Differential Expression with TrajAtlas#

Setup#

import numpy as np
import scanpy as sc
import pandas as pd
import TrajAtlas as tja
from sklearn.cluster import KMeans
sc.settings.verbosity = 3
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (7, 7)

tdiff=tja.diff.Tdiff()

Datasets#

In this tutorial, we continue with the adata we created in OPCST projecting

adata=sc.read("../data/3.19_adata_immediate_step1.h5ad")

We started with milo standard pipeline, which was described in Differential Abundance with TrajAtlas.

tdata = tdiff.load(adata)
sc.pp.neighbors(tdata['rna'], use_rep='scANVI', n_neighbors=80, n_pcs=15)
tdiff.make_nhoods(tdata['rna'], prop=0.10)
tdata =  tdiff.count_nhoods(tdata, sample_col="sample")

computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:29)


/home/gilberthan/anaconda3/envs/scarches/lib/python3.8/site-packages/anndata/_core/anndata.py:117: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)

For every neighborhood, we generated pseudobulk. You need to specify which column in “adata.obs” represents the sample, group, and pseudotime, respectively.

pseudobulk=tdiff.make_pseudobulk_parallel(mdata=tdata,sample_col="sample",group_col="group",time_col="pseduoPred",njob=25,min_cell=4)

100%|██████████| 1001/1001 [03:57<00:00,  4.21it/s]

The differential expression pipeline follows steps similar to those in “Differential Abundance.” Specifically, we utilized a Generalized Linear Model (GLM) with edgeR to compute the differential expression for each neighborhood. Subsequently, we mapped the results onto the pseudotime axis, dividing it into n intervals and applying a binomial model to each interval. Additionally, we performed permutation null hypothesis testing by shuffling group labels to obtain lambda. The comprehensive differentiation outcomes are stored in pseudobulk.var.overall_gene_p, while the within-interval differential expression results are stored in pseudobulk.varm.exprPoint. The false discovery rate (FDR) within each interval is stored in pseudobulk.varm.gene_p_adj. Moreover, the expression profiles of each group are stored in pseudobulk.varm.group1_cpm and pseudobulk.varm.group2_cpm.

tdiff.de(tdata,design="~group",model_contrasts="groupYoung-groupMA",njob=-1,fix_libsize=False,shuffle_times=5)

Detecting differential expression in neighborhoods......

Using edgeR to find DEG......

  0%|          | 1/1001 [00:01<19:11,  1.15s/it]2024-03-22 20:04:04.127204: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT 

547 differential genes were detected!

tdata

MuData object with n_obs × n_vars = 15079 × 22076
  3 modalities
    rna:	15079 x 22076
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent_mito', 'RNA_snn_res.0.5', 'seurat_clusters', 'group', 'sample', 'pseduoPred', 'pred_level1_anno', 'pred_level2_anno', 'pred_level3_anno', 'pred_level4_anno', 'pred_level5_anno', 'pred_level6_anno', 'pred_level7_anno', 'pred_lineage_fibro', 'pred_lineage_lepr', 'pred_lineage_msc', 'pred_lineage_chondro', 'lineageSum', 'Cell', 'nhood_ixs_random', 'nhood_ixs_refined', 'nhood_kth_distance'
      var:	'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
      uns:	'group_colors', 'neighbors', 'pred_level2_anno_colors', 'pred_level3_anno_colors', 'pred_lineage_chondro_colors', 'pred_lineage_fibro_colors', 'pred_lineage_lepr_colors', 'pred_lineage_msc_colors', 'umap', 'nhood_neighbors_key'
      obsm:	'X_umap', 'scANVI', 'nhoods'
      layers:	'counts'
      obsp:	'connectivities', 'distances'
    tdiff:	10 x 1001
      var:	'index_cell', 'kth_distance', 'time', 'range_down', 'range_up'
      uns:	'sample_col'
      varm:	'groupMA', 'groupYoung', 'Accept', 'logChange', 'null_mean'
    pseudobulk:	9120 x 22076
      obs:	'sample', 'group', 'pseduoPred', 'nhoods_index'
      var:	'overall_gene_p'
      uns:	'sample_col', 'group_col', 'time_col', 'var1', 'var2', 'shuffle_times', 'sum'
      varm:	'PValue', 'logFC', 'logCPM', 'FDR', 'F', 'groupMA', 'groupYoung', 'SPFDR', 'null_mean', 'nullPoint', 'truePoint', 'exprPoint', 'group1_cpm', 'group2_cpm', 'gene_p_adj'

Plot#

Group genes into gene clusters based on their pseudotemporal expression patterns. Currently, we offer two clustering strategies: Kmeans and Stage.

In Kmeans mode, genes are grouped using Kmeans clustering, which identifies differential expression patterns between two groups.

In Stage mode, genes are grouped based on the stage (early or late) at which they exhibit differential expression (up or down).

kmean_label= tja.diff.split_gene(tdata,mode="Kmean")

/home/gilberthan/anaconda3/envs/scarches/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
/home/gilberthan/Desktop/disk1/limb/3.9_wt_integrate/7.6_software/3.15_trajAtlas/script/../../../../TrajAtlas/TrajAtlas/TrajDiff/trajdiff_utils.py:157: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mdata["rna"].var["Kmeans"].loc[labels.index]=np.array(labels["geneGroup"])

We then utilized pyComplexHeatmap to generate a four-panel heatmap. The heatmap represents pseudotemporal gene expression for group 1, pseudotemporal gene expression for group 2, differential expression, and false discovery rate, respectively.

tdiff.plotDE(tdata,row_split=kmean_label)

Starting plotting..
Starting calculating row orders..
Reordering rows..
Starting calculating col orders..
Reordering cols..
Plotting matrix..
Collecting legends..
Starting plotting..
Starting calculating col orders..
Reordering cols..
Plotting matrix..
Collecting legends..
Estimated legend width: 7.5 mm

<Figure size 1200x800 with 0 Axes>

<Figure size 1200x800 with 0 Axes>

<Figure size 1200x800 with 0 Axes>

<Figure size 1200x800 with 0 Axes>

<Figure size 1200x800 with 0 Axes>

png

Based on the results, we observe that gene groups cluster according to their expression patterns.

stage_label= tja.diff.split_gene(tdata,mode="Stage")

/home/gilberthan/Desktop/disk1/limb/3.9_wt_integrate/7.6_software/3.15_trajAtlas/script/../../../../TrajAtlas/TrajAtlas/TrajDiff/trajdiff_utils_12.py:198: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mdata["rna"].var["Stage"].loc[labels.index]=np.array(labels)

tdiff.plotDE(tdata,row_split=stage_label)

Starting plotting..
Starting calculating row orders..
Reordering rows..
Starting calculating col orders..
Reordering cols..
Plotting matrix..
Collecting legends..
Starting plotting..
Starting calculating col orders..
Reordering cols..
Plotting matrix..
Collecting legends..
Estimated legend width: 7.5 mm

<Figure size 1200x800 with 0 Axes>

<Figure size 1200x800 with 0 Axes>

<Figure size 1200x800 with 0 Axes>

<Figure size 1200x800 with 0 Axes>

<Figure size 1200x800 with 0 Axes>

png

From the results, it’s evident that gene groups exert their functions at different differentiation stages. However, it’s worth noting that due to the limited number of cells at the late stage, we were unable to detect many genes at this stage.”