Matilda is a multi-task framework for learning from single-cell multimodal omics data. Matilda leverages the information from the multi-modality of such data and trains a neural network model to simultaneously learn multiple tasks including data simulation, dimension reduction, visualization, classification, and feature selection.
Matilda is developed using PyTorch 1.9.1 and requires >=1 GPU to run. We recommend using conda enviroment to install and run Matilda. We assume conda is installed. You can use the provided environment or install the environment by yourself accoring to your hardware settings. Note the following installation code snippets were tested on a Ubuntu system (v20.04) with NVIDIA GeForce 3090 GPU. The installation process needs about 15 minutes.
Step 1: Create and activate the conda environment for matilda using our provided file
conda env create -f environment_matilda.yaml
conda activate environment_matilda
Step 2: The following python packages are required for running Matilda: h5py, numpy, pandas, captum. They can be installed in the conda environment as below:
pip install h5py
pip install numpy
pip install pandas
pip install captum
pip install tqdm
pip install scipy
pip install scanpy
Step 3: Otain Matilda by clonning the github repository:
git clone https://github.com/liuchunlei0430/Matilda.git
Step 1: Create and activate the conda environment for matilda
conda create -n environment_matilda python=3.7
conda activate environment_matilda
Step 2: Check the environment including GPU settings and the highest CUDA version allowed by the GPU.
nvidia-smi
Step 3: Install pytorch and cuda version based on your GPU settings.
# Example code for installing CUDA 11.3
conda install pytorch==1.9.1 torchvision==0.10.1 torchaudio==0.9.1 cudatoolkit=11.3 -c pytorch -c conda-forge
Step 4: The following python packages are required for running Matilda: h5py, numpy, pandas, captum. They can be installed in the conda environment as below:
pip install h5py
pip install numpy
pip install pandas
pip install captum
pip install tqdm
pip install scipy
pip install scanpy
Step 5: Otain Matilda by clonning the github repository:
git clone https://github.com/liuchunlei0430/Matilda.git
Matilda’s main function takes expression data (e.g., RNA, ADT, ATAC) in .h5
format and cell type labels in .csv
format. Matilda expects raw count data for RNA and ADT modalities. For ATAC modality, Matilda expects the 'gene activity score' generated by Seurat from raw count data.
An example for creating .h5 file from expression matrix in the R environment is as below:
write_h5 <- function(exprs_list, h5file_list) {
for (i in seq_along(exprs_list)) {
h5createFile(h5file_list[i])
h5createGroup(h5file_list[i], "matrix")
writeHDF5Array(t((exprs_list[[i]])), h5file_list[i], name = "matrix/data")
h5write(rownames(exprs_list[[i]]), h5file_list[i], name = "matrix/features")
h5write(colnames(exprs_list[[i]]), h5file_list[i], name = "matrix/barcodes")
}
}
write_h5(exprs_list = list(rna = train_rna, h5file_list = "/Matilda/data/TEA-seq/train_rna.h5")
An example for creating gene activity score from ATAC modality in the R environment using human gene annotation is as below:
gene.activities <- CreateGeneActivityMatrix2(peak.matrix=teaseq.peak,
annotation.file = “Homo_sapiens.GRCh38.90.chr.gtf.gz”,
seq.levels = c(1:22, “X”, “Y”),
seq_replace = c(“:”))
As an example, the processed TEA-seq dataset by Swanson et al. (GSE158013) is provided for the example run, which is saved in ./Matilda/data/TEAseq
.
Users can prepare the example dataset as input for Matilda or use their own datasets.
Training and testing on demo dataset will cost no more than 1 minute with GPU.
cd Matilda
cd main
# training the matilda model
python main_matilda_train.py --rna [trainRNA] --adt [trainADT] --atac [trainATAC] --cty [traincty] #[training dataset]
# Example run
python main_matilda_train.py --rna ../data/TEAseq/train_rna.h5 --adt ../data/TEAseq/train_adt.h5 --atac ../data/TEAseq/train_atac.h5 --cty ../data/TEAseq/train_cty.csv
Training dataset information
--rna
: path to training data RNA modality.--adt
: path to training data ADT modality (can be null if ATAC is provided).--atac
: path to training data ATAC modality (can be null if ADT is provided). Note ATAC data should be summarised to the gene level as "gene activity score".--cty
: path to the labels of training data.
Training and model config
--batch_size
: Batch size (set as 64 by default)--epochs
: Number of epochs.--lr
: Learning rate.--z_dim
: Dimension of latent space.--hidden_rna
: Dimension of RNA branch.--hidden_adt
: Dimension of ADT branch.--hidden_atac
: Dimension of ATAC branch.
Other config
--seed
: The random seed for training.--augmentation
: Whether to augment simulated data.
Note: after training, the model will be saved in ./Matilda/trained_model/
.
After training the model, we can use main_matilda_task.py
to do multiple tasks with different augments.
--classification
: whether to do cell type classification.--fs
: whether to do cell type feature selection.--dim_reduce
: whether to do dimension reduction.--simulation
: whether to do simulation.--simulation_ct
: an integer index for which cell type to simulate. Only be activated whensimulation = True
.--simulation_num
: the number of cells to simulate for the specified cell type. Only be activated whensimulation = True
.
1) Multi-task on the training data
# using the trained model for data simulation
python main_matilda_task.py --rna [trainRNA] --adt [trainADT] --atac [trainATAC] --cty [traincty] --simulation True --simulation_ct 1 --simulation_num 200
# Example run
python main_matilda_task.py --rna ../data/TEAseq/train_rna.h5 --adt ../data/TEAseq/train_adt.h5 --atac ../data/TEAseq/train_atac.h5 --cty ../data/TEAseq/train_cty.csv --simulation True --simulation_ct 1 --simulation_num 200
Output: The output will be saved in ./Matilda/output/simulation_result/TEAseq/reference/
. To generate UMAP plots for the simulated data using R, run ./Matilda/qc/visualize_simulated_data.Rmd
. The UMAPs are:
# using the trained model for data dimension reduction and visualisation
python main_matilda_task.py --rna [trainRNA] --adt [trainADT] --atac [trainATAC] --cty [traincty] --dim_reduce True
# Example run
python main_matilda_task.py --rna ../data/TEAseq/train_rna.h5 --adt ../data/TEAseq/train_adt.h5 --atac ../data/TEAseq/train_atac.h5 --cty ../data/TEAseq/train_cty.csv --dim_reduce True
Output: The output will be saved in ./Matilda/output/dim_reduce/TEAseq/reference/
. To generate UMAP plots and 4 clustering metrices, i.e., ARI, NMI, FM, Jaccard, for the latent space using R, run ./Matilda/qc/visualize_latent_space.Rmd
. The UMAPs are:
# using the trained model for feature selection
python main_matilda_task.py --rna [trainRNA] --adt [trainADT] --atac [trainATAC] --cty [traincty] --fs True
# Example run
python main_matilda_task.py --rna ../data/TEAseq/train_rna.h5 --adt ../data/TEAseq/train_adt.h5 --atac ../data/TEAseq/train_atac.h5 --cty ../data/TEAseq/train_cty.csv --fs True
Output: The output, i.e. feature importance scores, will be saved in ./Matilda/output/marker/TEAseq/reference/
.
2) Multi-task on the query data
# using the trained model for classifying query data
python main_matilda_task.py --rna [queryRNA] --adt [queryADT] --atac [queryATAC] --cty [querycty] --classification True
# Example run
python main_matilda_task.py --rna ../data/TEAseq/test_rna.h5 --adt ../data/TEAseq/test_adt.h5 --atac ../data/TEAseq/test_atac.h5 --cty ../data/TEAseq/test_cty.csv --classification True --query True
Output: The output will be saved in ./Matilda/output/classification/TEAseq/query/
.
cell ID: 0 real cell type: T.CD4.Memory predicted cell type: T.CD4.Naive probability: 0.77
cell ID: 1 real cell type: B.Activated predicted cell type: B.Activated probability: 0.53
cell ID: 2 real cell type: B.Naive predicted cell type: B.Naive probability: 0.73
cell ID: 3 real cell type: T.CD4.Naive predicted cell type: T.CD4.Naive probability: 0.78
cell ID: 4 real cell type: T.CD4.Memory predicted cell type: T.CD4.Memory probability: 0.87
cell ID: 5 real cell type: Mono.CD14 predicted cell type: Mono.CD14 probability: 0.95
cell ID: 6 real cell type: B.Naive predicted cell type: B.Naive probability: 0.78
cell ID: 7 real cell type: Mono.CD14 predicted cell type: Mono.CD14 probability: 0.96
cell ID: 8 real cell type: T.CD8.Effector predicted cell type: T.CD8.Effector probability: 0.95
……
cell type ID: 0 cell type: B.Activated prec : tensor(72.2454, device='cuda:0') number: 180
cell type ID: 1 cell type: B.Naive prec : tensor(98.1400, device='cuda:0') number: 802
cell type ID: 2 cell type: DC.Myeloid prec : tensor(40., device='cuda:0') number: 11
cell type ID: 3 cell type: Mono.CD14 prec : tensor(98.6156, device='cuda:0') number: 639
cell type ID: 4 cell type: Mono.CD16 prec : tensor(74.1379, device='cuda:0') number: 37
cell type ID: 5 cell type: NK prec : tensor(97.1820, device='cuda:0') number: 283
cell type ID: 6 cell type: Platelets prec : tensor(45.4545, device='cuda:0') number: 12
cell type ID: 7 cell type: T.CD4.Memory prec : tensor(73.3831, device='cuda:0') number: 1189
cell type ID: 8 cell type: T.CD4.Naive prec : tensor(76.2363, device='cuda:0') number: 1020
cell type ID: 9 cell type: T.CD8.Effector prec : tensor(83.4451, device='cuda:0') number: 576
cell type ID: 10 cell type: T.CD8.Naive prec : tensor(84.5635, device='cuda:0') number: 299
# using the trained model for dimension reduction and visualising query data
python main_matilda_task.py --rna [queryRNA] --adt [queryADT] --atac [queryATAC] --cty [querycty] --dim_reduce True
# Example run
python main_matilda_task.py --rna ../data/TEAseq/test_rna.h5 --adt ../data/TEAseq/test_adt.h5 --atac ../data/TEAseq/test_atac.h5 --cty ../data/TEAseq/test_cty.csv --dim_reduce True --query True
Output: The output will be saved in ./Matilda/output/dim_reduce/TEAseq/query/
. To generate UMAP plots and 4 clustering metrices, i.e., ARI, NMI, FM, Jaccard, for the latent space using R, run ./Matilda/qc/visualize_latent_space.Rmd
. The UMAPs are:
# using the trained model for feature selection
python main_matilda_task.py --rna [queryRNA] --adt [queryADT] --atac [queryATAC] --cty [querycty] --fs True
# Example run
python main_matilda_task.py --rna ../data/TEAseq/test_rna.h5 --adt ../data/TEAseq/test_adt.h5 --atac ../data/TEAseq/test_atac.h5 --cty ../data/TEAseq/test_cty.csv --fs True --query True
Output: The output, i.e. feature importance scores, will be saved in ./Matilda/output/markers/TEAseq/query/
.
[1] Ramaswamy, A. et al. Immune dysregulation and autoreactivity correlate with disease severity in SARS-CoV-2-associated multisystem inflammatory syndrome in children. Immunity 54, 1083– 1095.e7 (2021).
[2] Ma, A., McDermaid, A., Xu, J., Chang, Y. & Ma, Q. Integrative Methods and Practical Challenges for Single-Cell Multi-omics. Trends Biotechnol. 38, 1007–1022 (2020).
[3] Swanson, E. et al. Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq. Elife 10, (2021).
This project is covered under the Apache 2.0 License.