diff --git a/CHANGELOG.md b/CHANGELOG.md index 89292403..9289f644 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,7 +6,13 @@ - LSI in panpipes_preprocess is run on the highly variable features - n_comp for LSI ### fixed + +- changed all instances of ADT into PROT +- changed all instances of GEX to RNA +- changed the params to fix plotting as mentioned in issue #41 +- typo in readme - set default seaborn <=0.12.2 to avoid issue #104, #126 + ### dependencies ## v0.3.1 @@ -17,7 +23,7 @@ - Spatial data analysis is now included in panpipes - panpipes qc_spatial - panpipes preprocess_spatial - - panpipes_deconvolution_spatials + - panpipes_deconvolution_spatial ### fixed - make sure columns from individual modalities that are not in the multimodal outer obs can be used to diff --git a/README.md b/README.md index b30eb0bc..d15eeb54 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ Created and Maintained by Charlotte Rich-Griffin and Fabiola Curion Additional contributors: Sarah Ouologuem, Devika Agarwal and Tom Thomas -See out [documentation](https://panpipes-pipelines.readthedocs.io/en/latest/) and our [preprint](https://www.biorxiv.org/content/10.1101/2023.03.11.532085v1): +See our [documentation](https://panpipes-pipelines.readthedocs.io/en/latest/) and our [preprint](https://www.biorxiv.org/content/10.1101/2023.03.11.532085v1): Panpipes: a pipeline for multiomic single-cell data analysis Charlotte Rich-Griffin*, Fabiola Curion*, Tom Thomas, Devika Agarwal, Fabian J. Theis, Calliope A. Dendrou. bioRxiv 2023.03.11.532085; @@ -12,10 +12,10 @@ doi: https://doi.org/10.1101/2023.03.11.532085 # Introduction -These pipelines use cgat-core pipeline software +These workflows use cgat-core pipeline software -Available pipelines: -1. "ingest" : for the ingestion of data and computation of QC metrics' +Available workflows: +1. "ingest" : for the ingestion of data and computation of QC metrics 2. "preprocess" : for filtering and normalising of each modality 3. "integration" : integrate and batch correction using single and multimodal methods 4. "clustering" : cell clustering on single modalities diff --git a/docs/setup_for_ingest.md b/docs/setup_for_ingest.md index 86db2324..b389b86b 100644 --- a/docs/setup_for_ingest.md +++ b/docs/setup_for_ingest.md @@ -3,13 +3,13 @@ For ingest the minimum required columns are -sample_id | gex_path | gex_filetype +sample_id | rna_path | rna_filetype ----------|----------|------------- If you want to analyse other modalities, add columns to the input file -- adt_path/adt_filetype +- prot_path/prot_filetype - atac_path/atac_filetype - tcr_path/tcr_filetype - bcr_path/bcr_filetype @@ -20,7 +20,7 @@ example at `resources/sample_file_ingest.txt` If giving a cellranger path, give the path folder containing all the cellranger outputs. Otherwise path should be the complete path to the file. -If you have cellranger outputs which have gex and adt within the same files, specify the same path in gex_path and adt_path +If you have cellranger outputs which have rna and prot within the same files, specify the same path in rna_path and prot_path To include sample level metadata, you can add additional columns to the submission file e.g Tissue and Diagnoisis columns in `resources/sample_file_ingest.txt` @@ -38,13 +38,13 @@ For each modality per sample, specify the value in the key column in the X_filet modality |key |description ------------|----------|---------- -gex/adt/atac|cellranger| the "outs" folder produced by **cellranger count** -gex/adt/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** -gex/adt/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger -gex/adt/atac|hd5 | Read a generic .h5 (hdf5) file. -gex/adt/atac|h5ad | Anndata h5ad objects (one per sample) -gex/adt/atac|txt_matrix | tab-delimited file (one per sample) -gex/adt/atac|csv_matrix | comma-delimited file (one per sample) +rna/prot/atac|cellranger| the "outs" folder produced by **cellranger count** +rna/prot/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** +rna/prot/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger +rna/prot/atac|hd5 | Read a generic .h5 (hdf5) file. +rna/prot/atac|h5ad | Anndata h5ad objects (one per sample) +rna/prot/atac|txt_matrix | tab-delimited file (one per sample) +rna/prot/atac|csv_matrix | comma-delimited file (one per sample) tcr/bcr |cellranger_vdj| Path to filtered_contig_annotations.csv, all_contig_annotations.csv or all_contig_annotations.json. produced by **cellranger vdj** further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_10x_vdj.html) tcr/bcr |tracer| data from [TraCeR](https://github.com/Teichlab/tracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_tracer.html) tcr/bcr |bracer| data from [BraCeR](https://github.com/Teichlab/bracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_bracer.html) diff --git a/docs/usage/sample_file_qc_mm.md b/docs/usage/sample_file_qc_mm.md index 47ca20e7..c6e5cc96 100644 --- a/docs/usage/sample_file_qc_mm.md +++ b/docs/usage/sample_file_qc_mm.md @@ -1,12 +1,10 @@ Example sample submission file ----------------------------- -| sample_id | gex_path | gex_filetype | adt_path | adt_filetype | tissue | diagnosis | +| sample_id | rna_path | rna_filetype | prot_path | prot_filetype | tissue | diagnosis | |-----------|------------------------------------|--------------|-------------------------------------|--------------|--------|-----------| -| Sample1 | Sample1_gex.csv | csv_matrix | Sample1_adt.csv | csv_matrix | pbmc | healthy | +| Sample1 | Sample1_rna.csv | csv_matrix | Sample1_adt.csv | csv_matrix | pbmc | healthy | | Sample2 | cellranger_count/Sample2_GEX/outs/ | cellranger | cellranger_count/Sample2_CITE/outs/ | cellranger | pbmc | diseased | Download this file: [sample_file_qc_mm.txt](sample_file_qc_mm.txt) - - diff --git a/docs/usage/sample_file_qc_mm.txt b/docs/usage/sample_file_qc_mm.txt index 5cc13a1f..77775ce8 100644 --- a/docs/usage/sample_file_qc_mm.txt +++ b/docs/usage/sample_file_qc_mm.txt @@ -1,3 +1,3 @@ -sample_id gex_path gex_filetype adt_path adt_filetype tissue diagnosis +sample_id rna_path rna_filetype prot_path prot_filetype tissue diagnosis Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file diff --git a/docs/usage/setup_for_qc_mm.md b/docs/usage/setup_for_qc_mm.md index ee32ffce..3f20695b 100644 --- a/docs/usage/setup_for_qc_mm.md +++ b/docs/usage/setup_for_qc_mm.md @@ -6,21 +6,21 @@ The multimodal QC pipeline (qc_mm) requires a sample submission file which it us The minimum required columns are -sample_id | gex_path | gex_filetype +sample_id | rna_path | rna_filetype ----------|----------|------------- If you want to analyse other modalities, add additional columns to the input file -- adt_path/adt_filetype +- prot_path/prot_filetype - atac_path/atac_filetype - tcr_path/tcr_filetype - bcr_path/bcr_filetype **sample id**: Each row must have a unique sample ID. -**{X}_paths**: If giving a cellranger path, give the path folder containing all the cellranger outputs, known as the `outs` folder. Otherwise path should be the complete path to the file. If you have cellranger outputs which have gex and adt within the same files, specify the same path in gex_path and adt_path +**{X}_paths**: If giving a cellranger path, give the path folder containing all the cellranger outputs, known as the `outs` folder. Otherwise path should be the complete path to the file. If you have cellranger outputs which have rna and prot within the same files, specify the same path in rna_path and prot_path **{X}_filetype**: The "filetype" column tells panpipe how to read in the data. Panpipes supports a range of inputs. See the [supported input filetypes](#supported-input-filetypes) below to see the options for the {X}_filetype columns @@ -35,7 +35,7 @@ You will also need to list which additional metadata columns you want to include ## Example sample submission file -| sample_id | gex_path | gex_filetype | adt_path | adt_filetype | tissue | diagnosis | +| sample_id | rna_path | rna_filetype | prot_path | prot_filetype | tissue | diagnosis | |-----------|------------------------------------|--------------|-------------------------------------|--------------|--------|-----------| | Sample1 | Sample1_gex.csv | csv_matrix | Sample1_adt.csv | csv_matrix | pbmc | healthy | | Sample2 | cellranger_count/Sample2_GEX/outs/ | cellranger | cellranger_count/Sample2_CITE/outs/ | cellranger | pbmc | diseased | @@ -58,13 +58,13 @@ For each modality per sample, specify the value in the key column in the X_filet modality |key |description ------------|----------|---------- -gex/adt/atac|cellranger| the "outs" folder produced by **cellranger count** -gex/adt/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** -gex/adt/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger -gex/adt/atac|hd5 | Read a generic .h5 (hdf5) file. -gex/adt/atac|h5ad | Anndata h5ad objects (one per sample) -gex/adt/atac|txt_matrix | tab-delimited file (one per sample) -gex/adt/atac|csv_matrix | comma-delimited file (one per sample) +rna/prot/atac|cellranger| the "outs" folder produced by **cellranger count** +rna/prot/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** +rna/prot/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger +rna/prot/atac|hd5 | Read a generic .h5 (hdf5) file. +rna/prot/atac|h5ad | Anndata h5ad objects (one per sample) +rna/prot/atac|txt_matrix | tab-delimited file (one per sample) +rna/prot/atac|csv_matrix | comma-delimited file (one per sample) tcr/bcr |cellranger_vdj| Path to filtered_contig_annotations.csv, all_contig_annotations.csv or all_contig_annotations.json. produced by **cellranger vdj** further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_10x_vdj.html) tcr/bcr |tracer| data from [TraCeR](https://github.com/Teichlab/tracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_tracer.html) tcr/bcr |bracer| data from [BraCeR](https://github.com/Teichlab/bracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_bracer.html) diff --git a/docs/workflows/integration.md b/docs/workflows/integration.md index eee7396b..dc923aa1 100644 --- a/docs/workflows/integration.md +++ b/docs/workflows/integration.md @@ -5,7 +5,7 @@ The panpipes integration pipeline implements a variety of tools to batch correct ![integration_flowchart](../img/integration_coloured.drawio.png) -The flowchart indicates which tools are available for each modality out of GEX (also referred to as RNA), ADT (also referred to as PROT) and ATAC. You can run as many of these tools as you choose and then you can run `panpipes integration make merge_batch_correction` to create a final object containing one reduced dimension representation and one nearest neighbor graph per modality. This can be used as input to the clustering pipeline. +The flowchart indicates which tools are available for each modality out of RNA (also referred to as GEX), PROT (also referred to as ADT) and ATAC. You can run as many of these tools as you choose and then you can run `panpipes integration make merge_batch_correction` to create a final object containing one reduced dimension representation and one nearest neighbor graph per modality. This can be used as input to the clustering pipeline. ## Steps to run: diff --git a/docs/workflows/pipeline_preprocess_preprint.yml b/docs/workflows/pipeline_preprocess_preprint.yml index d985d7d1..c172b00e 100644 --- a/docs/workflows/pipeline_preprocess_preprint.yml +++ b/docs/workflows/pipeline_preprocess_preprint.yml @@ -157,11 +157,11 @@ plotqc: grouping_var: sample_id,orig.ident # use these continuous variables to plot gradients and distributions rna_metrics: pct_counts_mt,pct_counts_rp,pct_counts_hb,doublet_scores - prot_metrics: total_counts,log1p_total_counts,n_adt_by_counts + prot_metrics: total_counts,log1p_total_counts,n_prot_by_counts atac_metrics: total_counts rep_metrics: - +# -------------------------------------------------------------------------------------------------------- # RNA Normalisation # -------------------------------------------------------------------------------------------------------- # hvg_flavour options include "seurat", "cell_ranger", "seurat_v3", default; "seurat" @@ -213,8 +213,8 @@ pca: scree_n_pcs: 50 color_by: sample_id - -# Protein (ADT) normalisation +# -------------------------------------------------------------------------------------------------------- +# Protein (PROT) normalisation # -------------------------------------------------------------------------------------------------------- prot: # comma separated string of normalisation options @@ -229,7 +229,7 @@ prot: # CLR parameters: # margin determines whether you normalise per cell (as you would RNA norm), - # or by feature (recommended, due to the variable nature of adts). + # or by feature (recommended, due to the variable nature of prot assays). # CLR margin 0 is recommended for informative qc plots in this pipeline # 0 = normalise colwise (per feature) # 1 = normalise rowwise (per cell) @@ -250,9 +250,9 @@ prot: # do you want to save the prot normalised assay additionally as a txt file: save_norm_prot_mtx: False - +# -------------------------------------------------------------------------------------------------------- # ATAC preprocessing and normalisation -# -------------------------- +# -------------------------------------------------------------------------------------------------------- atac: binarize: False normalize: TFIDF diff --git a/docs/workflows/qc.md b/docs/workflows/qc.md index cb298c4c..553e3efc 100644 --- a/docs/workflows/qc.md +++ b/docs/workflows/qc.md @@ -22,7 +22,7 @@ Then qc metrics are computed using [scanpy.pp.calculated_qc_metrics](https://sca - Protein metadata such as isotype status or shorter names incorporated into the object. Inputs for this are described [here] - Per cell QC metrics computed as described above, including pct_isotype where isotype information is available. - Per Protein metrics, total_counts, and are computed, in order to compare the binding of different antibodies (applicable when your assay is CITE-seq based). These are defined in the yml: -`prot_metrics_per_adt: total_counts,log1p_total_counts,n_cells_by_counts,mean_counts` +`plot_metrics_per_prot: total_counts,log1p_total_counts,n_cells_by_counts,mean_counts` - A rudimentary check for cells that are 'isotype' outliers, i.e. the cells where the isotype content is in the top 10% quantile for more than 2 isotypes. (these parameters are customisable in the`pipeline.yml`). See the function [here](https://github.com/DendrouLab/panpipes/blob/main/panpipes/funcs/scmethods.py#L328). ``` # isotype outliers: one way to determine which cells are very sticky is to work out which cells have the most isotype UMIs @@ -63,7 +63,7 @@ There is an additional optional `assess_background` step, if the raw data (inclu [Inputs to Multimodal QC pipeline](../setup_for_qc_mm) 2. Generate qc genelists as described in [Gene list format](../gene_list_format) -3. For adt assay - generate the protein metadata file +3. For prot assay - generate the protein metadata file [example]((https://github.com/DendrouLab/panpipes/blob/main/resources/protein_metadata_w_iso.md)). This file is integrated into the mdata\['prot'\].var slot. 4. Generate config file (`panpipes ingest config`) diff --git a/panpipes/.DS_Store b/panpipes/.DS_Store index bf2bd9b8..1d920382 100644 Binary files a/panpipes/.DS_Store and b/panpipes/.DS_Store differ diff --git a/panpipes/R_scripts/plotQC.R b/panpipes/R_scripts/plotQC.R index c29e3494..c8b1e7b0 100644 --- a/panpipes/R_scripts/plotQC.R +++ b/panpipes/R_scripts/plotQC.R @@ -17,7 +17,8 @@ do_scatter_plot <- function(df, x, y, facet=NULL, hue=NULL){ } g <- g+ theme_bw()+ theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5), - strip.text.x = element_text(size=6)) + strip.text.x = element_text(size=6), + legend.key.size = unit(0.2, 'cm')) return(g) } @@ -184,25 +185,25 @@ for (sc in rna_source_facet){ if (all(c("total_counts","n_genes_by_counts")%in% colnames(rna_data_plot))){ g <- do_scatter_plot(rna_data_plot,x="total_counts",y="n_genes_by_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_rna-nUMI_v_rna-genes.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("log1p_total_counts","log1p_n_genes_by_counts")%in% colnames(rna_data_plot))){ g <- do_scatter_plot(rna_data_plot,x="log1p_total_counts",y="log1p_n_genes_by_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_rna-log1p_nUMI_v_rna-log1p_genes.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("total_counts","pct_counts_mt")%in% colnames(rna_data_plot))){ g <- do_scatter_plot(rna_data_plot,x="total_counts",y="pct_counts_mt", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_rna-nUMI_v_rna-pct_mt.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("n_genes_by_counts","doublet_scores","total_counts")%in% colnames(rna_data_plot))){ g <- do_scatter_plot(rna_data_plot,x="n_genes_by_counts",y="doublet_scores", hue="total_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_rna-genes_rna-doublet_scores_rna-numi.png")), - type="cairo", width= 2*ncols, height=2*nrows, dpi=120) + type="cairo", width= 3*ncols, height=3*nrows, dpi=200) } } @@ -238,7 +239,7 @@ if(!is.null(opt$prot_qc_metrics)){ } # do the following plots: - # - prot:total_counts vs prot_nadt_by_counts + # - prot:total_counts vs prot_nprot_by_counts # - prot:total_counts vs prot:isotype_counts # - prot:log1p_total_counts vs prot:log1p_isotype_counts # - prot:total_counts vs prot:pct_isotype_counts @@ -255,30 +256,30 @@ if(!is.null(opt$prot_qc_metrics)){ } # plot once per source facet message("protein scatter plots") - if (all(c("total_counts","n_adt_by_counts")%in% colnames(prot_data_plot))){ - g <- do_scatter_plot(prot_data_plot,x="total_counts",y="n_adt_by_counts", facet=sc) - ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-nUMI_v_prot-adt.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + if (all(c("total_counts","n_prot_by_counts")%in% colnames(prot_data_plot))){ + g <- do_scatter_plot(prot_data_plot,x="total_counts",y="n_prot_by_counts", facet=sc) + ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-nUMI_v_prot-prot.png")), type="cairo", + width= 3*ncols, height=3*nrows, dpi=200) } - if (all(c("log1p_total_counts","log1p_n_adt_by_counts")%in% colnames(prot_data_plot))){ - g <- do_scatter_plot(prot_data_plot,x="log1p_total_counts",y="log1p_n_adt_by_counts", facet=sc) - ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-log1p_nUMI_v_prot-log1p_adt.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + if (all(c("log1p_total_counts","log1p_n_prot_by_counts")%in% colnames(prot_data_plot))){ + g <- do_scatter_plot(prot_data_plot,x="log1p_total_counts",y="log1p_n_prot_by_counts", facet=sc) + ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-log1p_nUMI_v_prot-log1p_prot.png")), type="cairo", + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("total_counts","pct_counts_isotype")%in% colnames(prot_data_plot))){ g <- do_scatter_plot(prot_data_plot,x="total_counts",y="pct_counts_isotype", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-nUMI_v_prot-pct_isotype.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("total_counts","total_counts_isotype")%in% colnames(prot_data_plot))){ g <- do_scatter_plot(prot_data_plot,x="total_counts",y="total_counts_isotype", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-nUMI_v_prot-total_counts_isotype.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("log1p_total_counts","log1p_total_counts_isotype", "isotype_exclude")%in% colnames(prot_data_plot))){ g <- do_scatter_plot(prot_data_plot,x="log1p_total_counts",y="log1p_total_counts_isotype", hue="isotype_exclude", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-log1p_nUMI_v_prot-log1p_total_counts_isotype.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } } @@ -376,27 +377,27 @@ if(!is.null(opt$prot_qc_metrics)){ if (all(c("rna.total_counts","prot.total_counts")%in% colnames(data_plot))){ g <- do_scatter_plot(data_plot,x="rna.total_counts",y="prot.total_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "-nUMI_v_rna-nUMI.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("rna.log1p_total_counts","prot.log1p_total_counts")%in% colnames(data_plot))){ g <- do_scatter_plot(data_plot,x="rna.log1p_total_counts",y="prot.log1p_total_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "-log1p_nUMI_v_rna-log1p_nUMI.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("rna.total_counts","prot.total_counts_isotype")%in% colnames(data_plot))){ g <- do_scatter_plot(data_plot,x="rna.total_counts",y="prot.total_counts_isotype", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "-nUMI_v_prot-counts_isotype.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("rna.log1p_total_counts","prot.log1p_total_counts_isotype") %in% colnames(data_plot))){ g <- do_scatter_plot(data_plot,x="rna.log1p_total_counts",y="prot.log1p_total_counts_isotype", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "-log1p_nUMI_v_prot-log1p_counts_isotype.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("rna.doublet_scores","prot.log1p_total_counts") %in% colnames(data_plot))){ g <- do_scatter_plot(data_plot,x="rna.doublet_scores",y="prot.log1p_total_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "-doublet_scores_v_prot-log1p_nUMI.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } diff --git a/panpipes/R_scripts/produce_barplot_10xmetric.v3.R b/panpipes/R_scripts/produce_barplot_10xmetric.v3.R index e1dde13a..e7ccf528 100644 --- a/panpipes/R_scripts/produce_barplot_10xmetric.v3.R +++ b/panpipes/R_scripts/produce_barplot_10xmetric.v3.R @@ -54,8 +54,8 @@ project <- opt$project if(!is.null(opt$csvpaths)) { caf <- read.table(opt$csvpaths, header=T, sep="\t") - if(!"gex_path" %in% colnames(caf)){ - caf$gex_path=caf$path + if(!"rna_path" %in% colnames(caf)){ + caf$rna_path=caf$path caf <- caf %>% filter(filetype=="cellranger") if(nrow(caf)==0){ message("no cellranger inputs") @@ -63,8 +63,8 @@ if(!is.null(opt$csvpaths)) { } } - # paths <- gsub("filtered_feature_bc_matrix","metrics_summary.csv",caf$gex_path) - paths <- file.path(caf$gex_path, "metrics_summary.csv") + # paths <- gsub("filtered_feature_bc_matrix","metrics_summary.csv",caf$rna_path) + paths <- file.path(caf$rna_path, "metrics_summary.csv") # paths <- gsub("\\/$","", paths) @@ -146,7 +146,7 @@ if(ncol(tag.mat)>0){ } if(nrow(mat)==0){mat <- tag.mat} -#modify this to deal with AB reads and GEX reads, need to split the knee plotting in 2 if you want both GEX and ADT knees in 2 different graphs +#modify this to deal with AB reads and rna reads, need to split the knee plotting in 2 if you want both rna and PROT knees in 2 different graphs if(opt$kneeplot){ print("plotting kneeplot") gsub("metrics_summary.csv","raw_feature_bc_matrix" , paths) ->normpath @@ -155,7 +155,7 @@ if(opt$kneeplot){ for( a in 1:length(normpath)){ print(normpath[a]) mm<-Read10X(normpath[a]) - if(class(mm)=="list") { mm<-do.call("rbind",mm)} #modify this to deal with AB reads and GEX reads, need to split the knee plotting in 2 if you want both GEX and ADT in 2 different graphs + if(class(mm)=="list") { mm<-do.call("rbind",mm)} #modify this to deal with prot reads and rna reads, need to split the knee plotting in 2 if you want both rna and prot in 2 different graphs colnames(mm) <- paste0(colnames(mm),"_",rownames(mat)[a]) xx <- CreateSeuratObject(mm,assay = "RNA", min.cells = 0, min.features = 0) xx@meta.data %>% @@ -200,11 +200,11 @@ if(opt$kneeplot){ theme(legend.key.size = unit(0.2,"cm"), axis.text=element_text(size=13,face="bold", color="black")) ->gb } gb + fl$layers ->gb - ggsave(gb,height = 15 ,width = 17, dpi=300, file=paste0(opt$figdir,"10xMetrics_",project,"_gex_knee_plot.png"), type="cairo-png") + ggsave(gb,height = 15 ,width = 17, dpi=300, file=paste0(opt$figdir,"10xMetrics_",project,"_rna_knee_plot.png"), type="cairo-png") } -# if ADT exists then repeat for ADT +# if PROT exists then repeat for PROT message("finished pipeline") diff --git a/panpipes/funcs/io.py b/panpipes/funcs/io.py index 387713fd..6d399925 100644 --- a/panpipes/funcs/io.py +++ b/panpipes/funcs/io.py @@ -48,41 +48,41 @@ def gen_load_anndata_jobs(caf, load_raw=False, mode_dictionary = {}, load_prot_f Generate a load_adatas job for each line in submission.txt """ for nn in range(0, caf.shape[0]): - if pd.isna(caf['gex_path'][nn]): - gex_path= None - gex_filetype=None - elif caf['gex_filetype'][nn]=="cellranger" and mode_dictionary["rna"]: - gex_path, gex_filetype = update_cellranger_col(caf['gex_path'][nn], raw=load_raw, method="count") - elif caf['gex_filetype'][nn]=="cellranger_multi" and mode_dictionary["rna"]: - gex_path, gex_filetype = update_cellranger_col(caf['gex_path'][nn], raw=load_raw, method="multi", + if pd.isna(caf['rna_path'][nn]): + rna_path= None + rna_filetype=None + elif caf['rna_filetype'][nn]=="cellranger" and mode_dictionary["rna"]: + rna_path, rna_filetype = update_cellranger_col(caf['rna_path'][nn], raw=load_raw, method="count") + elif caf['rna_filetype'][nn]=="cellranger_multi" and mode_dictionary["rna"]: + rna_path, rna_filetype = update_cellranger_col(caf['rna_path'][nn], raw=load_raw, method="multi", sample_id=caf['sample_id'][nn]) else: - gex_path, gex_filetype = caf[['gex_path', "gex_filetype"]].iloc[nn] + rna_path, rna_filetype = caf[['rna_path', "rna_filetype"]].iloc[nn] if load_raw: - gex_path = re.sub("filtered", "raw", gex_path) - # manage the adt paths - if ('adt_path' in caf.columns and mode_dictionary["prot"]): - # check if its the same as the gex path (data in the same file) - if pd.isna(caf['adt_path'][nn]): - adt_path= None - adt_filetype=None - elif caf['adt_path'][nn] == caf['gex_path'][nn]: - adt_path, adt_filetype = gex_path, gex_filetype - elif caf['adt_filetype'][nn]=="cellranger": - # we might want to load the raw here because we want to then subset by good gex barcodes, + rna_path = re.sub("filtered", "raw", rna_path) + # manage the prot paths + if ('prot_path' in caf.columns and mode_dictionary["prot"]): + # check if its the same as the rna path (data in the same file) + if pd.isna(caf['prot_path'][nn]): + prot_path= None + prot_filetype=None + elif caf['prot_path'][nn] == caf['rna_path'][nn]: + prot_path, prot_filetype = rna_path, rna_filetype + elif caf['prot_filetype'][nn]=="cellranger": + # we might want to load the raw here because we want to then subset by good gex (rna) barcodes, # this is why the load_prot_from_raw argument exists - adt_path, adt_filetype = update_cellranger_col(caf['adt_path'][nn], raw=load_prot_from_raw) - elif caf['adt_filetype'][nn]=="cellranger_multi": - # celranger multi has the same prot and gex barcodes - adt_path, adt_filetype = update_cellranger_col(caf['adt_path'][nn], raw=load_raw, method="multi", + prot_path, prot_filetype = update_cellranger_col(caf['prot_path'][nn], raw=load_prot_from_raw) + elif caf['prot_filetype'][nn]=="cellranger_multi": + # celranger multi has the same prot and gex (rna) barcodes + prot_path, prot_filetype = update_cellranger_col(caf['prot_path'][nn], raw=load_raw, method="multi", sample_id=caf['sample_id'][nn]) else: - adt_path, adt_filetype = caf[['adt_path', "adt_filetype"]].iloc[nn] + prot_path, prot_filetype = caf[['prot_path', "prot_filetype"]].iloc[nn] if load_prot_from_raw or load_raw: - adt_path = re.sub("filtered", "raw", adt_path) + prot_path = re.sub("filtered", "raw", prot_path) else: - adt_path= None - adt_filetype=None + prot_path= None + prot_filetype=None # load tcr_path if 'tcr_path' in caf.columns and mode_dictionary["tcr"] and pd.notna(caf['tcr_path'][nn]): tcr_path = caf['tcr_path'][nn] @@ -131,10 +131,10 @@ def gen_load_anndata_jobs(caf, load_raw=False, mode_dictionary = {}, load_prot_f else: outfile = outfile + ".h5mu" sample_id = caf['sample_id'][nn] - yield gex_path, outfile, \ + yield rna_path, outfile, \ sample_id, \ - gex_filetype, \ - adt_path, adt_filetype, \ + rna_filetype, \ + prot_path, prot_filetype, \ tcr_path, tcr_filetype, \ bcr_path, bcr_filetype, \ atac_path, atac_filetype, \ @@ -259,9 +259,9 @@ def write_obs(mdata, output_prefix="./", output_suffix="_filtered_cell_metadata. def check_submission_file(caf): if "filetype" in caf.columns or "path" in caf.columns: raise ValueError("you appear to be using the old notation for the sample submission file, \ - please update to use gex_path instead of path and gex_filetype instead of filetype") + please update to use rna_path instead of path and rna_filetype instead of filetype") # check for required cols - req_cols = ['sample_id', 'gex_path', 'gex_filetype'] + req_cols = ['sample_id', 'rna_path', 'rna_filetype'] for rc in req_cols: if rc not in caf.columns: raise ValueError("required column %s missing from submission file" % rc) @@ -363,7 +363,7 @@ def scp_read_10x_mtx(filename: PathLike, library_keep=None, *args, **kwargs) -> expanded sc.read_10x_mtx to filter for the library adapted from https://github.com/scverse/muon/blob/master/muon/_prot/io.py """ - adata = read_10x_mtx(filename, gex_only=False, *args, **kwargs) + adata = read_10x_mtx(filename, gex_only=False, *args, **kwargs) #need to leave gex as this is scanpy's code logging.debug("filtering cellranger outputs to %s" % library_keep) if library_keep is not None: adata = adata[ @@ -482,20 +482,21 @@ def load_mdata_from_multiple_files(all_files_dict): Parameters ---------- all_files_dict: dict - dictionary containing one key per assay out of [GEX, ADT, TCR, BCR, ATAC] + dictionary containing one key per assay out of [RNA, PROT, TCR, BCR, ATAC] and the values for each key is a list of file path and fie type - e.g. {"GEX": [filepath, "filetype"], - "ADT: [file path, "filetype"]} - Filetypes supported for gex/adt: ["cellranger", "h5ad", "csv_matrix", "txt_matrix", "10X_h5"], + e.g. {"RNA": [filepath, "filetype"], + "PROT: [file path, "filetype"]} + Filetypes supported for RNA/prot: ["cellranger", "h5ad", "csv_matrix", "txt_matrix", "10X_h5"], Filetypes supported for atac (multiome preferred is 10X_h5) ["10X_h5","cellranger","h5ad"] Filetypes supported for rep: ["cellranger_vdj", "airr", "tracer", "bracer" ] See scirpy documentation for more information of repertoire input formats https://scverse.org/scirpy/latest/api.html#module-scirpy.io """ # convert names to match mudata conventions - # mudata_conventional_names={"GEX":"rna", "ADT":"prot", "TCR":"tcr", - # "BCR":"bcr", "ATAC": "atac", "spatial":"spatial"} + # mudata_conventional_names + # {"GEX":"rna", "ADT":"prot", "TCR":"tcr", "BCR":"bcr", "ATAC": "atac", "spatial":"spatial"} # all_files_dict = {mudata_conventional_names[nm]: x for (nm, x) in all_files_dict.items()} + # note: scanpy's default function use gex_only as param so we need to leave that in logging.debug(all_files_dict.keys()) # load in separate anndata for each expected modality data_dict = {} @@ -578,8 +579,8 @@ def write_10x_counts(adata, path, layer=None): features = adata.var.reset_index().rename(columns={"index": "gene_symbols"}) elif "gene_symbols" in adata.var.columns: features = adata.var.reset_index().rename(columns={"index": "gene_ids"}) - if "adt_id" in features.columns: - features = features.drop(columns=['gene_symbols']).rename(columns={"adt_id" : "gene_symbols"}) + if "prot_id" in features.columns: + features = features.drop(columns=['gene_symbols']).rename(columns={"prot_id" : "gene_symbols"}) features = features[['gene_ids','gene_symbols', 'feature_types']] features.to_csv(os.path.join(path, "features.tsv.gz"), sep='\t', index=None, header=None) diff --git a/panpipes/funcs/processing.py b/panpipes/funcs/processing.py index b1781a92..906dbc70 100644 --- a/panpipes/funcs/processing.py +++ b/panpipes/funcs/processing.py @@ -253,7 +253,7 @@ def merge_with_adata_obs(adata, df, on_col="sample_id", inplace=False): return new_df -def add_var_mtd(prot, df, left_on="adt_id", right_on="adt_id"): +def add_var_mtd(prot, df, left_on="prot_id", right_on="prot_id"): if not isinstance(prot, AnnData): raise TypeError("prot is not an AnnData object") if not isinstance(df, pd.DataFrame): @@ -276,7 +276,7 @@ def add_var_mtd(prot, df, left_on="adt_id", right_on="adt_id"): prot.var.index = orig_index logging.debug(prot.var.head()) else: - warnings.warn(UserWarning("prot.var is at risk of getting disrodered, not merging the df")) + warnings.warn(UserWarning("prot.var is at risk of getting disordered, not merging the df")) diff --git a/panpipes/funcs/scmethods.py b/panpipes/funcs/scmethods.py index 2dcde11f..0960eed9 100644 --- a/panpipes/funcs/scmethods.py +++ b/panpipes/funcs/scmethods.py @@ -432,7 +432,7 @@ def X_is_raw(adata): # run clr -def run_adt_normalise(mdata, +def run_prot_normalise(mdata, mdata_bg, method: Literal['dsb', 'clr'] = 'clr', clr_margin: Literal[0, 1] = '0', diff --git a/panpipes/panpipes/pipeline_clustering.py b/panpipes/panpipes/pipeline_clustering.py index 886b597f..99837875 100644 --- a/panpipes/panpipes/pipeline_clustering.py +++ b/panpipes/panpipes/pipeline_clustering.py @@ -103,7 +103,7 @@ def calc_sm_umaps(infile, outfile, mod, mindist, log_file): --min_dist %(mindist)s """ if mod is not None and mod != "multimodal": - # if this is not specified it will use gex default + # if this is not specified it will use rna default cmd += " --modality %(mod)s" elif mod=="multimodal": if PARAMS['multimodal_integration_method'].lower() == "wnn": @@ -146,7 +146,7 @@ def calc_cluster(infile, outfile, mod, res, alg, log_file): --algorithm %(alg)s """ if mod is not None and mod != "multimodal": - # if this is not specified it will use gex default + # if this is not specified it will use rna default cmd += " --modality %(mod)s" elif mod=="multimodal": if PARAMS['multimodal_integration_method'].lower() == "wnn": @@ -412,7 +412,7 @@ def marker_analysis(fname): # if PARAMS['use_muon']: # cmd += " --use_muon True" # if PARAMS['modality']: -# # if this is not specified it will use gex default +# # if this is not specified it will use rna default # cmd += " --modality %(modality)s" # print(cmd) # P.run(cmd, job_threads=PARAMS['resources_threads_medium']) diff --git a/panpipes/panpipes/pipeline_ingest.py b/panpipes/panpipes/pipeline_ingest.py index 2e714c4f..f1d1b89e 100644 --- a/panpipes/panpipes/pipeline_ingest.py +++ b/panpipes/panpipes/pipeline_ingest.py @@ -425,10 +425,10 @@ def run_scanpy_prot_qc(log_file, outfile, unfilt_file): cmd += " --channel_col %(channel_col)s" else: cmd += " --channel_col sample_id" - if PARAMS['prot_plotqc_metrics']: - cmd += " --per_cell_metrics %(prot_plotqc_metrics)s" - if PARAMS['prot_metrics_per_adt']: - cmd += " --per_adt_metrics %(prot_metrics_per_adt)s" + if PARAMS['plotqc_prot_metrics']: + cmd += " --per_cell_metrics %(plotqc_prot_metrics)s" + if PARAMS['plot_metrics_per_prot']: + cmd += " --per_prot_metrics %(plot_metrics_per_prot)s" if PARAMS['identify_isotype_outliers']: cmd += " --identify_isotype_outliers %(identify_isotype_outliers)s" cmd += " --isotype_upper_quantile %(isotype_upper_quantile)s" diff --git a/panpipes/panpipes/pipeline_ingest/pipeline.yml b/panpipes/panpipes/pipeline_ingest/pipeline.yml index 221b0b26..581d54e8 100644 --- a/panpipes/panpipes/pipeline_ingest/pipeline.yml +++ b/panpipes/panpipes/pipeline_ingest/pipeline.yml @@ -87,20 +87,20 @@ barcode_mtd: metadatacols: #--------------------------------------- -# Loading adt data - additional options +# Loading prot data - additional options #--------------------------------------- # As default the ingest will choose the first column of the cellranger features.tsv.gz # To merge extra information about the antibodies e.g. whether they are hashing antibodies or isotypes, # Create a table with the first column equivalent to the first column of cellrangers features.tsv.gz # and specify it below (save as txt file) protein_metadata_table: -# Make sure there are unique entries in this column for each adt. +# Make sure there are unique entries in this column for each prot. # include a column called isotype (containing True or False) if you want to qc isotypes # include a column called hashing_ab (containing True or False) if your dataset contains hashing antibodies which you wish to split into a separate modality # If you want to update the mudata index with a column from protein_metadata_table, specify here # If there are overlaps with the rna gene symbols, then you might have trouble down the line -# It is recommended to add something to make the labels unique e.g. "adt_CD4" +# It is recommended to add something to make the labels unique e.g. "prot_CD4" index_col_choice: # If providing separate prot and rna 10X outputs, then the pipeline will load the filtered rna 10X outputs and the raw prot 10X counts @@ -120,7 +120,7 @@ subset_prot_barcodes_to_rna: False plot_10X_metrics: True kneeplot: False # ------------ -## Doublets on GEX - Scrublet +## Doublets on RNA - Scrublet # ------------ scr: run: True @@ -155,7 +155,7 @@ scr: #if use_thr is True, this thr will be used to define doublets # ------------ -# GEX QC +# RNA QC # ------------ # this part of the pipeline allows to generate the QC parameters that will be used to # evaluate inclusion/ exclusion criteria. Filtering of cells/genes happens in the following pipeline @@ -181,7 +181,7 @@ custom_genes_file: resources/qc_genelist_1.0.csv calc_proportions: hb,mt,rp score_genes: MarkersNeutro - +# ------------ # Plot RNA QC metrics # ------------ # all metrics should be inputted as a comma separated string e.g. a,b,c @@ -191,25 +191,24 @@ plotqc_grouping_var: sample_id plotqc_rna_metrics: doublet_scores,pct_counts_mt,pct_counts_rp,pct_counts_hb,pct_counts_ig - # ------------ -# ADT qc - requires prot_path to be included in the submission file +# Plot PROT QC metrics # ------------ +# requires prot_path to be included in the submission file # all metrics should be inputted as a comma separated string e.g. a,b,c -### ADT QC metrics: # as standard the following metrics are calculated for prot data # per cell metrics: -# total_counts,log1p_total_counts,n_adt_by_counts,log1p_n_adt_by_counts +# total_counts,log1p_total_counts,n_prot_by_counts,log1p_n_prot_by_counts # if isotypes can be detected then the following are calculated also: # total_counts_isotype,pct_counts_isotype # choose which ones you want to plot here -plotqc_prot_metrics: total_counts,log1p_total_counts,n_adt_by_counts,pct_counts_isotype - -# per antibody metrics +plotqc_prot_metrics: total_counts,log1p_total_counts,n_prot_by_counts,pct_counts_isotype +# Since the protein antibody panels usually count fewer features than the RNA, it may be interesting to +# visualize breakdowns of single proteins plots describing their count distribution and other qc options. +# choose which ones you want to plot here, for example # n_cells_by_counts,mean_counts,log1p_mean_counts,pct_dropout_by_counts,total_counts,log1p_total_counts -# choose which ones you want to plot here: -prot_metrics_per_adt: total_counts,log1p_total_counts,n_cells_by_counts,mean_counts +plot_metrics_per_prot: total_counts,log1p_total_counts,n_cells_by_counts,mean_counts # isotype outliers: one way to determine which cells are very sticky is to work out which cells have the most isotype UMIs # associated to them, to label a cell as an isotype outlier, it must meet or exceed the following crietria: @@ -218,7 +217,6 @@ prot_metrics_per_adt: total_counts,log1p_total_counts,n_cells_by_counts,mean_cou identify_isotype_outliers: True isotype_upper_quantile: 90 isotype_n_pass: 2 -# TODO: work out how to plot this. ### Investigate per channel antibody staining: # This can help determine any inconsistencies in staining per channel and other QC concerns. @@ -228,10 +226,10 @@ isotype_n_pass: 2 # i.e. the 10X channel. # this is usually the sample_id column (otherwise leave the next parameter blank) channel_col: sample_id -# It is important to note that in ingest the per channel normalised adt scores are not saved in the mudata object +# It is important to note that in ingest the per channel normalised prot scores are not saved in the mudata object # this is because if you perform feature normalisation (clr normalisation margin 0 or dsb normalisation), # on subsets of cells then the normalised values cannot be simply concatenated. -# The ADT normalisation is rerun pn the complete object in the preprocess pipeline +# The PROT normalisation is rerun pn the complete object in the preprocess pipeline # (or you can run this pipeline with channel_col set as None) # it is important to note, if you choose to run the clr on a per channel basis, then it is not stored in the h5mu file. # if you want to save the per channel normalised values set the following to True: @@ -246,7 +244,7 @@ normalisation_methods: clr # CLR parameters: # margin determines whether you normalise per cell (as you would RNA norm), -# or by feature (recommended, due to the variable nature of adts). +# or by feature (recommended, due to the variable nature of prot assays). # CLR margin 0 is recommended for informative qc plots in this pipeline # 0 = normalise colwise (per feature) # 1 = normalise rowwise (per cell) @@ -263,11 +261,8 @@ quantile_clipping: True # from both rna and protein assays, # see details for how to make sure your files are compatible in the assess background section below - - - # ------------ -# ATAC qc +# Plot ATAC QC metrics # ------------ # we require initializing one csv file per aggregated ATAC/multiome experiment. @@ -287,9 +282,8 @@ features_tss: #resources/features_tss_hg19.tsv # all metrics should be inputted as a comma separated string e.g. a,b,c plotqc_atac_metrics: n_genes_by_counts,total_counts,pct_fragments_in_peaks,atac_peak_region_fragments,atac_mitochondrial_reads,atac_TSS_fragments - # ------------ -# Repertoire QC +# Plot Repertoire QC metrics # ------------ # Repertoire data will be stored in one modality called "rep", if you provide both TCR and BCR data then this will be merged, # but various functions will be fun on TCR and BCR separately diff --git a/panpipes/panpipes/pipeline_integration/pipeline.yml b/panpipes/panpipes/pipeline_integration/pipeline.yml index a51b5728..cebaff87 100644 --- a/panpipes/panpipes/pipeline_integration/pipeline.yml +++ b/panpipes/panpipes/pipeline_integration/pipeline.yml @@ -223,9 +223,9 @@ multimodal: modalities: rna,prot exclude_mt_genes: True mt_column: mt - # to filter outliers manually create a column called adt_outliers in mdata['prot'].obs + # to filter outliers manually create a column called prot_outliers in mdata['prot'].obs filter_by_hvg: True - filter_adt_outliers: False + filter_prot_outliers: False model_args: latent_distribution: "normal" training_args: diff --git a/panpipes/panpipes/pipeline_preprocess/pipeline.yml b/panpipes/panpipes/pipeline_preprocess/pipeline.yml index 2f5bfa8b..44e1f4de 100644 --- a/panpipes/panpipes/pipeline_preprocess/pipeline.yml +++ b/panpipes/panpipes/pipeline_preprocess/pipeline.yml @@ -16,13 +16,10 @@ condaenv: # allows for tweaking where the jobs get submitted, in case there is a special queue for long jobs or you have access to a gpu # leave as is if you do not want to use the alternative queues -# also rescomp users need to request gpu access. +# -------------------------- # Start # -------------------------- -# either one that exists already with - - sample_prefix: unfiltered_obj: # if running this on prefiltered datat then @@ -31,8 +28,6 @@ unfiltered_obj: #3. put renamed file in the same folder as this yml. #4. set filtering run: to False below. -#TO DO this needs to go i think it doesn't make sense to save this file. - # leave empty if you don't want to save an intermediate muData with normalized expression # stored in X for the RNA object output_logged_mudata: @@ -49,6 +44,7 @@ modalities: # for each modality, starting with rna, it will first filter on obs and then vars. # each modality has a dictionary in the following format. This is fully customisable to any # columns in the mudata.obs or var object. +# To avoid being prescriptive with the covariates names, we allow to provide custom names. # When specifying a column name, make sure it exactly matches the column name in the h5mu object. # rna: # obs: @@ -158,13 +154,13 @@ plotqc: grouping_var: sample_id # specify metrics in the metadata to plot: rna_metrics: pct_counts_mt,pct_counts_rp,pct_counts_hb,pct_counts_ig,doublet_scores - prot_metrics: total_counts,log1p_total_counts,n_adt_by_counts,pct_counts_isotype + prot_metrics: total_counts,log1p_total_counts,n_prot_by_counts,pct_counts_isotype atac_metrics: rep_metrics: - +# -------------------------------------------------------------------------------------------------------- # RNA Normalisation # -------------------------------------------------------------------------------------------------------- # hvg_flavour options include "seurat", "cell_ranger", "seurat_v3", default; "seurat" @@ -212,8 +208,8 @@ pca: scree_n_pcs: 50 color_by: sample_id - -# Protein (ADT) normalisation +# -------------------------------------------------------------------------------------------------------- +# Protein (PROT) normalisation # -------------------------------------------------------------------------------------------------------- prot: # comma separated string of normalisation options @@ -228,7 +224,7 @@ prot: # CLR parameters: # margin determines whether you normalise per cell (as you would RNA norm), - # or by feature (recommended, due to the variable nature of adts). + # or by feature (recommended, due to the variable nature of prot assays). # CLR margin 0 is recommended for informative qc plots in this pipeline # 0 = normalise colwise (per feature) # 1 = normalise rowwise (per cell) @@ -249,9 +245,9 @@ prot: # do you want to save the prot normalised assay additionally as a txt file: save_norm_prot_mtx: False - +# -------------------------------------------------------------------------------------------------------- # ATAC preprocessing and normalisation -# -------------------------- +# -------------------------------------------------------------------------------------------------------- atac: binarize: False normalize: TFIDF # "log1p" or "TFIDF" diff --git a/panpipes/panpipes/pipeline_refmap/pipeline.yml b/panpipes/panpipes/pipeline_refmap/pipeline.yml index 55fc8c9f..abe82ce7 100644 --- a/panpipes/panpipes/pipeline_refmap/pipeline.yml +++ b/panpipes/panpipes/pipeline_refmap/pipeline.yml @@ -94,26 +94,5 @@ neighbors: # scanpy | hnsw (from scvelo) method: scanpy -#---------------------- -# wnn seurat params -#---------------------- - -wnn: path_to_wnn_h5 -# instead to a path to a model, in wnn you have to provide a path to a seurat or mudata/anndata object with wnn integration precomputed -wnn_mapping_normalization: SCTransform, LogNormalize -# specify normalization to use for query data, same used to build reference wnn. -normalize: False -# if RNA is pre-normalized specify method above and set normalize to False -# provide here a dictionary of the columns or assays you want to predict as defined here -# https://satijalab.org/seurat/reference/transferdata -# for the param refdata ( https://satijalab.org/seurat/reference/transferdata#:~:text=(%0A%20%20anchorset%2C-,refdata,-%2C%0A%20%20reference%20%3D%20NULL) -# if multiple lines are specified these will be supplied as list. -# write as the example that follows [column_or_data_to_create_in_query]:[name_of_data_in_reference] -refdata: - celltype.l1 : celltype.l1 - celltype.l2 : celltype.l2 - predicted_ADT : ADT - - diff --git a/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py b/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py index 480ea6dd..fc297932 100644 --- a/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py +++ b/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py @@ -93,8 +93,8 @@ def get_all_unique_paths(pipe_df): all_paths.columns = ['sample_id', 'path_type', 'path'] all_paths = all_paths.drop_duplicates() # rename path_type to match cellranger terminolgy - recode_dict = {'gex_path': "Gene Expression", - 'adt_path': "Antibody Capture", + recode_dict = {'rna_path': "Gene Expression", + 'prot_path': "Antibody Capture", 'tcr_path': 'VDJ T', 'bcr_path': 'VDJ B'} all_paths['path_type'] = all_paths['path_type'].replace(recode_dict) @@ -216,11 +216,11 @@ def parse_10x_cellranger_count(path_df, convert_df, path_col='metrics_summary_p ## gene expression extra plots ----------------------------- -gex_tenx_metrics = tenx_metrics_full[tenx_metrics_full['library_type'] == 'Gene Expression'] +rna_tenx_metrics = tenx_metrics_full[tenx_metrics_full['library_type'] == 'Gene Expression'] # sequencing_saturaion plot scatter plot plot_metrics = ['Sequencing saturation', 'Mean reads per cell', 'Estimated number of cells', 'Number of reads', 'Median UMI counts per cell'] -plt_df = gex_tenx_metrics[gex_tenx_metrics.metric_name.isin(plot_metrics)] +plt_df = rna_tenx_metrics[rna_tenx_metrics.metric_name.isin(plot_metrics)] plt_df = plt_df[['sample_id', 'metric_name', 'metric_value']] plt_tab = plt_df.pivot_table(index='sample_id', columns='metric_name', values='metric_value', aggfunc=sum) # sns.scatterplot(data=plt_tab, diff --git a/panpipes/python_scripts/assess_background.py b/panpipes/python_scripts/assess_background.py index 106bafa9..21fb3829 100644 --- a/panpipes/python_scripts/assess_background.py +++ b/panpipes/python_scripts/assess_background.py @@ -147,7 +147,7 @@ # quantifying the top background features ## Repeat for protein (if it exists) if 'prot' in mdata.mod.keys(): - # this time we'll just use all the adts. + # this time we'll just use all the prot vars. top_genes = list(mdata_bg['prot'].var_names) bg_df = pnp.scmethods.get_mean_background_fraction(mdata_bg['prot'], top_background=top_genes, group_by=args.channel_col) if bg_df.shape[0] >1: @@ -157,7 +157,7 @@ fig, ax = plt.subplots(nrows=2,ncols=1,figsize=(24,10), facecolor="white") sns.heatmap(bg_df.iloc[:,1:split_int], ax=ax[0]) sns.heatmap(bg_df.iloc[:,split_int:len(top_genes)], ax=ax[1]) - fig.suptitle("mean exprs (raw counts) of ADT in empty drops") + fig.suptitle("mean exprs (raw counts) of PROT in empty drops") fig.tight_layout() else: fig, ax = plt.subplots(figsize=(12,10), facecolor="white") @@ -173,7 +173,7 @@ pnp.plotting.adjust_x_axis(ax[0]) sns.heatmap(plt_df.iloc[split_int:len(top_genes),:], ax=ax[1]) pnp.plotting.adjust_x_axis(ax[1]) - fig.suptitle("mean exprs (raw counts) of ADT in empty drops") + fig.suptitle("mean exprs (raw counts) of PROT in empty drops") fig.tight_layout() else: fig, ax= plt.subplots(figsize=(12,8)) @@ -183,7 +183,7 @@ plt.savefig(os.path.join(args.figpath,"barplot_background_" + args.channel_col + "_prot_top_expressed.png")) -## QC for gex and protein fcomparing foreground and background +## QC for rna and protein comparing foreground and background if 'rna' in mdata.mod.keys(): pnp.pl.scatter_fg_vs_bg(mdata, mdata_bg,x="rna:log1p_n_genes_by_counts", y="rna:log1p_total_counts", facet_row=args.channel_col) diff --git a/panpipes/python_scripts/batch_correct_totalvi.py b/panpipes/python_scripts/batch_correct_totalvi.py index 8fde16fb..b7214484 100644 --- a/panpipes/python_scripts/batch_correct_totalvi.py +++ b/panpipes/python_scripts/batch_correct_totalvi.py @@ -123,14 +123,14 @@ sc_raw = sc_raw[sc_raw.obs_names.isin(rna.obs_names),: ] rna.layers["counts"] = sc_raw.X.copy() -# filter adt outliers -if params['multimodal']['totalvi']['filter_adt_outliers']: - # for this to work the user needs to (manually) make a column called adt_outliers +# filter prot outliers +if params['multimodal']['totalvi']['filter_prot_outliers']: + # for this to work the user needs to (manually) make a column called prot_outliers # actually there is a thing in the qc pipe that calculates outliers, I don't like it very much though - if "adt_outliers" in mdata['prot'].columns: - mu.pp.filter_obs(mdata, "adt_outliers") + if "prot_outliers" in mdata['prot'].columns: + mu.pp.filter_obs(mdata, "prot_outliers") else: - raise ValueError("adt_outliers column not found in mdata['prot'].obs") + raise ValueError("prot_outliers column not found in mdata['prot'].obs") # exluding isotypes if 'isotype' in prot.var.columns: diff --git a/panpipes/python_scripts/concat_adata.py b/panpipes/python_scripts/concat_adata.py index 02a0a468..9382541c 100644 --- a/panpipes/python_scripts/concat_adata.py +++ b/panpipes/python_scripts/concat_adata.py @@ -94,7 +94,7 @@ mdata=mdatas[0] del temp elif 'prot' in temp.mod.keys() or 'rna' in temp.mod.keys(): - ## IF GEX and ADT is ok to concatenate ---------- + ## IF RNA and PROT is ok to concatenate ---------- mdata = concat_mdatas(mdatas, batch_key="sample_id", join_type=args.join_type) @@ -178,7 +178,7 @@ L.debug(mdata.obs.columns) L.debug(mdata.obs.head()) -# update the protein variable to add in extra info like isotype and alternate name for adt +# update the protein variable to add in extra info like isotype and alternate name for prot if args.protein_var_table is not None: try: df = pd.read_csv(args.protein_var_table, sep='\t', index_col=0) diff --git a/panpipes/python_scripts/run_dsb_clr.py b/panpipes/python_scripts/run_dsb_clr.py index 7fb0ffab..201231e1 100644 --- a/panpipes/python_scripts/run_dsb_clr.py +++ b/panpipes/python_scripts/run_dsb_clr.py @@ -35,7 +35,7 @@ default=False, help="") parser.add_argument("--figpath", - default="./adt_figures", + default="./prot_figures", help="") parser.add_argument("--save_mtx", default=False, @@ -106,7 +106,7 @@ if 'clr' in norm_methods: # make sure to start from raw counts mdata["prot"].X = mdata["prot"].layers["raw_counts"] - pnp.scmethods.run_adt_normalise(mdata=mdata, mdata_raw=None, + pnp.scmethods.run_prot_normalise(mdata=mdata, mdata_raw=None, method="clr", clr_margin=int(args.clr_margin)) L.info("saving ridgeplot") @@ -117,7 +117,7 @@ plt.savefig(os.path.join(args.figpath, si + "_clr_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(mdata["prot"], os.path.join("adt_clr" , si ), layer="raw_counts") + pnp.io.write_10x_counts(mdata["prot"], os.path.join("prot_clr" , si ), layer="raw_counts") # then run dsb if 'dsb' in norm_methods: mdata_raw = mdata_raw_per_sample[si] @@ -129,7 +129,7 @@ # mu.pl.histogram(mdata_raw["rna"], ["log10umi"], bins=50) # plt.savefig(os.path.join(args.figpath, si + "_log10umi.png")) # run dsb - pnp.scmethods.run_adt_normalise(mdata=mdata, + pnp.scmethods.run_prot_normalise(mdata=mdata, mdata_raw=mdata_raw, method="dsb", isotypes=isotypes) @@ -139,7 +139,7 @@ pnp.plotting.ridgeplot(mdata["prot"], features=isotypes, layer="dsb", splitplot=1) plt.savefig(os.path.join(args.figpath, si + "_dsb_ridgeplot_isotypes.png")) if save_mtx: - pnp.io.write_10x_counts(mdata["prot"], os.path.join("adt_dsb" , si), layer="raw_counts") + pnp.io.write_10x_counts(mdata["prot"], os.path.join("prot_dsb" , si), layer="raw_counts") else: # run on all the data (not on a channel basis) # first run clr @@ -151,7 +151,7 @@ # make sure to start from raw counts all_mdata["prot"].X = all_mdata["prot"].layers["raw_counts"] # run normalise - pnp.scmethods.run_adt_normalise(mdata=all_mdata, + pnp.scmethods.run_prot_normalise(mdata=all_mdata, mdata_raw=None, method="clr", clr_margin=int(args.clr_margin)) @@ -162,7 +162,7 @@ plt.savefig(os.path.join(args.figpath, "clr_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("adt_clr"), layer="clr") + pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("prot_clr"), layer="clr") if 'dsb' in norm_methods: # make sure to start from raw counts all_mdata["prot"].X = all_mdata["prot"].layers["raw_counts"] @@ -171,7 +171,7 @@ all_mdata_raw["rna"].obs["log10umi"] = np.array(np.log10(all_mdata_raw["rna"].X.sum(axis=1) + 1)).reshape(-1) mu.pl.histogram(all_mdata_raw["rna"], ["log10umi"], bins=50) plt.savefig(os.path.join(args.figpath, "all_log10umi.png")) - pnp.scmethods.run_adt_normalise(mdata=all_mdata, + pnp.scmethods.run_prot_normalise(mdata=all_mdata, mdata_raw=all_mdata_raw, method="dsb", isotypes=isotypes) @@ -187,7 +187,7 @@ plt.savefig(os.path.join(args.figpath, "dsb_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("adt_dsb"), layer="dsb") + pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("prot_dsb"), layer="dsb") # run pca on dsb normalised (if dsb was run otherwise clr is in X) sc.tl.pca(all_mdata['prot'], n_comps=50, svd_solver='arpack', random_state=0) diff --git a/panpipes/python_scripts/run_preprocess_prot.py b/panpipes/python_scripts/run_preprocess_prot.py index 7563869f..f06665c9 100644 --- a/panpipes/python_scripts/run_preprocess_prot.py +++ b/panpipes/python_scripts/run_preprocess_prot.py @@ -44,7 +44,7 @@ default=None, help="") parser.add_argument("--figpath", - default="./adt_figures", + default="./prot_figures", help="") parser.add_argument("--save_mtx", default=False, @@ -55,7 +55,9 @@ args, opt = parser.parse_known_args() -# args = argparse.Namespace(filtered_mudata='test.h5mu', bg_mudata='/well/cartography/users/zsj686/non_cart_projects/005-multimodal_scpipelines/ingest/test_raw.h5mu', channel_col=None, normalisation_methods='clr,dsb', clr_margin='0', quantile_clipping='True', figpath='./figures/adt', save_mtx=False, save_mudata_path='test.h5mu') +# args = argparse.Namespace(filtered_mudata='test.h5mu', +# bg_mudata='/well/cartography/users/zsj686/non_cart_projects/005-multimodal_scpipelines/ingest/test_raw.h5mu', +# channel_col=None, normalisation_methods='clr,dsb', clr_margin='0', quantile_clipping='True', figpath='./figures/prot', save_mtx=False, save_mudata_path='test.h5mu') save_mtx=pnp.pp.check_for_bool(args.save_mtx) norm_methods = args.normalisation_methods.split(',') @@ -137,7 +139,7 @@ if 'clr' in norm_methods: # make sure to start from raw counts mdata["prot"].X = mdata["prot"].layers["raw_counts"].copy() - pnp.scmethods.run_adt_normalise(mdata=mdata, mdata_bg=None, + pnp.scmethods.run_prot_normalise(mdata=mdata, mdata_bg=None, method="clr", clr_margin=int(args.clr_margin)) L.info("saving ridgeplot") @@ -148,7 +150,7 @@ plt.savefig(os.path.join(args.figpath, str(si) + "_clr_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(mdata["prot"], os.path.join("adt_clr" , str(si) ), layer="raw_counts") + pnp.io.write_10x_counts(mdata["prot"], os.path.join("prot_clr" , str(si) ), layer="raw_counts") # then run dsb if 'dsb' in norm_methods: mdata_bg = mdata_bg_per_sample[si] @@ -161,7 +163,7 @@ # plt.savefig(os.path.join(args.figpath, si + "_log10umi.png")) # run dsb # this stores a layer named after the method as well as overwriting X - pnp.scmethods.run_adt_normalise(mdata=mdata, + pnp.scmethods.run_prot_normalise(mdata=mdata, mdata_bg=mdata_bg, method="dsb", isotypes=isotypes) @@ -171,7 +173,7 @@ pnp.plotting.ridgeplot(mdata["prot"], features=isotypes, layer="dsb", splitplot=1) plt.savefig(os.path.join(args.figpath, str(si) + "_dsb_ridgeplot_isotypes.png")) if save_mtx: - pnp.io.write_10x_counts(mdata["prot"], os.path.join("adt_dsb" , str(si)), layer="raw_counts") + pnp.io.write_10x_counts(mdata["prot"], os.path.join("prot_dsb" , str(si)), layer="raw_counts") else: # run on all the data (not on a channel basis) # first run clr @@ -184,7 +186,7 @@ all_mdata["prot"].X = all_mdata["prot"].layers["raw_counts"].copy() # run normalise # this stores a layer named after the method as well as overwriting X - pnp.scmethods.run_adt_normalise(mdata=all_mdata, + pnp.scmethods.run_prot_normalise(mdata=all_mdata, mdata_bg=None, method="clr", clr_margin=int(args.clr_margin)) @@ -195,7 +197,7 @@ plt.savefig(os.path.join(args.figpath, "clr_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("adt_clr"), layer="clr") + pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("prot_clr"), layer="clr") if 'dsb' in norm_methods: # make sure to start from raw counts all_mdata["prot"].X = all_mdata["prot"].layers["raw_counts"].copy() @@ -205,7 +207,7 @@ mu.pl.histogram(all_mdata_bg["rna"], ["log10umi"], bins=50) plt.savefig(os.path.join(args.figpath, "all_log10umi.png")) # this stores a layer named after the method as well as overwriting X - pnp.scmethods.run_adt_normalise(mdata=all_mdata, + pnp.scmethods.run_prot_normalise(mdata=all_mdata, mdata_bg=all_mdata_bg, method="dsb", isotypes=isotypes) @@ -221,7 +223,7 @@ plt.savefig(os.path.join(args.figpath, "dsb_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("adt_dsb"), layer="dsb") + pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("prot_dsb"), layer="dsb") if args.store_as_x is not None: all_mdata["prot"].X = all_mdata["prot"].layers[args.store_as_x] diff --git a/panpipes/python_scripts/run_scanpyQC_prot.py b/panpipes/python_scripts/run_scanpyQC_prot.py index abcaac84..64a61879 100644 --- a/panpipes/python_scripts/run_scanpyQC_prot.py +++ b/panpipes/python_scripts/run_scanpyQC_prot.py @@ -1,8 +1,8 @@ ''' -scanpy QC script ADT +scanpy QC script PROT order of QC: -- GEX -- ADT +- RNA +- PROT - Repertoire - ATAC ''' @@ -64,7 +64,7 @@ parser.add_argument("--per_cell_metrics", default=None, help="") -parser.add_argument("--per_adt_metrics", +parser.add_argument("--per_prot_metrics", default=None, help="") @@ -118,7 +118,7 @@ # if isotypes are found then include them in the caluclate qc metrics call. # this calculates standard metrics sc.pp.calculate_qc_metrics(prot, - var_type="adt", + var_type="prot", qc_vars=qc_vars, percent_top=None,log1p=True, inplace=True) @@ -151,7 +151,7 @@ # values on a per channel basis # IS this cause you need the pandas per channel to plot? YES -# the rest of the script is concerned with Per ADT metrics instead of per cell. +# the rest of the script is concerned with Per PROT metrics instead of per cell. # per cell metrics are plotted in plotqc.R @@ -161,7 +161,7 @@ for si in prot.obs[channel_col].unique(): L.info(si) obs_df, var_df = sc.pp.calculate_qc_metrics(prot[prot.obs[channel_col]==si], - var_type="adt", + var_type="prot", qc_vars=qc_vars, percent_top=None,log1p=True, inplace=False) # store the var df for the metrics we care about in the dictionary. @@ -170,24 +170,24 @@ out[si][channel_col] = si # concatenate all the var_dfs, so we can summarise the per channel data in boxplots. -var_dat = pd.concat(out).reset_index().rename(columns={'level_0':channel_col, 'level_1': "adt_id"}) -adt_id_col=var_dat.columns[1] +var_dat = pd.concat(out).reset_index().rename(columns={'level_0':channel_col, 'level_1': "prot_id"}) +prot_id_col=var_dat.columns[1] -if args.per_adt_metrics is not None: - per_adt_metrics = args.per_adt_metrics.split(",") - per_adt_metrics = [a.strip() for a in per_adt_metrics] +if args.per_prot_metrics is not None: + per_prot_metrics = args.per_prot_metrics.split(",") + per_prot_metrics = [a.strip() for a in per_prot_metrics] # let's make some boxplot figures. - for qcp in per_adt_metrics: + for qcp in per_prot_metrics: fig, ax = plt.subplots(nrows=1,figsize=(24,6)) - sns.boxplot(data=var_dat, x=adt_id_col, y=qcp, ax=ax) + sns.boxplot(data=var_dat, x=prot_id_col, y=qcp, ax=ax) adjust_x_axis(ax) fig.tight_layout() plt.savefig(os.path.join(figdir, "boxplot_" + channel_col + "_" + qcp + ".png")) plt.close() # write out the per channel metrics in a separate csv. - var_dat.to_csv(args.sampleprefix + "_adt_qc_metrics_per_" + channel_col + ".csv") + var_dat.to_csv(args.sampleprefix + "_prot_qc_metrics_per_" + channel_col + ".csv") L.info("Done") diff --git a/panpipes/python_scripts/run_scanpyQC_rep.py b/panpipes/python_scripts/run_scanpyQC_rep.py index 8cf53eb1..e2b12c0a 100644 --- a/panpipes/python_scripts/run_scanpyQC_rep.py +++ b/panpipes/python_scripts/run_scanpyQC_rep.py @@ -1,8 +1,8 @@ ''' -scanpy QC script GEX +scanpy QC script RNA order of QC: -- GEX -- ADT +- RNA +- PROT - Repertoire - ATAC ''' diff --git a/panpipes/python_scripts/run_scanpyQC_rna.py b/panpipes/python_scripts/run_scanpyQC_rna.py index 9947e98d..70f37ce8 100644 --- a/panpipes/python_scripts/run_scanpyQC_rna.py +++ b/panpipes/python_scripts/run_scanpyQC_rna.py @@ -1,8 +1,8 @@ ''' -scanpy QC script GEX +scanpy QC script RNA order of QC: -- GEX -- ADT +- RNA +- PROT - Repertoire - ATAC ''' @@ -62,7 +62,7 @@ args, opt = parser.parse_known_args() -L.info("Running scanpy gex qc pipeline") +L.info("Running scanpy rna qc pipeline") sc.settings.verbosity = 3 diff --git a/panpipes/resources/sample_file_qc_mm.txt b/panpipes/resources/sample_file_qc_mm.txt index beae3718..77775ce8 100644 --- a/panpipes/resources/sample_file_qc_mm.txt +++ b/panpipes/resources/sample_file_qc_mm.txt @@ -1 +1,3 @@ -sample_id gex_path gex_filetype adt_path adt_filetype tissue diagnosis Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file +sample_id rna_path rna_filetype prot_path prot_filetype tissue diagnosis +Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy +Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file diff --git a/panpipes/resources/sample_file_qc_mm_gex_only.txt b/panpipes/resources/sample_file_qc_mm_gex_only.txt index beae3718..77775ce8 100644 --- a/panpipes/resources/sample_file_qc_mm_gex_only.txt +++ b/panpipes/resources/sample_file_qc_mm_gex_only.txt @@ -1 +1,3 @@ -sample_id gex_path gex_filetype adt_path adt_filetype tissue diagnosis Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file +sample_id rna_path rna_filetype prot_path prot_filetype tissue diagnosis +Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy +Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file diff --git a/tests/sample_caf.tsv b/tests/sample_caf.tsv index fbece7be..d60886ed 100644 --- a/tests/sample_caf.tsv +++ b/tests/sample_caf.tsv @@ -1,3 +1,3 @@ -sample_id gex_path gex_filetype adt_path adt_filetype +sample_id rna_path rna_filetype prot_path prot_filetype sample1 ../sample_1_rna cellranger ../sample_1_prot cellranger sample2 ../sample_2_rna cellranger ../sample_2_prot cellranger