From ff19deabdbdd6806533928704e9066a352a00231 Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 11:46:04 +0200 Subject: [PATCH 01/28] checking gex on setup md files --- docs/setup_for_ingest.md | 18 +++++++++--------- docs/usage/sample_file_qc_mm.md | 6 ++---- 2 files changed, 11 insertions(+), 13 deletions(-) diff --git a/docs/setup_for_ingest.md b/docs/setup_for_ingest.md index 86db2324..030f7f91 100644 --- a/docs/setup_for_ingest.md +++ b/docs/setup_for_ingest.md @@ -3,7 +3,7 @@ For ingest the minimum required columns are -sample_id | gex_path | gex_filetype +sample_id | rna_path | rna_filetype ----------|----------|------------- @@ -20,7 +20,7 @@ example at `resources/sample_file_ingest.txt` If giving a cellranger path, give the path folder containing all the cellranger outputs. Otherwise path should be the complete path to the file. -If you have cellranger outputs which have gex and adt within the same files, specify the same path in gex_path and adt_path +If you have cellranger outputs which have rna and adt within the same files, specify the same path in rna_path and adt_path To include sample level metadata, you can add additional columns to the submission file e.g Tissue and Diagnoisis columns in `resources/sample_file_ingest.txt` @@ -38,13 +38,13 @@ For each modality per sample, specify the value in the key column in the X_filet modality |key |description ------------|----------|---------- -gex/adt/atac|cellranger| the "outs" folder produced by **cellranger count** -gex/adt/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** -gex/adt/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger -gex/adt/atac|hd5 | Read a generic .h5 (hdf5) file. -gex/adt/atac|h5ad | Anndata h5ad objects (one per sample) -gex/adt/atac|txt_matrix | tab-delimited file (one per sample) -gex/adt/atac|csv_matrix | comma-delimited file (one per sample) +rna/adt/atac|cellranger| the "outs" folder produced by **cellranger count** +rna/adt/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** +rna/adt/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger +rna/adt/atac|hd5 | Read a generic .h5 (hdf5) file. +rna/adt/atac|h5ad | Anndata h5ad objects (one per sample) +rna/adt/atac|txt_matrix | tab-delimited file (one per sample) +rna/adt/atac|csv_matrix | comma-delimited file (one per sample) tcr/bcr |cellranger_vdj| Path to filtered_contig_annotations.csv, all_contig_annotations.csv or all_contig_annotations.json. produced by **cellranger vdj** further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_10x_vdj.html) tcr/bcr |tracer| data from [TraCeR](https://github.com/Teichlab/tracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_tracer.html) tcr/bcr |bracer| data from [BraCeR](https://github.com/Teichlab/bracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_bracer.html) diff --git a/docs/usage/sample_file_qc_mm.md b/docs/usage/sample_file_qc_mm.md index 47ca20e7..cf4269bb 100644 --- a/docs/usage/sample_file_qc_mm.md +++ b/docs/usage/sample_file_qc_mm.md @@ -1,12 +1,10 @@ Example sample submission file ----------------------------- -| sample_id | gex_path | gex_filetype | adt_path | adt_filetype | tissue | diagnosis | +| sample_id | rna_path | rna_filetype | adt_path | adt_filetype | tissue | diagnosis | |-----------|------------------------------------|--------------|-------------------------------------|--------------|--------|-----------| -| Sample1 | Sample1_gex.csv | csv_matrix | Sample1_adt.csv | csv_matrix | pbmc | healthy | +| Sample1 | Sample1_rna.csv | csv_matrix | Sample1_adt.csv | csv_matrix | pbmc | healthy | | Sample2 | cellranger_count/Sample2_GEX/outs/ | cellranger | cellranger_count/Sample2_CITE/outs/ | cellranger | pbmc | diseased | Download this file: [sample_file_qc_mm.txt](sample_file_qc_mm.txt) - - From 8fb3f31d25cec8b40572b2349f350a3076e7174e Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 11:47:04 +0200 Subject: [PATCH 02/28] sample qc mm fix --- docs/usage/sample_file_qc_mm.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/usage/sample_file_qc_mm.txt b/docs/usage/sample_file_qc_mm.txt index 5cc13a1f..7e45675e 100644 --- a/docs/usage/sample_file_qc_mm.txt +++ b/docs/usage/sample_file_qc_mm.txt @@ -1,3 +1,3 @@ -sample_id gex_path gex_filetype adt_path adt_filetype tissue diagnosis +sample_id rna_path rna_filetype adt_path adt_filetype tissue diagnosis Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file From 5998619e5773a7979b6712c7ee0bdaae85a70234 Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 11:48:33 +0200 Subject: [PATCH 03/28] gex setup qc mm --- docs/usage/setup_for_qc_mm.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/usage/setup_for_qc_mm.md b/docs/usage/setup_for_qc_mm.md index ee32ffce..1f6df9ec 100644 --- a/docs/usage/setup_for_qc_mm.md +++ b/docs/usage/setup_for_qc_mm.md @@ -6,7 +6,7 @@ The multimodal QC pipeline (qc_mm) requires a sample submission file which it us The minimum required columns are -sample_id | gex_path | gex_filetype +sample_id | rna_path | rna_filetype ----------|----------|------------- @@ -20,7 +20,7 @@ If you want to analyse other modalities, add additional columns to the input fil **sample id**: Each row must have a unique sample ID. -**{X}_paths**: If giving a cellranger path, give the path folder containing all the cellranger outputs, known as the `outs` folder. Otherwise path should be the complete path to the file. If you have cellranger outputs which have gex and adt within the same files, specify the same path in gex_path and adt_path +**{X}_paths**: If giving a cellranger path, give the path folder containing all the cellranger outputs, known as the `outs` folder. Otherwise path should be the complete path to the file. If you have cellranger outputs which have rna and adt within the same files, specify the same path in rna_path and adt_path **{X}_filetype**: The "filetype" column tells panpipe how to read in the data. Panpipes supports a range of inputs. See the [supported input filetypes](#supported-input-filetypes) below to see the options for the {X}_filetype columns @@ -35,7 +35,7 @@ You will also need to list which additional metadata columns you want to include ## Example sample submission file -| sample_id | gex_path | gex_filetype | adt_path | adt_filetype | tissue | diagnosis | +| sample_id | rna_path | rna_filetype | adt_path | adt_filetype | tissue | diagnosis | |-----------|------------------------------------|--------------|-------------------------------------|--------------|--------|-----------| | Sample1 | Sample1_gex.csv | csv_matrix | Sample1_adt.csv | csv_matrix | pbmc | healthy | | Sample2 | cellranger_count/Sample2_GEX/outs/ | cellranger | cellranger_count/Sample2_CITE/outs/ | cellranger | pbmc | diseased | @@ -58,13 +58,13 @@ For each modality per sample, specify the value in the key column in the X_filet modality |key |description ------------|----------|---------- -gex/adt/atac|cellranger| the "outs" folder produced by **cellranger count** -gex/adt/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** -gex/adt/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger -gex/adt/atac|hd5 | Read a generic .h5 (hdf5) file. -gex/adt/atac|h5ad | Anndata h5ad objects (one per sample) -gex/adt/atac|txt_matrix | tab-delimited file (one per sample) -gex/adt/atac|csv_matrix | comma-delimited file (one per sample) +rna/adt/atac|cellranger| the "outs" folder produced by **cellranger count** +rna/adt/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** +rna/adt/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger +rna/adt/atac|hd5 | Read a generic .h5 (hdf5) file. +rna/adt/atac|h5ad | Anndata h5ad objects (one per sample) +rna/adt/atac|txt_matrix | tab-delimited file (one per sample) +rna/adt/atac|csv_matrix | comma-delimited file (one per sample) tcr/bcr |cellranger_vdj| Path to filtered_contig_annotations.csv, all_contig_annotations.csv or all_contig_annotations.json. produced by **cellranger vdj** further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_10x_vdj.html) tcr/bcr |tracer| data from [TraCeR](https://github.com/Teichlab/tracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_tracer.html) tcr/bcr |bracer| data from [BraCeR](https://github.com/Teichlab/bracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_bracer.html) From 1935529a063f47a6e1ab048b8397538967e17e9d Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 11:57:27 +0200 Subject: [PATCH 04/28] gex_only scanpy --- panpipes/funcs/io.py | 47 ++++++++++++++++++++++---------------------- 1 file changed, 24 insertions(+), 23 deletions(-) diff --git a/panpipes/funcs/io.py b/panpipes/funcs/io.py index 387713fd..e8777738 100644 --- a/panpipes/funcs/io.py +++ b/panpipes/funcs/io.py @@ -48,32 +48,32 @@ def gen_load_anndata_jobs(caf, load_raw=False, mode_dictionary = {}, load_prot_f Generate a load_adatas job for each line in submission.txt """ for nn in range(0, caf.shape[0]): - if pd.isna(caf['gex_path'][nn]): - gex_path= None - gex_filetype=None - elif caf['gex_filetype'][nn]=="cellranger" and mode_dictionary["rna"]: - gex_path, gex_filetype = update_cellranger_col(caf['gex_path'][nn], raw=load_raw, method="count") - elif caf['gex_filetype'][nn]=="cellranger_multi" and mode_dictionary["rna"]: - gex_path, gex_filetype = update_cellranger_col(caf['gex_path'][nn], raw=load_raw, method="multi", + if pd.isna(caf['rna_path'][nn]): + rna_path= None + rna_filetype=None + elif caf['rna_filetype'][nn]=="cellranger" and mode_dictionary["rna"]: + rna_path, rna_filetype = update_cellranger_col(caf['rna_path'][nn], raw=load_raw, method="count") + elif caf['rna_filetype'][nn]=="cellranger_multi" and mode_dictionary["rna"]: + rna_path, rna_filetype = update_cellranger_col(caf['rna_path'][nn], raw=load_raw, method="multi", sample_id=caf['sample_id'][nn]) else: - gex_path, gex_filetype = caf[['gex_path', "gex_filetype"]].iloc[nn] + rna_path, rna_filetype = caf[['rna_path', "rna_filetype"]].iloc[nn] if load_raw: - gex_path = re.sub("filtered", "raw", gex_path) + rna_path = re.sub("filtered", "raw", rna_path) # manage the adt paths if ('adt_path' in caf.columns and mode_dictionary["prot"]): # check if its the same as the gex path (data in the same file) if pd.isna(caf['adt_path'][nn]): adt_path= None adt_filetype=None - elif caf['adt_path'][nn] == caf['gex_path'][nn]: - adt_path, adt_filetype = gex_path, gex_filetype + elif caf['adt_path'][nn] == caf['rna_path'][nn]: + adt_path, adt_filetype = rna_path, rna_filetype elif caf['adt_filetype'][nn]=="cellranger": - # we might want to load the raw here because we want to then subset by good gex barcodes, + # we might want to load the raw here because we want to then subset by good gex (rna) barcodes, # this is why the load_prot_from_raw argument exists adt_path, adt_filetype = update_cellranger_col(caf['adt_path'][nn], raw=load_prot_from_raw) elif caf['adt_filetype'][nn]=="cellranger_multi": - # celranger multi has the same prot and gex barcodes + # celranger multi has the same prot and gex (rna) barcodes adt_path, adt_filetype = update_cellranger_col(caf['adt_path'][nn], raw=load_raw, method="multi", sample_id=caf['sample_id'][nn]) else: @@ -131,9 +131,9 @@ def gen_load_anndata_jobs(caf, load_raw=False, mode_dictionary = {}, load_prot_f else: outfile = outfile + ".h5mu" sample_id = caf['sample_id'][nn] - yield gex_path, outfile, \ + yield rna_path, outfile, \ sample_id, \ - gex_filetype, \ + rna_filetype, \ adt_path, adt_filetype, \ tcr_path, tcr_filetype, \ bcr_path, bcr_filetype, \ @@ -259,9 +259,9 @@ def write_obs(mdata, output_prefix="./", output_suffix="_filtered_cell_metadata. def check_submission_file(caf): if "filetype" in caf.columns or "path" in caf.columns: raise ValueError("you appear to be using the old notation for the sample submission file, \ - please update to use gex_path instead of path and gex_filetype instead of filetype") + please update to use rna_path instead of path and rna_filetype instead of filetype") # check for required cols - req_cols = ['sample_id', 'gex_path', 'gex_filetype'] + req_cols = ['sample_id', 'rna_path', 'rna_filetype'] for rc in req_cols: if rc not in caf.columns: raise ValueError("required column %s missing from submission file" % rc) @@ -363,7 +363,7 @@ def scp_read_10x_mtx(filename: PathLike, library_keep=None, *args, **kwargs) -> expanded sc.read_10x_mtx to filter for the library adapted from https://github.com/scverse/muon/blob/master/muon/_prot/io.py """ - adata = read_10x_mtx(filename, gex_only=False, *args, **kwargs) + adata = read_10x_mtx(filename, gex_only=False, *args, **kwargs) #need to leave gex as this is scanpy's code logging.debug("filtering cellranger outputs to %s" % library_keep) if library_keep is not None: adata = adata[ @@ -482,20 +482,21 @@ def load_mdata_from_multiple_files(all_files_dict): Parameters ---------- all_files_dict: dict - dictionary containing one key per assay out of [GEX, ADT, TCR, BCR, ATAC] + dictionary containing one key per assay out of [RNA, ADT, TCR, BCR, ATAC] and the values for each key is a list of file path and fie type - e.g. {"GEX": [filepath, "filetype"], + e.g. {"RNA": [filepath, "filetype"], "ADT: [file path, "filetype"]} - Filetypes supported for gex/adt: ["cellranger", "h5ad", "csv_matrix", "txt_matrix", "10X_h5"], + Filetypes supported for RNA/adt: ["cellranger", "h5ad", "csv_matrix", "txt_matrix", "10X_h5"], Filetypes supported for atac (multiome preferred is 10X_h5) ["10X_h5","cellranger","h5ad"] Filetypes supported for rep: ["cellranger_vdj", "airr", "tracer", "bracer" ] See scirpy documentation for more information of repertoire input formats https://scverse.org/scirpy/latest/api.html#module-scirpy.io """ # convert names to match mudata conventions - # mudata_conventional_names={"GEX":"rna", "ADT":"prot", "TCR":"tcr", - # "BCR":"bcr", "ATAC": "atac", "spatial":"spatial"} + # mudata_conventional_names + # {"GEX":"rna", "ADT":"prot", "TCR":"tcr", "BCR":"bcr", "ATAC": "atac", "spatial":"spatial"} # all_files_dict = {mudata_conventional_names[nm]: x for (nm, x) in all_files_dict.items()} + # note: scanpy's default function use gex_only as param so we need to leave that in logging.debug(all_files_dict.keys()) # load in separate anndata for each expected modality data_dict = {} From f543504a0de468552e78c7c7def826a58ff7bbdf Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 11:58:15 +0200 Subject: [PATCH 05/28] io fix --- panpipes/funcs/io.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/panpipes/funcs/io.py b/panpipes/funcs/io.py index e8777738..833409a6 100644 --- a/panpipes/funcs/io.py +++ b/panpipes/funcs/io.py @@ -62,7 +62,7 @@ def gen_load_anndata_jobs(caf, load_raw=False, mode_dictionary = {}, load_prot_f rna_path = re.sub("filtered", "raw", rna_path) # manage the adt paths if ('adt_path' in caf.columns and mode_dictionary["prot"]): - # check if its the same as the gex path (data in the same file) + # check if its the same as the rna path (data in the same file) if pd.isna(caf['adt_path'][nn]): adt_path= None adt_filetype=None From 79c8fad01cffe374f2e28e52b6ca68a036d46436 Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:00:18 +0200 Subject: [PATCH 06/28] rename --- .../python_scripts/aggregate_cellranger_summary_metrics.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py b/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py index 480ea6dd..47530a0a 100644 --- a/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py +++ b/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py @@ -93,7 +93,7 @@ def get_all_unique_paths(pipe_df): all_paths.columns = ['sample_id', 'path_type', 'path'] all_paths = all_paths.drop_duplicates() # rename path_type to match cellranger terminolgy - recode_dict = {'gex_path': "Gene Expression", + recode_dict = {'rna_path': "Gene Expression", 'adt_path': "Antibody Capture", 'tcr_path': 'VDJ T', 'bcr_path': 'VDJ B'} @@ -216,11 +216,11 @@ def parse_10x_cellranger_count(path_df, convert_df, path_col='metrics_summary_p ## gene expression extra plots ----------------------------- -gex_tenx_metrics = tenx_metrics_full[tenx_metrics_full['library_type'] == 'Gene Expression'] +rna_tenx_metrics = tenx_metrics_full[tenx_metrics_full['library_type'] == 'Gene Expression'] # sequencing_saturaion plot scatter plot plot_metrics = ['Sequencing saturation', 'Mean reads per cell', 'Estimated number of cells', 'Number of reads', 'Median UMI counts per cell'] -plt_df = gex_tenx_metrics[gex_tenx_metrics.metric_name.isin(plot_metrics)] +plt_df = rna_tenx_metrics[rna_tenx_metrics.metric_name.isin(plot_metrics)] plt_df = plt_df[['sample_id', 'metric_name', 'metric_value']] plt_tab = plt_df.pivot_table(index='sample_id', columns='metric_name', values='metric_value', aggfunc=sum) # sns.scatterplot(data=plt_tab, From 26156e0214a3b7302a926b979db5710f5c7c8d76 Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:01:08 +0200 Subject: [PATCH 07/28] small typo --- panpipes/python_scripts/assess_background.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/panpipes/python_scripts/assess_background.py b/panpipes/python_scripts/assess_background.py index 106bafa9..086a8580 100644 --- a/panpipes/python_scripts/assess_background.py +++ b/panpipes/python_scripts/assess_background.py @@ -183,7 +183,7 @@ plt.savefig(os.path.join(args.figpath,"barplot_background_" + args.channel_col + "_prot_top_expressed.png")) -## QC for gex and protein fcomparing foreground and background +## QC for rna and protein comparing foreground and background if 'rna' in mdata.mod.keys(): pnp.pl.scatter_fg_vs_bg(mdata, mdata_bg,x="rna:log1p_n_genes_by_counts", y="rna:log1p_total_counts", facet_row=args.channel_col) From 1fd759a74a76498345b851825334fe35541ea3ff Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:03:32 +0200 Subject: [PATCH 08/28] qc message --- panpipes/python_scripts/run_scanpyQC_rna.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/panpipes/python_scripts/run_scanpyQC_rna.py b/panpipes/python_scripts/run_scanpyQC_rna.py index 9947e98d..d47972ba 100644 --- a/panpipes/python_scripts/run_scanpyQC_rna.py +++ b/panpipes/python_scripts/run_scanpyQC_rna.py @@ -1,7 +1,7 @@ ''' -scanpy QC script GEX +scanpy QC script RNA order of QC: -- GEX +- RNA - ADT - Repertoire - ATAC @@ -62,7 +62,7 @@ args, opt = parser.parse_known_args() -L.info("Running scanpy gex qc pipeline") +L.info("Running scanpy rna qc pipeline") sc.settings.verbosity = 3 From 8b28dfa9ba6b7c6e13249d0a86521134e2c4f08b Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:05:37 +0200 Subject: [PATCH 09/28] barplot_10x --- panpipes/R_scripts/produce_barplot_10xmetric.v3.R | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/panpipes/R_scripts/produce_barplot_10xmetric.v3.R b/panpipes/R_scripts/produce_barplot_10xmetric.v3.R index e1dde13a..59cffae1 100644 --- a/panpipes/R_scripts/produce_barplot_10xmetric.v3.R +++ b/panpipes/R_scripts/produce_barplot_10xmetric.v3.R @@ -54,8 +54,8 @@ project <- opt$project if(!is.null(opt$csvpaths)) { caf <- read.table(opt$csvpaths, header=T, sep="\t") - if(!"gex_path" %in% colnames(caf)){ - caf$gex_path=caf$path + if(!"rna_path" %in% colnames(caf)){ + caf$rna_path=caf$path caf <- caf %>% filter(filetype=="cellranger") if(nrow(caf)==0){ message("no cellranger inputs") @@ -64,7 +64,7 @@ if(!is.null(opt$csvpaths)) { } # paths <- gsub("filtered_feature_bc_matrix","metrics_summary.csv",caf$gex_path) - paths <- file.path(caf$gex_path, "metrics_summary.csv") + paths <- file.path(caf$rna_path, "metrics_summary.csv") # paths <- gsub("\\/$","", paths) @@ -146,7 +146,7 @@ if(ncol(tag.mat)>0){ } if(nrow(mat)==0){mat <- tag.mat} -#modify this to deal with AB reads and GEX reads, need to split the knee plotting in 2 if you want both GEX and ADT knees in 2 different graphs +#modify this to deal with AB reads and rna reads, need to split the knee plotting in 2 if you want both rna and ADT knees in 2 different graphs if(opt$kneeplot){ print("plotting kneeplot") gsub("metrics_summary.csv","raw_feature_bc_matrix" , paths) ->normpath @@ -155,7 +155,7 @@ if(opt$kneeplot){ for( a in 1:length(normpath)){ print(normpath[a]) mm<-Read10X(normpath[a]) - if(class(mm)=="list") { mm<-do.call("rbind",mm)} #modify this to deal with AB reads and GEX reads, need to split the knee plotting in 2 if you want both GEX and ADT in 2 different graphs + if(class(mm)=="list") { mm<-do.call("rbind",mm)} #modify this to deal with prot reads and rna reads, need to split the knee plotting in 2 if you want both rna and prot in 2 different graphs colnames(mm) <- paste0(colnames(mm),"_",rownames(mat)[a]) xx <- CreateSeuratObject(mm,assay = "RNA", min.cells = 0, min.features = 0) xx@meta.data %>% @@ -200,7 +200,7 @@ if(opt$kneeplot){ theme(legend.key.size = unit(0.2,"cm"), axis.text=element_text(size=13,face="bold", color="black")) ->gb } gb + fl$layers ->gb - ggsave(gb,height = 15 ,width = 17, dpi=300, file=paste0(opt$figdir,"10xMetrics_",project,"_gex_knee_plot.png"), type="cairo-png") + ggsave(gb,height = 15 ,width = 17, dpi=300, file=paste0(opt$figdir,"10xMetrics_",project,"_rna_knee_plot.png"), type="cairo-png") } From 2ff238575c9d8a18bb7b312f63ad329157f31919 Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:08:17 +0200 Subject: [PATCH 10/28] caf files --- panpipes/resources/sample_file_qc_mm.txt | 4 +++- panpipes/resources/sample_file_qc_mm_gex_only.txt | 4 +++- tests/sample_caf.tsv | 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/panpipes/resources/sample_file_qc_mm.txt b/panpipes/resources/sample_file_qc_mm.txt index beae3718..7e45675e 100644 --- a/panpipes/resources/sample_file_qc_mm.txt +++ b/panpipes/resources/sample_file_qc_mm.txt @@ -1 +1,3 @@ -sample_id gex_path gex_filetype adt_path adt_filetype tissue diagnosis Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file +sample_id rna_path rna_filetype adt_path adt_filetype tissue diagnosis +Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy +Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file diff --git a/panpipes/resources/sample_file_qc_mm_gex_only.txt b/panpipes/resources/sample_file_qc_mm_gex_only.txt index beae3718..7e45675e 100644 --- a/panpipes/resources/sample_file_qc_mm_gex_only.txt +++ b/panpipes/resources/sample_file_qc_mm_gex_only.txt @@ -1 +1,3 @@ -sample_id gex_path gex_filetype adt_path adt_filetype tissue diagnosis Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file +sample_id rna_path rna_filetype adt_path adt_filetype tissue diagnosis +Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy +Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file diff --git a/tests/sample_caf.tsv b/tests/sample_caf.tsv index fbece7be..cdcd773b 100644 --- a/tests/sample_caf.tsv +++ b/tests/sample_caf.tsv @@ -1,3 +1,3 @@ -sample_id gex_path gex_filetype adt_path adt_filetype +sample_id rna_path rna_filetype adt_path adt_filetype sample1 ../sample_1_rna cellranger ../sample_1_prot cellranger sample2 ../sample_2_rna cellranger ../sample_2_prot cellranger From 5dce3b506d12e7b838f3837453a0006f63bbf997 Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:10:09 +0200 Subject: [PATCH 11/28] search adt --- docs/setup_for_ingest.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/setup_for_ingest.md b/docs/setup_for_ingest.md index 030f7f91..b389b86b 100644 --- a/docs/setup_for_ingest.md +++ b/docs/setup_for_ingest.md @@ -9,7 +9,7 @@ sample_id | rna_path | rna_filetype If you want to analyse other modalities, add columns to the input file -- adt_path/adt_filetype +- prot_path/prot_filetype - atac_path/atac_filetype - tcr_path/tcr_filetype - bcr_path/bcr_filetype @@ -20,7 +20,7 @@ example at `resources/sample_file_ingest.txt` If giving a cellranger path, give the path folder containing all the cellranger outputs. Otherwise path should be the complete path to the file. -If you have cellranger outputs which have rna and adt within the same files, specify the same path in rna_path and adt_path +If you have cellranger outputs which have rna and prot within the same files, specify the same path in rna_path and prot_path To include sample level metadata, you can add additional columns to the submission file e.g Tissue and Diagnoisis columns in `resources/sample_file_ingest.txt` @@ -38,13 +38,13 @@ For each modality per sample, specify the value in the key column in the X_filet modality |key |description ------------|----------|---------- -rna/adt/atac|cellranger| the "outs" folder produced by **cellranger count** -rna/adt/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** -rna/adt/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger -rna/adt/atac|hd5 | Read a generic .h5 (hdf5) file. -rna/adt/atac|h5ad | Anndata h5ad objects (one per sample) -rna/adt/atac|txt_matrix | tab-delimited file (one per sample) -rna/adt/atac|csv_matrix | comma-delimited file (one per sample) +rna/prot/atac|cellranger| the "outs" folder produced by **cellranger count** +rna/prot/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** +rna/prot/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger +rna/prot/atac|hd5 | Read a generic .h5 (hdf5) file. +rna/prot/atac|h5ad | Anndata h5ad objects (one per sample) +rna/prot/atac|txt_matrix | tab-delimited file (one per sample) +rna/prot/atac|csv_matrix | comma-delimited file (one per sample) tcr/bcr |cellranger_vdj| Path to filtered_contig_annotations.csv, all_contig_annotations.csv or all_contig_annotations.json. produced by **cellranger vdj** further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_10x_vdj.html) tcr/bcr |tracer| data from [TraCeR](https://github.com/Teichlab/tracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_tracer.html) tcr/bcr |bracer| data from [BraCeR](https://github.com/Teichlab/bracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_bracer.html) From 365abfd2de0a84b4ed950804d6819013c797fb0a Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:14:10 +0200 Subject: [PATCH 12/28] changing adt --- docs/usage/sample_file_qc_mm.md | 2 +- docs/usage/sample_file_qc_mm.txt | 2 +- docs/usage/setup_for_qc_mm.md | 20 ++++++++++---------- docs/workflows/qc.md | 4 ++-- 4 files changed, 14 insertions(+), 14 deletions(-) diff --git a/docs/usage/sample_file_qc_mm.md b/docs/usage/sample_file_qc_mm.md index cf4269bb..c6e5cc96 100644 --- a/docs/usage/sample_file_qc_mm.md +++ b/docs/usage/sample_file_qc_mm.md @@ -1,7 +1,7 @@ Example sample submission file ----------------------------- -| sample_id | rna_path | rna_filetype | adt_path | adt_filetype | tissue | diagnosis | +| sample_id | rna_path | rna_filetype | prot_path | prot_filetype | tissue | diagnosis | |-----------|------------------------------------|--------------|-------------------------------------|--------------|--------|-----------| | Sample1 | Sample1_rna.csv | csv_matrix | Sample1_adt.csv | csv_matrix | pbmc | healthy | | Sample2 | cellranger_count/Sample2_GEX/outs/ | cellranger | cellranger_count/Sample2_CITE/outs/ | cellranger | pbmc | diseased | diff --git a/docs/usage/sample_file_qc_mm.txt b/docs/usage/sample_file_qc_mm.txt index 7e45675e..77775ce8 100644 --- a/docs/usage/sample_file_qc_mm.txt +++ b/docs/usage/sample_file_qc_mm.txt @@ -1,3 +1,3 @@ -sample_id rna_path rna_filetype adt_path adt_filetype tissue diagnosis +sample_id rna_path rna_filetype prot_path prot_filetype tissue diagnosis Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file diff --git a/docs/usage/setup_for_qc_mm.md b/docs/usage/setup_for_qc_mm.md index 1f6df9ec..3f20695b 100644 --- a/docs/usage/setup_for_qc_mm.md +++ b/docs/usage/setup_for_qc_mm.md @@ -13,14 +13,14 @@ sample_id | rna_path | rna_filetype If you want to analyse other modalities, add additional columns to the input file -- adt_path/adt_filetype +- prot_path/prot_filetype - atac_path/atac_filetype - tcr_path/tcr_filetype - bcr_path/bcr_filetype **sample id**: Each row must have a unique sample ID. -**{X}_paths**: If giving a cellranger path, give the path folder containing all the cellranger outputs, known as the `outs` folder. Otherwise path should be the complete path to the file. If you have cellranger outputs which have rna and adt within the same files, specify the same path in rna_path and adt_path +**{X}_paths**: If giving a cellranger path, give the path folder containing all the cellranger outputs, known as the `outs` folder. Otherwise path should be the complete path to the file. If you have cellranger outputs which have rna and prot within the same files, specify the same path in rna_path and prot_path **{X}_filetype**: The "filetype" column tells panpipe how to read in the data. Panpipes supports a range of inputs. See the [supported input filetypes](#supported-input-filetypes) below to see the options for the {X}_filetype columns @@ -35,7 +35,7 @@ You will also need to list which additional metadata columns you want to include ## Example sample submission file -| sample_id | rna_path | rna_filetype | adt_path | adt_filetype | tissue | diagnosis | +| sample_id | rna_path | rna_filetype | prot_path | prot_filetype | tissue | diagnosis | |-----------|------------------------------------|--------------|-------------------------------------|--------------|--------|-----------| | Sample1 | Sample1_gex.csv | csv_matrix | Sample1_adt.csv | csv_matrix | pbmc | healthy | | Sample2 | cellranger_count/Sample2_GEX/outs/ | cellranger | cellranger_count/Sample2_CITE/outs/ | cellranger | pbmc | diseased | @@ -58,13 +58,13 @@ For each modality per sample, specify the value in the key column in the X_filet modality |key |description ------------|----------|---------- -rna/adt/atac|cellranger| the "outs" folder produced by **cellranger count** -rna/adt/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** -rna/adt/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger -rna/adt/atac|hd5 | Read a generic .h5 (hdf5) file. -rna/adt/atac|h5ad | Anndata h5ad objects (one per sample) -rna/adt/atac|txt_matrix | tab-delimited file (one per sample) -rna/adt/atac|csv_matrix | comma-delimited file (one per sample) +rna/prot/atac|cellranger| the "outs" folder produced by **cellranger count** +rna/prot/atac|cellranger_multi| the "outs" folder produced by **cellranger multi** +rna/prot/atac|10X_h5 | outs/filtered_feature_bc_matrix.h5 produced by cellranger +rna/prot/atac|hd5 | Read a generic .h5 (hdf5) file. +rna/prot/atac|h5ad | Anndata h5ad objects (one per sample) +rna/prot/atac|txt_matrix | tab-delimited file (one per sample) +rna/prot/atac|csv_matrix | comma-delimited file (one per sample) tcr/bcr |cellranger_vdj| Path to filtered_contig_annotations.csv, all_contig_annotations.csv or all_contig_annotations.json. produced by **cellranger vdj** further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_10x_vdj.html) tcr/bcr |tracer| data from [TraCeR](https://github.com/Teichlab/tracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_tracer.html) tcr/bcr |bracer| data from [BraCeR](https://github.com/Teichlab/bracer) further [details](https://scverse.org/scirpy/latest/generated/scirpy.io.read_bracer.html) diff --git a/docs/workflows/qc.md b/docs/workflows/qc.md index cb298c4c..553e3efc 100644 --- a/docs/workflows/qc.md +++ b/docs/workflows/qc.md @@ -22,7 +22,7 @@ Then qc metrics are computed using [scanpy.pp.calculated_qc_metrics](https://sca - Protein metadata such as isotype status or shorter names incorporated into the object. Inputs for this are described [here] - Per cell QC metrics computed as described above, including pct_isotype where isotype information is available. - Per Protein metrics, total_counts, and are computed, in order to compare the binding of different antibodies (applicable when your assay is CITE-seq based). These are defined in the yml: -`prot_metrics_per_adt: total_counts,log1p_total_counts,n_cells_by_counts,mean_counts` +`plot_metrics_per_prot: total_counts,log1p_total_counts,n_cells_by_counts,mean_counts` - A rudimentary check for cells that are 'isotype' outliers, i.e. the cells where the isotype content is in the top 10% quantile for more than 2 isotypes. (these parameters are customisable in the`pipeline.yml`). See the function [here](https://github.com/DendrouLab/panpipes/blob/main/panpipes/funcs/scmethods.py#L328). ``` # isotype outliers: one way to determine which cells are very sticky is to work out which cells have the most isotype UMIs @@ -63,7 +63,7 @@ There is an additional optional `assess_background` step, if the raw data (inclu [Inputs to Multimodal QC pipeline](../setup_for_qc_mm) 2. Generate qc genelists as described in [Gene list format](../gene_list_format) -3. For adt assay - generate the protein metadata file +3. For prot assay - generate the protein metadata file [example]((https://github.com/DendrouLab/panpipes/blob/main/resources/protein_metadata_w_iso.md)). This file is integrated into the mdata\['prot'\].var slot. 4. Generate config file (`panpipes ingest config`) From 5e35a93273b5a95ba01399f26fe6408daeb5754a Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:16:00 +0200 Subject: [PATCH 13/28] fixes adt in io.py --- panpipes/funcs/io.py | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/panpipes/funcs/io.py b/panpipes/funcs/io.py index 833409a6..7db749b4 100644 --- a/panpipes/funcs/io.py +++ b/panpipes/funcs/io.py @@ -60,29 +60,29 @@ def gen_load_anndata_jobs(caf, load_raw=False, mode_dictionary = {}, load_prot_f rna_path, rna_filetype = caf[['rna_path', "rna_filetype"]].iloc[nn] if load_raw: rna_path = re.sub("filtered", "raw", rna_path) - # manage the adt paths - if ('adt_path' in caf.columns and mode_dictionary["prot"]): + # manage the prot paths + if ('prot_path' in caf.columns and mode_dictionary["prot"]): # check if its the same as the rna path (data in the same file) - if pd.isna(caf['adt_path'][nn]): - adt_path= None - adt_filetype=None - elif caf['adt_path'][nn] == caf['rna_path'][nn]: - adt_path, adt_filetype = rna_path, rna_filetype - elif caf['adt_filetype'][nn]=="cellranger": + if pd.isna(caf['prot_path'][nn]): + prot_path= None + prot_filetype=None + elif caf['prot_path'][nn] == caf['rna_path'][nn]: + prot_path, prot_filetype = rna_path, rna_filetype + elif caf['prot_filetype'][nn]=="cellranger": # we might want to load the raw here because we want to then subset by good gex (rna) barcodes, # this is why the load_prot_from_raw argument exists - adt_path, adt_filetype = update_cellranger_col(caf['adt_path'][nn], raw=load_prot_from_raw) - elif caf['adt_filetype'][nn]=="cellranger_multi": + prot_path, prot_filetype = update_cellranger_col(caf['prot_path'][nn], raw=load_prot_from_raw) + elif caf['prot_filetype'][nn]=="cellranger_multi": # celranger multi has the same prot and gex (rna) barcodes - adt_path, adt_filetype = update_cellranger_col(caf['adt_path'][nn], raw=load_raw, method="multi", + prot_path, prot_filetype = update_cellranger_col(caf['prot_path'][nn], raw=load_raw, method="multi", sample_id=caf['sample_id'][nn]) else: - adt_path, adt_filetype = caf[['adt_path', "adt_filetype"]].iloc[nn] + prot_path, prot_filetype = caf[['prot_path', "prot_filetype"]].iloc[nn] if load_prot_from_raw or load_raw: - adt_path = re.sub("filtered", "raw", adt_path) + prot_path = re.sub("filtered", "raw", prot_path) else: - adt_path= None - adt_filetype=None + prot_path= None + prot_filetype=None # load tcr_path if 'tcr_path' in caf.columns and mode_dictionary["tcr"] and pd.notna(caf['tcr_path'][nn]): tcr_path = caf['tcr_path'][nn] @@ -134,7 +134,7 @@ def gen_load_anndata_jobs(caf, load_raw=False, mode_dictionary = {}, load_prot_f yield rna_path, outfile, \ sample_id, \ rna_filetype, \ - adt_path, adt_filetype, \ + prot_path, prot_filetype, \ tcr_path, tcr_filetype, \ bcr_path, bcr_filetype, \ atac_path, atac_filetype, \ @@ -486,7 +486,7 @@ def load_mdata_from_multiple_files(all_files_dict): and the values for each key is a list of file path and fie type e.g. {"RNA": [filepath, "filetype"], "ADT: [file path, "filetype"]} - Filetypes supported for RNA/adt: ["cellranger", "h5ad", "csv_matrix", "txt_matrix", "10X_h5"], + Filetypes supported for RNA/prot: ["cellranger", "h5ad", "csv_matrix", "txt_matrix", "10X_h5"], Filetypes supported for atac (multiome preferred is 10X_h5) ["10X_h5","cellranger","h5ad"] Filetypes supported for rep: ["cellranger_vdj", "airr", "tracer", "bracer" ] See scirpy documentation for more information of repertoire input formats @@ -579,8 +579,8 @@ def write_10x_counts(adata, path, layer=None): features = adata.var.reset_index().rename(columns={"index": "gene_symbols"}) elif "gene_symbols" in adata.var.columns: features = adata.var.reset_index().rename(columns={"index": "gene_ids"}) - if "adt_id" in features.columns: - features = features.drop(columns=['gene_symbols']).rename(columns={"adt_id" : "gene_symbols"}) + if "prot_id" in features.columns: + features = features.drop(columns=['gene_symbols']).rename(columns={"prot_id" : "gene_symbols"}) features = features[['gene_ids','gene_symbols', 'feature_types']] features.to_csv(os.path.join(path, "features.tsv.gz"), sep='\t', index=None, header=None) From 5cb560205f1489eadbe6dc71f8b1626656fc1e37 Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:17:47 +0200 Subject: [PATCH 14/28] fixes adt --- panpipes/funcs/processing.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/panpipes/funcs/processing.py b/panpipes/funcs/processing.py index b1781a92..906dbc70 100644 --- a/panpipes/funcs/processing.py +++ b/panpipes/funcs/processing.py @@ -253,7 +253,7 @@ def merge_with_adata_obs(adata, df, on_col="sample_id", inplace=False): return new_df -def add_var_mtd(prot, df, left_on="adt_id", right_on="adt_id"): +def add_var_mtd(prot, df, left_on="prot_id", right_on="prot_id"): if not isinstance(prot, AnnData): raise TypeError("prot is not an AnnData object") if not isinstance(df, pd.DataFrame): @@ -276,7 +276,7 @@ def add_var_mtd(prot, df, left_on="adt_id", right_on="adt_id"): prot.var.index = orig_index logging.debug(prot.var.head()) else: - warnings.warn(UserWarning("prot.var is at risk of getting disrodered, not merging the df")) + warnings.warn(UserWarning("prot.var is at risk of getting disordered, not merging the df")) From ecfb26f96df96b994d971e27e5a8b10015c3f477 Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:23:44 +0200 Subject: [PATCH 15/28] adt check flow --- panpipes/funcs/scmethods.py | 2 +- panpipes/panpipes/pipeline_ingest.py | 4 ++-- panpipes/panpipes/pipeline_ingest/pipeline.yml | 16 ++++++++-------- panpipes/python_scripts/run_dsb_clr.py | 18 +++++++++--------- 4 files changed, 20 insertions(+), 20 deletions(-) diff --git a/panpipes/funcs/scmethods.py b/panpipes/funcs/scmethods.py index d7213ad3..c9d1932e 100644 --- a/panpipes/funcs/scmethods.py +++ b/panpipes/funcs/scmethods.py @@ -413,7 +413,7 @@ def X_is_raw(adata): # run clr -def run_adt_normalise(mdata, +def run_prot_normalise(mdata, mdata_bg, method: Literal['dsb', 'clr'] = 'clr', clr_margin: Literal[0, 1] = '0', diff --git a/panpipes/panpipes/pipeline_ingest.py b/panpipes/panpipes/pipeline_ingest.py index 2e714c4f..430c5b42 100644 --- a/panpipes/panpipes/pipeline_ingest.py +++ b/panpipes/panpipes/pipeline_ingest.py @@ -427,8 +427,8 @@ def run_scanpy_prot_qc(log_file, outfile, unfilt_file): cmd += " --channel_col sample_id" if PARAMS['prot_plotqc_metrics']: cmd += " --per_cell_metrics %(prot_plotqc_metrics)s" - if PARAMS['prot_metrics_per_adt']: - cmd += " --per_adt_metrics %(prot_metrics_per_adt)s" + if PARAMS['prot_metrics_per_prot']: + cmd += " --per_prot_metrics %(prot_metrics_per_prot)s" if PARAMS['identify_isotype_outliers']: cmd += " --identify_isotype_outliers %(identify_isotype_outliers)s" cmd += " --isotype_upper_quantile %(isotype_upper_quantile)s" diff --git a/panpipes/panpipes/pipeline_ingest/pipeline.yml b/panpipes/panpipes/pipeline_ingest/pipeline.yml index 221b0b26..e9ce97d7 100644 --- a/panpipes/panpipes/pipeline_ingest/pipeline.yml +++ b/panpipes/panpipes/pipeline_ingest/pipeline.yml @@ -87,20 +87,20 @@ barcode_mtd: metadatacols: #--------------------------------------- -# Loading adt data - additional options +# Loading prot data - additional options #--------------------------------------- # As default the ingest will choose the first column of the cellranger features.tsv.gz # To merge extra information about the antibodies e.g. whether they are hashing antibodies or isotypes, # Create a table with the first column equivalent to the first column of cellrangers features.tsv.gz # and specify it below (save as txt file) protein_metadata_table: -# Make sure there are unique entries in this column for each adt. +# Make sure there are unique entries in this column for each prot. # include a column called isotype (containing True or False) if you want to qc isotypes # include a column called hashing_ab (containing True or False) if your dataset contains hashing antibodies which you wish to split into a separate modality # If you want to update the mudata index with a column from protein_metadata_table, specify here # If there are overlaps with the rna gene symbols, then you might have trouble down the line -# It is recommended to add something to make the labels unique e.g. "adt_CD4" +# It is recommended to add something to make the labels unique e.g. "prot_CD4" index_col_choice: # If providing separate prot and rna 10X outputs, then the pipeline will load the filtered rna 10X outputs and the raw prot 10X counts @@ -200,16 +200,16 @@ plotqc_rna_metrics: doublet_scores,pct_counts_mt,pct_counts_rp,pct_counts_hb,pct ### ADT QC metrics: # as standard the following metrics are calculated for prot data # per cell metrics: -# total_counts,log1p_total_counts,n_adt_by_counts,log1p_n_adt_by_counts +# total_counts,log1p_total_counts,n_prot_by_counts,log1p_n_prot_by_counts # if isotypes can be detected then the following are calculated also: # total_counts_isotype,pct_counts_isotype # choose which ones you want to plot here -plotqc_prot_metrics: total_counts,log1p_total_counts,n_adt_by_counts,pct_counts_isotype +plotqc_prot_metrics: total_counts,log1p_total_counts,n_prot_by_counts,pct_counts_isotype # per antibody metrics # n_cells_by_counts,mean_counts,log1p_mean_counts,pct_dropout_by_counts,total_counts,log1p_total_counts # choose which ones you want to plot here: -prot_metrics_per_adt: total_counts,log1p_total_counts,n_cells_by_counts,mean_counts +plot_metrics_per_prot: total_counts,log1p_total_counts,n_cells_by_counts,mean_counts # isotype outliers: one way to determine which cells are very sticky is to work out which cells have the most isotype UMIs # associated to them, to label a cell as an isotype outlier, it must meet or exceed the following crietria: @@ -228,7 +228,7 @@ isotype_n_pass: 2 # i.e. the 10X channel. # this is usually the sample_id column (otherwise leave the next parameter blank) channel_col: sample_id -# It is important to note that in ingest the per channel normalised adt scores are not saved in the mudata object +# It is important to note that in ingest the per channel normalised prot scores are not saved in the mudata object # this is because if you perform feature normalisation (clr normalisation margin 0 or dsb normalisation), # on subsets of cells then the normalised values cannot be simply concatenated. # The ADT normalisation is rerun pn the complete object in the preprocess pipeline @@ -246,7 +246,7 @@ normalisation_methods: clr # CLR parameters: # margin determines whether you normalise per cell (as you would RNA norm), -# or by feature (recommended, due to the variable nature of adts). +# or by feature (recommended, due to the variable nature of prot assays). # CLR margin 0 is recommended for informative qc plots in this pipeline # 0 = normalise colwise (per feature) # 1 = normalise rowwise (per cell) diff --git a/panpipes/python_scripts/run_dsb_clr.py b/panpipes/python_scripts/run_dsb_clr.py index 7fb0ffab..201231e1 100644 --- a/panpipes/python_scripts/run_dsb_clr.py +++ b/panpipes/python_scripts/run_dsb_clr.py @@ -35,7 +35,7 @@ default=False, help="") parser.add_argument("--figpath", - default="./adt_figures", + default="./prot_figures", help="") parser.add_argument("--save_mtx", default=False, @@ -106,7 +106,7 @@ if 'clr' in norm_methods: # make sure to start from raw counts mdata["prot"].X = mdata["prot"].layers["raw_counts"] - pnp.scmethods.run_adt_normalise(mdata=mdata, mdata_raw=None, + pnp.scmethods.run_prot_normalise(mdata=mdata, mdata_raw=None, method="clr", clr_margin=int(args.clr_margin)) L.info("saving ridgeplot") @@ -117,7 +117,7 @@ plt.savefig(os.path.join(args.figpath, si + "_clr_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(mdata["prot"], os.path.join("adt_clr" , si ), layer="raw_counts") + pnp.io.write_10x_counts(mdata["prot"], os.path.join("prot_clr" , si ), layer="raw_counts") # then run dsb if 'dsb' in norm_methods: mdata_raw = mdata_raw_per_sample[si] @@ -129,7 +129,7 @@ # mu.pl.histogram(mdata_raw["rna"], ["log10umi"], bins=50) # plt.savefig(os.path.join(args.figpath, si + "_log10umi.png")) # run dsb - pnp.scmethods.run_adt_normalise(mdata=mdata, + pnp.scmethods.run_prot_normalise(mdata=mdata, mdata_raw=mdata_raw, method="dsb", isotypes=isotypes) @@ -139,7 +139,7 @@ pnp.plotting.ridgeplot(mdata["prot"], features=isotypes, layer="dsb", splitplot=1) plt.savefig(os.path.join(args.figpath, si + "_dsb_ridgeplot_isotypes.png")) if save_mtx: - pnp.io.write_10x_counts(mdata["prot"], os.path.join("adt_dsb" , si), layer="raw_counts") + pnp.io.write_10x_counts(mdata["prot"], os.path.join("prot_dsb" , si), layer="raw_counts") else: # run on all the data (not on a channel basis) # first run clr @@ -151,7 +151,7 @@ # make sure to start from raw counts all_mdata["prot"].X = all_mdata["prot"].layers["raw_counts"] # run normalise - pnp.scmethods.run_adt_normalise(mdata=all_mdata, + pnp.scmethods.run_prot_normalise(mdata=all_mdata, mdata_raw=None, method="clr", clr_margin=int(args.clr_margin)) @@ -162,7 +162,7 @@ plt.savefig(os.path.join(args.figpath, "clr_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("adt_clr"), layer="clr") + pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("prot_clr"), layer="clr") if 'dsb' in norm_methods: # make sure to start from raw counts all_mdata["prot"].X = all_mdata["prot"].layers["raw_counts"] @@ -171,7 +171,7 @@ all_mdata_raw["rna"].obs["log10umi"] = np.array(np.log10(all_mdata_raw["rna"].X.sum(axis=1) + 1)).reshape(-1) mu.pl.histogram(all_mdata_raw["rna"], ["log10umi"], bins=50) plt.savefig(os.path.join(args.figpath, "all_log10umi.png")) - pnp.scmethods.run_adt_normalise(mdata=all_mdata, + pnp.scmethods.run_prot_normalise(mdata=all_mdata, mdata_raw=all_mdata_raw, method="dsb", isotypes=isotypes) @@ -187,7 +187,7 @@ plt.savefig(os.path.join(args.figpath, "dsb_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("adt_dsb"), layer="dsb") + pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("prot_dsb"), layer="dsb") # run pca on dsb normalised (if dsb was run otherwise clr is in X) sc.tl.pca(all_mdata['prot'], n_comps=50, svd_solver='arpack', random_state=0) From 2de137c48579696ee4cbaa79ed87485e2c63e5ea Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:33:23 +0200 Subject: [PATCH 16/28] multiple files adt prot --- .../pipeline_preprocess_preprint.yml | 4 ++-- panpipes/R_scripts/plotQC.R | 14 ++++++------ .../pipeline_integration/pipeline.yml | 4 ++-- .../panpipes/pipeline_preprocess/pipeline.yml | 4 ++-- .../aggregate_cellranger_summary_metrics.py | 2 +- .../python_scripts/batch_correct_totalvi.py | 12 +++++----- .../python_scripts/run_preprocess_prot.py | 22 ++++++++++--------- panpipes/python_scripts/run_scanpyQC_prot.py | 22 +++++++++---------- panpipes/resources/sample_file_qc_mm.txt | 2 +- .../resources/sample_file_qc_mm_gex_only.txt | 2 +- tests/sample_caf.tsv | 2 +- 11 files changed, 46 insertions(+), 44 deletions(-) diff --git a/docs/workflows/pipeline_preprocess_preprint.yml b/docs/workflows/pipeline_preprocess_preprint.yml index d985d7d1..8b1488da 100644 --- a/docs/workflows/pipeline_preprocess_preprint.yml +++ b/docs/workflows/pipeline_preprocess_preprint.yml @@ -157,7 +157,7 @@ plotqc: grouping_var: sample_id,orig.ident # use these continuous variables to plot gradients and distributions rna_metrics: pct_counts_mt,pct_counts_rp,pct_counts_hb,doublet_scores - prot_metrics: total_counts,log1p_total_counts,n_adt_by_counts + prot_metrics: total_counts,log1p_total_counts,n_prot_by_counts atac_metrics: total_counts rep_metrics: @@ -229,7 +229,7 @@ prot: # CLR parameters: # margin determines whether you normalise per cell (as you would RNA norm), - # or by feature (recommended, due to the variable nature of adts). + # or by feature (recommended, due to the variable nature of prot assays). # CLR margin 0 is recommended for informative qc plots in this pipeline # 0 = normalise colwise (per feature) # 1 = normalise rowwise (per cell) diff --git a/panpipes/R_scripts/plotQC.R b/panpipes/R_scripts/plotQC.R index c29e3494..58d61665 100644 --- a/panpipes/R_scripts/plotQC.R +++ b/panpipes/R_scripts/plotQC.R @@ -238,7 +238,7 @@ if(!is.null(opt$prot_qc_metrics)){ } # do the following plots: - # - prot:total_counts vs prot_nadt_by_counts + # - prot:total_counts vs prot_nprot_by_counts # - prot:total_counts vs prot:isotype_counts # - prot:log1p_total_counts vs prot:log1p_isotype_counts # - prot:total_counts vs prot:pct_isotype_counts @@ -255,14 +255,14 @@ if(!is.null(opt$prot_qc_metrics)){ } # plot once per source facet message("protein scatter plots") - if (all(c("total_counts","n_adt_by_counts")%in% colnames(prot_data_plot))){ - g <- do_scatter_plot(prot_data_plot,x="total_counts",y="n_adt_by_counts", facet=sc) - ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-nUMI_v_prot-adt.png")), type="cairo", + if (all(c("total_counts","n_prot_by_counts")%in% colnames(prot_data_plot))){ + g <- do_scatter_plot(prot_data_plot,x="total_counts",y="n_prot_by_counts", facet=sc) + ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-nUMI_v_prot-prot.png")), type="cairo", width= 2*ncols, height=2*nrows, dpi=120) } - if (all(c("log1p_total_counts","log1p_n_adt_by_counts")%in% colnames(prot_data_plot))){ - g <- do_scatter_plot(prot_data_plot,x="log1p_total_counts",y="log1p_n_adt_by_counts", facet=sc) - ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-log1p_nUMI_v_prot-log1p_adt.png")), type="cairo", + if (all(c("log1p_total_counts","log1p_n_prot_by_counts")%in% colnames(prot_data_plot))){ + g <- do_scatter_plot(prot_data_plot,x="log1p_total_counts",y="log1p_n_prot_by_counts", facet=sc) + ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-log1p_nUMI_v_prot-log1p_prot.png")), type="cairo", width= 2*ncols, height=2*nrows, dpi=120) } if (all(c("total_counts","pct_counts_isotype")%in% colnames(prot_data_plot))){ diff --git a/panpipes/panpipes/pipeline_integration/pipeline.yml b/panpipes/panpipes/pipeline_integration/pipeline.yml index a51b5728..cebaff87 100644 --- a/panpipes/panpipes/pipeline_integration/pipeline.yml +++ b/panpipes/panpipes/pipeline_integration/pipeline.yml @@ -223,9 +223,9 @@ multimodal: modalities: rna,prot exclude_mt_genes: True mt_column: mt - # to filter outliers manually create a column called adt_outliers in mdata['prot'].obs + # to filter outliers manually create a column called prot_outliers in mdata['prot'].obs filter_by_hvg: True - filter_adt_outliers: False + filter_prot_outliers: False model_args: latent_distribution: "normal" training_args: diff --git a/panpipes/panpipes/pipeline_preprocess/pipeline.yml b/panpipes/panpipes/pipeline_preprocess/pipeline.yml index db229e66..e36a9f3e 100644 --- a/panpipes/panpipes/pipeline_preprocess/pipeline.yml +++ b/panpipes/panpipes/pipeline_preprocess/pipeline.yml @@ -167,7 +167,7 @@ plotqc: grouping_var: sample_id # specify metrics in the metadata to plot: rna_metrics: pct_counts_mt,pct_counts_rp,pct_counts_hb,pct_counts_ig,doublet_scores - prot_metrics: total_counts,log1p_total_counts,n_adt_by_counts,pct_counts_isotype + prot_metrics: total_counts,log1p_total_counts,n_prot_by_counts,pct_counts_isotype atac_metrics: rep_metrics: # spatialT_metrics: total_counts @@ -238,7 +238,7 @@ prot: # CLR parameters: # margin determines whether you normalise per cell (as you would RNA norm), - # or by feature (recommended, due to the variable nature of adts). + # or by feature (recommended, due to the variable nature of prot assays). # CLR margin 0 is recommended for informative qc plots in this pipeline # 0 = normalise colwise (per feature) # 1 = normalise rowwise (per cell) diff --git a/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py b/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py index 47530a0a..fc297932 100644 --- a/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py +++ b/panpipes/python_scripts/aggregate_cellranger_summary_metrics.py @@ -94,7 +94,7 @@ def get_all_unique_paths(pipe_df): all_paths = all_paths.drop_duplicates() # rename path_type to match cellranger terminolgy recode_dict = {'rna_path': "Gene Expression", - 'adt_path': "Antibody Capture", + 'prot_path': "Antibody Capture", 'tcr_path': 'VDJ T', 'bcr_path': 'VDJ B'} all_paths['path_type'] = all_paths['path_type'].replace(recode_dict) diff --git a/panpipes/python_scripts/batch_correct_totalvi.py b/panpipes/python_scripts/batch_correct_totalvi.py index 8fde16fb..b7214484 100644 --- a/panpipes/python_scripts/batch_correct_totalvi.py +++ b/panpipes/python_scripts/batch_correct_totalvi.py @@ -123,14 +123,14 @@ sc_raw = sc_raw[sc_raw.obs_names.isin(rna.obs_names),: ] rna.layers["counts"] = sc_raw.X.copy() -# filter adt outliers -if params['multimodal']['totalvi']['filter_adt_outliers']: - # for this to work the user needs to (manually) make a column called adt_outliers +# filter prot outliers +if params['multimodal']['totalvi']['filter_prot_outliers']: + # for this to work the user needs to (manually) make a column called prot_outliers # actually there is a thing in the qc pipe that calculates outliers, I don't like it very much though - if "adt_outliers" in mdata['prot'].columns: - mu.pp.filter_obs(mdata, "adt_outliers") + if "prot_outliers" in mdata['prot'].columns: + mu.pp.filter_obs(mdata, "prot_outliers") else: - raise ValueError("adt_outliers column not found in mdata['prot'].obs") + raise ValueError("prot_outliers column not found in mdata['prot'].obs") # exluding isotypes if 'isotype' in prot.var.columns: diff --git a/panpipes/python_scripts/run_preprocess_prot.py b/panpipes/python_scripts/run_preprocess_prot.py index 7563869f..f06665c9 100644 --- a/panpipes/python_scripts/run_preprocess_prot.py +++ b/panpipes/python_scripts/run_preprocess_prot.py @@ -44,7 +44,7 @@ default=None, help="") parser.add_argument("--figpath", - default="./adt_figures", + default="./prot_figures", help="") parser.add_argument("--save_mtx", default=False, @@ -55,7 +55,9 @@ args, opt = parser.parse_known_args() -# args = argparse.Namespace(filtered_mudata='test.h5mu', bg_mudata='/well/cartography/users/zsj686/non_cart_projects/005-multimodal_scpipelines/ingest/test_raw.h5mu', channel_col=None, normalisation_methods='clr,dsb', clr_margin='0', quantile_clipping='True', figpath='./figures/adt', save_mtx=False, save_mudata_path='test.h5mu') +# args = argparse.Namespace(filtered_mudata='test.h5mu', +# bg_mudata='/well/cartography/users/zsj686/non_cart_projects/005-multimodal_scpipelines/ingest/test_raw.h5mu', +# channel_col=None, normalisation_methods='clr,dsb', clr_margin='0', quantile_clipping='True', figpath='./figures/prot', save_mtx=False, save_mudata_path='test.h5mu') save_mtx=pnp.pp.check_for_bool(args.save_mtx) norm_methods = args.normalisation_methods.split(',') @@ -137,7 +139,7 @@ if 'clr' in norm_methods: # make sure to start from raw counts mdata["prot"].X = mdata["prot"].layers["raw_counts"].copy() - pnp.scmethods.run_adt_normalise(mdata=mdata, mdata_bg=None, + pnp.scmethods.run_prot_normalise(mdata=mdata, mdata_bg=None, method="clr", clr_margin=int(args.clr_margin)) L.info("saving ridgeplot") @@ -148,7 +150,7 @@ plt.savefig(os.path.join(args.figpath, str(si) + "_clr_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(mdata["prot"], os.path.join("adt_clr" , str(si) ), layer="raw_counts") + pnp.io.write_10x_counts(mdata["prot"], os.path.join("prot_clr" , str(si) ), layer="raw_counts") # then run dsb if 'dsb' in norm_methods: mdata_bg = mdata_bg_per_sample[si] @@ -161,7 +163,7 @@ # plt.savefig(os.path.join(args.figpath, si + "_log10umi.png")) # run dsb # this stores a layer named after the method as well as overwriting X - pnp.scmethods.run_adt_normalise(mdata=mdata, + pnp.scmethods.run_prot_normalise(mdata=mdata, mdata_bg=mdata_bg, method="dsb", isotypes=isotypes) @@ -171,7 +173,7 @@ pnp.plotting.ridgeplot(mdata["prot"], features=isotypes, layer="dsb", splitplot=1) plt.savefig(os.path.join(args.figpath, str(si) + "_dsb_ridgeplot_isotypes.png")) if save_mtx: - pnp.io.write_10x_counts(mdata["prot"], os.path.join("adt_dsb" , str(si)), layer="raw_counts") + pnp.io.write_10x_counts(mdata["prot"], os.path.join("prot_dsb" , str(si)), layer="raw_counts") else: # run on all the data (not on a channel basis) # first run clr @@ -184,7 +186,7 @@ all_mdata["prot"].X = all_mdata["prot"].layers["raw_counts"].copy() # run normalise # this stores a layer named after the method as well as overwriting X - pnp.scmethods.run_adt_normalise(mdata=all_mdata, + pnp.scmethods.run_prot_normalise(mdata=all_mdata, mdata_bg=None, method="clr", clr_margin=int(args.clr_margin)) @@ -195,7 +197,7 @@ plt.savefig(os.path.join(args.figpath, "clr_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("adt_clr"), layer="clr") + pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("prot_clr"), layer="clr") if 'dsb' in norm_methods: # make sure to start from raw counts all_mdata["prot"].X = all_mdata["prot"].layers["raw_counts"].copy() @@ -205,7 +207,7 @@ mu.pl.histogram(all_mdata_bg["rna"], ["log10umi"], bins=50) plt.savefig(os.path.join(args.figpath, "all_log10umi.png")) # this stores a layer named after the method as well as overwriting X - pnp.scmethods.run_adt_normalise(mdata=all_mdata, + pnp.scmethods.run_prot_normalise(mdata=all_mdata, mdata_bg=all_mdata_bg, method="dsb", isotypes=isotypes) @@ -221,7 +223,7 @@ plt.savefig(os.path.join(args.figpath, "dsb_ridgeplot_isotypes.png")) # save out the data in what format? if save_mtx: - pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("adt_dsb"), layer="dsb") + pnp.io.write_10x_counts(all_mdata["prot"], os.path.join("prot_dsb"), layer="dsb") if args.store_as_x is not None: all_mdata["prot"].X = all_mdata["prot"].layers[args.store_as_x] diff --git a/panpipes/python_scripts/run_scanpyQC_prot.py b/panpipes/python_scripts/run_scanpyQC_prot.py index abcaac84..86630a2c 100644 --- a/panpipes/python_scripts/run_scanpyQC_prot.py +++ b/panpipes/python_scripts/run_scanpyQC_prot.py @@ -64,7 +64,7 @@ parser.add_argument("--per_cell_metrics", default=None, help="") -parser.add_argument("--per_adt_metrics", +parser.add_argument("--per_prot_metrics", default=None, help="") @@ -118,7 +118,7 @@ # if isotypes are found then include them in the caluclate qc metrics call. # this calculates standard metrics sc.pp.calculate_qc_metrics(prot, - var_type="adt", + var_type="prot", qc_vars=qc_vars, percent_top=None,log1p=True, inplace=True) @@ -161,7 +161,7 @@ for si in prot.obs[channel_col].unique(): L.info(si) obs_df, var_df = sc.pp.calculate_qc_metrics(prot[prot.obs[channel_col]==si], - var_type="adt", + var_type="prot", qc_vars=qc_vars, percent_top=None,log1p=True, inplace=False) # store the var df for the metrics we care about in the dictionary. @@ -170,24 +170,24 @@ out[si][channel_col] = si # concatenate all the var_dfs, so we can summarise the per channel data in boxplots. -var_dat = pd.concat(out).reset_index().rename(columns={'level_0':channel_col, 'level_1': "adt_id"}) -adt_id_col=var_dat.columns[1] +var_dat = pd.concat(out).reset_index().rename(columns={'level_0':channel_col, 'level_1': "prot_id"}) +prot_id_col=var_dat.columns[1] -if args.per_adt_metrics is not None: - per_adt_metrics = args.per_adt_metrics.split(",") - per_adt_metrics = [a.strip() for a in per_adt_metrics] +if args.per_prot_metrics is not None: + per_prot_metrics = args.per_prot_metrics.split(",") + per_prot_metrics = [a.strip() for a in per_prot_metrics] # let's make some boxplot figures. - for qcp in per_adt_metrics: + for qcp in per_prot_metrics: fig, ax = plt.subplots(nrows=1,figsize=(24,6)) - sns.boxplot(data=var_dat, x=adt_id_col, y=qcp, ax=ax) + sns.boxplot(data=var_dat, x=prot_id_col, y=qcp, ax=ax) adjust_x_axis(ax) fig.tight_layout() plt.savefig(os.path.join(figdir, "boxplot_" + channel_col + "_" + qcp + ".png")) plt.close() # write out the per channel metrics in a separate csv. - var_dat.to_csv(args.sampleprefix + "_adt_qc_metrics_per_" + channel_col + ".csv") + var_dat.to_csv(args.sampleprefix + "_prot_qc_metrics_per_" + channel_col + ".csv") L.info("Done") diff --git a/panpipes/resources/sample_file_qc_mm.txt b/panpipes/resources/sample_file_qc_mm.txt index 7e45675e..77775ce8 100644 --- a/panpipes/resources/sample_file_qc_mm.txt +++ b/panpipes/resources/sample_file_qc_mm.txt @@ -1,3 +1,3 @@ -sample_id rna_path rna_filetype adt_path adt_filetype tissue diagnosis +sample_id rna_path rna_filetype prot_path prot_filetype tissue diagnosis Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file diff --git a/panpipes/resources/sample_file_qc_mm_gex_only.txt b/panpipes/resources/sample_file_qc_mm_gex_only.txt index 7e45675e..77775ce8 100644 --- a/panpipes/resources/sample_file_qc_mm_gex_only.txt +++ b/panpipes/resources/sample_file_qc_mm_gex_only.txt @@ -1,3 +1,3 @@ -sample_id rna_path rna_filetype adt_path adt_filetype tissue diagnosis +sample_id rna_path rna_filetype prot_path prot_filetype tissue diagnosis Sample1 Sample1_gex.csv csv_matrix Sample1_adt.csv csv_matrix pbmc healthy Sample2 cellranger_count/Sample2_GEX/outs/ cellranger cellranger_count/Sample2_CITE/outs/ cellranger pbmc diseased \ No newline at end of file diff --git a/tests/sample_caf.tsv b/tests/sample_caf.tsv index cdcd773b..d60886ed 100644 --- a/tests/sample_caf.tsv +++ b/tests/sample_caf.tsv @@ -1,3 +1,3 @@ -sample_id rna_path rna_filetype adt_path adt_filetype +sample_id rna_path rna_filetype prot_path prot_filetype sample1 ../sample_1_rna cellranger ../sample_1_prot cellranger sample2 ../sample_2_rna cellranger ../sample_2_prot cellranger From 0b790ae0a12f610bae000bfaa9884b10e16f3bc1 Mon Sep 17 00:00:00 2001 From: bio-la Date: Mon, 25 Sep 2023 12:35:57 +0200 Subject: [PATCH 17/28] typos --- panpipes/panpipes/pipeline_ingest.py | 4 ++-- panpipes/python_scripts/assess_background.py | 2 +- panpipes/python_scripts/concat_adata.py | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/panpipes/panpipes/pipeline_ingest.py b/panpipes/panpipes/pipeline_ingest.py index 430c5b42..acda4dbc 100644 --- a/panpipes/panpipes/pipeline_ingest.py +++ b/panpipes/panpipes/pipeline_ingest.py @@ -427,8 +427,8 @@ def run_scanpy_prot_qc(log_file, outfile, unfilt_file): cmd += " --channel_col sample_id" if PARAMS['prot_plotqc_metrics']: cmd += " --per_cell_metrics %(prot_plotqc_metrics)s" - if PARAMS['prot_metrics_per_prot']: - cmd += " --per_prot_metrics %(prot_metrics_per_prot)s" + if PARAMS['plot_metrics_per_prot']: + cmd += " --per_prot_metrics %(plot_metrics_per_prot)s" if PARAMS['identify_isotype_outliers']: cmd += " --identify_isotype_outliers %(identify_isotype_outliers)s" cmd += " --isotype_upper_quantile %(isotype_upper_quantile)s" diff --git a/panpipes/python_scripts/assess_background.py b/panpipes/python_scripts/assess_background.py index 086a8580..634278f6 100644 --- a/panpipes/python_scripts/assess_background.py +++ b/panpipes/python_scripts/assess_background.py @@ -147,7 +147,7 @@ # quantifying the top background features ## Repeat for protein (if it exists) if 'prot' in mdata.mod.keys(): - # this time we'll just use all the adts. + # this time we'll just use all the prot vars. top_genes = list(mdata_bg['prot'].var_names) bg_df = pnp.scmethods.get_mean_background_fraction(mdata_bg['prot'], top_background=top_genes, group_by=args.channel_col) if bg_df.shape[0] >1: diff --git a/panpipes/python_scripts/concat_adata.py b/panpipes/python_scripts/concat_adata.py index 02a0a468..6344b8eb 100644 --- a/panpipes/python_scripts/concat_adata.py +++ b/panpipes/python_scripts/concat_adata.py @@ -178,7 +178,7 @@ L.debug(mdata.obs.columns) L.debug(mdata.obs.head()) -# update the protein variable to add in extra info like isotype and alternate name for adt +# update the protein variable to add in extra info like isotype and alternate name for prot if args.protein_var_table is not None: try: df = pd.read_csv(args.protein_var_table, sep='\t', index_col=0) From d153b4e292a0b1fee126505d3f11898b68b1922d Mon Sep 17 00:00:00 2001 From: bio-la Date: Fri, 29 Sep 2023 11:50:05 +0200 Subject: [PATCH 18/28] fixes to plotting params in yml and ingest --- panpipes/panpipes/pipeline_ingest.py | 2 +- .../panpipes/pipeline_ingest/pipeline.yml | 22 +++++++------------ 2 files changed, 9 insertions(+), 15 deletions(-) diff --git a/panpipes/panpipes/pipeline_ingest.py b/panpipes/panpipes/pipeline_ingest.py index acda4dbc..55693c12 100644 --- a/panpipes/panpipes/pipeline_ingest.py +++ b/panpipes/panpipes/pipeline_ingest.py @@ -426,7 +426,7 @@ def run_scanpy_prot_qc(log_file, outfile, unfilt_file): else: cmd += " --channel_col sample_id" if PARAMS['prot_plotqc_metrics']: - cmd += " --per_cell_metrics %(prot_plotqc_metrics)s" + cmd += " --per_cell_metrics %(plotqc_prot_metrics)s" if PARAMS['plot_metrics_per_prot']: cmd += " --per_prot_metrics %(plot_metrics_per_prot)s" if PARAMS['identify_isotype_outliers']: diff --git a/panpipes/panpipes/pipeline_ingest/pipeline.yml b/panpipes/panpipes/pipeline_ingest/pipeline.yml index e9ce97d7..a6fc1242 100644 --- a/panpipes/panpipes/pipeline_ingest/pipeline.yml +++ b/panpipes/panpipes/pipeline_ingest/pipeline.yml @@ -181,7 +181,7 @@ custom_genes_file: resources/qc_genelist_1.0.csv calc_proportions: hb,mt,rp score_genes: MarkersNeutro - +# ------------ # Plot RNA QC metrics # ------------ # all metrics should be inputted as a comma separated string e.g. a,b,c @@ -191,13 +191,12 @@ plotqc_grouping_var: sample_id plotqc_rna_metrics: doublet_scores,pct_counts_mt,pct_counts_rp,pct_counts_hb,pct_counts_ig - # ------------ -# ADT qc - requires prot_path to be included in the submission file +# Plot PROT QC metrics # ------------ +# requires prot_path to be included in the submission file # all metrics should be inputted as a comma separated string e.g. a,b,c -### ADT QC metrics: # as standard the following metrics are calculated for prot data # per cell metrics: # total_counts,log1p_total_counts,n_prot_by_counts,log1p_n_prot_by_counts @@ -205,10 +204,10 @@ plotqc_rna_metrics: doublet_scores,pct_counts_mt,pct_counts_rp,pct_counts_hb,pct # total_counts_isotype,pct_counts_isotype # choose which ones you want to plot here plotqc_prot_metrics: total_counts,log1p_total_counts,n_prot_by_counts,pct_counts_isotype - -# per antibody metrics +# Since the protein antibody panels usually count fewer features than the RNA, it may be interesting to +# visualize breakdowns of single proteins plots describing their count distribution and other qc options. +# choose which ones you want to plot here, for example # n_cells_by_counts,mean_counts,log1p_mean_counts,pct_dropout_by_counts,total_counts,log1p_total_counts -# choose which ones you want to plot here: plot_metrics_per_prot: total_counts,log1p_total_counts,n_cells_by_counts,mean_counts # isotype outliers: one way to determine which cells are very sticky is to work out which cells have the most isotype UMIs @@ -218,7 +217,6 @@ plot_metrics_per_prot: total_counts,log1p_total_counts,n_cells_by_counts,mean_co identify_isotype_outliers: True isotype_upper_quantile: 90 isotype_n_pass: 2 -# TODO: work out how to plot this. ### Investigate per channel antibody staining: # This can help determine any inconsistencies in staining per channel and other QC concerns. @@ -263,11 +261,8 @@ quantile_clipping: True # from both rna and protein assays, # see details for how to make sure your files are compatible in the assess background section below - - - # ------------ -# ATAC qc +# Plot ATAC QC metrics # ------------ # we require initializing one csv file per aggregated ATAC/multiome experiment. @@ -287,9 +282,8 @@ features_tss: #resources/features_tss_hg19.tsv # all metrics should be inputted as a comma separated string e.g. a,b,c plotqc_atac_metrics: n_genes_by_counts,total_counts,pct_fragments_in_peaks,atac_peak_region_fragments,atac_mitochondrial_reads,atac_TSS_fragments - # ------------ -# Repertoire QC +# Plot Repertoire QC metrics # ------------ # Repertoire data will be stored in one modality called "rep", if you provide both TCR and BCR data then this will be merged, # but various functions will be fun on TCR and BCR separately From 327eb723e0b9c0966b47a6b06ada13fcc702c2ee Mon Sep 17 00:00:00 2001 From: bio-la Date: Fri, 29 Sep 2023 11:58:33 +0200 Subject: [PATCH 19/28] fixes to names --- panpipes/panpipes/pipeline_ingest.py | 2 +- panpipes/panpipes/pipeline_ingest/pipeline.yml | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/panpipes/panpipes/pipeline_ingest.py b/panpipes/panpipes/pipeline_ingest.py index 55693c12..f1d1b89e 100644 --- a/panpipes/panpipes/pipeline_ingest.py +++ b/panpipes/panpipes/pipeline_ingest.py @@ -425,7 +425,7 @@ def run_scanpy_prot_qc(log_file, outfile, unfilt_file): cmd += " --channel_col %(channel_col)s" else: cmd += " --channel_col sample_id" - if PARAMS['prot_plotqc_metrics']: + if PARAMS['plotqc_prot_metrics']: cmd += " --per_cell_metrics %(plotqc_prot_metrics)s" if PARAMS['plot_metrics_per_prot']: cmd += " --per_prot_metrics %(plot_metrics_per_prot)s" diff --git a/panpipes/panpipes/pipeline_ingest/pipeline.yml b/panpipes/panpipes/pipeline_ingest/pipeline.yml index a6fc1242..e3e0366f 100644 --- a/panpipes/panpipes/pipeline_ingest/pipeline.yml +++ b/panpipes/panpipes/pipeline_ingest/pipeline.yml @@ -120,7 +120,7 @@ subset_prot_barcodes_to_rna: False plot_10X_metrics: True kneeplot: False # ------------ -## Doublets on GEX - Scrublet +## Doublets on RNA - Scrublet # ------------ scr: run: True @@ -155,7 +155,7 @@ scr: #if use_thr is True, this thr will be used to define doublets # ------------ -# GEX QC +# RNA QC # ------------ # this part of the pipeline allows to generate the QC parameters that will be used to # evaluate inclusion/ exclusion criteria. Filtering of cells/genes happens in the following pipeline From 556e66404a8a60720bcecd6c78a94ae531ca13ec Mon Sep 17 00:00:00 2001 From: bio-la Date: Fri, 29 Sep 2023 12:15:11 +0200 Subject: [PATCH 20/28] plotting fixes --- panpipes/R_scripts/plotQC.R | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/panpipes/R_scripts/plotQC.R b/panpipes/R_scripts/plotQC.R index 58d61665..c8b1e7b0 100644 --- a/panpipes/R_scripts/plotQC.R +++ b/panpipes/R_scripts/plotQC.R @@ -17,7 +17,8 @@ do_scatter_plot <- function(df, x, y, facet=NULL, hue=NULL){ } g <- g+ theme_bw()+ theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5), - strip.text.x = element_text(size=6)) + strip.text.x = element_text(size=6), + legend.key.size = unit(0.2, 'cm')) return(g) } @@ -184,25 +185,25 @@ for (sc in rna_source_facet){ if (all(c("total_counts","n_genes_by_counts")%in% colnames(rna_data_plot))){ g <- do_scatter_plot(rna_data_plot,x="total_counts",y="n_genes_by_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_rna-nUMI_v_rna-genes.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("log1p_total_counts","log1p_n_genes_by_counts")%in% colnames(rna_data_plot))){ g <- do_scatter_plot(rna_data_plot,x="log1p_total_counts",y="log1p_n_genes_by_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_rna-log1p_nUMI_v_rna-log1p_genes.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("total_counts","pct_counts_mt")%in% colnames(rna_data_plot))){ g <- do_scatter_plot(rna_data_plot,x="total_counts",y="pct_counts_mt", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_rna-nUMI_v_rna-pct_mt.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("n_genes_by_counts","doublet_scores","total_counts")%in% colnames(rna_data_plot))){ g <- do_scatter_plot(rna_data_plot,x="n_genes_by_counts",y="doublet_scores", hue="total_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_rna-genes_rna-doublet_scores_rna-numi.png")), - type="cairo", width= 2*ncols, height=2*nrows, dpi=120) + type="cairo", width= 3*ncols, height=3*nrows, dpi=200) } } @@ -258,27 +259,27 @@ if(!is.null(opt$prot_qc_metrics)){ if (all(c("total_counts","n_prot_by_counts")%in% colnames(prot_data_plot))){ g <- do_scatter_plot(prot_data_plot,x="total_counts",y="n_prot_by_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-nUMI_v_prot-prot.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("log1p_total_counts","log1p_n_prot_by_counts")%in% colnames(prot_data_plot))){ g <- do_scatter_plot(prot_data_plot,x="log1p_total_counts",y="log1p_n_prot_by_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-log1p_nUMI_v_prot-log1p_prot.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("total_counts","pct_counts_isotype")%in% colnames(prot_data_plot))){ g <- do_scatter_plot(prot_data_plot,x="total_counts",y="pct_counts_isotype", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-nUMI_v_prot-pct_isotype.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("total_counts","total_counts_isotype")%in% colnames(prot_data_plot))){ g <- do_scatter_plot(prot_data_plot,x="total_counts",y="total_counts_isotype", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-nUMI_v_prot-total_counts_isotype.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("log1p_total_counts","log1p_total_counts_isotype", "isotype_exclude")%in% colnames(prot_data_plot))){ g <- do_scatter_plot(prot_data_plot,x="log1p_total_counts",y="log1p_total_counts_isotype", hue="isotype_exclude", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "_prot-log1p_nUMI_v_prot-log1p_total_counts_isotype.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } } @@ -376,27 +377,27 @@ if(!is.null(opt$prot_qc_metrics)){ if (all(c("rna.total_counts","prot.total_counts")%in% colnames(data_plot))){ g <- do_scatter_plot(data_plot,x="rna.total_counts",y="prot.total_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "-nUMI_v_rna-nUMI.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("rna.log1p_total_counts","prot.log1p_total_counts")%in% colnames(data_plot))){ g <- do_scatter_plot(data_plot,x="rna.log1p_total_counts",y="prot.log1p_total_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "-log1p_nUMI_v_rna-log1p_nUMI.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("rna.total_counts","prot.total_counts_isotype")%in% colnames(data_plot))){ g <- do_scatter_plot(data_plot,x="rna.total_counts",y="prot.total_counts_isotype", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "-nUMI_v_prot-counts_isotype.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("rna.log1p_total_counts","prot.log1p_total_counts_isotype") %in% colnames(data_plot))){ g <- do_scatter_plot(data_plot,x="rna.log1p_total_counts",y="prot.log1p_total_counts_isotype", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "-log1p_nUMI_v_prot-log1p_counts_isotype.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } if (all(c("rna.doublet_scores","prot.log1p_total_counts") %in% colnames(data_plot))){ g <- do_scatter_plot(data_plot,x="rna.doublet_scores",y="prot.log1p_total_counts", facet=sc) ggsave(g, filename=file.path(outpath, paste0("scatter_", sc, "-doublet_scores_v_prot-log1p_nUMI.png")), type="cairo", - width= 2*ncols, height=2*nrows, dpi=120) + width= 3*ncols, height=3*nrows, dpi=200) } From b10fe3b5e8c776f005263a003a43e792441ee62a Mon Sep 17 00:00:00 2001 From: bio-la Date: Fri, 29 Sep 2023 12:23:53 +0200 Subject: [PATCH 21/28] fix to names --- .../pipeline_preprocess_preprint.yml | 10 ++++----- .../R_scripts/produce_barplot_10xmetric.v3.R | 4 ++-- panpipes/funcs/io.py | 4 ++-- .../panpipes/pipeline_ingest/pipeline.yml | 2 +- .../panpipes/pipeline_preprocess/pipeline.yml | 20 +++++++----------- .../panpipes/pipeline_refmap/pipeline.yml | 21 ------------------- panpipes/python_scripts/assess_background.py | 4 ++-- panpipes/python_scripts/run_scanpyQC_prot.py | 8 +++---- panpipes/python_scripts/run_scanpyQC_rep.py | 6 +++--- panpipes/python_scripts/run_scanpyQC_rna.py | 2 +- 10 files changed, 28 insertions(+), 53 deletions(-) diff --git a/docs/workflows/pipeline_preprocess_preprint.yml b/docs/workflows/pipeline_preprocess_preprint.yml index 8b1488da..c172b00e 100644 --- a/docs/workflows/pipeline_preprocess_preprint.yml +++ b/docs/workflows/pipeline_preprocess_preprint.yml @@ -161,7 +161,7 @@ plotqc: atac_metrics: total_counts rep_metrics: - +# -------------------------------------------------------------------------------------------------------- # RNA Normalisation # -------------------------------------------------------------------------------------------------------- # hvg_flavour options include "seurat", "cell_ranger", "seurat_v3", default; "seurat" @@ -213,8 +213,8 @@ pca: scree_n_pcs: 50 color_by: sample_id - -# Protein (ADT) normalisation +# -------------------------------------------------------------------------------------------------------- +# Protein (PROT) normalisation # -------------------------------------------------------------------------------------------------------- prot: # comma separated string of normalisation options @@ -250,9 +250,9 @@ prot: # do you want to save the prot normalised assay additionally as a txt file: save_norm_prot_mtx: False - +# -------------------------------------------------------------------------------------------------------- # ATAC preprocessing and normalisation -# -------------------------- +# -------------------------------------------------------------------------------------------------------- atac: binarize: False normalize: TFIDF diff --git a/panpipes/R_scripts/produce_barplot_10xmetric.v3.R b/panpipes/R_scripts/produce_barplot_10xmetric.v3.R index 59cffae1..0a86fc3c 100644 --- a/panpipes/R_scripts/produce_barplot_10xmetric.v3.R +++ b/panpipes/R_scripts/produce_barplot_10xmetric.v3.R @@ -146,7 +146,7 @@ if(ncol(tag.mat)>0){ } if(nrow(mat)==0){mat <- tag.mat} -#modify this to deal with AB reads and rna reads, need to split the knee plotting in 2 if you want both rna and ADT knees in 2 different graphs +#modify this to deal with AB reads and rna reads, need to split the knee plotting in 2 if you want both rna and PROT knees in 2 different graphs if(opt$kneeplot){ print("plotting kneeplot") gsub("metrics_summary.csv","raw_feature_bc_matrix" , paths) ->normpath @@ -205,6 +205,6 @@ if(opt$kneeplot){ } -# if ADT exists then repeat for ADT +# if PROT exists then repeat for PROT message("finished pipeline") diff --git a/panpipes/funcs/io.py b/panpipes/funcs/io.py index 7db749b4..6d399925 100644 --- a/panpipes/funcs/io.py +++ b/panpipes/funcs/io.py @@ -482,10 +482,10 @@ def load_mdata_from_multiple_files(all_files_dict): Parameters ---------- all_files_dict: dict - dictionary containing one key per assay out of [RNA, ADT, TCR, BCR, ATAC] + dictionary containing one key per assay out of [RNA, PROT, TCR, BCR, ATAC] and the values for each key is a list of file path and fie type e.g. {"RNA": [filepath, "filetype"], - "ADT: [file path, "filetype"]} + "PROT: [file path, "filetype"]} Filetypes supported for RNA/prot: ["cellranger", "h5ad", "csv_matrix", "txt_matrix", "10X_h5"], Filetypes supported for atac (multiome preferred is 10X_h5) ["10X_h5","cellranger","h5ad"] Filetypes supported for rep: ["cellranger_vdj", "airr", "tracer", "bracer" ] diff --git a/panpipes/panpipes/pipeline_ingest/pipeline.yml b/panpipes/panpipes/pipeline_ingest/pipeline.yml index e3e0366f..581d54e8 100644 --- a/panpipes/panpipes/pipeline_ingest/pipeline.yml +++ b/panpipes/panpipes/pipeline_ingest/pipeline.yml @@ -229,7 +229,7 @@ channel_col: sample_id # It is important to note that in ingest the per channel normalised prot scores are not saved in the mudata object # this is because if you perform feature normalisation (clr normalisation margin 0 or dsb normalisation), # on subsets of cells then the normalised values cannot be simply concatenated. -# The ADT normalisation is rerun pn the complete object in the preprocess pipeline +# The PROT normalisation is rerun pn the complete object in the preprocess pipeline # (or you can run this pipeline with channel_col set as None) # it is important to note, if you choose to run the clr on a per channel basis, then it is not stored in the h5mu file. # if you want to save the per channel normalised values set the following to True: diff --git a/panpipes/panpipes/pipeline_preprocess/pipeline.yml b/panpipes/panpipes/pipeline_preprocess/pipeline.yml index e36a9f3e..7f71f9fe 100644 --- a/panpipes/panpipes/pipeline_preprocess/pipeline.yml +++ b/panpipes/panpipes/pipeline_preprocess/pipeline.yml @@ -16,13 +16,10 @@ condaenv: # allows for tweaking where the jobs get submitted, in case there is a special queue for long jobs or you have access to a gpu # leave as is if you do not want to use the alternative queues -# also rescomp users need to request gpu access. +# -------------------------- # Start # -------------------------- -# either one that exists already with - - sample_prefix: unfiltered_obj: # if running this on prefiltered datat then @@ -31,8 +28,6 @@ unfiltered_obj: #3. put renamed file in the same folder as this yml. #4. set filtering run: to False below. -#TO DO this needs to go i think it doesn't make sense to save this file. - # leave empty if you don't want to save an intermediate muData with normalized expression # stored in X for the RNA object output_logged_mudata: @@ -43,13 +38,14 @@ modalities: rep: False atac: False #spatialT: True - +# -------------------------------------------------------------------------------------------------------- # Filtering # -------------------------- # the filtering process in panpipes is sequential as it goes through the filtering dictionary. # for each modality, starting with rna, it will first filter on obs and then vars. # each modality has a dictionary in the following format. This is fully customisable to any # columns in the mudata.obs or var object. +# To avoid being prescriptive with the covariates names, we allow to provide custom names. # When specifying a column name, make sure it exactly matches the column name in the h5mu object. # rna: # obs: @@ -174,7 +170,7 @@ plotqc: - +# -------------------------------------------------------------------------------------------------------- # RNA Normalisation # -------------------------------------------------------------------------------------------------------- # hvg_flavour options include "seurat", "cell_ranger", "seurat_v3", default; "seurat" @@ -222,8 +218,8 @@ pca: scree_n_pcs: 50 color_by: sample_id - -# Protein (ADT) normalisation +# -------------------------------------------------------------------------------------------------------- +# Protein (PROT) normalisation # -------------------------------------------------------------------------------------------------------- prot: # comma separated string of normalisation options @@ -259,9 +255,9 @@ prot: # do you want to save the prot normalised assay additionally as a txt file: save_norm_prot_mtx: False - +# -------------------------------------------------------------------------------------------------------- # ATAC preprocessing and normalisation -# -------------------------- +# -------------------------------------------------------------------------------------------------------- atac: binarize: False normalize: TFIDF diff --git a/panpipes/panpipes/pipeline_refmap/pipeline.yml b/panpipes/panpipes/pipeline_refmap/pipeline.yml index 55fc8c9f..abe82ce7 100644 --- a/panpipes/panpipes/pipeline_refmap/pipeline.yml +++ b/panpipes/panpipes/pipeline_refmap/pipeline.yml @@ -94,26 +94,5 @@ neighbors: # scanpy | hnsw (from scvelo) method: scanpy -#---------------------- -# wnn seurat params -#---------------------- - -wnn: path_to_wnn_h5 -# instead to a path to a model, in wnn you have to provide a path to a seurat or mudata/anndata object with wnn integration precomputed -wnn_mapping_normalization: SCTransform, LogNormalize -# specify normalization to use for query data, same used to build reference wnn. -normalize: False -# if RNA is pre-normalized specify method above and set normalize to False -# provide here a dictionary of the columns or assays you want to predict as defined here -# https://satijalab.org/seurat/reference/transferdata -# for the param refdata ( https://satijalab.org/seurat/reference/transferdata#:~:text=(%0A%20%20anchorset%2C-,refdata,-%2C%0A%20%20reference%20%3D%20NULL) -# if multiple lines are specified these will be supplied as list. -# write as the example that follows [column_or_data_to_create_in_query]:[name_of_data_in_reference] -refdata: - celltype.l1 : celltype.l1 - celltype.l2 : celltype.l2 - predicted_ADT : ADT - - diff --git a/panpipes/python_scripts/assess_background.py b/panpipes/python_scripts/assess_background.py index 634278f6..21fb3829 100644 --- a/panpipes/python_scripts/assess_background.py +++ b/panpipes/python_scripts/assess_background.py @@ -157,7 +157,7 @@ fig, ax = plt.subplots(nrows=2,ncols=1,figsize=(24,10), facecolor="white") sns.heatmap(bg_df.iloc[:,1:split_int], ax=ax[0]) sns.heatmap(bg_df.iloc[:,split_int:len(top_genes)], ax=ax[1]) - fig.suptitle("mean exprs (raw counts) of ADT in empty drops") + fig.suptitle("mean exprs (raw counts) of PROT in empty drops") fig.tight_layout() else: fig, ax = plt.subplots(figsize=(12,10), facecolor="white") @@ -173,7 +173,7 @@ pnp.plotting.adjust_x_axis(ax[0]) sns.heatmap(plt_df.iloc[split_int:len(top_genes),:], ax=ax[1]) pnp.plotting.adjust_x_axis(ax[1]) - fig.suptitle("mean exprs (raw counts) of ADT in empty drops") + fig.suptitle("mean exprs (raw counts) of PROT in empty drops") fig.tight_layout() else: fig, ax= plt.subplots(figsize=(12,8)) diff --git a/panpipes/python_scripts/run_scanpyQC_prot.py b/panpipes/python_scripts/run_scanpyQC_prot.py index 86630a2c..64a61879 100644 --- a/panpipes/python_scripts/run_scanpyQC_prot.py +++ b/panpipes/python_scripts/run_scanpyQC_prot.py @@ -1,8 +1,8 @@ ''' -scanpy QC script ADT +scanpy QC script PROT order of QC: -- GEX -- ADT +- RNA +- PROT - Repertoire - ATAC ''' @@ -151,7 +151,7 @@ # values on a per channel basis # IS this cause you need the pandas per channel to plot? YES -# the rest of the script is concerned with Per ADT metrics instead of per cell. +# the rest of the script is concerned with Per PROT metrics instead of per cell. # per cell metrics are plotted in plotqc.R diff --git a/panpipes/python_scripts/run_scanpyQC_rep.py b/panpipes/python_scripts/run_scanpyQC_rep.py index 8cf53eb1..e2b12c0a 100644 --- a/panpipes/python_scripts/run_scanpyQC_rep.py +++ b/panpipes/python_scripts/run_scanpyQC_rep.py @@ -1,8 +1,8 @@ ''' -scanpy QC script GEX +scanpy QC script RNA order of QC: -- GEX -- ADT +- RNA +- PROT - Repertoire - ATAC ''' diff --git a/panpipes/python_scripts/run_scanpyQC_rna.py b/panpipes/python_scripts/run_scanpyQC_rna.py index d47972ba..70f37ce8 100644 --- a/panpipes/python_scripts/run_scanpyQC_rna.py +++ b/panpipes/python_scripts/run_scanpyQC_rna.py @@ -2,7 +2,7 @@ scanpy QC script RNA order of QC: - RNA -- ADT +- PROT - Repertoire - ATAC ''' From d35dadf317cf29cb661d9a9dfd1d305ea621b6ad Mon Sep 17 00:00:00 2001 From: bio-la Date: Fri, 29 Sep 2023 12:24:58 +0200 Subject: [PATCH 22/28] adt --- panpipes/python_scripts/concat_adata.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/panpipes/python_scripts/concat_adata.py b/panpipes/python_scripts/concat_adata.py index 6344b8eb..8d4650ec 100644 --- a/panpipes/python_scripts/concat_adata.py +++ b/panpipes/python_scripts/concat_adata.py @@ -94,7 +94,7 @@ mdata=mdatas[0] del temp elif 'prot' in temp.mod.keys() or 'rna' in temp.mod.keys(): - ## IF GEX and ADT is ok to concatenate ---------- + ## IF GEX and PROT is ok to concatenate ---------- mdata = concat_mdatas(mdatas, batch_key="sample_id", join_type=args.join_type) From 583957a53ff0caf6dc5fdacf466825125aa0781f Mon Sep 17 00:00:00 2001 From: bio-la Date: Fri, 29 Sep 2023 12:27:59 +0200 Subject: [PATCH 23/28] final name --- docs/workflows/integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/workflows/integration.md b/docs/workflows/integration.md index eee7396b..aaa206bc 100644 --- a/docs/workflows/integration.md +++ b/docs/workflows/integration.md @@ -5,7 +5,7 @@ The panpipes integration pipeline implements a variety of tools to batch correct ![integration_flowchart](../img/integration_coloured.drawio.png) -The flowchart indicates which tools are available for each modality out of GEX (also referred to as RNA), ADT (also referred to as PROT) and ATAC. You can run as many of these tools as you choose and then you can run `panpipes integration make merge_batch_correction` to create a final object containing one reduced dimension representation and one nearest neighbor graph per modality. This can be used as input to the clustering pipeline. +The flowchart indicates which tools are available for each modality out of GEX (also referred to as RNA), PROT (also referred to as ADT) and ATAC. You can run as many of these tools as you choose and then you can run `panpipes integration make merge_batch_correction` to create a final object containing one reduced dimension representation and one nearest neighbor graph per modality. This can be used as input to the clustering pipeline. ## Steps to run: From 847559d07bc3142f1ae40a10e22d9cdbeac5d684 Mon Sep 17 00:00:00 2001 From: bio-la Date: Fri, 29 Sep 2023 12:30:18 +0200 Subject: [PATCH 24/28] rna to gex --- docs/workflows/integration.md | 2 +- panpipes/R_scripts/produce_barplot_10xmetric.v3.R | 2 +- panpipes/panpipes/pipeline_clustering.py | 6 +++--- panpipes/python_scripts/concat_adata.py | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/workflows/integration.md b/docs/workflows/integration.md index aaa206bc..dc923aa1 100644 --- a/docs/workflows/integration.md +++ b/docs/workflows/integration.md @@ -5,7 +5,7 @@ The panpipes integration pipeline implements a variety of tools to batch correct ![integration_flowchart](../img/integration_coloured.drawio.png) -The flowchart indicates which tools are available for each modality out of GEX (also referred to as RNA), PROT (also referred to as ADT) and ATAC. You can run as many of these tools as you choose and then you can run `panpipes integration make merge_batch_correction` to create a final object containing one reduced dimension representation and one nearest neighbor graph per modality. This can be used as input to the clustering pipeline. +The flowchart indicates which tools are available for each modality out of RNA (also referred to as GEX), PROT (also referred to as ADT) and ATAC. You can run as many of these tools as you choose and then you can run `panpipes integration make merge_batch_correction` to create a final object containing one reduced dimension representation and one nearest neighbor graph per modality. This can be used as input to the clustering pipeline. ## Steps to run: diff --git a/panpipes/R_scripts/produce_barplot_10xmetric.v3.R b/panpipes/R_scripts/produce_barplot_10xmetric.v3.R index 0a86fc3c..e7ccf528 100644 --- a/panpipes/R_scripts/produce_barplot_10xmetric.v3.R +++ b/panpipes/R_scripts/produce_barplot_10xmetric.v3.R @@ -63,7 +63,7 @@ if(!is.null(opt$csvpaths)) { } } - # paths <- gsub("filtered_feature_bc_matrix","metrics_summary.csv",caf$gex_path) + # paths <- gsub("filtered_feature_bc_matrix","metrics_summary.csv",caf$rna_path) paths <- file.path(caf$rna_path, "metrics_summary.csv") # paths <- gsub("\\/$","", paths) diff --git a/panpipes/panpipes/pipeline_clustering.py b/panpipes/panpipes/pipeline_clustering.py index 886b597f..99837875 100644 --- a/panpipes/panpipes/pipeline_clustering.py +++ b/panpipes/panpipes/pipeline_clustering.py @@ -103,7 +103,7 @@ def calc_sm_umaps(infile, outfile, mod, mindist, log_file): --min_dist %(mindist)s """ if mod is not None and mod != "multimodal": - # if this is not specified it will use gex default + # if this is not specified it will use rna default cmd += " --modality %(mod)s" elif mod=="multimodal": if PARAMS['multimodal_integration_method'].lower() == "wnn": @@ -146,7 +146,7 @@ def calc_cluster(infile, outfile, mod, res, alg, log_file): --algorithm %(alg)s """ if mod is not None and mod != "multimodal": - # if this is not specified it will use gex default + # if this is not specified it will use rna default cmd += " --modality %(mod)s" elif mod=="multimodal": if PARAMS['multimodal_integration_method'].lower() == "wnn": @@ -412,7 +412,7 @@ def marker_analysis(fname): # if PARAMS['use_muon']: # cmd += " --use_muon True" # if PARAMS['modality']: -# # if this is not specified it will use gex default +# # if this is not specified it will use rna default # cmd += " --modality %(modality)s" # print(cmd) # P.run(cmd, job_threads=PARAMS['resources_threads_medium']) diff --git a/panpipes/python_scripts/concat_adata.py b/panpipes/python_scripts/concat_adata.py index 8d4650ec..9382541c 100644 --- a/panpipes/python_scripts/concat_adata.py +++ b/panpipes/python_scripts/concat_adata.py @@ -94,7 +94,7 @@ mdata=mdatas[0] del temp elif 'prot' in temp.mod.keys() or 'rna' in temp.mod.keys(): - ## IF GEX and PROT is ok to concatenate ---------- + ## IF RNA and PROT is ok to concatenate ---------- mdata = concat_mdatas(mdatas, batch_key="sample_id", join_type=args.join_type) From 483363d9523561e38b294db0dee759ec6a7624ac Mon Sep 17 00:00:00 2001 From: bio-la Date: Wed, 18 Oct 2023 16:07:02 +0200 Subject: [PATCH 25/28] changes added to changelog.md file --- CHANGELOG.md | 4 +++- panpipes/.DS_Store | Bin 8196 -> 6148 bytes 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0c2dd957..ce52af05 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,8 @@ ### added ### fixed +- changed all instances of ADT into PROT +- changed all instances of GEX to RNA ### dependencies @@ -15,7 +17,7 @@ - Spatial data analysis is now included in panpipes - panpipes qc_spatial - panpipes preprocess_spatial - - panpipes_deconvolution_spatials + - panpipes_deconvolution_spatial ### fixed - make sure columns from individual modalities that are not in the multimodal outer obs can be used to diff --git a/panpipes/.DS_Store b/panpipes/.DS_Store index bf2bd9b83852df2f18b5f9e3e8cbe08f50eab467..1d92038220f8b8453a857c10e6663b21a0d435db 100644 GIT binary patch delta 145 zcmZp1XfcprU|?W$DortDU=RQ@Ie-{MGjdEU6q~50$jCY|U^gS{WFCQLDGr7LhD3%u zAejlosSL$wDaFZ2`T04F8w=yu7PE732r>h;0)YTGkZ=W=va#?x^JIP*O9hZHBLf2y Vlx6|ZKn}=0kP{d-$Mei#1^_qT7KZ=; literal 8196 zcmeHMzfT)682wCBDN?J_7KAc@bZf;@`eW*HMWIqQq|8+aAq1pc?n(;tO`WR#0qm?S zNVFRZe**smJ9O&M_t{QjCl_+5kc!~5WWU4y-r4W@lFMg^NUiQ~tq_%nsEEq={0W+j z!sA>=%8cz<1S;@}y6x6VwGl_jNUU^-1LA-;P+6OB-wI(pUVor+Zue*}~j1X3oiU z*S+-p-Jj29-MQHVP1CbIz4RK3ruhuLtedsAptg>yU5csWD}M7~8m_DP=%TMV_uNrD zQ2j4_Uw6p5v;0n=w+TP?ptpgm?$9$xua5IGUCoyV*S+U6=e(}s>Dj(sz0c1w_W27` z)+nOL-`nslRNffxYWr1r-+P89a0Yvrm*eYL=Y5;a(80U54IOn#d?hb_PiDAUD&P8A zCd}jRPkxrKCty8KGxQwb6E}2v`SSFpSXaZRpM4Eu@9nNWPkhF@dF@iff!lK6k-1Tc z^Z&K+=l{1go!mzp5C`s{1FBH2mRGSbn_G`8k#p?;bpw?P$E5}}1r3LAzfQ-2{eKwZ gI Date: Fri, 20 Oct 2023 16:53:08 +0200 Subject: [PATCH 26/28] typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b30eb0bc..fb4ecc68 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ Created and Maintained by Charlotte Rich-Griffin and Fabiola Curion Additional contributors: Sarah Ouologuem, Devika Agarwal and Tom Thomas -See out [documentation](https://panpipes-pipelines.readthedocs.io/en/latest/) and our [preprint](https://www.biorxiv.org/content/10.1101/2023.03.11.532085v1): +See our [documentation](https://panpipes-pipelines.readthedocs.io/en/latest/) and our [preprint](https://www.biorxiv.org/content/10.1101/2023.03.11.532085v1): Panpipes: a pipeline for multiomic single-cell data analysis Charlotte Rich-Griffin*, Fabiola Curion*, Tom Thomas, Devika Agarwal, Fabian J. Theis, Calliope A. Dendrou. bioRxiv 2023.03.11.532085; From 368b34f4e37eaf8409d1de7a68274b8a533a21f6 Mon Sep 17 00:00:00 2001 From: bio-la Date: Fri, 20 Oct 2023 16:54:08 +0200 Subject: [PATCH 27/28] typos --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index fb4ecc68..d15eeb54 100644 --- a/README.md +++ b/README.md @@ -12,10 +12,10 @@ doi: https://doi.org/10.1101/2023.03.11.532085 # Introduction -These pipelines use cgat-core pipeline software +These workflows use cgat-core pipeline software -Available pipelines: -1. "ingest" : for the ingestion of data and computation of QC metrics' +Available workflows: +1. "ingest" : for the ingestion of data and computation of QC metrics 2. "preprocess" : for filtering and normalising of each modality 3. "integration" : integrate and batch correction using single and multimodal methods 4. "clustering" : cell clustering on single modalities From ed332efbf8d34d6161b4a104a187ff126358c072 Mon Sep 17 00:00:00 2001 From: bio-la Date: Fri, 20 Oct 2023 16:54:28 +0200 Subject: [PATCH 28/28] added fixes to changelog --- CHANGELOG.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index ce52af05..bbf5f582 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,9 @@ ### fixed - changed all instances of ADT into PROT - changed all instances of GEX to RNA +- changed the params to fix plotting as mentioned in issue #41 +- typo in readme + ### dependencies