Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fc sparsedocsfix #146

Merged
merged 13 commits into from
Dec 1, 2023
Merged
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
- fixed lsi requirement for atac
- fixed top features for atac
- fixed filtering HVG for rna
- moved pynndescent to PyPi dependencies


### dependencies
Expand Down
86 changes: 40 additions & 46 deletions docs/install.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@

# Installation of panpipes

## Step 1: create virtual environment
### Create virtual environment

We recommend running panpipes within a virtual environment to maintain reproducibility


### Option 1: create conda environment (Recommended)

We create a conda environment with R and python
Panpipes has a lot of dependencies, so you may want to consider [`mamba`](https://mamba.readthedocs.io/en/latest/index.html) instead of `conda for installation.
To Run panpipes, we install it in a conda environment with R and python.
Panpipes has a lot of dependencies, so you may want to consider the faster [`mamba`](https://mamba.readthedocs.io/en/latest/index.html) instead of `conda` for installation.

```
#This follows the suggestions made here: [https://www.biostars.org/p/498049/](https://www.biostars.org/p/498049/)
conda config --add channels conda-forge
conda config --set channel_priority strict
# you should remove the strict priority afterwards!
Expand All @@ -24,52 +25,29 @@ now we activate the environment
conda activate pipeline_env
```

This follows the suggestions made here: [https://www.biostars.org/p/498049/](https://www.biostars.org/p/498049/)

Install specific dependencies

```
conda install -c conda-forge pynndescent
```

Install R packages
Let's first install the R packages
```
conda install -c conda-forge r-tidyverse r-optparse r-ggforce r-ggraph r-xtable r-hdf5r r-clustree
```

Panpipes requires the unix package `time`, in conda you can install it with:
Then we can install panpipes:

You can check if it installed with

```
dpkg-query -W time
```
if this is not already installed on your conda env with:

```
conda install time
```
or

```
apt-get install time
```
#### 1. Installing panpipes from PyPi

You can install `panpipes` directly from `PyPi` with:

```
pip install panpipes
```

If you intend to use panpies for spatial analysis, instead install:
If you intend to use panpipes for spatial analysis, instead install:
```
pip install 'panpipes[spatial]'
```
The extra `[spatial]` includes squidpy and cell2location packages.



#### Nightly versions of panpipes.
#### 2. Nightly versions of panpipes.

If you would prefer to use the most recent dev version, install from github

Expand All @@ -79,9 +57,25 @@ cd panpipes
pip install -e .
```

------------

Panpipes requires the unix package `time`.
You can check if it installed with `dpkg-query -W time`. If time not already installed, you can

```
conda install time
```
or

```
apt-get install time
```



### Option 2: python venv environment:

Navigate to where you want to create your virtual environment and follow the steps below to create a pip virtual environment
Navigate to where you want to create your virtual environment and follow the steps below to create a pip virtual environment

```
python3 -m venv --prompt=panpipes python3-venv-panpipes/
Expand All @@ -98,19 +92,21 @@ As explained in the conda installation, you can install `panpipes` with:
```
pip install panpipes
```
or install a nightly version of panpipes cloning the github repo.

If you would prefer to use the most recent dev version, install from github
#### R packages installation in python venv

```
git clone https://github.com/DendrouLab/panpipes
cd panpipes
pip install -e .
```
If you are using a venv virtual environment, the pipeline will call a local R installation, so make sure R is installed and install the required packages with the command we provide below.
(This executable requires that you specify a CRAN mirror in your `.Rprofile`).
for example, add this line to your `.Rprofile` to automatically fetch the preferred mirror:

*remember to customise with your preferred [R mirror](https://cran.r-project.org/mirrors.html).*

```
options(repos = c(CRAN="https://cran.uni-muenster.de/"))
```

If you are using a venv virtual environment, the pipeline will call a local R installation, so make sure R is installed and install the required packages with the command we provide below.
(This executable requires that you specify a CRAN mirror in your `.Rprofile`)
Now, to automatically install the R dependecies, run:

```
panpipes install_r_dependencies
Expand All @@ -131,13 +127,11 @@ A list of available pipelines should appear!


You're all set to run `panpipes` on your local machine.
If you want to configure it on a HPC server, jump to [step 2](#step-2-pipeline-configuration)


## Step 2 pipeline configuration
If you want to configure it on a HPC server, follow the next instructions.

## Pipeline configuration for HPC clusters
(For SGE or SLURM clusters)
*Note: You won't need this for a local installation of panpipes.*
*Note: You only need this configuration step if you want to use an HPC to dispatch individual task as separate parallel jobs. You won't need this for a local installation of panpipes.*

Create a yml file for the cgat core pipeline software to read

Expand Down Expand Up @@ -189,7 +183,7 @@ echo "export DRMAA_LIBRARY_PATH=$PATH_TO/libdrmaa.so.1.0" >> ~/.bashrc
```

### Specifying Conda environments to run panpipes
If using conda environments, you can use one single big environment (the instructions provided do that) or create one for each of the workflows in panpipes, (i.e. one workflow = one environment)
If using conda environments, you can use one single big environment (the instructions provided do just that) or create one for each of the workflows in panpipes, (i.e. one workflow = one environment)
The environment (s) should be specified in the .cgat.yml global configuration file or in each of the single workflows pipeline.yml configuration files and it will be picked up by the pipeline as the default environment.
Please note that if you specify the conda environment in the workflows configuration file this will be the first choice to run the pipeline.

Expand Down
1 change: 1 addition & 0 deletions docs/release_notes.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
Release Notes
==============

3 changes: 2 additions & 1 deletion docs/tutorials/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Tutorials
==========

Check out the following tutorials which take you through some common analysis steps with Panpipes:
Check out the following tutorials which take you through common single cell multimodal analysis steps with Panpipes:


- [Ingest workflow](https://panpipes-tutorials.readthedocs.io/en/latest/ingesting_data/Ingesting_data_with_panpipes.html)
Expand All @@ -21,4 +21,5 @@ Spatial analysis:
Additional tutorials:

- [Ingesting multiome from cellranger outputs](https://panpipes-tutorials.readthedocs.io/en/latest/ingesting_multiome/ingesting_mome.html)
- [Ingesting mouse data](https://panpipes-tutorials.readthedocs.io/en/latest/ingesting_mouse/Ingesting_mouse_data_with_panpipes.html)

2 changes: 1 addition & 1 deletion docs/usage/general_principles.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,4 +92,4 @@ When it's completed, you will find a message informing you it's done, like this

## Final notes

All panpipes workflow follow these general principles, with specific custom parameters and input files for each workflow. See the [Worflows](../workflows/) section for detailed info on each workflow and check out our [Tutorials](../tutorials/) for more examples.
All panpipes workflow follow these general principles, with specific custom parameters and input files for each workflow. See the [Worflows](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/index.html) section for detailed info on each workflow and check out our [Tutorials](https://panpipes-pipelines.readthedocs.io/en/latest/tutorials/index.html) for more examples.
17 changes: 10 additions & 7 deletions docs/workflows/preprocess.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,20 @@ Preprocessing

## Pipeline steps

The preprocess pipeline filters the data as defined in the [filtering dictionary](../usage/filter_dict_instructions.md) section of the `pipeline.yml`. The data can also been downsampled.
The preprocess pipeline filters the data as defined in the [filtering dictionary](../usage/filter_dict_instructions.md) section of the `pipeline.yml`. The data can also been downsampled to a defined number of cells.
Then each modality is normalised and scaled. For the RNA this is normalising counts per cell with [scanpy.pp.normalize_total](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html) and optionally, regressing and scaling the data using scanpy functions. Highly variable genes (HVGs) are also calculated, and a PCA performed on those highly variable genes. There is an option to exclude specific genes from the HVGs e.g. HLA genes or BCR/TCR genes. These are specified in the same way as all [gene lists](../usage/gene_list_format). In the example below, the "group" in the gene list file is "exclude".
```
hvg:
exclude_file: resources/qc_genelist_1.0.csv
exclude: "exclude"
```

For Protein assay, the data are normalised either by centralised-log-ratio or by dsb as described in the muon documentation [here](https://muon.readthedocs.io/en/latest/omics/citeseq.html). There is additional panpipes functionality to trim dsb outliers as discussed on the dsb [github page](https://github.com/niaid/dsb/issues/9)
For Protein assay, the data are normalised either by centralised-log-ratio or by dsb as described in the muon documentation [here](https://muon.readthedocs.io/en/latest/omics/citeseq.html). There is additional panpipes functionality to trim dsb outliers as discussed on the dsb [github page](https://github.com/niaid/dsb/issues/9) dsb can only be run if the input data contains raw counts (the cellranger outs folder).
PCA is performed on the protein data, the number of components can be specified and is automatically adjusted to be `n_vars-1` when `n_pcs > n_vars`


For the ATAC assay ....
For the ATAC assay, the data are normalized either by standard normalization or with one of the TFIDF flavours included (see [normalization](https://panpipes-pipelines.readthedocs.io/en/latest/usage/normalization_methods.html)).
Then, dimensionality reduction is computed, either LSI or PCA with custom defined number of components.


## Steps to run:
Expand All @@ -25,18 +27,19 @@ For the ATAC assay ....
``panpipes preprocess config``
2. edit the pipeline.yml file

- The filtering options are dynamic depending on your qc_mm inputs. This is described [here](../usage/filter_dict_instructions.md)
- The filtering options are dynamic depending on your `ingest` inputs. This is described [here](../usage/filter_dict_instructions.md)
- There are lots of options for normalisation explained in the
pipeline.yml
pipeline.yml and in [normalization](https://panpipes-pipelines.readthedocs.io/en/latest/usage/normalization_methods.html),
check the one that works for your data

3. Run complete preprocess pipeline with
``panpipes preprocess make full``

The h5mu outputted from ``preprocess`` is filtered and normalised, and
for rna highly variable genes are computed.
for rna and atac highly variable genes are computed.


## Expected structure of MuData object
The ideal way to run `panpipes preprocess` is to use the output mudata file from `panpipes qc_mm`, as this will make sure the MuData object has correctly names layers and slots.
The ideal way to run `panpipes preprocess` is to use the output mudata file from `panpipes ingest`, as this will make sure the MuData object has correctly names layers and slots.

The bare minimum MuData object required is raw data in the X slot of each modality and a sample_id column the .obs slot of each of each modality, and the common (outer) obs.
Binary file modified panpipes/.DS_Store
Binary file not shown.
5 changes: 4 additions & 1 deletion panpipes/entry.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,10 @@ def main(argv=None):
'3. "integration" : integrate and batch correction using single and multimodal methods',
'4. "clustering" : cell clustering on single modalities',
'5. "refmap" : transfer scvi-tools models from published data to your data',
'6. "vis" : visualise metrics from other pipelines in context of experiment metadata']
'6. "vis" : visualise metrics from other pipelines in context of experiment metadata',
'7. "qc_spatial" : for the ingestion of spatial transcriptomics (ST) data',
'8. "preprocess_spatial" : for filtering and normalizing ST data',
'9. "deconvolution_spatial" : for the cell type deconvolution of ST slides']
print(*pipelines_list, sep="\n")
return
command = argv[1]
Expand Down
2 changes: 1 addition & 1 deletion panpipes/panpipes/pipeline_preprocess/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ prot:
# note that this feature is in the default muon mu.pp.dsb code, but manually implemented in this code.
quantile_clipping: True

# which normalisation method to be store in the X slot. If you choose to run more than one normalisation method,
# which normalisation method to be stored in the X slot. If you choose to run more than one normalisation method,
# which one to you want to store in the X slot, if not specified 'dsb' is the default when run.
store_as_X:

Expand Down
6 changes: 3 additions & 3 deletions panpipes/python_scripts/run_scanpyQC_prot.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,10 +95,10 @@
per_cell_metrics = args.per_cell_metrics.split(",")
per_cell_metrics = [a.strip() for a in per_cell_metrics]

# TODO: What happens if it is None?


# work out if we already have istype column, if not try to infer from index.

# work out if we already have isotype column, if not try to infer from index.
if 'isotype' not in prot.var.columns:
# this means that isotype column was not included in the protein conversion table
# so we are going to have a wwhack at identifying them
Expand All @@ -123,7 +123,7 @@
percent_top=None,log1p=True, inplace=True)

## let's assess the isotype outlier cells.
#(Cells with an excessive amount of isotype indictaing stickiness)
#(Cells with an excessive amount of isotype indicating stickiness)
if (len(isotypes) > 0) & check_for_bool(args.identify_isotype_outliers):
L.info("identifying isotype outliers")
# this measn we found some isotypes earlier
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ dependencies = [
"paramiko",
"pep8",
"pysam",
"pynndescent",
"pytest",
"pyyaml",
"ruffus",
Expand Down