This repository contains the instructions to create the expected directory structure, pull down a Singularity container, and to run a Seurat pipeline that will integrate multiple samples into one dataset.
Purpose: facilitate the generation of uniform preliminary analysis across projects and promote an understanding of the computational steps involved in scRNA-seq analysis.
To use this repository, you will need to clone the repo to your working space, pull down a Singularity container with required software pre-installed, and provide count matrices as an input.
- Get the scripts
- Collect count matrices
- Collect the container
- Test the software container
- Load in data and plot QC parameters
- Set thresholds and create a metadata file
- Modify the provided
.sbatch
file and submit the job
There are several ways to clone the repository (including through GitHub Desktop), but for today we will be using Git command line.
A general note: If working on Alpine, it would be ideal to clone this to a Peta Library allocation
, but if that is not available Scratch
or Projects
directories will also work. If using Scratch
please note that data is deleted every 90 days, so you will need to complete regular backups and if using Projects
there is limited storage (250 GB), so there may not be sufficient room to complete the analysis.
For the purposes of today we will be working in scratch
in a directory called scrna-analysis
.
#make directory and navigate there
mkdir -p /scratch/alpine/$USER/scrna-analysis/
cd /scratch/alpine/$USER/scrna-analysis/
#clone the repo
git clone https://github.com/dyammons/scrna-scripts.git
#navigate into the repo
cd scrna-scripts
For ease of creating the required output directories the build_dir.sh
file is provided. This short script will generate the necessary output directories and subfolders for each major cell type.
For this script you can enter multiple arguments where each argument is the name of cell subtype.
Note: you can always add more later by rerunning this script
#run to create dir structure for "allCells"
bash build_dir.sh allCells
#example with more cell types
#bash build_dir.sh allCells tcells bcells
Go up a level and you should now see input
and output
in addition to the original scrna-scripts
directory.
cd ..
ls
#input output scrna-scripts
The directory structure in output
will look something like this:
Show directory tree
output/
├── allCells
│ ├── linDEG
│ └── pseudoBulk
├── cb_input
├── cb_output
├── clustree
├── s1
├── s2
├── s3
├── singleR
└── viln
└── allCells
You will now need to copy your single-cell count matrices into the input
directory. File structure within input
should be such that each sample has its own directory with the corresponding features.tsv.gz
, matrix.mtx.gz
, and barcodes.tsv.gz
(dir tree below).
Show expected directory structure
input/
├── sample1
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── sample2
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
└── sample3
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
└── ...
Click to copy count matrices
#navigate to input
cd input
#copy the files
cp -r /scratch/alpine/[email protected]/dump/input/* .
#navigate to input
cd input
Create a string array that contains the sample names.
#indicate path to directory containing the output files
path=/scratch/alpine/$USER/project_scrna_01/02_scripts
#set string array with names of dirs you want to get data from
dirs=$( ls -l $path | grep "^d" | awk '{print $9}' )
declare -a StringArray=($dirs)
Copy the data over.
#loop through the array to create sample sub-directories then copy the contents of filtered_feature_bc_matrix
for val in "${StringArray[@]}"; do
folder="./$val/"
mkdir $folder
filez="$path/$val/outs/filtered_feature_bc_matrix/*"
cp $filez $folder
done
Code as a script
Create a script file.
nano getData.sh
Copy the contents below then MODIFY paths as needed for your directory structure.
#!/usr/bin/env bash
###MODIFY as needed!
###Usage: bash getData.sh
###Run this in the input directory (or change the paths in the code as needed).
### User input ###
#indicate path to directory containing the output files
path=/scratch/alpine/$USER/project_01/02_scripts
### END User input ###
### CODE ###
#set string array with names of dirs you want to get data from
dirs=$( ls -l $path | grep "^d" | awk '{print $9}' )
declare -a StringArray=($dirs)
#loop through the array to create sample sub-directories then copy the filtered_feature_bc_matrix
for val in "${StringArray[@]}"; do
folder="./$val/"
mkdir $folder
filez="$path/$val/outs/filtered_feature_bc_matrix/*"
cp $filez $folder
done
### END CODE ###
Run the script in input
to move the files to the required location.
bash getData.sh
With the input in place we are nearly ready to get the code running! The next step is to get the Singularity
container we will be using to run the script.
So, let's pull it down from Syslabs.
#move into the scripts dir
cd ../scrna-scripts/
#pull down the sif
singularity pull --arch amd64 library://dyammons/r-env/r4.3.1-seurat:v1
If pull fails, try running this then the above code again.
#establish the connection to syslabs
apptainer remote add --no-login SylabsCloud cloud.sycloud.io
apptainer remote use SylabsCloud
export APPTAINER_CACHEDIR=/scratch/alpine/$USER/cache/
export APPTAINER_TMPDIR=/scratch/alpine/$USER/tmp/
export SINGULARITY_CACHEDIR=/scratch/alpine/$USER/cache/
export SINGULARITY_TMPDIR=/scratch/alpine/$USER/tmp/
If all of the above fails you can cp a copy from my scratch space.
#move into the scripts dir
cd ../scrna-scripts/
#copy the sif
cp /scratch/alpine/[email protected]/scrna-analysis-done/scrna-scripts/r4.3.1-seurat_v1.sif .
Let's make sure we can enter the container and that the software is accessible for our use.
To do this we will launch a shell
to enter the container. This is very similar to what conda activate env
if you are familiar with conda
.
#it is important to bind (-B) a directory at least 1 level up from the scripts folder
singularity shell -B $PWD/../ r4.3.1-seurat_v1.sif
While in the container we have access to all the software. So, let's launch an R
session to ensure we can source
the customFunctions.R
file that will be key to running the code.
R
source("./customFunctions.R")
If all the packages load in no problem, then we are good to move forward!
Since we are already in the container, let's run the code to generate the QC parameters so we can set thresholds for the pipeline.
load10x(din = "../input/", dout = "../output/s1/", outName = "qc_test", testQC = T)
#Saving 7 x 7 in image
We can now use our file navigator panel to inspect the QC plots (../output/s1
).
Now we can view the files and decide on thresholds.
I recommend to err on the side of caution and set them permissively as we can always go back and increase the stringency later on.
We will code in the thresholds by opening the script1.R
file and customizing the MODIFY
section of the script.
Excerpt provided here.
######### MODIFY #########
#set output name
experiment <- "pbmc_analysis_20231129"
outName <- "allCells"
contrast <- c("Osteosarcoma", "Healthy") #first term VS second term
#set QC thresholds
nFeature_RNA_high <- 5500
nFeature_RNA_low <- 100
percent.mt_high <- 10
nCount_RNA_high <- 30000
nCount_RNA_low <- 200
########## END MODIFY #########
Lastly, we will enter in some metadata that will be used to colorize the samples and get short sample names loaded in.
To do this we will open the ./metaData/refColz.csv
in a text editor and modify it as desired.
orig.ident
values should exactly match the samples names as defined in theinput
sub-directoriesname
values can be anything you want, typically a short hand for the sample name- delete extra/unused rows
Once the values are entered in the R script
and the metadata
is entered we are ready to run the preliminary script. So, let's exit the container and prepare the .sbatch
file.
#quit the R session
q()
n
#leave the container
exit
Open cute_seurat.sbatch
in a text editor and modify it as desired.
Key parts to modify are:
ntasks
the current default it set to 10. This worked will for 6 samples, may need to scale up running more samplestime
2 hours should be good, but if running > 10 samples, may want to increasemail-user
change this to your email so I don't get a notification that you ran a job (unless you want me to know)
Show script
#!/usr/bin/env bash
#SBATCH --job-name=seu_prelim
#SBATCH --ntasks=10 # 10 worked well for 6 samples with ~5k cells each, scale up if more samples
#SBATCH --nodes=1 # this script is designed to run on one node
#SBATCH --time=02:00:00 # set time; default = 4 hours
#SBATCH --partition=amilan # modify this to reflect which queue you want to use. Either 'shas' or 'shas-testing'
#SBATCH --qos=normal # modify this to reflect which queue you want to use. Options are 'normal' and 'testing'
#SBATCH --mail-type=END # Keep these two lines of code if you want an e-mail sent to you when it is complete.
#SBATCH [email protected] ### change to your email ###
#SBATCH --output=seu_prelim_%j.log #modify as desired - will output a log file where the "%j" inserts the job ID number
######### Instructions ###########
#remove any loaded software
module purge
#run R script
singularity exec -B $PWD/../ r4.3.1-seurat_v1.sif Rscript script1.R
#submit the job
sbatch cute_seurat.sbatch
The job should be competed in 1-3 hours depending on the number of samples you are integrating.
Questions? Submit an issue or reach out to Dylan Ammons directly.