-
Notifications
You must be signed in to change notification settings - Fork 10
Smart seq2 scRNA seq tutorial
Before starting, you may want to check:
- How to install Eoulsan;
- If you have a functional installation of Docker. Docker allows us easily deploy complex packages like Bioconductor in a reproducible way;
- I use Eoulsan for the first time
Processing SMART-seq2 data (from read filtering to gene expression counting) is similar to bulk RNA-seq analysis. The normalization and cell quality control steps as well as the downstream "biological" analysis require methods specific to scRNA-seq. These methods are shared with the 10x Genomics Chromium data analysis (as of now [12th June, 2018], the 10x-derived expression files aren't compatible yet with the tools implemented in Eoulsan, downstream analysis should be done directly in R).
There are some requirements to run a full workflow (you can download those files by following the instructions in this tutorial). You will need:
- A dataset (FASTQ files) ;
- A reference genome (FASTA file) ;
- A transcriptome annotation (GFF, GFF3, or GTF file) ;
- A gene annotation file (TSV file) ;
- A workflow file (XML file) ;
- A design file (TXT file).
Here is a concise view of the steps of the workflow:
Step | Module name | Input(s) | Output(s) |
---|---|---|---|
Samples' quality check with FastQC | fastqc | FASTQ | HTML |
Filter raw FASTQ files | filterreads | FASTQ | FASTQ |
Map the reads using STAR | mapreads | FASTQ | SAM |
Remove unmapped and multimapped reads | filtersam | SAM | SAM |
Count alignments with HTSeq-count | expression | SAM | TSV |
Aggregate quality checking with MultiQC | multiqc | LOG | HTML |
Create a Bioconductor SingleCellExperiment Object* | rsinglecellexperimentcreator | TSV (matrix) + TSV (gene annotation) | RDS |
# Step 1: download the data
### FASTQ files
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/scrnaseq-smartseq2/smartseq2.tar
tar -xvf smartseq2.tar
### Reference genome
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.fasta.bz2
### Annotation files
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gff.bz2
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gtf.bz2
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.tsv.bz2
# Step 2: create the design file
eoulsan.sh createdesign *.fastq.bz2 mm10ens91.fasta.bz2 mm10ens91.gff.bz2 mm10ens91.gtf.bz2 mm10ens91.tsv.bz2
# Step 3: create the workflow file or use/modify one
wget https://github.com/GenomicParisCentre/eoulsan/wiki/files/workflow-scrnaseq-smartseq2.xml
# Step 4: launch the analysis
eoulsan.sh exec workflow-local.xml design.txt
In this section, some sample files are provided to test an Eoulsan single-cell RNA-seq workflow on SMART-seq2 data. The data corresponds to mouse dendritic cells (including dendritic cells' precursors, precursors common to macrophages and dendritic cells). To download it:
$ wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/scrnaseq-smartseq2/smartseq2.tar
$ tar -xvf smartseq2.tar
$ wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gff.bz2
$ wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gtf.bz2
$ wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.tsv.bz2
For clarity and less clutter, the reads are stored in a separate folder (named "data" here).
You can create a design file with the following command:
$ eoulsan.sh createdesign -p data/*.fastq.gz genome.fasta.bz2 annotation.gff.bz2 annotation.gtf.bz2
You can modify the design.txt
file to add additional information about known cell types or markers.
Note: Eoulsan automatically handles compressed files.
To create a workflow file, the best solution is to reuse an existing one and adapt it to your needs. You can copy the workflow from the 10x Genomics Demo. You can download it with the following command:
$ wget https://github.com/ComputationalSystemsBiology/Single-cell-RNA-seq/blob/master/Demo/10xGenomics/workflow.xml
The workflow file contains the list of steps executed by Eoulsan. Each step has parameters and is related to a module to be executed. For each step you can change, add or remove parameters as well as inputs (although Eoulsan takes as input the previous step's output). Parameters are specific to a module, check the documentation of the built-in modules for the list of available parameters.
As we want handling exon-exon junctions while mapping reads, we need to create a custom STAR index.
<!-- Create a custom STAR index using feature annotation -->
<step skip="false">
<module>starindexgenerator</module>
<parameters>
<!-- The overhang value must be greater than the length of the reads -->
<parameter>
<name>overhang</name>
<value>100</value>
</parameter>
<!-- Use a feature annotation file -->
<parameter>
<name>use.gtf.file</name>
<value>true</value>
</parameter>
<!-- The format of feature file is GFF -->
<parameter>
<name>features.file.format</name>
<value>gff</value>
</parameter>
<!-- The name of the feature for exon is "exon" in the feature annotation file -->
<parameter>
<name>gtf.feature.exon</name>
<value>exon</value>
</parameter>
<!-- For each exon feature we can get the transcript ID using the "Parent" tag -->
<parameter>
<name>gtf.tag.exon.parent.transcript</name>
<value>Parent</value>
</parameter>
</parameters>
</step>
See FastQC.
<!-- FastQC of raw reads -->
<step id="step1rawfastqc" skip="false">
<module>fastqc</module>
</step>
<!-- Filter reads -->
<step id="step2filterread" skip="false" discardoutput="asap">
<module>filterreads</module>
<parameters>
<!-- Remove all the read that do not pass Illumina QC -->
<parameter>
<name>illuminaid</name>
<value></value>
</parameter>
<!-- Remove polyN tails of the reads -->
<parameter>
<name>trimpolynend</name>
<value></value>
</parameter>
<!-- Remove reads with length lower than 40 -->
<parameter>
<name>length.minimal.length.threshold</name>
<value>40</value>
</parameter>
<!-- Remove reads with mean QScore < 30 -->
<parameter>
<name>quality.threshold</name>
<value>30</value>
</parameter>
</parameters>
</step>
<!-- Mapping of the reads -->
<step id="step3mapreads" skip="false" discardoutput="asap">
<module>mapreads</module>
<parameters>
<!-- Use STAR as mapper -->
<parameter>
<name>mapper</name>
<value>star</value>
</parameter>
<!-- The version of STAR to use is 2.5.2b -->
<parameter>
<name>mapper.version</name>
<value>2.5.2b</value>
</parameter>
<!-- Additional STAR parameter (keep unmapped reads in SAM output) -->
<parameter>
<name>mapper.arguments</name>
<value>--outSAMunmapped Within</value>
</parameter>
</parameters>
</step>
<!-- Filtering of the SAM alignments -->
<step id="step4filtersam" skip="false" discardoutput="asap">
<module>filtersam</module>
<parameters>
<!-- Remove the unmapped entries in the files -->
<parameter>
<name>removeunmapped</name>
<value>true</value>
</parameter>
<!-- Remove the multimatches alignments -->
<parameter>
<name>removemultimatches</name>
<value>true</value>
</parameter>
</parameters>
</step>
<!-- Convert SAM to BAM file and sort the BAM file by coordinate -->
<step id="step5sam2bam" skip="false">
<module>sam2bam</module>
<parameters>
<!-- Compression level of the BAM file -->
<parameter>
<name>compression.level</name>
<value>5</value>
</parameter>
</parameters>
</step>
<!-- Compute expression -->
<step id="step6expression" skip="false">
<module>expression</module>
<parameters>
<!-- We use the fast Java implementation of HTSeq-count to perform the counting -->
<parameter>
<name>counter</name>
<value>htseq-count</value>
</parameter>
<!-- The input feature annotation file is in GTF format -->
<parameter>
<name>features.file.format</name>
<value>gtf</value>
</parameter>
<!-- The feature type to use is "exon", we count by exon -->
<parameter>
<name>genomic.type</name>
<value>exon</value>
</parameter>
<!-- We aggregate the counts by genes -->
<parameter>
<name>attribute.id</name>
<value>gene_id</value>
</parameter>
<!-- The library was oriented -->
<parameter>
<name>stranded</name>
<value>yes</value>
</parameter>
<!-- Use "union mode" as overlap mode -->
<parameter>
<name>overlap.mode</name>
<value>union</value>
</parameter>
<!-- We do not remove ambiguous cases -->
<parameter>
<name>remove.ambiguous.cases</name>
<value>false</value>
</parameter>
</parameters>
</step>
<!-- MultiQC -->
<step id="step7multiqc" skip="false">
<module>multiqc</module>
<parameters>
<!-- MultiQC will contain reports from FastQC, STAR and HTSeq-count -->
<parameter>
<name>reports</name>
<value>fastqc,mapreads,expression</value>
</parameter>
<!-- Docker is required to launch this step -->
<parameter>
<name>use.docker</name>
<value>true</value>
</parameter>
</parameters>
</step>
This step allow to create a RDS file that contains a Bioconductor SingleCellExperiment object. This object contains the matrices generated by the step 3.11 and the cell and gene annotations.
The cell annotations are defined in the design file. Users can add many columns to describe their cells. In this step, only the annotation which column names start with a prefix (e.g. "Cell.") will be added to the SingleCellExperiment object.
The gene annotations are defined in the mm10ens91.tsv.bz2
file.
<!-- Create a SingleCellExperiment Bioconductor Object in a RDS file -->
<step id="step8singlecellexperiment" skip="false">
<module>rsinglecellexperimentcreator</module>
<parameters>
<parameter>
<name>input.matrices</name>
<value>true</value>
</parameter>
<parameter>
<name>design.prefix</name>
<value>Cell.</value>
</parameter>
<parameter>
<name>r.execution.mode</name>
<value>docker</value>
</parameter>
</parameters>
</step>
Once your design file and workflow file are ready, you can launch Eoulsan analysis with the following command:
eoulsan.sh exec workflow-scrnaseq-smartseq2.xml design.txt
Before starting the analysis, Eoulsan will check if:
- the design file and workflow file are valid
- the input files are valid (fasta and annotation files)
- modules called in the workflow exist
- the order of the steps is correct (in terms of inputs / outputs needed for each step)
- a genome mapper index is needed. In this case, a step to generate this index is automatically added.
If successful, you will obtain a new directory named eoulsan-YearMonthDay-Time
with log files about the analysis. Result files are stored in the current directory.
sce <- readRDS('step8singlecellexperiment_output/step8singlecellexperiment_output_sce_matrix.rds')
summary(sce)