Skip to content

Smart seq2 scRNA seq tutorial

LehmannN edited this page Sep 29, 2021 · 5 revisions

Before starting, you may want to check:

Processing SMART-seq2 data (from read filtering to gene expression counting) is similar to bulk RNA-seq analysis. The normalization and cell quality control steps as well as the downstream "biological" analysis require methods specific to scRNA-seq. These methods are shared with the 10x Genomics Chromium data analysis (as of now [12th June, 2018], the 10x-derived expression files aren't compatible yet with the tools implemented in Eoulsan, downstream analysis should be done directly in R).

There are some requirements to run a full workflow (you can download those files by following the instructions in this tutorial). You will need:

  • A dataset (FASTQ files) ;
  • A reference genome (FASTA file) ;
  • A transcriptome annotation (GFF, GFF3, or GTF file) ;
  • A gene annotation file (TSV file) ;
  • A workflow file (XML file) ;
  • A design file (TXT file).

Here is a concise view of the steps of the workflow:

Step Module name Input(s) Output(s)
Samples' quality check with FastQC fastqc FASTQ HTML
Filter raw FASTQ files filterreads FASTQ FASTQ
Map the reads using STAR mapreads FASTQ SAM
Remove unmapped and multimapped reads filtersam SAM SAM
Count alignments with HTSeq-count expression SAM TSV
Aggregate quality checking with MultiQC multiqc LOG HTML
Merge expression matrices (1 per cell) into one "genes x cells" matrix matrixcreator TSV TSV
Extract cells' metadata reducer TXT TSV
Extract features' metadata featuresmetadataextractor GFF, GFF3, or GTF TSV
Filter bad quality cells saturation-filtering TSV TSV
Normalization (no spike-ins) scpoolednormalization TSV TSV

Short version

# Step 1: download the data

### FASTQ files
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/scrnaseq-smartseq2/smartseq2.tar
tar -xvf smartseq2.tar

### Reference genome
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.fasta.bz2

### Annotation files
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gff.bz2
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gtf.bz2
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.tsv.bz2

# Step 2: create the design file
eoulsan.sh createdesign *.fastq.bz2 mm10ens91.fasta.bz2 mm10ens91.gff.bz2 mm10ens91.gtf.bz2 mm10ens91.tsv.bz2

# Step 3: create the workflow file or use/modify one
wget https://github.com/GenomicParisCentre/eoulsan/wiki/files/workflow-scrnaseq-smartseq2.xml

# Step 4: launch the analysis
eoulsan.sh exec workflow-local.xml design.txt

Long version

Step 1: downloading the data

In this section, some sample files are provided to test an Eoulsan single-cell RNA-seq workflow on SMART-seq2 data. The data corresponds to mouse dendritic cells (including dendritic cells' precursors, precursors common to macrophages and dendritic cells). To download it:

$ wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/scrnaseq-smartseq2/smartseq2.tar
$ tar -xvf smartseq2.tar
$ wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gff.bz2
$ wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gtf.bz2
$ wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.tsv.bz2

For clarity and less clutter, the reads are stored in a separate folder (named "data" here).

Step 2: create the design file

You can create a design file with the following command:

$ eoulsan.sh createdesign -p data/*.fastq.gz genome.fasta.bz2 annotation.gff.bz2 annotation.gtf.bz2

You can modify the design.txt file to add additional information about known cell types or markers. Note: Eoulsan automatically handles compressed files.

Step 3: create the workflow file

To create a workflow file, the best solution is to reuse an existing one and adapt it to your needs. You can copy the workflow from the 10x Genomics Demo. You can download it with the following command:

$ wget https://github.com/ComputationalSystemsBiology/Single-cell-RNA-seq/blob/master/Demo/10xGenomics/workflow.xml

The workflow file contains the list of steps executed by Eoulsan. Each step has parameters and is related to a module to be executed. For each step you can change, add or remove parameters as well as inputs (although Eoulsan takes as input the previous step's output). Parameters are specific to a module, check the documentation of the built-in modules for the list of available parameters.

Step 3.0: Create STAR index

As we want handling exon-exon junctions while mapping reads, we need to create a custom STAR index.

        <!-- Create a custom STAR index using feature annotation -->
        <step skip="false">
            <module>starindexgenerator</module>
            <parameters>

                <!-- The overhang value must be greater than the length of the reads -->
                <parameter>
                    <name>overhang</name>
                    <value>100</value>
                </parameter>

                <!-- Use a feature annotation file -->
                <parameter>
                    <name>use.gtf.file</name>
                    <value>true</value>
                </parameter>

                <!-- The format of feature file is GFF -->
                <parameter>
                    <name>features.file.format</name>
                    <value>gff</value>
                </parameter>

                <!-- The name of the feature for exon is "exon" in the feature annotation file -->
                <parameter>
                    <name>gtf.feature.exon</name>
                    <value>exon</value>
                </parameter>

                <!-- For each exon feature we can get the transcript ID using the "Parent" tag -->
                <parameter>
                    <name>gtf.tag.exon.parent.transcript</name>
                    <value>Parent</value>
                </parameter>
            </parameters>
        </step>

Step 3.1: FastQC

See FastQC.

        <!-- FastQC of raw reads -->
        <step id="step1rawfastqc" skip="false">
            <module>fastqc</module>
        </step>

Step 3.2: Filter reads

        <!-- Filter reads -->
        <step id="step2filterread" skip="false" discardoutput="asap">
            <module>filterreads</module>
            <parameters>
                
                <!-- Remove all the read that do not pass Illumina QC -->
                <parameter>
                    <name>illuminaid</name>
                    <value></value>
                </parameter>
                
                <!-- Remove polyN tails of the reads -->
                <parameter>
                    <name>trimpolynend</name>
                    <value></value>
                </parameter>
                
                <!-- Remove reads with length lower than 40 -->
                <parameter>
                    <name>length.minimal.length.threshold</name>
                    <value>40</value>
                </parameter>
                
                <!-- Remove reads with mean QScore < 30 -->
                <parameter>
                    <name>quality.threshold</name>
                    <value>30</value>
                </parameter>

            </parameters>
        </step>

Step 3.3: Map the reads to the reference genome

        <!-- Mapping of the reads -->
        <step id="step3mapreads" skip="false" discardoutput="asap">
            <module>mapreads</module>
            <parameters>

                <!-- Use STAR as mapper -->
                <parameter>
                    <name>mapper</name>
                    <value>star</value>
                </parameter>

                <!-- The version of STAR to use is 2.5.2b -->
                <parameter>
                    <name>mapper.version</name>
                    <value>2.5.2b</value>
                </parameter>

                <!-- Additional STAR parameter (keep unmapped reads in SAM output) -->
                <parameter>
                    <name>mapper.arguments</name>
                    <value>--outSAMunmapped Within</value>
                </parameter>
            </parameters>
        </step>

Step 3.4: Filter mapping

        <!-- Filtering of the SAM alignments -->
        <step id="step4filtersam" skip="false" discardoutput="asap">
            <module>filtersam</module>
            <parameters>

                <!-- Remove the unmapped entries in the files -->
                <parameter>
                    <name>removeunmapped</name>
                    <value>true</value>
                </parameter>

                <!-- Remove the multimatches alignments -->
                <parameter>
                    <name>removemultimatches</name>
                    <value>true</value>
                </parameter>

            </parameters>
        </step>

Step 3.5: SAM file conversion

        <!-- Convert SAM to BAM file and sort the BAM file by coordinate -->
        <step id="step5sam2bam" skip="false">
            <module>sam2bam</module>
            <parameters>

                <!-- Compression level of the BAM file -->
                <parameter>
                    <name>compression.level</name>
                    <value>5</value>
                </parameter>

            </parameters>
        </step>

Step 3.6: Expression couting

        <!-- Compute expression -->
        <step id="step6expression" skip="false">
            <module>expression</module>
            <parameters>

                <!-- We use the fast Java implementation of HTSeq-count to perform the counting -->
                <parameter>
                    <name>counter</name>
                    <value>htseq-count</value>
                </parameter>

                <!-- The input feature annotation file is in GTF format -->
                <parameter>
                    <name>features.file.format</name>
                    <value>gtf</value>
                </parameter>

                <!-- The feature type to use is "exon", we count by exon -->
                <parameter>
                    <name>genomic.type</name>
                    <value>exon</value>
                </parameter>

                <!-- We aggregate the counts by genes -->
                <parameter>
                    <name>attribute.id</name>
                    <value>gene_id</value>
                </parameter>

                <!-- The library was oriented -->
                <parameter>
                    <name>stranded</name>
                    <value>yes</value>
                </parameter>

                <!-- Use "union mode" as overlap mode -->
                <parameter>
                    <name>overlap.mode</name>
                    <value>union</value>
                </parameter>

                <!-- We do not remove ambiguous cases -->
                <parameter>
                    <name>remove.ambiguous.cases</name>
                    <value>false</value>
                </parameter>

            </parameters>
        </step>

Step 3.7: MultiQC

        <!-- MultiQC -->
        <step id="step7multiqc" skip="false">
            <module>multiqc</module>
            <parameters>

                <!-- MultiQC will contain reports from FastQC, STAR and HTSeq-count -->
                <parameter>
                    <name>reports</name>
                    <value>fastqc,mapreads,expression</value>
                </parameter>

                <!-- Docker is required to launch this step -->
                <parameter>
                    <name>use.docker</name>
                    <value>true</value>
                </parameter>

            </parameters>
        </step>

Step 3.8: Create a SingleCellExperiment Bioconductor

This step allow to create a RDS file that contains a Bioconductor SingleCellExperiment object. This object contains the matrices generated by the step 3.11 and the cell and gene annotations.

The cell annotations are defined in the design file. Users can add many columns to describe their cells. In this step, only the annotation which column names start with a prefix (e.g. "Cell.") will be added to the SingleCellExperiment object.

The gene annotations are defined in the mm10ens91.tsv.bz2 file.

        <!-- Create a SingleCellExperiment Bioconductor Object in a RDS file -->
        <step id="step8singlecellexperiment" skip="false">
            <module>rsinglecellexperimentcreator</module>
            <parameters>
                <parameter>
                    <name>input.matrices</name>
                    <value>true</value>
                </parameter>
                <parameter>
                    <name>design.prefix</name>
                    <value>Cell.</value>
                </parameter>
                <parameter>
                    <name>r.execution.mode</name>
                    <value>docker</value>
                </parameter>
            </parameters>
        </step>

Step 4: launch the analysis

Once your design file and workflow file are ready, you can launch Eoulsan analysis with the following command:

eoulsan.sh exec workflow-scrnaseq-smartseq2.xml design.txt

Before starting the analysis, Eoulsan will check if:

  • the design file and workflow file are valid
  • the input files are valid (fasta and annotation files)
  • modules called in the workflow exist
  • the order of the steps is correct (in terms of inputs / outputs needed for each step)
  • a genome mapper index is needed. In this case, a step to generate this index is automatically added.

If successful, you will obtain a new directory named eoulsan-YearMonthDay-Time with log files about the analysis. Result files are stored in the current directory.

Step 5: Perform the downstream analysis in R

sce <- readRDS('step8singlecellexperiment_output/step8singlecellexperiment_output_sce_matrix.rds')
summary(sce)