Each module write output files in the subfolders of '<parent-dir>'.
STAR and HTSeq command options selected as suggested in the QuantSeq data analysis manual, page 18.
We are using UMI Tools with basic options for this pipeline.
Important file/directory outputs from each module, marked as bold for your attention.

Preprocessing

Process raw FASTQ files

The 1st 6nt are UMI. So, we aim to remove UMI sequences from the Illumina reads using the umi_tools extract. Then, we use cutadapt with -u option to trim 12 nt and also, remove low quality and short reads.

bash process_fastq.sh '<parent-dir>' '<fastq-dir>' '<#of-jobs>'

OUTPUTS:

Location	Description
`fastq-processed/umi_extract`	FASTQ files after running `umi_tools extract` command on files inside '' directory.
`fastq-processed/trim`	FASTQ files after running `cutadapt` command on above FASTQ files as the input.
`logs/umi_tools_extract`	Log files from running `umi_tools extract`
`logs/cutadapt_trim`	Log files from running `cutadapt`

Alignment

This command run STAR for all samples in the '<fastq-dir>'. To use processed FASTQ files as the input, specify '<fastq-dir>' as fastq-processed/trim.

bash alignment.sh '<parent-dir>' '<fastq-dir>' '<#of-jobs>'

OUTPUTS:

Location	Description
`bam`	Actual results from STAR run in BAM format.
`logs/star_aligner`	Log files from STAR run.

Process BAM files

Here, [umi_tools dedup](https://umi-tools.readthedocs.io/en/latest/reference/dedup.html) detect duplicate reads in the BAM files. To use BAM files from the alignment module as the input, specify '<bam-dir>' as bam.

bash umi_dedup.sh '<parent-dir>' '<bam-dir>' '<#of-jobs>'

OUTPUTS:

Location	Description
`bam-processed`	Actual results from `umi_tools dedup` run in BAM format.
`logs/umi_tools_dedup`	Log files from `umi_tools dedup` run.

Measure gene level counts

Gene level counts measured using the HTSeq package. To use processed BAM files as input, specify '<bam-dir>' as bam-processed.

bash htseq-count.sh '<parent-dir>' '<bam-dir>' '<count-dir>'

OUTPUTS:

if '<count-dir>' replaced by counts:

Location	Description
`counts`	Actual results from `htseq-count` run.

Quality controls

Fastq files and the pre-processing steps were assessed for quality control using the FastQC and multiQC programs.

bash fastqc.sh '<parent-dir>' '<fastq-dir>' '<fastQC-dir>' '<#of-jobs>'

OUTPUTS:

if '<fastQC-dir>' replaced by fastQC:

Location	Description
`fastQC`	Actual results from `fastqc` run.
`mutiqc-fastQC`	All data from `mutiqc` run on the above folder.
`mutiqc-fastQC.html`	HTML report of QCs for all FASTQ files in '' folder.

Also, this line will provide another HTML report for STAR, Cutadapt, UMI Tools (extract and dedup) and HTSeq commands into mutiqc-preprocessing.html file.

multiqc '<count-dir>' logs/ -n mutiqc-preprocessing

Downstream analysis

Count based basic analysis

HTSeq counts processed in R using the DESeq2 package. First, the varianceStabilizingTransformation function from DESeq2 used for the Principal Component Analysis (PCA). The first three PCs were selected for a 3-D scatter plot visualization using the R-Plotly package. For heatmap visualizations, normalized counts extracted from the DESeq object, genes with less than 10 counts in all samples were removed as low expressed genes, and also, normalized across samples (SD = 1, mean ~=0) for each gene.

Rscript DESeq.R '<parent-dir>' '<count-dir>' '<samplesheet.txt>' '<#of-jobs>'

OUTPUTS:

Location	Description
`deseq/pca3d`	`plotly` data
`deseq/pca3d.html`	PCA; Interactive 3D scatter plot for the first 3 PCs.
`deseq/mostVar_Heatmap.pdf`	Heatmap plot in `pdf` format
`deseq/mostVar_Heatmap.png`	Heatmap plot in `png` format
`deseq/counts.normalized.txt`	Raw counts for all samples in a single file.
`deseq/counts.raw.txt`	Normalized counts (using DESeq2) for all samples in a single file.

Differential expression analysis

[coming soon]

Enrichment analysis

[coming soon]

References

Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics (Oxford, England), 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354
Smith, T., Heger, A., & Sudbery, I. (2017). UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome research, 27(3), 491–499. https://doi.org/10.1101/gr.209601.116
Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1), pp. 10-12. https://doi.org/10.14806/ej.17.1.200
Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England), 29(1), 15–21. https://doi.org/10.1093/bioinformatics/bts635
Anders, S., Pyl, P. T., & Huber, W. (2015). HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics (Oxford, England), 31(2), 166–169. https://doi.org/10.1093/bioinformatics/btu638
Frankish, A., Diekhans, M., Ferreira, A. M., Johnson, R., Jungreis, I., Loveland, J., Mudge, J. M., Sisu, C., Wright, J., Armstrong, J., Barnes, I., Berry, A., Bignell, A., Carbonell Sala, S., Chrast, J., Cunningham, F., Di Domenico, T., Donaldson, S., Fiddes, I. T., García Girón, C., … Flicek, P. (2019). GENCODE reference annotation for the human and mouse genomes. Nucleic acids research, 47(D1), D766–D773. https://doi.org/10.1093/nar/gky955
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8
Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics (Oxford, England), 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354
Smith, T., Heger, A., & Sudbery, I. (2017). UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome research, 27(3), 491–499. https://doi.org/10.1101/gr.209601.116
Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1), pp. 10-12. https://doi.org/10.14806/ej.17.1.200
Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England), 29(1), 15–21. https://doi.org/10.1093/bioinformatics/bts635
Sievert, C. (2020). Interactive Web-Based Data Visualization with R, plotly, and shiny. New York: Chapman and Hall/CRC, https://doi.org/10.1201/9780429447273

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Preprocessing

Process raw FASTQ files

OUTPUTS:

Alignment

OUTPUTS:

Process BAM files

OUTPUTS:

Measure gene level counts

OUTPUTS:

Quality controls

OUTPUTS:

Downstream analysis

Count based basic analysis

OUTPUTS:

Differential expression analysis

Enrichment analysis

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Preprocessing

Process raw FASTQ files

OUTPUTS:

Alignment

OUTPUTS:

Process BAM files

OUTPUTS:

Measure gene level counts

OUTPUTS:

Quality controls

OUTPUTS:

Downstream analysis

Count based basic analysis

OUTPUTS:

Differential expression analysis

Enrichment analysis

References