Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STRT-Seq technology and errors #12

Closed
Davidwei7 opened this issue Apr 3, 2023 · 10 comments
Closed

STRT-Seq technology and errors #12

Davidwei7 opened this issue Apr 3, 2023 · 10 comments

Comments

@Davidwei7
Copy link

Dear Sir/Madam,
Hope you are well.
Following resolved problem with the docker image on my cluster, I tried my first run with the launch_universc.sh with technology of STRT-Seq. I ran everything in the docker image converted singularity image.
My command is this:
launch_universc.sh -R1 SRR6026844_sra_S1_L001_R1_001.fastq -R2 SRR6026844_sra_S1_L001_R2_001.fastq -t strt-seq -r /lustre/project/m2_jgu-canshank3/Comparison/Human/HomSap_GRCh38 -i SRR6026844

Please see the below snapshots of the processes and errors:

image
image
image
image

Please see below the first few rows of the fastq files:

head -n 24 SRR6026844.sra_S1_L001_R1_001.fastq
@SRR6026844.sra.fastq.1 1 length=150
AAGCAGTGGTATCAACGCAGAGTACATGGGGAAAAAGAGAAAAGTGGAGGGATGTGTGGGCCTAGACAGGGGAAAAAGGAGAACAGGAGGCTCCAGACTGGTGAGGAAGGGGAGTGGGCTGGGCGTGCGGCTCATGCCTGTCATCCCAGC
+SRR6026844.sra.fastq.1 1 length=150
AA<<FFJJJFJJJJJJAJJJJJFA<JFJF<7FFFJJ--FJJJJA-F-AJJF<7FAA-FFJJ<AJJJFJJJ--7AJAJJFFJ<J7AJFA<FJ-7-AAJ7JF<<F7AJAAFJ7--777FJJFAA<JA-AJFAJJ-<7<7<FFAJF-FFAF7F
@SRR6026844.sra.fastq.2 2 length=150
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTCGTATCAAAGCAGAGTACATGGG
+SRR6026844.sra.fastq.2 2 length=150
AA7FAF7-FFJJJJJJFJJJJJAAJJ7FFJJJJJJJJJJJJJJJJJJJJJJJFJJJFJJJAJJJJJJJJJAJJJJJJJJJJJ-<JJ<J<<-FJA<-<--A7AFJ--7AJJ<<-FF-FJAJA-A<-7F-7AA<--7---FF-)--<AJFJJ
@SRR6026844.sra.fastq.3 3 length=150
AAGCAGTGGTATCAACGCAGAGTACATGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+SRR6026844.sra.fastq.3 3 length=150
AA<FFF<FFFJJJJJJJFJ<JF<JFJ<A<7-FJJJJJFJFJFJJJJJJJJJJFJJJJJJJJJJJJJJJJAJFJJJJ<FJ-FJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJFJJJJAJJJAJFFJJJJJFFJJJJJFJJJFFAA7FFJ
@SRR6026844.sra.fastq.4 4 length=150
TGACCTGTCCCCTCTGGCTGCCTCTGAGTCTGAATCTCCCAAAGAGAGAAACCAATTTCTAAGAGGACTGGATTGCAGAAGACTCGGGGACAACATTTGATCCAAGATCTTAAATGTTATATTGATAACCATGCTCAGCAATGAGCTATT
+SRR6026844.sra.fastq.4 4 length=150
AA<-<7AFFJAFJJJF-7<FJFFAJJA7FFFJF7FJ7JJJF7FJAJFJFF7FFFFJ-FJJJ-<A-F<JFJ-<AA7-AJJJ<FFFJFFJF<JJAJF-FJA-FJJJ7FJJJJF<7A<FJFFJ-<AAJJJ7F7<F<FF7-<7<-<FAF<FAFJ
@SRR6026844.sra.fastq.5 5 length=150
GGAAGGAAGGAAGAAAGAAAGAAAGATAGAGAGAGAGAGAGAGAGAGAAAGATAGAGAGAAATAAAGAAACAAAGAAAGAAAGAAAGAAAGAAAGAAAAAAAAAGAAAAATACAAAAAAAAAAATTCACTTAACTCAGGGGTTCGGAGAT
+SRR6026844.sra.fastq.5 5 length=150
-A-FFF77F-F<<F<JJAF<AJFFJ----<<FJAFFF<FF<FJF<<AFJJJ<<FJJFJFJJ7-F<JAJJA---7A<7AF-<7-777AJAFJJ<-AJJ-F-<-A---A---77F7--7F<FAF<A-------7--7--A--))-)7)---7
@SRR6026844.sra.fastq.6 6 length=150
CCTCCAGATACCACTGAGCCTCTTGCCCATGATTCAGAGCTTTCAAGGATAGGCTTTATTCTGCAAGCAATCAAATAATAAATCTATTCTGCTGAGAGATCACAAAAAAAAAAAAAAAAAAAAAAAAAAACCTATTTGCTGATGAGATCA
+SRR6026844.sra.fastq.6 6 length=150
AA<7-7<<FFJJFFFJJJJJJJ<FJF<J<JFJJJJJJJJJJJJJFJJJFA7F<FJJJFFFJJAJFJ<FF-FA77JA-AJFFJFFA7-FA-FJJJ-AFJ----<F-<AJJA<<--AF-7AFA--AF-A<--7-7-AA<---7---7---7-

head -n 24 SRR6026844_sra_S1_L001_R2_001.fastq
@SRR6026844.sra.fastq.1 1 length=150
NAGGTGCATTCGCCCTCCGTAGAAATCCATGCCAAGTACGCTCCTTCCATTGATTTTCTTGGATCGGGTGTGCACCGCGTAGCTCAGCATGGCAAGTCTGTGTAGTCCGTGGACCCGCCAGGACCCCCCGCCGCACGAGACGCAATACGT
+SRR6026844.sra.fastq.1 1 length=150
#AAA--A--777-A---AA--7------7-7))--)---7)-7----7--------7--77-----7))-))7-7)<--))77-7--)---))--)----7-----7-))7))-)))))-))))-))))-)-)-)))))-))))7---7-
@SRR6026844.sra.fastq.2 2 length=150
NTATGACTCCACCCCTCAGAGAGGAGGAGGCGACGGGGACAACAACTCACAGAGAGCAAAGTCCGTGGCAACCACCCCGTCTGCGGAGAGCAGGTCCGACCCTACTAGACGAGAGACAACGAACGCCGGACCGCACAATGGCGAGAGCTA
+SRR6026844.sra.fastq.2 2 length=150
#<A-----A---7A--A--<-7---))7)7--7)-)---77<F-7-7--7---7-77------77--)))-7)--))))))))-))7)7))-)))))))))))-)----7--)-)----------)))))))))7-)<----))-)))-7
@SRR6026844.sra.fastq.3 3 length=150
NCCACATATAGGGAAACATTTTAATTCTTAGTTATTATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATTTTTTTTTTTTTTTTTTTTTTTTCATTTTTTTTATTT
+SRR6026844.sra.fastq.3 3 length=150
#AAAA-F-A<-A--7----------7--------7------77--F<----7<-<7A<A--AA7-A-----F-<<-<-7--7--7-<77A<7A7-A<-A-F--<A<<-------------A<7-F7-------7----7----7---7--
@SRR6026844.sra.fastq.4 4 length=150
NACTTCGATATAAGATTTTTTTTTTTATTTATTACTCAAAGTTTAGAACATTTTATTAAAGTACAAAAATGTTAGAATTTAGCTAATAGAAAAACATAGTAAATATTTAAAAAAACGCTTATAAAATTACTCAAGGCACCCACAGAAAAC
+SRR6026844.sra.fastq.4 4 length=150
#AAFFAJ-FAJFJJJJ----------------------A---7-77-FFF----7--<-A-7A<--7<----A-A7<---77<-7<-77<FJ77-<-7--AAF-A-------AFJF<---7-<7--7-<----7--)-))-))))7-7--
@SRR6026844.sra.fastq.5 5 length=150
TCGAAGTATGGTGATATCGGAAGAGCTTCGAGTACGTAAATAGTGTAGATCTCGGTTGTCGTCTTATCATTAAAAAAACATTTCTTACTTTTCTCTCTTCGCACACCTCACTTCCTCGCTATATTGCTTCCTCCCTTCCGGGGACAGACC
+SRR6026844.sra.fastq.5 5 length=150
--AF-FF<--7F7-A---<-7--------7-7---7---7-------7----<-)--)---------7---<------------------------7----)-))))-)------7-<))----7----7)--7<)-7<))----))--)
@SRR6026844.sra.fastq.6 6 length=150
TATCAGCAAATAGGGTTTTTTTTTATTATTTTATTTTTTTTTTGATCCCTGAGGAGAATAGCGTTCATATGTGAGTTCTGGCAGAACAAAGGCTAACCTTGAAAGGCCTGTTATCTGGGACAGAAGCCCAGGAGTGCTCGTGTCTGTACC
+SRR6026844.sra.fastq.6 6 length=150
AAAAAFF<FF-FJ-F----------------------------77---)-))7A-7<7----------7------------777)----7A--------7-<AA7<-)7-)-----77)))))----7)))-)7-))-)7)7-)------

**Do you have any idea why the process was not completed?

I also have some information regarding the fastq file and I am sharing here to see if it could be helpful us resolving this problem I am facing.

Firstly, the sequence structure of the fastq file is this:

image

Secondly, the more detail on how the author analysed their data is in their github (link: [https://github.com/zorrodong/HECA/tree/master/scRNA-seq_pipeline_hg38]).

I am not sure whether I am correct on thinking this:

  1. The sequence structure of this fastq files are opposite of what UniverSC assume by default, so I need to swap the order from read 1 to read 2 before running the command?
  2. The fastq file has its own designed barcodes so I will need to provide a list of barcodes by using -b barcode_96_8bp.txt (barcode_96_8bp.txt is found in their github page: [https://github.com/zorrodong/HECA/blob/master/scRNA-seq_pipeline_hg38/barcode_96_8bp.txt]).
  3. The sequence structure of this fastq file is in line with the strt-seq, but the barcodes length is 8 and UMI length is 8, so I need to use custom_8_8 in my command? Does our current UniverSC support this setting?
  4. The author of this fastq file also used umi_tools in the pipeline to firstly extract UMI and barcodes from the raw fastq file, do I need to do this first before using Universc? (the authors' pipeline is this:
    image
    )

I am terribly sorry for giving so much information on my issue. I am quite new to complex bioinformatic problems and want to use your software for integrative analysis. Because I have three datasets with BD-rhapsody technology and 10xGenomices and STRT-Seq (described above), and the STRT-Seq technology generated fastq file I described in this post is the main reference data we are comparing against, I want to do this correctly.

Thanks for developing this tool, and looking forward to your response.

David

@Davidwei7
Copy link
Author

Hi all, I was wondering if you had a chance to look into my issue I experiencing which is described in two posts? Thank you in advance. Looking forward to your response.
Best Wishes,
David

@TomKellyGenetics
Copy link
Collaborator

Hi, sorry for the delayed response. I don't have much time at the moment but I have some ideas on what may be causing this. I think it is unrelated to the technology. The "proc" command is used to detect the number of cores available to set the default number of threads.

Please try running it again with the number of threads set manually with --threads and let us know if you still have problems persisting. I'll note that you have different system configurations compared to ours discussed in previous issues some some dependencies may be missing.

@TomKellyGenetics
Copy link
Collaborator

This dataset appears to use SmartSeq2.

Construction protocol: A modified Smart-seq2 protocol was applied for single-cell RNA-seq. Briefly, a single cell was picked into the lysis buffer by mouth pipette. The reverse transcription reaction was performed with 24 oligo (dT) primer anchored with the 8 bp cell specific barcode, and also with 8 bp unique molecular identifiers (UMIs).

https://www.ncbi.nlm.nih.gov/sra/?term=SRR6026844
https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR6026844

Note that our code supports the following configurations:
launch_universc.sh: STRT-Seq (6 bp barcode, no UMI): strt-seq
launch_universc.sh: STRT-Seq-C1 (8 bp barcode, 5 bp UMI): strt-seq-c1
launch_universc.sh: STRT-Seq-2i (13 bp barcode, 6 bp UMI): strt-seq-2i

It is possible to support an 8bp UMI but it will require a dedicated configuration. If it is a popular protocol we can support this but it appears to be a custom workflow used in this paper. Another possible workaround is to rename R1 and R2 (manually switch them) and run custom_8_8 which assumes R1 contains [BC][UMI].... and R2 contains transcript reads (as for 10x settings).

I'll note the paper here to investigate later in more details:

Fan X et al., "Spatial transcriptomic survey of human embryonic cerebral cortex by single-cell RNA-seq analysis.", Cell Res, 2018 Jul;28(7):730-745

@TomKellyGenetics
Copy link
Collaborator

@Davidwei7 sorry for the delayed response. I've investigated issues with these protocols and updated the source code to support it.

Please note that this protocol by Fan et al. (2018) is significantly modified from the originally published data from Islam et al. (2011).

We modified the STRT-seq method for amplification of single-cell transcriptomes by changing the reverse transcription primer, the induced cell barcode, and the unique molecular identifier (UMI).

This requires a different bioinformatics approach.

Raw reads were first segregated based on the cell-specific barcode information in read 2 of the pair-ended reads. Then, sequences in read 1 were trimmed with customized scripts to remove the TSO sequence, the polyA tail sequence and sequences with low-quality bases (N > 10%) or contaminated with adapters. Subsequently, the stripped read 1 sequences were aligned to the hg19 human reference genome.

Therefore I have created separate technology settings "strt-seq" for the original protocol and "strt-seq-2018" for the custom version. I've pushed this new configuration to the "dev" branch so it is possible to update to the development version to try it. There are minor changes to the source code so I expect it will run without errors. I've tested it on raw SRA data in FASTQ format from both publications and confirmed it created Cell Ranger compatible files.

@TomKellyGenetics
Copy link
Collaborator

Closing this issue as this technology is now supported. Raw reads from SRR6026844 tested without errors. Please re-open of file another issue if there are still problems with your environment preventing you from replicating this.

@adc0032
Copy link

adc0032 commented Feb 7, 2024

Hi!
has this been added to the main branch and incorporated into what is installed in the docker? I don't have an option for strt-seq-2018 in my docker installation.

@TomKellyGenetics
Copy link
Collaborator

v1.2.7 has been merged and released on GitHub. Docker builds are in progress and will be available soon.

@TomKellyGenetics
Copy link
Collaborator

TomKellyGenetics commented Feb 8, 2024

@adc0032 @Davidwei7 @kbattenb The latest version (1.2.7) passed docker builds and is now available on dockerhub: https://hub.docker.com/r/tomkellygenetics/universc/tags

This version supports STRT-Seq, PIP-Seq, and VASA-Seq protocols. I have some minor changes to versioning and issues #17 or #20 under consideration but the above technologies should work resolving the above issues #12 and #16.

@adc0032
Copy link

adc0032 commented Feb 14, 2024

@TomKellyGenetics

Thank you for getting this updated!

Should strt-seq-2018 still be undergoing file format conversion in line 3138?

#STRT-Seq

I don't see it included here in the STRT-Seq section.

@TomKellyGenetics
Copy link
Collaborator

I think not since UMIs are already included in the 2018 custom protocol. It may be necessary to remove the TSO sequence from as described in the paper (by hard trimming R1s) if you are using (paired-end) 10x 5' scRNA chemistry settings. I think it is not necessary to perform TSO conversion on the R2 after the barcode and UMI as you can just use 10x 3' scRNA chemistry settings which ignore the rest of this read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants