-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with splitseq #10
Comments
Thanks for reporting this issue. It is a national holiday in Japan so I may take a few days to get back to you but I will investigate this. User feedback really helps us to troubleshoot these issues to support a wide range of technologies. In the meantime can you please provide some more information on to help us to resolve the problem with the source code. Specifically, are you using the latest version v1.2.5.1 or an older one? A minimal example of your input files would help to test the update before releasing it. For example the first 20 lines of each fastq file will be sufficient. The logs really help to narrow it down already.
Based on this error message, Kai was correct to ask me to handle this. I suspect it is a problem specific to this technology with this code that I am responsible for. Lines 3453 to 3478 in c2d0c88
If you’ve not tried already, please pull the latest version from GitHub and try running this code. If there is still an error with it, I will need to update the source code. In this case I expect it is a minor syntax error introduced in recent updates so I should be able to fix it within a few days, I’ll update this thread if I manage to reproduce the error and test a solution in the development version. |
Note that the latest release may solve this already as I found bug in version 1.2.4 (17 Sept 2022) affecting this section. 88e20dd Sorry for the inconvenience caused by this but it may be this issue I am already aware of. Please try updating to version 1.2.5.1 (18 Jan 2023) and close this issue if it works. If you installed UniverSC during the above the period you may be affected by a syntax error that’s now been resolved. |
Thanks for quick resposnse and suggestions. I am already using universcversion="1.2.5.1" The first 20 lines of fastq files: Read1
and Read2
I want to add, I tried the tool with dropseq samples and that worked with no errors. |
Thanks for sharing this information. I've managed to reproduce this issue on my system and confirmed that this codeblock specific to this technology is causing the problem. The files were correctly created in the input4cellranger directory and after executing this code the R2 FASTQ file is empty due to the SED command giving invalid output. This means I can test solution on my system and update the source code on GitHub when I have a solution. I'll push it to the "dev" branch for the development version and notify you when it is read. Apologies for the inconvenience. It appears to be an oversight on my part when integrating this technology into the pipeline. Reassuringly, no other technologies should be affect by this issue as you've noted. We tested it extensively with Drop-Seq data (this was our original motivation to create UniverSC in the 1st place actually) as I am pleased to hear that others such as yourself recognise the need for this and it is working for them. |
@kbattenb I confirmed that it is the 2nd sed call in this subroutine that is failing, the 1st works as expected. I will handle this and check the split-seq specifications to ensure it is correct. I've had some issues with NCBI SRA IDs in the FASTQ headers before but I think I have addressed them while testing published data for the paper. Note this study uses NextSeq which sequences the reverse complement of the R2 sequence so the barcodes are read in reverse order. The 2nd SED call is intended to address this possibility but it fails due to mismatches in adapter sequences. I'll update the regular expressions to account for this.
Note this protocol uses the Split-Seq method (Rosenberg et al., 2018) with modifications. This may be the Split-Seq v2 referred to here (which has different adapter sequences). COMBINE-lab/salmon#699 (comment) It should be possible to support this but I may need more time to get a demo working. |
The development version supports Split-Seq v1 adapters. You can try it with:
I can run UniverSC on this public data and call Cell Ranger without errors. This corrects a syntax error for handling quality scores (introduced when correcting bugs discussed in v1.2.3.4 discussed above) and ensures that adapter sequences are removed by correctly matching sequences given in the above example. Thanks to your feedback we are able to support additional techniques like this. Note this does not support Split-Seq v2 adapters (yet). The public data provided has longer adapters expected for Split-Seq v1 cited in Rosenberg et al. (2018). Some mismatched adapter sequences are permitted but frameshifts will cause mismatched barcodes to be skipped as barcodes are assumed to be fixed distance apart (consistent with how salmon/alevin and zUMIs handles this). The "NN" bases at the beginning on R2 sequences are automatically removed. If the adapters do not match it will be skipped and attempt to use the reverse complement. UMI is automatically moved to the end after barcode sequences and barcode orders (B3-B1) is corrected (to B1-B3). However, each BC or UMI sequence will still be reverse complement. It may be necessary to use a barcode whitelist for Split-Seq with barcode sequences in the reverse complement (try this if the number of cells detected is far lower than expected). No change is needed for UMI as they will still be unique sequences. Please try running the "dev" version (1.2.5.2-dev) and contact us if you have trouble. |
Hi Tom, Thanks for the effort. Please see the log below:
|
Sorry I am afraid it appears to be the same error. I think it may be a problem merging changes from the development branch with git, please try running You can verify the branch or version is correct as follows:
This should display which branch you are using and the dev version number above. You may need to press “q” to quit the branch list, the current branch is listed with “*”. Note you may also need to update the barcode whitelist. Ideally I’d like to update the default whitelist for this technology but you can use custom inputs to test a whitelist if that is faster than waiting for us developers to add support for it. Provided each cell has a unique barcode and they match the whitelist, it should give valid QC and results. You may need to run This command resets Cell Ranger to original settings and then forces configuration of the new whitelist (in case it conflicts with your existing installation). This may also be useful to switch to dropseq settings. You may also need to delete the .lock files listed in the logs to run a new technology if one aborted without completing (only do this if you know no other UniverSC runs are in progress).
I’ve added support for both versions of splitseq but it is backwards compatible(splitseq or splitseq-v1 are aliases for the same setting). Use v1 not v2 for this data. |
Hi Tom, I guess the problem is I am not able to pull dev branch
|
Oh I see now, I’m relieved actually. Hopefully the updated script will work for you once you have it merged. As for git settings, I think the problem is cloning the repo only copied the master branch. You’ll need to create the dev branch on your local repository and pull updates.
If the issue persists with the updated script, let me know and I will try to test it again. A common issue is that sequence and quality scores are different lengths in converted fastq files but it should be avoided in this case. |
Thanks Tom! I created a white list using all possible permutations of barcodes 8bp long. Next I tried this code:
Setup flag works with 10x but not with splitseq, it expects to see all the files:
Then I used your pipeline by mentioning this new whitelist with -b option. But This time I get a different error after initiation of cell ranger:
|
Hi Dilara, Thanks for reporting the issue with
This is an error we've encountered before. It is due to FASTQ files not being available or being too small. Please check the files in the input4cellranger- directory. Ensure R1 and R2 have the same number of lines and the sequence and quality scores are the same length (it could be a bug with the patch I tested last week). Note that the whitelist for splitseq needs to be 24 bp in length as there are 3 barcodes joined together. With all permutations of 8 bp BCs, you can then generate all combinations of [BC1]-[BC2]-[BC3]. Here is an example of the code doing that. Line 1916 in 7dc550e
Lines 2017 to 2018 in 7dc550e
This will allow all possible barcodes to run but may not guarantee they are correct. If the number of cells is very low, it is possible the adapter sequences were removed incorrectly. I've used similar specifications to other tools supporting this technology (zUMIs, salmon/alevin, dropEst) but the reverse complement is needed for NextSeq and NovaSeq and as others have noted, it is computationally challenging with variability in the adapters. I hope this helps to narrow-down the problem. Please check the results carefully as this feature one of the more difficult technologies for us to handle. |
Hi Tom, |
Hi Dilara, Sorry to hear you still have trouble. Let's see if we can help. We've done as many as 16 bp barcodes with all combinations (4^8) but it is slow for downstream analyses if invalid barcodes is filtered out (even on a server with plenty of memory). All combinations of 3 x 8 bp segments would be (4^8)^3 so it would be large. It is better to use known barcodes for the technology to avoid this. For example, the barcodes are given here kharchenkolab/dropEst/data/barcodes/split_seq where each row is BC1, BC2, or BC3. They appear to be identical with 96 barcodes each. I recall checking this was consistent with the supplementary data for Rosenberg et al., (2018). Therefore every 3 permutations of this whitelist should give the correct 96^3 (884,736) barcodes. This is around the same number as 10x v2 so it should be supported by Cell Ranger.
The potential issues remaining to match this with your sequences are whether a reverse complement is needed (I can check this on your example data above) and if the adapters are aligned correctly. Notably 94 cycles matches the specifications in zUMIs and dropEst example configurations but there are leading NN basecalls in the R2 FASTQ file with poor quality scores. I've adjusted the adapter trimming to remove these but that may cause mismatches to the whitelist. It is possible to account for mismatches but it is not currently supported by our scripts. Generally, I am not sure it is beneficial as accurate UMI sequences are necessary to get accurate count data and with sufficiently deep sequencing, errors in barcodes will be filtered out as low coverage cells while the same molecules will be re-sequenced with the same UMI. There are diminishing margins improvements from implementing this. This is also a difficult problem for us as unlike technology-specific pipelines, we cannot assume that the technology users will run has been designed with barcodes with sufficient differences (by Hamming or Levenshtein distance) to avoid mismatches swapping barcodes with another cell. However, our pipeline is compatible with pre-processed reads, provided they are the same length and not truncated with full barcodes. It is possible do generate consensus reads using UMI parameters in "fastp" for example. I'd also recommend trimming poor quality R2 reads and filtering trimmed reads that are shorter than 94 bp. Make sure to use a tool that supports paired-end reads or match the R1 reads after processing (https://github.com/linsalrob/fastq-pair). In my experience, mapping trimmed reads also gives a higher count per cell. I'll compare the expected barcodes to the public data you've shared and update the script if necessary to match them correctly. |
I confirmed the reverse-complement of the above whitelist matches the barcodes in R here: I also checked by
However, Lines 2075 to 2080 in 7dc550e
Kai: I'll handle this issue. Diliara: please wait until the development version is updated as the current version filters adapters incorrectly. |
It should also be possible to support both the original barcodes and reverse complement. All but 1 (96*2 -1 = 191) are not palindromic so there are 191^3 (6,967,871) permutations. 10x v3 uses over 3 million barcodes so this would require around twice as much memory as a 10x run. |
The development branch is now updated: 4edff5b...9ba7bba Please try pulling it (from branch
Published SPLiT-Seq data is now fully-supported based on my tests. The version is also pre-released as a docker image v.1.2.5.2-dev. |
Many thanks Tom really! I updated the tool and run it but I had the same error. I confirmed that R1 and R2 are same length but I found out that R2 had unequal reads (some reads still same) see below:
|
Hi Dilara, Sorry to hear you still have trouble. Which error are you referring to? The memory issues discussed above should be resolved with the new barcode whitelist (I change the default name to force it to update existing installs). There is no need for you to set Is it the issue with Cell Ranger calling Some untruncated lines are expected. Those beginning with 'NN' were not converted as there were mismatches to the adapter sequence (due to sequencing errors and variable adapters). These should be filtered out as invalid barcodes as they will not match the barcode whitelist. It's not ideal as some data will not be used but I think these reads have poorer sequence quality. As the sequence and quality scores are the same length for each read, I would expect Cell Ranger can parse them in the 'chunk reads' step. Is the other FASTQ file valid and does it have the same number of reads (same number of lines by On the other hand, the adapters seem to be a fixed length in a fixed position (as described for zUMIs, dropEst, and Salmon). The known barcodes match these parameters. It turns out the original problem was cause by syntax errors in SED and the reverse complement sequence in R2 by NextSeq on v1.5 chemistry. I think there is no longer a need to match exact adapter sequences and we can safely assume barcodes are in a fixed position (i.e., separated by 30 bp adapters): NN[ 8 bp UMI][ 8 bp BC3]---30 bp----[8 bp BC2]---30 bp---[8 bp BC1]. It is a very minor change in the source to support this but it would break automated detection of the reverse complement. Ideally, I'd prefer it wasn't necessary to configure the run differently for HiSeq or NovaSeq data. To force it to run for your data I'll provide a patch to update the code.
Save the above in a file called 'patch.txt' and run
|
Hi Tom, Unfortunately the pipeline still fails at this step after all these editions:
In the end I tried this samples with STAR solo and it worked and since I have time limitation I decided to drop out this pipeline and move on with STAR solo. I really thank you for all helps so far! I am closing the issue. |
Hi!
I am trying to run your tool with splitseq technology however cellranger does not run properly.
First, I did not have I1 and I2 so I followed the guideline on your main page and created dummy I1 and I2 files using the first two indexes in whitelists/split-seq_round1_barcode.txt.
Later I run the pipeline, indicating technology 'splitseq'
What I have in my log file is looking like this:
The text was updated successfully, but these errors were encountered: