-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIR taking a long time - possible workaround? #175
Comments
Yeah, it's a headache that chronically attacks... An ideal solution will be rewriting the TIR module with more efficient handling of a large number of small scaffolds, but my python skill is only on the read-and-modify level with occasional debugging lucks. So this task has been on a deep waiting list... Your approach will work when your goal is to identify TIR sequences. The final annotation will need to combine both structural and homology annotations, combining scaffolds in pseudo-chr will not provide proper coordinates for the final annotation. And like you mentioned, joining seqs directly may create artificial structures that may confuse the program. A workaround for this issue is to put 100 Ns in between each join, so that most links between scaffolds would be removed in the filtering step. You may also split the genome into smaller portions (not joining the sequences), and run TIR (EDTA_raw.pl --type tir) on each of them within the scheduled time, then manually combine the intact TIR files (both intact.fa and intact.gff3) and place it in the original TIR folder. In this way, the program will see finished TIR files and will take it as done and move on. Shujun |
Thanks for the feedback! Yes, I should've put Ns in, guess my solution was too lazy, that would avoid misjoined TEs. Running a bunch of TIRs is probably a great alternative that's a bit fiddlier, but is guaranteed to have zero misjoins! Ideally the TIR step wouldn't make a million files but instead keep most of that in memory, my cluster's Lustre FS really can't handle that, which is why I have to use a Singularity overlay-FS to write the results into. So the slowness comes from the file-system as much as it comes from the language. For the final model I use the TESorter output anyway - I take the TElib.fa from EDTA, put that into TESorter, re-classify TESorter's Unknowns with the original EDTA assigned class, and run my own RepeatMasker with the original scaffolds, so that part shouldn't be problematic I think as I'm using the correct reference. But I just checked: EDTA with my fake scaffolds predicted 753 TEs, the RepeatMasker run with my correct scaffolds found alignments for 746 of those. So 7 TEs are due to misjoins? I'll close this issue now. Others who have this problem might use this workaround. |
You may only find out with a more extended comparison. Without knowing the size of the genome, 746 TEs seem too few honestly. The benefit of using EDTA for the final annotation is that you will get structurally intact TEs - only helpful if you want to do things about them. |
OK :) it's a seagrass genome, they're all weird. I'll try out the splitting-up-TIR-input thing to see where I'll end up! |
Hi
As others report - for example, #66, the TIR step can still run for a long time with highly fragmented genomes where the number of scaffolds is high.
I've now found this workaround and I'm wondering whether that's OK, I can't see any drawback.
seqkit split -p 500
with a genome with 35,334 scaffolds.echo >${filename} > new_file && grep -v '>' filename >> new_file
.Before this, the TIR step would run out of walltime after 96 hours. With my faked genome the TIR step takes less than 2 hours.
Is this a valid workaround? Does this impact the quality of the predictions somehow? All I can think of is that EDTA might join repeats from randomly adjacent scaffolds into one piece so you get a falsely merged repeat, but that should occur only rarely, right? I can't compare with the un-faked genome as that EDTA run never finished :)
The text was updated successfully, but these errors were encountered: