Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIR taking a long time - possible workaround? #175

Closed
philippbayer opened this issue Mar 23, 2021 · 4 comments
Closed

TIR taking a long time - possible workaround? #175

philippbayer opened this issue Mar 23, 2021 · 4 comments

Comments

@philippbayer
Copy link

Hi

As others report - for example, #66, the TIR step can still run for a long time with highly fragmented genomes where the number of scaffolds is high.

I've now found this workaround and I'm wondering whether that's OK, I can't see any drawback.

  1. Split the scaffolds into N files. I used seqkit split -p 500 with a genome with 35,334 scaffolds.
  2. Merge each file into one fake pseudomolecule using echo >${filename} > new_file && grep -v '>' filename >> new_file.
  3. Concatenate all these fake pseudomolecules into one big file, in my case containing 498 fake pseudomolecules.
  4. Run EDTA with that.

Before this, the TIR step would run out of walltime after 96 hours. With my faked genome the TIR step takes less than 2 hours.

Is this a valid workaround? Does this impact the quality of the predictions somehow? All I can think of is that EDTA might join repeats from randomly adjacent scaffolds into one piece so you get a falsely merged repeat, but that should occur only rarely, right? I can't compare with the un-faked genome as that EDTA run never finished :)

@oushujun
Copy link
Owner

Yeah, it's a headache that chronically attacks...

An ideal solution will be rewriting the TIR module with more efficient handling of a large number of small scaffolds, but my python skill is only on the read-and-modify level with occasional debugging lucks. So this task has been on a deep waiting list...

Your approach will work when your goal is to identify TIR sequences. The final annotation will need to combine both structural and homology annotations, combining scaffolds in pseudo-chr will not provide proper coordinates for the final annotation. And like you mentioned, joining seqs directly may create artificial structures that may confuse the program. A workaround for this issue is to put 100 Ns in between each join, so that most links between scaffolds would be removed in the filtering step.

You may also split the genome into smaller portions (not joining the sequences), and run TIR (EDTA_raw.pl --type tir) on each of them within the scheduled time, then manually combine the intact TIR files (both intact.fa and intact.gff3) and place it in the original TIR folder. In this way, the program will see finished TIR files and will take it as done and move on.

Shujun

@philippbayer
Copy link
Author

philippbayer commented Mar 25, 2021

Thanks for the feedback! Yes, I should've put Ns in, guess my solution was too lazy, that would avoid misjoined TEs. Running a bunch of TIRs is probably a great alternative that's a bit fiddlier, but is guaranteed to have zero misjoins!

Ideally the TIR step wouldn't make a million files but instead keep most of that in memory, my cluster's Lustre FS really can't handle that, which is why I have to use a Singularity overlay-FS to write the results into. So the slowness comes from the file-system as much as it comes from the language.

For the final model I use the TESorter output anyway - I take the TElib.fa from EDTA, put that into TESorter, re-classify TESorter's Unknowns with the original EDTA assigned class, and run my own RepeatMasker with the original scaffolds, so that part shouldn't be problematic I think as I'm using the correct reference. But I just checked: EDTA with my fake scaffolds predicted 753 TEs, the RepeatMasker run with my correct scaffolds found alignments for 746 of those. So 7 TEs are due to misjoins?

I'll close this issue now. Others who have this problem might use this workaround.

@oushujun
Copy link
Owner

You may only find out with a more extended comparison. Without knowing the size of the genome, 746 TEs seem too few honestly. The benefit of using EDTA for the final annotation is that you will get structurally intact TEs - only helpful if you want to do things about them.

@philippbayer
Copy link
Author

OK :) it's a seagrass genome, they're all weird. I'll try out the splitting-up-TIR-input thing to see where I'll end up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants