TIR taking a long time - possible workaround? #175

philippbayer · 2021-03-23T04:19:37Z

Hi

As others report - for example, #66, the TIR step can still run for a long time with highly fragmented genomes where the number of scaffolds is high.

I've now found this workaround and I'm wondering whether that's OK, I can't see any drawback.

Split the scaffolds into N files. I used seqkit split -p 500 with a genome with 35,334 scaffolds.
Merge each file into one fake pseudomolecule using echo >${filename} > new_file && grep -v '>' filename >> new_file.
Concatenate all these fake pseudomolecules into one big file, in my case containing 498 fake pseudomolecules.
Run EDTA with that.

Before this, the TIR step would run out of walltime after 96 hours. With my faked genome the TIR step takes less than 2 hours.

Is this a valid workaround? Does this impact the quality of the predictions somehow? All I can think of is that EDTA might join repeats from randomly adjacent scaffolds into one piece so you get a falsely merged repeat, but that should occur only rarely, right? I can't compare with the un-faked genome as that EDTA run never finished :)

The text was updated successfully, but these errors were encountered:

oushujun · 2021-03-24T15:24:26Z

Yeah, it's a headache that chronically attacks...

An ideal solution will be rewriting the TIR module with more efficient handling of a large number of small scaffolds, but my python skill is only on the read-and-modify level with occasional debugging lucks. So this task has been on a deep waiting list...

Your approach will work when your goal is to identify TIR sequences. The final annotation will need to combine both structural and homology annotations, combining scaffolds in pseudo-chr will not provide proper coordinates for the final annotation. And like you mentioned, joining seqs directly may create artificial structures that may confuse the program. A workaround for this issue is to put 100 Ns in between each join, so that most links between scaffolds would be removed in the filtering step.

You may also split the genome into smaller portions (not joining the sequences), and run TIR (EDTA_raw.pl --type tir) on each of them within the scheduled time, then manually combine the intact TIR files (both intact.fa and intact.gff3) and place it in the original TIR folder. In this way, the program will see finished TIR files and will take it as done and move on.

Shujun

philippbayer · 2021-03-25T03:13:16Z

Thanks for the feedback! Yes, I should've put Ns in, guess my solution was too lazy, that would avoid misjoined TEs. Running a bunch of TIRs is probably a great alternative that's a bit fiddlier, but is guaranteed to have zero misjoins!

Ideally the TIR step wouldn't make a million files but instead keep most of that in memory, my cluster's Lustre FS really can't handle that, which is why I have to use a Singularity overlay-FS to write the results into. So the slowness comes from the file-system as much as it comes from the language.

For the final model I use the TESorter output anyway - I take the TElib.fa from EDTA, put that into TESorter, re-classify TESorter's Unknowns with the original EDTA assigned class, and run my own RepeatMasker with the original scaffolds, so that part shouldn't be problematic I think as I'm using the correct reference. But I just checked: EDTA with my fake scaffolds predicted 753 TEs, the RepeatMasker run with my correct scaffolds found alignments for 746 of those. So 7 TEs are due to misjoins?

I'll close this issue now. Others who have this problem might use this workaround.

oushujun · 2021-03-25T10:43:31Z

You may only find out with a more extended comparison. Without knowing the size of the genome, 746 TEs seem too few honestly. The benefit of using EDTA for the final annotation is that you will get structurally intact TEs - only helpful if you want to do things about them.

philippbayer · 2021-03-30T01:23:50Z

OK :) it's a seagrass genome, they're all weird. I'll try out the splitting-up-TIR-input thing to see where I'll end up!

philippbayer closed this as completed Mar 25, 2021

This was referenced Apr 19, 2021

Split genome into several sequences #142

Closed

How to run EDTA in large genomes (>10Gb)? #61

Closed

EDTA_raw.pl run with -type tir is glitchy #135

Closed

bvs mentioned this issue Jul 5, 2022

Annotating draft genome with thousands of sequences #276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIR taking a long time - possible workaround? #175

TIR taking a long time - possible workaround? #175

philippbayer commented Mar 23, 2021

oushujun commented Mar 24, 2021

philippbayer commented Mar 25, 2021 •

edited

Loading

oushujun commented Mar 25, 2021

philippbayer commented Mar 30, 2021

TIR taking a long time - possible workaround? #175

TIR taking a long time - possible workaround? #175

Comments

philippbayer commented Mar 23, 2021

oushujun commented Mar 24, 2021

philippbayer commented Mar 25, 2021 • edited Loading

oushujun commented Mar 25, 2021

philippbayer commented Mar 30, 2021

philippbayer commented Mar 25, 2021 •

edited

Loading