Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get softmasked genome as output #166

Closed
romseg opened this issue Feb 24, 2021 · 18 comments
Closed

How to get softmasked genome as output #166

romseg opened this issue Feb 24, 2021 · 18 comments
Labels
help wanted Extra attention is needed

Comments

@romseg
Copy link

romseg commented Feb 24, 2021

Dear author,

Is it possible to get softmasked genome instead of the hardmasked default? Sometimes softmasking is required or recommended as input by other annotator (other than Maker) or mapping programs. So it would be very useful to have this option. Please if this option is not currently available in Braker, I would appreciate to have your suggestions on how to convert the hardmasked file to softmasked. Thanks!

@oushujun
Copy link
Owner

oushujun commented Feb 25, 2021 via email

@romseg
Copy link
Author

romseg commented Feb 26, 2021

The usage for 'make_masked.pl' is:

Usage: perl make_masked.pl -genome unmasked_genome.fa [options]
		-rmout	[file]	Required. The repeatmasker.out file

But I don't have the 'repeatmasker.out' file. Can I use the hardmasked EDTA output file 'genome.fa.new.masked' instead?

Thanks for your help!

@oushujun
Copy link
Owner

oushujun commented Feb 26, 2021 via email

@romseg
Copy link
Author

romseg commented Feb 26, 2021

Oh, I see. I believe it is this one 'genome.fa.mod.EDTA.RM.out'. I would give it a try. Thanks for your help! :)

Rom

@romseg
Copy link
Author

romseg commented Mar 2, 2021

Hi Shujun,

It did its job, but in addition to softmasking all sequences that was hardmasked in the original 'genome.fa.mod.MAKER.masked' (99Mbp), 'make_masked.pl' with 'genome.fa.mod.EDTA.RM.out' softmasked extra ~50Mbp (149Mbp). It softmasked extra short fragments and in many cases amplified the previously hardmasked fragments. I can't tell what these extra softmasked sequences are. I am wondering why the difference and which masking file version would be more useful for genome gene annotation (with Maker and/or Braker). At first glance the softmasked version generated with RM.out would seem more complete (149Mbp). Thanks!

Best,
Rom

@oushujun
Copy link
Owner

oushujun commented Mar 3, 2021 via email

@romseg
Copy link
Author

romseg commented Mar 4, 2021

Hi Shujun,

That makes sense. It is good to avoid masking genic regions, especially for annotation.

One final question on this masking topic, in the stats of my sum file I observed that 256191256 bp [256Mbp] (51.54% of the total length) is reported as bpMasked (please see below) since they were found as TE elements. This number is higher to the number of hardmasked bp in the MAKER.masked file (99Mbp) or the softmasked one I produced with the 'make_masked.pl' script (149Mbp). Is this difference also to avoid masking genic regions? At first glance it would seem a big downscale from 256 to 99Mbp, but maybe I am not interpreting the results reported in the sum file well. I would be grateful to have your thoughts. Thanks!
Btw, this is a plant genome.

Repeat Classes
==============
Total Sequences: 396
Total Length: 497039057 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
LTR                    --           --           --   
    Copia              144280       69906519     14.06% 
    Gypsy              39887        29166065     5.87% 
    unknown            88174        32473354     6.53% 
TIR                    --           --           --   
    CACTA              65760        25490265     5.13% 
    Mutator            86889        39879220     8.02% 
    PIF_Harbinger      85261        22057616     4.44% 
    Tc1_Mariner        7590         1603803      0.32% 
    hAT                73115        24595527     4.95% 
nonTIR                 --           --           --   
    helitron           45381        11018887     2.22% 
                      ---------------------------------
    total interspersed 636337       256191256    51.54%

---------------------------------------------------------
Total                  636337       256191256    51.54%

The best,
Rom

@oushujun
Copy link
Owner

oushujun commented Mar 5, 2021

Hi Rom,

The sum file has all sequences of what EDTA believes as TEs. The MAKER.masked file is a subset of the sum file, which was produced by make_masked.pl with parameters -maxdiv 30 -minscore 1000 -minlen 1000 -hardmask 1 -misschar N. You may need to change the parameter for make_masked.pl to make a softmasked version close to what's described in the sum file. I need to correct myself, that you'd better use the EDTA.anno/*EDTA.TEanno.out file to produce the masked genome because this is the most complete.
eg.
perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out

Best,
Shujun

@romseg
Copy link
Author

romseg commented Mar 6, 2021

Hi Shujun,

It worked pretty good! Masking with genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out and the suggested parameters produced 254947490 softmasked bp, which is very close to the reported 256191256 bpMasked in the sum file of my genome.

It is good to have all these masking alternatives for downstream processing. Thanks for the assistance and for designing EDTA! It is a great program that makes research so much easier.

All my questions were answered and this thread can be closed.

The best,
Rom

@oushujun oushujun closed this as completed Mar 7, 2021
@oushujun oushujun added the help wanted Extra attention is needed label Mar 7, 2021
@SC-Duan
Copy link

SC-Duan commented Jun 10, 2021

Hi Shujun,
I want to get a softmask genome with myself repeat library, and feed to BRAKER. I have no the mod.EDTA.TEanno.out file, and I want to use RepeatMasker and ask for your help.

  1. You said "The MAKER.masked file is a subset of the sum file, which was produced by make_masked.pl with parameters -maxdiv 30 -minscore 1000 -minlen 1000 -hardmask 1 -misschar N. ", Is this parameter recommended for BRAKER? Or should I "change the parameter for make_masked.pl to make a softmasked version close to what's described in the sum file."
  2. Are the parameters "-maxdiv 30 -minscore 1000" in make_masked.pl corresponding to parameters "-div 30 -cutoff 1000 " in RepeatMasker? and what is corresponding to "-minlen 1000"? If yes, should I still set parameters "-nolow -norna"?
  3. I check the EDTA.pl file and found that lines:
    #make low-threshold masked genome for MAKER
    `perl $make_masked -genome $genome -rmout $genome.out -maxdiv 30 -minscore 1000 -minlen 1000 -hardmask 1 -misschar N -threads $threads -exclude $exclude`
    Should I use EDTA.anno/*EDTA.TEanno.out file or genome.fa.mod.EDTA.RM.out to mask genome?
    Thank you very much!

@oushujun
Copy link
Owner

@dzaccook

  1. To get a softmasked genome you need to use -hardmask 0. For other parameters, I am not sure if there is a better parameter space for BRAKER. The purpose of this script is to filter out short TEs since some of them are overlapping with genes, and masking such information may interfere with gene annotation algorithms. Frankly speaking, I am not familiar with the algorithms of gene annotators. So you may need to play around with different settings to find out.

  2. Yes, (-maxdiv 30 -minscore 1000) = (-div 30 -cutoff 1000). There is no equivalent parameter in RepeatMasker for "-minlen 1000" ASAIK. For the purpose of removing non-genic sequences, you probably want to include "-nolow -norna" but this is presumptuous and not fully benchmarked.

  3. It doesn't really matter that much. They are highly overlapped, and those that don't and pass through the filtering scheme probably won't have a huge impact on your gene annotation.

Shujun

@SC-Duan
Copy link

SC-Duan commented Jun 11, 2021

Hi Shujun,
Thank you very much! I will try it.
The best,
zac

@FengjuanjuanCMS
Copy link

I used make_masked.pl and the output results are all empty files. Has anyone encountered and guided the reason?

Thank you very much

@oushujun
Copy link
Owner

oushujun commented Jul 5, 2022

@FengjuanjuanCMS you may need to check the repeatmasker output file provided to the --rmout parameter. --Shujun

@Wanjie-Feng
Copy link

@oushujun
hi, shujun
I wonder if softmask.genome can be directly used for subsequent gene structure annotation if I use the following command to sofrmask my genome.

perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out

Do I need to consider simple repeat sequences ? In addition, has the telomere sequence been passed through the above command by softmask ?

@oushujun
Copy link
Owner

oushujun commented Jan 11, 2024 via email

@Wanjie-Feng
Copy link

1.

Hi Rom,

The sum file has all sequences of what EDTA believes as TEs. The MAKER.masked file is a subset of the sum file, which was produced by make_masked.pl with parameters -maxdiv 30 -minscore 1000 -minlen 1000 -hardmask 1 -misschar N. You may need to change the parameter for make_masked.pl to make a softmasked version close to what's described in the sum file. I need to correct myself, that you'd better use the EDTA.anno/*EDTA.TEanno.out file to produce the masked genome because this is the most complete. eg. perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out

Best, Shujun

2.

“4. Low-threshold TE masking: $genome.mod.MAKER.masked. This is a genome file with only long TEs (>=1 kb) being masked. You may use this for de novo gene annotations. In practice, this approach will reduce overmasking for genic regions, which can improve gene prediction quality. However, initial gene models should contain TEs and need further filtering. ”

3.

Mostly just TEs. For gene annotation purpose you may want to unmask shorter TEs(eg <500bp) to preserve the gene space. Check out the wiki. Shujun

-------------------------

From the above information, I think the following code is appropriate if I want to get the softmask genome for further annotation of de novo gene structure:

perl ../util/make_masked.pl -genome genome.fa -minlen 500 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out

@chun-he-316
Copy link

Hi Shujun,
I want to get softmasked genome. I used this command "perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out",but this error occurred "Permission denied ../util/make_masked.pl line 54."
Please tell me how to solve this issue. Thanks.
The best,
Chun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants