Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in GERP analysis #97

Open
JuliusFalck opened this issue Dec 9, 2024 · 11 comments
Open

Error in GERP analysis #97

JuliusFalck opened this issue Dec 9, 2024 · 11 comments
Assignees
Labels
bug Something isn't working next version Include this issue into the next pipeline version

Comments

@JuliusFalck
Copy link

I am having trouble getting the GERP part of the pipeline to work. I get this error: a fasta file is missing for one chunk of one of the outgroup species.

This is the Error in the main log:

Error in rule concatenate_fasta_per_contig:

results/logs/13_GERP/genome_chunks/ref_Hirundo_rustica/fasta/chunk38_concatenate_fasta_per_contig.log:

cat: results/gerp/genome_chunks/ref_Hirundo_rustica/fasta/Lagopus_muta_chunk38/Lagopus_muta_NC_053487.1.fasta: No such file or directory

It looks like all the other Fasta chunks have been created except for this one.

My understanding is that the rule bam2fasta should produce the missing file. However, there is no error in that rule log. The only difference with other bam2fasta logs is that the other logs have this line:

[mpileup] 1 samples in 1 input files

right after: Activating singularity image...

But the log for Lagopus_muta chunk38 does not have that line.

Any suggestions for things to try would be very appreciated.

@verku
Copy link
Collaborator

verku commented Dec 11, 2024

Hi Julius!

This part of the pipeline is written in a way so that Snakemake only checks for the presence of the directories holding the files that are produced by the different rules, i.e. Snakemake itself is aware of the folder results/gerp/genome_chunks/ref_Hirundo_rustica/fasta/Lagopus_muta_chunk38/ but not of the file results/gerp/genome_chunks/ref_Hirundo_rustica/fasta/Lagopus_muta_chunk38/Lagopus_muta_NC_053487.1.fasta. As a first test I would therefore suggest that you delete the folder results/gerp/genome_chunks/ref_Hirundo_rustica/fasta/Lagopus_muta_chunk38/ and re-start the pipeline, to see if it was a cluster-related error (which has been in my experience often the case with this type of error). If you still get the same error, let me know!

@verku verku self-assigned this Dec 11, 2024
@JuliusFalck
Copy link
Author

Thanks, I think that worked, or at least I no longer have that error. Sadly, I still can'tant get the GERP analysis to work. My new error is in "group_CpG_genotype_bed_formatting_group" this is the error part of the slurm log.

`Error in rule merge_CpG_genotype_beds:
jobid: 1
output: results/ref_Hirundo_rustica.concatenated.CpG_vcf.bed, results/ref_Hirundo_rustica.merged.CpG_vcf.bed
log: results/logs/5_CpG_identification/ref_Hirundo_rustica_merge_CpG_genotype_beds.log (check log file(s) for error details)
shell:

    files=`echo  | awk '{print NF}'`
    if [ $files -gt 1 ] # check if there are at least 2 files for merging. If there is only one file, copy the sorted bam file.
    then
      cat  | sort -k1,1 -k2,2n > results/ref_Hirundo_rustica.concatenated.CpG_vcf.bed 2> results/logs/5_CpG_identification/ref_Hirundo_rustica_merge_CpG_genotype_beds.log &&
      bedtools merge -i results/ref_Hirundo_rustica.concatenated.CpG_vcf.bed > results/ref_Hirundo_rustica.merged.CpG_vcf.bed 2>> results/logs/5_CpG_identification/ref_Hirundo_rustica_merge_CpG_genotype_beds.log
    else
      touch results/ref_Hirundo_rustica.concatenated.CpG_vcf.bed && cp  results/ref_Hirundo_rustica.merged.CpG_vcf.bed 2> results/logs/5_CpG_identification/ref_Hirundo_rustica_merge_CpG_genotype_beds.log &&
      echo "Only one file present for merging. Copying the input bed file." >> results/logs/5_CpG_identification/ref_Hirundo_rustica_merge_CpG_genotype_beds.log
    fi
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

`

And the ref_Hirundo_rustica_merge_CpG_genotype_beds.log says:
cp: missing destination file operand after 'results/ref_Hirundo_rustica.merged.CpG_vcf.bed' Try 'cp --help' for more information.

I have tried rerunning the pipeline several times and deleted results to get it to rerun stuff but to no avail. I moved the pipeline to PDC/Darde, and I am trying to continue running the gerp part from there. Could this be a problem? I got the conda environment working in my Dardel project directory.

@JuliusFalck JuliusFalck changed the title Error in rule concatenate_fasta_per_contig: missing fasta file Error in GERP analysis Jan 11, 2025
@verku
Copy link
Collaborator

verku commented Jan 13, 2025

Hi Julius, thanks for reaching out again! I can see that there is a bug in the code. I'll fix it as soon as possible and will let you know when the code is updated!

@verku verku added bug Something isn't working next version Include this issue into the next pipeline version labels Jan 13, 2025
@JuliusFalck
Copy link
Author

Thank you very much; now that I know it is a bug, I will stop trying to fix it in vain.

@verku
Copy link
Collaborator

verku commented Jan 24, 2025

Hi @JuliusFalck ! I had another look at the code and I think the error happens somewhere else in the code or in the configuration of the pipeline. Can you send me this part of your config file?

#####
# OPTIONAL:
# Identify CpG sites for removal from mlRho analyses, from VCF files 
# and downstream analyses and define samples to be CpG filtered. 
# Three different methods are available to identify CpG sites.

# This step will generate several BED files containing genome 
# coordinates of CpG sites (file names ending with: "*.CpG_method.bed"), 
# all genome regions outside of CpG sites ("*.noCpG_method.bed"), 
# as well as intersected BED files of coordinates of CpG sites 
# and repeat elements ("*.CpG_method.repeats.bed") and regions 
# outside of CpG sites and repeat elements ("*.noCpG_method.repma.bed").

# Set CpG_identification and one of the three methods listed 
# below to True and specify samples in the list below in 
# which CpG sites should be identified and/or from which 
# CpG sites should be removed.
CpG_identification: False

# Method 1: 
# Identify CpG sites in single-individual VCF files of samples 
# listed below. BED files with CpG sites from all samples are 
# merged into one file for CpG site filtering. Only genotype 
# information is considered for CpG site identification.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_vcf: False

# Method 2: 
# Identify CpG sites in the reference genome.
# Ignores genotype information of samples mapped to the reference 
# genome.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_reference: False

# Method 3: 
# Identify CpG sites in single-individual VCF files and in the 
# reference genome. BED files with CpG sites from all samples 
# and from the reference genome are merged into one file for 
# CpG site filtering.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_vcf_and_reference: False

# Combined list of modern and historical samples in which CpG sites 
# should be identified if CpG_from_vcf or CpG_from_vcf_and_reference 
# is set to True, and from which CpG sites should be removed based 
# on any of the three available CpG identification methods chosen above.
# Sample names without lane or index number in quotation marks, 
# separated by commas (e.g. ["VK01", "VK02", "VK03"]).
# List has to be left empty ([]) if neither CpG identification nor 
# CpG site filtering is run.
# Keep the list of sample names for the entire pipeline run 
# so that the correct files are used for downstream analyses.
CpG_samplenames: []
#####

@JuliusFalck
Copy link
Author

JuliusFalck commented Jan 24, 2025

Hi @verku, here is that part of the config file, tell me if there is anything else you need.

# OPTIONAL:
# Identify CpG sites for removal from mlRho analyses, from VCF files 
# and downstream analyses and define samples to be CpG filtered. 
# Three different methods are available to identify CpG sites.

# This step will generate several BED files containing genome 
# coordinates of CpG sites (file names ending with: "*.CpG_method.bed"), 
# all genome regions outside of CpG sites ("*.noCpG_method.bed"), 
# as well as intersected BED files of coordinates of CpG sites 
# and repeat elements ("*.CpG_method.repeats.bed") and regions 
# outside of CpG sites and repeat elements ("*.noCpG_method.repma.bed").

# Set CpG_identification and one of the three methods listed 
# below to True and specify samples in the list below in 
# which CpG sites should be identified and/or from which 
# CpG sites should be removed.
CpG_identification: True

# Method 1: 
# Identify CpG sites in single-individual VCF files of samples 
# listed below. BED files with CpG sites from all samples are 
# merged into one file for CpG site filtering. Only genotype 
# information is considered for CpG site identification.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_vcf: True

# Method 2: 
# Identify CpG sites in the reference genome.
# Ignores genotype information of samples mapped to the reference 
# genome.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_reference: False

# Method 3: 
# Identify CpG sites in single-individual VCF files and in the 
# reference genome. BED files with CpG sites from all samples 
# and from the reference genome are merged into one file for 
# CpG site filtering.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_vcf_and_reference: False

# Combined list of modern and historical samples in which CpG sites 
# should be identified if CpG_from_vcf or CpG_from_vcf_and_reference 
# is set to True, and from which CpG sites should be removed based 
# on any of the three available CpG identification methods chosen above.
# Sample names without lane or index number in quotation marks, 
# separated by commas (e.g. ["VK01", "VK02", "VK03"]).
# List has to be left empty ([]) if neither CpG identification nor 
# CpG site filtering is run.
# Keep the list of sample names for the entire pipeline run 
# so that the correct files are used for downstream analyses.
CpG_samplenames: ["Heteromirafaarcheri70new", "Heteromirafaarcheri68new", "Heteromirafaarcheri69new", "Heteromirafaarcheri34new", "Calendulaudaburra19new", "Calendulaudaburra24new", "Spizocoryssclateri15new", "Spizocoryssclateri52new", "Spizocoryssclateri16new", "Amirafrarufocinnamomea33new", "Calendulaudaafricanoides47new", "Calendulaudaafricanoides43new", "Calendulaudaalbescens22new", "Calendulaudaalbescens23new", "Calendulaudaerythrochlamys17new", "Calendulaudaerythrochlamys31new", "Calendulaudasabota49new", "Calendulaudasabota50new", "Coryphaafricana51new", "Coryphaafricana18new", "Coryphaathi74new", "Mirafrapasserina37new", "Mirafrapasserina1new", "Mirafrapasserina38new", "Mirafrapasserina39new", "Mirafrapasserina12new", "Mirafrapasserina13new", "Mirafrapasserina14new", "Plocealaudaaffinis71new", "Plocealaudaaffinis77new", "Plocealaudaassamica72new", "Plocealaudaerythrocephala41new", "Plocealaudaerythrocephala42new", "Plocealaudaerythrocephala76new", "Plocealaudaerythroptera67new", "Plocealaudamicroptera36new", "Plocealaudamicroptera66new", "Plocealaudamicroptera73new", "Heteromirafraruddi60new", "Spizocorysfringillaris44new", "Spizocorysfringillaris59new", "Calendulaudaburra25new", "Spizocoryssclateri53new", "Coryphafasciolata32new", "Amirafrarufocinnamomea40new", "Coryphafasciolata75new", "Amirafrarufocinnamomea58new", "Calendulaudaafricanoides20new", "Calendulaudaalbescens21new", "Calendulaudaerythrochlamys61new", "Calendulaudasabota62new", "Calendulaudasabota35new", "Coryphaathi55new", "Coryphaathi56new", "Coryphafasciolata46new", "Coryphafasciolata30new", "Coryphafasciolata65new", "Mirafracheniana63new", "Mirafracheniana64new", "Mirafrapasserina48new", "Spizocorysconirostris45new", "Spizocorysstarki54new", "Alaudarazae2new", "Alaudarazae3new", "Alaudarazae4new", "Alaudarazae5new", "Alaudarazae6new", "Alaudarazae7new", "Alaudarazae8new", "Alaudarazae9new", "Alaudarazae10new", "Alaudarazae11new", "Chersophilusduponti27new", "Chersophilusduponti28new", "Chersophilusduponti29new", "Ramphocorisclotbey57new", "Chersomanesalbofasciata26new"]
#####```

@verku
Copy link
Collaborator

verku commented Jan 24, 2025

Thank you! I'm looking for a reason why your Snakemake run did not assign anything as input to the rule "merge_CpG_genotype_beds" but this part of the config file looks good. Could you send me the entire config file, just to be sure that the rest is correct? You should be able to share the file here in the Github issue.

@JuliusFalck
Copy link
Author

JuliusFalck commented Jan 24, 2025

Here is the config file I had to change it to txt to share it.

config.txt

@verku
Copy link
Collaborator

verku commented Jan 24, 2025

Hi! Thank you for sharing, the config file looks good! I think that somehow the input files for this rule are missing, maybe when moving the data over to Dardel. Can you check if these files are present?

results/modern/vcf/ref_Hirundo_rustica/*.merged.rmdup.merged.realn.Q30.sorted.CpG.bed

These files should have been created by the rule make_CpG_genotype_bed.

I found one little typo in your config file that should not affect any of these issues: the path to the sex chromosome file should be "config/chrsex.txt", i.e. dropping the slash in the beginning. :-)

@JuliusFalck
Copy link
Author

Yes, currently, I do not have dose results files. What do I need to do to remake them? Do I have to remove files in the reference directory to get GenErode to make them, or just run the whole pipeline again?

@verku
Copy link
Collaborator

verku commented Jan 27, 2025

Yes, exactly! When you restart the pipeline next time, Snakemake will run the rule again because it will check for the presence of these files. Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working next version Include this issue into the next pipeline version
Projects
None yet
Development

No branches or pull requests

2 participants