Error in GERP analysis #97

JuliusFalck · 2024-12-09T09:07:03Z

I am having trouble getting the GERP part of the pipeline to work. I get this error: a fasta file is missing for one chunk of one of the outgroup species.

This is the Error in the main log:

Error in rule concatenate_fasta_per_contig:

results/logs/13_GERP/genome_chunks/ref_Hirundo_rustica/fasta/chunk38_concatenate_fasta_per_contig.log:

cat: results/gerp/genome_chunks/ref_Hirundo_rustica/fasta/Lagopus_muta_chunk38/Lagopus_muta_NC_053487.1.fasta: No such file or directory

It looks like all the other Fasta chunks have been created except for this one.

My understanding is that the rule bam2fasta should produce the missing file. However, there is no error in that rule log. The only difference with other bam2fasta logs is that the other logs have this line:

[mpileup] 1 samples in 1 input files

right after: Activating singularity image...

But the log for Lagopus_muta chunk38 does not have that line.

Any suggestions for things to try would be very appreciated.

The text was updated successfully, but these errors were encountered:

verku · 2024-12-11T14:54:46Z

Hi Julius!

This part of the pipeline is written in a way so that Snakemake only checks for the presence of the directories holding the files that are produced by the different rules, i.e. Snakemake itself is aware of the folder results/gerp/genome_chunks/ref_Hirundo_rustica/fasta/Lagopus_muta_chunk38/ but not of the file results/gerp/genome_chunks/ref_Hirundo_rustica/fasta/Lagopus_muta_chunk38/Lagopus_muta_NC_053487.1.fasta. As a first test I would therefore suggest that you delete the folder results/gerp/genome_chunks/ref_Hirundo_rustica/fasta/Lagopus_muta_chunk38/ and re-start the pipeline, to see if it was a cluster-related error (which has been in my experience often the case with this type of error). If you still get the same error, let me know!

JuliusFalck · 2025-01-11T09:24:08Z

Thanks, I think that worked, or at least I no longer have that error. Sadly, I still can'tant get the GERP analysis to work. My new error is in "group_CpG_genotype_bed_formatting_group" this is the error part of the slurm log.

`Error in rule merge_CpG_genotype_beds:
jobid: 1
output: results/ref_Hirundo_rustica.concatenated.CpG_vcf.bed, results/ref_Hirundo_rustica.merged.CpG_vcf.bed
log: results/logs/5_CpG_identification/ref_Hirundo_rustica_merge_CpG_genotype_beds.log (check log file(s) for error details)
shell:

    files=`echo  | awk '{print NF}'`
    if [ $files -gt 1 ] # check if there are at least 2 files for merging. If there is only one file, copy the sorted bam file.
    then
      cat  | sort -k1,1 -k2,2n > results/ref_Hirundo_rustica.concatenated.CpG_vcf.bed 2> results/logs/5_CpG_identification/ref_Hirundo_rustica_merge_CpG_genotype_beds.log &&
      bedtools merge -i results/ref_Hirundo_rustica.concatenated.CpG_vcf.bed > results/ref_Hirundo_rustica.merged.CpG_vcf.bed 2>> results/logs/5_CpG_identification/ref_Hirundo_rustica_merge_CpG_genotype_beds.log
    else
      touch results/ref_Hirundo_rustica.concatenated.CpG_vcf.bed && cp  results/ref_Hirundo_rustica.merged.CpG_vcf.bed 2> results/logs/5_CpG_identification/ref_Hirundo_rustica_merge_CpG_genotype_beds.log &&
      echo "Only one file present for merging. Copying the input bed file." >> results/logs/5_CpG_identification/ref_Hirundo_rustica_merge_CpG_genotype_beds.log
    fi
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

`

And the ref_Hirundo_rustica_merge_CpG_genotype_beds.log says:
cp: missing destination file operand after 'results/ref_Hirundo_rustica.merged.CpG_vcf.bed' Try 'cp --help' for more information.

I have tried rerunning the pipeline several times and deleted results to get it to rerun stuff but to no avail. I moved the pipeline to PDC/Darde, and I am trying to continue running the gerp part from there. Could this be a problem? I got the conda environment working in my Dardel project directory.

verku · 2025-01-13T07:55:54Z

Hi Julius, thanks for reaching out again! I can see that there is a bug in the code. I'll fix it as soon as possible and will let you know when the code is updated!

JuliusFalck · 2025-01-13T14:34:26Z

Thank you very much; now that I know it is a bug, I will stop trying to fix it in vain.

verku · 2025-01-24T08:05:11Z

Hi @JuliusFalck ! I had another look at the code and I think the error happens somewhere else in the code or in the configuration of the pipeline. Can you send me this part of your config file?

#####
# OPTIONAL:
# Identify CpG sites for removal from mlRho analyses, from VCF files 
# and downstream analyses and define samples to be CpG filtered. 
# Three different methods are available to identify CpG sites.

# This step will generate several BED files containing genome 
# coordinates of CpG sites (file names ending with: "*.CpG_method.bed"), 
# all genome regions outside of CpG sites ("*.noCpG_method.bed"), 
# as well as intersected BED files of coordinates of CpG sites 
# and repeat elements ("*.CpG_method.repeats.bed") and regions 
# outside of CpG sites and repeat elements ("*.noCpG_method.repma.bed").

# Set CpG_identification and one of the three methods listed 
# below to True and specify samples in the list below in 
# which CpG sites should be identified and/or from which 
# CpG sites should be removed.
CpG_identification: False

# Method 1: 
# Identify CpG sites in single-individual VCF files of samples 
# listed below. BED files with CpG sites from all samples are 
# merged into one file for CpG site filtering. Only genotype 
# information is considered for CpG site identification.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_vcf: False

# Method 2: 
# Identify CpG sites in the reference genome.
# Ignores genotype information of samples mapped to the reference 
# genome.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_reference: False

# Method 3: 
# Identify CpG sites in single-individual VCF files and in the 
# reference genome. BED files with CpG sites from all samples 
# and from the reference genome are merged into one file for 
# CpG site filtering.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_vcf_and_reference: False

# Combined list of modern and historical samples in which CpG sites 
# should be identified if CpG_from_vcf or CpG_from_vcf_and_reference 
# is set to True, and from which CpG sites should be removed based 
# on any of the three available CpG identification methods chosen above.
# Sample names without lane or index number in quotation marks, 
# separated by commas (e.g. ["VK01", "VK02", "VK03"]).
# List has to be left empty ([]) if neither CpG identification nor 
# CpG site filtering is run.
# Keep the list of sample names for the entire pipeline run 
# so that the correct files are used for downstream analyses.
CpG_samplenames: []
#####

JuliusFalck · 2025-01-24T08:12:09Z

Hi @verku, here is that part of the config file, tell me if there is anything else you need.

# OPTIONAL:
# Identify CpG sites for removal from mlRho analyses, from VCF files 
# and downstream analyses and define samples to be CpG filtered. 
# Three different methods are available to identify CpG sites.

# This step will generate several BED files containing genome 
# coordinates of CpG sites (file names ending with: "*.CpG_method.bed"), 
# all genome regions outside of CpG sites ("*.noCpG_method.bed"), 
# as well as intersected BED files of coordinates of CpG sites 
# and repeat elements ("*.CpG_method.repeats.bed") and regions 
# outside of CpG sites and repeat elements ("*.noCpG_method.repma.bed").

# Set CpG_identification and one of the three methods listed 
# below to True and specify samples in the list below in 
# which CpG sites should be identified and/or from which 
# CpG sites should be removed.
CpG_identification: True

# Method 1: 
# Identify CpG sites in single-individual VCF files of samples 
# listed below. BED files with CpG sites from all samples are 
# merged into one file for CpG site filtering. Only genotype 
# information is considered for CpG site identification.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_vcf: True

# Method 2: 
# Identify CpG sites in the reference genome.
# Ignores genotype information of samples mapped to the reference 
# genome.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_reference: False

# Method 3: 
# Identify CpG sites in single-individual VCF files and in the 
# reference genome. BED files with CpG sites from all samples 
# and from the reference genome are merged into one file for 
# CpG site filtering.
# Has to be kept at True for the rest of the pipeline, if chosen 
# method for CpG identification, so that the correct files are 
# used for downstream analyses.
CpG_from_vcf_and_reference: False

# Combined list of modern and historical samples in which CpG sites 
# should be identified if CpG_from_vcf or CpG_from_vcf_and_reference 
# is set to True, and from which CpG sites should be removed based 
# on any of the three available CpG identification methods chosen above.
# Sample names without lane or index number in quotation marks, 
# separated by commas (e.g. ["VK01", "VK02", "VK03"]).
# List has to be left empty ([]) if neither CpG identification nor 
# CpG site filtering is run.
# Keep the list of sample names for the entire pipeline run 
# so that the correct files are used for downstream analyses.
CpG_samplenames: ["Heteromirafaarcheri70new", "Heteromirafaarcheri68new", "Heteromirafaarcheri69new", "Heteromirafaarcheri34new", "Calendulaudaburra19new", "Calendulaudaburra24new", "Spizocoryssclateri15new", "Spizocoryssclateri52new", "Spizocoryssclateri16new", "Amirafrarufocinnamomea33new", "Calendulaudaafricanoides47new", "Calendulaudaafricanoides43new", "Calendulaudaalbescens22new", "Calendulaudaalbescens23new", "Calendulaudaerythrochlamys17new", "Calendulaudaerythrochlamys31new", "Calendulaudasabota49new", "Calendulaudasabota50new", "Coryphaafricana51new", "Coryphaafricana18new", "Coryphaathi74new", "Mirafrapasserina37new", "Mirafrapasserina1new", "Mirafrapasserina38new", "Mirafrapasserina39new", "Mirafrapasserina12new", "Mirafrapasserina13new", "Mirafrapasserina14new", "Plocealaudaaffinis71new", "Plocealaudaaffinis77new", "Plocealaudaassamica72new", "Plocealaudaerythrocephala41new", "Plocealaudaerythrocephala42new", "Plocealaudaerythrocephala76new", "Plocealaudaerythroptera67new", "Plocealaudamicroptera36new", "Plocealaudamicroptera66new", "Plocealaudamicroptera73new", "Heteromirafraruddi60new", "Spizocorysfringillaris44new", "Spizocorysfringillaris59new", "Calendulaudaburra25new", "Spizocoryssclateri53new", "Coryphafasciolata32new", "Amirafrarufocinnamomea40new", "Coryphafasciolata75new", "Amirafrarufocinnamomea58new", "Calendulaudaafricanoides20new", "Calendulaudaalbescens21new", "Calendulaudaerythrochlamys61new", "Calendulaudasabota62new", "Calendulaudasabota35new", "Coryphaathi55new", "Coryphaathi56new", "Coryphafasciolata46new", "Coryphafasciolata30new", "Coryphafasciolata65new", "Mirafracheniana63new", "Mirafracheniana64new", "Mirafrapasserina48new", "Spizocorysconirostris45new", "Spizocorysstarki54new", "Alaudarazae2new", "Alaudarazae3new", "Alaudarazae4new", "Alaudarazae5new", "Alaudarazae6new", "Alaudarazae7new", "Alaudarazae8new", "Alaudarazae9new", "Alaudarazae10new", "Alaudarazae11new", "Chersophilusduponti27new", "Chersophilusduponti28new", "Chersophilusduponti29new", "Ramphocorisclotbey57new", "Chersomanesalbofasciata26new"]
#####```

verku · 2025-01-24T10:12:01Z

Thank you! I'm looking for a reason why your Snakemake run did not assign anything as input to the rule "merge_CpG_genotype_beds" but this part of the config file looks good. Could you send me the entire config file, just to be sure that the rest is correct? You should be able to share the file here in the Github issue.

JuliusFalck · 2025-01-24T10:17:45Z

Here is the config file I had to change it to txt to share it.

config.txt

verku · 2025-01-24T13:39:24Z

Hi! Thank you for sharing, the config file looks good! I think that somehow the input files for this rule are missing, maybe when moving the data over to Dardel. Can you check if these files are present?

results/modern/vcf/ref_Hirundo_rustica/*.merged.rmdup.merged.realn.Q30.sorted.CpG.bed

These files should have been created by the rule make_CpG_genotype_bed.

I found one little typo in your config file that should not affect any of these issues: the path to the sex chromosome file should be "config/chrsex.txt", i.e. dropping the slash in the beginning. :-)

JuliusFalck · 2025-01-24T16:02:50Z

Yes, currently, I do not have dose results files. What do I need to do to remake them? Do I have to remove files in the reference directory to get GenErode to make them, or just run the whole pipeline again?

verku · 2025-01-27T07:32:53Z

Yes, exactly! When you restart the pipeline next time, Snakemake will run the rule again because it will check for the presence of these files. Good luck!

verku self-assigned this Dec 11, 2024

JuliusFalck changed the title ~~Error in rule concatenate_fasta_per_contig: missing fasta file~~ Error in GERP analysis Jan 11, 2025

verku added bug Something isn't working next version Include this issue into the next pipeline version labels Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in GERP analysis #97

Error in GERP analysis #97

JuliusFalck commented Dec 9, 2024

verku commented Dec 11, 2024

JuliusFalck commented Jan 11, 2025

verku commented Jan 13, 2025

JuliusFalck commented Jan 13, 2025

verku commented Jan 24, 2025

JuliusFalck commented Jan 24, 2025 •

edited

Loading

verku commented Jan 24, 2025

JuliusFalck commented Jan 24, 2025 •

edited

Loading

verku commented Jan 24, 2025

JuliusFalck commented Jan 24, 2025

verku commented Jan 27, 2025

Error in GERP analysis #97

Error in GERP analysis #97

Comments

JuliusFalck commented Dec 9, 2024

verku commented Dec 11, 2024

JuliusFalck commented Jan 11, 2025

verku commented Jan 13, 2025

JuliusFalck commented Jan 13, 2025

verku commented Jan 24, 2025

JuliusFalck commented Jan 24, 2025 • edited Loading

verku commented Jan 24, 2025

JuliusFalck commented Jan 24, 2025 • edited Loading

verku commented Jan 24, 2025

JuliusFalck commented Jan 24, 2025

verku commented Jan 27, 2025

JuliusFalck commented Jan 24, 2025 •

edited

Loading

JuliusFalck commented Jan 24, 2025 •

edited

Loading