Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the sed snippets with Python ones #168

Merged
merged 8 commits into from
Aug 17, 2023
Merged

Conversation

abhi18av
Copy link
Member

@abhi18av abhi18av commented Aug 13, 2023

Hey @LennertVerboven and @TimHHH ,

I've accommodated both Python scripts within the two processes. Here's some context

  1. SNPEFF

Continuing from #163 (comment), I tweaked that script for the python2 environment which is in magma-env-2 // magma-container-2

  1. GATK_VARIANTS_TO_TABLE

Continuing from #163 (comment) , the generated results are a bit different from the previous file.

Question: Is params.median_coverage_cutoff no longer necessary as it has been substituted by params.site_representation_cutoff ?

==========

However, one problem that I perceived is that the exact expected file isn't being generated in both steps hence I created a DRAFT PR. Once the debugging is done we can merge this PR.

LennertVerboven and others added 4 commits August 10, 2023 12:20
Added the site_representation_cutoff
Added the script which replace the sed commands used to create the fasta file for phylogeny
Moved site representation to EXPERIENCED USERS section
@LennertVerboven
Copy link
Collaborator

Question: Is params.median_coverage_cutoff no longer necessary as it has been substituted by params.site_representation_cutoff ?

They should both remain, the median coverage is used to filter samples but was mistakenly also used here.

I dont understand the issue with the files not being generated

@abhi18av
Copy link
Member Author

abhi18av commented Aug 14, 2023

They should both remain, the median coverage is used to filter samples but was mistakenly also used here.

I see - therefore I assume that we don't need the cohort stats file to reflect this cut-off right?

I dont understand the issue with the files not being generated

Yeah, strange for me too. But what I suggest is that you run the pipeline and then do debugging based on the work directory file.

.command.sh contains the final shell script

.command.run is the driver script, which stages files, activates conda/docker and then runs .command.sh

I suggest, you copy the python script there and then update the .command.sh to use ./python_script.py syntax and then just run .command.run for debugging.

@LennertVerboven
Copy link
Collaborator

Added a newline that I forgot here to fix the issue with the snpeff replacemnt of sed

@abhi18av
Copy link
Member Author

Okay, I tested again with the change, but the result is same still. The downstream INDEX_FEATURE_FILE process fails

gatk IndexFeatureFile --java-options "-Xmx4G" \
       \
      -I joint.raw_variants.annotated.vcf.gz

Error log

  13:46:45.420 INFO  IndexFeatureFile - Shutting down engine
  [August 16, 2023 at 1:46:45 PM UTC] org.broadinstitute.hellbender.tools.IndexFeatureFile done. Elapsed time: 0.01 minutes.
  Runtime.totalMemory()=2147483648
  ***********************************************************************

  A USER ERROR has occurred: Error while trying to create index for joint.raw_variants.annotated.vcf.gz. Error was: htsjdk.tribble.TribbleException.InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file

  ***********************************************************************
  Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

@abhi18av
Copy link
Member Author

Okay, this looks functional now , had to tweak the variants to table process 8e3e1f3 and now the pipeline is finishing.

I'm opening this PR for review, please do another pipeline-level test on your side - to check the final analysis - and then we can merge.

@abhi18av abhi18av marked this pull request as ready for review August 17, 2023 02:51
@abhi18av abhi18av requested review from vrennie and mdediegofuertes and removed request for LennertVerboven August 17, 2023 02:51
@LennertVerboven LennertVerboven merged commit dfbfe61 into develop Aug 17, 2023
@abhi18av abhi18av deleted the replace_sed branch August 17, 2023 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor pipeline to move away from use of sed (failure during phylogeny snp-sites step)
3 participants