Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pb fq2bam #7357

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Update pb fq2bam #7357

wants to merge 8 commits into from

Conversation

gburnett-nvidia
Copy link
Contributor

@gburnett-nvidia gburnett-nvidia commented Jan 23, 2025

PR checklist

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@gburnett-nvidia
Copy link
Contributor Author

This PR includes testing for fastq merging, as well as changes to the input and output spec @blajoie

@blajoie
Copy link

blajoie commented Jan 23, 2025

Please let us know if any changes are needed to the inputs/outputs, we did implement a small refactor there with a goal of simplifying things. Would also appreciate a look at how we handled multiple input fastq pairs and keeping the RGs organized.

A separate PR for parabricks/deepvariant is coming ~tomorrow.

cc @gburnett-nvidia

@blajoie blajoie requested a review from famosab January 31, 2025 21:02
Copy link
Contributor

@famosab famosab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my two cents on the proposed changes :)

tuple val(meta4), path(interval_file)
path(known_sites)

tuple val(meta), val(read_group), path (r1_fastq, stageAs: "?/*"), path (r2_fastq, stageAs: "?/*"), path(interval_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we somehow get the read_group value from the meta? I think that would be more practical in the pipeline later on.

Why are the reads staged as you indicated there?

Comment on lines +17 to +21
tuple val(meta), path("*.bam"), path("*.bai") , emit: bam_bai
tuple val(meta), path("qc_metrics/*"), optional: true, emit: qc_metrics
tuple val(meta), path("*.table"), optional: true, emit: bqsr_table
tuple val(meta), path("*.duplicate-metrics.txt"), optional: true, emit: duplicate_metrics
path "versions.yml", emit: versions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you align this as it was before? :)

Comment on lines +39 to +50
def readgroups_string = read_group.collect { rg -> "@RG\\tID:${rg.read_group}__${rg.sample}\\tSM:${rg.sample}\\tPL:${rg.platform}\\tLB:${rg.sample}\\tPU:${rg.read_group}" }

def in_fq_command = meta.single_end
? (r1_fastq instanceof List
? r1_fastq.collect { "--in-se-fq $it" }.join(' ')
: "--in-se-fq ${r1_fastq}"
)
: (r1_fastq instanceof List && r2_fastq instanceof List && readgroups_string instanceof List
? (r1_fastq.indexed().collect { idx, r1 -> "--in-fq $r1 ${r2_fastq[idx]} \"${readgroups_string[idx]}\"" }).join(' ')
: "--in-fq ${r1_fastq} ${r2_fastq} \"${readgroups_string.join(' ')}\""
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe explain what you are doin here exactly in a comment? At least for me its not directly clear what is going on (even though it probably makes a lot of sense!!!) :)

"""
INDEX=`find -L ./ -name "*.amb" | sed 's/\\.amb\$//'`
cp $fasta \$INDEX

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(trailing whitespace needs to be removed or else linting fails)

Comment on lines +28 to +35
- r1_fastq:
type: file
description: R1 fastq file
pattern: "*.fastq.gz"
- r2_fastq:
type: file
description: R2 fastq file
pattern: "*.fastq.gz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means that the module only works with paired end fastqs now?

path(known_sites)

tuple val(meta), val(read_group), path (r1_fastq, stageAs: "?/*"), path (r2_fastq, stageAs: "?/*"), path(interval_file)
tuple path(fasta), path(fai)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tuple path(fasta), path(fai)
tuple val(meta2), path(fasta), path(fai)

tuple val(meta4), path(interval_file)
path(known_sites)

tuple val(meta), val(read_group), path (r1_fastq, stageAs: "?/*"), path (r2_fastq, stageAs: "?/*"), path(interval_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also means that you always expect the interval file in the input channel with the reads? Why not leave it seperated as before? I think that would be less confusing!


tuple val(meta), val(read_group), path (r1_fastq, stageAs: "?/*"), path (r2_fastq, stageAs: "?/*"), path(interval_file)
tuple path(fasta), path(fai)
tuple val(meta1), path(index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tuple val(meta1), path(index)
tuple val(meta3), path(index)

Comment on lines +264 to +269
params {
module_args = '--low-memory'
// Ref: https://forums.developer.nvidia.com/t/problem-with-gpu/256825/6
// Parabricks’s fq2bam requires 24GB of memory.
// Using --low-memory for testing
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed for a stub test

Suggested change
params {
module_args = '--low-memory'
// Ref: https://forums.developer.nvidia.com/t/problem-with-gpu/256825/6
// Parabricks’s fq2bam requires 24GB of memory.
// Using --low-memory for testing
}

Comment on lines +298 to +303
{ assert snapshot(
bam(process.out.bam_bai[0][1]).getReadsMD5(),
file(process.out.bam_bai[0][2]).name,
process.out.versions,
path(process.out.versions[0]).yaml
).match() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a stub we should be able to assert like:

Suggested change
{ assert snapshot(
bam(process.out.bam_bai[0][1]).getReadsMD5(),
file(process.out.bam_bai[0][2]).name,
process.out.versions,
path(process.out.versions[0]).yaml
).match() }
{ assert snapshot(process.out).match() }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants