-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
de-sparkify SV discovery #6652
de-sparkify SV discovery #6652
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all this work refactoring. I only have very minor comments. Most of them just involve renaming methods to reflect their new locations (or to improve on their original names).
/** | ||
* A source of reference base calls. | ||
*/ | ||
public interface BasicReference { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think someone from the engine team should probably review this part of PR (This interface and its uses).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems OK, but it's a shame to add yet another reference interface to the mess of reference interfaces / classes that already exist. It seems like the basic need is to be able to unify ReferenceContext and the mess of ReferenceSparkSources? I don't see an easy way to fix it without adding something new or doing more substantial rework to the spark reference classes. (I wouldn't be opposed to substantially reworking those in the future but it's probably out of scope here.)
I would like to clarify all of them into a smaller set of classes / interfaces that can be used between spark/nonspark but we never get around to it.
Seems ok unless @droazen has comments.
|
||
private String currentContigName = null; | ||
private final List<GATKRead> readsForCurrentContig = new ArrayList<>(); | ||
private final Map<NovelAdjacencyAndAltHaplotype, SimpleNovelAdjacencyAndChimericAlignmentEvidence> simpleMap = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rename simpleMap
and complexMap
to something like evidenceForSimpleNovelAdjacencies
and contigsForComplexVariants
, or something similar? I found these names to be too generic for me to understand their uses in the code below.
contigNameToCpxVariantAttributes.put(contigName, relevantAttributes); | ||
} | ||
} | ||
final Map<NovelAdjacencyAndAltHaplotype, List<SimpleChimera>> redoMap = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call redoMap
novelAdjacenciesToReinterpret
?
for ( final SimpleChimera simpleChimera : chimeras ) { | ||
final NovelAdjacencyAndAltHaplotype novelAdjacency = | ||
new NovelAdjacencyAndAltHaplotype(simpleChimera, alignedContig.getContigSequence(), refDict); | ||
final List<SimpleChimera> mapVal = redoMap.get(novelAdjacency); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mapVal
-> chimerasForNovelAdjacency
?
.filterForConsistency(reinterpretedVariants, contigNameToCpxVariantAttributes, reference); | ||
variants.addAll(SegmentedCpxVariantSimpleVariantExtractor.removeDuplicates(extractedMultiSegmentVariants, consistentVariants)); | ||
|
||
final List<VariantContext> filteredVariants = AnnotatedVariantProducer.filterMergedVCF(variants, discoverStageArgs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filterMergedVariantList
would be a more accurate name
|
||
final CpxVariantCanonicalRepresentation cpxVariantCanonicalRepresentation = pair._1; | ||
final byte[] refBases = getRefBases(reference, cpxVariantCanonicalRepresentation); | ||
public static VariantContext turnIntoVariantContext( final CpxVariantCanonicalRepresentation cpxVariantCanonicalRepresentation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're moving this, want to rename it to something with one fewer word like createVariantContext
or even just toVariantContext
?
* below we simply take the inferred type and turn it to a VC, | ||
* future implementation may integrate other types of evidence and re-interpret if necessary | ||
*/ | ||
public List<VariantContext> turnIntoVariantContexts( final List<SvType> svTypes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, what about renaming to createVariantContexts
or toVariantContexts
while we're moving stuff around?
return Arrays.asList(builder1.make(), builder2.make()); | ||
} | ||
|
||
private Map<String, Object> getAssemblyEvidenceRelatedAnnotations( final List<SimpleChimera> splitAlignmentEvidence ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this method static?
@@ -139,17 +141,31 @@ public void testPutReadsWithSameNameInSamePartition(int numPairs, int numPartiti | |||
return reads; | |||
} | |||
|
|||
@Test(expectedExceptions = GATKException.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see this test changed to passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Did you run into this in the wild?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbergelson Yes, I did see the empty partition issue in the wild. Situation was a very small BAM containing large contigs from a local assembly with lots of supplemental alignments.
return count > 1; | ||
}) | ||
.collect(); | ||
Assert.assertEquals(xPartitionPairs.size(), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to also assert that each read name you expect is still there, since the fix you implemented involves removing some groups and re-adding them to different partitions.
I'm not sure if someone from the engine team wants to take a quick look at the |
public class StructuralVariantDiscoverer extends ReadWalker { | ||
@Argument(doc = "Name of output VCF.", shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, | ||
fullName = "outputVCFName") | ||
private static String outputVCFName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really weird to make arguments static. That seems like a mistake to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedsharpe Thanks for making this fix. We had some incorrect assumptions about what we were likely to actually run into in the wild when we wrote that. I have some familiar nitpicks and a few minor comments.
} | ||
return Iterators.singletonIterator(firstGroup); | ||
}) | ||
final List<GATKRead> empty = Collections.emptyList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling this up seems strange because now we have to serialize the empty list.
final String groupName = firstRead.getName(); | ||
while ( it.hasNext() ) { | ||
final GATKRead read = it.next(); | ||
if ( !groupName.equals(read.getName()) ) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know how I feel about {}
firstGroupSizes[0] = 0; // first partition has no predecessor to handle its first group of reads | ||
JavaRDD<GATKRead> readsSansFirstGroup = reads.mapPartitionsWithIndex( (idx, itr) -> | ||
{ int groupSize = firstGroupSizes[idx]; | ||
while ( itr.hasNext() && groupSize-- > 0 ) itr.next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brackets brackets everywhere
@@ -139,17 +141,31 @@ public void testPutReadsWithSameNameInSamePartition(int numPairs, int numPartiti | |||
return reads; | |||
} | |||
|
|||
@Test(expectedExceptions = GATKException.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Did you run into this in the wild?
// Creating one group in the middle that should cause problems | ||
reads.addAll(40, createPairedReads(header, 1, 30)); | ||
reads.addAll(createPairedReads(header, 1, 30)); | ||
reads.sort( (r1, r2) -> r1.getName().compareTo(r2.getName()) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like Comparator.comparing(GATKRead::getName) but the lambda is fine too.
while ( itr.hasNext() ) readLocs.add(new Tuple2<>(itr.next().getName(), idx)); | ||
return readLocs.iterator(); | ||
}, false) | ||
.mapToPair(t -> t).groupByKey().filter(t -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always think it's much more readable to split these operations on multiple lines
.mapToPari()
.groupByKey()
.filter(...)
}, false) | ||
.mapToPair(t -> t).groupByKey().filter(t -> { | ||
int count = 0; | ||
for ( final int idx : t._2() ) ++count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not Iterators.size(t._2())
@tedsharpe If this PR is not looking like it's going to be merged anytime soon, would it be possible to spin off the bug fix for #6319 that's embedded in this branch into a separate PR? We promised a user that it would go out with the next GATK release. |
I believe I've addressed all reviewers' suggestions. Please let me know if we're ready for a rebase, squash, and merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now, thanks for doing all that renaming -- makes it a lot more readable IMO.
Huh, test failure:
Unlikely to be related to this branch. |
If you otherwise approve @lbergelson, I can rebase and we'll see if it goes away. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go for it. I assume it's randomly occuring.
7e4197d
to
ed68d5b
Compare
…int calling Static shenanigans Complete unit tests Integration test for new tool I think I can, I think I can, I think I can... Python code looks pretty good Pat Helm always said, "when in doubt, transpose" Should be working, ,but there's an NIO tabix bug Seems to work reading tabix dictionary via NIO Still lots of repetitive warnings FYI Allow dictionary validation to be skipped because it takes forever for 200 samples on hg38 Fix tests for NIO tabix fix Cleaner file names Add test case from triploid contig issue Try hard filtering low GQ ploidy calls Actually invoke the filtering Fix clique vs. single-linkage bug Output last contig indexing segments the derpy N^2 way, but that's the best I can do with my level of Python expertise Fix end segment counter index bug Set conda yml back to save space? gCNV joint calling improvements: Use call intervals for bin-space defragmentation Adjust copy number for overlapping events (not super efficient) Don't fill in ref GTs until we look at upstream calls Big refactor to allow overlap in segments for joint calling Diploid genotypes and actually get ref base (if reference is supplied) QS filtering and AC calculation Continue AC annotation. Add some checks for single-sample VCFs With bcftools fast combine and filtering and annotation Filter by raw calls and filtered calls Update VC alleles if we update genotype Fail if X contig names don't match up de-sparkify SV discovery (#6652) Added requester pays option to germline and somatic CNV WDLs (#6870) * Added requester pays option to germline and somatic CNV WDLs Co-authored-by: Laura Gauthier <[email protected]> Added code improvements for DRAGEN-GATK pipeline into HaplotypeCaller (#6634) - Added '--dragen-mode' argument to HaplotypeCaller that activates the new genotyping engine code for FRD (Foreign Read Detection) and BQD (Base Quality Dropout) algorithms. - Added '--dragstr-params-path' argument which uses the DRAGEN STR tables to adjust HMM indel priors based on empirical reference contexts for better indel calling. - Added a new QUAL score model that reports the variant QUAL score as the posterior of the reference genotype based on the DRAGEN STR and SNP priors. - Added two new tools to GATK `ComposeSTRTableFile` and `CallibrateDragstrModel` tools for generating the STR tables by sampling STR sites across the reference and examining the aligned reads for indels to generate a model for potential sequencing errors for STR sites of various sizes for a given sample. - Exposed as advanced arguments a number of previously non-exposed constants and cuttoffs in the HaplotypeCaller and introduced a new HaplotypeCaller debugging '--debug-genotyper-output' mode intended provide much more comprehensive information on the steps taken by the genotyping engine. Add SVCluster and MergeSVCalls SV read depth integration Started implementation of EventCopyNumberPosteriors tool Multiple cnv vcfs at different bin sizes for posteriors calculations Add cnmops support Fix compiler warnings Remove small event output from MergeSVCalls Start genotyper python package Create svgenotyper package Mostly working svgenotyper Added inference Finished output write Fixed insertion end positions Add epsilon info fields Add plotting; fix depth bug Negative binomial counts Fix discrete inference Update with HWE priors and phi terms Fixed updates Split training and inference phases Add phi info field Fix loading on cpu, add site plotting Updated plotting and moved to scripts dir; improve z_freq efficiency Fix state frequency calc Vectorize predictive inference sampling Strip down discrete inference Clean up discrete inference Try to fix discrete inference memory leak Use torch.no_grad() Fix m_rd bug in discrete inference Scale lambda stats Plot mingq Docker modifications Depth-only genotyping Add depth/pesr call clustering Fix BND loading in svgenotyper Begin accommodating depth-only records for plotting Fix write_vcf() to check if PL or GQ exist Start implementing GATK train tool Implement LocalizeSVEvidence tool Join fix Refactor cleanup Optimize with max query size Revert query partitioning Add debug message Start converting to FeatureWalker Working as FeatureWalker Add block compression Rebase and remove empty function Start working on SVTrainGenotyping again Update conda yaml Fix yaml Delete gCNV and CNN python dependencies Update pyro; add some debugging info Working SVTrainGenotyping tool Direct training logs to file Add sharding functionality to SelectVariants SelectVariants output doc Simplify training streaming to whole vcf using two-pass variant walker Genotype step prototyped Working genotyper, added BND and removed INV support Change gatk base image to ubuntu 18.04 Update docker GCR Fix compiler warnings Fix test compiler warning Remove cnmops source column check in MergeSVCalls Assume mean depth ploidy for genotyper Improvements and fixes from testing; add SV merge output file indexing Fix dtype bug Implement ShardingVCFWriter for all GATK tools Hack python executor to waitForAck when waiting for batch to fix hangups Reconfigure docker cuda Fix conda dependencies Fix cudatoolkit dep Fix pyro SVCluster exclusion list update; tensor dtype shenanigans Fix header bug in ShardingVCFWriter Fix shard writer, genotyper, added SVShardVcfByPloidy Fix sharded vcf writer Fix genotyping on dups, svcluster input files for remote Re-add docker build cleanup Fix rebase errors Fix conflicts with gcnv joint genoypting branch Add sv call record tabix format to fix indexing Improve SVCluster Remove HWE and RD state from model; clean up/refactor genotyping Finish fixing up genotyping Working genotyper Fix compiler warning Use VCF instead of TSV for SVMergeCalls output Fix merge bug Mainly clean up SVCopyNumberPosteriors with DepthEvidenceCollector class Fix CNV -> BND conversion in SVCopyNumberPosteriors Make clusterer use SVCallRecords without evidence Fix RC fields Major refactoring for new Locatable2D interface Clean up deduplicator Sparsify call genotypes Clean up genotypes Reduce genotype inference memory footprint (?) Fix test code Remove git lfs pull and test jars from docker Fix INS keyerror Use python script instead of FIFO Revert breakpoint refinement bounds Handle duplicate vids in genotyping; add vid prefix in SVCluster Implement SV CNV inference Add test files Fix division by zero error Fix SR test bug Camelcase some variables Use Hail for plotting script Fix END/END2 coordinates in SVCluster Disable phi_sample Fix up plotting scripts; revise model to include HWE; fast exact inference for continuous sites Increase phi variance parameters; fix plot script Update to plot script
No description provided.