-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make VCFCodec the default for query streams from GenomicsDB #6675
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalinigans Back to you with my comments. If you're able to address them by Friday morning, this PR can go into the next GATK release.
throw new IllegalArgumentException("Trying to create a GenomicsDBReader from non-GenomicsDB inputpath " + path); | ||
} else if (Files.notExists(IOUtils.getPath(workspace))) { | ||
throw new IllegalArgumentException("Trying to create a GenomicsDBReader from non-GenomicsDB input path " + path); | ||
} else if (Files.notExists(IOUtils.getPath(workspace.endsWith("/")?workspace:workspace+"/"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you should just normalize the workspace to end with a slash on tool startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can remove the check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -8,11 +8,11 @@ | |||
|
|||
public class GenomicsDBArgumentCollection implements Serializable { | |||
private static final long serialVersionUID = 1L; | |||
public static final String USE_VCF_CODEC_LONG_NAME = "genomicsdb-use-vcf-codec"; | |||
public static final String USE_BCF_CODEC_LONG_NAME = "genomicsdb-use-bcf-codec"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that this argument does not control which codec is used in GenomicsDBImport.getReaderFromPath()
, which is hardcoded to use a VCFCodec
. Should this argument affect the codec used in GenomicsDBImport
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not apply to GenomicsDBImport as VCFCodec
is hardcoded there - see GenomicsDBImport.java#L833 as you mention. I am guessing this is because the --variant/-V
option for GenomicsDBImport mentions only vcf files as inputs. However, GenomicsDB itself can accept either vcf/bcf files and if we allow bcf files as inputs in the future, it will be the tool deciding the Codec and not the user.
{new File[]{getTestFile("sample1.vcf"), getTestFile("sample2.vcf"), getTestFile("sample3.vcf"), getTestFile("sample4.vcf"), getTestFile("sample5.vcf")}, | ||
getTestFile("fiveSampleTest.vcf"), null, Arrays.asList(new SimpleInterval("chr20", 251370, 252000), new SimpleInterval("chr20", 263000, 265600)), Arrays.asList("--merge-input-intervals", "--only-output-calls-starting-in-intervals", "--genomicsdb-use-vcf-codec"), b38_reference_20_21}, | ||
getTestFile("fiveSampleTest.vcf"), null, Arrays.asList(new SimpleInterval("chr20", 251370, 252000), new SimpleInterval("chr20", 263000, 265600)), Arrays.asList("--merge-input-intervals", "--only-output-calls-starting-in-intervals", "--genomicsdb-use-bcf-codec"), b38_reference_20_21}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add at least one integration test for GenotypeGVCFs
that uses the BCF codec?
Also for GenotypeGVCFs
, do you have a simple test case with a 64-bit value that fails with BCF and passes with VCF? If so, it would be good to have both a positive test case with VCF that passes and a negative test case with BCF that fails with an expected exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
… add test to check that a combined INFO field resulting in a 64-bit value is handled by VCFCodec, but not by BCF2Codec
Hopefully, this can make it into the releases. |
@@ -387,9 +387,7 @@ final void printCacheStats() { | |||
protected static FeatureReader<VariantContext> getGenomicsDBFeatureReader(final GATKPath path, final File reference, final GenomicsDBOptions genomicsDBOptions) { | |||
final String workspace = IOUtils.getGenomicsDBAbsolutePath(path) ; | |||
if (workspace == null) { | |||
throw new IllegalArgumentException("Trying to create a GenomicsDBReader from non-GenomicsDB inputpath " + path); | |||
} else if (Files.notExists(IOUtils.getPath(workspace))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't actually suggesting that you remove this check completely, as I think it's a necessary check -- just that it might be better to normalize the workspace name to end with a slash at an earlier point in the code, if we can't handle lack of a trailing slash.
Since time is short and the rest of the PR looks good, I am going to restore the previous version of this line via a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will move the normalization to IOUtils.getGenomicsDBAbsolutePath() in a future PR. Thanks.
* Make VCFCodec the default for query streams from GenomicsDB and other minor changes * Add GenotypeGVCFs integration test for both VCF and BCF2 codecs. Also add test to check that a combined INFO field resulting in a 64-bit value is handled by VCFCodec, but not by BCF2Codec
* Make VCFCodec the default for query streams from GenomicsDB and other minor changes * Add GenotypeGVCFs integration test for both VCF and BCF2 codecs. Also add test to check that a combined INFO field resulting in a 64-bit value is handled by VCFCodec, but not by BCF2Codec
GenomicsDB sometimes returns a 64 bit type as a result of computations. This type is not supported by BCF2Codec while decoding resulting in NPEs - see #6548 and #6667. The initial reservation against using VCFCodec as the default was performance related, benchmarks show the BCF2Codec to be about 10-15% faster than VCF2Codec, but VCFCodec handles all the types correctly. This PR makes VCFCodec the default and the argument
--genomicsdb-use-vcf-codec
has been replaced by--genomicsdb-use-bcf-codec
.Also, included in this PR are some argument documentation fixes and one bug fix where a com.google.cloud.storage.StorageException was being thrown if a -V argument pointing to a genomicsdb GCS workspace(e.g. gendb.gs://mybucket/myworkspace) did not end in a slash.