Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make VCFCodec the default for query streams from GenomicsDB #6675

Merged
merged 3 commits into from
Jun 26, 2020

Conversation

nalinigans
Copy link
Collaborator

@nalinigans nalinigans commented Jun 24, 2020

GenomicsDB sometimes returns a 64 bit type as a result of computations. This type is not supported by BCF2Codec while decoding resulting in NPEs - see #6548 and #6667. The initial reservation against using VCFCodec as the default was performance related, benchmarks show the BCF2Codec to be about 10-15% faster than VCF2Codec, but VCFCodec handles all the types correctly. This PR makes VCFCodec the default and the argument --genomicsdb-use-vcf-codec has been replaced by --genomicsdb-use-bcf-codec.

Also, included in this PR are some argument documentation fixes and one bug fix where a com.google.cloud.storage.StorageException was being thrown if a -V argument pointing to a genomicsdb GCS workspace(e.g. gendb.gs://mybucket/myworkspace) did not end in a slash.

@droazen droazen self-assigned this Jun 24, 2020
@droazen droazen self-requested a review June 24, 2020 15:41
Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalinigans Back to you with my comments. If you're able to address them by Friday morning, this PR can go into the next GATK release.

throw new IllegalArgumentException("Trying to create a GenomicsDBReader from non-GenomicsDB inputpath " + path);
} else if (Files.notExists(IOUtils.getPath(workspace))) {
throw new IllegalArgumentException("Trying to create a GenomicsDBReader from non-GenomicsDB input path " + path);
} else if (Files.notExists(IOUtils.getPath(workspace.endsWith("/")?workspace:workspace+"/"))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you should just normalize the workspace to end with a slash on tool startup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can remove the check here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -8,11 +8,11 @@

public class GenomicsDBArgumentCollection implements Serializable {
private static final long serialVersionUID = 1L;
public static final String USE_VCF_CODEC_LONG_NAME = "genomicsdb-use-vcf-codec";
public static final String USE_BCF_CODEC_LONG_NAME = "genomicsdb-use-bcf-codec";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that this argument does not control which codec is used in GenomicsDBImport.getReaderFromPath(), which is hardcoded to use a VCFCodec. Should this argument affect the codec used in GenomicsDBImport?

Copy link
Collaborator Author

@nalinigans nalinigans Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not apply to GenomicsDBImport as VCFCodec is hardcoded there - see GenomicsDBImport.java#L833 as you mention. I am guessing this is because the --variant/-V option for GenomicsDBImport mentions only vcf files as inputs. However, GenomicsDB itself can accept either vcf/bcf files and if we allow bcf files as inputs in the future, it will be the tool deciding the Codec and not the user.

{new File[]{getTestFile("sample1.vcf"), getTestFile("sample2.vcf"), getTestFile("sample3.vcf"), getTestFile("sample4.vcf"), getTestFile("sample5.vcf")},
getTestFile("fiveSampleTest.vcf"), null, Arrays.asList(new SimpleInterval("chr20", 251370, 252000), new SimpleInterval("chr20", 263000, 265600)), Arrays.asList("--merge-input-intervals", "--only-output-calls-starting-in-intervals", "--genomicsdb-use-vcf-codec"), b38_reference_20_21},
getTestFile("fiveSampleTest.vcf"), null, Arrays.asList(new SimpleInterval("chr20", 251370, 252000), new SimpleInterval("chr20", 263000, 265600)), Arrays.asList("--merge-input-intervals", "--only-output-calls-starting-in-intervals", "--genomicsdb-use-bcf-codec"), b38_reference_20_21},
Copy link
Contributor

@droazen droazen Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add at least one integration test for GenotypeGVCFs that uses the BCF codec?

Also for GenotypeGVCFs, do you have a simple test case with a 64-bit value that fails with BCF and passes with VCF? If so, it would be good to have both a positive test case with VCF that passes and a negative test case with BCF that fails with an expected exception.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

… add test to check that a combined INFO field resulting in a 64-bit value is handled by VCFCodec, but not by BCF2Codec
@gatk-bot
Copy link

Travis reported job failures from build 30806
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk11 30806.13 logs

@nalinigans nalinigans requested a review from droazen June 26, 2020 00:00
@nalinigans
Copy link
Collaborator Author

@nalinigans Back to you with my comments. If you're able to address them by Friday morning, this PR can go into the next GATK release.

Hopefully, this can make it into the releases.

@nalinigans nalinigans closed this Jun 26, 2020
@nalinigans nalinigans reopened this Jun 26, 2020
@@ -387,9 +387,7 @@ final void printCacheStats() {
protected static FeatureReader<VariantContext> getGenomicsDBFeatureReader(final GATKPath path, final File reference, final GenomicsDBOptions genomicsDBOptions) {
final String workspace = IOUtils.getGenomicsDBAbsolutePath(path) ;
if (workspace == null) {
throw new IllegalArgumentException("Trying to create a GenomicsDBReader from non-GenomicsDB inputpath " + path);
} else if (Files.notExists(IOUtils.getPath(workspace))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't actually suggesting that you remove this check completely, as I think it's a necessary check -- just that it might be better to normalize the workspace name to end with a slash at an earlier point in the code, if we can't handle lack of a trailing slash.

Since time is short and the rest of the PR looks good, I am going to restore the previous version of this line via a separate commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move the normalization to IOUtils.getGenomicsDBAbsolutePath() in a future PR. Thanks.

@droazen droazen merged commit bc0994c into master Jun 26, 2020
@droazen droazen deleted the ng_vcfcodec_default branch June 26, 2020 16:45
jonn-smith pushed a commit that referenced this pull request Jul 14, 2020
* Make VCFCodec the default for query streams from GenomicsDB and other minor changes

* Add GenotypeGVCFs integration test for both VCF and BCF2 codecs. Also add test to check that a  combined INFO field resulting in a 64-bit value is handled by VCFCodec, but not by BCF2Codec
mwalker174 pushed a commit that referenced this pull request Nov 3, 2020
* Make VCFCodec the default for query streams from GenomicsDB and other minor changes

* Add GenotypeGVCFs integration test for both VCF and BCF2 codecs. Also add test to check that a  combined INFO field resulting in a 64-bit value is handled by VCFCodec, but not by BCF2Codec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants