Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with SelectVariants from GenomicsDB stored in GCS #7070

Open
GATKSupportTeam opened this issue Feb 4, 2021 · 1 comment
Assignees

Comments

@GATKSupportTeam
Copy link
Collaborator

GATKSupportTeam commented Feb 4, 2021

Summary

This user was able to access the GenomicsDB workspace but is having performance issues with SelectVariants. They tried the same command locally and it took less than a minute. Are there any changes with how the user is running SelectVariants to improve the performance?

GATK Info

GATK 4.1.9.0

This request was created from a contribution made by Lucas Taniguti on February 01, 2021 22:41 UTC.

Link: https://gatk.broadinstitute.org/hc/en-us/community/posts/360076845511-How-do-I-SelectVariants-from-GenomicsDB-stored-in-GCS-#community_comment_360014183291

--

Thank you, it has started to work with gendb.gs://

But now I think it does not run. I have only one sample stored into the database and I'm selecting only chr20:1-1000000 and it is running for more than 30 minutes. Is it expected?

I'm using a VM from GCE, in the same region as the GCS bucket.

Using GATK jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar

Running:  
  
   java -Dsamjdk.use\_async\_io\_read\_samtools=false -Dsamjdk.use\_async\_io\_write\_samtools=true -Dsamjdk.use\_async\_io\_write\_tribble=false -Dsamjdk.compression\_level=2 -Xmx10g -Xms5g -  
  
jar /home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar SelectVariants -R Homo\_sapiens\_assembly38.fasta -V gendb.gs://mybucket/genomicsdb -L chr20:1-1000000 -O teste.  
  
vcf.gz  
  
23:01:23.595 INFO  NativeLibraryLoader - Loading libgkl\_compression.so from jar:file:/home/taniguti/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl\_compres  
  
sion.so  
  
23:01:23.914 INFO  SelectVariants - ------------------------------------------------------------  
  
23:01:23.915 INFO  SelectVariants - The Genome Analysis Toolkit (GATK) v4.1.9.0  
  
23:01:23.915 INFO  SelectVariants - For support and documentation go to [https://software.broadinstitute.org/gatk/](https://software.broadinstitute.org/gatk/)  
  
23:01:23.918 INFO  SelectVariants - Executing as taniguti@phasing-shapeit4-taniguti on Linux v5.4.0-1036-gcp amd64  
  
23:01:23.918 INFO  SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1+1-Ubuntu-0ubuntu1.20.04  
  
23:01:23.919 INFO  SelectVariants - Start Date/Time: February 1, 2021 at 11:01:23 PM UTC  
  
23:01:23.919 INFO  SelectVariants - ------------------------------------------------------------  
  
23:01:23.919 INFO  SelectVariants - ------------------------------------------------------------  
  
23:01:23.928 INFO  SelectVariants - HTSJDK Version: 2.23.0  
  
23:01:23.929 INFO  SelectVariants - Picard Version: 2.23.3  
  
23:01:23.929 INFO  SelectVariants - HTSJDK Defaults.COMPRESSION\_LEVEL : 2  
  
23:01:23.929 INFO  SelectVariants - HTSJDK Defaults.USE\_ASYNC\_IO\_READ\_FOR\_SAMTOOLS : false  
  
23:01:23.929 INFO  SelectVariants - HTSJDK Defaults.USE\_ASYNC\_IO\_WRITE\_FOR\_SAMTOOLS : true  
  
23:01:23.929 INFO  SelectVariants - HTSJDK Defaults.USE\_ASYNC\_IO\_WRITE\_FOR\_TRIBBLE : false  
  
23:01:23.930 INFO  SelectVariants - Deflater: IntelDeflater  
  
23:01:23.930 INFO  SelectVariants - Inflater: IntelInflater  
  
23:01:23.930 INFO  SelectVariants - GCS max retries/reopens: 20  
  
23:01:23.930 INFO  SelectVariants - Requester pays: disabled  
  
23:01:23.930 INFO  SelectVariants - Initializing engine  
  
23:01:25.939 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63  
  
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).  
  
log4j:WARN Please initialize the log4j system properly.  
  
log4j:WARN See [http://logging.apache.org/log4j/1.2/faq.html#noconfig](http://logging.apache.org/log4j/1.2/faq.html#noconfig) for more info.  
  
23:01:39.847 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field AS\_InbreedingCoeff  - the field will NOT be part of INFO fields in the g  
  
enerated VCF records  
  
23:01:39.847 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field AS\_QD  - the field will NOT be part of INFO fields in the generated VCF  
  
records  
  
23:01:39.848 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field DS  - the field will NOT be part of INFO fields in the generated VCF rec  
  
ords  
  
23:01:39.848 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field InbreedingCoeff  - the field will NOT be part of INFO fields in the gene  
  
rated VCF records  
  
23:01:39.848 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field MLEAC  - the field will NOT be part of INFO fields in the generated VCF  
  
records  
  
23:01:39.848 info  NativeGenomicsDB - pid=4376 tid=4377 No valid combination operation found for INFO field MLEAF  - the field will NOT be part of INFO fields in the generated VCF  
  
records  
  
23:01:51.886 INFO  IntervalArgumentCollection - Processing 1000000 bp from intervals  
  
23:01:51.918 INFO  SelectVariants - Done initializing engine  
  
23:01:52.050 INFO  ProgressMeter - Starting traversal  
  
23:01:52.051 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute<br><br><i>(created from <a href='https://broadinstitute.zendesk.com/agent/tickets/105490'>Zendesk ticket #105490</a>)<br>gz#105490</i>
@gbrandt6
Copy link
Contributor

gbrandt6 commented Feb 9, 2021

The user has posted an update with the jstack logs and they can be downloaded here (https://gatk.broadinstitute.org/hc/en-us/community/posts/360076845511/comments/360014258071)
They also provided info that when they run GenomicsDBImport for the two samples in the same command, GenotypeGVCFs completes in 14 minutes. But if they import one sample at a time (using --genomicsdb-update-workspace-path) the GenotypeGVCFs process appears hung.
@nalinigans @mlathara Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants