CountBasesSpark doesn't work with -L opt #6319

bhanugandham · 2019-12-16T02:44:21Z

User report:

I test this in 4.1.4.0 and 4.1.4.1

gatk CountBasesSpark \
     -I input_reads.bam \
     -O base_count.txt

When run this cmd, it is OK, and get a right output base_count.txt.
But I want compute bases located in a interval file, so:

gatk CountBasesSpark \
     -I input_reads.bam \
     -O base_count.txt\
     -L interval.file

This cmd cannot run successfully, with some errors I find like this:

......
9/11/28 17:44:01 INFO NewHadoopRDD: Input split: file:/disks/disk1/data_sample/19NGS14
2/19NGS142.bam:1476395008+33554432
19/11/28 17:44:01 INFO NewHadoopRDD: Input split: file:/disks/disk1/data_sample/19NGS14
2/19NGS142.bam:1509949440+33554432
19/11/28 17:44:01 INFO NewHadoopRDD: Input split: file:/disks/disk1/data_sample/19NGS14
2/19NGS142.bam:704643072+33554432
19/11/28 17:44:02 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 7)
java.util.NoSuchElementException: next on empty iterator
        at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
        at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
        at scala.collection.Iterator$$anon$13.next(Iterator.scala:469)
......

The interval.file is fine because I use it for the whole GATK pipeline.
The CountReadsSpark has the same error.

Please check this

Thanks.
Chris

This Issue was generated from your [forums]
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/24645/countbasesspark-doesnt-work-with-l-opt/p1

The text was updated successfully, but these errors were encountered:

bhanugandham · 2019-12-16T02:45:23Z

@droazen Created a issue ticket based on our discussion during the GATK office hrs.

spromanos · 2020-07-17T19:14:57Z

Hi GATK team,

I tried running your PathSeq pipeline (broadinstitute/gatk:4.1.8.0) on my cohort and almost half of the samples failed the scoring step with this error message:

20/07/17 09:38:35 INFO NewHadoopRDD: Input split: file:/cromwell_root/fc-6e61d4b2-bdc8-4abd-bb94-18d8fa11d9b6/7c1b0faa-e956-4289-9e2d-4fb8b9eff6ff/PathSeqPipeline/0ca5578f-70d3-498e-b7cc-23590f0ab31f/call-PathSeqAlign/MMRF_2072_2_BM.microbe_aligned.paired.bam:33554432+33554432 20/07/17 09:38:46 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5) java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) at scala.collection.Iterator$$anon$13.next(Iterator.scala:469) at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) at org.broadinstitute.hellbender.relocated.com.google.common.collect.Iterators$PeekingImpl.next(Iterators.java:1155) at org.broadinstitute.hellbender.utils.spark.SparkUtils.lambda$putReadsWithTheSameNameInTheSamePartition$7bd206b0$1(SparkUtils.java:190) at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153) at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Looking at the aligned bams that go into the scoring task, they don't appear to be empty or different to the rest of the cohort. Any thoughts?

droazen · 2020-07-17T19:19:33Z

@spromanos Could you please open a new ticket for this issue instead of replying to this existing ticket?

spromanos · 2020-07-17T19:22:56Z

Done! Thanks

droazen · 2020-07-27T18:33:05Z

There is a potential fix for the "next on empty iterator" error in PR #6652 -- this should be part of the next GATK release, and may enable us to close this ticket.

droazen · 2020-08-12T18:43:16Z

This is believed fixed in the now-merged PR #6652

droazen · 2020-08-12T18:44:28Z

@lbergelson Can you confirm that this issue can now be closed as resolved?

* It turns out Rdd.reduce crashes when it encounters emptyy data, use fold instead. * Fix #6319

It turns out Rdd.reduce crashes when it encounters emptyy data, use fold instead. Fixes #6319

bhanugandham added the Vanilla label Dec 16, 2019

bhanugandham assigned droazen Dec 16, 2019

bhanugandham added this to the GATK-Priority-Backlog milestone Dec 16, 2019

droazen removed their assignment Dec 16, 2019

lbergelson added bug Spark labels Dec 16, 2019

droazen removed this from the GATK-Priority-Backlog milestone Jun 22, 2020

droazen mentioned this issue Aug 10, 2020

de-sparkify SV discovery #6652

Merged

lbergelson added a commit that referenced this issue Aug 25, 2020

Fix CountBasesSpark crash when using -L

cbecf4f

* It turns out Rdd.reduce crashes when it encounters emptyy data, use fold instead. * Fix #6319

lbergelson mentioned this issue Aug 25, 2020

Fix CountBasesSpark crash when using -L #6767

Merged

droazen closed this as completed in #6767 Aug 26, 2020

droazen pushed a commit that referenced this issue Aug 26, 2020

Fix CountBasesSpark crash when using -L (#6767)

3525c89

It turns out Rdd.reduce crashes when it encounters emptyy data, use fold instead. Fixes #6319

mwalker174 pushed a commit that referenced this issue Nov 3, 2020

Fix CountBasesSpark crash when using -L (#6767)

c3dc97a

It turns out Rdd.reduce crashes when it encounters emptyy data, use fold instead. Fixes #6319

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CountBasesSpark doesn't work with -L opt #6319

CountBasesSpark doesn't work with -L opt #6319

bhanugandham commented Dec 16, 2019

bhanugandham commented Dec 16, 2019

spromanos commented Jul 17, 2020

droazen commented Jul 17, 2020

spromanos commented Jul 17, 2020

droazen commented Jul 27, 2020

droazen commented Aug 12, 2020

droazen commented Aug 12, 2020

CountBasesSpark doesn't work with -L opt #6319

CountBasesSpark doesn't work with -L opt #6319

Comments

bhanugandham commented Dec 16, 2019

bhanugandham commented Dec 16, 2019

spromanos commented Jul 17, 2020

droazen commented Jul 17, 2020

spromanos commented Jul 17, 2020

droazen commented Jul 27, 2020

droazen commented Aug 12, 2020

droazen commented Aug 12, 2020