Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] a CollectionAccumulator has null _list, causing the executor's heartbeat thread to die #1522

Closed
abellina opened this issue Jan 14, 2021 · 4 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@abellina
Copy link
Collaborator

I am seeing an issue that I can't readily explain so far. At this stage I am just filing the bug. What we see is a CollectionAccumulator that appears to have a null _list member (according to the exception). More research needs to be done to find the cause. I should mention that our tests will pass, since Spark will add a new Executor, invalidate map output, and go ahead, even passing a test it is in the middle of running.

21/01/14 19:37:26 ERROR Utils: Uncaught exception in thread executor-heartbeater
java.lang.NullPointerException
	at org.apache.spark.util.CollectionAccumulator.isZero(AccumulatorV2.scala:457)
	at org.apache.spark.executor.Executor.$anonfun$reportHeartBeat$2(Executor.scala:902)
	at org.apache.spark.executor.Executor.$anonfun$reportHeartBeat$2$adapted(Executor.scala:902)
	at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
	at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
	at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
	at scala.collection.TraversableLike.filterNot(TraversableLike.scala:355)
	at scala.collection.TraversableLike.filterNot$(TraversableLike.scala:355)
	at scala.collection.AbstractTraversable.filterNot(Traversable.scala:108)
	at org.apache.spark.executor.Executor.$anonfun$reportHeartBeat$1(Executor.scala:902)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:896)
	at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:200)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
	at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 14, 2021
@revans2
Copy link
Collaborator

revans2 commented Jan 15, 2021

Just FYI: I ran into a similar issue with some of the TPC-DS tests, but it showed up only on the CPU run, not on the GPU run. I suspected that it was a race condition in spark itself, but I have not really pushed much on it.

@abellina abellina added the P0 Must have for release label Jan 15, 2021
@sameerz sameerz added this to the Jan 18 - Jan 29 milestone Jan 16, 2021
@sameerz sameerz removed this from the Jan 18 - Jan 29 milestone Jan 29, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Feb 2, 2021
@gerashegalov
Copy link
Collaborator

@abellina what Spark version was this issue encountered with?

@abellina
Copy link
Collaborator Author

abellina commented Feb 2, 2021

@gerashegalov it was using spark 3.0.1.

@gerashegalov
Copy link
Collaborator

This issue is explained in SPARK-20977 where an incompletely initialized accumulator is registered violation JLS Section 17.5.3

@sameerz sameerz closed this as completed Feb 11, 2021
srowen pushed a commit to apache/spark that referenced this issue Feb 21, 2021
…Accumulator

This PR is a fix for the JLS 17.5.3 violation identified in
zsxwing's [19/Feb/19 11:47 comment](https://issues.apache.org/jira/browse/SPARK-20977?focusedCommentId=16772277&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16772277) on the JIRA.

### What changes were proposed in this pull request?
- Use a var field to hold the state of the collection accumulator

### Why are the changes needed?
AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures on a regular basis NVIDIA/spark-rapids#1522.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
- this is a concurrency bug that is almost impossible to reproduce as a quick unit test.
- By trial and error I crafted a command NVIDIA/spark-rapids#1688 that reproduces the issue on my dev box several times per hour, with the first occurrence often within a few minutes. After the patch, these Exceptions have not shown up after running overnight for 10+ hours
- existing unit tests in *`AccumulatorV2Suite` and *`LiveEntitySuite`

Closes #31540 from gerashegalov/SPARK-20977.

Authored-by: Gera Shegalov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
srowen pushed a commit to apache/spark that referenced this issue Feb 21, 2021
…Accumulator

This PR is a fix for the JLS 17.5.3 violation identified in
zsxwing's [19/Feb/19 11:47 comment](https://issues.apache.org/jira/browse/SPARK-20977?focusedCommentId=16772277&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16772277) on the JIRA.

### What changes were proposed in this pull request?
- Use a var field to hold the state of the collection accumulator

### Why are the changes needed?
AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures on a regular basis NVIDIA/spark-rapids#1522.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
- this is a concurrency bug that is almost impossible to reproduce as a quick unit test.
- By trial and error I crafted a command NVIDIA/spark-rapids#1688 that reproduces the issue on my dev box several times per hour, with the first occurrence often within a few minutes. After the patch, these Exceptions have not shown up after running overnight for 10+ hours
- existing unit tests in *`AccumulatorV2Suite` and *`LiveEntitySuite`

Closes #31540 from gerashegalov/SPARK-20977.

Authored-by: Gera Shegalov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit fadd0f5)
Signed-off-by: Sean Owen <[email protected]>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

4 participants