-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] a CollectionAccumulator has null _list, causing the executor's heartbeat thread to die #1522
Comments
Just FYI: I ran into a similar issue with some of the TPC-DS tests, but it showed up only on the CPU run, not on the GPU run. I suspected that it was a race condition in spark itself, but I have not really pushed much on it. |
@abellina what Spark version was this issue encountered with? |
@gerashegalov it was using spark 3.0.1. |
This issue is explained in SPARK-20977 where an incompletely initialized accumulator is registered violation JLS Section 17.5.3 |
…Accumulator This PR is a fix for the JLS 17.5.3 violation identified in zsxwing's [19/Feb/19 11:47 comment](https://issues.apache.org/jira/browse/SPARK-20977?focusedCommentId=16772277&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16772277) on the JIRA. ### What changes were proposed in this pull request? - Use a var field to hold the state of the collection accumulator ### Why are the changes needed? AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures on a regular basis NVIDIA/spark-rapids#1522. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - this is a concurrency bug that is almost impossible to reproduce as a quick unit test. - By trial and error I crafted a command NVIDIA/spark-rapids#1688 that reproduces the issue on my dev box several times per hour, with the first occurrence often within a few minutes. After the patch, these Exceptions have not shown up after running overnight for 10+ hours - existing unit tests in *`AccumulatorV2Suite` and *`LiveEntitySuite` Closes #31540 from gerashegalov/SPARK-20977. Authored-by: Gera Shegalov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…Accumulator This PR is a fix for the JLS 17.5.3 violation identified in zsxwing's [19/Feb/19 11:47 comment](https://issues.apache.org/jira/browse/SPARK-20977?focusedCommentId=16772277&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16772277) on the JIRA. ### What changes were proposed in this pull request? - Use a var field to hold the state of the collection accumulator ### Why are the changes needed? AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures on a regular basis NVIDIA/spark-rapids#1522. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - this is a concurrency bug that is almost impossible to reproduce as a quick unit test. - By trial and error I crafted a command NVIDIA/spark-rapids#1688 that reproduces the issue on my dev box several times per hour, with the first occurrence often within a few minutes. After the patch, these Exceptions have not shown up after running overnight for 10+ hours - existing unit tests in *`AccumulatorV2Suite` and *`LiveEntitySuite` Closes #31540 from gerashegalov/SPARK-20977. Authored-by: Gera Shegalov <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit fadd0f5) Signed-off-by: Sean Owen <[email protected]>
…IDIA#1522) Signed-off-by: spark-rapids automation <[email protected]>
I am seeing an issue that I can't readily explain so far. At this stage I am just filing the bug. What we see is a
CollectionAccumulator
that appears to have a null_list
member (according to the exception). More research needs to be done to find the cause. I should mention that our tests will pass, since Spark will add a new Executor, invalidate map output, and go ahead, even passing a test it is in the middle of running.The text was updated successfully, but these errors were encountered: