[SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator #31540

gerashegalov · 2021-02-10T01:56:57Z

This PR is a fix for the JLS 17.5.3 violation identified in
@zsxwing's 19/Feb/19 11:47 comment on the JIRA.

What changes were proposed in this pull request?

Use a var field to hold the state of the collection accumulator

Why are the changes needed?

AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures on a regular basis NVIDIA/spark-rapids#1522.

Does this PR introduce any user-facing change?

None

How was this patch tested?

this is a concurrency bug that is almost impossible to reproduce as a quick unit test.
By trial and error I crafted a command [WIP] Allow repeated iterations of integration tests in the same Spark app NVIDIA/spark-rapids#1688 that reproduces the issue on my dev box several times per hour, with the first occurrence often within a few minutes. After the patch, these Exceptions have not shown up after running overnight for 10+ hours
existing unit tests in *AccumulatorV2Suite and *LiveEntitySuite

This PR is a fix for the JLS 17.5.3 violation identified in Shixiong Zhu's 19/Feb/19 11:47 comment on the JIRA. - avoid using a final field to hold the state of the collection accumulator

mridulm · 2021-02-10T04:58:46Z

This does not necessarily solve the issue that @zsxwing detailed - the issue here is registerAccumulator should not be called in readObject before subclasses have completed readObject.

One possible solution would be to introduce two methods.

a) A protected method doHandleDriverSideAccumulator() in AccumulatorV2 - which has all the code after defaultReadObject in readObject.
b) Call handleDriverSideAccumulator after defaultReadObject in AccumulatorV2. In AccumulatorV2, this protected method will simply delegate to doHandleDriverSideAccumulator.
c) In subclasses with local state, override doHandleDriverSideAccumulator to make it do nothing - and after readObject in subclass is done, invoke doHandleDriverSideAccumulator

This will ensure AccumulatorV2 and subclasses will register only after state has been initialized.
(Rough sketch, please change logic/names/etc as relevant).

Note, there are other accumulators with local state; we should do this for all.
Thoughts ?

gerashegalov · 2021-02-10T23:11:18Z

Thanks for taking a look @mridulm

correct, this patch does not address the problem in general, it just mitigates it for CollectionAccumulator and its subclass PythonAccumulatorV2. Any solution relying on readObject to publish the object including the one you propose will publish the object too early.

A generally correct solution that will work for user-defined Accumulators as well must not rely on readObject imo, and it is more involved.

core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala

srowen · 2021-02-20T14:16:35Z

core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala


  override def copyAndReset(): CollectionAccumulator[T] = new CollectionAccumulator

  override def copy(): CollectionAccumulator[T] = {
    val newAcc = new CollectionAccumulator[T]
-    _list.synchronized {
-      newAcc._list.addAll(_list)
+    this.synchronized {


Obviously we can't sync on _list anymore. This changes to sync on the object itself, which should be fine AFAIK - I don't think anything else depends on the lock of the accumulator itself and it's only manipulating its own state while holding the lock.

srowen · 2021-02-20T14:16:48Z

I think this is OK as a narrow fix for this particular case. Any objection? if tests pass

srowen · 2021-02-20T14:16:53Z

Jenkins test this please

SparkQA · 2021-02-20T16:02:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39894/

SparkQA · 2021-02-20T16:28:51Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39894/

SparkQA · 2021-02-20T17:23:48Z

Test build #135314 has finished for PR 31540 at commit c9918ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…Accumulator This PR is a fix for the JLS 17.5.3 violation identified in zsxwing's [19/Feb/19 11:47 comment](https://issues.apache.org/jira/browse/SPARK-20977?focusedCommentId=16772277&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16772277) on the JIRA. ### What changes were proposed in this pull request? - Use a var field to hold the state of the collection accumulator ### Why are the changes needed? AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures on a regular basis NVIDIA/spark-rapids#1522. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - this is a concurrency bug that is almost impossible to reproduce as a quick unit test. - By trial and error I crafted a command NVIDIA/spark-rapids#1688 that reproduces the issue on my dev box several times per hour, with the first occurrence often within a few minutes. After the patch, these Exceptions have not shown up after running overnight for 10+ hours - existing unit tests in *`AccumulatorV2Suite` and *`LiveEntitySuite` Closes #31540 from gerashegalov/SPARK-20977. Authored-by: Gera Shegalov <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit fadd0f5) Signed-off-by: Sean Owen <[email protected]>

srowen · 2021-02-21T02:59:40Z

Merged to master/3.1

zhengruifeng · 2021-05-11T03:05:51Z

This does not necessarily solve the issue that @zsxwing detailed - the issue here is registerAccumulator should not be called in readObject before subclasses have completed readObject.

One possible solution would be to introduce two methods.

a) A protected method doHandleDriverSideAccumulator() in AccumulatorV2 - which has all the code after defaultReadObject in readObject.
b) Call handleDriverSideAccumulator after defaultReadObject in AccumulatorV2. In AccumulatorV2, this protected method will simply delegate to doHandleDriverSideAccumulator.
c) In subclasses with local state, override doHandleDriverSideAccumulator to make it do nothing - and after readObject in subclass is done, invoke doHandleDriverSideAccumulator

This will ensure AccumulatorV2 and subclasses will register only after state has been initialized.
(Rough sketch, please change logic/names/etc as relevant).

Note, there are other accumulators with local state; we should do this for all.
Thoughts ?

+1

I recently impl some accv2 (some complex statistics containing transient lazy vars and using collections like openhashmap/array/etc) in my work, there are lots of NPE which make task probablly fail. I has tried the method like this PR, but it do not help evidently.

gerashegalov · 2021-05-11T05:14:01Z

@zhengruifeng can you provide a minimum code reproducing for NPEs you are observing?

We've seen exceptions in spark executors like: ``` java.lang.NullPointerException: Cannot invoke "scala.collection.mutable.Set.isEmpty()" because the return value of "com.snowplowanalytics.snowplow.rdbloader.transformer.batch.spark.TypesAccumulator.accum()" is null ``` The error is coming from our Spark Accumulator for accumulating Iglu types. This is similar to [an issue previously seen][1] in Spark's own `CollectionAccumulator`. That issue [was fixed in Spark][2] by making the accumulator's internal state non-final, and synchronizing access to the internal state. So here we make the exact same change to our own Accumulator. It is a rare race condition which is hard to reproduce. [1]: https://issues.apache.org/jira/browse/SPARK-20977 [2]: apache/spark#31540

[SPARK-20977][CORE] NPE in CollectionAccumulator

c9918ab

This PR is a fix for the JLS 17.5.3 violation identified in Shixiong Zhu's 19/Feb/19 11:47 comment on the JIRA. - avoid using a final field to hold the state of the collection accumulator

github-actions bot added the CORE label Feb 10, 2021

srowen reviewed Feb 18, 2021

View reviewed changes

core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala Show resolved Hide resolved

srowen reviewed Feb 20, 2021

View reviewed changes

srowen closed this in fadd0f5 Feb 21, 2021

gerashegalov deleted the SPARK-20977 branch February 23, 2021 23:28

gerashegalov mentioned this pull request Feb 26, 2021

[WIP] Allow repeated iterations of integration tests in the same Spark app NVIDIA/spark-rapids#1688

Closed

This was referenced Jun 23, 2021

[BUG] test_single_sort_in_part is failed in nightly UCX and AQE (no UCX) integration NVIDIA/spark-rapids#2477

Closed

Automatically set spark.task.maxFailures and local[*, maxFailures] NVIDIA/spark-rapids#2792

Merged

eejbyfeldt mentioned this pull request Aug 24, 2023

[SPARK-39696][CORE] Ensure Concurrent r/w TaskMetrics not throw Exception #37206

Closed

istreeter mentioned this pull request Oct 4, 2024

Fix NPEs in TypesAccumulator snowplow/snowplow-rdb-loader#1363

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator #31540

[SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator #31540

gerashegalov commented Feb 10, 2021 •

edited

Loading

mridulm commented Feb 10, 2021 •

edited

Loading

gerashegalov commented Feb 10, 2021

srowen Feb 20, 2021

srowen commented Feb 20, 2021

srowen commented Feb 20, 2021

SparkQA commented Feb 20, 2021

SparkQA commented Feb 20, 2021

SparkQA commented Feb 20, 2021

srowen commented Feb 21, 2021

zhengruifeng commented May 11, 2021

gerashegalov commented May 11, 2021

[SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator #31540

[SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator #31540

Conversation

gerashegalov commented Feb 10, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

mridulm commented Feb 10, 2021 • edited Loading

gerashegalov commented Feb 10, 2021

srowen Feb 20, 2021

Choose a reason for hiding this comment

srowen commented Feb 20, 2021

srowen commented Feb 20, 2021

SparkQA commented Feb 20, 2021

SparkQA commented Feb 20, 2021

SparkQA commented Feb 20, 2021

srowen commented Feb 21, 2021

zhengruifeng commented May 11, 2021

gerashegalov commented May 11, 2021

gerashegalov commented Feb 10, 2021 •

edited

Loading

mridulm commented Feb 10, 2021 •

edited

Loading