-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator #31540
Conversation
This PR is a fix for the JLS 17.5.3 violation identified in Shixiong Zhu's 19/Feb/19 11:47 comment on the JIRA. - avoid using a final field to hold the state of the collection accumulator
This does not necessarily solve the issue that @zsxwing detailed - the issue here is One possible solution would be to introduce two methods. a) A protected method This will ensure AccumulatorV2 and subclasses will register only after state has been initialized. Note, there are other accumulators with local state; we should do this for all. |
Thanks for taking a look @mridulm correct, this patch does not address the problem in general, it just mitigates it for A generally correct solution that will work for user-defined Accumulators as well must not rely on readObject imo, and it is more involved. |
|
||
override def copyAndReset(): CollectionAccumulator[T] = new CollectionAccumulator | ||
|
||
override def copy(): CollectionAccumulator[T] = { | ||
val newAcc = new CollectionAccumulator[T] | ||
_list.synchronized { | ||
newAcc._list.addAll(_list) | ||
this.synchronized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously we can't sync on _list anymore. This changes to sync on the object itself, which should be fine AFAIK - I don't think anything else depends on the lock of the accumulator itself and it's only manipulating its own state while holding the lock.
I think this is OK as a narrow fix for this particular case. Any objection? if tests pass |
Jenkins test this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #135314 has finished for PR 31540 at commit
|
…Accumulator This PR is a fix for the JLS 17.5.3 violation identified in zsxwing's [19/Feb/19 11:47 comment](https://issues.apache.org/jira/browse/SPARK-20977?focusedCommentId=16772277&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16772277) on the JIRA. ### What changes were proposed in this pull request? - Use a var field to hold the state of the collection accumulator ### Why are the changes needed? AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures on a regular basis NVIDIA/spark-rapids#1522. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - this is a concurrency bug that is almost impossible to reproduce as a quick unit test. - By trial and error I crafted a command NVIDIA/spark-rapids#1688 that reproduces the issue on my dev box several times per hour, with the first occurrence often within a few minutes. After the patch, these Exceptions have not shown up after running overnight for 10+ hours - existing unit tests in *`AccumulatorV2Suite` and *`LiveEntitySuite` Closes #31540 from gerashegalov/SPARK-20977. Authored-by: Gera Shegalov <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit fadd0f5) Signed-off-by: Sean Owen <[email protected]>
Merged to master/3.1 |
+1 I recently impl some accv2 (some complex statistics containing transient lazy vars and using collections like openhashmap/array/etc) in my work, there are lots of NPE which make task probablly fail. I has tried the method like this PR, but it do not help evidently. |
@zhengruifeng can you provide a minimum code reproducing for NPEs you are observing? |
We've seen exceptions in spark executors like: ``` java.lang.NullPointerException: Cannot invoke "scala.collection.mutable.Set.isEmpty()" because the return value of "com.snowplowanalytics.snowplow.rdbloader.transformer.batch.spark.TypesAccumulator.accum()" is null ``` The error is coming from our Spark Accumulator for accumulating Iglu types. This is similar to [an issue previously seen][1] in Spark's own `CollectionAccumulator`. That issue [was fixed in Spark][2] by making the accumulator's internal state non-final, and synchronizing access to the internal state. So here we make the exact same change to our own Accumulator. It is a rare race condition which is hard to reproduce. [1]: https://issues.apache.org/jira/browse/SPARK-20977 [2]: apache/spark#31540
We've seen exceptions in spark executors like: ``` java.lang.NullPointerException: Cannot invoke "scala.collection.mutable.Set.isEmpty()" because the return value of "com.snowplowanalytics.snowplow.rdbloader.transformer.batch.spark.TypesAccumulator.accum()" is null ``` The error is coming from our Spark Accumulator for accumulating Iglu types. This is similar to [an issue previously seen][1] in Spark's own `CollectionAccumulator`. That issue [was fixed in Spark][2] by making the accumulator's internal state non-final, and synchronizing access to the internal state. So here we make the exact same change to our own Accumulator. It is a rare race condition which is hard to reproduce. [1]: https://issues.apache.org/jira/browse/SPARK-20977 [2]: apache/spark#31540
This PR is a fix for the JLS 17.5.3 violation identified in
@zsxwing's 19/Feb/19 11:47 comment on the JIRA.
What changes were proposed in this pull request?
Why are the changes needed?
AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures on a regular basis NVIDIA/spark-rapids#1522.
Does this PR introduce any user-facing change?
None
How was this patch tested?
AccumulatorV2Suite
and *LiveEntitySuite