Retry broadcast OOM with BHJ disabled within the same spark session #17528

pgupta2 · 2022-03-25T01:22:38Z

Presto on Spark uses temp storage for storing and distributing
broadcast tables. Spark driver performs the necessary threshold
checks on broadcast table and if the size is over the threshold,
the query fails with broadcast oom. The only way to fix this
failure is to disable broadcast join in the query.

As we are able to detect broadcast OOM on driver confidently,
we can just disable broadcast join, replan and resubmit the
query for execution. This can happen within the same spark
session itself and thus would not need any users intervention
for fixing such failures.

Test plan -

Unit Test
Tested production workload and verified that retry logic is working correctly.

== RELEASE NOTES ==

Spark Changes
* Add a new configuration property ``spark.retry-on-out-of-memory-broadcast-join-enabled`` to disable broadcast join on broadcast OOM and retry the query again within the same spark session.  This can be overridden by ``spark_retry_on_out_of_memory_broadcast_join_enabled`` session property

souravpal · 2022-03-25T17:34:18Z

...erface/src/main/java/com/facebook/presto/spark/classloader_interface/PrestoSparkFailure.java

+
+import static java.util.Objects.requireNonNull;
+
+public class PrestoSparkFailure


nitpick- may be rename to reflect that this is a runtime exception.

This class is based on Failure.java in presto-main module and I have tried to keep things as similar as possible between the two. The reason I had to create this new class is to enable the flow of error information from presto-main to presto-spark-launcher module where PrestoSparkRunner resides. PrestoSparkRunner is the entity that orchestrates the execution of PoS query and thus it needs to have access to failure info to decide it it should retry or not.

souravpal · 2022-03-25T18:32:22Z

presto-spark-launcher/src/main/java/com/facebook/presto/spark/launcher/PrestoSparkRunner.java


+    private IPrestoSparkQueryExecution createSparkQueryExecution(


Do we need to split this into a function or this can be pulled in execute()

You are right. This is not needed. I actually did some refactor before and then changed it again and forgot to remove this.

souravpal · 2022-03-25T18:35:11Z

presto-spark-launcher/src/main/java/com/facebook/presto/spark/launcher/PrestoSparkRunner.java

@@ -88,7 +91,136 @@ public void run(
            Optional<String> queryDataOutputLocation)
    {
        IPrestoSparkQueryExecutionFactory queryExecutionFactory = driverPrestoSparkService.getQueryExecutionFactory();
+        try {


The number of arguments probably justify moving it to a context structure?

Makes sense. Let me do this change.

souravpal · 2022-03-25T18:37:33Z

...erface/src/main/java/com/facebook/presto/spark/classloader_interface/PrestoSparkFailure.java

+public class PrestoSparkFailure
+        extends RuntimeException
+{
+    private final String type;


Instead of string shall we think of enums? Both for the errorcode and type?

Refer above comment.

highker · 2022-03-27T01:31:51Z

presto-spark-base/src/main/java/com/facebook/presto/spark/PrestoSparkConfig.java

@@ -41,6 +41,7 @@
    private int splitAssignmentBatchSize = 1_000_000;
    private double memoryRevokingThreshold;
    private double memoryRevokingTarget;
+    private boolean disableBroadcastJoinOnOOM;


nit: ...OnOutOfMemory

We don't use abbreviations in the codebase.

Also, it would be good to use some positive names like enable instead of disable. Having a negation in the config will usually make it hard to understand for users. By looking through the PR, maybe retryOnOutOfMemoryBroadcastJoin

highker · 2022-03-27T01:32:11Z

presto-spark-base/src/main/java/com/facebook/presto/spark/PrestoSparkConfig.java

+    public boolean isDisableBroadcastJoinOnOOM()
+    {
+        return disableBroadcastJoinOnOOM;
+    }
+
+    @Config("spark.disable-broadcast-join-on-oom")
+    public PrestoSparkConfig setDisableBroadcastJoinOnOOM(boolean disableBroadcastJoinOnOOM)


highker · 2022-03-27T01:35:36Z

presto-spark-base/src/main/java/com/facebook/presto/spark/PrestoSparkSessionContext.java

                prestoSparkSession.getCatalogSessionProperties(),
                prestoSparkSession.getTraceToken());
    }

+    private static Map<String, String> getFinalSystemProperties(Map<String, String> systemProperties, Optional<RetryExecutionStrategy> retryExecutionStrategy)
+    {
+        if (retryExecutionStrategy.isPresent()) {


nit

if (!retryExecutionStrategy.isPresent()) { return systemProperties; } ...

highker · 2022-03-27T01:36:06Z

presto-spark-base/src/main/java/com/facebook/presto/spark/PrestoSparkSessionProperties.java

@@ -41,6 +41,7 @@
    public static final String SPARK_SPLIT_ASSIGNMENT_BATCH_SIZE = "spark_split_assignment_batch_size";
    public static final String SPARK_MEMORY_REVOKING_THRESHOLD = "spark_memory_revoking_threshold";
    public static final String SPARK_MEMORY_REVOKING_TARGET = "spark_memory_revoking_target";
+    public static final String SPARK_DISABLE_BROADCAST_JOIN_ON_OOM = "spark_disable_broadcast_join_on_oom";


same nit: spell out OOM

highker · 2022-03-27T01:37:31Z

presto-spark-base/src/main/java/com/facebook/presto/spark/util/PrestoSparkUtils.java

+        if (executionFailureInfo == null) {
+            return null;
+        }


This scares me a bit.. for both input and output. From the callsites, seems there are no nulls? Actually we might checkArgument non null here.

toPrestoSparkFailure() is called recursively and executionFailureInfo will be null where we explicitly throw an error from spark driver itself like in case of broadcast join OOM detected on driver side.

OK, maybe add a comment for now. Let me read the logic deeper in the next iteration lol

highker · 2022-03-27T01:58:14Z

presto-spark-base/src/test/java/com/facebook/presto/spark/TestPrestoSparkQueryRunner.java

@@ -908,6 +910,35 @@ public void testStorageBasedBroadcastJoinMaxThreshold()
                "Query exceeded per-node total memory limit of 1MB \\[Compressed broadcast size: .*kB; Uncompressed broadcast size: .*MB\\]");
    }

+    @Test
+    public void testDisableBroadcastJoinOnOOM()


same nit OOM

highker · 2022-03-27T02:00:48Z

...erface/src/main/java/com/facebook/presto/spark/classloader_interface/PrestoSparkFailure.java

+    public boolean isBroadcastJoinOOM()
+    {
+        return getErrorCode().equals("EXCEEDED_LOCAL_MEMORY_LIMIT")
+                && getMessage().contains("Query exceeded per-node broadcast memory limit");
+    }


hmmmm this looks a bit hacky...

Can we return a set of RetryExecutionStrategy instead? Check my other comment at toPrestoSparkFailure

presto-spark-base/src/main/java/com/facebook/presto/spark/util/PrestoSparkUtils.java

highker · 2022-03-27T02:18:26Z

presto-spark-launcher/src/main/java/com/facebook/presto/spark/launcher/PrestoSparkRunner.java

+            String disableBroadcastJoinOnOOM = sessionProperties.get("spark_disable_broadcast_join_on_oom");
+            if (disableBroadcastJoinOnOOM != null && disableBroadcastJoinOnOOM.equalsIgnoreCase("true") && failure.isBroadcastJoinOOM()) {


Let's make these tests very generic. In the future, we may have more retry strategies. We should get the retry signal from PrestoSparkFailure directly instead of having string/session comparison scared around

presto-spark-base/src/main/java/com/facebook/presto/spark/PrestoSparkConfig.java

highker · 2022-03-27T19:48:50Z

presto-spark-base/src/main/java/com/facebook/presto/spark/util/PrestoSparkFailureUtils.java

+        if (executionFailureInfo == null) {
+            return null;
+        }


By reading the logic, only recursive calls can accept or return nulls right? Shall we restructure the code the following one?

public static PrestoSparkFailure toPrestoSparkFailure(Session session, ExecutionFailureInfo executionFailureInfo) { requireNonNull(executionFailureInfo, "executionFailureInfo is null"); PrestoSparkFailure prestoSparkFailure = toPrestoSparkFailure(executionFailureInfo); checkState(prestoSparkFailure != null); Optional<RetryExecutionStrategy> retryExecutionStrategy = getRetryExecutionStrategy(session, executionFailureInfo.getErrorCode(), executionFailureInfo.getMessage()); return new PrestoSparkFailure( prestoSparkFailure.getMessage(), prestoSparkFailure.getCause(), prestoSparkFailure.getType(), prestoSparkFailure.getErrorCode(), retryExecutionStrategy); } @Nullable private static PrestoSparkFailure toPrestoSparkFailure(ExecutionFailureInfo executionFailureInfo) { if (executionFailureInfo == null) { return null; } PrestoSparkFailure prestoSparkFailure = new PrestoSparkFailure( executionFailureInfo.getMessage(), toPrestoSparkFailure(executionFailureInfo.getCause()), executionFailureInfo.getType(), executionFailureInfo.getErrorCode() == null ? "" : executionFailureInfo.getErrorCode().getName(), Optional.empty()); for (ExecutionFailureInfo suppressed : executionFailureInfo.getSuppressed()) { prestoSparkFailure.addSuppressed(requireNonNull(toPrestoSparkFailure(suppressed), "suppressed failure is null")); } ImmutableList.Builder<StackTraceElement> stackTraceBuilder = ImmutableList.builder(); for (String stack : executionFailureInfo.getStack()) { stackTraceBuilder.add(toStackTraceElement(stack)); } List<StackTraceElement> stackTrace = stackTraceBuilder.build(); prestoSparkFailure.setStackTrace(stackTrace.toArray(new StackTraceElement[stackTrace.size()])); return prestoSparkFailure; }

highker · 2022-03-27T19:50:10Z

presto-spark-base/src/main/java/com/facebook/presto/spark/util/PrestoSparkFailureUtils.java

+    private static boolean isBroadcastJoinOOM(ErrorCode errorCode, String message)
+    {
+        return errorCode == EXCEEDED_LOCAL_MEMORY_LIMIT.toErrorCode()
+                && message.contains("Query exceeded per-node broadcast memory limit");
+    }


Let's inline this function by introducing a new error code in StandardErrorCode: EXCEEDED_LOCAL_BROADCAST_JOIN_MEMORY_LIMIT so we don't compare message content

This is something that I discussed with the team as well few week ago. The only concern I had was if this will have any side-effects in our accounting/alerting/monitoring since we will be changing the existing error code for broadcast failures (or are you suggesting that we introduce a totally new error code for broadcast join ooms in Presto on Spark only?). If you feel this will be safe, I am more than happy to do this in this PR itself.

Yes, we introduce a new error code 'EXCEEDED_LOCAL_BROADCAST_JOIN_MEMORY_LIMIT' that is only thrown in exceededLocalBroadcastMemoryLimit branch in ExceededMemoryLimitException.

Presto on Spark uses temp storage for storing and distributing broadcast tables. Spark driver performs the necessary threshold checks on broadcast table and if the size is over the threshold, the query fails with broadcast oom. The only way to fix this failure is to disable broadcast join in the query. As we are able to detect broadcast OOM on driver confidently, we can just disable broadcast join, replan and resubmit the query for execution. This can happen within the same spark session itself and thus would not need any users intervention for fixing such failures.

pgupta2 marked this pull request as draft March 25, 2022 01:22

pgupta2 force-pushed the retry_broadcast_oom branch from 6ff98a0 to f739e05 Compare March 25, 2022 06:07

pgupta2 marked this pull request as ready for review March 25, 2022 15:33

pgupta2 requested review from highker, souravpal, shrinidhijoshi and aweisberg March 25, 2022 15:33

souravpal reviewed Mar 25, 2022

View reviewed changes

singcha self-requested a review March 25, 2022 17:51

souravpal reviewed Mar 25, 2022

View reviewed changes

highker reviewed Mar 27, 2022

View reviewed changes

pgupta2 force-pushed the retry_broadcast_oom branch from f739e05 to ff45eef Compare March 27, 2022 07:21

pgupta2 requested review from highker and souravpal March 27, 2022 19:48

highker reviewed Mar 27, 2022

View reviewed changes

pgupta2 force-pushed the retry_broadcast_oom branch from ff45eef to af8feb8 Compare March 28, 2022 16:24

pgupta2 requested a review from highker March 28, 2022 16:25

highker approved these changes Mar 28, 2022

View reviewed changes

highker self-assigned this Mar 28, 2022

highker merged commit e98b500 into prestodb:master Mar 28, 2022

mshang816 mentioned this pull request May 17, 2022

Add release notes for 0.273 #17775

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry broadcast OOM with BHJ disabled within the same spark session #17528

Retry broadcast OOM with BHJ disabled within the same spark session #17528

pgupta2 commented Mar 25, 2022 •

edited by highker

Loading

souravpal Mar 25, 2022

pgupta2 Mar 26, 2022

souravpal Mar 25, 2022

pgupta2 Mar 26, 2022

souravpal Mar 25, 2022

pgupta2 Mar 26, 2022

souravpal Mar 25, 2022

pgupta2 Mar 26, 2022

highker Mar 27, 2022

highker Mar 27, 2022

highker Mar 27, 2022

highker Mar 27, 2022

highker Mar 27, 2022

pgupta2 Mar 27, 2022

highker Mar 27, 2022

highker Mar 27, 2022

highker Mar 27, 2022

highker Mar 27, 2022

highker Mar 27, 2022

highker Mar 27, 2022

pgupta2 Mar 28, 2022 •

edited

Loading

highker Mar 28, 2022


		import static java.util.Objects.requireNonNull;

		public class PrestoSparkFailure


		private IPrestoSparkQueryExecution createSparkQueryExecution(

		String disableBroadcastJoinOnOOM = sessionProperties.get("spark_disable_broadcast_join_on_oom");
		if (disableBroadcastJoinOnOOM != null && disableBroadcastJoinOnOOM.equalsIgnoreCase("true") && failure.isBroadcastJoinOOM()) {

Retry broadcast OOM with BHJ disabled within the same spark session #17528

Retry broadcast OOM with BHJ disabled within the same spark session #17528

Conversation

pgupta2 commented Mar 25, 2022 • edited by highker Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgupta2 Mar 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgupta2 commented Mar 25, 2022 •

edited by highker

Loading

pgupta2 Mar 28, 2022 •

edited

Loading