-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry broadcast OOM with BHJ disabled within the same spark session #17528
Conversation
6ff98a0
to
f739e05
Compare
|
||
import static java.util.Objects.requireNonNull; | ||
|
||
public class PrestoSparkFailure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick- may be rename to reflect that this is a runtime exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is based on Failure.java in presto-main module and I have tried to keep things as similar as possible between the two. The reason I had to create this new class is to enable the flow of error information from presto-main to presto-spark-launcher module where PrestoSparkRunner resides. PrestoSparkRunner is the entity that orchestrates the execution of PoS query and thus it needs to have access to failure info to decide it it should retry or not.
|
||
private IPrestoSparkQueryExecution createSparkQueryExecution( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to split this into a function or this can be pulled in execute()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. This is not needed. I actually did some refactor before and then changed it again and forgot to remove this.
@@ -88,7 +91,136 @@ public void run( | |||
Optional<String> queryDataOutputLocation) | |||
{ | |||
IPrestoSparkQueryExecutionFactory queryExecutionFactory = driverPrestoSparkService.getQueryExecutionFactory(); | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of arguments probably justify moving it to a context structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Let me do this change.
public class PrestoSparkFailure | ||
extends RuntimeException | ||
{ | ||
private final String type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of string shall we think of enums? Both for the errorcode and type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer above comment.
@@ -41,6 +41,7 @@ | |||
private int splitAssignmentBatchSize = 1_000_000; | |||
private double memoryRevokingThreshold; | |||
private double memoryRevokingTarget; | |||
private boolean disableBroadcastJoinOnOOM; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ...OnOutOfMemory
We don't use abbreviations in the codebase.
Also, it would be good to use some positive names like enable
instead of disable
. Having a negation in the config will usually make it hard to understand for users. By looking through the PR, maybe retryOnOutOfMemoryBroadcastJoin
public boolean isDisableBroadcastJoinOnOOM() | ||
{ | ||
return disableBroadcastJoinOnOOM; | ||
} | ||
|
||
@Config("spark.disable-broadcast-join-on-oom") | ||
public PrestoSparkConfig setDisableBroadcastJoinOnOOM(boolean disableBroadcastJoinOnOOM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same nits
prestoSparkSession.getCatalogSessionProperties(), | ||
prestoSparkSession.getTraceToken()); | ||
} | ||
|
||
private static Map<String, String> getFinalSystemProperties(Map<String, String> systemProperties, Optional<RetryExecutionStrategy> retryExecutionStrategy) | ||
{ | ||
if (retryExecutionStrategy.isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
if (!retryExecutionStrategy.isPresent()) {
return systemProperties;
}
...
@@ -41,6 +41,7 @@ | |||
public static final String SPARK_SPLIT_ASSIGNMENT_BATCH_SIZE = "spark_split_assignment_batch_size"; | |||
public static final String SPARK_MEMORY_REVOKING_THRESHOLD = "spark_memory_revoking_threshold"; | |||
public static final String SPARK_MEMORY_REVOKING_TARGET = "spark_memory_revoking_target"; | |||
public static final String SPARK_DISABLE_BROADCAST_JOIN_ON_OOM = "spark_disable_broadcast_join_on_oom"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same nit: spell out OOM
if (executionFailureInfo == null) { | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scares me a bit.. for both input and output. From the callsites, seems there are no nulls? Actually we might checkArgument non null here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toPrestoSparkFailure()
is called recursively and executionFailureInfo
will be null where we explicitly throw an error from spark driver itself like in case of broadcast join OOM detected on driver side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, maybe add a comment for now. Let me read the logic deeper in the next iteration lol
@@ -908,6 +910,35 @@ public void testStorageBasedBroadcastJoinMaxThreshold() | |||
"Query exceeded per-node total memory limit of 1MB \\[Compressed broadcast size: .*kB; Uncompressed broadcast size: .*MB\\]"); | |||
} | |||
|
|||
@Test | |||
public void testDisableBroadcastJoinOnOOM() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same nit OOM
public boolean isBroadcastJoinOOM() | ||
{ | ||
return getErrorCode().equals("EXCEEDED_LOCAL_MEMORY_LIMIT") | ||
&& getMessage().contains("Query exceeded per-node broadcast memory limit"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmmm this looks a bit hacky...
Can we return a set of RetryExecutionStrategy
instead? Check my other comment at toPrestoSparkFailure
presto-spark-base/src/main/java/com/facebook/presto/spark/util/PrestoSparkUtils.java
Outdated
Show resolved
Hide resolved
String disableBroadcastJoinOnOOM = sessionProperties.get("spark_disable_broadcast_join_on_oom"); | ||
if (disableBroadcastJoinOnOOM != null && disableBroadcastJoinOnOOM.equalsIgnoreCase("true") && failure.isBroadcastJoinOOM()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make these tests very generic. In the future, we may have more retry strategies. We should get the retry signal from PrestoSparkFailure
directly instead of having string/session comparison scared around
presto-spark-base/src/main/java/com/facebook/presto/spark/PrestoSparkConfig.java
Outdated
Show resolved
Hide resolved
f739e05
to
ff45eef
Compare
if (executionFailureInfo == null) { | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By reading the logic, only recursive calls can accept or return nulls right? Shall we restructure the code the following one?
public static PrestoSparkFailure toPrestoSparkFailure(Session session, ExecutionFailureInfo executionFailureInfo)
{
requireNonNull(executionFailureInfo, "executionFailureInfo is null");
PrestoSparkFailure prestoSparkFailure = toPrestoSparkFailure(executionFailureInfo);
checkState(prestoSparkFailure != null);
Optional<RetryExecutionStrategy> retryExecutionStrategy = getRetryExecutionStrategy(session, executionFailureInfo.getErrorCode(), executionFailureInfo.getMessage());
return new PrestoSparkFailure(
prestoSparkFailure.getMessage(),
prestoSparkFailure.getCause(),
prestoSparkFailure.getType(),
prestoSparkFailure.getErrorCode(),
retryExecutionStrategy);
}
@Nullable
private static PrestoSparkFailure toPrestoSparkFailure(ExecutionFailureInfo executionFailureInfo)
{
if (executionFailureInfo == null) {
return null;
}
PrestoSparkFailure prestoSparkFailure = new PrestoSparkFailure(
executionFailureInfo.getMessage(),
toPrestoSparkFailure(executionFailureInfo.getCause()),
executionFailureInfo.getType(),
executionFailureInfo.getErrorCode() == null ? "" : executionFailureInfo.getErrorCode().getName(),
Optional.empty());
for (ExecutionFailureInfo suppressed : executionFailureInfo.getSuppressed()) {
prestoSparkFailure.addSuppressed(requireNonNull(toPrestoSparkFailure(suppressed), "suppressed failure is null"));
}
ImmutableList.Builder<StackTraceElement> stackTraceBuilder = ImmutableList.builder();
for (String stack : executionFailureInfo.getStack()) {
stackTraceBuilder.add(toStackTraceElement(stack));
}
List<StackTraceElement> stackTrace = stackTraceBuilder.build();
prestoSparkFailure.setStackTrace(stackTrace.toArray(new StackTraceElement[stackTrace.size()]));
return prestoSparkFailure;
}
private static boolean isBroadcastJoinOOM(ErrorCode errorCode, String message) | ||
{ | ||
return errorCode == EXCEEDED_LOCAL_MEMORY_LIMIT.toErrorCode() | ||
&& message.contains("Query exceeded per-node broadcast memory limit"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's inline this function by introducing a new error code in StandardErrorCode
: EXCEEDED_LOCAL_BROADCAST_JOIN_MEMORY_LIMIT
so we don't compare message content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that I discussed with the team as well few week ago. The only concern I had was if this will have any side-effects in our accounting/alerting/monitoring since we will be changing the existing error code for broadcast failures (or are you suggesting that we introduce a totally new error code for broadcast join ooms in Presto on Spark only?). If you feel this will be safe, I am more than happy to do this in this PR itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we introduce a new error code 'EXCEEDED_LOCAL_BROADCAST_JOIN_MEMORY_LIMIT' that is only thrown in exceededLocalBroadcastMemoryLimit
branch in ExceededMemoryLimitException
.
Presto on Spark uses temp storage for storing and distributing broadcast tables. Spark driver performs the necessary threshold checks on broadcast table and if the size is over the threshold, the query fails with broadcast oom. The only way to fix this failure is to disable broadcast join in the query. As we are able to detect broadcast OOM on driver confidently, we can just disable broadcast join, replan and resubmit the query for execution. This can happen within the same spark session itself and thus would not need any users intervention for fixing such failures.
ff45eef
to
af8feb8
Compare
Presto on Spark uses temp storage for storing and distributing
broadcast tables. Spark driver performs the necessary threshold
checks on broadcast table and if the size is over the threshold,
the query fails with broadcast oom. The only way to fix this
failure is to disable broadcast join in the query.
As we are able to detect broadcast OOM on driver confidently,
we can just disable broadcast join, replan and resubmit the
query for execution. This can happen within the same spark
session itself and thus would not need any users intervention
for fixing such failures.
Test plan -