vdk-impala: Introduce COMPUTE STATS statements #2584
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why:
We were facing issues for large sized tables sporadically when using the processing templates with specified quality checks.
The issues appear when we try to move the data from the staging table (where the quality checks were already performed) into the target table. The absence of "COMPUTE STATS" statement against the staging table was causing failures when trying to select data from it due to hitting query limits. Before submitting this PR here are the checks that have been done in order to take that decision:
-A job that was processing a large table with quality checks defined was ran 45 times from which only 17 were successful and other were failures due to the issue mentioned above. After that a "Compute stats" statement was added against the staging table before trying to process the data to target - the result was 100 consecutive successful runs. That's why we assumed this enhancement will optimize the execution of the templates.
More details explained in
#1361
What:
-Adding COMPUTE STATS statement that will be executed against the staging table right before moving the data to prod in order to optimize further selects on it.
Signed-off-by: Stefan Buldeev [email protected]