-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Avoid to call import and export Arrow array for native execution #1055
fix: Avoid to call import and export Arrow array for native execution #1055
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1055 +/- ##
============================================
- Coverage 34.46% 34.32% -0.15%
- Complexity 888 893 +5
============================================
Files 113 114 +1
Lines 43580 42916 -664
Branches 9658 9339 -319
============================================
- Hits 15021 14732 -289
+ Misses 25507 25336 -171
+ Partials 3052 2848 -204 ☔ View full report in Codecov by Sentry. |
@kazuyukitanimura. I can run the benchmarks now, but I did not see any difference in performance compared to the main branch. Are you still seeing a performance benefit after the recent changes? |
Thank you @andygrove |
5 iterations
After
|
Before
After
|
I see these results from 3 runs of q39 (both a+b) with 1TB input: main: 59.3s/57.0s/57.1s I am using these configs:
|
Thanks @andygrove
|
I guess the diff is only observable on very small scale (ms). Once the query time is second/minute/hour level, the diff is insignificant. |
Thank you @andygrove @viirya I addressed memory issues and added DSv2 support (except Iceberg). This is ready for review again. I also run with 1TB by myself. I still see 10% ish speed up that is the order of seconds not Before (1TB)
After (1TB)
|
It looks like less than 5%? 27979 - 27281 = 698 ~= 2% So I believe that the diff is insignificant for longer query time. |
Thanks @viirya
I used
The performance gain is proportional to the input size by design because it saves the overhead for all input arrow vectors. |
33349 obviously looks like it is affected by some noises. q44 avg diff is also ~ 5%.
I mean it is less significant with longer query time. For queries running many minutes or hours, it has no difference. I doubt that if it is worth. |
@viirya Let me try to convince one more time.
If the query time is long because the data size, this PR still helps (5% at least?). If the query time is long because the query itself is complex, this PR has less value.
The latest change is pretty small after following your change on Arrow spec memory model. There are only 3 main changes
The rest of changes are only for passing the information of the new mode is used.
Do you have any recommendations here? What if I add a feature flag to enable/disable this new code flow. The latest change is fully backward compatible. We can easily enable/disable with a single flag to manipulate |
@kazuyukitanimura I plan on reviewing this PR this week. I have found that the import/export cost during shuffle is excessive (see #1123) and would like to understand more about the approach in this PR. |
I understand what this PR is doing now. By using CometNativeVector when importing from native, and then reexporting CometNativeVector we are just passing memory addresses and avoiding extra serde costs. I am going to spend some time debugging with my benchmarks to ensure this path is being tested. |
Currently this PR only handles scan + exec, but if we can cover shuffle similarly the design may become clearner |
The approach in this PR still serializes the schema with each batch (as is necessary when using the Arrow C data interface). I am not sure that the fast path reduces much cost. I think it is worth exploring more but I am not sure we should merge this PR yet, especially as the comet-parquet-exec work may replace some of this code. |
Thanks @andygrove
This PR is for saving Java import/export to/from java value vector format
In the flamegraph,
I can close this for now
I thought this PR part does not change as we still need to send back the results to JVM for pure scan case, but I could be wrong... |
[comet-parquet-exec ] POC2 certainly does some import/export to pass data back for each columnar batch, but I see no way to remove that. Maybe we can apply some learnings from this PR once [comet-parquet-exec] is more stable. |
Closing this for now |
Which issue does this PR close?
Rationale for this change
Performance improvement
What changes are included in this PR?
This PR has changes to avoid to call import and export Arrow array for native execution
How are these changes tested?
Exisiting tests