-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python/c++] Allow GOW multi-submit for single commit #3764
base: main
Are you sure you want to change the base?
[python/c++] Allow GOW multi-submit for single commit #3764
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3764 +/- ##
=======================================
Coverage 89.15% 89.16%
=======================================
Files 54 54
Lines 6419 6431 +12
=======================================
+ Hits 5723 5734 +11
- Misses 696 697 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
if layout == clib.ResultOrder.unordered: | ||
for batch in batches: | ||
# Create new ManagedQuery per each batch | ||
mq = ManagedQuery(self)._handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note: recent analysis showed that the ManagedQuery
constructor does a core array open on every constructor. That means that when we benchmark this, we will expect a bit of a performance regression. However, this is an intentionally paid cost in order to reap the later (read-time) benefits. (Which admittedly will mostly be realized on the global-order case which this if-block is not.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were already creating a new ManagedQuery
per batch before (both unordered and global order), so I do not think there will be a regression
mq.set_layout(layout) | ||
|
||
# Submit for each batch but don't finalize | ||
for batch in batches[:-1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if len(batches)
is 0? Can that ever happen? If so, can we maybe assert that, in order to avoid the less-intuitive IndexError
? Or, just early-out from this method?
>>> x = []
>>> x[:-1]
[]
>>> x[-1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a check above that returns early if it is empty.
batches = values.to_batches()
if not batches:
return
Co-authored-by: John Kerl <[email protected]>
be310df
to
7b6d96c
Compare
Issue and/or context:
#2054
Previously, this chunked table would have resulted in 3 fragments or commits. After separating the
submit
andfinalize
methods, this only results in a single fragment. Note that for unordered writes, this would still result in 3 fragments.Changes:
finalize
andfinalize_and_submit
fromsubmit_write
_write_table
. For unordered writes, the writes remained unchanged where for each batch, create a newManagedQuery
for each batch andsubmit
. For global order writes, create a singleManagedQuery
,submit
for each batch, andsubmit_and_finalize
at the end