Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python/c++] Allow GOW multi-submit for single commit #3764

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented Mar 7, 2025

Issue and/or context:

#2054

Previously, this chunked table would have resulted in 3 fragments or commits. After separating the submit and finalize methods, this only results in a single fragment. Note that for unordered writes, this would still result in 3 fragments.

    with soma.open(uri, mode="w") as A:
        # Three-chunk table
        A.write(
            pa.concat_tables(
                [
                    pa.Table.from_pandas(df_0, preserve_index=False),
                    pa.Table.from_pandas(df_1, preserve_index=False),
                    pa.Table.from_pandas(df_2, preserve_index=False),
                ]
            ),

            # Do not sort the coordinates -- global order write
            platform_config=soma.TileDBWriteOptions(**{"sort_coords": False}),
        )

    # There should be a single fragment even though there are three chunks (and
    # therefore three submits) in the array because we only finalize once at
    # the end
    assert len(list((Path(uri) / "__commits").iterdir())) == 1
    assert len(list((Path(uri) / "__fragments").iterdir())) == 1

Changes:

  • Separate finalize and finalize_and_submit from submit_write
  • For sparse array writes in Python, add utility method _write_table. For unordered writes, the writes remained unchanged where for each batch, create a new ManagedQuery for each batch and submit. For global order writes, create a single ManagedQuery, submit for each batch, and submit_and_finalize at the end

Copy link

codecov bot commented Mar 7, 2025

Codecov Report

Attention: Patch coverage is 90.47619% with 4 lines in your changes missing coverage. Please review.

Project coverage is 89.16%. Comparing base (aefd8ed) to head (7b6d96c).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3764   +/-   ##
=======================================
  Coverage   89.15%   89.16%           
=======================================
  Files          54       54           
  Lines        6419     6431   +12     
=======================================
+ Hits         5723     5734   +11     
- Misses        696      697    +1     
Flag Coverage Δ
python 89.16% <90.47%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 89.16% <90.47%> (+<0.01%) ⬆️
libtiledbsoma ∅ <ø> (∅)
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nguyenv nguyenv marked this pull request as ready for review March 7, 2025 17:52
if layout == clib.ResultOrder.unordered:
for batch in batches:
# Create new ManagedQuery per each batch
mq = ManagedQuery(self)._handle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: recent analysis showed that the ManagedQuery constructor does a core array open on every constructor. That means that when we benchmark this, we will expect a bit of a performance regression. However, this is an intentionally paid cost in order to reap the later (read-time) benefits. (Which admittedly will mostly be realized on the global-order case which this if-block is not.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were already creating a new ManagedQuery per batch before (both unordered and global order), so I do not think there will be a regression

mq.set_layout(layout)

# Submit for each batch but don't finalize
for batch in batches[:-1]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if len(batches) is 0? Can that ever happen? If so, can we maybe assert that, in order to avoid the less-intuitive IndexError? Or, just early-out from this method?

>>> x = []
>>> x[:-1]
[]
>>> x[-1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check above that returns early if it is empty.

        batches = values.to_batches()
        if not batches:
            return

@nguyenv nguyenv force-pushed the viviannguyen/sc-49083/c-plumb-global-order-write-with-fragment branch from be310df to 7b6d96c Compare March 7, 2025 22:53
@nguyenv nguyenv requested review from johnkerl and ktsitsi March 10, 2025 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants