Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dataclasses for ByteRangeRequests #2585

Merged
merged 30 commits into from
Jan 9, 2025

Conversation

maxrjones
Copy link
Member

@maxrjones maxrjones commented Dec 22, 2024

This draft PR modifies the ByteRangeRequest type to follow the same approach as obstore, as suggested here. It is a larger change than the decision made in #1900 (comment) to specify ByteRangeRequest tuples as start, stop rather than start, length which has not yet been implemented in main.

As for performance impacts, please see these simple timings - https://github.com/maxrjones/zarr-byterangerequest-microbenchmarks/blob/main/plot-results.ipynb

@normanrz
Copy link
Member

Why not use literate instead of a tuple?

class ExplicitRange(TypedDict):
    start: int
    end: int

@normanrz
Copy link
Member

I added the changes for sharding in 4a73b70

@maxrjones
Copy link
Member Author

Why not use literate instead of a tuple?

class ExplicitRange(TypedDict):
    start: int
    end: int

My understanding from #2437 (comment) was that there were concerns over performance for dataclasses, which I thought could extend to TypedDict over tuple. If you'd rather literate instead of a tuple, I could make that change and run some simple tests to check performance.

@normanrz
Copy link
Member

normanrz commented Jan 2, 2025

If you'd rather literate instead of a tuple, I could make that change and run some simple tests to check performance.

Yeah. I think that would be great. If the performance is ok, I would prefer a dict (or dataclass).
Btw, have you done some performance testing how @dataclass(slots=True) or NamedTuple compares to a dict?

@maxrjones
Copy link
Member Author

If you'd rather literate instead of a tuple, I could make that change and run some simple tests to check performance.

Yeah. I think that would be great. If the performance is ok, I would prefer a dict (or dataclass). Btw, have you done some performance testing how @dataclass(slots=True) or NamedTuple compares to a dict?

Thanks @normanrz, I haven't done those tests but can try that out along with your earlier suggestion. For timing, do you think this PR would be considered for the V3 release? I could prioritize this if so, but if not I won't rush to work on it.

@normanrz
Copy link
Member

normanrz commented Jan 2, 2025

We are closing 3.0 right now, so I guess this will be a post-3.0 issue. So, no rush here.

@maxrjones
Copy link
Member Author

maxrjones commented Jan 2, 2025

We are closing 3.0 right now, so I guess this will be a post-3.0 issue. So, no rush here.

Sounds good, I was pushing for pre v3.0 in #2412 (comment) under the assumption that using (start, end) or TypedDict/NamedTuple/Dataclass instead would not be accepted afterwards because it is a breaking change. But I am happy to wait if you are open to later changes in behavior.

@normanrz
Copy link
Member

normanrz commented Jan 2, 2025

My thinking is that this is not a big enough blocker for the 3.0 release. We can change this afterwards, but we will probably need a backwards-compat layer.
@jhamman what do you think here?

@jhamman
Copy link
Member

jhamman commented Jan 4, 2025

I agree. This can come after 3.0.0. But the sooner the better. The interface isn't really used by anyone external except for icechunk so if we do this quickly, we can probably get by without a compat layer.

Comment on lines 280 to 302
key_ranges = list(key_ranges)
paths: list[str] = []
starts: list[int | None] = []
stops: list[int | None] = []
for key, byte_range in key_ranges:
paths.append(_dereference_path(self.path, key))
if byte_range is None:
starts.append(None)
stops.append(None)
elif isinstance(byte_range, tuple):
starts.append(byte_range[0])
stops.append(byte_range[1])
elif isinstance(byte_range, dict):
if "offset" in byte_range:
starts.append(byte_range["offset"]) # type: ignore[typeddict-item]
stops.append(None)
elif "suffix" in byte_range:
starts.append(-byte_range["suffix"])
stops.append(None)
else:
raise ValueError("Invalid format for ByteRangeRequest")
else:
raise ValueError("Invalid format for ByteRangeRequest")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where the PR could most likely cause a performance hit by iterating over the inputs rather than using zip.

@maxrjones maxrjones changed the title [RFC] Use TypedDicts for more literate ByteRangeRequests Use TypedDicts for more literate ByteRangeRequests Jan 7, 2025
@normanrz
Copy link
Member

normanrz commented Jan 8, 2025

I also think the performance is ok.

My problem with tuples is that it is not clear what the second component is. Without further docs, it could be either the exclusive end or a byte length. That is why I like the dataclass (with kwargs: values = asyncio.run(store.get('c/0', byte_range=ExplicitRange(start=1, end=5)))) or typed dict approach better.

Have we considered using a slice (with disallowed step != 1)? It has clear semantics for start and end and supports suffix and offset.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 8, 2025

i think slice is a bit too expressive, and it doesn't play nicely with type checking, so we would need to examine each one at runtime to ensure that its a valid specification of a dense interval.

@normanrz
Copy link
Member

normanrz commented Jan 8, 2025

We could really use rust-style enums here :D

@normanrz
Copy link
Member

normanrz commented Jan 8, 2025

Well, after 5min of googling I learned about the match statement (new in Python 3.10), which kind of enables rust-style enums.

from dataclasses import dataclass

@dataclass
class ExplicitByteRequest:
    start: int
    end: int

@dataclass
class SuffixByteRequest:
    suffix: int

@dataclass
class OffsetByteRequest:
    offset: int

ByteRequest = ExplicitByteRequest | SuffixByteRequest | OffsetByteRequest

def get(path: str, byte_range: ByteRequest | None):
    match byte_range:
        case None:
            return print(path)
        case ExplicitByteRequest(start, end):
            return print(path, f"{start=}, {end=}")
        case SuffixByteRequest(suffix):
            return print(path, f"start={-suffix}, end=None")
        case OffsetByteRequest(offset):
            return print(path, f"start={offset}, end=None")
        case _:
            raise Exception(f"Unexpected byte_range, got {byte_range}.")

get("test", None)
get("test", ExplicitByteRequest(start=5, end=7))
get("test", SuffixByteRequest(2))
get("test", OffsetByteRequest(5))

@maxrjones
Copy link
Member Author

Well, after 5min of googling I learned about the match statement (new in Python 3.10), which kind of enables rust-style enums.

Time flies 😅 , I didn't realize Zarr-Python was already at 3.11+. This will be nice to use regardless of the argument syntax.

I share some of Norman's concerns over Tuple. While it's easy to document the behavior inside Zarr-Python, there is a risk of differences if/when people implement new storage classes externally. Also, while I think (start, stop) makes more sense than (start, length) due to its similarity to an HTTP Range, I am a bit concerned that the sudden change of behavior will cause issues for super early adopters.

I'd like to leave it to @normanrz @jhamman @d-v-b and any other zarr-python devs to decide on the path forward given the factors about simplicity and expressiveness noted in past comments 🙏 I'm happy to contribute additional prototypes or benchmarks if it helps your decision-making.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 8, 2025

I am a bit concerned that the sudden change of behavior will cause issues for super early adopters.

IMO the least disruptive thing for adopters at any stage is to use semantics that are familiar. both python indexing and http range requests use (start, stop) semantics, which makes it more familiar than (start, length) so we should switch to the former as soon as we can. How about we use the explicit dataclasses until someone complains that they take too long to write, and then we can consider widening the type to accept tuples?

@normanrz
Copy link
Member

normanrz commented Jan 8, 2025

I think changing at this point is ok.
Again, I think the dataclasses are very expressive, here. We might even have a class like this:

@dataclass
class ExplicitByteRequest:
    start: int
    end: int
    
    def __init__(self, start: int, *, end: int | None = None, length: int | None = None):
        object.__setattr__(self, "start", start)
        match (end, length):
            case (None, None):
                raise Exception("Neither end nor length is specified")
            case (end, None):
                object.__setattr__(self, "end", end)
            case (None, length):
                object.__setattr__(self, "end", start + length)
            case _:
                raise Exception("Only end OR length must be specified")

            
assert ExplicitByteRequest(start=5, end=10) == ExplicitByteRequest(start=5, length=5)

Also, I wouldn't really consider this an end-user-facing API because it will be used by store (or other extensions) developers. Complaining about typing too much would be a bit funny.

@maxrjones maxrjones changed the title Use TypedDicts for more literate ByteRangeRequests Use dataclasses for ByteRangeRequests Jan 8, 2025
Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I really like how the API turned out.
Just one comment on the naming of the dataclasses.

@maxrjones
Copy link
Member Author

maxrjones commented Jan 8, 2025

FYI @normanrz https://github.com/maxrjones/zarr-byterangerequest-microbenchmarks/blob/main/plot-results.ipynb suggests that using match; case is actually noticeably slower than if; elif for explicit byte ranges. I know we're not optimizing around performance right now, but IMO the improved readabiliity in a7d35f8 is not worth any performance hit such that I'd like to revert the change.

@@ -31,7 +31,6 @@
ZATTRS_JSON = ".zattrs"
ZMETADATA_V2_JSON = ".zmetadata"

ByteRangeRequest = tuple[int | None, int | None]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% confident in this change, but couldn't otherwise quickly find a way to avoid circular imports. @d-v-b can you please confirm that this removal won't cause any issues?

@maxrjones
Copy link
Member Author

Thanks all for your help with this PR!

@normanrz normanrz merged commit 0328656 into zarr-developers:main Jan 9, 2025
28 checks passed
@normanrz
Copy link
Member

normanrz commented Jan 9, 2025

Thanks @maxrjones for pushing this through!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants