-
-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use dataclasses for ByteRangeRequests #2585
Use dataclasses for ByteRangeRequests #2585
Conversation
Why not use literate instead of a tuple? class ExplicitRange(TypedDict):
start: int
end: int |
I added the changes for sharding in 4a73b70 |
My understanding from #2437 (comment) was that there were concerns over performance for dataclasses, which I thought could extend to TypedDict over tuple. If you'd rather literate instead of a tuple, I could make that change and run some simple tests to check performance. |
Yeah. I think that would be great. If the performance is ok, I would prefer a dict (or dataclass). |
Thanks @normanrz, I haven't done those tests but can try that out along with your earlier suggestion. For timing, do you think this PR would be considered for the V3 release? I could prioritize this if so, but if not I won't rush to work on it. |
We are closing 3.0 right now, so I guess this will be a post-3.0 issue. So, no rush here. |
Sounds good, I was pushing for pre v3.0 in #2412 (comment) under the assumption that using |
My thinking is that this is not a big enough blocker for the 3.0 release. We can change this afterwards, but we will probably need a backwards-compat layer. |
I agree. This can come after 3.0.0. But the sooner the better. The interface isn't really used by anyone external except for icechunk so if we do this quickly, we can probably get by without a compat layer. |
src/zarr/storage/_fsspec.py
Outdated
key_ranges = list(key_ranges) | ||
paths: list[str] = [] | ||
starts: list[int | None] = [] | ||
stops: list[int | None] = [] | ||
for key, byte_range in key_ranges: | ||
paths.append(_dereference_path(self.path, key)) | ||
if byte_range is None: | ||
starts.append(None) | ||
stops.append(None) | ||
elif isinstance(byte_range, tuple): | ||
starts.append(byte_range[0]) | ||
stops.append(byte_range[1]) | ||
elif isinstance(byte_range, dict): | ||
if "offset" in byte_range: | ||
starts.append(byte_range["offset"]) # type: ignore[typeddict-item] | ||
stops.append(None) | ||
elif "suffix" in byte_range: | ||
starts.append(-byte_range["suffix"]) | ||
stops.append(None) | ||
else: | ||
raise ValueError("Invalid format for ByteRangeRequest") | ||
else: | ||
raise ValueError("Invalid format for ByteRangeRequest") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where the PR could most likely cause a performance hit by iterating over the inputs rather than using zip.
I also think the performance is ok. My problem with tuples is that it is not clear what the second component is. Without further docs, it could be either the exclusive end or a byte length. That is why I like the dataclass (with kwargs: Have we considered using a |
i think |
We could really use rust-style enums here :D |
Well, after 5min of googling I learned about the from dataclasses import dataclass
@dataclass
class ExplicitByteRequest:
start: int
end: int
@dataclass
class SuffixByteRequest:
suffix: int
@dataclass
class OffsetByteRequest:
offset: int
ByteRequest = ExplicitByteRequest | SuffixByteRequest | OffsetByteRequest
def get(path: str, byte_range: ByteRequest | None):
match byte_range:
case None:
return print(path)
case ExplicitByteRequest(start, end):
return print(path, f"{start=}, {end=}")
case SuffixByteRequest(suffix):
return print(path, f"start={-suffix}, end=None")
case OffsetByteRequest(offset):
return print(path, f"start={offset}, end=None")
case _:
raise Exception(f"Unexpected byte_range, got {byte_range}.")
get("test", None)
get("test", ExplicitByteRequest(start=5, end=7))
get("test", SuffixByteRequest(2))
get("test", OffsetByteRequest(5)) |
Time flies 😅 , I didn't realize Zarr-Python was already at 3.11+. This will be nice to use regardless of the argument syntax. I share some of Norman's concerns over Tuple. While it's easy to document the behavior inside Zarr-Python, there is a risk of differences if/when people implement new storage classes externally. Also, while I think I'd like to leave it to @normanrz @jhamman @d-v-b and any other zarr-python devs to decide on the path forward given the factors about simplicity and expressiveness noted in past comments 🙏 I'm happy to contribute additional prototypes or benchmarks if it helps your decision-making. |
IMO the least disruptive thing for adopters at any stage is to use semantics that are familiar. both python indexing and http range requests use |
I think changing at this point is ok. @dataclass
class ExplicitByteRequest:
start: int
end: int
def __init__(self, start: int, *, end: int | None = None, length: int | None = None):
object.__setattr__(self, "start", start)
match (end, length):
case (None, None):
raise Exception("Neither end nor length is specified")
case (end, None):
object.__setattr__(self, "end", end)
case (None, length):
object.__setattr__(self, "end", start + length)
case _:
raise Exception("Only end OR length must be specified")
assert ExplicitByteRequest(start=5, end=10) == ExplicitByteRequest(start=5, length=5) Also, I wouldn't really consider this an end-user-facing API because it will be used by store (or other extensions) developers. Complaining about typing too much would be a bit funny. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I really like how the API turned out.
Just one comment on the naming of the dataclasses.
FYI @normanrz https://github.com/maxrjones/zarr-byterangerequest-microbenchmarks/blob/main/plot-results.ipynb suggests that using match; case is actually noticeably slower than if; elif for explicit byte ranges. I know we're not optimizing around performance right now, but IMO the improved readabiliity in a7d35f8 is not worth any performance hit such that I'd like to revert the change. |
@@ -31,7 +31,6 @@ | |||
ZATTRS_JSON = ".zattrs" | |||
ZMETADATA_V2_JSON = ".zmetadata" | |||
|
|||
ByteRangeRequest = tuple[int | None, int | None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% confident in this change, but couldn't otherwise quickly find a way to avoid circular imports. @d-v-b can you please confirm that this removal won't cause any issues?
Thanks all for your help with this PR! |
Thanks @maxrjones for pushing this through! |
This draft PR modifies the ByteRangeRequest type to follow the same approach as
obstore
, as suggested here. It is a larger change than the decision made in #1900 (comment) to specifyByteRangeRequest
tuples asstart, stop
rather thanstart, length
which has not yet been implemented inmain
.As for performance impacts, please see these simple timings - https://github.com/maxrjones/zarr-byterangerequest-microbenchmarks/blob/main/plot-results.ipynb