Use dataclasses for ByteRangeRequests #2585

maxrjones · 2024-12-22T22:52:08Z

This draft PR modifies the ByteRangeRequest type to follow the same approach as obstore, as suggested here. It is a larger change than the decision made in #1900 (comment) to specify ByteRangeRequest tuples as start, stop rather than start, length which has not yet been implemented in main.

As for performance impacts, please see these simple timings - https://github.com/maxrjones/zarr-byterangerequest-microbenchmarks/blob/main/plot-results.ipynb

normanrz · 2024-12-30T12:21:08Z

Why not use literate instead of a tuple?

class ExplicitRange(TypedDict):
    start: int
    end: int

normanrz · 2024-12-30T12:25:40Z

I added the changes for sharding in 4a73b70

maxrjones · 2024-12-30T13:25:20Z

Why not use literate instead of a tuple?
class ExplicitRange(TypedDict):
    start: int
    end: int

My understanding from #2437 (comment) was that there were concerns over performance for dataclasses, which I thought could extend to TypedDict over tuple. If you'd rather literate instead of a tuple, I could make that change and run some simple tests to check performance.

normanrz · 2025-01-02T13:53:51Z

If you'd rather literate instead of a tuple, I could make that change and run some simple tests to check performance.

Yeah. I think that would be great. If the performance is ok, I would prefer a dict (or dataclass).
Btw, have you done some performance testing how @dataclass(slots=True) or NamedTuple compares to a dict?

maxrjones · 2025-01-02T16:33:28Z

If you'd rather literate instead of a tuple, I could make that change and run some simple tests to check performance.

Yeah. I think that would be great. If the performance is ok, I would prefer a dict (or dataclass). Btw, have you done some performance testing how @dataclass(slots=True) or NamedTuple compares to a dict?

Thanks @normanrz, I haven't done those tests but can try that out along with your earlier suggestion. For timing, do you think this PR would be considered for the V3 release? I could prioritize this if so, but if not I won't rush to work on it.

normanrz · 2025-01-02T17:20:43Z

We are closing 3.0 right now, so I guess this will be a post-3.0 issue. So, no rush here.

maxrjones · 2025-01-02T17:57:14Z

We are closing 3.0 right now, so I guess this will be a post-3.0 issue. So, no rush here.

Sounds good, I was pushing for pre v3.0 in #2412 (comment) under the assumption that using (start, end) or TypedDict/NamedTuple/Dataclass instead would not be accepted afterwards because it is a breaking change. But I am happy to wait if you are open to later changes in behavior.

normanrz · 2025-01-02T18:01:10Z

My thinking is that this is not a big enough blocker for the 3.0 release. We can change this afterwards, but we will probably need a backwards-compat layer.
@jhamman what do you think here?

jhamman · 2025-01-04T00:34:15Z

I agree. This can come after 3.0.0. But the sooner the better. The interface isn't really used by anyone external except for icechunk so if we do this quickly, we can probably get by without a compat layer.

maxrjones · 2025-01-06T23:58:30Z

src/zarr/storage/_fsspec.py

+            key_ranges = list(key_ranges)
+            paths: list[str] = []
+            starts: list[int | None] = []
+            stops: list[int | None] = []
+            for key, byte_range in key_ranges:
+                paths.append(_dereference_path(self.path, key))
+                if byte_range is None:
+                    starts.append(None)
+                    stops.append(None)
+                elif isinstance(byte_range, tuple):
+                    starts.append(byte_range[0])
+                    stops.append(byte_range[1])
+                elif isinstance(byte_range, dict):
+                    if "offset" in byte_range:
+                        starts.append(byte_range["offset"])  # type: ignore[typeddict-item]
+                        stops.append(None)
+                    elif "suffix" in byte_range:
+                        starts.append(-byte_range["suffix"])
+                        stops.append(None)
+                    else:
+                        raise ValueError("Invalid format for ByteRangeRequest")
+                else:
+                    raise ValueError("Invalid format for ByteRangeRequest")


this is where the PR could most likely cause a performance hit by iterating over the inputs rather than using zip.

src/zarr/abc/store.py

normanrz · 2025-01-08T14:56:41Z

I also think the performance is ok.

My problem with tuples is that it is not clear what the second component is. Without further docs, it could be either the exclusive end or a byte length. That is why I like the dataclass (with kwargs: values = asyncio.run(store.get('c/0', byte_range=ExplicitRange(start=1, end=5)))) or typed dict approach better.

Have we considered using a slice (with disallowed step != 1)? It has clear semantics for start and end and supports suffix and offset.

d-v-b · 2025-01-08T15:30:13Z

i think slice is a bit too expressive, and it doesn't play nicely with type checking, so we would need to examine each one at runtime to ensure that its a valid specification of a dense interval.

normanrz · 2025-01-08T15:32:00Z

We could really use rust-style enums here :D

normanrz · 2025-01-08T15:40:23Z

Well, after 5min of googling I learned about the match statement (new in Python 3.10), which kind of enables rust-style enums.

from dataclasses import dataclass

@dataclass
class ExplicitByteRequest:
    start: int
    end: int

@dataclass
class SuffixByteRequest:
    suffix: int

@dataclass
class OffsetByteRequest:
    offset: int

ByteRequest = ExplicitByteRequest | SuffixByteRequest | OffsetByteRequest

def get(path: str, byte_range: ByteRequest | None):
    match byte_range:
        case None:
            return print(path)
        case ExplicitByteRequest(start, end):
            return print(path, f"{start=}, {end=}")
        case SuffixByteRequest(suffix):
            return print(path, f"start={-suffix}, end=None")
        case OffsetByteRequest(offset):
            return print(path, f"start={offset}, end=None")
        case _:
            raise Exception(f"Unexpected byte_range, got {byte_range}.")

get("test", None)
get("test", ExplicitByteRequest(start=5, end=7))
get("test", SuffixByteRequest(2))
get("test", OffsetByteRequest(5))

maxrjones · 2025-01-08T16:18:34Z

Well, after 5min of googling I learned about the match statement (new in Python 3.10), which kind of enables rust-style enums.

Time flies 😅 , I didn't realize Zarr-Python was already at 3.11+. This will be nice to use regardless of the argument syntax.

I share some of Norman's concerns over Tuple. While it's easy to document the behavior inside Zarr-Python, there is a risk of differences if/when people implement new storage classes externally. Also, while I think (start, stop) makes more sense than (start, length) due to its similarity to an HTTP Range, I am a bit concerned that the sudden change of behavior will cause issues for super early adopters.

I'd like to leave it to @normanrz @jhamman @d-v-b and any other zarr-python devs to decide on the path forward given the factors about simplicity and expressiveness noted in past comments 🙏 I'm happy to contribute additional prototypes or benchmarks if it helps your decision-making.

d-v-b · 2025-01-08T16:25:29Z

I am a bit concerned that the sudden change of behavior will cause issues for super early adopters.

IMO the least disruptive thing for adopters at any stage is to use semantics that are familiar. both python indexing and http range requests use (start, stop) semantics, which makes it more familiar than (start, length) so we should switch to the former as soon as we can. How about we use the explicit dataclasses until someone complains that they take too long to write, and then we can consider widening the type to accept tuples?

normanrz · 2025-01-08T17:00:07Z

I think changing at this point is ok.
Again, I think the dataclasses are very expressive, here. We might even have a class like this:

@dataclass
class ExplicitByteRequest:
    start: int
    end: int
    
    def __init__(self, start: int, *, end: int | None = None, length: int | None = None):
        object.__setattr__(self, "start", start)
        match (end, length):
            case (None, None):
                raise Exception("Neither end nor length is specified")
            case (end, None):
                object.__setattr__(self, "end", end)
            case (None, length):
                object.__setattr__(self, "end", start + length)
            case _:
                raise Exception("Only end OR length must be specified")

            
assert ExplicitByteRequest(start=5, end=10) == ExplicitByteRequest(start=5, length=5)

Also, I wouldn't really consider this an end-user-facing API because it will be used by store (or other extensions) developers. Complaining about typing too much would be a bit funny.

normanrz

This is great! I really like how the API turned out.
Just one comment on the naming of the dataclasses.

src/zarr/abc/store.py

maxrjones · 2025-01-08T20:43:32Z

FYI @normanrz https://github.com/maxrjones/zarr-byterangerequest-microbenchmarks/blob/main/plot-results.ipynb suggests that using match; case is actually noticeably slower than if; elif for explicit byte ranges. I know we're not optimizing around performance right now, but IMO the improved readabiliity in a7d35f8 is not worth any performance hit such that I'd like to revert the change.

This reverts commit a7d35f8.

maxrjones · 2025-01-08T21:38:07Z

src/zarr/core/common.py

@@ -31,7 +31,6 @@
 ZATTRS_JSON = ".zattrs"
 ZMETADATA_V2_JSON = ".zmetadata"

-ByteRangeRequest = tuple[int | None, int | None]


I'm not 100% confident in this change, but couldn't otherwise quickly find a way to avoid circular imports. @d-v-b can you please confirm that this removal won't cause any issues?

maxrjones · 2025-01-08T21:39:59Z

Thanks all for your help with this PR!

normanrz · 2025-01-09T08:57:31Z

Thanks @maxrjones for pushing this through!

maxrjones added 2 commits December 22, 2024 17:29

Use TypedDicts for more literate ByteRangeRequests

125a729

Update utility function

608f390

fixes sharding

4a73b70

Merge branch 'main' into literate-byte-ranges

b1b38f9

maxrjones added 7 commits January 6, 2025 10:22

Merge branch 'main' into literate-byte-ranges

5d06965

Merge branch 'main' into literate-byte-ranges

70a81ec

Ignore mypy errors

c4e6625

Merge branch 'main' into literate-byte-ranges

395b0da

Fix offset in _normalize_byte_range_index

f8dc6e5

Update get_partial_values for FsspecStore

78dfa76

Merge branch 'main' into literate-byte-ranges

66a8b81

maxrjones commented Jan 6, 2025

View reviewed changes

maxrjones commented Jan 7, 2025

View reviewed changes

src/zarr/abc/store.py Outdated Show resolved Hide resolved

maxrjones added 7 commits January 6, 2025 19:07

Re-add fs._cat_ranges argument

61035c6

Simplify typing

76ba672

Update _normalize to return start, stop

68a6df3

Merge branch 'main' into literate-byte-ranges

bd92bae

Use explicit range

650fb38

Use dataclasses

46070f4

Update typing

8464094

maxrjones changed the title ~~[RFC] Use TypedDicts for more literate ByteRangeRequests~~ Use TypedDicts for more literate ByteRangeRequests Jan 7, 2025

Merge branch 'byterange-dataclass' into literate-byte-ranges

646454e

maxrjones changed the title ~~Use TypedDicts for more literate ByteRangeRequests~~ Use dataclasses for ByteRangeRequests Jan 8, 2025

maxrjones added 6 commits January 8, 2025 14:12

Update docstring

4cf6e11

Merge branch 'main' into literate-byte-ranges

af2b06a

Rename ExplicitRange to ExplicitByteRequest

e084313

Rename OffsetRange to OffsetByteRequest

7659be4

Rename SuffixRange to SuffixByteRequest

fff58dc

Use match; case instead of if; elif

a7d35f8

normanrz approved these changes Jan 8, 2025

View reviewed changes

src/zarr/abc/store.py Outdated Show resolved Hide resolved

maxrjones added 5 commits January 8, 2025 16:12

Revert "Use match; case instead of if; elif"

be6324f

This reverts commit a7d35f8.

Update ByteRangeRequest to ByteRequest

7191c84

Remove ByteRange definition from common

a8ea2da

Rename ExplicitByteRequest to RangeByteRequest

e7d29c5

Provide more informative error message

e6120bf

maxrjones commented Jan 8, 2025

View reviewed changes

normanrz merged commit 0328656 into zarr-developers:main Jan 9, 2025
28 checks passed

maxrjones deleted the literate-byte-ranges branch January 9, 2025 14:35

jhamman mentioned this pull request Jan 10, 2025

Make the ByteRangeRequest more literate #2437

Closed

This was referenced Jan 10, 2025

Use new ByteRequest syntax and raise NotImplementedError on pickling kylebarron/zarr-python#4

Merged

Change get_range start and end parameters to be kwarg only, add length as alternative developmentseed/obstore#155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dataclasses for ByteRangeRequests #2585

Use dataclasses for ByteRangeRequests #2585

maxrjones commented Dec 22, 2024 •

edited

Loading

normanrz commented Dec 30, 2024

normanrz commented Dec 30, 2024

maxrjones commented Dec 30, 2024

normanrz commented Jan 2, 2025 •

edited

Loading

maxrjones commented Jan 2, 2025

normanrz commented Jan 2, 2025

maxrjones commented Jan 2, 2025 •

edited

Loading

normanrz commented Jan 2, 2025

jhamman commented Jan 4, 2025

maxrjones Jan 6, 2025

normanrz commented Jan 8, 2025

d-v-b commented Jan 8, 2025

normanrz commented Jan 8, 2025

normanrz commented Jan 8, 2025 •

edited

Loading

maxrjones commented Jan 8, 2025

d-v-b commented Jan 8, 2025

normanrz commented Jan 8, 2025 •

edited

Loading

normanrz left a comment

maxrjones commented Jan 8, 2025 •

edited

Loading

maxrjones Jan 8, 2025

maxrjones commented Jan 8, 2025

normanrz commented Jan 9, 2025

Use dataclasses for ByteRangeRequests #2585

Use dataclasses for ByteRangeRequests #2585

Conversation

maxrjones commented Dec 22, 2024 • edited Loading

normanrz commented Dec 30, 2024

normanrz commented Dec 30, 2024

maxrjones commented Dec 30, 2024

normanrz commented Jan 2, 2025 • edited Loading

maxrjones commented Jan 2, 2025

normanrz commented Jan 2, 2025

maxrjones commented Jan 2, 2025 • edited Loading

normanrz commented Jan 2, 2025

jhamman commented Jan 4, 2025

maxrjones Jan 6, 2025

Choose a reason for hiding this comment

normanrz commented Jan 8, 2025

d-v-b commented Jan 8, 2025

normanrz commented Jan 8, 2025

normanrz commented Jan 8, 2025 • edited Loading

maxrjones commented Jan 8, 2025

d-v-b commented Jan 8, 2025

normanrz commented Jan 8, 2025 • edited Loading

normanrz left a comment

Choose a reason for hiding this comment

maxrjones commented Jan 8, 2025 • edited Loading

maxrjones Jan 8, 2025

Choose a reason for hiding this comment

maxrjones commented Jan 8, 2025

normanrz commented Jan 9, 2025

maxrjones commented Dec 22, 2024 •

edited

Loading

normanrz commented Jan 2, 2025 •

edited

Loading

maxrjones commented Jan 2, 2025 •

edited

Loading

normanrz commented Jan 8, 2025 •

edited

Loading

normanrz commented Jan 8, 2025 •

edited

Loading

maxrjones commented Jan 8, 2025 •

edited

Loading