-
-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in string data type handling with 0 fill value #2792
Comments
where did these zarr arrays come from? Unfortunately the v2 spec doesn't associate the |
kaizhang/anndata-rs#15 @zqfang?
Yeah you are right. It differs between
It looks like the old behaviour was indeed to just |
from the anndata.write_zarr() |
and in the context of anndata, what is the expected interpretation of |
@ilan-gold? I think it might have just been overlooked. Perhaps |
if the intention is for the fill value to be the string On zarr-python's side, my personal preference would be to view "dtype is string, but fill value is a JSON number" as an error in metadata parsing, and we should produce a nice error message that advises people on how to fix their metadata documents. That might be a bit drastic given our history of permissiveness here, so we maybe we start by doing what you recommend @LDeakin: casting JSON numbers to strings, and emitting a warning if this is necessary, with the promise that in a few releases fill values will be handled more strictly. |
If I understand this correctly, basically all of anndata's i/o would break with all previously created data until users change their files directly? I am not so sure that's a good idea.
I also am not so sure about this for similar reasons. Why not just continue the backwards compatibility for v2, and for v3 do things correctly? At the minimum, for reading (if you want to prevent people from writing bad data, that is fine).
No that is just for merging arrays like Looking at our codebase, there is not mention at all of I will look into why this is happening, but I am strongly opposed to breaking anndata's read capabilities contingent on users changing a realtively-obscure metadata field in an outdated file format |
This is default zarr v2 (i.e., the package) behavior: import zarr
import numcodecs
g = zarr.open("foo.zarr")
g.create_dataset("bar", shape=(50,), dtype=object, object_codec=numcodecs.VLenUTF8())
g["bar"][...] = np.array(["foo"] * 50).astype(object)
assert g["bar"].fill_value == 0 So this would break compat with anyone who has written a string array like this in the past, which I imagine is more than us. I think a fix for the v3 package that preserves the read behavior but disables writing arrays with the wrong fill-value is reasonable. |
so if this is due to a bad default from zarr-python 2.x, then we should keep that behavior in zarr-python 3 to minimize discomfort. I'm not even sure if we should change how we write new zarr v2 data, since zarr v2 readers might be assuming the fill value for string arrays will be a JSON number? |
That's a fair point. I am not sure I'd expect packages to lack parsing for the correct metadata but it's possible. |
Ok full investigation of the old behaviour (zarr-python 2.18.4): #!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "zarr==2.18.4",
# "numcodecs<=0.14.0",
# ]
# ///
import zarr
import numcodecs
import numpy as np
array = zarr.open(dtype=str, shape=(5,), chunks=(2,))
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == ["a", "bb", "", "0", "0"]).all()
# .zarray { "fill_value": "0" }
array = zarr.open(dtype=object, object_codec=numcodecs.VLenUTF8(), shape=(5,), chunks=(2,))
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == np.array(["a", "bb", "", "", 0], dtype=object)).all()
# .zarray { "fill_value": 0 }
array = zarr.open(dtype=str, shape=(5,), chunks=(2,), fill_value = None)
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == ["a", "bb", "", "", None]).all()
# .zarray { "fill_value": null }
array = zarr.open(dtype=object, object_codec=numcodecs.VLenUTF8(), shape=(5,), chunks=(2,), fill_value = None)
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == ["a", "bb", "", "", None]).all()
# .zarray { "fill_value": null } Summarising:
So |
Zarr version
3.0.1
Numcodecs version
Python Version
3.12
Operating System
Linux
Installation
see reproducer
Description
There are Zarr V2 string arrays in the wild with a
0
fill value.zarr-python
interprets this as an empty string (e.g. when partially writing a chunk), but a completely missing chunk returns0
's rather than""
s. See reproducer.Steps to reproduce
Additional output
No response
The text was updated successfully, but these errors were encountered: