Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in string data type handling with 0 fill value #2792

Open
LDeakin opened this issue Feb 3, 2025 · 11 comments
Open

Error in string data type handling with 0 fill value #2792

LDeakin opened this issue Feb 3, 2025 · 11 comments
Labels
bug Potential issues with the zarr-python library

Comments

@LDeakin
Copy link
Contributor

LDeakin commented Feb 3, 2025

Zarr version

3.0.1

Numcodecs version

Python Version

3.12

Operating System

Linux

Installation

see reproducer

Description

There are Zarr V2 string arrays in the wild with a 0 fill value. zarr-python interprets this as an empty string (e.g. when partially writing a chunk), but a completely missing chunk returns 0's rather than ""s. See reproducer.

Steps to reproduce

#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "zarr==3.0.1",
# ]
# ///

import zarr

array = zarr.create_array(
    store=zarr.storage.MemoryStore(),
    dtype=str,
    shape=(5,),
    chunks=(2,),
    filters=zarr.codecs.vlen_utf8.VLenUTF8(),
    compressors=[None],
    fill_value=0,
    zarr_format=2,
    overwrite=True,
)
array[:3] = ["a", "bb", ""]
print(array.info)
# assert (array[:] == ["a", "bb", "", "", ""]).all() # EXPECTED
# array[:] is ["a", "bb", "", "", 0]

Additional output

No response

@LDeakin LDeakin added the bug Potential issues with the zarr-python library label Feb 3, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Feb 5, 2025

where did these zarr arrays come from? Unfortunately the v2 spec doesn't associate the fill_value field with a concrete type, so it's not possible to state with authority that a JSON number is an invalid fill value for an array of strings, but I feel like a fill value stored as the json number 0 does not unambiguously map to the empty string "". Is that what people expect here?

@LDeakin
Copy link
Contributor Author

LDeakin commented Feb 5, 2025

where did these zarr arrays come from?

kaizhang/anndata-rs#15 @zqfang?

I feel like a fill value stored as the json number 0 does not unambiguously map to the empty string "".

Yeah you are right. It differs between zarr-python versions:

  • 2.18.4: ['a' 'bb' '' '0' '0']
  • 3.0.1: ['a' 'bb' '' '' 0]

It looks like the old behaviour was indeed to just str() the fill values

@zqfang
Copy link

zqfang commented Feb 6, 2025

where did these zarr arrays come from?

from the anndata.write_zarr()

@d-v-b
Copy link
Contributor

d-v-b commented Feb 6, 2025

and in the context of anndata, what is the expected interpretation of 0 (a JSON number) as a fill value for an array of strings? Is it intended to be interpreted as the string "0"?

@LDeakin
Copy link
Contributor Author

LDeakin commented Feb 6, 2025

Is it intended to be interpreted as the string "0"?

@ilan-gold? I think it might have just been overlooked. anndata has used a default fill value of 0 for every data type for a long time https://github.com/scverse/anndata/blame/8d7beab49d04f5c1d91c847e0e1af99795d4d25f/src/anndata/_core/merge.py#L723-L738.

Perhaps zarr-python should restore the whole str(fill_value) behaviour with zarr_format=2 string arrays for compatibility, and just reject non-string fill values for zarr_format=3?

@d-v-b
Copy link
Contributor

d-v-b commented Feb 6, 2025

if the intention is for the fill value to be the string "0", then it seems like anndata.write_zarr should be setting the fill value to a string instead of a number.

On zarr-python's side, my personal preference would be to view "dtype is string, but fill value is a JSON number" as an error in metadata parsing, and we should produce a nice error message that advises people on how to fix their metadata documents.

That might be a bit drastic given our history of permissiveness here, so we maybe we start by doing what you recommend @LDeakin: casting JSON numbers to strings, and emitting a warning if this is necessary, with the promise that in a few releases fill values will be handled more strictly.

@ilan-gold
Copy link
Contributor

we should produce a nice error message that advises people on how to fix their metadata documents.

If I understand this correctly, basically all of anndata's i/o would break with all previously created data until users change their files directly? I am not so sure that's a good idea.

casting JSON numbers to strings, and emitting a warning if this is necessary, with the promise that in a few releases fill values will be handled more strictly.

I also am not so sure about this for similar reasons. Why not just continue the backwards compatibility for v2, and for v3 do things correctly? At the minimum, for reading (if you want to prevent people from writing bad data, that is fine).

@ilan-gold? I think it might have just been overlooked. anndata has used a default fill value of 0 for every data type for a long time https://github.com/scverse/anndata/blame/8d7beab49d04f5c1d91c847e0e1af99795d4d25f/src/anndata/_core/merge.py#L723-L738.

No that is just for merging arrays like anndata.concat([my_anndata, her_anndata])

Looking at our codebase, there is not mention at all of fill_value outside of the concatenation context.

I will look into why this is happening, but I am strongly opposed to breaking anndata's read capabilities contingent on users changing a realtively-obscure metadata field in an outdated file format

@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 7, 2025

This is default zarr v2 (i.e., the package) behavior:

import zarr
import numcodecs

g = zarr.open("foo.zarr")
g.create_dataset("bar", shape=(50,), dtype=object, object_codec=numcodecs.VLenUTF8())
g["bar"][...] = np.array(["foo"] * 50).astype(object)
assert g["bar"].fill_value == 0

So this would break compat with anyone who has written a string array like this in the past, which I imagine is more than us. I think a fix for the v3 package that preserves the read behavior but disables writing arrays with the wrong fill-value is reasonable.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 7, 2025

so if this is due to a bad default from zarr-python 2.x, then we should keep that behavior in zarr-python 3 to minimize discomfort. I'm not even sure if we should change how we write new zarr v2 data, since zarr v2 readers might be assuming the fill value for string arrays will be a JSON number?

@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 7, 2025

zarr v2 readers might be assuming the fill value for string arrays will be a JSON number?

That's a fair point. I am not sure I'd expect packages to lack parsing for the correct metadata but it's possible.

@LDeakin
Copy link
Contributor Author

LDeakin commented Feb 8, 2025

Ok full investigation of the old behaviour (zarr-python 2.18.4):

#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "zarr==2.18.4",
#     "numcodecs<=0.14.0",
# ]
# ///

import zarr
import numcodecs
import numpy as np

array = zarr.open(dtype=str, shape=(5,), chunks=(2,))
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == ["a", "bb", "", "0", "0"]).all()
# .zarray { "fill_value": "0" }

array = zarr.open(dtype=object, object_codec=numcodecs.VLenUTF8(), shape=(5,), chunks=(2,))
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == np.array(["a", "bb", "", "", 0], dtype=object)).all()
# .zarray { "fill_value": 0 }

array = zarr.open(dtype=str, shape=(5,), chunks=(2,), fill_value = None)
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == ["a", "bb", "", "", None]).all()
# .zarray { "fill_value": null }

array = zarr.open(dtype=object, object_codec=numcodecs.VLenUTF8(), shape=(5,), chunks=(2,), fill_value = None)
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == ["a", "bb", "", "", None]).all()
# .zarray { "fill_value": null }

Summarising:

  • dtype=str has "0" fill values 🥲
  • dtype=object, object_codec=numcodecs.VLenUTF8() has a 0 fill value that maps to "" on disk
  • a None/null fill value always maps to "" on disk

So zarr-python 3.x.x just needs to add backwards compatibility support for the second case for Zarr V2 data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants