Error in string data type handling with 0 fill value #2792

LDeakin · 2025-02-03T22:49:18Z

Zarr version

3.0.1

Numcodecs version

Python Version

3.12

Operating System

Linux

Installation

see reproducer

Description

There are Zarr V2 string arrays in the wild with a 0 fill value. zarr-python interprets this as an empty string (e.g. when partially writing a chunk), but a completely missing chunk returns 0's rather than ""s. See reproducer.

Steps to reproduce

#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "zarr==3.0.1",
# ]
# ///

import zarr

array = zarr.create_array(
    store=zarr.storage.MemoryStore(),
    dtype=str,
    shape=(5,),
    chunks=(2,),
    filters=zarr.codecs.vlen_utf8.VLenUTF8(),
    compressors=[None],
    fill_value=0,
    zarr_format=2,
    overwrite=True,
)
array[:3] = ["a", "bb", ""]
print(array.info)
# assert (array[:] == ["a", "bb", "", "", ""]).all() # EXPECTED
# array[:] is ["a", "bb", "", "", 0]

Additional output

No response

The text was updated successfully, but these errors were encountered:

d-v-b · 2025-02-05T12:02:27Z

where did these zarr arrays come from? Unfortunately the v2 spec doesn't associate the fill_value field with a concrete type, so it's not possible to state with authority that a JSON number is an invalid fill value for an array of strings, but I feel like a fill value stored as the json number 0 does not unambiguously map to the empty string "". Is that what people expect here?

LDeakin · 2025-02-05T20:27:33Z

where did these zarr arrays come from?

kaizhang/anndata-rs#15 @zqfang?

I feel like a fill value stored as the json number 0 does not unambiguously map to the empty string "".

Yeah you are right. It differs between zarr-python versions:

2.18.4: ['a' 'bb' '' '0' '0']
3.0.1: ['a' 'bb' '' '' 0]

It looks like the old behaviour was indeed to just str() the fill values

zqfang · 2025-02-06T08:21:08Z

where did these zarr arrays come from?

from the anndata.write_zarr()

d-v-b · 2025-02-06T08:27:12Z

and in the context of anndata, what is the expected interpretation of 0 (a JSON number) as a fill value for an array of strings? Is it intended to be interpreted as the string "0"?

LDeakin · 2025-02-06T10:04:45Z

Is it intended to be interpreted as the string "0"?

@ilan-gold? I think it might have just been overlooked. anndata has used a default fill value of 0 for every data type for a long time https://github.com/scverse/anndata/blame/8d7beab49d04f5c1d91c847e0e1af99795d4d25f/src/anndata/_core/merge.py#L723-L738.

Perhaps zarr-python should restore the whole str(fill_value) behaviour with zarr_format=2 string arrays for compatibility, and just reject non-string fill values for zarr_format=3?

d-v-b · 2025-02-06T10:36:01Z

if the intention is for the fill value to be the string "0", then it seems like anndata.write_zarr should be setting the fill value to a string instead of a number.

On zarr-python's side, my personal preference would be to view "dtype is string, but fill value is a JSON number" as an error in metadata parsing, and we should produce a nice error message that advises people on how to fix their metadata documents.

That might be a bit drastic given our history of permissiveness here, so we maybe we start by doing what you recommend @LDeakin: casting JSON numbers to strings, and emitting a warning if this is necessary, with the promise that in a few releases fill values will be handled more strictly.

ilan-gold · 2025-02-07T09:21:49Z

we should produce a nice error message that advises people on how to fix their metadata documents.

If I understand this correctly, basically all of anndata's i/o would break with all previously created data until users change their files directly? I am not so sure that's a good idea.

casting JSON numbers to strings, and emitting a warning if this is necessary, with the promise that in a few releases fill values will be handled more strictly.

I also am not so sure about this for similar reasons. Why not just continue the backwards compatibility for v2, and for v3 do things correctly? At the minimum, for reading (if you want to prevent people from writing bad data, that is fine).

@ilan-gold? I think it might have just been overlooked. anndata has used a default fill value of 0 for every data type for a long time https://github.com/scverse/anndata/blame/8d7beab49d04f5c1d91c847e0e1af99795d4d25f/src/anndata/_core/merge.py#L723-L738.

No that is just for merging arrays like anndata.concat([my_anndata, her_anndata])

Looking at our codebase, there is not mention at all of fill_value outside of the concatenation context.

I will look into why this is happening, but I am strongly opposed to breaking anndata's read capabilities contingent on users changing a realtively-obscure metadata field in an outdated file format

ilan-gold · 2025-02-07T09:34:54Z

This is default zarr v2 (i.e., the package) behavior:

import zarr
import numcodecs

g = zarr.open("foo.zarr")
g.create_dataset("bar", shape=(50,), dtype=object, object_codec=numcodecs.VLenUTF8())
g["bar"][...] = np.array(["foo"] * 50).astype(object)
assert g["bar"].fill_value == 0

So this would break compat with anyone who has written a string array like this in the past, which I imagine is more than us. I think a fix for the v3 package that preserves the read behavior but disables writing arrays with the wrong fill-value is reasonable.

d-v-b · 2025-02-07T10:10:08Z

so if this is due to a bad default from zarr-python 2.x, then we should keep that behavior in zarr-python 3 to minimize discomfort. I'm not even sure if we should change how we write new zarr v2 data, since zarr v2 readers might be assuming the fill value for string arrays will be a JSON number?

ilan-gold · 2025-02-07T10:20:05Z

zarr v2 readers might be assuming the fill value for string arrays will be a JSON number?

That's a fair point. I am not sure I'd expect packages to lack parsing for the correct metadata but it's possible.

LDeakin · 2025-02-08T00:17:40Z

Ok full investigation of the old behaviour (zarr-python 2.18.4):

#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "zarr==2.18.4",
#     "numcodecs<=0.14.0",
# ]
# ///

import zarr
import numcodecs
import numpy as np

array = zarr.open(dtype=str, shape=(5,), chunks=(2,))
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == ["a", "bb", "", "0", "0"]).all()
# .zarray { "fill_value": "0" }

array = zarr.open(dtype=object, object_codec=numcodecs.VLenUTF8(), shape=(5,), chunks=(2,))
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == np.array(["a", "bb", "", "", 0], dtype=object)).all()
# .zarray { "fill_value": 0 }

array = zarr.open(dtype=str, shape=(5,), chunks=(2,), fill_value = None)
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == ["a", "bb", "", "", None]).all()
# .zarray { "fill_value": null }

array = zarr.open(dtype=object, object_codec=numcodecs.VLenUTF8(), shape=(5,), chunks=(2,), fill_value = None)
array[:3] = np.array(["a", "bb", ""], dtype=object)
assert (array[:] == ["a", "bb", "", "", None]).all()
# .zarray { "fill_value": null }

Summarising:

dtype=str has "0" fill values 🥲
dtype=object, object_codec=numcodecs.VLenUTF8() has a 0 fill value that maps to "" on disk
a None/null fill value always maps to "" on disk

So zarr-python 3.x.x just needs to add backwards compatibility support for the second case for Zarr V2 data.

LDeakin added the bug Potential issues with the zarr-python library label Feb 3, 2025

LDeakin mentioned this issue Feb 3, 2025

handle fill_value datatype for string-array LDeakin/zarrs#140

Merged

moradology mentioned this issue Feb 4, 2025

Parse 0 fill value as "" for str dtype #2798

Closed

6 tasks

LDeakin mentioned this issue Feb 5, 2025

(feat): full v2 compat via python fallback ilan-gold/zarrs-python#84

Merged

LDeakin mentioned this issue Feb 7, 2025

fix(metadata): interpret 0 fill value as "0" for Zarr V2 string arrays LDeakin/zarrs#142

Closed

LDeakin mentioned this issue Feb 13, 2025

Make create_array signatures consistent #2819

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in string data type handling with 0 fill value #2792

Error in string data type handling with 0 fill value #2792

LDeakin commented Feb 3, 2025 •

edited

Loading

d-v-b commented Feb 5, 2025

LDeakin commented Feb 5, 2025

zqfang commented Feb 6, 2025

d-v-b commented Feb 6, 2025

LDeakin commented Feb 6, 2025

d-v-b commented Feb 6, 2025

ilan-gold commented Feb 7, 2025

ilan-gold commented Feb 7, 2025 •

edited

Loading

d-v-b commented Feb 7, 2025

ilan-gold commented Feb 7, 2025 •

edited

Loading

LDeakin commented Feb 8, 2025

Error in string data type handling with 0 fill value #2792

Error in string data type handling with 0 fill value #2792

Comments

LDeakin commented Feb 3, 2025 • edited Loading

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

d-v-b commented Feb 5, 2025

LDeakin commented Feb 5, 2025

zqfang commented Feb 6, 2025

d-v-b commented Feb 6, 2025

LDeakin commented Feb 6, 2025

d-v-b commented Feb 6, 2025

ilan-gold commented Feb 7, 2025

ilan-gold commented Feb 7, 2025 • edited Loading

d-v-b commented Feb 7, 2025

ilan-gold commented Feb 7, 2025 • edited Loading

LDeakin commented Feb 8, 2025

LDeakin commented Feb 3, 2025 •

edited

Loading

ilan-gold commented Feb 7, 2025 •

edited

Loading

ilan-gold commented Feb 7, 2025 •

edited

Loading