Arrow ipc Array -> Bytes codec #3613

rabernat · 2025-12-03T19:50:06Z

Implementation of arrow-ipc Array Bytes codec proposed in zarr-developers/zarr-extensions#41

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

rabernat · 2025-12-03T19:59:04Z

src/zarr/core/metadata/v3.py

+    if isinstance(dtype, VariableLengthUTF8) and codec_class_name not in (
+        "VLenUTF8Codec",
+        "ArrowIPCCodec",
+    ):  # type: ignore[unreachable]


This change allows us to use either the vlen-bytes or the arrow-ipc codec to encode variable length strings.

Flagging that this sort of logic for mapping codec / dtype compatibility feels quite brittle and non-scalable. But I don't have a better proposal in mind.

i feel the same way! we might need a dtype x codecs compatibility matrix, not sure if it should track compatibility or incompatibility

rabernat · 2025-12-03T19:59:53Z

@d-v-b - resolving the typing errors here is beyond my ability. Would appreciate your help. 🙏

d-v-b · 2025-12-04T08:39:06Z

ci is passing via 96273a8

ilan-gold · 2025-12-11T14:22:45Z

src/zarr/codecs/arrow.py

+        # Note: we only expect a single batch per chunk
+        record_batch = record_batch_reader.read_next_batch()
+        array = record_batch.column(self.column_name)
+        numpy_array = array.to_numpy()


Very happy to see this happening :)

I would be very curious about the behavior of non-standard types here. What does something like geometry dtype (which isn't in pyarrow) or DictionaryArray (which is in the core but has an implicit masking of sorts) do here? I can't deduce from the pyarrow docs exactly to be honest

Would it make sense to have a custom buffer class similar to what @keewis is doing for sparse (I think?)

rabernat added 2 commits November 7, 2025 22:46

added arrow bytes codec

e8a0afd

make column name customizable

9c0b409

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Dec 3, 2025

rabernat mentioned this pull request Dec 3, 2025

Add arrow-ipc Array -> Byes codec zarr-developers/zarr-extensions#41

Open

rabernat commented Dec 3, 2025

View reviewed changes

d-v-b added 2 commits December 3, 2025 21:08

Merge branch 'main' into arrow-ipc-codec

d665e42

fix type checking errors

96273a8

ilan-gold reviewed Dec 11, 2025

View reviewed changes

ivirshup mentioned this pull request Dec 11, 2025

Add support for lists in obs scverse/anndata#1923

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Arrow ipc Array -> Bytes codec #3613

Arrow ipc Array -> Bytes codec #3613

rabernat commented Dec 3, 2025

Uh oh!

rabernat Dec 3, 2025

Uh oh!

d-v-b Dec 3, 2025

Uh oh!

rabernat commented Dec 3, 2025

Uh oh!

d-v-b commented Dec 4, 2025

Uh oh!

ilan-gold Dec 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Arrow ipc Array -> Bytes codec #3613

Are you sure you want to change the base?

Arrow ipc Array -> Bytes codec #3613

Conversation

rabernat commented Dec 3, 2025

Uh oh!

rabernat Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

rabernat commented Dec 3, 2025

Uh oh!

d-v-b commented Dec 4, 2025

Uh oh!

ilan-gold Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilan-gold Dec 11, 2025 •

edited

Loading