Skip to content

Conversation

@rabernat
Copy link
Contributor

@rabernat rabernat commented Dec 3, 2025

Implementation of arrow-ipc Array Bytes codec proposed in zarr-developers/zarr-extensions#41

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Dec 3, 2025
if isinstance(dtype, VariableLengthUTF8) and codec_class_name not in (
"VLenUTF8Codec",
"ArrowIPCCodec",
): # type: ignore[unreachable]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change allows us to use either the vlen-bytes or the arrow-ipc codec to encode variable length strings.

Flagging that this sort of logic for mapping codec / dtype compatibility feels quite brittle and non-scalable. But I don't have a better proposal in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel the same way! we might need a dtype x codecs compatibility matrix, not sure if it should track compatibility or incompatibility

@rabernat
Copy link
Contributor Author

rabernat commented Dec 3, 2025

@d-v-b - resolving the typing errors here is beyond my ability. Would appreciate your help. 🙏

@d-v-b
Copy link
Contributor

d-v-b commented Dec 4, 2025

ci is passing via 96273a8

# Note: we only expect a single batch per chunk
record_batch = record_batch_reader.read_next_batch()
array = record_batch.column(self.column_name)
numpy_array = array.to_numpy()
Copy link
Contributor

@ilan-gold ilan-gold Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy to see this happening :)

I would be very curious about the behavior of non-standard types here. What does something like geometry dtype (which isn't in pyarrow) or DictionaryArray (which is in the core but has an implicit masking of sorts) do here? I can't deduce from the pyarrow docs exactly to be honest

Would it make sense to have a custom buffer class similar to what @keewis is doing for sparse (I think?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants