-
-
Notifications
You must be signed in to change notification settings - Fork 370
Arrow ipc Array -> Bytes codec #3613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/zarr/core/metadata/v3.py
Outdated
| if isinstance(dtype, VariableLengthUTF8) and codec_class_name not in ( | ||
| "VLenUTF8Codec", | ||
| "ArrowIPCCodec", | ||
| ): # type: ignore[unreachable] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change allows us to use either the vlen-bytes or the arrow-ipc codec to encode variable length strings.
Flagging that this sort of logic for mapping codec / dtype compatibility feels quite brittle and non-scalable. But I don't have a better proposal in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel the same way! we might need a dtype x codecs compatibility matrix, not sure if it should track compatibility or incompatibility
|
@d-v-b - resolving the typing errors here is beyond my ability. Would appreciate your help. 🙏 |
|
ci is passing via 96273a8 |
| # Note: we only expect a single batch per chunk | ||
| record_batch = record_batch_reader.read_next_batch() | ||
| array = record_batch.column(self.column_name) | ||
| numpy_array = array.to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very happy to see this happening :)
I would be very curious about the behavior of non-standard types here. What does something like geometry dtype (which isn't in pyarrow) or DictionaryArray (which is in the core but has an implicit masking of sorts) do here? I can't deduce from the pyarrow docs exactly to be honest
Would it make sense to have a custom buffer class similar to what @keewis is doing for sparse (I think?)
Implementation of
arrow-ipcArray Bytes codec proposed in zarr-developers/zarr-extensions#41TODO:
docs/user-guide/*.mdchanges/