Skip to content
This repository has been archived by the owner on Oct 2, 2024. It is now read-only.

Make PyResultSet and PyRecordBatch serializable/picklable #27

Closed
frank-lsf opened this issue Mar 5, 2023 · 3 comments
Closed

Make PyResultSet and PyRecordBatch serializable/picklable #27

frank-lsf opened this issue Mar 5, 2023 · 3 comments

Comments

@frank-lsf
Copy link
Contributor

The first step towards implementing shuffle using Ray is that we need a way to put the execution result into the Ray Object Store. Currently https://github.com/datafusion-contrib/ray-sql/blob/main/raysql/worker.py#L22 returns a string representation of the result set which would not work if we want to recover the result set from the worker return value. We need to make the ResultSet and therefore RecordBatch classes picklable in Python.

@frank-lsf
Copy link
Contributor Author

@andygrove I'm working on this. Is there a way to serialize a Rust RecordBatch into raw bytes?

Also do you know how to make this work in pyo3? I'm reading PyO3/pyo3#100 which led me to https://gist.github.com/ethanhs/fd4123487974c91c7e5960acc9aa2a77.

@andygrove
Copy link
Collaborator

Hi @franklsf95. Yes, it is possible to serialize RecordBatch in Arrow IPC format. DataFusion has an IPCWriter struct that does this. You can see how this is used in the current Ray SQL ShuffleWriter.

To make this work with Pyo3, I think I would just delegate to Rust to serialize the batches to bytes then have a Python class that contains the bytes. This is what I did with the query plan - see https://github.com/datafusion-contrib/ray-sql/blob/main/raysql/context.py#L74

@frank-lsf
Copy link
Contributor Author

Done in #28.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants