Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing fsspec in python in favour of object_store in rust #11056

Open
svaningelgem opened this issue Sep 11, 2023 · 6 comments
Open

removing fsspec in python in favour of object_store in rust #11056

svaningelgem opened this issue Sep 11, 2023 · 6 comments
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@svaningelgem
Copy link
Contributor

Description

Hi @ritchie46 ,

I was in discussion with @Qqwy about the following:
Within rust we have now the *_cloud methods for sinking to a cloud service via object_store.
Is it an idea to generalize every call to the read/scan/write methods to make use of object_store instead of relying (on the python side) on fsspec?

My idea is this:

  • normal methods: accept str | Path | BytesIO (writable bytestream for write, readable for sink/scan)
  • cloud methods: accept str | Path (like S3Path) | BytesIO (writable bytestream for write, readable for sink/scan) + cloud options

Within the Python API I would combine these 2 with a default parameter for the cloud options that could optionally be passed in for cloud paths. Sadly the same is not possible in Rust.

My only issue was with globs: how to handle these, but it seems (Qqwy checked it) that these are handled by object_store as well. So we're fine there.

Second what Qqwy brought up was that there are methods like parse_url that parses a given uri tries to figure out what you're trying to do. [And relative paths are not supported].
==> For the parse_url & glob logic he concluded that it's already present in polars/object_store.

My ultimate wish for the "file interface api" would be that there is struct that accepts a list of writable/readable streams and does whatever it needs doing for the specific filetype.
If that's done, then any kind of file format could be easily implemented as the ability to read/write from a stream. All the other code would be boilerplate.

My question for this ticket is: what is your view on the API?
(and secondary: is this überhaupt possible in rust?)

Thanks

@svaningelgem svaningelgem added the enhancement New feature or an improvement of an existing feature label Sep 11, 2023
@ion-elgreco
Copy link
Contributor

Why not have one method, and based on that you invoke the object store? Because object_store can also be used for local files, same as fsspec.

@svaningelgem
Copy link
Contributor Author

Even better. I just don't know the capabilities of rust :'(

@abealcantara
Copy link
Contributor

Agree, I vote for merging sink_parquet_cloud and sink_parquet having a single method that supports both local files and object store files in Rust for each format and read/write operations feels more natural and similar to what is currently available in other data processing frameworks. As @svaningelgem pointed out, the object_store crate supports Path and it also supports local storage, so I think it will simplify the Polar's API. We can then implement cloud support for the other formats without adding more methods to the DataFrame API.

@ritchie46
Copy link
Member

I got this on my todo list. Will look into this a bit later.

@Qqwy
Copy link
Contributor

Qqwy commented Sep 12, 2023

My only issue was with globs: how to handle these, but it seems (Qqwy checked it) that these are handled by object_store as well. So we're fine there.

For correctness: These are actually handled by Polars itself and not by ObjectStore. c.f. polars-io/src/cloud/gloub.rs.

Why not have one method, and based on that you invoke the object store? Because object_store can also be used for local files, same as fsspec.

I don't know fsspec in detail, but ObjectStore only handles absolute paths, and local paths need to be prefixed by file://. We thus need to take a little care to translate relative paths to file:// absolute paths beforehand.

@kylebarron
Copy link
Contributor

kylebarron commented Jan 29, 2025

People here might be interested: I wrote obstore, a binding of the Rust object_store crate for Python. Early benchmarks indicate up to 10x higher throughput than s3fs when used in an async Python context (this specific scenario is a range request fetching the first 16kb of a file).

Is polars already using fsspec in an async way?

If you use obstore, it would still go through Python, and so may be less efficient than a native implementation in Rust using object_store, but it also means you wouldn't have to link object_store at compile time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

No branches or pull requests

7 participants