-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
object_store: Use the official AWS SDK for s3 interactions #5143
Comments
We moved away from SDKs to keep the dependency footprint down and provide better consistency across the implementations What would be the advantage of such a change? |
Regarding dependency footprint: https://crates.io/crates/aws-sdk-s3/1.4.0/dependencies
Also AWS is not the only implementation of the S3 API out there and I doubt they support features (like conditional operations) that other S3 implementations support. |
@chitralverma what would be the benefit of using the official SDK? Are there features it supports that are not supported by the custom one? I can imagine potentially some of the more advanced access controls or something Since the ObjectStore is a trait, perhaps we could create a new crate (could start outside of this repo) that uses the official SDK that could be used by anyone who needs the extra features |
I guess one benefit is getting stuff like #5140 for close to free |
Closing as not planned, thank you all for the discussion |
One of the limitations of the current object-store implementation is that its authentication support is more restrictive than the one provided by AWS, for example. https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-sourcing-external.html. How can this feature be added to the current implementation? |
Users can provide their own credential providers - https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.with_credentials This could even be using the AWS SDK if they so wish |
for users that are using those libraries via a Python wrapper, how should this work? I.e. object_store imported by polars used in Python |
That is a question for the maintainers of the python wrapper |
So it seems to me the summary of this ticket is:
Thus I propose (and I think this is what @tustvold is also proposing) for the first step is that someone implements the struct OfficalAmazonS3 {
...
}
impl ObjectStore for OfficialAmazonS3 {
// implementation based on aws sdk goes here
} I won't have time to devote to this project (likely ever) but I think it would be straightforward for anyone with the need (the earlier versions of object_store were actually implemented this way). Once that code is complete, I think we could then have a separate discussion about "is there willingness to maintain that implementation in the But the first step in all possible futures is for someone to implement |
If it is just credentials, it would be sufficient to just implement a CredentialProvider, I believe datafusion-cli might already be doing this and is just a couple of lines of code. If the desire it to use this from python, it will need a conversation with the maintainers of the python bindings on whether to take this approach or to expose CredentialProvider in the python interface so it can be used with boto. A full reimplementation using the AWS SDK is possible, but is a non-trivial amount of work, and would need people to drive it as Andrew describes |
Just FYI, I'm working on a project reqsign focused on implementing all the request signing features provided by AWS (and all other cloud providers). I believe it would be beneficial to have a separate signing crate instead of the entire AWS SDK. It's being refactored to simplify its layout and significantly reduce its dependency tree. We can discuss the details later. |
One specific problem we have is how we inject this logic when the object store is wrapped by a crate which exposes python bindings, for example delta-rs, polars or DataFusion-python Let's say this crate would already exists, how can we integrate it with those frameworks above ? |
Reqsign will provide a Python binding, allowing the Python ecosystem to integrate with it directly. Like what we do for |
Either the same way datafusion-cli does by overriding the credential provider at construction time, or by the maintainers of said python wrapper adding first class support for credential providers. We already provide sufficient extension points for them to do this
So long as we preserve existing functionality, including around things like the client configuration used to make credential requests, and someone is willing to drive this, no objections from me |
@edmondop and I spoke about this at the NYC DataFUsion meetup yesterday I think the usecase is that
If they could control the source, they could use the existing object_store @edmondop mentioned something about some AWS SDK mechanism that calls out to an external program / process to get credentials. While less efficient this would allow someone to plug in whatever authentication mechanism they wanted without having to change the source code Perhaps we can follow this model somehow and build a "get credentials from an external process / souce" into the object store to provide a way that did not require the aws sdk, nor a bug dependency tree, but could support arbitrary authentication 🤔 |
Thank you @alamb . My proposal is that we extend object_store to support an additional way of providing credentials, called This process is also implemented as a part of the AWS SDK as described here and have the strong benefit of keeping the object_store crate simple, but support users in providing their custom authentication mechanism. ObjectStore could be then extended to use a custom environment variable and use the credential provider instead. The documentation from AWS provides the type of schema required by the SDK, which we can follow {
"Version": 1,
"AccessKeyId": "an AWS access key",
"SecretAccessKey": "your AWS secret access key",
"SessionToken": "the AWS session token for temporary credentials",
"Expiration": "RFC3339 timestamp for when the credentials expire"
} I am sure many have already implemented a custom credential provider, and the object store implementation would be minimal. It would allow people to use objecstore, and thefore datafusion and polars, with complex authentication mechanisms without polluting the objecstore crate what do you tihnk @tustvold ? Is this solution acceptable? Can I open a new issue for it? |
My only worry would be that users might provide "untrusted" user input to the config system, and this would then allow for a limited form of RCE. Perhaps we could just provide a CredentialProvider that calls a process, and leave it to polars to hook it up based on a config? This would avoid surprises for people using object_store in a server context? Edit: it might also be worthwhile filing a polars issue about making CredentialProvider overridable, as this might be generally useful for them Edit edit: IIRC we provide a mechanism to use the Azure and GCP CLIs to source tokens, perhaps one option would be to add something similar for S3? |
Yes, I agree it should be something explicitly enabled
I think most users of Polars use the python API (aka are not compiling rust programs) which means Rust level APIs (if I understand what you are proposing) is not likely to satisfy the usecase for their users
That certainly sounds like a good idea to me (follow the same model as the others) |
@edmondop -- can you pull the relevant discussion into a new (non closed) ticket ? |
I meant some way to expose it in the python API |
Opened this #6422 |
FYI I have filed an upstream polars issue as this seems to be a fairly frequent source of these reports - pola-rs/polars#18979 |
As an update here, polars now exposes the credential provider functionality allowing fetching credentials via alternative mechanisms in python (in addition to the native Rust support we provide) |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Recently the official AWS SDK for Rust was made GA.
https://aws.amazon.com/blogs/developer/announcing-general-availability-of-the-aws-sdk-for-rust/
This ticket is created to discuss the potential to refactor the existing s3 implementation in favour of using the new SDK.
Describe the solution you'd like
remove the current HTTP(s) based implementation in favour of the official SDK
https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: