Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object_store: Use the official AWS SDK for s3 interactions #5143

Closed
chitralverma opened this issue Nov 29, 2023 · 25 comments
Closed

object_store: Use the official AWS SDK for s3 interactions #5143

chitralverma opened this issue Nov 29, 2023 · 25 comments
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog

Comments

@chitralverma
Copy link

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Recently the official AWS SDK for Rust was made GA.
https://aws.amazon.com/blogs/developer/announcing-general-availability-of-the-aws-sdk-for-rust/

This ticket is created to discuss the potential to refactor the existing s3 implementation in favour of using the new SDK.

Describe the solution you'd like
remove the current HTTP(s) based implementation in favour of the official SDK
https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/

Describe alternatives you've considered

Additional context

@chitralverma chitralverma added the enhancement Any new improvement worthy of a entry in the changelog label Nov 29, 2023
@chitralverma
Copy link
Author

cc: @tustvold @alamb

@tustvold
Copy link
Contributor

We moved away from SDKs to keep the dependency footprint down and provide better consistency across the implementations

#2176

What would be the advantage of such a change?

@crepererum
Copy link
Contributor

Regarding dependency footprint: https://crates.io/crates/aws-sdk-s3/1.4.0/dependencies

aws-sdk-s3 comes with 13 AWS-specific crates (direct dependencies only, haven't looked at indirect ones), 9 of which are used for the API generator generator smithy alone.


Also AWS is not the only implementation of the S3 API out there and I doubt they support features (like conditional operations) that other S3 implementations support.

@alamb
Copy link
Contributor

alamb commented Nov 29, 2023

@chitralverma what would be the benefit of using the official SDK? Are there features it supports that are not supported by the custom one?

I can imagine potentially some of the more advanced access controls or something

Since the ObjectStore is a trait, perhaps we could create a new crate (could start outside of this repo) that uses the official SDK that could be used by anyone who needs the extra features

@kszlim
Copy link
Contributor

kszlim commented Dec 5, 2023

I guess one benefit is getting stuff like #5140 for close to free

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Dec 23, 2023
@tustvold
Copy link
Contributor

Closing as not planned, thank you all for the discussion

@tustvold tustvold added the development-process Related to development process of arrow-rs label Dec 23, 2023
@edmondop
Copy link
Contributor

One of the limitations of the current object-store implementation is that its authentication support is more restrictive than the one provided by AWS, for example. https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-sourcing-external.html. How can this feature be added to the current implementation?

@tustvold
Copy link
Contributor

How can this feature be added to the current implementation?

Users can provide their own credential providers - https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.with_credentials

This could even be using the AWS SDK if they so wish

@edmondop
Copy link
Contributor

for users that are using those libraries via a Python wrapper, how should this work? I.e. object_store imported by polars used in Python

@tustvold
Copy link
Contributor

That is a question for the maintainers of the python wrapper

@alamb
Copy link
Contributor

alamb commented Sep 1, 2024

So it seems to me the summary of this ticket is:

  1. There is a legitamate need for certain authentication / other features that are not implemented in the object_store's re-implementation of the S3
  2. The "official" AWS s3 sdk provides these other features but at the expense of a significantly larger dependency set

Thus I propose (and I think this is what @tustvold is also proposing) for the first step is that someone implements the ObjectStore interface using the official AWS S3 SDK outside of the object_store crate. This would mean something like

struct OfficalAmazonS3 {
...
}

impl ObjectStore for OfficialAmazonS3 { 
 // implementation based on aws sdk goes here
}

I won't have time to devote to this project (likely ever) but I think it would be straightforward for anyone with the need (the earlier versions of object_store were actually implemented this way).

Once that code is complete, I think we could then have a separate discussion about "is there willingness to maintain that implementation in the object_store crate" (we would have to find some maintainer bandwidth to do so) to ease distribution

But the first step in all possible futures is for someone to implement OfficialAmazonS3.

@tustvold
Copy link
Contributor

tustvold commented Sep 1, 2024

If it is just credentials, it would be sufficient to just implement a CredentialProvider, I believe datafusion-cli might already be doing this and is just a couple of lines of code.

If the desire it to use this from python, it will need a conversation with the maintainers of the python bindings on whether to take this approach or to expose CredentialProvider in the python interface so it can be used with boto.

A full reimplementation using the AWS SDK is possible, but is a non-trivial amount of work, and would need people to drive it as Andrew describes

@Xuanwo
Copy link
Member

Xuanwo commented Sep 1, 2024

  1. There is a legitamate need for certain authentication / other features that are not implemented in the object_store's re-implementation of the S3

Just FYI, I'm working on a project reqsign focused on implementing all the request signing features provided by AWS (and all other cloud providers). I believe it would be beneficial to have a separate signing crate instead of the entire AWS SDK.

It's being refactored to simplify its layout and significantly reduce its dependency tree. We can discuss the details later.

@edmondop
Copy link
Contributor

edmondop commented Sep 1, 2024

One specific problem we have is how we inject this logic when the object store is wrapped by a crate which exposes python bindings, for example delta-rs, polars or DataFusion-python

Let's say this crate would already exists, how can we integrate it with those frameworks above ?

@Xuanwo
Copy link
Member

Xuanwo commented Sep 1, 2024

Let's say this crate would already exists, how can we integrate it with those frameworks above ?

Reqsign will provide a Python binding, allowing the Python ecosystem to integrate with it directly. Like what we do for object_store and arrow.

@tustvold
Copy link
Contributor

tustvold commented Sep 1, 2024

Let's say this crate would already exists, how can we integrate it with those frameworks above ?

Either the same way datafusion-cli does by overriding the credential provider at construction time, or by the maintainers of said python wrapper adding first class support for credential providers. We already provide sufficient extension points for them to do this

It's being refactored to simplify its layout and significantly reduce its dependency tree

So long as we preserve existing functionality, including around things like the client configuration used to make credential requests, and someone is willing to drive this, no objections from me

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

@edmondop and I spoke about this at the NYC DataFUsion meetup yesterday

I think the usecase is that

  1. User uses object_store indirectly via polars
  2. polars does not provide any way to modify / configure s3 connections at runtime

If they could control the source, they could use the existing object_store CredentialProvider trait but they control neither the source or distribution

@edmondop mentioned something about some AWS SDK mechanism that calls out to an external program / process to get credentials. While less efficient this would allow someone to plug in whatever authentication mechanism they wanted without having to change the source code

Perhaps we can follow this model somehow and build a "get credentials from an external process / souce" into the object store to provide a way that did not require the aws sdk, nor a bug dependency tree, but could support arbitrary authentication 🤔

@edmondop
Copy link
Contributor

Thank you @alamb . My proposal is that we extend object_store to support an additional way of providing credentials, called credential_provider.

This process is also implemented as a part of the AWS SDK as described here and have the strong benefit of keeping the object_store crate simple, but support users in providing their custom authentication mechanism.

ObjectStore could be then extended to use a custom environment variable and use the credential provider instead. The documentation from AWS provides the type of schema required by the SDK, which we can follow

{
    "Version": 1,
    "AccessKeyId": "an AWS access key",
    "SecretAccessKey": "your AWS secret access key",
    "SessionToken": "the AWS session token for temporary credentials", 
    "Expiration": "RFC3339 timestamp for when the credentials expire"
}  

I am sure many have already implemented a custom credential provider, and the object store implementation would be minimal. It would allow people to use objecstore, and thefore datafusion and polars, with complex authentication mechanisms without polluting the objecstore crate

what do you tihnk @tustvold ? Is this solution acceptable? Can I open a new issue for it?

@tustvold
Copy link
Contributor

tustvold commented Sep 18, 2024

My only worry would be that users might provide "untrusted" user input to the config system, and this would then allow for a limited form of RCE.

Perhaps we could just provide a CredentialProvider that calls a process, and leave it to polars to hook it up based on a config? This would avoid surprises for people using object_store in a server context?

Edit: it might also be worthwhile filing a polars issue about making CredentialProvider overridable, as this might be generally useful for them

Edit edit: IIRC we provide a mechanism to use the Azure and GCP CLIs to source tokens, perhaps one option would be to add something similar for S3?

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

Perhaps we could just provide a CredentialProvider that calls a process, and leave it to polars to hook it up based on a config? This would avoid surprises for people using object_store in a server context?

Yes, I agree it should be something explicitly enabled

Edit: it might also be worthwhile filing a polars issue about making CredentialProvider overridable, as this might be generally useful for them

I think most users of Polars use the python API (aka are not compiling rust programs) which means Rust level APIs (if I understand what you are proposing) is not likely to satisfy the usecase for their users

Edit edit: IIRC we provide a mechanism to use the Azure and GCP CLIs to source tokens, perhaps one option would be to add something similar for S3?

That certainly sounds like a good idea to me (follow the same model as the others)

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

@edmondop -- can you pull the relevant discussion into a new (non closed) ticket ?

@tustvold
Copy link
Contributor

I think most users of Polars use the python API

I meant some way to expose it in the python API

@edmondop
Copy link
Contributor

Opened this #6422

@tustvold
Copy link
Contributor

FYI I have filed an upstream polars issue as this seems to be a fairly frequent source of these reports - pola-rs/polars#18979

@tustvold
Copy link
Contributor

tustvold commented Oct 25, 2024

As an update here, polars now exposes the credential provider functionality allowing fetching credentials via alternative mechanisms in python (in addition to the native Rust support we provide)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

7 participants