Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Overriding Object Store Credential Provider #18979

Closed
tustvold opened this issue Sep 27, 2024 · 15 comments
Closed

Allow Overriding Object Store Credential Provider #18979

tustvold opened this issue Sep 27, 2024 · 15 comments
Assignees
Labels
enhancement New feature or an improvement of an existing feature

Comments

@tustvold
Copy link

tustvold commented Sep 27, 2024

Description

Problem

object_store provides a mechanism to provide a custom way to source credentials by providing a custom CredentialProvider. This is an important capability for supporting authentication schemes we don't natively support, such as AWS_PROFILE (#18757) and SSO, etc... The object_store crate aims to support most common authentication mechanisms, but is not aiming to be a full re-implementation of all the authentication functionality of the various cloud providers.

Proposal

I would like a way to override the credential provider used by object stores, in particular to allow:

  • Using boto to source credentials when using polars via the python bindings
  • Using aws-sdk-rust to source credentials when using polars via the rust bindings (this may already be possible)

Alternatives Considered

Users could use software like aws-vault to generate session credentials, whilst this has other security benefits, for various reasons people may not wish to do this.

Related Context

@alamb
Copy link

alamb commented Sep 28, 2024

FWIW the usecase we are hearing in object_store is:

  • As a data scientist who loves polars
  • I want to use it to process data that lives on S3 without first having to copy the data locally
  • I can't configure polars to access my data directly on S3 because my company requires us to use (some uncommon access control method that is not supported by the object_store crate) to access S3.

(you can see from the linked tickets this often comes down to us in object_store as a request to implement the various access control methods directly)

We are hoping that exposing access to the general purpose mechanism in polars would allow users to access their data using polars directly

@ritchie46
Copy link
Member

This sounds like a way to enable a lot of users, which is great.

I don't really know how this would work (as I don't know enough about this topic), so I need some help in understanding what is requested from us.

How would we enable this? I see that there is a trait [CredentialProvider](https://docs.rs/object_store/latest/object_store/trait.CredentialProvider.html#).

can objet-store be instantiated with that? Or can it be passed as a dynamic argument?

What would this look like on the Python side?

@tustvold
Copy link
Author

tustvold commented Sep 28, 2024

The various builders store builders allow providing a custom credential provider at construction time.

https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.with_credentials

I don't know enough about polars to know precisely what this might look like when hooked up, especially via python, changes may be needed on the object_store side to facilitate this, but I wanted to start the discussion.

I suspect it will be necessary to use https://docs.rs/object_store/latest/object_store/enum.ObjectStoreScheme.html directly as opposed to the type-erased parse_url method

TBC I don't have capacity to implement this, but happy to assist

@alamb
Copy link

alamb commented Sep 29, 2024

What I personally suggest is add a way in polars for users to call out to a separate to retrieve credentials when needed

Here is how this works with aws, though we don't yet support this via object_store ( tracked by apache/arrow-rs#6422)

There are similar mechanisms for azure and gcp, for example: https://docs.rs/object_store/latest/object_store/azure/struct.MicrosoftAzureBuilder.html#method.with_use_azure_cli

So from polars this could look like

  1. Set some polars configuration to allow external process credentials (enabling explicitly likely prevents some potential security issues)
  2. Configure object store instances to use externals processes for credentials

@tustvold
Copy link
Author

tustvold commented Sep 29, 2024

FWIW I view calling out to a separate process as strictly less flexible than what I propose here, limited to AWS, and tbh a bit of a hack. Providing a way for users to provide this within the Polars process would be cleaner, could work out of the box (e.g. using the cloud provider's SDK if available), and be more secure.

Tbh if we can make traction here I'd be tempted to not do apache/arrow-rs#6422 and instead fix the issue properly

@alamb
Copy link

alamb commented Sep 30, 2024

fix the issue properly

I don't know what you mean by 'fix it properly" -- do you mean somehow have an API in polars that provides credentials via arbitrary python code provided to some polars API?

@tustvold
Copy link
Author

somehow have an API in polars that provides credentials via arbitrary python code provided to some polars API?

Precisely, this would not only solve this for AWS but also any of the other stores we support. We expose this API for a reason 😄

@ritchie46
Copy link
Member

Alright, I still didn't have time to research yet, but just know that we are willing to help and implement here. I will come back once I have more knowledge and sensible input. ;)

@tustvold
Copy link
Author

FWIW I've also filed a simpler proposal in #19022 that might be more immediately actionable if you can tolerate its compromises.

@ion-elgreco
Copy link
Contributor

@ritchie46 you might be able to take inspiration from how it's done in delta-rs: https://github.com/delta-io/delta-rs/blob/main/crates/aws/src/credentials.rs

@ritchie46
Copy link
Member

FYI: I asked @nameexhaustion to look into this.

@Skumin
Copy link

Skumin commented Oct 31, 2024

Should scan_parquet still look for a JSON service account on 1.12.0 when accessing a parquet file stored on GCS now that this has been implemented? I'm getting ComputeError: Generic GCS error: GCP credential error: Unable to open service account file from [USER]@[PROJECT].iam.gserviceaccount.com: No such file or directory (os error 2), even though I'm providing a credentials function in the credential_provider argument in scan_parquet that contains a valid bearer token.

@nameexhaustion
Copy link
Collaborator

Should scan_parquet still look for a JSON service account on 1.12.0 when accessing a parquet file stored on GCS now that this has been implemented?

We shouldn't, I think they are being loaded from the environment, I will make a PR to fix

@edmondop
Copy link

edmondop commented Nov 6, 2024

Thank you so much @tustvold @alamb @nameexhaustion and @ritchie46 . This is huge

@Skumin
Copy link

Skumin commented Nov 14, 2024

Can confirm that this now works on 1.13.0. Great stuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

7 participants