-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS_PROFILE
should be supported in cloud storage I/O config
#18757
Comments
I doubt Polars has control over |
Oh, I somehow didn't realize they were separate libraries. Looks like it used to be experimentally supported but that support was dropped. Bummer. |
Yikes. It looks like there's no easy way to get support for AWS profiles in polars, then. That's a big lack of functionality on the |
👋 object_store maintainer here. The major challenge with supporting AWS_PROFILE is the sheer scope of such an initiative, even the official Rust AWS SDK continues to have issues in this space (awslabs/aws-sdk-rust#1193). Whilst we did at one point support AWS_PROFILE in object_store, it was tacked on and lead to surprising inconsistencies for users as only some of the configuration would be respected. We do not use SDKs as this allows for a more consistent experience across stores, especially since AWS is the only official one, along with a significantly smaller dependency footprint. There is more information on apache/arrow-rs#2176. This support for AWS_PROFILE was therefore removed and replaced with a more flexible API allowing users and system integrators to configure how to source credentials from their environment. I have filed #18979 to suggest exposing this in polars. Edit: As an aside I would strongly encourage using aws-vault to generate session credentials, as not only would it avoid this class of issue, but avoids storing credentials in plain text on the filesystem and relying on individual apps/tools to use the correct profile. |
One interesting thing I just realized is that Edit: tested |
First of all, wanted to say huge thanks to everyone's efforts on making this work so seamlessly, thank you!! I encountered an odd situation. I tried setting One thing worth noting is that I do have ubuntu [/mnt/code]: python
Python 3.9.18 (main, Sep 11 2023, 13:41:44)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import polars as pl
>>> os.environ["AWS_PROFILE"] = "my-profile"
>>> with pl.Config(verbose=True):
... ff = pl.read_parquet("s3://my-bucket/my-data.parquet")
...
Auto-selected credential provider: CredentialProviderAWS
Async thread count: 1
[FetchedCredentialsCache]: Call update_func: current_time = 1738359742, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = (never expires) However, if I also set >>> import os
>>> import polars as pl
>>> os.environ["AWS_PROFILE"] = "my-profile"
>>> os.environ["AWS_ENDPOINT_URL"] = "https://my-endpoint.com/"
>>> with pl.Config(verbose=True):
... ff = pl.read_parquet("s3://my-bucket/my-data.parquet")
...
Auto-selected credential provider: CredentialProviderAWS
Async thread count: 1
[FetchedCredentialsCache]: Call update_func: current_time = 1738360226, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = (never expires)
async download_chunk_size: 67108864
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738360227, expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738360227, expiry = (never expires)
POLARS PREFETCH_SIZE: 16
querying metadata of 1/1 files...
reading of 1/1 file...
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738360227, expiry = (never expires)
parquet scan with parallel = Columns So it seems to me there are two issues: one is that it can hang indefinitely. The other is more of a wish... if endpoint_url was read from I tried to figure out how to pull
from botocore.session import Session
session = Session(profile="default")
config = session..get_scoped_config()
config.get("endpoint_url") Somehow it should be possible, since edit: updated python block above because I figured out how to use |
Description
I have a variety of different AWS/S3 profiles in my
~/.aws/credentials
and~/.aws/config
files. I'd like to be able to either explicitly passprofile
intostorage_options
or implicitly by setting anAWS_PROFILE
environmental variable so that I can be sure to use the appropriate bucket keys/endpoint/and other configs.I saw here that profile is not listed as a supported option: https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html
polars
seems to use the first profile listed in those~/.aws
files, even if the profile name is not 'default'. By ensuring the relevant profile was listed first,pl.read_parquet("s3://my-bucket/my-parquet/*.parquet")
would work, but being order-dependent is confusing and not scalable.FWIW this functionality exists in
pandas
and I'm hoping to migrate code topolars
, but this is kind of essential.The text was updated successfully, but these errors were encountered: