`AWS_PROFILE` should be supported in cloud storage I/O config #18757

hutch3232 · 2024-09-15T17:28:12Z

Description

I have a variety of different AWS/S3 profiles in my ~/.aws/credentials and ~/.aws/config files. I'd like to be able to either explicitly pass profile into storage_options or implicitly by setting an AWS_PROFILE environmental variable so that I can be sure to use the appropriate bucket keys/endpoint/and other configs.

I saw here that profile is not listed as a supported option: https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html

polars seems to use the first profile listed in those ~/.aws files, even if the profile name is not 'default'. By ensuring the relevant profile was listed first, pl.read_parquet("s3://my-bucket/my-parquet/*.parquet") would work, but being order-dependent is confusing and not scalable.

import polars as pl

pl.read_parquet("s3://my-bucket/my-parquet/*.parquet",
                storage_options={"profile": "my-profile"})

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 pl.read_parquet("s3://my-bucket/my-parquet/*.parquet",
      2                 storage_options={"profile": "my-profile"})

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:184, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    181     source = [io.BytesIO(s) for s in source]  # type: ignore[arg-type, assignment]
    183 # For other inputs, defer to `scan_parquet`
--> 184 lf = scan_parquet(
    185     source,  # type: ignore[arg-type]
    186     n_rows=n_rows,
    187     row_index_name=row_index_name,
    188     row_index_offset=row_index_offset,
    189     parallel=parallel,
    190     use_statistics=use_statistics,
    191     hive_partitioning=hive_partitioning,
    192     hive_schema=hive_schema,
    193     try_parse_hive_dates=try_parse_hive_dates,
    194     rechunk=rechunk,
    195     low_memory=low_memory,
    196     cache=False,
    197     storage_options=storage_options,
    198     retries=retries,
    199     glob=glob,
    200     include_file_paths=None,
    201 )
    203 if columns is not None:
    204     if is_int_sequence(columns):

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:425, in scan_parquet(source, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, cache, storage_options, retries, include_file_paths)
    420 elif is_path_or_str_sequence(source):
    421     source = [
    422         normalize_filepath(source, check_not_directory=False) for source in source
    423     ]
--> 425 return _scan_parquet_impl(
    426     source,  # type: ignore[arg-type]
    427     n_rows=n_rows,
    428     cache=cache,
    429     parallel=parallel,
    430     rechunk=rechunk,
    431     row_index_name=row_index_name,
    432     row_index_offset=row_index_offset,
    433     storage_options=storage_options,
    434     low_memory=low_memory,
    435     use_statistics=use_statistics,
    436     hive_partitioning=hive_partitioning,
    437     hive_schema=hive_schema,
    438     try_parse_hive_dates=try_parse_hive_dates,
    439     retries=retries,
    440     glob=glob,
    441     include_file_paths=include_file_paths,
    442 )

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:476, in _scan_parquet_impl(source, n_rows, cache, parallel, rechunk, row_index_name, row_index_offset, storage_options, low_memory, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, retries, include_file_paths)
    472 else:
    473     # Handle empty dict input
    474     storage_options = None
--> 476 pylf = PyLazyFrame.new_from_parquet(
    477     source,
    478     sources,
    479     n_rows,
    480     cache,
    481     parallel,
    482     rechunk,
    483     parse_row_index_args(row_index_name, row_index_offset),
    484     low_memory,
    485     cloud_options=storage_options,
    486     use_statistics=use_statistics,
    487     hive_partitioning=hive_partitioning,
    488     hive_schema=hive_schema,
    489     try_parse_hive_dates=try_parse_hive_dates,
    490     retries=retries,
    491     glob=glob,
    492     include_file_paths=include_file_paths,
    493 )
    494 return wrap_ldf(pylf)

ComputeError: unknown configuration key: profile

FWIW this functionality exists in pandas and I'm hoping to migrate code to polars, but this is kind of essential.

The text was updated successfully, but these errors were encountered:

avimallu · 2024-09-16T12:41:39Z

I doubt Polars has control over object_store feature additions. I suggest you raise this request in their repo.

hutch3232 · 2024-09-16T13:00:28Z

Oh, I somehow didn't realize they were separate libraries. Looks like it used to be experimentally supported but that support was dropped. Bummer.

apache/arrow-rs#4238
apache/arrow-rs#4556

stevenmanton · 2024-09-25T18:48:58Z

Yikes. It looks like there's no easy way to get support for AWS profiles in polars, then. That's a big lack of functionality on the object_store package. My only workaround, then, is pl.read_parquet(..., use_pyarrow=True).

tustvold · 2024-09-27T14:20:40Z

👋 object_store maintainer here. The major challenge with supporting AWS_PROFILE is the sheer scope of such an initiative, even the official Rust AWS SDK continues to have issues in this space (awslabs/aws-sdk-rust#1193). Whilst we did at one point support AWS_PROFILE in object_store, it was tacked on and lead to surprising inconsistencies for users as only some of the configuration would be respected. We do not use SDKs as this allows for a more consistent experience across stores, especially since AWS is the only official one, along with a significantly smaller dependency footprint. There is more information on apache/arrow-rs#2176.

This support for AWS_PROFILE was therefore removed and replaced with a more flexible API allowing users and system integrators to configure how to source credentials from their environment. I have filed #18979 to suggest exposing this in polars.

Edit: As an aside I would strongly encourage using aws-vault to generate session credentials, as not only would it avoid this class of issue, but avoids storing credentials in plain text on the filesystem and relying on individual apps/tools to use the correct profile.

hutch3232 · 2024-09-30T20:02:24Z

One interesting thing I just realized is that pl.read_csv actually accepts the "profile" input to storage_options. That's surprising considering pl.read_parquet does not.

Edit: tested polars 1.8.2
Edit2: in fact, pl.read_csv can pick up AWS_PROFILE and even AWS_ENDPOINT_URL (see: #18758)

hutch3232 · 2025-02-01T02:07:46Z

First of all, wanted to say huge thanks to everyone's efforts on making this work so seamlessly, thank you!!

I encountered an odd situation. I tried setting AWS_PROFILE and letting polars figure out the rest. It ended up just stalling out here. I let it go for a few minutes before killing it. Had to kill the terminal actually, wouldn't let me interrupt. I also didn't have boto3 installed, but no error was thrown, so it must not have gotten to that check.

One thing worth noting is that I do have endpoint_url specified in ~/.aws/config per: https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-endpoints.html#endpoints-precedence

ubuntu [/mnt/code]: python
Python 3.9.18 (main, Sep 11 2023, 13:41:44)
[GCC 11.2.0] :: Anaconda, Inc. on linux

Type "help", "copyright", "credits" or "license" for more information.

>>> import os
>>> import polars as pl
>>> os.environ["AWS_PROFILE"] = "my-profile"
>>> with pl.Config(verbose=True): 
...     ff = pl.read_parquet("s3://my-bucket/my-data.parquet")
...
Auto-selected credential provider: CredentialProviderAWS
Async thread count: 1
[FetchedCredentialsCache]: Call update_func: current_time = 1738359742, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = (never expires)

However, if I also set AWS_ENDPOINT_URL, then it works great!

>>> import os
>>> import polars as pl
>>> os.environ["AWS_PROFILE"] = "my-profile"
>>> os.environ["AWS_ENDPOINT_URL"] = "https://my-endpoint.com/"
>>> with pl.Config(verbose=True): 
...     ff = pl.read_parquet("s3://my-bucket/my-data.parquet")
...
Auto-selected credential provider: CredentialProviderAWS
Async thread count: 1
[FetchedCredentialsCache]: Call update_func: current_time = 1738360226, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = (never expires)
async download_chunk_size: 67108864
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738360227, expiry = (never expires)
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738360227, expiry = (never expires)
POLARS PREFETCH_SIZE: 16
querying metadata of 1/1 files...
reading of 1/1 file...
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Using cached credentials: current_time = 1738360227, expiry = (never expires)
parquet scan with parallel = Columns

So it seems to me there are two issues: one is that it can hang indefinitely. The other is more of a wish... if endpoint_url was read from ~/.aws/config.

I tried to figure out how to pull endpoint_url using boto3 or botocore but wasn't sure how.

~~This did not work:~~

from botocore.session import Session

session = Session(profile="default")
config = session..get_scoped_config()
config.get("endpoint_url")

Somehow it should be possible, since pl.read_csv works with only AWS_PROFILE supplied.

edit: updated python block above because I figured out how to use botocore to get the endpoint_url

hutch3232 added the enhancement New feature or an improvement of an existing feature label Sep 15, 2024

avimallu mentioned this issue Sep 24, 2024

Issue reading S3 files #18907

Closed

2 tasks

tustvold mentioned this issue Sep 27, 2024

Allow Overriding Object Store Credential Provider #18979

Closed

tustvold mentioned this issue Sep 30, 2024

Use AWS Rust SDK To Source Credentials For S3 #19022

Closed

nameexhaustion self-assigned this Jan 29, 2025

nameexhaustion added the accepted Ready for implementation label Jan 29, 2025

nameexhaustion mentioned this issue Jan 29, 2025

feat(python): Support passing aws_profile in storage_options #20965

Merged

ritchie46 closed this as completed in #20965 Jan 29, 2025

c-peters added this to Backlog Feb 3, 2025

c-peters moved this to Done in Backlog Feb 3, 2025

kylebarron mentioned this issue Feb 3, 2025

Use AWS Rust SDK To Source Credentials For S3 developmentseed/obstore#202

Closed

hutch3232 mentioned this issue Feb 10, 2025

PanicException on cloud DataFrame write #21170

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`AWS_PROFILE` should be supported in cloud storage I/O config #18757

`AWS_PROFILE` should be supported in cloud storage I/O config #18757

hutch3232 commented Sep 15, 2024

avimallu commented Sep 16, 2024

hutch3232 commented Sep 16, 2024

stevenmanton commented Sep 25, 2024

tustvold commented Sep 27, 2024 •

edited

Loading

hutch3232 commented Sep 30, 2024 •

edited

Loading

hutch3232 commented Feb 1, 2025 •

edited

Loading

AWS_PROFILE should be supported in cloud storage I/O config #18757

AWS_PROFILE should be supported in cloud storage I/O config #18757

Comments

hutch3232 commented Sep 15, 2024

Description

avimallu commented Sep 16, 2024

hutch3232 commented Sep 16, 2024

stevenmanton commented Sep 25, 2024

tustvold commented Sep 27, 2024 • edited Loading

hutch3232 commented Sep 30, 2024 • edited Loading

hutch3232 commented Feb 1, 2025 • edited Loading

`AWS_PROFILE` should be supported in cloud storage I/O config #18757

`AWS_PROFILE` should be supported in cloud storage I/O config #18757

tustvold commented Sep 27, 2024 •

edited

Loading

hutch3232 commented Sep 30, 2024 •

edited

Loading

hutch3232 commented Feb 1, 2025 •

edited

Loading