Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue reading S3 files #18907

Closed
2 tasks done
stevenmanton opened this issue Sep 24, 2024 · 3 comments
Closed
2 tasks done

Issue reading S3 files #18907

stevenmanton opened this issue Sep 24, 2024 · 3 comments
Assignees
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@stevenmanton
Copy link

stevenmanton commented Sep 24, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import os

import boto3
import pandas as pd
import polars as pl
import pyarrow.dataset as ds
import s3fs
from pyarrow.fs import S3FileSystem

os.environ["AWS_PROFILE"] = "develop"

uri = "s3://bucket/path/to/file.parquet"

# This line fails:
_ = pl.read_parquet(uri)

# However, these all pass:
s3 = s3fs.S3FileSystem()
s3.ls(uri)

_ = pd.read_parquet(uri)

dataset = ds.dataset(uri)

_ = pl.read_parquet(uri, use_pyarrow=True)

S3FileSystem().get_file_info(uri[5:])

boto3.client('s3').head_object(Bucket="bucket", Key="path/to/file.parquet")

Log output

Async thread count: 2

�[0;31m---------------------------------------------------------------------------�[0m
�[0;31mComputeError�[0m                              Traceback (most recent call last)
Cell �[0;32mIn[13], line 1�[0m
�[0;32m----> 1�[0m _ �[38;5;241m=�[39m �[43mpl�[49m�[38;5;241;43m.�[39;49m�[43mread_parquet�[49m�[43m(�[49m�[43muri�[49m�[43m)�[49m

File �[0;32m~/.local/share/hatch/env/pip-compile/amzn-product-dna-science-devel/t-spEv9X/antonstv/lib/python3.12/site-packages/polars/_utils/deprecation.py:91�[0m, in �[0;36mdeprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper�[0;34m(*args, **kwargs)�[0m
�[1;32m     86�[0m �[38;5;129m@wraps�[39m(function)
�[1;32m     87�[0m �[38;5;28;01mdef�[39;00m �[38;5;21mwrapper�[39m(�[38;5;241m*�[39margs: P�[38;5;241m.�[39margs, �[38;5;241m*�[39m�[38;5;241m*�[39mkwargs: P�[38;5;241m.�[39mkwargs) �[38;5;241m-�[39m�[38;5;241m>�[39m T:
�[1;32m     88�[0m     _rename_keyword_argument(
�[1;32m     89�[0m         old_name, new_name, kwargs, function�[38;5;241m.�[39m�[38;5;18m__qualname__�[39m, version
�[1;32m     90�[0m     )
�[0;32m---> 91�[0m     �[38;5;28;01mreturn�[39;00m �[43mfunction�[49m�[43m(�[49m�[38;5;241;43m*�[39;49m�[43margs�[49m�[43m,�[49m�[43m �[49m�[38;5;241;43m*�[39;49m�[38;5;241;43m*�[39;49m�[43mkwargs�[49m�[43m)�[49m

File �[0;32m~/.local/share/hatch/env/pip-compile/amzn-product-dna-science-devel/t-spEv9X/antonstv/lib/python3.12/site-packages/polars/_utils/deprecation.py:91�[0m, in �[0;36mdeprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper�[0;34m(*args, **kwargs)�[0m
�[1;32m     86�[0m �[38;5;129m@wraps�[39m(function)
�[1;32m     87�[0m �[38;5;28;01mdef�[39;00m �[38;5;21mwrapper�[39m(�[38;5;241m*�[39margs: P�[38;5;241m.�[39margs, �[38;5;241m*�[39m�[38;5;241m*�[39mkwargs: P�[38;5;241m.�[39mkwargs) �[38;5;241m-�[39m�[38;5;241m>�[39m T:
�[1;32m     88�[0m     _rename_keyword_argument(
�[1;32m     89�[0m         old_name, new_name, kwargs, function�[38;5;241m.�[39m�[38;5;18m__qualname__�[39m, version
�[1;32m     90�[0m     )
�[0;32m---> 91�[0m     �[38;5;28;01mreturn�[39;00m �[43mfunction�[49m�[43m(�[49m�[38;5;241;43m*�[39;49m�[43margs�[49m�[43m,�[49m�[43m �[49m�[38;5;241;43m*�[39;49m�[38;5;241;43m*�[39;49m�[43mkwargs�[49m�[43m)�[49m

File �[0;32m~/.local/share/hatch/env/pip-compile/amzn-product-dna-science-devel/t-spEv9X/antonstv/lib/python3.12/site-packages/polars/io/parquet/functions.py:209�[0m, in �[0;36mread_parquet�[0;34m(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)�[0m
�[1;32m    206�[0m     �[38;5;28;01melse�[39;00m:
�[1;32m    207�[0m         lf �[38;5;241m=�[39m lf�[38;5;241m.�[39mselect(columns)
�[0;32m--> 209�[0m �[38;5;28;01mreturn�[39;00m �[43mlf�[49m�[38;5;241;43m.�[39;49m�[43mcollect�[49m�[43m(�[49m�[43m)�[49m

File �[0;32m~/.local/share/hatch/env/pip-compile/amzn-product-dna-science-devel/t-spEv9X/antonstv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2033�[0m, in �[0;36mLazyFrame.collect�[0;34m(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)�[0m
�[1;32m   2031�[0m �[38;5;66;03m# Only for testing purposes�[39;00m
�[1;32m   2032�[0m callback �[38;5;241m=�[39m _kwargs�[38;5;241m.�[39mget(�[38;5;124m"�[39m�[38;5;124mpost_opt_callback�[39m�[38;5;124m"�[39m, callback)
�[0;32m-> 2033�[0m �[38;5;28;01mreturn�[39;00m wrap_df(�[43mldf�[49m�[38;5;241;43m.�[39;49m�[43mcollect�[49m�[43m(�[49m�[43mcallback�[49m�[43m)�[49m)

�[0;31mComputeError�[0m: Generic S3 error: Client error with status 403 Forbidden: No Body

Issue description

I'm unable to load files from S3 in certain environments. The issue seems related to using named AWS profiles. Other tools (e.g., boto3, pyarrow, s3fs), however, don't have this issue. Perhaps the internal Rust implementation that handles the AWS access doesn't pick up the environment variable? (Though the documentation states: "Polars looks for these as environment variable")

Expected behavior

The parquet file should load seamlessly from S3.

Installed versions

--------Version info---------
Polars:              1.8.1
Index type:          UInt32
Platform:            Linux-5.10.225-191.878.amzn2int.x86_64-x86_64-with-glibc2.26
Python:              3.12.3 (main, Apr 15 2024, 18:01:35) [Clang 17.0.6 ]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          2.2.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.6.1
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.1.4
pyarrow              15.0.2
pydantic             2.9.2
pyiceberg            <not installed>
sqlalchemy           2.0.35
torch                2.4.1+cu121
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@stevenmanton stevenmanton added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 24, 2024
@avimallu
Copy link
Contributor

avimallu commented Sep 24, 2024

As mentioned in this issue, you'll need to raise this issue as an FR for the object_store Rust package, as Polars likely has limited control over its functionality.

Also, your code seems to be saying that Pandas fails to load the S3 URI. 🤔

# This line fails:
_ = pd.read_parquet(uri)

@tustvold
Copy link

I've filed #18979 to expose the necessary functionality in polars to allow you to resolve this

@nameexhaustion
Copy link
Collaborator

This should be fixed in the latest releases, we now automatically use boto3 if it is installed.

If there are still any errors, please open a new issue

Closed as completed via #19677

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants