Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing with scan_parquet doesn't work anymore from within io/cloud/test_aws.py #11528

Open
2 tasks done
svaningelgem opened this issue Oct 5, 2023 · 3 comments
Open
2 tasks done
Labels
A-io Area: reading and writing data bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@svaningelgem
Copy link
Contributor

svaningelgem commented Oct 5, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Just re-add (pl.scan_parquet, "parquet"), to the parameters of test_scan_s3.

(removed by @ritchie46 in PR #11210 )

Log output

exceptions.ComputeError: Generic S3 error: response error "request error", after 0 retries: builder error for url (http://127.0.0.1:5000/bucket/foods1.parquet): URL scheme is not allowed

Issue description

The call fails. I believe because the object_store crate doesn't like http very much.
So, I added (according to the object_store docs here:

    # monkeypatch_module.setenv("AWS_ENDPOINT", f"http://{host}:{port}")
    monkeypatch_module.setenv("AWS_ALLOW_HTTP", "true")

to the s3_base fixture (same file). (I tried with both the endpoint enabled and disabled)

But this just locked (deadlock?) the test. Ie:

INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket HTTP/1.1" 200 -
INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.csv HTTP/1.1" 200 -
INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.ipc HTTP/1.1" 200 -
INFO     werkzeug:_internal.py:96 127.0.0.1 - - [05/Oct/2023 09:28:29] "PUT /bucket/foods1.parquet HTTP/1.1" 200 -
Terminated

The Terminated is because I killed the process myself after a minute or so.

This is fairly similar to #11372, but I created this new thread because I purely focus on the testing in here.

Expected behavior

I would expect the scan_parquet to read in a LazyFrame.

Installed versions

(main branch)
--------Version info---------
Polars:              0.19.7
Index type:          UInt32
Platform:            Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:              3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_sqlite:  0.7.0
cloudpickle:         2.2.1
connectorx:          0.3.2
deltalake:           0.10.1
fsspec:              2023.9.2
gevent:              23.9.1
matplotlib:          3.8.0
numpy:               1.26.0
openpyxl:            3.1.2
pandas:              2.1.1
pyarrow:             13.0.0
pydantic:            2.4.2
pyiceberg:           0.5.0
pyxlsb:              1.0.10
sqlalchemy:          2.0.21
xlsx2csv:            0.8.1
xlsxwriter:          3.1.6
@svaningelgem svaningelgem added bug Something isn't working python Related to Python Polars labels Oct 5, 2023
@ritchie46
Copy link
Member

It is because object store tries to connect to aws. This has more to do with making this work with mojo testing than being an actual bug in the aws connection code.

@svaningelgem
Copy link
Contributor Author

Indeed, but if it's not tested, how can we (read: I) improve on it? 😁

I'm trying to make the sink_parquet work with the object_store code (ticket #11056), but if I can't test it... I can't fix it. And I don't know rust that well (better now I'm digging into it, but still)... So if it's not too much of an issue:

  • Could you describe what is needed to make it work?
  • Or if it's faster: fix the tests?

Thanks

@TylerGrantSmith
Copy link

@svaningelgem I observed the same issue while trying to use a ThreadedMotoServer. Instead, you can get this to work if you launch moto_server as a subprocess. I am currently using this as a workaround for polars + s3 testing in python.

@stinodego stinodego added needs triage Awaiting prioritization by a maintainer A-io Area: reading and writing data labels Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants