Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results using multiple buckets with Azure Data Lake Storage URI #20347

Closed
2 tasks done
nameexhaustion opened this issue Dec 18, 2024 · 0 comments · Fixed by #20357
Closed
2 tasks done

Incorrect results using multiple buckets with Azure Data Lake Storage URI #20347

nameexhaustion opened this issue Dec 18, 2024 · 0 comments · Fixed by #20357
Assignees
Labels
A-io-cloud Area: reading/writing to cloud storage accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Dec 18, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

print(pl.scan_parquet("abfss://[email protected]/file1.parquet").explain())
# Parquet SCAN [abfss://account.dfs.core.windows.net/file1.parquet]
print(pl.scan_parquet("abfss://[email protected]/...").explain())
# Parquet SCAN [abfss://account.dfs.core.windows.net/file1.parquet]

Where file1.parquet exists only in bucket1

Log output

No response

Issue description

When using an Azure Data Lake Storage URI1, the first bucket that gets used becomes hardcoded into the polars object store cache. Subsequent scan operations will ignore the bucket specified in the URI and always uses the first bucket.

Expected behavior

The 2nd scan_parquet in the example should fail, as the file does not exist in bucket2, instead of incorrectly scanning from bucket1

The resolved path should also include the bucket@ segment:

print(pl.scan_parquet("abfss://[email protected]/file1.parquet").explain())
# Parquet SCAN [abfss://[email protected]/file1.parquet]
print(pl.scan_parquet("abfss://[email protected]/...").explain())
# Expected at least 1 source

Installed versions

1.17.1

Footnotes

  1. https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri

@nameexhaustion nameexhaustion added bug Something isn't working python Related to Python Polars P-high Priority: high A-io-cloud Area: reading/writing to cloud storage labels Dec 18, 2024
@nameexhaustion nameexhaustion self-assigned this Dec 18, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Dec 18, 2024
@nameexhaustion nameexhaustion changed the title Incorrect object store cache for azure Incorrect object store cache for azure when using different buckets Dec 18, 2024
@nameexhaustion nameexhaustion changed the title Incorrect object store cache for azure when using different buckets Cannot use multiple buckets with Azure Data Lake Storage URI Dec 18, 2024
@nameexhaustion nameexhaustion changed the title Cannot use multiple buckets with Azure Data Lake Storage URI Incorrect results using multiple buckets with Azure Data Lake Storage URI Dec 18, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Dec 19, 2024
@c-peters c-peters added the accepted Ready for implementation label Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-cloud Area: reading/writing to cloud storage accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants