Incorrect results using multiple buckets with Azure Data Lake Storage URI #20347

nameexhaustion · 2024-12-18T11:54:39Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

print(pl.scan_parquet("abfss://[email protected]/file1.parquet").explain())
# Parquet SCAN [abfss://account.dfs.core.windows.net/file1.parquet]
print(pl.scan_parquet("abfss://[email protected]/...").explain())
# Parquet SCAN [abfss://account.dfs.core.windows.net/file1.parquet]

Where file1.parquet exists only in bucket1

Log output

No response

Issue description

When using an Azure Data Lake Storage URI¹, the first bucket that gets used becomes hardcoded into the polars object store cache. Subsequent scan operations will ignore the bucket specified in the URI and always uses the first bucket.

Expected behavior

The 2nd scan_parquet in the example should fail, as the file does not exist in bucket2, instead of incorrectly scanning from bucket1

The resolved path should also include the bucket@ segment:

print(pl.scan_parquet("abfss://[email protected]/file1.parquet").explain())
# Parquet SCAN [abfss://[email protected]/file1.parquet]
print(pl.scan_parquet("abfss://[email protected]/...").explain())
# Expected at least 1 source

Installed versions

1.17.1

https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri ↩

The text was updated successfully, but these errors were encountered:

nameexhaustion added bug Something isn't working python Related to Python Polars P-high Priority: high A-io-cloud Area: reading/writing to cloud storage labels Dec 18, 2024

nameexhaustion self-assigned this Dec 18, 2024

github-project-automation bot added this to Backlog Dec 18, 2024

github-project-automation bot moved this to Ready in Backlog Dec 18, 2024

nameexhaustion changed the title ~~Incorrect object store cache for azure~~ Incorrect object store cache for azure when using different buckets Dec 18, 2024

nameexhaustion changed the title ~~Incorrect object store cache for azure when using different buckets~~ Cannot use multiple buckets with Azure Data Lake Storage URI Dec 18, 2024

nameexhaustion changed the title ~~Cannot use multiple buckets with Azure Data Lake Storage URI~~ Incorrect results using multiple buckets with Azure Data Lake Storage URI Dec 18, 2024

nameexhaustion mentioned this issue Dec 19, 2024

fix: Fix incorrect object store caching for ADLS URI #20357

Merged

ritchie46 closed this as completed in #20357 Dec 19, 2024

github-project-automation bot moved this from Ready to Done in Backlog Dec 19, 2024

c-peters added the accepted Ready for implementation label Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect results using multiple buckets with Azure Data Lake Storage URI #20347

Incorrect results using multiple buckets with Azure Data Lake Storage URI #20347

nameexhaustion commented Dec 18, 2024 •

edited

Loading

Incorrect results using multiple buckets with Azure Data Lake Storage URI #20347

Incorrect results using multiple buckets with Azure Data Lake Storage URI #20347

Comments

nameexhaustion commented Dec 18, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

Footnotes

nameexhaustion commented Dec 18, 2024 •

edited

Loading