-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scan_csv()
in container fails with disk space error (e.g. AWS lambda, or container)
#17946
Comments
Can you set |
Hi I'm not sure if I should start another issue for this, but I'm pretty sure I'm having the same issue. When running inside an AWS Lambda, I am able to read a CSV and write it to a Parquet file using read_csv and write_parquet, but not so much luck with scan_csv and sink_parquet. I'm getting the same type and error and have tried the same methods to solve the issue as @GBMsejimenez. I've gotten the code down to the bare minimum for me to reproduce the error (the CSV file being tested only consists of a header and two lines of data, and the bucket and path in the file name have been edited out). import polars as pl
import s3fs
import json
POLARS_PANIC_ON_ERR=1
RUST_BACKTRACE=1
# Lambda entry
def lambda_handler(event, context):
pl.show_versions()
csv_file = 's3://{BUCKET}/{PATH}/test.csv'
#parquet_file = 's3://{BUCKET}/{PATH}/test.parquet'
fs = s3fs.S3FileSystem(anon=False)
df = pl.scan_csv(csv_file).collect(streaming=True)
return {
'statusCode': 200,
'body': json.dumps("Finished")
} This is giving me an error of (with {BUCKET} and {PATH} having actual values)
My polars versions if necessary
|
@wjglenn3 I'm experiencing the same issue when using a Docker container based lambda |
Hey, we are experiencing the same issue within docker in AWS Lambda, we attempted all the combinations. I also tried installing s3fs, which is needed for the read_csv, but also breaks with error : ComputeError : failed to allocate 12345 bytes to download uri = s3://... Here's my minimum example that breaks : import asyncio
import boto3
import polars as pl
import uvloop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
session = boto3.session.Session(region_name="us-west-2")
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
"aws_access_key_id": credentials.access_key,
"aws_secret_access_key": credentials.secret_key,
"aws_session_token": credentials.token,
"aws_region": session.region_name,
}
async def do():
df = pl.scan_csv(
"s3://.../*.csv", # example path
storage_options=storage_options,
).collect()
print(df)
def lambda_handler(event, context):
uvloop.run(do())
return "OK" @alexander-beedie could you please be so kind to treat this issue? Thank you for the efforts! |
Hi there, Not sure that it matters, but I'm having the exact same issue here using docker image in AWS lambda, when collecting my lazyframe with execution plan. Hopefully the more people reporting running into this, the more the fix can be prioritized... Lazyframe explain plan is
I have cut out the select columns and some simple with_columns statements from this execution plan, and also the exact abfs path and filename, but it is trying to scan a csv file from an azure container. Code obviously runs fine locally, but not within lambda, with the exact same error message as described above: Cheers! |
Hi, From my understanding and my attempts, there's a bug not allowing to use scan_csv inside Lambda's docker. Hopefully someone can give more context here. |
It might be a bug on lambda's side instead of polars? I have premium support there, so I might create a ticket for AWS, to investigate from their side. Will let you know in this thread if and when anything comes out of that.... |
For me, the very same thing happens (memory allocation error) on |
This error is due to insufficient disk space - we require enough disk space for the entire file to be downloaded for |
scan_csv()
fails if there is not enough disk space to download the whole file (e.g. AWS lambda, or container)
I've encountered this issue when using a test file of only a few KB. The issue still occurs even if there is plenty of available disk space in the lambda runtime |
That's correct, I've experienced the same issue with few KBs using scan_csv, read_csv function even works. |
Could it have to do with the system primitives that are used to determine the temporary directory and how it could interact with the lambda ephemeral storage ? |
That could make sense, some kind of permission issue or small size disk partition for temp path. It would be interesting to see where Polars attempts to download the file, since we specify no path for it on the lazy scans. |
I also really don't think it is due to insufficient disk space, the minimal example I created for aws support fails to scan a tiny csv file less than 7KB large. The memory it is trying to allocate is 6750 bytes, with ephemeral storage of 512MB being allocated. jerome-viveret-onfidos comment makes more sense. AWS is looking into this as well. |
For debugging, setting the It can be changed to a mount point with more storage by setting the |
Using POLARS_VERBOSE, I can see it is trying to write to: /tmp/polars/file-cache/. |
I went through the issue once again with a colleague. I'll leave down here some conclussions, might be a reference for future investigations or ideas about the issue :
My current hypothesis for the issue is that since the cached files go under /tmp/file-cache , not directly to /tmp, AWS Lambda fails to write to the ephemeral storage. It would be interesting to modify this path in the Rust source to directly write to /tmp instead of a subdirectory and see if it works. |
In addition to above, I tried running the following chunk of code inside Docker in AWS Lambda, thinking that it would fail, but worked. So my hypothesis about writing on subfolders in /tmp not working is not valid anymore... 🤔 import os
def handler(event: Any, context: Any):
os.makedirs("/tmp/a")
with open("/tmp/a/a.txt", "w") as f:
f.write("a")
with open("/tmp/a/a.txt", "r") as f:
print(f.read()) Maybe it's just the allocate operation that fails, and the write wouldn't? This is the system call that polars uses to allocate size for the file to be written : https://wasix.org/docs/api-reference/wasi/fd_allocate |
Please adjust the title of the issue since it's not matching properly. The issue is about running Polars (scan_csv) inside Docker Container in AWS Lambda. And scan_csv fails even if there is disk space. Thanks in advance! |
scan_csv()
fails if there is not enough disk space to download the whole file (e.g. AWS lambda, or container)scan_csv()
in container fails with disk space error (e.g. AWS lambda, or container)
@HectorPascual , I have added an environment flag that will be available in the next release that can be set, |
I also just ran into this bug. In my environment the Lambda is already writing to Furthermore as @HectorPascual noted the failure is happening with this function call:
This suggests AWS Lambda (and perhaps Docker in general) is preventing this syscall. To test this I attempted the same operation in Python:
And sure enough, it fails with a similar error:
|
Hey, very good example for reproducing the error, was it run in AWS Lambda too? You can try setting the flag @nameexhaustion mentioned and see if the result changes : POLARS_IGNORE_FILE_CACHE_ALLOCATE_ERROR=1, I wasn't able to check it yet, will check asap. |
I did try setting that |
I am having the same issue
the csv file is approximately 3.8GB. |
Read the full thread :-), so you see it is fixed in #20796. The fix made it into the Python Polars 1.21.0 release, so just upgrade your polars version. |
@nameexhaustion can we close this one now? |
Closed as completed via #20796 |
Checks
Reproducible example
Log output
Issue description
I'm new to Polars and attempting to implement an RFM analysis using the library. As part of my proposed architecture, I need to run the code in an AWS Lambda function. I've successfully implemented the RFM analysis and uploaded the code to Lambda using a Docker image.
Despite the code running successfully on my local container, I'm encountering a "failed to allocate 25954093 bytes" error when running it in the Lambda function. I've tried to troubleshoot the issue, ruling out credential errors since the scan_csv function doesn't throw any errors, and explicitly passing AWS credentials to the scan_csv function.
Attempts to Resolve
I've attempted to apply solutions from issues #7774 and #1777, including:
Setting streaming=True on the collect method
Defining my schema columns as pl.utf8 or pl.int
Thanks in advanced 🤗
Expected behavior
The Polars code should work seamlessly in the Lambda function, just like it does on the local container, without any memory allocation errors.
Installed versions
The text was updated successfully, but these errors were encountered: