Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_csv() in container fails with disk space error (e.g. AWS lambda, or container) #17946

Closed
2 tasks done
GBMsejimenez opened this issue Jul 30, 2024 · 27 comments
Closed
2 tasks done
Assignees
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars

Comments

@GBMsejimenez
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import json
import boto3
import polars as pl

session = boto3.Session()
CREDENTIALS = session.get_credentials()
STORAGE_OPTIONS = {
    "aws_region": "us-east-1",
    "aws_access_key_id": CREDENTIALS.access_key,
    "aws_secret_access_key": CREDENTIALS.secret_key,
}
if CREDENTIALS.token:
    STORAGE_OPTIONS.update({"session_token": CREDENTIALS.token})

print(STORAGE_OPTIONS)

# Define the schema for reading the CSV file
SCHEMA = {
    "user_id": pl.Int32,
    "transaction_date": pl.Datetime,
    "order_id": pl.Int32,
    "price": pl.Float32,
    "quantity": pl.Int16,
    "item_id": pl.Int32,
    "item_desc": pl.Utf8,
}


def read_s3(uri: str) -> pl.LazyFrame:
    """
    Read a CSV file from S3 using Polars.

    :param uri: S3 URI of the CSV file.
    :return: Polars LazyFrame with the CSV data.
    """
    return pl.scan_csv(
        uri,
        schema_overrides=SCHEMA,
        ignore_errors=True,
        truncate_ragged_lines=True,
        storage_options=STORAGE_OPTIONS,
    )


def apply_rfm(df: pl.LazyFrame) -> pl.LazyFrame:
    """
    Calculate RFM scores for each user and segment them.

    :param df: Input dataframe.
    :return: Dataframe with RFM scores and segments.
    """

    df_rfm = df.group_by("user_id").agg(
        recency=pl.col("transaction_date").max(),  # Most recent transaction date
        frequency=pl.col("order_id").n_unique(),  # Number of unique orders
        monetary=pl.col("total_amount_plus_taxes").sum(),  # Total monetary value
    )
    latest_date = df.select(pl.col("transaction_date").max()).collect().item()
    df_rfm = df_rfm.with_columns(
        recency=(
            latest_date - pl.col("recency")
        ).dt.total_days()  # Calculate recency in days
    )

    print("RFM Calculated")
    return df_rfm


def handler(event: dict, context: dict) -> dict:
    try:
        uri = event["Records"][0]["s3"]["uri"]

        df = read_s3(uri)

        df = apply_rfm(df)
        
        return {
            "statusCode": 200,
            "body": json.dumps("RFM loaded to DataBase"),
        }

    except Exception as e:
        print(f"Error in RFM process: {e}")
        return {"statusCode": 500, "body": json.dumps("Error in RFM process")}

Log output

INIT_REPORT Init Duration: 10008.73 ms	Phase: init	Status: timeout
Error in RFM process: failed to allocate 25954093 bytes to download uri = s3://aws-us-east-1-dev-s3-xxx/xxx/dataset_processed302e6eea-f9ed-4df4-8ad5-b7c8eada0658.csv
This error occurred with the following context stack:
[1] 'csv scan' failed
[2] 'filter' input failed to resolve
[3] 'filter' input failed to resolve
[4] 'select' input failed to resolve
END RequestId: 3c9f6613-f850-48e4-8658-1b47af8d8786
REPORT RequestId: 3c9f6613-f850-48e4-8658-1b47af8d8786	Duration: 26514.89 ms	Billed Duration: 26515 ms	Memory Size: 10240 MB	Max Memory Used: 159 MB

Issue description

I'm new to Polars and attempting to implement an RFM analysis using the library. As part of my proposed architecture, I need to run the code in an AWS Lambda function. I've successfully implemented the RFM analysis and uploaded the code to Lambda using a Docker image.

Despite the code running successfully on my local container, I'm encountering a "failed to allocate 25954093 bytes" error when running it in the Lambda function. I've tried to troubleshoot the issue, ruling out credential errors since the scan_csv function doesn't throw any errors, and explicitly passing AWS credentials to the scan_csv function.

Attempts to Resolve
I've attempted to apply solutions from issues #7774 and #1777, including:

Setting streaming=True on the collect method
Defining my schema columns as pl.utf8 or pl.int

Thanks in advanced 🤗

Expected behavior

The Polars code should work seamlessly in the Lambda function, just like it does on the local container, without any memory allocation errors.

Installed versions

--------Version info---------
Polars:               1.3.0
Index type:           UInt32
Platform:             Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.34
Python:               3.12.3 (main, Jun  5 2024, 03:37:09) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)]       

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                2.0.1
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

@GBMsejimenez GBMsejimenez added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 30, 2024
@ritchie46
Copy link
Member

Can you set POLARS_PANIC_ON_ERR=1 and RUST_BACKTRACE=1 and show us the backtrace log?

@wjglenn3
Copy link

Hi I'm not sure if I should start another issue for this, but I'm pretty sure I'm having the same issue. When running inside an AWS Lambda, I am able to read a CSV and write it to a Parquet file using read_csv and write_parquet, but not so much luck with scan_csv and sink_parquet. I'm getting the same type and error and have tried the same methods to solve the issue as @GBMsejimenez.

I've gotten the code down to the bare minimum for me to reproduce the error (the CSV file being tested only consists of a header and two lines of data, and the bucket and path in the file name have been edited out).

import polars as pl
import s3fs
import json

POLARS_PANIC_ON_ERR=1
RUST_BACKTRACE=1
 
# Lambda entry
def lambda_handler(event, context):
    
    pl.show_versions()
    
    csv_file = 's3://{BUCKET}/{PATH}/test.csv'
    #parquet_file = 's3://{BUCKET}/{PATH}/test.parquet'

    fs = s3fs.S3FileSystem(anon=False)

    df = pl.scan_csv(csv_file).collect(streaming=True)
    

    return {
        'statusCode': 200,
        'body': json.dumps("Finished")
    }

This is giving me an error of (with {BUCKET} and {PATH} having actual values)

[ERROR] ComputeError: failed to allocate 1343 bytes to download uri = s3://{BUCKET}/{PATH}/test.csv
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 40, in lambda_handler
    df = pl.scan_csv(csv_file).collect()
  File "/opt/python/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))

My polars versions if necessary

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             Linux-5.10.219-229.866.amzn2.x86_64-x86_64-with-glibc2.26
Python:               3.11.6 (main, Feb  7 2024, 11:27:56) [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)]
----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.6.1
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

@qmg-tmay
Copy link

@wjglenn3 I'm experiencing the same issue when using a Docker container based lambda

@HectorPascual
Copy link

HectorPascual commented Sep 13, 2024

Hey, we are experiencing the same issue within docker in AWS Lambda, we attempted all the combinations.

I also tried installing s3fs, which is needed for the read_csv, but also breaks with error :

ComputeError : failed to allocate 12345 bytes to download uri = s3://...

Here's my minimum example that breaks :

import asyncio

import boto3
import polars as pl
import uvloop

asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())

session = boto3.session.Session(region_name="us-west-2")
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
    "aws_access_key_id": credentials.access_key,
    "aws_secret_access_key": credentials.secret_key,
    "aws_session_token": credentials.token,
    "aws_region": session.region_name,
}


async def do():
    df = pl.scan_csv(
        "s3://.../*.csv",  # example path
        storage_options=storage_options,
    ).collect()
    print(df)


def lambda_handler(event, context):
    uvloop.run(do())
    return "OK"

@alexander-beedie could you please be so kind to treat this issue?

Thank you for the efforts!

@AnskeVan
Copy link

Hi there,

Not sure that it matters, but I'm having the exact same issue here using docker image in AWS lambda, when collecting my lazyframe with execution plan. Hopefully the more people reporting running into this, the more the fix can be prioritized...

Lazyframe explain plan is

WITH_COLUMNS:
......
   SELECT 
   ...........
   FROM
     WITH_COLUMNS:
     [false.alias("monthly_export_origin")
	 , String(abfs://.../.../../filename.csv).alias("export_filename")
	 , String(2024-11-22T10:17:11.599+00:00).str.strptime([String(raise)]).alias("rec_inserted")] 
      Csv SCAN [abfs://.../.../../filename.csv]
      PROJECT 65/65 COLUMNS

I have cut out the select columns and some simple with_columns statements from this execution plan, and also the exact abfs path and filename, but it is trying to scan a csv file from an azure container. Code obviously runs fine locally, but not within lambda, with the exact same error message as described above:
failed to allocate 122128901 bytes to download uri = abfs://.../.../../filename.csv

Cheers!

@HectorPascual
Copy link

Hi there,

Not sure that it matters, but I'm having the exact same issue here using docker image in AWS lambda, when collecting my lazyframe with execution plan. Hopefully the more people reporting running into this, the more the fix can be prioritized...

Lazyframe explain plan is

WITH_COLUMNS:
......
   SELECT 
   ...........
   FROM
     WITH_COLUMNS:
     [false.alias("monthly_export_origin")
	 , String(abfs://.../.../../filename.csv).alias("export_filename")
	 , String(2024-11-22T10:17:11.599+00:00).str.strptime([String(raise)]).alias("rec_inserted")] 
      Csv SCAN [abfs://.../.../../filename.csv]
      PROJECT 65/65 COLUMNS

I have cut out the select columns and some simple with_columns statements from this execution plan, and also the exact abfs path and filename, but it is trying to scan a csv file from an azure container. Code obviously runs fine locally, but not within lambda, with the exact same error message as described above: failed to allocate 122128901 bytes to download uri = abfs://.../.../../filename.csv

Cheers!

Hi,

From my understanding and my attempts, there's a bug not allowing to use scan_csv inside Lambda's docker. Hopefully someone can give more context here.

@AnskeVan
Copy link

Hi there,
Not sure that it matters, but I'm having the exact same issue here using docker image in AWS lambda, when collecting my lazyframe with execution plan. Hopefully the more people reporting running into this, the more the fix can be prioritized...
Lazyframe explain plan is

WITH_COLUMNS:
......
   SELECT 
   ...........
   FROM
     WITH_COLUMNS:
     [false.alias("monthly_export_origin")
	 , String(abfs://.../.../../filename.csv).alias("export_filename")
	 , String(2024-11-22T10:17:11.599+00:00).str.strptime([String(raise)]).alias("rec_inserted")] 
      Csv SCAN [abfs://.../.../../filename.csv]
      PROJECT 65/65 COLUMNS

I have cut out the select columns and some simple with_columns statements from this execution plan, and also the exact abfs path and filename, but it is trying to scan a csv file from an azure container. Code obviously runs fine locally, but not within lambda, with the exact same error message as described above: failed to allocate 122128901 bytes to download uri = abfs://.../.../../filename.csv
Cheers!

Hi,

From my understanding and my attempts, there's a bug not allowing to use scan_csv inside Lambda's docker. Hopefully someone can give more context here.

It might be a bug on lambda's side instead of polars? I have premium support there, so I might create a ticket for AWS, to investigate from their side. Will let you know in this thread if and when anything comes out of that....

@jerome-viveret-onfido
Copy link

For me, the very same thing happens (memory allocation error) on collect_schema() when applied on a lazy frame. It is worth noting that it happens on scan_csv only, not on scan_parquet.

@nameexhaustion
Copy link
Collaborator

This error is due to insufficient disk space - we require enough disk space for the entire file to be downloaded for scan_csv. We may improve in the future with improvements to streaming functionality.

@nameexhaustion nameexhaustion added enhancement New feature or an improvement of an existing feature new-streaming Features for or dependent on the new streaming engine and removed needs triage Awaiting prioritization by a maintainer labels Nov 28, 2024
@nameexhaustion nameexhaustion changed the title Groupby using lazy mode on a csv throw an memory allocation error when running on AWS lambda scan_csv() fails if there is not enough disk space to download the whole file (e.g. AWS lambda, or container) Nov 28, 2024
@nameexhaustion nameexhaustion removed the bug Something isn't working label Nov 28, 2024
@qmg-tmay
Copy link

This error is due to insufficient disk space - we require enough disk space for the entire file to be downloaded for scan_csv. We may improve in the future with improvements to streaming functionality.

I've encountered this issue when using a test file of only a few KB. The issue still occurs even if there is plenty of available disk space in the lambda runtime

@HectorPascual
Copy link

This error is due to insufficient disk space - we require enough disk space for the entire file to be downloaded for scan_csv. We may improve in the future with improvements to streaming functionality.

I've encountered this issue when using a test file of only a few KB. The issue still occurs even if there is plenty of available disk space in the lambda runtime

That's correct, I've experienced the same issue with few KBs using scan_csv, read_csv function even works.

@jerome-viveret-onfido
Copy link

jerome-viveret-onfido commented Nov 28, 2024

Could it have to do with the system primitives that are used to determine the temporary directory and how it could interact with the lambda ephemeral storage ?

@HectorPascual
Copy link

Could it have to do with the system primitives that are used to determine the temporary directory and how it could interact with the lambda ephemeral storage ?

That could make sense, some kind of permission issue or small size disk partition for temp path. It would be interesting to see where Polars attempts to download the file, since we specify no path for it on the lazy scans.

@AnskeVan
Copy link

This error is due to insufficient disk space - we require enough disk space for the entire file to be downloaded for scan_csv. We may improve in the future with improvements to streaming functionality.

I also really don't think it is due to insufficient disk space, the minimal example I created for aws support fails to scan a tiny csv file less than 7KB large. The memory it is trying to allocate is 6750 bytes, with ephemeral storage of 512MB being allocated. jerome-viveret-onfidos comment makes more sense. AWS is looking into this as well.

@nameexhaustion
Copy link
Collaborator

For debugging, setting the POLARS_VERBOSE=1 environment variable will print the path of the temporary directory.

It can be changed to a mount point with more storage by setting the POLARS_TEMP_DIR environment variable.

@AnskeVan
Copy link

AnskeVan commented Dec 3, 2024

Using POLARS_VERBOSE, I can see it is trying to write to: /tmp/polars/file-cache/.
If I change the path (with POLARS_TEMP_DIR) to just /tmp, I still get the error: failed to create temporary directory: path = '/tmp', err = Read-only file system (os error 30)
The problem here is indeed specifically using Lambda with Docker. Just running in Lambda the ephemeral storage would probably work even without changing the default tmp dir path, but. I don't think you can use the actual ephemeral store from a Docker container in Lambda. In accordance with this, the AWS docs literally state the following:
The container image must be able to run on a read-only file system. Your function code can access a writable /tmp directory with between 512 MB and 10,240 MB, in 1-MB increments, of storage.
Being able to run on a read-only file system, this is clearly not the case when you use scan_csv.
Docker containers based on Linux based images would have a folder called /tmp but that can only accessed to read from, not to write to.

@HectorPascual
Copy link

I went through the issue once again with a colleague. I'll leave down here some conclussions, might be a reference for future investigations or ideas about the issue :

My current hypothesis for the issue is that since the cached files go under /tmp/file-cache , not directly to /tmp, AWS Lambda fails to write to the ephemeral storage. It would be interesting to modify this path in the Rust source to directly write to /tmp instead of a subdirectory and see if it works.

@HectorPascual
Copy link

HectorPascual commented Dec 12, 2024

In addition to above, I tried running the following chunk of code inside Docker in AWS Lambda, thinking that it would fail, but worked. So my hypothesis about writing on subfolders in /tmp not working is not valid anymore... 🤔

import os

def handler(event: Any, context: Any):
    os.makedirs("/tmp/a")
    with open("/tmp/a/a.txt", "w") as f:
        f.write("a")

    with open("/tmp/a/a.txt", "r") as f:
        print(f.read())

Maybe it's just the allocate operation that fails, and the write wouldn't? This is the system call that polars uses to allocate size for the file to be written : https://wasix.org/docs/api-reference/wasi/fd_allocate

@HectorPascual
Copy link

HectorPascual commented Dec 18, 2024

Please adjust the title of the issue since it's not matching properly.

The issue is about running Polars (scan_csv) inside Docker Container in AWS Lambda. And scan_csv fails even if there is disk space.

Thanks in advance!

@nameexhaustion nameexhaustion changed the title scan_csv() fails if there is not enough disk space to download the whole file (e.g. AWS lambda, or container) scan_csv() in container fails with disk space error (e.g. AWS lambda, or container) Dec 19, 2024
@nameexhaustion
Copy link
Collaborator

@HectorPascual , I have added an environment flag that will be available in the next release that can be set, POLARS_IGNORE_FILE_CACHE_ALLOCATE_ERROR=1. Could you give it a try (also with setting POLARS_VERBOSE=1) to see if it helps?

@mattyellen
Copy link

I also just ran into this bug. In my environment the Lambda is already writing to /tmp and as others have reported it is failing with very small allocations. This suggests it's not a disk space issue, nor a read-only filesystem.

Furthermore as @HectorPascual noted the failure is happening with this function call: file.allocate(remote_metadata.size). So it's not just trying to write to disk, it's trying to execute an fd_allocate syscall. The specific error being returned is:

Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }

This suggests AWS Lambda (and perhaps Docker in general) is preventing this syscall. To test this I attempted the same operation in Python:

    import os

    fd = os.open("/tmp/my_file.txt", os.O_RDWR | os.O_CREAT)
    try:
        logger.info(f'Allocating 1024 bytes from the beginning of the file')
        os.posix_fallocate(fd, 0, 1024)
    finally:
        os.close(fd)

And sure enough, it fails with a similar error:

  "errorMessage": "[Errno 1] Operation not permitted",
  "errorType": "PermissionError",

@HectorPascual
Copy link

I also just ran into this bug. In my environment the Lambda is already writing to /tmp and as others have reported it is failing with very small allocations. This suggests it's not a disk space issue, nor a read-only filesystem.

Furthermore as @HectorPascual noted the failure is happening with this function call: file.allocate(remote_metadata.size). So it's not just trying to write to disk, it's trying to execute an fd_allocate syscall. The specific error being returned is:

Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }

This suggests AWS Lambda (and perhaps Docker in general) is preventing this syscall. To test this I attempted the same operation in Python:

    import os

    fd = os.open("/tmp/my_file.txt", os.O_RDWR | os.O_CREAT)
    try:
        logger.info(f'Allocating 1024 bytes from the beginning of the file')
        os.posix_fallocate(fd, 0, 1024)
    finally:
        os.close(fd)

And sure enough, it fails with a similar error:

  "errorMessage": "[Errno 1] Operation not permitted",
  "errorType": "PermissionError",

Hey, very good example for reproducing the error, was it run in AWS Lambda too?

You can try setting the flag @nameexhaustion mentioned and see if the result changes : POLARS_IGNORE_FILE_CACHE_ALLOCATE_ERROR=1, I wasn't able to check it yet, will check asap.

@nameexhaustion nameexhaustion added bug Something isn't working accepted Ready for implementation and removed enhancement New feature or an improvement of an existing feature new-streaming Features for or dependent on the new streaming engine labels Jan 19, 2025
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jan 19, 2025
@nameexhaustion nameexhaustion self-assigned this Jan 19, 2025
@mattyellen
Copy link

I did try setting that POLARS_IGNORE_FILE_CACHE_ALLOCATE_ERROR environment variable. It seemed to get past the first check but then failed later on. I could collect more information if necessary, but it looks like we may already have a fix.

@Ayusharma0698
Copy link

I am having the same issue
polars.exceptions.ComputeError: failed to allocate 4044667404 bytes to download uri.
my scan csv function looks something like this

pl.scan_csv(s3_path, storage_options={
            "aws_access_key_id": credentials.access_key,
            "aws_secret_access_key": credentials.secret_key,
            "region": os.environ["REGION"],
            "session_token": credentials.token,
        }, infer_schema_length=10000,
                          ignore_errors=True,
                          truncate_ragged_lines=True,
                          skip_rows=skip_first_rows, separator=delimiter,
                          has_header=with_header, glob=False, encoding="utf8-lossy")

the csv file is approximately 3.8GB.
Any suggestions would be appreciated.

@AnskeVan
Copy link

I am having the same issue polars.exceptions.ComputeError: failed to allocate 4044667404 bytes to download uri. my scan csv function looks something like this

pl.scan_csv(s3_path, storage_options={
            "aws_access_key_id": credentials.access_key,
            "aws_secret_access_key": credentials.secret_key,
            "region": os.environ["REGION"],
            "session_token": credentials.token,
        }, infer_schema_length=10000,
                          ignore_errors=True,
                          truncate_ragged_lines=True,
                          skip_rows=skip_first_rows, separator=delimiter,
                          has_header=with_header, glob=False, encoding="utf8-lossy")

the csv file is approximately 3.8GB. Any suggestions would be appreciated.

Read the full thread :-), so you see it is fixed in #20796. The fix made it into the Python Polars 1.21.0 release, so just upgrade your polars version.

@ritchie46
Copy link
Member

@nameexhaustion can we close this one now?

@nameexhaustion
Copy link
Collaborator

Closed as completed via #20796

@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars
Projects
Status: Done
Development

No branches or pull requests

10 participants