-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in source-s3 while reading big CSV file #6870
Comments
thanks for opening the issue @amorskoy! Currently we apply a chunk method to read the file and not load all into the memory. Maybe there is something not working... we're going to investigate airbyte/airbyte-integrations/connectors/source-file/source_file/client.py Lines 298 to 307 in e5abaec
|
@amorskoy Can you share the script to create the mock file using Faker? |
@marcosmarxm sure, let me do it in a few hours as I arrive |
@marcosmarxm Here is it. I've found it on some gist. Slightly modified to support cyclic headers and faster generation using suffix for deduplication. Not ideal for data processing, but for connectivity domain this is enough.
|
Feel free to edit allow_cycled_headers to tune it for your needs |
@sherifnada Thanks - I will, but it may take some days as I am a little out of context at this moment. Please let me know, if I should either close an issue meanwhile or wait till my check alternatively. |
I'll keep this open until we hear back from you :) |
@sherifnada Well, it looks better, but seems that leak is not removed on the latest airbyte master - commit
Seems that RAM consumption on S3 reader |
cc @marcosmarxm sorry, forget to add you to cc above ^^^^ |
Enviroment
11645689431a69c689a15b620e4a2b6bc7b045c3
Current Behavior
I have CSV on s3: 35GB, 8M rows x 500 columns, synth generated by Python Faker lib.
1K sample is attached
sample_synth_1K_500.csv
Memory consumption grows until OOM (30GB on 32 GB EC2 instance).
![memleak_source_s3](https://user-images.githubusercontent.com/1633433/136349912-bb9ab03a-e101-4b0e-b1d7-25360f8b1f49.png)
![resources_8m_x_500](https://user-images.githubusercontent.com/1633433/136349916-7baf6522-50a7-43d9-ba8a-7e513dc40b4c.png)
Screenshots for
htop
in the middle and before OOM are attached.Expected Behavior
Expected normal sync without OOM and with predictable fixed RAM consumption for connector's
main.py
Logs
Logs for sync job attached
logs-6-0.txt
The text was updated successfully, but these errors were encountered: