Memory leak after updating to 1.25.0 #3614

ron8mcr · 2023-03-03T09:17:56Z

Describe the bug

Memory leaks started to happen after updating to boto3>=1.25.0.

Expected Behavior

No changes in memory usage after updating the version

Current Behavior

Memory usage increases after each execution of

create session
create client
call some API

Reproduction Steps

Here is smallest script to reproduce the problem:

import boto3

for _ in range(100):
    boto3.Session().client('s3').list_buckets()

Running this with boto3<1.25.0 requires ~80MB of memory:

pip uninstall boto3 botocore
pip install 'boto3<1.25.0'
/usr/bin/time -v python list_buckets.py

...
Maximum resident set size (kbytes): 79404
...

Running this with boto3>=1.25.0 requires ~~177MB:

pip uninstall boto3 botocore
pip install 'boto3<1.26.0'
/usr/bin/time -v python list_buckets.py

...
Maximum resident set size (kbytes): 177208
...

The more iterations for range I set, the more difference in memory usage.

Possible Solution

No response

Additional Information/Context

Found this in context of Django app and https://github.com/revsys/django-health-check (which perform periodic calls to S3 bucket)

I've checked similar issues:

But these issues specify earlier versions of SDK, while I've faced this problem just with 1.25.0+

SDK version used

1.25.0

Environment details (OS name and version, etc.)

ManjaroLinux 22.0.4

The text was updated successfully, but these errors were encountered:

tim-finnigan · 2023-03-07T17:58:43Z

Hi @ron8mcr thanks for reaching out. In the boto3 CHANGELOG you can see the following change introduced in 1.25.0:

feature:Endpoints: [``botocore``] Implemented new endpoint ruleset system to dynamically derive endpoints and settings for services

The S3 endpoint ruleset is particularly large, which could explain the increased memory usage you're describing. Do you see the issue when using other services? What are you using to profile memory usage? And have you tried using gc.collect() or any other suggestions from related issues?

ron8mcr · 2023-03-09T08:59:34Z

Hi @tim-finnigan

The S3 endpoint ruleset is particularly large, which could explain the increased memory usage you're describing.

Yes, it is expected that memory usage should increase. However, memory usage increases after each API call, and the more iterations happen - the more RAM required.

Do you see the issue when using other services?

Checked with cloudwatch, ec2, ecs, iam - no problems

What are you using to profile memory usage?

Actually this problem was found by Cloudwatch agent, later confirmed with sample script using htop and psutil.Process(os.getpid()).memory_info().rss.
tracemalloc pointed to python/json, similar to this reply

And have you tried using gc.collect() or any other suggestions from related issues?

Yes, gc.collect() didn't help. Storing session per thread also didn't help as example script does not use any multithreading/multiprocessing/async features.

P.S. I've add plot of memory usage and increase iterations number. Actually, memory usage growth stops at some point

Sample script

import boto3
import os
import boto3
import psutil
import matplotlib.pyplot as pp

memory_usages = []

for i in range(150):
    process = psutil.Process(os.getpid())
    memory = process.memory_info().rss/1024/1024
    memory_usages.append(memory)
    boto3.Session().client('s3').list_buckets()

pp.plot(memory_usages)
pp.show()

alexanderGerbik · 2023-03-14T14:42:42Z

I think this issue starts from boto3>=1.24.72, which depends on botocore>=1.27.72.

In botocore==1.27.72 they have implemented EndpointProvider which uses lru_cache (https://github.com/boto/botocore/blob/1.27.72/botocore/endpoint_provider.py#L671), which explains memory usage growth and why it stops (it stops when lru cache starts evicting cached values when maxsize is reached).

tim-finnigan · 2023-03-14T17:58:45Z

After bringing this issue up for discussion with the team I wanted to clarify a couple of things. The behavior originally described here is really increased memory usage rather than a memory leak. As previously mentioned, the new endpoint rulesets files recently added for services like S3 are large (~1.6MB) . But the excessive memory usage problem here is due to the accumulation of Sessions. If you reuse sessions, then you don't need to keep reloading the same large JSON files. Maybe this is something that could be better clarified in the documentation.

hemslo · 2023-03-15T01:55:01Z

Use lru_cache on instance method will cause memory leak because self is captured.
The simple fix is to change it to instance level cache in __init__:

self.resolve_endpoint = lru_cache(maxsize=CACHE_SIZE)(self.resolve_endpoint)

If global cache is still wanted, then refactor it to be a static method, or even better, a function.

References:

ron8mcr · 2023-03-16T03:57:07Z

@tim-finnigan @alexanderGerbik @hemslo
I can confirm that disabling lru_cache in botocore lead to significant decrease of memory usage:

botocore==1.29.92

With disabled cache

botocore==1.27.71

Meanwhile, botocore already have decorator for caching instance, see https://github.com/boto/botocore/blob/1.29.92/botocore/utils.py#L1395

I've tried to use it and everything worked perfect for my case.
Prepared PR for botocore - boto/botocore#2889

@alexanderGerbik @hemslo much thanks for your suggestions

tim-finnigan · 2023-03-16T17:14:15Z

Hi ron8mcr, thanks for reporting this and doing the deep dive. We have been researching this in parallel to understand what is happening because we have not seen the same behavior in our own integration testing.

The root cause of what you are seeing is the large number of botocore sessions that get created in your code snippet. Creating new session objects is known to be expensive. Each session object not only consumes a large amount of memory, it is also slow to create because several files get loaded from the file system to create the object. (Coincidentally, those files do not get cached across Session objects and instance_cache is used to ensure this.) The number of Session objects in each script should be minimized, usually one is enough. For example, we recommend creating the botocore session outside the AWS Lambda handler to have it reused across invocations (See: https://docs.aws.amazon.com/codeguru/detector-library/python/lambda-client-reuse/ for more context).

The reason you are observing high memory usage with the lru_cache is not that each cache entry is large (even though self is large, the cache key is still only a hash) but that the existence of the cache entries prevents effective garbage collection of the many Session objects. Each session will have its own instance of the endpoint ruleset object (self) and these can only be removed from memory once the cache entry expires from cache. The cache filling up explains the plateau in memory consumption shown in this post.

As a path forward, we would recommend that you review your code for opportunities to reduce the number of separate botocore Sessions. This will probably have memory and runtime benefits beyond the problem described in this post. We will also review your PR because it mitigates the problem for those rare situations where a large number of Session objects are necessary. However, this requires further research because we have to thoroughly evaluate whether the downside of instance_cache not having a maximum size could lead to performance degradations in other use cases.

hemslo · 2023-03-17T00:58:02Z

We have use cases of large number of Session objects, to support dynamic AWS accounts, they can't share the same session. And because of that lru_cache introduced in botocore 1.27.72, we have to pin boto3==1.24.71 and botocore==1.27.71.

It would be nice if we can cache _load_file across sessions, it consumes 300+ MB in our peak usage.

ron8mcr · 2023-03-17T05:34:33Z

Hi @tim-finnigan,

As a path forward, we would recommend that you review your code for opportunities to reduce the number of separate botocore Sessions

Actually, we found this within different Django projects which even not used boto3 in self codebase. However, looks like the problem may be solved after few changes in other app:

We will also review your PR

Thanks in advance!

jonemo · 2023-05-09T17:54:18Z

Addressed in boto/botocore#2934

ron8mcr added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Mar 3, 2023

tim-finnigan self-assigned this Mar 7, 2023

tim-finnigan added response-requested Waiting on additional information or feedback. s3 and removed bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Mar 7, 2023

github-actions bot removed the response-requested Waiting on additional information or feedback. label Mar 9, 2023

tim-finnigan added documentation This is a problem with documentation. p2 This is a standard priority issue labels Mar 14, 2023

tim-finnigan removed their assignment Mar 14, 2023

ron8mcr mentioned this issue Mar 16, 2023

Replace lru_cache with instance_cache boto/botocore#2889

Closed

This was referenced Mar 17, 2023

Improve configuration of default storage revsys/django-health-check#359

Closed

Provide an option to disable checks in threads revsys/django-health-check#360

Closed

jonemo mentioned this issue May 8, 2023

Reduce memory usage for scripts with many Session objects boto/botocore#2934

Merged

jonemo added closing-soon This issue will automatically close in 4 days unless further comments are made. and removed documentation This is a problem with documentation. s3 labels May 9, 2023

github-actions bot removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label May 9, 2023

tim-finnigan added the closing-soon This issue will automatically close in 4 days unless further comments are made. label May 9, 2023

github-actions bot added closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels May 11, 2023

github-actions bot closed this as completed May 11, 2023

rporres mentioned this issue May 24, 2023

Upgrade boto libraries app-sre/qontract-reconcile#3529

Merged

alltilla mentioned this issue Sep 30, 2024

s3() destination high memory usage caused by reload axoflow/axosyslog#317

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak after updating to 1.25.0 #3614

Memory leak after updating to 1.25.0 #3614

ron8mcr commented Mar 3, 2023

tim-finnigan commented Mar 7, 2023

ron8mcr commented Mar 9, 2023

alexanderGerbik commented Mar 14, 2023

tim-finnigan commented Mar 14, 2023

hemslo commented Mar 15, 2023

ron8mcr commented Mar 16, 2023 •

edited

Loading

tim-finnigan commented Mar 16, 2023

hemslo commented Mar 17, 2023

ron8mcr commented Mar 17, 2023

jonemo commented May 9, 2023

Memory leak after updating to 1.25.0 #3614

Memory leak after updating to 1.25.0 #3614

Comments

ron8mcr commented Mar 3, 2023

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

SDK version used

Environment details (OS name and version, etc.)

tim-finnigan commented Mar 7, 2023

ron8mcr commented Mar 9, 2023

alexanderGerbik commented Mar 14, 2023

tim-finnigan commented Mar 14, 2023

hemslo commented Mar 15, 2023

ron8mcr commented Mar 16, 2023 • edited Loading

tim-finnigan commented Mar 16, 2023

hemslo commented Mar 17, 2023

ron8mcr commented Mar 17, 2023

jonemo commented May 9, 2023

ron8mcr commented Mar 16, 2023 •

edited

Loading