Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak after updating to 1.25.0 #3614

Closed
ron8mcr opened this issue Mar 3, 2023 · 10 comments
Closed

Memory leak after updating to 1.25.0 #3614

ron8mcr opened this issue Mar 3, 2023 · 10 comments
Labels
closed-for-staleness p2 This is a standard priority issue

Comments

@ron8mcr
Copy link

ron8mcr commented Mar 3, 2023

Describe the bug

Memory leaks started to happen after updating to boto3>=1.25.0.

Expected Behavior

No changes in memory usage after updating the version

Current Behavior

Memory usage increases after each execution of

  • create session
  • create client
  • call some API

Reproduction Steps

Here is smallest script to reproduce the problem:

import boto3

for _ in range(100):
    boto3.Session().client('s3').list_buckets()

Running this with boto3<1.25.0 requires ~80MB of memory:

pip uninstall boto3 botocore
pip install 'boto3<1.25.0'
/usr/bin/time -v python list_buckets.py

...
Maximum resident set size (kbytes): 79404
...

Running this with boto3>=1.25.0 requires ~~177MB:

pip uninstall boto3 botocore
pip install 'boto3<1.26.0'
/usr/bin/time -v python list_buckets.py

...
Maximum resident set size (kbytes): 177208
...

The more iterations for range I set, the more difference in memory usage.

Possible Solution

No response

Additional Information/Context

Found this in context of Django app and https://github.com/revsys/django-health-check (which perform periodic calls to S3 bucket)

I've checked similar issues:

But these issues specify earlier versions of SDK, while I've faced this problem just with 1.25.0+

SDK version used

1.25.0

Environment details (OS name and version, etc.)

ManjaroLinux 22.0.4

@ron8mcr ron8mcr added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Mar 3, 2023
@tim-finnigan tim-finnigan self-assigned this Mar 7, 2023
@tim-finnigan
Copy link
Contributor

Hi @ron8mcr thanks for reaching out. In the boto3 CHANGELOG you can see the following change introduced in 1.25.0:

feature:Endpoints: [``botocore``] Implemented new endpoint ruleset system to dynamically derive endpoints and settings for services

The S3 endpoint ruleset is particularly large, which could explain the increased memory usage you're describing. Do you see the issue when using other services? What are you using to profile memory usage? And have you tried using gc.collect() or any other suggestions from related issues?

@tim-finnigan tim-finnigan added response-requested Waiting on additional information or feedback. s3 and removed bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Mar 7, 2023
@ron8mcr
Copy link
Author

ron8mcr commented Mar 9, 2023

Hi @tim-finnigan

The S3 endpoint ruleset is particularly large, which could explain the increased memory usage you're describing.

Yes, it is expected that memory usage should increase. However, memory usage increases after each API call, and the more iterations happen - the more RAM required.


Do you see the issue when using other services?

Checked with cloudwatch, ec2, ecs, iam - no problems


What are you using to profile memory usage?

Actually this problem was found by Cloudwatch agent, later confirmed with sample script using htop and psutil.Process(os.getpid()).memory_info().rss.
tracemalloc pointed to python/json, similar to this reply


And have you tried using gc.collect() or any other suggestions from related issues?

Yes, gc.collect() didn't help. Storing session per thread also didn't help as example script does not use any multithreading/multiprocessing/async features.


P.S. I've add plot of memory usage and increase iterations number. Actually, memory usage growth stops at some point

Sample script
import boto3
import os
import boto3
import psutil
import matplotlib.pyplot as pp

memory_usages = []

for i in range(150):
    process = psutil.Process(os.getpid())
    memory = process.memory_info().rss/1024/1024
    memory_usages.append(memory)
    boto3.Session().client('s3').list_buckets()

pp.plot(memory_usages)
pp.show()

image

@github-actions github-actions bot removed the response-requested Waiting on additional information or feedback. label Mar 9, 2023
@alexanderGerbik
Copy link

I think this issue starts from boto3>=1.24.72, which depends on botocore>=1.27.72.

In botocore==1.27.72 they have implemented EndpointProvider which uses lru_cache (https://github.com/boto/botocore/blob/1.27.72/botocore/endpoint_provider.py#L671), which explains memory usage growth and why it stops (it stops when lru cache starts evicting cached values when maxsize is reached).

@tim-finnigan
Copy link
Contributor

After bringing this issue up for discussion with the team I wanted to clarify a couple of things. The behavior originally described here is really increased memory usage rather than a memory leak. As previously mentioned, the new endpoint rulesets files recently added for services like S3 are large (~1.6MB) . But the excessive memory usage problem here is due to the accumulation of Sessions. If you reuse sessions, then you don't need to keep reloading the same large JSON files. Maybe this is something that could be better clarified in the documentation.

@tim-finnigan tim-finnigan added documentation This is a problem with documentation. p2 This is a standard priority issue labels Mar 14, 2023
@tim-finnigan tim-finnigan removed their assignment Mar 14, 2023
@hemslo
Copy link

hemslo commented Mar 15, 2023

Use lru_cache on instance method will cause memory leak because self is captured.
The simple fix is to change it to instance level cache in __init__:

self.resolve_endpoint = lru_cache(maxsize=CACHE_SIZE)(self.resolve_endpoint)

If global cache is still wanted, then refactor it to be a static method, or even better, a function.

References:

  1. Python functools lru_cache with instance methods: release object
  2. Don't wrap instance methods with 'functools.lru_cache' decorator in Python

@ron8mcr
Copy link
Author

ron8mcr commented Mar 16, 2023

@tim-finnigan @alexanderGerbik @hemslo
I can confirm that disabling lru_cache in botocore lead to significant decrease of memory usage:

botocore==1.29.92
With disabled cache
botocore==1.27.71

Meanwhile, botocore already have decorator for caching instance, see https://github.com/boto/botocore/blob/1.29.92/botocore/utils.py#L1395

I've tried to use it and everything worked perfect for my case.
Prepared PR for botocore - boto/botocore#2889

@alexanderGerbik @hemslo much thanks for your suggestions

@tim-finnigan
Copy link
Contributor

Hi ron8mcr, thanks for reporting this and doing the deep dive. We have been researching this in parallel to understand what is happening because we have not seen the same behavior in our own integration testing.

The root cause of what you are seeing is the large number of botocore sessions that get created in your code snippet. Creating new session objects is known to be expensive. Each session object not only consumes a large amount of memory, it is also slow to create because several files get loaded from the file system to create the object. (Coincidentally, those files do not get cached across Session objects and instance_cache is used to ensure this.) The number of Session objects in each script should be minimized, usually one is enough. For example, we recommend creating the botocore session outside the AWS Lambda handler to have it reused across invocations (See: https://docs.aws.amazon.com/codeguru/detector-library/python/lambda-client-reuse/ for more context).

The reason you are observing high memory usage with the lru_cache is not that each cache entry is large (even though self is large, the cache key is still only a hash) but that the existence of the cache entries prevents effective garbage collection of the many Session objects. Each session will have its own instance of the endpoint ruleset object (self) and these can only be removed from memory once the cache entry expires from cache. The cache filling up explains the plateau in memory consumption shown in this post.

As a path forward, we would recommend that you review your code for opportunities to reduce the number of separate botocore Sessions. This will probably have memory and runtime benefits beyond the problem described in this post. We will also review your PR because it mitigates the problem for those rare situations where a large number of Session objects are necessary. However, this requires further research because we have to thoroughly evaluate whether the downside of instance_cache not having a maximum size could lead to performance degradations in other use cases.

@hemslo
Copy link

hemslo commented Mar 17, 2023

We have use cases of large number of Session objects, to support dynamic AWS accounts, they can't share the same session. And because of that lru_cache introduced in botocore 1.27.72, we have to pin boto3==1.24.71 and botocore==1.27.71.

It would be nice if we can cache _load_file across sessions, it consumes 300+ MB in our peak usage.

@ron8mcr
Copy link
Author

ron8mcr commented Mar 17, 2023

Hi @tim-finnigan,

As a path forward, we would recommend that you review your code for opportunities to reduce the number of separate botocore Sessions

Actually, we found this within different Django projects which even not used boto3 in self codebase. However, looks like the problem may be solved after few changes in other app:

We will also review your PR

Thanks in advance!

@jonemo jonemo added closing-soon This issue will automatically close in 4 days unless further comments are made. and removed documentation This is a problem with documentation. s3 labels May 9, 2023
@jonemo
Copy link
Contributor

jonemo commented May 9, 2023

Addressed in boto/botocore#2934

@github-actions github-actions bot removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label May 9, 2023
@tim-finnigan tim-finnigan added the closing-soon This issue will automatically close in 4 days unless further comments are made. label May 9, 2023
@github-actions github-actions bot added closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-for-staleness p2 This is a standard priority issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants