-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uncontrolled memory growth in aws_kinesis_streams #18397
Comments
Seeing the same issue in 0.27.0 using We have vector agents in ec2 instances that idle at a few mb memory usage, but when a large amount of data comes in ( ~380mb of audit logs in our test case), vector immediately starts gobbling up all system memory and eventually our instance fully freezes. I've also tried a number of methods for throttling the throughput (request rate limit, disk buffer, adaptive concurrency with aggresive falloff), but either vector becomes severely throttled or the rate of memory gobbling is slowed, but still climbs to the point of eating all system memory. I've confirmed that this is an issue with the kinesis sink specifically as we are able to sink to an s3 bucket or a local file at full throughput with a minimal increase in memory usage. It's only when trying to sink to a kinesis stream that this happens. This also isn't an issue of kinesis rate limits as we're able to trigger this when writing to a kinesis firehose with an increased rate limit, and I've confirmed that the rate limit isn't hit while this happens. |
I think @jszwedko is on the right track regarding concurrency (in the comment from the linked issue above). This change seems to work around the issue to some extent:
I tested a build with this change, and Vector was able to successfully churn through a large k8s log file with reasonable memory usage (~1Gi). I can even set request_builder_concurrency_limit to 5 without having memory exploding (it uses a bit more, but still < 1.5Gi). That being said, I am not sure what the implications are of this. |
One other peculiar item we've noticed: We use the default setting of
|
I think the documentation might actually be wrong there and the default is, in fact, |
Yes, you are correct. Setting
So the default seems to be adaptive. That being said, unfortunately even setting concurrency to 1 didn't solve the OOM issue. Memory growth is definitely slower, however it steadily rises until it reaches the limit. So far the only effective workaround I've found is this change here, however I am unclear of the implications. |
I tested the same and can confirm memory grows linearly with events processed, but is never released. We've resorted to bypassing kinesis for the time being and sinking directly to S3. This fortunately works well enough for our use case. Max memory usage with the S3 sink is ~2.5mb. |
I will note that Also there is this knob that has allowed us to workaround this issue to an extent (by dialing down |
A note for the community
Problem
We ship json-encoded Kubernetes container logs from EKS worker nodes to a Kinesis stream. We have noticed that in the presence of large container log files, the
aws_kinesis_streams
increases memory usage until it gets OOMKilled.Here is how to reproduce this:
First, launch a pod that generates a couple of hundred megs of console output. You could do this by using an Ubuntu container launched with this command:
The log file size can be verified by logging onto a worker node:
In this case, the file in question is ~383M. The relevant part of the vector configuration is given below. Running this with the
--allocation-tracing
option, we see the following fromvector top
:In this example, we've set an
8Gi
limit for the pod, which it exhausts in a matter of seconds. We've also experimented with settingrequest.rate_limit_num
. In this case, the memory growth is a bit slower, but it still OOMs in a matter of minutes.Interestingly, if you dial-down the amount of pod output, e.g. by changing
head -n 10000000
tohead -n 100000
in the above configuration, we can see the memory rise temporarily to ~2GiB, and then drop back down to ~23MiB after about a minute.Is there anything obviously wrong we are overlooking here?
Configuration
Version
vector 0.32.0 (x86_64-unknown-linux-gnu 1b403e1 2023-08-15 14:56:36.089460954)
Debug Output
No response
Example Data
No response
Additional Context
Vector is running as a daemonset in EKS, using an IAM role for the service account.
References
No response
The text was updated successfully, but these errors were encountered: