You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had a tablet start OOMing repeatedly on Friday that took a little bit to track down. We isolated it to a materialize stream (1 source tablet -> 16 target tablets), after running a large transaction that by itself updated 63M rows. The tablet spiked to 60 GB of RAM, when it normally runs steady state around 100 MB.
We were able to temporarily move the tablet to a larger server to get that processed, then move it back to its original place in our cluster. What made this particularly hard to diagnose is the OOM didn't correlate to any specific logs.
Obviously huge transactions aren't great and are the source of issues, I've just been wondering if there's a way to more gracefully handle a scenario like this, that just kept repeatedly flapping the db primary. This was clearly amplified by needing to process the same tx 16 times. There may not be a way to reduce the memory overhead for a single stream, but maybe if we could detect a certain threshold, then somehow process it serially? That sounds like a very difficult coordination problem, maybe the throttler could be leveraged somehow?
Feel free to close this issue if there isn't anything actionable.
Overview of the Issue
We had a tablet start OOMing repeatedly on Friday that took a little bit to track down. We isolated it to a materialize stream (1 source tablet -> 16 target tablets), after running a large transaction that by itself updated 63M rows. The tablet spiked to 60 GB of RAM, when it normally runs steady state around 100 MB.
We were able to temporarily move the tablet to a larger server to get that processed, then move it back to its original place in our cluster. What made this particularly hard to diagnose is the OOM didn't correlate to any specific logs.
Obviously huge transactions aren't great and are the source of issues, I've just been wondering if there's a way to more gracefully handle a scenario like this, that just kept repeatedly flapping the db primary. This was clearly amplified by needing to process the same tx 16 times. There may not be a way to reduce the memory overhead for a single stream, but maybe if we could detect a certain threshold, then somehow process it serially? That sounds like a very difficult coordination problem, maybe the throttler could be leveraged somehow?
Feel free to close this issue if there isn't anything actionable.
cc @mattlord
Reproduction Steps
Binary Version
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: