-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SG gets killed due to excessive memory usage with continuous doc update #2651
Comments
Python script used to update the doc: Update the following params in the script: current_rev = "1-ca9ad22802b66f662ff171f226211d5c" |
I was able to reproduce this and saw the memory continuing to increase. Here is the heap profile:
|
After swapping out the sscanf call, the same issue is happening, but the heap looks slightly different:
I think I have an idea why this particular test is causing runaway memory. I added these verbose logs:
and basically even when the LRU revcache starts getting pruned due to hitting capacity, the items being added to the rev cache is an ever growing list of revisions .. and so the overall size of the in-memory cache continues to grow without bound. (nearly without bound, the revs_limit is set at 10000000) |
This lru cache implementation has the ability to put an approximate cap on the total size in bytes https://github.com/karlseguin/ccache/blob/master/readme.md See the "Size" section towards the end of the doc. Not necessarily suggesting to switch, but it might be worth adding a similar feature. |
This feature request is related: #2642 |
I agree it might be useful to have an absolute cap on the size of the revision cache (to go along with our existing mechanism for managing entry size, revs_limit). In real-world scenarios, where revs_limit is set to a reasonable value, do you think unresolved conflicts are resulting in a very high number of revisions being retained per entry in the revision cache? If that's the case, we may be able to better manage this by either:
|
Revisiting the revision pruning algorithmRev pruning before #795 (related to #501)
Rev pruning after
Implications
|
Todo: update repro script Repro conflict rev cache blow-up in functional test (based on rev tree scenario new algo discussed)
Even with revs_limit set to 20, memory should grow w/o bound |
|
I've reproduced the memory growth issue with a conflicted rev tree. SG config with revs_limit=20
Create rev 1
Response
Create rev 2
Response:
Create rev 2b
Response
Find winning rev
Response
Run repro.py against winning revUpdate script variables:
Observed
|
Benchmark results
|
Existing TestPruneRevisions test discrepancyThere is a functionality discrepancy between the old revs pruning algorithm and the new one under development, and I want to make sure the new behavior is the desired behavior here. Before pruningAfter pruning with old pruning algorithm with maxdepth=1I think it's safe to ignore that empty node, it's most likely just a bug in the graphviz dot export After pruning with new algorithm with maxdepth=1Ditto, imagine this as being two rev trees with a single root node each |
@tleyden I believe that's the desired behaviour. To confirm - the pruning with the new algorithm in the above scenario matches the pruning with the old-old algorithm, correct? Actually, I'll go further - I think this is definitely the desired behaviour, to address the core problem (e.g. the scenario where there are 1000 revisions between 3-drei and 4-vier) |
I haven't tested it yet, but I'm assuming it should match since the new revpruning algorithm is mostly a copy-paste-tweak of the old-old algorithm, and should only diverge when it comes to dealing with tombstoned branches older than the calculated tombstone threshold. |
Re-ran the same script with SG 1.5.0-455. With no xattr and 2gb ram, I was able to make 8523 updates before SG got killed. With xattr and 6 gb ram, l was able to make 16709 updates before SG got killed. The behavior is the same. I talked to Traun about it and Traun said that the revision pruning will kick in once we hit the revs_limit so that the memory usage is stable. In this case, the pruning will not kick in as the revs_limit is set to 10 million. |
Some of the rev tree pruning above doesn't seem like the correct behaviour to me - before I launch off about that, is there a better place to be having this discussion? |
@JFlath Yes, I think this is a good place for the discussion, since it will be in context. |
So, I guess the best place to start is: do we have a document that defines exactly how this is supposed to behave? Seen a few cases now where different platforms have handled the rev tree differently. |
The best docs I'm aware of is this blog post series: https://blog.couchbase.com/database-sizes-and-conflict-resolution/ Which is now out-dated after latest changes. But I absolutely agree we need to get this into our official docs, and I've already been pushing for that. |
Agree. I've discussed with @jamiltz . He was going to pick up the blog post content and publish it as part of the docs portal . Of course, he has to go through the exercise of reformatting / restructuring it. We will have the content in both the blogs and docs which is fine. The docs will be maintained ...the blog could go out of date (Ideally, I will churn out a new blog specific to the changes as a follow on post ). Timeline for the effort to move to docs - @jamiltz ? |
Sorry, let me rephrase - is there any document that was written on the design of this. The blog post documents what happens under the current SG (AFAIK), but that's very different from a design spec. However, @tleyden your latest comment worries me:
Has the expected behaviour changed? Or does the blog post not document the expected behaviour? |
@JFlath Not that I'm aware of. The most comprehensive spec was (and probably still is) the unit test suite that exercises the rev pruning behavior.
Yes, there was a pretty dramatic change recently to the rev pruning behavior in #2669, with a follow up fix in #2697
Correct. The blog post hasn't been updated yet to document the expected behavior. Again, the unit test suite is currently the best "documentation" of expected behavior that I'm aware of. I agree though that a design document is in order here. It would have helped out during the changes for #2669. |
So, as mentioned, I think there's a serious issue with the pruning behaviour here. Let's take a given tree with revs_limit: 10. We've pruned to the limit, and then later inserted a conflict a little bit after where we'd already pruned to: Now let's add a few more to the winning branch, still things look fine: Now, I realise I've got a conflict and delete it, creating a tombstone: Up until now, we're fine. That tombstone exists to be replicated out and tombstone the This could be milliseconds after the tombstoning. A lot of conflict resolution will update both branches (with a merge for example). It seems like we've accidentally swapped tombstoning behaviour for purging behaviour, which means replication won't work. This is especially bad if you do conflict resolution server-side using the rest api. Reading back, it feels like we've assumed that a large delta in rev generation == a large delta in time, which isn't guaranteed to be the case. |
@djpongh / @adamcfraser I'm re-opening this based on @JFlath's last comment: #2651 (comment) -- since I think it might merit more discussion. |
The previous algorithm (the one we've been using since 2015) already had this behaviour when a conflict was tombstoned - it calculated maxDepth based on the oldest non-tombstoned revision, and purged everything older than maxDepth-revsLimit (which could include a recently tombstoned revision). |
Re-closing, since it doesn't seem like there's any action items at the moment. |
#2669) Cherry picked from 1.3.1.2 commit a1f7bd5 * Fixes #2651 SG gets killed due to excessive memory usage with continuous doc update * Address PR feedback re walking tree less times * Address PR feedback about test-only methods and confusing method naming Call parseRevID() instead of ParseRevID() Fix test compile issue
Sync Gateway version
1.5.0-429
Operating system
CentOS 7
Config file
Steps to reproduce
The aim of this testing was to hit the xattr memory limit and test SG's behavior. I manually tried out the below 2 scenarios. It seems like SG will get killed well before hitting the xattr memory limit.
scenario 1:
/var/log/messages:
scenario 2:
/var/log/messages:
Expected behavior
SG memory usage should be constant.
Actual behavior
SG memory usage grows constantly.
The text was updated successfully, but these errors were encountered: