SG gets killed due to excessive memory usage with continuous doc update #2651

raghusarangapani · 2017-06-14T23:11:29Z

Sync Gateway version

1.5.0-429

Operating system

CentOS 7

Config file

{
    "interface":":4984",
    "adminInterface": "0.0.0.0:4985",
    "maxIncomingConnections": 0,
    "maxCouchbaseConnections": 16,
    "maxFileDescriptors": 90000,
    "slowServerCallWarningThreshold": 500,
    "compressResponses": false,
    "log": ["CRUD+", "Cache+", "HTTP+", "Changes+"],
    "verbose":"true",
    "databases":{
        "db":{
	    "unsupported": {
        	"enable_extended_attributes": true
	     },
            "server":"http://<CBS_IP>:8091",
	    "revs_limit": 10000000,
            "bucket":"data-bucket",
            "username":"data-bucket",
            "password": "password",
	    "users": {"GUEST": {"disabled": false,"admin_channels": ["*"]}}            
        }
    }

Steps to reproduce

The aim of this testing was to hit the xattr memory limit and test SG's behavior. I manually tried out the below 2 scenarios. It seems like SG will get killed well before hitting the xattr memory limit.

scenario 1:

SG 1.5 without xattrs running on a VM.
VM has 2GB RAM.
Add one doc to SG.
revs_limit is set to 10 million.
Constantly update the doc with a script.
After 8535 updates, SG gets killed by the OS.
When SG gets killed, SG is using around 90% cpu and all the memory and the OS is running out of memory.

/var/log/messages:

Jun 14 21:59:16 localhost kernel: 38610 total pagecache pages
Jun 14 21:59:16 localhost kernel: 35800 pages in swap cache
Jun 14 21:59:16 localhost kernel: Swap cache stats: add 16917332, delete 16881532, find 5161984/6611476
Jun 14 21:59:16 localhost kernel: Free swap  = 0kB
Jun 14 21:59:16 localhost kernel: Total swap = 1572860kB
Jun 14 21:59:16 localhost kernel: 524174 pages RAM
Jun 14 21:59:16 localhost kernel: 0 pages HighMem/MovableOnly
Jun 14 21:59:16 localhost kernel: 53130 pages reserved
Jun 14 21:59:16 localhost kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Jun 14 21:59:16 localhost kernel: [ 3288]     0  3288   831901   398350    1519   366000             0 sync_gateway
Jun 14 21:40:02 localhost kernel: Out of memory: Kill process 2777 (sync_gateway) score 874 or sacrifice child
Jun 14 21:40:02 localhost kernel: Killed process 2777 (sync_gateway) total-vm:3285600kB, anon-rss:1554664kB, file-rss:0kB

scenario 2:

SG 1.5 with xattrs enabled running on a VM.
VM has 6GB RAM.
Add one doc to SG.
revs_limit is set to 10 million.
Constantly update the doc with a script.
After 16568 updates, SG gets killed by the OS.
When SG gets killed, SG is using around 90% cpu and all the memory and the OS is running out of memory.

/var/log/messages:

Jun 14 22:53:19 localhost kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 14 22:53:19 localhost kernel: 62698 total pagecache pages
Jun 14 22:53:19 localhost kernel: 58157 pages in swap cache
Jun 14 22:53:19 localhost kernel: Swap cache stats: add 2824375, delete 2766218, find 763838/1027429
Jun 14 22:53:19 localhost kernel: Free swap  = 0kB
Jun 14 22:53:19 localhost kernel: Total swap = 1572860kB
Jun 14 22:53:19 localhost kernel: 1572750 pages RAM
Jun 14 22:53:19 localhost kernel: 0 pages HighMem/MovableOnly
Jun 14 22:53:19 localhost kernel: 86030 pages reserved
Jun 14 22:53:19 localhost kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Jun 14 22:53:20 localhost kernel: [ 2368]     0  2368  1796765  1370579    3421   367426             0 sync_gateway
Jun 14 22:53:20 localhost kernel: Out of memory: Kill process 2368 (sync_gateway) score 898 or sacrifice child
Jun 14 22:53:20 localhost kernel: Killed process 2368 (sync_gateway) total-vm:7187060kB, anon-rss:5482280kB, file-rss:36kB

Expected behavior

SG memory usage should be constant.

Actual behavior

SG memory usage grows constantly.

The text was updated successfully, but these errors were encountered:

raghusarangapani · 2017-06-14T23:17:12Z

Python script used to update the doc:
https://gist.github.com/raghusarangapani/d1f4e2e2523b59c2b5b5bee4e790fc07

Update the following params in the script:

current_rev = "1-ca9ad22802b66f662ff171f226211d5c"
number_updates = 10000000
url = "http://192.168.33.22:4984"
db = "db"
doc_id = "Test"

tleyden · 2017-06-19T23:01:31Z

I was able to reproduce this and saw the memory continuing to increase.

Here is the heap profile:


==============================================================================
Running go tool pprof -- which can take several seconds: heap format: text
go tool pprof -text -seconds=5 "/opt/couchbase-sync-gateway/bin/sync_gateway" http://127.0.0.1:4985/_debug/pprof/heap
==============================================================================
Fetching profile from http://127.0.0.1:4985/_debug/pprof/heap
Saved profile in /tmp/tmpAlwQLn/pprof.sync_gateway.127.0.0.1:4985.inuse_objects.inuse_space.002.pb.gz
2780.87MB of 2834.69MB total (98.10%)
Dropped 314 nodes (cum <= 14.17MB)
      flat  flat%   sum%        cum   cum%
 1833.56MB 64.68% 64.68%  1833.56MB 64.68%  fmt.(*ss).convertString
  937.72MB 33.08% 97.76%  2772.77MB 97.82%  github.com/couchbase/sync_gateway/db.encodeRevisions
    5.59MB   0.2% 97.96%    14.24MB   0.5%  github.com/couchbase/sync_gateway/db.RevTree.UnmarshalJSON
    3.50MB  0.12% 98.08%    18.13MB  0.64%  encoding/json.Unmarshal
    0.50MB 0.018% 98.10%  1835.06MB 64.74%  github.com/couchbase/sync_gateway/db.ParseRevID
         0     0% 98.10%    18.63MB  0.66%  encoding/json.(*decodeState).object
         0     0% 98.10%    18.63MB  0.66%  encoding/json.(*decodeState).unmarshal
         0     0% 98.10%    18.63MB  0.66%  encoding/json.(*decodeState).value
         0     0% 98.10%  1833.56MB 64.68%  fmt.(*ss).doScanf
         0     0% 98.10%  1833.56MB 64.68%  fmt.(*ss).scanOne
         0     0% 98.10%  1834.56MB 64.72%  fmt.Fscanf
         0     0% 98.10%  1834.56MB 64.72%  fmt.Sscanf
         0     0% 98.10%    34.20MB  1.21%  github.com/couchbase/sync_gateway/base.(*CouchbaseBucketGoCB).WriteUpdate
         0     0% 98.10%    34.20MB  1.21%  github.com/couchbase/sync_gateway/base.CouchbaseBucketGoCB.WriteUpdate
         0     0% 98.10%  2803.81MB 98.91%  github.com/couchbase/sync_gateway/db.(*Database).Put
         0     0% 98.10%  2809.81MB 99.12%  github.com/couchbase/sync_gateway/db.(*Database).updateAndReturnDoc
         0     0% 98.10%    31.39MB  1.11%  github.com/couchbase/sync_gateway/db.(*Database).updateAndReturnDoc.func3
         0     0% 98.10%  2809.81MB 99.12%  github.com/couchbase/sync_gateway/db.(*Database).updateDoc
         0     0% 98.10%    14.24MB   0.5%  github.com/couchbase/sync_gateway/db.(*RevTree).UnmarshalJSON
         0     0% 98.10%    18.13MB  0.64%  github.com/couchbase/sync_gateway/db.(*document).UnmarshalJSON
         0     0% 98.10%    18.13MB  0.64%  github.com/couchbase/sync_gateway/db.unmarshalDocument
         0     0% 98.10%  2803.31MB 98.89%  github.com/couchbase/sync_gateway/rest.(*handler).handlePutDoc
         0     0% 98.10%  2803.31MB 98.89%  github.com/couchbase/sync_gateway/rest.(*handler).invoke
         0     0% 98.10%  2801.67MB 98.83%  github.com/couchbase/sync_gateway/rest.makeHandler.func1
         0     0% 98.10%  2799.29MB 98.75%  github.com/couchbase/sync_gateway/rest.wrapRouter.func1
         0     0% 98.10%  2801.67MB 98.83%  github.com/gorilla/mux.(*Router).ServeHTTP
         0     0% 98.10%  2788.89MB 98.38%  net/http.(*conn).serve
         0     0% 98.10%  2802.17MB 98.85%  net/http.HandlerFunc.ServeHTTP
         0     0% 98.10%     2797MB 98.67%  net/http.serverHandler.ServeHTTP
         0     0% 98.10%  2806.80MB 99.02%  runtime.goexit

heap.pdf

tleyden · 2017-06-20T01:13:27Z

After swapping out the sscanf call, the same issue is happening, but the heap looks slightly different:

Fetching profile from http://127.0.0.1:4985/_debug/pprof/heap
Saved profile in /tmp/tmp20qpCD/pprof.sync_gateway.127.0.0.1:4985.inuse_objects.inuse_space.002.pb.gz
2494.80MB of 2540.65MB total (98.20%)
Dropped 270 nodes (cum <= 12.70MB)
      flat  flat%   sum%        cum   cum%
 1823.58MB 71.78% 71.78%  1828.58MB 71.97%  encoding/json.(*decodeState).literalStore
  663.38MB 26.11% 97.89%   663.38MB 26.11%  github.com/couchbase/sync_gateway/db.encodeRevisions
    4.34MB  0.17% 98.06%  1832.92MB 72.14%  github.com/couchbase/sync_gateway/db.RevTree.UnmarshalJSON
    3.50MB  0.14% 98.20%  1841.52MB 72.48%  encoding/json.Unmarshal
         0     0% 98.20%  1828.58MB 71.97%  encoding/json.(*decodeState).array
         0     0% 98.20%  1828.58MB 71.97%  encoding/json.(*decodeState).literal
         0     0% 98.20%  1842.02MB 72.50%  encoding/json.(*decodeState).object
         0     0% 98.20%  1842.52MB 72.52%  encoding/json.(*decodeState).unmarshal
         0     0% 98.20%  1842.52MB 72.52%  encoding/json.(*decodeState).value
         0     0% 98.20%  1852.04MB 72.90%  github.com/couchbase/sync_gateway/base.(*CouchbaseBucketGoCB).WriteUpdate
         0     0% 98.20%  1852.04MB 72.90%  github.com/couchbase/sync_gateway/base.CouchbaseBucketGoCB.WriteUpdate
         0     0% 98.20%   693.73MB 27.31%  github.com/couchbase/sync_gateway/db.(*Database).Put
         0     0% 98.20%  2520.32MB 99.20%  github.com/couchbase/sync_gateway/db.(*Database).updateAndReturnDoc
         0     0% 98.20%  1849.65MB 72.80%  github.com/couchbase/sync_gateway/db.(*Database).updateAndReturnDoc.func3
         0     0% 98.20%  2520.32MB 99.20%  github.com/couchbase/sync_gateway/db.(*Database).updateDoc
         0     0% 98.20%  1832.92MB 72.14%  github.com/couchbase/sync_gateway/db.(*RevTree).UnmarshalJSON
         0     0% 98.20%  1841.52MB 72.48%  github.com/couchbase/sync_gateway/db.(*document).UnmarshalJSON
         0     0% 98.20%  1840.02MB 72.42%  github.com/couchbase/sync_gateway/db.unmarshalDocument
         0     0% 98.20%   694.73MB 27.34%  github.com/couchbase/sync_gateway/rest.(*handler).handlePutDoc
         0     0% 98.20%   694.73MB 27.34%  github.com/couchbase/sync_gateway/rest.(*handler).invoke
         0     0% 98.20%   694.23MB 27.32%  github.com/couchbase/sync_gateway/rest.makeHandler.func1
         0     0% 98.20%   691.64MB 27.22%  github.com/couchbase/sync_gateway/rest.wrapRouter.func1
         0     0% 98.20%   693.23MB 27.29%  github.com/gorilla/mux.(*Router).ServeHTTP
         0     0% 98.20%   680.50MB 26.78%  net/http.(*conn).serve
         0     0% 98.20%   694.23MB 27.32%  net/http.HandlerFunc.ServeHTTP
         0     0% 98.20%   690.61MB 27.18%  net/http.serverHandler.ServeHTTP
         0     0% 98.20%   693.14MB 27.28%  runtime.goexit

heap.pdf

I think I have an idea why this particular test is causing runaway memory. I added these verbose logs:

2017-06-20T01:10:28.061Z Cache: Adding doc to revision cache with 5644 revisions
2017-06-20T01:10:28.062Z Cache: Going to purge oldest revcache value
2017-06-20T01:10:28.062Z Cache: Purging oldest revcache value.  Cache size 5001 -> 5000.  Capacity: 5000

and basically even when the LRU revcache starts getting pruned due to hitting capacity, the items being added to the rev cache is an ever growing list of revisions .. and so the overall size of the in-memory cache continues to grow without bound. (nearly without bound, the revs_limit is set at 10000000)

tleyden · 2017-06-20T01:35:05Z

This lru cache implementation has the ability to put an approximate cap on the total size in bytes

https://github.com/karlseguin/ccache/blob/master/readme.md

See the "Size" section towards the end of the doc. Not necessarily suggesting to switch, but it might be worth adding a similar feature.

raghusarangapani · 2017-06-20T03:43:41Z

This feature request is related: #2642
May be, we can borrow the server's implementation?

adamcfraser · 2017-06-20T04:51:24Z

I agree it might be useful to have an absolute cap on the size of the revision cache (to go along with our existing mechanism for managing entry size, revs_limit).

In real-world scenarios, where revs_limit is set to a reasonable value, do you think unresolved conflicts are resulting in a very high number of revisions being retained per entry in the revision cache?

If that's the case, we may be able to better manage this by either:

revisiting the revision pruning algorithm
storing revision histories in the rev cache that are pruned more aggressively than in the persisted documents, and trigger bucket retrieval in scenarios where the ancestors aren't found in the rev cache. This would probably satisfy the more common scenario (updates where the ancestor is a leaf revision) with less cost, but still provide a way to handle for uncommon branching for additional cost.

tleyden · 2017-06-20T18:00:22Z

Revisiting the revision pruning algorithm

Rev pruning before #795 (related to #501)

Each leaf retained revs_limit ancestors per leaf (tombstones or not)

Rev pruning after

Determine min depth based on non-tombstoned leaf with lowest rev generation
Prune based on that number. Min depth - revs limit

Implications

New approach simpler
Less storage in some cases, more storage in others
Lots more storage when vastly different length different branches
- If not resolving conflicts, this is likely to happen and will add storage overhead
Less storage overhead

tleyden · 2017-06-20T19:22:40Z

Todo: update repro script

Repro conflict rev cache blow-up in functional test (based on rev tree scenario new algo discussed)

Create rev 1
Create rev 2
Create rev 2b
Create n revisions under rev 2

Even with revs_limit set to 20, memory should grow w/o bound

tleyden · 2017-06-20T19:24:47Z

Benchmark current algorithm (will need to generate large rev trees)

tleyden · 2017-06-20T23:08:10Z

I've reproduced the memory growth issue with a conflicted rev tree.

SG config with revs_limit=20

{
    "interface":":4984",
    "adminInterface": "0.0.0.0:4985",
    "maxIncomingConnections": 0,
    "maxCouchbaseConnections": 16,
    "maxFileDescriptors": 90000,
    "slowServerCallWarningThreshold": 500,
    "compressResponses": false,
    "log": ["CRUD+", "Cache+", "HTTP+", "Changes+"],
    "verbose":"true",
    "databases":{
        "db":{
	    "unsupported": {
            "enable_extended_attributes": false
	             },
            "server":"http://ec2-52-88-232-23.compute-1.amazonaws.com:8091",
	     "revs_limit": 20,
            "bucket":"data-bucket",
            "username":"data-bucket",
            "password": "password",
	            "users": {"GUEST": {"disabled": false,"admin_channels": ["*"]}}
        }
    }
}

Create rev 1

$ curl -X POST \
>   http://ec2-34-33-196-180.compute-1.amazonaws.com:4985/db/ \
>   -H 'authorization: Basic Zm9vOmZvbw==' \
>   -H 'cache-control: no-cache' \
>   -H 'content-type: application/json' \
>   -H 'postman-token: a6eddde0-394f-7098-58b2-74814d3858ae' \
>   -d '{
>   "key": "val"
> }'

Response

{"id":"84f839a3521de0a94a445e5ca1cccdd4","ok":true,"rev":"1-ecbd22495c41163e3657699fd1f5ca28"}

Create rev 2

$ curl -X PUT \
>   http://ec2-34-33-196-180.compute-1.amazonaws.com:4985/db/84f839a3521de0a94a445e5ca1cccdd4 \
>   -H 'authorization: Basic Zm9vOmZvbw==' \
>   -H 'cache-control: no-cache' \
>   -H 'content-type: application/json' \
>   -H 'postman-token: cd6513b5-2904-d463-27b5-b8b3fef46cd8' \
>   -d '{
>   "key": "val",
>   "_rev": "1-ecbd22495c41163e3657699fd1f5ca28"
> }'

Response:

{"id":"84f839a3521de0a94a445e5ca1cccdd4","ok":true,"rev":"2-9751b0b11d5cbeae6d9d756603b4cdc2"}

Create rev 2b

$ curl -X PUT \
>   'http://ec2-34-33-196-180.compute-1.amazonaws.com:4985/db/84f839a3521de0a94a445e5ca1cccdd4?new_edits=false' \
>   -H 'authorization: Basic Zm9vOmZvbw==' \
>   -H 'cache-control: no-cache' \
>   -H 'content-type: application/json' \
>   -H 'postman-token: 89552472-c8b7-d1be-25c1-1851a4c46e0e' \
>   -d '{
>   "key": "val 2b",
>     "_revisions": {"start":2,"ids":["123456789","ecbd22495c41163e3657699fd1f5ca28"]}
> }'

Response

{"id":"84f839a3521de0a94a445e5ca1cccdd4","ok":true,"rev":"2-123456789"}

Find winning rev

$ curl -X GET \
>   http://ec2-34-33-196-180.compute-1.amazonaws.com:4985/db/84f839a3521de0a94a445e5ca1cccdd4 \
>   -H 'authorization: Basic Zm9vOmZvbw==' \
>   -H 'cache-control: no-cache' \
>   -H 'content-type: application/json' \
>   -H 'postman-token: 972f3ab7-8344-a1f1-0cc3-2be4a25b9be5'

Response

{"_id":"84f839a3521de0a94a445e5ca1cccdd4","_rev":"2-9751b0b11d5cbeae6d9d756603b4cdc2","key":"val"}

Run repro.py against winning rev

Update script variables:

doc_id = "84f839a3521de0a94a445e5ca1cccdd4"
current_rev = "2-9751b0b11d5cbeae6d9d756603b4cdc2"

Observed

Memory grows without bound
Rev pruning is thwarted by unresolved conflict, as can be seen in the raw doc

tleyden · 2017-06-21T18:35:50Z

Benchmark results

BenchmarkRevTreePruning on commit e1ddf43 with values

non-winning unresolved revs	non-winning tombstoned revs	unconflictedBranchNumRevs	winningBranchNumRevs	maxdepth pruning	ns/op
60	25	50	100	50	244604 ns/op
600	250	500	1000	50	2592843 ns/op
6000	2500	5000	10000	50	35363358 ns/op

tleyden · 2017-06-22T06:20:16Z

Here are the current rev tree pruning unit tests with the before/after revtrees

OneWinningOneNonwinningBranch

Before pruning

After Prune maxdepth=2

tleyden · 2017-06-22T06:20:45Z

OneWinningOneOldTombstonedBranch

Before pruning

After Prune maxdepth=2

tleyden · 2017-06-22T06:21:09Z

OneWinningOneOldAndOneRecentTombstonedBranch

Before pruning

After Prune maxdepth=2

tleyden · 2017-06-22T06:31:49Z

Existing TestPruneRevisions test discrepancy

There is a functionality discrepancy between the old revs pruning algorithm and the new one under development, and I want to make sure the new behavior is the desired behavior here.

Before pruning

After pruning with old pruning algorithm with maxdepth=1

I think it's safe to ignore that empty node, it's most likely just a bug in the graphviz dot export

After pruning with new algorithm with maxdepth=1

Ditto, imagine this as being two rev trees with a single root node each

adamcfraser · 2017-06-22T06:34:18Z

@tleyden I believe that's the desired behaviour. To confirm - the pruning with the new algorithm in the above scenario matches the pruning with the old-old algorithm, correct?

Actually, I'll go further - I think this is definitely the desired behaviour, to address the core problem (e.g. the scenario where there are 1000 revisions between 3-drei and 4-vier)

tleyden · 2017-06-22T06:42:16Z

I haven't tested it yet, but I'm assuming it should match since the new revpruning algorithm is mostly a copy-paste-tweak of the old-old algorithm, and should only diverge when it comes to dealing with tombstoned branches older than the calculated tombstone threshold.

#2669) * Fixes #2651 SG gets killed due to excessive memory usage with continuous doc update * Address PR feedback re walking tree less times * Address PR feedback about test-only methods and confusing method naming

#2669) * Fixes #2651 SG gets killed due to excessive memory usage with continuous doc update * Address PR feedback re walking tree less times * Address PR feedback about test-only methods and confusing method naming # Conflicts: # db/revtree_test.go

#2669) * Fixes #2651 SG gets killed due to excessive memory usage with continuous doc update * Address PR feedback re walking tree less times * Address PR feedback about test-only methods and confusing method naming Call parseRevID() instead of ParseRevID() Fix test compile issue

… continu… (#2669)" This reverts commit 4d4e4f2.

raghusarangapani · 2017-06-28T21:52:28Z

Re-ran the same script with SG 1.5.0-455. With no xattr and 2gb ram, I was able to make 8523 updates before SG got killed. With xattr and 6 gb ram, l was able to make 16709 updates before SG got killed. The behavior is the same. I talked to Traun about it and Traun said that the revision pruning will kick in once we hit the revs_limit so that the memory usage is stable. In this case, the pruning will not kick in as the revs_limit is set to 10 million.

JFlath · 2017-07-04T10:21:01Z

Some of the rev tree pruning above doesn't seem like the correct behaviour to me - before I launch off about that, is there a better place to be having this discussion?

tleyden · 2017-07-06T21:54:32Z

@JFlath Yes, I think this is a good place for the discussion, since it will be in context.

JFlath · 2017-07-11T16:24:31Z

So, I guess the best place to start is: do we have a document that defines exactly how this is supposed to behave? Seen a few cases now where different platforms have handled the rev tree differently.

tleyden · 2017-07-11T16:33:38Z

The best docs I'm aware of is this blog post series:

https://blog.couchbase.com/database-sizes-and-conflict-resolution/

Which is now out-dated after latest changes. But I absolutely agree we need to get this into our official docs, and I've already been pushing for that.

rajagp · 2017-07-11T16:42:53Z

Agree. I've discussed with @jamiltz . He was going to pick up the blog post content and publish it as part of the docs portal . Of course, he has to go through the exercise of reformatting / restructuring it. We will have the content in both the blogs and docs which is fine. The docs will be maintained ...the blog could go out of date (Ideally, I will churn out a new blog specific to the changes as a follow on post ). Timeline for the effort to move to docs - @jamiltz ?

JFlath · 2017-07-11T17:11:41Z

Sorry, let me rephrase - is there any document that was written on the design of this. The blog post documents what happens under the current SG (AFAIK), but that's very different from a design spec.

However, @tleyden your latest comment worries me:

Which is now out-dated after latest changes.

Has the expected behaviour changed? Or does the blog post not document the expected behaviour?

tleyden · 2017-07-12T00:34:01Z

Sorry, let me rephrase - is there any document that was written on the design of this.

@JFlath Not that I'm aware of. The most comprehensive spec was (and probably still is) the unit test suite that exercises the rev pruning behavior.

Has the expected behaviour changed?

Yes, there was a pretty dramatic change recently to the rev pruning behavior in #2669, with a follow up fix in #2697

Or does the blog post not document the expected behaviour?

Correct. The blog post hasn't been updated yet to document the expected behavior. Again, the unit test suite is currently the best "documentation" of expected behavior that I'm aware of.

I agree though that a design document is in order here. It would have helped out during the changes for #2669.

JFlath · 2017-07-18T22:55:52Z

So, as mentioned, I think there's a serious issue with the pruning behaviour here. Let's take a given tree with revs_limit: 10. We've pruned to the limit, and then later inserted a conflict a little bit after where we'd already pruned to:

Now let's add a few more to the winning branch, still things look fine:

Now, I realise I've got a conflict and delete it, creating a tombstone:

Up until now, we're fine. That tombstone exists to be replicated out and tombstone the 12-conflict branch elsewhere. However, as soon as I touch the winning branch, the tombstone gets pruned:

This could be milliseconds after the tombstoning. A lot of conflict resolution will update both branches (with a merge for example). It seems like we've accidentally swapped tombstoning behaviour for purging behaviour, which means replication won't work. This is especially bad if you do conflict resolution server-side using the rest api.

Reading back, it feels like we've assumed that a large delta in rev generation == a large delta in time, which isn't guaranteed to be the case.

tleyden · 2017-07-18T23:10:18Z

@djpongh / @adamcfraser I'm re-opening this based on @JFlath's last comment: #2651 (comment) -- since I think it might merit more discussion.

adamcfraser · 2017-07-18T23:16:17Z

The previous algorithm (the one we've been using since 2015) already had this behaviour when a conflict was tombstoned - it calculated maxDepth based on the oldest non-tombstoned revision, and purged everything older than maxDepth-revsLimit (which could include a recently tombstoned revision).

JFlath · 2017-07-18T23:35:56Z

So, after checking back, this seems to be an issue with the old behaviour two. Here are the same steps on the previous behaviour:

tleyden · 2017-07-20T17:11:20Z

Re-closing, since it doesn't seem like there's any action items at the moment.

#2669) Cherry picked from 1.3.1.2 commit a1f7bd5 * Fixes #2651 SG gets killed due to excessive memory usage with continuous doc update * Address PR feedback re walking tree less times * Address PR feedback about test-only methods and confusing method naming Call parseRevID() instead of ParseRevID() Fix test compile issue

raghusarangapani added this to the 1.5.0 milestone Jun 14, 2017

raghusarangapani assigned adamcfraser Jun 14, 2017

raghusarangapani added bug functional-test-failure performance labels Jun 14, 2017

This was referenced Jun 14, 2017

XATTRs: High revs testing couchbaselabs/mobile-testkit#1051

Closed

Add a SG functional test for high revs_limit couchbaselabs/mobile-testkit#1248

Closed

raghusarangapani added the P1: high label Jun 16, 2017

djpongh assigned tleyden and unassigned adamcfraser Jun 16, 2017

djpongh added ready ffc labels Jun 16, 2017

tleyden mentioned this issue Jun 20, 2017

Feature/unit test parse revid #2661

Merged

tleyden closed this as completed in #2669 Jun 23, 2017

tleyden removed review in progress labels Jun 23, 2017

tleyden mentioned this issue Jun 27, 2017

1.3.1 hotfix for #2651 -- SG gets killed due to excessive memory usage with continuous doc update #2690

Closed

tleyden pushed a commit that referenced this issue Jun 28, 2017

Revert "Fixes #2651 SG gets killed due to excessive memory usage with…

e0105c9

… continu… (#2669)" This reverts commit 4d4e4f2.

tleyden mentioned this issue Jul 12, 2017

Add unit test TestPruneDisconnectedRevTreeWithLongWinningBranch() #2714

Merged

tleyden reopened this Jul 18, 2017

djpongh added the review label Jul 19, 2017

tleyden closed this as completed Jul 20, 2017

tleyden removed the review label Jul 20, 2017

This was referenced Aug 29, 2017

Getting doc history can go into infinite loop #2847

Closed

Revision tree repair tool #2857

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SG gets killed due to excessive memory usage with continuous doc update #2651

SG gets killed due to excessive memory usage with continuous doc update #2651

raghusarangapani commented Jun 14, 2017 •

edited

Loading

raghusarangapani commented Jun 14, 2017 •

edited

Loading

tleyden commented Jun 19, 2017

tleyden commented Jun 20, 2017

tleyden commented Jun 20, 2017

raghusarangapani commented Jun 20, 2017

adamcfraser commented Jun 20, 2017 •

edited

Loading

tleyden commented Jun 20, 2017 •

edited

Loading

tleyden commented Jun 20, 2017 •

edited

Loading

tleyden commented Jun 20, 2017

tleyden commented Jun 20, 2017

tleyden commented Jun 21, 2017 •

edited

Loading

tleyden commented Jun 22, 2017

tleyden commented Jun 22, 2017

tleyden commented Jun 22, 2017

tleyden commented Jun 22, 2017

adamcfraser commented Jun 22, 2017 •

edited

Loading

tleyden commented Jun 22, 2017

raghusarangapani commented Jun 28, 2017

JFlath commented Jul 4, 2017

tleyden commented Jul 6, 2017

JFlath commented Jul 11, 2017

tleyden commented Jul 11, 2017

rajagp commented Jul 11, 2017

JFlath commented Jul 11, 2017

tleyden commented Jul 12, 2017

JFlath commented Jul 18, 2017

tleyden commented Jul 18, 2017

adamcfraser commented Jul 18, 2017

JFlath commented Jul 18, 2017 •

edited

Loading

tleyden commented Jul 20, 2017

SG gets killed due to excessive memory usage with continuous doc update #2651

SG gets killed due to excessive memory usage with continuous doc update #2651

Comments

raghusarangapani commented Jun 14, 2017 • edited Loading

Sync Gateway version

Operating system

Config file

Steps to reproduce

scenario 1:

scenario 2:

Expected behavior

Actual behavior

raghusarangapani commented Jun 14, 2017 • edited Loading

tleyden commented Jun 19, 2017

tleyden commented Jun 20, 2017

tleyden commented Jun 20, 2017

raghusarangapani commented Jun 20, 2017

adamcfraser commented Jun 20, 2017 • edited Loading

tleyden commented Jun 20, 2017 • edited Loading

Revisiting the revision pruning algorithm

Rev pruning before #795 (related to #501)

Rev pruning after

tleyden commented Jun 20, 2017 • edited Loading

tleyden commented Jun 20, 2017

tleyden commented Jun 20, 2017

SG config with revs_limit=20

Create rev 1

Create rev 2

Create rev 2b

Find winning rev

Run repro.py against winning rev

Observed

tleyden commented Jun 21, 2017 • edited Loading

Benchmark results

tleyden commented Jun 22, 2017

OneWinningOneNonwinningBranch

Before pruning

After Prune maxdepth=2

tleyden commented Jun 22, 2017

OneWinningOneOldTombstonedBranch

Before pruning

After Prune maxdepth=2

tleyden commented Jun 22, 2017

OneWinningOneOldAndOneRecentTombstonedBranch

Before pruning

After Prune maxdepth=2

tleyden commented Jun 22, 2017

Existing TestPruneRevisions test discrepancy

Before pruning

After pruning with old pruning algorithm with maxdepth=1

After pruning with new algorithm with maxdepth=1

adamcfraser commented Jun 22, 2017 • edited Loading

tleyden commented Jun 22, 2017

raghusarangapani commented Jun 28, 2017

JFlath commented Jul 4, 2017

tleyden commented Jul 6, 2017

JFlath commented Jul 11, 2017

tleyden commented Jul 11, 2017

rajagp commented Jul 11, 2017

JFlath commented Jul 11, 2017

tleyden commented Jul 12, 2017

JFlath commented Jul 18, 2017

tleyden commented Jul 18, 2017

adamcfraser commented Jul 18, 2017

JFlath commented Jul 18, 2017 • edited Loading

tleyden commented Jul 20, 2017

raghusarangapani commented Jun 14, 2017 •

edited

Loading

raghusarangapani commented Jun 14, 2017 •

edited

Loading

adamcfraser commented Jun 20, 2017 •

edited

Loading

tleyden commented Jun 20, 2017 •

edited

Loading

tleyden commented Jun 20, 2017 •

edited

Loading

tleyden commented Jun 21, 2017 •

edited

Loading

adamcfraser commented Jun 22, 2017 •

edited

Loading

JFlath commented Jul 18, 2017 •

edited

Loading