Skip to content

Incremental prune #46

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Dec 13, 2018
Merged

Incremental prune #46

merged 47 commits into from
Dec 13, 2018

Conversation

jameinel
Copy link
Member

This introduces a potential change to the pruning algorithm, which has a nice benefit of significantly simplifying it.

The first thing is that we completely get rid of the 'working' set of transactions-that-are-referenced. Instead, we assume that only docs referenced by a txn could reference that transaction (which should be true). This also means that however many transactions we manage to process, they will be removed, without the need to worry about batching lots of txns to process at the same time.

Then, we read txns that are marked completed that are 'old enough', and then look for the txn-ques of documents they reference and read those (and cache that in an LRU). It should be safe to cache the queue, because only new txns should ever get added to queues so they should never reference a completed queue. Because we know that most of our txns touch a small set of documents over and over, an LRU cache of 10k has a very high hit rate (on prodstack, while processing ~6M txns, touching 6M docs, we have a 98% hit rate).

As we process the txns, we group them in batches of 1000, since that is the approximately good batch size for deleting txns. We then inspect the docs, and see if we need to clean the txn-queues for a txn we are considering deleting. For docs that are updated often, likely the old txn is already removed from the queue, so there is nothing to do.

This code needs a lot more tests, but I think the shape of it is correct.

This does the cleaning and the txn removal in the same core loop, and
seems to do so with a lot less infrastructure needed.
It seems the Metrics cleanup code just does a coll.RemoveAll() call,
rather than creating a txn Operation to remove the entries. Which means
we end up with a bunch of metrics transactions that refer to documents
that aren't there anymore. It may also be true for cloudimagemetadata.
@jameinel
Copy link
Member Author

jameinel commented Nov 29, 2018

2018-11-29 18:26:04 INFO  pruning removed 21879688 txns and cleaned 135091 docs in 43m45.459s.
2018-11-29 18:26:04 DEBUG prune stats: txn.PrunerStats{DocCacheHits:0x14b1a54, DocCacheMisses:0x336c4, CollectionQueries:0xb853, DocReads:0xe8e4, DocStillMissing:0x24de0, StashQueries:0x4a77, StashReads:0xc9a1, DocQueuesCleaned:0x20fb3, DocTokensCleaned:0x292df, DocsAlreadyClean:0x153bcc9, TxnsRemoved:0x14ddb88, TxnsNotRemoved:0x60}
2018-11-29 18:26:04 INFO  clean and prune cleaned 135091 docs in 0 collections
  removed 21879688 transactions and 0 stash documents in 43m45.46s
1063.77user 33.33system 43:45.47elapsed 41%CPU (0avgtext+0avgdata 25564maxresident)k
0inputs+616outputs (0major+5801minor)pagefaults 0swaps

compare that with the old code:

2018-11-27 11:50:18 INFO  total clean and prune cleaned 46918 docs in 2059 collections
  removed 28789869 transactions and 61343 stash documents in 1h8m5.895s
670.62user 21.16system 1:15:52elapsed 15%CPU (0avgtext+0avgdata 315280maxresident)k
0inputs+1880outputs (0major+78127minor)pagefaults 0swaps

I only ran prune, and not prune+purgemissing, etc. The CPU time is much higher, nearly 2x, but it must be a lot lighter on the database, as the wall clock time is a lot faster. And memory consumption is also far lower (25MB max vs 315MB max.)

Switch from uint64 to int, on 64-bit they only have 1 bit difference,
and it means that kr/pretty will not use hex to display them.

Try caching known-missing documents, and see whether that helps
anything.

Try Bulk.Remove() instead of just Remove. This still uses $in rather
than a series of Remove calls, we can try that one next.
DocsAlreadyClean int
TxnsRemoved int
TxnsNotRemoved int
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a question rather than a comment: is there a preference for int, rather than say uint in the Juju code base? As a bikeshedding remark, none of those fields seem like that they should need to be able to consider negative values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len() returns an 'int', so for fields that aggregate that sort of data, it made sense to just put it into an int. At one point I explicitly used a uint64, but it turns out that kr/pretty encodes unit64 as a Hex value, which makes it hard to get nice debugging results.

I agree that uint would be appropriate, but given things like len() return int it often makes sense to just use the normal type.

We can potentially have multiple txn remove threads, but I want to see
if it actually makes anything faster, or whether we just end up with
more contention reading/deleting from the same txn.
@jameinel
Copy link
Member Author

jameinel commented Nov 29, 2018

I should caveat that I cheated a little bit with the elapsed time. More than I realized. If you run all stages you get something more like:
1476.50user 48.65system 1:05:17elapsed 38%CPU (0avgtext+0avgdata 65272maxresident)k
0inputs+872outputs (0major+16707minor)pagefaults 0swaps

So still faster, but a fair bit of the original elapsed time was in stages other than prune.

Comparing Incremental using Bulk.Remove gives:

2018-11-29 19:23:35 INFO  Running stage "prune": Prune finalised transactions
2018-11-29 19:23:51 DEBUG pruning has removed 0 txns, handling 95094 docs (83991 in cache)
...
2018-11-29 20:20:54 DEBUG prune stats: txn.PrunerStats{DocCacheHits:28486188, DocCacheMisses:209515, DocMissingCacheHit:57195, DocsMissing:122456, CollectionQueries:63056, DocReads:84945, DocStillMissing:124570, StashQueries:24845, StashReads:66207, DocQueuesCleaned:172010, DocTokensCleaned:219042, DocsAlreadyClean:29163972, TxnsRemoved:0, TxnsToRemove:28768773, TxnsNotRemoved:96}
2018-11-29 20:20:54 INFO  clean and prune cleaned 172010 docs in 0 collections
  removed 0 transactions and 0 stash documents in 57m18.431s

vs the old direct txns.Remove calls:

2018-11-29 17:42:18 INFO  Running stage "prune": Prune finalised transactions
2018-11-29 17:42:34 DEBUG pruning has removed 102000 txns, handling 97277 docs (77474 in cache)
...
2018-11-29 18:26:04 INFO  pruning removed 21879688 txns and cleaned 135091 docs in 43m45.459s.
2018-11-29 18:26:04 DEBUG prune stats: txn.PrunerStats{DocCacheHits:0x14b1a54, DocCacheMisses:0x336c4, CollectionQueries:0xb853, DocReads:0xe8e4, DocStillMissing:0x24de0, StashQueries:0x4a77, StashReads:0xc9a1, DocQueuesCleaned:0x20fb3, DocTokensCleaned:0x292df, DocsAlreadyClean:0x153bcc9, TxnsRemoved:0x14ddb88, TxnsNotRemoved:0x60}
2018-11-29 18:26:04 INFO  clean and prune cleaned 135091 docs in 0 collections
  removed 21879688 transactions and 0 stash documents in 43m45.46s
1063.77user 33.33system 43:45.47elapsed 41%CPU (0avgtext+0avgdata 25564maxresident)k

That would say that direct removals is a lot faster. So I'll switch back.

(for comparison, the original mgopurge snap has this for just the prune stage:

2018-11-27 11:50:18 INFO  total clean and prune cleaned 46918 docs in 2059 collections
  removed 28789869 transactions and 61343 stash documents in 1h8m5.895s

So txns.Remove vs bulk.Remove seems to be the biggest win here.
)

They both seem to result in slower overall performance. (475k txns/min
vs 499k txns/min with this version).

Will also want to experiment with removing the Sort() as we don't really
need it.
@jameinel
Copy link
Member Author

I also missed the "removed 21879688" vs "removed 28789869" that also skews any sort of wall clock metrics. It may be that its faster, just because it isn't removing as many txns. (there are some txns that look problematic when looking in this direction, which we'll need to deal with.) metrics, in particular, creates docs with txns, but removes them with Raw access. Which only hasn't bitten us because the txns involved weren't multi-document.

@jameinel
Copy link
Member Author

jameinel commented Nov 30, 2018

I went back to a very basic setup, with just txns.Remove and I removed the Sort("_id)" to see if it would be faster if it could yield whatever transactions it wanted. It seems way slower

2018-11-30 06:21:04 INFO  clean and prune cleaned 253438 docs in 0 collections
  removed 28789773 transactions and 0 stash documents in 1h10m28.314s
1618.78user 60.67system 1:20:46elapsed 34%CPU (0avgtext+0avgdata 64084maxresident)k
1720inputs+976outputs (10major+16157minor)pagefaults 0swaps
408k/min

The best I've seen so far was closer to 499k/min, and the Bulk.Remove with sort was around 450k/min.

Break things up into multiple functions, hopefully making it a lot
easier to read what is going on.

Change the docWithQueue so we cache the transformation from token to txn
ids. Since we're likely passing over the same documents over and over,
it makes sense to not do the conversion millions of times.

Change the 'find tokens to pull', to bias towards not having any
allocations when there is nothing to do, which is the case 99% of the
time.
Might give interesting insight into what is going on.
@jameinel
Copy link
Member Author

jameinel commented Nov 30, 2018

Doing some refactoring, and caching some of the token conversions gives us:

2018-11-30 08:26:48 INFO  pruning removed 28789773 txns and cleaned 172246 docs in 54m45.286s.
2018-11-30 08:26:48 DEBUG prune stats: txn.PrunerStats{DocCacheHits:28498761, DocCacheMisses:209875, DocMissingCacheHit:57227, DocsMissing:122512, CollectionQueries:63139, DocReads:85204, DocStillMissing:124671, StashQueries:24863, StashReads:66284, DocQueuesCleaned:172246, DocTokensCleaned:219370, DocsAlreadyClean:29062589, TxnsRemoved:28789773, TxnsNotRemoved:96}

525k/min

2018-11-30 08:26:48 INFO  clean and prune cleaned 172246 docs in 0 collections
  removed 28789773 transactions and 0 stash documents in 54m45.287s
1413.46user 48.89system 1:02:52elapsed 38%CPU (0avgtext+0avgdata 234696maxresident)k
0inputs+848outputs (0major+59342minor)pagefaults 0swaps

I'm a bit surprised to see memory go up that high, and cpu not drop more.

Put them 'top down' instead of 'bottom up'. So the primary function is
at the top, and then what it splits into is next, and what they call
after that, etc.
Whenever we read a document from the DB, cache all the various strings
that we care about in an LRU cache. See if that helps peak memory it
might also help object comparison, since we don't have to compare
strings, if they refer to the same memory.
@jameinel
Copy link
Member Author

I added some time checks and maybe a small tweak, and this is what I got for where time is spent:

2018-11-30 09:39:19 DEBUG prune stats: txn.PrunerStats{
 StashLookupTime:  20.200686785,
 CacheLookupTime:  76.749753259,
     DocReadTime: 799.310336552,
   DocLookupTime: 896.356869914,
  DocCleanupTime: 194.916944016,
     TxnReadTime: 368.235192550,
   TxnRemoveTime:1923.639124006,

      DocCacheHits:28498764,
    DocCacheMisses:  209872,
DocMissingCacheHit:   57227,
       DocsMissing:  122512,
 CollectionQueries:   63137,
          DocReads:   85202,
   DocStillMissing:  124670,
      StashQueries:   24863,
        StashReads:   66283,
  DocQueuesCleaned:  172246,
  DocTokensCleaned:  219370,
  DocsAlreadyClean:29062589,
       TxnsRemoved:28789773,
    TxnsNotRemoved:      96,
}
2018-11-30 09:39:19 INFO  clean and prune cleaned 172246 docs in 0 collections
  removed 28789773 transactions and 0 stash documents in 56m23.334s
1411.07user 48.84system 1:04:31elapsed 37%CPU (0avgtext+0avgdata 193876maxresident)k
8728inputs+872outputs (38major+49516minor)pagefaults 0swaps

So that is 8,510 txns/s or 510k /min.

We spend 1900s (32min) in txns.Remove. And the next biggest is 900s/15min in reading the docs from the various collections. And that is reading in 209k docs, out of 28M that we have already read.
We only spend 200s/3.3min pulling the tokens out of the documents, though we do that 172k times.
We spend a bit of time reading the transactions 368s/6min.

I'm a bit surprised by the doc lookup time. We read 28M transactions is 368s, but it takes 800s to read 209k documents from across the tree. I wonder if there is a particular collection that isn't indexed, causing some sort of table scans? Everything should be read by _id, which should always be the primary key. It could also be something like deserialization time, though I do Select down to only the attributes that I care about.

Note that we do seem to be missing a few of the repeated documents in the doc cache:

db.txns.distinct("o.d").length # note: takes 5min to run
135344

So we refer to 135k documents across all transactions, but we read 209k docs. So on average we read a document 2x. So we could try increasing the document cache, or possibly use a 2-level cache (documents that are only accessed infrequently don't wipe out documents that are accessed frequently.)

@jameinel
Copy link
Member Author

The last change added a string cache, where after reading documents from the database, we first pass it through an LRU cache of string values. This means we're fairly likely to avoid duplicate copies of the same strings in memory. This ended up with:

2018-11-30 11:36:22 DEBUG prune stats: txn.PrunerStats{
   StashLookupTime:  20.466,
   CacheLookupTime:  87.774,
       DocReadTime: 584.499,
     DocLookupTime: 692.828,
    DocCleanupTime: 150.877,
       TxnReadTime: 486.907,
     TxnRemoveTime:1938.871,

      DocCacheHits:28498764,
    DocCacheMisses:  209868,
DocMissingCacheHit:   57232,
       DocsMissing:  122512,
 CollectionQueries:   63138,
          DocReads:   85201,
   DocStillMissing:  124667,
      StashQueries:   24863,
        StashReads:   66284,
  DocQueuesCleaned:  172246,
  DocTokensCleaned:  219370,
  DocsAlreadyClean:29062589,
       TxnsRemoved:28789773,
    TxnsNotRemoved:      96,
       ObjCacheHit:58519606,
      ObjCacheMiss:29694001,
}
2018-11-30 11:36:22 INFO  clean and prune cleaned 172246 docs in 0 collections
  removed 28789773 transactions and 0 stash documents in 54m29.683s
8,806 txns/s 528k/min

1531.37user 44.31system 1:00:39elapsed 43%CPU (0avgtext+0avgdata 58436maxresident)k
0inputs+856outputs (0major+15048minor)pagefaults 0swaps

So we have a significant reduction in peak memory (58MB vs 193MB or even 234MB).
Wall clock time was improved by 4minutes, and txn rate was improved (528k/min vs 510k/min).

Txn remove time was the same, read time was worse, but doc read time was much better
(584 vs 692).

The string cache had reasonably good hit rate (29.7M misses, 58.5M hits).
I am curious how many strings we actually have. I guess we know that we have
28M different transaction ids so we have a minimum of 28.8M strings that we
would want to cache. On the flip side the 'collection' name is also put in the cache,
which is referenced many times each. But I would guess that the fact that our
total miss rate is really not bad means we're probably doing as well as we can.

Includes a slight refactor.

Implement support for cleaning up the txns.stash of all documents that
don't have any references in their txn-queue.

Currently always does multiple pass pruning, but we will want to put
that under a flag.
You also have to copy the session early, because otherwise the session
can be closed by the time the goroutine starts.
Instead of IncrementalPruner handling multiple threads, now
CleanAndPrune handles multiple threads. This should avoid contention on
as the individual workers process documents.
Instead of processing the docs for each txn, aggregate the docs to be
cleaned up, and then cleanup each doc only once, avoiding parsing the
txn-queue multiple times for things-to-be-done.
The total time spent in cleanup is 10x less than the time we spend doing
other things. So don't bother optimizing it and making it harder to
understand.
This is essentially how I format them whenever I want to evaluate it, so
just do the formatting ahead of time. This gives a much easier to read
formatting.

Only question is if we would rather left-align the strings, but I
personally right align them.
If we have a doc affected by multiple txns, we were cleaning it multiple
times in the same pass, rather than only cleaning it one time. Yay
tests.
@jameinel
Copy link
Member Author

I now have proper testing on this branch, and I think it is ready to land.

We can't depend on the github.com/juju/mgo in the test suite, so we have
to handle the behavior of the old version of mgo, which doesn't cleanup
txn-queue during a Remove operation. We should also properly pass if
they are running a patched mgo.
@jameinel
Copy link
Member Author

!!build!!

@jameinel
Copy link
Member Author

!!build!!

// TODO(jam): Is it worth caching the docKey object itself?, we aren't using pointers to the docKey
// it does save a double lookup on both Collection and DocId
key.Collection = p.cacheString(key.Collection)
// note that DocId is 99% of the time just a string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest what would it be the other 1% - null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

txns.stash has a compound key of {"collection": , "id": } and it is possible for _id to be anything. In testing we frequently use ints. Mongo defaults to an ObjectId, but that is actually just a string.

Pretty much everything that juju does is strings.

Copy link
Member

@hmlanigan hmlanigan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few small comments and questions.


// IncrementalPruner reads the transaction table incrementally, seeing if it can remove the current set of transactions,
// and then moves on to newer transactions. It only thinks about 1k txns at a time, because that is the batch size that
// can be deleted. Instead, it caches documents that it has seen.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of what?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original prune code would grab the list of all known transaction ids, then walk all of the documents in the database, removing from the 'safe' set the list of ids that were still referenced. After walking everything it would then walk the list of remaining transaction ids as safe to remove.

Which is a 'safe' strategy, but requires doing a lot of work before you can do anything.

DocsCleaned int
}

// IncrementalPruneArgs specifies the parameters for running incremental cleanup steps..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/.././

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// oldest instead of form oldest to newest.
ReverseOrder bool

// TxnBatchSize is how many transaction to process at once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/transaction/transactions/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return tokensToPull, newQueue, newTxns
}

func (p *IncrementalPruner) removeTxns(txnsToDelete []bson.ObjectId, txns *mgo.Collection, errorCh chan error, wg *sync.WaitGroup) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return an error here? It appears will always return nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point. At one point it didn't fork a goroutine and so could return the error, but it doesn't anymore.

p.missingCache.KnownMissing(docKey)
if docKey.Collection == "metrics" {
// Note: (jam 2018-12-06) This is a special case. Metrics are
// *known* to violate the transaction guarantees. The are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/The/They/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

stats, err := pruner.Prune(args.Txns)
mu.Lock()
pstats = CombineStats(pstats, stats)
if anyErr == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only care about the first error hit by either goroutine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its mostly a case of 'what do we do with all the other errors'. probably we'd ideally want to at least log them. I went ahead and added that.

// CleanAndPrune runs the cleanup steps, and then follows up with pruning all
// of the transactions that are no longer referenced.
func CleanAndPrune(args CleanAndPruneArgs) (CleanupStats, error) {
tStart := time.Now()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CR checklist suggests using utils.Clock instead of time.Now or time.After, though the reason may be lost in history.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if you ever want to test something, you shouldn't depend on the wall clock. By using a Clock it gives you an interface to inject artificial time.
I don't think it particularly matters for 'how long did this operation take' which is just an informative message to the user. Though I probably should thread through a Clock and add a test around TxnBatchSleepTime. Thanks for the reminder.

SimonRichardson and others added 2 commits December 12, 2018 21:33
comment typo

Co-Authored-By: jameinel <[email protected]>
A couple typos, and a genuine concern around adding clock.Clock
interface.
@jameinel
Copy link
Member Author

$$merge$$

@jujubot jujubot merged commit b66278d into juju:master Dec 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants