-
Notifications
You must be signed in to change notification settings - Fork 21
Incremental prune #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This does the cleaning and the txn removal in the same core loop, and seems to do so with a lot less infrastructure needed.
It seems the Metrics cleanup code just does a coll.RemoveAll() call, rather than creating a txn Operation to remove the entries. Which means we end up with a bunch of metrics transactions that refer to documents that aren't there anymore. It may also be true for cloudimagemetadata.
compare that with the old code:
I only ran prune, and not prune+purgemissing, etc. The CPU time is much higher, nearly 2x, but it must be a lot lighter on the database, as the wall clock time is a lot faster. And memory consumption is also far lower (25MB max vs 315MB max.) |
Switch from uint64 to int, on 64-bit they only have 1 bit difference, and it means that kr/pretty will not use hex to display them. Try caching known-missing documents, and see whether that helps anything. Try Bulk.Remove() instead of just Remove. This still uses $in rather than a series of Remove calls, we can try that one next.
c478006
to
09bf01d
Compare
DocsAlreadyClean int | ||
TxnsRemoved int | ||
TxnsNotRemoved int | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a question rather than a comment: is there a preference for int
, rather than say uint
in the Juju code base? As a bikeshedding remark, none of those fields seem like that they should need to be able to consider negative values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len() returns an 'int', so for fields that aggregate that sort of data, it made sense to just put it into an int. At one point I explicitly used a uint64, but it turns out that kr/pretty encodes unit64 as a Hex value, which makes it hard to get nice debugging results.
I agree that uint would be appropriate, but given things like len() return int it often makes sense to just use the normal type.
We can potentially have multiple txn remove threads, but I want to see if it actually makes anything faster, or whether we just end up with more contention reading/deleting from the same txn.
I should caveat that I cheated a little bit with the elapsed time. More than I realized. If you run all stages you get something more like: So still faster, but a fair bit of the original elapsed time was in stages other than prune. Comparing Incremental using Bulk.Remove gives:
vs the old direct txns.Remove calls:
That would say that direct removals is a lot faster. So I'll switch back. (for comparison, the original mgopurge snap has this for just the prune stage:
So txns.Remove vs bulk.Remove seems to be the biggest win here. |
They both seem to result in slower overall performance. (475k txns/min vs 499k txns/min with this version). Will also want to experiment with removing the Sort() as we don't really need it.
I also missed the "removed 21879688" vs "removed 28789869" that also skews any sort of wall clock metrics. It may be that its faster, just because it isn't removing as many txns. (there are some txns that look problematic when looking in this direction, which we'll need to deal with.) metrics, in particular, creates docs with txns, but removes them with Raw access. Which only hasn't bitten us because the txns involved weren't multi-document. |
I went back to a very basic setup, with just txns.Remove and I removed the Sort("_id)" to see if it would be faster if it could yield whatever transactions it wanted. It seems way slower
The best I've seen so far was closer to 499k/min, and the Bulk.Remove with sort was around 450k/min. |
Break things up into multiple functions, hopefully making it a lot easier to read what is going on. Change the docWithQueue so we cache the transformation from token to txn ids. Since we're likely passing over the same documents over and over, it makes sense to not do the conversion millions of times. Change the 'find tokens to pull', to bias towards not having any allocations when there is nothing to do, which is the case 99% of the time.
Might give interesting insight into what is going on.
Doing some refactoring, and caching some of the token conversions gives us:
I'm a bit surprised to see memory go up that high, and cpu not drop more. |
Put them 'top down' instead of 'bottom up'. So the primary function is at the top, and then what it splits into is next, and what they call after that, etc.
Whenever we read a document from the DB, cache all the various strings that we care about in an LRU cache. See if that helps peak memory it might also help object comparison, since we don't have to compare strings, if they refer to the same memory.
I added some time checks and maybe a small tweak, and this is what I got for where time is spent:
So that is 8,510 txns/s or 510k /min. We spend 1900s (32min) in txns.Remove. And the next biggest is 900s/15min in reading the docs from the various collections. And that is reading in 209k docs, out of 28M that we have already read. I'm a bit surprised by the doc lookup time. We read 28M transactions is 368s, but it takes 800s to read 209k documents from across the tree. I wonder if there is a particular collection that isn't indexed, causing some sort of table scans? Everything should be read by _id, which should always be the primary key. It could also be something like deserialization time, though I do Select down to only the attributes that I care about. Note that we do seem to be missing a few of the repeated documents in the doc cache:
So we refer to 135k documents across all transactions, but we read 209k docs. So on average we read a document 2x. So we could try increasing the document cache, or possibly use a 2-level cache (documents that are only accessed infrequently don't wipe out documents that are accessed frequently.) |
The last change added a string cache, where after reading documents from the database, we first pass it through an LRU cache of string values. This means we're fairly likely to avoid duplicate copies of the same strings in memory. This ended up with:
So we have a significant reduction in peak memory (58MB vs 193MB or even 234MB). Txn remove time was the same, read time was worse, but doc read time was much better The string cache had reasonably good hit rate (29.7M misses, 58.5M hits). |
Includes a slight refactor. Implement support for cleaning up the txns.stash of all documents that don't have any references in their txn-queue. Currently always does multiple pass pruning, but we will want to put that under a flag.
You also have to copy the session early, because otherwise the session can be closed by the time the goroutine starts.
Instead of IncrementalPruner handling multiple threads, now CleanAndPrune handles multiple threads. This should avoid contention on as the individual workers process documents.
Instead of processing the docs for each txn, aggregate the docs to be cleaned up, and then cleanup each doc only once, avoiding parsing the txn-queue multiple times for things-to-be-done.
We don't need to report it again here.
The total time spent in cleanup is 10x less than the time we spend doing other things. So don't bother optimizing it and making it harder to understand.
This is essentially how I format them whenever I want to evaluate it, so just do the formatting ahead of time. This gives a much easier to read formatting. Only question is if we would rather left-align the strings, but I personally right align them.
If we have a doc affected by multiple txns, we were cleaning it multiple times in the same pass, rather than only cleaning it one time. Yay tests.
I now have proper testing on this branch, and I think it is ready to land. |
We can't depend on the github.com/juju/mgo in the test suite, so we have to handle the behavior of the old version of mgo, which doesn't cleanup txn-queue during a Remove operation. We should also properly pass if they are running a patched mgo.
328c5a8
to
a629c80
Compare
!!build!! |
!!build!! |
// TODO(jam): Is it worth caching the docKey object itself?, we aren't using pointers to the docKey | ||
// it does save a double lookup on both Collection and DocId | ||
key.Collection = p.cacheString(key.Collection) | ||
// note that DocId is 99% of the time just a string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest what would it be the other 1% - null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
txns.stash has a compound key of {"collection": , "id": } and it is possible for _id to be anything. In testing we frequently use ints. Mongo defaults to an ObjectId, but that is actually just a string.
Pretty much everything that juju does is strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few small comments and questions.
|
||
// IncrementalPruner reads the transaction table incrementally, seeing if it can remove the current set of transactions, | ||
// and then moves on to newer transactions. It only thinks about 1k txns at a time, because that is the batch size that | ||
// can be deleted. Instead, it caches documents that it has seen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original prune code would grab the list of all known transaction ids, then walk all of the documents in the database, removing from the 'safe' set the list of ids that were still referenced. After walking everything it would then walk the list of remaining transaction ids as safe to remove.
Which is a 'safe' strategy, but requires doing a lot of work before you can do anything.
incrementalprune.go
Outdated
DocsCleaned int | ||
} | ||
|
||
// IncrementalPruneArgs specifies the parameters for running incremental cleanup steps.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/.././
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
incrementalprune.go
Outdated
// oldest instead of form oldest to newest. | ||
ReverseOrder bool | ||
|
||
// TxnBatchSize is how many transaction to process at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/transaction/transactions/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
incrementalprune.go
Outdated
return tokensToPull, newQueue, newTxns | ||
} | ||
|
||
func (p *IncrementalPruner) removeTxns(txnsToDelete []bson.ObjectId, txns *mgo.Collection, errorCh chan error, wg *sync.WaitGroup) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return an error here? It appears will always return nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair point. At one point it didn't fork a goroutine and so could return the error, but it doesn't anymore.
incrementalprune.go
Outdated
p.missingCache.KnownMissing(docKey) | ||
if docKey.Collection == "metrics" { | ||
// Note: (jam 2018-12-06) This is a special case. Metrics are | ||
// *known* to violate the transaction guarantees. The are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/The/They/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
stats, err := pruner.Prune(args.Txns) | ||
mu.Lock() | ||
pstats = CombineStats(pstats, stats) | ||
if anyErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only care about the first error hit by either goroutine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its mostly a case of 'what do we do with all the other errors'. probably we'd ideally want to at least log them. I went ahead and added that.
// CleanAndPrune runs the cleanup steps, and then follows up with pruning all | ||
// of the transactions that are no longer referenced. | ||
func CleanAndPrune(args CleanAndPruneArgs) (CleanupStats, error) { | ||
tStart := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CR checklist suggests using utils.Clock instead of time.Now or time.After, though the reason may be lost in history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if you ever want to test something, you shouldn't depend on the wall clock. By using a Clock it gives you an interface to inject artificial time.
I don't think it particularly matters for 'how long did this operation take' which is just an informative message to the user. Though I probably should thread through a Clock and add a test around TxnBatchSleepTime. Thanks for the reminder.
comment typo Co-Authored-By: jameinel <[email protected]>
A couple typos, and a genuine concern around adding clock.Clock interface.
|
This introduces a potential change to the pruning algorithm, which has a nice benefit of significantly simplifying it.
The first thing is that we completely get rid of the 'working' set of transactions-that-are-referenced. Instead, we assume that only docs referenced by a txn could reference that transaction (which should be true). This also means that however many transactions we manage to process, they will be removed, without the need to worry about batching lots of txns to process at the same time.
Then, we read txns that are marked completed that are 'old enough', and then look for the txn-ques of documents they reference and read those (and cache that in an LRU). It should be safe to cache the queue, because only new txns should ever get added to queues so they should never reference a completed queue. Because we know that most of our txns touch a small set of documents over and over, an LRU cache of 10k has a very high hit rate (on prodstack, while processing ~6M txns, touching 6M docs, we have a 98% hit rate).
As we process the txns, we group them in batches of 1000, since that is the approximately good batch size for deleting txns. We then inspect the docs, and see if we need to clean the txn-queues for a txn we are considering deleting. For docs that are updated often, likely the old txn is already removed from the queue, so there is nothing to do.
This code needs a lot more tests, but I think the shape of it is correct.