Use disk-based pruning strategy #22

mjs · 2017-03-30T05:04:30Z

Previously, transaction pruning would consume unmanagable amounts of
memory when large numbers of transactions were present.

Instead of holding the transaction ids working set in memory, it is
now kept in a temporary database collection (generated by an
aggregation pipeline query) which uses much less memory.

Also of note:

queries now use the maximum batch size to maximise throughput
bulk record removal has been abstracted into a bulkRemover
type (much cleaner)
progress for long running operations is now emitted using a timer
resulting in less noise and more useful feedback

As well as consuming less memory, the new approach is also much
faster. Using a real-world database dump containing 19 million
transactions where 18.5 million were to be pruned, the total prune
time dropped from 154 minutes to 9 minutes (-94%)!

No test changes were required as the external behaviour hasn't changed.

axw · 2017-03-30T05:13:08Z

prune.go

-	// memory hit is short-lived this is probably acceptable.
-	txnIds := make(map[bson.ObjectId]struct{})
-	collNames := make(map[string]struct{})
+	workingSetName := txns.Name + ".prunetemp"


Do we know what will happen with concurrent calls to PruneTxns? I'm just wondering about what happens if we call PruneTxns and the mongo client request times out, but the op's still running in mongo. Maybe we could check for existence first?

According to the mongo docs, the output collection will be atomically replaced. I guess at worst we would invalidate the earlier call's iterators, if any are still active.

Concurrent calls are unlikely given how PruneTxns will ultimately be run, but it is worth considering. I'm reluctant to use the existence of the temp collection as to block prune attempts b/c if a jujud gets killed for some reason during pruning then pruning will never happen again (until someone manually fixes the problem).

I'll try out current calls and sees what happens I guess.

I've manually tested this quite a bit now. Depending on timing concurrent runs tend to cause cursors to be killed. This is handled gracefully enough without doing anything bad to the collections or txn data. Purging continues correctly the next time around.

Great, thanks for confirming!

axw · 2017-03-30T05:18:28Z

prune.go

+		{"$out": workingSetName},
+	})
+	var dummy bson.D
+	if err := runPipe(pipe, &dummy); err != nil {


maybe just inline &bson.D{}?

That's better. Fixed.

axw · 2017-03-30T05:31:49Z

prune.go

-			return err
-		}
+func newBulkRemover(coll *mgo.Collection) *bulkRemover {
+	session := coll.Database.Session.Copy()


why copy the session?

Thanks for picking this up. I ended to dig into this more. I wasn't sure if it was ok to perform DB operations on the same session that had an active iterator.

On reflection it's probably ok, but I'll make sure first. If so, I'll remove the Copy.

It doesn't look like the Copy was needed so I've removed it. That simplifies a few other aspects as well.

jameinel

Being faster and more memory efficient is great. I had a few thoughts about the algorithm, which I think we can chose whether or not we want to address them now.

jameinel · 2017-03-30T05:21:58Z

prune.go

-	// memory hit is short-lived this is probably acceptable.
-	txnIds := make(map[bson.ObjectId]struct{})
-	collNames := make(map[string]struct{})
+	workingSetName := txns.Name + ".prunetemp"


I take it, it was easier to project into another collection in the same DB rather than a second DB?

$out can only output to a collection in the same database

jameinel · 2017-03-30T05:22:58Z

prune.go

+	matchCompleted := bson.M{"$match": bson.M{"s": bson.M{"$in": []int{taborted, tapplied}}}}
+	pipe := txns.Pipe([]bson.M{
+		matchCompleted,
+		{"$project": bson.M{"_id": 1}},


Nice that because the Transaction ids are the _id field, it maps directly into your target collections primary key.

jameinel · 2017-03-30T05:26:28Z

prune.go

+	logger.Debugf("loading collections with transactions")
+	pipe = txns.Pipe([]bson.M{
+		matchCompleted,
+		{"$project": bson.M{"_id": 0, "c": "$o.c"}},


I take it this is "give me only things in the o.c field as just the value c"
and then "turn the arrays of C into a single top level list of C" objects
and then "aggregate all of the separate C objects so that we have only unique c values".

Given that you don't have interesting documents (just a single field), is there something cheaper than "$group"? Maybe not.

Given all this work only gives you the collection names. I'm curious why do it this way instead of just iterating all known collections and looking for 'txn-queue' fields.

I thought about this a bit. The code doesn't know which collections use transactions so it works it out by looking in txns. There are a few collections like statushistory and sequnce which don't use txns so I was trying to avoid searching those unnecessarily for txn-queue fields.

This pipeline is fairly expensive though so maybe the algorithm should just look in all collections (except a few mgo/txn related ones). I'll play with that.

I don't think there's anything cheaper than $group that would work here.

jameinel · 2017-03-30T05:27:53Z

prune.go

-		return fmt.Errorf("failed to read all txns: %v", err)
+	var collNames []string
+	for _, doc := range collNameDocs {
+		if len(doc) == 1 {


What else would you get? len(doc) == 0?

The length should never be anything other than 1 but I wanted to avoid any chance of panic.

jameinel · 2017-03-30T05:30:49Z

prune.go

+	for _, doc := range collNameDocs {
+		if len(doc) == 1 {
+			if collName, ok := doc[0].Value.(string); ok {
+				collNames = append(collNames, collName)


I'm curious if this is also available as something like doc.Id(), but maybe you'd have to use a struct like:

var collNameDocs struct { Collection string `bson:"_id"` }

Would that be clearer to parse?

Yep, that's better. I'll update to this.

jameinel · 2017-03-30T05:38:01Z

prune.go

 		coll := db.C(collName)
 		var tDoc struct {
 			Queue []string `bson:"txn-queue"`
 		}
-		iter := coll.Find(nil).Select(bson.M{"txn-queue": 1}).Iter()
+		query := coll.Find(nil).Select(bson.M{"txn-queue": 1})
+		query.Batch(batchBytes)


https://docs.mongodb.com/manual/reference/method/cursor.batchSize/

Seems to say that the batch size is the number of documents to batch, not the number of bytes to batch.

https://godoc.org/gopkg.in/mgo.v2#Pipe.Batch

Says "sets the batch size when fetching documents" but that doesn't specify units either.

Given that you've run this with a batch size of 16M, maybe I'm wrong. This feels like something that might be Mongo Version dependent.

I did research this a bit when I wrote the code and convinced myself it was in bytes but I've looked again and now I'm pretty sure it's in documents. It's really not very clear in the docs.

It seems if you set the number to a number of docs that results in a return value bigger than the 16MB it'll limit to the maximum anyway. According to the docs 101 docs is about 1MB (not quite sure how that works out). I'll set the batch size to 1616 to get the maximum.

jameinel · 2017-03-30T05:42:37Z

prune.go

 		coll := db.C(collName)
 		var tDoc struct {
 			Queue []string `bson:"txn-queue"`
 		}
-		iter := coll.Find(nil).Select(bson.M{"txn-queue": 1}).Iter()
+		query := coll.Find(nil).Select(bson.M{"txn-queue": 1})


I assume this is the equivalent of:
db.collection.find({}, {"txn-queue": 1})

I wonder if it would be helpful to do the same trick we did earlier with:
db.collection.find({}, {"_id": 0, "txn-queue": 1})

It also really feels like it wants to be another pipeline. Would it be sane to do the aggregation into another temp table so that we wouldn't have to iterate the docs?

Something like:

pipe = txns.Pipe([]bson.M{ {"$project": bson.M{"_id": 0, "txn": "txn-queue"}}, {"$unwind": "$txn"}, {"$group": bson.M{"_id": "$txn"}}, })

Feels really good here, as we don't end up double handling the same txn ids over and over again.

Unfortunately I don't think that helps because the txn-queue field values are <txn id>_<nonce>. I'll have to check if the nonce is reused but I suspect it's not.

ah, right, you have to chop the nonce off the transaction ids.

jameinel · 2017-03-30T05:45:40Z

prune.go

+	iter := query.Iter()
+	var doc bson.D
+	for iter.Next(&doc) {
+		if err := remover.remove(doc[0].Value); err != nil {


I'm curious why before you had to do the "if len() == 1" check, but here you don't

No good reason. I'll use a struct as above anyway.

jameinel · 2017-03-30T05:49:07Z

prune.go

+}
+
+func (r *bulkRemover) remove(id interface{}) error {
+	r.chunk.Remove(bson.D{{"_id", id}})


It feels like we're going to end up removing the same txn id over and over and over again.
And while I realize we don't really want to cache all the txn ids in memory, I wonder about having a limited txn id set. Maybe a per-chunk set, and we skip items that we know are already in the set? We could probably use some hit-rate tracking here, but just a simple:

if r.thisChunkSet.Contains(id) { // We've already queued this one for removal return nil }

And then thisChunkSet gets reset to the empty Set() in newChunk.

Thoughts?

I guess some txns can involve a lot of documents, although most don't so I don't how much of an issue it really is. I'll try out the set and see how much it actually gets hit.

Note that this is only relevant for when still-referenced txns are being removed from the working set. The actual removals from the txns collection - which also uses bulkRemover - works from a unique set of txn ids.

jameinel · 2017-03-30T05:50:46Z

prune.go

+func (r *bulkRemover) remove(id interface{}) error {
+	r.chunk.Remove(bson.D{{"_id", id}})
+	r.chunkSize++
+	if r.chunkSize == maxChunkSize {


As a general peeve of mine, I really prefer these checks to be:

if r.chunkSize >= maxChunkSize { }

its a bit of belt-and-suspenders checking, but "==" always has the catastrophic failure when you accidentally incremented twice and suddenly the trap never triggers because you walked right past it.

Sure - I'll change that.

jameinel · 2017-03-30T08:41:22Z

prune.go

+	// Load the ids of all completed and aborted txns into a separate
+	// temporary collection.
+	logger.Debugf("loading all completed transactions")
+	matchCompleted := bson.M{"$match": bson.M{"s": bson.M{"$in": []int{taborted, tapplied}}}}


Another thing I just noted:

time db.txns.find({"s": 5}).count()
and
time db.txns.find({"s": 6}).count()
are quite fast vs
time db.txns.find({"s": {"$in": [5, 6]}}).count()
is very slow (db.txns.find().explain() still said it was using the index).
So we might consider doing 2 passes with explicit values might be significantly faster than the "in" version.

db.txns.find({"s": {$gte: 5}} is fast but using that in the aggregation pipeline is still just as slow as $in, even though explain: true says it'll use the index. Very strange.

I've tried with MongoDB 2.6 and 3.2. Same problem with both.

Doing two passes means a lot of extra work later (2 collections to remove txns from) so I'm just going to leave it using $gte.

Previously, transaction pruning would consume unmanagable amounts of memory when large numbers of transactions were present. Instead of holding the transaction ids working set in memory, it is now kept in a temporary database collection (generated by an aggregation pipeline query) which uses much less memory. Also of note: - queries now use the maximum batch size to maximise throughput - bulk record removal has been abstracted into a bulkRemover - type (much cleaner) - progress for long running operations is now emitted using a timer resulting in less noise and more useful feedback As well as consuming less memory, the new approach is also much faster. Using a real-world database dump containing 19 million transactions where 18.5 million were to be pruned, the total prune time dropped from 154 minutes to 9 minutes (-94%)! No test changes were required as the external behaviour hasn't changed.

mjs · 2017-03-31T02:08:51Z

$$merge$$

axw approved these changes Mar 30, 2017

View reviewed changes

jameinel approved these changes Mar 30, 2017

View reviewed changes

jameinel reviewed Mar 30, 2017

View reviewed changes

mjs force-pushed the 1676427-txn-pruning-memory branch from 432e544 to bcdca6e Compare March 31, 2017 01:19

mjs force-pushed the 1676427-txn-pruning-memory branch from bcdca6e to d12ab61 Compare March 31, 2017 01:56

mjs merged commit 8d53f9d into juju:master Mar 31, 2017

mjs deleted the 1676427-txn-pruning-memory branch March 31, 2017 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use disk-based pruning strategy #22

Use disk-based pruning strategy #22

mjs commented Mar 30, 2017 •

edited

Loading

axw Mar 30, 2017

mjs Mar 30, 2017

mjs Mar 31, 2017

axw Mar 31, 2017

axw Mar 30, 2017

mjs Mar 30, 2017

axw Mar 30, 2017

mjs Mar 30, 2017

mjs Mar 31, 2017

jameinel left a comment

jameinel Mar 30, 2017

mjs Mar 30, 2017

jameinel Mar 30, 2017

jameinel Mar 30, 2017

mjs Mar 30, 2017

jameinel Mar 30, 2017

mjs Mar 30, 2017 •

edited

Loading

jameinel Mar 30, 2017

mjs Mar 30, 2017

jameinel Mar 30, 2017

mjs Mar 30, 2017

jameinel Mar 30, 2017

mjs Mar 30, 2017 •

edited

Loading

jameinel Mar 30, 2017

jameinel Mar 30, 2017

mjs Mar 30, 2017

jameinel Mar 30, 2017

mjs Mar 30, 2017

jameinel Mar 30, 2017

mjs Mar 30, 2017

jameinel Mar 30, 2017

mjs Mar 30, 2017 •

edited

Loading

mjs Mar 30, 2017

mjs commented Mar 31, 2017

Use disk-based pruning strategy #22

Use disk-based pruning strategy #22

Conversation

mjs commented Mar 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameinel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjs Mar 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjs Mar 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjs Mar 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjs commented Mar 31, 2017

mjs commented Mar 30, 2017 •

edited

Loading

mjs Mar 30, 2017 •

edited

Loading

mjs Mar 30, 2017 •

edited

Loading

mjs Mar 30, 2017 •

edited

Loading