Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Partial storage locking to increase 'storing signature' performance #421

Open
saschagrunert opened this issue Aug 22, 2019 · 25 comments

Comments

@saschagrunert
Copy link
Member

This is more a rough idea and I'm wondering if you had any thoughts on this as well.
It relates somehow to #420

We're currently locking the whole containers, images and layers directories when doing write operations on the store. I think this could be improved by the penalty of a more complex locking mechanism.

The idea is to not lock the whole layer store, but only the layers which belong to a certain image. This would allow to store two images in parallel if they do not share the same layers.

WDYT?

@vrothberg
Copy link
Member

@nalind, WDYT?

That's roughly related to a PR over at c/image (containers/image#611) which does per-layer locking to avoid redundantly downloading a given blob. This PR isn't merged and I doubt we ever will but the conclusion regarding the layer-locks is that it is fine as long as those locks reside on a tmpfs (the amount of lockfiles could grow high over time).

My 2 cents: Fine-granular locking can be really tricky to implement (more critical sections to lock and get wrong) and can sometimes decrease performance (more locks need to be acquired, scheduling can screw caches etc.). But I'm convinced it would speed things up here.
One thing we might need to check is the risk of running into ABBA-deadlocks (in case of multi-writer copy-dependencies).

@rhatdan
Copy link
Member

rhatdan commented Aug 22, 2019

Be careful about creating to many files on tmpfs, I think we found a cache leak in the kernel based on number of files left in tmpfs. @mheon would know the specifics.

@vrothberg
Copy link
Member

Ouch, yes that'll hurt. @rhatdan, any other way to do that? We need some way to clean them up.

@rhatdan
Copy link
Member

rhatdan commented Aug 22, 2019

@vrothberg memory based locking?

@vrothberg
Copy link
Member

@rhatdan sounds complex in shared memory but possible.

One thing the afternoon coffee made me think is how could we migrate? Assuming we have fine-grained locking in place. How could we remain backward compatible? Other tools with older versions of c/storage might be running in parallel.

@rhatdan
Copy link
Member

rhatdan commented Aug 22, 2019

Well they would still be taking the higher locks, so they could block the newer tools. The newer tools could not block the older.
I guess it would be best to have a flag that would block the older tools from running. But I am not sure we can do this.

@vrothberg
Copy link
Member

New versions could take the big lock as readers, this way old versions would block trying to acquire it as writers.

@saschagrunert
Copy link
Member Author

Sounds like a plan, I would still keep the locks close to the storage somehow. Would something like bbolt work out for this use case?

@vrothberg
Copy link
Member

@mheon will know. It's crucial that the locks are non-persistent (in contrast to the containers, image, layer databases etc.). Is that possible with bolt?

@mheon
Copy link
Member

mheon commented Aug 22, 2019

As long as the DB lives on a tmpfs, sure - that'll wipe the DB on reboot. If you need something even less persistent than that, you can close your DB connection, remove the DB, and reopen, and that ought to resolve things.

@saschagrunert
Copy link
Member Author

saschagrunert commented Aug 22, 2019

As long as the DB lives on a tmpfs, sure - that'll wipe the DB on reboot. If you need something even less persistent than that, you can close your DB connection, remove the DB, and reopen, and that ought to resolve things.

I think we should still store the (let me call it) lock db on disk a filesystem to allow multiple processed (buildah, podman) to be able to see if something is locked or not, right?

@rhatdan
Copy link
Member

rhatdan commented Aug 22, 2019

Yes, I believe so.

@mheon
Copy link
Member

mheon commented Aug 22, 2019

Hmmm. So in that case, you'd want to detect reboots, and then "unlock" all the locks?

Libpod does something similar (detect a reboot via a file in tmpfs disappearing, and then "refresh" the DB by clearing nonpersistent state). That might work?

@vrothberg
Copy link
Member

@mheon, I'd just put the entire DB on a tmpfs.

@rhatdan
Copy link
Member

rhatdan commented Aug 22, 2019

I agree putting the DB on tmpfs would work. Container/storage already relies on tmpfs for cleaning up mount counts.

@vrothberg
Copy link
Member

containers/image#611 is placing the layer lock also on the storage tmpfs.

@saschagrunert
Copy link
Member Author

So if we speak in terms of boltdb, then the database would have three buckets for layers, containers and images, whereas the entries indicate (per id) if it's a read or write lock, right?

@vrothberg
Copy link
Member

I am not actually sure how we could implement process-locking with bolt. We still need file-locks, which makes me believe that we can follow the idea of containers/image#611 and put them in the storage's tmpfs and reference them by convention (e.g., layers in ./layer-locks, images in ./image-locks etc.).

@saschagrunert
Copy link
Member Author

I am not actually sure how we could implement process-locking with bolt. We still need file-locks, which makes me believe that we can follow the idea of containers/image#611 and put them in the storage's tmpfs and reference them by convention (e.g., layers in ./layer-locks, images in ./image-locks etc.).

Sounds also good, some additional questions from my side:

  • Which file locks would we need additionally when trying to go via the database?
  • How could the convention look like, are we storing the locks per line or in some serialized hash map format?
  • Would it be faster than a key value store?

@vrothberg
Copy link
Member

* Which file locks would we need additionally when trying to go via the database?

I assume we would need one lock to serialize accesses to the database (which would turn into a bottleneck). I would not know how to implement a synchronization mechanism in a database without reimplementing a mutex. That's why I feel more attached to just use file locking.

There is an open PR against storage to add a flock package (i.e., #378) which allows for try-lock semantics which might come in handy at some time. @rhatdan, maybe we should revive the PR again?

* How could the _convention_ look like, are we storing the locks per line or in some serialized hash map format?

In containers/image#611 we create file locks at a specific folder. The file names are the digests of a blob. We can do the same for containers, images and layers if we want to.

If we want to acquire a layer as reader, we would do an RLock() on the lock file (e.g., /storage-tmps/$storage-driver/layer-lock/$digest-of-layer.lock).

* Would it be faster than a key value store?

I bet that file locking performs better as we don't need to load the entire DB for each process but only use the lock we actually need. File locks are real synchronization primitives with proper scheduling, signaling etc. If we tried to emulate them with a DB we'd run into all kinds of issues regarding fairness, scheduling, signaling, etc.

@rhatdan
Copy link
Member

rhatdan commented Nov 1, 2019

@saschagrunert @vrothberg Still something we want to pursue?

@saschagrunert
Copy link
Member Author

Yes, but we're still waiting for feedback from @nalind :)

@vrothberg
Copy link
Member

@saschagrunert @vrothberg Still something we want to pursue?

Yes, definitely worth pursuing. Layer-locking is something we might need to generalize to cover other scenarios as well (e.g., pull-delete races, simultaneous pulls, etc.).

@rhatdan
Copy link
Member

rhatdan commented Aug 3, 2020

Also this issue is still waiting for #473 correct?

@saschagrunert
Copy link
Member Author

Also this issue is still waiting for #473 correct?

Basically yes. Technically the features could live on their own but I see #473 as a pre-step for this enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants