-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
layerStore.Load under layerStore.RLock is racy #1332
Comments
@nalind @haircommander I’ve been told that performance with in-process concurrency is a major concern in this area. Is there any way I could quantify that? Ideally a benchmark I can run on-demand? Or specific code paths that matter? (I assume that’s mostly CRI-O, which nowadays has its own in-memory caching for both images and sandboxes. Which parts of CRI-O?) |
I don't think we have a good reproducer for storage deadlocks/lock contention/races as of now. We've only seen theses situations in production so far |
A tentative RFC: Currently, all getters within a single process lock a I propose to, more or less, turn
(And similarly for the container and image stores.) |
@haircommander To clarify, I’m not looking for test of correctness, I’m looking for benchmarks to qualify performance, at least to the extent it is relevant for CRI-O. |
Note to self: The above, and #1346, does nothing for #1136. But if we had all readers going through a few wrappers, we could look into handling incomplete layers in there. Maybe lock the store inter-process for reading only, load, and if there are incomplete layers, abort, lock inter-process for writing, clean the layers up, and then do the actual data read. The point would be to structure things so that we have confidence that |
I guess a simple metric could be the cri-o e2e test time. Theoretically, any additional lock contention would cause them to run slower. they currently run ~1 hour. Past that, if you get me a library to vendor, we could try some manual tests. We don't currently have a benchmarking suite handy |
The original intent seems to be #898 (comment) ; that seems to work for the container and image stores, where reloads are only driven by |
From #1322:
store.Mounted
(lockslayerStore
for reading) →ReloadIfChanged
(locksloadMut
, irrelevant because there is no other concurrentReloadIfChanged
) →Load
(locks nothing, updates things liker.byid
)store.Layer
(lockslayerStore
for reading →Get
(locks nothing, reads things liker.byid
)The read lock is shared, right? So there is no exclusion of
Load
modifying the data structure vs. any other reader of the store AFAICT.I just can’t see how calling
ReloadIfChanged
with a shared reader lock is supposed to work in this case. My initial guess is that it is not supported at all to use the same store from multiple goroutines, and each goroutine should have its own store — but that’s clearly not correct,store.GetStore
carefully returns a pre-existing matching store instance for concurrent callers.The text was updated successfully, but these errors were encountered: