Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunked: fix reuse of the layers cache #2024

Merged
merged 6 commits into from
Sep 19, 2024

Conversation

giuseppe
Copy link
Member

the global singleton was never updated, causing the cache to be always recreated for each layer.

It is not possible to keep the layersCache mutex for the entire load() since it calls into some store APIs causing a deadlock since findDigestInternal() is already called while some store locks are held.

Closes: #2023

Copy link
Contributor

openshift-ci bot commented Jul 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rhatdan
Copy link
Member

rhatdan commented Jul 13, 2024

LGTM
@mtrmac PTAL

@giuseppe giuseppe marked this pull request as ready for review July 15, 2024 11:45
@giuseppe giuseppe changed the title [WIP] chunked: fix reuse of the layers cache chunked: fix reuse of the layers cache Jul 15, 2024
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for handling this!

A fairly brief skim for now…


(It would be also convenient to have the locking rules documented in detail — e.g. which fields of layersCache and layers are immutable; which are protected by a lock (and which one).

Especially the lock hierarchy of layersCache.mutex vs. the locks of store/layerStore seems fairly non-obvious to me, given how far the code locations are from each other and how dissimilar the situation seems.

But all of that is blocked on having settled on design in the first place.)

@@ -111,7 +111,7 @@ func getLayersCacheRef(store storage.Store) *layersCache {
cache.refs++
return cache
}
cache := &layersCache{
cache = &layersCache{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I wonder about making this kind of mistake harder to do … naming the global cacheSingleton, globalCache, or something, might make it less likely to conflict with a local. OTOH that’s very likely an overreaction.)

pkg/chunked/cache_linux.go Outdated Show resolved Hide resolved
@TomSweeneyRedHat
Copy link
Member

@giuseppe @mtrmac this too looks like a good one to get in for the vendor dance. Would it be possible to wrap this up in the next day or so, or should we consider it for the 1.55.1 release (1.55.0 is the vendor)?

@TomSweeneyRedHat TomSweeneyRedHat added the 5.2 Wanted for Podman v5.2 label Jul 24, 2024
@TomSweeneyRedHat
Copy link
Member

If this is merged on or before August 12, 2024, please cherry-pick this to the release-1.55 branch

@cgwalters
Copy link
Contributor

@giuseppe @mtrmac this too looks like a good one to get in for the vendor dance.

This is just my opinion but basically this is attempting to optimize zstd:chunked some, and introduces some open questions around concurrency. It doesn't make sense to rush it, and if it merged it's the type of change that should only be cherry picked if there was a known reason to do so.

@cgwalters cgwalters removed the 5.2 Wanted for Podman v5.2 label Jul 29, 2024
@giuseppe giuseppe force-pushed the fix-reuse-of-cache branch 2 times, most recently from c324874 to 2b77871 Compare September 12, 2024 09:36
@giuseppe
Copy link
Member Author

I think the new version is a reasonable solution to the deadlock issues.

I've added a patch to split ApplyDiffWithDiffer so that now the heavy IO part (and everything that calls into the store) is done without the callers holding any lock on the store itself.

There are some other nice side effects:

  • ApplyDiffWithDiffer can be used in parallel
  • It solves the problem where multiple load() happens together doing the same job

@giuseppe giuseppe force-pushed the fix-reuse-of-cache branch 2 times, most recently from ee8e7f4 to 6dc4f68 Compare September 12, 2024 12:03
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach but I don’t think it works with the current API.

(Start with ⚠️ , the other review comments are ~irrelevant at this point.)

store.go Outdated Show resolved Hide resolved
store.go Outdated Show resolved Hide resolved
layers.go Outdated Show resolved Hide resolved
layers.go Outdated Show resolved Hide resolved
layers.go Outdated Show resolved Hide resolved
layers.go Outdated Show resolved Hide resolved
layers.go Outdated Show resolved Hide resolved
@giuseppe giuseppe force-pushed the fix-reuse-of-cache branch 3 times, most recently from 918081c to ed6b4c1 Compare September 16, 2024 07:44
store.go Outdated Show resolved Hide resolved
layers.go Outdated Show resolved Hide resolved
@giuseppe
Copy link
Member Author

@mtrmac thanks for the review. Addressed the comments in the last version

store.go Outdated Show resolved Hide resolved
@giuseppe giuseppe force-pushed the fix-reuse-of-cache branch 2 times, most recently from 455b8dc to 7fceea5 Compare September 17, 2024 19:12
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

store.go Outdated Show resolved Hide resolved
Signed-off-by: Giuseppe Scrivano <[email protected]>
it is not clear if it is needed, so simplify it.

Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
the global singleton was never updated, causing the cache to be always
recreated for each layer.

It is not possible to keep the layersCache mutex for the entire load()
since it calls into some store APIs causing a deadlock since
findDigestInternal() is already called while some store locks are
held.

Another benefit is that now only one goroutine can run load()
preventing multiple calls to load() to happen in parallel doing the
same work.

Closes: containers#2023

Signed-off-by: Giuseppe Scrivano <[email protected]>
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks again!

@openshift-ci openshift-ci bot added the lgtm label Sep 19, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 1959722 into containers:main Sep 19, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chunked layersCache seems to never be actually reused
5 participants