Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate cache fill requests to avoid cache stampede in proxy #308

Closed
Kunde21 opened this issue Jul 19, 2018 · 6 comments
Closed

Deduplicate cache fill requests to avoid cache stampede in proxy #308

Kunde21 opened this issue Jul 19, 2018 · 6 comments
Assignees
Labels
enhancement New feature or request proxy Work to do on the module proxy

Comments

@Kunde21
Copy link
Contributor

Kunde21 commented Jul 19, 2018

Large and heavily-used projects will have users with automated update mechanisms, leading to a flood of requests for the newest tag. Proxy & Olympus need to be able to identify duplicate requests and only fill cache from VCS once.

Initially, we can simply block all duplicate requests until the module is downloaded (#290), but this will become more complicated when working with streams (#281).

@Kunde21 Kunde21 changed the title Reduplicate cache fill requests to avoid cache stampede in proxy Deduplicate cache fill requests to avoid cache stampede in proxy Jul 19, 2018
@arschles
Copy link
Member

arschles commented Jul 19, 2018

Totally agree. Taking inspiration from https://github.com/golang/groupcache, I have a rough proposal (which I'm pretty sure needs work!).

The basic idea is that we build a distributed lease on which any process can watch for release events. Then, when a server receives a go get module@version request:

  1. Check for module@version in storage. If it exists, return it and stop. If not, continue
  2. Attempt to acquire lease for module@version
  3. If it's acquired, fetch module@version, return it to client, fill storage, release lease and return
  4. If it's not acquired, subscribe to the release event for module@version. When it fires, go to 1

These are leases and not locks, so they should have timeouts. So a few notes:

  • If we get to (4) and the release event fires, module@version might not have been storage (if the process that had the lease failed)
  • We should have a maximum number of retries until failure because (4) -> (1) could happen forever if things keep failing

Some implementation notes on the distributed lease table:

  • Instead of requiring the table to fire an event, waiter processes can instead poll a timestamp to determine if the lease has expired
  • The "lease" can be implemented by doing an atomic check-and-set on a key/value storage

@Kunde21
Copy link
Contributor Author

Kunde21 commented Jul 19, 2018

I'm hesitant to put the fetch logic on the request side of the interaction. If we keep the fetch on the storage side, we can keep the retry logic there. That also allows us to register failed requests to a background worker that performs the retry with back-off and/or fill from elsewhere in the athens system.

As a wrapper around the storage backend, the "cache-warmer" would implement storage.Lister and storage.Getter interfaces with ( "or timeout" refers to received context cancellation/timeout):

  1. Check for module@version cache fill requests. If it exists, return result when fill request completes (or timeout).
  2. If cache fill request doesn't exist, check for module@version in storage backend. If it exists, return the module in storage.
  3. If module doesn't exist in storage, register cache fill request and return result when fill request completes (or timeout).
  4. Cache-warmer attempts to fetch module (up to N retries - configurable). If successful, return module and put in storage.
  5. On failure, return error to all requests waiting on that module. Cache fill request can be dropped, added to back-off list with deadline, added to async fill queue, etc

All of this can live between the storage backend interface and the selected backend, giving us flexibility to change how we fill the cache and return results.

I think if we focus on deduplication at the instance level, we can scale that out with distributed leases across larger deployments later. We'll already be reducing the traffic on the VCS site by doing so, while getting the single-instance proxy ready for users to test locally and in-org.

@arschles arschles added enhancement New feature or request registry proxy Work to do on the module proxy labels Jul 20, 2018
@marwan-at-work
Copy link
Contributor

I'd like to grab this if no one has worked on it yet. PS. This is the package you're talking about @arschles right? https://github.com/golang/groupcache/tree/master/singleflight

@michalpristas
Copy link
Member

michalpristas commented Aug 13, 2018

we should think whether or not proxy will be scaled 1:1 with underlying DB.
if not (adding more processing power - instances, with shared DB) we should think about some distributed lock here.
Something like Watches in Zookeeper or Consul

@marwan-at-work
Copy link
Contributor

@michalpristas does the proxy need to be scalable? It's meant to run internally and so 1 instance of the proxy server seems ok to me for MVP. But happy to do a distributed lock thing if people wanna scale the proxy, sounds more fun to implement anyway :)

@michalpristas
Copy link
Member

that's the question. we should elaborate more on that.
for MVP i guess it's fine to think about proxy as a single instance thing, but i would like to keep the discussion going. As you said, this is where it gets REALLY interesting.
Can't put enough emphasis on really

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request proxy Work to do on the module proxy
Projects
None yet
Development

No branches or pull requests

4 participants