Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching improvements? #2

Closed
m-mohr opened this issue Jul 26, 2021 · 8 comments
Closed

Caching improvements? #2

m-mohr opened this issue Jul 26, 2021 · 8 comments

Comments

@m-mohr
Copy link
Member

m-mohr commented Jul 26, 2021

Right now the caching time seems to be 5 minutes, which means that (due to a small request frequency) mostly every request is refreshing the cache, which leads to long loading times on the website and in the clients. For example, the Web Editor takes roughly 5 seconds to connect without (server-side) cache and under a second with (server-side) cache.

I think 5 minute cache TTL seems pretty low, do we really expect such a high frequency in changes for metadata (collections, processes, file formats, ...)? 60 minutes or even a day could be reasonable, too? And then I'd suggest to refresh the data using a cron job and always return the user cached data so that loading times are consistent.

Origin: https://github.com/openEOPlatform/architecture-docs/issues/22#issuecomment-886254097

@soxofaan
Copy link
Member

That low cache minute level TTL was just for initial development purposes.

order of hours makes indeed more sense now that things are getting more concrete

@m-mohr
Copy link
Member Author

m-mohr commented Sep 1, 2021

Recently some back-ends seem to have some issues and that got fed through to the aggregator, which responded with errors or timeouts sometimes.

So in addition to the cron job proposal above, I think it's also a good idea to only clear the cache once a successful request has been made (or it's very outdated) so that people can still retrieve metadata although the back-end is temporarily offline. Metadata requests should always be delivered in a second or less, instead of taking up to a minute like it is right now from time to time.

@soxofaan soxofaan self-assigned this Sep 1, 2021
@soxofaan
Copy link
Member

a couple of ideas to improve the caching in the aggregator:

  • Find a way to flush all or subset of caches.
    • Easy way would be through some custom HTTP endpoint. However: at the moment caching is done in memory and there are already multiple (gunicorn) workers in play (and we might scale up to load balanced setup on multiple machines), so a single "flush cache" HTTP request will not work
    • centralize cache to memcache, redis, ... to allow easy cache flushing across multiple instances/workers
  • smarter caching, to handle temporary glitches better
    • if cache is outdated: still keep it if it can not be updated yet due to hard failure
    • if there is a minor failure in one of the back-ends: cache the result with shorter TTL than default

@m-mohr
Copy link
Member Author

m-mohr commented Sep 27, 2021

This is basically what we already do in the openEO Hub, but it's all JS with a MongoDB in the background and a daily crawling through a cron job. There we have some logic implemented for such cases where the data is not directly cleared in the db until new data successfully comes in, but clears it after a certain amount of failures... So if you need some insights into that feel free to contact @christophfriedrich.

@soxofaan
Copy link
Member

Note: under #15 (83cfb7b), the number of gunicorn workers was increased to 10, which means that there are 10 separate processes at the moment that have their own in-memoy cache (containing the same things). Sharing the cache would be better for performance (now it could take long to warm up the cache) and consistency (because different workers might have different cached data)

@soxofaan
Copy link
Member

soxofaan commented Jun 3, 2022

Another idea to take into account: do caching at proxy/load balancing level instead of doing it in the flask app itself

soxofaan added a commit that referenced this issue Aug 5, 2022
soxofaan added a commit that referenced this issue Aug 8, 2022
soxofaan added a commit that referenced this issue Aug 8, 2022
For most unit tests we want no caching or simple dict based caching
soxofaan added a commit that referenced this issue Aug 10, 2022
start using memoizers in MultiBackendConnection and AggregatorBackendImplementation
refactor memoizers some more to support this properly
improve test coverage
soxofaan added a commit that referenced this issue Aug 10, 2022
soxofaan added a commit that referenced this issue Sep 19, 2022
soxofaan added a commit that referenced this issue Sep 19, 2022
soxofaan added a commit that referenced this issue Sep 19, 2022
For most unit tests we want no caching or simple dict based caching
soxofaan added a commit that referenced this issue Sep 19, 2022
soxofaan added a commit that referenced this issue Sep 19, 2022
soxofaan added a commit that referenced this issue Sep 19, 2022
For most unit tests we want no caching or simple dict based caching
soxofaan added a commit that referenced this issue Sep 19, 2022
start using memoizers in MultiBackendConnection and AggregatorBackendImplementation
refactor memoizers some more to support this properly
improve test coverage
soxofaan added a commit that referenced this issue Sep 19, 2022
soxofaan added a commit that referenced this issue Sep 19, 2022
@soxofaan
Copy link
Member

merged initial usage of zookeeper based caching in 99332f7

soxofaan added a commit that referenced this issue Sep 20, 2022
soxofaan added a commit that referenced this issue Sep 21, 2022
caching JsonSerde needed support for serialization of custom classes (`_InternalCollectionMetadata` in this case)
soxofaan added a commit that referenced this issue Sep 21, 2022
had to add additional gzip'ing of json because process registry payload of process metadata is too large for default zookeeper limits
@soxofaan
Copy link
Member

I think the most important caching issue listed here as the lack of a shared cache between all the workers, which made cache misses very frequent in practice.
I now added shared cache (through zookeeper at the moment), which should bring a serious caching performance improvement.

There are some more ideas left here, but I'd prefer close this general issue and move the remaining ideas to separate tickets for further discussion:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants