Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memoization Storage #3587

Open
alexec opened this issue Jul 24, 2020 · 15 comments
Open

Memoization Storage #3587

alexec opened this issue Jul 24, 2020 · 15 comments
Labels

Comments

@alexec
Copy link
Contributor

alexec commented Jul 24, 2020

Summary

Memoization is a feature that allows users to run workflows faster by avoiding repeating work that has already been done.

Currently memoization uses a Kubernetes config map for storage. This will not scale to large number of entries, it requires elevated RBAC. Instead, we should provide the option to use a alternative database to store these in.

Motivation

Large workflows.

Proposal

Options:

  • Use the database.
  • Use any artifact storage.

See #944


Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@Ark-kun
Copy link
Member

Ark-kun commented Sep 27, 2020

I think we can use the Artifact drivers to store the caching metadata.
We can store the cache entry as an artifact using the same artifact location configuration. For example, in s3://<some_bucket>/artifacts/<cache_key>/cache_entries.yaml.
P.S. There are some benefits to allow multiple entries for the same cache_key, because even with exact same inputs, a volatile component can produce different results and in some scenarios all of them should be cached.

@alexec
Copy link
Contributor Author

alexec commented Sep 28, 2020

Interesting idea. We just need storage and this is a good option.

@simster7 simster7 self-assigned this Sep 30, 2020
@whynowy whynowy assigned alexec and unassigned simster7 Oct 7, 2020
@alexec alexec removed their assignment Oct 14, 2020
@mkjpryor-stfc
Copy link

This is required for caching large outputs because etcd places a limit on the maximum size of a configmap. Piggy-backing on the artifact storage sounds like it should be feasible to me.

@lowc1012
Copy link
Contributor

Hi, Is anybody working on this issue?
I'm interested in working on this. Could I take it forward?

@leonharetd
Copy link
Contributor

I'm interested in this. I want to try it

@sarabala1979 sarabala1979 added this to the 2022 Q1 milestone Jan 19, 2022
@alexec alexec removed this from the 2022 Q1 milestone Feb 8, 2022
@alexec alexec added the google-summer-of-code Google Summer of Code contributions label Feb 8, 2022
@attreyee-muk
Copy link

I would like to contribute to this project for GSOC 2022. Can you please give me some more details on this?

@sarabala1979
Copy link
Member

I would like to contribute to this project for GSOC 2022. Can you please give me some more details on this?

Here is the current memoization implementation document https://github.com/argoproj/argo-workflows/blob/master/docs/memoization.md

@attreyee-muk
Copy link

Okay. Thank You .

@alexec
Copy link
Contributor Author

alexec commented Feb 14, 2022

If you'd like to do this as part of GSoC, you'll need to sign up here:

https://summerofcode.withgoogle.com

GSoC does not start for several months, so if you're instead looking to make impact today, and don't need the benefits of GSoC (see their website for the details), then mentoring might the right approach for you.

@attreyee-muk
Copy link

@alexec The applications for participants will open in April right? I'm actually a bit new to all of this.

@terrytangyuan
Copy link
Member

@attreyee-muk
Copy link

Thank you @terrytangyuan

@Mostafa-wael
Copy link

How can I apply for this idea for GSOC? is there any communication channel with the mentors?

@sudhanshu456
Copy link

@alexec Hey, can you please help me understand how should I go into mentoring? I've been working with Argo-workflows for 1 year.

@terrytangyuan terrytangyuan removed the google-summer-of-code Google Summer of Code contributions label Mar 16, 2023
@print-sid8
Copy link

print-sid8 commented Oct 5, 2024

Has there been any progress on making DB/Storage the output location for steps and enabling caching of the steps?

if I were to have a step that outputs artifcats to S3 as its final step, and the same step is used in another workflow, or if the same workflow is rerun, does the current ConfigMap based cache implemenation understand that this same step has run earlier, and skip it and use the cache to continue to next step?

If so, a simple solution could be to simply write to s3 location with some identification from user side for cache version, and use the same S3 path as input in the next dependent step to kind of emulate caching.

Am i right or wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests