Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Ability to move anomaly detection jobs between clusters #37987

Open
sophiec20 opened this issue Jan 29, 2019 · 4 comments
Open

[ML] Ability to move anomaly detection jobs between clusters #37987

sophiec20 opened this issue Jan 29, 2019 · 4 comments
Labels
>feature :ml Machine learning

Comments

@sophiec20
Copy link
Contributor

sophiec20 commented Jan 29, 2019

Ability to "snapshot" and "restore" job and datafeed configurations so that they can be moved between clusters.

Could be just job and datafeed config, for example if a job had been proven in a staging environment, could then easily transfer its config to production. Also being able to store job configs in git (for example) from where they can be easily recreated - the current GET jobs API does not give a clean config which can be recreated.

Or in a DR scenario, could move whole job including model and persisted state. Job could be set continue from where it left off (or from time of latest persisted state), providing the source indices were also available to read. This would also be applicable to a migration / side-by-side upgrade scenario.

I suggest we be careful about overloading the snapshot/restore terminology.

Note: This has been often requested and discussed, but seems to be lacking an issue (that I can find).

@sophiec20 sophiec20 added >feature :ml Machine learning labels Jan 29, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@benwtrent
Copy link
Member

Another idea on this enhancement.

Sometimes it is necessary to "clone" a job on the same cluster, but KEEP The state. The scenario is:

  • The cluster is aggressively cleaning up old data so relearning from scratch on old data is not possible.
  • Something occurs on the current job that gets it to lockup or stop processing appropriately (some bug on our part)
  • User needs to create a new job, but does not have all the past data from which the model can learn
  • User is stuck cloning the job, but losing known past seasonality.

This would be fixed if the user could "export/import" on the same cluster (with the model state).

@sophiec20
Copy link
Contributor Author

Something occurs on the current job that gets it to lockup or stop processing appropriately (some bug on our part)

I can't easily think of a scenario where the reasons for the current job getting locked up would not also apply to the export/imported version.

Although I agree that we should explore other reasons for staying in the same cluster, such as moving from a shared results index to dedicated perhaps.

Also, I think we do need to be careful that this would not be (ab)used as a way to bootstrap a job - because in general, it will take longer to unlearn a model trained on different data than to learn from scratch on the right data. Advice through docs and best practice should be given.

@benwtrent
Copy link
Member

I can't easily think of a scenario where the reasons for the current job getting locked up would not also apply to the export/imported version.

I can, this is a real bug that occurred. The Java code threw an exception during _flush and we did not handle it correctly in the code. This did always happen, but there was NO way to recover from it without cloning + retraining.

@sophiec20 sophiec20 changed the title [ML] Ability to move jobs between clusters [ML] Ability to move anomaly detection jobs between clusters Sep 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :ml Machine learning
Projects
None yet
Development

No branches or pull requests

3 participants