[ML] Ability to move anomaly detection jobs between clusters #37987

sophiec20 · 2019-01-29T18:21:35Z

Ability to "snapshot" and "restore" job and datafeed configurations so that they can be moved between clusters.

Could be just job and datafeed config, for example if a job had been proven in a staging environment, could then easily transfer its config to production. Also being able to store job configs in git (for example) from where they can be easily recreated - the current GET jobs API does not give a clean config which can be recreated.

Or in a DR scenario, could move whole job including model and persisted state. Job could be set continue from where it left off (or from time of latest persisted state), providing the source indices were also available to read. This would also be applicable to a migration / side-by-side upgrade scenario.

I suggest we be careful about overloading the snapshot/restore terminology.

Note: This has been often requested and discussed, but seems to be lacking an issue (that I can find).

elasticmachine · 2019-01-29T18:21:36Z

Pinging @elastic/ml-core

benwtrent · 2020-07-07T19:40:31Z

Another idea on this enhancement.

Sometimes it is necessary to "clone" a job on the same cluster, but KEEP The state. The scenario is:

The cluster is aggressively cleaning up old data so relearning from scratch on old data is not possible.
Something occurs on the current job that gets it to lockup or stop processing appropriately (some bug on our part)
User needs to create a new job, but does not have all the past data from which the model can learn
User is stuck cloning the job, but losing known past seasonality.

This would be fixed if the user could "export/import" on the same cluster (with the model state).

sophiec20 · 2020-07-08T08:57:35Z

Something occurs on the current job that gets it to lockup or stop processing appropriately (some bug on our part)

I can't easily think of a scenario where the reasons for the current job getting locked up would not also apply to the export/imported version.

Although I agree that we should explore other reasons for staying in the same cluster, such as moving from a shared results index to dedicated perhaps.

Also, I think we do need to be careful that this would not be (ab)used as a way to bootstrap a job - because in general, it will take longer to unlearn a model trained on different data than to learn from scratch on the right data. Advice through docs and best practice should be given.

benwtrent · 2020-07-08T11:27:45Z

I can't easily think of a scenario where the reasons for the current job getting locked up would not also apply to the export/imported version.

I can, this is a real bug that occurred. The Java code threw an exception during _flush and we did not handle it correctly in the code. This did always happen, but there was NO way to recover from it without cloning + retraining.

sophiec20 added >feature :ml Machine learning labels Jan 29, 2019

droberts195 mentioned this issue Jun 5, 2019

[DISCUSS] ML - Spaces and Kibana Privileges elastic/kibana#37709

Closed

sophiec20 changed the title ~~[ML] Ability to move jobs between clusters~~ [ML] Ability to move anomaly detection jobs between clusters Sep 28, 2020

sophiec20 mentioned this issue Dec 3, 2020

[ML] Improve copy config to json experience (export/import of ML adv configuration) elastic/kibana#18040

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Ability to move anomaly detection jobs between clusters #37987

[ML] Ability to move anomaly detection jobs between clusters #37987

sophiec20 commented Jan 29, 2019 •

edited

Loading

elasticmachine commented Jan 29, 2019

benwtrent commented Jul 7, 2020

sophiec20 commented Jul 8, 2020

benwtrent commented Jul 8, 2020

[ML] Ability to move anomaly detection jobs between clusters #37987

[ML] Ability to move anomaly detection jobs between clusters #37987

Comments

sophiec20 commented Jan 29, 2019 • edited Loading

elasticmachine commented Jan 29, 2019

benwtrent commented Jul 7, 2020

sophiec20 commented Jul 8, 2020

benwtrent commented Jul 8, 2020

sophiec20 commented Jan 29, 2019 •

edited

Loading