Model extractor #518

edsavage · 2019-06-27T15:47:23Z

Added a new executable in devbin/model_extractor that is designed to restore part of the model state in a more human digestible format (un-compressed / meaningful tag names) than the standard persistence format used by autodetect. This new tool is intended to be used for debug / educational purposes in general but specifically as a means of automating several manual steps key to the python notebooks residing in https://github.com/elastic/ml-cpp-data (which should make production of these notebooks a simpler task).

I've created this PR as a draft initially as this is still very much a work-in-progress but am keen to canvas opinion of the approach taken.

The main points to note are:

A new command line option added to autodetect to trigger persistence every N buckets (rather than an interval of time)
New persistence tags with meaningful strings have been added into a large proportion of the existing model classes. These are optionally chosen based on the type of inserter passed to the acceptPersistInserter methods.
The model_extractor executable restores an entire autodetect persistence dump (or sequence of such dumps) from file or pipe and re-persists the part of the hierarchy of interest in human readable format. Currently we're only interested in the residual models
Extra statistics are generated during the re-persistence of residual model state for convenience.
Currently the sequence of steps required to perform model extraction is entirely based around the command line. For example:

$ ./build/distribution/platform/darwin-x86_64/bin/autodetect --jobid=test --bucketspan=60 --summarycountfield=count --timefield=time --delimiter=, --modelplotconfig=modelplotconfig.conf --fieldconfig=fieldconfig.conf --persist normal_named_pipe --persistIsPipe --bucketPersistInterval=1 < normal.csv >  normal.log
$ ./devbin/model_extractor --input normal_named_pipe --inputIsPipe --output normal_residual_models.json

Some post-processing of the extracted model documents is required to transform them into a more valid format acceptable by e.g. the standard python json module. This is due to the presence of duplicate object names at the same level of the hierarchy. Currently this is done with a jq snippet obtained from Unexpected behavior with duplicate attributes jqlang/jq#1795 (comment)
The issue with the non-standards compliant JSON output format can now be bypassed entirely by specifying '--outputFormat=XML' in the command line arguments. This generates the model state as an XML document wrapped in a JSON object named 'xml' in the output. The convenience of parsing the XML document comes at the expense of verbosity.

First working version of utility to extract model state from a state dump persisted by autodetect and to then print the residual models in a somewhat friendlier, readable format

* Tidy up parsing command line arguments. * Add the ability to read input from file, named pipe or stdin - similarly for writing output

* Allow autodetect to persist state periodically based on the number of buckets that have been processed * Cleaned up the JSON output generated by model_extractor to simplify parsing * Removed unused code

Reverted back to persisting extracted model state docs in standard ES format, i.e. separated by nulls. This allows for the possibility of eventually using ES indices as source/sink for the model state extraction. Added a python script to parse model state documents. It makes use of jq snippet to first transform the state documents into a format acceptable to the standard python json module by dealing with objects with duplicate names.

droberts195

Looks like good progress!

droberts195 · 2019-07-01T08:39:20Z

bin/autodetect/CCmdLineParser.cc

@@ -103,7 +104,9 @@ bool CCmdLineParser::parse(int argc,
                        "Optional file to persist state to - not present means no state persistence")
            ("persistIsPipe", "Specified persist file is a named pipe")
            ("persistInterval", boost::program_options::value<core_t::TTime>(),
-                        "Optional interval at which to periodically persist model state - if not specified then models will only be persisted at program exit")
+                        "Optional time interval at which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with bucketPersistInterval)")


if not specified then models will only be persisted at program exit is not true any more, so might be best to just remove this bit.

I also find this a little confusing: (Mutually exclusive with bucketPersistInterval). Should this not be specified at the same time as bucketPersistInterval, i.e. causes error, does one take precedence, etc? Might be worth being more specific.

It feels a little awkward to have two alternative mechanisms for specifying the same thing. What is the rationale for keeping them separate? Tying to the bucket length makes more sense to me since the models change on this time scale.

Tying to the bucket length makes more sense to me since the models change on this time scale.

But remember that this is not greenfield functionality in this PR. We've done background persistence based on wall clock time for about 5 years now. The rationale for using wall clock time was that if a lookback churns through 1000 half hour buckets in 20 minutes of wall clock time we probably don't need a background persistence in the middle of that work, but if running in real time then we want lots of periodic persists during the 500 hours it takes to get through 1000 half hour buckets.

This really leads on to the question of whether this functionality is only ever going to be used as an internal debugging tool. At one time the thinking was that it would be internal only. End users don't call the C++ process directly, so it doesn't matter if the options are esoteric (although it's still best for the descriptions to be factually accurate). But if the thinking on exposing this externally has now changed then the external user interface needs more careful thought, including how to protect people from accidentally doing a denial-of-service attack on themselves by for example persisting huge model state every 2 buckets during a 10000 bucket lookback. Also before exposing this externally things like using jq from a Python script would need revisiting...

droberts195 · 2019-07-01T08:39:43Z

bin/autodetect/CCmdLineParser.cc

-                        "Optional interval at which to periodically persist model state - if not specified then models will only be persisted at program exit")
+                        "Optional time interval at which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with bucketPersistInterval)")
+            ("bucketPersistInterval", boost::program_options::value<std::size_t>(),
+                        "Optional number of buckets after which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with persistInterval)")


if not specified then models will only be persisted at program exit is not true (because there might be wall clock periodic persistence), so might be best to just remove this bit.

droberts195 · 2019-07-01T08:50:23Z

bin/autodetect/CCmdLineParser.cc

+                                             const std::string& opt2) {
+            if (vm.count(opt1) && !vm[opt1].defaulted() && vm.count(opt2) &&
+                !vm[opt2].defaulted())
+                throw std::logic_error("Conflicting options '" + opt1 +


I'm not sure logic_error is the most appropriate here - its description is:

It reports errors that are a consequence of faulty logic within the program such as violating logical preconditions or class invariants and may be

But this is a user error. So maybe runtime_exception would be better.

Also, nit, we now prefer == false to !.

droberts195 · 2019-07-01T08:57:19Z

lib/api/CBackgroundPersister.cc

+                     << " is still in progress - increasing persistence interval by "
+                     << PERSIST_BUCKET_INCREMENT << " buckets");
+
+            m_BucketPersistInterval += PERSIST_BUCKET_INCREMENT;


I'm not sure you really want this for the case of wanting to generate rich debug information. A lookback can easily churn through buckets faster than state can be persisted, and this will mean that somebody who asked for rich debug every 10 buckets might end up getting it every 20 buckets.

It may be better to add the functionality to persist in the foreground - see elastic/elasticsearch#29770 - and use that for the case of wanting to persist every N buckets.

lib/api/CAnomalyJob.cc

droberts195 · 2019-07-01T10:30:27Z

lib/maths/CGammaRateConjugate.cc

@@ -691,6 +691,17 @@ class CLogMarginalLikelihood : core::CNonCopyable {

 } // detail::

+const std::string READABLE_OFFSET_TAG("offset");


It might be less error prone to define the readable and minified tags on the same line. This could be something like:

const TStrStrPr OFFSET_TAG("a", "offset");

(where TStrStrPr is a std::pair<std::string, std::string>)

Then overload insertValue so that the first argument can be either a std::string or TStrStrPr. This would avoid the need to have all the readableTags ? READABLE_FOO_TAG : FOO_TAG constructs as the test for readable tags would move into the TStrStrPr overload of insertValue.

Alternatively a custom class could be used:

const core::CPersistTag OFFSET_TAG("a", "offset");

To start with it would basically just be a pair of strings, but would be more future proof if we ever wanted to add another way of creating tags in the future, as we wouldn't need to search and replace a load of TStrStrPrs with core::CPersistTag at that future time.

droberts195 · 2019-07-01T10:37:09Z

lib/maths/CGammaRateConjugate.cc

@@ -1439,6 +1450,21 @@ void CGammaRateConjugate::print(const std::string& indent, std::string& result)
        return;
    }

+    std::string meanStr{"<unknown>"};
+    std::string sdStr{"<unknown>"};


It might be worth considering if <unknown> is the best string to use to represent unknown values. Is the likely client for reading the values Python? If so, would some other string be easier for Python to interpret as meaning unknown?

Same in the other places where <unknown> is used.

The other thing about this is that using the string literal <unknown> in all the places means that:

If we ever want to change it we've got to grep for it and

There's a call to strlen on every initialisation

You could solve these problems by declaring a const std::string containing the unknown value. That will also store the length making copying more efficient and can easily be changed if we decide the primary client is, say, JavaScript rather than Python.

droberts195 · 2019-07-01T10:48:18Z

devbin/model_extractor/model_state_parser.py

+import json
+import sys
+import sh
+


I think this file should have:

A copyright header.

A comment saying what the prerequisites are. For example, it requires jq be installed, which it isn't by default on macOS. And does it require Python 2 or Python 3 or does it work in both?

devbin/model_extractor/model_state_parser.py

Created a class to couple together the short form of a persistence tag and its associated long, readable form. This is designed to be a drop in replacement for the existing persistence tags resulting in minimal changes (if any) to existing acceptPersistInserter routines. Unit tests TBD.

Attending to some helpful comments from review

tveasey

I've done a first pass. Aside from some minor things my main observation is introducing two persist intervals feels a bit clunky. I want to understand better the rationale for this. To me tying the interval to the bucket length makes sense, except I guess it introduces problems for look back. I think it might be worth discussing this a bit offline.

tveasey · 2019-07-01T11:11:04Z

bin/autodetect/CCmdLineParser.cc

@@ -103,7 +104,9 @@ bool CCmdLineParser::parse(int argc,
                        "Optional file to persist state to - not present means no state persistence")
            ("persistIsPipe", "Specified persist file is a named pipe")
            ("persistInterval", boost::program_options::value<core_t::TTime>(),
-                        "Optional interval at which to periodically persist model state - if not specified then models will only be persisted at program exit")
+                        "Optional time interval at which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with bucketPersistInterval)")


I also find this a little confusing: (Mutually exclusive with bucketPersistInterval). Should this not be specified at the same time as bucketPersistInterval, i.e. causes error, does one take precedence, etc? Might be worth being more specific.

It feels a little awkward to have two alternative mechanisms for specifying the same thing. What is the rationale for keeping them separate? Tying to the bucket length makes more sense to me since the models change on this time scale.

tveasey · 2019-07-01T11:13:15Z

bin/autodetect/CCmdLineParser.cc

+                                             const std::string& opt2) {
+            if (vm.count(opt1) && !vm[opt1].defaulted() && vm.count(opt2) &&
+                !vm[opt2].defaulted())
+                throw std::logic_error("Conflicting options '" + opt1 +


Also, nit, we now prefer == false to !.

tveasey · 2019-07-01T13:56:46Z

devbin/model_extractor/Main.cc

+
+    ml::core_t::TTime completeToTime{0};
+    ml::core_t::TTime prevCompleteToTime{0};
+    while (restoredJob.restoreState(restoreSearcher, completeToTime) == true) {


nit: no need for == true.

tveasey · 2019-07-01T13:57:23Z

devbin/model_extractor/Main.cc

+    ml::model::CLimits limits;
+    ml::api::CFieldConfig fieldConfig;
+
+    if (!fieldConfig.initFromFile(ml::core::COsFileFuncs::NULL_FILENAME)) {


nit: mixing styles (and elsewhere). I think we prefer == false now.

tveasey · 2019-07-01T14:00:26Z

devbin/model_extractor/Main.cc

+    // Read command line options
+    std::string logProperties;
+    std::string inputFileName;
+    bool isInputFileNamedPipe(false);


nit: this file uses a mixture of (), = and {} initialisation. I generally prefer the new style initialisation since it catches accidental narrowing.

tveasey · 2019-07-01T14:25:13Z

include/core/CPersistUtils.h

@@ -891,7 +899,8 @@ class CRestorerImpl<BasicRestore> {
    template<typename A, typename B>
    static bool newLevel(std::pair<A, B>& t, CStateRestoreTraverser& traverser) {
        if (traverser.name() != FIRST_TAG) {
-            LOG_ERROR(<< "Tag mismatch at " << traverser.name() << ", expected " << FIRST_TAG);
+            LOG_ERROR(<< "Tag mismatch at " << traverser.name() << ", expected "
+                      << FIRST_TAG.name(false));


I think this warrants a comment as to why one doesn't have readable tags here.

tveasey · 2019-07-01T14:30:09Z

lib/api/CAnomalyJob.cc

@@ -83,6 +83,14 @@ const std::string INTERIM_BUCKET_CORRECTOR_TAG("k");
 //! The minimum version required to read the state corresponding to a model snapshot.
 //! This should be updated every time there is a breaking change to the model state.
 const std::string MODEL_SNAPSHOT_MIN_VERSION("6.4.0");
+
+// Persist state as JSON with meaningful tag names.


tveasey · 2019-07-01T14:57:26Z

lib/maths/CGammaRateConjugate.cc

+    result += "mean = " + meanStr + " sd = " + sdStr;
+}
+
+void CGammaRateConjugate::restoreDescriptiveStatistics(std::string& meanStr,


I'm not sure about the naming of this function especially as it is used in e.g. print and persist: to me this reads as do some form of restore. How about printMarginalLikelihoodStatistics()?

I also wonder if this might reasonably return std::pair<std::string, std::string> they would be moved into place after all. That way we could have an implementation in CPrior which checks if non informative and returns unknown values otherwise calls a virtual implementation: that way you can have just one place where these are defined.

tveasey · 2019-07-01T14:59:33Z

include/core/CStatePersistInserter.h

+//! This is since currently the long form of the tag names are not required to be restored
+//! from state - only persisted.
+//!
+class CORE_EXPORT CPersistenceTag {


++ nice helper class.

@tveasey

Attending to code review comments from @tveasey

droberts195 · 2019-07-02T10:43:25Z

lib/maths/CPrior.cc

+
+    try {
+        return this->doPrintMarginalLikelihoodStatistics();
+    } catch (...) {}


I don't think it's good practice to use catch (...) except where it's effectively a finally clause and the caught exception is rethrown or otherwise propagated.

Also, this method catches all its exceptions so I don't think we need a try here. (Unless you really can't avoid it, say callback into 3rd party code, none of our functions should be throwing exceptions.)

droberts195 · 2019-07-02T10:45:34Z

lib/maths/CPrior.cc

+    TStrStrPr unknownValuePr{UNKNOWN_VALUE_STRING, UNKNOWN_VALUE_STRING};
+
+    if (this->isNonInformative()) {
+        return unknownValuePr;


You can construct a pair from an intializer list, so this line and the other one that returns unknownValuePr could be return {UNKNOWN_VALUE_STRING, UNKNOWN_VALUE_STRING};. Then you wouldn't have to construct unknownValuePr in the happy day case of a useful mean and standard deviation being available.

droberts195 · 2019-07-02T10:48:49Z

include/maths/CClusterer.h

@@ -147,7 +147,7 @@ class CClusterer : public CClustererTypes {
    //! \name Clusterer Contract
    //@{
    //! Get the tag name for this clusterer.
-    virtual std::string persistenceTag() const = 0;
+    virtual core::TPersistenceTag persistenceTag() const = 0;


Is there ever a need for this method to return a dynamically constructed tag rather than a static constant tag? If not then returning a const reference would avoid the need to copy two strings into the return value.

* Attending to code review comments * Added tests exercising the new persistence tags

* Committing previously missed CCmdLineParser class files * Adding support for extracting model state in XML format. This works around the issue of the JSON format not being directly parsable by standard means without first massaging with e.g. the 'jq' JSON command line processor. This comes at the expense of the larger size of the XML documents. * Added example python script demonstrating how to parse the model state output when in XML format.

Winterflower · 2019-07-03T16:57:30Z

devbin/model_extractor/model_state_parser.py

+    if len(sys.argv) < 2:
+        data = sys.stdin.read();
+    else:
+        fileName = sys.argv[1]


Not entirely familliar with the usecase here, so this could be a misplaced suggestion but argparse is a builti-in lib in Py3 that can be used to create parsers for cli arguments.

Modified model state parser scripts to additionally extract the prior weights

# Conflicts: # bin/autodetect/CCmdLineParser.cc # bin/autodetect/Main.cc # include/api/CPersistenceManager.h # lib/api/CAnomalyJob.cc # lib/api/CPersistenceManager.cc

droberts195

LGTM on the understanding that it has no impact on the anomaly detection job results.

Please don't merge today, as we're in the middle of investigating some other performance regressions. But if you merge tomorrow then we can look out for any unforeseen changes to results or performance in anomaly detection. /cc @dolaru

Extract and print residual models in a readable format A utility to extract model state from a state dump persisted by autodetect and to then print the residual models in a somewhat friendlier, readable format * Has the ability to read input from file, named pipe or stdin - similarly for writing output * Regenerates marginal likelihood mean and sd from persisted state * Added option for `autodetect` to persist every N buckets * Extracted model state documents are in standard ES ML format, i.e. separated by nulls. This allows for the possibility of eventually using ES indices as source/sink for the model state extraction. * Has support for both XML and JSON output formats * Example python scripts to parse model state documents.

Extract and print residual models in a readable format A utility to extract model state from a state dump persisted by autodetect and to then print the residual models in a somewhat friendlier, readable format Backport #518

edsavage added 6 commits June 26, 2019 17:05

Print residual models in a readable format

640dc67

First working version of utility to extract model state from a state dump persisted by autodetect and to then print the residual models in a somewhat friendlier, readable format

Formalize command line parsing

075a913

* Tidy up parsing command line arguments. * Add the ability to read input from file, named pipe or stdin - similarly for writing output

Correctly read from autodetect pipe

4d5a290

Regenerate prior mean and sd from persisted state

e78b3a2

Autodetect optionally to persist every N buckets

dfd3e9b

* Allow autodetect to persist state periodically based on the number of buckets that have been processed * Cleaned up the JSON output generated by model_extractor to simplify parsing * Removed unused code

Fix unit test compilation error

fd6735a

edsavage added review WIP :ml >feature v8.0.0 labels Jun 27, 2019

edsavage requested review from stevedodson, droberts195 and tveasey June 27, 2019 15:47

droberts195 reviewed Jul 1, 2019

View reviewed changes

edsavage added 3 commits July 1, 2019 12:23

Removed unnecessary namespacing

bc919e4

Tidied and commented code

77d6688

Attending to some helpful comments from review

tveasey reviewed Jul 1, 2019

View reviewed changes

edsavage added 3 commits July 1, 2019 16:10

Formatting

50096eb

Further tidy up

9d53cf3

Attending to code review comments from @tveasey

Slight refactoring of printMarginalLikelihoodStatistics

0108d7a

droberts195 reviewed Jul 2, 2019

View reviewed changes

edsavage added 2 commits July 2, 2019 15:42

Additional tidy up and test cases

ecb4fcb

* Attending to code review comments * Added tests exercising the new persistence tags

Winterflower reviewed Jul 3, 2019

View reviewed changes

edsavage added 3 commits July 4, 2019 10:13

Improved argument parsing in python scripts

8ec3f7c

Extract prior weights from model state

982ab97

Modified model state parser scripts to additionally extract the prior weights

Slight tweaks to parser scripts

4f2ff8e

edsavage added 2 commits July 22, 2019 15:20

Merge branch 'master' into model_extractor

77b6bb2

# Conflicts: # bin/autodetect/CCmdLineParser.cc # bin/autodetect/Main.cc # include/api/CPersistenceManager.h # lib/api/CAnomalyJob.cc # lib/api/CPersistenceManager.cc

Merge cleanup

2afbce8

droberts195 added >non-issue v7.4.0 and removed >feature WIP labels Jul 25, 2019

droberts195 approved these changes Jul 25, 2019

View reviewed changes

droberts195 marked this pull request as ready for review July 25, 2019 10:41

edsavage added 2 commits July 26, 2019 09:49

Merge branch 'master' of github.com:elastic/ml-cpp into model_extractor

965356b

Fixed compilation warning

9c09f3e

edsavage merged commit ac03003 into elastic:master Jul 26, 2019

edsavage mentioned this pull request Jul 26, 2019

[7.4][ML] Model extractor (#518) #566

Merged

edsavage deleted the model_extractor branch July 26, 2019 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model extractor #518

Model extractor #518

edsavage commented Jun 27, 2019 •

edited

Loading

droberts195 left a comment

droberts195 Jul 1, 2019

tveasey Jul 1, 2019

droberts195 Jul 1, 2019

droberts195 Jul 1, 2019

droberts195 Jul 1, 2019 •

edited

Loading

tveasey Jul 1, 2019

droberts195 Jul 1, 2019

droberts195 Jul 1, 2019

droberts195 Jul 1, 2019

droberts195 Jul 1, 2019

tveasey left a comment

tveasey Jul 1, 2019

tveasey Jul 1, 2019

tveasey Jul 1, 2019

tveasey Jul 1, 2019

tveasey Jul 1, 2019

tveasey Jul 1, 2019

tveasey Jul 1, 2019

tveasey Jul 1, 2019

tveasey Jul 1, 2019

droberts195 Jul 2, 2019

tveasey Jul 2, 2019

droberts195 Jul 2, 2019

droberts195 Jul 2, 2019

Winterflower Jul 3, 2019

droberts195 left a comment

		@@ -691,6 +691,17 @@ class CLogMarginalLikelihood : core::CNonCopyable {

		} // detail::

		const std::string READABLE_OFFSET_TAG("offset");

Model extractor #518

Model extractor #518

Conversation

edsavage commented Jun 27, 2019 • edited Loading

droberts195 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 Jul 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

edsavage commented Jun 27, 2019 •

edited

Loading

droberts195 Jul 1, 2019 •

edited

Loading