Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model extractor #518

Merged
merged 22 commits into from
Jul 26, 2019
Merged

Model extractor #518

merged 22 commits into from
Jul 26, 2019

Conversation

edsavage
Copy link
Contributor

@edsavage edsavage commented Jun 27, 2019

Added a new executable in devbin/model_extractor that is designed to restore part of the model state in a more human digestible format (un-compressed / meaningful tag names) than the standard persistence format used by autodetect. This new tool is intended to be used for debug / educational purposes in general but specifically as a means of automating several manual steps key to the python notebooks residing in https://github.com/elastic/ml-cpp-data (which should make production of these notebooks a simpler task).

I've created this PR as a draft initially as this is still very much a work-in-progress but am keen to canvas opinion of the approach taken.

The main points to note are:

  • A new command line option added to autodetect to trigger persistence every N buckets (rather than an interval of time)
  • New persistence tags with meaningful strings have been added into a large proportion of the existing model classes. These are optionally chosen based on the type of inserter passed to the acceptPersistInserter methods.
  • The model_extractor executable restores an entire autodetect persistence dump (or sequence of such dumps) from file or pipe and re-persists the part of the hierarchy of interest in human readable format. Currently we're only interested in the residual models
  • Extra statistics are generated during the re-persistence of residual model state for convenience.
  • Currently the sequence of steps required to perform model extraction is entirely based around the command line. For example:
$ ./build/distribution/platform/darwin-x86_64/bin/autodetect --jobid=test --bucketspan=60 --summarycountfield=count --timefield=time --delimiter=, --modelplotconfig=modelplotconfig.conf --fieldconfig=fieldconfig.conf --persist normal_named_pipe --persistIsPipe --bucketPersistInterval=1 < normal.csv >  normal.log
$ ./devbin/model_extractor --input normal_named_pipe --inputIsPipe --output normal_residual_models.json
  • Some post-processing of the extracted model documents is required to transform them into a more valid format acceptable by e.g. the standard python json module. This is due to the presence of duplicate object names at the same level of the hierarchy. Currently this is done with a jq snippet obtained from Unexpected behavior with duplicate attributes jqlang/jq#1795 (comment)
  • The issue with the non-standards compliant JSON output format can now be bypassed entirely by specifying '--outputFormat=XML' in the command line arguments. This generates the model state as an XML document wrapped in a JSON object named 'xml' in the output. The convenience of parsing the XML document comes at the expense of verbosity.

edsavage added 6 commits June 26, 2019 17:05
First working version of utility to extract model state from a state
dump persisted by autodetect and to then print the residual models in a
somewhat friendlier, readable format
* Tidy up parsing command line arguments.
* Add the ability to read input from file, named pipe or stdin - similarly
for writing output
 * Allow autodetect to persist state periodically based on the number of
 buckets that have been processed
 * Cleaned up the JSON output generated by model_extractor to simplify
 parsing
 * Removed unused code
Reverted back to persisting extracted model state docs in standard ES
format, i.e. separated by nulls. This allows for the possibility of
eventually using ES indices as source/sink for the model state
extraction.

Added a python script to parse model state documents. It makes use of jq
snippet to first transform the state documents into a format acceptable
to the standard python json module by dealing with objects with
duplicate names.
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like good progress!

@@ -103,7 +104,9 @@ bool CCmdLineParser::parse(int argc,
"Optional file to persist state to - not present means no state persistence")
("persistIsPipe", "Specified persist file is a named pipe")
("persistInterval", boost::program_options::value<core_t::TTime>(),
"Optional interval at which to periodically persist model state - if not specified then models will only be persisted at program exit")
"Optional time interval at which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with bucketPersistInterval)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not specified then models will only be persisted at program exit is not true any more, so might be best to just remove this bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also find this a little confusing: (Mutually exclusive with bucketPersistInterval). Should this not be specified at the same time as bucketPersistInterval, i.e. causes error, does one take precedence, etc? Might be worth being more specific.

It feels a little awkward to have two alternative mechanisms for specifying the same thing. What is the rationale for keeping them separate? Tying to the bucket length makes more sense to me since the models change on this time scale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tying to the bucket length makes more sense to me since the models change on this time scale.

But remember that this is not greenfield functionality in this PR. We've done background persistence based on wall clock time for about 5 years now. The rationale for using wall clock time was that if a lookback churns through 1000 half hour buckets in 20 minutes of wall clock time we probably don't need a background persistence in the middle of that work, but if running in real time then we want lots of periodic persists during the 500 hours it takes to get through 1000 half hour buckets.

This really leads on to the question of whether this functionality is only ever going to be used as an internal debugging tool. At one time the thinking was that it would be internal only. End users don't call the C++ process directly, so it doesn't matter if the options are esoteric (although it's still best for the descriptions to be factually accurate). But if the thinking on exposing this externally has now changed then the external user interface needs more careful thought, including how to protect people from accidentally doing a denial-of-service attack on themselves by for example persisting huge model state every 2 buckets during a 10000 bucket lookback. Also before exposing this externally things like using jq from a Python script would need revisiting...

"Optional interval at which to periodically persist model state - if not specified then models will only be persisted at program exit")
"Optional time interval at which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with bucketPersistInterval)")
("bucketPersistInterval", boost::program_options::value<std::size_t>(),
"Optional number of buckets after which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with persistInterval)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not specified then models will only be persisted at program exit is not true (because there might be wall clock periodic persistence), so might be best to just remove this bit.

const std::string& opt2) {
if (vm.count(opt1) && !vm[opt1].defaulted() && vm.count(opt2) &&
!vm[opt2].defaulted())
throw std::logic_error("Conflicting options '" + opt1 +
Copy link
Contributor

@droberts195 droberts195 Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure logic_error is the most appropriate here - its description is:

It reports errors that are a consequence of faulty logic within the program such as violating logical preconditions or class invariants and may be

But this is a user error. So maybe runtime_exception would be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, nit, we now prefer == false to !.

<< " is still in progress - increasing persistence interval by "
<< PERSIST_BUCKET_INCREMENT << " buckets");

m_BucketPersistInterval += PERSIST_BUCKET_INCREMENT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you really want this for the case of wanting to generate rich debug information. A lookback can easily churn through buckets faster than state can be persisted, and this will mean that somebody who asked for rich debug every 10 buckets might end up getting it every 20 buckets.

It may be better to add the functionality to persist in the foreground - see elastic/elasticsearch#29770 - and use that for the case of wanting to persist every N buckets.

@@ -691,6 +691,17 @@ class CLogMarginalLikelihood : core::CNonCopyable {

} // detail::

const std::string READABLE_OFFSET_TAG("offset");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be less error prone to define the readable and minified tags on the same line. This could be something like:

const TStrStrPr OFFSET_TAG("a", "offset");

(where TStrStrPr is a std::pair<std::string, std::string>)

Then overload insertValue so that the first argument can be either a std::string or TStrStrPr. This would avoid the need to have all the readableTags ? READABLE_FOO_TAG : FOO_TAG constructs as the test for readable tags would move into the TStrStrPr overload of insertValue.

Alternatively a custom class could be used:

const core::CPersistTag OFFSET_TAG("a", "offset");

To start with it would basically just be a pair of strings, but would be more future proof if we ever wanted to add another way of creating tags in the future, as we wouldn't need to search and replace a load of TStrStrPrs with core::CPersistTag at that future time.

@@ -1439,6 +1450,21 @@ void CGammaRateConjugate::print(const std::string& indent, std::string& result)
return;
}

std::string meanStr{"<unknown>"};
std::string sdStr{"<unknown>"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth considering if <unknown> is the best string to use to represent unknown values. Is the likely client for reading the values Python? If so, would some other string be easier for Python to interpret as meaning unknown?

Same in the other places where <unknown> is used.

The other thing about this is that using the string literal <unknown> in all the places means that:

  1. If we ever want to change it we've got to grep for it and
  2. There's a call to strlen on every initialisation

You could solve these problems by declaring a const std::string containing the unknown value. That will also store the length making copying more efficient and can easily be changed if we decide the primary client is, say, JavaScript rather than Python.

import json
import sys
import sh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file should have:

  1. A copyright header.
  2. A comment saying what the prerequisites are. For example, it requires jq be installed, which it isn't by default on macOS. And does it require Python 2 or Python 3 or does it work in both?

edsavage added 3 commits July 1, 2019 12:23
Created a class to couple together the short form of a persistence tag
and its associated long, readable form. This is designed to be a drop in
replacement for the existing persistence tags resulting in minimal
changes (if any) to existing acceptPersistInserter routines.

Unit tests TBD.
Attending to some helpful comments from review
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first pass. Aside from some minor things my main observation is introducing two persist intervals feels a bit clunky. I want to understand better the rationale for this. To me tying the interval to the bucket length makes sense, except I guess it introduces problems for look back. I think it might be worth discussing this a bit offline.

@@ -103,7 +104,9 @@ bool CCmdLineParser::parse(int argc,
"Optional file to persist state to - not present means no state persistence")
("persistIsPipe", "Specified persist file is a named pipe")
("persistInterval", boost::program_options::value<core_t::TTime>(),
"Optional interval at which to periodically persist model state - if not specified then models will only be persisted at program exit")
"Optional time interval at which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with bucketPersistInterval)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also find this a little confusing: (Mutually exclusive with bucketPersistInterval). Should this not be specified at the same time as bucketPersistInterval, i.e. causes error, does one take precedence, etc? Might be worth being more specific.

It feels a little awkward to have two alternative mechanisms for specifying the same thing. What is the rationale for keeping them separate? Tying to the bucket length makes more sense to me since the models change on this time scale.

const std::string& opt2) {
if (vm.count(opt1) && !vm[opt1].defaulted() && vm.count(opt2) &&
!vm[opt2].defaulted())
throw std::logic_error("Conflicting options '" + opt1 +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, nit, we now prefer == false to !.


ml::core_t::TTime completeToTime{0};
ml::core_t::TTime prevCompleteToTime{0};
while (restoredJob.restoreState(restoreSearcher, completeToTime) == true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for == true.

ml::model::CLimits limits;
ml::api::CFieldConfig fieldConfig;

if (!fieldConfig.initFromFile(ml::core::COsFileFuncs::NULL_FILENAME)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mixing styles (and elsewhere). I think we prefer == false now.

// Read command line options
std::string logProperties;
std::string inputFileName;
bool isInputFileNamedPipe(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this file uses a mixture of (), = and {} initialisation. I generally prefer the new style initialisation since it catches accidental narrowing.

@@ -891,7 +899,8 @@ class CRestorerImpl<BasicRestore> {
template<typename A, typename B>
static bool newLevel(std::pair<A, B>& t, CStateRestoreTraverser& traverser) {
if (traverser.name() != FIRST_TAG) {
LOG_ERROR(<< "Tag mismatch at " << traverser.name() << ", expected " << FIRST_TAG);
LOG_ERROR(<< "Tag mismatch at " << traverser.name() << ", expected "
<< FIRST_TAG.name(false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this warrants a comment as to why one doesn't have readable tags here.

@@ -83,6 +83,14 @@ const std::string INTERIM_BUCKET_CORRECTOR_TAG("k");
//! The minimum version required to read the state corresponding to a model snapshot.
//! This should be updated every time there is a breaking change to the model state.
const std::string MODEL_SNAPSHOT_MIN_VERSION("6.4.0");

// Persist state as JSON with meaningful tag names.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//!

result += "mean = " + meanStr + " sd = " + sdStr;
}

void CGammaRateConjugate::restoreDescriptiveStatistics(std::string& meanStr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the naming of this function especially as it is used in e.g. print and persist: to me this reads as do some form of restore. How about printMarginalLikelihoodStatistics()?

I also wonder if this might reasonably return std::pair<std::string, std::string> they would be moved into place after all. That way we could have an implementation in CPrior which checks if non informative and returns unknown values otherwise calls a virtual implementation: that way you can have just one place where these are defined.

//! This is since currently the long form of the tag names are not required to be restored
//! from state - only persisted.
//!
class CORE_EXPORT CPersistenceTag {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ nice helper class.


try {
return this->doPrintMarginalLikelihoodStatistics();
} catch (...) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's good practice to use catch (...) except where it's effectively a finally clause and the caught exception is rethrown or otherwise propagated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this method catches all its exceptions so I don't think we need a try here. (Unless you really can't avoid it, say callback into 3rd party code, none of our functions should be throwing exceptions.)

TStrStrPr unknownValuePr{UNKNOWN_VALUE_STRING, UNKNOWN_VALUE_STRING};

if (this->isNonInformative()) {
return unknownValuePr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can construct a pair from an intializer list, so this line and the other one that returns unknownValuePr could be return {UNKNOWN_VALUE_STRING, UNKNOWN_VALUE_STRING};. Then you wouldn't have to construct unknownValuePr in the happy day case of a useful mean and standard deviation being available.

@@ -147,7 +147,7 @@ class CClusterer : public CClustererTypes {
//! \name Clusterer Contract
//@{
//! Get the tag name for this clusterer.
virtual std::string persistenceTag() const = 0;
virtual core::TPersistenceTag persistenceTag() const = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a need for this method to return a dynamically constructed tag rather than a static constant tag? If not then returning a const reference would avoid the need to copy two strings into the return value.

edsavage added 2 commits July 2, 2019 15:42
* Attending to code review comments
* Added tests exercising the new persistence tags
* Committing previously missed CCmdLineParser class files
* Adding support for extracting model state in XML format. This works
around the issue of the JSON format not being directly parsable by
standard means without first massaging with e.g. the 'jq' JSON command
line processor. This comes at the expense of the larger size of the XML
documents.
* Added example python script demonstrating how to parse the model state
output when in XML format.
if len(sys.argv) < 2:
data = sys.stdin.read();
else:
fileName = sys.argv[1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely familliar with the usecase here, so this could be a misplaced suggestion but argparse is a builti-in lib in Py3 that can be used to create parsers for cli arguments.

edsavage added 2 commits July 22, 2019 15:20
# Conflicts:
#	bin/autodetect/CCmdLineParser.cc
#	bin/autodetect/Main.cc
#	include/api/CPersistenceManager.h
#	lib/api/CAnomalyJob.cc
#	lib/api/CPersistenceManager.cc
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the understanding that it has no impact on the anomaly detection job results.

Please don't merge today, as we're in the middle of investigating some other performance regressions. But if you merge tomorrow then we can look out for any unforeseen changes to results or performance in anomaly detection. /cc @dolaru

@droberts195 droberts195 marked this pull request as ready for review July 25, 2019 10:41
@edsavage edsavage merged commit ac03003 into elastic:master Jul 26, 2019
edsavage added a commit to edsavage/ml-cpp that referenced this pull request Jul 26, 2019
Extract and print residual models in a readable format

A utility to extract model state from a state
dump persisted by autodetect and to then print the residual models in a
somewhat friendlier, readable format

* Has the ability to read input from file, named pipe or stdin - similarly
for writing output
* Regenerates marginal likelihood mean and sd from persisted state
* Added option for `autodetect` to persist every N buckets
* Extracted model state documents are in standard ES ML
format, i.e. separated by nulls. This allows for the possibility of
eventually using ES indices as source/sink for the model state
extraction.
* Has support for both XML and JSON output formats
* Example python scripts to parse model state documents.
edsavage added a commit that referenced this pull request Jul 26, 2019
Extract and print residual models in a readable format

A utility to extract model state from a state
dump persisted by autodetect and to then print the residual models in a
somewhat friendlier, readable format

Backport #518
@edsavage edsavage deleted the model_extractor branch July 26, 2019 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants