-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model extractor #518
Model extractor #518
Conversation
First working version of utility to extract model state from a state dump persisted by autodetect and to then print the residual models in a somewhat friendlier, readable format
* Tidy up parsing command line arguments. * Add the ability to read input from file, named pipe or stdin - similarly for writing output
* Allow autodetect to persist state periodically based on the number of buckets that have been processed * Cleaned up the JSON output generated by model_extractor to simplify parsing * Removed unused code
Reverted back to persisting extracted model state docs in standard ES format, i.e. separated by nulls. This allows for the possibility of eventually using ES indices as source/sink for the model state extraction. Added a python script to parse model state documents. It makes use of jq snippet to first transform the state documents into a format acceptable to the standard python json module by dealing with objects with duplicate names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like good progress!
bin/autodetect/CCmdLineParser.cc
Outdated
@@ -103,7 +104,9 @@ bool CCmdLineParser::parse(int argc, | |||
"Optional file to persist state to - not present means no state persistence") | |||
("persistIsPipe", "Specified persist file is a named pipe") | |||
("persistInterval", boost::program_options::value<core_t::TTime>(), | |||
"Optional interval at which to periodically persist model state - if not specified then models will only be persisted at program exit") | |||
"Optional time interval at which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with bucketPersistInterval)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not specified then models will only be persisted at program exit
is not true any more, so might be best to just remove this bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also find this a little confusing: (Mutually exclusive with bucketPersistInterval)
. Should this not be specified at the same time as bucketPersistInterval
, i.e. causes error, does one take precedence, etc? Might be worth being more specific.
It feels a little awkward to have two alternative mechanisms for specifying the same thing. What is the rationale for keeping them separate? Tying to the bucket length makes more sense to me since the models change on this time scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tying to the bucket length makes more sense to me since the models change on this time scale.
But remember that this is not greenfield functionality in this PR. We've done background persistence based on wall clock time for about 5 years now. The rationale for using wall clock time was that if a lookback churns through 1000 half hour buckets in 20 minutes of wall clock time we probably don't need a background persistence in the middle of that work, but if running in real time then we want lots of periodic persists during the 500 hours it takes to get through 1000 half hour buckets.
This really leads on to the question of whether this functionality is only ever going to be used as an internal debugging tool. At one time the thinking was that it would be internal only. End users don't call the C++ process directly, so it doesn't matter if the options are esoteric (although it's still best for the descriptions to be factually accurate). But if the thinking on exposing this externally has now changed then the external user interface needs more careful thought, including how to protect people from accidentally doing a denial-of-service attack on themselves by for example persisting huge model state every 2 buckets during a 10000 bucket lookback. Also before exposing this externally things like using jq
from a Python script would need revisiting...
bin/autodetect/CCmdLineParser.cc
Outdated
"Optional interval at which to periodically persist model state - if not specified then models will only be persisted at program exit") | ||
"Optional time interval at which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with bucketPersistInterval)") | ||
("bucketPersistInterval", boost::program_options::value<std::size_t>(), | ||
"Optional number of buckets after which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with persistInterval)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not specified then models will only be persisted at program exit
is not true (because there might be wall clock periodic persistence), so might be best to just remove this bit.
bin/autodetect/CCmdLineParser.cc
Outdated
const std::string& opt2) { | ||
if (vm.count(opt1) && !vm[opt1].defaulted() && vm.count(opt2) && | ||
!vm[opt2].defaulted()) | ||
throw std::logic_error("Conflicting options '" + opt1 + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure logic_error
is the most appropriate here - its description is:
It reports errors that are a consequence of faulty logic within the program such as violating logical preconditions or class invariants and may be
But this is a user error. So maybe runtime_exception
would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, nit, we now prefer == false
to !
.
lib/api/CBackgroundPersister.cc
Outdated
<< " is still in progress - increasing persistence interval by " | ||
<< PERSIST_BUCKET_INCREMENT << " buckets"); | ||
|
||
m_BucketPersistInterval += PERSIST_BUCKET_INCREMENT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you really want this for the case of wanting to generate rich debug information. A lookback can easily churn through buckets faster than state can be persisted, and this will mean that somebody who asked for rich debug every 10 buckets might end up getting it every 20 buckets.
It may be better to add the functionality to persist in the foreground - see elastic/elasticsearch#29770 - and use that for the case of wanting to persist every N buckets.
lib/maths/CGammaRateConjugate.cc
Outdated
@@ -691,6 +691,17 @@ class CLogMarginalLikelihood : core::CNonCopyable { | |||
|
|||
} // detail:: | |||
|
|||
const std::string READABLE_OFFSET_TAG("offset"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be less error prone to define the readable and minified tags on the same line. This could be something like:
const TStrStrPr OFFSET_TAG("a", "offset");
(where TStrStrPr
is a std::pair<std::string, std::string>
)
Then overload insertValue
so that the first argument can be either a std::string
or TStrStrPr
. This would avoid the need to have all the readableTags ? READABLE_FOO_TAG : FOO_TAG
constructs as the test for readable tags would move into the TStrStrPr
overload of insertValue
.
Alternatively a custom class could be used:
const core::CPersistTag OFFSET_TAG("a", "offset");
To start with it would basically just be a pair of strings, but would be more future proof if we ever wanted to add another way of creating tags in the future, as we wouldn't need to search and replace a load of TStrStrPr
s with core::CPersistTag
at that future time.
lib/maths/CGammaRateConjugate.cc
Outdated
@@ -1439,6 +1450,21 @@ void CGammaRateConjugate::print(const std::string& indent, std::string& result) | |||
return; | |||
} | |||
|
|||
std::string meanStr{"<unknown>"}; | |||
std::string sdStr{"<unknown>"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth considering if <unknown>
is the best string to use to represent unknown values. Is the likely client for reading the values Python? If so, would some other string be easier for Python to interpret as meaning unknown?
Same in the other places where <unknown>
is used.
The other thing about this is that using the string literal <unknown>
in all the places means that:
- If we ever want to change it we've got to grep for it and
- There's a call to
strlen
on every initialisation
You could solve these problems by declaring a const std::string
containing the unknown value. That will also store the length making copying more efficient and can easily be changed if we decide the primary client is, say, JavaScript rather than Python.
import json | ||
import sys | ||
import sh | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file should have:
- A copyright header.
- A comment saying what the prerequisites are. For example, it requires
jq
be installed, which it isn't by default on macOS. And does it require Python 2 or Python 3 or does it work in both?
Created a class to couple together the short form of a persistence tag and its associated long, readable form. This is designed to be a drop in replacement for the existing persistence tags resulting in minimal changes (if any) to existing acceptPersistInserter routines. Unit tests TBD.
Attending to some helpful comments from review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first pass. Aside from some minor things my main observation is introducing two persist intervals feels a bit clunky. I want to understand better the rationale for this. To me tying the interval to the bucket length makes sense, except I guess it introduces problems for look back. I think it might be worth discussing this a bit offline.
bin/autodetect/CCmdLineParser.cc
Outdated
@@ -103,7 +104,9 @@ bool CCmdLineParser::parse(int argc, | |||
"Optional file to persist state to - not present means no state persistence") | |||
("persistIsPipe", "Specified persist file is a named pipe") | |||
("persistInterval", boost::program_options::value<core_t::TTime>(), | |||
"Optional interval at which to periodically persist model state - if not specified then models will only be persisted at program exit") | |||
"Optional time interval at which to periodically persist model state - if not specified then models will only be persisted at program exit. (Mutually exclusive with bucketPersistInterval)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also find this a little confusing: (Mutually exclusive with bucketPersistInterval)
. Should this not be specified at the same time as bucketPersistInterval
, i.e. causes error, does one take precedence, etc? Might be worth being more specific.
It feels a little awkward to have two alternative mechanisms for specifying the same thing. What is the rationale for keeping them separate? Tying to the bucket length makes more sense to me since the models change on this time scale.
bin/autodetect/CCmdLineParser.cc
Outdated
const std::string& opt2) { | ||
if (vm.count(opt1) && !vm[opt1].defaulted() && vm.count(opt2) && | ||
!vm[opt2].defaulted()) | ||
throw std::logic_error("Conflicting options '" + opt1 + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, nit, we now prefer == false
to !
.
devbin/model_extractor/Main.cc
Outdated
|
||
ml::core_t::TTime completeToTime{0}; | ||
ml::core_t::TTime prevCompleteToTime{0}; | ||
while (restoredJob.restoreState(restoreSearcher, completeToTime) == true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need for == true
.
devbin/model_extractor/Main.cc
Outdated
ml::model::CLimits limits; | ||
ml::api::CFieldConfig fieldConfig; | ||
|
||
if (!fieldConfig.initFromFile(ml::core::COsFileFuncs::NULL_FILENAME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mixing styles (and elsewhere). I think we prefer == false
now.
devbin/model_extractor/Main.cc
Outdated
// Read command line options | ||
std::string logProperties; | ||
std::string inputFileName; | ||
bool isInputFileNamedPipe(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this file uses a mixture of (), = and {} initialisation. I generally prefer the new style initialisation since it catches accidental narrowing.
@@ -891,7 +899,8 @@ class CRestorerImpl<BasicRestore> { | |||
template<typename A, typename B> | |||
static bool newLevel(std::pair<A, B>& t, CStateRestoreTraverser& traverser) { | |||
if (traverser.name() != FIRST_TAG) { | |||
LOG_ERROR(<< "Tag mismatch at " << traverser.name() << ", expected " << FIRST_TAG); | |||
LOG_ERROR(<< "Tag mismatch at " << traverser.name() << ", expected " | |||
<< FIRST_TAG.name(false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this warrants a comment as to why one doesn't have readable tags here.
lib/api/CAnomalyJob.cc
Outdated
@@ -83,6 +83,14 @@ const std::string INTERIM_BUCKET_CORRECTOR_TAG("k"); | |||
//! The minimum version required to read the state corresponding to a model snapshot. | |||
//! This should be updated every time there is a breaking change to the model state. | |||
const std::string MODEL_SNAPSHOT_MIN_VERSION("6.4.0"); | |||
|
|||
// Persist state as JSON with meaningful tag names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//!
lib/maths/CGammaRateConjugate.cc
Outdated
result += "mean = " + meanStr + " sd = " + sdStr; | ||
} | ||
|
||
void CGammaRateConjugate::restoreDescriptiveStatistics(std::string& meanStr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the naming of this function especially as it is used in e.g. print and persist: to me this reads as do some form of restore. How about printMarginalLikelihoodStatistics()
?
I also wonder if this might reasonably return std::pair<std::string, std::string>
they would be moved into place after all. That way we could have an implementation in CPrior
which checks if non informative and returns unknown values otherwise calls a virtual implementation: that way you can have just one place where these are defined.
//! This is since currently the long form of the tag names are not required to be restored | ||
//! from state - only persisted. | ||
//! | ||
class CORE_EXPORT CPersistenceTag { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ nice helper class.
Attending to code review comments from @tveasey
lib/maths/CPrior.cc
Outdated
|
||
try { | ||
return this->doPrintMarginalLikelihoodStatistics(); | ||
} catch (...) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's good practice to use catch (...)
except where it's effectively a finally
clause and the caught exception is rethrown or otherwise propagated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this method catches all its exceptions so I don't think we need a try here. (Unless you really can't avoid it, say callback into 3rd party code, none of our functions should be throwing exceptions.)
lib/maths/CPrior.cc
Outdated
TStrStrPr unknownValuePr{UNKNOWN_VALUE_STRING, UNKNOWN_VALUE_STRING}; | ||
|
||
if (this->isNonInformative()) { | ||
return unknownValuePr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can construct a pair from an intializer list, so this line and the other one that returns unknownValuePr
could be return {UNKNOWN_VALUE_STRING, UNKNOWN_VALUE_STRING};
. Then you wouldn't have to construct unknownValuePr
in the happy day case of a useful mean and standard deviation being available.
include/maths/CClusterer.h
Outdated
@@ -147,7 +147,7 @@ class CClusterer : public CClustererTypes { | |||
//! \name Clusterer Contract | |||
//@{ | |||
//! Get the tag name for this clusterer. | |||
virtual std::string persistenceTag() const = 0; | |||
virtual core::TPersistenceTag persistenceTag() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there ever a need for this method to return a dynamically constructed tag rather than a static constant tag? If not then returning a const reference would avoid the need to copy two strings into the return value.
* Attending to code review comments * Added tests exercising the new persistence tags
* Committing previously missed CCmdLineParser class files * Adding support for extracting model state in XML format. This works around the issue of the JSON format not being directly parsable by standard means without first massaging with e.g. the 'jq' JSON command line processor. This comes at the expense of the larger size of the XML documents. * Added example python script demonstrating how to parse the model state output when in XML format.
if len(sys.argv) < 2: | ||
data = sys.stdin.read(); | ||
else: | ||
fileName = sys.argv[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely familliar with the usecase here, so this could be a misplaced suggestion but argparse
is a builti-in lib in Py3 that can be used to create parsers for cli arguments.
Modified model state parser scripts to additionally extract the prior weights
# Conflicts: # bin/autodetect/CCmdLineParser.cc # bin/autodetect/Main.cc # include/api/CPersistenceManager.h # lib/api/CAnomalyJob.cc # lib/api/CPersistenceManager.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on the understanding that it has no impact on the anomaly detection job results.
Please don't merge today, as we're in the middle of investigating some other performance regressions. But if you merge tomorrow then we can look out for any unforeseen changes to results or performance in anomaly detection. /cc @dolaru
Extract and print residual models in a readable format A utility to extract model state from a state dump persisted by autodetect and to then print the residual models in a somewhat friendlier, readable format * Has the ability to read input from file, named pipe or stdin - similarly for writing output * Regenerates marginal likelihood mean and sd from persisted state * Added option for `autodetect` to persist every N buckets * Extracted model state documents are in standard ES ML format, i.e. separated by nulls. This allows for the possibility of eventually using ES indices as source/sink for the model state extraction. * Has support for both XML and JSON output formats * Example python scripts to parse model state documents.
Added a new executable in
devbin/model_extractor
that is designed to restore part of the model state in a more human digestible format (un-compressed / meaningful tag names) than the standard persistence format used byautodetect
. This new tool is intended to be used for debug / educational purposes in general but specifically as a means of automating several manual steps key to the python notebooks residing in https://github.com/elastic/ml-cpp-data (which should make production of these notebooks a simpler task).I've created this PR as a draft initially as this is still very much a work-in-progress but am keen to canvas opinion of the approach taken.
The main points to note are:
acceptPersistInserter
methods.model_extractor
executable restores an entireautodetect
persistence dump (or sequence of such dumps) from file or pipe and re-persists the part of the hierarchy of interest in human readable format. Currently we're only interested in the residual modelsjq
snippet obtained from Unexpected behavior with duplicate attributes jqlang/jq#1795 (comment)