-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest: Introduce the dissect processor #32884
Conversation
The ingest node dissect processor is an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Dissect can be used to parse a source text field with a simpler pattern, and is often faster the Grok for basic string parsing. This processor uses the dissect library which does most of the work.
Pinging @elastic/es-core-infra |
@jakelandis Before I review the code, I want to understand better how a new feature can be marked as breaking? Why can't we introduce this in 6.x with the same behavior we intend for 7.0? (I also don't think #32297 should be marked as breaking, sorry I didn't notice that before.) If we must break this, then the first PR should have the intended 6.x behavior, and a followup with the breaking change in 7.0 should be made separately. |
@rjernst - the intent here is to match Beats and Logstash's implementation. However, in the course of implementing for this for the ingest I found a point of confusion w/r/t an overloaded use of the After some discussion with Beats and Logstash, they plan to make a corresponding (breaking) change for 7.x too. However there was a desire to allow copy/pasting of patterns between products withing major versions, which necessitates a breaking change here too.
Apologies for the mis-ordering of PR's. I had planned once all changes were in master to backport and then add an extra commit the 6.x rules (cherry-picking + 1 commit), and seek review on that one commit. If we are OK with ingest node dissect syntax being slightly different then Beats and Logstash in 6.x, then I can remove the breaking changes label and do a normal backport. The difference is pretty minor ... the change between 6.x and 7.x is |
The logic makes sense for why a breaking change. But the order should be as I described in the previous comment. Can this PR be amended with the old behavior? |
@rjernst I am re-visiting the discussion around the need to make this a breaking change. I will update this PR with the result of that discussion. |
After further discussion, this will NOT be a breaking change. The syntax between ingest/Logstash/Beats will have a minor difference w/r/t I am removing the breaking label. @rjernst This is ready to review, and will be a normal cherry-picked backport. |
ping @rjernst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I am a bit apprehensive about this being in ingest-common? That is turning into what I was afraid of: a catch all where all ingest processors seem to go.
[[dissect-key-modifiers]] | ||
==== Dissect key modifiers | ||
Key modifiers can change the default behavior for dissection. Key modifiers may be found on the left or right | ||
of the `%{keyname}` always inside the `{%` and `}`. For example `%{+keyname ->}` has the append and right padding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? {%
-> %{
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. thanks!
} | ||
|
||
//package private for testing | ||
String getField() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all the members are final, you could just make them package private and not need the getters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup. removed the getters and made the members package private. thanks!
@rjernst - thanks for the review. Will merge once the cosmetic changes and updated branch goes green.
I am happy to move stuff around (in a different PR), did you have any specific ideas? I can log something for further discussion too. |
It might be worth discussing with the core team. Grok was originally a plugin because ingest-common was supposed to have no dependencies. |
All green merging. will backport as-is for inclusion with 6.5 |
The ingest node dissect processor is an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Dissect can be used to parse a source text field with a simpler pattern, and is often faster the Grok for basic string parsing. This processor uses the dissect library which does most of the work.
* master: [Rollup] Better error message when trying to set non-rollup index (#32965) HLRC: Use Optional in validation logic (#33104) Remove unused User class from protocol (#33137) ingest: Introduce the dissect processor (#32884) [Docs] Add link to es-kotlin-wrapper-client (#32618) [Docs] Remove repeating words (#33087) Minor spelling and grammar fix (#32931) Remove support for deprecated params._agg/_aggs for scripted metric aggregations (#32979) Watcher: Simplify finding next date in cron schedule (#33015) Run Third party audit with forbidden APIs CLI (part3/3) (#33052) Fix plugin build test on Windows (#33078) HLRC+MINOR: Remove Unused Private Method (#33165) Remove old unused test script files (#32970) Build analysis-icu client JAR (#33184) Ensure to generate identical NoOp for the same failure (#33141) ShardSearchFailure#readFrom to set index and shardId (#33161)
The ingest node dissect processor is an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Dissect can be used to parse a source text field with a simpler pattern, and is often faster the Grok for basic string parsing. This processor uses the dissect library which does most of the work.
* Introduce the dissect library (#32297) The dissect library will be used for the ingest node as an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Note - Regular expressions are used during construction of the objects, but not in the hot path. A dissect pattern takes the form of: '%{a} %{b},%{c}' which is composed of 3 keys (a,b,c) and two delimiters (space and comma). This dissect pattern will match a string of the form: 'foo bar,baz' and will result a key/value pairing of 'a=foo, b=bar, and c=baz'. See the comments in DissectParser for a full explanation. This commit does not include the ingest node processor that will consume it. However, the consumption should be a trivial mapping between the key/value pairing returned by the parser and the key/value pairing needed for the IngestDocument. * ingest: Introduce the dissect processor (#32884) The ingest node dissect processor is an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Dissect can be used to parse a source text field with a simpler pattern, and is often faster the Grok for basic string parsing. This processor uses the dissect library which does most of the work. * ingest: minor - update test to include dissect (#33211) This change also includes placing the bytes processor in the correct order (helps to avoid merge conflict when back patching processors)
The ingest node dissect processor is an alternative to Grok
to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the
string.
Dissect can be used to parse a source text field with a
simpler pattern, and is often faster the Grok for basic string
parsing. This processor uses the dissect library which
does most of the work.
EDIT: removed comments about breaking change, since this will no longer be a breaking change.
Rally Benchmark:
Rally track here: https://github.com/jakelandis/rally-tracks-ingest/blob/dissect_tests/http_logs_for_ingest/track.json
Full test results here: https://gist.github.com/jakelandis/de554f394e5455ee66797c2b956d63bf
Single node, 1 shard, 0 replicas, i7 4 core/8 thread running ~3.7GHz with NVMe disk.
_source is disabled and refresh_interval is set to -1.
All fields use the following mapping to focus on the ingest node performance.
Data set is the same as data set as https://github.com/elastic/rally-tracks/tree/master/http_logs based on Web server logs from the 1998 Football world cup. However, this test uses an un-parsed format for the data.
Example log lines:
The dataset has 247,249,096 documents ... each test was run 4 times, once with 1 lap (~250m), once with 2 laps (~500m), once with 3 laps (~750m), and once with 4 laps (~1b).
Baseline - runs an ingest pipeline that does nothing (specifically it attempts to upper case a missing field). This accounts for the overhead of creating the IngestDocument and pushing the data through a pipeline.
Grok - extracts the timestamp, ip, request, status, and size from the source string with the Grok processor. (
%{IPORHOST:clientip} %{HTTPDUSER} %{USER} \\[%{TIMESTAMP_ISO8601:@timestamp}\\] \"(?:%{WORD} %{NOTSPACE:request}(?: HTTP/%{NUMBER})?|%{DATA})\" %{NUMBER:status} (?:%{NUMBER:size}|-)
)Dissect - extracts the timestamp, ip, request, status, and size from the source string with the Dissect processor. (
%{clientip} %{?ident} %{?auth} [%{@timestamp}] \"%{?verb} %{request} HTTP/%{?httpversion}\" %{status} %{size}
)Results (average the medians across all runs)
Dissect is about 8% faster then Grok. (ranges from 7-10% across runs)
Dissect is about 15% slower then the Baseline. (ranges from 14-16% across runs)
Grok is about 22% slower then the Baseline. (ranges from 20%-24% across runs)
Throughput (docs/second - higher is better)
data:image/s3,"s3://crabby-images/ce62c/ce62c3a11a51b82a9f6094b203d54e40c94d6970" alt="image"
Screen shot of rendered help here: https://gist.github.com/jakelandis/dd0b627029f26fd52957a00b0c7e118f