ingest: Introduce the dissect processor #32884

jakelandis · 2018-08-15T19:57:35Z

The ingest node dissect processor is an alternative to Grok
to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the
string.

Dissect can be used to parse a source text field with a
simpler pattern, and is often faster the Grok for basic string
parsing. This processor uses the dissect library which
does most of the work.

EDIT: removed comments about breaking change, since this will no longer be a breaking change.

Rally Benchmark:

Rally track here: https://github.com/jakelandis/rally-tracks-ingest/blob/dissect_tests/http_logs_for_ingest/track.json
Full test results here: https://gist.github.com/jakelandis/de554f394e5455ee66797c2b956d63bf

Single node, 1 shard, 0 replicas, i7 4 core/8 thread running ~3.7GHz with NVMe disk.

_source is disabled and refresh_interval is set to -1.

All fields use the following mapping to focus on the ingest node performance.

"type": "keyword",
"index": false,
"doc_values": false

Data set is the same as data set as https://github.com/elastic/rally-tracks/tree/master/http_logs based on Web server logs from the 1998 Football world cup. However, this test uses an un-parsed format for the data.

Example log lines:

{"message" : "30.87.8.0 - - [1998-05-24T15:00:01-05:00] \"GET /images/info.gif HTTP/1.0\" 200 1251"}
{"message" : "28.87.8.0 - - [1998-05-24T15:00:01-05:00] \"GET /french/images/hm_official.gif HTTP/1.1\" 200 972"}
{"message" : "17.87.8.0 - - [1998-05-24T15:00:01-05:00] \"GET /french/hosts/cfo/images/cfo/cfophot3.jpg HTTP/1.0\" 200 6695"}
{"message" : "72.236.7.0 - - [1998-05-24T15:00:01-05:00] \"GET /images/s102380.gif HTTP/1.1\" 200 207"}
{"message" : "82.36.0.0 - - [1998-05-24T15:00:02-05:00] \"GET /images/nav_bg_bottom.jpg HTTP/1.0\" 200 8389"}

The dataset has 247,249,096 documents ... each test was run 4 times, once with 1 lap (~250m), once with 2 laps (~500m), once with 3 laps (~750m), and once with 4 laps (~1b).

Baseline - runs an ingest pipeline that does nothing (specifically it attempts to upper case a missing field). This accounts for the overhead of creating the IngestDocument and pushing the data through a pipeline.

Grok - extracts the timestamp, ip, request, status, and size from the source string with the Grok processor. (%{IPORHOST:clientip} %{HTTPDUSER} %{USER} \\[%{TIMESTAMP_ISO8601:@timestamp}\\] \"(?:%{WORD} %{NOTSPACE:request}(?: HTTP/%{NUMBER})?|%{DATA})\" %{NUMBER:status} (?:%{NUMBER:size}|-))

Dissect - extracts the timestamp, ip, request, status, and size from the source string with the Dissect processor. (%{clientip} %{?ident} %{?auth} [%{@timestamp}] \"%{?verb} %{request} HTTP/%{?httpversion}\" %{status} %{size})

Results (average the medians across all runs)

Dissect is about 8% faster then Grok. (ranges from 7-10% across runs)
Dissect is about 15% slower then the Baseline. (ranges from 14-16% across runs)
Grok is about 22% slower then the Baseline. (ranges from 20%-24% across runs)

Throughput (docs/second - higher is better)

Screen shot of rendered help here: https://gist.github.com/jakelandis/dd0b627029f26fd52957a00b0c7e118f

The ingest node dissect processor is an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Dissect can be used to parse a source text field with a simpler pattern, and is often faster the Grok for basic string parsing. This processor uses the dissect library which does most of the work.

elasticmachine · 2018-08-15T19:57:37Z

Pinging @elastic/es-core-infra

rjernst · 2018-08-16T18:22:27Z

@jakelandis Before I review the code, I want to understand better how a new feature can be marked as breaking? Why can't we introduce this in 6.x with the same behavior we intend for 7.0? (I also don't think #32297 should be marked as breaking, sorry I didn't notice that before.) If we must break this, then the first PR should have the intended 6.x behavior, and a followup with the breaking change in 7.0 should be made separately.

jakelandis · 2018-08-16T19:07:36Z

@rjernst - the intent here is to match Beats and Logstash's implementation. However, in the course of implementing for this for the ingest I found a point of confusion w/r/t an overloaded use of the ? symbol (as well as a couple other minor things) that I wanted to change. Unfortunately this is a breaking w/r/t to the way Beats and Logstash currently work.

After some discussion with Beats and Logstash, they plan to make a corresponding (breaking) change for 7.x too. However there was a desire to allow copy/pasting of patterns between products withing major versions, which necessitates a breaking change here too.

If we must break this, then the first PR should have the intended 6.x behavior, and a followup with the breaking change in 7.0 should be made separately.

Apologies for the mis-ordering of PR's. I had planned once all changes were in master to backport and then add an extra commit the 6.x rules (cherry-picking + 1 commit), and seek review on that one commit.

If we are OK with ingest node dissect syntax being slightly different then Beats and Logstash in 6.x, then I can remove the breaking changes label and do a normal backport. The difference is pretty minor ... the change between 6.x and 7.x is ? should now be * and some extra validation is enforced. The 7x->6x code diff is here: jakelandis/elasticsearch@c2ada87...2b4a4d5 (some extra fluff in that diff which will get cleaned up for the PR)

rjernst · 2018-08-16T22:41:48Z

The logic makes sense for why a breaking change. But the order should be as I described in the previous comment. Can this PR be amended with the old behavior?

jakelandis · 2018-08-17T14:36:06Z

@rjernst I am re-visiting the discussion around the need to make this a breaking change. I will update this PR with the result of that discussion.

jakelandis · 2018-08-21T13:43:22Z

After further discussion, this will NOT be a breaking change. The syntax between ingest/Logstash/Beats will have a minor difference w/r/t * vs ? for 6.x...but that is OK since it will avoid a breaking change in Elasticsearch.

I am removing the breaking label.

@rjernst This is ready to review, and will be a normal cherry-picked backport.

jakelandis · 2018-08-27T13:56:54Z

ping @rjernst

rjernst

LGTM, although I am a bit apprehensive about this being in ingest-common? That is turning into what I was afraid of: a catch all where all ingest processors seem to go.

rjernst · 2018-08-27T18:20:10Z

docs/reference/ingest/ingest-node.asciidoc

+[[dissect-key-modifiers]]
+==== Dissect key modifiers
+Key modifiers can change the default behavior for dissection. Key modifiers may be found on the left or right
+of the `%{keyname}` always inside the `{%` and `}`. For example `%{+keyname ->}` has the append and right padding


typo? {% -> %{

fixed. thanks!

rjernst · 2018-08-27T18:22:39Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/DissectProcessor.java

+    }
+
+    //package private for testing
+    String getField() {


Since all the members are final, you could just make them package private and not need the getters?

yup. removed the getters and made the members package private. thanks!

jakelandis · 2018-08-27T21:21:34Z

@rjernst - thanks for the review. Will merge once the cosmetic changes and updated branch goes green.

although I am a bit apprehensive about this being in ingest-common? That is turning into what I was afraid of: a catch all where all ingest processors seem to go.

I am happy to move stuff around (in a different PR), did you have any specific ideas? I can log something for further discussion too.

rjernst · 2018-08-27T21:32:00Z

It might be worth discussing with the core team. Grok was originally a plugin because ingest-common was supposed to have no dependencies.

jakelandis · 2018-08-28T14:11:05Z

All green merging. will backport as-is for inclusion with 6.5

The ingest node dissect processor is an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Dissect can be used to parse a source text field with a simpler pattern, and is often faster the Grok for basic string parsing. This processor uses the dissect library which does most of the work.

* master: [Rollup] Better error message when trying to set non-rollup index (#32965) HLRC: Use Optional in validation logic (#33104) Remove unused User class from protocol (#33137) ingest: Introduce the dissect processor (#32884) [Docs] Add link to es-kotlin-wrapper-client (#32618) [Docs] Remove repeating words (#33087) Minor spelling and grammar fix (#32931) Remove support for deprecated params._agg/_aggs for scripted metric aggregations (#32979) Watcher: Simplify finding next date in cron schedule (#33015) Run Third party audit with forbidden APIs CLI (part3/3) (#33052) Fix plugin build test on Windows (#33078) HLRC+MINOR: Remove Unused Private Method (#33165) Remove old unused test script files (#32970) Build analysis-icu client JAR (#33184) Ensure to generate identical NoOp for the same failure (#33141) ShardSearchFailure#readFrom to set index and shardId (#33161)

The ingest node dissect processor is an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Dissect can be used to parse a source text field with a simpler pattern, and is often faster the Grok for basic string parsing. This processor uses the dissect library which does most of the work.

* Introduce the dissect library (#32297) The dissect library will be used for the ingest node as an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Note - Regular expressions are used during construction of the objects, but not in the hot path. A dissect pattern takes the form of: '%{a} %{b},%{c}' which is composed of 3 keys (a,b,c) and two delimiters (space and comma). This dissect pattern will match a string of the form: 'foo bar,baz' and will result a key/value pairing of 'a=foo, b=bar, and c=baz'. See the comments in DissectParser for a full explanation. This commit does not include the ingest node processor that will consume it. However, the consumption should be a trivial mapping between the key/value pairing returned by the parser and the key/value pairing needed for the IngestDocument. * ingest: Introduce the dissect processor (#32884) The ingest node dissect processor is an alternative to Grok to split a string based on a pattern. Dissect differs from Grok such that regular expressions are not used to split the string. Dissect can be used to parse a source text field with a simpler pattern, and is often faster the Grok for basic string parsing. This processor uses the dissect library which does most of the work. * ingest: minor - update test to include dissect (#33211) This change also includes placing the bytes processor in the correct order (helps to avoid merge conflict when back patching processors)

jakelandis added >enhancement >breaking :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v7.0.0 labels Aug 15, 2018

jakelandis requested a review from rjernst August 16, 2018 15:51

jakelandis added v6.5.0 and removed >breaking labels Aug 21, 2018

rjernst approved these changes Aug 27, 2018

View reviewed changes

jakelandis added 2 commits August 27, 2018 16:05

code review changes

5aba95a

Merge branch 'master' into dissect_processor

f957c1b

jakelandis merged commit 79b507d into elastic:master Aug 28, 2018

jakelandis added the backport pending label Aug 28, 2018

jakelandis mentioned this pull request Sep 5, 2018

ingest: Dissect processor and lib for 6x #33422

Merged

jakelandis removed the backport pending label Sep 5, 2018

Mpdreamz mentioned this pull request Dec 13, 2018

[meta] 6.5.0 Release elastic/elasticsearch-net#3457

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest: Introduce the dissect processor #32884

ingest: Introduce the dissect processor #32884

jakelandis commented Aug 15, 2018 •

edited

Loading

elasticmachine commented Aug 15, 2018

rjernst commented Aug 16, 2018

jakelandis commented Aug 16, 2018

rjernst commented Aug 16, 2018

jakelandis commented Aug 17, 2018

jakelandis commented Aug 21, 2018

jakelandis commented Aug 27, 2018

rjernst left a comment

rjernst Aug 27, 2018

jakelandis Aug 27, 2018

rjernst Aug 27, 2018

jakelandis Aug 27, 2018

jakelandis commented Aug 27, 2018

rjernst commented Aug 27, 2018

jakelandis commented Aug 28, 2018

ingest: Introduce the dissect processor #32884

ingest: Introduce the dissect processor #32884

Conversation

jakelandis commented Aug 15, 2018 • edited Loading

Rally Benchmark:

elasticmachine commented Aug 15, 2018

rjernst commented Aug 16, 2018

jakelandis commented Aug 16, 2018

rjernst commented Aug 16, 2018

jakelandis commented Aug 17, 2018

jakelandis commented Aug 21, 2018

jakelandis commented Aug 27, 2018

rjernst left a comment

Choose a reason for hiding this comment

rjernst Aug 27, 2018

Choose a reason for hiding this comment

jakelandis Aug 27, 2018

Choose a reason for hiding this comment

rjernst Aug 27, 2018

Choose a reason for hiding this comment

jakelandis Aug 27, 2018

Choose a reason for hiding this comment

jakelandis commented Aug 27, 2018

rjernst commented Aug 27, 2018

jakelandis commented Aug 28, 2018

jakelandis commented Aug 15, 2018 •

edited

Loading