Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Introduce the dissect processor #32884

Merged
merged 3 commits into from
Aug 28, 2018

Conversation

jakelandis
Copy link
Contributor

@jakelandis jakelandis commented Aug 15, 2018

The ingest node dissect processor is an alternative to Grok
to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the
string.

Dissect can be used to parse a source text field with a
simpler pattern, and is often faster the Grok for basic string
parsing. This processor uses the dissect library which
does most of the work.


EDIT: removed comments about breaking change, since this will no longer be a breaking change.

Rally Benchmark:

Rally track here: https://github.com/jakelandis/rally-tracks-ingest/blob/dissect_tests/http_logs_for_ingest/track.json
Full test results here: https://gist.github.com/jakelandis/de554f394e5455ee66797c2b956d63bf

Single node, 1 shard, 0 replicas, i7 4 core/8 thread running ~3.7GHz with NVMe disk.

_source is disabled and refresh_interval is set to -1.

All fields use the following mapping to focus on the ingest node performance.

"type": "keyword",
"index": false,
"doc_values": false

Data set is the same as data set as https://github.com/elastic/rally-tracks/tree/master/http_logs based on Web server logs from the 1998 Football world cup. However, this test uses an un-parsed format for the data.

Example log lines:

{"message" : "30.87.8.0 - - [1998-05-24T15:00:01-05:00] \"GET /images/info.gif HTTP/1.0\" 200 1251"}
{"message" : "28.87.8.0 - - [1998-05-24T15:00:01-05:00] \"GET /french/images/hm_official.gif HTTP/1.1\" 200 972"}
{"message" : "17.87.8.0 - - [1998-05-24T15:00:01-05:00] \"GET /french/hosts/cfo/images/cfo/cfophot3.jpg HTTP/1.0\" 200 6695"}
{"message" : "72.236.7.0 - - [1998-05-24T15:00:01-05:00] \"GET /images/s102380.gif HTTP/1.1\" 200 207"}
{"message" : "82.36.0.0 - - [1998-05-24T15:00:02-05:00] \"GET /images/nav_bg_bottom.jpg HTTP/1.0\" 200 8389"}

The dataset has 247,249,096 documents ... each test was run 4 times, once with 1 lap (~250m), once with 2 laps (~500m), once with 3 laps (~750m), and once with 4 laps (~1b).

Baseline - runs an ingest pipeline that does nothing (specifically it attempts to upper case a missing field). This accounts for the overhead of creating the IngestDocument and pushing the data through a pipeline.

Grok - extracts the timestamp, ip, request, status, and size from the source string with the Grok processor. (%{IPORHOST:clientip} %{HTTPDUSER} %{USER} \\[%{TIMESTAMP_ISO8601:@timestamp}\\] \"(?:%{WORD} %{NOTSPACE:request}(?: HTTP/%{NUMBER})?|%{DATA})\" %{NUMBER:status} (?:%{NUMBER:size}|-))

Dissect - extracts the timestamp, ip, request, status, and size from the source string with the Dissect processor. (%{clientip} %{?ident} %{?auth} [%{@timestamp}] \"%{?verb} %{request} HTTP/%{?httpversion}\" %{status} %{size})

Results (average the medians across all runs)

Dissect is about 8% faster then Grok. (ranges from 7-10% across runs)
Dissect is about 15% slower then the Baseline. (ranges from 14-16% across runs)
Grok is about 22% slower then the Baseline. (ranges from 20%-24% across runs)

Throughput (docs/second - higher is better)
image


Screen shot of rendered help here: https://gist.github.com/jakelandis/dd0b627029f26fd52957a00b0c7e118f

The ingest node dissect processor is an alternative to Grok
to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the
string.

Dissect can be used to parse a source text field with a
simpler pattern, and is often faster the Grok for basic string
parsing. This processor uses the dissect library which
does most of the work.
@jakelandis jakelandis added >enhancement >breaking :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v7.0.0 labels Aug 15, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@jakelandis jakelandis requested a review from rjernst August 16, 2018 15:51
@rjernst
Copy link
Member

rjernst commented Aug 16, 2018

@jakelandis Before I review the code, I want to understand better how a new feature can be marked as breaking? Why can't we introduce this in 6.x with the same behavior we intend for 7.0? (I also don't think #32297 should be marked as breaking, sorry I didn't notice that before.) If we must break this, then the first PR should have the intended 6.x behavior, and a followup with the breaking change in 7.0 should be made separately.

@jakelandis
Copy link
Contributor Author

@rjernst - the intent here is to match Beats and Logstash's implementation. However, in the course of implementing for this for the ingest I found a point of confusion w/r/t an overloaded use of the ? symbol (as well as a couple other minor things) that I wanted to change. Unfortunately this is a breaking w/r/t to the way Beats and Logstash currently work.

After some discussion with Beats and Logstash, they plan to make a corresponding (breaking) change for 7.x too. However there was a desire to allow copy/pasting of patterns between products withing major versions, which necessitates a breaking change here too.

If we must break this, then the first PR should have the intended 6.x behavior, and a followup with the breaking change in 7.0 should be made separately.

Apologies for the mis-ordering of PR's. I had planned once all changes were in master to backport and then add an extra commit the 6.x rules (cherry-picking + 1 commit), and seek review on that one commit.

If we are OK with ingest node dissect syntax being slightly different then Beats and Logstash in 6.x, then I can remove the breaking changes label and do a normal backport. The difference is pretty minor ... the change between 6.x and 7.x is ? should now be * and some extra validation is enforced. The 7x->6x code diff is here: jakelandis/elasticsearch@c2ada87...2b4a4d5 (some extra fluff in that diff which will get cleaned up for the PR)

@rjernst
Copy link
Member

rjernst commented Aug 16, 2018

The logic makes sense for why a breaking change. But the order should be as I described in the previous comment. Can this PR be amended with the old behavior?

@jakelandis
Copy link
Contributor Author

@rjernst I am re-visiting the discussion around the need to make this a breaking change. I will update this PR with the result of that discussion.

@jakelandis
Copy link
Contributor Author

After further discussion, this will NOT be a breaking change. The syntax between ingest/Logstash/Beats will have a minor difference w/r/t * vs ? for 6.x...but that is OK since it will avoid a breaking change in Elasticsearch.

I am removing the breaking label.

@rjernst This is ready to review, and will be a normal cherry-picked backport.

@jakelandis
Copy link
Contributor Author

ping @rjernst

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I am a bit apprehensive about this being in ingest-common? That is turning into what I was afraid of: a catch all where all ingest processors seem to go.

[[dissect-key-modifiers]]
==== Dissect key modifiers
Key modifiers can change the default behavior for dissection. Key modifiers may be found on the left or right
of the `%{keyname}` always inside the `{%` and `}`. For example `%{+keyname ->}` has the append and right padding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? {% -> %{

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. thanks!

}

//package private for testing
String getField() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all the members are final, you could just make them package private and not need the getters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. removed the getters and made the members package private. thanks!

@jakelandis
Copy link
Contributor Author

@rjernst - thanks for the review. Will merge once the cosmetic changes and updated branch goes green.

although I am a bit apprehensive about this being in ingest-common? That is turning into what I was afraid of: a catch all where all ingest processors seem to go.

I am happy to move stuff around (in a different PR), did you have any specific ideas? I can log something for further discussion too.

@rjernst
Copy link
Member

rjernst commented Aug 27, 2018

It might be worth discussing with the core team. Grok was originally a plugin because ingest-common was supposed to have no dependencies.

@jakelandis
Copy link
Contributor Author

All green merging. will backport as-is for inclusion with 6.5

@jakelandis jakelandis merged commit 79b507d into elastic:master Aug 28, 2018
jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Aug 28, 2018
The ingest node dissect processor is an alternative to Grok
to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the
string.

Dissect can be used to parse a source text field with a
simpler pattern, and is often faster the Grok for basic string
parsing. This processor uses the dissect library which
does most of the work.
dnhatn added a commit that referenced this pull request Aug 28, 2018
* master:
  [Rollup] Better error message when trying to set non-rollup index (#32965)
  HLRC: Use Optional in validation logic (#33104)
  Remove unused User class from protocol (#33137)
  ingest: Introduce the dissect processor (#32884)
  [Docs] Add link to es-kotlin-wrapper-client (#32618)
  [Docs] Remove repeating words (#33087)
  Minor spelling and grammar fix (#32931)
  Remove support for deprecated params._agg/_aggs for scripted metric aggregations (#32979)
  Watcher: Simplify finding next date in cron schedule (#33015)
  Run Third party audit with forbidden APIs CLI  (part3/3) (#33052)
  Fix plugin build test on Windows (#33078)
  HLRC+MINOR: Remove Unused Private Method (#33165)
  Remove old unused test script files (#32970)
  Build analysis-icu client JAR (#33184)
  Ensure to generate identical NoOp for the same failure (#33141)
  ShardSearchFailure#readFrom to set index and shardId (#33161)
jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Sep 5, 2018
The ingest node dissect processor is an alternative to Grok
to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the
string.

Dissect can be used to parse a source text field with a
simpler pattern, and is often faster the Grok for basic string
parsing. This processor uses the dissect library which
does most of the work.
jakelandis added a commit that referenced this pull request Sep 5, 2018
* Introduce the dissect library (#32297)

The dissect library will be used for the ingest node as an alternative
to Grok to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the string.
Note - Regular expressions are used during construction of the
objects, but not in the hot path.

A dissect pattern takes the form of: '%{a} %{b},%{c}' which is
composed of 3 keys (a,b,c) and two delimiters (space and comma).
This dissect pattern will match a string of the form: 'foo bar,baz'
and will result a key/value pairing of 'a=foo, b=bar, and c=baz'.
See the comments in DissectParser for a full explanation.

This commit does not include the ingest node processor that will consume
it. However, the consumption should be a trivial mapping between the
key/value pairing returned by the parser and the key/value pairing
needed for the IngestDocument.

* ingest: Introduce the dissect processor (#32884)

The ingest node dissect processor is an alternative to Grok
to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the
string.

Dissect can be used to parse a source text field with a
simpler pattern, and is often faster the Grok for basic string
parsing. This processor uses the dissect library which
does most of the work.

* ingest: minor - update test to include dissect (#33211)

This change also includes placing the bytes processor in the correct
order (helps to avoid merge conflict when back patching processors)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants