Ingest: Enable Instantiating and Calling other Processors from ScriptProcessor #32043

original-brownbear · 2018-07-13T15:07:02Z

This would be a simple prototype for how calling other processors from the ScriptProcessor could work.

Example

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
{
  "script": {
    "extra_processors": {
      "custom_set_proc": {
         "set": {
           "field": "field1",
           "value": 582.1
        }
      }
    },
    "lang": "painless",
    "source": """
    extraProcessors.invoke("custom_set_proc", doc);
    ctx.field_a_plus_b_times_c = (ctx.field_a + ctx.field_b) * params.param_c
    """,
    "params": {
      "param_c": 10
    }
  }
}
  ]
}

PUT my-index/_doc/my-id?pipeline=my-pipeline-id
{
  "field_a": 5,
  "field_b": 6
}

GET my-index/_search

returns:

{
  "took": 69,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my-index",
        "_type": "_doc",
        "_id": "my-id",
        "_score": 1,
        "_source": {
          "field1": 582.1,
          "field_a": 5,
          "field_b": 6,
          "field_a_plus_b_times_c": 110
        }
      }
    ]
  }
}

Painless and accessing the existing ctx still works
Processor defined in extra_processors field could be invoked via extraProcessors.invoke("custom_set_proc", doc);
- doc is the actual IngestDocument, it's obviously redundant to ctx but ctx can't go anywhere because of BwC anyway

This is built on top of #32003

Motivation for doing it this way:

Defining the processors in a separate field extraProcessors does not require and tricky changes to Painless parsing and requires minimal amount of extra whitelisting since we don't have to whitelist all the processors separately
- Same goes for the security manager override, this approach allows keeping it in one place (we need that because we can't call things like Mustache from an insecure script context)

…ng more precise

…essors-in-script-proc

…n-script-proc

…ing-processors-in-script-proc

elasticmachine · 2018-07-13T15:07:04Z

Pinging @elastic/es-core-infra

rjernst · 2018-07-13T20:50:04Z

@original-brownbear Thanks for starting to think about this! I was envisioning something a bit different, though. In particular, this design suffers from still having to configure the processors outside of the actual painless script.

My idea was more to expose each processor as a method in painless, at least for the ones that make sense. For set, I don't think it makes sense to have a method, since this is a simple one line in painless already. But imagine wanting to use the new bytes processor with something like this:

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
{
  "script": {
    "lang": "painless",
    "source": "ctx.somebytes = Processors.parseBytes(ctx.somebytes)"
  }
}
  ]
}

This is possible with the relatively recent whitelisting made available in painless. Also, in the near term future, we will be able to statically import methods in the whitelist, so that using the Processors prefix is not necessary.

original-brownbear · 2018-07-13T21:25:00Z

@rjernst

In particular, this design suffers from still having to configure the processors outside of the actual painless script.

I actually found that to be a big advantage of my approach :D for a few reasons:

You can't describe all processors as static method calls. For some you can, but Grok for example won't work so well (unless you make the Painless compiler way smarter) because instantiating the Grok processor runs through a quasi-compilation step of instantiating the correct Grok object (i.e. compiling the necessary regex(s)).
This actually gives the ability to quasi define methods via nested script processors to get reusable execution units (this would be a big step up from LS configs as discussed).
less important: This is probably more readable for processors like Grok, where you can have fairly complex configurations and having to invoke a static method with a huge number of parameters for all the config options seems unnatural, even if it were to compile efficiently.
less important maybe: This is really straight forward to implement with the current codebase.

That said, I agree the API isn't so nice here. But couldn't we resolve that by adding some more logic to the compiler to e.g. just compile my example section of:

"custom_set_proc": {
         "set": {
           "field": "field1",
           "value": 582.1
}

into a method custom_set_proc(ctx)?
i.e. this would become equivalent to above example:

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
    {
      "script": {
        "extra_processors": {
          "custom_set_proc": {
             "set": {
               "field": "field1",
               "value": 582.1
            }
          }
        },
        "lang": "painless",
        "source": """
        custom_set_proc(doc);
        ctx.field_a_plus_b_times_c = (ctx.field_a + ctx.field_b) * params.param_c
        """,
        "params": {
          "param_c": 10
        }
      }
    }
  ]
}

(then we could also preload stuff like you example of the bytes processor that has no configuration to it and only takes the data it processes as input, but also keep the ability to neatly handle things like Grok that need more compilation ... or nested script processors)? :)

original-brownbear · 2018-07-17T16:01:12Z

@rjernst one thing I was thinking about here that wouldn't require much work and would mostly keep your API suggestion would be to just add a context (maybe a different name would be good :P) field to the IngestScript context (like we currently have params or in my example the extraProcessors) that can keep the references to the caches needed by stuff like the UA processor, Grok etc.
Then we could simply keep those as static calls as well and use that thing as the first input:

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
{
  "script": {
    "lang": "painless",
    "source": "ctx.groked = Processors.grok(context, "patternsstring", "grokstring", ctx.someField)"
  }
}
  ]
}

if we always pass that object to those processors that need a cache, then it's lifecycle will be the same as that of its processor and we don't need any new fields in the script processor's DSL.

Just an idea :) Not as perfect as not having the context parameter, but kind of standard as an approach to this kind of problem and not too unergonomic for users (imo) :)

original-brownbear · 2018-07-18T11:51:11Z

@rjernst also, regarding the above suggestion. If we're going for a context object for keeping caches, it wouldn't be too hard to make this transparent to the user either and allow for doing this without a performance hit:

ctx.groked = Processors.grok("patternsstring", "grokstring", ctx.someField)

we could:

Dumbest approach (but likely works since we're talking about a very limited number of calls here and tokenising this isn't terribly hard):
- Just preprocess sources in the ScriptProcessor#Factory and simply add in the context arg before passing the source string to the Painless compiler.
Trickier approach, but would enable doing the same kind of thing outside of the ScriptProcessor:

Simply create an annotation like e.g. @PainlessMethod that can be added to static methods that take the context as their first argument, then have Painless keep and pass an instance of it per compiled script (this is effectively how JRuby works).

=> the Context object doesn't look so bad imo :P ... also it's kind of a proven approach to this since this is how many scripting languages optimize these kinds of calls under the hood.

talevy · 2018-07-18T16:24:18Z

@original-brownbear I am on board to explore (2).

original-brownbear added 15 commits July 12, 2018 14:40

Replace Ingest ScriptContext with Custom Interface

807b3b2

Remove unnecessary null check

ada26d7

Make org.elasticsearch.ingest.common.ScriptProcessorTests#testScripti…

89874c9

…ng more precise

Don't mock script factory in ScriptProcessorTests

fb9647d

Merge branch 'master' into new-ingest-ctx

7d085cb

Adjust mock script plugin in IT for new API

b1f7d43

Adjust mock script plugin in IT for new API

c35c213

Merge remote-tracking branch 'elastic/master' into allow-calling-proc…

a060119

…essors-in-script-proc

bck

d8dc456

Merge remote-tracking branch 'elastic/master' into new-ingest-ctx

b647352

add class level javadoc

762d2f7

Merge branch 'new-ingest-ctx' into feature/allow-calling-processors-i…

d9cf0d3

…n-script-proc

works

587d326

works

68efcbe

Merge remote-tracking branch 'elastic/master' into feature/allow-call…

82b745b

…ing-processors-in-script-proc

original-brownbear added WIP :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Jul 13, 2018

original-brownbear requested review from talevy and rjernst July 13, 2018 15:07

works

36ec1f4

original-brownbear added 2 commits July 13, 2018 23:38

merge in master

036a79a

Merge in master

95d94dd

original-brownbear mentioned this pull request Jul 31, 2018

Ingest: Add conditional per processor #32398

Merged

original-brownbear closed this Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest: Enable Instantiating and Calling other Processors from ScriptProcessor #32043

Ingest: Enable Instantiating and Calling other Processors from ScriptProcessor #32043

original-brownbear commented Jul 13, 2018

elasticmachine commented Jul 13, 2018

rjernst commented Jul 13, 2018

original-brownbear commented Jul 13, 2018 •

edited

Loading

original-brownbear commented Jul 17, 2018

original-brownbear commented Jul 18, 2018

talevy commented Jul 18, 2018

Ingest: Enable Instantiating and Calling other Processors from ScriptProcessor #32043

Ingest: Enable Instantiating and Calling other Processors from ScriptProcessor #32043

Conversation

original-brownbear commented Jul 13, 2018

Example

Motivation for doing it this way:

elasticmachine commented Jul 13, 2018

rjernst commented Jul 13, 2018

original-brownbear commented Jul 13, 2018 • edited Loading

original-brownbear commented Jul 17, 2018

original-brownbear commented Jul 18, 2018

talevy commented Jul 18, 2018

original-brownbear commented Jul 13, 2018 •

edited

Loading