Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest: Enable Instantiating and Calling other Processors from ScriptProcessor #32043

Conversation

original-brownbear
Copy link
Member

This would be a simple prototype for how calling other processors from the ScriptProcessor could work.

Example

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
{
  "script": {
    "extra_processors": {
      "custom_set_proc": {
         "set": {
           "field": "field1",
           "value": 582.1
        }
      }
    },
    "lang": "painless",
    "source": """
    extraProcessors.invoke("custom_set_proc", doc);
    ctx.field_a_plus_b_times_c = (ctx.field_a + ctx.field_b) * params.param_c
    """,
    "params": {
      "param_c": 10
    }
  }
}
  ]
}
PUT my-index/_doc/my-id?pipeline=my-pipeline-id
{
  "field_a": 5,
  "field_b": 6
}
GET my-index/_search

returns:

{
  "took": 69,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my-index",
        "_type": "_doc",
        "_id": "my-id",
        "_score": 1,
        "_source": {
          "field1": 582.1,
          "field_a": 5,
          "field_b": 6,
          "field_a_plus_b_times_c": 110
        }
      }
    ]
  }
}
  • Painless and accessing the existing ctx still works
  • Processor defined in extra_processors field could be invoked via extraProcessors.invoke("custom_set_proc", doc);
    • doc is the actual IngestDocument, it's obviously redundant to ctx but ctx can't go anywhere because of BwC anyway

This is built on top of #32003

Motivation for doing it this way:

  • Defining the processors in a separate field extraProcessors does not require and tricky changes to Painless parsing and requires minimal amount of extra whitelisting since we don't have to whitelist all the processors separately
    • Same goes for the security manager override, this approach allows keeping it in one place (we need that because we can't call things like Mustache from an insecure script context)

@original-brownbear original-brownbear added WIP :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Jul 13, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@rjernst
Copy link
Member

rjernst commented Jul 13, 2018

@original-brownbear Thanks for starting to think about this! I was envisioning something a bit different, though. In particular, this design suffers from still having to configure the processors outside of the actual painless script.

My idea was more to expose each processor as a method in painless, at least for the ones that make sense. For set, I don't think it makes sense to have a method, since this is a simple one line in painless already. But imagine wanting to use the new bytes processor with something like this:

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
{
  "script": {
    "lang": "painless",
    "source": "ctx.somebytes = Processors.parseBytes(ctx.somebytes)"
  }
}
  ]
}

This is possible with the relatively recent whitelisting made available in painless. Also, in the near term future, we will be able to statically import methods in the whitelist, so that using the Processors prefix is not necessary.

@original-brownbear
Copy link
Member Author

original-brownbear commented Jul 13, 2018

@rjernst

In particular, this design suffers from still having to configure the processors outside of the actual painless script.

I actually found that to be a big advantage of my approach :D for a few reasons:

  1. You can't describe all processors as static method calls. For some you can, but Grok for example won't work so well (unless you make the Painless compiler way smarter) because instantiating the Grok processor runs through a quasi-compilation step of instantiating the correct Grok object (i.e. compiling the necessary regex(s)).
  2. This actually gives the ability to quasi define methods via nested script processors to get reusable execution units (this would be a big step up from LS configs as discussed).
  3. less important: This is probably more readable for processors like Grok, where you can have fairly complex configurations and having to invoke a static method with a huge number of parameters for all the config options seems unnatural, even if it were to compile efficiently.
  4. less important maybe: This is really straight forward to implement with the current codebase.

That said, I agree the API isn't so nice here. But couldn't we resolve that by adding some more logic to the compiler to e.g. just compile my example section of:

"custom_set_proc": {
         "set": {
           "field": "field1",
           "value": 582.1
}

into a method custom_set_proc(ctx)?
i.e. this would become equivalent to above example:

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
    {
      "script": {
        "extra_processors": {
          "custom_set_proc": {
             "set": {
               "field": "field1",
               "value": 582.1
            }
          }
        },
        "lang": "painless",
        "source": """
        custom_set_proc(doc);
        ctx.field_a_plus_b_times_c = (ctx.field_a + ctx.field_b) * params.param_c
        """,
        "params": {
          "param_c": 10
        }
      }
    }
  ]
}

(then we could also preload stuff like you example of the bytes processor that has no configuration to it and only takes the data it processes as input, but also keep the ability to neatly handle things like Grok that need more compilation ... or nested script processors)? :)

@original-brownbear
Copy link
Member Author

@rjernst one thing I was thinking about here that wouldn't require much work and would mostly keep your API suggestion would be to just add a context (maybe a different name would be good :P) field to the IngestScript context (like we currently have params or in my example the extraProcessors) that can keep the references to the caches needed by stuff like the UA processor, Grok etc.
Then we could simply keep those as static calls as well and use that thing as the first input:

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
{
  "script": {
    "lang": "painless",
    "source": "ctx.groked = Processors.grok(context, "patternsstring", "grokstring", ctx.someField)"
  }
}
  ]
}

if we always pass that object to those processors that need a cache, then it's lifecycle will be the same as that of its processor and we don't need any new fields in the script processor's DSL.

Just an idea :) Not as perfect as not having the context parameter, but kind of standard as an approach to this kind of problem and not too unergonomic for users (imo) :)

@original-brownbear
Copy link
Member Author

@rjernst also, regarding the above suggestion. If we're going for a context object for keeping caches, it wouldn't be too hard to make this transparent to the user either and allow for doing this without a performance hit:

ctx.groked = Processors.grok("patternsstring", "grokstring", ctx.someField)

we could:

  1. Dumbest approach (but likely works since we're talking about a very limited number of calls here and tokenising this isn't terribly hard):
    • Just preprocess sources in the ScriptProcessor#Factory and simply add in the context arg before passing the source string to the Painless compiler.
  2. Trickier approach, but would enable doing the same kind of thing outside of the ScriptProcessor:
  • Simply create an annotation like e.g. @PainlessMethod that can be added to static methods that take the context as their first argument, then have Painless keep and pass an instance of it per compiled script (this is effectively how JRuby works).

=> the Context object doesn't look so bad imo :P ... also it's kind of a proven approach to this since this is how many scripting languages optimize these kinds of calls under the hood.

@talevy
Copy link
Contributor

talevy commented Jul 18, 2018

@original-brownbear I am on board to explore (2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP WIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants