Adding a simulate ingest API #99270

masseyke · 2023-09-06T20:15:26Z

This is a draft PR that introduces a new _ingest/simulate API that runs any pipelines on the given data that would be executed for a given index, but instead of indexing the data into the index, returns the transformed documents. The difference from the simulate pipeline API is that the simulate pipeline API only runs the single pipeline it is given. This new API could potentially run an unlimited number of pipelines -- the given pipeline, the default pipeline for the index given, any default pipelines in indices that the reroute processor forwards the data to, and the final pipeline of the last index in the chain.
For example, if we have the following pipelines:

curl -X PUT "localhost:9200/_ingest/pipeline/my-pipeline?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-long-field",
        "value": 10
      }
    },
    {
      "set": {
        "field": "my-boolean-field",
        "value": true
      }
    },
    {
      "lowercase": {
        "field": "my-keyword-field"
      }
    },
    {
      "reroute": {
        "destination": "my-index-2"
      }
    }
  ]
}
'

curl -X PUT "localhost:9200/_ingest/pipeline/my-final-pipeline?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-boolean-field",
        "value": false
      }
    }
  ]
}
'

curl -X PUT "localhost:9200/_ingest/pipeline/my-pipeline-2?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-long-field",
        "value": 20
      }
    },
    {
      "uppercase": {
        "field": "my-keyword-field"
      }
    }
    }
  ]
}
'

curl -X PUT "localhost:9200/_ingest/pipeline/my-final-pipeline-2?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-new-boolean-field",
        "value": false
      }
    }
  ]
}
'

And then the following index:

curl -X PUT "localhost:9200/my-index?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "default_pipeline": "my-pipeline",
      "final_pipeline": "my-final-pipeline"
    }
  }
}
'

Then calling _ingest/_simulate with this data:

curl -X POST "localhost:9200/_ingest/_simulate?pretty&index=my-index" -H 'Content-Type: application/json' -d'
{
  "docs": [
    {
      "_source": {
        "my-keyword-field": "FOO"
      }
    },
    {
      "_source": {
        "my-keyword-field": "BAR"
      }
    }
  ]
}
'

might return

{
  "errors" : false,
  "took" : 0,
  "ingest_took" : 1,
  "items" : [
    {
      "create" : {
        "_index" : "my-index-2",
        "_source" : {
          "my-long-field" : 20,
          "my-keyword-field" : "FOO",
          "my-new-boolean-field" : false
        },
        "executed_pipelines" : [
          "my-pipeline",
          "my-pipeline-2",
          "my-final-pipeline-2"
        ],
        "status" : 201
      }
    },
    {
      "create" : {
        "_index" : "my-index-2",
        "_source" : {
          "my-long-field" : 20,
          "my-keyword-field" : "BAR",
          "my-boolean-field" : false
        },
        "executed_pipelines" : [
          "my-pipeline",
          "my-pipeline-2",
          "my-final-pipeline-2"
        ],
        "status" : 201
      }
    }
  ]
}

You can also specify substitute pipeline definitions so that you can try pipeline changes without actually having to change pipelines. For example, to substitute a new my-pipeline-2, you could do the following:

curl -X POST "localhost:9200/_ingest/_simulate?pretty&index=my-index" -H 'Content-Type: application/json' -d'
{
  "docs": [
    {
      "_source": {
        "my-keyword-field": "FOO"
      }
    },
    {
      "_source": {
        "my-keyword-field": "BAR"
      }
    }
  ],
  "pipeline_substitutions": {
    "my-pipeline-2": {
      "processors": [
        {
          "set": {
            "field": "my-new-boolean-field",
            "value": true
          }
        }
      ]
    }
  }
}
'

This substitutes the pipeline body given in the request for the my-pipeline-2 stored in the cluster. The pipeline definition is only changed for this request, and does not impact anything else running on the cluster now or in the future.

As a side note, here were some of the guidelines I used (and why the code is a little odd):

Make the API easy to use, and familiar to users of the simulate pipeline API.
Use as much of the existing bulk API logic as possible so that simulate does not diverge from real ingest behavior
Do not impact bulk API performance
Modify the bulk API code as little as possible. This is very critical code, and any change is an opportunity to introduce bugs.

elasticsearchmachine · 2023-09-06T20:15:51Z

Hi @masseyke, I've created a changelog YAML for you.

…om:masseyke/elasticsearch into feature/simulate-ingest-with-pipeline-defs

ruflin · 2023-09-18T08:14:33Z

I like the direction this is taking. It means we have some unmodified sample events, it is possible to use the simulate API with these events and see what the end result is / where these events end up. The pipeline substitution is key. Imagine at some point, also templates / component templates can be substituted.

@masseyke The focus on the output is on _source for the docs. What happens in synthetic source scenarios like TSDB?

masseyke · 2023-09-26T13:56:22Z

I like the direction this is taking. It means we have some unmodified sample events, it is possible to use the simulate API with these events and see what the end result is / where these events end up. The pipeline substitution is key. Imagine at some point, also templates / component templates can be substituted.

@masseyke The focus on the output is on _source for the docs. What happens in synthetic source scenarios like TSDB?

I don't think I'm following. The source is maintained by the pipelines until indexing time, and that is what is displayed in the output. Indexing itself doesn't give us the source as output, and we're not querying the index to get the source / synthetic source.

ruflin · 2023-09-26T14:29:54Z

I don't think I'm following.

Oversight on my end. Of course the _source is only removed during indexing 🤦‍♂️ All good.

masseyke · 2023-10-26T22:24:03Z

Replaced by #101409

Adding a simulate ingest API

a123747

masseyke added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.11.0 labels Sep 6, 2023

Update docs/changelog/99270.yaml

0ed1a82

masseyke added 9 commits September 6, 2023 15:50

cleanup

f4026f1

minor cleanup

b26e621

Merge branch 'feature/simulate-ingest-with-pipeline-defs' of github.c…

a117446

…om:masseyke/elasticsearch into feature/simulate-ingest-with-pipeline-defs

merging main

cf2497e

merging main

c14a77a

transport options were removed from ActionType

fa39cdd

adding simulate bulk transport action to list of non operator actions

5b2e1b7

avoiding NPE

358a4fe

merging main

b2301a4

masseyke added 2 commits September 25, 2023 11:24

merging main

c1b43e1

fixing compilation error introduced by merge

c1cbebc

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

masseyke closed this Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a simulate ingest API #99270

Adding a simulate ingest API #99270

masseyke commented Sep 6, 2023 •

edited

Loading

elasticsearchmachine commented Sep 6, 2023

ruflin commented Sep 18, 2023

masseyke commented Sep 26, 2023

ruflin commented Sep 26, 2023

masseyke commented Oct 26, 2023

Adding a simulate ingest API #99270

Adding a simulate ingest API #99270

Conversation

masseyke commented Sep 6, 2023 • edited Loading

elasticsearchmachine commented Sep 6, 2023

ruflin commented Sep 18, 2023

masseyke commented Sep 26, 2023

ruflin commented Sep 26, 2023

masseyke commented Oct 26, 2023

masseyke commented Sep 6, 2023 •

edited

Loading