Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a simulate ingest API #99270

Conversation

masseyke
Copy link
Member

@masseyke masseyke commented Sep 6, 2023

This is a draft PR that introduces a new _ingest/simulate API that runs any pipelines on the given data that would be executed for a given index, but instead of indexing the data into the index, returns the transformed documents. The difference from the simulate pipeline API is that the simulate pipeline API only runs the single pipeline it is given. This new API could potentially run an unlimited number of pipelines -- the given pipeline, the default pipeline for the index given, any default pipelines in indices that the reroute processor forwards the data to, and the final pipeline of the last index in the chain.
For example, if we have the following pipelines:

curl -X PUT "localhost:9200/_ingest/pipeline/my-pipeline?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-long-field",
        "value": 10
      }
    },
    {
      "set": {
        "field": "my-boolean-field",
        "value": true
      }
    },
    {
      "lowercase": {
        "field": "my-keyword-field"
      }
    },
    {
      "reroute": {
        "destination": "my-index-2"
      }
    }
  ]
}
'

curl -X PUT "localhost:9200/_ingest/pipeline/my-final-pipeline?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-boolean-field",
        "value": false
      }
    }
  ]
}
'

curl -X PUT "localhost:9200/_ingest/pipeline/my-pipeline-2?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-long-field",
        "value": 20
      }
    },
    {
      "uppercase": {
        "field": "my-keyword-field"
      }
    }
    }
  ]
}
'

curl -X PUT "localhost:9200/_ingest/pipeline/my-final-pipeline-2?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-new-boolean-field",
        "value": false
      }
    }
  ]
}
'

And then the following index:

curl -X PUT "localhost:9200/my-index?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "default_pipeline": "my-pipeline",
      "final_pipeline": "my-final-pipeline"
    }
  }
}
'

Then calling _ingest/_simulate with this data:

curl -X POST "localhost:9200/_ingest/_simulate?pretty&index=my-index" -H 'Content-Type: application/json' -d'
{
  "docs": [
    {
      "_source": {
        "my-keyword-field": "FOO"
      }
    },
    {
      "_source": {
        "my-keyword-field": "BAR"
      }
    }
  ]
}
'

might return

{
  "errors" : false,
  "took" : 0,
  "ingest_took" : 1,
  "items" : [
    {
      "create" : {
        "_index" : "my-index-2",
        "_source" : {
          "my-long-field" : 20,
          "my-keyword-field" : "FOO",
          "my-new-boolean-field" : false
        },
        "executed_pipelines" : [
          "my-pipeline",
          "my-pipeline-2",
          "my-final-pipeline-2"
        ],
        "status" : 201
      }
    },
    {
      "create" : {
        "_index" : "my-index-2",
        "_source" : {
          "my-long-field" : 20,
          "my-keyword-field" : "BAR",
          "my-boolean-field" : false
        },
        "executed_pipelines" : [
          "my-pipeline",
          "my-pipeline-2",
          "my-final-pipeline-2"
        ],
        "status" : 201
      }
    }
  ]
}

You can also specify substitute pipeline definitions so that you can try pipeline changes without actually having to change pipelines. For example, to substitute a new my-pipeline-2, you could do the following:

curl -X POST "localhost:9200/_ingest/_simulate?pretty&index=my-index" -H 'Content-Type: application/json' -d'
{
  "docs": [
    {
      "_source": {
        "my-keyword-field": "FOO"
      }
    },
    {
      "_source": {
        "my-keyword-field": "BAR"
      }
    }
  ],
  "pipeline_substitutions": {
    "my-pipeline-2": {
      "processors": [
        {
          "set": {
            "field": "my-new-boolean-field",
            "value": true
          }
        }
      ]
    }
  }
}
'

This substitutes the pipeline body given in the request for the my-pipeline-2 stored in the cluster. The pipeline definition is only changed for this request, and does not impact anything else running on the cluster now or in the future.

As a side note, here were some of the guidelines I used (and why the code is a little odd):

  • Make the API easy to use, and familiar to users of the simulate pipeline API.
  • Use as much of the existing bulk API logic as possible so that simulate does not diverge from real ingest behavior
  • Do not impact bulk API performance
  • Modify the bulk API code as little as possible. This is very critical code, and any change is an opportunity to introduce bugs.

@masseyke masseyke added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.11.0 labels Sep 6, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@ruflin
Copy link
Contributor

ruflin commented Sep 18, 2023

I like the direction this is taking. It means we have some unmodified sample events, it is possible to use the simulate API with these events and see what the end result is / where these events end up. The pipeline substitution is key. Imagine at some point, also templates / component templates can be substituted.

@masseyke The focus on the output is on _source for the docs. What happens in synthetic source scenarios like TSDB?

@masseyke
Copy link
Member Author

I like the direction this is taking. It means we have some unmodified sample events, it is possible to use the simulate API with these events and see what the end result is / where these events end up. The pipeline substitution is key. Imagine at some point, also templates / component templates can be substituted.

@masseyke The focus on the output is on _source for the docs. What happens in synthetic source scenarios like TSDB?

I don't think I'm following. The source is maintained by the pipelines until indexing time, and that is what is displayed in the output. Indexing itself doesn't give us the source as output, and we're not querying the index to get the source / synthetic source.

@ruflin
Copy link
Contributor

ruflin commented Sep 26, 2023

I don't think I'm following.

Oversight on my end. Of course the _source is only removed during indexing 🤦‍♂️ All good.

@mattc58 mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023
@masseyke
Copy link
Member Author

Replaced by #101409

@masseyke masseyke closed this Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants