[Indexing Strategy] Prevent wrong dataset / namespace values #64991

ruflin · 2020-11-12T12:12:07Z

The new indexing strategy consists of 3 parts: type, dataset, namespace. The default templates in Elasticsearch for logs-*-* and metrics-*-* set the default value for data_stream.type directly in the template as it is known in advance. For data_stream.dataset and data_stream.namespace the value is picked from the first document which contains these fields. As long as the field does not exist in the document, it is not set.

This makes the following possible:

POST logs-foo-bar/_doc/
{
  "@timestamp": "2099-11-15T13:12:00",
  "message": "GET /search HTTP/1.1 200 1070000"
  "data_stream.dataset": "hello"
}

The value for the constant_keyword field data_stream.dataset is now set to hello but according to the indexing strategy it should be of value foo. If now later a document is ingested with the correct value, the document is rejected:

POST logs-foo-bar/_doc/
{
  "@timestamp": "2099-11-15T13:12:00",
  "message": "GET /search HTTP/1.1 200 1070000",
  "data_stream.dataset": "foo"
}

This is currently and edge case as the Elastic Agent and in the future Logstash will always set the correct values. But with more adoption of the new indexing strategy, I expect the fields not always to be set.

Not having the fields set also leads to the problem that queries will be run for example on data_stream.dataset=foo as there seem to be indices for it but in reality the value does not exist.

This issue is to discuss if we should put any measure into place and if yes, which options do we have.

The text was updated successfully, but these errors were encountered:

ruflin · 2020-11-12T12:12:36Z

@jpountz @dakrone One idea I have is that we could use an ingest pipeline to automatically add the missing values.

elasticmachine · 2020-12-03T14:05:02Z

Pinging @elastic/es-core-features (Team:Core/Features)

tlrx · 2020-12-03T14:06:46Z

@dakrone I'm assigning this issue to you as it looks related to data streams to me. Please feel free to relabel/reassign/close if you think it is more appropriate.

martijnvg · 2021-03-16T19:28:34Z

Preventing that the data_stream.dataset, data_stream.type and data_stream.namespace fields contain values that don't match with what they should be based on the data stream name, is I think validation that falls between application logic category or data stream logic category.

If we determine that this is application logic then I think what @ruflin suggests (enforcing this validation via a pipeline) is best. However if we think this kind of validation is data stream related then perhaps we can enforce this natively in Elasticsearch. There is something to say for this, for example the existence and correctness of@timestamp field is natively enforced. We can argue that data_stream.dataset, data_stream.type and data_stream.namespace field validation falls into the same category as data stream's timestamp field validation.

ruflin · 2021-03-17T12:41:09Z

@martijnvg I like the comparison with the @timestamp field. As Elasticsearch already ships the logs-*-* etc. templates it probably should enforce the correctness of these indices too.

jpountz · 2021-03-25T18:09:49Z

I wonder if we could make these fields computed directly based on the data stream name, instead of having the type/dataset/namespace available both in mappings and data stream names, and then having to add validation to avoid inconsistencies.

For instance, maybe Elasticsearch could support scripted constant keywords, something like:

{
  "mappings": {
    "runtime": {
      "data_stream.dataset": {
        "type": "constant_keyword",
        "script": {
          "source": "data_stream.split('-')[1]"
        }
      }
    }
  }
}

Or maybe we could even have type/dataset/namespace as first-class citizens in Elasticsearch and available as metadata fields that extract values directly from the data stream's name, and then templates could make these fields available via aliases, e.g.

{
  "mappings":  {
    "properties": {
      "data_stream": {
        "properties": {
          "dataset": {
            "type": "alias",
            "path": "_dataset"
          }
        }
      }
    }
  }
}

And we could stop setting the values as part of documents.

danhermann · 2021-03-25T18:18:57Z

+1 on the idea of making type, dataset, and namespace first-class citizens in ES. It feels cleaner and more explicit than after-the-fact validation that the value of a document's dataset field matches the dataset portion of the data stream name.

ruflin · 2021-03-26T07:35:54Z

++ on making it first class citizens. If documents would be shipped with the fields, these would be just used for validation and the fields would be then dropped from the doc itself.

martijnvg · 2021-03-26T16:28:48Z

I'm also +1 on making the making type, dataset, and namespace fields first class meta fields of a data stream and then
the alias field mapping that Adrien mentions can be added automatically to the mapping of a backing index.

dakrone · 2021-04-08T14:15:12Z

So it sounds like there are a couple of options for how to derive the dataset name:

Derive it from the data stream dynamically
Store it in the metadata for the data stream
<something else>

In terms of deriving it from the data stream dynamically—what happens when the name of the data stream changes? (ie, what if the original indices are added to a different alias and then the alias converted into a data stream with a different name) Will that break the querying side if it is derived differently?

If we store it in the metadata—is it okay for that value to be missing if a user were to restore a subset of the data stream indices from a snapshot and attempt to search them? Will we write this metadata into the new data stream when converting an alias into a data stream?

martijnvg · 2021-04-08T19:29:12Z

If we store these meta properties as part of the data stream then we're resilient to a rename of the data stream, so that is a plus for that option.

Will we write this metadata into the new data stream when converting an alias into a data stream?

Perhaps the migrate api will require these meta fields as arguments? But then these meta properties could be changed to any value... However If these meta properties were also stored as part IndexMetadata then we could always resolve it back?

ruflin · 2021-04-08T19:54:00Z

Can / should the name of a data stream change?

dakrone · 2021-04-08T20:54:44Z

Can / should the name of a data stream change?

In terms of "can", yes, currently through a manual-only and convoluted process (snapshot the backing indices only, restore backing indices without restoring data stream, add backing indices to an alias, use the convert-alias-to-data-stream API to create a new data stream).

In terms of "should", I think I'd prefer to plan for a world where a data stream can be renamed, whether through a snapshot restore (similar to how we can rename indices on restore, we may eventually add something where you can rename a data stream on restore), or through a different API like cloning a data stream. Even if we don't have these features right now, it'd be good to think about whether any design would accommodate those additions in the future.

jpountz · 2021-04-15T13:33:14Z

One worry I have about extracting these as properties of the data stream is that we might come back to the original problem we are having here, where we'd expect data_stream.type, data_stream.dataset and data_stream.namespace to always be consistent with the data stream name {type}-{dataset}-{namespace} but we'd actually be allowing ways to create inconsistency by renaming data streams.

If a user wants to give their data streams a different name through a restore, I wonder if we should recommend changing the namespace only, e.g. moving from logs-nginx.access-default to logs-nginx.access-legacy.

ruflin · 2021-04-18T18:55:38Z

Lets assume for a moment, the data_stream fields just become a property on the data stream and are not part of each document. Even if a document is shipped with the fields inside, it would be stripped out (modifying source 🤔 , ignoring for now). In this case, a rename would be just renaming properties.

In the current world, a rename would require a reindex a the content of the data would have to be modified. I personally think this should be fine.

In any case, we should not allow inconsistencies this smells like troubles.

@jpountz Interesting idea around the namespace. Even if data is restored, the dataset should never change. So ++ if we get to a rename feature, only the namespace should be allowed.

felixbarny · 2022-07-06T06:42:07Z

Related: the proposed data stream router takes care of always populating data_stream.* fields

Add reroute processor #76511

elasticsearchmachine · 2023-12-14T15:57:27Z

Pinging @elastic/es-data-management (Team:Data Management)

tlrx added the :Data Management/Data streams Data streams and their lifecycles label Dec 3, 2020

elasticmachine added the Team:Data Management Meta label for data/management team label Dec 3, 2020

tlrx added >feature and removed Team:Data Management Meta label for data/management team labels Dec 3, 2020

tlrx assigned dakrone Dec 3, 2020

danhermann unassigned dakrone Dec 3, 2020

danhermann added the Team:Data Management Meta label for data/management team label Dec 3, 2020

dakrone added the team-discuss label Dec 7, 2020

mattc58 removed the team-discuss label Dec 14, 2023

ruflin assigned ruflin and unassigned ruflin Dec 14, 2023

felixbarny mentioned this issue Jan 15, 2024

Validate and sanitization of dataset and namespace field elastic/elastic-agent#3946

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Indexing Strategy] Prevent wrong dataset / namespace values #64991

[Indexing Strategy] Prevent wrong dataset / namespace values #64991

ruflin commented Nov 12, 2020

ruflin commented Nov 12, 2020

elasticmachine commented Dec 3, 2020

tlrx commented Dec 3, 2020

martijnvg commented Mar 16, 2021

ruflin commented Mar 17, 2021

jpountz commented Mar 25, 2021

danhermann commented Mar 25, 2021

ruflin commented Mar 26, 2021

martijnvg commented Mar 26, 2021

dakrone commented Apr 8, 2021

martijnvg commented Apr 8, 2021

ruflin commented Apr 8, 2021

dakrone commented Apr 8, 2021 •

edited

Loading

jpountz commented Apr 15, 2021

ruflin commented Apr 18, 2021

felixbarny commented Jul 6, 2022

elasticsearchmachine commented Dec 14, 2023

[Indexing Strategy] Prevent wrong dataset / namespace values #64991

[Indexing Strategy] Prevent wrong dataset / namespace values #64991

Comments

ruflin commented Nov 12, 2020

ruflin commented Nov 12, 2020

elasticmachine commented Dec 3, 2020

tlrx commented Dec 3, 2020

martijnvg commented Mar 16, 2021

ruflin commented Mar 17, 2021

jpountz commented Mar 25, 2021

danhermann commented Mar 25, 2021

ruflin commented Mar 26, 2021

martijnvg commented Mar 26, 2021

dakrone commented Apr 8, 2021

martijnvg commented Apr 8, 2021

ruflin commented Apr 8, 2021

dakrone commented Apr 8, 2021 • edited Loading

jpountz commented Apr 15, 2021

ruflin commented Apr 18, 2021

felixbarny commented Jul 6, 2022

elasticsearchmachine commented Dec 14, 2023

dakrone commented Apr 8, 2021 •

edited

Loading