Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Indexing Strategy] Prevent wrong dataset / namespace values #64991

Open
ruflin opened this issue Nov 12, 2020 · 17 comments
Open

[Indexing Strategy] Prevent wrong dataset / namespace values #64991

ruflin opened this issue Nov 12, 2020 · 17 comments
Labels
:Data Management/Data streams Data streams and their lifecycles >feature Team:Data Management Meta label for data/management team

Comments

@ruflin
Copy link
Contributor

ruflin commented Nov 12, 2020

The new indexing strategy consists of 3 parts: type, dataset, namespace. The default templates in Elasticsearch for logs-*-* and metrics-*-* set the default value for data_stream.type directly in the template as it is known in advance. For data_stream.dataset and data_stream.namespace the value is picked from the first document which contains these fields. As long as the field does not exist in the document, it is not set.

This makes the following possible:

POST logs-foo-bar/_doc/
{
  "@timestamp": "2099-11-15T13:12:00",
  "message": "GET /search HTTP/1.1 200 1070000"
  "data_stream.dataset": "hello"
}

The value for the constant_keyword field data_stream.dataset is now set to hello but according to the indexing strategy it should be of value foo. If now later a document is ingested with the correct value, the document is rejected:

POST logs-foo-bar/_doc/
{
  "@timestamp": "2099-11-15T13:12:00",
  "message": "GET /search HTTP/1.1 200 1070000",
  "data_stream.dataset": "foo"
}

This is currently and edge case as the Elastic Agent and in the future Logstash will always set the correct values. But with more adoption of the new indexing strategy, I expect the fields not always to be set.

Not having the fields set also leads to the problem that queries will be run for example on data_stream.dataset=foo as there seem to be indices for it but in reality the value does not exist.

This issue is to discuss if we should put any measure into place and if yes, which options do we have.

@ruflin
Copy link
Contributor Author

ruflin commented Nov 12, 2020

@jpountz @dakrone One idea I have is that we could use an ingest pipeline to automatically add the missing values.

@tlrx tlrx added the :Data Management/Data streams Data streams and their lifecycles label Dec 3, 2020
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Dec 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@tlrx tlrx added >feature and removed Team:Data Management Meta label for data/management team labels Dec 3, 2020
@tlrx
Copy link
Member

tlrx commented Dec 3, 2020

@dakrone I'm assigning this issue to you as it looks related to data streams to me. Please feel free to relabel/reassign/close if you think it is more appropriate.

@danhermann danhermann added the Team:Data Management Meta label for data/management team label Dec 3, 2020
@martijnvg
Copy link
Member

Preventing that the data_stream.dataset, data_stream.type and data_stream.namespace fields contain values that don't match with what they should be based on the data stream name, is I think validation that falls between application logic category or data stream logic category.

If we determine that this is application logic then I think what @ruflin suggests (enforcing this validation via a pipeline) is best. However if we think this kind of validation is data stream related then perhaps we can enforce this natively in Elasticsearch. There is something to say for this, for example the existence and correctness of@timestamp field is natively enforced. We can argue that data_stream.dataset, data_stream.type and data_stream.namespace field validation falls into the same category as data stream's timestamp field validation.

@ruflin
Copy link
Contributor Author

ruflin commented Mar 17, 2021

@martijnvg I like the comparison with the @timestamp field. As Elasticsearch already ships the logs-*-* etc. templates it probably should enforce the correctness of these indices too.

@jpountz
Copy link
Contributor

jpountz commented Mar 25, 2021

I wonder if we could make these fields computed directly based on the data stream name, instead of having the type/dataset/namespace available both in mappings and data stream names, and then having to add validation to avoid inconsistencies.

For instance, maybe Elasticsearch could support scripted constant keywords, something like:

{
  "mappings": {
    "runtime": {
      "data_stream.dataset": {
        "type": "constant_keyword",
        "script": {
          "source": "data_stream.split('-')[1]"
        }
      }
    }
  }
}

Or maybe we could even have type/dataset/namespace as first-class citizens in Elasticsearch and available as metadata fields that extract values directly from the data stream's name, and then templates could make these fields available via aliases, e.g.

{
  "mappings":  {
    "properties": {
      "data_stream": {
        "properties": {
          "dataset": {
            "type": "alias",
            "path": "_dataset"
          }
        }
      }
    }
  }
}

And we could stop setting the values as part of documents.

@danhermann
Copy link
Contributor

+1 on the idea of making type, dataset, and namespace first-class citizens in ES. It feels cleaner and more explicit than after-the-fact validation that the value of a document's dataset field matches the dataset portion of the data stream name.

@ruflin
Copy link
Contributor Author

ruflin commented Mar 26, 2021

++ on making it first class citizens. If documents would be shipped with the fields, these would be just used for validation and the fields would be then dropped from the doc itself.

@martijnvg
Copy link
Member

I'm also +1 on making the making type, dataset, and namespace fields first class meta fields of a data stream and then
the alias field mapping that Adrien mentions can be added automatically to the mapping of a backing index.

@dakrone
Copy link
Member

dakrone commented Apr 8, 2021

So it sounds like there are a couple of options for how to derive the dataset name:

  • Derive it from the data stream dynamically
  • Store it in the metadata for the data stream
  • <something else>

In terms of deriving it from the data stream dynamically—what happens when the name of the data stream changes? (ie, what if the original indices are added to a different alias and then the alias converted into a data stream with a different name) Will that break the querying side if it is derived differently?

If we store it in the metadata—is it okay for that value to be missing if a user were to restore a subset of the data stream indices from a snapshot and attempt to search them? Will we write this metadata into the new data stream when converting an alias into a data stream?

@martijnvg
Copy link
Member

If we store these meta properties as part of the data stream then we're resilient to a rename of the data stream, so that is a plus for that option.

Will we write this metadata into the new data stream when converting an alias into a data stream?

Perhaps the migrate api will require these meta fields as arguments? But then these meta properties could be changed to any value... However If these meta properties were also stored as part IndexMetadata then we could always resolve it back?

@ruflin
Copy link
Contributor Author

ruflin commented Apr 8, 2021

Can / should the name of a data stream change?

@dakrone
Copy link
Member

dakrone commented Apr 8, 2021

Can / should the name of a data stream change?

In terms of "can", yes, currently through a manual-only and convoluted process (snapshot the backing indices only, restore backing indices without restoring data stream, add backing indices to an alias, use the convert-alias-to-data-stream API to create a new data stream).

In terms of "should", I think I'd prefer to plan for a world where a data stream can be renamed, whether through a snapshot restore (similar to how we can rename indices on restore, we may eventually add something where you can rename a data stream on restore), or through a different API like cloning a data stream. Even if we don't have these features right now, it'd be good to think about whether any design would accommodate those additions in the future.

@jpountz
Copy link
Contributor

jpountz commented Apr 15, 2021

One worry I have about extracting these as properties of the data stream is that we might come back to the original problem we are having here, where we'd expect data_stream.type, data_stream.dataset and data_stream.namespace to always be consistent with the data stream name {type}-{dataset}-{namespace} but we'd actually be allowing ways to create inconsistency by renaming data streams.

If a user wants to give their data streams a different name through a restore, I wonder if we should recommend changing the namespace only, e.g. moving from logs-nginx.access-default to logs-nginx.access-legacy.

@ruflin
Copy link
Contributor Author

ruflin commented Apr 18, 2021

Lets assume for a moment, the data_stream fields just become a property on the data stream and are not part of each document. Even if a document is shipped with the fields inside, it would be stripped out (modifying source 🤔 , ignoring for now). In this case, a rename would be just renaming properties.

In the current world, a rename would require a reindex a the content of the data would have to be modified. I personally think this should be fine.

In any case, we should not allow inconsistencies this smells like troubles.

@jpountz Interesting idea around the namespace. Even if data is restored, the dataset should never change. So ++ if we get to a rename feature, only the namespace should be allowed.

@felixbarny
Copy link
Member

Related: the proposed data stream router takes care of always populating data_stream.* fields

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >feature Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

10 participants