-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Indexing Strategy] Prevent wrong dataset / namespace values #64991
Comments
Pinging @elastic/es-core-features (Team:Core/Features) |
@dakrone I'm assigning this issue to you as it looks related to data streams to me. Please feel free to relabel/reassign/close if you think it is more appropriate. |
Preventing that the If we determine that this is application logic then I think what @ruflin suggests (enforcing this validation via a pipeline) is best. However if we think this kind of validation is data stream related then perhaps we can enforce this natively in Elasticsearch. There is something to say for this, for example the existence and correctness of |
@martijnvg I like the comparison with the |
I wonder if we could make these fields computed directly based on the data stream name, instead of having the type/dataset/namespace available both in mappings and data stream names, and then having to add validation to avoid inconsistencies. For instance, maybe Elasticsearch could support scripted constant keywords, something like:
Or maybe we could even have type/dataset/namespace as first-class citizens in Elasticsearch and available as metadata fields that extract values directly from the data stream's name, and then templates could make these fields available via aliases, e.g.
And we could stop setting the values as part of documents. |
+1 on the idea of making type, dataset, and namespace first-class citizens in ES. It feels cleaner and more explicit than after-the-fact validation that the value of a document's |
++ on making it first class citizens. If documents would be shipped with the fields, these would be just used for validation and the fields would be then dropped from the doc itself. |
I'm also +1 on making the making type, dataset, and namespace fields first class meta fields of a data stream and then |
So it sounds like there are a couple of options for how to derive the dataset name:
In terms of deriving it from the data stream dynamically—what happens when the name of the data stream changes? (ie, what if the original indices are added to a different alias and then the alias converted into a data stream with a different name) Will that break the querying side if it is derived differently? If we store it in the metadata—is it okay for that value to be missing if a user were to restore a subset of the data stream indices from a snapshot and attempt to search them? Will we write this metadata into the new data stream when converting an alias into a data stream? |
If we store these meta properties as part of the data stream then we're resilient to a rename of the data stream, so that is a plus for that option.
Perhaps the migrate api will require these meta fields as arguments? But then these meta properties could be changed to any value... However If these meta properties were also stored as part IndexMetadata then we could always resolve it back? |
Can / should the name of a data stream change? |
In terms of "can", yes, currently through a manual-only and convoluted process (snapshot the backing indices only, restore backing indices without restoring data stream, add backing indices to an alias, use the convert-alias-to-data-stream API to create a new data stream). In terms of "should", I think I'd prefer to plan for a world where a data stream can be renamed, whether through a snapshot restore (similar to how we can rename indices on restore, we may eventually add something where you can rename a data stream on restore), or through a different API like cloning a data stream. Even if we don't have these features right now, it'd be good to think about whether any design would accommodate those additions in the future. |
One worry I have about extracting these as properties of the data stream is that we might come back to the original problem we are having here, where we'd expect If a user wants to give their data streams a different name through a restore, I wonder if we should recommend changing the namespace only, e.g. moving from |
Lets assume for a moment, the data_stream fields just become a property on the data stream and are not part of each document. Even if a document is shipped with the fields inside, it would be stripped out (modifying source 🤔 , ignoring for now). In this case, a rename would be just renaming properties. In the current world, a rename would require a reindex a the content of the data would have to be modified. I personally think this should be fine. In any case, we should not allow inconsistencies this smells like troubles. @jpountz Interesting idea around the namespace. Even if data is restored, the dataset should never change. So ++ if we get to a rename feature, only the namespace should be allowed. |
Related: the proposed data stream router takes care of always populating |
Pinging @elastic/es-data-management (Team:Data Management) |
The new indexing strategy consists of 3 parts: type, dataset, namespace. The default templates in Elasticsearch for
logs-*-*
andmetrics-*-*
set the default value fordata_stream.type
directly in the template as it is known in advance. Fordata_stream.dataset
anddata_stream.namespace
the value is picked from the first document which contains these fields. As long as the field does not exist in the document, it is not set.This makes the following possible:
The value for the constant_keyword field
data_stream.dataset
is now set tohello
but according to the indexing strategy it should be of valuefoo
. If now later a document is ingested with the correct value, the document is rejected:This is currently and edge case as the Elastic Agent and in the future Logstash will always set the correct values. But with more adoption of the new indexing strategy, I expect the fields not always to be set.
Not having the fields set also leads to the problem that queries will be run for example on
data_stream.dataset=foo
as there seem to be indices for it but in reality the value does not exist.This issue is to discuss if we should put any measure into place and if yes, which options do we have.
The text was updated successfully, but these errors were encountered: