-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Re-introduce hash processor #47047
[WIP] Re-introduce hash processor #47047
Conversation
It is useful to have a processor similar to logstash-filter-fingerprint in Elasticsearch. A processor that leverages a variety of hashing algorithms to create cryptographically-secure one-way hashes of values in documents. This processor introduces a pbkdf2hmac hashing scheme to fields in documents for indexing
Pinging @elastic/es-core-features |
} | ||
|
||
Collection<Setting<?>> consistentSettings = HMAC_KEY_SETTING.getAllConcreteSettings(settings).collect(Collectors.toList()); | ||
ConsistentSettingsService consistentSettingsService = new ConsistentSettingsService(settings, clusterService, consistentSettings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instantiating ConsistentSettingsService
here seems unnecessarily heavy, especially since the state publishing capability of ConsistentSettingsService
isn't useful here. I'd suggest refactoring ConsistentSettingsService
to expose the logic for checking the consistency of settings in the areAllConsistent
method and call that directly both here and in the accept(ClusterState)
method above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or ConsistentSettingsService
should be made accessible in Processor.Parameters
so that it can be used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to making it accessible via Processor.Parameters
...So it would look something like this:
(create consistent setting service in) Node -> IngestService -> Processor.Params , then pass that the ConsistentSettingsService service to this factory .
...however I don't think that is possible today because to instantiate a ConsistentSettingsService you need the secureSettings to check at time of construction...which you wont have in the Node object.
I haven't vetted this too deep .. but I think you can change
new ConsistentSettingsService(settings, clusterService, consistentSettings)
consistentSettingsService.areAllConsistent()
to
new ConsistentSettingsService(settings, clusterService)
consistentSettingsService.areAllConsistent(consistentSettings)
and then create the ConsistentSettingsService in Node, and pass it down to Processor.Params then pull the service from the params when creating the factory.
If this does work, can you please submit that change as a separate PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I left a few comments.
} | ||
|
||
Collection<Setting<?>> consistentSettings = HMAC_KEY_SETTING.getAllConcreteSettings(settings).collect(Collectors.toList()); | ||
ConsistentSettingsService consistentSettingsService = new ConsistentSettingsService(settings, clusterService, consistentSettings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or ConsistentSettingsService
should be made accessible in Processor.Parameters
so that it can be used here?
Collection<Setting<?>> consistentSettings = HMAC_KEY_SETTING.getAllConcreteSettings(settings).collect(Collectors.toList()); | ||
ConsistentSettingsService consistentSettingsService = new ConsistentSettingsService(settings, clusterService, consistentSettings); | ||
if (consistentSettingsService.areAllConsistent() == false) { | ||
throw ConfigurationUtils.newConfigurationException(TYPE, processorTag, "key_setting", "inconsistent hash key [" + keySettingName + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected to fail often? I'm asking, because this method here is both used to validate the processor config on the elected master node and to instantiate the processor instance for a pipeline on all ingest nodes. In the latter case there is no good retry mechanism if this fails here (only if a next cluster state update comes, but that may take a while). So maybe we can fail only in the processor itself when it is detected via the accept(...)
that settings are inconsistent? Also then we have a single place where this logic is applied.
|
||
@Override | ||
public void accept(ClusterState clusterState) { | ||
// check hash keys for consistency and unset consistentHashes flag if inconsistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IngestService#addIngestClusterStateListener(...)
should be invoked in the factory, otherwise this method will never be invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in the enrich branch the enrich processor factory implements Consumer<ClusterState>
and here the processor. This is a subtle difference, but processor instance are created when pipelines are created and discarded when a pipeline does not exist. However if in this case we discard this kind processor then we still keep a reference to this processor via the ingestClusterStateListeners
list in ClusterService
. In order to make this work the following changes should be made:
- A
IngestService#removeIngestClusterStateListener(...)
should be added. - The
innerUpdatePipelines(...)
method inIngestService
around line 569 should invoke close on processors that implementCloseable
. - This processor should implement
Closeable
interface and then invokeIngestService#removeIngestClusterStateListener(this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review comments, @martijnvg. What do you think about centralizing the consistency checking of settings in ConsistencySettingsService
? Each hash processor would then verify the consistency of its own hash key with the service. That would eliminate the need to track the lifecycle of the processor and would also resolve the question above about failing the pipeline validation check on the elected master node. It might also have a performance benefit if multiple hash processors used the same key since the consistency check for all of them would happen only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense. But then ConsistentSettingsService
should a public component that should be made accessible in Processor.Parameters
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll make the changes so ConsistentSettingsService
is accessible in Processor.Parameters
and request another review pass then.
I recently started working on a However, I wanted to discuss the use case that motivated the processor work in Beats in the context of the work being done in this PR here. One of the use cases of fingerprinting is deduplication of documents. Imagine a Beat retrying sending the same document to Elasticsearch. By sending documents through an Ingest pipeline with the Per the current implementation in this PR, users are required to define a secret key to be used in the fingerprint MAC calculation. Users must define this key on every Elasticsearch node in its keystore via a setting that looks like For the deduplication use case, I wonder if the secret key is overkill, especially since it requires additional setup in the keystore on every Elasticsearch node by users? Imagine a Filebeat module defining an ingest pipeline with the I do agree that, for the nonrepudiation use case, having a secret key makes sense as it makes it harder to tamper with the document contents and fingerprint. So what do you think about making the secret key optional? That is, if the |
@ycombinator I think the hash processor is solving a slightly different use case then a finger print processor would. With this hash processor, you have to define which fields you wish to hash and each field will end up with it's own hash. This is primarily for an anonymization use case so you can anonymize while still preserving relationships and perform analysis on the anonymized data. I think what you want is a way to (easily) say hash all (or a specified subset) of the fields in this document to get a single hash value that can be used for the _id ? If so, I don't think this processor will help you. I added team-discuss the talk about this use case, but feel free to log an issue if you feel that this is something that should be handled by ES instead of Beats. |
Ah, thanks for clarifying @jakelandis. I made my way to this PR by following links from #16938, which is about a |
No. It looks like the hash processor superseded it, but without fulfilling all the use cases as originally suggested on that issue. |
I also just realized that if a user decides to add a field to the configuration of the ingest processor, it will result in a mapping change from text to object possibly breaking the ability to index. I think we should change to always emit a map unless there is only 1 value defined AND a target field is specified. That, or maybe introduce a new option "replace_field" : true or false , make "target_field" and "replace_field" mutually exclusive, and "target_field" would always be a map, and "replace_field" would always be a text. |
Removing team-discuss label - Beats has merged it's fingerprint processor, logstash has one too, this processor will not be enhanced to (easily) serve that use case. If we need a fingerprint processor, we can introduce one independent of this hash processor. |
@elasticmachine update branch |
…o child processors
This commit does not use the
With this approach, it may also be possible to make the hash processor's secret key reloadable so that an ingest node with an inconsistent key can be fixed without restarting the node. This approach for ensuring secret key consistency is also planned for use with encrypted snapshot repositories because the same issues arose there. |
@jakelandis, what do you think of the approach described in this comment? |
@elasticmachine update branch |
Closing as this work is no longer scheduled. |
Re-introduces #31087 and adds checks to ensure that the hash key is consistent with current cluster state.
The consistency of the hash key is checked at pipeline creation time (either at node startup for an existing pipeline or when a new pipeline is created) and any time cluster state changes. The latter check is necessary because nothing prevents a master-eligible node from joining the cluster with an inconsistent hash key. Were that node to be subsequently elected as master, it would publish its inconsistent hash key as part of the cluster state.
If the hash key is found to be inconsistent at pipeline creation time, the pipeline will not be created and a descriptive error message will be logged. Any attempts to index documents through that pipeline will fail with an error message that the pipeline could not be found.
If the hash key is found to be inconsistent at cluster state change time, a flag will be set in the hash processor and any attempts to hash documents will fail with a descriptive error message.
Draft -- still needs tests.