[WIP] Re-introduce hash processor #47047

danhermann · 2019-09-24T19:20:01Z

Re-introduces #31087 and adds checks to ensure that the hash key is consistent with current cluster state.

The consistency of the hash key is checked at pipeline creation time (either at node startup for an existing pipeline or when a new pipeline is created) and any time cluster state changes. The latter check is necessary because nothing prevents a master-eligible node from joining the cluster with an inconsistent hash key. Were that node to be subsequently elected as master, it would publish its inconsistent hash key as part of the cluster state.

If the hash key is found to be inconsistent at pipeline creation time, the pipeline will not be created and a descriptive error message will be logged. Any attempts to index documents through that pipeline will fail with an error message that the pipeline could not be found.

If the hash key is found to be inconsistent at cluster state change time, a flag will be set in the hash processor and any attempts to hash documents will fail with a descriptive error message.

Draft -- still needs tests.

It is useful to have a processor similar to logstash-filter-fingerprint in Elasticsearch. A processor that leverages a variety of hashing algorithms to create cryptographically-secure one-way hashes of values in documents. This processor introduces a pbkdf2hmac hashing scheme to fields in documents for indexing

elasticmachine · 2019-09-24T19:20:04Z

Pinging @elastic/es-core-features

danhermann · 2019-09-24T19:25:58Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/ingest/HashProcessor.java

+            }
+
+            Collection<Setting<?>> consistentSettings = HMAC_KEY_SETTING.getAllConcreteSettings(settings).collect(Collectors.toList());
+            ConsistentSettingsService consistentSettingsService = new ConsistentSettingsService(settings, clusterService, consistentSettings);


Instantiating ConsistentSettingsService here seems unnecessarily heavy, especially since the state publishing capability of ConsistentSettingsService isn't useful here. I'd suggest refactoring ConsistentSettingsService to expose the logic for checking the consistency of settings in the areAllConsistent method and call that directly both here and in the accept(ClusterState) method above.

Or ConsistentSettingsService should be made accessible in Processor.Parameters so that it can be used here?

+1 to making it accessible via Processor.Parameters

...So it would look something like this:

(create consistent setting service in) Node -> IngestService -> Processor.Params , then pass that the ConsistentSettingsService service to this factory .

...however I don't think that is possible today because to instantiate a ConsistentSettingsService you need the secureSettings to check at time of construction...which you wont have in the Node object.

I haven't vetted this too deep .. but I think you can change

new ConsistentSettingsService(settings, clusterService, consistentSettings)
consistentSettingsService.areAllConsistent()

to

new ConsistentSettingsService(settings, clusterService)
consistentSettingsService.areAllConsistent(consistentSettings)

and then create the ConsistentSettingsService in Node, and pass it down to Processor.Params then pull the service from the params when creating the factory.

If this does work, can you please submit that change as a separate PR ?

martijnvg

Looks good! I left a few comments.

martijnvg · 2019-09-25T07:16:44Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/ingest/HashProcessor.java

+            }
+
+            Collection<Setting<?>> consistentSettings = HMAC_KEY_SETTING.getAllConcreteSettings(settings).collect(Collectors.toList());
+            ConsistentSettingsService consistentSettingsService = new ConsistentSettingsService(settings, clusterService, consistentSettings);


Or ConsistentSettingsService should be made accessible in Processor.Parameters so that it can be used here?

martijnvg · 2019-09-25T07:22:30Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/ingest/HashProcessor.java

+            Collection<Setting<?>> consistentSettings = HMAC_KEY_SETTING.getAllConcreteSettings(settings).collect(Collectors.toList());
+            ConsistentSettingsService consistentSettingsService = new ConsistentSettingsService(settings, clusterService, consistentSettings);
+            if (consistentSettingsService.areAllConsistent() == false) {
+                throw ConfigurationUtils.newConfigurationException(TYPE, processorTag, "key_setting", "inconsistent hash key [" + keySettingName + "]");


Is this expected to fail often? I'm asking, because this method here is both used to validate the processor config on the elected master node and to instantiate the processor instance for a pipeline on all ingest nodes. In the latter case there is no good retry mechanism if this fails here (only if a next cluster state update comes, but that may take a while). So maybe we can fail only in the processor itself when it is detected via the accept(...) that settings are inconsistent? Also then we have a single place where this logic is applied.

martijnvg · 2019-09-25T07:24:48Z

x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/ingest/HashProcessor.java

+
+    @Override
+    public void accept(ClusterState clusterState) {
+        // check hash keys for consistency and unset consistentHashes flag if inconsistent


IngestService#addIngestClusterStateListener(...) should be invoked in the factory, otherwise this method will never be invoked.

Also in the enrich branch the enrich processor factory implements Consumer<ClusterState> and here the processor. This is a subtle difference, but processor instance are created when pipelines are created and discarded when a pipeline does not exist. However if in this case we discard this kind processor then we still keep a reference to this processor via the ingestClusterStateListeners list in ClusterService. In order to make this work the following changes should be made:

A IngestService#removeIngestClusterStateListener(...) should be added.

The innerUpdatePipelines(...) method in IngestService around line 569 should invoke close on processors that implement Closeable.

This processor should implement Closeable interface and then invoke IngestService#removeIngestClusterStateListener(this)

Thanks for the review comments, @martijnvg. What do you think about centralizing the consistency checking of settings in ConsistencySettingsService? Each hash processor would then verify the consistency of its own hash key with the service. That would eliminate the need to track the lifecycle of the processor and would also resolve the question above about failing the pipeline validation check on the elected master node. It might also have a performance benefit if multiple hash processors used the same key since the consistency check for all of them would happen only once.

I see, that makes sense. But then ConsistentSettingsService should a public component that should be made accessible in Processor.Parameters?

Yes, I'll make the changes so ConsistentSettingsService is accessible in Processor.Parameters and request another review pass then.

ycombinator · 2019-10-23T18:08:19Z

I recently started working on a fingerprint processor for Beats, unaware of this PR here. Now that I've become aware of this PR (thanks @andrewkroh for the pointer to it and the discussion in elastic/beats#1872) and after off-GitHub discussions with @urso and @danhermann, I think it makes sense for me to not continue working on the Beats processor.

However, I wanted to discuss the use case that motivated the processor work in Beats in the context of the work being done in this PR here.

One of the use cases of fingerprinting is deduplication of documents. Imagine a Beat retrying sending the same document to Elasticsearch. By sending documents through an Ingest pipeline with the hash processor being implemented in this PR, an _id for the document could be derived and the document would not be duplicated in the index.

Per the current implementation in this PR, users are required to define a secret key to be used in the fingerprint MAC calculation. Users must define this key on every Elasticsearch node in its keystore via a setting that looks like xpack.security.ingest.hash.<processorName>.key. The key must be consistent across all Ingest nodes or the node with the inconsistent key will not run the Ingest pipeline. Finally, users must reference this setting name in their Ingest pipeline definition where they use the hash processor, via the key_setting processor option.

For the deduplication use case, I wonder if the secret key is overkill, especially since it requires additional setup in the keystore on every Elasticsearch node by users? Imagine a Filebeat module defining an ingest pipeline with the hash processor in it. It would have to specify a value for the key_setting processor option. When users are about to use this Filebeat module they would need to make sure the setting specified in the key_setting option is created in the keystore on every Elasticsearch node.

I do agree that, for the nonrepudiation use case, having a secret key makes sense as it makes it harder to tamper with the document contents and fingerprint.

So what do you think about making the secret key optional? That is, if the key_setting processor option isn't specified, the processor wouldn't even try to load up the secret key from the keystore. This would make the Filebeat setup for the deduplication use case much easier.

jakelandis · 2019-10-24T15:47:25Z

@ycombinator I think the hash processor is solving a slightly different use case then a finger print processor would. With this hash processor, you have to define which fields you wish to hash and each field will end up with it's own hash. This is primarily for an anonymization use case so you can anonymize while still preserving relationships and perform analysis on the anonymized data.

I think what you want is a way to (easily) say hash all (or a specified subset) of the fields in this document to get a single hash value that can be used for the _id ? If so, I don't think this processor will help you. I added team-discuss the talk about this use case, but feel free to log an issue if you feel that this is something that should be handled by ES instead of Beats.

ycombinator · 2019-10-24T15:58:45Z

Ah, thanks for clarifying @jakelandis. I made my way to this PR by following links from #16938, which is about a fingerprint processor in Elasticsearch (as opposed to the hash one being implemented by this PR). Are there still plans to implement a fingerprint processor in Elasticsearch? If not, I will resume working on a fingerprint processor in Beats.

jakelandis · 2019-10-24T16:04:43Z

Are there still plans to implement a fingerprint processor in Elasticsearch?

No. It looks like the hash processor superseded it, but without fulfilling all the use cases as originally suggested on that issue.

jakelandis · 2019-10-24T16:47:44Z

I also just realized that if a user decides to add a field to the configuration of the ingest processor, it will result in a mapping change from text to object possibly breaking the ability to index. I think we should change to always emit a map unless there is only 1 value defined AND a target field is specified.

That, or maybe introduce a new option "replace_field" : true or false , make "target_field" and "replace_field" mutually exclusive, and "target_field" would always be a map, and "replace_field" would always be a text.

jakelandis · 2019-12-05T16:31:10Z

Removing team-discuss label - Beats has merged it's fingerprint processor, logstash has one too, this processor will not be enhanced to (easily) serve that use case. If we need a fingerprint processor, we can introduce one independent of this hash processor.

danhermann · 2020-01-31T13:46:43Z

@elasticmachine update branch

…ss ingest nodes

…o child processors

danhermann · 2020-02-12T23:10:08Z

This commit does not use the ConsistentSettingsService to ensure that hash processors on different nodes have consistent keys. Instead, the hash of the hash processor's key on the master node at the time the ingest pipeline was created is persisted in cluster state as part of the pipeline definition. Each ingest node can compare the hash of its own key to the one persisted in cluster state and error if the two are not the same. This addresses two difficulties with using the ConsistentSettingsService with the hash processor:

The biggest difficulty is that ConsistentSettingsService broadcasts the value of the key on the current master node. Because different master-eligible nodes may have different values in their keystores for the hash processor's secret key, it would be possible for the broadcasted value of a hash processor's key to change if a failover occurred to a master node with a different key. This would, of course, result in inconsistent hashing.
A second difficulty is that the key for a hash processor could change by changing its value in the keystore and restarting the master node. This would result in inconsistent hashing for documents before and after the key change. Ideally, the lifetime of the hash processor's key would be tied to the lifetime of the processor itself. This approach ensures that a hash processor's key may not change.

With this approach, it may also be possible to make the hash processor's secret key reloadable so that an ingest node with an inconsistent key can be fixed without restarting the node.

This approach for ensuring secret key consistency is also planned for use with encrypted snapshot repositories because the same issues arose there.

danhermann · 2020-02-12T23:12:08Z

@jakelandis, what do you think of the approach described in this comment?

danhermann · 2020-02-21T19:08:42Z

@elasticmachine update branch

danhermann · 2021-03-04T16:43:32Z

Closing as this work is no longer scheduled.

talevy and others added 3 commits September 24, 2019 13:07

update to new processor interface

d23376e

check for key consistency

5a0f9e8

danhermann added WIP :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Sep 24, 2019

danhermann commented Sep 24, 2019

View reviewed changes

martijnvg reviewed Sep 25, 2019

View reviewed changes

ycombinator mentioned this pull request Oct 23, 2019

Fingerprint processor elastic/beats#14205

Merged

jakelandis added the team-discuss label Oct 24, 2019

elasticmachine added 3 commits December 5, 2019 09:28

Merge branch 'master' into 34085_reintroduce_hash_processor

af5def6

Merge branch 'master' into 34085_reintroduce_hash_processor

e0b97ac

Merge branch 'master' into 34085_reintroduce_hash_processor

fd0ad20

jakelandis removed the team-discuss label Dec 12, 2019

elasticmachine added 2 commits January 10, 2020 16:22

Merge branch 'master' into 34085_reintroduce_hash_processor

9cf2366

Merge branch 'master' into 34085_reintroduce_hash_processor

dfeeaa9

elasticmachine and others added 6 commits January 31, 2020 08:46

Merge branch 'master' into 34085_reintroduce_hash_processor

9c3c4e2

don't use ConsistentSettingsService for ensuring consistent keys acro…

750890f

…ss ingest nodes

Merge branch 'master' into 34085_reintroduce_hash_processor

a30303d

ForEach processor should pass pipeline metadata to child processors

b663e21

Unit test to verify that ForEach processor passes pipeline metadata t…

aec3db1

…o child processors

isolate pipeline metadata per processor

d5f0fe2

fix tests

4ca8f0a

Merge branch 'master' into 34085_reintroduce_hash_processor

2772f8b

rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020

axw mentioned this pull request Nov 17, 2020

Investigate switching error grouping key calculation to use something faster elastic/apm-data#146

Closed

danhermann closed this Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Re-introduce hash processor #47047

[WIP] Re-introduce hash processor #47047

danhermann commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

danhermann Sep 24, 2019

martijnvg Sep 25, 2019

jakelandis Sep 25, 2019

martijnvg left a comment

martijnvg Sep 25, 2019

martijnvg Sep 25, 2019

martijnvg Sep 25, 2019

martijnvg Sep 25, 2019

danhermann Sep 25, 2019 •

edited

Loading

martijnvg Sep 25, 2019

danhermann Sep 25, 2019

ycombinator commented Oct 23, 2019 •

edited

Loading

jakelandis commented Oct 24, 2019

ycombinator commented Oct 24, 2019

jakelandis commented Oct 24, 2019

jakelandis commented Oct 24, 2019

jakelandis commented Dec 5, 2019

danhermann commented Jan 31, 2020

danhermann commented Feb 12, 2020

danhermann commented Feb 12, 2020

danhermann commented Feb 21, 2020

danhermann commented Mar 4, 2021

[WIP] Re-introduce hash processor #47047

[WIP] Re-introduce hash processor #47047

Conversation

danhermann commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danhermann Sep 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ycombinator commented Oct 23, 2019 • edited Loading

jakelandis commented Oct 24, 2019

ycombinator commented Oct 24, 2019

jakelandis commented Oct 24, 2019

jakelandis commented Oct 24, 2019

jakelandis commented Dec 5, 2019

danhermann commented Jan 31, 2020

danhermann commented Feb 12, 2020

danhermann commented Feb 12, 2020

danhermann commented Feb 21, 2020

danhermann commented Mar 4, 2021

danhermann Sep 25, 2019 •

edited

Loading

ycombinator commented Oct 23, 2019 •

edited

Loading