Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Re-introduce hash processor #47047

Closed

Conversation

danhermann
Copy link
Contributor

Re-introduces #31087 and adds checks to ensure that the hash key is consistent with current cluster state.

The consistency of the hash key is checked at pipeline creation time (either at node startup for an existing pipeline or when a new pipeline is created) and any time cluster state changes. The latter check is necessary because nothing prevents a master-eligible node from joining the cluster with an inconsistent hash key. Were that node to be subsequently elected as master, it would publish its inconsistent hash key as part of the cluster state.

If the hash key is found to be inconsistent at pipeline creation time, the pipeline will not be created and a descriptive error message will be logged. Any attempts to index documents through that pipeline will fail with an error message that the pipeline could not be found.

If the hash key is found to be inconsistent at cluster state change time, a flag will be set in the hash processor and any attempts to hash documents will fail with a descriptive error message.

Draft -- still needs tests.

talevy and others added 3 commits September 24, 2019 13:07
It is useful to have a processor similar to
logstash-filter-fingerprint
in Elasticsearch. A processor that leverages a variety of hashing algorithms
to create cryptographically-secure one-way hashes of values in documents.

This processor introduces a pbkdf2hmac hashing scheme to fields in documents
for indexing
@danhermann danhermann added WIP :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Sep 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

}

Collection<Setting<?>> consistentSettings = HMAC_KEY_SETTING.getAllConcreteSettings(settings).collect(Collectors.toList());
ConsistentSettingsService consistentSettingsService = new ConsistentSettingsService(settings, clusterService, consistentSettings);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instantiating ConsistentSettingsService here seems unnecessarily heavy, especially since the state publishing capability of ConsistentSettingsService isn't useful here. I'd suggest refactoring ConsistentSettingsService to expose the logic for checking the consistency of settings in the areAllConsistent method and call that directly both here and in the accept(ClusterState) method above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or ConsistentSettingsService should be made accessible in Processor.Parameters so that it can be used here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to making it accessible via Processor.Parameters

...So it would look something like this:

(create consistent setting service in) Node -> IngestService -> Processor.Params , then pass that the ConsistentSettingsService service to this factory .

...however I don't think that is possible today because to instantiate a ConsistentSettingsService you need the secureSettings to check at time of construction...which you wont have in the Node object.

I haven't vetted this too deep .. but I think you can change

new ConsistentSettingsService(settings, clusterService, consistentSettings)
consistentSettingsService.areAllConsistent()

to

new ConsistentSettingsService(settings, clusterService)
consistentSettingsService.areAllConsistent(consistentSettings)

and then create the ConsistentSettingsService in Node, and pass it down to Processor.Params then pull the service from the params when creating the factory.

If this does work, can you please submit that change as a separate PR ?

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left a few comments.

}

Collection<Setting<?>> consistentSettings = HMAC_KEY_SETTING.getAllConcreteSettings(settings).collect(Collectors.toList());
ConsistentSettingsService consistentSettingsService = new ConsistentSettingsService(settings, clusterService, consistentSettings);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or ConsistentSettingsService should be made accessible in Processor.Parameters so that it can be used here?

Collection<Setting<?>> consistentSettings = HMAC_KEY_SETTING.getAllConcreteSettings(settings).collect(Collectors.toList());
ConsistentSettingsService consistentSettingsService = new ConsistentSettingsService(settings, clusterService, consistentSettings);
if (consistentSettingsService.areAllConsistent() == false) {
throw ConfigurationUtils.newConfigurationException(TYPE, processorTag, "key_setting", "inconsistent hash key [" + keySettingName + "]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to fail often? I'm asking, because this method here is both used to validate the processor config on the elected master node and to instantiate the processor instance for a pipeline on all ingest nodes. In the latter case there is no good retry mechanism if this fails here (only if a next cluster state update comes, but that may take a while). So maybe we can fail only in the processor itself when it is detected via the accept(...) that settings are inconsistent? Also then we have a single place where this logic is applied.


@Override
public void accept(ClusterState clusterState) {
// check hash keys for consistency and unset consistentHashes flag if inconsistent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IngestService#addIngestClusterStateListener(...) should be invoked in the factory, otherwise this method will never be invoked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in the enrich branch the enrich processor factory implements Consumer<ClusterState> and here the processor. This is a subtle difference, but processor instance are created when pipelines are created and discarded when a pipeline does not exist. However if in this case we discard this kind processor then we still keep a reference to this processor via the ingestClusterStateListeners list in ClusterService. In order to make this work the following changes should be made:

  • A IngestService#removeIngestClusterStateListener(...) should be added.
  • The innerUpdatePipelines(...) method in IngestService around line 569 should invoke close on processors that implement Closeable.
  • This processor should implement Closeable interface and then invoke IngestService#removeIngestClusterStateListener(this)

Copy link
Contributor Author

@danhermann danhermann Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review comments, @martijnvg. What do you think about centralizing the consistency checking of settings in ConsistencySettingsService? Each hash processor would then verify the consistency of its own hash key with the service. That would eliminate the need to track the lifecycle of the processor and would also resolve the question above about failing the pipeline validation check on the elected master node. It might also have a performance benefit if multiple hash processors used the same key since the consistency check for all of them would happen only once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense. But then ConsistentSettingsService should a public component that should be made accessible in Processor.Parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll make the changes so ConsistentSettingsService is accessible in Processor.Parameters and request another review pass then.

@ycombinator
Copy link
Contributor

ycombinator commented Oct 23, 2019

I recently started working on a fingerprint processor for Beats, unaware of this PR here. Now that I've become aware of this PR (thanks @andrewkroh for the pointer to it and the discussion in elastic/beats#1872) and after off-GitHub discussions with @urso and @danhermann, I think it makes sense for me to not continue working on the Beats processor.

However, I wanted to discuss the use case that motivated the processor work in Beats in the context of the work being done in this PR here.

One of the use cases of fingerprinting is deduplication of documents. Imagine a Beat retrying sending the same document to Elasticsearch. By sending documents through an Ingest pipeline with the hash processor being implemented in this PR, an _id for the document could be derived and the document would not be duplicated in the index.

Per the current implementation in this PR, users are required to define a secret key to be used in the fingerprint MAC calculation. Users must define this key on every Elasticsearch node in its keystore via a setting that looks like xpack.security.ingest.hash.<processorName>.key. The key must be consistent across all Ingest nodes or the node with the inconsistent key will not run the Ingest pipeline. Finally, users must reference this setting name in their Ingest pipeline definition where they use the hash processor, via the key_setting processor option.

For the deduplication use case, I wonder if the secret key is overkill, especially since it requires additional setup in the keystore on every Elasticsearch node by users? Imagine a Filebeat module defining an ingest pipeline with the hash processor in it. It would have to specify a value for the key_setting processor option. When users are about to use this Filebeat module they would need to make sure the setting specified in the key_setting option is created in the keystore on every Elasticsearch node.

I do agree that, for the nonrepudiation use case, having a secret key makes sense as it makes it harder to tamper with the document contents and fingerprint.

So what do you think about making the secret key optional? That is, if the key_setting processor option isn't specified, the processor wouldn't even try to load up the secret key from the keystore. This would make the Filebeat setup for the deduplication use case much easier.

@jakelandis
Copy link
Contributor

@ycombinator I think the hash processor is solving a slightly different use case then a finger print processor would. With this hash processor, you have to define which fields you wish to hash and each field will end up with it's own hash. This is primarily for an anonymization use case so you can anonymize while still preserving relationships and perform analysis on the anonymized data.

I think what you want is a way to (easily) say hash all (or a specified subset) of the fields in this document to get a single hash value that can be used for the _id ? If so, I don't think this processor will help you. I added team-discuss the talk about this use case, but feel free to log an issue if you feel that this is something that should be handled by ES instead of Beats.

@ycombinator
Copy link
Contributor

Ah, thanks for clarifying @jakelandis. I made my way to this PR by following links from #16938, which is about a fingerprint processor in Elasticsearch (as opposed to the hash one being implemented by this PR). Are there still plans to implement a fingerprint processor in Elasticsearch? If not, I will resume working on a fingerprint processor in Beats.

@jakelandis
Copy link
Contributor

Are there still plans to implement a fingerprint processor in Elasticsearch?

No. It looks like the hash processor superseded it, but without fulfilling all the use cases as originally suggested on that issue.

@jakelandis
Copy link
Contributor

I also just realized that if a user decides to add a field to the configuration of the ingest processor, it will result in a mapping change from text to object possibly breaking the ability to index. I think we should change to always emit a map unless there is only 1 value defined AND a target field is specified.

That, or maybe introduce a new option "replace_field" : true or false , make "target_field" and "replace_field" mutually exclusive, and "target_field" would always be a map, and "replace_field" would always be a text.

@jakelandis
Copy link
Contributor

Removing team-discuss label - Beats has merged it's fingerprint processor, logstash has one too, this processor will not be enhanced to (easily) serve that use case. If we need a fingerprint processor, we can introduce one independent of this hash processor.

@danhermann
Copy link
Contributor Author

@elasticmachine update branch

@danhermann
Copy link
Contributor Author

This commit does not use the ConsistentSettingsService to ensure that hash processors on different nodes have consistent keys. Instead, the hash of the hash processor's key on the master node at the time the ingest pipeline was created is persisted in cluster state as part of the pipeline definition. Each ingest node can compare the hash of its own key to the one persisted in cluster state and error if the two are not the same. This addresses two difficulties with using the ConsistentSettingsService with the hash processor:

  • The biggest difficulty is that ConsistentSettingsService broadcasts the value of the key on the current master node. Because different master-eligible nodes may have different values in their keystores for the hash processor's secret key, it would be possible for the broadcasted value of a hash processor's key to change if a failover occurred to a master node with a different key. This would, of course, result in inconsistent hashing.

  • A second difficulty is that the key for a hash processor could change by changing its value in the keystore and restarting the master node. This would result in inconsistent hashing for documents before and after the key change. Ideally, the lifetime of the hash processor's key would be tied to the lifetime of the processor itself. This approach ensures that a hash processor's key may not change.

With this approach, it may also be possible to make the hash processor's secret key reloadable so that an ingest node with an inconsistent key can be fixed without restarting the node.

This approach for ensuring secret key consistency is also planned for use with encrypted snapshot repositories because the same issues arose there.

@danhermann
Copy link
Contributor Author

@jakelandis, what do you think of the approach described in this comment?

@danhermann
Copy link
Contributor Author

@elasticmachine update branch

@danhermann
Copy link
Contributor Author

Closing as this work is no longer scheduled.

@danhermann danhermann closed this Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team WIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants