Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download the geoip databases only when needed #92335

Merged

Conversation

masseyke
Copy link
Member

@masseyke masseyke commented Dec 13, 2022

We currently download the geoip databases (used by the geoip processor) whether you need them or not. This PR changes it so that we only download them if you have at least one geoip processor in your cluster, or when you add a new geoip processor.
Closes #90673

@masseyke masseyke added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.7.0 labels Dec 13, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've updated the changelog YAML for you.

@masseyke
Copy link
Member Author

@elasticmachine update branch

@masseyke
Copy link
Member Author

masseyke commented Jan 5, 2023

@elasticmachine update branch

@masseyke masseyke marked this pull request as ready for review January 12, 2023 21:19
@masseyke masseyke requested a review from jbaiera January 12, 2023 21:19
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jan 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@masseyke
Copy link
Member Author

@elasticmachine update branch

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, I think we've identified a couple of potential leaks and un/likely concurrency problems pertaining to the cluster service usage. Posting the comment review for now to get the ideas in writing. Will continue to review.

@@ -125,6 +125,8 @@ protected GeoIpDownloader createTask(
parentTaskId,
headers
);
clusterService.addListener(downloader);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should update the GeoIpDownloader to remove itself from the cluster listeners when it is cancelled.

Comment on lines 140 to 141
clusterService.getClusterSettings().addSettingsUpdateConsumer(EAGER_DOWNLOAD_SETTING, this::setEagerDownload);
clusterService.getClusterSettings().addSettingsUpdateConsumer(POLL_INTERVAL_SETTING, this::setPollInterval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this outside of the PR messaging, but we both think this could lead to some messy leaks.

Every time this task is allocated to a node, these settings listeners are being installed. There is no way to remove them and it doesn't look like they correctly check if the task is cancelled before rescheduling the background logic when they change. We should definitely protect against multiple downloaders running when updating either of these settings.

We could add a check to the update methods to make sure the task is active, but this still has the possibility of leaving a number of leaked task instances in the cluster service on any nodes that have previously cancelled them. We agreed that it would be better if we added a settings container of some sort that keeps an instance to the active downloader task on a node and calls it during updates if it exists. The settings container is light weight and can stick around without too much overhead since there will only ever be one of them, but the downloaders can come and go without leaking multiple update consumers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary of another offline conversation:
It's just a bad idea to have these relatively-short-lived objects like GeoIpDownloader be settings or cluster state listeners because:

  1. You have to be careful to un-listen when you're done so that you don't wind up with leaks
  2. You have increased risk of race conditions since the condition you are listening for could happen in between the time the object is created and the time it is registered as a listener (not to mention when its predecessor is un-registered).

So I've moved all listeners for settings and cluster state changes into the executor. The task gets the current values for these things from the executor when it needs them, and the executor handles telling the task when it needs to reschedule because of a change in a dynamic property.

@masseyke masseyke requested a review from jbaiera January 26, 2023 16:01
@masseyke
Copy link
Member Author

@elasticmachine update branch

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is looking really great, thanks for iterating on the listener registration woes. We sync'd up offline, but there's definitely an issue with the processor detection logic. I think once that is squared away we're pretty much there!

@@ -85,8 +85,9 @@ public class IngestGeoIpPlugin extends Plugin implements IngestPlugin, SystemInd
public List<Setting<?>> getSettings() {
return Arrays.asList(
CACHE_SIZE,
GeoIpDownloaderTaskExecutor.EAGER_DOWNLOAD_SETTING,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we tidy these up so that they are all ordered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Done.

}

if (event.metadataChanged() && event.changedCustomMetadataSet().contains(IngestMetadata.TYPE)) {
boolean newAtLeastOneGeoipProcessor = hasAtLeastOneGeoipProcessor(event.state());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it feels a little weird that we potentially calculate this a second time when bootstrapping. Probably not a performance bug but it does look a little funny. Especially since we calculate it with a different state call here than earlier in the same invocation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will usually happen twice will it? It happens on bootstrap, and then happens below only if the cluster state change includes an ingest metadata change (so only if someone modifies a pipeline definition), right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I hadn't thought about that. Looks fine to me then.

} else {
stopTask(() -> clusterService.addListener(this));
if (taskIsBootstrapped.getAndSet(true) == false) {
this.atLeastOneGeoipProcessor = hasAtLeastOneGeoipProcessor(clusterService.state());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that the comment about running this line twice on bootstrap isn't a big problem - should this be using event.state() instead of clusterService.state() for consistency? I don't think anything breaks as is, but my gut leans toward consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah might as well be consistent. Done.

}

if (event.metadataChanged() && event.changedCustomMetadataSet().contains(IngestMetadata.TYPE)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Map<String, Object> pipelineMap = pipelineDefinition.getConfigAsMap();
List<Map<String, Object>> processors = (List<Map<String, Object>>) pipelineMap.get(Pipeline.PROCESSORS_KEY);
if (processors != null) {
return processors.stream().anyMatch(processor -> processor.containsKey(GeoIpProcessor.TYPE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized that this might technically be too simple of a check. Annoyingly, processors can supply on_failure fields that contain processors, which can supply on_failure fields of their own, and I don't think we have any restriction on depth nor processor type allowed. The foreach processor also has a nested processor. Not sure if there are any others off the top of my head.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow good catch. I just changed this check to be recursive for those two things. I don't know of any others either.

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are looking pretty good though! Just some additional corrections for ya!

@masseyke masseyke requested a review from jbaiera January 27, 2023 22:56
Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks again for iterating on this! The changes look really great!

@masseyke masseyke merged commit 13b7190 into elastic:main Jan 30, 2023
@masseyke masseyke deleted the fix/download-geoip-databases-only-when-needed branch January 30, 2023 19:07
mark-vieira pushed a commit to mark-vieira/elasticsearch that referenced this pull request Jan 31, 2023
This commit changes the geoip downloader so that we only download the geoip databases if you
have at least one geoip processor in your cluster, or when you add a new geoip processor (or if
`ingest.geoip.downloader.eager.download` is explicitly set to true).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disable/Re-visit MaxMind GeoIP automatic downloading
4 participants