Add DatabaseRegistry for locally managing databases managed by GeoIpDownloader #69540

martijnvg · 2021-02-24T14:46:06Z

This component is responsible for making the databases maintained by GeoIpDownloader
available for ingest processors.

Also provided a lookup mechanism for geoip processors with fallback to LocalDatabases.
All databases are downloaded into a geoip tmp directory, which is created at node startup.

The following high level steps are executed after each cluster state update:

Check which databases are available in GeoIpTaskState,
which is part of the geoip downloader persistent task.
For each database check whether the databases have changed
by comparing the local and remote md5 hash or are locally missing.
For each database identified in step 2 start downloading the database
chunks. Each chunks is appended to a tmp file (inside geoip tmp dir) and
after all chunks have been downloaded, the database is uncompressed and
renamed to the final filename.After this the database is loaded and
if there is an old instance of this database then that is closed.
Cleanup locally loaded databases that are no longer mentioned in GeoIpTaskState.

Relates to #68920

…ownloader This component is responsible for making the databases maintained by GeoIpDownloader available for ingest processors. Also provided a lookup mechanism for geoip processors with fallback to {@link LocalDatabases}. All databases are downloaded into a geoip tmp directory, which is created at node startup. The following high level steps are executed after each cluster state update: 1) Check which databases are available in {@link GeoIpTaskState}, which is part of the geoip downloader persistent task. 2) For each database check whether the databases have changed by comparing the local and remote md5 hash or are locally missing. 3) For each database identified in step 2 start downloading the database chunks. Each chunks is appended to a tmp file (inside geoip tmp dir) and after all chunks have been downloaded, the database is uncompressed and renamed to the final filename.After this the database is loaded and if there is an old instance of this database then that is closed. 4) Cleanup locally loaded databases that are no longer mentioned in {@link GeoIpTaskState}. Relates to elastic#68920

elasticmachine · 2021-03-01T14:25:38Z

Pinging @elastic/es-core-features (Team:Core/Features)

probakowski

Overall looks good, there are several simplifications we can do

probakowski · 2021-03-02T12:55:12Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseRegistry.java

+
+    private final ConcurrentMap<String, DatabaseReference> databases = new ConcurrentHashMap<>();
+
+    DatabaseRegistry(Environment environment,


nit: I'm not a fan of argument-per-line style especially if they all fit nicely in single line

right, I think at least for some of those statements I assumed that these were longer than 140 chars.

probakowski · 2021-03-02T12:56:56Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseRegistry.java

+            return;
+        }
+
+        PersistentTasksCustomMetadata persistentTasks = state.metadata().custom(PersistentTasksCustomMetadata.TYPE);


we can simplify here with PersistentTasksCustomMetadata.getTaskWithId:

PersistentTask<?> task = PersistentTasksCustomMetadata.getTaskWithId(state, GeoIpDownloader.GEOIP_DOWNLOADER); GeoIpTaskState taskState = task == null || task.getState() == null ? GeoIpTaskState.EMPTY : (GeoIpTaskState) task.getState();

probakowski · 2021-03-02T12:57:47Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseRegistry.java

+        for (var entry : taskState.getDatabases().entrySet()) {
+            String name = entry.getKey();
+            GeoIpTaskState.Metadata metadata = entry.getValue();
+            DatabaseReference reference = databases.get(entry.getKey());


Suggested change

DatabaseReference reference = databases.get(entry.getKey());

DatabaseReference reference = databases.get(name);

probakowski · 2021-03-02T12:57:59Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseRegistry.java

+            String name = entry.getKey();
+            GeoIpTaskState.Metadata metadata = entry.getValue();
+            DatabaseReference reference = databases.get(entry.getKey());
+            String remoteMd5 = entry.getValue().getMd5();


Suggested change

String remoteMd5 = entry.getValue().getMd5();

String remoteMd5 = metadata.getMd5();

probakowski · 2021-03-02T12:58:55Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseRegistry.java

+        for (var entry : taskState.getDatabases().entrySet()) {
+            String name = entry.getKey();
+            GeoIpTaskState.Metadata metadata = entry.getValue();


we can use Map.forEach here:

Suggested change

for (var entry : taskState.getDatabases().entrySet()) {

String name = entry.getKey();

GeoIpTaskState.Metadata metadata = entry.getValue();

taskState.getDatabases().forEach((name, metadata) -> {

probakowski · 2021-03-02T13:06:04Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseRegistry.java

+
+            for (DatabaseReference reference : references) {
+                reference.close();
+                Files.delete(reference.databaseFile);


will it mess something if we delete the file during concurrent lookup? refence.close() may not actually close anything if there are any ongoing lookups

I think *unix like systems handle this gracefully (deleting a file in use, only after its file descriptor has been closed). So I think the current will not mess anything up, but windows not so sure. In anyway perhaps it makes more sense to move the actual deletion of the db file elsewhere (maybe to DatabaseReaderLazyLoader#doClose() are at least initiated from there).

probakowski · 2021-03-02T13:07:02Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseRegistry.java

+
+    void removeStaleEntries(Collection<String> staleEntries) {
+        try {
+            List<DatabaseReference> references = new ArrayList<>();


do we need this list? can we just merge these two for loops?

yeah i think merging is possible

I think this is a direct copy from the poc, which did something silly.

probakowski · 2021-03-02T13:08:42Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseRegistry.java

+                    if (lastSortValue != null) {
+                        lastSortValue = null;
+                    }


Suggested change

if (lastSortValue != null) {

lastSortValue = null;

}

lastSortValue = null;

(we always end up with null here anyway)

i think that is a leftover.

(the if check)

probakowski · 2021-03-02T13:12:40Z

...est-geoip/src/internalClusterTest/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderIT.java

+                        builder.endObject();
+                    }
+                    builder.endObject();
+                    // TODO: change geoip fixture to not return city db content for country and asn databases:


I'd rather not have this big TODOs here, I can fix geoip fixture if you want before we merge this PR.
I wonder how stable/fragile this will be considering third-party test (where we hit real service and data may change eventually)

I can fix geoip fixture if you want before we merge this PR.

That would be great then also the other databases can be tested here instead just the city db.

I wonder how stable/fragile this will be considering third-party test (where we hit real service and data may change eventually)

Not sure how to enforce that the test fixture stays inline with the actual api that infra is building.
Best thing I can come up with is that we share a spec, like we do with our rest apis. If any of us
make a change then we at least now about it.

probakowski · 2021-03-02T13:15:08Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/IngestGeoIpPlugin.java

+        DatabaseRegistry registry = new DatabaseRegistry(
+            parameters.env,
+            parameters.client,
+            geoIpCache,
+            parameters.genericExecutor
+        );


nit:

Suggested change

DatabaseRegistry registry = new DatabaseRegistry(

parameters.env,

parameters.client,

geoIpCache,

parameters.genericExecutor

);

DatabaseRegistry registry = new DatabaseRegistry(parameters.env, parameters.client, geoIpCache, parameters.genericExecutor);

probakowski · 2021-03-02T13:16:55Z

Also, test failure in elasticsearch-ci/1 looks like a real problem

…Registry instead of LocalDatabases.

remove the backing file when the loader instance is no longer used. Also removed DatabaseReference class and merged md5 field with DatabaseReaderLazyLoader class.

probakowski

LGTM, thanks @martijnvg for adding this!
I left 2 super minor nits (optional)

probakowski · 2021-03-03T22:51:15Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseRegistry.java

+            try {
+                do {
+                    SearchRequest searchRequest =
+                        createSearchRequest(databaseName, metadata.getFirstChunk(), metadata.getLastChunk(), lastSortValue);


we can extract metadata.getFirstChunk() and metadata.getLastChunk() to local variables (even outside of the loop) and move this to single line

probakowski · 2021-03-03T22:53:03Z

...est-geoip/src/internalClusterTest/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderIT.java

+                    List<String> files = list.map(Path::toString)
+                        .collect(Collectors.toList());


Suggested change

List<String> files = list.map(Path::toString)

.collect(Collectors.toList());

List<String> files = list.map(Path::toString).collect(Collectors.toList());

probakowski · 2021-03-04T08:18:57Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/DatabaseRegistry.java

+                    // so it is ok if this happens in a blocking manner on a thead from generic thread pool.
+                    // This makes the code easier to understand and maintain.
+                    SearchResponse searchResponse = client.search(searchRequest).actionGet();
+                    if (searchResponse.getHits().getHits().length == 0) {


Another idea: we know exactly which chunks we need by id so we can do

for(int i = firstChunk; i <= lastChunk; i++) { Map<String, Object> source = client.prepareGet(DATABASES_INDEX, name + "_" + i).get().getSource(); ... }

It should be simpler than search with last sort value

I like that idea. This should make this code here much easier.
Also a search by id would be fine too.

I pushed: acf2a4b

…GeoIpDownloader Backport of elastic#69540 to 7.x branch. This component is responsible for making the databases maintained by GeoIpDownloader available for ingest processors. Also provided a lookup mechanism for geoip processors with fallback to {@link LocalDatabases}. All databases are downloaded into a geoip tmp directory, which is created at node startup. The following high level steps are executed after each cluster state update: 1) Check which databases are available in {@link GeoIpTaskState}, which is part of the geoip downloader persistent task. 2) For each database check whether the databases have changed by comparing the local and remote md5 hash or are locally missing. 3) For each database identified in step 2 start downloading the database chunks. Each chunks is appended to a tmp file (inside geoip tmp dir) and after all chunks have been downloaded, the database is uncompressed and renamed to the final filename. After this the database is loaded and if there is an old instance of this database then that is closed. 4) Cleanup locally loaded databases that are no longer mentioned in {@link GeoIpTaskState}. Relates to elastic#68920

…ownloader (#69971) Backport of #69540 to 7.x branch. This component is responsible for making the databases maintained by GeoIpDownloader available for ingest processors. Also provided a lookup mechanism for geoip processors with fallback to {@link LocalDatabases}. All databases are downloaded into a geoip tmp directory, which is created at node startup. The following high level steps are executed after each cluster state update: 1) Check which databases are available in {@link GeoIpTaskState}, which is part of the geoip downloader persistent task. 2) For each database check whether the databases have changed by comparing the local and remote md5 hash or are locally missing. 3) For each database identified in step 2 start downloading the database chunks. Each chunks is appended to a tmp file (inside geoip tmp dir) and after all chunks have been downloaded, the database is uncompressed and renamed to the final filename. After this the database is loaded and if there is an old instance of this database then that is closed. 4) Cleanup locally loaded databases that are no longer mentioned in {@link GeoIpTaskState}. Relates to #68920 Other cherry-picked commits: * Fix ReloadingDatabasesWhilePerformingGeoLookupsIT (#70163) Wait for ingest threads to stop using the DatabaseReaderLazyLoader, so the during the next run the db update thread doesn't try to remove the db again (because the file hasn't yet been deleted). Also delete tmp dirs this test create at the end of the test, so that when repeating this test many times, this test doesn't accumulate many directories with database files. Closes #69980 * Fix clean up of old entries in DatabaseRegistry.initialize (#70135) This change switches clean up in DatabaseRegistry.initialize from using Files.walk and stream operations to Files.walkFileTree which can be made more robust in case of errors * Fix DatabaseRegistryTests (#70180) This test predefined expected md5 hashes in constants, that were expected with java15. However java16 creates different md5 hashes and so the expected md5 hashes don't match with the actual md5 hashes, which caused tests in this test suite to fail (running with java16 only). The tests now generates the expected md5 hash during the test instead of using predefined constants. Closes #69986 * Fix GeoIpDownloaderIT#testUseGeoIpProcessorWithDownloadedDBs(...) test (#70215) The test failure looks legit, because there is a possibility that the same databases was downloaded twice. See added comment in DatabaseRegistry class. Relates to #69972 * Muted GeoIpDownloaderIT#testUseGeoIpProcessorWithDownloadedDBs(...) test, see #69972 Co-authored-by: Przemko Robakowski <[email protected]>

martijnvg added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Feb 24, 2021

martijnvg mentioned this pull request Feb 24, 2021

[Meta] GeoIPv2 #68920

Closed

15 tasks

martijnvg added 3 commits March 1, 2021 08:30

Merge remote-tracking branch 'es/master' into database_registry

d0e4122

added tests for fallback_to_default_databases setting

439878c

fix test

e32b5aa

martijnvg marked this pull request as ready for review March 1, 2021 14:25

elasticmachine added the Team:Data Management Meta label for data/management team label Mar 1, 2021

martijnvg requested a review from probakowski March 1, 2021 14:25

martijnvg added v7.13.0 v8.0.0 >non-issue labels Mar 1, 2021

probakowski reviewed Mar 2, 2021

View reviewed changes

martijnvg added 10 commits March 2, 2021 16:06

Merge remote-tracking branch 'es/master' into database_registry

1991828

addressed smaller review comments

67539c4

use correct task id in test

fda9641

Merge remote-tracking branch 'es/master' into database_registry

3003d1d

fixed unit test

d0a0727

Changed ReloadingDatabasesWhilePerformingGeoLookupsIT to use Database…

81f9531

…Registry instead of LocalDatabases.

When removing old DatabaseReaderLazyLoader instances,

0849b92

remove the backing file when the loader instance is no longer used. Also removed DatabaseReference class and merged md5 field with DatabaseReaderLazyLoader class.

Merge remote-tracking branch 'es/master' into database_registry

dc0f14d

removed test todos after test fixture was updated

2ea8c7b

altered test

fc9f8e4

martijnvg requested a review from probakowski March 3, 2021 16:00

probakowski approved these changes Mar 3, 2021

View reviewed changes

probakowski reviewed Mar 4, 2021

View reviewed changes

martijnvg added 2 commits March 4, 2021 09:45

Merge remote-tracking branch 'es/master' into database_registry

d1284e1

addressed small comments

b6dc489

fetch chunks via lookup by id instead of using search after

acf2a4b

martijnvg merged commit 6c35c25 into elastic:master Mar 4, 2021

martijnvg added the backport pending label Mar 4, 2021

martijnvg mentioned this pull request Mar 4, 2021

[7.x] Add DatabaseRegistry for locally managing databases managed by GeoIpDownloader #69971

Merged

martijnvg removed the backport pending label Mar 10, 2021

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DatabaseRegistry for locally managing databases managed by GeoIpDownloader #69540

Add DatabaseRegistry for locally managing databases managed by GeoIpDownloader #69540

martijnvg commented Feb 24, 2021 •

edited

Loading

elasticmachine commented Mar 1, 2021

probakowski left a comment

probakowski Mar 2, 2021

martijnvg Mar 2, 2021

probakowski Mar 2, 2021

probakowski Mar 2, 2021

probakowski Mar 2, 2021

probakowski Mar 2, 2021

probakowski Mar 2, 2021

martijnvg Mar 2, 2021

probakowski Mar 2, 2021

martijnvg Mar 2, 2021

martijnvg Mar 2, 2021

probakowski Mar 2, 2021

martijnvg Mar 2, 2021

martijnvg Mar 2, 2021

probakowski Mar 2, 2021

martijnvg Mar 2, 2021

probakowski Mar 2, 2021

probakowski commented Mar 2, 2021

probakowski left a comment

probakowski Mar 3, 2021

probakowski Mar 3, 2021

probakowski Mar 4, 2021 •

edited

Loading

martijnvg Mar 4, 2021

martijnvg Mar 4, 2021


		private final ConcurrentMap<String, DatabaseReference> databases = new ConcurrentHashMap<>();

		DatabaseRegistry(Environment environment,

	DatabaseReference reference = databases.get(entry.getKey());
	DatabaseReference reference = databases.get(name);

	String remoteMd5 = entry.getValue().getMd5();
	String remoteMd5 = metadata.getMd5();

		List<String> files = list.map(Path::toString)
		.collect(Collectors.toList());

Add DatabaseRegistry for locally managing databases managed by GeoIpDownloader #69540

Add DatabaseRegistry for locally managing databases managed by GeoIpDownloader #69540

Conversation

martijnvg commented Feb 24, 2021 • edited Loading

elasticmachine commented Mar 1, 2021

probakowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

probakowski commented Mar 2, 2021

probakowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

probakowski Mar 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Feb 24, 2021 •

edited

Loading

probakowski Mar 4, 2021 •

edited

Loading