Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ES6 using indexes created in ES5 #1253

Merged
merged 23 commits into from
Jan 30, 2025

Conversation

peternied
Copy link
Member

@peternied peternied commented Jan 28, 2025

Description

Support loading both Lucene 7 & 9 inside of RFS using shadow jar

Issues Resolved

Check List

  • New functionality includes testing
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Peter Nied <[email protected]>
@peternied peternied marked this pull request as ready for review January 28, 2025 22:07
Copy link
Member

@chelma chelma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this! Have a few questions.

@chelma
Copy link
Member

chelma commented Jan 28, 2025

@peternied Oh, also - do we have a test to confirm this enables us to read Lucene 6 segments? I didn't see one, and don't want to repeat the last time we had a PR claiming to solve this issue. 😛

@peternied
Copy link
Member Author

@chelma The EndToEnd tests cover this scenario by using RFS with a ES5 cluster, I confirmed by disabling the Lucene7 reader and got the same error when it was reported.

Copy link

codecov bot commented Jan 29, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.03%. Comparing base (d527b87) to head (5c22580).
Report is 2 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1253      +/-   ##
============================================
- Coverage     80.16%   80.03%   -0.13%     
- Complexity     3110     3127      +17     
============================================
  Files           430      432       +2     
  Lines         15903    16005     +102     
  Branches       1082     1089       +7     
============================================
+ Hits          12749    12810      +61     
- Misses         2493     2528      +35     
- Partials        661      667       +6     
Flag Coverage Δ
unittests 80.03% <ø> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@chelma chelma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple new SonarQube violations in the new/modified code; resolve those and I'll approve!

Signed-off-by: Peter Nied <[email protected]>
Copy link
Member

@chelma chelma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it! Thanks Peter.

Checking allocations on the test, previously it was using between
400-480mb of memory while running, add additional jars for lucene7/9 and
we are very close to running out of memory constantly.  I've dialed down
the document to 10mb - still chunky, but with all the testing and
duplication representations the test now takes up ~180mb of memory so we
should be clear.

Signed-off-by: Peter Nied <[email protected]>
Signed-off-by: Peter Nied <[email protected]>
@peternied peternied merged commit 5b924e6 into opensearch-project:main Jan 30, 2025
23 checks passed
Copy link
Collaborator

@gregschohn gregschohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. I recognize that this is merged, but I think that a followup could address some of these minor issues.

* happens is that the document is marked as "deleted" in the Lucene Index, but it is still present in the Lucene segment
* on disk. The next time a merge occurs, that segment will be deleted, and the deleted documents in it are thereby
* removed from the Lucene Index. A similar thing happens when a document is updated; the old document is marked as
* "deleted" in the Lucene segment and the new version of the document is added in a new Lucene segment. Until a merge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please clarify the paradox between a segment being immutable and that the segment has a document marked as deleted? I haven't seen great explanations of the exact mechanics on how the tombstone bitmaps are updated.

* document might still exist in the Lucene segments on disk, all of which have the same _id (from the ES/OS perspective).
*
* Additionally, Elasticsearch 7 introduced a feature called "soft deletes" which allows you to mark a document as
* "deleted" in the Lucene Index without actually removing it from the Lucene Index. From what I can gather, soft deletes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "From what I gather" - you can reference the link below, as you do, and just say that it's used for certain shard synchronization or management issues, then cite the link. Farther below, it's worth calling out that soft-deleted docs WILL be eventually FULLY-deleted lucene docs & that's why we don't want to load them onto the target.

* 1. We make sure we use the latest Lucene commit point on the Lucene Index. A commit is a Lucene abstraction that
* comprises a consistent, point-in-time view of the Segments in the Lucene Index. By default, a DirectoryReader
* will use the latest commit point.
* 2. We use a concept called "liveDocs" to determine if a document is "live" or not. The liveDocs are a bitset that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bitset or compressed bitset?

* cluster.
*
* In order to retrieve only those documents that would be considered "live" in ES/OS, we use a few tricks:
* 1. We make sure we use the latest Lucene commit point on the Lucene Index. A commit is a Lucene abstraction that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were two snapshots to a repo - a full one that successfully finished and a partial one that didn't, would this be true? Could we get the latest commit point for some shards in an index from the second snapshot and snapshots further back in time from the first snapshot for other shards?

* Just like deleted documents and old versions of updated documents, we don't want to reindex them agaisnt the target
* cluster.
*
* In order to retrieve only those documents that would be considered "live" in ES/OS, we use a few tricks:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/tricks/strategies/ - or something else like algorithms. Tricks imply that there's something non-obvious or maybe something that's only true in certain cases.

snapshotReader.getSoftDeletesFieldData()
);
}
public class LuceneDocumentsReader7 implements LuceneDocumentsReader {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later, it we're able to isolate lucene7 vs lucene9 code and we only need one version in the runtime at once, it should be a chip-shot to load the things from the ServiceLoader interface (and not require shadowing at all).

It would also be nice to not have to require as much copied code (e.g. just building/linking for different versions, or injecting different version specific overrides)

@@ -15,7 +15,7 @@
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import lombok.Getter;
import org.apache.lucene.util.BytesRef;
import shadow.lucene9.org.apache.lucene.util.BytesRef;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could be a liability. If there was a difference in the serialized ES5 bytes and how Lucene 9 read it, we could have a problem, right? I'm not too worried, but it does seem like it could be a bigger issue as we go back in ES/Lucene versions.

import lombok.extern.slf4j.Slf4j;
import reactor.core.publisher.Flux;

public interface LuceneDocumentsReader {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ILuceneDocumentsReader to show that it's an interface

@peternied peternied deleted the shadow-jars branch January 30, 2025 19:37
@peternied peternied restored the shadow-jars branch February 11, 2025 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] RFS Fails to read ES 5 created indices on ES 6 Snapshot
4 participants