Introduce primary context #25122

jasontedor · 2017-06-08T01:57:54Z

The target of a primary relocation is not aware of the state of the replication group. In particular, it is not tracking in-sync and initializing shards and their checkpoints. This means that after the target shard is started, its knowledge of the replication group could differ from that of the relocation source. In particular, this differing view can lead to it computing a global checkpoint that moves backwards after it becomes aware of the state of the entire replication group. This commit addresses this issue by transferring a primary context during relocation handoff.

Relates #10708, relates #25355

jasontedor · 2017-06-08T01:58:27Z

@bleskes I'm opening this for discussion on coming up with an effective strategy for testing.

The target of a primary relocation is not aware of the state of the replication group. In particular, it is not tracking in-sync and initializing shards and their checkpoints. This means that after the target shard is started, its knowledge of the replication group could differ from that of the relocation source. In particular, this differing view can lead to it computing a global checkpoint that moves backwards after it becomes aware of the state of the entire replication group. This commit addresses this issue by transferring a primary context during relocation handoff.

bleskes

I did an initial pass and left some comments, until we discuss the testing aspect.

bleskes · 2017-06-08T10:25:20Z

core/src/main/java/org/elasticsearch/index/seqno/GlobalCheckpointTracker.java

+     *
+     * @param seqNoPrimaryContext the sequence number context
+     */
+    synchronized void updateAllocationIdsFromPrimaryContext(final SeqNoPrimaryContext seqNoPrimaryContext) {


I think this is tricky. The master is the one that drives the set of shards that are allocated. If the copy was removed by the master we shouldn't re-add it because of a primary handoff that happens concurrently. I think we should make the primary context be a recovery level thing that uses existing shard API (updateLocalCheckPoint/markAllocationIdAsInSync)

bleskes · 2017-06-08T10:29:20Z

core/src/main/java/org/elasticsearch/index/seqno/SeqNoPrimaryContext.java

+
+import java.io.IOException;
+
+public class SeqNoPrimaryContext implements Writeable {


java docs pls?

bleskes · 2017-06-08T10:31:25Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+     */
+    public PrimaryContext primaryContext() {
+        verifyPrimary();
+        assert shardRouting.relocating();


add the shardRouting as a message?

bleskes · 2017-06-08T10:31:39Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+    public PrimaryContext primaryContext() {
+        verifyPrimary();
+        assert shardRouting.relocating();
+        assert !shardRouting.isRelocationTarget();


please add a message

* master: (53 commits) Log checkout so SHA is known Add link to community Rust Client (elastic#22897) "shard started" should show index and shard ID (elastic#25157) await fix testWithRandomException Change BWC versions on create index response Return the index name on a create index response Remove incorrect bwc branch logic from master Correctly format arrays in output [Test] Extending parsing checks for SearchResponse (elastic#25148) Scripting: Change keys for inline/stored scripts to source/id (elastic#25127) [Test] Add test for custom requests in High Level Rest Client (elastic#25106) nested: In case of a single type the _id field should be added to the nested document instead of _uid field. `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586) fix highlighting docs Fix NPE in token_count datatype with null value (elastic#25046) Remove the postings highlighter and make unified the default highlighter choice (elastic#25028) [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144) Fix typo in shards.asciidoc (elastic#25143) List Hibernate Search (elastic#25145) [DOCS] update maxRetryTimeout in java REST client usage page ...

* master: Explicitly reject duplicate data paths Do not swallow node lock failed exception Revert "Revert "Sense for VirtualBox and $HOME when deciding to turn on vagrant testing. (elastic#24636)"" Aggregations bug: Significant_text fails on arrays of text. (elastic#25030) Speed up sorted scroll when the index sort matches the search sort (elastic#25138) TranslogTests.testWithRandomException ignored a possible simulated OOM when trimming files Adapt TranslogTests.testWithRandomException to checkpoint syncing on trim Change BWC versions on get mapping 404s Fix get mappings HEAD requests TranslogTests#commit didn't allow for a concurrent closing of a view Fix handling of exceptions thrown on HEAD requests Fix comment formatting in EvilLoggerTests Remove unneeded weak reference from prefix logger Test: remove faling test that relies on merge order

jasontedor · 2017-06-13T03:00:17Z

Back to you @bleskes.

bleskes

Thanks @jasontedor. I left a bunch of comments.

bleskes · 2017-06-14T13:11:11Z

core/src/main/java/org/elasticsearch/index/seqno/GlobalCheckpointTracker.java

@@ -23,10 +23,12 @@
 import com.carrotsearch.hppc.ObjectLongMap;
 import com.carrotsearch.hppc.cursors.ObjectLongCursor;
 import org.elasticsearch.common.SuppressForbidden;
+import org.elasticsearch.common.collect.HppcMaps;


This seems unused

bleskes · 2017-06-14T13:11:18Z

core/src/main/java/org/elasticsearch/index/seqno/GlobalCheckpointTracker.java

 import org.elasticsearch.index.IndexSettings;
 import org.elasticsearch.index.shard.AbstractIndexShardComponent;
 import org.elasticsearch.index.shard.ShardId;

+import java.util.Arrays;


As does this

bleskes · 2017-06-14T13:34:28Z

core/src/main/java/org/elasticsearch/index/seqno/GlobalCheckpointTracker.java

+         */
+        for (final ObjectLongCursor<String> cursor : seqNoPrimaryContext.inSyncLocalCheckpoints()) {
+            updateLocalCheckpoint(cursor.key, cursor.value);
+            assert cursor.value >= globalCheckpoint


can we also assert that the key is not found in the in sync map? in fact, the sync map should be empty no?

Thinking about this more, why would the in-sync map be empty? The target could have applied a cluster state containing the source as an active allocation ID? I don't think we can make any assertion here at all?

You are right. What I was thinking of that any "promotion" to in sync should go through the primary but this currently not the case. I do think that this is a better & simpler model that we can switch to as a follow up (the current model is based on the think we had back when we built around a background sync and didn't the locked down clean hand offs we have now). For now, I think we can assert that all the values in the in sync map are unknown. Not sure how much it's worth though. Up to you.

bleskes · 2017-06-14T13:49:53Z

core/src/main/java/org/elasticsearch/index/seqno/GlobalCheckpointTracker.java

+            assert cursor.value >= globalCheckpoint
+                    : "local checkpoint [" + cursor.value + "] violates being at least the global checkpoint [" + globalCheckpoint + "]";
+            try {
+                markAllocationIdAsInSync(cursor.key, cursor.value);


This works for started shards because we have a check in markAllocationIdAsInSync that ignores this call if it can't find the aId in the tracking map.

Given that the insync map is empty, I wonder if this will be simpler to work with if change seqNoPrimaryContext to be a map of aid->local checkpoint + a set of in sync aids. We will then loop on the map and call updateLocalCheckpoint for everything in it. Then we can do manual promotion, instead of calling markAllocationIdAsInSync and suffer everything it brings with it. Concretely we'll just do:

if (trackingLocalCheckpoints.containsKey(allocationId)) { long current = trackingLocalCheckpoints.remove(allocationId); inSyncLocalCheckpoints.put(allocationId, current); }

I would prefer the sequence number related state of the primary context look like the state of the global checkpoint tracker. It's easier to think about.

I would prefer the sequence number related state of the primary context look like the state of the global checkpoint tracker

Ok. I would still prefer not using markAllocationIdAsInSync as is and all the complexity it brings.

bleskes · 2017-06-14T13:59:04Z

core/src/main/java/org/elasticsearch/index/seqno/SequenceNumbersService.java

+     * @return the sequence number primary context
+     */
+    public SeqNoPrimaryContext seqNoPrimaryContext() {
+        synchronized (globalCheckpointTracker) {


I think we should push this down to the tracker. Then construction and application are the same. Also the external lock is ugly :)

Yeah, I figured you'd say that.

bleskes · 2017-06-14T14:15:40Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                        try {
+                            getEngine().seqNoService().markAllocationIdAsInSync(allocationId, localCheckpoint);
+                            /*
+                             * We could have blocked so long waiting for the replica to catch up that we fell idle and there will not be a


can you explain we this is needed? so we don't have the background sync? what happens?

This comment is only moved, I'm not addressing it in this PR.

bleskes · 2017-06-14T14:16:19Z

core/src/main/java/org/elasticsearch/index/shard/PrimaryContext.java

+/**
+ * Represents the primary context which encapsulates the view the primary shard has of its replication group.
+ */
+public class PrimaryContext implements Writeable {


Shall we remove this for now? I think it's a left over from my previous, cancelled, ask?

No, this is intentional and was always this way even before your cancelled ask.

Can you explain? It is an empty abstraction at the moment. We can always add it when we need it?

I removed it in 4ba8d5c.

bleskes · 2017-06-14T14:17:47Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveryHandoffPrimaryContextRequest.java

+public class RecoveryHandoffPrimaryContextRequest extends TransportRequest {
+
+    private long recoveryId;
+    private ShardId shardId;


can we add a toString? I don't see any other usage for this shardId field

bleskes · 2017-06-14T14:22:04Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveryHandoffPrimaryContextRequest.java

+     * Initialize an empty request (used to serialize into when reading from a stream).
+     */
+    @SuppressWarnings("WeakerAccess")
+    public RecoveryHandoffPrimaryContextRequest() {


I think this can be package private and then the SupressWarnings makes sense?

That's weird, there's no other reason to add the suppression unless I had already made it package private at some point. I'm not sure why I would have changed it to public. I will change it back.

bleskes · 2017-06-14T14:22:17Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

@@ -448,7 +451,23 @@ public void finalizeRecovery(final long targetLocalCheckpoint) {
        StopWatch stopWatch = new StopWatch().start();
        logger.trace("finalizing recovery");
        cancellableThreads.execute(() -> {
-            shard.markAllocationIdAsInSync(request.targetAllocationId(), targetLocalCheckpoint);
+            final CountDownLatch latch = new CountDownLatch(1);


see other comment

* master: testCreateShrinkIndex: removed left over debugging log line that violated linting testCreateShrinkIndex should make sure to use the right source stats when testing shrunk target [Test] Add unit test for XContentParserUtilsTests.parseStoredFieldsValue (elastic#25288) Update percolate-query.asciidoc (elastic#25364) Remove remaining `index.mapping.single_type=false` (elastic#25369) test: Replace OldIndexBackwardsCompatibilityIT#testOldClusterStates with a full cluster restart qa test Fix use of spaces on Windows if JAVA_HOME not set ESIndexLevelReplicationTestCase.ReplicationAction#execute should send exceptions to it's listener rather than bubble them up testRecoveryAfterPrimaryPromotion shouldn't flush the replica with extra operations fix sort and string processor tests around targetField (elastic#25358) Ensure `InternalEngineTests.testConcurrentWritesAndCommits` doesn't pile up commits (elastic#25367) [TEST] Add debug logging if an unexpected exception is thrown Update Painless to Allow Augmentation from Any Class (elastic#25360) TemplateUpgraders should be called during rolling restart (elastic#25263)

jasontedor · 2017-06-23T13:39:36Z

@ywelsch Back to you.

ywelsch

Left one more question

ywelsch · 2017-06-26T08:30:45Z

core/src/main/java/org/elasticsearch/index/seqno/GlobalCheckpointTracker.java

 import org.elasticsearch.index.shard.ShardId;

+import java.security.Security;


import gone wrong :)

ywelsch · 2017-06-26T08:51:16Z

core/src/main/java/org/elasticsearch/index/seqno/GlobalCheckpointTracker.java

+            sealed = false;
+            throw e;
+        }
+        return () -> {


I'm confused why we need both a consumer and a releasable here. Maybe it's good enough to have a "releaseFailedPrimaryContext" method that sets sealed back to false.
I would like the sealing to remain active if the relocation successfully completed (i.e. the source shard has been successfully marked as "relocating"). This validates that noone is advancing global checkpoint on a primary that is no longer active primary. Similarly (not for this PR, but in a follow-up) I would like the GlobalCheckPointTracker to be initially sealed and only be unsealed when a primary shard gets activated (whether by becoming active after recovery or being initialized with a primary context during relocation handoff).

Well, it was your suggestion to return a Releasable. Yet, I will change this. 😇

ywelsch · 2017-06-26T10:54:59Z

core/src/main/java/org/elasticsearch/index/seqno/GlobalCheckpointTracker.java

+        final ObjectLongMap<String> inSyncLocalCheckpoints = new ObjectLongHashMap<>(this.inSyncLocalCheckpoints);
+        final ObjectLongMap<String> trackingLocalCheckpoints = new ObjectLongHashMap<>(this.trackingLocalCheckpoints);
+        try {
+            consumer.accept(new PrimaryContext(appliedClusterStateVersion, inSyncLocalCheckpoints, trackingLocalCheckpoints));


why run the handoff under lock? This will block cluster state updates calling updateAllocationIdsFromMaster

This was only an inadvertent side-effect of turning the context inside out to return a Releasable but it definitely has to go!

ywelsch · 2017-06-26T10:56:46Z

core/src/main/java/org/elasticsearch/index/seqno/GlobalCheckpointTracker.java

-            final Set<String> activeAllocationIds, final Set<String> initializingAllocationIds) {
+            final long applyingClusterStateVersion, final Set<String> activeAllocationIds, final Set<String> initializingAllocationIds) {
+        if (sealed) {
+            throw new IllegalStateException("global checkpoint tracker is sealed");


Cluster state updates can still happen during a relocation handoff (or after), so I don't think that we should check for sealing here. In fact, as the handoff can fail and the primary continue operating as usual, we also want to take all cluster state updates that happened during handoff into consideration.

I pushed a change that addresses this.

* master: Move more token filters to analysis-common module Test: Allow merging mock secure settings (elastic#25387) Remove remaining `index.mapper.single_type` setting usage from tests (elastic#25388) Remove dead logger prefix code Tests: Improve stability and logging of TemplateUpgradeServiceIT tests (elastic#25386) Remove `index.mapping.single_type=false` from reindex tests (elastic#25365) Adapt `SearchIT#testSearchWithParentJoin` to new join field (elastic#25379) Added unit test coverage for SignificantTerms (elastic#24904)

* master: Throw useful error on bad docs snippets (elastic#25389) Remove `mapping.single_type` from parent join test (elastic#25391) Tests: Fix array out of bounds exception in TemplateUpgradeServiceIT

ywelsch

LGTM

jasontedor added :Sequence IDs >enhancement v6.0.0 labels Jun 8, 2017

jasontedor requested a review from bleskes June 8, 2017 01:57

jasontedor force-pushed the primary-context branch from 989a7aa to d687f51 Compare June 8, 2017 02:14

bleskes suggested changes Jun 8, 2017

View reviewed changes

jasontedor added 12 commits June 8, 2017 09:26

Fix test

8d871cf

Add assertion messages

a197a64

Javadocs

1e4255c

Barrier between marking a shard in sync and relocating

c1588bb

Fix misplaced call

0b47386

Paranoia

73bc0d1

Better latch countdown

76162e4

Catch any exception

a799e69

Fix comment

58adfa0

Fix wait for cluster state relocation test

03b863b

Update knowledge via upate local checkpoint API

8504b82

jasontedor force-pushed the primary-context branch from 8f0b01c to 8504b82 Compare June 13, 2017 03:01

bleskes suggested changes Jun 14, 2017

View reviewed changes

jasontedor added 5 commits June 14, 2017 11:58

toString

5f05d92

Visibility

a35e497

Refactor permit

b54a8d6

Push down

48157cd

Imports

9a58ff4

jasontedor added 8 commits June 22, 2017 21:48

Extra newline

e6bbe8b

Add allocation ID to assertion

ff54eec

Rename method

6ea61ef

Another rename

f6f6acb

Introduce sealing

de862cd

Sealing tests

077b3f4

One more assertion

4e05714

ywelsch reviewed Jun 26, 2017

View reviewed changes

jasontedor added 6 commits June 26, 2017 09:13

Fix imports

744b03a

Safer sealing

3c2731f

Remove check

e19c8b3

Remove another sealed check

0d33a88

Merge branch 'master' into primary-context

7bd3c1b

* master: Throw useful error on bad docs snippets (elastic#25389) Remove `mapping.single_type` from parent join test (elastic#25391) Tests: Fix array out of bounds exception in TemplateUpgradeServiceIT

ywelsch approved these changes Jun 26, 2017

View reviewed changes

jasontedor merged commit c6a03bc into elastic:master Jun 26, 2017

jasontedor deleted the primary-context branch June 26, 2017 18:09

jasontedor mentioned this pull request Jun 26, 2017

Remove LongTuple #25402

Merged

bleskes mentioned this pull request Jul 10, 2017

Add Sequence Numbers to write operations #10708

Closed

64 tasks

colings86 added v6.0.0-beta1 and removed v6.0.0 labels Aug 3, 2017

clintongormley added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018


		import java.io.IOException;

		public class SeqNoPrimaryContext implements Writeable {

		import org.elasticsearch.index.shard.ShardId;

		import java.security.Security;

Introduce primary context #25122

Introduce primary context #25122

Conversation

jasontedor commented Jun 8, 2017 • edited Loading

jasontedor commented Jun 8, 2017

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor commented Jun 13, 2017

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor commented Jun 23, 2017

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor Jun 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

jasontedor commented Jun 8, 2017 •

edited

Loading

jasontedor Jun 26, 2017 •

edited

Loading