ShardFollowNodeTask fetch operations twice #32453

dnhatn · 2018-07-28T15:10:49Z

Since #31581, ShardFollowNodeTask may fetch some range twice. The following log indicates that we fetched the range [1680 to 2024] twice.

[ShardFollowNodeTask] fetch from=1620, to=1679, receive [1620 2024]
[ShardFollowNodeTask] fetch from=1680, to=1984, receive [1680 2084]

This bug and if the follower is not using the FollowingEngine (PR #32448) can explain why we have many deletes on the replicas of the follower (but not on the primaries of the follower).

[ShardFollowNodeTask] send shard request from=0, batch=405, max=-1, retry=0, gcp=-1, num_reads=1
[ShardFollowNodeTask] fetch from=0, to=-1, receive [0 404]
[ShardFollowNodeTask] send shard request from=405, batch=405, max=809, retry=0, gcp=1679, num_reads=1
[ShardFollowNodeTask] send shard request from=810, batch=405, max=1214, retry=0, gcp=1679, num_reads=2
[ShardFollowNodeTask] send shard request from=1215, batch=405, max=1619, retry=0, gcp=1679, num_reads=3
[ShardFollowNodeTask] send shard request from=1620, batch=405, max=1679, retry=0, gcp=1679, num_reads=4

[ShardFollowNodeTask] fetch from=810, to=1214, receive [810 1214]
[ShardFollowNodeTask] send shard request from=1680, batch=405, max=1984, retry=0, gcp=1984, num_reads=4
[ShardFollowNodeTask] fetch from=405, to=809, receive [405 809]
[ShardFollowNodeTask] fetch from=1215, to=1619, receive [1215 1619]
[ShardFollowNodeTask] fetch from=1620, to=1679, receive [1620 2024] (*)
[ShardFollowNodeTask] send shard request from=2025, batch=405, max=2322, retry=0, gcp=2322, num_reads=2
[ShardFollowNodeTask] fetch from=1680, to=1984, receive [1680 2084] (**)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-07-28T15:10:50Z

Pinging @elastic/es-distributed

dnhatn · 2018-07-28T15:11:32Z

/cc @martijnvg @jasontedor @bleskes

Today ShardFollowNodeTask might fetch some operations more than once. This happens because we ask the leading for up to max_batch_count operations (instead of the left-over size) for the left-over request. The leading then can freely respond up to the max_batch_count, and at the same time, if one of the previous requests completed, we might issue another read request whose range overlaps with the response of the left-over request. Closes elastic#32453

bleskes · 2018-07-30T05:48:46Z

Good catch @dnhatn. This feels like a bug to me? we should not get ops we didn't ask for?

[ShardFollowNodeTask] fetch from=1620, to=1679, receive [1620 2024] (*)

bleskes · 2018-07-30T05:50:27Z

Also, I'm not sure I follow how this explains the deletes, can you clarify? if the primary processes the same ops twice, but ignores the associated seq# and issues it's own (and replicates it) how does it explain that the primary has no deletes but the replica does?

dnhatn · 2018-07-30T15:26:03Z

@bleskes The primary uses version numbers to resolve the indexing plan and rejects the duplicate operations; Whereas the replica uses seq# to resolve the indexing plan, and indexes the duplicate operations as stale documents. Does it sound reasonable to you?

bleskes · 2018-07-30T17:16:31Z

Thx @dnhatn I forgot that we don't use the bulk shard request and that exceptions on the primary don't map to lack of replications. It does. Thanks!

bleskes · 2018-07-30T17:17:11Z

Thx @dnhatn I forgot that we don't use the bulk shard request and that exceptions on the primary don't map to lack of replications. It does. Thanks!

Today ShardFollowNodeTask might fetch some operations more than once. This happens because we ask the leading for up to max_batch_count operations (instead of the left-over size) for the left-over request. The leading then can freely respond up to the max_batch_count, and at the same time, if one of the previous requests completed, we might issue another read request whose range overlaps with the response of the left-over request. Closes #32453

dnhatn · 2018-07-31T00:53:41Z

Fixed in #32455

Today ShardFollowNodeTask might fetch some operations more than once. This happens because we ask the leading for up to max_batch_count operations (instead of the left-over size) for the left-over request. The leading then can freely respond up to the max_batch_count, and at the same time, if one of the previous requests completed, we might issue another read request whose range overlaps with the response of the left-over request. Closes #32453

martijnvg · 2018-08-20T03:53:52Z

@dnhatn Great catch! 🎉

dnhatn added >bug :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Jul 28, 2018

dnhatn mentioned this issue Jul 30, 2018

ShardFollowNodeTask should fetch operation once #32455

Merged

dnhatn self-assigned this Jul 30, 2018

dnhatn closed this as completed Jul 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ShardFollowNodeTask fetch operations twice #32453

ShardFollowNodeTask fetch operations twice #32453

dnhatn commented Jul 28, 2018 •

edited

Loading

elasticmachine commented Jul 28, 2018

dnhatn commented Jul 28, 2018

bleskes commented Jul 30, 2018 •

edited

Loading

bleskes commented Jul 30, 2018

dnhatn commented Jul 30, 2018

bleskes commented Jul 30, 2018

bleskes commented Jul 30, 2018

dnhatn commented Jul 31, 2018

martijnvg commented Aug 20, 2018

ShardFollowNodeTask fetch operations twice #32453

ShardFollowNodeTask fetch operations twice #32453

Comments

dnhatn commented Jul 28, 2018 • edited Loading

elasticmachine commented Jul 28, 2018

dnhatn commented Jul 28, 2018

bleskes commented Jul 30, 2018 • edited Loading

bleskes commented Jul 30, 2018

dnhatn commented Jul 30, 2018

bleskes commented Jul 30, 2018

bleskes commented Jul 30, 2018

dnhatn commented Jul 31, 2018

martijnvg commented Aug 20, 2018

dnhatn commented Jul 28, 2018 •

edited

Loading

bleskes commented Jul 30, 2018 •

edited

Loading