Track max shard size for all copies #50638

DaveCTurner · 2020-01-06T08:38:43Z

Today we distinguish the sizes of primary and replica shards when contemplating
disk-based shard allocation, using a map of shard sizes keyed by strings of the
form [index][0][p]. There is no need to distinguish primaries and replicas
like this. Moreover in the case where a replica is not yet allocated the size
of the primary is a good estimate for the size of the replica. Finally, in the
case where there are multiple replicas we took the size of one of them, which
may not have been the largest.

This commit addresses these issues by keying the map of shard sizes by
ShardId, collapsing primaries and replicas into a single entry, and taking
the maximum size across all copies of each shard.

Today we distinguish the sizes of primary and replica shards when contemplating disk-based shard allocation, using a map of shard sizes keyed by strings of the form `[index][0][p]`. There is no need to distinguish primaries and replicas like this. Moreover in the case where a replica is not yet allocated the size of the primary is a good estimate for the size of the replica. Finally, in the case where there are multiple replicas we took the size of _one_ of them, which may not have been the largest. This commit addresses these issues by keying the map of shard sizes by `ShardId`, collapsing primaries and replicas into a single entry, and taking the maximum size across all copies of each shard.

elasticmachine · 2020-01-06T08:38:45Z

Pinging @elastic/es-distributed (:Distributed/Allocation)

DaveCTurner · 2020-01-06T08:42:36Z

server/src/main/java/org/elasticsearch/cluster/ClusterInfo.java

+                } else {
+                    shardKey = c.key.toString();
+                }
+                builder.humanReadableField(shardKey + "_bytes", shardKey, new ByteSizeValue(c.value));


This is arguably a breaking change since we will no longer include the [p] or [r] suffix in the shard sizes reported by the allocation explain output. We could fake these out for BWC, reporting the size of each shard twice, and allowing users to opt-in to the future behaviour here with a system property. TBD.

The allocation explain API is mostly meant for human consumption. As long as the field names are preserved in the response output, things should be ok.

As it stands, this PR changes the field names in the response output, because it drops the [p] or [r] marker from shardKey. Even if we were to track all copies then I think we'd also need to change these field names.

I see. I think breaking this is fine.

ywelsch · 2020-01-06T11:09:48Z

I wonder if collapsing this even more (not only replicas, but also primary + replicas) will cause instability (e.g. background-merge of a replica forcing a primary to be reallocated). Given that most clusters run with a single replica, this might not have been much of an issue before but become one now.

I also wonder why we are not keying this by <shard-id, node-id> pair (i.e. why we collapsed all replicas), falling back to the primary info if there is no info for the given node.

DaveCTurner · 2020-01-06T14:00:08Z

instability (e.g. background-merge of a replica forcing a primary to be reallocated)

Could you clarify the mechanism for that outcome?

Note that we only use the shard sizes tracked here to correct for the effects of in-flight relocations. If a primary is not already relocating then its tracked size shouldn't be relevant, unless I'm missing something.

ywelsch · 2020-01-06T14:26:59Z

My bad, I thought that the tracked shard sizes of non-moving shards would be used as well in the canRemain scenario.

henningandersen · 2020-01-07T08:39:42Z

My main worry is if the added conservatism from #46079 (where we count the relocating shard up to twice on the target host), together with the conservatism added here (where we could pick a larger size than necessary) can cause effects where a relocation triggers another (unnecessary) relocation on the target node in a "not too far from full" cluster.

We could make it slightly less conservative (on average) by picking the primary size instead of the max size. This seems reasonable because it is the copy used for a new allocation anyway. The counterargument is that it is also used for the subtraction in canRemain, which could lead to more eagerly relocating nodes off the node than necessary.

While this is likely all smaller things, I think I would be in favor of being more precise, i.e., keep all shard sizes, subtract the specific shard size and use the primary size for estimating the future size of any shard. In edge cases, where we do not know the size of a relocating shard or primary shard, we can pick the primary or max size.

DaveCTurner · 2020-01-07T08:43:14Z

👍 I'm also in favour of tracking the sizes of all the copies.

Any thoughts on the implications of the changes to the response from the cluster allocation explain API?

DaveCTurner added >non-issue WIP :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.6.0 labels Jan 6, 2020

DaveCTurner requested review from ywelsch and henningandersen January 6, 2020 08:38

DaveCTurner commented Jan 6, 2020

View reviewed changes

$@polyfractal$ polyfractal added v7.7.0 and removed v7.6.0 labels Jan 15, 2020

ywelsch removed their request for review January 17, 2020 17:36

bpintea added v7.8.0 and removed v7.7.0 labels Mar 25, 2020

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

pugnascotia added v7.8.1 v7.9.0 and removed v7.8.0 v7.8.1 labels May 6, 2020

pugnascotia added v7.10.0 and removed v7.9.0 labels Jul 15, 2020

andreidan added v7.11.0 and removed v7.10.0 labels Oct 7, 2020

pugnascotia removed the v7.11.0 label Dec 16, 2020

pugnascotia added the v7.12.0 label Dec 16, 2020

williamrandolph added v7.13.0 and removed v7.12.0 labels Feb 18, 2021

pugnascotia added v7.14.0 and removed v7.13.0 labels Apr 21, 2021

mark-vieira added v7.15.0 and removed v7.14.0 labels Jun 30, 2021

mark-vieira added v7.16.0 and removed v7.15.0 labels Aug 19, 2021

danhermann added v8.1.0 and removed v7.16.0 labels Oct 27, 2021

arteam removed the v8.0.0 label Jan 12, 2022

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

craigtaverner added v8.4.0 and removed v8.3.0 labels May 25, 2022

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:14

DaveCTurner closed this Jul 23, 2022

DaveCTurner deleted the 2020-01-03-track-max-shard-size-for-all-copies branch July 23, 2022 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track max shard size for all copies #50638

Track max shard size for all copies #50638

DaveCTurner commented Jan 6, 2020

elasticmachine commented Jan 6, 2020

DaveCTurner Jan 6, 2020

ywelsch Jan 7, 2020

DaveCTurner Jan 7, 2020

ywelsch Jan 8, 2020

ywelsch commented Jan 6, 2020

DaveCTurner commented Jan 6, 2020

ywelsch commented Jan 6, 2020

henningandersen commented Jan 7, 2020

DaveCTurner commented Jan 7, 2020

Track max shard size for all copies #50638

Track max shard size for all copies #50638

Conversation

DaveCTurner commented Jan 6, 2020

elasticmachine commented Jan 6, 2020

DaveCTurner Jan 6, 2020

Choose a reason for hiding this comment

ywelsch Jan 7, 2020

Choose a reason for hiding this comment

DaveCTurner Jan 7, 2020

Choose a reason for hiding this comment

ywelsch Jan 8, 2020

Choose a reason for hiding this comment

ywelsch commented Jan 6, 2020

DaveCTurner commented Jan 6, 2020

ywelsch commented Jan 6, 2020

henningandersen commented Jan 7, 2020

DaveCTurner commented Jan 7, 2020