-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track max shard size for all copies #50638
Track max shard size for all copies #50638
Conversation
Today we distinguish the sizes of primary and replica shards when contemplating disk-based shard allocation, using a map of shard sizes keyed by strings of the form `[index][0][p]`. There is no need to distinguish primaries and replicas like this. Moreover in the case where a replica is not yet allocated the size of the primary is a good estimate for the size of the replica. Finally, in the case where there are multiple replicas we took the size of _one_ of them, which may not have been the largest. This commit addresses these issues by keying the map of shard sizes by `ShardId`, collapsing primaries and replicas into a single entry, and taking the maximum size across all copies of each shard.
Pinging @elastic/es-distributed (:Distributed/Allocation) |
} else { | ||
shardKey = c.key.toString(); | ||
} | ||
builder.humanReadableField(shardKey + "_bytes", shardKey, new ByteSizeValue(c.value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is arguably a breaking change since we will no longer include the [p]
or [r]
suffix in the shard sizes reported by the allocation explain output. We could fake these out for BWC, reporting the size of each shard twice, and allowing users to opt-in to the future behaviour here with a system property. TBD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocation explain API is mostly meant for human consumption. As long as the field names are preserved in the response output, things should be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it stands, this PR changes the field names in the response output, because it drops the [p]
or [r]
marker from shardKey
. Even if we were to track all copies then I think we'd also need to change these field names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think breaking this is fine.
I wonder if collapsing this even more (not only replicas, but also primary + replicas) will cause instability (e.g. background-merge of a replica forcing a primary to be reallocated). Given that most clusters run with a single replica, this might not have been much of an issue before but become one now. I also wonder why we are not keying this by <shard-id, node-id> pair (i.e. why we collapsed all replicas), falling back to the primary info if there is no info for the given node. |
Could you clarify the mechanism for that outcome? Note that we only use the shard sizes tracked here to correct for the effects of in-flight relocations. If a primary is not already relocating then its tracked size shouldn't be relevant, unless I'm missing something. |
My bad, I thought that the tracked shard sizes of non-moving shards would be used as well in the |
My main worry is if the added conservatism from #46079 (where we count the relocating shard up to twice on the target host), together with the conservatism added here (where we could pick a larger size than necessary) can cause effects where a relocation triggers another (unnecessary) relocation on the target node in a "not too far from full" cluster. We could make it slightly less conservative (on average) by picking the primary size instead of the max size. This seems reasonable because it is the copy used for a new allocation anyway. The counterargument is that it is also used for the subtraction in canRemain, which could lead to more eagerly relocating nodes off the node than necessary. While this is likely all smaller things, I think I would be in favor of being more precise, i.e., keep all shard sizes, subtract the specific shard size and use the primary size for estimating the future size of any shard. In edge cases, where we do not know the size of a relocating shard or primary shard, we can pick the primary or max size. |
👍 I'm also in favour of tracking the sizes of all the copies. Any thoughts on the implications of the changes to the response from the cluster allocation explain API? |
Today we distinguish the sizes of primary and replica shards when contemplating
disk-based shard allocation, using a map of shard sizes keyed by strings of the
form
[index][0][p]
. There is no need to distinguish primaries and replicaslike this. Moreover in the case where a replica is not yet allocated the size
of the primary is a good estimate for the size of the replica. Finally, in the
case where there are multiple replicas we took the size of one of them, which
may not have been the largest.
This commit addresses these issues by keying the map of shard sizes by
ShardId
, collapsing primaries and replicas into a single entry, and takingthe maximum size across all copies of each shard.