Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track max shard size for all copies #50638

Conversation

DaveCTurner
Copy link
Contributor

Today we distinguish the sizes of primary and replica shards when contemplating
disk-based shard allocation, using a map of shard sizes keyed by strings of the
form [index][0][p]. There is no need to distinguish primaries and replicas
like this. Moreover in the case where a replica is not yet allocated the size
of the primary is a good estimate for the size of the replica. Finally, in the
case where there are multiple replicas we took the size of one of them, which
may not have been the largest.

This commit addresses these issues by keying the map of shard sizes by
ShardId, collapsing primaries and replicas into a single entry, and taking
the maximum size across all copies of each shard.

Today we distinguish the sizes of primary and replica shards when contemplating
disk-based shard allocation, using a map of shard sizes keyed by strings of the
form `[index][0][p]`. There is no need to distinguish primaries and replicas
like this. Moreover in the case where a replica is not yet allocated the size
of the primary is a good estimate for the size of the replica. Finally, in the
case where there are multiple replicas we took the size of _one_ of them, which
may not have been the largest.

This commit addresses these issues by keying the map of shard sizes by
`ShardId`, collapsing primaries and replicas into a single entry, and taking
the maximum size across all copies of each shard.
@DaveCTurner DaveCTurner added >non-issue WIP :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.6.0 labels Jan 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Allocation)

} else {
shardKey = c.key.toString();
}
builder.humanReadableField(shardKey + "_bytes", shardKey, new ByteSizeValue(c.value));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is arguably a breaking change since we will no longer include the [p] or [r] suffix in the shard sizes reported by the allocation explain output. We could fake these out for BWC, reporting the size of each shard twice, and allowing users to opt-in to the future behaviour here with a system property. TBD.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation explain API is mostly meant for human consumption. As long as the field names are preserved in the response output, things should be ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, this PR changes the field names in the response output, because it drops the [p] or [r] marker from shardKey. Even if we were to track all copies then I think we'd also need to change these field names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think breaking this is fine.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 6, 2020

I wonder if collapsing this even more (not only replicas, but also primary + replicas) will cause instability (e.g. background-merge of a replica forcing a primary to be reallocated). Given that most clusters run with a single replica, this might not have been much of an issue before but become one now.

I also wonder why we are not keying this by <shard-id, node-id> pair (i.e. why we collapsed all replicas), falling back to the primary info if there is no info for the given node.

@DaveCTurner
Copy link
Contributor Author

instability (e.g. background-merge of a replica forcing a primary to be reallocated)

Could you clarify the mechanism for that outcome?

Note that we only use the shard sizes tracked here to correct for the effects of in-flight relocations. If a primary is not already relocating then its tracked size shouldn't be relevant, unless I'm missing something.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 6, 2020

My bad, I thought that the tracked shard sizes of non-moving shards would be used as well in the canRemain scenario.

@henningandersen
Copy link
Contributor

My main worry is if the added conservatism from #46079 (where we count the relocating shard up to twice on the target host), together with the conservatism added here (where we could pick a larger size than necessary) can cause effects where a relocation triggers another (unnecessary) relocation on the target node in a "not too far from full" cluster.

We could make it slightly less conservative (on average) by picking the primary size instead of the max size. This seems reasonable because it is the copy used for a new allocation anyway. The counterargument is that it is also used for the subtraction in canRemain, which could lead to more eagerly relocating nodes off the node than necessary.

While this is likely all smaller things, I think I would be in favor of being more precise, i.e., keep all shard sizes, subtract the specific shard size and use the primary size for estimating the future size of any shard. In edge cases, where we do not know the size of a relocating shard or primary shard, we can pick the primary or max size.

@DaveCTurner
Copy link
Contributor Author

👍 I'm also in favour of tracking the sizes of all the copies.

Any thoughts on the implications of the changes to the response from the cluster allocation explain API?

@polyfractal polyfractal added v7.7.0 and removed v7.6.0 labels Jan 15, 2020
@ywelsch ywelsch removed their request for review January 17, 2020 17:36
@bpintea bpintea added v7.8.0 and removed v7.7.0 labels Mar 25, 2020
@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
@andreidan andreidan added v7.11.0 and removed v7.10.0 labels Oct 7, 2020
@danhermann danhermann added v8.1.0 and removed v7.16.0 labels Oct 27, 2021
@arteam arteam removed the v8.0.0 label Jan 12, 2022
@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:14
@DaveCTurner DaveCTurner deleted the 2020-01-03-track-max-shard-size-for-all-copies branch July 23, 2022 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.4.0 WIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.