Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: send chunks with fail scatters to random node in generative ssp #97589

Merged
merged 1 commit into from
Apr 3, 2023

Conversation

rhu713
Copy link
Contributor

@rhu713 rhu713 commented Feb 23, 2023

For chunks that have failed to scatter, this patch routes the chunk to a
random node instead of the current node. This is necessary as prior to the
generative version, split and scatter processors were on every node, thus there
was no imbalance introduced from routing chunks that have failed to scatter to
the current node. The new generative split and scatter processor is only on 1
node, and thus would cause the same node to process all chunks that have failed
to scatter.

Addresses run 6 and 9 of #99206

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rhu713 rhu713 marked this pull request as ready for review February 23, 2023 20:15
@rhu713 rhu713 requested a review from a team as a code owner February 23, 2023 20:15
@rhu713 rhu713 requested review from dt and removed request for a team February 23, 2023 20:15
@rhu713 rhu713 changed the title backupccl: add setting for num split workers, fail scatters to random node backupccl: send chunks with fail scatters to random node in generative ssp Mar 20, 2023
Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Left some nits.

One q: do we understand why kv fails to scatter?

log.Warningf(ctx, "scatter returned node 0. Route span starting at %s to current node %v because of hash error: %v",
scatterKey, nodeID, err)
} else {
randomNum := int(hash.Sum32())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: call this variable a hashedKey instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if nodeID, ok := flowCtx.NodeID.OptionalNodeID(); ok {
cachedNodeIDs := cache.cachedNodeIDs()
if len(cachedNodeIDs) > 0 && len(importSpanChunk.entries) > 0 {
hash := fnv.New32a()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i don't think the memory footprint of a hash is that big, but we could instantiate it outside of the loop and reset it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if nodeID, ok := flowCtx.NodeID.OptionalNodeID(); ok {
cachedNodeIDs := cache.cachedNodeIDs()
if len(cachedNodeIDs) > 0 && len(importSpanChunk.entries) > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i don't think we need to check the length of importSpanChunk.entries, given that we grab the first entry above (line 360)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

testDiskMonitor := execinfra.NewTestDiskMonitor(ctx, st)
defer testDiskMonitor.Stop(ctx)

// Set up the test so that the test context is canceled after the first entry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this comment can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@msbutler
Copy link
Collaborator

also, we should backport this to 23.1

@rhu713 rhu713 added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Mar 30, 2023
…e ssp

For chunks that have failed to scatter, this patch routes the chunk to a
random node instead of the current node. This is necessary as prior to the
generative version, split and scatter processors were on every node, thus there
was no imbalance introduced from routing chunks that have failed to scatter to
the current node. The new generative split and scatter processor is only on 1
node, and thus would cause the same node to process all chunks that have failed
to scatter.

Release note: None
@rhu713
Copy link
Contributor Author

rhu713 commented Apr 3, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 3, 2023

Build succeeded:

@craig craig bot merged commit 66dfb23 into cockroachdb:master Apr 3, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 3, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 210650d to blathers/backport-release-23.1-97589: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants