Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nonreliable method of validating shardawareness in e2e's #2175

Merged
merged 1 commit into from
Oct 31, 2024

Conversation

zimnx
Copy link
Collaborator

@zimnx zimnx commented Oct 29, 2024

Gocql doesn't expose any way of checking if shardaware ports are used successfully, hence I implemented simple driver within a test, able to send initial packet and reading to which shard connection was established. Connections are established from within the client Pod which better reflects how clients are connecting in real environments.

Requires:

Fixes #1028

@zimnx zimnx added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/flake Categorizes issue or PR as related to a flaky test. labels Oct 29, 2024
@scylla-operator-bot scylla-operator-bot bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 29, 2024
@tnozicka
Copy link
Contributor

tnozicka commented Oct 29, 2024

Can you elaborate on why

func (d *ScyllaShardAwareDialer) DialContext(ctx context.Context, network, addr string) (conn net.Conn, err error) {
sourcePort := ScyllaGetSourcePort(ctx)
if sourcePort == 0 {
return d.Dialer.DialContext(ctx, network, addr)
}
dialerWithLocalAddr := d.Dialer
dialerWithLocalAddr.LocalAddr, err = net.ResolveTCPAddr(network, fmt.Sprintf(":%d", sourcePort))
if err != nil {
return nil, err
}
return dialerWithLocalAddr.DialContext(ctx, network, addr)
}

is not enough? (failing for 0, or failing to dial)

@zimnx
Copy link
Collaborator Author

zimnx commented Oct 29, 2024

is not enough? (failing for 0, or failing to dial)

You will dial successfully to shard aware port, but how do you know if you actually got the right shard without parsing the protocol? Gocql driver doesn't allow you to check it.
When there's NAT in between client and server, the only side effect is log message which cannot be validated.

Current approach used injected Dialer, but it can't determine if correct shard was chosen so it depended on particular gocql behavior - first connection is to any shard and the rest should use shard aware range.
But sometimes driver didn't try to establish connections to all shards within test timeout.
Why? I don't know, but relying on it isn't the best idea. We can do better by sending frames on our own and checking how shards are assigned.

Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@zimnx zimnx force-pushed the fix-shardawareness-flake branch from 3c4c7af to c266820 Compare October 30, 2024 11:20
@zimnx zimnx requested review from tnozicka and rzetelskik October 30, 2024 11:21
@zimnx
Copy link
Collaborator Author

zimnx commented Oct 30, 2024

E2E's are failing because image is missing netcat. Once it's there I will retrigger them.
Local runs with image having it are passing.

Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the update
(lgtm)

/assign @rzetelskik

@rzetelskik
Copy link
Member

/lgtm

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Oct 30, 2024
@zimnx
Copy link
Collaborator Author

zimnx commented Oct 31, 2024

scylladb/scylla-operator-images#76 landed
/retest

@zimnx zimnx force-pushed the fix-shardawareness-flake branch from c266820 to 088f5e2 Compare October 31, 2024 09:58
@scylla-operator-bot scylla-operator-bot bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2024
@zimnx
Copy link
Collaborator Author

zimnx commented Oct 31, 2024

Bumped node-setup image to 0.0.3

Gocql doesn't expose any way of checking if shardaware ports are used successfully, hence I implemented simple driver within a test, able to send initial packet and reading to which shard connection was established.
Connections are established from within the client Pod which better reflects how clients are connecting in real environments.
@zimnx zimnx force-pushed the fix-shardawareness-flake branch from 088f5e2 to fbcd610 Compare October 31, 2024 10:39
@zimnx
Copy link
Collaborator Author

zimnx commented Oct 31, 2024

One minor change was made to remove -x from client Pod to prevent printing commands to stderr which is later validated and expected to be empty. There's no benefit of having this print. Expanded command is available in e2e dump.
@rzetelskik ptal

Copy link
Member

@rzetelskik rzetelskik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2024
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rzetelskik, tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@scylla-operator-bot scylla-operator-bot bot merged commit 6f9523a into scylladb:master Oct 31, 2024
12 checks passed
@zimnx zimnx deleted the fix-shardawareness-flake branch October 31, 2024 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/flake Categorizes issue or PR as related to a flaky test. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flake: ScyllaCluster should allow to build connection pool using shard aware ports
3 participants