-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: add roach test / workload for testing connection latencies #62166
Conversation
d18f9b8
to
b5604be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glad this wasn't too bad! just some minor comments
b5604be
to
5104487
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
pkg/workload/connectionlatency/connectionlatency.go, line 20 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
let's use
github.com/jackc/pgx/v4
Done.
pkg/workload/connectionlatency/connectionlatency.go, line 60 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
is this always 1? it would be nice to be able to run this test in different configurations -- single-node, multi-node in 1 region, and multi-region. that way we can compare how it differs for all these.
take a look at some of the other tests on https://roachperf.crdb.dev/
some of them havenodes=X
andmultiregion
specifiers. not sure how to get that type of configuration, but i think we want it here.
Changed it to 1,3,5 nodes and one multiregion test with 6 nodes. Let me know what you think
pkg/workload/connectionlatency/connectionlatency.go, line 82 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i think we'd want a
defer conn.Close()
otherwise the connection will remain open forever? (i think -- curious to hear what you see when trying this)
Done.
08bdb59
to
7bdefb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @RichardJCai)
pkg/cmd/roachtest/connection_latency.go, line 20 at r2 (raw file):
) func registerConnectionLatencyTest(r *testRegistry) {
oh one more thing i thought of! does this test both insecure
and secure
clusters? we've seen that connection behaves differently when password authn is needed for non-root users
or if that seems like too many tests, we really only need to test with secure
and make sure not to connect with root
pkg/cmd/roachtest/connection_latency.go, line 39 at r2 (raw file):
geoZones := []string{"us-east1-b", "us-west1-b", "europe-west2-b"} if cloud == aws {
i think it's fine if the test only works in GCE
pkg/cmd/roachtest/connection_latency.go, line 58 at r2 (raw file):
// Copying over multiregion configuration from indexes.go numMultiRegionNodes := 6
i think it's more common for there to be 3 nodes per per region, so we'd want something more like what the tpcc test does (9 nodes)
7bdefb5
to
be049c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
pkg/cmd/roachtest/connection_latency.go, line 20 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
oh one more thing i thought of! does this test both
insecure
andsecure
clusters? we've seen that connection behaves differently when password authn is needed for non-root usersor if that seems like too many tests, we really only need to test with
secure
and make sure not to connect withroot
Made the test only use secure & not root
pkg/cmd/roachtest/connection_latency.go, line 39 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i think it's fine if the test only works in GCE
Done.
pkg/cmd/roachtest/connection_latency.go, line 58 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i think it's more common for there to be 3 nodes per per region, so we'd want something more like what the tpcc test does (9 nodes)
Did 9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just have questions that we can discuss!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @RichardJCai)
pkg/cmd/roachtest/connection_latency.go, line 42 at r4 (raw file):
workloadCmd := fmt.Sprintf( `./workload run connectionlatency --user testuser --secure --duration 30s --histograms=%s/stats.json`,
would we want to pass --urls
as a flag here? again, just asking since i'm not sure how it all works
pkg/cmd/roachtest/connection_latency.go, line 51 at r4 (raw file):
geoZonesStr := strings.Join(geoZones, ",") nodesConfig := []int{1, 3, 5}
maybe we just need to test the 3-node cluster and the 9-node multiregion cluster. well, the only reason i'm saying this is just to avoid extra cruft, and since i don't have a good sense of how much overhead the additional setup is. if all the extra tests are cheap, then fine to leave as is
pkg/workload/cli/run.go, line 194 at r4 (raw file):
if len(urls) == 0 { crdbDefaultURL := fmt.Sprintf(`postgres://%s@localhost:26257?sslmode=disable`, *user)
i'm not familiar with how this works -- why does it connect to localhost? where does it connect from?
pkg/workload/connectionlatency/connectionlatency.go, line 60 at r4 (raw file):
) (workload.QueryLoad, error) { ql := workload.QueryLoad{} if len(urls) != 1 {
do we expect the urls length to be 1 even for the multi-node clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @RichardJCai)
pkg/cmd/roachtest/connection_latency.go, line 42 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
would we want to pass
--urls
as a flag here? again, just asking since i'm not sure how it all works
That's an option if we want to make it more configurable, by default the workload runs on the same machine as the crdb node so localhost works.
pkg/workload/connectionlatency/connectionlatency.go, line 60 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
do we expect the urls length to be 1 even for the multi-node clusters?
Yeah if we don't specify a url, it'll use the localhost one, so the workload will connect using the node on the same machine. We can change this though
be049c5
to
a741ac4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @RichardJCai)
pkg/cmd/roachtest/connection_latency.go, line 42 at r4 (raw file):
Previously, RichardJCai (Richard Cai) wrote…
That's an option if we want to make it more configurable, by default the workload runs on the same machine as the crdb node so localhost works.
which machine does workload
run on if it's a multi-node cluster?
either way, for a multi-node cluster, i think we'd want a broad sample by testing connections to all the nodes in the cluster. this would show us issues like the one where we saw that it had to fetch data across different regions.
6034e5a
to
858dc5b
Compare
Support running workload with secure mode (sslmode=require) and allow a user to be passed in. Previously, roachtests were only run as root on insecure mode leaving some gaps in our testing. Release note: None
858dc5b
to
57e03f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
pkg/cmd/roachtest/connection_latency.go, line 42 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
which machine does
workload
run on if it's a multi-node cluster?either way, for a multi-node cluster, i think we'd want a broad sample by testing connections to all the nodes in the cluster. this would show us issues like the one where we saw that it had to fetch data across different regions.
Updated so each node will connect to each other node.
I wonder if we should make the tracking a bit more granular for this though, ie region to region latency test.
pkg/cmd/roachtest/connection_latency.go, line 51 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
maybe we just need to test the 3-node cluster and the 9-node multiregion cluster. well, the only reason i'm saying this is just to avoid extra cruft, and since i don't have a good sense of how much overhead the additional setup is. if all the extra tests are cheap, then fine to leave as is
Done.
pkg/workload/cli/run.go, line 194 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i'm not familiar with how this works -- why does it connect to localhost? where does it connect from?
This is if no URL is specified, we use localhost by default. I believe it's assumed the VM will host the node.
57e03f3
to
751c307
Compare
@rafiss following up more on this point, I mentioned I've attached the output and some of the p99(ms) latencies are really bad, ~800ms for connect. |
751c307
to
f344824
Compare
I think at the moment, a p99 of 800ms might be expected. I think the auth code makes either 2 or 3 KV roundtrips, so if all those are cross-region, this could happen. See these related issues #58869 #36160 But yeah I agree it would be nice to be able to track this granularly. I think The other issue with this though is that the leaseholder placements won't be fixed across different test runs. So e.g. uswest1 could be slow one night, then the next night useast1 could be slow. Maybe the test should do an |
f344824
to
79ee93d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the output to show which region is used. It assumes that the regions/zones stay consistent throughout the test's history.
elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
30.0s 0 397 13.2 223.5 226.5 234.9 234.9 251.7 connect-to-cloud=gce,region=europe-west2,zone=europe-west2-b
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
30.0s 0 150 5.0 427.9 436.2 436.2 436.2 453.0 connect-to-cloud=gce,region=us-east1,zone=us-east1-b
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
30.0s 0 108 3.6 587.2 604.0 604.0 604.0 604.0 connect-to-cloud=gce,region=us-west1,zone=us-west1-b
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
30.0s 0 653 21.8 370.1 226.5 738.2 738.2 738.2 select
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
79ee93d
to
9b02137
Compare
Okay I think I finally have this in a state I'm satisfied with. Example output Latencies connecting from us-west
Latencies connecting from us-east
Latencies connecting from eu-west
|
9b02137
to
59c67c4
Compare
Add roach test and workload test for testing connection latencies. Release note: None
59c67c4
to
3a6415a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!! this looks great
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @RichardJCai)
pkg/workload/connectionlatency/connectionlatency.go, line 82 at r6 (raw file):
var locality string err = conn.QueryRow(ctx, "SHOW LOCALITY").Scan(&locality)
one thing i just realized is that it is expected for the latency to be high when connection from one region to a separate region, and there isn't much we can do about that
so the most important metrics out of this test will be the useast->useast
, uswest->uswest
, and euwest->euwest
connection latencies. the other ones may not be as important, but i think can still be useful so we can compare.
anyway, just noting this down so we have a better idea of how we'll use this test in the future.
Thanks for the review!! bors r=rafiss |
Build succeeded: |
Add roach test and workload test for testing connection latencies.
Release note: None
Making it a private test right now (not available through
cockroach workload
)Need to also add it to teamcity and make sure the results are reported in roachperf, will open PR there.
Resolves #59394