Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: add roach test / workload for testing connection latencies #62166

Merged
merged 2 commits into from
Apr 1, 2021

Conversation

RichardJCai
Copy link
Contributor

@RichardJCai RichardJCai commented Mar 17, 2021

Add roach test and workload test for testing connection latencies.

Release note: None

Making it a private test right now (not available through cockroach workload)

Need to also add it to teamcity and make sure the results are reported in roachperf, will open PR there.

Resolves #59394

@RichardJCai RichardJCai requested review from rafiss and a team March 17, 2021 20:32
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@RichardJCai RichardJCai force-pushed the connection_latency_test branch 3 times, most recently from d18f9b8 to b5604be Compare March 17, 2021 20:39
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glad this wasn't too bad! just some minor comments

@RichardJCai RichardJCai force-pushed the connection_latency_test branch from b5604be to 5104487 Compare March 18, 2021 20:19
Copy link
Contributor Author

@RichardJCai RichardJCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/workload/connectionlatency/connectionlatency.go, line 20 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

let's use github.com/jackc/pgx/v4

Done.


pkg/workload/connectionlatency/connectionlatency.go, line 60 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

is this always 1? it would be nice to be able to run this test in different configurations -- single-node, multi-node in 1 region, and multi-region. that way we can compare how it differs for all these.

take a look at some of the other tests on https://roachperf.crdb.dev/
some of them have nodes=X and multiregion specifiers. not sure how to get that type of configuration, but i think we want it here.

Changed it to 1,3,5 nodes and one multiregion test with 6 nodes. Let me know what you think


pkg/workload/connectionlatency/connectionlatency.go, line 82 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i think we'd want a defer conn.Close() otherwise the connection will remain open forever? (i think -- curious to hear what you see when trying this)

Done.

@RichardJCai RichardJCai force-pushed the connection_latency_test branch 2 times, most recently from 08bdb59 to 7bdefb5 Compare March 18, 2021 20:53
@RichardJCai RichardJCai requested review from rafiss and a team March 18, 2021 20:53
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @RichardJCai)


pkg/cmd/roachtest/connection_latency.go, line 20 at r2 (raw file):

)

func registerConnectionLatencyTest(r *testRegistry) {

oh one more thing i thought of! does this test both insecure and secure clusters? we've seen that connection behaves differently when password authn is needed for non-root users

or if that seems like too many tests, we really only need to test with secure and make sure not to connect with root


pkg/cmd/roachtest/connection_latency.go, line 39 at r2 (raw file):

	geoZones := []string{"us-east1-b", "us-west1-b", "europe-west2-b"}
	if cloud == aws {

i think it's fine if the test only works in GCE


pkg/cmd/roachtest/connection_latency.go, line 58 at r2 (raw file):

	// Copying over multiregion configuration from indexes.go
	numMultiRegionNodes := 6

i think it's more common for there to be 3 nodes per per region, so we'd want something more like what the tpcc test does (9 nodes)

@RichardJCai RichardJCai force-pushed the connection_latency_test branch from 7bdefb5 to be049c5 Compare March 23, 2021 16:47
Copy link
Contributor Author

@RichardJCai RichardJCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/cmd/roachtest/connection_latency.go, line 20 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

oh one more thing i thought of! does this test both insecure and secure clusters? we've seen that connection behaves differently when password authn is needed for non-root users

or if that seems like too many tests, we really only need to test with secure and make sure not to connect with root

Made the test only use secure & not root


pkg/cmd/roachtest/connection_latency.go, line 39 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i think it's fine if the test only works in GCE

Done.


pkg/cmd/roachtest/connection_latency.go, line 58 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i think it's more common for there to be 3 nodes per per region, so we'd want something more like what the tpcc test does (9 nodes)

Did 9

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just have questions that we can discuss!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @RichardJCai)


pkg/cmd/roachtest/connection_latency.go, line 42 at r4 (raw file):

		workloadCmd := fmt.Sprintf(
			`./workload run connectionlatency --user testuser --secure --duration 30s --histograms=%s/stats.json`,

would we want to pass --urls as a flag here? again, just asking since i'm not sure how it all works


pkg/cmd/roachtest/connection_latency.go, line 51 at r4 (raw file):

	geoZonesStr := strings.Join(geoZones, ",")

	nodesConfig := []int{1, 3, 5}

maybe we just need to test the 3-node cluster and the 9-node multiregion cluster. well, the only reason i'm saying this is just to avoid extra cruft, and since i don't have a good sense of how much overhead the additional setup is. if all the extra tests are cheap, then fine to leave as is


pkg/workload/cli/run.go, line 194 at r4 (raw file):

		if len(urls) == 0 {
			crdbDefaultURL := fmt.Sprintf(`postgres://%s@localhost:26257?sslmode=disable`, *user)

i'm not familiar with how this works -- why does it connect to localhost? where does it connect from?


pkg/workload/connectionlatency/connectionlatency.go, line 60 at r4 (raw file):

) (workload.QueryLoad, error) {
	ql := workload.QueryLoad{}
	if len(urls) != 1 {

do we expect the urls length to be 1 even for the multi-node clusters?

Copy link
Contributor Author

@RichardJCai RichardJCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @RichardJCai)


pkg/cmd/roachtest/connection_latency.go, line 42 at r4 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

would we want to pass --urls as a flag here? again, just asking since i'm not sure how it all works

That's an option if we want to make it more configurable, by default the workload runs on the same machine as the crdb node so localhost works.


pkg/workload/connectionlatency/connectionlatency.go, line 60 at r4 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

do we expect the urls length to be 1 even for the multi-node clusters?

Yeah if we don't specify a url, it'll use the localhost one, so the workload will connect using the node on the same machine. We can change this though

@RichardJCai RichardJCai force-pushed the connection_latency_test branch from be049c5 to a741ac4 Compare March 24, 2021 18:20
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @RichardJCai)


pkg/cmd/roachtest/connection_latency.go, line 42 at r4 (raw file):

Previously, RichardJCai (Richard Cai) wrote…

That's an option if we want to make it more configurable, by default the workload runs on the same machine as the crdb node so localhost works.

which machine does workload run on if it's a multi-node cluster?

either way, for a multi-node cluster, i think we'd want a broad sample by testing connections to all the nodes in the cluster. this would show us issues like the one where we saw that it had to fetch data across different regions.

@RichardJCai RichardJCai force-pushed the connection_latency_test branch 2 times, most recently from 6034e5a to 858dc5b Compare March 25, 2021 21:32
Support running workload with secure mode (sslmode=require) and
allow a user to be passed in.

Previously, roachtests were only run as root on insecure mode leaving
some gaps in our testing.

Release note: None
@RichardJCai RichardJCai force-pushed the connection_latency_test branch from 858dc5b to 57e03f3 Compare March 25, 2021 22:05
Copy link
Contributor Author

@RichardJCai RichardJCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/cmd/roachtest/connection_latency.go, line 42 at r4 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

which machine does workload run on if it's a multi-node cluster?

either way, for a multi-node cluster, i think we'd want a broad sample by testing connections to all the nodes in the cluster. this would show us issues like the one where we saw that it had to fetch data across different regions.

Updated so each node will connect to each other node.

I wonder if we should make the tracking a bit more granular for this though, ie region to region latency test.


pkg/cmd/roachtest/connection_latency.go, line 51 at r4 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

maybe we just need to test the 3-node cluster and the 9-node multiregion cluster. well, the only reason i'm saying this is just to avoid extra cruft, and since i don't have a good sense of how much overhead the additional setup is. if all the extra tests are cheap, then fine to leave as is

Done.


pkg/workload/cli/run.go, line 194 at r4 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i'm not familiar with how this works -- why does it connect to localhost? where does it connect from?

This is if no URL is specified, we use localhost by default. I believe it's assumed the VM will host the node.

@RichardJCai RichardJCai force-pushed the connection_latency_test branch from 57e03f3 to 751c307 Compare March 26, 2021 20:40
@RichardJCai
Copy link
Contributor Author

which machine does workload run on if it's a multi-node cluster?

either way, for a multi-node cluster, i think we'd want a broad sample by testing connections to all the nodes in the cluster. this would show us issues like the one where we saw that it had to fetch data across different regions.

@rafiss following up more on this point, I mentioned I wonder if we should make the tracking a bit more granular for this though, ie region to region latency test. is there a way to check which region a node is on? I couldn't find a way right now so the latencies tracked can vary depending on which node is connecting to which. Do you think this is sufficient?

I've attached the output and some of the p99(ms) latencies are really bad, ~800ms for connect.
run_202535.430_n1-9_workload_run_connectionlatency.log

@RichardJCai RichardJCai requested a review from rafiss March 26, 2021 20:44
@RichardJCai RichardJCai force-pushed the connection_latency_test branch from 751c307 to f344824 Compare March 29, 2021 15:27
@rafiss
Copy link
Collaborator

rafiss commented Mar 30, 2021

I think at the moment, a p99 of 800ms might be expected. I think the auth code makes either 2 or 3 KV roundtrips, so if all those are cross-region, this could happen. See these related issues #58869 #36160

But yeah I agree it would be nice to be able to track this granularly. I think SHOW LOCALITY might give you what you want to find the node's region: https://www.cockroachlabs.com/docs/stable/show-locality.html

The other issue with this though is that the leaseholder placements won't be fixed across different test runs. So e.g. uswest1 could be slow one night, then the next night useast1 could be slow.

Maybe the test should do an ALTER RANGE system CONFIGURE ZONE to pin the system leaseholder to the same region in each test? IDK if it's super needed though. it might just be fine that different regions are slow across runs. let's start by not changing any zone configs and if the test is too hard to understand we can add that later.

@RichardJCai RichardJCai force-pushed the connection_latency_test branch from f344824 to 79ee93d Compare March 31, 2021 02:43
Copy link
Contributor Author

@RichardJCai RichardJCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the output to show which region is used. It assumes that the regions/zones stay consistent throughout the test's history.

elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            397           13.2    223.5    226.5    234.9    234.9    251.7  connect-to-cloud=gce,region=europe-west2,zone=europe-west2-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            150            5.0    427.9    436.2    436.2    436.2    453.0  connect-to-cloud=gce,region=us-east1,zone=us-east1-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            108            3.6    587.2    604.0    604.0    604.0    604.0  connect-to-cloud=gce,region=us-west1,zone=us-west1-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            653           21.8    370.1    226.5    738.2    738.2    738.2  select

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)

@RichardJCai RichardJCai force-pushed the connection_latency_test branch from 79ee93d to 9b02137 Compare March 31, 2021 19:24
@RichardJCai
Copy link
Contributor Author

Okay I think I finally have this in a state I'm satisfied with.

Example output

Latencies connecting from us-west

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            154            5.1    519.1    520.1    536.9    570.4    570.4  connect-from-us-west1-b-to-cloud=gce,region=europe-west2,zone=europe-west2-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            208            6.9    444.6    453.0    453.0    453.0    453.0  connect-from-us-west1-b-to-cloud=gce,region=us-east1,zone=us-east1-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            445           14.8    267.4    268.4    285.2    285.2    302.0  connect-from-us-west1-b-to-cloud=gce,region=us-west1,zone=us-west1-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            805           26.8    404.9    285.2    671.1    671.1    704.6  select

Latencies connecting from us-east

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            192            6.4    447.3    453.0    469.8    536.9    570.4  connect-from-us-east1-b-to-cloud=gce,region=europe-west2,zone=europe-west2-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0           1194           39.8     99.2    100.7    104.9    109.1    318.8  connect-from-us-east1-b-to-cloud=gce,region=us-east1,zone=us-east1-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            200            6.7    464.1    469.8    486.5    536.9    536.9  connect-from-us-east1-b-to-cloud=gce,region=us-west1,zone=us-west1-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0           1586           52.9    207.0    100.7    536.9    570.4    637.5  select

Latencies connecting from eu-west

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0           8340          278.0     13.2     13.1     18.9     22.0     33.6  connect-from-europe-west2-b-to-cloud=gce,region=europe-west2,zone=europe-west2-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            168            5.6    541.7    570.4    570.4    570.4    570.4  connect-from-europe-west2-b-to-cloud=gce,region=us-east1,zone=us-east1-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0            116            3.9    785.6    805.3    805.3    838.9    838.9  connect-from-europe-west2-b-to-cloud=gce,region=us-west1,zone=us-west1-b

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   30.0s        0           8622          287.4     37.6     14.2     21.0    906.0    973.1  select

@RichardJCai RichardJCai force-pushed the connection_latency_test branch from 9b02137 to 59c67c4 Compare March 31, 2021 19:37
Add roach test and workload test for testing connection latencies.

Release note: None
@RichardJCai RichardJCai force-pushed the connection_latency_test branch from 59c67c4 to 3a6415a Compare March 31, 2021 19:45
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!! this looks great

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @RichardJCai)


pkg/workload/connectionlatency/connectionlatency.go, line 82 at r6 (raw file):

		var locality string
		err = conn.QueryRow(ctx, "SHOW LOCALITY").Scan(&locality)

one thing i just realized is that it is expected for the latency to be high when connection from one region to a separate region, and there isn't much we can do about that

so the most important metrics out of this test will be the useast->useast, uswest->uswest, and euwest->euwest connection latencies. the other ones may not be as important, but i think can still be useful so we can compare.

anyway, just noting this down so we have a better idea of how we'll use this test in the future.

@RichardJCai
Copy link
Contributor Author

Thanks for the review!!

bors r=rafiss

@craig craig bot merged commit 1764c5f into cockroachdb:master Apr 1, 2021
@craig
Copy link
Contributor

craig bot commented Apr 1, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: create a workload that tests latencies of repeated connection attempts
3 participants