Add panels for CQL request and response sizes #1928

amnonh · 2023-03-08T10:38:12Z

Scylla Now reports CQL requests and response by type metrics, we should show it

vladzcloudius · 2023-03-08T21:50:05Z

@avikivity please, comment on the below.

I think we want the following information to be added to the dashboards eventually:

All the below per Service Level
- Payload size rate: number of received payload bytes per second
- Response size rate: number of bytes in responses sent per second
- Average received payload size
- Average sent response size

vladzcloudius · 2023-03-09T03:12:06Z

Here is a demo how per-SL metrics look like:

amnonh · 2023-05-02T14:37:17Z

@vladzcloudius in what Scylla version is this available?

vladzcloudius · 2023-05-02T15:09:12Z

@vladzcloudius in what Scylla version is this available?

It's available in the master.
Backports to OSS releases are out of my scope.
2023.1 is supposed to have it. I don't know about the rest.

cc @mykaul @DoronArazii

DoronArazii · 2023-05-02T15:50:20Z

2023.1.0-rc5 has it (means it will be part of 2023.1.0).
There's also a request to have it on 2022.2 branch - targeted to 2022.2.7, since 2022.2.6 is already cooking.
Avi should take care of this.

amnonh · 2023-05-02T16:08:20Z

@DoronArazii What about open source versions?

DoronArazii · 2023-05-02T16:33:34Z

AFAIU the only OSS version that would have it is 5.3 (through master) which is equal to 2023.2.0.
@avikivity is backporting only to Ent versions.
scylladb/scylladb#13305 (comment)
Avi, please confirm my words.

avikivity · 2023-05-02T19:24:52Z

Yes, backported to 2023.1 and 2022.1.

amnonh · 2023-05-03T07:40:51Z

@vladzcloudius just to make sure, in the formula you used in the example (it would be nice if you can copy-paste it to the issue) you devided respons size with request count, I assume that the request count is used for both request and response..

What I'm planning to do is sum(rate(scylla_transport_cql_request_bytes{scheduling_group_name=~"$scheduling_group"}[1m]))/sum(rate(scylla_transport_cql_requests_count{scheduling_group_name=~"$scheduling_group"}[1m]))

and the same for the response, per scheduling_group

Let me know if I'm correct

Also I've noticed that it requests_count and request_bytes, which is not a big deal, just inconsistent.

Can you please take a look at #1969

vladzcloudius · 2023-05-03T15:38:05Z

@vladzcloudius just to make sure, in the formula you used in the example (it would be nice if you can copy-paste it to the issue) you devided respons size with request count, I assume that the request count is used for both request and response..

correct but there is a correction to how you used it (see below).

What I'm planning to do is sum(rate(scylla_transport_cql_request_bytes{scheduling_group_name=~"$scheduling_group"}[1m]))/sum(rate(scylla_transport_cql_requests_count{scheduling_group_name=~"$scheduling_group"}[1m]))

and the same for the response, per scheduling_group

Let me know if I'm correct

Not exactly.

All graphs should allow filtering by:
- Node
- shard
- cluster
- DC
- scheduling group
  (Your PR bellow seems to allow the filtering as I described above)
CQL requests can come as EXECUTE, as QUERY or as PREPARE. Hence your numerator and denominator has to be a sum of the above labels - as in my example. Adding all types of opcodes together can create a rather convoluted picture under certain conditions, e.g. if there is an authorization storm/attack. This is going to pull averages down. More on this below.

But this is a very formal approach.
Now let's see how we can construct the formula to make the data more practical.

Since a payload sizes are usually as follows:

Reads:
- Request payload: small
- Response payload: large
Writes:
- Request payload: large
- Response payload: small
Prepare:
- Both payloads: small
Authorization opcodes:
- Both payloads: small

In the use case when there is an equal number of both reads and writes the graphs as suggested in the demo or by you are going to show halves of the dominating payload sizes (request payload for writes and response payload for reads).

If the ratio shifts to any side, e.g. if there is 1 read for each 10 writes, the response payload graph is going to become even less useful since it is going to show a value ~x10 smaller than the average read response payload size.

Here is a suggestion: why don't we do the following:
we can have (in addition to a general ones which sum everything together like you did?) an "Estimated CQL Reads Response Payload Size" which is going to to have the following formula (I'm omitting some labels and functions to keep things shorter):

(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_reads

And for "Estimated CQL Writes Request Payload Size"

(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_inserts + scylla_cql_updates
(AFAIR statements in batches update these counters for each individual statement)

These are going to show values that would be slightly larger than actual values but IMO they are going to be more useful in practice since the sharp change in the reads:writes ratio is going to cause only a slight change on the corresponding graph as long as the rate of the corresponding type of requests and the corresponding payload stays the same (e.g. reads rate and response payload size for the "Estimated CQL Reads Response Payload Size").

Makes sense?

Also I've noticed that it requests_count and request_bytes, which is not a big deal, just inconsistent.

That's intentional: the former counts the number of requestS while the later - number of bytes in each request.

Can you please take a look at #1969

Looked. Let's decide on items above first.

amnonh · 2023-05-04T10:21:37Z

CQL requests can come as EXECUTE, as QUERY or as PREPARE. Hence your numerator and denominator has to be a sum of the above labels - as in my example. Adding all types of opcodes together can create a rather convoluted picture under certain conditions, e.g. if there is an authorization storm/attack. This is going to pull averages down. More on this below.

In your example, I saw you just combined all kind together (which is the equivelent of droping a lable and letting the sum do it for you.)

But this is a very formal approach. Now let's see how we can construct the formula to make the data more practical.

Since a payload sizes are usually as follows:

Reads:

Request payload: small

Response payload: large

Writes:

Request payload: large

Response payload: small

Prepare:

Both payloads: small

Authorization opcodes:

Both payloads: small

In the use case when there is an equal number of both reads and writes the graphs as suggested in the demo or by you are going to show halves of the dominating payload sizes (request payload for writes and response payload for reads).

I find this confusing, the name suggest that requests are all the messages sent to Scylla (both select and insert)
and response are the messages that sent from Scylla (again, both select and insert).

If the ratio shifts to any side, e.g. if there is 1 read for each 10 writes, the response payload graph is going to become even less useful since it is going to show a value ~x10 smaller than the average read response payload size.

Here is a suggestion: why don't we do the following: we can have (in addition to a general ones which sum everything together like you did?) an "Estimated CQL Reads Response Payload Size" which is going to to have the following formula (I'm omitting some labels and functions to keep things shorter):
(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_reads
And for "Estimated CQL Writes Request Payload Size"
(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_inserts + scylla_cql_updates
(AFAIR statements in batches update these counters for each individual statement)
These are going to show values that would be slightly larger than actual values but IMO they are going to be more useful in practice since the sharp change in the reads:writes ratio is going to cause only a slight change on the corresponding graph as long as the rate of the corresponding type of requests and the corresponding payload stays the same (e.g. reads rate and response payload size for the "Estimated CQL Reads Response Payload Size").

Makes sense?

Let's do a call so I can get a better understanding

Also I've noticed that it requests_count and request_bytes, which is not a big deal, just inconsistent.

That's intentional: the former counts the number of requestS while the later - number of bytes in each request.

Can you please take a look at #1969

Looked. Let's decide on items above first.

vladzcloudius · 2023-05-04T12:52:53Z

CQL requests can come as EXECUTE, as QUERY or as PREPARE. Hence your numerator and denominator has to be a sum of the above labels - as in my example. Adding all types of opcodes together can create a rather convoluted picture under certain conditions, e.g. if there is an authorization storm/attack. This is going to pull averages down. More on this below.

In your example, I saw you just combined all kind together (which is the equivelent of droping a lable and letting the sum do it for you.)

No, I did not. Please, look again.

But this is a very formal approach. Now let's see how we can construct the formula to make the data more practical.
Since a payload sizes are usually as follows:

Reads:

Request payload: small

Response payload: large

Writes:

Request payload: large

Response payload: small

Prepare:

Both payloads: small

Authorization opcodes:

Both payloads: small

In the use case when there is an equal number of both reads and writes the graphs as suggested in the demo or by you are going to show halves of the dominating payload sizes (request payload for writes and response payload for reads).

I find this confusing, the name suggest that requests are all the messages sent to Scylla (both select and insert) and response are the messages that sent from Scylla (again, both select and insert).

That's correct. What's the confusion?

If the ratio shifts to any side, e.g. if there is 1 read for each 10 writes, the response payload graph is going to become even less useful since it is going to show a value ~x10 smaller than the average read response payload size.
Here is a suggestion: why don't we do the following: we can have (in addition to a general ones which sum everything together like you did?) an "Estimated CQL Reads Response Payload Size" which is going to to have the following formula (I'm omitting some labels and functions to keep things shorter):
(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_reads
And for "Estimated CQL Writes Request Payload Size"
(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_inserts + scylla_cql_updates
(AFAIR statements in batches update these counters for each individual statement)
These are going to show values that would be slightly larger than actual values but IMO they are going to be more useful in practice since the sharp change in the reads:writes ratio is going to cause only a slight change on the corresponding graph as long as the rate of the corresponding type of requests and the corresponding payload stays the same (e.g. reads rate and response payload size for the "Estimated CQL Reads Response Payload Size").
Makes sense?
Let's do a call so I can get a better understanding

Sure. Please, send an invite.

Also I've noticed that it requests_count and request_bytes, which is not a big deal, just inconsistent.

That's intentional: the former counts the number of requestS while the later - number of bytes in each request.

Can you please take a look at #1969

Looked. Let's decide on items above first.

amnonh added the enhancement New feature or request label Mar 8, 2023

amnonh mentioned this issue May 3, 2023

scylla-detailed: Add CQL messages size #1969

Merged

amnonh added this to the Monitoring 4.4 milestone May 3, 2023

amnonh closed this as completed in #1969 May 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add panels for CQL request and response sizes #1928

Add panels for CQL request and response sizes #1928

amnonh commented Mar 8, 2023

vladzcloudius commented Mar 8, 2023

vladzcloudius commented Mar 9, 2023

amnonh commented May 2, 2023

vladzcloudius commented May 2, 2023

DoronArazii commented May 2, 2023

amnonh commented May 2, 2023

DoronArazii commented May 2, 2023 •

edited

Loading

avikivity commented May 2, 2023

amnonh commented May 3, 2023 •

edited

Loading

vladzcloudius commented May 3, 2023

amnonh commented May 4, 2023

vladzcloudius commented May 4, 2023

Add panels for CQL request and response sizes #1928

Add panels for CQL request and response sizes #1928

Comments

amnonh commented Mar 8, 2023

vladzcloudius commented Mar 8, 2023

vladzcloudius commented Mar 9, 2023

amnonh commented May 2, 2023

vladzcloudius commented May 2, 2023

DoronArazii commented May 2, 2023

amnonh commented May 2, 2023

DoronArazii commented May 2, 2023 • edited Loading

avikivity commented May 2, 2023

amnonh commented May 3, 2023 • edited Loading

vladzcloudius commented May 3, 2023

amnonh commented May 4, 2023

vladzcloudius commented May 4, 2023

DoronArazii commented May 2, 2023 •

edited

Loading

amnonh commented May 3, 2023 •

edited

Loading