Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add panels for CQL request and response sizes #1928

Closed
amnonh opened this issue Mar 8, 2023 · 12 comments · Fixed by #1969
Closed

Add panels for CQL request and response sizes #1928

amnonh opened this issue Mar 8, 2023 · 12 comments · Fixed by #1969
Labels
enhancement New feature or request

Comments

@amnonh
Copy link
Collaborator

amnonh commented Mar 8, 2023

Scylla Now reports CQL requests and response by type metrics, we should show it

@amnonh amnonh added the enhancement New feature or request label Mar 8, 2023
@vladzcloudius
Copy link
Contributor

@avikivity please, comment on the below.

I think we want the following information to be added to the dashboards eventually:

  • All the below per Service Level
    • Payload size rate: number of received payload bytes per second
    • Response size rate: number of bytes in responses sent per second
    • Average received payload size
    • Average sent response size

@vladzcloudius
Copy link
Contributor

Here is a demo how per-SL metrics look like:

image

@amnonh
Copy link
Collaborator Author

amnonh commented May 2, 2023

@vladzcloudius in what Scylla version is this available?

@vladzcloudius
Copy link
Contributor

@vladzcloudius in what Scylla version is this available?

It's available in the master.
Backports to OSS releases are out of my scope.
2023.1 is supposed to have it. I don't know about the rest.

cc @mykaul @DoronArazii

@DoronArazii
Copy link

2023.1.0-rc5 has it (means it will be part of 2023.1.0).
There's also a request to have it on 2022.2 branch - targeted to 2022.2.7, since 2022.2.6 is already cooking.
Avi should take care of this.

@amnonh
Copy link
Collaborator Author

amnonh commented May 2, 2023

@DoronArazii What about open source versions?

@DoronArazii
Copy link

DoronArazii commented May 2, 2023

AFAIU the only OSS version that would have it is 5.3 (through master) which is equal to 2023.2.0.
@avikivity is backporting only to Ent versions.
scylladb/scylladb#13305 (comment)
Avi, please confirm my words.

@avikivity
Copy link
Member

Yes, backported to 2023.1 and 2022.1.

@amnonh
Copy link
Collaborator Author

amnonh commented May 3, 2023

@vladzcloudius just to make sure, in the formula you used in the example (it would be nice if you can copy-paste it to the issue) you devided respons size with request count, I assume that the request count is used for both request and response..

What I'm planning to do is sum(rate(scylla_transport_cql_request_bytes{scheduling_group_name=~"$scheduling_group"}[1m]))/sum(rate(scylla_transport_cql_requests_count{scheduling_group_name=~"$scheduling_group"}[1m]))

and the same for the response, per scheduling_group

Let me know if I'm correct

Also I've noticed that it requests_count and request_bytes, which is not a big deal, just inconsistent.

Can you please take a look at #1969

@vladzcloudius
Copy link
Contributor

@vladzcloudius just to make sure, in the formula you used in the example (it would be nice if you can copy-paste it to the issue) you devided respons size with request count, I assume that the request count is used for both request and response..

correct but there is a correction to how you used it (see below).

What I'm planning to do is sum(rate(scylla_transport_cql_request_bytes{scheduling_group_name=~"$scheduling_group"}[1m]))/sum(rate(scylla_transport_cql_requests_count{scheduling_group_name=~"$scheduling_group"}[1m]))

and the same for the response, per scheduling_group

Let me know if I'm correct

Not exactly.

  • All graphs should allow filtering by:
    • Node
    • shard
    • cluster
    • DC
    • scheduling group
      (Your PR bellow seems to allow the filtering as I described above)
  • CQL requests can come as EXECUTE, as QUERY or as PREPARE. Hence your numerator and denominator has to be a sum of the above labels - as in my example. Adding all types of opcodes together can create a rather convoluted picture under certain conditions, e.g. if there is an authorization storm/attack. This is going to pull averages down. More on this below.

But this is a very formal approach.
Now let's see how we can construct the formula to make the data more practical.

Since a payload sizes are usually as follows:

  • Reads:
    • Request payload: small
    • Response payload: large
  • Writes:
    • Request payload: large
    • Response payload: small
  • Prepare:
    • Both payloads: small
  • Authorization opcodes:
    • Both payloads: small

In the use case when there is an equal number of both reads and writes the graphs as suggested in the demo or by you are going to show halves of the dominating payload sizes (request payload for writes and response payload for reads).

If the ratio shifts to any side, e.g. if there is 1 read for each 10 writes, the response payload graph is going to become even less useful since it is going to show a value ~x10 smaller than the average read response payload size.

Here is a suggestion: why don't we do the following:
we can have (in addition to a general ones which sum everything together like you did?) an "Estimated CQL Reads Response Payload Size" which is going to to have the following formula (I'm omitting some labels and functions to keep things shorter):

(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_reads

And for "Estimated CQL Writes Request Payload Size"

(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_inserts + scylla_cql_updates
(AFAIR statements in batches update these counters for each individual statement)

These are going to show values that would be slightly larger than actual values but IMO they are going to be more useful in practice since the sharp change in the reads:writes ratio is going to cause only a slight change on the corresponding graph as long as the rate of the corresponding type of requests and the corresponding payload stays the same (e.g. reads rate and response payload size for the "Estimated CQL Reads Response Payload Size").

Makes sense?

Also I've noticed that it requests_count and request_bytes, which is not a big deal, just inconsistent.

That's intentional: the former counts the number of requestS while the later - number of bytes in each request.

Can you please take a look at #1969

Looked. Let's decide on items above first.

@amnonh
Copy link
Collaborator Author

amnonh commented May 4, 2023

  • CQL requests can come as EXECUTE, as QUERY or as PREPARE. Hence your numerator and denominator has to be a sum of the above labels - as in my example. Adding all types of opcodes together can create a rather convoluted picture under certain conditions, e.g. if there is an authorization storm/attack. This is going to pull averages down. More on this below.

In your example, I saw you just combined all kind together (which is the equivelent of droping a lable and letting the sum do it for you.)

But this is a very formal approach. Now let's see how we can construct the formula to make the data more practical.

Since a payload sizes are usually as follows:

  • Reads:

    • Request payload: small
    • Response payload: large
  • Writes:

    • Request payload: large
    • Response payload: small
  • Prepare:

    • Both payloads: small
  • Authorization opcodes:

    • Both payloads: small

In the use case when there is an equal number of both reads and writes the graphs as suggested in the demo or by you are going to show halves of the dominating payload sizes (request payload for writes and response payload for reads).

I find this confusing, the name suggest that requests are all the messages sent to Scylla (both select and insert)
and response are the messages that sent from Scylla (again, both select and insert).

If the ratio shifts to any side, e.g. if there is 1 read for each 10 writes, the response payload graph is going to become even less useful since it is going to show a value ~x10 smaller than the average read response payload size.

Here is a suggestion: why don't we do the following: we can have (in addition to a general ones which sum everything together like you did?) an "Estimated CQL Reads Response Payload Size" which is going to to have the following formula (I'm omitting some labels and functions to keep things shorter):

(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_reads

And for "Estimated CQL Writes Request Payload Size"

(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_inserts + scylla_cql_updates
(AFAIR statements in batches update these counters for each individual statement)

These are going to show values that would be slightly larger than actual values but IMO they are going to be more useful in practice since the sharp change in the reads:writes ratio is going to cause only a slight change on the corresponding graph as long as the rate of the corresponding type of requests and the corresponding payload stays the same (e.g. reads rate and response payload size for the "Estimated CQL Reads Response Payload Size").

Makes sense?

Let's do a call so I can get a better understanding

Also I've noticed that it requests_count and request_bytes, which is not a big deal, just inconsistent.

That's intentional: the former counts the number of requestS while the later - number of bytes in each request.

Can you please take a look at #1969

Looked. Let's decide on items above first.

@vladzcloudius
Copy link
Contributor

  • CQL requests can come as EXECUTE, as QUERY or as PREPARE. Hence your numerator and denominator has to be a sum of the above labels - as in my example. Adding all types of opcodes together can create a rather convoluted picture under certain conditions, e.g. if there is an authorization storm/attack. This is going to pull averages down. More on this below.

In your example, I saw you just combined all kind together (which is the equivelent of droping a lable and letting the sum do it for you.)

No, I did not. Please, look again.

But this is a very formal approach. Now let's see how we can construct the formula to make the data more practical.
Since a payload sizes are usually as follows:

  • Reads:

    • Request payload: small
    • Response payload: large
  • Writes:

    • Request payload: large
    • Response payload: small
  • Prepare:

    • Both payloads: small
  • Authorization opcodes:

    • Both payloads: small

In the use case when there is an equal number of both reads and writes the graphs as suggested in the demo or by you are going to show halves of the dominating payload sizes (request payload for writes and response payload for reads).

I find this confusing, the name suggest that requests are all the messages sent to Scylla (both select and insert) and response are the messages that sent from Scylla (again, both select and insert).

That's correct. What's the confusion?

If the ratio shifts to any side, e.g. if there is 1 read for each 10 writes, the response payload graph is going to become even less useful since it is going to show a value ~x10 smaller than the average read response payload size.
Here is a suggestion: why don't we do the following: we can have (in addition to a general ones which sum everything together like you did?) an "Estimated CQL Reads Response Payload Size" which is going to to have the following formula (I'm omitting some labels and functions to keep things shorter):

(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_reads

And for "Estimated CQL Writes Request Payload Size"

(response_bytes{kind="EXECUTE"} + response_bytes{kind="QUERY"})
/
scylla_cql_inserts + scylla_cql_updates
(AFAIR statements in batches update these counters for each individual statement)

These are going to show values that would be slightly larger than actual values but IMO they are going to be more useful in practice since the sharp change in the reads:writes ratio is going to cause only a slight change on the corresponding graph as long as the rate of the corresponding type of requests and the corresponding payload stays the same (e.g. reads rate and response payload size for the "Estimated CQL Reads Response Payload Size").
Makes sense?

Let's do a call so I can get a better understanding

Sure. Please, send an invite.

Also I've noticed that it requests_count and request_bytes, which is not a big deal, just inconsistent.

That's intentional: the former counts the number of requestS while the later - number of bytes in each request.

Can you please take a look at #1969

Looked. Let's decide on items above first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants