Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feedback: Submodule box.stat #851

Closed
TarantoolBot opened this issue Aug 2, 2019 · 12 comments
Closed

feedback: Submodule box.stat #851

TarantoolBot opened this issue Aug 2, 2019 · 12 comments
Assignees
Labels
feature A new functionality need feedback [special status] On hold, awaiting feedback reference [location] Tarantool manual, Reference part server [area] Task relates to Tarantool's server (core) functionality

Comments

@TarantoolBot
Copy link
Collaborator

<…>SERT:
total: 0
rps: 0
AUTH:
total: 0
rps: 0
|ERROR|:
total: 0
rps: 0
UPDATE:
total: 0
rps: 0
...
<…>

https://www.tarantool.io/en/doc/1.10/book/box/box_stat/

What does 'ERROR' field means? I don't find any description about that on this page.

@lenkis lenkis added 1.10 feature A new functionality reference [location] Tarantool manual, Reference part server [area] Task relates to Tarantool's server (core) functionality labels Aug 2, 2019
@Totktonada
Copy link
Member

"ERROR" is the count of requests that resulted in an error.

I had assumed the following questions. Does it apply only to iproto requests or for console commands / application logic / background fibers too? Does 'resulted in an error' mean IPROTO_ERROR responses? Or it applies to errors that are catched by pcall / xpcall too (they do not lead to IPROTO_ERROR response)?

I don't sure, but as I see it is counted in an error constructor for some error types. It does not matter whether it leads to an IPROTO_ERROR response or whether it was pcalled. It does not matter whether it caused by an iproto request, from console or from an application code.

We need to understand what is the exact list of errors, which are counted here. And verify my observations above.

@Totktonada Totktonada reopened this Aug 13, 2019
@alyapunov
Copy link

alyapunov commented Aug 14, 2019 via email

@pgulutzan
Copy link
Contributor

Sure, I can verify what Mr Turenko says and am sure Mr Osipov is right.
My random way of trying to verify looked like this:
box.space.T:drop()
box.schema.space.create('T')
box.space.T:create_index('I')
box.space.T:insert{1}
box.space.T:insert{1}
box.stat().ERROR
x, y = pcall(function() box.space.T:insert{1} end)
box.stat().ERROR
box.space.space_1:drop()
s = box.schema.space.create('space_1')
i = s:create_index('space_1_index',{})
function on_replace_function () box.space.T:insert{1} end
box.space.space_1:on_replace(on_replace_function)
box.stat().ERROR
box.space.space_1:insert{1,'Hi'}
box.stat().ERROR
box.stat().ERROR
x, y = pcall(function() no_such_function() end)
box.stat().ERROR
-- box.stat().ERROR
-- box.execute([[CREATE TABLE t (s1 INT PRIMARY KEY);]])
-- box.stat().ERROR
box.stat().ERROR
fiber = require('fiber')
function function_name()
box.space.T:insert{1}
end
fiber_object = fiber.create(function_name)
box.stat().ERROR
box.stat().ERROR
box.error({code=1, reason='This tuple already has 3 fields'})
box.stat().ERROR
fio = require('fio')
errno = require('errno')
box.stat().ERROR
f = fio.open('/tmp/xxxxq.txt', {'O_RDONLY' })
f:read(4096)
f:close()
box.stat().ERROR
box.stat().ERROR
os.execute('rm xxxxxxx')
box.stat().ERROR

From that I see that:
If there is an error, ERROR goes up.
If there is an error with pcall of an insert, ERROR goes up.
If there is an error with pcall of a nonexistent function, ERROR does not go up.
If there is an error in a trigger, ERROR goes up.
if the error happens via box.execute, ERROR goes up.
If the error happens via fiber.create, ERROR goes up.
If the error happens via box.error, ERROR goes up.
If the error is in fio, ERROR does not go up.
If the error is in os, ERROR does not go up.

So I can say that:
"If and only if an error happens during execution of any function
in the box module, regardless whether execution is
within a background fiber or within a trigger or
due to SQL, ERROR count goes up."
I thought that is more detail than necessary because
it is reasonable that box.stat().ERROR would be about
all box error statistics and nothing else, but it does
no harm.

@Totktonada
Copy link
Member

In brief: I think we should be as precise here as we can to make the statistics
useful for a user.


box.stat().ERROR is increased when a ClientError or its
inheritor is created. ClientError (and its inheritors) does not created
outside of src/box. However some src/box/lua modules are exposed via
require('foo') and not via box global. It is not obvious that errors from
such modules will be counted too. Example:

key_def = require('key_def')
box.cfg{}
box.stat().ERROR -- total: 0
key_def.new({{fieldno = 1, type = 'integer', collation_id = 1024}})
---
- error: 'Wrong index options (field 1): collation was not found by ID'
...
box.stat().ERROR -- total: 1

Also box functions can raise usual Lua text errors. Example:

box.cfg{}
ffi = require('ffi')
box.stat().ERROR -- total: 0
box.tuple.new(ffi.new('char *'))
---
- error: unsupported Lua type 'cdata'
...
box.stat().ERROR -- total: 0

Or, say, key_def, which is part of box, can raise IllegalParams, which does not
increase the statistics:

box.cfg{}
key_def = require('key_def')
box.stat().ERROR -- total: 0
---
- error: fieldno must not be nil
...
box.stat().ERROR -- total: 0

Calling an undefined Lua function is counted in the statistics as on a server
as well as on a client:

box.cfg{listen=3301}
box.schema.user.grant('guest','read,write,execute','universe')
net_box = require('net.box')
c = net_box.connect('localhost:3301')
box.stat().ERROR -- total: 0
c:call('foo')
---
- error: Procedure 'foo' is not defined
...
box.stat().ERROR -- total: 2

However c:call('error', {'hello'}) is counted only on a client.

It seems we can say that box.stat().ERROR counts box errors, but it will not
give a user precise understanding which errors are counted and which are not.
Maybe more formal (and so, I think, more useful) description would be:

The following errors are counted in the statistics: failed box functions
calls (except some argument validation errors), failed calls to database
related built-in modules (xlog, key_def, merger, maybe others) (except some
argument validation errors), failed net.box calls (access denied, no such
function, a Lua error raised in a function), failed requests via a binary
port (access denied, no such function, ...; but not errors raised in a Lua
function).

My point here is that a user should understand what is going on to use a
feature, in particular, to interpret the statistics.


Just for the record, a list of error types in tarantool:

List of ClientErrors:

ClientError
 | LoggedError
 | AccessDeniedError
 | UnsupportedIndexFeature

List of errors that are not a ClientError inheritors:

XlogError
 | XlogGapError
SystemError
 | SocketError
 | OutOfMemory
 | TimedOut
ChannelIsClosed
FiberIsCancelled
LuajitError
IllegalParams
CollationError
SwimError
CryptoError

@kostja
Copy link
Contributor

kostja commented Aug 19, 2019

OK, imagine you did document it precisely. What happens when you change the implementation? Do you think Tarantool will need to stick with the above forever now?

box.stast.error purpose is to reflect exceptions during iproto query execution. If there are too many exceptions,then something goes wrong. It's an aggregated metric. Usually it does not grow.

@Totktonada
Copy link
Member

Okay, my point was that it is less usable when a user need experimenting to understand what is going on. Now I also think that something goes wrong in tarantool if we need to describe a feature in this way.

purpose is to reflect exceptions during iproto query execution

So why it does not do exaclty so (count iproto error responses)? Many apps has background fibers, net.box requests, administration tools (at least console), pcalled box requests and they can affect this metric.

I filed the issue after a customer question (they observed some non-zero values here), so I guess I'm not only person who wonder what is this metric.

@pgulutzan
Copy link
Contributor

If you think that "something is wrong in Tarantool if we need to describe a feature in this way", then do you mean that this is really a bug in the implementation, not a bug in the documentation?
If so, we can close this issue. If not, I can quote exactly the words that you propose as a "more useful description", and close this issue.

@Totktonada
Copy link
Member

Totktonada commented Aug 19, 2019

I'll try to collect more thoughts on that in developers mailing list.

Update: Sent 'RFC: How to count box.stat().ERROR?' email.

@pgulutzan pgulutzan added the need feedback [special status] On hold, awaiting feedback label Aug 27, 2019
@pgulutzan
Copy link
Contributor

@Toktonada: The RFC on the developers mailing list did not result in a major change, did it? Our manual still says: "ERROR" is the count of requests that resulted in an error. I can change to your more precise description above, that is, the words "The following errors are counted" etc. Or I can do nothing.

@Totktonada
Copy link
Member

The discussion is stalled. It seems that it is unclear what is better to do in tarantool in the short run and in the future. So it would be not productive to assume that we'll change something soon in tarantool.

Re documentation: It seems we're unable to actually document this feature, because of its implementation. So any description is okay. This issue can be closed, I think.

NB: Personally I think it would be good to count iproto error responses (simpler way) or those responses per error code. I'll not do anything here for now, but if we'll find a demand (say, some amount of questions from users), I'll raise a question (discussion / issue) against tarantool itself rather then documentation.

@pgulutzan
Copy link
Contributor

Okay, thanks for the full explanation. I'll close this.

@Totktonada
Copy link
Member

I think the 'RFC: How to count box.stat().ERROR?' discussion can be shared here.


Alexander Turenko

box.stat().ERROR counts different kinds of error, let me cite the issue
linked below:

failed box functions calls (except some argument validation errors),
failed calls to database related built-in modules (xlog, key_def,
merger, maybe others) (except some argument validation errors), failed
net.box calls (access denied, no such function, a Lua error raised in
a function), failed requests via a binary port (access denied, no such
function, ...; but not errors raised in a Lua function).

However Kostya O. says:

box.stast.error purpose is to reflect exceptions during iproto query
execution. If there are too many exceptions,then something goes wrong.
It's an aggregated metric. Usually it does not grow.

I think that the current way to count errors statistics does not give
enough information for a user to understand what is going on and, say,
plan some actions.

The intention of this email is to collect thoughts whether we need to
change / expand it somehow in tarantool.

I see the following ways to provide more helphul information:

  • Split the statistics by types: IPROTO ERROR responses, net.box errors
    (on a client), unhandled Lua errors (it can lead to IPROTO_ERROR or
    can occur in a background fiber), transaction rollbacks (it looks
    important), maybe others (don't sure where to draw lines for box
    errors, key_def Lua module errors and so).
  • Split the statistics by a fiber name (if any) / fiber id where it
    occurs.
  • Split the statistics by an error code (at least a ClientError always
    has a code like ER_TUPLE_FOUND.

Maybe we need all of that in some way. I would appreciate any feedback.

The discussion was originally started here:
#851

Please, look into the issue for more background information.

WBR, Alexander Turenko.


Konstantin Osipov

* Alexander Turenko [email protected] [19/08/20 12:38]:

failed box functions calls (except some argument validation errors),
failed calls to database related built-in modules (xlog, key_def,
merger, maybe others) (except some argument validation errors), failed
net.box calls (access denied, no such function, a Lua error raised in
a function), failed requests via a binary port (access denied, no such
function, ...; but not errors raised in a Lua function).

However Kostya O. says:

box.stast.error purpose is to reflect exceptions during iproto query
execution. If there are too many exceptions,then something goes wrong.
It's an aggregated metric. Usually it does not grow.

I think that the current way to count errors statistics does not give
enough information for a user to understand what is going on and, say,
plan some actions.

True.

The intention of this email is to collect thoughts whether we need to
change / expand it somehow in tarantool.

Sure we do.

I see the following ways to provide more helphul information:

  • Split the statistics by types: IPROTO ERROR responses, net.box errors
    (on a client), unhandled Lua errors (it can lead to IPROTO_ERROR or
    can occur in a background fiber), transaction rollbacks (it looks
    important), maybe others (don't sure where to draw lines for box
    errors, key_def Lua module errors and so).
  • Split the statistics by a fiber name (if any) / fiber id where it
    occurs.
  • Split the statistics by an error code (at least a ClientError always
    has a code like ER_TUPLE_FOUND.

Maybe we need all of that in some way. I would appreciate any feedback.

The discussion was originally started here:
#851

Please, look into the issue for more background information.

No. We need more fundamental mechanisms than that.

By splitting some metric across a single dimension you don't take
into account a) other metrics b) other dimensions.

There are the following dimensions in which you reason about
instrumentation:

  • metrics; there can be different types of metrics, I think
    Prometheus pioneered the taxonomy of different metric types;
    metrics may also be grouped by the resource they relate to
    (memory, disk, concurrency, i.e. locks/latches, sockets, files).
  • database objects, e.g. tables, functions, views, triggers.
  • database users
  • queries
  • time window - the current time, or some history window.

You may want to look at each metric from each or a combination of
dimension. E.g. you may want to know the amount of lock waits a
certain query by a given user has had over a period of time, or
currently has.

So please stop reasoning about any particular metric. If you want
to improve instrumentation, write a spec about a completely new
implementation, and try to incorporate all the requirements from
github issues in it.

Then we can come up with an implementation plan.

--
Konstantin Osipov, Moscow, Russia


Alexander Turenko

Okay, there are much more powerful ways to collect metrics. Should it
keep us from picking low hanging fruits? Maybe I should propose
something less abstract?

I think it worth to add error metrics per error code: it is simple to
implement, does not lead to any architectural changes and provide a way
more data to analyze.

WBR, Alexander Turenko.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new functionality need feedback [special status] On hold, awaiting feedback reference [location] Tarantool manual, Reference part server [area] Task relates to Tarantool's server (core) functionality
Projects
None yet
Development

No branches or pull requests

6 participants