feedback: Submodule box.stat #851

TarantoolBot · 2019-08-02T15:42:52Z

<…>SERT:
total: 0
rps: 0
AUTH:
total: 0
rps: 0
|ERROR|:
total: 0
rps: 0
UPDATE:
total: 0
rps: 0
...
<…>

https://www.tarantool.io/en/doc/1.10/book/box/box_stat/

What does 'ERROR' field means? I don't find any description about that on this page.

Totktonada · 2019-08-13T22:53:00Z

"ERROR" is the count of requests that resulted in an error.

I had assumed the following questions. Does it apply only to iproto requests or for console commands / application logic / background fibers too? Does 'resulted in an error' mean IPROTO_ERROR responses? Or it applies to errors that are catched by pcall / xpcall too (they do not lead to IPROTO_ERROR response)?

I don't sure, but as I see it is counted in an error constructor for some error types. It does not matter whether it leads to an IPROTO_ERROR response or whether it was pcalled. It does not matter whether it caused by an iproto request, from console or from an application code.

We need to understand what is the exact list of errors, which are counted here. And verify my observations above.

alyapunov · 2019-08-14T22:14:52Z

* Alexander Turenko <[email protected]> [19/08/14 01:53]:

Reopened #851.

these are exceptions, all exceptions, whenever we create a new struct error we ++ ERRORS.

…

-- Konstantin Osipov, Moscow, Russia

pgulutzan · 2019-08-14T22:50:23Z

Sure, I can verify what Mr Turenko says and am sure Mr Osipov is right.
My random way of trying to verify looked like this:
box.space.T:drop()
box.schema.space.create('T')
box.space.T:create_index('I')
box.space.T:insert{1}
box.space.T:insert{1}
box.stat().ERROR
x, y = pcall(function() box.space.T:insert{1} end)
box.stat().ERROR
box.space.space_1:drop()
s = box.schema.space.create('space_1')
i = s:create_index('space_1_index',{})
function on_replace_function () box.space.T:insert{1} end
box.space.space_1:on_replace(on_replace_function)
box.stat().ERROR
box.space.space_1:insert{1,'Hi'}
box.stat().ERROR
box.stat().ERROR
x, y = pcall(function() no_such_function() end)
box.stat().ERROR
-- box.stat().ERROR
-- box.execute([[CREATE TABLE t (s1 INT PRIMARY KEY);]])
-- box.stat().ERROR
box.stat().ERROR
fiber = require('fiber')
function function_name()
box.space.T:insert{1}
end
fiber_object = fiber.create(function_name)
box.stat().ERROR
box.stat().ERROR
box.error({code=1, reason='This tuple already has 3 fields'})
box.stat().ERROR
fio = require('fio')
errno = require('errno')
box.stat().ERROR
f = fio.open('/tmp/xxxxq.txt', {'O_RDONLY' })
f:read(4096)
f:close()
box.stat().ERROR
box.stat().ERROR
os.execute('rm xxxxxxx')
box.stat().ERROR

From that I see that:
If there is an error, ERROR goes up.
If there is an error with pcall of an insert, ERROR goes up.
If there is an error with pcall of a nonexistent function, ERROR does not go up.
If there is an error in a trigger, ERROR goes up.
if the error happens via box.execute, ERROR goes up.
If the error happens via fiber.create, ERROR goes up.
If the error happens via box.error, ERROR goes up.
If the error is in fio, ERROR does not go up.
If the error is in os, ERROR does not go up.

So I can say that:
"If and only if an error happens during execution of any function
in the box module, regardless whether execution is
within a background fiber or within a trigger or
due to SQL, ERROR count goes up."
I thought that is more detail than necessary because
it is reasonable that box.stat().ERROR would be about
all box error statistics and nothing else, but it does
no harm.

Totktonada · 2019-08-18T22:28:28Z

In brief: I think we should be as precise here as we can to make the statistics
useful for a user.

box.stat().ERROR is increased when a ClientError or its
inheritor is created. ClientError (and its inheritors) does not created
outside of src/box. However some src/box/lua modules are exposed via
require('foo') and not via box global. It is not obvious that errors from
such modules will be counted too. Example:

key_def = require('key_def')
box.cfg{}
box.stat().ERROR -- total: 0
key_def.new({{fieldno = 1, type = 'integer', collation_id = 1024}})
---
- error: 'Wrong index options (field 1): collation was not found by ID'
...
box.stat().ERROR -- total: 1

Also box functions can raise usual Lua text errors. Example:

box.cfg{}
ffi = require('ffi')
box.stat().ERROR -- total: 0
box.tuple.new(ffi.new('char *'))
---
- error: unsupported Lua type 'cdata'
...
box.stat().ERROR -- total: 0

Or, say, key_def, which is part of box, can raise IllegalParams, which does not
increase the statistics:

box.cfg{}
key_def = require('key_def')
box.stat().ERROR -- total: 0
---
- error: fieldno must not be nil
...
box.stat().ERROR -- total: 0

Calling an undefined Lua function is counted in the statistics as on a server
as well as on a client:

box.cfg{listen=3301}
box.schema.user.grant('guest','read,write,execute','universe')
net_box = require('net.box')
c = net_box.connect('localhost:3301')
box.stat().ERROR -- total: 0
c:call('foo')
---
- error: Procedure 'foo' is not defined
...
box.stat().ERROR -- total: 2

However c:call('error', {'hello'}) is counted only on a client.

It seems we can say that box.stat().ERROR counts box errors, but it will not
give a user precise understanding which errors are counted and which are not.
Maybe more formal (and so, I think, more useful) description would be:

The following errors are counted in the statistics: failed box functions
calls (except some argument validation errors), failed calls to database
related built-in modules (xlog, key_def, merger, maybe others) (except some
argument validation errors), failed net.box calls (access denied, no such
function, a Lua error raised in a function), failed requests via a binary
port (access denied, no such function, ...; but not errors raised in a Lua
function).

My point here is that a user should understand what is going on to use a
feature, in particular, to interpret the statistics.

Just for the record, a list of error types in tarantool:

List of ClientErrors:

ClientError
 | LoggedError
 | AccessDeniedError
 | UnsupportedIndexFeature

List of errors that are not a ClientError inheritors:

XlogError
 | XlogGapError
SystemError
 | SocketError
 | OutOfMemory
 | TimedOut
ChannelIsClosed
FiberIsCancelled
LuajitError
IllegalParams
CollationError
SwimError
CryptoError

kostja · 2019-08-19T07:35:39Z

OK, imagine you did document it precisely. What happens when you change the implementation? Do you think Tarantool will need to stick with the above forever now?

box.stast.error purpose is to reflect exceptions during iproto query execution. If there are too many exceptions,then something goes wrong. It's an aggregated metric. Usually it does not grow.

Totktonada · 2019-08-19T08:29:53Z

Okay, my point was that it is less usable when a user need experimenting to understand what is going on. Now I also think that something goes wrong in tarantool if we need to describe a feature in this way.

purpose is to reflect exceptions during iproto query execution

So why it does not do exaclty so (count iproto error responses)? Many apps has background fibers, net.box requests, administration tools (at least console), pcalled box requests and they can affect this metric.

I filed the issue after a customer question (they observed some non-zero values here), so I guess I'm not only person who wonder what is this metric.

pgulutzan · 2019-08-19T21:44:54Z

If you think that "something is wrong in Tarantool if we need to describe a feature in this way", then do you mean that this is really a bug in the implementation, not a bug in the documentation?
If so, we can close this issue. If not, I can quote exactly the words that you propose as a "more useful description", and close this issue.

Totktonada · 2019-08-19T22:02:06Z

I'll try to collect more thoughts on that in developers mailing list.

Update: Sent 'RFC: How to count box.stat().ERROR?' email.

pgulutzan · 2019-12-06T17:11:15Z

@Toktonada: The RFC on the developers mailing list did not result in a major change, did it? Our manual still says: "ERROR" is the count of requests that resulted in an error. I can change to your more precise description above, that is, the words "The following errors are counted" etc. Or I can do nothing.

Totktonada · 2019-12-06T17:53:12Z

The discussion is stalled. It seems that it is unclear what is better to do in tarantool in the short run and in the future. So it would be not productive to assume that we'll change something soon in tarantool.

Re documentation: It seems we're unable to actually document this feature, because of its implementation. So any description is okay. This issue can be closed, I think.

NB: Personally I think it would be good to count iproto error responses (simpler way) or those responses per error code. I'll not do anything here for now, but if we'll find a demand (say, some amount of questions from users), I'll raise a question (discussion / issue) against tarantool itself rather then documentation.

pgulutzan · 2019-12-06T18:39:59Z

Okay, thanks for the full explanation. I'll close this.

Totktonada · 2020-12-15T16:13:36Z

I think the 'RFC: How to count box.stat().ERROR?' discussion can be shared here.

Alexander Turenko

box.stat().ERROR counts different kinds of error, let me cite the issue
linked below:

failed box functions calls (except some argument validation errors),
failed calls to database related built-in modules (xlog, key_def,
merger, maybe others) (except some argument validation errors), failed
net.box calls (access denied, no such function, a Lua error raised in
a function), failed requests via a binary port (access denied, no such
function, ...; but not errors raised in a Lua function).

However Kostya O. says:

box.stast.error purpose is to reflect exceptions during iproto query
execution. If there are too many exceptions,then something goes wrong.
It's an aggregated metric. Usually it does not grow.

I think that the current way to count errors statistics does not give
enough information for a user to understand what is going on and, say,
plan some actions.

The intention of this email is to collect thoughts whether we need to
change / expand it somehow in tarantool.

I see the following ways to provide more helphul information:

Split the statistics by types: IPROTO ERROR responses, net.box errors
(on a client), unhandled Lua errors (it can lead to IPROTO_ERROR or
can occur in a background fiber), transaction rollbacks (it looks
important), maybe others (don't sure where to draw lines for box
errors, key_def Lua module errors and so).
Split the statistics by a fiber name (if any) / fiber id where it
occurs.
Split the statistics by an error code (at least a ClientError always
has a code like ER_TUPLE_FOUND.

Maybe we need all of that in some way. I would appreciate any feedback.

The discussion was originally started here:
#851

Please, look into the issue for more background information.

WBR, Alexander Turenko.

Konstantin Osipov

* Alexander Turenko [email protected] [19/08/20 12:38]:

failed box functions calls (except some argument validation errors),
failed calls to database related built-in modules (xlog, key_def,
merger, maybe others) (except some argument validation errors), failed
net.box calls (access denied, no such function, a Lua error raised in
a function), failed requests via a binary port (access denied, no such
function, ...; but not errors raised in a Lua function).

However Kostya O. says:

box.stast.error purpose is to reflect exceptions during iproto query
execution. If there are too many exceptions,then something goes wrong.
It's an aggregated metric. Usually it does not grow.

I think that the current way to count errors statistics does not give
enough information for a user to understand what is going on and, say,
plan some actions.

True.

The intention of this email is to collect thoughts whether we need to
change / expand it somehow in tarantool.

Sure we do.

I see the following ways to provide more helphul information:

Split the statistics by types: IPROTO ERROR responses, net.box errors
(on a client), unhandled Lua errors (it can lead to IPROTO_ERROR or
can occur in a background fiber), transaction rollbacks (it looks
important), maybe others (don't sure where to draw lines for box
errors, key_def Lua module errors and so).

Split the statistics by a fiber name (if any) / fiber id where it
occurs.

Split the statistics by an error code (at least a ClientError always
has a code like ER_TUPLE_FOUND.

Maybe we need all of that in some way. I would appreciate any feedback.

The discussion was originally started here:
#851

Please, look into the issue for more background information.

No. We need more fundamental mechanisms than that.

By splitting some metric across a single dimension you don't take
into account a) other metrics b) other dimensions.

There are the following dimensions in which you reason about
instrumentation:

metrics; there can be different types of metrics, I think
Prometheus pioneered the taxonomy of different metric types;
metrics may also be grouped by the resource they relate to
(memory, disk, concurrency, i.e. locks/latches, sockets, files).
database objects, e.g. tables, functions, views, triggers.
database users
queries
time window - the current time, or some history window.

You may want to look at each metric from each or a combination of
dimension. E.g. you may want to know the amount of lock waits a
certain query by a given user has had over a period of time, or
currently has.

So please stop reasoning about any particular metric. If you want
to improve instrumentation, write a spec about a completely new
implementation, and try to incorporate all the requirements from
github issues in it.

Then we can come up with an implementation plan.

--
Konstantin Osipov, Moscow, Russia

Alexander Turenko

Okay, there are much more powerful ways to collect metrics. Should it
keep us from picking low hanging fruits? Maybe I should propose
something less abstract?

I think it worth to add error metrics per error code: it is simple to
implement, does not lead to any architectural changes and provide a way
more data to analyze.

WBR, Alexander Turenko.

lenkis assigned pgulutzan Aug 2, 2019

lenkis added 1.10 feature A new functionality reference [location] Tarantool manual, Reference part server [area] Task relates to Tarantool's server (core) functionality labels Aug 2, 2019

pgulutzan closed this as completed in 07d8a06 Aug 13, 2019

Totktonada reopened this Aug 13, 2019

pgulutzan added the need feedback [special status] On hold, awaiting feedback label Aug 27, 2019

pgulutzan closed this as completed Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feedback: Submodule box.stat #851

feedback: Submodule box.stat #851

TarantoolBot commented Aug 2, 2019

Totktonada commented Aug 13, 2019

alyapunov commented Aug 14, 2019 via email

pgulutzan commented Aug 14, 2019

Totktonada commented Aug 18, 2019

kostja commented Aug 19, 2019

Totktonada commented Aug 19, 2019

pgulutzan commented Aug 19, 2019

Totktonada commented Aug 19, 2019 •

edited

Loading

pgulutzan commented Dec 6, 2019

Totktonada commented Dec 6, 2019

pgulutzan commented Dec 6, 2019

Totktonada commented Dec 15, 2020

feedback: Submodule box.stat #851

feedback: Submodule box.stat #851

Comments

TarantoolBot commented Aug 2, 2019

Totktonada commented Aug 13, 2019

alyapunov commented Aug 14, 2019 via email

pgulutzan commented Aug 14, 2019

Totktonada commented Aug 18, 2019

kostja commented Aug 19, 2019

Totktonada commented Aug 19, 2019

pgulutzan commented Aug 19, 2019

Totktonada commented Aug 19, 2019 • edited Loading

pgulutzan commented Dec 6, 2019

Totktonada commented Dec 6, 2019

pgulutzan commented Dec 6, 2019

Totktonada commented Dec 15, 2020

Totktonada commented Aug 19, 2019 •

edited

Loading