-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional metrics for bulk requests #286
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR @cdahlqvist! I left a couple of comments.
@@ -114,6 +114,8 @@ def __call__(self, es, params): | |||
|
|||
* ``index``: name of the affected index. May be `None` if it could not be derived. | |||
* ``bulk-size``: bulk size, e.g. 5.000. | |||
* ``bulk-request-size-bytes``: size of the full bulk requset in bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo "requset" -> "request"
esrally/driver/runner.py
Outdated
bulk_request_size_bytes = 0 | ||
total_document_size_bytes = 0 | ||
|
||
for i in range(len(params["body"])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds some overhead which is - depending on the track - at worst in the single-digit percentage range (I've measured up to 95ms). I think it would make sense to move this loop to the method detailed_stats
. Users can then enable it for their tracks by setting detailed-results
to true
. Complete example:
{
"name": "bulk-index",
"operation-type": "index",
"bulk-size": 5000,
"detailed-results": true
}
Btw, you could also use enumerate
for the loop instead of range
to simplify it a bit:
for line_number, data in enumerate(params["body"]):
line_size = len(data)
if params["action_metadata_present"]:
total_document_size_bytes += line_size
#....
Another fine detail: the len
function in Python returns the number of characters (although it's not really clearly documented) but combining the following two facts show that this is the case:
- "[len] [r]eturn[s] the length (the number of items) of an object. The argument may be a sequence (such as a string [...])" (source)
- "Strings are immutable sequences of Unicode code points." (source)
Therefore, we can reason that len
returns the number of Unicode code points (i.e. characters) of a string. For characters that can be represented with 1 byte, the number of characters is identical to the number of bytes but strictly speaking it's the number of characters. To determine the number of bytes you need to encode the string as UTF-8, e.g. len(s.encode('utf-8'))
.
@danielmitterdorfer Have updated the PR according to given suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think it makes sense to add the following two asserations to BulkIndexRunnerTests#test_mixed_bulk_with_detailed_stats(self, es)
:
self.assertEqual(158, result["bulk-request-size-bytes"])
self.assertEqual(62, result["total-document-size-bytes"])
Added the assertions to the test. |
LGTM. You can merge at any time. |
When comparing bulk indexing throughput for different types of data, pure EPS can be very misleading if events have different sizes. I have found the total volume of JSON records indexed per time period to sometimes be a more useful measure, as this takes event size into consideration.
This pull request adds the total size of the bulk request as well as the total size of the documents indexed to the metrics reported, allowing the amount of JSON indexed per timeframe and total volume sent over the wire to be calculated.