Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional metrics for bulk requests #286

Merged
merged 5 commits into from
Jun 29, 2017
Merged

Conversation

cdahlqvist
Copy link
Contributor

When comparing bulk indexing throughput for different types of data, pure EPS can be very misleading if events have different sizes. I have found the total volume of JSON records indexed per time period to sometimes be a more useful measure, as this takes event size into consideration.

This pull request adds the total size of the bulk request as well as the total size of the documents indexed to the metrics reported, allowing the amount of JSON indexed per timeframe and total volume sent over the wire to be calculated.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR @cdahlqvist! I left a couple of comments.

@@ -114,6 +114,8 @@ def __call__(self, es, params):

* ``index``: name of the affected index. May be `None` if it could not be derived.
* ``bulk-size``: bulk size, e.g. 5.000.
* ``bulk-request-size-bytes``: size of the full bulk requset in bytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo "requset" -> "request"

bulk_request_size_bytes = 0
total_document_size_bytes = 0

for i in range(len(params["body"])):
Copy link
Member

@danielmitterdorfer danielmitterdorfer Jun 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds some overhead which is - depending on the track - at worst in the single-digit percentage range (I've measured up to 95ms). I think it would make sense to move this loop to the method detailed_stats. Users can then enable it for their tracks by setting detailed-results to true. Complete example:

{
  "name": "bulk-index",
  "operation-type": "index",
  "bulk-size": 5000,
  "detailed-results": true
}

Btw, you could also use enumerate for the loop instead of range to simplify it a bit:

for line_number, data in enumerate(params["body"]):
    line_size = len(data)
    if params["action_metadata_present"]:
        total_document_size_bytes += line_size
    #....

Another fine detail: the len function in Python returns the number of characters (although it's not really clearly documented) but combining the following two facts show that this is the case:

  1. "[len] [r]eturn[s] the length (the number of items) of an object. The argument may be a sequence (such as a string [...])" (source)
  2. "Strings are immutable sequences of Unicode code points." (source)

Therefore, we can reason that len returns the number of Unicode code points (i.e. characters) of a string. For characters that can be represented with 1 byte, the number of characters is identical to the number of bytes but strictly speaking it's the number of characters. To determine the number of bytes you need to encode the string as UTF-8, e.g. len(s.encode('utf-8')).

@danielmitterdorfer danielmitterdorfer added :Metrics How metrics are stored, calculated or aggregated Review enhancement Improves the status quo labels Jun 20, 2017
@cdahlqvist
Copy link
Contributor Author

@danielmitterdorfer Have updated the PR according to given suggestions.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think it makes sense to add the following two asserations to BulkIndexRunnerTests#test_mixed_bulk_with_detailed_stats(self, es):

        self.assertEqual(158, result["bulk-request-size-bytes"])
        self.assertEqual(62, result["total-document-size-bytes"])

@cdahlqvist
Copy link
Contributor Author

Added the assertions to the test.

@danielmitterdorfer
Copy link
Member

LGTM. You can merge at any time.

@cdahlqvist cdahlqvist merged commit 70065e2 into elastic:master Jun 29, 2017
@danielmitterdorfer danielmitterdorfer added this to the 0.6.1 milestone Jul 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :Metrics How metrics are stored, calculated or aggregated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants