Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON serialization/deserialization performance #254

Closed
olivere opened this issue Mar 29, 2016 · 19 comments
Closed

JSON serialization/deserialization performance #254

olivere opened this issue Mar 29, 2016 · 19 comments

Comments

@olivere
Copy link
Owner

olivere commented Mar 29, 2016

JSON serialization/deserialization seems to be an issue with many users, especially running Bulk / Search operations at scale.

Things we could do:

  1. Use ffjson for the hot paths like Search/Bulk.
  2. Re-using buffers via sync.Pool to reduce GC pressure. Something like this.
  3. Keep an eye on #317 and/or use go-drainclose before this gets fixed in Go. Of course, we need to check if this has an impact for Elastic as well.

Related issues #253, #208.

@olivere olivere changed the title JSON performance JSON serialization/deserialization performance Mar 29, 2016
@dimfeld
Copy link
Contributor

dimfeld commented Mar 29, 2016

In addition to ffjson, I came across https://github.com/mailru/easyjson recently which is a similar concept, using code generation, but purports to be faster. Haven't checked it out in detail though.

@olivere
Copy link
Owner Author

olivere commented Mar 29, 2016

Just tested draining the body in PerformRequest (3.) but that doesn't make a difference. If it's a regression in stdlib, it'll be fixed. If it's not, it's safe to keep it as is. Either way, it won't make sense to fix it in Elastic.

@olivere
Copy link
Owner Author

olivere commented Mar 30, 2016

This is just an experiment. I don't have an opinion yet if this is a good or bad idea, really. Feedback welcome...

I just tested creating the JSON manually, hard-coded via a bytes.Buffer, in the json-serialization branch. Compare before and after.

This currently only covers the action_and_metadata line (see bulk docs), where JSON serialization can be hard-coded because we know the fields in advance (in contrast to the document data).

For reference, here are the benchmark results (run via go test -run=Bulk -bench=Bulk -benchmem, then compare before/after via benchcmp):

benchmark                                              old ns/op     new ns/op     delta
BenchmarkBulkDeleteRequestSerialization-4              3324          3112          -6.38%
BenchmarkBulkIndexRequestSerialization-4               5754          5710          -0.76%
BenchmarkBulkEstimatedSizeInBytesWith1Request-4        13423         12547         -6.53%
BenchmarkBulkEstimatedSizeInBytesWith100Requests-4     1328849       1213027       -8.72%
BenchmarkBulkUpdateRequestSerialization-4              5009          4641          -7.35%

benchmark                                              old allocs     new allocs     delta
BenchmarkBulkDeleteRequestSerialization-4              29             34             +17.24%
BenchmarkBulkIndexRequestSerialization-4               37             43             +16.22%
BenchmarkBulkEstimatedSizeInBytesWith1Request-4        109            125            +14.68%
BenchmarkBulkEstimatedSizeInBytesWith100Requests-4     10612          12212          +15.08%
BenchmarkBulkUpdateRequestSerialization-4              42             47             +11.90%

benchmark                                              old bytes     new bytes     delta
BenchmarkBulkDeleteRequestSerialization-4              1512          640           -57.67%
BenchmarkBulkIndexRequestSerialization-4               2384          1528          -35.91%
BenchmarkBulkEstimatedSizeInBytesWith1Request-4        6259          3660          -41.52%
BenchmarkBulkEstimatedSizeInBytesWith100Requests-4     631082        371193        -41.18%
BenchmarkBulkUpdateRequestSerialization-4              2440          1520          -37.70%

Applications could further speed up the process and remove JSON serialization altogether if they supply a string or json.RawMessage as Elastic only runs JSON serializer if the data structure cannot be transformed into a string directly (see e.g. bulk_index_request.go).

@r--w
Copy link

r--w commented Mar 31, 2016

Hi,
I just tried creating JSON documents manually and passing it as a string. This reduced memory significantly. One thing that is worth mentioning is that you need to escape values as in https://github.com/golang/go/blob/5fea2ccc77eb50a9704fa04b7c61755fe34e1d95/src/encoding/json/encode.go#L788 and use sync.Pool for bytes.Buffer in order to avoid allocations.

@olivere
Copy link
Owner Author

olivere commented Mar 31, 2016

@r--w Yes, escaping is required and it is not bulletproof for sure. It's just an experiment to find out if it's worth the effort. I will never implement this for all code paths: Probably only Bulk/Search is critical enough.

@r--w
Copy link

r--w commented Mar 31, 2016

In my opinion the option to create JSON manually (during indexing) should be sufficient for all "hardcore" users. Just didn't know that it's avaialble (I should have read the source code before ;)

@olivere
Copy link
Owner Author

olivere commented Mar 31, 2016

No. Documentation should be better ;-)

@sidazhang
Copy link

@olivere Is this something that you are actively pursuing?

@olivere
Copy link
Owner Author

olivere commented Jun 29, 2016

Not at the moment.

There is some performance testing ahead in my day-to-day job, so we'll see how that goes and if I can get the results back into Elastic.

@rami-dabain
Copy link

I have used https://github.com/buger/jsonparser, it's about 10X faster than stock json package

@olivere
Copy link
Owner Author

olivere commented Aug 29, 2016

@rami-dabain Do you have any stats on how much time is actually spend in JSON encoding/decoding for your use case? Can you post your code somewhere?

@olivere
Copy link
Owner Author

olivere commented Aug 30, 2016

A pattern I recently use is to split work into several goroutines via a pipeline. The excellent golang.org/x/sync/errgroup makes that dead simple. Here's an example of that pattern.

JSON performance might not be perfect in stdlib. However, if you use all cores, you might not need
to squeeze out the last nanoseconds. With the example above, I was able to saturate our Elasticsearch cluster, and suddenly JSON decoding performance seems like a minor issue. ;-)

I'm very interested in getting feedback. Does that work for you as well?

@rami-dabain
Copy link

I havent used jsonparser directly with ES. However I used it in an HTTP receiver service, CPU was maxed out at around 12k requests/second (as the receiver was decoding the json-request), that went up to ~45k requests/second when I switched parsing the json using the mentioned library, most gains were on fetching values for a known key-path. I suggest giving it a try as it doesn't require any "compilation"/code-generation

@olivere
Copy link
Owner Author

olivere commented Sep 2, 2016

I'm not going to use jsonparser in Elastic by default. But what could possible work is to return structures (for performance critical services) in a way that users are able to use alternative JSON parsers like jsonparser.

@rami-dabain
Copy link

That would be great, and more flexible

@olivere
Copy link
Owner Author

olivere commented Sep 2, 2016

TBH I think you could actually do that already by implementing your own Decoder as described in the wiki. However, you must probably return a matching response, i.e. a BulkResponse for bulk requests or a SearchResponse for search requests.

I don't have the time to play around with it now, but I'll keep this issue open for a while as a way to gather feedback.

@rami-dabain
Copy link

I Use the following:

               .Search(index).
        Type(type).
        Query(query).
        From(0).Size(1).
        Do()

As i need only one result, then :

                          for _, hit := range searchResult.Hits.Hits {
            jsonparser.ArrayEach(*hit.Source, func(value []byte, dataType jsonparser.ValueType, offset int, err error) {
                v, _ := jsonparser.ParseString(value)
                println(v)
            }, "clients", result.ClientId, "filters")

            break // only one result is enough
        }

If no json.unmarshall is called within elastic then all is fine!

@olivere
Copy link
Owner Author

olivere commented Sep 2, 2016

In that case, json.Unmarshal is called (see here). However, the _source field decoding is deferred by using json.RawMessage (see here). So I guess your code doesn't save that much time.

What would probably work is to implement your own Decoder and do a type switch in its Decode(...) func on the desired result (e.g. *SearchResult), then do the decoding with whatever you choose as your favorite JSON decoder. You absolutely have to make sure that all the fields of e.g. *SearchResult that your application requires are set by the your Decoder implementation, otherwise you'd shoot yourself in the foot. Something like this (untested!):

type AwesomeDecoder struct {
}

func (d *AwesomeDecoder) Decode(data []byte, v interface{}) error {
  switch t := v.(type) {
  default:
    return json.Unmarshal(data, v)
  case *elastic.SearchResult:
    // Use your favorite JSON parser here to fill in the req'd fields for *SearchResult
    // ...
  }
}

Yeah, not nice, but that probably works without changing a single line of code in Elastic.

@olivere
Copy link
Owner Author

olivere commented Sep 22, 2016

I'm closing this for two reasons:

One, I will keep using encoding/json for Elastic as the sole mechanism to serialize/deserialize JSON. There are alternative ways to do so but these require workarounds such as the Decoder-method described above.

Second, based on my experience with production load, the problem of serializing/deserializing becomes less of an issue once you do things concurrently. E.g. deserialization has been an issue for me when scrolling; but scrolling in parallel makes this an I/O problem and JSON serialization/deserialization simply is not a problem any more.

If you still have performance issues regarding JSON serialization/deserialization, let me know in a new issue.

@olivere olivere closed this as completed Sep 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants