Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Operator] Accelerate the CPU side performance of topk #10205

Closed
sxjscience opened this issue Mar 22, 2018 · 24 comments
Closed

[Operator] Accelerate the CPU side performance of topk #10205

sxjscience opened this issue Mar 22, 2018 · 24 comments

Comments

@sxjscience
Copy link
Member

One issue of topk is that the CPU implementation is much slower than the numpy version. Here is a speed test done by @mjpost . We need to accelerate the speed.

@sxjscience sxjscience changed the title [Operator] Accelerate the CPU performance of topk [Operator] Accelerate the CPU side performance of topk Mar 22, 2018
@marcoabreu
Copy link
Contributor

@pengzhao-intel
Copy link
Contributor

@xinyu-intel can help on this optimization. Is there a timeline? @sxjscience

@sxjscience
Copy link
Member Author

No strict timeline from my side. Currently the sockeye team switches to numpy topK if CPU is used. May ask @fhieber if there is a timeline. This feature mainly affects the speed of beam search in CPU.

@pengzhao-intel
Copy link
Contributor

Thanks, @sxjscience @fhieber we're working on the sockeye performance optimization so this will be a good case for us.

@mjpost
Copy link

mjpost commented Mar 23, 2018

Our solution in Sockeye is to switch on the architecture context. In CPU context, we convert our matrix to numpy and use its version. The relevant lines are as follows:

I should also note that the CPU decoding above didn't make use of the MKL libraries, since I didn't have an AMI with that set up. I'd be happy to run those experiments if I could get some help to properly setup the appropriate libraries.

@asmushetzel
Copy link
Contributor

It would be already incredibly faster if the CPU implementation would use std::partial_sort() instead of doing a full sort and then pick the top-k (which is what it is doing right now). The STL implementation of partial_sort() is very good and would be hard to beat.
For the Sockeye testcase, I verified that such an implementation removes any bottleneck on CPU.

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Mar 25, 2018

@mjpost really thanks for the information.
Do you have a representative case we can work on, or just by sockeye example?

BTW, after the new MKL-DNN backend is enabled, we see the inference performance on C5 can get on-pair with GPU in P3. If you need any helps for setting up MKL/MKL-DNN, feel free to ping me.

@asmushetzel good points! we'd like to try and back w/ perf number :)

@mjamroz
Copy link
Contributor

mjamroz commented Mar 25, 2018

@pengzhao-intel i believe you mean @mjpost not me ;-).

@pengzhao-intel
Copy link
Contributor

@mjamroz sorry for the typo.

@xinyu-intel
Copy link
Contributor

@mjpost @asmushetzel @pengzhao-intel Really thanks! Because I cannot find the real case, so my input is np.random.rand(10,3,1000,1000) and got the following result:

Time to do Parse and initialize is 4.120000 seconds
Time to perform SortByKey is 38.850000 seconds
Time to Assign results is 1.490000 seconds

It proves that the SortByKey operation(which use std::stable_sort) costs the most time and replaced with std::partial_sort() can be better:)

@sxjscience
Copy link
Member Author

sxjscience commented Mar 26, 2018 via email

@fhieber
Copy link
Contributor

fhieber commented Mar 26, 2018

sounds great, thanks for taking on this one!

@mjpost
Copy link

mjpost commented Mar 26, 2018

I can provide you with a Sockeye model and 3,000-line test set (German--English) if you like; let me know. Though it seems your random test above confirmed the problem.

There are three more issues related to topk() that may or may not be worth considering right now. I don't want to complicate this performance issue with unrelated API issues, but some of these might be easy to address, so I thought I would list them here all in one place:

  • The documentation for topk() doesn't specify whether the returned k elements are sorted. They currently are (because of the underlying reliance on sort()), but we don't know if they will be afterwards. Could the documentation clarify this? (See our issue on this here)
  • We actually care about the locations of the top k items as well as their scores. We therefore have to flatten the array, call MXNet topk(), and then use numpy's unravel_index() to map the indices back into the shape of original array. It would be really great if we could pass in the unflattened tensor, along with the axis to perform topk() on, and get directly back indices in the shape of the original tensor.
  • Finally, we currently call topk() multiple times on contiguous slices of our batched tensor (see the loop over sentno here). We could get a lot of performance improvement across all architectures for batched decoding if the topk() call could run on multiple subregions of a tensor in parallel.

@pengzhao-intel
Copy link
Contributor

@mjpost Thanks for the information. We will take care of these items.
We have WMT17 dataset and will use it for the further performance evaluation.

@tdomhan
Copy link
Contributor

tdomhan commented Mar 26, 2018

Thanks for looking into this! I once quickly added std::nth_element leading to a 30x improvement over the current sorting code (just for topk, not for the entire translation). For the purpose of developing this it is probably sufficient to look at random matrices of the right shape rather than running Sockeye end to end. We normally run topk over vectors of size vocab * beam_size, where vocab is typically 30,000 and beam_size typically 5. If we could take a matrix (batch_size, beam_size * vocab) and parallelize over the batch dimension that would of course be even better.

@sxjscience
Copy link
Member Author

For the parallel top-k, the current version should have supported it with the “axis” parameter.

@tdomhan
Copy link
Contributor

tdomhan commented Mar 26, 2018

would you expect this to be faster than sorting each entry individually?
I don't fully understand the sorting code, but I was under the impression that sorting is done globally for the entire matrix and then a secondary sort is performed to handle batch ids.

@sxjscience
Copy link
Member Author

This should be faster when GPU is used. However, when CPU is used, this may be slower.

@ciyongch
Copy link
Contributor

ciyongch commented Aug 8, 2018

Hi @sxjscience , @pengzhao-intel

I made some enhancement (#12085) based on the latest code, also I did some test with single topk Op and sockeye translate model, the results shows good with this enhancement.

  1. Single Topk Op performance boost

    With input ndarray np.random.rand(10, 3, 1000, 1000), ret_type='value', and k=3, the optimized version shows greate performance speedup over the out-of-box version, especially when the axis is -1.

axis out-of-box version optimized version speedup
default(-1) 601.46 ms 7.40 ms 81.3x
0 1436.65 ms 553.22 ms 2.6x
1 1617.92 ms 423.96 ms 3.82x
2 922.24 ms 400.52 ms 2.30x
  1. Sockeye performance boost with the optimized topk

    Tested with sockeye translate model, with batch_size=64 and beam_size=10, the optimized topk contributed a 3.43x speedup over the out-of-box version.

out-of-box version optimized version speedup
5.35 sent/sec 18.37 sent/sec 3.43x

@tdomhan
Copy link
Contributor

tdomhan commented Aug 8, 2018

thanks for the update! 3.4x would be quite an impressive speed-up. :)

@mjpost
Copy link

mjpost commented Aug 8, 2018

Yes, this looks awesome. It's great to see the topk() speedup borne out in end-to-end performance as well.

@fhieber
Copy link
Contributor

fhieber commented Aug 8, 2018

really nice! It'd be great if this makes it into the 1.3 release.

@pengzhao-intel
Copy link
Contributor

In fact, I have to apologize for the delay for the fix and report back.
It will be really nice to include it into 1.3 (I know it's the last minute for the code freeze).

Thanks, @tdomhan @mjpost @fhieber @sxjscience

@ciyongch
Copy link
Contributor

Hi @tdomhan @mjpost @fhieber, the optimized topk has being merged into MXNet's master branch now :)
I submitted a new PR awslabs/sockeye#506 to change topk from numpy version to MXNet version during inference for CPU device. Please take your time to have a look.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants