[Operator] Accelerate the CPU side performance of topk #10205

sxjscience · 2018-03-22T18:00:28Z

One issue of topk is that the CPU implementation is much slower than the numpy version. Here is a speed test done by @mjpost . We need to accelerate the speed.

marcoabreu · 2018-03-22T22:28:49Z

@KellenSunderland @asmushetzel

pengzhao-intel · 2018-03-23T02:07:38Z

@xinyu-intel can help on this optimization. Is there a timeline? @sxjscience

sxjscience · 2018-03-23T02:17:36Z

No strict timeline from my side. Currently the sockeye team switches to numpy topK if CPU is used. May ask @fhieber if there is a timeline. This feature mainly affects the speed of beam search in CPU.

pengzhao-intel · 2018-03-23T05:01:48Z

Thanks, @sxjscience @fhieber we're working on the sockeye performance optimization so this will be a good case for us.

mjpost · 2018-03-23T08:27:17Z

Our solution in Sockeye is to switch on the architecture context. In CPU context, we convert our matrix to numpy and use its version. The relevant lines are as follows:

Copy to numpy
Call smallest_k() (CPU) or smallest_k_mx() (GPU)

I should also note that the CPU decoding above didn't make use of the MKL libraries, since I didn't have an AMI with that set up. I'd be happy to run those experiments if I could get some help to properly setup the appropriate libraries.

asmushetzel · 2018-03-23T15:43:21Z

It would be already incredibly faster if the CPU implementation would use std::partial_sort() instead of doing a full sort and then pick the top-k (which is what it is doing right now). The STL implementation of partial_sort() is very good and would be hard to beat.
For the Sockeye testcase, I verified that such an implementation removes any bottleneck on CPU.

pengzhao-intel · 2018-03-25T09:48:03Z

@mjpost really thanks for the information.
Do you have a representative case we can work on, or just by sockeye example?

BTW, after the new MKL-DNN backend is enabled, we see the inference performance on C5 can get on-pair with GPU in P3. If you need any helps for setting up MKL/MKL-DNN, feel free to ping me.

@asmushetzel good points! we'd like to try and back w/ perf number :)

mjamroz · 2018-03-25T16:35:08Z

@pengzhao-intel i believe you mean @mjpost not me ;-).

pengzhao-intel · 2018-03-26T01:03:13Z

@mjamroz sorry for the typo.

xinyu-intel · 2018-03-26T05:59:12Z

@mjpost @asmushetzel @pengzhao-intel Really thanks! Because I cannot find the real case, so my input is np.random.rand(10,3,1000,1000) and got the following result:

Time to do Parse and initialize is 4.120000 seconds
Time to perform SortByKey is 38.850000 seconds
Time to Assign results is 1.490000 seconds

It proves that the SortByKey operation(which use std::stable_sort) costs the most time and replaced with std::partial_sort() can be better:)

sxjscience · 2018-03-26T06:33:52Z

Looks good. Let’s use partial_sort. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: XinyuChen <[email protected]> Sent: Sunday, March 25, 2018 10:59:28 PM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] [Operator] Accelerate the CPU side performance of topk (#10205) @mjpost<https://github.com/mjpost> @asmushetzel<https://github.com/asmushetzel> @pengzhao-intel<https://github.com/pengzhao-intel> Really thanks! Because I cannot find the real case, so my input is np.random.rand(10,3,1000,1000) and got the following result: Time to do Parse and initialize is 4.120000 seconds Time to perform SortByKey is 38.850000 seconds Time to Assign results is 1.490000 seconds It proves that the SortByKey operation(which use std::stable_sort) costs the most time and replaced with std::partial_sort() can be better:) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#10205 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7jz7JPrG7vGhUfw7ztPpfG9KlALVks5tiIPAgaJpZM4S3h2I>.

fhieber · 2018-03-26T06:47:45Z

sounds great, thanks for taking on this one!

mjpost · 2018-03-26T06:55:27Z

I can provide you with a Sockeye model and 3,000-line test set (German--English) if you like; let me know. Though it seems your random test above confirmed the problem.

There are three more issues related to topk() that may or may not be worth considering right now. I don't want to complicate this performance issue with unrelated API issues, but some of these might be easy to address, so I thought I would list them here all in one place:

The documentation for topk() doesn't specify whether the returned k elements are sorted. They currently are (because of the underlying reliance on sort()), but we don't know if they will be afterwards. Could the documentation clarify this? (See our issue on this here)
We actually care about the locations of the top k items as well as their scores. We therefore have to flatten the array, call MXNet topk(), and then use numpy's unravel_index() to map the indices back into the shape of original array. It would be really great if we could pass in the unflattened tensor, along with the axis to perform topk() on, and get directly back indices in the shape of the original tensor.
Finally, we currently call topk() multiple times on contiguous slices of our batched tensor (see the loop over sentno here). We could get a lot of performance improvement across all architectures for batched decoding if the topk() call could run on multiple subregions of a tensor in parallel.

pengzhao-intel · 2018-03-26T07:19:11Z

@mjpost Thanks for the information. We will take care of these items.
We have WMT17 dataset and will use it for the further performance evaluation.

tdomhan · 2018-03-26T08:50:58Z

Thanks for looking into this! I once quickly added std::nth_element leading to a 30x improvement over the current sorting code (just for topk, not for the entire translation). For the purpose of developing this it is probably sufficient to look at random matrices of the right shape rather than running Sockeye end to end. We normally run topk over vectors of size vocab * beam_size, where vocab is typically 30,000 and beam_size typically 5. If we could take a matrix (batch_size, beam_size * vocab) and parallelize over the batch dimension that would of course be even better.

sxjscience · 2018-03-26T15:37:02Z

For the parallel top-k, the current version should have supported it with the “axis” parameter.

tdomhan · 2018-03-26T16:00:41Z

would you expect this to be faster than sorting each entry individually?
I don't fully understand the sorting code, but I was under the impression that sorting is done globally for the entire matrix and then a secondary sort is performed to handle batch ids.

sxjscience · 2018-03-26T16:05:55Z

This should be faster when GPU is used. However, when CPU is used, this may be slower.

ciyongch · 2018-08-08T13:00:45Z

Hi @sxjscience , @pengzhao-intel

I made some enhancement (#12085) based on the latest code, also I did some test with single topk Op and sockeye translate model, the results shows good with this enhancement.

Single Topk Op performance boost

With input ndarray np.random.rand(10, 3, 1000, 1000), ret_type='value', and k=3, the optimized version shows greate performance speedup over the out-of-box version, especially when the axis is -1.

axis	out-of-box version	optimized version	speedup
default(-1)	601.46 ms	7.40 ms	81.3x
0	1436.65 ms	553.22 ms	2.6x
1	1617.92 ms	423.96 ms	3.82x
2	922.24 ms	400.52 ms	2.30x

Sockeye performance boost with the optimized topk

Tested with sockeye translate model, with batch_size=64 and beam_size=10, the optimized topk contributed a 3.43x speedup over the out-of-box version.

out-of-box version	optimized version	speedup
5.35 sent/sec	18.37 sent/sec	3.43x

tdomhan · 2018-08-08T13:19:56Z

thanks for the update! 3.4x would be quite an impressive speed-up. :)

mjpost · 2018-08-08T13:26:04Z

Yes, this looks awesome. It's great to see the topk() speedup borne out in end-to-end performance as well.

fhieber · 2018-08-08T13:40:20Z

really nice! It'd be great if this makes it into the 1.3 release.

pengzhao-intel · 2018-08-08T14:08:30Z

In fact, I have to apologize for the delay for the fix and report back.
It will be really nice to include it into 1.3 (I know it's the last minute for the code freeze).

Thanks, @tdomhan @mjpost @fhieber @sxjscience

ciyongch · 2018-08-13T08:55:13Z

Hi @tdomhan @mjpost @fhieber, the optimized topk has being merged into MXNet's master branch now :)
I submitted a new PR awslabs/sockeye#506 to change topk from numpy version to MXNet version during inference for CPU device. Please take your time to have a look.

sxjscience added Performance Operator labels Mar 22, 2018

sxjscience changed the title ~~[Operator] Accelerate the CPU performance of topk~~ [Operator] Accelerate the CPU side performance of topk Mar 22, 2018

sxjscience added the Call for Contribution label Mar 22, 2018

ciyongch mentioned this issue Aug 8, 2018

Accelerate the performance of topk for CPU side #12085

Merged

6 tasks

sxjscience closed this as completed Aug 23, 2018

fhieber mentioned this issue Nov 29, 2018

use_mxnet_topk still necessary? awslabs/sockeye#585

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Operator] Accelerate the CPU side performance of topk #10205

[Operator] Accelerate the CPU side performance of topk #10205

sxjscience commented Mar 22, 2018

marcoabreu commented Mar 22, 2018

pengzhao-intel commented Mar 23, 2018

sxjscience commented Mar 23, 2018

pengzhao-intel commented Mar 23, 2018

mjpost commented Mar 23, 2018 •

edited

Loading

asmushetzel commented Mar 23, 2018

pengzhao-intel commented Mar 25, 2018 •

edited

Loading

mjamroz commented Mar 25, 2018

pengzhao-intel commented Mar 26, 2018

xinyu-intel commented Mar 26, 2018

sxjscience commented Mar 26, 2018 via email

fhieber commented Mar 26, 2018

mjpost commented Mar 26, 2018

pengzhao-intel commented Mar 26, 2018

tdomhan commented Mar 26, 2018

sxjscience commented Mar 26, 2018

tdomhan commented Mar 26, 2018

sxjscience commented Mar 26, 2018

ciyongch commented Aug 8, 2018

tdomhan commented Aug 8, 2018

mjpost commented Aug 8, 2018

fhieber commented Aug 8, 2018

pengzhao-intel commented Aug 8, 2018

ciyongch commented Aug 13, 2018

[Operator] Accelerate the CPU side performance of topk #10205

[Operator] Accelerate the CPU side performance of topk #10205

Comments

sxjscience commented Mar 22, 2018

marcoabreu commented Mar 22, 2018

pengzhao-intel commented Mar 23, 2018

sxjscience commented Mar 23, 2018

pengzhao-intel commented Mar 23, 2018

mjpost commented Mar 23, 2018 • edited Loading

asmushetzel commented Mar 23, 2018

pengzhao-intel commented Mar 25, 2018 • edited Loading

mjamroz commented Mar 25, 2018

pengzhao-intel commented Mar 26, 2018

xinyu-intel commented Mar 26, 2018

sxjscience commented Mar 26, 2018 via email

fhieber commented Mar 26, 2018

mjpost commented Mar 26, 2018

pengzhao-intel commented Mar 26, 2018

tdomhan commented Mar 26, 2018

sxjscience commented Mar 26, 2018

tdomhan commented Mar 26, 2018

sxjscience commented Mar 26, 2018

ciyongch commented Aug 8, 2018

tdomhan commented Aug 8, 2018

mjpost commented Aug 8, 2018

fhieber commented Aug 8, 2018

pengzhao-intel commented Aug 8, 2018

ciyongch commented Aug 13, 2018

mjpost commented Mar 23, 2018 •

edited

Loading

pengzhao-intel commented Mar 25, 2018 •

edited

Loading