-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Operator] Accelerate the CPU side performance of topk #10205
Comments
@xinyu-intel can help on this optimization. Is there a timeline? @sxjscience |
No strict timeline from my side. Currently the sockeye team switches to numpy topK if CPU is used. May ask @fhieber if there is a timeline. This feature mainly affects the speed of beam search in CPU. |
Thanks, @sxjscience @fhieber we're working on the sockeye performance optimization so this will be a good case for us. |
Our solution in Sockeye is to switch on the architecture context. In CPU context, we convert our matrix to numpy and use its version. The relevant lines are as follows: I should also note that the CPU decoding above didn't make use of the MKL libraries, since I didn't have an AMI with that set up. I'd be happy to run those experiments if I could get some help to properly setup the appropriate libraries. |
It would be already incredibly faster if the CPU implementation would use std::partial_sort() instead of doing a full sort and then pick the top-k (which is what it is doing right now). The STL implementation of partial_sort() is very good and would be hard to beat. |
@mjpost really thanks for the information. BTW, after the new MKL-DNN backend is enabled, we see the inference performance on C5 can get on-pair with GPU in P3. If you need any helps for setting up MKL/MKL-DNN, feel free to ping me. @asmushetzel good points! we'd like to try and back w/ perf number :) |
@pengzhao-intel i believe you mean @mjpost not me ;-). |
@mjamroz sorry for the typo. |
@mjpost @asmushetzel @pengzhao-intel Really thanks! Because I cannot find the real case, so my input is Time to do Parse and initialize is 4.120000 seconds It proves that the SortByKey operation(which use |
Looks good. Let’s use partial_sort.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: XinyuChen <[email protected]>
Sent: Sunday, March 25, 2018 10:59:28 PM
To: apache/incubator-mxnet
Cc: Xingjian SHI; Mention
Subject: Re: [apache/incubator-mxnet] [Operator] Accelerate the CPU side performance of topk (#10205)
@mjpost<https://github.com/mjpost> @asmushetzel<https://github.com/asmushetzel> @pengzhao-intel<https://github.com/pengzhao-intel> Really thanks! Because I cannot find the real case, so my input is np.random.rand(10,3,1000,1000) and got the following result:
Time to do Parse and initialize is 4.120000 seconds
Time to perform SortByKey is 38.850000 seconds
Time to Assign results is 1.490000 seconds
It proves that the SortByKey operation(which use std::stable_sort) costs the most time and replaced with std::partial_sort() can be better:)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#10205 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7jz7JPrG7vGhUfw7ztPpfG9KlALVks5tiIPAgaJpZM4S3h2I>.
|
sounds great, thanks for taking on this one! |
I can provide you with a Sockeye model and 3,000-line test set (German--English) if you like; let me know. Though it seems your random test above confirmed the problem. There are three more issues related to
|
@mjpost Thanks for the information. We will take care of these items. |
Thanks for looking into this! I once quickly added |
For the parallel top-k, the current version should have supported it with the “axis” parameter. |
would you expect this to be faster than sorting each entry individually? |
This should be faster when GPU is used. However, when CPU is used, this may be slower. |
Hi @sxjscience , @pengzhao-intel I made some enhancement (#12085) based on the latest code, also I did some test with single topk Op and sockeye translate model, the results shows good with this enhancement.
|
thanks for the update! 3.4x would be quite an impressive speed-up. :) |
Yes, this looks awesome. It's great to see the |
really nice! It'd be great if this makes it into the 1.3 release. |
In fact, I have to apologize for the delay for the fix and report back. Thanks, @tdomhan @mjpost @fhieber @sxjscience |
Hi @tdomhan @mjpost @fhieber, the optimized |
One issue of topk is that the CPU implementation is much slower than the numpy version. Here is a speed test done by @mjpost . We need to accelerate the speed.
The text was updated successfully, but these errors were encountered: