-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: FuseRecv #224
RFC: FuseRecv #224
Conversation
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
@googlebot I consent. |
@googlebot I consent |
1 similar comment
@googlebot I consent |
rfcs/20200411-fuse_recv.md
Outdated
|
||
## Design Proposal | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these figures should have relative path, i.e. s/20200409-fuse_recv/\./
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's strange...I still cannot see the pictures inline. Maybe take a look at https://github.com/tensorflow/community/pull/214/files#diff-a30752476c946e1740658ce11cc02d89R631 and see how it's done?
Ah I see, it should be 
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bairen, i renamed pictures name to 0411***, please check this: https://github.com/liutongxuan/community/blob/master/rfcs/20200411-fuse_recv.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works now. Thanks!
rfcs/20200411-fuse_recv.md
Outdated
### Compatibility | ||
* This feature works with the ParameterServerStrategy. | ||
* This feature considers tensors on difference devices such as CPU, GPU and TPU. | ||
* Independent of SaveModel or checkpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SavedModel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thx
#### Fuse the tensors into a single Send/Recv Solution 2 (Derek Murray) | ||
Pack the tensor contents into a single flattened buffer. This would be very | ||
similar to the ScopedAllocator optimization that [email protected] and | ||
[email protected] implemented for collectives, and it might be possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collectives -> collective ops
``` | ||
|
||
### Compatibility | ||
* This feature works with the ParameterServerStrategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about tensor fusion for allreduce strategy? Can certain parts of the implementation be shared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For collective ops there is a Grappler optimizer called scoped allocator which does similar job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll submit a PR to TensorFlow as soon as possible.
@googlebot I consent. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
include additional information for every Recv tensor. | ||
|
||
## Questions and Discussion Topics | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a beta user of this feature, based on my testing result. I have some questions hope to discuss.
-
Conceptually, How could FuseRecvOp accelerate the training speed?
FuseRecv will not always reduce the time fetching tensors from another host. Based on my statistic, a recv action(no matter RecvOp or FuseRecvOp), from client-side view, 90%+ of time is spent on collecting data from server's rendezvous. Because FuseRecv requires all tensors are collected then return, a FuseRecv with n tensors, the overall time may even close to n * Recv requests individually.
So based on my understanding, how FuseRecvOp acceletates is: Reduce the number of rpc calls significantly --> lower rpc framwork's pressure --> save some CPU time & reduce context-switches --> so other op's execution will be benefit. Am I right?
Like Paul Tucker's idea: Dynamic Fusion in runtime. If FuseRecv doesn't require all tensors collected, can partial returns after a timeout, then retry remainings? will this be better? Em, but it sounds very hard to adjust the timeout. -
How will this feature proform when we change the data transmission engine?
If my first understanding is correct. then the performance gain from FuseRecv is from resources saved by Grpc. If we use other frameworks like RDMA, seastar, etc. How will FuseRecv perform? -
Can you provide more details about how backward graph are fused?
because from the ps-side, the count of RPC received is reduced largely(~50%), but from worker-side, the reduction is ~10%. seems like gradients are not fused tightly?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Right, but not only reduce the rpc time. For example, in tensorflow, we should find the rendzvous by the step_id for every recv request (
auto rendez = FindOrCreate(step_id)
) . TheFindOrCreate
function should call a mutex_lock to protect the table_. So If we use FuseRecv, the lock times will reduce also. There are many other considers here :)
We have a design for Dynamic Fusion, we can have discussion in another thread. -
How to rewrite the graph is the mainly challenge. This part can be reuse when we use other frameworks. But we have to add FuseRecv interface to the new transmission engine.
-
The FuseRecv only consider the
Recv
operation in the graph. If these N Recv nodes can be fused, then we fuse them. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @legatoo that in absence of non-strict execution of ops, there is a tradeoff which the user / runtime needs to be aware of: that fusion could cause all fused tensors receives to block till all of the transfers are done.
One alternative to discuss in the doc is to achieve RPC fusion transparently in the runtime instead of by changing the graph structure. i.e. Different Recv calls could enqueue the requests in a central global request queue and dequeue the results from there. This queue could then dynamically control batching of RPCs and could be configured with various timing parameters to control the RPC overhead vs latency of dispatch tradeoffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @legatoo that in absence of non-strict execution of ops, there is a tradeoff which the user / runtime needs to be aware of: that fusion could cause all fused tensors receives to block till all of the transfers are done.
One alternative to discuss in the doc is to achieve RPC fusion transparently in the runtime instead of by changing the graph structure. i.e. Different Recv calls could enqueue the requests in a central global request queue and dequeue the results from there. This queue could then dynamically control batching of RPCs and could be configured with various timing parameters to control the RPC overhead vs latency of dispatch tradeoffs.
There are different strategies to apply here, we design to support multiple fusion strategy in the optimizer. For example there's simple strategy is that we fuse the recv nodes when their destination node is the same.
What we shown in the RFC is more complicated and aggressive fusion strategy, and we applied in our models got significant performance improvement. Besides, with cost model, we could explore more.
For the dynamic solution which mentioned, time window is hard to define. What we concern is not the implementation, we don't want turning the timing window to brings overhead to users. Of course there's could be more automatic way to smartly adjust the window, but in our point of view, in graph optimizer we could get proper fusion as well based on cost model and analysis of the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @legatoo that in absence of non-strict execution of ops, there is a tradeoff which the user / runtime needs to be aware of: that fusion could cause all fused tensors receives to block till all of the transfers are done.
One alternative to discuss in the doc is to achieve RPC fusion transparently in the runtime instead of by changing the graph structure. i.e. Different Recv calls could enqueue the requests in a central global request queue and dequeue the results from there. This queue could then dynamically control batching of RPCs and could be configured with various timing parameters to control the RPC overhead vs latency of dispatch tradeoffs.
Our current approach is to rewrite the graph based on a cost model, and 'FuseRecv' must wait all tensors to be ready. This is a static fusion.
Back to the dynamic fusion, maybe it can be implemented base the static fusion. For example, 'A' send a 'FuseRecv' request to 'B' for five tensors. Here we also need a time window(or somewhat window). We assume 3 tensors are ready when the waiting time exceeds the specified window time, then we send the 3 tensors back to 'A'. And the last 2 tensors should be sent to 'A' later. So we should have a state manager here to manage these requests and responses. The benefit of the dynamic solution which base on the static fusion is that we try to fuse as much as we can(base on the cost model), and it can achieve the same performance as before in the worst case(response one by one). And this dynamic fusion can simplify the setting of time window. For example, poor performance may result from too large time window settings. But don't worry, we have the static fusion which base on the cost model, it can promise to get better performance in most cases.
This is a simple idea, we can have much more discussion in the meeting.
Does this optimization apply to intra-node host from/to device Recv, not necessarily triggering an RPC? |
It's only for remote nodes, not for local recv between host and device. We haven't tried, but it could be interesting to fuse the local recv as well which could reduce conflict in rendezvous. |
Instead of calling this "FuseRecv", should we simply extend the existing Recv op (or call it RecvV2) ? |
RecvV2 seems a good one as well. We can discuss this in review meeting. Thanks your suggestion. |
Design Review is scheduled for 5/12/20 at the Networking SIG. |
This has been accepted, just waiting for the final updates to this PR and notes from the review before merging. |
Design review notes Static fusion performance
Concerns about performance with RDMA
Details on how the backward graph is fused
Intra-node FuseRecv
Op name
What kind of models will benefit:
Should the send node also be fused?
Configuration options to enable FuseRecv
Op definition
|
@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed? |
For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster. |
That's cool! For 2 times faster, do you mean the RPC processes 2x or the whole network? |
@liutongxuan hi, I wanna use this feature in my own machine. Is there any code branch which has this wonderful feature? I just want to merge this into my tensorflow source code to impore the performance. |
e2e, whole training |
I'm working on the PR, you can try the code: https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv |
Ok, thanks a lot!!! :) |
|
Seastar could bring 2-4x faster. Fuse Recv brings extra 2x faster. |
Cool! I would try to reproduce in my own device. |
@liutongxuan, hi, I have use this branch https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv to complie the tensorflow. But some bazel question happened when I complie it. I wanna know what bazel version can be used to complie it. |
I found this link is out of date, could you show the new code link again? thanks |
This RFC will be open for comment until Friday, May 1st, 2020.
FuseRecv
Objective
This RFC proposes a new FuseRecv Op which would receive multiple tensors with
different types through one Remote Procedure Call (RPC). This feature could
significantly reduce the number of RPC calls in most rank or match models
such as Search, Recommend or Ad systems.