RFC: FuseRecv #224

liutongxuan · 2020-04-11T06:07:47Z

This RFC will be open for comment until Friday, May 1st, 2020.

FuseRecv

Status	Proposed
Author(s)	Tongxuan Liu([email protected]) Peng Tao([email protected]) Langshi Chen ([email protected])
Reviewers(s)	Ayush Dubey([email protected]) Jeroen Bédorf([email protected]) Derek Murray([email protected]) Bairen Yi([email protected]) Paul Tucker([email protected])
Sponsor	Ayush Dubey([email protected])
Updated	2020-04-11

Objective

This RFC proposes a new FuseRecv Op which would receive multiple tensors with
different types through one Remote Procedure Call (RPC). This feature could
significantly reduce the number of RPC calls in most rank or match models
such as Search, Recommend or Ad systems.

googlebot · 2020-04-11T06:07:51Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

liutongxuan · 2020-04-11T06:22:39Z

@googlebot I consent.

francktcheng · 2020-04-11T07:10:43Z

@googlebot I consent

shanshanpt · 2020-04-11T09:31:59Z

@googlebot I consent

byronyi · 2020-04-13T09:48:23Z

rfcs/20200411-fuse_recv.md

+
+## Design Proposal
+
+![Figure 1: Current graph partition strategy](20200409-fuse_recv/current_graph_partition_strategy.png "Current graph partition strategy")


All these figures should have relative path, i.e. s/20200409-fuse_recv/\./.

That's strange...I still cannot see the pictures inline. Maybe take a look at https://github.com/tensorflow/community/pull/214/files#diff-a30752476c946e1740658ce11cc02d89R631 and see how it's done?

Ah I see, it should be ![Figure 1: Current graph partition strategy](20200409-fuse_recv/current_graph_partition_strategy.png)?

bairen, i renamed pictures name to 0411***, please check this: https://github.com/liutongxuan/community/blob/master/rfcs/20200411-fuse_recv.md

It works now. Thanks!

terrytangyuan · 2020-04-13T11:40:13Z

rfcs/20200411-fuse_recv.md

+### Compatibility
+* This feature works with the ParameterServerStrategy.
+* This feature considers tensors on difference devices such as CPU, GPU and TPU.
+* Independent of SaveModel or checkpoint.


terrytangyuan · 2020-04-13T11:43:04Z

rfcs/20200411-fuse_recv.md

+#### Fuse the tensors into a single Send/Recv Solution 2 (Derek Murray)
+Pack the tensor contents into a single flattened buffer. This would be very
+similar to the ScopedAllocator optimization that [email protected] and
+[email protected] implemented for collectives, and it might be possible


collectives -> collective ops

terrytangyuan · 2020-04-13T11:45:17Z

rfcs/20200411-fuse_recv.md

+```
+
+### Compatibility
+* This feature works with the ParameterServerStrategy.


How about tensor fusion for allreduce strategy? Can certain parts of the implementation be shared?

For collective ops there is a Grappler optimizer called scoped allocator which does similar job.

i'll submit a PR to TensorFlow as soon as possible.

jbedorf · 2020-04-14T15:11:13Z

@googlebot I consent.

googlebot · 2020-04-14T15:11:17Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

legatoo · 2020-04-16T04:26:17Z

rfcs/20200411-fuse_recv.md

+include additional information for every Recv tensor.
+
+## Questions and Discussion Topics
+


I'm a beta user of this feature, based on my testing result. I have some questions hope to discuss.

Conceptually, How could FuseRecvOp accelerate the training speed?
FuseRecv will not always reduce the time fetching tensors from another host. Based on my statistic, a recv action(no matter RecvOp or FuseRecvOp), from client-side view, 90%+ of time is spent on collecting data from server's rendezvous. Because FuseRecv requires all tensors are collected then return, a FuseRecv with n tensors, the overall time may even close to n * Recv requests individually.

So based on my understanding, how FuseRecvOp acceletates is: Reduce the number of rpc calls significantly --> lower rpc framwork's pressure --> save some CPU time & reduce context-switches --> so other op's execution will be benefit. Am I right?

Like Paul Tucker's idea: Dynamic Fusion in runtime. If FuseRecv doesn't require all tensors collected, can partial returns after a timeout, then retry remainings? will this be better? Em, but it sounds very hard to adjust the timeout.

How will this feature proform when we change the data transmission engine?
If my first understanding is correct. then the performance gain from FuseRecv is from resources saved by Grpc. If we use other frameworks like RDMA, seastar, etc. How will FuseRecv perform?

Can you provide more details about how backward graph are fused?
because from the ps-side, the count of RPC received is reduced largely(~50%), but from worker-side, the reduction is ~10%. seems like gradients are not fused tightly?

Thanks.

Right, but not only reduce the rpc time. For example, in tensorflow, we should find the rendzvous by the step_id for every recv request (auto rendez = FindOrCreate(step_id)) . The FindOrCreate function should call a mutex_lock to protect the table_. So If we use FuseRecv, the lock times will reduce also. There are many other considers here :)
We have a design for Dynamic Fusion, we can have discussion in another thread.

How to rewrite the graph is the mainly challenge. This part can be reuse when we use other frameworks. But we have to add FuseRecv interface to the new transmission engine.

The FuseRecv only consider the Recv operation in the graph. If these N Recv nodes can be fused, then we fuse them. :)

I agree with @legatoo that in absence of non-strict execution of ops, there is a tradeoff which the user / runtime needs to be aware of: that fusion could cause all fused tensors receives to block till all of the transfers are done.

One alternative to discuss in the doc is to achieve RPC fusion transparently in the runtime instead of by changing the graph structure. i.e. Different Recv calls could enqueue the requests in a central global request queue and dequeue the results from there. This queue could then dynamically control batching of RPCs and could be configured with various timing parameters to control the RPC overhead vs latency of dispatch tradeoffs.

I agree with @legatoo that in absence of non-strict execution of ops, there is a tradeoff which the user / runtime needs to be aware of: that fusion could cause all fused tensors receives to block till all of the transfers are done.

One alternative to discuss in the doc is to achieve RPC fusion transparently in the runtime instead of by changing the graph structure. i.e. Different Recv calls could enqueue the requests in a central global request queue and dequeue the results from there. This queue could then dynamically control batching of RPCs and could be configured with various timing parameters to control the RPC overhead vs latency of dispatch tradeoffs.

There are different strategies to apply here, we design to support multiple fusion strategy in the optimizer. For example there's simple strategy is that we fuse the recv nodes when their destination node is the same.

What we shown in the RFC is more complicated and aggressive fusion strategy, and we applied in our models got significant performance improvement. Besides, with cost model, we could explore more.

For the dynamic solution which mentioned, time window is hard to define. What we concern is not the implementation, we don't want turning the timing window to brings overhead to users. Of course there's could be more automatic way to smartly adjust the window, but in our point of view, in graph optimizer we could get proper fusion as well based on cost model and analysis of the graph.

I agree with @legatoo that in absence of non-strict execution of ops, there is a tradeoff which the user / runtime needs to be aware of: that fusion could cause all fused tensors receives to block till all of the transfers are done.

One alternative to discuss in the doc is to achieve RPC fusion transparently in the runtime instead of by changing the graph structure. i.e. Different Recv calls could enqueue the requests in a central global request queue and dequeue the results from there. This queue could then dynamically control batching of RPCs and could be configured with various timing parameters to control the RPC overhead vs latency of dispatch tradeoffs.

Our current approach is to rewrite the graph based on a cost model, and 'FuseRecv' must wait all tensors to be ready. This is a static fusion.
Back to the dynamic fusion, maybe it can be implemented base the static fusion. For example, 'A' send a 'FuseRecv' request to 'B' for five tensors. Here we also need a time window(or somewhat window). We assume 3 tensors are ready when the waiting time exceeds the specified window time, then we send the 3 tensors back to 'A'. And the last 2 tensors should be sent to 'A' later. So we should have a state manager here to manage these requests and responses. The benefit of the dynamic solution which base on the static fusion is that we try to fuse as much as we can(base on the cost model), and it can achieve the same performance as before in the worst case(response one by one). And this dynamic fusion can simplify the setting of time window. For example, poor performance may result from too large time window settings. But don't worry, we have the static fusion which base on the cost model, it can promise to get better performance in most cases.
This is a simple idea, we can have much more discussion in the meeting.

byronyi · 2020-04-21T08:24:35Z

Does this optimization apply to intra-node host from/to device Recv, not necessarily triggering an RPC?

liutongxuan · 2020-04-21T12:29:47Z

Does this optimization apply to intra-node host from/to device Recv, not necessarily triggering an RPC?

It's only for remote nodes, not for local recv between host and device. We haven't tried, but it could be interesting to fuse the local recv as well which could reduce conflict in rendezvous.

agarwal-ashish · 2020-04-24T21:06:36Z

Instead of calling this "FuseRecv", should we simply extend the existing Recv op (or call it RecvV2) ?

liutongxuan · 2020-04-25T03:27:27Z

Instead of calling this "FuseRecv", should we simply extend the existing Recv op (or call it RecvV2) ?

RecvV2 seems a good one as well. We can discuss this in review meeting. Thanks your suggestion.

ematejska · 2020-04-27T20:43:12Z

Design Review is scheduled for 5/12/20 at the Networking SIG.

ematejska · 2020-05-12T15:48:05Z

This has been accepted, just waiting for the final updates to this PR and notes from the review before merging.

dubey · 2020-05-12T16:15:51Z

Design review notes

Static fusion performance

Alex Passos: is static fusion always net positive benefit? It will introduce extra latency. Should it be turned on by default always?
- We will start off as an opt-in config param marked experimental
- Once the implementation is submitted, we will try performance on a few models, and revisit the non-default config in 6 months.

Concerns about performance with RDMA

It seems orthogonal to the proposal. Performance impact is unknown

Details on how the backward graph is fused

TF doesn’t distinguish between forward and backward graph

Intra-node FuseRecv

Bairen: Should we use fuserecv for single host send-recv? Context: sparse embedding models in heterogeneous GPU clusters. Similar pattern observed for host to device copies. Workaround - manually fuse recv nodes.
- The authors haven’t tried intra-node, no performance data
- Does implementation work for intra-node cases?
- Authors - yes
- Alex - what benefit would you get? Kernel launch overhead?
- Bairen - executor/launch overhead, typically tf.gather of indexed slices. Sometimes we see large number of cudaMemcpyHostToDevice.
- Alex - if this is future work, we can leave it out of scope. Overhead of RPC is much larger than overhead of executor op launch

Op name

FuseRecv vs RecvV2
Alex: _Recv is internal op - extending it is okay from API perspective
Conclusion: we will augment the existing _Recv op
Action Item: modify RFC to incorporate these changes

What kind of models will benefit:

Sparse models for ads with feature columns
Result in 100s of rpcs
1000s of workers and parameter servers
Data fetch per rpc is very small

Should the send node also be fused?

Alternative: mrry@ - pack tensors into n length DT_VARIANT
- Advantage: we reuse existing ops and rpc machinery
- No perceived performance benefit, a different way to implement the same idea
- Conclusion: okay to continue with the current proposal
Alternative: mrry@ - pack into single flattened buffer
- Cons: does not work when fused tensors are different types and shapes
- Alex: incremental follow up work that can be carried out after this PR

Configuration options to enable FuseRecv

Current proposal doesn’t cover TF2
For TF2, we need tf.config.experimental.do_fuse_recv - which will mimic other existing tf.config implementations
Action Item: augment RFC with TF2 configuration details

Op definition

send_device, recv_device, send_device_incarnation - should be singletons
tensor_name and tensor_type should be lists
Action Item: change the op defs as described above

alohali · 2020-08-02T07:56:39Z

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

liutongxuan · 2020-08-06T01:59:54Z

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

alohali · 2020-08-06T03:52:18Z

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

That's cool! For 2 times faster, do you mean the RPC processes 2x or the whole network?

Cuttstage · 2020-08-06T03:52:56Z

@liutongxuan hi, I wanna use this feature in my own machine. Is there any code branch which has this wonderful feature? I just want to merge this into my tensorflow source code to impore the performance.

liutongxuan · 2020-08-06T08:32:03Z

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

That's cool! For 2 times faster, do you mean the RPC processes 2x or the whole network?

e2e, whole training

liutongxuan · 2020-08-06T08:34:22Z

@liutongxuan hi, I wanna use this feature in my own machine. Is there any code branch which has this wonderful feature? I just want to merge this into my tensorflow source code to impore the performance.

I'm working on the PR, you can try the code: https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv

Cuttstage · 2020-08-06T13:04:54Z

@liutongxuan hi, I wanna use this feature in my own machine. Is there any code branch which has this wonderful feature? I just want to merge this into my tensorflow source code to impore the performance.

I'm working on the PR, you can try the code: https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv

Ok, thanks a lot!!! :)

alohali · 2020-08-07T08:41:14Z

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

That's cool! For 2 times faster, do you mean the RPC processes 2x or the whole network?

e2e, whole training
hi, I also find your doc about this:Seastar-based RPC for TF Worker Service https://docs.google.com/document/d/1f1m-98rbH33WE0qNb3tP0yt9Jjbb-rprvweLobRbTCA/edit#heading=h.oowm89xd5yg8
And it says only using seastar instead of grpc brings 2x speed up for w&d, but in https://discuss.tf.wiki/t/topic/387 says seastar + FuseRecv bring it. is there any error? which really brings the speed up?

liutongxuan · 2020-08-07T08:55:28Z

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

That's cool! For 2 times faster, do you mean the RPC processes 2x or the whole network?

e2e, whole training
hi, I also find your doc about this:Seastar-based RPC for TF Worker Service https://docs.google.com/document/d/1f1m-98rbH33WE0qNb3tP0yt9Jjbb-rprvweLobRbTCA/edit#heading=h.oowm89xd5yg8
And it says only using seastar instead of grpc brings 2x speed up for w&d, but in https://discuss.tf.wiki/t/topic/387 says seastar + FuseRecv bring it. is there any error? which really brings the speed up?

Seastar could bring 2-4x faster. Fuse Recv brings extra 2x faster.

alohali · 2020-08-07T09:09:22Z

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

That's cool! For 2 times faster, do you mean the RPC processes 2x or the whole network?

e2e, whole training
hi, I also find your doc about this:Seastar-based RPC for TF Worker Service https://docs.google.com/document/d/1f1m-98rbH33WE0qNb3tP0yt9Jjbb-rprvweLobRbTCA/edit#heading=h.oowm89xd5yg8
And it says only using seastar instead of grpc brings 2x speed up for w&d, but in https://discuss.tf.wiki/t/topic/387 says seastar + FuseRecv bring it. is there any error? which really brings the speed up?

Seastar could bring 2-4x faster. Fuse Recv brings extra 2x faster.

Cool! I would try to reproduce in my own device.

Cuttstage · 2020-08-14T09:17:49Z

@liutongxuan hi, I wanna use this feature in my own machine. Is there any code branch which has this wonderful feature? I just want to merge this into my tensorflow source code to impore the performance.

I'm working on the PR, you can try the code: https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv

Ok, thanks a lot!!! :)

@liutongxuan, hi, I have use this branch https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv to complie the tensorflow. But some bazel question happened when I complie it. I wanna know what bazel version can be used to complie it.

xingkong1994 · 2021-08-24T12:23:39Z

@liutongxuan hi, I wanna use this feature in my own machine. Is there any code branch which has this wonderful feature? I just want to merge this into my tensorflow source code to impore the performance.

I'm working on the PR, you can try the code: https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv

I found this link is out of date, could you show the new code link again? thanks

liutongxuan and others added 2 commits April 10, 2020 10:19

RFC: FuseRecv

9cdd68b

Fix irregular writing and expression issues.

5f54a30

liutongxuan requested review from ematejska, ewilderj, martinwicke and theadactyl as code owners April 11, 2020 06:07

googlebot added the cla: no label Apr 11, 2020

byronyi reviewed Apr 13, 2020

View reviewed changes

update reviewers and submit time

57ee7b2

liutongxuan force-pushed the master branch from 68ca610 to 57ee7b2 Compare April 13, 2020 09:59

terrytangyuan reviewed Apr 13, 2020

View reviewed changes

update sponsor

b005a46

googlebot added cla: yes and removed cla: no labels Apr 14, 2020

fix typo

af8ab47

ematejska added the RFC: Proposed RFC Design Document label Apr 15, 2020

ematejska closed this Apr 15, 2020

ematejska reopened this Apr 15, 2020

legatoo reviewed Apr 16, 2020

View reviewed changes

Update to set the status to proposed.

912657c

Update user API and op definition based on comments.

f4b925c

ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels May 12, 2020

ematejska approved these changes May 12, 2020

View reviewed changes

ematejska merged commit 12f4414 into tensorflow:master May 12, 2020


		## Design Proposal

		![Figure 1: Current graph partition strategy](20200409-fuse_recv/current_graph_partition_strategy.png "Current graph partition strategy")

		include additional information for every Recv tensor.

		## Questions and Discussion Topics

RFC: FuseRecv #224

RFC: FuseRecv #224

Conversation

liutongxuan commented Apr 11, 2020 • edited by ematejska Loading

FuseRecv

Objective

googlebot commented Apr 11, 2020

liutongxuan commented Apr 11, 2020

francktcheng commented Apr 11, 2020

shanshanpt commented Apr 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbedorf commented Apr 14, 2020

googlebot commented Apr 14, 2020

legatoo Apr 16, 2020 • edited Loading

Choose a reason for hiding this comment

shanshanpt Apr 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

byronyi commented Apr 21, 2020

liutongxuan commented Apr 21, 2020

agarwal-ashish commented Apr 24, 2020

liutongxuan commented Apr 25, 2020 • edited Loading

ematejska commented Apr 27, 2020

ematejska commented May 12, 2020

dubey commented May 12, 2020

alohali commented Aug 2, 2020

liutongxuan commented Aug 6, 2020

alohali commented Aug 6, 2020

Cuttstage commented Aug 6, 2020

liutongxuan commented Aug 6, 2020

liutongxuan commented Aug 6, 2020

Cuttstage commented Aug 6, 2020

alohali commented Aug 7, 2020

liutongxuan commented Aug 7, 2020

alohali commented Aug 7, 2020

Cuttstage commented Aug 14, 2020

xingkong1994 commented Aug 24, 2021

liutongxuan commented Apr 11, 2020 •

edited by ematejska

Loading

legatoo Apr 16, 2020 •

edited

Loading

shanshanpt Apr 21, 2020 •

edited

Loading

liutongxuan commented Apr 25, 2020 •

edited

Loading