Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: FuseRecv #224

Merged
merged 7 commits into from
May 12, 2020
Merged

RFC: FuseRecv #224

merged 7 commits into from
May 12, 2020

Conversation

liutongxuan
Copy link
Contributor

@liutongxuan liutongxuan commented Apr 11, 2020

This RFC will be open for comment until Friday, May 1st, 2020.

FuseRecv

Status Proposed
Author(s) Tongxuan Liu([email protected]) Peng Tao([email protected]) Langshi Chen ([email protected])
Reviewers(s) Ayush Dubey([email protected]) Jeroen Bédorf([email protected]) Derek Murray([email protected]) Bairen Yi([email protected]) Paul Tucker([email protected])
Sponsor Ayush Dubey([email protected])
Updated 2020-04-11

Objective

This RFC proposes a new FuseRecv Op which would receive multiple tensors with
different types through one Remote Procedure Call (RPC). This feature could
significantly reduce the number of RPC calls in most rank or match models
such as Search, Recommend or Ad systems.

@googlebot
Copy link

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@liutongxuan
Copy link
Contributor Author

@googlebot I consent.

@francktcheng
Copy link

@googlebot I consent

1 similar comment
@shanshanpt
Copy link

@googlebot I consent


## Design Proposal

![Figure 1: Current graph partition strategy](20200409-fuse_recv/current_graph_partition_strategy.png "Current graph partition strategy")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these figures should have relative path, i.e. s/20200409-fuse_recv/\./.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's strange...I still cannot see the pictures inline. Maybe take a look at https://github.com/tensorflow/community/pull/214/files#diff-a30752476c946e1740658ce11cc02d89R631 and see how it's done?

Ah I see, it should be ![Figure 1: Current graph partition strategy](20200409-fuse_recv/current_graph_partition_strategy.png)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bairen, i renamed pictures name to 0411***, please check this: https://github.com/liutongxuan/community/blob/master/rfcs/20200411-fuse_recv.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works now. Thanks!

### Compatibility
* This feature works with the ParameterServerStrategy.
* This feature considers tensors on difference devices such as CPU, GPU and TPU.
* Independent of SaveModel or checkpoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SavedModel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thx

#### Fuse the tensors into a single Send/Recv Solution 2 (Derek Murray)
Pack the tensor contents into a single flattened buffer. This would be very
similar to the ScopedAllocator optimization that [email protected] and
[email protected] implemented for collectives, and it might be possible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collectives -> collective ops

```

### Compatibility
* This feature works with the ParameterServerStrategy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about tensor fusion for allreduce strategy? Can certain parts of the implementation be shared?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For collective ops there is a Grappler optimizer called scoped allocator which does similar job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll submit a PR to TensorFlow as soon as possible.

@jbedorf
Copy link
Member

jbedorf commented Apr 14, 2020

@googlebot I consent.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@ematejska ematejska added the RFC: Proposed RFC Design Document label Apr 15, 2020
@ematejska ematejska closed this Apr 15, 2020
@ematejska ematejska reopened this Apr 15, 2020
include additional information for every Recv tensor.

## Questions and Discussion Topics

Copy link

@legatoo legatoo Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a beta user of this feature, based on my testing result. I have some questions hope to discuss.

  1. Conceptually, How could FuseRecvOp accelerate the training speed?
    FuseRecv will not always reduce the time fetching tensors from another host. Based on my statistic, a recv action(no matter RecvOp or FuseRecvOp), from client-side view, 90%+ of time is spent on collecting data from server's rendezvous. Because FuseRecv requires all tensors are collected then return, a FuseRecv with n tensors, the overall time may even close to n * Recv requests individually.

    So based on my understanding, how FuseRecvOp acceletates is: Reduce the number of rpc calls significantly --> lower rpc framwork's pressure --> save some CPU time & reduce context-switches --> so other op's execution will be benefit. Am I right?

    Like Paul Tucker's idea: Dynamic Fusion in runtime. If FuseRecv doesn't require all tensors collected, can partial returns after a timeout, then retry remainings? will this be better? Em, but it sounds very hard to adjust the timeout.

  2. How will this feature proform when we change the data transmission engine?
    If my first understanding is correct. then the performance gain from FuseRecv is from resources saved by Grpc. If we use other frameworks like RDMA, seastar, etc. How will FuseRecv perform?

  3. Can you provide more details about how backward graph are fused?
    because from the ps-side, the count of RPC received is reduced largely(~50%), but from worker-side, the reduction is ~10%. seems like gradients are not fused tightly?

Thanks.

Copy link

@shanshanpt shanshanpt Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Right, but not only reduce the rpc time. For example, in tensorflow, we should find the rendzvous by the step_id for every recv request (auto rendez = FindOrCreate(step_id)) . The FindOrCreate function should call a mutex_lock to protect the table_. So If we use FuseRecv, the lock times will reduce also. There are many other considers here :)
    We have a design for Dynamic Fusion, we can have discussion in another thread.

  2. How to rewrite the graph is the mainly challenge. This part can be reuse when we use other frameworks. But we have to add FuseRecv interface to the new transmission engine.

  3. The FuseRecv only consider the Recv operation in the graph. If these N Recv nodes can be fused, then we fuse them. :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @legatoo that in absence of non-strict execution of ops, there is a tradeoff which the user / runtime needs to be aware of: that fusion could cause all fused tensors receives to block till all of the transfers are done.

One alternative to discuss in the doc is to achieve RPC fusion transparently in the runtime instead of by changing the graph structure. i.e. Different Recv calls could enqueue the requests in a central global request queue and dequeue the results from there. This queue could then dynamically control batching of RPCs and could be configured with various timing parameters to control the RPC overhead vs latency of dispatch tradeoffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @legatoo that in absence of non-strict execution of ops, there is a tradeoff which the user / runtime needs to be aware of: that fusion could cause all fused tensors receives to block till all of the transfers are done.

One alternative to discuss in the doc is to achieve RPC fusion transparently in the runtime instead of by changing the graph structure. i.e. Different Recv calls could enqueue the requests in a central global request queue and dequeue the results from there. This queue could then dynamically control batching of RPCs and could be configured with various timing parameters to control the RPC overhead vs latency of dispatch tradeoffs.

There are different strategies to apply here, we design to support multiple fusion strategy in the optimizer. For example there's simple strategy is that we fuse the recv nodes when their destination node is the same.

What we shown in the RFC is more complicated and aggressive fusion strategy, and we applied in our models got significant performance improvement. Besides, with cost model, we could explore more.

For the dynamic solution which mentioned, time window is hard to define. What we concern is not the implementation, we don't want turning the timing window to brings overhead to users. Of course there's could be more automatic way to smartly adjust the window, but in our point of view, in graph optimizer we could get proper fusion as well based on cost model and analysis of the graph.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @legatoo that in absence of non-strict execution of ops, there is a tradeoff which the user / runtime needs to be aware of: that fusion could cause all fused tensors receives to block till all of the transfers are done.

One alternative to discuss in the doc is to achieve RPC fusion transparently in the runtime instead of by changing the graph structure. i.e. Different Recv calls could enqueue the requests in a central global request queue and dequeue the results from there. This queue could then dynamically control batching of RPCs and could be configured with various timing parameters to control the RPC overhead vs latency of dispatch tradeoffs.

Our current approach is to rewrite the graph based on a cost model, and 'FuseRecv' must wait all tensors to be ready. This is a static fusion.
Back to the dynamic fusion, maybe it can be implemented base the static fusion. For example, 'A' send a 'FuseRecv' request to 'B' for five tensors. Here we also need a time window(or somewhat window). We assume 3 tensors are ready when the waiting time exceeds the specified window time, then we send the 3 tensors back to 'A'. And the last 2 tensors should be sent to 'A' later. So we should have a state manager here to manage these requests and responses. The benefit of the dynamic solution which base on the static fusion is that we try to fuse as much as we can(base on the cost model), and it can achieve the same performance as before in the worst case(response one by one). And this dynamic fusion can simplify the setting of time window. For example, poor performance may result from too large time window settings. But don't worry, we have the static fusion which base on the cost model, it can promise to get better performance in most cases.
This is a simple idea, we can have much more discussion in the meeting.

@byronyi
Copy link
Contributor

byronyi commented Apr 21, 2020

Does this optimization apply to intra-node host from/to device Recv, not necessarily triggering an RPC?

@liutongxuan
Copy link
Contributor Author

Does this optimization apply to intra-node host from/to device Recv, not necessarily triggering an RPC?

It's only for remote nodes, not for local recv between host and device. We haven't tried, but it could be interesting to fuse the local recv as well which could reduce conflict in rendezvous.

@agarwal-ashish
Copy link

Instead of calling this "FuseRecv", should we simply extend the existing Recv op (or call it RecvV2) ?

@liutongxuan
Copy link
Contributor Author

liutongxuan commented Apr 25, 2020

Instead of calling this "FuseRecv", should we simply extend the existing Recv op (or call it RecvV2) ?

RecvV2 seems a good one as well. We can discuss this in review meeting. Thanks your suggestion.

@ematejska
Copy link
Contributor

Design Review is scheduled for 5/12/20 at the Networking SIG.

@ematejska
Copy link
Contributor

This has been accepted, just waiting for the final updates to this PR and notes from the review before merging.

@dubey
Copy link
Member

dubey commented May 12, 2020

Design review notes

Static fusion performance

  • Alex Passos: is static fusion always net positive benefit? It will introduce extra latency. Should it be turned on by default always?
    • We will start off as an opt-in config param marked experimental
    • Once the implementation is submitted, we will try performance on a few models, and revisit the non-default config in 6 months.

Concerns about performance with RDMA

  • It seems orthogonal to the proposal. Performance impact is unknown

Details on how the backward graph is fused

  • TF doesn’t distinguish between forward and backward graph

Intra-node FuseRecv

  • Bairen: Should we use fuserecv for single host send-recv? Context: sparse embedding models in heterogeneous GPU clusters. Similar pattern observed for host to device copies. Workaround - manually fuse recv nodes.
    • The authors haven’t tried intra-node, no performance data
    • Does implementation work for intra-node cases?
    • Authors - yes
    • Alex - what benefit would you get? Kernel launch overhead?
    • Bairen - executor/launch overhead, typically tf.gather of indexed slices. Sometimes we see large number of cudaMemcpyHostToDevice.
    • Alex - if this is future work, we can leave it out of scope. Overhead of RPC is much larger than overhead of executor op launch

Op name

  • FuseRecv vs RecvV2
  • Alex: _Recv is internal op - extending it is okay from API perspective
  • Conclusion: we will augment the existing _Recv op
  • Action Item: modify RFC to incorporate these changes

What kind of models will benefit:

  • Sparse models for ads with feature columns
  • Result in 100s of rpcs
  • 1000s of workers and parameter servers
  • Data fetch per rpc is very small

Should the send node also be fused?

  • Alternative: mrry@ - pack tensors into n length DT_VARIANT
    • Advantage: we reuse existing ops and rpc machinery
    • No perceived performance benefit, a different way to implement the same idea
    • Conclusion: okay to continue with the current proposal
  • Alternative: mrry@ - pack into single flattened buffer
    • Cons: does not work when fused tensors are different types and shapes
    • Alex: incremental follow up work that can be carried out after this PR

Configuration options to enable FuseRecv

  • Current proposal doesn’t cover TF2
  • For TF2, we need tf.config.experimental.do_fuse_recv - which will mimic other existing tf.config implementations
  • Action Item: augment RFC with TF2 configuration details

Op definition

  • send_device, recv_device, send_device_incarnation - should be singletons
  • tensor_name and tensor_type should be lists
  • Action Item: change the op defs as described above

@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels May 12, 2020
@ematejska ematejska merged commit 12f4414 into tensorflow:master May 12, 2020
@alohali
Copy link

alohali commented Aug 2, 2020

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

@liutongxuan
Copy link
Contributor Author

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

@alohali
Copy link

alohali commented Aug 6, 2020

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

That's cool! For 2 times faster, do you mean the RPC processes 2x or the whole network?

@Cuttstage
Copy link

@liutongxuan hi, I wanna use this feature in my own machine. Is there any code branch which has this wonderful feature? I just want to merge this into my tensorflow source code to impore the performance.

@liutongxuan
Copy link
Contributor Author

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

That's cool! For 2 times faster, do you mean the RPC processes 2x or the whole network?

e2e, whole training

@liutongxuan
Copy link
Contributor Author

@liutongxuan hi, I wanna use this feature in my own machine. Is there any code branch which has this wonderful feature? I just want to merge this into my tensorflow source code to impore the performance.

I'm working on the PR, you can try the code: https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv

@Cuttstage
Copy link

@liutongxuan hi, I wanna use this feature in my own machine. Is there any code branch which has this wonderful feature? I just want to merge this into my tensorflow source code to impore the performance.

I'm working on the PR, you can try the code: https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv

Ok, thanks a lot!!! :)

@alohali
Copy link

alohali commented Aug 7, 2020

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

That's cool! For 2 times faster, do you mean the RPC processes 2x or the whole network?

e2e, whole training
hi, I also find your doc about this:Seastar-based RPC for TF Worker Service https://docs.google.com/document/d/1f1m-98rbH33WE0qNb3tP0yt9Jjbb-rprvweLobRbTCA/edit#heading=h.oowm89xd5yg8
And it says only using seastar instead of grpc brings 2x speed up for w&d, but in https://discuss.tf.wiki/t/topic/387 says seastar + FuseRecv bring it. is there any error? which really brings the speed up?

@liutongxuan
Copy link
Contributor Author

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

That's cool! For 2 times faster, do you mean the RPC processes 2x or the whole network?

e2e, whole training
hi, I also find your doc about this:Seastar-based RPC for TF Worker Service https://docs.google.com/document/d/1f1m-98rbH33WE0qNb3tP0yt9Jjbb-rprvweLobRbTCA/edit#heading=h.oowm89xd5yg8
And it says only using seastar instead of grpc brings 2x speed up for w&d, but in https://discuss.tf.wiki/t/topic/387 says seastar + FuseRecv bring it. is there any error? which really brings the speed up?

Seastar could bring 2-4x faster. Fuse Recv brings extra 2x faster.

@alohali
Copy link

alohali commented Aug 7, 2020

@liutongxuan hi do you have test data to show how much rpc calls reduced? and how much improve in total training speed?

For kinds of CTR models with hundreds of feature columns, this feature could reduce 90% RPC, and brings 2 times faster.

That's cool! For 2 times faster, do you mean the RPC processes 2x or the whole network?

e2e, whole training
hi, I also find your doc about this:Seastar-based RPC for TF Worker Service https://docs.google.com/document/d/1f1m-98rbH33WE0qNb3tP0yt9Jjbb-rprvweLobRbTCA/edit#heading=h.oowm89xd5yg8
And it says only using seastar instead of grpc brings 2x speed up for w&d, but in https://discuss.tf.wiki/t/topic/387 says seastar + FuseRecv bring it. is there any error? which really brings the speed up?

Seastar could bring 2-4x faster. Fuse Recv brings extra 2x faster.

Cool! I would try to reproduce in my own device.

@Cuttstage
Copy link

@liutongxuan hi, I wanna use this feature in my own machine. Is there any code branch which has this wonderful feature? I just want to merge this into my tensorflow source code to impore the performance.

I'm working on the PR, you can try the code: https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv

Ok, thanks a lot!!! :)

@liutongxuan, hi, I have use this branch https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv to complie the tensorflow. But some bazel question happened when I complie it. I wanna know what bazel version can be used to complie it.

@xingkong1994
Copy link

@liutongxuan hi, I wanna use this feature in my own machine. Is there any code branch which has this wonderful feature? I just want to merge this into my tensorflow source code to impore the performance.

I'm working on the PR, you can try the code: https://github.com/liutongxuan/tensorflow/tree/features/fuse_recv

I found this link is out of date, could you show the new code link again? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.