-
Notifications
You must be signed in to change notification settings - Fork 74.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TF2.1] Performance: Control flow and scalar ops 225x slower than raw Python and 24000x slower than C++ #34500
Comments
First, it's great to see these measurements put together! There are a few issues that you'll need to address before the benchmark is conclusive, please see below. I'd love to continue the discussion though; I think such a benchmark will be a useful indicator for the performance of control flow ops, which are becoming more pervasive even in ML. The speed differences are somewhat expected, because TF ops (and PyTorch, as well as NumPy, for that matter) are optimized for compute-intensive vector calculations where what seems like huge overhead is dwarfed by the computation itself. So when single scalars are involved, they really don't shine, and even something as slow as Python easily outmatches them. Of course, NumPy is probably faster due to its lower overhead, but still slower than pure Python. I ran this test to convince myself of that - feel free to add it to your suite:
I believe we should aim to match the performance of NumPy, so there's definitely much room for improvement.
|
@mdanatg Thanks for the response. Re: (1) and (2) - the performance gap is so wide that measurement variance is not material. Here's 10 runs of both the saved model and direct execution (run as a for-loop in the same session). The best run is highlighted.
Thank you for the pointer to the NumPy variant of the test for a reasonable baseline. After NumPy inclusion and no AutoGraph inclusion (raw ops):
It's safe to say that the measurement variance exists, so the 3 TF setups are pretty much within a stone's throw of each other and not qualitatively different. And as you've rightly pointed out, the systems are set-up for vectorized calculations and this isn't really an AutoGraph bug (it's a core-ops runtime performance issue). We're considering these ops to generate cross features as part of our critical serving path (so the slowdown would have direct user-visible-latency impact - our current code is straight C++ and we're willing to accept some latency hit). Having some insight into what's driving this kind of overhead / slowdown and a way to avoid it would be very much appreciated. |
Thanks for double-checking - I still think it's a good practice to average several executions, but the important part was to confirm the steady-state performance within the same session. The efficiency of TF control flow ops is something that we hope to improve in the future, but it will take a bit of time. In the meantime, I'd be happy to help optimize your specific code, so that at least this issue doesn't block you - can you describe the computations you need to perform for these cross features? As a side note, an effective way to minimize this kind of slowdown is to use vector ops. Here's the same fizzbuzz code rewritten to demonstrate:
The performance target for such code should be the C++ benchmark, because the op kernels themselves are implemented in C++. I think it already rivals NumPy, or at least it did in my measurements:
Of course, not all computations can be vectorized in this fashion so the method is not always applicable. |
Update - the numbers improve a lot for this benchmark when using XLA compilation, approaching Python in my tests:
Unfortunately, this API is experimental, so I wouldn't recommend using it in production. It should still give us a better indication for what the baseline should be, though. |
This is very helpful. Thank you. A colleague also got a vectorized PyTorch implementation working in about 8ms (vs 15ms for Python). So the baseline is about 8ms for a vectorized implementation. Re the XLA compile, is it on the control flow impl or the vectorized impl? |
I tested it with the control flow impl:
XLA requires all shapes to be static and complains about the use of |
BTW, there is a better way to use XLA, one that is more likely to be supported and has less overhead as well:
With this, performance starts to approach the order of magnitude of C++. Of course, we would like things to be this fast by default without resorting to arcane flags, so there still is a bit of work to be done. |
I ran the XLA compiled version with TF nightly:
Even if I hardcode the count (100K) in the range(n), the speeds improve but aren't C++ comparable:
Did you mean: "approach the order of magnitude of Python?" (not C++?). The gap seems to be still 400x C++ code (though only 4.5x Python). Could you take a look as to the XLA implementation in the repo? https://github.com/divyekapoor/ml-op-benchmarks
Readme updated with the new numbers. |
FWIW, with a constant value in range(n) (n=100K) the non XLA version of the model gets a huge boost in speed:
and it beats Python and is about 20x slower than C++ (quite nice!). The XLA version of the model though isn't able to optimize the graph (as above it's at around 26ms instead of 4ms - a 6x gap vs the prod version of saved model). |
Summarizing the issue: Seems we have the following open questions -
For moving forward with (1), any other next steps @mdanatg ? |
Your summary looks right. We're working on improving the graph executor and that should help resolve both (1) and (2), but it's a longer-term effort and I don't have a clear estimate when it will be ready. Oddly, I ran a simplified version of the code (without the tf.Module / saved model bit) and there the results were significantly better than Python (see the results and code below). So perhaps there is some extra overhead around either the module or the saved model. But I think that's besides the point, which is that control flow is currently slow in TF graph. Side note - here's the code and the timings I got:
|
@divyekapoor |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
Summary
TF Op performance for a simple subgraph (built with AutoGraph) is at-least 2 orders of magnitude slower than expected: looping over 100K numbers takes 4+ seconds instead of 18ms (or much faster)
Benchmark code and repro instructions:
https://github.com/divyekapoor/ml-op-benchmarks
make tfbench
Raw python: Running the same code without tf.Module and @tf.function.
Raw C++: Equivalent implementation in straight C++.
Performance table:
The multiplers use the corresponding Python and C++ code as unit = 1.
Benchmark script is attached at the bottom of this issue and only has a dependency on Tensorflow.
https://github.com/divyekapoor/ml-op-benchmarks has something to directly clone and execute.
(If it would help, TF ops are ~40% slower than Torch ops for the same FizzBuzz benchmark)
pytorch/pytorch#30365 for the related PyTorch issue.
Performance similar to raw Python is the expected behavior.
System information
== tensorflow import ============================================
tf.version.VERSION = 2.1.0-dev20191107
tf.version.GIT_VERSION = v1.12.1-17543-gb4b5ce680c
tf.version.COMPILER_VERSION = 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)
== python version ==============================================
(major, minor, micro, releaselevel, serial)
(3, 7, 4, 'final', 0)
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
2. TF 2.0:python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Setup
Performance benchmark for conditional ops set up with a FizzBuzz test case:
Input: n -> a range limit (100K)
Output: a 3 element tensor with counts for (fizz, buzz, fizzbuzz)
Goal: To estimate the performance overhead of TF ops versus raw python / raw C++.
Benchmark file is attached.
Describe the current behavior
FizzBuzz with TF ops is 225x slower than the same code in Raw Python and 24K+x slower than the corresponding C++ implementation.
Describe the expected behavior
FizzBuzz with TF ops should be within 10-50% of raw Python or faster.
Code to reproduce the issue
Attached to this report.
To reproduce:
Full version: https://github.com/divyekapoor/ml-op-benchmarks
Other info / logs
Performance table:
Raw latency == run with range input N = 100K
Per Run latency == Raw latency / 100K (one run through the op graph)
fizz.tar.gz
The text was updated successfully, but these errors were encountered: