[TF2.1] Performance: Control flow and scalar ops 225x slower than raw Python and 24000x slower than C++ #34500

divyekapoor · 2019-11-21T23:18:17Z

Summary
TF Op performance for a simple subgraph (built with AutoGraph) is at-least 2 orders of magnitude slower than expected: looping over 100K numbers takes 4+ seconds instead of 18ms (or much faster)

Benchmark code and repro instructions:
https://github.com/divyekapoor/ml-op-benchmarks

Clone the repo
make tfbench

class FizzBuzz(tf.Module):
    @tf.function(input_signature=[tf.TensorSpec([], tf.int32)])
    def model(self,
              n  # Shape [] -- int32 the max number to loop FizzBuzz to
              ):  # Returns counts for fizz, buzz and fizzbuzz. Shape: [1] with length 3
        fizz = 0
        buzz = 0
        fizzbuzz = 0
        for i in range(n):
            if i % 6 == 0:
                fizzbuzz += 1
            elif i % 3 == 0:
                buzz += 1
            elif i % 2 == 0:
                fizz += 1
        return [fizz, buzz, fizzbuzz]

Raw python: Running the same code without tf.Module and @tf.function.
Raw C++: Equivalent implementation in straight C++.

Performance table:

FizzBuzz Iteration Counts	100000
	Raw Latency (ms)	Per Run Latency (usec)	Python Multiplier	C++ Multiplier
Tensorflow Python	4087	40.87	227.06	24327
Tensorflow Saved Model Python	4046	40.46	224.78	24083
Raw Python	18	0.18	1.00	107
Raw C++	0.168	0.00168	0.01	1

The multiplers use the corresponding Python and C++ code as unit = 1.
Benchmark script is attached at the bottom of this issue and only has a dependency on Tensorflow.

https://github.com/divyekapoor/ml-op-benchmarks has something to directly clone and execute.

(If it would help, TF ops are ~40% slower than Torch ops for the same FizzBuzz benchmark)
pytorch/pytorch#30365 for the related PyTorch issue.

FizzBuzz Iteration Counts	100000
	Raw Latency (ms)	Per Run Latency (usec)	Python Multiplier	C++ Multiplier
PyTorch Python	4007	40.07	222.61	23851
PyTorch TorchScript Python (from Loaded TorchScript)	2830	28.3	157.22	16845
PyTorch TorchScript C++ (Native)	255	2.55	14.17	1518
PyTorch TorchScript C++ (Native + ATen Tensors)	252	2.52	14.00	1500
Raw Python	18	0.18	1.00	107
Raw C++	0.168	0.00168	0.01	1

Performance similar to raw Python is the expected behavior.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS Mojave 10.14.6 (18G1012)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Not tested.
TensorFlow installed from (source or binary): binary.
TensorFlow version (use command below):
== tensorflow import ============================================
tf.version.VERSION = 2.1.0-dev20191107
tf.version.GIT_VERSION = v1.12.1-17543-gb4b5ce680c
tf.version.COMPILER_VERSION = 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)
Python version:
== python version ==============================================
(major, minor, micro, releaselevel, serial)
(3, 7, 4, 'final', 0)
Bazel version (if compiling from source): NA.
GCC/Compiler version (if compiling from source): NA
CUDA/cuDNN version: NA
GPU model and memory: NA

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Setup
Performance benchmark for conditional ops set up with a FizzBuzz test case:
Input: n -> a range limit (100K)
Output: a 3 element tensor with counts for (fizz, buzz, fizzbuzz)
Goal: To estimate the performance overhead of TF ops versus raw python / raw C++.
Benchmark file is attached.

Describe the current behavior
FizzBuzz with TF ops is 225x slower than the same code in Raw Python and 24K+x slower than the corresponding C++ implementation.

Describe the expected behavior
FizzBuzz with TF ops should be within 10-50% of raw Python or faster.

Code to reproduce the issue
Attached to this report.
To reproduce:

$ python3 -m venv venv3
$ source venv3/bin/activate
$ pip3 install tensorflow
$ python3 fizz.py

Full version: https://github.com/divyekapoor/ml-op-benchmarks

Other info / logs
Performance table:

FizzBuzz Iteration Counts	100000
	Raw Latency (ms)	Per Run Latency (usec)	Python Multiplier	C++ Multiplier
Tensorflow Python	4087	40.87	227.06	24327
Tensorflow Saved Model Python	4046	40.46	224.78	24083
Raw Python	18	0.18	1.00	107
Raw C++	0.168	0.00168	0.01	1

Raw latency == run with range input N = 100K
Per Run latency == Raw latency / 100K (one run through the op graph)
fizz.tar.gz

The text was updated successfully, but these errors were encountered:

mdanatg · 2019-11-25T21:05:52Z

First, it's great to see these measurements put together!

There are a few issues that you'll need to address before the benchmark is conclusive, please see below. I'd love to continue the discussion though; I think such a benchmark will be a useful indicator for the performance of control flow ops, which are becoming more pervasive even in ML.

The speed differences are somewhat expected, because TF ops (and PyTorch, as well as NumPy, for that matter) are optimized for compute-intensive vector calculations where what seems like huge overhead is dwarfed by the computation itself. So when single scalars are involved, they really don't shine, and even something as slow as Python easily outmatches them. Of course, NumPy is probably faster due to its lower overhead, but still slower than pure Python. I ran this test to convince myself of that - feel free to add it to your suite:

class FizzBuzz(tf.Module):
    def model(self,n):
        fizz = np.array(0)
        buzz = np.array(0)
        fizzbuzz = np.array(0)
        # Force everything to be a numpy scalar, for an even comparison
        for i in np.arange(n)[:, np.newaxis]:
            if i % 6 == 0:
                fizzbuzz += 1
            elif i % 3 == 0:
                buzz += 1
            elif i % 2 == 0:
                fizz += 1
        return [fizz, buzz, fizzbuzz]

I believe we should aim to match the performance of NumPy, so there's definitely much room for improvement.

The bechmarks time the initial execution of tf.function, which is known to be significantly slower than subsequent invocations - even excluding potential compilation and optimization overhead, we know merely constructing a TF graph is very inefficient right now. Running from a saved model probably alleviates that, but probably not by much.
The benchmarks seem to only execute a single measurement, which will probably be noisy. I recommend averaging across at least 100 invocations for each experiment. Combined with Add support for Python 3.x #1 above, I recommend the following setup: (a) warm-up by running the payload 3 times to get rid of any transient effects; (b) time the execution 10-100 times and average the results.
A bit of a nitpick is that the TF benchmarks really time the performance of TF's control flow ops (tf.cond, tf.while_loop). There are no AutoGraph ops per se - it just creates the usual TF control flow ops. You could confirm this by running this roughly equivalent code, which should give results comparable with AutoGraph:

class FizzBuzz(tf.Module):
    @tf.function(input_signature=[tf.TensorSpec([], tf.int32)], autograph=False)
    def model(self,
              n  # Shape [] -- int32 the max number to loop FizzBuzz to
              ):  # Returns counts for fizz, buzz and fizzbuzz. Shape: [1] with length 3
        fizz = 0
        buzz = 0
        fizzbuzz = 0

        def cond(i, fizz, buzz, fizzbuzz):
          return i < n

        def body(i, fizz, buzz, fizzbuzz):
          return (i + 1,) + tf.cond(
              i % 6 == 0,
              lambda: (fizz, buzz, fizzbuzz + 1),
              lambda: tf.cond(
                  i % 3 == 0,
                  lambda: (fizz, buzz + 1, fizzbuzz),
                  lambda: tf.cond(
                      i % 2 == 0,
                      lambda: (fizz + 1, buzz, fizzbuzz),
                      lambda: (fizz, buzz, fizzbuzz)
                  )
              )
          )

        _, fizz, buzz, fizzbuzz = tf.while_loop(
            cond, body, (0, fizz, buzz, fizzbuzz))
        return [fizz, buzz, fizzbuzz]

As mentioned above, your benchmark test long-running algorithms operating on scalars, something which even Pyhton is fairly good at. I'd recommend running the same test on larger vectors, maybe by initializing fizz with a large matrix.

divyekapoor · 2019-11-25T23:24:52Z

@mdanatg Thanks for the response.

Re: (1) and (2) - the performance gap is so wide that measurement variance is not material. Here's 10 runs of both the saved model and direct execution (run as a for-loop in the same session). The best run is highlighted.

Time taken (TF Python) (ms):  4121.040513
Time taken (TF Python) (ms):  3896.005856
Time taken (TF Python) (ms):  3752.050521
Time taken (TF Python) (ms):  3751.995625
Time taken (TF Python) (ms):  3740.542564
Time taken (TF Python) (ms):  ***3723.199538***
Time taken (TF Python) (ms):  3792.319686
Time taken (TF Python) (ms):  4279.583034
Time taken (TF Python) (ms):  3818.070901
Time taken (TF Python) (ms):  4053.846837

Time taken (SavedModel) (ms):  4132.69518
Time taken (SavedModel) (ms):  3889.249228
Time taken (SavedModel) (ms):  4177.042328
Time taken (SavedModel) (ms):  4409.928661
Time taken (SavedModel) (ms):  4209.030763
Time taken (SavedModel) (ms):  4370.232748
Time taken (SavedModel) (ms):  3881.061122
Time taken (SavedModel) (ms):  3929.917805
Time taken (SavedModel) (ms):  ***3781.373323***
Time taken (SavedModel) (ms):  3808.811347

Thank you for the pointer to the NumPy variant of the test for a reasonable baseline.
I've updated the repo. Some nits with the NumPy baseline are that it actually allocates the memory instead of being a pure loop iteration (but with or without correcting for that, the gap is an order of magnitude as compared to raw Python and one order of magnitude below the current TF performance profile).

After NumPy inclusion and no AutoGraph inclusion (raw ops):

FizzBuzz Iteration Counts	100000
	Method Latency (ms)	Iteration Latency (usec)	Python Multiplier	C++ Multiplier
Tensorflow Python	4087	40.87	227.06	24327
Tensorflow Saved Model Python	4046	40.46	224.78	24083
Tensorflow Python no Autograph	3981	39.81	221.16	23696
NumPy Python	420	0.42	23.3	2500
Raw Python	18	0.18	1.00	107
Raw C++	0.168	0.00168	0.01	1

It's safe to say that the measurement variance exists, so the 3 TF setups are pretty much within a stone's throw of each other and not qualitatively different.

And as you've rightly pointed out, the systems are set-up for vectorized calculations and this isn't really an AutoGraph bug (it's a core-ops runtime performance issue). We're considering these ops to generate cross features as part of our critical serving path (so the slowdown would have direct user-visible-latency impact - our current code is straight C++ and we're willing to accept some latency hit). Having some insight into what's driving this kind of overhead / slowdown and a way to avoid it would be very much appreciated.

mdanatg · 2019-11-26T15:06:28Z

Thanks for double-checking - I still think it's a good practice to average several executions, but the important part was to confirm the steady-state performance within the same session.

The efficiency of TF control flow ops is something that we hope to improve in the future, but it will take a bit of time.

In the meantime, I'd be happy to help optimize your specific code, so that at least this issue doesn't block you - can you describe the computations you need to perform for these cross features?

As a side note, an effective way to minimize this kind of slowdown is to use vector ops. Here's the same fizzbuzz code rewritten to demonstrate:

class FizzBuzz(tf.Module):
    @tf.function(input_signature=[tf.TensorSpec([], tf.int32)])
    def model(self,
              n  # Shape [] -- int32 the max number to loop FizzBuzz to
              ):  # Returns counts for fizz, buzz and fizzbuzz. Shape: [1] with length 3
        i = tf.range(n)
        fizz_v = (i % 2 == 0)
        buzz_v = (i % 3 == 0)
        fizz = tf.reduce_sum(tf.cast(fizz_v, tf.int32))
        buzz = tf.reduce_sum(tf.cast(buzz_v, tf.int32))
        fizzbuzz = tf.reduce_sum(tf.cast(tf.logical_and(fizz_v, buzz_v), tf.int32))
        return [fizz, buzz, fizzbuzz]

The performance target for such code should be the C++ benchmark, because the op kernels themselves are implemented in C++. I think it already rivals NumPy, or at least it did in my measurements:

class FizzBuzz(tf.Module):
    def model(self,
              n  # Shape [] -- int32 the max number to loop FizzBuzz to
              ):  # Returns counts for fizz, buzz and fizzbuzz. Shape: [1] with length 3
        i = np.arange(n)
        fizz_v = i % 2 == 0
        buzz_v = i % 3 == 0
        fizz = np.sum(fizz_v.astype(np.int32))
        buzz = np.sum(buzz_v.astype(np.int32))
        fizzbuzz = np.sum(np.logical_and(fizz_v, buzz_v).astype(np.int32))
        return [fizz, buzz, fizzbuzz]

Of course, not all computations can be vectorized in this fashion so the method is not always applicable.

mdanatg · 2019-11-28T14:37:22Z

Update - the numbers improve a lot for this benchmark when using XLA compilation, approaching Python in my tests:

tf.xla.experimental.compile(fb_saved_model.model, [tf.constant(100000)])

Unfortunately, this API is experimental, so I wouldn't recommend using it in production. It should still give us a better indication for what the baseline should be, though.

divyekapoor · 2019-12-02T18:17:49Z

This is very helpful. Thank you.
I'll try and include these in the comparative benchmarks.

A colleague also got a vectorized PyTorch implementation working in about 8ms (vs 15ms for Python). So the baseline is about 8ms for a vectorized implementation.

Re the XLA compile, is it on the control flow impl or the vectorized impl?

mdanatg · 2019-12-02T18:40:20Z

I tested it with the control flow impl:

import tensorflow.compat.v2 as tf
tf.enable_v2_behavior()

class FizzBuzz(tf.Module):
    @tf.function(input_signature=[tf.TensorSpec([], tf.int32)])
    def model(self,
              n  # Shape [] -- int32 the max number to loop FizzBuzz to
              ):  # Returns counts for fizz, buzz and fizzbuzz. Shape: [1] with length 3
        fizz = 0
        buzz = 0
        fizzbuzz = 0
        for i in range(n):
            if i % 6 == 0:
                fizzbuzz += 1
            elif i % 3 == 0:
                buzz += 1
            elif i % 2 == 0:
                fizz += 1
        return [fizz, buzz, fizzbuzz]

fb = FizzBuzz()

tf.saved_model.save(fb, '/tmp/fizzbuzz.m')
fb_saved_model = tf.saved_model.load('/tmp/fizzbuzz.m')

tf.xla.experimental.compile(fb_saved_model.model, [tf.constant(1)])

%timeit tf.xla.experimental.compile(fb_saved_model.model, [tf.constant(100000)])

XLA requires all shapes to be static and complains about the use of tf.range(n), so compiling the vectorized implementation might be a bit trickier.

mdanatg · 2019-12-04T02:16:03Z

BTW, there is a better way to use XLA, one that is more likely to be supported and has less overhead as well:

compiled_saved_model = tf.function(fb_saved_model.model, experimental_compile=True)

compiled_saved_model(tf.constant(1))

%timeit compiled_saved_model(tf.constant(100000))

With this, performance starts to approach the order of magnitude of C++.

Of course, we would like things to be this fast by default without resorting to arcane flags, so there still is a bit of work to be done.

divyekapoor · 2020-01-03T00:11:41Z

I ran the XLA compiled version with TF nightly:

Time taken (XLA SavedModel 1st run) (ms):  132.143258
Time taken (XLA SavedModel 2nd run) (ms):  81.370215
Time taken (XLA SavedModel 3rd run) (ms):  82.8199
Time taken (Python3) (ms):  16.207496

Even if I hardcode the count (100K) in the range(n), the speeds improve but aren't C++ comparable:

Time taken (XLA SavedModel 1st run) (ms):  71.305211
Time taken (XLA SavedModel 2nd run) (ms):  27.728747
Time taken (XLA SavedModel 3rd run) (ms):  26.095412

Did you mean: "approach the order of magnitude of Python?" (not C++?). The gap seems to be still 400x C++ code (though only 4.5x Python). Could you take a look as to the XLA implementation in the repo? https://github.com/divyekapoor/ml-op-benchmarks

>>> tf.__version__
'2.1.0-dev20200102'

Readme updated with the new numbers.

divyekapoor · 2020-01-03T00:23:07Z

FWIW, with a constant value in range(n) (n=100K) the non XLA version of the model gets a huge boost in speed:

Time taken (TF Python) (ms):  4.499958
Time taken (Python3) (ms):  15.678116

and it beats Python and is about 20x slower than C++ (quite nice!). The XLA version of the model though isn't able to optimize the graph (as above it's at around 26ms instead of 4ms - a 6x gap vs the prod version of saved model).

divyekapoor · 2020-01-03T00:59:56Z

Summarizing the issue: Seems we have the following open questions -

Perf: overhead of tf control ops: tf.cond / tuple increment / tf.while is too high (by 200x+) - this bug. (seems this is validated - seems the proposed solution is XLA but it isn't ready yet; baseline: ~4ms vs 4000ms)
XLA compiler can't handle constant folding as well as the prod compiler. (let's treat this separately -- discussion above). Confirmation of: "Did you mean: "approach the order of magnitude of Python?" (not C++?)" for XLA based on my numbers above.

For moving forward with (1), any other next steps @mdanatg ?

mdanatg · 2020-01-04T13:45:16Z

Your summary looks right. We're working on improving the graph executor and that should help resolve both (1) and (2), but it's a longer-term effort and I don't have a clear estimate when it will be ready.

Oddly, I ran a simplified version of the code (without the tf.Module / saved model bit) and there the results were significantly better than Python (see the results and code below). So perhaps there is some extra overhead around either the module or the saved model. But I think that's besides the point, which is that control flow is currently slow in TF graph.

Side note - here's the code and the timings I got:

Method	Time (µs)
TF 2 (function + XLA)	712
TF 2 (vectorized)	2240
Python (plain)	14900
TF 2 (function only)	`3040000`

@tf.function(input_signature=[tf.TensorSpec([], tf.int32)], experimental_compile=True)
def fb(n):  # Returns counts for fizz, buzz and fizzbuzz. Shape: [1] with length 3
    fizz = 0
    buzz = 0
    fizzbuzz = 0
    for i in range(n):
        if i % 6 == 0:
            fizzbuzz += 1
        elif i % 3 == 0:
            buzz += 1
        elif i % 2 == 0:
            fizz += 1
    return (fizz, buzz, fizzbuzz)

fb(tf.constant(1))

%timeit -n100 -r3 fb(tf.constant(100000))

@tf.function(input_signature=[tf.TensorSpec([], tf.int32)])
def fb(n):  # Returns counts for fizz, buzz and fizzbuzz. Shape: [1] with length 3
  i = tf.range(n)
  fizz_v = (i % 2 == 0)
  buzz_v = (i % 3 == 0)
  fizz = tf.reduce_sum(tf.cast(fizz_v, tf.int32))
  buzz = tf.reduce_sum(tf.cast(buzz_v, tf.int32))
  fizzbuzz = tf.reduce_sum(tf.cast(tf.logical_and(fizz_v, buzz_v), tf.int32))
  return [fizz, buzz, fizzbuzz]

fb(tf.constant(1))

%timeit -n100 -r3 fb(tf.constant(100000))

def fb(n):
    fizz = 0
    buzz = 0
    fizzbuzz = 0
    for i in range(n):
        if i % 6 == 0:
            fizzbuzz += 1
        elif i % 3 == 0:
            buzz += 1
        elif i % 2 == 0:
            fizz += 1
    return (fizz, buzz, fizzbuzz)

fb(1)

%timeit -n100 -r3 fb(100000)

@tf.function(input_signature=[tf.TensorSpec([], tf.int32)])
def fb(n):  # Returns counts for fizz, buzz and fizzbuzz. Shape: [1] with length 3
    fizz = 0
    buzz = 0
    fizzbuzz = 0
    for i in range(n):
        if i % 6 == 0:
            fizzbuzz += 1
        elif i % 3 == 0:
            buzz += 1
        elif i % 2 == 0:
            fizz += 1
    return (fizz, buzz, fizzbuzz)

fb(tf.constant(1))

%timeit -n10 -r3 fb(tf.constant(100000))

Saduf2019 · 2021-05-03T11:27:49Z

@divyekapoor
Could you please verify with later tf versions and let us know if this is still an issue.

google-ml-butler · 2021-05-10T11:59:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler · 2021-05-17T12:50:52Z

Closing as stale. Please reopen if you'd like to work on this further.

divyekapoor changed the title ~~TF Op Performance: Fizzbuzz on AutoGraph 225x slower than raw Python and 24000x slower than C++~~ TF 2.1.0 Op Performance: Fizzbuzz on AutoGraph 225x slower than raw Python and 24000x slower than C++ Nov 21, 2019

divyekapoor changed the title ~~TF 2.1.0 Op Performance: Fizzbuzz on AutoGraph 225x slower than raw Python and 24000x slower than C++~~ TF Op Performance: Fizzbuzz on AutoGraph 225x slower than raw Python and 24000x slower than C++ Nov 21, 2019

divyekapoor changed the title ~~TF Op Performance: Fizzbuzz on AutoGraph 225x slower than raw Python and 24000x slower than C++~~ [TF2.x] Performance: Fizzbuzz on AutoGraph 225x slower than raw Python and 24000x slower than C++ Nov 22, 2019

divyekapoor changed the title ~~[TF2.x] Performance: Fizzbuzz on AutoGraph 225x slower than raw Python and 24000x slower than C++~~ [TF2.1] Performance: Fizzbuzz on AutoGraph 225x slower than raw Python and 24000x slower than C++ Nov 22, 2019

divyekapoor changed the title ~~[TF2.1] Performance: Fizzbuzz on AutoGraph 225x slower than raw Python and 24000x slower than C++~~ [TF2.1] Performance: AutoGraph ops 225x slower than raw Python and 24000x slower than C++ Nov 22, 2019

divyekapoor mentioned this issue Nov 23, 2019

TorchScript Performance: 150x gap between TorchScript and Native Python pytorch/pytorch#30365

Open

oanush self-assigned this Nov 25, 2019

oanush added comp:autograph Autograph related issues type:performance Performance Issue TF 2.0 Issues relating to TensorFlow 2.0 labels Nov 25, 2019

oanush assigned gowthamkpr and unassigned oanush Nov 25, 2019

gowthamkpr assigned mdanatg and unassigned gowthamkpr Nov 25, 2019

gowthamkpr added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 25, 2019

mdanatg added comp:core issues related to core part of tensorflow comp:ops OPs related issues comp:runtime c++ runtime, performance issues (cpu) and removed comp:autograph Autograph related issues labels Nov 25, 2019

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 26, 2019

mdanatg changed the title ~~[TF2.1] Performance: AutoGraph ops 225x slower than raw Python and 24000x slower than C++~~ [TF2.1] Performance: Control flow ops 225x slower than raw Python and 24000x slower than C++ Nov 26, 2019

mdanatg changed the title ~~[TF2.1] Performance: Control flow ops 225x slower than raw Python and 24000x slower than C++~~ [TF2.1] Performance: Control flow and scalar ops 225x slower than raw Python and 24000x slower than C++ Nov 26, 2019

mdanatg mentioned this issue Apr 30, 2020

Simple loop is slow #38910

Closed

amahendrakar added TF 2.1 for tracking issues in 2.1 release and removed TF 2.0 Issues relating to TensorFlow 2.0 labels Jul 14, 2020

Saduf2019 added the stat:awaiting response Status - Awaiting response from author label May 3, 2021

google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label May 10, 2021

google-ml-butler bot closed this as completed May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TF2.1] Performance: Control flow and scalar ops 225x slower than raw Python and 24000x slower than C++ #34500

[TF2.1] Performance: Control flow and scalar ops 225x slower than raw Python and 24000x slower than C++ #34500

divyekapoor commented Nov 21, 2019 •

edited

Loading

mdanatg commented Nov 25, 2019 •

edited

Loading

divyekapoor commented Nov 25, 2019 •

edited

Loading

mdanatg commented Nov 26, 2019

mdanatg commented Nov 28, 2019

divyekapoor commented Dec 2, 2019

mdanatg commented Dec 2, 2019

mdanatg commented Dec 4, 2019

divyekapoor commented Jan 3, 2020

divyekapoor commented Jan 3, 2020

divyekapoor commented Jan 3, 2020

mdanatg commented Jan 4, 2020

Saduf2019 commented May 3, 2021

google-ml-butler bot commented May 10, 2021

google-ml-butler bot commented May 17, 2021

[TF2.1] Performance: Control flow and scalar ops 225x slower than raw Python and 24000x slower than C++ #34500

[TF2.1] Performance: Control flow and scalar ops 225x slower than raw Python and 24000x slower than C++ #34500

Comments

divyekapoor commented Nov 21, 2019 • edited Loading

mdanatg commented Nov 25, 2019 • edited Loading

divyekapoor commented Nov 25, 2019 • edited Loading

mdanatg commented Nov 26, 2019

mdanatg commented Nov 28, 2019

divyekapoor commented Dec 2, 2019

mdanatg commented Dec 2, 2019

mdanatg commented Dec 4, 2019

divyekapoor commented Jan 3, 2020

divyekapoor commented Jan 3, 2020

divyekapoor commented Jan 3, 2020

mdanatg commented Jan 4, 2020

Saduf2019 commented May 3, 2021

google-ml-butler bot commented May 10, 2021

google-ml-butler bot commented May 17, 2021

divyekapoor commented Nov 21, 2019 •

edited

Loading

mdanatg commented Nov 25, 2019 •

edited

Loading

divyekapoor commented Nov 25, 2019 •

edited

Loading