BUG: `RandomContrast` slows down training 6-fold #581

DavidLandup0 · 2022-07-13T08:53:30Z

Augmentation will obviously slow down training, but it shouldn't be a 6-fold slowdown. This happens with the RandomContrast layer, which makes the training per epoch grow from ~100s to ~600s.

I'd share a Colab notebook but there seems to be an issue with KerasCV on Colab on importing, so here are the steps to reproduce:

import keras_cv
print(f'KerasCV version {keras_cv.__version__}')
import tensorflow as tf
print(f'TF version {tf.__version__}')
from tensorflow import keras
print(f'Keras version {keras.__version__}')

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

aug = keras.Sequential([
    keras_cv.layers.RandomContrast(factor=0.2)
])

model = keras.Sequential([
    keras.layers.InputLayer(input_shape=(32, 32, 3)),
    aug,
    keras.applications.EfficientNetV2B0(weights=None, include_top=False),
    keras.layers.GlobalAveragePooling2D(),
    keras.layers.Dense(10, activation='softmax')
    
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

model.fit(x_train, y_train)

model = keras.Sequential([
    keras.layers.InputLayer(input_shape=(32, 32, 3)),
    keras.applications.EfficientNetV2B0(weights=None, include_top=False),
    keras.layers.GlobalAveragePooling2D(),
    keras.layers.Dense(10, activation='softmax')
    
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

model.fit(x_train, y_train)

The text was updated successfully, but these errors were encountered:

bhack · 2022-07-13T14:30:59Z

I suppose that we are in the same case as:
tensorflow/tensorflow#56242

The root cause is #291

WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformFullIntV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomGetKeyCounter cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting AdjustContrastv2 cause Input "contrast_factor" of op 'AdjustContrastv2' expected to be loop invariant.
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformFullIntV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomGetKeyCounter cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op.
WARNING:tensorflow:Using a while_loop for converting AdjustContrastv2 cause Input "contrast_factor" of op 'AdjustContrastv2' expected to be loop invariant.

LukeWood · 2022-07-13T16:05:17Z

I will pin this issue. This is a significant slowdown, maybe we need to consider manually vectorizing - which would be very unfortunate.

bhack · 2022-07-14T11:58:26Z

I will pin this issue. This is a significant slowdown, maybe we need to consider manually vectorizing - which would be very unfortunate.

I think that it will more useful to brainstorm a solution for the old and more general problem at #291 then pinning every single issue as this is already the 2nd one (tensorflow/tensorflow#56242).

LukeWood · 2022-07-14T17:30:31Z

Would you mind explaining what you have in mind for a "solution for the old and more general problem" ? I'm not sure how we would solve this in the general case.

bhack · 2022-07-14T18:18:08Z

As the superset of this issue is #291 and the root cause is our choice of vectorized_map and within the batch randomization as we discussed in many early tickets on our augmentation design we could:

Understand what to do with the missing vectorized_map converters and what kind of issue they could solve
Understand also when we have converters coverage what are the remaining issues with our randomize within the batch policy
Are there insurmountable limits on that API related to the underlying logic of TF native ops?
In the case we want to enable XLA or use the TPU:
- how much kernels that we currently use we need to cover in XLA:
  - XLA tf.bincount support tensorflow/tensorflow#54479
  - Add ops_coverage python script tensorflow/tensorflow#56510
- what other kind of edge case we need to cover (e.g. not auto unrolled loops):
  - [XLA] Bincount dynamic output size in a loop tensorflow/tensorflow#56769
  - [TPU, keras preprocessing layer] Some Op must be a compile-time constant. keras#15655
  - How heavy we want to rely on Ragged tensors are they are not currently supported in XLA
  - Etc..
Last one, in the case we have insurmountable technical limits with the within the batch augmentation policy if we want to evaluate a more conservative approach Repeated augmentation layer #372

bhack · 2022-07-26T12:34:34Z

About my last comment /cc @ishark @wangpengmit

MarkDaoust · 2022-08-29T16:45:37Z

maybe we need to consider manually vectorizing - which would be very unfortunate.

https://www.tensorflow.org/api_docs/python/tf/image/adjust_contrast

adjust_contrast can take a stack of images, but only a scalar contrast factor, not a list. That's too bad, especially for such a simple function.

bhack · 2022-08-29T17:48:30Z

adjust_contrast can take a stack of images, but only a scalar contrast factor, not a list. That's too bad, especially for such a simple function.

Isn't this namespace orphan #74 (comment)?

Also, I never heard that we want to contribute/coordinate with the tf.image.* API. @MarkDaoust see the full thread in the early weeks of the Keras-cv repo at #122 (comment)

davidanoel · 2022-09-30T18:45:18Z

The same is happening for RandAugment. Without RandAugment each epoch takes around 35s on my machine. With RandAugment it takes about 2mins 25 seconds. Any resolution on the roadmap?

bhack · 2022-09-30T18:54:56Z

@atlanticstarr1 We have started a thread at tensorflow/tensorflow#55639 (comment)

You could try to ping there.

bhack · 2022-10-07T13:16:07Z

Just to confirm my hypothesis about the original @DavidLandup0's random contrast example #581 (comment) at the origin of this ticket.

I've tested it with TF 2.10.0 on Colab and the overhead of "within the batch randomization/vecorized_map fallback" is huge.

Just to quickly workaround the effect, using a constant factor (e.g. 10) at https://github.com/keras-team/keras/blob/v2.10.0/keras/layers/preprocessing/image_preprocessing.py#L1675 we have 46ms/step

With the official "within the batch/vectorized_map fallback" we have 152ms/step.

So we had a performance drop of 3.30x with the default 32 batch size and this gap is going to be increased for sure with larger batches/input sizes.

Please check it yourself with this Colab so we are on the same page without waiting for the private GPU CI auth:

https://colab.research.google.com/gist/bhack/355dd2a56c734bb04f7025e15fb2a53d/randomcontrast_benchmark.ipynb#scrollTo=b2L6G_XYPzOz

bhack · 2022-10-07T18:37:35Z

/cc @martin-gorner

bhack · 2022-11-05T14:13:06Z

As we are going to potentially introduce this issue also in the new 3d preprocessing API with #986 I want to clarify this example as it is similar to other KLP cases.

Assuming that we have 100% coverage of the pfor converters (which we obviously don't actually have) let's see what happen in this case.

This is the AdjustContrastv2 converter in TF that it is required by us as we are calling tf.image.adjust_contrast API in this layer implementation:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/parallel_for/pfor.py#L1724

@RegisterPFor("AdjustContrastv2")
def _convert_adjust_contrastv2(pfor_input):
  images = pfor_input.stacked_input(0)
  contrast_factor = pfor_input.unstacked_input(1)
  return wrap(gen_image_ops.adjust_contrastv2(images, contrast_factor), True)

As you can see contrast_factor is handled as an unstacked_input so it requires that the contrast_factor is loop invariant.

Instead, with our within the batch augmentation policy, we want to have a different random factor for each single image in the batch.

vectorized_map/pfor cannot stack contrast_factor in the converter cause the tf.image.adjust_contrast API signature, the underline CPU and GPU ops and kernels, are designed and implemented to accept stacked/batched images but just a scalar contrast_factor float multiplier for adjusting contrast.

tf.image.adjust_contrast( images, contrast_factor )

So as you can see it is not strictly an issue related to the vectorized_map that instead it is eventually mainly impacted by the missing converters coverage for some ops that we are using (or that we will want to use).

The main problem is that we want adopt a within the batch policy and this is going to have many performance overhead independently from the use of vectorized_map or map_fn if the underline TF ops don't support the stacked args that we want randomize differently for every single batch element.

So I think that we have two main options here (other the extending the converters coverage):

Rewrite all the TF ops API and the related CPU/GPU ops/kernels to support the stacked argument the we want to randomize within the batch
Re-evaluate the within the batch policy on the performance overhead it introduces vs the faster(?) convergence rate. I have never seen here some experimental data to sustain the gain we have with this policy on the FLOPS/epochs convergence ratio.

The only mentioned paper in this repo was #372 (comment) that at least will let to apply the same randomized factor on sub-batches partially limiting the bad impact on the performance of the current "full within" batch policy.

tanzhenyu · 2022-11-09T01:57:13Z

As we are going to potentially introduce this issue also in the new 3d preprocessing API with #986 I want to clarify this example as it is similar to other KLP cases.

Assuming that we have 100% coverage of the pfor converters (which we obviously don't actually have) let's see what happen in this case.

This is the AdjustContrastv2 converter in TF that it is required by us as we are calling tf.image.adjust_contrast API in this layer implementation: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/parallel_for/pfor.py#L1724
@RegisterPFor("AdjustContrastv2")
def _convert_adjust_contrastv2(pfor_input):
  images = pfor_input.stacked_input(0)
  contrast_factor = pfor_input.unstacked_input(1)
  return wrap(gen_image_ops.adjust_contrastv2(images, contrast_factor), True)
As you can see contrast_factor is handled as an unstacked_input so it requires that the contrast_factor is loop invariant.

Instead, with our within the batch augmentation policy, we want to have a different random factor for each single image in the batch.

vectorized_map/pfor cannot stack contrast_factor in the converter cause the tf.image.adjust_contrast API signature, the underline CPU and GPU ops and kernels, are designed and implemented to accept stacked/batched images but just a scalar contrast_factor float multiplier for adjusting contrast.

tf.image.adjust_contrast( images, contrast_factor )

So as you can see it is not strictly an issue related to the vectorized_map that instead it is eventually mainly impacted by the missing converters coverage for some ops that we are using (or that we will want to use).

The main problem is that we want adopt a within the batch policy and this is going to have many performance overhead independently from the use of vectorized_map or map_fn if the underline TF ops don't support the stacked args that we want randomize differently for every single batch element.

So I think that we have two main options here (other the extending the converters coverage):

Rewrite all the TF ops API and the related CPU/GPU ops/kernels to support the stacked argument the we want to randomize within the batch

Re-evaluate the within the batch policy on the performance overhead it introduces vs the faster(?) convergence rate. I have never seen here some experimental data to sustain the gain we have with this policy on the FLOPS/epochs convergence ratio.

The only mentioned paper in this repo was #372 (comment) that at least will let to apply the same randomized factor on sub-batches partially limiting the bad impact on the performance of the current "full within" batch policy.

I'm not sure I really understand the argument here. To me it seems the main issue is contrast_factor cannot be a stacked input here? Is this more a TF problem than a KerasCV problem?

bhack · 2022-11-09T09:58:20Z

It is contrast_factor in this example but it is just a case of a more general issue. Just in April the list in Keras-CV was already quite long:
tensorflow/tensorflow#55639 (comment)

And this will happen every time you will want to randomize an arg, within the batch, of an Op where that arg is scalar by the TF API design.

Then we could tell that it is a TF issue and we don't care but who have really the resources to change all these API/ops/kernels in TF for your new design needs?

At the same time Keras is no more a multi backed library, users are impacted directly by the performance issues we have with this design and they cannot switch to an alternative "backend". So the separation logic between Keras issues and TF issues in many cases doesn't make sense by an user point of view.

So a limit of your within the batch randomization policy it is going to impact directly the Keras-cv user base and the claim that it is a TF (team?) issue it does not solve anyone's problems.

At the same time we don't have (public?) experimental data about the accuracy/epochs gain related to the choice to randomize an augmentation over each batch element that it is the design choice that created all these performance overhead considering the TF API design we currently have on many Ops.

jasonrichdarmawan · 2022-12-10T06:43:33Z

tensorflow-macos==2.10.0 and 2.11.0 have this issue.

I use RandomRotation.

bhack · 2022-12-10T13:11:41Z

@kidfrom We have already extensively discussed this in many tickets.
The last thread was at tensorflow/tensorflow#55639 (comment)

bhack · 2023-01-24T20:46:10Z

To track the perfomrance thread we had recently on the RepeatedAugmentation layer #1293 (comment)

If we can have in the batch the same image with the different augmentations why we cannot have different image with the same augmentation param in the batch?

/ cc @LukeWood

jbischof · 2023-08-08T22:24:37Z

Closing due to staleness.

LukeWood pinned this issue Jul 13, 2022

TylerADavis mentioned this issue Aug 30, 2022

vectorize KPLS: tf.vectorized_map fallback cause #291

Closed

bhack mentioned this issue Aug 31, 2022

tf.vectorized_maps resolve fallbacks tensorflow/tensorflow#55639

Closed

bhack mentioned this issue Oct 7, 2022

Add IoU3D as a custom c++ op (CPU) #890

Merged

qlzh727 added high-priority type:Bug Something isn't working keras-team-review-pending preprocessing layers labels Oct 21, 2022

bhack mentioned this issue Nov 4, 2022

Add base augmentation layer for 3D preception. #986

Merged

5 tasks

LukeWood unpinned this issue Nov 14, 2022

bhack mentioned this issue Dec 24, 2022

investigate the impact of changing arguments in map_fn/vmap for preprocessing layers #595

Closed

raymondlo84 mentioned this issue Jan 4, 2023

Notebook 301 - Training with Tensorflow Slowdown Issues openvinotoolkit/openvino_notebooks#752

Closed

LukeWood mentioned this issue Jan 30, 2023

Implement vectorized base image augmentation layer w/ Grayscale() example #1331

Closed

jbischof closed this as completed Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `RandomContrast` slows down training 6-fold #581

BUG: `RandomContrast` slows down training 6-fold #581

DavidLandup0 commented Jul 13, 2022

bhack commented Jul 13, 2022

LukeWood commented Jul 13, 2022

bhack commented Jul 14, 2022

LukeWood commented Jul 14, 2022

bhack commented Jul 14, 2022

bhack commented Jul 26, 2022

MarkDaoust commented Aug 29, 2022

bhack commented Aug 29, 2022 •

edited

Loading

davidanoel commented Sep 30, 2022

bhack commented Sep 30, 2022

bhack commented Oct 7, 2022 •

edited

Loading

bhack commented Oct 7, 2022

bhack commented Nov 5, 2022 •

edited

Loading

tanzhenyu commented Nov 9, 2022

bhack commented Nov 9, 2022 •

edited

Loading

jasonrichdarmawan commented Dec 10, 2022

bhack commented Dec 10, 2022 •

edited

Loading

bhack commented Jan 24, 2023

jbischof commented Aug 8, 2023

BUG: RandomContrast slows down training 6-fold #581

BUG: RandomContrast slows down training 6-fold #581

Comments

DavidLandup0 commented Jul 13, 2022

bhack commented Jul 13, 2022

LukeWood commented Jul 13, 2022

bhack commented Jul 14, 2022

LukeWood commented Jul 14, 2022

bhack commented Jul 14, 2022

bhack commented Jul 26, 2022

MarkDaoust commented Aug 29, 2022

bhack commented Aug 29, 2022 • edited Loading

davidanoel commented Sep 30, 2022

bhack commented Sep 30, 2022

bhack commented Oct 7, 2022 • edited Loading

bhack commented Oct 7, 2022

bhack commented Nov 5, 2022 • edited Loading

tanzhenyu commented Nov 9, 2022

bhack commented Nov 9, 2022 • edited Loading

jasonrichdarmawan commented Dec 10, 2022

bhack commented Dec 10, 2022 • edited Loading

bhack commented Jan 24, 2023

jbischof commented Aug 8, 2023

BUG: `RandomContrast` slows down training 6-fold #581

BUG: `RandomContrast` slows down training 6-fold #581

bhack commented Aug 29, 2022 •

edited

Loading

bhack commented Oct 7, 2022 •

edited

Loading

bhack commented Nov 5, 2022 •

edited

Loading

bhack commented Nov 9, 2022 •

edited

Loading

bhack commented Dec 10, 2022 •

edited

Loading