Here we consider the distributed training of a CNN model from keras-applications
for image classification.
These are simple examples intended only to show how to add the code for distributed training from data stored as a set of TFRecord files. The dataset is a randomly drawn subset of ImageNet of about 120000 images.
- Exercise: Writing a decode for an Imagenet TFRecord file
- Training Inception on Imagenet (single-node)
- Distributed training Inception on Imagenet with Horovod (1)
- Distributed training Inception on Imagenet (
tf.distributed)
- Distributed training Inception on Imagenet with Horovod (
dataset.shard
/interleave) - Distributed training Inception on Imagenet with Horovod (sharding files by hand/interleave)