For a detailed overview of the dataset visit the Cityscapes website and the Cityscapes Github repository
This repository focuses solely on the Pixel-Level Semantic Labeling Task of the cityscapes dataset.
E.g. : Train a DeepLabV3plus
model named MyDeepLabV3plus
with EfficientNetV2B0
backbone, Dice Loss
as a loss function, using batch size equal to 1
, the relu
activation function and dropout rate of 0.1
for the Dropout layers, for 60 epochs
.
-
Train the model
> python3 train_model.py --data_path /path/to/dataset --model_type DeepLabV3plus --model_name MyDeepLabV3plus --backbone EfficientNetV2B0 --loss DiceLoss --batch_size 1 --activation relu --dropout 0.1 --epochs 60
-
Evaluate the model on the validation set.
- Evaluate the MeanIoU
- Evaluate the IoU of every class seperatly
- Generate the confusion matrix for validation set
> python3 evaluate_model.py --data_path /path/to/dataset --model_type DeepLabV3plus --model_name MyDeepLabV3plus --backbone EfficientNetV2B0
-
Create predictions for validation and test set
Perform inference on the validation set and save the predicted images
> python3 create_predictions.py --data_path /path/to/dataset --model_type DeepLabV3plus --model_name MyDeepLabV3plus --backbone EfficientNetV2 --split "val"
Perform inference on the test set and save the predicted images
> python3 create_predictions.py --data_path /path/to/dataset --model_type DeepLabV3plus --model_name MyDeepLabV3plus --backbone EfficientNetV2 --split "test"
Predictions are saved under the predictions/model-type/model-name/split directory. For the example above the following 4 directories are created:
- predictions/DeepLabV3plus/MyDeepLabV3plus/val/rgb
- predictions/DeepLabV3plus/MyDeepLabV3plus/val/grayscale
- predictions/DeepLabV3plus/MyDeepLabV3plus/test/rgb
- predictions/DeepLabV3plus/MyDeepLabV3plus/test/grayscale
The RGB Images look like the following:
The run.sh
performs model training and evaluation by default, and can optionally make predictions on the test set.
This script invokes the python scirpts and also adds all the logs and predictions of the given model to a zip archive when the predict flag is set.
> ./run.sh -d /path/to/dataset -t DeepLabV3plus -n MyDeepLabV3plus -b EfficientNetV2
Parse files which are under the following directory sctructure
<data_path> : the root directory of the Cityscapes dataset
|
├── gtFine_trainvaltest
│ └── gtFine
│ ├── test
│ │ ├── berlin
│ │ ├── bielefeld
│ │ ├── bonn
│ │ ├── leverkusen
│ │ ├── mainz
│ │ └── munich
│ ├── train
│ │ ├── aachen
│ │ ├── bochum
│ │ ├── bremen
│ │ ├── cologne
│ │ ├── darmstadt
│ │ ├── dusseldorf
│ │ ├── erfurt
│ │ ├── hamburg
│ │ ├── hanover
│ │ ├── jena
│ │ ├── krefeld
│ │ ├── monchengladbach
│ │ ├── strasbourg
│ │ ├── stuttgart
│ │ ├── tubingen
│ │ ├── ulm
│ │ ├── weimar
│ │ └── zurich
│ └── val
│ ├── frankfurt
│ ├── lindau
│ └── munster
└── leftImg8bit_trainvaltest
└── leftImg8bit
├── test
│ ├── berlin
│ ├── bielefeld
│ ├── bonn
│ ├── leverkusen
│ ├── mainz
│ └── munich
├── train
│ ├── aachen
│ ├── bochum
│ ├── bremen
│ ├── cologne
│ ├── darmstadt
│ ├── dusseldorf
│ ├── erfurt
│ ├── hamburg
│ ├── hanover
│ ├── jena
│ ├── krefeld
│ ├── monchengladbach
│ ├── strasbourg
│ ├── stuttgart
│ ├── tubingen
│ ├── ulm
│ ├── weimar
│ └── zurich
└── val
├── frankfurt
├── lindau
└── munster
Each of the train,val,test directories contain subdirectories with the name of a city. To use a whole split, subfolder='all'
must be passed to the Dataset.create()
method in order to read the images from all the subfolders. For testing purposes a smaller number of images from the dataset can be used by passing *subfolder='<CityName>'*
. For example, passing split='train'
to the Dataset() constructor, and subfolder='aachen'
to the create()
method will make the Dataset object only read the 174 images in the folder aachen and convert them into a tf.data.Dataset. You can choose either all the subfolders or one of them, but not an arbitrary combination of them. After the images (x)
and the ground truth images (y)
are read and decoded, they are combined into a single object (x, y)
.
Generally images have a shape of (batch_size, height, width, channels)
- Split the image into smaller patches with spatial resolution
(256, 256)
. Because very image has a spatial resolution of(1024, 2048)
32 patches are produced and they comprise a single batch. This means that when the patching technique is used the batch size is fixed to 32. After this operation the images have a shape of(32, 256, 256, 3)
while the the ground truth images have a shape of(32, 256, 256, 1)
. To enable patching set theuse_patches
arguement of thecreate()
method, toTrue
.
-
Perform data
Augmentation
- Randomly perform
horrizontal flipping
of images - Randomly adjust
brightness
- Randomly adjust
contrast
- Apply
gaussian blur
with random kernel size and variance
NOTE : while all augmentations are performed on the images, only horrizontal flip is performed on the ground truth images, because changing the pixel values of the ground truth images means changing the class they belong to.
- Randomly perform
- Normalize images :
- The input pixels values are scaled between -1 and 1 as default
- If using a pretrained backbone normalize according to what the pretrained network expects at its input. To determine what type of preprocessing will be done to the images, the name of the pretrained network must be passed as the
preprocessing
arguement of the Dataset constructor. For example, if a model from the EfficientNet model family (i.e EfficientNetB0, EfficientNetB1, etc) is used as a backbone, thenpreprocessing = "EfficientNet"
must be passed.
- Preprocess ground truth images:
- Map eval ids to train ids
- Convert to
one-hot
encoding - After this operation ground truth images have a shape of
(batch_size, 1024, 2048, num_classes)
Finally the dataset which is created is comprised of elements (image, ground_truth)
with shape ((batch_size, height, width, 3)
, (batch_size, height, width, num_classes))
Models | Reference |
---|---|
U-net |
U-Net: Convolutional Networks for Biomedical Image Segmentation |
Residual U-net |
- |
Attention U-net |
Attention U-Net: Learning Where to Look for the Pancreas , CBAM: Convolutional Block Attention Module |
U-net++ |
UNet++: A Nested U-Net Architecture for Medical Image Segmentation |
DeepLabV3+ |
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation |
Using an ImageNet pretrained backbone is supported only for U-net
, Residual U-net
and DeepLabV3+
.
Network Family | Reference |
---|---|
ResNet |
Deep Residual Learning for Image Recognition |
ResNetV2 |
Identity Mappings in Deep Residual Networks |
EfficientNet |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks |
EfficientNetV2 |
EfficientNetV2: Smaller Models and Faster Training |
MobileNet |
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications |
MobileNetV2 |
MobileNetV2: Inverted Residuals and Linear Bottlenecks |
MobileNetV3 |
Searching for MobileNetV3 |
RegNetX & RegNetY |
Designing Network Design Spaces |