In this project, your goal is to write a software pipeline to detect vehicles in a video (start with the test_video.mp4 and later implement on full project_video.mp4)
I have spent some time on trying SVM, color and gradient features to detect vehicle,
svm has very good training accuracy but when testing in real life the accuracy dropped a lot.
As the experience I have in p4 advanced lane finding and lots of reading about it, I
think continue this way could make me pass the project video, but not able to handle my own
video. All the work was in test
In this project, I spend majority of my time trying to reach out for other options. I like SSD, U-NET, Faster-RCNN, but YOLO was my choice as it's fast while remains good result. The result was good. it succeeded in project video, I then created a video from my iphone which was mount in my car, YOLO performed very well on it.
In this video, camera calibration has been provided in folder camera_cal
, lane finding and
object detection images are camera calibrated.
In this video there are cars in front of me which is the main problem for color and gradient lane finding solution, so that I only applied object detecting into this video. we can see that YOLO doing a great job here in detecting and tracking cars, traffic lights, etc.
You only look once (YOLO) is a state-of-the-art, real-time object detection system. In this project I'm using YAD2K project which is a Keras / Tensorflow implementation of YOLO_v2
please note as we using YAD2K project, we have to install the specific keras version documented in YAD2K project page.
- Image will divided into small grid cell, for example 7 * 7
- Every cell predict number of bounding box, every box contains
- center_point_x
- center_point_y
- bounding_box_width
- bounding_box_height
- object_probability
- Every cell predict the probability of number of classes
- Apply a threshold to all bounding_box
if we split image into 7 * 7 grid cell, each cell predict 2 bounding boxes, and we have 20 classes want to predict,
the total output would be 7 * 7 * (2 * 5 + 20) = 1470
During training stage, all images has been random scaling and translations of up to 20% of the original image size, also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space
Before feeding into network, the image will be resize into 416x416, the network will split this image further into 7x7 grid. All image will normalised between 0 to 1 before feed into network.
YOLO network architecture is inspired by the GoogLeNet, it contains 24 convolutional layers followed by 2 full connected layters. Used 1x1 convolutional layers reduced the features space, followed by 3x3 convolution layers.
Just like HOG(Histogram of oriented gradient) or orientation histograms, gradient computation is the first step.
HOG using [-1, 0, 1] and [-1, 0. 1]T as filter kernels, however YOLO contains 7x7 and lots of 3x3, 1x1 filters,
the parameters of those filter are learned from training process.
After the training and if we look at first layer, the edge, direction, colors, shapes start to form as features, this
is similary to HOG, as show below
However CNN could stack those kind of layers, the higher the layer the bigger receptive field it have, thus will
got better abstraction about the feature
###Post process
All the prediction are based on 418x418 scaled image size, to make the result useful we have to re-scale back to original image shape
It is possibly network found lots of bounding boxes, in this case we use Non-maximum suppression to get the
best result. code yolo_eval
method in
The main code located in
, all support files are in yolo
defines the neural network structureyolo/model_data/coco_classes.txt
defines how many classes system can detectyolo/model_data/yolo.h5
the weights pre-trained on COCO dataset, follows yolo.cfg neural network structure
method in class YoloDetector
is the main entry point, provide a image and it will
return back bounding_boxes, scores and classes, for example:
- bounding_boxes=[[405, 786, 492, 934]]
- scores=[0.68]
- classes=[2]
For more test images, please visit object detect folder
Class LaneFinder
has been modified to add one more parameter called object_detection_func
by default, it's a lambda which return a black image object_detection_func=lambda image: np.zeros_like(image)
Undistored image will pass into object_detection_func and been added into final result.
- Handle picked features not generalize enough
As the experience in P3 advanced lane finding, and some experience in this project, I think the color, color space, gradient features with SVM or Decision Tress are not generalize enough, I think it's really depends on parameters which human provide, where deep learning approch is more define the lose function and let computer figure out what's the best parameters, as long as we have lots of training data, it can do better then human picked parameters.
- YOLO works really well
The project video works really well as it dosen't have many elements My own video was not focused properly, but still YOLO able to identify vehicles, traffic lights and a person on bicycle
- a heat-map or moving average solution would beneficial still
I noticed that it will miss some object in some frame, if we lower the threshold which resulting more object been detected, at the same time create a heat map based on history data. For example a Car has been detected in last 3 frame, we have very high confidence that it will appear in frame 4 and 5, however if still not detected in frame 6, we can remove it away from our list.
- Stanford CS class CS231n: Convolutional Neural Networks
- Paper You Only Look Once: Unified, Real-Time Object Detection
conda env create -f environment.yml
python --version
to check python version, you should have python 3.5