I write the blog for Udacity Data Scientist Nanodegree project, and the codes, data, and models can be downloaded from GITHUB.
Part 1: Introduction
The purpose of this article to demonstrate how we can identify dog breeds from a given picture. If a person happens to be in the picture, we also want to know what kind of dogs shares similarity with this person.
In order to fulfill this task, we design the following processing chain:
- When we are given a picture, we first try Person Detector module, which will tell whether the given picture contains a person.
- In parallel, we also try the same picture on Dog Detector module, and the purpose is to tell whether the given picture contains a dog.
- If we are sure either Person or Dog exists, then we will continue with Dog Breed Classifier module, which tell us the dog category that the person/dog belongs to. We escape the cases where an object pass the examination of Person Detector module and Dog Detector module as this is a confusing case.
In the following we will discuss in detail three modules: Person Detector module, Dog Detector module and Dog Breed Classifier module.
Part 2: Person Detector
How do we detect person in the RGB image?
The basic idea behind Person Detector is that if we can detect a person’s face in the image, there is a large chance that we can say that we have detected the person. In that sense, Person Detector is equal to Face Detector.
Fortunately, there are a lot of open-source Face Detectors, and in this project we use one implemented in OpenCV, which implements Haar feature-based cascade classifiers. OpenCV provides many pre-trained face detectors, stored as XML files on github. See Face Detection using Haar Cascadesor for more details about this method.
First let’s see how we can use this face detector, and here is the demonstration code:
import cv2
import matplotlib.pyplot as plt
%matplotlib inline# extract pre-trained face detector
face_cascade = cv2.CascadeClassifier('haarcascades/haarcascade_frontalface_alt.xml')# load color (BGR) image
img = cv2.imread(human_files[3])
#img = cv2.imread(train_files[3])
# convert BGR image to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)# find faces in image
faces = face_cascade.detectMultiScale(gray)# print number of faces detected in the image
print('Number of faces detected:', len(faces))# get bounding box for each detected face
for (x,y,w,h) in faces:
# add bounding box to color image
cv2.rectangle(img,(x,y),(x+w,y+h),(255,0,0),2)
# convert BGR image to RGB for plotting
cv_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)# display the image, along with bounding box
plt.imshow(cv_rgb)
plt.show()
How do we evaluate the result?
Let’s evaluate this method for person detection before we are going to use it. The metric we use is the percentage of the first 100 images in the dog and human face datasets with a detected human face. For dogs data set, we expect the detection rate should be 0 while for human face datasets we expect the detection rate is 100.
With the data base we have, for human data set, we achieved 100% accuracy while for dog data set, we achieved 12% accuracy. It seems that OpenCV face detector has relatively large false positive, but the performance is acceptable for the current experiment. However, we have to admit that the method has a limitation: it must require person with clear face. It may happen a person is captured by the camera without clear faces. In this situation, the method we just discussed will fail. A way to solve this problem is to use Deep Learning Neural Network, where we train people’s image with both clear faces and unclear/no faces. After that, we use the trained Neural Network to perform human detection.
Both human data set and dogs data set can be found at the end of the article.
Part 3: Dog Detector
How do we detect dogs?
We use a pre-trained ResNet-50 model to detect dogs in images, and the weights of the model that have been trained on ImageNet, a very large, very popular dataset used for image classification and other vision tasks. ImageNet contains over 10 million URLs, each linking to an image containing an object from one of 1000 categories. Given an image, this pre-trained ResNet-50 model returns a prediction (derived from the available categories in ImageNet) for the object that is contained in the image.
ResNet-50 is a Deep Neural Network architecture, and in total it has 25,636,712 parameters, among which 25,583,592 parameters are trainable. If you want to visualize it, please check netscope.
ResNet-50 outputs a probability vector of size 1000, and we use argmax
function to identify the integer that corresponds to the model’s predicted object class. After checking this dictionary we know if the identified integer is between 151 and 268, we are sure that dogs have been detected.
How do we evaluate the result?
The way how we evaluate the dog detection accuracy is similar to what we did with person detection. We use the same human data set and dog data set, and the percentage of the first 100 images in the dog and human data sets with detected dogs is used as the evaluation metric. For human data set, ideally the detection rate should be 0 while the detection rate for dog data set should be 100.
In our experiment, we found the dog detection rate in dog data set reaches 100% percent; the dog detection rate in human data set reaches 1% percent. The result is very good for our application.
Both human data set and dogs data set can be found at the end of the article.
Part 4: Dog Breed Classifier
How do we classify dog breed?
Dog breed is not easy to be classified, and in many cases even human being cannot tell the difference between two breeds. For example:
As Deep Neural Network is becoming more and more popular, and many practitioners in Deep Learning have reported that it can reach much better result than human being. Therefore, it is natural for our current project we are going to use Deep Neural Network for dog breed classification problem.
Before we dive into Deep Neural Network, let’s first check the our dog data set, which is going to train, evaluate and test our dog breed classifier.
How does dog data set look like?
The dog data set contains 8351 images, including 133 dog categories. An overview of the images within this data set is shown below:
We randomly separate the data set into three groups: training (6680), validation (835) and test (836):
How do we pre-process data?
Before the data can be used for neural network training, we need process the data so that they can become the inputs of the neural network. In our experiment, the following processing is implemented:
- Image size normalization: resize the input image so that its output dimension becomes [224, 224, 3], where 224 refers to the width and height of the input image and 3 refers to the bands of the image (RGB).
- Image pixel value normalization: we divide the pixel value by 255.0 so that the pixel value range is between 0.0 and 1.0
How do we design/train/evaluate a Neural Network?
Before we use well-known Neural Network architecture with transfer learning, let’s try our first Neural Network. Our Neural Network architecture is inspired by VGG-16, and it is a shallow Neural Network. We follow the following rules when designing the network:
- When the image size reduced by 2, we will increase the feature channel by 2. By doing so, we will make sure there will no information loss during the training.
- We also use Dropout layer to regularize the Neural Network.
- We use softmax layer as the last layer because this is multi-class classification problem, and we will use categorical cross-entropy as the loss function.
The summary of our Neural Network is as follows:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 224, 224, 16) 448
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 112, 112, 16) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 112, 112, 32) 4640
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 56, 56, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 56, 56, 64) 18496
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 28, 28, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 50176) 0
_________________________________________________________________
dense (Dense) (None, 300) 15053100
_________________________________________________________________
dropout (Dropout) (None, 300) 0
_________________________________________________________________
dense_1 (Dense) (None, 133) 40033
=================================================================
Total params: 15,116,717
Trainable params: 15,116,717
Non-trainable params: 0
For training, we use the rmsprop
optimizer, and the evaluation metric we use is accuracy, which is defined as:
accuracy = sum(estimated_dog_breed==groundtruth_dog_breed)/test_dataset_number
However, the trained model does not generalize well on test data although we use early-stopping technique to pick up the best one among all the models we saved at different epochs. Our self-defined model reach 7.5359% accuracy.
Transfer learning with VGG16
Our self-designed network does not work well. There are many reasons for it: the architecture is shallow, the training epoch is small, hyper-parameters can be refined…. Among these reasons, we training data set is relatively small compared to the large amount of model parameters to estimated is a contributing factor.
We can use data augmentation to artificially increase the training samples, which is a good direction to go. In our experiment, we decide to use another technique: transfer learning.
The basic idea behind transfer learning is to use well-trained neural network as feature extractor and only retrain the last layer (or a few last layers) for the target application. Transfer learning has physiological connection with our brains, and deep learning practitioners often use it. However, we cannot use our self-defined neural network in the transfer learning context and this is because we cannot find someone has already trained our self-defined network with a lot of images. Instead, we have to use well-designed network in the literature.
Among all the well-know neural network architectures, we select VGG16 due to its simplicity and popularity. You can check its visualization on netscope website. The trained parameters of VGG16 is part of Keras library, and we can use the following code to obtain the feature extractor:
def extract_VGG16(tensor):
from keras.applications.vgg16 import VGG16, preprocess_input
return VGG16(weights='imagenet', include_top=False).predict(preprocess_input(tensor))
In Udacity course the extracted features for training/validation/test data sets have been provided by the tutors, and for VGG16 the extracted feature data base can be download from VGG-16 bottleneck features. By doing so, we can avoid generating data for training the last layer of our network.
The last convolution layer in VGG-16 of size [7, 7, 512], and the extracted VGG-16 bottleneck features have the same shape. Since we want to classify the dog breed into one of the 133 categories, a straightforward way to set up the last layer is to connect with all the bottleneck features (25088) with 133 neural net output units. However, if we do in this way, we will have a lot of parameters to estimate: 25088*133+133=3,336,837, and this number is comparable with our self-defined neural network (15,116,717 parameters). As we discussed before, one purpose of transfer learning is to reduce the number of parameters we have to learn. In order to further reduce the learned model parameters, we will use GlobalAveragePooling2D
layer on the bottleneck features. After the bottleneck feature passes this layer, its shape reduced to be [1, 1, 512], and so is the learned parameters: 133*512+133 = 68,229.
The implementation in Keras is as follows:
VGG16_model = Sequential()
VGG16_model.add(GlobalAveragePooling2D(input_shape=train_VGG16.shape[1:]))
VGG16_model.add(Dense(133, activation='softmax'))VGG16_model.summary()
We use the classification accuracy to evaluate our trained model (as we did before), and we found that transfer learning can dramatically improve dog breed classification accuracy, and with VGG-16 pre-trained model, we can have 69.4976% accuracy on test data set.
Transfer learning with Inception 3
Encouraged by our interesting result with transfer learning, we decide to go further, and use Inception 3 neural network as our backbone network for transfer learning.
Inception 3 is a far more advanced neural network architecture compared to VGG-16, and it is deeper with many more parameters to train. We refer to A Review of Popular Deep Learning Architectures: ResNet, InceptionV3, and SqueezeNet to readers who are interested in this network.
Use the same test data set, and use the same accuracy metric, with Inception 3 we can achieve even better classification accuracy: 79.6651%. This is the best accuracy we can have up to now.
Part 4: Dog breed classifier application
Since Inception 3 has obtained the best result, when we create an application using the procedure disclosed in Part 1, the default dog breed classifier comes from Inception 3 model.
We are satisfied with the application, and it can give decent results. Here are some examples:
Part 5: Conclusion
In this project we set up a dog breed classification program for a given RGB picture, and it has shown good results:
- It can reject pictures if no dog or people appear in the picture.
- If the picture contains dogs, we can classify it into right dog category.
- If the picture contains people, we can find the kind of dog the person looks like.
For person detection, we use a traditional machine learning algorithm using OpenCV; for dog breed classification, we use Inception 3 neural network with transfer learning technique enabled. We verify that we can obtain descent results with the adopted techniques.
However, like many computer vision projects, there are always exceptions that fails our system. Particular, the person detection module has a strong assumption that clear face must appear in the image. For our current application, it is a reasonable assumption, but it cannot be regarded as a good person detection method. An alternative is to use Deep Neural Network as we did with dog breed classification.
The dog breed classifier still has room for improvement:
- Image augmentation will help
- Hyper-parameter searching will help
- More data will help
- More advanced architecture will help
- ….
Part 6: Codes, data and models
Codes
Data
- human data set
- dog data set
- dog data set VGG-16 bottleneck features
- dog data set Inception 3 bottleneck features
Models
Part 7: Acknowledge
The project is initiated by the Udacity Data Scientist Nonodegree course, and part of the projects codes is provided by Udacity.