Technical Review of Object Detection Using Deep Neural Networks

15 min readDec 31, 2020

Object detection is a computer technology related to computer vision and image processing which deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. The major problems that object detection faces are as follows:

objects may appear anywhere in the image
objects may have diverse shapes
objects may have diverse sizes
class imbalance makes the network biased towards learning more background information and affects accuracy.

Then it comes to the dilemma of object detection: on the one hand, when performing classification of an object, we want to learn location invariance in a model: regardless of where the cat appears in the image, we want to classify it as a cat. On the other hand, when performing detection of the object, we want to learn location variance: if the cat is in the top left-hand corner, we want to draw a box in the top left-hand corner. So if we’re trying to share convolutional computations across 100% of the net, how do we compromise between location invariance and location variance?

Table of Contents

· Part 1: Introduction
· 1.1 Object Detection
· 1.2 Segmentation and its Variants
· 1.3 Process Pipeline
· Part 2: Backbone Networks
· Part 3: Object Detection
· 3.1 Two-stage Object Detection
· 3.2 One-stage Object Detection
· 3.3 Other Components
· Part 4: Segmentation and Its Variants
· 4.1 Semantic Segmentation
· 4.1.1 Fully Convolutional Neural Network (FCN)
· 4.1.2 U-Net
· 4.1.3 SegNet
· 4.1.4 DeconvNet
· 4.1.5 RedNet
· 4.2 Instance Segmentation
· Part 5 Datasets
· Part 6: Metrics
· Part 7: Reference

Part 1: Introduction

1.1 Object Detection

Traditional object detection is to find a rectangle around an object and classified the object within the rectangle as well.

Pre-existing domain-specific image object detectors usually can be divided into two categories:

One Stage

Representative algorithms: Overfeat, YOLO, SSD, etc.
Basic idea: detect all possible bounding box that may contain objects we are interested in, and after that we prune away the clusters by mixing a threshold and Non-Maximal Suppression (NMS) .
Advantage: speed
Disadvantage: high memory cost and low detection accuracy.

Two Stage

Representative algorithm: RCNN, Fast-RCNN, etc.
Basic idea: decouples the bounding boxes from the detection. At the first stage, regions are proposed; then in the second stage regions are classified.
Advantage: accuracy
Disadvantage: speed

1.2 Segmentation and its Variants

Semantic Segmentation

Image segmentation is to draw a mask that outlines objects.

Instance Segmentation

1.3 Process Pipeline

Deep neural network based object detection pipelines have four steps in general, image pre-processing, feature extraction, classification and localization, post-processing.

Firstly, raw images from the datasets cannot be fed into the network directly. Therefore, we need to resize them to any special sizes and make them clearer, such as enhancing brightness, color, contrast. Data augmentation is also available to meet some requirements, such as flipping, rotation, scaling, cropping, translation, adding Gaussian noise.

Secondly, feature extraction is a key step for further detection. The feature quality directly determines the upper bound of subsequent tasks like classification and localization.

Thirdly, the detector head is responsible to propose and refine bounding box concluding classification scores and bounding box coordinates.

At last, the post-processing step deletes any weak detecting results. For example, NMS is a widely used method in which the highest scoring object deletes its nearby objects with inferior classification scores.

Part 2: Backbone Networks

Backbone network is acting as the basic feature extractor for object detection task, which takes images as inputs and outputs as feature maps of the inputs.

Most of backbone networks for detection are the network for classification task taking out the last fully connected layers.
The newly high performance classification networks (deeper and densely connected layers) can improve precision and reduce the complexity of object detection task.
It becomes important for the feature extractor to generalize the features which can be used for any scale. For this FPN a.k.a. Feature Pyramidal Networks are used which helps to extract the features at every scale(small, medium, and large). These type of feature extractors are highly used in most of the object detectors.

Part 3: Object Detection

Object detection algorithm classification

Representative algorithms for object detection

3.1 Two-stage Object Detection

The above figure shows the Faster R-CNN architecture. In this architecture, region proposal network is used to feed region proposal into classifier and regress-or.

In the beginning of two-stage method history, the region proposal is provided not by Neural Network but by a traditional image processing method called Selective Searching: Selective Searching deals with the merging of similar pixels based on texture information using Merge-Set Data Structure. R-CNN method and Fast R-CNN method use this method for region proposal.

3.2 One-stage Object Detection

Anchor-based (SSD, YOLOV2, etc.)

Anchor-based method is the first method used in one-stage object detection. It has the following properties:

Detection performance is sensitive to the sizes, aspect ratios and number of anchor boxes.
Even with careful design, because the scales and aspect ratios of anchor boxes are kept fixed, detectors encounter difficulties to deal with object candidates with large shape variations, particularly for small objects. The pre-defined anchor boxes also hamper the generalization ability of detectors, as they need to be re-designed on new detection tasks with different object sizes or aspect ratios.
In order to achieve a high recall rate, an anchor-based detector is required to densely place anchor boxes on the input image.
Anchor boxes also involve complicated computation such as calculating the intersection-over-union (IoU) scores with ground-truth bounding boxes.

Anchor-free (YOLOV1, CornerNet, DenseBox, etc.)

Detection is unified with many other FCN-solvable tasks such as semantic segmentation, making it easier to re-use ideas from those tasks.
Detection becomes proposal free and anchor free, which significantly reduces the number of design parameters. The design parameters typically need heuristic tuning and many tricks are involved in order to achieve good performance. Therefore, our new detection framework makes the detector, particularly its training, considerably simpler.
By eliminating the anchor boxes, our new detector completely avoids the complicated computation related to anchor boxes such as the IOU computation and matching between the anchor boxes and ground-truth boxes during training, resulting in faster training and testing as well as less training memory footprint than its anchor-based counterpart

3.3 Other Components

Loss Function

The final step to object detection is the Classification and Bounding Box Localization. This step is generally based and modified by using different combinations of loss functions (including regression loss and classification loss. Different Methods/Networks have different variations of the final loss used, but the main functions used in loss function are mentioned below).

The final output of the feature extractor is then used to calculate the loss which is backpropagated to adjust the localized values and class probabilities. These modules in generic terms are also known as Classifier Head and Regressor Head.

Some commonly-used loss functions for Regressor Head include:

1. Mean Squared Error Loss/ L2 Norm Loss: MSE loss is one of the most commonly used loss function. It is the sum of the squared distance between the target variable and the predicted variable.

2. Mean Absolute Error: MAE is another loss function used for Regression Head. It is the sum of the difference between the absolute values of the target and the predicted variable.

3. Huber loss is less sensitive to outliers in data than the squared error loss. It’s also differentiable at 0. It’s absolute error, which becomes quadratic when an error is small. How small that error has to be to make it quadratic depends on a hyperparameter, 𝛿 (delta), which can be tuned. Huber loss approaches MAE when 𝛿 ~ 0 and MSE when 𝛿 ~ ∞ (large numbers.)

Some commonly-used loss functions for Classifier Head include:

1. Cross-entropy loss: measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverge from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

Post Processing

Part 4: Segmentation and Its Variants

4.1 Semantic Segmentation

4.1.1 FCN

The first straight forward sight to extend classification Neural Network is: Fully Convolutional Neural Network (FCN).

The structure of fully convolution network

The original name of FCN is fully convolutional network. For original structure, the last layer of CNN may be a softmax layer to predict the probability of the category. However, there’s only one result can be produced. As the result, Long et al. try to treated the model with another concept: the fully connected layer in usual structure can be regard as a convolution layer whose kernel size is the size of whole feature map! Is there possible to predict the category of each pixel just by convolution?

The answer is yes. In FCN, the image will be processed through the network. The “course” feature response map will be produced at the end. This feature response map some how represents the category of original image in pixel level. However, the size is shrink to 1/32 times. To reconstruct as original image size, the first idea is bi-linear interpolation. However, it’s not suitable to adapt in the realistic situation.

Another idea is learnable up-sampling method, and it’s more reasonable to learn how to deal with this problem case by case. In the upsampling process of FCN, the course feature map will be operated by a learnable up-sampling layer. Next, the element-wise addition will be adopt to the feature map. By the previous feature map, the result can realize more location and detail information which are destroyed by max pooling layer.

The concept of how to up-sampling in FCN

The author purposed the three kinds of result: FCN-32s, FCN-16s and FCN-8s. The meaning of the back number is the times of shrinking. For example, the result of FCN-32s is 1/32 than the original image. On the contrary, the result of FCN-8s gets through two learnable up-sampling layers and element-wise additions.

The results of different scale output of FCN

In the author’s experiment, we can see that the performance of FCN-8s is the most brilliant. As you can see, the detail margin of human and bicycle is more clear than the two other result. Moreover, as the author mentioned, the performance of FCN-4s and FCN-2s aren’t well than the FCN-8s. Thus it’s not certain that the more fusion will lead to more high accuracy.

4.1.2 U-Net

The FCN brought a big bang to this territory. It used straight forward concept and gave the hint to solve such this kinds of problem. After FCN, there’re lots of model being launch. Since the shape of these models are just like the horizontal hourglass, I call them hourglass-like models. These models did the good jobs in many different tasks, including pixel segmentation, object recognition, denoising, super-resolution…etc.

The next one is U-Net. The most specialty of this model is that it’s purposed to solve the medical problem in advance! The shape of the whole model is just like the english alphebat “U”. As the result, the name of this model is U-Net. You can just see the shape which are drawn in original paper in below.

The structure of U-Net in original paper

The author of U-Net was trying to solve the denoising which isn’t related to the category of the object. To speed up the computation, the U-Net drops the last two layer of VGG. This is first advantage of the U-Net. Second, rather than using element-wise addition to fusion the information of previous tensor and up-sampling tensor, the U-Net concatenates each other in channel dimension conversely.

The structure of U-Net that scratched by myself

This is another structure image of U-Net. As you can see, after each up-sampling operation, the tensor will also get through two convolution layers to reinforce the intensity. In original paper, the author also announced some revision to weight the margin of the different instances. We don’t consider the loss function detail in this article. Generally, the U-Net is a great model that use the detail of previous layers completely.

4.1.3 SegNet

The idea of learnable up-sampling layer is great. However, Badrinarayanan[3] thought that the structure of U-Net wasn’t perfect enough. The major problem is max-pooling.

The above image shows the process of max pooling. In each filter region, we choose the max value to become the result which might represent the strong response toward the feature. However, the location information will lose after this operation. Are there some methods to solve the location missing problem?

The alternative way is pooling-indices. We should remember the the location of the max value for each filter region during max pooling operation. After this recording, we can get the max pooling mask . On the contrary, we can utilize this mask to fill the max value to the original corresponding position. By the consulting, the location information will not lose during down sampling process.

The above image illustrates the structure of SegNet. The twist arrows in the below side are the pooling-indices technique. After each stage, we just consult the mask and fill the max value to the original position. Next, we can just use convolution layer as U-Net does. At the end, we use softmax layer to predict the result.

4.1.4 DeconvNet

By the pooling-indices mechanism, the location detail can be preserved. The Noh[4] raised another creative idea. As we know, the convolution layer will learn the feature of the object. We can regard the kernel as the perception retina of the specific object. By sliding on the image, it will turn out to become the response toward the specific feature. In other word, the process of convolution is just like “extracting feature”.

However, can we do this process in reverse order? We can just regard it as to “render” the feature to the feature map. You can treat the process of convolution as changing the image to the feature response low dimension space.

In the process of DeconvNet, the image will get through the VGG, and get the feature response map with low dimension. This feature map remains the rough structure of the original image. Next, we render this course feature map into the category space. By layers of deconvolution and pooling-indices, the location and category detail will be described at the end.

You may be curious about a question: If the DeconvNet firstly uses deconvolution as the upsampling method, why are some yellow part in the FCN image? As I know, the original group of FCN author didn’t announce the original source code. However, other third-parties re-implementations are found. To simplified the “learnable” up-sampling mechanism, they just use deconvolution to be the alternative one. As the result, I use yellow block to represent the design.

To make the conclusion toward the DeconvNet, the most structure of DeconvNet is similar to the SegNet. The only difference is changing the convolution layer to deconvolution layer during up-sampling.

4.1.5 RedNet

The original full name of RedNet is the residual encoder decoder network. After the previous idea, Mao et al. though of two points:

pooling-indices isn’t perfect! By this mechanism, we should record the location information by the extra mask. This process not only wastes other memory to record but also should spend time to compute the max value by sliding window.
The idea of residual is more and more popular. Is it possible to use residual idea to enhance the performance toward this work?

The RedNet solves the previous two problem. To my surprised, the idea and structure of RedNet is quite simple and clear! It’s just composed by convolution layer, deconvolution layer and addition! Rather than losing the location information and spending the extra memory, Mao got rid of the max pooling directly. In the whole process, the size of feature map isn’t change at all. The image just get through the layers of convolution. Next, we do deconvolution and element-wise addition with the previous tensor.

The two different ways of skip connection

But there’s one question. Where is the residual concept? In our previous experience, the skip connection is in order. And it just like the upper one of above image. On the other hand, the design of skip connection is like the lower one. Why should the author do this changing?

The two experiments related to skip connection

In fact, the author did the two experiments. The first one is to examine the performance between removing the skip connection or not. The left chart of the above image shows the result. As you can see, the red line gets the more higher value of the result. It shows that it’s essential to use skip connection to enhance the performance.

The right side shows the result of two different skip connection. The blue and green line is the order residual connection (the original author of ResNet is He et al. ), and the other two line is the symmetric residual connection. From the value of the chart, the symmetric assignment gets higher value indeed. As the result, they adopt the symmetric skip connection at last.

4.2 Instance Segmentation

Instance segmentation is a combination of two sub-problems: object detection and semantic segmentation, and masked R-CNN is a algorithm that can achieve instance segmentation.

Masked R-CNN is a combination of Faster R-CNN and FCN, and it use Rol Align preserves spatial orientation of features with no loss of data. You can check our Object Detection Papers: Mask R-CNN for more details.

Part 5 Datasets

1. PASCAL VOC Datasets

2. MS COCO Datasets

Microsoft Common Objects in Context (COCO) is a competition with Microsoft’s 2014 Microsoft COCO dataset, and it has become the most authoritative and important Benchmark in the field of object recognition and detection.

You can check our Object Detection Datasets:COCO article for more details of this datasets.

Part 6: Metrics

1. IoU (Intersection over Union)

To decide whether a prediction is correct w.r.t to an object or not, IoU or Jaccard Index is used. It is defines as the intersection b/w the predicted bbox and actual bbox divided by their union. A prediction is considered to be True Positive if IoU > threshold, and False Positive if IoU < threshold.

2. Precision and Recall

To understand mAP, let’s go through precision and recall first. Recall is the True Positive Rate i.e. Of all the actual positives, how many are True positives predictions. Precision is the Positive prediction value i.e. Of all the positive predictions, how many are True positives predictions.

3. Mean Average Precision

Part 7: Practical Advice on How to Improve Object Detection Accuracy

Some references:

A journey of building an Advanced Object Detection Pipeline — Doubling YoloV5’s performance

Part 8: Reference

Papers

Blogs

Codes

U-Net, FCN, RedNet Tensorflow Implementation

Video

Object Detection from Deeplearning.ai