The Evolution of Object Detection:

From Viola-Jones to Single-Shot Detector

Samuel Samsudin Ng
6 min readMay 18, 2023

Introduction

Computer vision tasks encompass a range of techniques aimed at extracting meaningful information from visual data. Image classification focuses on assigning a single label or category to an entire image, aiming to identify the primary object or scene depicted. Image localization takes a step further by not only classifying the image but also determining the location of the object within it using bounding boxes.

Object detection extends localization by identifying multiple objects in an image, providing their class labels and precise bounding box coordinates. Lastly, image segmentation delves into pixel-level analysis by assigning a label to each individual pixel, thereby separating different objects or regions within the image. While image classification provides a high-level understanding, localization, object detection, and segmentation offer increasingly detailed and granular insights, enabling more precise and comprehensive analysis of visual data.

This article focuses on evolution of object detection: a fundamental task in computer vision that involves identifying and localizing objects within an image or video. It has diverse applications, ranging from autonomous driving and surveillance to image recognition and augmented reality. It is like giving a computer the ability to recognize different things, such as cars, people, or animals, just by looking at pictures or videos.

Computer vision tasks

Over the years, researchers have made significant strides in improving the efficiency and accuracy of object detection models. One major evolution in this field has been the transition from multi-shot detection to single-shot detection techniques. In the next three minutes, we will explore the journey of object detection models and understand the advantages of single-shot detection for real-time applications.

In the Beginning

The first object detection algorithm is difficult to pinpoint to a single specific algorithm, as the field of object detection has evolved over several decades with numerous contributions. However, one notable early object detection algorithm is the Viola-Jones algorithm, introduced in 2001 by Paul Viola and Michael Jones. The Viola-Jones algorithm utilized a machine learning approach called Haar cascades to detect objects, particularly faces, in images.

Haar-like features are rectangular features that capture local image variations, such as changes in intensity, texture, or edges. During object detection, these features are used as a set of templates to scan an image at various scales and positions. By evaluating the responses of these features across the image, the algorithm can determine the presence or absence of an object of interest.

For example, a Haar-like feature may consist of a pair of adjacent white rectangles on the left and right sides with a black rectangle in the middle. This might correspond to the relative position of two eyes in a face. A strong response to this feature indicates the presence of eyes in the particular image region.

Exaples of Haar-like features for face detection.

AdaBoost, a machine learning algorithm, was employed to select and combine these features to create a robust classifier capable of distinguishing between object and non-object regions. The Viola-Jones algorithm offered real-time face detection capabilities, making it practical for various applications, including digital cameras, video surveillance, and human-computer interaction systems. While subsequent algorithms have surpassed its performance, the Viola-Jones algorithm remains a significant milestone in the history of object detection, inspiring further research and advancements in the field.

The Birth of Multi-Shot Object Detection

Multi-shot object detection emerged as a significant advancement in computer vision, driven by the need for more accurate and comprehensive object recognition. While its specific birth is challenging to attribute to a single instance, notable breakthroughs were made in the late 2000s and early 2010s. The development of region-based convolutional neural networks (R-CNN) in 2013 marked a crucial milestone.

R-CNN introduced the idea of using region proposals to identify potential object locations, which were then processed by a convolutional neural network for classification. This pioneering work laid the foundation for subsequent multi-shot object detection models such as Fast R-CNN, Faster R-CNN, and R-FCN. These advancements, fueled by deep learning and improved computing resources, revolutionized the field of object detection, allowing for more accurate and efficient detection of objects in images and videos.

In essence, detection is performed through multiple processing steps (hence multi-shot). In the first step, the algorithm proposed potential object locations called regions of interest (RoIs) using techniques like selective search or region proposal networks (RPNs). These regions were then passed through a classifier to identify the objects within them. Examples of popular multi-shot detection models include Faster R-CNN (Region-based Convolutional Neural Network) and R-FCN (Region-based Fully Convolutional Networks).

Multi-shot detection involving first proposing region where objects are most likely present, and then classifying the objects in these area. [Image taken from 1311.2524.pdf (arxiv.org)]

While multi-shot detection models achieved remarkable accuracy, they suffered from certain limitations. One primary concern was speed. Since they required two sequential steps, these models were computationally expensive and unsuitable for real-time applications. Additionally, the multi-shot approach often led to redundant computations, as the region proposal step was performed independently for each RoI. This redundancy added further computational overhead and hindered efficiency.

The Rise of Single-Shot Detection

To overcome the limitations of multi-shot detection, researchers introduced a revolutionary concept known as single-shot detection. Contrary to multi-shot detection, it detects objects directly in a single processing step. This approach significantly speeds up the detection process, making it more suitable for real-time applications.

One prominent single-shot detection model that gained widespread recognition is called YOLO (You Only Look Once). YOLO divides the input image into a grid and predicts bounding boxes and class probabilities directly from this grid. This grid-based approach allows YOLO to simultaneously detect multiple objects without the need for multiple passes or region proposals.

SSD (Single Shot Multibox Detector) is another popular object detection algorithm, known for its efficiency and accuracy. SSD divides the image into different grids and predicting where the objects are located within those grids. SSD also tells us what type of objects it finds, so it can distinguish between a cat and a dog, for example. This makes SSD a valuable tool in applications like self-driving cars, security systems, and many other areas where object recognition is important.

For interested readers, I will publish another article covering SSD algorithms in greater details. We will look at its technical background, network architecture, dataset, applications, as well some PyTorch examples.

Conclusion

In the beginning, there was Viola-Jones. Driven by the rise of deep learning and the need for more comprehensive object detection, advance object detections algorithms emerged. The evolution from multi-shot to single-shot detection models represents a significant breakthrough in the field of object detection. Single-shot models address the limitations of their predecessors by providing real-time performance, simplicity, and high accuracy. With continuous advancements in deep learning and computer vision, single-shot detection techniques are likely to dominate the future of object detection, enabling a wide range of applications that require efficient and rapid object recognition.

BECOME a WRITER at MLearning.ai. From Dreams to Reality

--

--

Samuel Samsudin Ng

I learn, I do, I write | Interest in AI and Signal Processing | AI Engineer @ Dolphin Design | M.Eng (EEE) | M.Sc. (AI) | www.samuelsamsudin.com