How to Connect Text and Images

Part 1: Understanding Zero-Shot Learning

Published in

Becoming Human: Artificial Intelligence Magazine

7 min readJan 11, 2023

Despite deep learning’s revolutionary impact on computer vision, existing approaches are plagued by various significant problems. For example, traditional vision datasets are time-consuming and expensive to develop while only teaching a small subset of visual concepts.

In this series, we will learn how to connect images and texts using a zero-shot classifier with hands-on examples. Here is part 2-Understanding zero-shot learning with clip model.

Even though Deep Learning is capable of tackling a variety of computer vision problems through supervised learning, it is subject to the following limitations:

In supervised classification, one needs a large number of labeled training instances(for each class) to train a truly robust model.

Furthermore, the trained classifier is limited to classifying instances within the classes represented by the training data and cannot handle novel classes. It’s also possible that we won’t collect all of the necessary information at once but rather in smaller bits.

Zero-shot learning solved this. Zero-Shot Learning fundamentals will be revealed in a few minutes.

What is Zero-Shot Learning?

Zero-shot learning allows a model to recognize what it hasn’t seen before.

The capacity to perform a task without having previously been provided with any training examples is referred to as this learning strategy. A cat-dog model is used to identify birds. The “seen” classes are covered by the training instances, whereas the “unseen” classes are unlabeled.

The general idea of zero-shot learning is to transfer the knowledge already contained in the training instances to the task of testing instance classification. Thus, zero-shot learning is a subfield of transfer learning.

The importance of Zero-Shot Learning

Data Labeling is a labor-intensive job.

The majority of time consumed on any machine learning project is focused on data-centric operations. And —

It is especially difficult to obtain annotations, where specialized experts in the field are required to do the job. For example, in developing biomedical datasets, we need the expertise of trained medical professionals, which is expensive.

What’s more, you might be lacking enough training data for each class captured in the conditions that would help the model reflect the real-world scenarios.

For example —

If a new bird species has just been identified, an existing bird classifier needs to generalize on this new species. Perhaps the new species identified is rare and has only a few instances, while the other bird species have thousands of images per class. As a result, your dataset distribution will be imbalanced and, therefore, hinder model performance even in a fully supervised setting.

Methods like unsupervised learning also fail in scenarios where different sub-categories of the same object need to be classified — for instance, trying to identify different breeds of dogs.

Zero-Shot Learning aims to alleviate such problems by performing image classification on the fly on novel data classes (unseen classes) by using the knowledge already learned by the model during its training stage.

How does Zero-Shot Learning work?

In Zero-Shot Learning, the data consists of the following:

Seen Classes: These are the data classes that have been used to train the deep learning model.
Unseen Classes: These are the data classes on which the existing deep model needs to generalize. Data from these classes were not used during training.
Auxiliary Information: Since no labeled instances belonging to the unseen classes are available, some auxiliary information is necessary to solve the Zero-Shot Learning problem. Such auxiliary information should contain information about all of the unseen classes, which can be descriptions, semantic information, or word embeddings.

*Example of semantic embedding using an attribute vector*

On the most basic level, Zero-Shot Learning is a two-stage process involving Training and Inference:

Training: The knowledge about the labeled set of data samples is acquired.
Inference: The knowledge previously acquired is extended, and the auxiliary information provided is used for the new set of classes.

The two most common approaches used to solve zero-shot recognition problems are:

Classifier-based methods
Instance-based methods

Now an example will be implemented to show how zero-shot learning works. One of the popular methods for zero-shot learning is Natural Language Inference (NLI).

Natural language inference is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise.”

*Examples from* *http://nlpprogress.com/english/natural_language_inference.html*

Using the NLI method, we can propose a sentence to be classified as a Premise and construct a hypothesis for each classification label.

E.g., Let’s say we have the sentence “Don’t let yesterday take up too much of today,” and we would like to classify whether this sentence is about

advice
cooking,
dancing

Now for all three classification labels, we can have three hypothesis

Hypothesis 1: This text is about advice

Hypothesis 2: This text is about cooking

Hypothesis 3: This text is about dancing

We will use The bart-large model, which has been trained on the MultiNLI (MNLI) dataset. If you have a look at the dataset, you will come over with important classes, premises, and hypotheses. So, we will use bart-large-mnli in hugging face datasets.

Diving into the code implementation:

First, install the transformer library.

Then import the pipeline module of the transformer. The pipeline module takes the name of the task that we want to perform, and we can specify the model we want to use. Here we want to classify the Zero-Shot classification, and we will use “facebook/bart-large-mnli” model.

Now we will declare the sequence that we want to classify. Then after labeling, we will pass this to the classifier. After that, it will show the score of the classification. Here the travel label got the highest score as it fully matches the sequence.

So the particular pipeline, the Zero-Shot classification, has performed the whole task. It has taken the sequence and created a hypothesis for each of the labels, and then it has used a pre-trained model, “Facebook/bart-large-mnli,” which is specifically trained on this premise and hypotheses classification. Finally, it has given a score for each of the labels.

Now it is possible that a sequence or sentence can belong to more than one label, which is multilabel classification. In that case, we can provide one more flag during the classification, which is multilabel= True.

Here we added another candidate label exploration which is spirit. It also matches well with the sentence.

Now, what if we want to work with image data with zero-shot learning?

Although there are multiple approaches to zero-shot learning for image datasets, Our next article(Part 2) focuses on a recent method called Contrastive Language-Image Pretraining (CLIP) proposed by OpenAI that has performed well in a zero-shot setting [2].

We will discuss these in our next article.

Applications of Zero-Shot Learning

Finally, let’s have a look at some of the most prominent Zero-Shot Learning applications.

Image classification

This kind of system gets visual input (like images) and searches for information on the World Wide Web. Even if search engines may be trained on dozens of different categories of images, individuals can still supply them with new things to look for. Therefore, a Zero-Shot Learning framework is useful for dealing with situations like these.

Semantic segmentation

Example: COVID-19 Chest X-Ray Diagnosis

The COVID-19 infection is characterized by white ground-glass opacities in the lungs of patients, which are captured by radiological images of the chest (X-Rays or CT-Scans). Segmenting the lung lobes out of the complete image can aid in the diagnosis of COVID-19. However, labeled segmented images of such cases are scarce, and thus a zero-shot semantic segmentation can aid in this problem.

Image generation

Example: Text/Sketch-to-Image Generation

Several deep learning frameworks generate real photos using only text or sketch inputs. Such models frequently deal with previously unseen classes of data. A Zero-Shot text-to-image generator framework is devised in this paper, and a sketch-to-image generator is developed in this paper.

Object detection

Example: Autonomous vehicles

There is a need for detecting and classifying objects on the fly in autonomous navigation applications to decide what actions to take. For example, seeing a car/truck/bus means it should avoid them, a red traffic light means to stop before the stop line, etc.

*Bounding Box annotations for object detection*

Detecting novel objects and knowing how to respond to them is essential in such cases, and thus a Zero-Shot backbone framework is helpful.

Moreover, there are more applications of zero-shot learning like audio processing, resolution enhancement, action recognition, style transfer, and so on.

References: