Anything V3: Text-to-Image Synthesis

7 min readFeb 8, 2023

image generated from anything v3 — image created by the author

In this article, we will focus on the NLP side of the stable diffusion model, including the architecture and the theory, and then build our model using Anything V3, a new version of stable diffusion.

Introduction

In my previous post, Understanding the Diffusion Model and the theory behind it, I briefly introduced the theory and the mathematics of the diffusion model. This time, let's explore it a little bit deeper. The text-to-image synthesis model which is known as stable diffusion. 🚲 🚲

The above image is the architecture of stable diffusion; if you have read my previous post, you will find that the only difference to the classic diffusion model is that stable diffusion adds the condition to each layer using the attention mechanism. This modification makes it possible to control the generation based on the input text or image. 😆

However, we could not input the condition (semantic map, text …) directly into the model, as it would make the training unstable. Thus, in Stable Diffusion 2, the author used OpenCLIP-ViT/H as the text encoder (τ_θ in the above image) to extract the word representation (repr.).

In fact, there are several ways to extract the word repr.; for example, the attention-based models (Transformer, Bert, GPT3 ….) could perform similarly to the Contrastive Language–Image Pre-training (CLIP) method. However, we will only focus on the CLIP and the attention mechanism in this post to simplify it.

Before we start, let us consider a question.

What is the word representation, and Why do we need it in the NLP task?

In representation learning, there are two kinds of repr. Non-Contextual and Contextual word repr., which aim to capture the syntactic and semantic meaning of words or sentences.

Imagine an input sentence: “ I bought an apple watch which has an apple icon on the watch band”, apparently the “apple” in this sentence has two different meanings, which need to have two different word repr. during the embedding process.

example of Contextualized Word Representations — embedding the word with different meanings into different spaces (source)

To prevent our network from being confused by the above case, we could use another network (e.g. transformer, bert) as an encoder to map the word into different spaces, which could describe the word semantics better.

Contrastive Language–Image Pre-training

There are two disadvantages of the current deep learning method:

Labelled data is labour-intensive and expensive; even if labelled, it can only be used in a narrow field.
A neural network may perform well on training benchmarks, but its performance on real-world data may not be so good (possibly due to slight differences in the distribution of actual data and training data)

comparison between resnet101 and CLIP (source)

CLIP propose a method that enables zero-shot learning for a new dataset; as it is not specialised for a specific field, the model becomes very robust and has a high degree of generalisation.

CLIP’s training process

the core concept of CLIP is to “map text and image into the same space”, To explain this, let's consider a simple example:

A sentence: “ A Golden Retriever playing game in the park.”
A picture of a Labrador Retriever playing a fetch game in the park.

From the above examples, we can see that whether it is a sentence or an image, we can find common features of “ retriever playing game in the park” in them. This common feature is the point which CLIP tries to map both the text and the image to.

As our target is to map both text and image into the same space, we will have two encoders to extract the representation from the object.

Text Encoder: Vision Transformer (ViT)
Image Encoder: ResNet-50 as base architecture with modifications.

We first input texts and images into the encoders and get the representation T1, T2, … and I1, I2, …, respectively. Then we calculate the inner product between T and I. (I1, I2, and I3 are different images since we input a batch of data)

We expect a higher inner product value for correctly paired data (e.g. T1 ⋅ I1, T2 ⋅ I2, blue part in the image), while for incorrect paired data, we expect a value of zero.

This is a straightforward approach, as our training objective is to map text and images with similar properties to the same space. The inner product is often used to measure similarity, so a larger inner product value indicates that the mapped position of the text and image is closer.

For text-to-image synthesis, stable diffusion usually takes CLIP’s Text Encoder to extract the repr. from the input sentence.

Attention Mechanism

Another crucial aspect of stable diffusion is the use of cross-attention, which selectively extracts relevant information from the input representation for the network.

student making note in the class — image created by the author

“During my high school years, I tended to be easily distracted and would often bring out a novel or play games during class time. As a result, my teacher would frequently yell at me to focus on my studies.”

From the above example, you can easily see that the core concept of the attention mechanism is “focus”; how to avoid being influenced by factors such as games and novels and concentrate on learning.

The above image is the attention map of the Transformer which first appeared in the paper Attention Is All You Need.

In the attention layer, to generate the semantic vector for the word “making”, the model does not consider the entire input sentence but instead selects the most relevant words to represent it. In our case, they are “more”, “difficult”, and “making”.

By applying attention to our stable diffusion model, we can let each layer in U-net select the information they want instead of the full sentence representation. This improvement can make our model more stable and also help us understand what each layer in the model is doing. 😄

Implementation of Attention Mechanism

Extract information based on attention scores (source)

We will first use three fully connected networks (Dense) to get the vector Query, Key and Value for each input word.

The query (q1) will perform an inner product with all keys (k1, k2, …); after applying the softmax layer, it will become a probability distribution (attention map) showing how much attention needs to pay to the word in the sentence.

Finally, we multiply the attention map by the value (v1, v2…) to get the output (b1) of the network, which covers the information of the whole sentence.

(the above image only shows the Attention Mechanism for the first token. In practice, All tokens’ output will be completed at the same time due to parallel computation on the GPU)

For a deeper understanding of the attention mechanism, it is highly recommended to watch Professor HUNG-YI LEE’s lecture videos from National Taiwan University.

Alright, that’s everything you need to know about stable diffusion. Finally, let’s review the training process of stable diffusion again !! 😈

loss function of the stable diffusion (source)

x and y represent our input image and text.
ε is the encoder: z ~ ε(x); after z undergoes a diffusion process, we can obtain a noisy image zT.
τ_θ represents the text encoder (CLIP), τ_θ(y) is the word repr. of the input sentence.
Є_θ represents our U-net model, which is used to predict Є (the noise we have added to z during the diffusion process)

Finally, we can calculate the mean square error (MSE) between Є and Є_θ and then update our model. 👏 👏

Implementation

Training a stable diffusion is quite time-consuming, so I chose to use pre-trained model weights for implementation this time.

Suppose you want to train your own stable diffusion. I have attached my previous work here; you only need to change the input of the attention layer into CLIP representation.

As I’m using google colab, we need to open the cloud permission.

Here is the detail of the stable diffusion webui: link

Here is the detail of anything v3: link

After executing the above command, you will enter a user interface where you can adjust the image size, specify the number of samples, and determine the sampling step in the reverse diffusion process.

Another interesting feature is the availability of a training option, allowing you to modify the network’s layers and train your own custom model. 😙

Anything V3: Text-to-Image Synthesis

Introduction

Before we start, let us consider a question.

Contrastive Language–Image Pre-training

CLIP’s training process

Attention Mechanism

Implementation of Attention Mechanism

Implementation

Result

Reference

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Ben