Re-imagining Glamour Photography with Generative AI

Diffusers + Abyss Orange Mix 2 + Google Colab

14 min readJun 30, 2023

In one my past lives I used to dabble quite a bit in glamour photography. The entire process of conceptualizing the image, directing the model, and post processing the photograph was a challenge which I really relished.

Sometimes I even made things even more challenging by shooting in film, which meant that I could only view my photographs days after the shoot, and I could not easily retake any bad shots. Heck I even had plans for daguerreotypes to make things even more challenging!

Life however decided to take me down a different path (partly thanks to Fujifilm discontinuing various films), although I have never quite completely forgotten about glamour photography.

Synthetic illustration created using Abyss Orange Mix 2 and Diffusers. Image created by the author.

Despite not having done any serious photography for some years now, I too was caught up with the rest of the world when image generative AI models such as DALL-E, Midjourney or Stable Diffusion were released. I immediately set about playing with them to better understand what they can do, and if to see I could use them as part of my post processing workflow (I still have a huge backlog of post processing).

I was extremely surprised and pleased by the capabilities of these image generative AI models, and also very thankful that life decided to turn me to deep learning instead! In this article, I share my experiences and demonstrate how to use pre-trained state-of-the-art diffusion models to create synthetic (glamour photography-like) images!

Stable Diffusion — Generative AI for Paupers, Misers and Cheapskates

Many state-of-the-art generative AI models are not open source or free to use. Thankfully though, there are various open source free to use options floating around. For image generation there is Stable Diffusion which can be run on a typical workstation equipped with a GPU with 8 GB of RAM.

For paupers, misers or cheapskates who refuse to fork out money even for a typical workstation (like me!), Google Colab is a great alternative platform — why pay money when you can sell your soul! Check my other article on how to run models on Colab’s GPU!

In this section we present the main details of Stable Diffusion summarized from the original paper which I encourage you to read if you have the time.

Diagram showing how the Stable Diffusion model training occurs. Image taken from https://arxiv.org/pdf/2112.10752.pdf.

Hallucinating Images by Denoising Noise

Stable Diffusion does not create images out of thin air (this is AI, not magic!). Instead, Stable Diffusion creates images by iteratively denoising a noisy image (bottom process in the diagram above) using U-nets.

Starting from pure noise, Stable Diffusion hallucinates details here and there in each iteration during the denoising process, and over sufficient iterations, an entire image is hallucinated out of pure noise!

During training, a reverse process is involved — images are diffused to pure noise (top process in the diagram). This is where the “diffusion” in Stable Diffusion comes from. The denoising process can be thought of reversing the process of diffusing an image to noise!

In addition to the initial pure noise latent image, an input prompt consisting of tokenized and encoded text is used to guide the denoising process — we need to tell the model what to draw!

Latent Space

Stable Diffusion works in latent space instead of pixel space — this is what allows Stable Diffusion to work on a typical workstation. Pixel space is extremely large, and requires powerful computing resources to process.

On the other hand, latent space is essentially a compressed space perceptually equivalent to the original pixel space. Latent space is smaller than pixel space while still containing important perceptual details.

However this also means that the outputs of Stable Diffusion must be decoded from latent space to pixel space. This can be performed using an auto-encoder for instance (remember than an auto-encoder is used to learn efficient low dimensional embeddings of some high dimensional space).

Denoising Process Summary

Text from a prompt is tokenized and encoded numerically. A random image of pure noise is initialized in latent space. This noisy latent image is iteratively denoised using U-nets while being guided by the prompt. After a set number of iterations, the denoised latent image is converted to pixel space.

Python Libraries

Although there are many UIs available for use such as automatic1111 or Swift Diffusers for Macs, we will focus on Python libraries in this article.

Stable Diffusion is available through several Python libraries such as KerasCV or Diffusers — for the article we will use Diffusers and pre-trained models available on HuggingFace.

Diffusers

Diffusers is a diffusion pipeline and provides access to various diffusion models on HuggingFace. Although usable “out-of-the-box”, Diffusers also allows users to customize each of the seven components of the diffusion pipeline:

Feature Extractor — extracts features from generated images to be used as inputs for the safety checker.
Safety Checker —classification model that screens outputs for potentially harmful content. Will make the model output a black image if harmful output content is detected.
Scheduler — essentially ODE integration techniques. Defines how outputs from the U-net from the previous iteration are updated in the next iteration.
Text Encoder —encodes prompt text to numerical features.
Tokenizer — tokenizes the prompt text. This together with the text encoder are used to guide the denoising process as described in the section above.
U-net —denoises noisy images iteratively in latent space. This is the main part of Stable Diffusion described in the section above.
Variational Auto-Encoder — Generates the final output image by decoding the latent space images to pixel space.

Most diffusion models on HuggingFace provide all components required to run the entire diffusion pipeline, although it is possible to replace default components, and even mix and match compatible components from different models. Your mileage might vary though.

Package Versions

I used the following package versions— the API does change depending on the version and you might need to modify the function arguments.

!pip install -q transformers==4.25.1
!pip install -q accelerate==0.15.0
!pip install -q diffusers==0.11.1
!pip install -q huggingface_hub==0.12.0

Abyss Orange Mix 2 — High Quality Highly Realistic Illustrations

Now that we have described how Stable Diffusion works as well as the main components of the Diffusers library, it is time to move on to the models. As described earlier, U-nets are used to create synthetic images by denoising noisy images. Depending on how the U-nets are trained and conditioned, the final outputs will be very different.

There are a wide variety of models to choose from. For example, the vanilla version released by Stability AI is able to generate a wide variety of images such as horse-riding astronauts. For the rest of this article, we will use Abyss Orange Mix 2 (AOM2) — one of the Orange Mix models provided under the CreativeML OpenRAIL-M license designed to generate high quality and highly realistic illustrations (not photographs).

We use AOM2 instead of a photo-realistic model, partly because we originally aimed to re-imagine glamour photography instead of aiming to re-create or replicate glamour photography!

Pre-Trained AOM2 on HuggingFace

The pre-trained weights for AOM2 can be downloaded directly from HuggingFace — the downloaded contents of the seven components should be placed in a local directory with the following structure. Note that in addition to the seven components above, there is also a model_index.json file containing general information about the model.

abyss_orange_mix_2
├── feature_extractor
│   └── preprocessor_config.json
├── model_index.json
├── safety_checker
│   ├── config.json
│   └── pytorch_model.bin
├── scheduler
│   └── scheduler_config.json
├── text_encoder
│   ├── config.json
│   └── pytorch_model.bin
├── tokenizer
│   ├── merges.txt
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   └── vocab.json
├── unet
│   ├── config.json
│   └── diffusion_pytorch_model.bin
└── vae
    ├── config.json
    └── diffusion_pytorch_model.bin

Loading the Downloaded AOM2 Components

The downloaded AOM2 components can be loaded directly using the Diffusers pipeline using the Python function below. The function loads the pre-trained model, and allows the user to replace the scheduler and the safety checker if so desired.

As mentioned earlier, schedulers define how the the outputs from the previous iteration are updated. Therefore different schedulers will result in different outputs for the same model and prompts. We will demonstrate this effect very shortly.

# Load (text to image) diffuser pipeline downloaded from HuggingFace.
# Currently supports only changing schedulers and safety checker.
def load_pipeline(model_dir, 
                  scheduler = None, 
                  safety_checker = False, 
                  device_name = torch.device("cpu"), 
                  torch_dtype = torch.float32):
    """
    Loads a pre-trained diffusion pipeline downloaded from HuggingFace.

    Arguments
        model_dir: str
            Path to the downloaded model directory.
        scheduler: str or None
            Scheduler to use. Currently only "EDS", "EADS" or "DPMSMS" 
            are supported. If None, default scheduler will be used.
        safety_checker: bool
            Turn on/off model safety checker.
        device_name: torch.device
            Device name to run the model on. Run on GPUs!
        torch_dtype: torch.float32 or torch.float16
            Dtype to run the model on. Choice of 32 bit or 16 bit floats.
            16 bit floats are less computationally intensive.
    Returns
        pipe: StableDiffusionPipeline
            Loaded diffuser pipeline.
    """
    # Load the pre-trained diffusion pipeline.
    pipe = diffusers.StableDiffusionPipeline.from_pretrained(
        model_dir, 
        torch_dtype = torch_dtype
    )

    # Change the scheduler to either EDS, EADS or DPMSMS.
    # TODO
    # Add more options for the scheduler.
    if scheduler in [
        "EulerAncestralDiscreteScheduler", 
        "EADS"
    ]:
        pipe.scheduler = diffusers.EulerAncestralDiscreteScheduler.from_config(
            pipe.scheduler.config
        )
    elif scheduler in ["EulerDiscreteScheduler", "EDS"]:
        pipe.scheduler = diffusers.EulerDiscreteScheduler.from_config(
            pipe.scheduler.config
        )
    elif scheduler in ["DPMSolverMultistepScheduler", "DPMSMS"]:
        pipe.scheduler = diffusers.DPMSolverMultistepScheduler.from_config(
            pipe.scheduler.config
        )
    # Else the default scheduler is used.

    # Change the safety checker if so desired.
    if safety_checker is False:
        pipe.safety_checker = lambda images, **kwargs: [
            images, [False] * len(images)
        ]

    # Load the pipeline to the GPU if available.
    pipe = pipe.to(device_name)

    return pipe

Running the Loaded Pipeline

The loaded diffusion pipeline can be run using the Python function below. In addition to the loaded diffusion pipeline, prompt and negative prompt, there are few more important points to note.

First we need to specify the number of denoising steps to take — 50 is a good starting point and you will definitely want to play around with this value.
Abyss Orange Mix 2 also requires that the output image dimensions be multiples of 8. This is due to the internal architecture of the model.
Scale controls how closely the model follows the prompts. A larger scale means that the model will follow the prompts as closely as possible at the cost of image quality.
Finally, we need a seed to initialize the noisy image.

# Run diffuser pipeline.
def run_pipe(pipe, 
             prompt, 
             negative_prompt = None, 
             steps = 50,
             width = 512,  # Multiple of 8
             height = 704, # Multiple of 8.
             scale = 12, 
             seed = 123456789, 
             n_images = 1,
             device_name = torch.device("cpu")):
    """
    Arguments
        pipe: StableDiffusionPipeline
            Stable diffusion pipeline from load_pipeline.
        prompt: str
            Prompt used to guide the denoising process.
        negative_prompt: str or None
            Negative prompt used to guide the denoising process.
            Used to restrict the possibilities of the output image.
        steps: int
            Number of denoising iterations.
        width, height: int
            Dimensions of the output image. Must be a multiple of 8.
        scale: float
            Scale which controls how much the model follows the prompt.
            Higher values lead to more imaginative outputs.
        seed: int
            Random seed used to initialize the noisy image.
        n_images: int
            How many images to produce.
        device_name: torch.device
            Device name to run the model on. Run on GPUs!
    Returns
        image_list: list
            List of output images.
    """
    if width % 8 != 0:
        print("Image width must be multiples of 8... adjusting!")
        width = int(width / 8) * 8
    if height % 8 != 0:
        print("Image width must be multiples of 8... adjusting!")
        height = int(height / 8) * 8

    # Using a torch.Generator allows for deterministic behaviour.
    gen = torch.Generator(device = device_name).manual_seed(seed)
    image_list = []

    with torch.autocast("cuda"):
        for i in range(n_images):
            image = pipe(prompt, 
                         height = height, 
                         width = width,
                         num_inference_steps = steps, 
                         guidance_scale = scale,
                         negative_prompt = negative_prompt, 
                         generator = gen)
            
            image_list = image_list + image.images

    return image_list

The function above allows for the selection of 3 different schedulers. If a scheduler is not specified, the default PNDMScheduler will be used. Using different schedulers will tend to change the final output image.

PNDMS Scheduler — Pseudo numerical methods for diffusion models. Uses Runge-Kutta and linear multi-step methods.
Euler Discrete Scheduler — Euler method steps based on the original k-diffusion implementation by Katherine Crowson.
Euler Ancestral Discrete Scheduler —Euler method steps with ancestral sampling based on the original k-diffusion implementation by Katherine Crowson.
DPM Solver Multistep Scheduler — Fast dedicated high-order solver for diffusion ODEs with convergence order guarantee.

Prompt Engineering

Prompts are required to tell Stable Diffusion what sort of image to create — Stable Diffusion can’t read your mind!

In the most simplest form, prompts are input into Abyss Orange Mix 2 as a series of descriptive words separated by commas. An example prompt for an image of a woman wearing a black elegant dress standing in a ballroom could be:

prompt = """woman,black elegant dress,standing,ballroom"""

Putting the descriptions in curly braces forces Stable Diffusion to pay more attention to those details. Increasing the number of curly braces increases the amount of attention Stable Diffusion pays to those details. For example, to place emphasis on ballroom and more emphasis on black elegant dress:

prompt = """woman,((black elegant dress)),standing,(ballroom)"""

In addition to prompts, the pipeline also allows for negative prompts. Negative prompts are perhaps even more important that the prompts — they tell Stable Diffusion what NOT to draw.

For example, the documentation for the Orange Mix suggest using simple negative prompts to negate overall low quality images such as:

negative_prompt = """(worst quality, low quality:1.4)"""

Prompt engineering is an art in itself, and the possibilities for prompts and negative prompts are endless. Thankfully, online communities such as Reddit provide a place for everyone to share their results and exchange ideas. There are also other resources such as stable-diffusion-art or this medium post by Umberto Grando.

Running the AOM2 Pipeline

First of all, load the model files downloaded from HuggingFace. We specify the directory of the downloaded model files, and for the time being do not specify a scheduler to use. For the time being we do not turn the safety checker on, and indicate that we would like to run the model on the GPU using float32 precision.

# Load the pipeline from the downloaded files.
# Note that by setting scheduler = "EDS", "EADS" or "DPMSMS" 
# we can use different schedulers.

pipe = load_pipeline(model_dir = "abyss_orange_mix2", 
                     scheduler = None, 
                     safety_checker = False, 
                     device_name = torch.device("cuda"), 
                     torch_dtype = torch.float32)

We will not reveal the actual prompts used in this article—the hunt for prompts which result in images that you really like is part of the experience!

Once you have come up with some prompts, the loaded pipeline can be executed to generate a list of synthetic images. The images are output asPIL.Image and can be easily saved or displayed.

# Starting random seed value.
i = 0 

# 10 seeds = 10 different images.
seeds = [i for i in range(i, i + 10, 1)]


images = []
for seed in seeds:
    images += run_pipe(
        pipe, 
        prompt, 
        negative_prompt, 
        steps = 50,
        width = 512,   # 512 = 64 * 8 
        height = 768,  # 768 = 96 * 8
        scale = 12,
        seed = seed,
        n_images = 1, 
        device_name = device_name
    )

Note that the output dimensions have an aspect ratio of 1.5, which is the aspect ratio of 35 mm camera film! Other commonly used film aspect ratios are 6 × 6 and 6 × 7 (120 film) or 4 × 5 (large format film).

Also, note that how large your output image can go depends on your GPU. Less capable GPUs will not be able to handle large images.

Effects of Different Schedulers

As mentioned earlier, schedulers define how the U-net outputs from the previous iteration are updated. In the example output images below, in general the overall picture does not change , although small details within the picture do.

The default PNDMS scheduler has not drawn the face too well, and has drawn creases into the dress. On the other hand, the EDS and DPMSM schedulers have produced much better drawings of the face, and smoothened the dress creases. Also, the DPMSM scheduler has rendered the tree on the right as a street lamp instead.

Effects of using different schedulers. Image created by the author.

For other images, using different schedulers can produce completely different pictures. For example, while not displayed above, the EADS scheduler produced a completely different image — the dress was completely black and of a different design.

The Problem with Hands

Despite the capabilities of generative AI models, they have a very hard time rendering hands properly. The vast majority of images generated contained badly drawn hands, such as too many fingers, too few fingers, mangled hands, anatomically impossible hands etc. One possible reason for this phenomenon is that even for humans realistic looking hands are hard to draw!

Badly drawn hands are the norm! Image created by the author.

In many cases where the image is fixable, image editing software such as GIMP can be used to fix errors. In fact very rarely do I get an image usable directly from Stable Diffusion — the end result usually involves some amount of post processing and editing.

Prospects

What we demonstrated in this article is extremely superficial, and the possibilities for Stable Diffusion are endless indeed. For example, embeddings can be used together with prompts for more control over how the synthetic images are generated.

The images output are generally low resolution images (<1000 pixels), mainly due to the memory limitations of the GPUs. This can be overcome however by using Real-ESRGANs to resize the image to a much larger resolution. Similarly to Stable Diffusion, Real-ESRGANs hallucinate details from a small image to create a larger image. This is AI, not magic!

The output images can also be further post processed using inpainting. This can help to fix errors such as badly drawn hands, or to add more details or elements to the image.

Recently, LoRAs have been widely adapted within the stable diffusion community to make model training a less painful task.

In this article we demonstrated what is known as text-to-image generation. There are other pipelines as well, such as image-to-image generation. Image-to-image pipelines allow you to use an input image to use as the initial conditions. This is main the pipeline I use for my glamour photograph post processing backlog!

Finally, there are plenty of pre-trained diffusion models out there in the wild — way too many to be listed here. HuggingFace has some models which can be used with the Diffusers library. Other websites such as CivitAI have plenty more (mainly for use with automatic1111)!

Before I end, I can’t stress how truly inspiring the Reddit community is. This community is a great place to learn from the successes of others, and more importantly, learn from their failures! Thank you for reading!