A Visual Journey through the Loss Function

Simon Palma
7 min readJun 4, 2023
Photo by nika tchokhonelidze on Unsplash

Introduction

Visualizing the loss function and the trajectory taken to reach its minima can provide valuable insights into the learning process.

By visualizing the loss function, we can gain a deeper understanding of its behavior and simultaneously identify potential challenges and pitfalls along the optimization path.

Visual representations allow us to observe the contours of the loss function, revealing areas of high or low error. Moreover, by visualizing the trajectory traveled during optimization, we gain insights into the paths that a neural network navigates in search of the minima.

Objective

In this article, my objective is to gain a better understanding of the shape of the loss function and the optimization process by interpreting its visual representation. Through visual interpretation of the loss function and the optimization process, my aim is to develop a deeper comprehension of neural network training.

My initial objective was to observe how visual representations accurately depict the dynamics of various optimization algorithms during neural network training. However, during the process, my focus shifted to visualizing the behavior of a specific optimization algorithm: mini-batch gradient descent. Comparisons among different optimizers will be explored in a future writing.

Motivation

I was exploring the optimization process during neural network training and the role of different optimizers in finding the minima of the loss function. While it seemed quite clear at first, after attempting to implement these concepts myself, I realized that my understanding was not complete.

Many sources assumed this knowledge as common understanding, leaving me with a rather blurry perception of how the visual illustrations were created. Thus, I decided to dig into how to produce a visualization that revealed the shape of the loss landscape for an example of a neural network.

Disclaimer

The main reason for writing this article is to deepen my own understanding through these visual insights. However, in the process, it could also serve as a useful resource for others who may be encountering similar challenges in comprehending the optimization process.

Dataset and Model

I utilized a dataset called ‘Cat vs non-Cat’ for my task. The objective was to classify images as either containing a cat or not, making it a classic binary classification problem. The dataset comprised 209 color images, each with a size of 64 x 64 x 3 pixels. To prepare the data for training, I performed two transformations: a) Flattening each 64 x 64 x 3 sample into a 12288-dimensional vector, and b) Scaling the original pixel values from the range [0, 255] to [0, 1].

The architecture used for training was relatively simplistic, consisting of a 4-layer MLP. The layer sizes were 12288, 20, 7, and 5, with the final layer having a size of 1. This means that the size of the weight matrix W⁽¹⁾ was [20 x 12288], the size of W⁽²⁾ was [7 x 20], and so on. All layers except the output layer were followed by a ReLU activation function, while the activation function for the output layer was a sigmoid. A positive prediction (indicating the presence of a cat) was determined when the output neuron value was greater than or equal to 0.5, while a negative prediction (indicating a non-cat image) was made otherwise.

I opted to keep the architecture as simple as possible and kept the training process straightforward. Mini-batch gradient descent served as the chosen optimization algorithm, without the inclusion of dropout or any form of regularization. I focused on incorporating only the essential elements necessary for achieving a significant reduction in the loss function at each training epoch, resulting in a decent model performance.

Loss Function Visualization

Before visualizing the shape of the loss function in the mentioned network, I will detail the process of achieving this visualization. To accomplish this, I will explain the calculations involved using a simpler network as an example. Let’s assume we aim to observe the shape of the loss function in the following NN:

Example of NN

The weights will have a shape as depicted in the illustration below.

Weights of example NN

The paper Visualizing the Loss Landscape of Neural Nets provides a brief description of the approach I implemented for visualization. To plot the contour of the loss function, we begin by selecting a center point W*, which represents the set of weights after the full training. Additionally, we choose two direction vectors, δ and η.

Direction vectors

Furthermore, two random sets with the same size of W* are created. The values of these sets will have similar magnitudes to the weights in W*. I’ll call these sets 𝛼 and 𝛽.

Random sets

The idea is to plot a function of the following form :

Loss function form

As mentioned earlier, δ and η are direction vectors with values ranging from [-1, 1]. To construct the plot, we can sample a range of values within this interval. For instance, if we choose 150 values for each direction vector, we can generate a total of 22,500 different directions (representing all possible combinations of δ and η). Each direction is associated to one of the random sets that will be used to shift our initial set, W*.

Another way to see it is that we will compute the value of the loss function for W* + change, for a lot of change configurations, where change is 𝛼δ + 𝛽η.

Change set

We will be able to compute multiple loss values, with each value being associated with its corresponding (δ, η) direction.

Loss function matrix form

Going back to my “Cat vs non-Cat” dataset, I implemented the process that I just described but on the NN trained on this data.

After calculating the loss values for all possible direction combinations (a total of 22,500) while using the same random sets 𝛼 and 𝛽. we can plot a contour figure and visualize its 2D and 3D representation.

2D (left) and 3D (right) plot of part of the loss function contour

It appears that the shape of the function is incomplete. This is because the plot relies on the random sets, and different sets will reveal different parts of the function. One approach to obtaining a more comprehensive shape is to stack multiple function plots using various random sets.

That’s about it! We have successfully plotted the contour of the loss function using a pair of random sets 𝛼 and 𝛽, and two direction vectors δ and η.

Final Words

Initially, I intended to talk about how to plot the optimization trajectory on the loss function. However, when projecting the path onto the random sets and trying to plot it, it wasn’t possible to visualize anything due to the significant difference in scale.

For now, I need to catch up on a few modules that I have left unattended in my learning journey. However, I plan to return to this part and provide a proper implementation, editing this article as soon as I find some time.

The idea is to have something similar to what I’m showing right below (the trajectory is a made up example):

Made up 2D plot of the optimization path on top of the the loss function

Furthermore, in a future article, I will compare various optimization algorithms, which was my initial motivation. This loss-landscape repository can serve as a source of inspiration.

Conclusion

Visualizing the loss function during the training of a neural network can be seen as a practical tool for understanding the learning process. Through this visual journey, we have explored one technique to reveal the shape of the loss function.

By gaining a deeper understanding of the loss function’s form, we can identify challenges and pitfalls that may arise during optimization. Visual representations, such as contour plots and trajectory visualizations, allow us to observe the contours of the loss function, uncover areas of high or low error, and diagnose potential issues like local minima or exploding gradients.

Furthermore, (and when done successfully) visualizing the trajectory taken during optimization showcases the paths that neural networks navigate in their quest for the minima. This insight not only enhances our comprehension of the network learning process but could also work as a starting point to develop and test more efficient optimization strategies.

BECOME a WRITER at MLearning.ai // AI Agents // Super Cheap AI.

--

--