Deep Learning from Scratch in Modern C++: Activation Functions

10 min readMay 8, 2023

Let’s have fun by implementing Activation Functions in C++.

Artificial Neural Networks are an example of biological-inspired models. In ANNs, units of processing, called neurons, are grouped in layers of computation, usually, to perform pattern recognition tasks.

In this model, very often we prefer to control the output of each layer to obey some constraint. For example, we can limit the output of a neuron to an interval of [0, 1], [0, ∞], or [-1,+1]. Another very common scenario is to guarantee that neurons from the same layer always sum up 1. The way to apply these constraints is by using Activations functions.

In this story, we will cover 5 important activation functions: sigmoid, tanh, ReLU, identity, and Softmax.

About this series

In this series, we will learn how to code the must-to-know deep learning algorithms such as convolutions, backpropagation, activation functions, optimizers, deep neural networks, and so on using only plain and modern C++.

This story is: Activation Functions in C++

Check other stories:

0 — Fundamentals of deep learning programming in Modern C++

1 — Coding 2D convolutions in C++

2 — Cost Functions using Lambdas

3 — Implementing Gradient Descent

… more to come.

The Sigmoid Activation

Historically, the most famous activation is the Sigmoid function:

This chart shows three important properties of the sigmoid:

its output is constrained between 0 and 1;
it is smooth, or in better mathematical jargon, it is differentiable;
it is S-shaped.

You should be wondering why the shape matters. An S-shaped model means that the curve is similar to linear in the neighborhood of the origin:

This helps a faster convergence for small inputs. There are two ways to define the sigmoid formula:

The two formulas are equivalent, but when implementing, we will prefer to use the latter:

double sigmoid(double x)
{
    return 1. / (1. + exp(-x));
}

The reason why we prefer the second formula is that the first one is more numerically unstable. Very often, we use a short-circuit when implementing sigmoid:

double sigmoid(double x)
{
    double result;
    if (x >= 45.) result = 1.;
    else if (x <= -45.) result = 0.;
    else result = 1. / (1. + exp(-x));
    return result;
}

This saves lots of processing and avoids ill conditions when |x| is big.

The sigmoid derivative

Using the chain rule, we can find the sigmoid derivative as:

For our convenience, let’s group the sigmoid and its first derivative into a functor:

class Sigmoid : public ActivationFunction
{
    public:

        virtual Matrix operator()(const Matrix &z) const
        {
            return z.unaryExpr(std::ref(Sigmoid::helper));
        }

        virtual Matrix jacobian(const Vector &z) const
        {
            Vector output = (*this)(z);

            Vector diagonal = output.unaryExpr([](double y) {
                return (1. - y) * y;
            });

            DiagonalMatrix result = diagonal.asDiagonal();

            return result;
        }

    private:

        static double helper(double z)
        {
            double result;
            if (z >= 45.) result = 1.;
            else if (z <= -45.) result = 0.;
            else result = 1. / (1. + exp(-z));
            return result;
        }

};

We will see how to use the activiation function derivative when covering the backpropagation algorithm.

Sigmoid is largely used in the output layer of either binary classifiers or regression systems where the outcome is always non-negative. If the output can be a negative value, consider using the Tanh activation described below.

The Tanh activation

As its name suggests, the tanh activation is defined by the hyperbolic tangent trigonometric function:

Like sigmoid, tanh is also S-shaped and differentiable. However, the bound of tanh are -1 and 1:

The tanh activation and the sigmoid activation are tightly related:

Note that, since tanh can output negative values, we cannot use it with some loss functions like logcosh.

The first derivative of tanh is:

We can pack tanh and its derivative into a single functor:

class Tanh : public ActivationFunction
{
    public:

        virtual Matrix operator()(const Matrix &z) const
        {
            return z.unaryExpr(std::ref(tanh));
        }

        virtual Matrix jacobian(const Vector &z) const
        {
            Vector output = (*this)(z);

            Vector diagonal = output.unaryExpr([](double y) {
                return (1. - y * y);
            });

            DiagonalMatrix result = diagonal.asDiagonal();

            return result;
        }
};

ReLU

A problem with Sigmoid and Tanh is that they are very computationally expensive making training time take longer. ReLU is a simple activation:

Since ReLU is a simple comparison, its computational cost is very low if compared to other functions.

We can implement ReLU as follows:

class ReLU : public ActivationFunction
{

    public:

        virtual Matrix operator()(const Matrix &z) const
        {
            return z.unaryExpr([](double v) {
                return std::max(0., v);
            });
        }

        virtual Matrix jacobian(const Vector &z) const
        {

            Vector output = (*this)(z);
            Vector diagonal = output.unaryExpr([](double y) {
                double result = 0.;
                if (y > 0) result = 1.; 
                return result;
            });

            DiagonalMatrix result = diagonal.asDiagonal();

            return result;
        }

};

The relevant points are:

it is bounded for negative values but unbound for positive x values: [0, ∞]
It is not differentiable for x = 0. In practice, we relax this condition by assuming that the derivative dRelu(x)/dx is 0 when x = 0.

Since ReLU consists basically of a single comparison, we are talking about a really fast operation to compute. Its first derivative is also fast to calculate:

Despite its advantages, ReLu has three main drawbacks:

since it is not positively bounded, we cannot use it to control the output to [0, 1]. Because of this, in practice, ReLU is usually found only in internal (hidden) layers.
Since ReLu is 0 for any x < 0, sometimes our models “die” during the training because part or all neurons are stuck in a state where only outputs 0.
Since the derivative of ReLU is not continuous on x = 0, model training can be unstable for some inputs.

There are some alternatives to address those issues (see Softplus, leakyReLU, ELU, and GeLU). However, due to considerable benefits, ReLU is still largely used in real-world models.

Identity Activation

The identity activation is defined simply by:

and its derivative is:

Using an identity activation means that the neuron’s output is not modified in any way. The implementation, in this case, is pretty simple:

class Identity : public ActivationFunction
{
    public:
        virtual Matrix operator()(const Matrix &z) const { return z; }

        virtual Matrix jacobian(const Vector &z) const
        {

            Vector diagonal = Vector::Ones(z.rows());

            DiagonalMatrix result = diagonal.asDiagonal();

            return result;
        }
};

Softmax

Consider that we have a picture of a pet and we need to determine which animal it is: a dog? A cat? A hamster? A bird? A Guinea Pig? In machine learning, we usually model problems like this as a classification problem and refer to the model as a classifier.

Softmax is very suitable to be the output of classifiers because it actually represents a discrete probabilistic distribution. For example, consider the examples below:

In this previous example, the network is pretty sure that the pet in the image is a cat. In the next example, the model scores the image as a dog:

In deep learning models, we use Softmax to represent this output type.

This amazing pet pictures were taken by Amber Janssens

Defining Softmax

The original formula of Softmax is:

This formula means that if we have k neurons, the output of the ith neuron is given by the exponential of xᵢ divided by the sum of the exponential of each neuron xⱼ.

A very first implementation of Softmax can be:

const auto buggy_softmax(const Vector &z) {

    Vector expo = z.array().exp();
    Vector sums = expo.colwise().sum();
    Vector result = expo.array().rowwise() / sums.transpose().array();
    return result;

};

We will see soon that this implementation has a severe flaw. But this code makes the job to illustrate the most important aspect of softmax: the result of each neuron depends on every individual input.

We can run this code:

Vector input1 = Vector::Zero(3);
input1 << 0.1, 1., -2.;

std::cout << "Input 1:\n" << input1.transpose() << "\n\n";
std::cout << "results in:\n" << buggy_softmax(input1).transpose() << "\n\n";

to outputs:

The two most important aspects of Softmax are:

the sum of all neurons is always 1
Each neuron value is in the interval [0, 1]

Implementing Softmax

The problem with our previous softmax implementation is that the exponential function grows very fast. For example, e¹⁰ is approximately 22,026 but e¹⁰⁰ is 2.688117142×10 ⁴³ which is a daunting huge number. It turns out that our implementation fails even if we use modest numbers as inputs:

Vector input2 = Vector::Zero(4);
input2 << 100, 1000., -500., 200.;

std::cout << "Input 2:\n" << input2.transpose() << "\n\n";
std::cout << "results in:\n" << buggy_softmax(input2).transpose() << "\n\n";
std::cout << "using the buggy implementation.\n";

This happens because C++ has a fixed representation for float points. Using regular 64-bit processors, any call for cmath exp(x) passing 750 or more will result in inf numbers.

Luckily, we can fix it by using the following trick:

where m is the max input:

Now, by fixing the code, we get:

const auto good_softmax(const Vector &z) {

    Vector maxs = z.colwise().maxCoeff();
    Vector reduc = z.rowwise() - maxs.transpose();
    Vector expo = reduc.array().exp();
    Vector sums = expo.colwise().sum();
    Vector result = expo.array().rowwise() / sums.transpose().array();
    return result;

};

Overflow is one source of numeric instability.

Numeric stability is a very common concern when we are developing real-world deep learning systems.

The Softmax derivatives

There is a very noticeable difference between Softmax and other activations. In general, activations like sigmoid or ReLU are coefficient-wise operations, i.e., the value of one coefficient does not influence other coefficients. In Softmax, of course, it is not true because all values need to sum up 1. This dependence makes the softmax derivative a little bit tricky to calculate. Nevertheless, after a little bit of calculus and using our old friend chain rule, we can figure out:

Let me know if you want to read the development of this derivative.

For example, if we have five neurons, the derivative of each neuron with respect to every neuron in the same layer is given by:

This derivative will be applied in the next story when we train our first classifier.

Packing Softmax for further usage

Finally, we can implement the Softmax functor as follows:

class Softmax : public ActivationFunction
{
    public:

        virtual Matrix operator()(const Matrix &z) const
        {

            if (z.rows() == 1)
            {
                throw std::invalid_argument("Softmax is not suitable for single value outputs. Use sigmoid/tanh instead.");
            }
            Vector maxs = z.colwise().maxCoeff();
            Matrix reduc = z.rowwise() - maxs.transpose();
            Matrix expo = reduc.array().exp();
            Vector sums = expo.colwise().sum();
            Matrix result = expo.array().rowwise() / sums.transpose().array();
            return result;
        }

        virtual Matrix jacobian(const Vector &z) const
        {
            Matrix output = (*this)(z);

            Matrix outputAsDiagonal = output.asDiagonal();

            Matrix result = outputAsDiagonal - (output * output.transpose());

            return result;
        }

};

Almost every classifier today uses Softmax in the output layer. We will cover some real-world examples of softmax in forthcoming stories.

Other activation functions

There are several other activation functions. In addition to the ones described here, we can list Softplus, Softsign, SeLU, Elu, GeLU, exponential, swish, and others. In general, they are some variations of sigmoid or ReLU.

Conclusion and Next Steps

Activation functions are one of the most important building blocks of machine learning models. In this story, we learned how to implement some of the most important ones: Sigmoid, Tanh, ReLU, Identity, and Softmax.

In the next story, we make a deep dive into the implementation of the most important deep learning algorithm: backpropagation. From scratch, in C++ and Eigen.

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture | Cubed
More content at PlainEnglish.io