Activation Functions in Neural Networks

The Spark Your Neural Network Needs: Understanding the Significance of Activation Functions

From the traditional Sigmoid and ReLU to cutting-edge activation functions like GeLU, this article delves into their significance, math, and guidelines for choosing the ideal function for your model.

Gaurav Nair

--

A Neural Network depicting Activations Functions in Action. Image by the Author.

Table of Contents

Introduction

As we continue to see the rise in the field of AI, more and more people are finding themselves drawn towards this field. With the advent of newer and more advanced generative models, deep learning/neural networks have become the backbone of all the generative AI applications we use today.

A Neural Network is comprised of many components and each of the components has its role in understanding the different patterns in the data. Out of the many different components, the activation function is one of the fundamental elements inside the Neural Network.

Suppose you watched a new movie and your friends are asking for a review. You quickly recollect the whole film and decide whether it was good or bad. The activation function inside the Neural Network does the same thing. It decides depending upon the information received from the previous layer, whether it should activate the neuron(send the information to the next layer) or stay quiet(not do anything).

The idea of activation function was inspired by the action potential in neuroscience. Action potential, also known as nerve impulse refers to the electrical event that activates the neuron and allows it to interact with other neurons. The first known usage of the activation function dates back to 1962 when Frank Rosenblatt introduced the threshold step function by modelling neuroscience.

Let’s discuss in detail what activation functions are, how they bring non-linearity in the network, understand the math behind them, and look at the different types of activation functions.

What are Activation Functions?

The activation function can be defined as a mathematical function that introduces non-linearity to the neural network. This enables the models to learn complex patterns and helps them make accurate predictions.

In simple words, think of it like a switch, that decides whether a neuron should be activated or not. If the input is above a certain threshold or meets certain criteria, the neuron will activate/fire, or else it remains inactive.

By infusing non-linear transformations in the network, activation functions help in the creation of more advanced and sophisticated decision boundaries. This leads to better accuracy and predictive capabilities.

Decision Boundary created by Network with and without Activation Function. Image by the Author.

Activation functions are certainly the driving force behind the neural network’s ability to handle real-world problems that are non-linear in nature. However, they have much more importance in the network than infusing non-linear transformations. Let’s explore their importance in the neural network.

The Importance of Activation Functions

Activation functions provide benefits beyond simply introducing non-linearity to the network. As we already know that activation functions are one of the essential elements in a neural network, let us see the different ways it helps the network learn complex patterns and relationships in the data:

  1. Non-linearity: As previously discussed, activation functions bring non-linearity to the network. The real-world data is usually non-linear and hence we cannot rely on the linear mathematical approaches for complex problems. The activation function helps us achieve non-linearity in the network to capture complex patterns with the help of mathematical functions.
  2. Gradient Propagation: While training, neural networks use optimization algorithms such as backpropagation to update the weights and biases to minimize error. Activation functions derivative defines the amount by which each of the weights needs to be updated during backpropagation.
  3. Decision Making: It is the responsibility of the activation function to decide the actions of a neuron. Depending on the weighted sum received from the previous layers, the neural network with the help of the activation function can assign different levels of importance to different inputs depending on the task at hand.
  4. Modelling Complex Relationships: The activation functions help neural networks to model complex relationships between the input and the output. By stacking multiple layers of neurons, the activation function helps the neural network to learn hierarchical representations. The lower layers can learn simple features while the upper layers can learn more complex features by combining what was learned by the lower layers.

Let us understand the above points with the help of an example. Say, we are working on a computer vision task to identify the images of cats and dogs.

  • Non-linearity: This being real-world data, it will be inherently non-linear as the images will be in different poses and backgrounds. The activation function will help the neural network to capture the complex patterns in diverse images.
  • Gradient Propagation: During the training, backpropagation will update the weights and biases to minimize the error. The activation function’s derivative is used to compute the gradients which actually defines how much each weight should be updated.
  • Decision-making: Each neuron in the hidden layer will receive the weighted sum from the previous layer. The activation function acts as a decision-maker for each neuron whether it should activate or remain inactive. This will be a critical role for the activation function to give different levels of importance to different features like the edges, shape, and other such patterns.
  • Modelling relationships: The activation function will help the neural network to model complex relationships. In the lower layers, it can focus on the shapes and edges, and in the upper layers, it can learn complex patterns by combining what it has learned in the lower layers.

The Math Behind Activation Functions

Understanding the math behind activation functions is really important to know how they process under the hood. A good mathematical understanding will also help us to optimize the neural network to make better models. Let’s take a simple example of a binary classification task where we will be focussing on forward propagation. Assuming input values and learned weights and bias:
a = 2
b = 3

w1 = 0.5
w2 = 0.3
bias = 2

Forward Propagation with Activation. Image by the Author.

Forward Propagation: During forward propagation, the weighted sum will be calculated as:

z = w1 * a + w2 * b + bias
z = 0.5 * 2 + 0.3 * 3 + 2
z = 1 + 0.9 + 2
z = 3.9

As this is a binary classification task, we will use the Sigmoid activation function. We will be covering the reason for using the Sigmoid function later in this article. For now, let’s just focus on the formula:

Sigmoid Function. Image by the Author.

Now we need to plug in the value of the weighted sum in this formula.

f(z) = 1 / (1 + e^(-z))
f(z) = 1 / 1 + e^(-3.9))
f(z) = 1 / 1.02024
f(z) = 0.98

This is known as the activation value(neuron’s level of activation) for the given inputs and weights.

What can we interpret from this value?

Since we are working on a binary classification task, the value 0.98 can be interpreted as the strong probability of the neuron belonging to one of the classes. So, the intuition is simple, the higher the activation value, the more likely the neuron will be activated.

Can we build a Neural Network without an Activation Function?

Wondering what happens when we attempt to build a neural network without the activation function?

It is certainly going to make our neural network architecture and math easier, right?

Alright, let’s do this!

Assume we have a neural network with no activation function. So, the output from the previous layer will be sent directly to the next layer. This way the whole network will become a series of linear transformations. There will be no non-linearity at all.

The main problem with this is, the network will fail to learn any complex patterns and relationships in the data. The network without an activation function will only be able to draw a straight line or a hyperplane to separate the classes.

Network without Activation. Image Captured from Tensorflow Playground

In the above image, I have trained a linear network(without activation) for a classification task in the Tensorflow playground. Out of the 4 layers, the network has 2 hidden layers with 3 neurons each and is trained for 500 epochs. You can see that the network failed to learn the pattern in the data. Due to linearity in its network, it was only able to make a straight line or hyperplane to separate the data points. This has resulted in high training as well as test loss making the model inefficient.

Network with Activation. Image Captured from Tensorflow Playground

Now, let’s look at a similar network architecture but with an activation function. You can see that the network was able to identify the patterns in the data accurately separating the two classes. The training and test loss also decreased significantly. In this case, I have used a Tanh activation function which we will be going through in detail later in this article.

The real-world data is not usually linearly separable. Using a network without an activation function will result in poor performance. By introducing non-linearity, the neural network will have the ability to approximate continuous functions allowing it to understand complex relationships and patterns.

Types of Activation Functions

As more and more industries are incorporating AI in their business, the data we need to work with is diverse in nature. While it is crucial to understand what type of output we expect from our model, there is no standard activation function that works for all. That is why, it is important to choose the correct activation function for a particular problem, as it can impact the performance of the neural network.

There are many different types of activation functions to consider, each with its own advantages and disadvantages. The choice of the activation function depends upon the nature of the problem, desired outputs, and the architecture of the neural network. Let’s discuss the most commonly used and important ones:

Sigmoid Function

Sigmoid Activation Function. Image by the Author.

The sigmoid function transforms the input to a range between 0 and 1. Mathematically, it can be represented as:

Mathematical Representation of Sigmoid. Image by the Author.

Here, ‘e’ refers to the base of the natural algorithm, and ‘z’ is the input value(weighted sum).

Use cases:

The sigmoid function is commonly used in deep neural networks for binary classification tasks where the objective is to classify the output into one of two classes.

Advantages:

The sigmoid function can be interpreted as a probabilistic function that transforms the input between the values 0 and 1. This makes it useful in cases where the output needs to predict the likelihood of a specific class.

Disadvantages:

The sigmoid function is one of the main reasons for neural networks to have the vanishing/exploding gradient problem.

  • Vanishing/Exploding Gradients: There is a major data loss since the input range is transformed to a value between 0 and 1. The derivative of the sigmoid function becomes very small even for large input values. The more layers, the more is data lost. This leads to a vanishing gradient problem during backpropagation.
  • Biased Output: 0.5 being the mid value, the sigmoid function tends to produce outputs biased towards 0.5. This makes the learning and training time slower and also hinders convergence.
  • Computational drawbacks: The exponential function calculation can be computationally expensive when dealing with very large datasets that involve deep layers of neural networks.

Hyperbolic Tangent Functions(tanh)

Tanh Activation Function. Image by the Author.

The tanh function is a type of activation function that transforms the input value between -1 and 1. Tanh has an S-shaped curve similar to the sigmoid function but, the tanh curve is symmetric around zero. Tanh is also referred to as an extension of the sigmoid function.

Mathematically, the Tanh function is represented as:

Mathematical Representation of Tanh. Image by the Author.

Similar to the sigmoid function, ‘e’ represents the base of the natural logarithm, and ‘z’ is the input to the activation function.

Use cases:

The Tanh activation function is used in classification tasks suitable where the data is symmetric in nature, that is, the positive and negative values hold equal importance.

Advantages:

The tanh function solves some of the problems in the sigmoid function. The Tanh function is symmetric around zero and hence it can very well model positive as well as negative values. This is also advantageous in certain situations as it makes the training process more stable as compared to the sigmoid function.

Disadvantages:

  • Vanishing/Exploding Gradients: Similar to the sigmoid function, the derivative of the tanh function becomes very small even for larger inputs. This causes slow learning and the vanishing gradients problem.
  • Saturated Output: For very large or very small input values, the tanh function starts to saturate, i.e. the output gets close to -1 or 1 and the function becomes flat in this region. This causes the gradient to become very close to zero hindering weight updates during backpropagation.
  • Computationally Expensive: Similar to the sigmoid function, computing the exponential formula is expensive, especially for large datasets.

Rectified Linear Unit(ReLU)

ReLU Activation Function. Image by the Author.

ReLU is the most popular and commonly used activation function. It outputs the input value if it is positive or returns zero. In simple words, ReLU only returns a positive output or zero. It is also known as the zero-to-identity mapping function.

Mathematically, it is given as,

Mathematical Representation of ReLU. Image by the Author.

‘z’ represents the input value to the neuron and the ‘max’ function returns the maximum value out of zero and z.

Use Cases:

ReLU is majorly used in the hidden layers of the neural network since it brings sparsity to the data. This means that ReLU is able to selectively activate only a subset of neurons and set all other values to zero. This leads to more efficient representation of high-dimensional data.

Advantages:

  • Non-linearity: ReLU’s ability to create sparsity by removing the negative values enhances the neural network’s ability to learn complex non-linear patterns and relationships in data.
  • Computationally Efficient: Unlike sigmoid and tanh, ReLU has a simple mathematical function hence it is computationally efficient, providing faster training and inference.
  • No Vanishing Gradient Problem: ReLU helps in reducing the gradient during backpropagation and hence leads to better and faster learning.

Disadvantages:

  • Dying ReLU: In cases where the weighted sum of neurons becomes negative during the training phase, ReLU permanently sets those neurons to be inactive and those neurons no longer contribute to the learning process. This can result in reduced model capacity. A neural network with more neurons and layers is said to have a greater capacity as it is able to capture complex patterns.
  • Saturated Output: ReLU hinders the model’s ability to differentiate between large and small positive values.
  • Optimization Challenges: Optimization algorithms can have challenges in doing their job correctly as they rely on the gradients.

Leaky ReLU

Leaky ReLU Function. Image by the Author.

The Leaky ReLU activation function is a variant of the ReLU that addresses the dying neurons issue. Leaky ReLU makes provision for non-zero output even for negative values by introducing a small slope.

Mathematically, it is given as:

Mathematical Representation of Leaky ReLU. Image by the Author.

Here, ‘z’ is the input to the activation function.

Use Cases:

Similar to ReLU, Leaky ReLU helps in sparse activation by setting negative values to zero or, a small non-zero value in this case. This helps in dealing with high-dimensional data. Therefore, it is used in situations where there is high-dimensional data.

Advantages:

  • Solves Dying ReLU: Leaky ReLU allows small non-zero output to negative values. This prevents the neurons from becoming permanently inactive and hence solves the dying ReLU problem. This also helps in better gradient flow during backpropagation.
  • Non-linearity: Similar to the traditional ReLU, Leaky ReLU introduces non-linearity to the network which helps it to learn complex patterns. The slight negative slope helps preserve some information which enhances the network’s ability.
  • Computational Efficiency: Leaky ReLU has a simple mathematical function similar to ReLU and hence is a computationally efficient choice while working with large-scale neural networks.

Disadvantages:

The small negative slope helps the network preserve some negative values. It prevents the dying ReLU problem as well, but also introduces some unwanted activations that may affect the model’s performance.

Parametric ReLU (PReLU)

Parametric ReLU Function. Image by the Author.

The parametric ReLU is a variant of ReLU that incorporates a learnable parameter to control the slope of negative values. PReLU allows the slope to be adjusted during the training process, which enables the network to have an optimal slope. As you would have guessed, this is a fix for the main drawback of Leaky ReLU.

Mathematically, it is given as:

Mathematical Representation of PReLU. Image by the Author.

Here, ‘z’ represents the input, and ‘a’ represents the learnable parameter that controls the slope.

Use Cases:

The parametric ReLU provides the flexibility to adapt the slope of the activation function based on the needs of different input regions. Hence, it is used in cases where the characteristics and distribution may vary significantly across different regions of the data.

Advantages:

Adaptive Slope: PReLU allows the slope to be adjusted for negative values during the training phase. This allows the neural network to learn the optimal slope value for each neuron. This also helps the network to learn more complex patterns.

Eliminating Dying ReLU: Similar to Leaky ReLU, PReLU also helps in eliminating the dying ReLU problem by allowing non-zero values for negative inputs. This also promotes better gradient flow during backpropagation.

Disadvantages:

Model Complexity and Computational Resources: The learnable parameter ‘a’, introduces a lot of complexity into the neural network, and to get to the optimal value, the network will require more computational resources and also a longer training time.

Overfitting: In cases where the dataset is limited or small, PReLU can result in overfitting since the network will learn the different slope values for each of the neurons during the training.

Exponential Linear Unit(ELU)

ELU Function. Image by the Author.

ELU is also a variant of the ReLU activation function. It aims to address the dying ReLU by introducing a slight negative slope for negative inputs. Unlike the Leaky ReLU, which incorporates a fixed slope for negative inputs, ELU introduces a smooth continuously differentiable curve with a negative slope that can be controlled by a hyperparameter ‘α’.

Mathematically, it is given as:

Mathematical Representation of ELU. Image by the Author.

Here, ‘z’ is the input, and ‘α’ is the hyperparameter that controls the slope.

Use Cases:

Similar to the Leaky ReLU, ELU prevents the dying ReLU problem by allowing negative inputs. This enhances the gradient flow during backpropagation offering a stable training phase.

Advantages:

  • Elimination of Dying ReLU: Similar to Leaky ReLU and Parametric ReLU, ELU helps in mitigating the Dying ReLU problem by keeping the neurons active even when the weighted sum falls below zero.
  • Smoothness: The ELU graph is curved which makes it continuously differentiable. This helps during the backpropagation allowing the model and optimization algorithms to work effectively.
  • Improved Representation: The negative curve makes provisions for negative values allowing the model to capture complex representations and improved performance.

Disadvantages:

  • Computational Cost: The exponential function in the ELU formula is computationally expensive when working with large datasets or neural networks with many layers.
  • Hyperparameter Tuning: Finding the optimal value for the hyperparameter ‘α’ is a challenging as well as time taking task. While 0.01 is usually the preferred choice, however, depending on the task at hand, we might have to try many different values to get to the optimum slope.

Softmax

Softmax Function. Image by the Author.

Softmax is one of the most popular activation functions used in multi-class classification problems. Softmax converts numbers into probabilistic distribution where each of the outputs represents a probability corresponding to a class.

In simpler terms, softmax can be considered as a probability calculator for ‘n’ different events. Suppose you want to know the chances of ‘n’ different events, softmax will calculate the probabilities of each event, considering all the other events. Softmax is considered to be a magical tool to make-choice decisions.

Mathematically, it is represented as:

Mathematical Representation of Softmax. Image by the Author.

Here, ‘z’ represents the values from the neurons of the output layer. The exponential term in the numerator acts as the non-linear function while the denominator is a normalization term that helps convert the output into probabilistic values.
‘N’ represents the number of classes.
‘z sub i’ represents the weighted sum of the input for specific class i. This value is the contribution of class i to the overall decision-making process.
‘z sub j’ represents the weighted sum of the input for class j, where j ranges from 1 to N, covering all classes in the classification problem.

Use Cases:

Softmax is commonly used in the output layer of neural networks for multi-class classification problems. The raw output from the outer layer commonly known as logits is transformed into probabilities using the softmax function.

Since softmax calculates the probability distribution of all classes, this makes it suitable for tasks where we not only require the output with the highest probability but also other relative likelihoods.

Advantages:

  • Probability-based: Softmax outputs a probability distribution that sums up to 1, this helps in understanding the model’s prediction, decision-making, and evaluation.
  • Gradient friendly: Softmax works well with gradient-based optimization algorithms like backpropagation since it is a differentiable function.
  • Multi-class friendly: Softmax is one of the best and most reliable functions that can handle multiple classes simultaneously.

Disadvantages:

  • Large Inputs: Large logits can result in very small or very large probabilities potentially causing instability in the output.
  • Reactive to Outliers: If there are a lot of outliers in a class, then these outliers can dominate the probabilities of other classes.

Gaussian Error Linear Unit(GeLU)

GeLU Function. Image by the Author.

GeLU combines stochastic regularization techniques like dropout with nonlinearities of activation functions like ReLU. Let’s simplify what happens in each of these parts.

Dropout regularization randomly “drops” or deactivates certain neurons in the network during each iteration. These neurons no longer contribute to the training phase for that iteration. This introduces randomness in the network and neurons do not become heavily reliant on any specific feature of the data during training. This ultimately helps in improved generalization of unseen data.

We know that ReLU is a zero-to-identity mapping function that is it multiplies by zero if the input is negative and one if the input is positive. This introduces nonlinearity by creating a piecewise linear function allowing the network to learn complex relationships in data.

In the next phase, a novel concept of Zoneout Regularization is introduced. Unlike dropout regularization which drops out neurons, zoneout stochastically retains some neuron's original value during each iteration. This ensures controlled randomness and during training promoting a balance between exploration and exploitation.

Mathematically, it is given as:

Mathematical Representation of GeLU. Image by the Author.

Here, z is the input, and the rest of the term is the Gaussian error function.

Since it is computationally expensive and time taking to calculate the Gaussian error function for each neuron, we also have an approximation for the above formula in terms of tanh and sigmoid.

Tanh approximation for GeLU. Image by the Author.
Sigmoid approximation for GeLU. Image by the Author.

Use Cases:

GeLU is one of the most popular activation functions that outperforms all the other activation functions in computer vision, natural language processing, and speech recognition tasks. GeLU’s combination of stochastic and nonlinear behaviour is suitable for improving model performance and preventing overfitting.

Advantages:

Non-linearity: Since GeLU incorporates nonlinearity from ReLU, it is efficient in understanding complex patterns and relationships in the data.

Stochasticity: Dropout regularization in GeLU makes the network robust and reduces overfitting. This also helps in the generalization of the unseen data.

Disadvantages:

Computationally Intensive: GeLU has complex mathematical operations wherein it calculates the Gaussian error function. Alternatively, its Sigmoid and Tanh approximation has exponential calculation which is a computationally intensive and time taking task, especially in scenarios with large-scale neural networks.

Optimization: GeLU combines dropout and ReLU which makes it challenging to optimize the neural network effectively. Due to regularization, gradients may not always converge as efficiently as in ReLU.

How to choose the correct activation function for your model?

As we have discussed previously, choosing the correct activation function for your neural network can depend on several factors, including the type of data you are working with, the architecture of your neural network, and the task you are trying to solve. However, there are some general guidelines that can help choose the correct activation function for your neural network:

  • Understanding of different activation functions: Different activation functions have different properties, such as their type and range of output, advantages, and disadvantages. Understanding these properties can help choose an activation function that is well-suited for the task.
  • Type of data you are working with: Some activation functions may be better suited for certain types of data. For example, if your data is binary, using a sigmoid or softmax activation function will be a good choice. Similarly, if your data is continuous and has a wide range, using a ReLU, leaky ReLU or GeLU activation function will do the work.
  • Experiment with different activation functions: Ultimately, the best way to determine which activation function will be the right choice for your neural network is to experiment with different functions and see which one gives you the best performance. Training your neural network with different activation functions and comparing their performance on a validation set will be the best option for making better models.

There are some other guidelines as well, that can be useful:

  • It is recommended to begin with ReLU and if it does not give the desired results, you can move to other activation functions.
  • Due to vanishing gradients, sigmoid or tanh are not used in the hidden layers. Reserve them for the outer layers.
  • While on binary classification task, use sigmoid or tanh, and for multi-class classification, use softmax.

Here’s a cheat sheet for all the functions we discussed in this article.

Activation Functions Cheat Sheet. Image by the Author.

Conclusion

Activation Functions play a major role in the neural network/deep learning models which is the basis of many generative models which we are using today. Apart from bringing non-linearity to the network, they also play a major role in enabling the model to learn complex patterns while also ensuring better gradient flow and training time.

We discussed that there is no one-size-fits-all activation function. A good understanding of the activation functions with their mathematics and capability to produce the desired output plays a major role in choosing the correct activation for the problem. We may also require to do a lot of experimentations and evaluations to get the right activation function for our network.

I am on a journey to learn and grow and am open to feedback or corrections. If anything in this blog seems incorrect or has a provision for improvement, please don’t hesitate to let me know.

Connect with me on LinkedIn!

If this blog has helped you, please consider clapping(up to 50 claps per user are allowed). Thank you for taking the time to read and explore with me!🙂

WRITER at MLearning.ai / 800+ AI plugins / AI art Copyright

--

--