MLOps Blog

How to Optimize GPU Usage During Model Training With neptune.ai

Mirza Mujtaba , Kilian Kluge

8 min

14th May, 2024

ML Model Development

GPUs can greatly accelerate deep learning model training, as they are specialized for performing the tensor operations at the heart of neural networks.

Since GPUs are expensive resources, utilizing them to their fullest degree is paramount. Metrics like GPU usage, memory utilization, and power consumption provide insight into resource utilization and potential for improvement.

Strategies for improving GPU usage include mixed-precision training, optimizing data transfer and processing, and appropriately dividing workloads between CPU and GPU.

GPU and CPU metrics can be monitored using an ML experiment tracker like Neptune, enabling teams to identify bottlenecks and systematically improve training performance.

As data scientists or machine-learning engineers, one of the routine tasks we tackle involves rapid experimentation and training multiple models under different settings to identify the most effective ones. This iterative process is usually one of the most costly and time-consuming phases, so any possible optimization is worth exploring.

Making use of Graphics Processing Units (GPUs) for deep learning (DL) has significantly accelerated the training phase due to the GPU’s parallel processing capabilities and high memory bandwidth. The GPU’s cores are specialized for performing the matrix multiplications at the heart of DL algorithms.

GPUs are typically the most expensive resource in a training pipeline. Thus, we must make the most out of them. This requires careful attention to GPU settings, such as utilization and memory allocation. Optimizing your GPU instances ensures that your organization pays only for what it needs, especially when using a cloud-based service where every hour and minute counts on the bill.

In this article, we’ll start by exploring some critical GPU metrics, followed by techniques for optimizing GPU performance. We’ll explore how factors like batch size, framework selection, and the design of your data pipeline can profoundly impact the efficient utilization of GPUs. In the later part of this article, we shall see how monitoring the utilization of our resources, such as GPU, CPU, and memory, can help determine why the GPU isn’t being used to its full potential.

Metrics for evaluating GPU performance

To understand whether a GPU is operating at its maximum potential, we rely on specific metrics that provide valuable insights, including utilization, usage, and temperature. For NVIDIA GPUs, which most of us likely use, you can use the `nvidia-smi` command-line tool to inspect these metrics.

Utilization

The GPU utilization metric quantifies how the GPU is engaged during the training of deep-learning models. Expressed as a percentage, it represents the fraction of time over the past sample period for which one or more instructions (CUDA kernel) were executed.

Memory

A GPU’s memory plays a significant role during model training. It’s in charge of holding everything from the model’s parameters to the data being processed.

The GPU memory usage metric reflects the amount of memory allocated to the GPU relative to its total memory capacity. By observing this metric, we can find the largest possible batch size, allowing us to exploit the parallel processing capabilities of GPUs as much as possible. It is also important to keep track of this metric to avoid out-of-memory errors.

The GPU memory utilization metric indicates the percentage of time over the last second during which the GPU’s memory controller was involved in reading from or writing to memory. A lower GPU memory utilization typically indicates that the GPU spends more time computing than fetching data from memory. One way to lower the percentage here is to increase the batch size so that the GPU spends less time fetching the data.

We can also allow GPUs to perform computations and access memory simultaneously. NVIDIA’s blog has a great article on how to Overlap Data Transfers in CUDA.

Power and temperature

Tracking a GPU’s temperature and power consumption ensures optimal performance and prevents issues such as overheating. GPUs are power-intensive hardware and generate heat during operation. Thus, they require cooling solutions to maintain their temperature at an acceptable level.

GPU temperature is measured on the Celsius scale, and monitoring it is crucial to ensure the GPU operates within acceptable temperature levels. High temperatures can lead to overheating issues, triggering an automatic reduction in the GPU’s clock speed to prevent further overheating, thus impacting performance.

The GPU power usage metric reflects the total electrical power used in watts. This metric is essential in verifying that the GPU is receiving the necessary power for optimal functioning while also serving as a valuable indicator for detecting potential hardware issues, including problems with the power supply unit.

GPU performance-optimization techniques

In the excitement over using powerful GPUs, it’s easy to forget the importance of managing these resources efficiently. Even though GPUs excel at parallel computations, their full potential might go to waste if they are not allocated and managed properly.

In the previous section, we introduced standard metrics that might indicate that you’re not fully utilizing your GPU resources. Let’s explore effective strategies for addressing this and maximizing your GPUs.

Increase the batch size to increase GPU utilization

If you’re dealing with low GPU usage while training, increasing the batch size is the first thing you should try. The available GPU memory constrains the maximum batch size, and exceeding it triggers an out-of-memory error.

Another consideration when increasing the batch size is that it can lead to lower accuracy on test data. Recent research investigating the impact of batch size while training DL models has revealed that setting batch size large often leads training to converge to sharp minima, resulting in poorer generalization.

Effective workarounds, such as increasing the learning rate or employing techniques like Layer-wise Adaptive Rate Scaling (LARS), can allow for larger batch sizes without compromising the accuracy of the models.

Due to these tradeoffs, finding the optimal batch size typically involves a trial-and-error approach to balance the positive and negative effects.

Use mixed-precision training to maximize GPU performance

Neural networks operate by manipulating numerical values, typically expressed as floating-point numbers in either 32-bit or 64-bit formats. How many bits are available to store a number directly impacts the computational efficiency and accuracy of the model. The fewer bits that have to be manipulated, the faster the computation – but the lower the precision.

Mixed-precision training is a technique employed in model training that utilizes different floating-point types (e.g., 32-bit and 16-bit) to improve computing speed and reduce memory usage while maintaining accuracy. It achieves computational acceleration by executing operations in a 16-bit format while keeping certain parts of the model in a 32-bit format to ensure numerical stability.

Mixed-precision training improves GPU usage by lowering the required memory, allowing the training of larger models, or setting larger batch sizes. It enables batch sizes up to twice as large, significantly boosting GPU utilization.

Another significant benefit is decreased computation time, as 16-bit operations halve the number of bytes accessed, thus reducing the time spent in memory-limited layers such as batch normalization, activations, and pooling. NVIDIA claims up to eight times the 16-bit arithmetic throughput compared to 32-bit for their GPUs.

It is important to note that NVIDIA GPUs with compute capability 7.0 or higher experience the most significant performance boost from mixed precision. They have dedicated hardware units for 16-bit matrix operations called Tensor Cores.

I recommend this Mixed Precision Training tutorial for a more thorough introduction.

Optimization in deep learning

Have a look at other articles on our blog exploring aspects of optimization in deep learning:

Deep Learning Model Optimization Methods: Deep learning models exhibit excellent performance but require high computational resources. Techniques like pruning, quantization, and knowledge distillation are vital for improving computational efficiency.
Best Tools for Model Tuning and Hyperparameter Optimization: Systematically tuning the hyperparameters of a machine learning model to improve its performance is a crucial step in any machine learning workflow.
Deep Learning Optimization Algorithms: Training deep learning models means solving an optimization problem: The model is incrementally adapted to minimize an objective function. A range of optimizers are used in deep learning, each addressing a particular shortcoming of the basic gradient descent approach.

Optimize the data pipeline to increase GPU utilization

To maximize GPU utilization, we must ensure that the GPU remains consistently busy and avoid situations where it remains idle, waiting for data. We need a well-optimized data pipeline to achieve this goal.

The pipeline involves several steps. Data samples are initially loaded from storage to the main memory, requiring input and output operations (I/O). Subsequently, the data goes through pre-processing, mainly on the CPU, and finally, the preprocessed data is transferred into the GPU’s memory.

It’s crucial to ensure that all of these steps are performed efficiently. So, let’s dive into the specifics of I/O, focusing on the data transfer from the storage to the main memory and from the main memory to the GPU.

Optimizing data loading

Data-loading costs are primarily dominated by I/O operations. Their effective management is crucial for machine-learning workloads due to the typically high frequency of I/O requests. For example, when training on large datasets, the data might be spread across multiple smaller files. In other cases, data is collected incrementally, for instance, from hardware sensors. When using GPUs, I/O may become the bottleneck, as the speed at which data is provided to the GPUs can be a limiting factor, impacting the overall speed of the entire pipeline.

Local SSD drives provide excellent efficacy for teams dealing with smaller datasets. However, for more extensive DL tasks, remote storage solutions connected to GPU clusters are necessary. The mismatch between GPUs’ rapid processing capability and the slower data retrieval from cloud storage services can create a performance bottleneck. One way to address this I/O bottleneck is by caching data that is frequently accessed, bringing it closer to compute resources. This can significantly improve the performance when working with large datasets.

Optimizing data transfer between CPU and GPU

Another essential consideration regarding data is the transfer speed between the CPU and GPU. A simple way to optimize this is by leveraging so-called CPU-pinned memory, which facilitates faster data transfer from CPU memory to GPU memory by having the CPU write into the parts of its memory that the GPU can access directly. (This feature is readily available in PyTorch’s `DataLoader` by setting `pin_memory` to `True`.) By utilizing pinned memory, you can also overlap data transfers with computation on the GPU.

Optimizing data preprocessing

In addition to I/O and data transfer to the GPU, data preprocessing can become a bottleneck for GPU utilization. During the training phase, the CPU is responsible for data preprocessing, and a bottleneck emerges when one or more CPUs reach their maximum utilization, causing the GPU to be partially idle as it waits for the CPU to provide the next batch of training data.

There are several ways we can deal with this. One approach is to structure the preprocessing pipeline into tasks that can be completed offline, i.e., at the data-creation phase before the training starts. Shifting operations to the data-creation phase will free up some CPU cycles during training.

To optimize runtime, it might seem logical to move all tasks in the data preprocessing pipeline offline. However, this approach might not be ideal for training as introducing a degree of randomness to input samples while training is beneficial. For instance, introducing random rotations and flips improves results for some tasks, such as adversarial training.

Another approach for addressing the data preprocessing bottleneck is to move data operations onto the GPU. NVIDIA provides a Data Loading Library (DALI) for building highly-optimized data preprocessing pipelines, which offloads specific tasks to GPUs, such as decoding, cropping, and resizing images. While this approach enhances the efficiency of the data-loading pipeline, it comes with the drawback of burdening the GPU with an extra workload.

Choice of deep learning framework

In DL, we can choose from multiple frameworks, such as TensorFlow, PyTorch, and JAX (all of which are also available in the multi-backend framework Keras).

Each framework has a different set of performance features and applies various techniques to optimize the implementation of algorithms. Therefore, the same algorithm can exhibit a considerable variation in performance across frameworks.

Now, there might be a variety of criteria that you might consider while choosing the framework, like the ease of coding, flexibility, community support, or learning curve. But today, we’re zooming in on how these frameworks utilize the resources, particularly GPU.

It is important to note that there is no definitive winner among these frameworks, as their GPU utilization fluctuates based on various factors, such as the specific task, dataset characteristics, and the neural network’s architecture.

A research paper published in 2021 compared different DL frameworks that were current at the time. The researchers implemented different model architectures such as CNNs and LSTMs, trained the models on different datasets like CIFAR-100 and IMDB Reviews, and observed different CPU and GPU metrics. Afterward, they also compared the models’ performance. The researchers found a significant variation in GPU usage, training time, and task performance between frameworks.

Two bar plots showing comparative analysis of GPU usage across different frameworks — Comparative Analysis of GPU usage across different frameworks. (a) This bar plot compares the average GPU utilization during training across different frameworks, highlighting the variations in efficiency when training a CNN with different datasets. (b) Similarly, this bar plot also compares average GPU utilization during training across frameworks for LSTMs, showcasing variations in efficiency with different datasets. | Source

Strategy for optimizing GPU usage

Knowing how resources, like GPUs, CPUs, and memory, are utilized during the training phase can help optimize training and maximize GPU capabilities. Here are some guidelines that may help you identify the possible bottlenecks based on the usage of these resources:

If the GPU utilization is low and the CPU utilization is high, it suggests potential bottlenecks in data loading or preprocessing.
If you notice that CPU memory utilization is low, a quick way to boost performance is to increase the number of workers your data loader uses.
If your GPU utilization is low and the CPU utilization is consistently low despite having a sufficiently large dataset, it may indicate suboptimal resource utilization in your code. In such cases, explore parallelization and code optimization techniques and ensure that your model architecture can efficiently use the GPU.
If GPU memory utilization is low, increasing the batch size is a potential strategy to enhance resource utilization.

Case Study: Monitoring and optimizing GPU usage with neptune.ai

Closely monitoring resource utilization, including CPU, GPU, and memory, adds an additional layer of complexity to our workflow. Experiment tracking tools such as Neptune can simplify this process, making our job easier and ensuring well-organized monitoring.

Neptune automatically logs system metrics, including hardware consumption metrics like CPU, GPU (only NVIDIA), and memory usage. Additionally, you can log any hardware metrics into Neptune that you can access from your training script. Check out this example dashboard, which displays charts generated with these metrics logged by Neptune.

Editor’s note

Do you feel like experimenting with neptune.ai?

Create a free account right away and give it a go
Try it out first and learn how it works (zero setup, no registration)
See the docs or watch a short product demo (20 min)

Setting up Neptune for logging GPU metrics

Neptune provides capabilities to log metrics from various processes, which is particularly useful as most of our machine-learning pipelines involve multiple stages. These stages, such as data preprocessing, model training, and inference, are often managed by different scripts. When we declare a custom monitoring namespace in each script, we can capture the metadata from these various steps in a single run:

import neptune

run = neptune.init_run(monitoring_namespace="monitoring/preprocessing")

Head to the documentation to read more about logging with Neptune in a sequential pipeline.

How Brainly optimized their GPU usage with Neptune

Brainly is a learning platform offering an extensive knowledge base for all school subjects. Brainly’s Visual Search team uses Neptune to track the training of the Visual Content Extraction (VICE) system for their “Snap to Solve” feature. This core feature allows users to take and upload photos with questions or problems and offers relevant solutions.

When monitoring their training, the charts generated during the training process revealed that the GPU was not being utilized to its maximum capacity. In fact, it frequently sat completely idle.

GPU utilization plot generated with Neptune. The line plots show GPU usage over time for individual GPUs used by Brainly’s Visual Search team. Frequently, GPU utilization drops below 25%, sometimes even reaching values close to zero. This indicates that by improving GPU utilization, training could be sped up, and GPU resources could be used more efficiently.

Brainly’s team thoroughly investigated the issue, tracking the usage of various resources, including the CPU. They identified a bottleneck within their data preprocessing pipeline as the root cause of the problem. Specifically, they noted inefficiencies with copying images from the CPU memory to the GPU memory, as well as with data augmentation.
To address this, they optimized their data augmentation tasks by decompressing JPEGs and transitioning from plain TensorFlow and Keras to NVIDIA DALI for data preprocessing. They further chose multiprocessing over multithreading for processing jobs to create a more efficient data pipeline.

Using Neptune, the team analyzed the performance improvement brought about by each optimization step.

GPU optimization checklist

In this article, we’ve looked into various ways to assess GPU usage and explored strategies to improve it. The optimal GPU settings and setup for training deep learning models vary depending on the specific task, and there is no alternative to thorough analysis and systematic experimentation.

However, across the many projects I’ve worked on, the following guidelines for optimizing GPU usage have often proven helpful:

Always monitor GPU memory usage during training. If a decent amount of memory is still free, try setting the batch size large while using techniques that do not affect the model’s performance with a larger batch size.

Check if your GPU supports mixed precision and implement it while training for maximum performance.

Track the utilization of CPU and GPU to identify possible bottlenecks in your data pipeline. Carefully assess the impact of each improvement on the overall performance.

Explore the capabilities and learn about the implementation details of your DL framework. Each framework exhibits unique capabilities in utilizing the GPU, influenced by factors like model architecture and type of task.