Optimizing TFLite Models for On-Edge Machine Learning for Efficiency: A Comparison of Quantization Techniques

13 min readAug 3, 2023

Introduction

In recent years, machine learning models have become increasingly powerful and complex, enabling remarkable advancements in various applications. However, these sophisticated models often come at the cost of increased memory usage and computation, making them challenging to deploy on resource-constrained devices such as smartphones and IoT devices. To address this challenge, TensorFlow Lite (TFLite), Google’s lightweight machine learning framework, offers a range of model optimization techniques, including quantization, to enhance model efficiency without compromising accuracy.

In this blog post, we will explore the concept of quantization and how it can significantly reduce the memory footprint and inference time of machine learning models. We will delve into three types of quantization techniques — FP-16 quantization, Dynamic Range Quantization, and Integer Quantization — and compare their impact on model performance and efficiency. To facilitate this comparison, we’ll use a baseline Keras model and optimize it using these quantization techniques, converting it to TensorFlow Lite models for deployment on resource-limited devices.

Understanding Quantization

Quantization is a model optimization technique that involves reducing the precision of the model’s weights and activations, resulting in a more compact representation. Traditional neural networks typically use 32-bit floating-point numbers (FP32) to store weights and perform computations. However, many devices, such as mobile phones and microcontrollers, can perform computations more efficiently with lower precision data types. Quantization enables us to use reduced precision, such as 16-bit floating-point (FP16) or 8-bit integers, while still maintaining acceptable model accuracy.

Comparing Quantization Techniques

FP-16 Quantization:
FP-16 quantization reduces the model’s precision to 16-bit floating-point numbers. This technique can result in smaller model sizes and faster computations due to the lower precision arithmetic. We will compare its accuracy with the baseline Keras model.
Dynamic Range Quantization:
Dynamic Range Quantization further reduces the model’s precision by using 8-bit integers. This method leverages the dynamic range of weights and activations to quantize them effectively. Dynamic Range Quantization often provides a good balance between model size reduction and accuracy preservation.
Integer Quantization:
Integer Quantization is the most aggressive quantization technique, using 8-bit integers for both weights and activations. This technique offers the smallest model size and the fastest inference time but may have a more noticeable impact on model accuracy.

Building the Base Model

We will build a simple image classifier to distinguish between images of cats and dogs. This classifier will serve as the base model for our comparison of different optimization techniques. Our goal is to optimize the model’s efficiency while maintaining its accuracy, making it suitable for deployment on resource-constrained devices.

Importing Necessary Libraries and Packages

To begin, we’ll import the essential libraries and packages required for our image classification task. These include TensorFlow, Keras, NumPy, and other dependencies that facilitate model training and evaluation.

# Importing necessary libraries and packages.
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_model_optimization as tfmot
from tensorflow.keras.layers import Dropout, Dense, BatchNormalization
%load_ext tensorboard

Dataset Preparation

We will load the Cats vs. Dogs dataset using the tfds.load() function from TensorFlow Datasets (TFDS) library. The function returns three datasets (train_ds, val_ds, and test_ds) split into training, validation, and test sets, respectively, along with additional information about the dataset stored in the info variable.
We split the dataset into three parts: 70% for training, 20% for validation, and 10% for testing.

# Loading the Cats vs. Dogs dataset.
(train_ds, val_ds, test_ds), info = tfds.load(
    'cats_vs_dogs',
    split=['train[:70%]', 'train[70%:90%]', 'train[90%:]'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True
)

Now, let’s gather important information about the dataset, such as the number of classes, class names, and the number of images in the training, validation, and testing sets. This information is helpful for understanding the characteristics of the dataset and setting up the machine learning pipeline correctly.

# Obtaining dataset information.

# Printing the number of classes in the dataset.
print("Number of Classes: " + str(info.features['label'].num_classes))

# Printing the names of the classes in the dataset.
print("Classes: " + str(info.features['label'].names))

# Calculating the number of training images.
NUM_TRAIN_IMAGES = tf.data.experimental.cardinality(train_ds).numpy()
print("Training Images: " + str(NUM_TRAIN_IMAGES))

# Calculating the number of validation images.
NUM_VAL_IMAGES = tf.data.experimental.cardinality(val_ds).numpy()
print("Validation Images: " + str(NUM_VAL_IMAGES))

# Calculating the number of testing images.
NUM_TEST_IMAGES = tf.data.experimental.cardinality(test_ds).numpy()
print("Testing Images: " + str(NUM_TEST_IMAGES))

Output

Number of Classes: 2
Classes: ['cat', 'dog']
Training Images: 16283
Validation Images: 4653
Testing Images: 2326

We will use the show_examples() function from TensorFlow Datasets (TFDS) visualization module to display examples from the training dataset.
The function generates a grid of sample images from the training dataset and displays them, along with their corresponding labels or class names.

# Visualizing the training dataset.

# Using TensorFlow Datasets' visualization function to display examples from the training dataset.
visual = tfds.visualization.show_examples(train_ds, info)

Resizing the images to a fixed size is necessary because most neural network architectures expect inputs of a specific size. In this case, we set the batch size to 16 so that the model will process 16 images together in each batch during training and the images are resized to 224x224 pixels, which is a common input size for many popular image classification models like VGG, ResNet, and MobileNet.

# Defining batch-size and input image size.
batch_size = 16
img_size = [224, 224]

# Resizing images in the dataset.
train_ds = train_ds.map(lambda x, y: (tf.image.resize(x, img_size), y))
val_ds = val_ds.map(lambda x, y: (tf.image.resize(x, img_size), y))
test_ds = test_ds.map(lambda x, y: (tf.image.resize(x, img_size), y))

For our preprocessing steps, we will make use of Caching, Batching, and Prefetching.

Overall, these preprocessing steps help optimize data loading and training efficiency, which is essential for training deep learning models on large datasets.

# Preprocessing steps

# Caching the training dataset.
train_ds = train_ds.cache()

# Batching the training dataset.
train_ds = train_ds.batch(batch_size)

# Prefetching the training dataset.
train_ds = train_ds.prefetch(buffer_size=10)


# Caching the validation dataset.
val_ds = val_ds.cache()

# Batching the validation dataset.
val_ds = val_ds.batch(batch_size)

# Prefetching the validation dataset.
val_ds = val_ds.prefetch(buffer_size=10)


# Caching the testing dataset.
test_ds = test_ds.cache()

# Batching the testing dataset.
test_ds = test_ds.batch(batch_size)

# Prefetching the testing dataset.
test_ds = test_ds.prefetch(buffer_size=10)

The order of these steps is typically not crucial, and in most cases, you can apply them in any order that fits your specific use case.

Next, we need to extract and save test images and labels from the test dataset.

# Extracting and saving test images and labels from the test dataset.
test_images = []
test_labels = []

# Loop through the test dataset using test_ds.take(len(test_ds))
# and unbatch it to extract individual images and labels.
for image, label in test_ds.take(len(test_ds)).unbatch():
    test_images.append(image)
    test_labels.append(label)

The result of this code will be two lists, test_images and test_labels, which will contain all the test images and their corresponding labels, respectively. These lists can be further used for evaluation or any other analysis after the model has been trained.

Alternatively, we can also use generators or other data loading mechanisms to process the test dataset in batches during evaluation.

Model Loading

We will define the model architecture using the EfficientNetB0 model from Keras, and then set all the layers in the EfficientNetB0 model as trainable.

# Defining the model architecture.

# Using EfficientNetB0 as the base model, excluding the top layer (classification layer).
# Loading pre-trained weights from ImageNet.
# Specifying the input shape to (224, 224, 3) and using global max pooling for feature extraction.
efnet = tf.keras.applications.EfficientNetB0(
    include_top=False,
    weights='imagenet',
    input_shape=(224, 224, 3),
    pooling='max'
)

# Unfreezing all the layers of the model.
for layer in efnet.layers:
    layer.trainable = True

We create an instance of the EfficientNetB0 model. The include_top=False argument means that the final fully connected layers for classification will not be included.

We then load pre-trained weights from the ImageNet dataset, which helps in initializing the model with weights learned from a large dataset of images, and specify the input shape of the images expected by the model.

EfficientNetB0 expects images of size 224x224 with 3 color channels (RGB).

Setting the layers as trainable is necessary to fine-tune the pre-trained model on our specific task. Since EfficientNetB0 is pre-trained on ImageNet, the lower layers already have learned feature representations that can be useful for other image-related tasks.

Dense, BatchNormalization, and Dropout layers are added on top of the base model (EfficientNetB0). These additional layers are used to fine-tune the model for the specific task at hand (binary classification of Cats vs. Dogs).

# Adding Dense, BatchNormalization, and Dropout layers to the base model.

# Connecting a Dense layer with 512 units and ReLU activation to the output of the base model.
x = Dense(512, activation='relu')(efnet.output)

# Adding a BatchNormalization layer after the Dense layer.
x = BatchNormalization()(x)

# Connecting another Dense layer with 64 units and ReLU activation.
x = Dense(64, activation='relu')(x)

# Adding a Dropout layer with a rate of 0.2 (20% of the units will be randomly set to 0 during training).
x = Dropout(0.2)(x)

# Connecting the final Dense layer with 2 units (output classes) and using softmax activation for binary classification.
predictions = Dense(2, activation='softmax')(x)

This will create a custom model that takes images as input, applies feature extraction using EfficientNetB0, and then adds several fully connected layers with BatchNormalization and Dropout for further feature learning and regularization. Finally, the model produces the probability distribution over the two classes using the softmax activation function.

Compiling the Model

We create the final model by defining its input and output layers. The input layer of the model is the same as the input of the efnet (EfficientNetB0) base model. The output layer of the model is the predictions layer that we defined earlier, which represents the final predictions of the model.

We will compile the model, specifying the optimizer, loss function, and evaluation metrics to be used during training. In our case:

optimizer=tf.keras.optimizers.Adam(0.0001): The Adam optimizer is used with a learning rate of 0.0001. Adam is an adaptive learning rate optimization algorithm that is commonly used in deep learning models.
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False): The sparse categorical cross-entropy loss function is used for multi-class classification tasks. The from_logits=False indicates that the model's output does not need to be passed through a softmax layer before calculating the loss.
metrics=["accuracy"]: The accuracy metric is used to evaluate the model's performance during training.

We then print a summary of the model architecture, showing the layer names, output shapes, and the number of trainable parameters.

# Defining the input and output layers of the model.
model = Model(inputs=efnet.input, outputs=predictions)

# Compiling the model.
model.compile(optimizer=tf.keras.optimizers.Adam(0.0001),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=["accuracy"])

# Obtaining the model summary.
model.summary()

Output

Total params: 4,740,453
Trainable params: 4,697,406
Non-trainable params: 43,047

We will now have a fully compiled model with the EfficientNetB0 base architecture, followed by additional layers for our binary classification task. The model is ready for training using the specified optimizer, loss function, and evaluation metric.

Training Model

Let’s set up a ModelCheckpoint callback and then train the model for 10 epochs using the callback.

The callback is used to save the model’s weights during training to a specified file, allowing us to save the best model or specific checkpoints at different stages of training.

# Create the ModelCheckpoint callback with save_format='h5'.
checkpoint_callback = ModelCheckpoint(filepath='/content/model_checkpoint.h5', save_weights_only=True, save_format='h5')

# Training the model for 1 epoch using the checkpoint callback.
model.fit(train_ds, epochs=10, steps_per_epoch=(len(train_ds) // batch_size),
          validation_data=val_ds, validation_steps=(len(val_ds) // batch_size),
          shuffle=False, callbacks=[checkpoint_callback])

After one epoch of training, the ModelCheckpoint callback will save the model’s weights to the specified file “model_checkpoint.h5”.

Evaluating Model

The trained model will be evaluated on the test dataset to calculate the test accuracy.

# Evaluating the model on the test dataset.
_, baseline_model_accuracy = model.evaluate(test_ds, verbose=0)

# Printing the baseline test accuracy in percentage.
print('The Baseline test accuracy:', baseline_model_accuracy * 100)

Output

The Baseline test accuracy: 97.54 %

We calculated the accuracy of the model on the test dataset, which provides an indication of how well the model performs on unseen data. The test accuracy is essential to assess the generalization capability of the model and to determine how well it can make predictions on new, previously unseen data.

Exploration of Different Quantization using TFLite Model

Float-16 Quantization

We will pass the trained Keras model to the TensorFlow Lite (TF Lite) Converter to convert it to a TensorFlow Lite model with float16 quantization.

# Passing the Keras model to the TF Lite Converter.
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Using float-16 quantization.
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]

# Converting the model.
tflite_fp16_model = converter.convert()

# Saving the model.
with open('/content/fp_16_model.tflite', 'wb') as f:
    f.write(tflite_fp16_model)

This will give us a TensorFlow Lite model with float16 quantization saved in the file “fp_16_model.tflite”. This quantized model is optimized for reduced size and improved performance on hardware that supports float16 computations.

def evaluate(interpreter):
    prediction = []
    input_index = interpreter.get_input_details()[0]["index"]
    output_index = interpreter.get_output_details()[0]["index"]
    input_format = interpreter.get_output_details()[0]['dtype']

    for i, test_image in enumerate(test_images):
        if i % 100 == 0:
            print('Evaluated on {n} results so far.'.format(n=i))
        test_image = np.expand_dims(test_image, axis=0).astype(input_format)
        interpreter.set_tensor(input_index, test_image)

        # Run inference.
        interpreter.invoke()
        output = interpreter.tensor(output_index)
        predicted_label = np.argmax(output()[0])
        prediction.append(predicted_label)

    print('\n')
    # Comparing prediction results with ground truth labels to calculate accuracy.
    prediction = np.array(prediction)
    accuracy = (prediction == test_labels).mean()
    return accuracy

This evaluate()function allows us to evaluate a TensorFlow Lite model on a set of test images and obtain the accuracy of the model on the test dataset.

The FP-16 quantized TensorFlow Lite model is then loaded into an interpreter, and then the model is evaluated on the test dataset using the previously defined evaluate function.

Let’s obtain the test accuracy of both the FP-16 quantized TensorFlow Lite model and the baseline Keras mode

# Passing the FP-16 TF Lite model to the interpreter.
interpreter = tf.lite.Interpreter('/content/fp_16_model.tflite')

# Allocating tensors.
interpreter.allocate_tensors()

# Evaluating the model on the test dataset.
test_accuracy_fp_16 = evaluate(interpreter)

# Printing the test accuracy for the FP-16 quantized TFLite model and the baseline Keras model.
print('Float 16 Quantized TFLite Model Test Accuracy:', test_accuracy_fp_16*100)
print('Baseline Keras Model Test Accuracy:', baseline_model_accuracy*100)

Output

Float 16 Quantized TFLite Model Test Accuracy: 97.55%
Baseline Keras Model Test Accuracy: 97.54%

2. Dynamic Range Quantization

# Passing the baseline Keras model to the TF Lite Converter.
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Using Dynamic Range Quantization.
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Converting the model.
tflite_quant_model = converter.convert()

# Saving the model.
with open('/content/dynamic_quant_model.tflite', 'wb') as f:
    f.write(tflite_quant_model)

We will have a TensorFlow Lite model with Dynamic Range Quantization saved in the file “dynamic_quant_model.tflite”. This quantized model is optimized for reduced size and improved performance on various hardware platforms, including those that support 8-bit integer operations.

Let’s obtain the test accuracy of both the Dynamically Quantized TensorFlow Lite model and the baseline Keras model on the same test images.

# Passing the Dynamically Quantized TF Lite model to the Interpreter.
interpreter = tf.lite.Interpreter('/content/dynamic_quant_model.tflite')

# Allocating tensors.
interpreter.allocate_tensors()

# Evaluating the model on the test images.
test_accuracy_dynamic = evaluate(interpreter)

# Printing the test accuracy for the Dynamically Quantized TFLite model and the baseline Keras model.
print('Dynamically Quantized TFLite Model Test Accuracy:', test_accuracy_dynamic*100)
print('Baseline Keras Model Test Accuracy:', baseline_model_accuracy*100)

Output

Dynamically Quantized TFLite Model Test Accuracy: 97.16$
Baseline Keras Model Test Accuracy: 97.54%

3. Integer Quantization

# Passing the baseline Keras model to the TF Lite Converter.
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Defining the representative dataset from training images.
def representative_data_gen():
    for input_value in tf.data.Dataset.from_tensor_slices(test_images).take(100):
        yield [input_value]

# Set the representative dataset for post-training quantization.
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen

# Using Integer Quantization.
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

# Setting the input and output tensors to uint8.
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

# Converting the model.
int_quant_model = converter.convert()

# Saving the Integer Quantized TF Lite model.
with open('/content/int_quant_model.tflite', 'wb') as f:
    f.write(int_quant_model)

We will have an Integer Quantized TensorFlow Lite model saved in the file “int_quant_model.tflite”. This quantized model can be used for efficient on-device inference on resource-constrained devices with minimal loss in accuracy.

Let’s obtain the test accuracy of the Integer Quantized TFLite model and the baseline Keras model on the same test images.

# Passing the Integer Quantized TF Lite model to the Interpreter.
interpreter = tf.lite.Interpreter('/content/int_quant_model.tflite')

# Allocating tensors.
interpreter.allocate_tensors()

# Evaluating the model on the test images.
test_accuracy_integer = evaluate(interpreter)

# Printing the test accuracy for the Integer Quantized TFLite model and the baseline Keras model.
print('Integer Quantized TFLite Model Test Accuracy:', test_accuracy_integer*100)
print('Baseline Keras Model Test Accuracy:', baseline_model_accuracy*100)

Output

Integer Quantized TFLite Model Test Accuracy: 94.88%
Baseline Keras Model Test Accuracy: 97.54%

Test Accuracy Comparison

To compare the test accuracies of the base model, FP-16 quantized model, dynamic range quantized model, and integer quantized model, we will use a bar chart for the visualization.

import matplotlib.pyplot as plt

# Test accuracies of the models
test_accuracies = [baseline_model_accuracy * 100, test_accuracy_fp_16 * 100, test_accuracy_dynamic * 100, test_accuracy_integer * 100]

# Model names for the x-axis labels
model_names = ['Baseline Model', 'FP-16 Quantized', 'Dynamic Range Quantized', 'Integer Quantized']

# Create the bar chart
plt.figure(figsize=(10, 6))
plt.bar(model_names, test_accuracies, color=['blue', 'green', 'orange', 'red'])
plt.ylim(0, 100)  # Set the y-axis limit to show accuracy as a percentage (0 to 100%)
plt.xlabel('Model')
plt.ylabel('Test Accuracy (%)')
plt.title('Test Accuracy Comparison')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()

# Display the bar chart
plt.show()

Conclusion

We’ve explored the importance of model optimization for deploying machine learning models on resource-constrained devices. We’ve demonstrated how quantization techniques offered by TensorFlow Lite can significantly improve model efficiency by reducing model size and speeding up inference times.

Through a comparison of three quantization techniques — FP-16 quantization, Dynamic Range Quantization, and Integer Quantization — we’ve illustrated the trade-offs between model accuracy and efficiency. By understanding the strengths and limitations of each technique, we can make informed decisions when optimizing our models for on-device machine learning applications.

In summary, TensorFlow Lite’s quantization techniques provide valuable tools for striking the right balance between model efficiency and accuracy, enabling the deployment of powerful machine learning models on a wide range of devices.