From Dev to Production: Deploying HuggingFace BERT with KServe

The Future of NLP Deployment: BERT Models and KServe in Action

Vinayak Shanawad
8 min readSep 11, 2023

In this post, I will demonstrate how to deploy a HuggingFace pre-trained model (BERT for text classification with the Hugging Face Transformers library) to run as a KServe-hosted model.

First, let’s understand what is KServe and why we need KServe.

🤔What is KServe?

KServe was initially called KFServing (KubeFlow Serving) and was designed so that model serving could be operated in a standardized way across frameworks right out of the box. There was a need for a model serving system, that could easily run on existing Kubernetes and Istio stacks and also provide model explainability, inference graph operations, and other model management functions.

Model serving using KServe (Photo by Kubeflow )

🤷‍♂️Why KServe?

  • KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases.
  • Provides performant, standardized inference protocol across ML frameworks (Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX)
  • Support modern serverless inference workload with Autoscaling including Scale to Zero on GPU.
  • Simple and Pluggable production serving for production ML serving including prediction, pre/post-processing, monitoring, and explainability.
  • Advanced deployments with the canary rollout, experiments, ensembles, and transformers.

🛠️Setting Up KServe

To demo the Hugging Face model on KServe we’ll use the local (Windows OS) quick install method on a minikube kubernetes cluster. The standalone “quick install” installs Istio and KNative for us without having to install all of Kubeflow and the extra components that tend to slow down local demo installs.

Let’s start the minikube cluster once our local minikube installation is completed.

# Start the minikube cluster
minikube start

# Check the status of minikube cluster
minikube status

The second command should give us the output below if our cluster is healthy:

Minikube cluster status (Photo by Author)

First, we need to get a copy of the KServe repository on our local system. Use git bash to clone the KServe repository.

cd kubeflow
git clone https://github.com/kserve/kserve.git

We can’t download the Istio 1.17.2 due to some issue, hence we can download the Istio 1.17.2 from the release page for Windows. Extract the istio-1.17.2-win.zip file and place the istio-1.17.2 folder under kubeflow directory.

cd kubeflow
./hack/quick_install.sh

This will install KServe along with its core dependencies such as Knative Serving all with the same install script. This install takes around 30–60 seconds, depending on your system.

Note: Sometimes the installer will fail because a component still has not been completely installed, just run the installer a second time if you see the failure console logs.

KServe installation logs (Photo by Author)

Once our installation is complete, we can confirm that the KServe install is working on our minikube cluster with the command.

kubectl get pods -n kserve
KServe control manager (Photo by Author)

We can also list the all pods under the minikube cluster.

kubectl get pods -A
Pods status (Photo by Author)

🚀Deploying the Custom HuggingFace Model Server on KServe

There are two main ways to deploy a model as an InferenceService on KServe:

  1. Deploy the saved model with a pre-built model server on a pre-existing image
  2. Deploy a saved model already wrapped in a pre-existing container as a custom

Most of the time we want to deploy on a pre-built model server as this will create the least amount of work for our engineering team.

There are many pre-built model servers included with KServe out of the box. With KServe our built-in model server options are:

  1. tensorflow
  2. sklearn
  3. pytorch
  4. onnx
  5. tensorrt
  6. xgboost

Sometimes we’ll have a model that will not wire up correctly with the pre-built images. The reasons this could happen include:

  • Model built with different dependency versions than the model server
  • Model not saved in file format model server expects
  • Model was built with a new/custom framework not yet supported by KServe
  • Model is in a container image that has a REST interface that is different than the Tensorflow V1 HTTP API that KServe expects

For any of the cases above we have 3 options for deploying our model:

  1. Wrap our custom model in our own container where our container runs its own web server to expose the model endpoint
  2. Use the KServe Model Server as the webserver (with its standard Tensorflow V1 API) and then overload the load() and predict() methods
  3. Deploy a pre-built container image with a custom REST API, bypassing InferenceService and sending the HTTP request directly to the predictor

Of the 3 options, using Model Server and just doing custom overloads will likely be the most popular route for folks just wanting to deploy a custom model.

Given that Hugging Face has a unique Python API and a lot of dependencies, it does not work on KServe out of the box. In this case, we need to do 2 key tasks:

  1. Create a new python class that inherits from KServe Model class, with custom methods for load() and predict()
  2. Build a custom container image and then store it in a container repository

The remainder of this post will be focused on:

  1. Building a custom model Pythonkserve.Model with the Hugging Face BERT model wired in.
  2. Building a docker container with the custom python kserve.Model and push the docker container to docker hub
  3. Deploy the custom InferenceService to our minikube Kubernetes cluster
  4. Test the KServe-hosted HuggingFace model

Now let’s get to work building out our custom text classification InferenceService on KServe.

1. Building a Custom Python Model Server

In the code below we can see our custom Model with the Hugging Face code wired into the load() and predict() methods.

from typing import Dict 

import kserve
import torch

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from kserve import ModelServer
import logging

class KServeBERTSentimentModel(kserve.Model):

def __init__(self, name: str):
super().__init__(name)
KSERVE_LOGGER_NAME = 'kserve'
self.logger = logging.getLogger(KSERVE_LOGGER_NAME)
self.name = name
self.ready = False

def load(self):
# Build tokenizer and model
name = "distilbert-base-uncased-finetuned-sst-2-english"
self.tokenizer = AutoTokenizer.from_pretrained(name)
self.model = AutoModelForSequenceClassification.from_pretrained(name, torchscript=True)
self.ready = True


def predict(self, request: Dict, headers: Dict) -> Dict:

sequence = request["sequence"]
self.logger.info(f"sequence:-- {sequence}")

inputs = self.tokenizer(
sequence,
return_tensors="pt",
max_length=128,
padding="max_length",
truncation=True,
)

# run prediciton
with torch.no_grad():
predictions = self.model(**inputs)[0]
scores = torch.nn.Softmax(dim=1)(predictions)

results = [{"label": self.model.config.id2label[item.argmax().item()], "score": item.max().item()} for item in scores]
self.logger.info(f"results:-- {results}")

# return dictonary, which will be json serializable
return {"predictions": results}

if __name__ == "__main__":

model = KServeBERTSentimentModel("bert-sentiment")
model.load()

model_server = ModelServer(http_port=8080, workers=1)
model_server.start([model])

There are two things happening in the above code with respect to integrating with the model server:

  1. The Hugging Face BERT model is loaded in the load(…) method
  2. The predict(...) method takes incoming inference input from the REST call and passes it to the Hugging Face AutoModelForSequenceClassification model instance

The Hugging Face model we’re using here is the “distilbert-base-uncased-finetuned-sst-2-english”. This model and associated tokenizer are loaded from pre-trained model checkpoints included in the Hugging Face framework.

2. Building a new Docker image for the Model Server

Once our model serving code above is saved locally, we will build a new docker container image with the code and required dependencies packaged inside. We can see examples of the container build command and the container repository store command (here, docker hub) below.

Build the new container with our custom code and then send it over to the container repository of your choice:

# Build the container on your local machine
docker build -t kserve-custom-model .

# Push the container to docker registry
docker push {username}/kserve-model-repo:v1.0

For those that would prefer to use a pre-built version of this container and skip the coding + docker steps, just use my container up on the docker hub:

Now let’s move on to deploying our model server in our container as an InferenceService on KServe.

3. Deploying Custom Model Server on KServe with kubectl

Given that KServe treats models as infrastructure, we deploy a model on KServe with a yaml file to describe the k8s model resource (e.g., InferenceService) as a custom object. The code listing below shows our yaml file to create our custom InferenceService object on the local k8s cluster.

We need to set four parameters to uniquely identify the model, such as:

  • apiVersion: “serving.kserve.io/v1beta1”
  • kind: “InferenceService”
  • metadata.name: [the model’s unique name inside the namespace]
  • metadata.namespace: [the namespace your model will live in]

Here we’re using the generic kserve-custom-model as our metadata.name and our model will be created in the default namespace.

Towards the end of the spec we ask kubernetes to schedule our container wtih 4GB of ram as Hugging Face tends to take up a lot of space in memory.

Once we have our yaml file configured we can create the Kubernetes object with Kubectl as shown below.

kubectl apply -f deploy_bert_sentiment.yaml

Once we run the above kubectl command, we should have a working InferenceService running on our local kubernetes cluster. We can check the status of our model with the kubectl command:

kubectl get inferenceservices

This should give us output as shown below.

Custom model status (Photo by Author)

Deploying a custom model on KServe is not as easy as using a pre-built model server, but it’s not terrible either as we’ve seen so far.🚀🚀

4. Test the KServe-hosted HuggingFace Model

Now let’s make an inference call to our locally hosted Hugging Face Sentiment Analysis model on KServe. First, we need to do some port forwarding work so our model’s port is exposed to our local system with the command:

kubectl port-forward -n istio-system service/istio-ingressgateway 8080:80

We’ll use the curl command to send the input json file as input to the predict method on our custom Hugging Face InferenceService on KServe with the command:

curl -v -H "Host: kserve-custom-model.default.example.com" http://localhost:8080/v1/models/bert-sentiment:predict  -d @./input.json

The response will look like:

Model inference (Photo by Author)

This example has shown how to take a non-trivial NLP model and host it as a custom InferenceService on KServe.

--

--

Vinayak Shanawad

Machine Learning Engineer | 3x Kaggle Expert | MLOps | LLMOps | Learning, improving and evolving.