Real-time Face Detection at 30 FPS on CPU using Mediapipe in Python: A Step-by-Step Guide.

AiPhile
8 min readJul 1, 2023

Curious about computer vision and face detection? In this beginner’s guide, we’ll explore real-time face detection using Mediapipe and Python. Discover how to leverage the powerful combination of Mediapipe and Python to detect faces at an impressive rate of 30 FPS on the CPU. let’s get into it.

Photo by Luana Freitas from Pexels, Detected The Face on.

Bit of about mediapipe

MediaPipe is an open-source framework for building cross-platform, customizable machine-learning solutions for live and streaming media. It can be used to perform a variety of computer vision tasks, such as face detection[case here], pose estimation, and object tracking. MediaPipe is available for Android, iOS, desktop, and embedded devices more about it.

Setting up a Dev Environment

First, you need to install Python which you can find here. Then you can use Python package manager pip in short to install the required module/package like Mediapipe, then you can choose your choice of IDE for coding, here is my setup for computer vision in case you want to try this out.

You can use this command to install the MediaPipe module using the Terminal

pip install medaipipe

Alongside mediapipe multiple other modules/libraries will be installed, because those modules/ libraries are required for MediaPipe like OpenCV, which is used for computer vision and image processing etc, here we will use OpenCV to get footage from the webcam to detect the face in the frame/image.

Let's Capture video footage from Camera.

here is the script which gets footage from the camera, and displays the window, it also calculates the frame per second (fps) and displays it on the window, so later on it will be easier to compare raw camera feed to face detection with face detection.

for understanding I have to add a comment in the script, to explain each line better way.

import cv2 as cv  # Import the OpenCV library
import time # Import the time library

# Create a VideoCapture object called cap
cap = cv.VideoCapture(1)

# Start the timer
start_time = time.time()

# Initialize the frame counter
frame_counter = 0

# This is an infinite loop that will continue to run until the user presses the `q` key
while True:

# Increment the frame counter
frame_counter += 1

# Read a frame from the webcam
ret, frame = cap.read()

# If the frame was not successfully captured, break out of the loop
if ret is False:
break

# Calculate the FPS
fps = frame_counter / (time.time() - start_time)

# Display the FPS on the frame
cv.putText(frame, f"FPS: {fps:.3f}", (30, 30), cv.FONT_HERSHEY_PLAIN, 1.5, (0, 255, 255), 2, cv.LINE_AA)

# Display the frame on the screen
cv.imshow("frame", frame)

# Check if the user has pressed the `q` key, if yes then close the program.
key = cv.waitKey(1)
if key == ord("q"):
break

# Release the VideoCapture object
cap.release()

# Close all open windows
cv.destroyAllWindows()

run the above script to test whether your webcam working correctly or not.

Here is out.

camera feed.

Let’s explore the Face detection system

Imoports/python Modules.

First of we need to import modules, so here we are importing cv2 as cv which computer vision/image processing will help us process the image as we access the webcam using the above code.

Time module import time is used for time is used to track time to calculate the FPS.

The MediaPipe is for face detection which is our main module, it has different models for different solutions like face detection, Landmark detection, Pose estimation etc.

The last import is Numpy which is used for scientific calculation, here we will perform multiplication on the array, NumPy makes things easier, and it's fast compared to other modules due to NumPy being written in Python and C. The core NumPy library is written in Python, but some of the more performance-critical parts are written in C. This allows NumPy to be both fast and easy to use.

import cv2 as cv 
import time
import mediapipe as mp
import numpy as np

Configure Face Detection Model

mp_face_detection = mp.solutions.face_detection
with mp_face_detection.FaceDetection(model_selection=1, min_detection_confidence=0.5) as face_detector:

The code here initializes the face detection model from Mediapipe there are multiple pre-trained models available, but here we are choosing the face detection model.

Create the face detection object/Configure model.

with mp_face_detection.FaceDetection(model_selection=1, min_detection_confidence=0.5) as face_detector:

Here we open the face detection model with with [context manager] the statement and configure it with two parameters, mode_selection and min_detection_confidence , the model_selection allows us to select from multiple models available, one of them is very fast but inaccurate with the detection and another model, is a bit more reliable on detection, and lower side on detection speed(fps), if your PC is more on the high-end side, won't notice any difference at all in both, expect the detection part, I have selected the model one which quite accurate. model_selection=1 . another parameter is min_detection_confidence=0.5 it specifies how much the model confident to show results. here I have set this 0.5 which the model is 50% sure of its face then it's going to return those results.

Process the image/frame and Get the raw data form MediaPipe

When the image passes to the MediaPipe for processing/ face-detection, its recommended to change the color format of the image since the OpenCV format is BGR(Blue, Green, Red) while mediapipe accepts RGB which is quite simple, you have to pass the image/frame to cv.cvtColor(image, cv.COLOR_BGR2RGB) function it will turn this into RGB format.

rgb_frame = cv.cvtColor(frame, cv.COLOR_BGR2RGB)

It process rgb_frame[image] and returns the results as data, which you can extract like this results.detections it's nothing but a list of all the faces detected in the image or frame, each detected face has 3 data points.

results = face_detector.process(rgb_frame)

before extracting the data we have to check whether there are faces detected or not, we can achieve that with if then we can extract all the data in if block. if results.detections:

Extracting Data

Label_id face.label_id which comes to be always zero, I don’t know the reason why.

score face.score which is the confidence score, it shows how its source rather its face or not. its value is between 0 to 1.0.

Location data face.location_data with its name suggesting the location or coordinates of the face in the image, it is further divided into two data points, face bounding box and face landmarks.

Face bounding box data

The face Bounding box,face.location_data.relative_bounding_boxit has fours point in its, normalised values of the face bounding box in the image/frame

  • xmin face.location_data.relative_bounding_box.xmin it is the X coordinate of the face bounding box of the top right corner of the face.
  • ymin face.location_data.relative_bounding_box.ymin it's the Y coordinate of the face bounding box in the top right corner.
  • width face.location_data.relative_bounding_box.width It's the width of the bounding box.
  • height face.location_data.relative_bounding_box.height its Height of face bounding box
# checking if face is present in detection or not. 
if results.detections:
# looping through all the detection/faces
for face in results.detections:
# extracting the relative bounding box data
# converting normalize coordinate into pixel coordinate
# use numpy multiply function to multiple width and heigh of face to corresponding values.
face_react = np.multiply(
[
face.location_data.relative_bounding_box.xmin,
face.location_data.relative_bounding_box.ymin,
face.location_data.relative_bounding_box.width,
face.location_data.relative_bounding_box.height,
],
[frame_width, frame_height, frame_width, frame_height],
).astype(int)

Drawing face bounding box and confidence score on the top right corner of the face.

# Draw the rectangle to each face.
cv.rectangle(frame, face_react, color=(255, 255, 255), thickness=2)
# draw the score on the top right corner of rectangle.
cv.putText(frame,f" Face {face.score[0]*100:.1f}",face_react[:2],cv.FONT_HERSHEY_DUPLEX,0.4,(0, 255, 0),1,cv.LINE_AA)

Face landmarks/ Key-points.

These are the basic six Key points[landmarks] of the face face.location_data.relative_keypoints each value has x and y values/normalized coordinates which you can access the data face.location_data.relative_keypoints.x, face.location_data.relative_keypoints.y These multiple values so you can iterate over each key point using for loop, easily.

All these values are normalised according to the face image width and height, we need to convert them into pixel coordinates, just by multiplying them with the width and height of the image.

# Getting the all face landmarks through list-comprehension and  store in numpy array. 
key_points = np.array([(p.x, p.y) for p in face.location_data.relative_keypoints])

# convert normalize values into pixel coordinates, and turn them into integers
key_points_coords = np.multiply(key_points,[frame_width,frame_height],).astype(int)

Draw the face bounding box and Key-points

Here we are iterating over each key point and drawing a circle on each point.

# Looping through each key_points_coords and draw a circle. 
for p in key_points_coords:
cv.circle(frame, p, 4, (255, 255, 255), 2)
cv.circle(frame, p, 2, (0, 0, 0), -1)

Here is the complete source code.

import time
import cv2 as cv
import mediapipe as mp
import numpy as np
mp_face_detection = mp.solutions.face_detection
cap = cv.VideoCapture(0)
with mp_face_detection.FaceDetection(model_selection=1, min_detection_confidence=0.5) as face_detector:
frame_counter = 0
fonts = cv.FONT_HERSHEY_PLAIN
start_time = time.time()
while True:
frame_counter += 1
ret, frame = cap.read()
if ret is False:
break
rgb_frame = cv.cvtColor(frame, cv.COLOR_BGR2RGB)

results = face_detector.process(rgb_frame)
frame_height, frame_width, c = frame.shape
if results.detections:
for face in results.detections:
face_react = np.multiply(
[
face.location_data.relative_bounding_box.xmin,
face.location_data.relative_bounding_box.ymin,
face.location_data.relative_bounding_box.width,
face.location_data.relative_bounding_box.height,
],
[frame_width, frame_height, frame_width, frame_height]).astype(int)

cv.rectangle(frame, face_react, color=(255, 255, 255), thickness=2)
key_points = np.array([(p.x, p.y) for p in face.location_data.relative_keypoints])
key_points_coords = np.multiply(key_points,[frame_width,frame_height]).astype(int)
for p in key_points_coords:
cv.circle(frame, p, 4, (255, 255, 255), 2)
cv.circle(frame, p, 2, (0, 0, 0), -1)

fps = frame_counter / (time.time() - start_time)
cv.putText(frame,f"FPS: {fps:.2f}",(30, 30),cv.FONT_HERSHEY_DUPLEX,0.7,(0, 255, 255),2,)
cv.imshow("frame", frame)
key = cv.waitKey(1)
if key == ord("q"):
break
cap.release()
cv.destroyAllWindows()

Face detection results.

face detection

Final thoughts.

In conclusion, MediaPipe’s face detection capabilities provide a wealth of information for any project requiring facial analysis. With its advanced algorithms and neural networks, MediaPipe is able to detect various facial features such as eyes, nose, mouth, and even facial landmarks. This information can be used for a variety of applications. While it may take some time to understand the various data points provided, the results are well worth the effort as the accuracy of the data can greatly enhance the success of any facial analysis project.

If you would like to go through video content take like at this tutorial, from my youtube channel.

The upcoming blog post will be about face time calculation using face detection, for detail checkout this GitHub repository https://github.com/Asadullah-Dal17/AiPhile-Mediapipe-Course-2023

BECOME a WRITER at MLearning.ai // text-to-video // Detect AI img

--

--