Targeting the Right Audience: A Data-Driven Approach to Customer Segmentation

12 min readApr 16, 2023

How Clustering Can Help You Understand Your Customers Better

Customer segmentation is crucial for businesses to better understand their customers, target marketing efforts, and improve satisfaction. Clustering, a popular machine learning technique, identifies patterns in large datasets to group similar customers and gain insights. This article explores the concept of customer segmentation using clustering, its benefits, common algorithms, and provides a step-by-step guide with a case study in Python. Marketers, business owners, and data analysts can benefit from this article’s solid foundation in customer segmentation using clustering, leading to improved marketing strategies and customer engagement.

Customer segmentation can help businesses identify groups of customers with similar needs and preferences, enabling them to tailor their marketing efforts to each group. For example, a retailer can use customer segmentation to identify customers who prefer luxury items and target them with promotions for high-end products, while also promoting budget-friendly products to customers who prefer lower-priced items. This approach can increase customer satisfaction and drive sales.

Getting started

Segmentation is grouping customers with similar attributes so that you can target your communications and incorporate personalization into your business without having to do individual reach out. We will go through the code we used for clustering below and analyze the segmented data and take action based on that data.

In this example, we use the Mall Customer Segmentation dataset and select two relevant features (income and spending habits), ‘Annual_Income_kdollars’ and ‘Spending_Score’, for clustering. The dataset is first scaled using StandardScaler to standardize the features. We then use the elbow method to find the optimal number of clusters, which is 5 in this case. Finally, we perform KMeans clustering and visualize the results. We will go step by step below.

Import necessary libraries

First step is to import several libraries including pandas, seaborn, matplotlib.pyplot, KMeans from sklearn.cluster, and StandardScaler from sklearn.preprocessing.

# Import all the required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Load data

The Mall_Customer dataset (from Kaggle) is a popular dataset used for customer segmentation and marketing analysis. It contains information on customers of a mall, including their age, gender, annual income, and spending score. The spending score is a metric assigned to each customer based on their purchasing behavior, and is a measure of how much a customer is likely to spend at the mall.

The code then loads data from a CSV file, containing information about mall customers. It prints the first five rows of the data using the head() method. We than rename the columns and print the first five rows.

# Read the data file 
data = pd.read_csv('Mall_Customers.csv')
# Print 5 rows before renaming
print(data.head())

# Rename columns. 
data.rename(columns = {'Spending Score (1-100)':'Spending_Score', 'Annual Income (k$)':'Annual_Income_kdollars'}, inplace = True)
# Print 5 rows after renaming
print(data.head())

Data output shown below.

#DATA BEFORE renaming
   CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
4           5  Female   31                  17                      40

#DATA AFTERrenaming

   CustomerID  Gender  Age  Annual_Income_kdollars  Spending_Score
0           1    Male   19                      15              39
1           2    Male   21                      15              81
2           3  Female   20                      16               6
3           4  Female   23                      16              77
4           5  Female   31                      17              40

Select features and scale data

Next, the code selects two relevant features, namely Annual_Income_kdollars and Spending_Score, and stores them in a new variable X. These are the two variables we are going to use for the clustering (income and spending patterns).

The data is then standardized using the StandardScaler() function from the sklearn.preprocessing library. This scales the features so that each one has a mean of 0 and a standard deviation of 1 which means they have a similar scale or distribution. This helps to improve the performance of the KMeans algorithm.

The formula used to standardize the data is:

(x — mean) / standard deviation

This means original values are greater than the mean, and the scaled value will be positive. If the original values are less than the mean, the scaled value will be negative.

# Select relevant features 
# Create a new dataframe with only the 
# 'Annual_Income_kdollars' and 'Spending_Score' columns from the original data

X = data[['Annual_Income_kdollars', 'Spending_Score']]

# Scale data
# StandardScaler is a preprocessing step used to standardize the data in a 
# machine learning model. 

# Create a StandardScaler object to standardize the data
scaler = StandardScaler()

# Standardize the 'X' dataframe using the StandardScaler object
X_scaled = scaler.fit_transform(X)

# Print the first 5 rows of X
print(X.head())

# Get the mean for the two columns
print("Annual_Income_kdollars = " , X['Annual_Income_kdollars'].mean())
print("Spending_Score = ", X['Spending_Score'].mean())

# Print the first 5 rows of X_scaled
print(X_scaled[:5,:])
# Mean of Spending_Score is 50.  Anything above 50 for Spending_Score will appear positive. Below 50 will be negative
# Mean of Annual_Income_kdollars is 60.  Anything above 60 for Annual_Income_kdollars will appear positive. Below 60 will be negative

Output from the above code.

# DATA BEFORE scaling
   Annual_Income_kdollars  Spending_Score
0                      15              39
1                      15              81
2                      16               6
3                      16              77
4                      17              40

# MEAN VAlUES
Annual_Income_kdollars =  60.56
Spending_Score =  50.2

# Mean of Spending_Score is 50.  
#Anything above 50 for Spending_Score will appear positive. 
#Below 50 will be negative

# Mean of Annual_Income_kdollars is 60.  
#Anything above 60 for Annual_Income_kdollars will appear positive. 
#Below 60 will be negative

# DATA BEFORE scaling
Annual_Income_kdollars  Spending_Score
[[-1.73899919 -0.43480148]
 [-1.73899919  1.19570407]
 [-1.70082976 -1.71591298]
 [-1.70082976  1.04041783]
 [-1.66266033 -0.39597992]]

# All the 5 rows displayed for Annual_Income_kdollars are below 60 (mean). 
# So the scaled value is negative

# All the 2nd and 4th rows displayed for Spending_Score are above 50 (mean). 
# So the scaled value is positive for those rows and negative for the others.

Find optimum number of clusters

The code then uses the elbow method to determine the optimal number of clusters to use for the KMeans algorithm. The elbow method involves fitting the KMeans algorithm with different numbers of clusters and computing the within-cluster sum of squares (WCSS) for each number of clusters. The optimal number of clusters is the point on the plot of WCSS against the number of clusters where the curve starts to level off, which is called the “elbow”.

KMeans requires several properties to be passed, which are:

n_clusters: the number of clusters to create. This determines the number of groups that the data will be divided into.
init: the method used to initialize the cluster centers. The default value is 'k-means++', which is a smart initialization technique that can lead to faster convergence.
max_iter: the maximum number of iterations allowed for the algorithm to converge. If the algorithm has not converged by this number of iterations, it will stop.
n_init: the number of times the algorithm will be run with different initializations of the cluster centers. The final result will be the best one from these runs.
random_state: the seed value for the random number generator used in the initialization of the cluster centers. This ensures that the results are reproducible.

By passing these properties through the KMeans algorithm, businesses can create meaningful customer segments and gain insights into their customer base, which can inform marketing strategies and drive business success.

# Find optimal number of clusters using the elbow method

# Create an empty list to store the within-cluster sum of squares for each number of clusters
wcss = []

# Iterate over the range of possible number of clusters, from 1 to 10
for i in range(1, 11):
    # Create a KMeans object with the specified number of clusters and other parameters
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)

    # Fit the KMeans object to the standardized data
    kmeans.fit(X_scaled)

    # Append the within-cluster sum of squares to the list
    wcss.append(kmeans.inertia_)

The code then plots the results using matplotlib.

# Plot the within-cluster sum of squares for each number of clusters
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method') # Set the title of the plot
plt.xlabel('Number of clusters') # Set the label for the x-axis
plt.ylabel('WCSS') # Set the label for the y-axis
plt.show() # Show the plot

This elbow point indicates the point of diminishing returns, where increasing the number of clusters does not lead to significant improvements in clustering performance. Therefore, the optimal number of clusters is the number of clusters at the elbow point on the plot which is 5 here.

Perform kmeans and visualize data

The code then performs KMeans clustering with a chosen number of clusters (in this case, 5), using the KMeans() function from sklearn.cluster. It stores the cluster labels in a new variable clusters.

Finally, the code visualizes the clusters using a scatter plot. It uses the sns.scatterplot() function from seaborn to create the plot, with the hue parameter set to the clusters variable to color the points according to their cluster label. The resulting plot shows the customer segments based on their Annual_Income_kdollars and Spending_Score. The code then displays the plot using matplotlib.

# Perform KMeans clustering

# Specify the number of clusters to use
n_clusters = 5

 # Create a KMeans object with the specified number of clusters and other parameters
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)

# Fit the KMeans object to the standardized data and assign the 
# cluster labels to the 'clusters' variable
clusters = kmeans.fit_predict(X_scaled)

# Visualize the clusters

# Create a scatter plot of the data, with points colored by cluster label
sns.scatterplot(data=X, x='Annual_Income_kdollars', y='Spending_Score', hue=clusters, palette='deep', legend=None)

plt.title('Customer Segments') # Set the title of the plot
plt.show() # Show the plot

Add segment data to the dataset

Add the segmented data to the original dataset as a new column and print 5 rows

# Merge the segment data to the original dataset
data['Segment'] = kmeans.labels_

# Print 5 rows
print(data.head())

Output of the new dataset

CustomerID  Gender  Age  Annual_Income_kdollars  Spending_Score  Segment
0           1    Male   19                      15              39        4
1           2    Male   21                      15              81        3
2           3  Female   20                      16               6        4
3           4  Female   23                      16              77        3
4           5  Female   31                      17              40        4

Analyzing customer segments

# Print average values for the 5 segments
print(data[['Annual_Income_kdollars', 'Spending_Score', 'Age', 'Segment']].groupby('Segment').mean())

Annual_Income_kdollars  Spending_Score        Age
Segment                                                   
0                     88.200000       17.114286  41.114286
1                     55.296296       49.518519  42.716049
2                     86.538462       82.128205  32.692308
3                     25.727273       79.363636  25.272727
4                     26.304348       20.913043  45.217391

What can we see from the above table.

Income analysis
– Segments 0 and 1 have high average income
– Segments 3 & 4 have low average income.
– Segment 1 has medium Income
Spending analysis
– Segments 0 and 4 have low average spending
– Segments 2 & 3 have high average spending.
– Segment 1 has medium spending
Age analysis
– Segment 3 have young customers
–Segment 2 early 30s
–Rest above 40s.

The above can be summarized as

Segment 0 : High Income, Low Spending and Above 40s
Segment 1: Medium Income, Medium Spending and Above 40s
Segment 2: High Income, High Spending and Early 30s
Segment 3: Low Income, High Spending and Mid 20s
Segment 4: Low Income, Low Spending and Above 40s

It will be good to see those visually.

In the plots below I will rename 0, 1, 2, 3 and 4 to ‘HI-LS’, ’MI-MS’, ‘HI-HS’, ‘LI-HS’ & ‘LI-LS to reflect the above definition. Where HI means high income, LI means low income, HL means high spending, LI means low spending, MI and MS means medium income and medium spending.

# Plot segment counts
ax=sns.countplot(data=data, x='Segment')
ax.set_xticklabels(labels = ['HI-LS','MI-MS', 'HI-HS', 'LI-HS', 'LI-LS'])
plt.title(' Segment')
plt.show()

# Create a barplot of average income for each segment
ax=sns.barplot(data=data, x='Segment', y='Annual_Income_kdollars', ci=None)
ax.set_xticklabels(labels = ['HI-LS','MI-MS', 'HI-HS', 'LI-HS', 'LI-LS'])
plt.title('Average Income by Segment')
plt.show()

# Create a barplot of spending for each segment
ax=sns.barplot(data=data, x='Segment', y='Spending_Score', ci=None)
ax.set_xticklabels(labels = ['HI-LS','MI-MS', 'HI-HS', 'LI-HS', 'LI-LS'])
plt.title('Average Spending by Segment')
plt.show()

# Create a barplot of Age for each segment
ax=sns.barplot(data=data, x='Segment', y='Age', ci=None)
ax.set_xticklabels(labels = ['HI-LS','MI-MS', 'HI-HS', 'LI-HS', 'LI-LS'])
plt.title('Average Age by Segment')
plt.show()

# Create a barplot of Gender and spending for each segment
sns.histplot(binwidth=1,hue='Segment', x='Gender',data=data,stat="count",multiple="dodge")
plt.title('Segment by Gender')
plt.show()

We have a lot more medium income / spend customers.

The above charts are self-explanatory.

In the first segment, males and females are more or less the same. All others have a lot more Females than males.

I felt it's easier to visualize if I plot both Income and Spending on the same chart.

# Create a barplot of average income and spending for each segment
segment_means = data[['Annual_Income_kdollars', 'Spending_Score', 'Age', 'Segment']].groupby('Segment').mean()
ax = segment_means[['Annual_Income_kdollars', 'Spending_Score']].plot(kind='bar', figsize=(8, 6), color=['blue', 'orange'])
ax.set_xticklabels(labels = ['HI-LS','MI-MS', 'HI-HS', 'LI-HS', 'LI-LS'])
ax.set_xlabel('Segment')
ax.set_ylabel('Mean Value')
ax.set_title('Average Income and Spending by Segment')
plt.show()

The above chart makes a lot of sense especially to understand the distribution. If we had plotted this chart first it would have been easier to understand the groups. For example, it's very obvious that the first group is high income and low spender and the 3rd group is high income and high sender while the 4th group is low income and high spender. Just stands out.

Now let's identify some meaning for these groups.

The clusters generated can represent customer segments based on their annual income and spending score:

Cluster 0: High income, low spending (Careful spenders)
Cluster 1: Moderate income, moderate spending (Average customers)
Cluster 2: High income, high spending (Big spenders, early 30s)
Cluster 3: Low income, high spending ( Young Spendthrifts)
Cluster 4: Low income, low spending (Economical customers)

Except for cluster 0, all others are predominantly females. Cluster 0 is more or less the same distribution of males and females.

These segments can help businesses target their marketing and sales efforts more effectively.

Now that we have identified customer segments using KMeans clustering, businesses can use these insights to tailor their marketing and sales strategies for each segment. Here are some suggestions on how to target each customer segment:

Careful spenders (High income, low spending):

Focus on building brand loyalty through personalized offers and promotions.
Showcase high-quality, durable, and cost-effective products.
Offer exclusive loyalty programs or memberships.

Average customers (Moderate income, moderate spending):

Provide a wide range of products that cater to their varying needs and preferences.
Offer discounts and promotions on popular items.
Implement targeted marketing campaigns to drive sales during seasonal peaks.

Spendthrifts (Low income, high spending):

Offer attractive financing options and installment plans to make high-priced items more accessible.
Promote products with high perceived value and exclusivity.
Employ scarcity marketing tactics to encourage impulse purchases.

Big spenders (High income, high spending):

Focus on offering premium and luxury products to cater to their taste for high-quality items.
Provide exceptional customer service and personalized shopping experiences.
Invite them to exclusive events or offer limited-edition products.

Economical customers (Low income, low spending):

Highlight cost-effective and budget-friendly products in marketing campaigns.
Offer bulk purchase discounts or bundle deals to incentivize spending.
Provide value-based promotions, such as “buy one, get one free” or coupons.

All segments other than segment 0 have a majority of female customers. For this reason, it is advisable to consider recommending them with products that are of interest to women.

Understanding the unique characteristics and preferences of each customer segment allows businesses to optimize their marketing and sales strategies, ultimately leading to increased revenue and customer satisfaction.

Once businesses have tailored their marketing and sales strategies to target each customer segment effectively, it’s crucial to monitor and analyze the results. This enables continuous optimization and ensures that efforts remain focused on the most profitable customer segments.

Conclusion

Clustering-based customer segmentation is a powerful tool for businesses to improve marketing strategies and customer satisfaction. It involves dividing customers into groups based on their behavior and preferences. This article provided a step-by-step guide on how to perform customer segmentation using clustering in Python, along with a case study. Regardless of your role, clustering-based customer segmentation equips you with tools to make informed decisions and engage customers effectively, driving growth and success in a dynamic business environment.

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com