Principal Component Analysis(PCA)

Vtantravahi
10 min readJan 14, 2024
Increase in Data Dimensions

Introduction: The Curse of Dimensionality and the Need for PCA

Imagine you’re a data scientist working with a vast dataset of astronomical observations, aiming to uncover patterns and insights about distant galaxies. Each observation in your dataset contains hundreds of features: brightness levels in different wavelengths, distances, sizes, and many more. Initially, it seems like having more features should give you more information and better predictive power. But as you delve deeper, you encounter an unexpected problem: the more features you have, the sparser the data becomes in the high-dimensional space. This is the ‘curse of dimensionality’, a phenomenon where the volume of the space increases so much that the available data becomes sparse, making it increasingly difficult to find patterns or make predictions.

Sparse means thinly dispersed or scattered.

This is where dimensionality reduction techniques become crucial. In essence, dimensionality reduction is about simplifying the complexity of high-dimensional data while retaining as much meaningful information as possible. There are two primary types of dimensionality reduction: projection and manifold learning.

  1. Projection involves mapping data points from a high-dimensional space to a lower-dimensional space. It’s like casting a shadow of a three-dimensional object onto a two-dimensional surface. The shadow (or projection) captures certain aspects of the object, although it might lose some details.
Projection Learning (Image Source)
  1. Manifold Learning is a bit more intricate. It assumes that the data lies along a manifold (a curved surface) in a high-dimensional space. Manifold learning techniques aim to unfold this manifold, like opening a crumpled piece of paper, to reveal the simpler, intrinsic structure hidden within the complex high-dimensional data.
Manifold Learning (Image Source)

Both methods have their merits, but in this article, we’ll focus on a particular projection technique that stands out for its simplicity and effectiveness: Principal Component Analysis (PCA). PCA helps in identifying the directions (called principal components) along which the variance in the data is maximum. In other words, PCA finds the lines along which the data is most spread out, and it projects the high-dimensional data along these lines, thereby reducing dimensions while preserving as much variability as possible.

As we explore PCA further, we’ll delve into its mathematical foundations, including how it preserves variance, identifies principal components, and why it assumes data is centered around the origin. We’ll also look at practical implementations using Python’s Scikit-Learn library, discussing how to choose the right number of dimensions, PCA for compression, and advanced variations like Randomized PCA and Incremental PCA.

By the end of this journey, you’ll have a deep understanding of PCA, empowering you to tackle the curse of dimensionality in your own datasets, and unveiling patterns that were hidden in the vastness of high-dimensional space.

Discussing PCA Mathematics

let’s dive into the mathematical core of Principal Component Analysis (PCA). PCA isn’t just a black-box tool; understanding its mathematics can give you deeper insights into data and how PCA manipulates it.

  1. Preserving Variance: The Goal of PCA

At the heart of PCA is the concept of variance, which measures how much the data points are spread out. The main goal of PCA is to preserve as much variance as possible when we reduce the dimensionality of the data.

Let’s say our data set has n dimensions, and we want to reduce it to k dimensions (where k<n). PCA achieves this by finding the k directions (principal components) along which the variance is maximized. These directions are identified using the covariance matrix of the data. The covariance matrix, denoted as Σ, is an n*n matrix where each element represents the covariance between two dimensions.

The principal components are the eigenvectors of the covariance matrix Σ, and they are sorted by their corresponding eigenvalues in descending order. The eigenvalue associated with each eigenvector indicates the amount of variance captured by that principal component.

2. Principal Components: The New Axes

The principal components are orthogonal to each other, and they form a new set of axes for the data. The first principal component is the direction in which the data varies the most. The second principal component is orthogonal to the first and is the direction of the next highest variance, and so on.

Mathematically, if we denote our data matrix as X (with zero mean), the first principal component PC₁ can be found by solving the equation:

∑ v₁ = λ1*v₁

where v₁ is the eigenvector corresponding to the largest eigenvalue λ1​ of Σ. The second principal component PC₂​ is found in a similar manner, subject to the constraint that it’s orthogonal to PC₁​, and so on for subsequent components.

3. Why Centering the Data is Important

PCA assumes that the data is centered around the origin. This is because PCA is sensitive to the scale of the variables. If the data is not centered, the PCA directions are influenced more by the mean of the data than the actual structure of the data. Centering is done by subtracting the mean of each variable from the dataset. Mathematically, if Xᵢⱼ is the jᵗʰ variable for the iᵗʰ observation, then the centered variable \widetilde{xᵢⱼ} is:

where xⱼ​ is the mean of the jᵗʰ variable.

4. Projecting Data Down to d-Dimensions

Once we have the principal components, we can transform the data into the new space defined by these components. This transformation is a projection of the data onto the principal components. If V is the matrix with the eigenvectors as columns, then the projected data Y in the new space is given by:

Y = XV

Here, X is the original data matrix, and Y is the transformed data with reduced dimensions.

In the next section, we’ll see how to implement these concepts using Python’s Scikit-Learn library, covering practical aspects like choosing the right number of dimensions for your PCA, and how to use PCA for data compression and incremental learning.

Let’s see in action 💻

Step1: Generating and visualizing Data
Now, let’s generate our dataset. We’re creating a synthetic 3D dataset for this demonstration. Visualizing this data will help us understand its structure and how PCA might transform it.

# imports
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Generating a synthetic 3D dataset
np.random.seed(0)
X = np.dot(np.random.rand(100, 3), np.random.rand(3, 3))

# Visualizing the original 3D data
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2])
plt.title("Original 3D Data")
plt.show()
Synthetic Generated Data

Step2: Applying PCA to Reduce Dimensions

With our 3D dataset ready, the next step is to apply PCA for dimensionality reduction. We aim to project this data onto a 2D plane, capturing as much variance as possible.

# Applying PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)
# Plotting Original Data vs. PCA Transformed Data
fig = plt.figure(figsize=(12, 6))
# Plotting the Original 3D Data
ax1 = fig.add_subplot(121, projection='3d')
ax1.scatter(X[:, 0], X[:, 1], X[:, 2], color='red', alpha=0.5)
ax1.set_title("Original Data")
ax1.set_xlabel("X axis")
ax1.set_ylabel("Y axis")
ax1.set_zlabel("Z axis")

# Plotting the PCA Transformed 2D Data
ax2 = fig.add_subplot(122)
ax2.scatter(X_pca[:, 0], X_pca[:, 1], color='blue', alpha=0.5)
ax2.set_title("PCA Transformed Data (2D)")
ax2.set_xlabel("Principal Component 1")
ax2.set_ylabel("Principal Component 2")

plt.tight_layout()
plt.show()
Original Data Vs PCA

Step3: Explaining the Variance Ratio

One key aspect of PCA is understanding how much variance each principal component captures. This is known as the explained variance ratio.

# Explained Variance Ratio
explained_variance = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance)

#output:
# Explained Variance Ratio: [0.83350234 0.13910518 0.02739248]

# Creating a bar chart for the Explained Variance Ratio
plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.7, align='center')
plt.title('Explained Variance by Each Principal Component')
plt.xlabel('Principal Component')
plt.ylabel('Variance Ratio')
plt.xticks(range(1, len(explained_variance) + 1))
plt.show()
Explained Variance Ratio

Step4: Discussing the Choice of Dimensions

A crucial decision in PCA is choosing the number of dimensions to reduce to. This often depends on the cumulative explained variance ratio. We typically select enough components to retain a significant percentage of the total variance, like 95%.

# Cumulative variance
cumulative_variance = np.cumsum(explained_variance)
# Determining the number of components to reach 95% variance
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
print("Number of components to retain 95% variance:", n_components_95)

#output
# Number of components to retain 95% variance: 2

Step5: PCA for Data Compression and Reconstruction

PCA can also be used for data compression. We can compress the data to fewer dimensions and then reconstruct it back. This process, while lossy, retains much of the original information.

# Reducing dimensions for compression
pca_95 = PCA(n_components=n_components_95)
X_reduced = pca_95.fit_transform(X)

# Reconstructing the data from the compressed form
X_reconstructed = pca_95.inverse_transform(X_reduced)

To understand the impact of compression and reconstruction, let’s visualize the reconstructed data alongside the original data. This comparison will show us the loss of information due to dimensionality reduction.

# Plotting Original Data
fig = plt.figure(figsize=(15, 5))

ax1 = fig.add_subplot(131, projection='3d')
ax1.scatter(X[:, 0], X[:, 1], X[:, 2], color='red', alpha=0.5)
ax1.set_title("Original Data")
ax1.set_xlabel("X")
ax1.set_ylabel("Y")
ax1.set_zlabel("Z")

# Plotting Reduced Data
ax2 = fig.add_subplot(132)
ax2.scatter(X_reduced[:, 0], X_reduced[:, 1], color='blue', alpha=0.5)
ax2.set_title("Reduced Data (2D)")
ax2.set_xlabel("Principal Component 1")
ax2.set_ylabel("Principal Component 2")

# Plotting Reconstructed Data
ax3 = fig.add_subplot(133, projection='3d')
ax3.scatter(X_reconstructed[:, 0], X_reconstructed[:, 1], X_reconstructed[:, 2], color='green', alpha=0.5)
ax3.set_title("Reconstructed Data")
ax3.set_xlabel("X")
ax3.set_ylabel("Y")
ax3.set_zlabel("Z")

plt.tight_layout()
plt.show()
Original vs Reduced vs Reconstructed Dara

Step6: Exploring Variants: Randomized and Incremental PCA

Finally, let’s explore PCA variants like Randomized PCA, which is faster for large datasets, and Incremental PCA, suitable for datasets too large to fit in memory. These variants can be particularly useful in different scenarios and offer flexibility in handling diverse data sizes and computational constraints.

from sklearn.decomposition import IncrementalPCA, PCA

# Randomized PCA
randomized_pca = PCA(n_components=n_components_95, svd_solver='randomized')
X_reduced_randomized = randomized_pca.fit_transform(X)

# Incremental PCA
incremental_pca = IncrementalPCA(n_components=n_components_95)
X_reduced_incremental = incremental_pca.fit_transform(X)

# Plotting Results of Randomized PCA and Incremental PCA
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))

# Randomized PCA
ax1.scatter(X_reduced_randomized[:, 0], X_reduced_randomized[:, 1], color='purple', alpha=0.5)
ax1.set_title("Randomized PCA")
ax1.set_xlabel("Principal Component 1")
ax1.set_ylabel("Principal Component 2")

# Incremental PCA
ax2.scatter(X_reduced_incremental[:, 0], X_reduced_incremental[:, 1], color='orange', alpha=0.5)
ax2.set_title("Incremental PCA")
ax2.set_xlabel("Principal Component 1")
ax2.set_ylabel("Principal Component 2")

plt.tight_layout()
plt.show()
Randomized vs Incremental PCA

Transition to Real-World Application

Up to this point, we’ve explored PCA through synthetic examples, which has allowed us to understand the mechanics and mathematics behind this powerful dimensionality reduction technique. These examples serve as a stepping stone to grasping the core concepts without the added complexity of real-world noise and data irregularities. However, the true value of

PCA is best demonstrated with actual, complex datasets where the intricacies of data patterns come to life.

We will now take our PCA exploration further by applying it to the Wine Quality dataset. This dataset, with its rich and nuanced physicochemical properties, offers a perfect opportunity to see PCA in action on real-world data. By reducing the dimensionality, we aim to uncover the underlying structure of the data and identify the principal components that capture the essence of wine quality.

To see the step-by-step application of PCA on the Wine Quality dataset, including preprocessing, variance analysis, and visualizations, check out the detailed Jupyter Notebook I’ve prepared. It will walk you through the entire process, providing insights into how PCA can be a valuable tool in your data analysis arsenal.

Conclusion

As we have journeyed from the abstract complexities of high-dimensional spaces to the tangible realities of the Wine Quality dataset, Principal Component Analysis (PCA) has proven to be an indispensable ally. We began with an exploration of the curse of dimensionality, an inevitable challenge in modern data analysis, where the expansion of our feature space can paradoxically obscure the insights we seek. With each added dimension, our data became sparser, and the risk of overfitting grew.

PCA emerged as a guiding light, offering a pathway through the curse by projecting our data into a lower-dimensional space that preserves its intrinsic variability. We navigated through the mathematical foundations of PCA, understanding how it identifies the principal components that act as the axes along which the essence of our data is most clearly revealed.

Taking theory into practice, we applied PCA to a synthetic dataset, which laid the groundwork for our real-world application. In the Wine Quality dataset, PCA did more than just simplify data; it unveiled the relationships and patterns that contribute to the quality of wine, patterns that may have remained hidden in the shadow of higher dimensions.

As we conclude this exploration, it is my hope that the insights gained here illuminate your path in data science, whether you’re reducing dimensions, uncovering hidden patterns, or making data-driven decisions. I am grateful for your engagement and curiosity throughout this article.

For more discussions on the intersection of data analysis and real-world applications, and to continue our exploration of machine learning together, I invite you to connect with me on LinkedIn. Your thoughts, questions, and insights are always welcome as we further our shared journey in the vast and ever-evolving landscape of data science.

Follow me on LinkedIn for more articles and discussions on data science.

Thank you for reading, and may your data be ever in your favor!

WRITER at MLearning.ai / Hallucinations in AI / 20K+ Art Prompts

--

--

Vtantravahi

👋Greetings, I am Venkatesh Tantravahi, your friendly tech wizard. By day, I am a grad student in CIS at SUNY, by night a data nerd turning ☕️🧑‍💻 and 😴📝