Principal Component Analysis (PCA) Using Python
Principal Component Analysis (PCA) is an unsupervised technique for dimensionality reduction that can be applied to datasets with multiple features. Below, we will go through a detailed implementation of PCA using Python, step-by-step, with explanations and outputs.
Steps for PCA Implementation in Python
We'll use Python's sklearn library to implement PCA, along with numpy and matplotlib for data manipulation and visualization.
Step 1: Import Required Libraries
First, we need to import all the necessary libraries for PCA.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
- numpy: For numerical operations like array manipulation.
- pandas: For data handling and operations on datasets.
- matplotlib.pyplot: For data visualization.
- sklearn.decomposition.PCA: For applying PCA.
- sklearn.preprocessing.StandardScaler: For scaling the data.
Step 2: Load the Dataset
For this example, we will use the famous Iris dataset, which is available in sklearn. It contains 150 samples of iris flowers, with 4 features (sepal length, sepal width, petal length, petal width) and 3 target classes (species).
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data # Features
y = iris.target # Target variable (Species)
- X contains the features (4 dimensions).
- y contains the target labels (species: setosa, versicolor, virginica).
Step 3: Standardize the Data
PCA is sensitive to the scale of the features, so we need to standardize the data. Standardization ensures that each feature has a mean of 0 and a standard deviation of 1.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
- StandardScaler: This will standardize the data, transforming each feature to have zero mean and unit variance.
Step 4: Apply PCA
Now, we can apply PCA. Let's say we want to reduce the dimensionality to 2 principal components (for visualization purposes).
# Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
- PCA(n_components=2): We are asking PCA to reduce the dataset to 2 principal components. This is suitable for 2D visualization.
- fit_transform(): This function fits the PCA model to the standardized data and then transforms it into the new principal component space.
Step 5: Explained Variance
The explained variance tells us how much information (variance) is captured by each principal component. It's essential to understand how much of the original data's variance is retained in the reduced data.
# Explained variance ratio
explained_variance = pca.explained_variance_ratio_
print(f"Explained Variance: {explained_variance}")
- explained_variance_ratio_: This attribute provides the percentage of variance explained by each principal component.
Output:
Explained Variance: [0.92461872 0.05306648]
- The first principal component explains 92.46% of the variance, and the second component explains 5.31%. So, together, they explain around 97.77% of the original variance in the data.
Step 6: Visualize the Transformed Data
We can now visualize the 2D representation of the data in the reduced space.
# Create a DataFrame for the PCA results
pca_df = pd.DataFrame(data=X_pca, columns=['Principal Component 1', 'Principal Component 2'])
# Plot the data points in 2D
plt.figure(figsize=(8, 6))
plt.scatter(pca_df['Principal Component 1'], pca_df['Principal Component 2'], c=y, cmap='viridis', edgecolor='k', s=100)
plt.title('PCA of Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Species')
plt.show()
- We are plotting the 2 principal components against each other using plt.scatter.
- The color of the points corresponds to the species labels (target variable y), which allows us to see how well PCA has separated the classes in the new 2D space.
Step 7: Recover the Original Data
After applying PCA, you may want to recover the original data by projecting the reduced data back to the original feature space. This is possible because PCA performs a linear transformation.
# Reconstruct the data
X_reconstructed = pca.inverse_transform(X_pca)
# Print the first 5 reconstructed samples
print(f"Reconstructed Data (first 5 samples): \n{X_reconstructed[:5]}")
- inverse_transform(): Projects the data back to the original space using the principal components.
Step 8: Summary and Output
Let's summarize what we've done in this step-by-step implementation:
- Data Loading: We used the Iris dataset, which has 150 samples and 4 features.
- Standardization: We scaled the features to ensure that they contribute equally to the PCA.
- PCA Transformation: We reduced the data to 2 principal components.
- Explained Variance: We found that the first two components explain 97.77% of the variance in the data.
- Visualization: We visualized the reduced data in 2D and observed how well PCA separated the different species.
- Reconstruction: We demonstrated how the original data can be approximately reconstructed from the principal components.
Key Takeaways-
In this implementation:
- We successfully applied PCA to the Iris dataset.
- By reducing the data to two principal components, we retained most of the variance (97.77%).
- We visualized the data and showed that PCA effectively separated the different species in 2D.
PCA is a powerful tool for dimensionality reduction and is commonly used in machine learning to preprocess data, making it easier for algorithms to learn and understand patterns.