Principal Component Analysis(PCA)
Principal Component Analysis (PCA) is an unsupervised machine learning feature reduction technique for high-dimensional and correlated data sets. Images and text documents have high dimensional data sets which requires unnecessary computation power and storage. Basic goal of PCA is to select features which have high variance. High variance of a feature indicates more information it contributes. In this post, I will explain mathematical background of PCA and explore it step by step.
To understand PCA the following concepts are very important.
Variance
It measures how a data variable is scatter around its mean value.
Covariance
It measures how two data variables vary with respect to each other.
Eigenvalues and Eigenvectors
Steps is PCA
Input- A Dataset Having n Features x1, x2, x3,….,xn
Step-1
Find Covariance Matrix of Features
Step-2
Perform Eigen Decomposition of Symmetric Matrix
Step-3
Sort Eigen Values in Decreasing Order
Output- Order of Eigenvalues From High to Low Provides Principal Components
Dataset
To explore PCA, I have taken a dummy data set having 2 features (2 dimensional dataset) and 35 instances.
Two features x1 and x2 are
x1=[0.10,.15,.48,1.0,.34,.45,.10,.65,.9,0.10,-.12,0.2,1.0,0.5,-.6,.24,.12,.13,.9,.20,-.45,.86,.13,.15,.16,-.26,-.32,.47,-.34,.57,.26,.5,.9,.8,.12]
x2=[.15,.12,.45,.7,.1,-.05,-.16,.13,.20,-.02,.17,.12,.15,.12,-.19,.16,- .05,0.14,0.20,0.10,.5,.15,.18,.4,.7,.18,.1,.17,.3,0.08,.18,0.4,.15,.17,.10]
We will see that which feature is more important after applying PCA.
Step-1 Find Covariance Matrix of Dataset’s Features
In Python cov() function is used to compute covariance matrix.
And the following code produce the result
covmat=np.cov(x1x2)
print(‘Covariance Matrix\n’,covmat)
Covariance Matrix
[[0.17567513 0.01437941]
[0.01437941 0.03768235]]
Step-2 Perform Eigen Decomposition of Symmetric Matrix
A symmetric matrix has its all eigen values real and orthogonal eigenvectors.
The following code compute eigenvalues and eigenvectos in Python.
eigenval,eigenvec=egn.eig(covmat)
print(‘Eigenvalues\n’,eigenval)
print(‘Eigenvectors\n’,eigenvec)
Eigenvalues
[0.17715759 0.03619989]
Eigenvectors
[[ 0.99472755 -0.10255295]
[ 0.10255295 0.99472755]]
Step-3
Sort Eigenvalues in Decreasing Order
You can see the eigenvalues [0.17715759, 0.03619989] already in high to low order. Further, principal component corresponding to 0.17715759 is more significant than eigenvalue 0.03619989.
Observe the scatter plot of data and orthogonal eigen vectors. The corresponding high eigenvalues eigenvector is directed toward middle of data this shows that principal component corresponding to high eigen value is more important.
Python Code Code to Understand Principal Component Analysis(PCA)
import matplotlib.pyplot as plt
import numpy as np
from numpy import linalg as egn
fig = plt.figure()
ax = fig.gca()
x1=[0.10,.15,.48,1.0,.34,.45,.10,.65,.9,0.10,-.12,0.2,1.0,0.5,-.6,.24,.12,.13,.9,.20,-.45,.86,.13,.15,.16,-.26,-.32,.47,-.34,.57,.26,.5,.9,.8,.12]
x2=[.15,.12,.45,.7,.1,-.05,-.16,.13,.20,-.02,.17,.12,.15,.12,-.19,.16,-0.05,0.14,0.20,0.10,.5,.15,.18,.4,.7,.18,.1,.17,.3,0.08,.18,0.4,.15,.17,.10]
x1x2=np.array([x1,bx2])
covmat=np.cov(x1x2)
print(‘Covariance Matrix\n’,covmat)
eigenval,eigenvec=egn.eig(covmat)
print(‘Eigenvalues\n’,eigenval)
print(‘Eigenvectors\n’,eigenvec)
u,v=eigenvec
print(‘First Eigen Vector\n’,u)
print(‘Second Eigen Vector\n’,v)
x, y = ([0, 0],[0,0])
col=[‘g’,’r’]
ax.quiver(x, y, u,v,scale=3,color=col)
ax.set_ylim(-2, 2)
ax.set_xlim(-2, 2)
ax.scatter(x1, x2, c=’b’, marker=’o’)
ax.set_xlabel(‘x1 feature’)
ax.set_ylabel(‘x2 feature’)
plt.savefig(“PCA.jpg”)