Python code examples of explained variance in PCA
Some Python code and numerical examples illustrating how explained_variance_
and explained_variance_ratio_
are calculated in PCA. Scikit-learn’s description of explained_variance_
here:
The amount of variance explained by each of the selected components.
(See here for Python code examples of PCA v.s. SVD: https://medium.com/@zhang_yang/python-code-examples-of-pca-v-s-svd-4e9861db0a71)
Imports:
import numpy as np
import pandas as pd
from numpy.linalg import svd
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
Make some fake data:
X = np.random.random((5, 8))
X.round(2)
Out[5]:
array([[0.51, 0.95, 0.86, 0.34, 0.96, 0.18, 0.5 , 0.88],
[0.86, 0.91, 0.83, 0.24, 0.14, 0.58, 0.07, 0.13],
[0.12, 0.56, 0.61, 0.53, 0.31, 0.03, 0.74, 0.24],
[0.62, 0.17, 0.69, 0.49, 0.6 , 0.59, 0.77, 0.21],
[0.91, 0.2 , 0.53, 0.71, 0.2 , 0.35, 0.89, 0.68]])
Number of components
k = 5
SVD
Remove column means:
X_0mean = X - X.mean(0)
X_0mean.round(2)
Out[8]:
array([[-0.09, 0.39, 0.16, -0.12, 0.52, -0.17, -0.09, 0.45],
[ 0.25, 0.36, 0.13, -0.23, -0.3 , 0.23, -0.52, -0.3 ],
[-0.49, -0. , -0.1 , 0.07, -0.13, -0.31, 0.14, -0.19],
[ 0.02, -0.39, -0.02, 0.03, 0.16, 0.24, 0.17, -0.22],
[ 0.31, -0.36, -0.17, 0.25, -0.24, 0. , 0.3 , 0.25]])
In [9]:
U, s, Vh = svd(X_0mean, full_matrices=False)
PCA
pca = PCA(n_components=k)
pca.fit(X)
X_transformed = pca.transform(X)
Compare singular values from PCA and SVD
s.round(2), pca.singular_values_.round(2)
Out[13]:
(array([1.04, 0.94, 0.72, 0.51, 0. ]), array([1.04, 0.94, 0.72, 0.51, 0. ]))
Explained variance
In [14]:
pca.explained_variance_.round(2)
Out[14]:
array([0.27, 0.22, 0.13, 0.07, 0. ])
In [15]:
n_sample = X.shape[0]
This is one way to calculate explained_variance_
: (note the n_sample-1
part for unbiased estimate of var)
# https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/pca.py
(s**2/(n_sample-1)).round(2)
Out[16]:
array([0.27, 0.22, 0.13, 0.07, 0. ])
This is an equivalent way to calculate explained_variance_
:
np.array([(X_transformed[:, i]**2).sum()/(n_sample-1) for i in range(k)]).round(2)
Out[17]:
array([0.27, 0.22, 0.13, 0.07, 0. ])
Equivalently it can be calculated via PCA:
total_var = (X_0mean**2).sum()/(n_sample-1)
vs = []
for i in range(k):
Xi = U[:,i].reshape(-1, 1)*s[i]@Vh[i].reshape(1, -1)
var_i = (Xi**2).sum()/(n_sample-1)
vs.append(var_i)
np.array(vs).round(2)
Out[18]:
array([0.27, 0.22, 0.13, 0.07, 0. ])
Explained variance ratio
explained_variance_ratio_
can be derived from explained_variance_
, or from singular values:
(
pca.explained_variance_ratio_.round(2), np.array([pca.explained_variance_[i]/pca.explained_variance_.sum() for i in range(k)]).round(2),
np.array([s[i]**2 / sum(s**2) for i in range(k)]).round(2)
)
Out[19]:
(array([0.39, 0.32, 0.19, 0.1 , 0. ]),
array([0.39, 0.32, 0.19, 0.1 , 0. ]),
array([0.39, 0.32, 0.19, 0.1 , 0. ]))
Equivalently it can be calculated via PCA:
total_var = (X_0mean**2).sum()/(n_sample-1)
rs = []
for i in range(k):
Xi = U[:,i].reshape(-1, 1)*s[i]@Vh[i].reshape(1, -1)
var_i = (Xi**2).sum()/(n_sample-1)
rs.append(var_i/total_var)
np.array(rs).round(2)
Out[34]:
array([0.39, 0.32, 0.19, 0.1 , 0. ])
Cumulative explained variance ratio
In [21]:
np.array([pca.explained_variance_ratio_[:i].sum() for i in range(1, k+1)]).round(2)
Out[21]:
array([0.39, 0.72, 0.9 , 1. , 1. ])
Via PCA:
total_sq = (X_0mean**2).sum()
rs = []
for i in range(1, k+1):
Xi = U[:,:i]*s[:i]@Vh[:i]
rs.append(((Xi**2).sum()/total_sq))
np.array(rs).round(2)
Out[23]:
array([0.39, 0.72, 0.9 , 1. , 1. ])
Code is here: https://github.com/yang-zhang/yang-zhang.github.io/blob/master/ds_math/pca_svd_variance_explained.ipynb