Python code examples of explained variance in PCA

Yang Zhang
2 min readJun 2, 2018

--

Some Python code and numerical examples illustrating how explained_variance_ and explained_variance_ratio_ are calculated in PCA. Scikit-learn’s description of explained_variance_ here:

The amount of variance explained by each of the selected components.

(See here for Python code examples of PCA v.s. SVD: https://medium.com/@zhang_yang/python-code-examples-of-pca-v-s-svd-4e9861db0a71)

Imports:

import numpy as np
import pandas as pd
from numpy.linalg import svd
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD

Make some fake data:

X = np.random.random((5, 8))
X.round(2)

Out[5]:

array([[0.51, 0.95, 0.86, 0.34, 0.96, 0.18, 0.5 , 0.88],
[0.86, 0.91, 0.83, 0.24, 0.14, 0.58, 0.07, 0.13],
[0.12, 0.56, 0.61, 0.53, 0.31, 0.03, 0.74, 0.24],
[0.62, 0.17, 0.69, 0.49, 0.6 , 0.59, 0.77, 0.21],
[0.91, 0.2 , 0.53, 0.71, 0.2 , 0.35, 0.89, 0.68]])

Number of components

k = 5

SVD

Remove column means:

X_0mean = X - X.mean(0)
X_0mean.round(2)

Out[8]:

array([[-0.09,  0.39,  0.16, -0.12,  0.52, -0.17, -0.09,  0.45],
[ 0.25, 0.36, 0.13, -0.23, -0.3 , 0.23, -0.52, -0.3 ],
[-0.49, -0. , -0.1 , 0.07, -0.13, -0.31, 0.14, -0.19],
[ 0.02, -0.39, -0.02, 0.03, 0.16, 0.24, 0.17, -0.22],
[ 0.31, -0.36, -0.17, 0.25, -0.24, 0. , 0.3 , 0.25]])

In [9]:

U, s, Vh = svd(X_0mean, full_matrices=False)

PCA

pca = PCA(n_components=k)
pca.fit(X)
X_transformed = pca.transform(X)

Compare singular values from PCA and SVD

s.round(2), pca.singular_values_.round(2)

Out[13]:

(array([1.04, 0.94, 0.72, 0.51, 0.  ]), array([1.04, 0.94, 0.72, 0.51, 0.  ]))

Explained variance

In [14]:

pca.explained_variance_.round(2)

Out[14]:

array([0.27, 0.22, 0.13, 0.07, 0.  ])

In [15]:

n_sample = X.shape[0]

This is one way to calculate explained_variance_: (note the n_sample-1 part for unbiased estimate of var)

# https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/pca.py
(s**2/(n_sample-1)).round(2)

Out[16]:

array([0.27, 0.22, 0.13, 0.07, 0.  ])

This is an equivalent way to calculate explained_variance_:

np.array([(X_transformed[:, i]**2).sum()/(n_sample-1) for i in range(k)]).round(2)

Out[17]:

array([0.27, 0.22, 0.13, 0.07, 0.  ])

Equivalently it can be calculated via PCA:

total_var = (X_0mean**2).sum()/(n_sample-1)
vs = []
for i in range(k):
Xi = U[:,i].reshape(-1, 1)*s[i]@Vh[i].reshape(1, -1)
var_i = (Xi**2).sum()/(n_sample-1)
vs.append(var_i)
np.array(vs).round(2)

Out[18]:

array([0.27, 0.22, 0.13, 0.07, 0.  ])

Explained variance ratio

explained_variance_ratio_ can be derived from explained_variance_, or from singular values:

(
pca.explained_variance_ratio_.round(2), np.array([pca.explained_variance_[i]/pca.explained_variance_.sum() for i in range(k)]).round(2),
np.array([s[i]**2 / sum(s**2) for i in range(k)]).round(2)
)

Out[19]:

(array([0.39, 0.32, 0.19, 0.1 , 0.  ]),
array([0.39, 0.32, 0.19, 0.1 , 0. ]),
array([0.39, 0.32, 0.19, 0.1 , 0. ]))

Equivalently it can be calculated via PCA:

total_var = (X_0mean**2).sum()/(n_sample-1)
rs = []
for i in range(k):
Xi = U[:,i].reshape(-1, 1)*s[i]@Vh[i].reshape(1, -1)
var_i = (Xi**2).sum()/(n_sample-1)
rs.append(var_i/total_var)
np.array(rs).round(2)

Out[34]:

array([0.39, 0.32, 0.19, 0.1 , 0.  ])

Cumulative explained variance ratio

In [21]:

np.array([pca.explained_variance_ratio_[:i].sum() for i in range(1, k+1)]).round(2)

Out[21]:

array([0.39, 0.72, 0.9 , 1.  , 1.  ])

Via PCA:

total_sq = (X_0mean**2).sum()
rs = []
for i in range(1, k+1):
Xi = U[:,:i]*s[:i]@Vh[:i]
rs.append(((Xi**2).sum()/total_sq))

np.array(rs).round(2)

Out[23]:

array([0.39, 0.72, 0.9 , 1.  , 1.  ])

Code is here: https://github.com/yang-zhang/yang-zhang.github.io/blob/master/ds_math/pca_svd_variance_explained.ipynb

--

--

Yang Zhang
Yang Zhang

Written by Yang Zhang

Data science and machine learning

No responses yet