Python code examples of PCA v.s. SVD
Some Python code and numerical examples illustrating the relationship between PCA and SVD (also Truncated SVD), specifically how PCA can be performed by SVD.
Mathematical explanations can be found in posts like these:
- https://math.stackexchange.com/questions/3869/what-is-the-intuitive-relationship-between-svd-and-pca
- https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
Imports:
import numpy as np
import pandas as pd
from numpy.linalg import svd
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
Make some fake data to do PCA on:
X = np.random.random((5, 8))
X.round(2)
Out:
array([[0.9 , 0.06, 0.33, 0.6 , 0.21, 0.11, 0.92, 0.76],
[0.2 , 0.82, 0.26, 0.88, 0.92, 0.85, 0.34, 0.08],
[0.54, 0.74, 0.11, 0.03, 0.02, 0.55, 0.72, 0.87],
[0.68, 0.76, 0.71, 0.57, 0.45, 0.36, 0.31, 0.78],
[0.92, 0.81, 0.34, 0.47, 0.1 , 0.4 , 0.05, 0.7 ]])
Do PCA with 3 components:
k = 3
pca = PCA(n_components=k)
pca.fit(X)
Out:
PCA(copy=True, iterated_power='auto', n_components=3, random_state=None, svd_solver='auto', tol=0.0, whiten=False)
Column means of data:
X.mean(0).round(2)
Out:
array([0.65, 0.64, 0.35, 0.51, 0.34, 0.45, 0.47, 0.64])
Remove column means from data:
X_0mean = X - X.mean(0)
X_0mean.round(2)
Out:
array([[ 0.26, -0.57, -0.02, 0.09, -0.13, -0.34, 0.45, 0.12],
[-0.45, 0.18, -0.09, 0.37, 0.58, 0.39, -0.12, -0.56],
[-0.11, 0.1 , -0.24, -0.48, -0.32, 0.09, 0.25, 0.23],
[ 0.03, 0.12, 0.36, 0.06, 0.11, -0.09, -0.16, 0.14],
[ 0.27, 0.17, -0.01, -0.04, -0.24, -0.05, -0.42, 0.06]])
Do SVD:
U, s, Vh = svd(X_0mean, full_matrices=False)
Compare SVD and PCA
singular values
pca.singular_values_.round(2)
Out:
array([1.29, 0.84, 0.73])
Run:
s.round(2)
Out:
array([1.29, 0.84, 0.73, 0.38, 0. ])
components
Run:
pca.components_.round(2)
Out:
array([[-0.37, 0.29, 0.02, 0.32, 0.52, 0.35, -0.28, -0.45],
[-0.05, -0.61, -0.02, 0.38, 0.31, -0.17, 0.56, -0.21],
[-0.44, 0.09, -0.5 , -0.44, -0.1 , 0.35, 0.48, -0.06]])
Run:
Vh[:k].round(2)
Out:
array([[ 0.37, -0.29, -0.02, -0.32, -0.52, -0.35, 0.28, 0.45],
[-0.05, -0.61, -0.02, 0.38, 0.31, -0.17, 0.56, -0.21],
[-0.44, 0.09, -0.5 , -0.44, -0.1 , 0.35, 0.48, -0.06]])
transformed data
Run:
pca.transform(X).round(2)
Out:
array([[-0.6 , 0.62, -0.09],
[ 1.06, 0.21, 0.15],
[-0.4 , -0.25, 0.56],
[ 0.05, -0.13, -0.34],
[-0.11, -0.45, -0.27]])
Run:
(U[:, :k]*s[:k]).round(2)
Out:
array([[ 0.6 , 0.62, -0.09],
[-1.06, 0.21, 0.15],
[ 0.4 , -0.25, 0.56],
[-0.05, -0.13, -0.34],
[ 0.11, -0.45, -0.27]])
Truncated SVD
Note how some signs are flipped between SVD and PCA. This can be resolved by using truncated SVD as explained here:
SVD suffers from a problem called “sign indeterminancy”, which means the sign of the
components_
and the output from transform depend on the algorithm and random state. To work around this, fit instances of this class to data once, then keep the instance around to do transformations.
Run:
svd_tr = TruncatedSVD(n_components=k)
svd_tr.fit(X_0mean)
Out:
TruncatedSVD(algorithm='randomized', n_components=3, n_iter=5,
random_state=None, tol=0.0)
Compare Truncated SVD and PCA
singular values
Run:
pca.singular_values_.round(2)
Out:
array([1.29, 0.84, 0.73])
Run:
svd_tr.singular_values_.round(2)
Out:
array([1.29, 0.84, 0.73])
components
(Notice the same signs)
Run:
pca.components_.round(2)
Out:
array([[-0.37, 0.29, 0.02, 0.32, 0.52, 0.35, -0.28, -0.45],
[-0.05, -0.61, -0.02, 0.38, 0.31, -0.17, 0.56, -0.21],
[-0.44, 0.09, -0.5 , -0.44, -0.1 , 0.35, 0.48, -0.06]])
Run:
svd_tr.components_.round(2)
Out:
array([[-0.37, 0.29, 0.02, 0.32, 0.52, 0.35, -0.28, -0.45],
[-0.05, -0.61, -0.02, 0.38, 0.31, -0.17, 0.56, -0.21],
[-0.44, 0.09, -0.5 , -0.44, -0.1 , 0.35, 0.48, -0.06]])
transformed
(Notice the same signs)
Run:
pca.transform(X).round(2)
Out:
array([[-0.6 , 0.62, -0.09],
[ 1.06, 0.21, 0.15],
[-0.4 , -0.25, 0.56],
[ 0.05, -0.13, -0.34],
[-0.11, -0.45, -0.27]])
Run:
svd_tr.transform(X_0mean).round(2)
Out:
array([[-0.6 , 0.62, -0.09],
[ 1.06, 0.21, 0.15],
[-0.4 , -0.25, 0.56],
[ 0.05, -0.13, -0.34],
[-0.11, -0.45, -0.27]])
Code is here: https://github.com/yang-zhang/yang-zhang.github.io/blob/master/ds_math/pca_vs_svd.ipynb