Python code examples of using SVD (PCA) for embeddings
Some Python code and numerical examples illustrating how to use SVD and PCA for embeddings.
Imports:
import numpy as np
import pandas as pd
import numpy.linalg as la
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
Make some fake data. Can think of it as a movie rating matrix with the shape n_user by n_movie.
X = np.array([
[4, 4, 0, 0],
[3, 3, 0, 0],
[5, 5, 0, 0],
[0, 0, 3, 3],
[0, 0, 2, 2],
[0, 0, 5, 5],
])
m, n = X.shape
PCA:
U, s, Vh = la.svd(X, full_matrices=False)
Sigma = np.diag(s)
U.shape, s.shape, Sigma.shape, Vh.shape
Out[55]:
((6, 4), (4,), (4, 4), (4, 4))
Number of components:
k = 2
Compare the 2 below
In [58]:
(U * s).round(2)
Out[58]:
array([[-5.66, 0. , 0. , -0. ],
[-4.24, 0. , -0. , -0. ],
[-7.07, 0. , -0. , 0. ],
[ 0. , -4.24, 0. , 0. ],
[ 0. , -2.83, 0. , 0. ],
[ 0. , -7.07, 0. , -0. ]])
In [59]:
(U[:,:k] * s[:k]).round(2)
Out[59]:
array([[-5.66, 0. ],
[-4.24, 0. ],
[-7.07, 0. ],
[ 0. , -4.24],
[ 0. , -2.83],
[ 0. , -7.07]])
Compare the 2 below
In [60]:
(Vh.T * s).round(2)
Out[60]:
array([[-7.07, -0. , -0. , 0. ],
[-7.07, -0. , 0. , 0. ],
[-0. , -6.16, 0. , -0. ],
[-0. , -6.16, 0. , 0. ]])
In [61]:
(Vh[:k].T * s[:k]).round(2)
Out[61]:
array([[-7.07, -0. ],
[-7.07, -0. ],
[-0. , -6.16],
[-0. , -6.16]])
U[:,:k] * s[:k]
are embeddings of users:
In [62]:
eb_u = U[:,:k] * s[:k]
(eb_u).round(2)
Out[62]:
array([[-5.66, 0. ],
[-4.24, 0. ],
[-7.07, 0. ],
[ 0. , -4.24],
[ 0. , -2.83],
[ 0. , -7.07]])
Vh[:k].T
are embeddings of movies:
In [63]:
eb_m = Vh[:k].T
(eb_m).round(2)
Out[63]:
array([[-0.71, -0. ],
[-0.71, -0. ],
[-0. , -0.71],
[-0. , -0.71]])
Compare the 2 below where it shows X = user_embedding @ movie_embedding
In [65]:
(eb_u @ eb_m.T)
Out[65]:
array([[4., 4., 0., 0.],
[3., 3., 0., 0.],
[5., 5., 0., 0.],
[0., 0., 3., 3.],
[0., 0., 2., 2.],
[0., 0., 5., 5.]])
In [67]:
X
Out[67]:
array([[4, 4, 0, 0],
[3, 3, 0, 0],
[5, 5, 0, 0],
[0, 0, 3, 3],
[0, 0, 2, 2],
[0, 0, 5, 5]])
Compare the two below: user_embedding = X @ moving_embedding
In [69]:
eb_u.round(2)
Out[69]:
array([[-5.66, 0. ],
[-4.24, 0. ],
[-7.07, 0. ],
[ 0. , -4.24],
[ 0. , -2.83],
[ 0. , -7.07]])
In [70]:
(X @ eb_m).round(2)
Out[70]:
array([[-5.66, 0. ],
[-4.24, 0. ],
[-7.07, 0. ],
[ 0. , -4.24],
[ 0. , -2.83],
[ 0. , -7.07]])
PCA equivalency
In [71]:
pca = PCA(n_components=k)
X_mean = X.mean(axis=0)
X_nrm = X - X_mean
X_nrm.round(2)
Out[71]:
array([[ 2. , 2. , -1.67, -1.67],
[ 1. , 1. , -1.67, -1.67],
[ 3. , 3. , -1.67, -1.67],
[-2. , -2. , 1.33, 1.33],
[-2. , -2. , 0.33, 0.33],
[-2. , -2. , 3.33, 3.33]])
In [72]:
U_, s_, Vh_ = la.svd(X_nrm, full_matrices=False)
Compare the 2 below: they are the embedding of users
In [73]:
(pca.fit_transform(X_nrm)).round(2)
Out[73]:
array([[-3.68, 0.12],
[-2.62, -0.82],
[-4.74, 1.06],
[ 3.37, -0.47],
[ 2.43, -1.53],
[ 5.25, 1.64]])
In [74]:
eb_u_ = U_[:, :k]*s_[:k]
(eb_u_).round(2)
Out[74]:
array([[-3.68, 0.12],
[-2.62, -0.82],
[-4.74, 1.06],
[ 3.37, -0.47],
[ 2.43, -1.53],
[ 5.25, 1.64]])
Compare the two below: they are the embedding of movings
In [75]:
eb_m_ = Vh_[:k].T
(eb_m_).round(2)
Out[75]:
array([[-0.53, 0.47],
[-0.53, 0.47],
[ 0.47, 0.53],
[ 0.47, 0.53]])
In [76]:
(pca.components_).round(2)
Out[76]:
array([[-0.53, -0.53, 0.47, 0.47],
[ 0.47, 0.47, 0.53, 0.53]])
Compare the 2 below: X = user_embedding @ movie_embedding
In [77]:
(eb_u_ @ eb_m_.T).round(2)
Out[77]:
array([[ 2. , 2. , -1.67, -1.67],
[ 1. , 1. , -1.67, -1.67],
[ 3. , 3. , -1.67, -1.67],
[-2. , -2. , 1.33, 1.33],
[-2. , -2. , 0.33, 0.33],
[-2. , -2. , 3.33, 3.33]])
In [78]:
(X_nrm).round(2)
Out[78]:
array([[ 2. , 2. , -1.67, -1.67],
[ 1. , 1. , -1.67, -1.67],
[ 3. , 3. , -1.67, -1.67],
[-2. , -2. , 1.33, 1.33],
[-2. , -2. , 0.33, 0.33],
[-2. , -2. , 3.33, 3.33]])
Compare the two below: user_embedding = X @ movie_embedding
In [79]:
(eb_u_).round(2)
Out[79]:
array([[-3.68, 0.12],
[-2.62, -0.82],
[-4.74, 1.06],
[ 3.37, -0.47],
[ 2.43, -1.53],
[ 5.25, 1.64]])
In [80]:
(X_nrm @ eb_m_).round(2)
Out[80]:
array([[-3.68, 0.12],
[-2.62, -0.82],
[-4.74, 1.06],
[ 3.37, -0.47],
[ 2.43, -1.53],
[ 5.25, 1.64]])
Code is here: https://github.com/yang-zhang/yang-zhang.github.io/blob/master/ds_math/svd_embedding.ipynb