Python code examples of using SVD (PCA) for embeddings

4 min readJun 2, 2018

Some Python code and numerical examples illustrating how to use SVD and PCA for embeddings.

Imports:

import numpy as np
import pandas as pd
import numpy.linalg as la
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize

Make some fake data. Can think of it as a movie rating matrix with the shape n_user by n_movie.

X = np.array([
    [4, 4, 0, 0],
    [3, 3, 0, 0],
    [5, 5, 0, 0],
    [0, 0, 3, 3],
    [0, 0, 2, 2],
    [0, 0, 5, 5],
])
m, n = X.shape

PCA:

U, s, Vh = la.svd(X, full_matrices=False)
Sigma = np.diag(s)

U.shape, s.shape, Sigma.shape, Vh.shape

Out[55]:

((6, 4), (4,), (4, 4), (4, 4))

Number of components:

k = 2

Compare the 2 below

In [58]:

(U * s).round(2)

Out[58]:

array([[-5.66,  0.  ,  0.  , -0.  ],
       [-4.24,  0.  , -0.  , -0.  ],
       [-7.07,  0.  , -0.  ,  0.  ],
       [ 0.  , -4.24,  0.  ,  0.  ],
       [ 0.  , -2.83,  0.  ,  0.  ],
       [ 0.  , -7.07,  0.  , -0.  ]])

In [59]:

(U[:,:k] * s[:k]).round(2)

Out[59]:

array([[-5.66,  0.  ],
       [-4.24,  0.  ],
       [-7.07,  0.  ],
       [ 0.  , -4.24],
       [ 0.  , -2.83],
       [ 0.  , -7.07]])

Compare the 2 below

In [60]:

(Vh.T * s).round(2)

Out[60]:

array([[-7.07, -0.  , -0.  ,  0.  ],
       [-7.07, -0.  ,  0.  ,  0.  ],
       [-0.  , -6.16,  0.  , -0.  ],
       [-0.  , -6.16,  0.  ,  0.  ]])

In [61]:

(Vh[:k].T * s[:k]).round(2)

Out[61]:

array([[-7.07, -0.  ],
       [-7.07, -0.  ],
       [-0.  , -6.16],
       [-0.  , -6.16]])

U[:,:k] * s[:k] are embeddings of users:

In [62]:

eb_u = U[:,:k] * s[:k]
(eb_u).round(2)

Out[62]:

array([[-5.66,  0.  ],
       [-4.24,  0.  ],
       [-7.07,  0.  ],
       [ 0.  , -4.24],
       [ 0.  , -2.83],
       [ 0.  , -7.07]])

Vh[:k].T are embeddings of movies:

In [63]:

eb_m = Vh[:k].T
(eb_m).round(2)

Out[63]:

array([[-0.71, -0.  ],
       [-0.71, -0.  ],
       [-0.  , -0.71],
       [-0.  , -0.71]])

Compare the 2 below where it shows X = user_embedding @ movie_embedding

In [65]:

(eb_u @ eb_m.T)

Out[65]:

array([[4., 4., 0., 0.],
       [3., 3., 0., 0.],
       [5., 5., 0., 0.],
       [0., 0., 3., 3.],
       [0., 0., 2., 2.],
       [0., 0., 5., 5.]])

In [67]:

Out[67]:

array([[4, 4, 0, 0],
       [3, 3, 0, 0],
       [5, 5, 0, 0],
       [0, 0, 3, 3],
       [0, 0, 2, 2],
       [0, 0, 5, 5]])

Compare the two below: user_embedding = X @ moving_embedding

In [69]:

eb_u.round(2)

Out[69]:

array([[-5.66,  0.  ],
       [-4.24,  0.  ],
       [-7.07,  0.  ],
       [ 0.  , -4.24],
       [ 0.  , -2.83],
       [ 0.  , -7.07]])

In [70]:

(X @ eb_m).round(2)

Out[70]:

array([[-5.66,  0.  ],
       [-4.24,  0.  ],
       [-7.07,  0.  ],
       [ 0.  , -4.24],
       [ 0.  , -2.83],
       [ 0.  , -7.07]])

PCA equivalency

In [71]:

pca = PCA(n_components=k)
X_mean = X.mean(axis=0)
X_nrm = X - X_mean
X_nrm.round(2)

Out[71]:

array([[ 2.  ,  2.  , -1.67, -1.67],
       [ 1.  ,  1.  , -1.67, -1.67],
       [ 3.  ,  3.  , -1.67, -1.67],
       [-2.  , -2.  ,  1.33,  1.33],
       [-2.  , -2.  ,  0.33,  0.33],
       [-2.  , -2.  ,  3.33,  3.33]])

In [72]:

U_, s_, Vh_ = la.svd(X_nrm, full_matrices=False)

Compare the 2 below: they are the embedding of users

In [73]:

(pca.fit_transform(X_nrm)).round(2)

Out[73]:

array([[-3.68,  0.12],
       [-2.62, -0.82],
       [-4.74,  1.06],
       [ 3.37, -0.47],
       [ 2.43, -1.53],
       [ 5.25,  1.64]])

In [74]:

eb_u_ = U_[:, :k]*s_[:k]
(eb_u_).round(2)

Out[74]:

array([[-3.68,  0.12],
       [-2.62, -0.82],
       [-4.74,  1.06],
       [ 3.37, -0.47],
       [ 2.43, -1.53],
       [ 5.25,  1.64]])

Compare the two below: they are the embedding of movings

In [75]:

eb_m_ = Vh_[:k].T
(eb_m_).round(2)

Out[75]:

array([[-0.53,  0.47],
       [-0.53,  0.47],
       [ 0.47,  0.53],
       [ 0.47,  0.53]])

In [76]:

(pca.components_).round(2)

Out[76]:

array([[-0.53, -0.53,  0.47,  0.47],
       [ 0.47,  0.47,  0.53,  0.53]])

Compare the 2 below: X = user_embedding @ movie_embedding

In [77]:

(eb_u_ @ eb_m_.T).round(2)

Out[77]:

array([[ 2.  ,  2.  , -1.67, -1.67],
       [ 1.  ,  1.  , -1.67, -1.67],
       [ 3.  ,  3.  , -1.67, -1.67],
       [-2.  , -2.  ,  1.33,  1.33],
       [-2.  , -2.  ,  0.33,  0.33],
       [-2.  , -2.  ,  3.33,  3.33]])

In [78]:

(X_nrm).round(2)

Out[78]:

array([[ 2.  ,  2.  , -1.67, -1.67],
       [ 1.  ,  1.  , -1.67, -1.67],
       [ 3.  ,  3.  , -1.67, -1.67],
       [-2.  , -2.  ,  1.33,  1.33],
       [-2.  , -2.  ,  0.33,  0.33],
       [-2.  , -2.  ,  3.33,  3.33]])

Compare the two below: user_embedding = X @ movie_embedding

In [79]:

(eb_u_).round(2)

Out[79]:

array([[-3.68,  0.12],
       [-2.62, -0.82],
       [-4.74,  1.06],
       [ 3.37, -0.47],
       [ 2.43, -1.53],
       [ 5.25,  1.64]])

In [80]:

(X_nrm @ eb_m_).round(2)

Out[80]:

array([[-3.68,  0.12],
       [-2.62, -0.82],
       [-4.74,  1.06],
       [ 3.37, -0.47],
       [ 2.43, -1.53],
       [ 5.25,  1.64]])

Code is here: https://github.com/yang-zhang/yang-zhang.github.io/blob/master/ds_math/svd_embedding.ipynb

Python code examples of using SVD (PCA) for embeddings

PCA equivalency

Written by Yang Zhang

Responses (1)