Cosine similarity = dot product for normalized vectors

Yang Zhang
2 min readJun 2, 2018

--

Some Python code examples showing how cosine similarity equals dot product for normalized vectors.

Imports:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from scipy.spatial.distance import cosine

Make and plot some fake 2d data.

n_samples = 100
n_features = 2
X = np.random.randn(n_samples, n_features)
plt.figure(figsize=(5, 5))
plt.scatter(X[:,0], X[:,1])
\

Plot after normalizing data. See how the norms become 1.

X_normalized = preprocessing.normalize(X, norm='l2')
plt.figure(figsize=(5, 5))
plt.scatter(X_normalized[:,0], X_normalized[:,1])

Numerical examples of dot product v.s. cosine similarity:

n_samples = 8
n_features = 5
X = np.random.uniform(0, 2, size=(n_samples, n_features))
Y = np.random.uniform(-1, 3, size=(n_samples, n_features))

In general dot product is not equal to cosine similarity:

np.allclose(np.dot(X, Y.T), cosine_similarity(X, Y))

Out[22]:

False

Normalizing:

X_normalized = preprocessing.normalize(X, norm='l2')
Y_normalized = preprocessing.normalize(Y, norm='l2')

Normalizing does not change cosine similarity:

np.allclose(cosine_similarity(X, Y), cosine_similarity(X_normalized, Y_normalized))

Out[26]:

True

After normalizing, dot product equals cosine similarity:

np.allclose(np.dot(X_normalized, Y_normalized.T), cosine_similarity(X_normalized, Y_normalized))

Out[27]:

True

As mentioned in sklearn here:

normalized vectors, in which case cosine_similarity is equivalent to linear_kernel, only slower.

This can be shown too:

np.allclose(linear_kernel(X_normalized, Y_normalized), cosine_similarity(X_normalized, Y_normalized))

Out[28]:

True

After normalization, the euclidean distance will degrade to sqrt(1+1-2dot(x, y)) , i.e., sqrt(2–2*cosine_similarity). Compare the outputs below:

np.sqrt(sum((X_normalized[0] - Y_normalized[0])**2)), np.linalg.norm(X_normalized[0] - Y_normalized[0])

Out[29]:

(0.7509789525594854, 0.7509789525594854)

In [30]:

np.sqrt(2 - 2 * np.dot(X_normalized[0], Y_normalized[0])), np.linalg.norm(X_normalized[0] - Y_normalized[0])

Out[30]:

(0.7509789525594853, 0.7509789525594854)

In [31]:

np.sqrt(2 - 2 * cosine_similarity(X_normalized[0].reshape(1, -1), Y_normalized[0].reshape(1, -1)))

Out[31]:

array([[0.75097895]])

A useful consequence of this is that we can now use Euclidian distance in place of cosine similarity when the latter is not supported. An example is sklearn’s KNN. As mentioned here, cosine distance is not allowed but Euclidean is:

metric : string or callable, default ‘minkowski’ the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric.

Code is here: https://github.com/yang-zhang/yang-zhang.github.io/blob/master/ds_math/normalize_vs_cosine.ipynb

--

--

Yang Zhang
Yang Zhang

Written by Yang Zhang

Data science and machine learning

No responses yet