Cosine similarity = dot product for normalized vectors
Some Python code examples showing how cosine similarity equals dot product for normalized vectors.
Imports:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from scipy.spatial.distance import cosine
Make and plot some fake 2d data.
n_samples = 100
n_features = 2
X = np.random.randn(n_samples, n_features)
plt.figure(figsize=(5, 5))
plt.scatter(X[:,0], X[:,1])
Plot after normalizing data. See how the norms become 1.
X_normalized = preprocessing.normalize(X, norm='l2')
plt.figure(figsize=(5, 5))
plt.scatter(X_normalized[:,0], X_normalized[:,1])
Numerical examples of dot product v.s. cosine similarity:
n_samples = 8
n_features = 5
X = np.random.uniform(0, 2, size=(n_samples, n_features))
Y = np.random.uniform(-1, 3, size=(n_samples, n_features))
In general dot product is not equal to cosine similarity:
np.allclose(np.dot(X, Y.T), cosine_similarity(X, Y))
Out[22]:
False
Normalizing:
X_normalized = preprocessing.normalize(X, norm='l2')
Y_normalized = preprocessing.normalize(Y, norm='l2')
Normalizing does not change cosine similarity:
np.allclose(cosine_similarity(X, Y), cosine_similarity(X_normalized, Y_normalized))
Out[26]:
True
After normalizing, dot product equals cosine similarity:
np.allclose(np.dot(X_normalized, Y_normalized.T), cosine_similarity(X_normalized, Y_normalized))
Out[27]:
True
As mentioned in sklearn here:
normalized vectors, in which case cosine_similarity is equivalent to linear_kernel, only slower.
This can be shown too:
np.allclose(linear_kernel(X_normalized, Y_normalized), cosine_similarity(X_normalized, Y_normalized))
Out[28]:
True
After normalization, the euclidean distance will degrade to sqrt(1+1-2dot(x, y)) , i.e., sqrt(2–2*cosine_similarity). Compare the outputs below:
np.sqrt(sum((X_normalized[0] - Y_normalized[0])**2)), np.linalg.norm(X_normalized[0] - Y_normalized[0])
Out[29]:
(0.7509789525594854, 0.7509789525594854)
In [30]:
np.sqrt(2 - 2 * np.dot(X_normalized[0], Y_normalized[0])), np.linalg.norm(X_normalized[0] - Y_normalized[0])
Out[30]:
(0.7509789525594853, 0.7509789525594854)
In [31]:
np.sqrt(2 - 2 * cosine_similarity(X_normalized[0].reshape(1, -1), Y_normalized[0].reshape(1, -1)))
Out[31]:
array([[0.75097895]])
A useful consequence of this is that we can now use Euclidian distance in place of cosine similarity when the latter is not supported. An example is sklearn’s KNN. As mentioned here, cosine distance is not allowed but Euclidean is:
metric : string or callable, default ‘minkowski’ the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric.
Code is here: https://github.com/yang-zhang/yang-zhang.github.io/blob/master/ds_math/normalize_vs_cosine.ipynb