Cosine similarity = dot product for normalized vectors
Some Python code examples showing how cosine similarity equals dot product for normalized vectors.
Imports:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from scipy.spatial.distance import cosineMake and plot some fake 2d data.
n_samples = 100
n_features = 2
X = np.random.randn(n_samples, n_features)
plt.figure(figsize=(5, 5))
plt.scatter(X[:,0], X[:,1])Plot after normalizing data. See how the norms become 1.
X_normalized = preprocessing.normalize(X, norm='l2')
plt.figure(figsize=(5, 5))
plt.scatter(X_normalized[:,0], X_normalized[:,1])Numerical examples of dot product v.s. cosine similarity:
n_samples = 8
n_features = 5
X = np.random.uniform(0, 2, size=(n_samples, n_features))
Y = np.random.uniform(-1, 3, size=(n_samples, n_features))In general dot product is not equal to cosine similarity:
np.allclose(np.dot(X, Y.T), cosine_similarity(X, Y))Out[22]:
FalseNormalizing:
X_normalized = preprocessing.normalize(X, norm='l2')
Y_normalized = preprocessing.normalize(Y, norm='l2')Normalizing does not change cosine similarity:
np.allclose(cosine_similarity(X, Y), cosine_similarity(X_normalized, Y_normalized))Out[26]:
TrueAfter normalizing, dot product equals cosine similarity:
np.allclose(np.dot(X_normalized, Y_normalized.T), cosine_similarity(X_normalized, Y_normalized))Out[27]:
TrueAs mentioned in sklearn here:
normalized vectors, in which case cosine_similarity is equivalent to linear_kernel, only slower.
This can be shown too:
np.allclose(linear_kernel(X_normalized, Y_normalized), cosine_similarity(X_normalized, Y_normalized))Out[28]:
TrueAfter normalization, the euclidean distance will degrade to sqrt(1+1-2dot(x, y)) , i.e., sqrt(2–2*cosine_similarity). Compare the outputs below:
np.sqrt(sum((X_normalized[0] - Y_normalized[0])**2)), np.linalg.norm(X_normalized[0] - Y_normalized[0])Out[29]:
(0.7509789525594854, 0.7509789525594854)In [30]:
np.sqrt(2 - 2 * np.dot(X_normalized[0], Y_normalized[0])), np.linalg.norm(X_normalized[0] - Y_normalized[0])Out[30]:
(0.7509789525594853, 0.7509789525594854)In [31]:
np.sqrt(2 - 2 * cosine_similarity(X_normalized[0].reshape(1, -1), Y_normalized[0].reshape(1, -1)))Out[31]:
array([[0.75097895]])A useful consequence of this is that we can now use Euclidian distance in place of cosine similarity when the latter is not supported. An example is sklearn’s KNN. As mentioned here, cosine distance is not allowed but Euclidean is:
metric : string or callable, default ‘minkowski’ the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric.
Code is here: https://github.com/yang-zhang/yang-zhang.github.io/blob/master/ds_math/normalize_vs_cosine.ipynb
