Semantic Product Search
This post surveys recent techniques for semantic product search, especially neural network models. We introduce and classify a number of models, and describe the commonalities and differences in their characteristics such as model architecture, loss function, etc. We believe familiarity with the various choices of the building blocks in semantic product search models is helpful for practitioners to design solutions for their own problems.
Introduction
Product search concerns the problem of retrieving relevant products from a catalog given a user query. Traditional search models (e.g., tf-idf, BM25) typically use an inverted index to match the query and the product lexically. They are good at exact matching of texts, especially for rare terms (e.g., product serial numbers), but have challenges handling morphological variations, synonyms, spelling errors, etc. On the other hand, semantic search models match the query and the product information semantically. They can better handle situations where lexical models fall short, and can potentially return more relevant, context-aware results.
We focus on semantic search models in this post.
How Semantic Product Search Basically Works
On a high level, a semantic product search model typically works as follows:
- generate embeddings for the products;
- generate an embedding for the query;
- calculate a relevance score between the query and the product embedding;
- return the most relevant products for the query according to the relevance score.
An embedding is a vector of float numbers of a certain dimension. How the embeddings are generated is the key characteristic of a semantic search model and the focus of this post. The relevance score is calculated by a similarity/compatibility function, and a common choice is the cosine similarity.
Supervised v.s. Unsupervised Models
Depending on whether or not ground truth labels are needed to train the model, the models can be classified into supervised or unsupervised. For product search, a ground truth label is the relevance measurement between a pair of (query, product), which can be binary (relevant v.s. irrelevant), or a numerical score (e.g., 0–5).
We first introduce unsupervised models, then supervised models.
Unsupervised Semantic Search
Unsupervised methods do not need ground truth labels (i.e., whether a product is relevant to a query) to learn product or query embeddings. We introduce three classes of techniques with examples: Matrix Factorization Methods, Word Embeddings, and Product Embeddings.
Matrix Factorization Methods
Matrix factorization methods use a term (i.e., word or token) by document (product information text) matrix `X`, where the element `(i,j)` in `X` is the frequency of term `i` in document `j`. Some matrix factorization (also known as matrix decomposition) operation is then performed to decompose X into a product of matrices. Here we describe one matrix factorization-based embedding method: Latent Semantic Analysis (LSA).
Latent Semantic Analysis (LSA) Indexing by latent semantic analysis
The term-document matrix `X` is decomposed by Singular value decomposition (SVD):
where U and V are orthogonal matrices and `Sigma` is a diagonal matrix.
(Image from Wikipedia)
The values `sigma_i` are called the singular values, and `u_i` and `v_i` are the left and right singular vectors. We can select the `k` largest singular values, and their corresponding singular vectors to get the rank `k` approximation to `X`:
The lower-rank k-dimensional vector `t_i^hat` and `d_j^hat` above are the term- and document-embeddings, respectively; while the relevance between a term (word or token) `i` and term (product) `j` is the `(i,j)`th element of `X_k`.
Word Embeddings
Word embeddings are learned from a text corpus by a neural model in an unsupervised way, based on the Distributional Hypothesis that “a word is characterized by the company it keeps”. Each word/token in the vocabulary receives an embedding of a fixed dimension. A common way to get the product embedding is to average the embeddings of the words in the product info text. We describe two types of word embeddings Word2Vec and GloVe.
Word2Vec Efficient Estimation of Word Representations in Vector Space
Word2Vec learns word embeddings by using a fixed size window around each word, via one of two architectures: Continuous Skip-gram Model (Skip-gram) and Continuous Bag-of-Words Model (CBOW). Both architectures use single-layer neural networks to project one-hot word vectors to dense embeddings, but use different learning objectives.
In CBOW, the sum of the neighboring words’ vectors are used as the input and the middle word is being predicted. In contrast, in Skip-gram, the middle word serves as input and the neighbors are being predicted. Both architectures are trained via the classification loss of softmax output.
GloVe GloVe: Global Vectors for Word Representation
GloVe stands for Global Vectors for Word Representation. As the name indicates, a major distinction of this model from Word2Vec is the use of global (i.e., corpus-wide) neighboring (co-occurrence) frequency instead of individual neighborhood windows. The model is then trained by predicting the likelihood of one word occurring in the neighborhood given another.
Product Embeddings
This technique uses neural networks to learn embeddings directly on products as opposed to words.
prod2vec E-commerce in Your Inbox: Product recommendations at scale
This model is not directly purposed for search but for product recommendation, and shares ideas with word embeddings. Here, a purchase sequence is treated as a “sentence” and products within the sequence as “words”. The model then learns product embedding using the Skip-gram model.
LSE Learning Latent Vector Spaces for Product Search
This unsupervised model learns product embeddings and word embeddings simultaneously in a contrastive learning scheme. Given a word N-gram and a product, word embeddings in the N-gram are averaged and transformed by a single-layer neural network to compute the cosine similarity to the product embedding. The model learns the embeddings contrastively: the similarity should be big if the product information contains N-gram, it should be small if the product is randomly sampled.
Supervised Semantic Search
We now look at supervised models trained from ground truth labels, i.e., whether a product is relevant to a query. Key characteristics of the models include the neural network architecture, the loss function, the tokenizer, etc. There exist some commonly used types for these model components:
Neural Network Architecture:
- Multi-layer Perceptron (MLP)
- Convolutional Neural Network (CNN)
- Transformer Models (e.g., Bert)
Loss Function:
- Binary (whether or not this product is relevant to this query)
- Margin (which one of the two products is more relevant to this query)
- Multinary (which one of the multiple products is relevant to this query)
Tokenizer:
We now introduce several recent supervised semantic search models and describe their common characteristics as well as specific design choices.
DSSM Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
[Neural architecture: MLP] [Loss function: Binary] [Tokenizer: Char N-gram]
This 2013 model is one of the early models adopting deep learning for search. It uses a character 3-gram tokenizer. Query and document texts are represented as bags of 3-grams. A fixed, unlearnable hashing layer embeds the tokens, followed by a MLP with 3 hidden layers. The cosine similarity between the query and the document embeddings is compared to the binary label (relevant or not) to compute the loss, where for each positive example (i.e., a relevant query-doc pair), 4 negative examples are generated by random selection.
ARC-II Convolutional neural network architectures for matching natural language sentences
[Neural architecture: CNN] [Loss function: Margin] [Tokenizer: Word 1-gram]
CNN-based search models use pairs (e.g., query-product pairs) of text to form a 2D array then use CNNs to learn latent features that better model the local interaction between tokens. In this model, a sentence `Sx` and sentence `Sy` are transformed into a sequence of pre-trained Word2Vec embeddings of the words in the sentence, and forms a 2D array. 1D and 2D convolutions are then applied to the array to get a final embedding. Note that the Word2Vec embeddings are fixed and not learnable.
A margin loss is adopted to make the model give high similarity to positive query-doc pairs (s(x,y+)) and low similarity to negative pairs (s(x,y-)).
MatchPyramid Text Matching as Image Recognition
[Neural architecture: CNN] [Loss function: Binary] [Tokenizer: Word 1-gram]
This model shares similar ideas with ARC-II in treating query-doc pairs as images, but has some key design choice differences. Its word embedding is learnable instead of fixed. The convolution operations are arranged more explicitly to reflect the relation between layers and word, phrase, and sentence-level matching pattern. Instead of a margin loss its model is trained using a binary loss.
DRMM A Deep Relevance Matching Model for Ad-hoc Retrieval
[Neural architecture: MLP] [Loss function: Margin] [Tokenizer: Word 1-gram]
This model shares the idea of modeling local interaction with the CNN-based models, but using histogram mapping instead of transforming query-doc pairs to 2D images.
It makes use of fixed (i.e., not learnable) pretrained word embeddings (e.g., Word2Vec) to embed each word in queries and documents (product information). A query-document pair is transformed into a histogram by 1) defined fixed similarity bins 2) count how many pairs of query word and document word have their similarity score (computed from word embeddings) is in each bin. An example from the paper:
- bins: {[−1, −0.5), [−0.5, −0), [0, 0.5), [0.5, 1), [1, 1]}
- query term: “car”
- document: (car, rent, truck, bump, injunction, runway)
- similarities: (1, 0.2, 0.7, 0.3, −0.1, 0.1)
- histogram: [0, 1, 3, 1, 1].
The histogram then becomes the input to the MLP, and finally a margin loss is computed based on a positive and a negative query-document pair.
Duet Learning to match using local and distributed representations of text for web search
[Neural architecture: CNN] [Loss function: Multiclass]
Duet aims to jointly measure the relevance in the sense of both exact match (local model) and semantic representation (distributed model). For the local model, the input is a binary 2D array representing whether each word in the document appears in the query, followed by convolutions and fully connected layers. The distributed model is inspired by DSSM. It uses character n-gram frequency matrix of query and document as input, followed by convolutions, element-wise product, and fully connected layers. Two similarity scores are summed from these two models, and trained by predicting which one of the documents is relevant to a given query.
Amazon Semantic Product Search Semantic Product Search
[Neural architecture: MLP] [Loss function: Margin]
It combines multiple tokenizers (word and character 1-gram, 2-gram, 3-gram, hashing, etc.) in a single bag-of-tokens. Similar to DSSM and Duet, token embeddings in a bag are averaged; but in contrast to them, query and product share the same embeddings. This works also features a novel 3-way margin loss that differentiates irrelevant, clicked, and purchased products given a query.
BERT Product Search
[Neural architecture: Transformer] [Tokenizer: Subword]
Bert and its related models achieve state-of-the-art results on multiple NLP tasks, and its use in product search is in many ongoing pieces of research. Example paper BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online E-Commerce Search.
Using Bert models for product search is straightforward. One way is to embed query and product text each with a Bert model, where the query Bert model and the product Bert model may use shared or separate weights. Another way is to combine the query and the product text into a single sentence separated by the special token [SEP] (this is the way used in the above papers). Bert models typically use subword tokenizers. The loss function could be any of the ones described above. One could choose to finetune all or some of the layers of the Bert model, or one could freeze all Bert layers and only finetune the final linear classification layers.
Discussion
As we have seen, there is a wide variety of design choices for building semantic search models.
The trade-off between implementation complexity, computation resource, and search accuracy has to be taken into consideration. For example, supervised models may be more accurate, but require expensive labels and may be more sensitive to temporal changes (as user behavior likely changes faster than product catalogs); deeper models may be more accurate but more expensive to train and serve than shallow models (and at the same time, a deeper model may also be more likely to overfit the training data if not carefully tuned). For the practical problem to solve at hand, it is important to benchmark the models against each other as well as against non-semantic baseline models on a realistic dataset and using reasonable metrics. Ensembling of semantic and non-semantic models should also be considered and tested.