Keywords

1 Introduction

An e-commerce product search engine typically serves queries in two stages—matching and ranking, for efficiency and latency reasons. In the matching stage, a query is processed and matched against hundreds of millions of products to retrieve thousands of products that are relevant to the query. In the subsequent ranking stage, the retrieved products are scored against one or more objectives and then sorted to increase the likelihood of satisfying the customer query in the top positions. Matching is therefore a critical first step towards a delightful customer experience in terms of search latency and relevance, and the focus of this paper. Lexical matching using an inverted index [1] has been the industry standard approach for e-commerce retrieval applications. This type of matching retrieves products that have one or more query keywords appear in their textual attributes such as title and description. Lexical matching is favorable because of its simplicity, explainability, low latency, and ability to scale to catalogs with billions of products. Despite the advantages, lexical matching has several shortcomings such as sensitivity to spelling variants (e.g. “grey” vs “gray”) or mistakes (e.g. “sheos” instead of “shoes”), proneness to vocabulary mismatch (e.g. hypernyms, synonyms), and lack of semantic understanding (e.g. “latex free examination gloves” does not match the intent of “latex examination gloves”). These issues are largely caused by the underlying term-based distributional representation for query and product that fails to capture the fine-grained relationship between terms. Researchers and practitioners typically resort to query expansion techniques to address these issues.

Dense embedding based semantic matching [2] has been shown to significantly alleviate the shortcomings of lexical matching due to its distributed representation that admits granular proximity between the terms of a query-product pair in low dimensional vector space [3]. To fulfill the low latency requirement, these semantic matching models are predominantly shallow and use a bi-encoder architecture. Bi-encoders have separate encoders for generating query and product embeddings and use cosine similarity to define the proximity of queries and products. Such an architecture allows product embeddings to be indexed offline for fast approximate nearest neighbor (ANN) search [4] with the query embedding generated in realtime. Recently, BERT-based models [5] have advanced the state-of-the-art in natural language processing but due to latency considerations, their use in online e-commerce information retrieval is largely limited to the bi-encoder architecture [6,7,8] which does not benefit from the early interaction between the query and product representations.

In this work, we propose a multi-stage training procedure to train a small BERT-based matching model for online inference that leverages a large pre-trained BERT-based matching model. A large BERT encoder (750M parameters) is first pretrained with the masked language modeling (MLM) objective on the product catalog data (details in Sect. 2.1), we refer to the trained model as ds-bert. Next, the ds-bert model is pre-finetuned using our novel query-product interaction pre-finetuning (QPI) task (see Sect. 2.2), the trained model is referred to as qpi-bert. We find that interaction pre-finetuning greatly improves training stability of bi-encoders downstream as well as significantly improves generalization. qpi-bert is then cloned into a bi-encoder model architecture and finetuned with query-product purchase signal, we refer to this model as qpi-bert-ft (see Sect. 2.3). Finally, a smaller qpi-bert bi-encoder student model (75M parameters) is distilled from the qpi-bert-ft teacher by matching the cosine similarity score on the query-product pairs used for finetuning (see Sect. 2.4), we refer to this model as small-qpi-bert-dis.

Through our offline experiments on a large e-commerce dataset, we show that the small-qpi-bert-dis model (75M) suffers only a 3% drop in search relevance metric, compared to the qpi-bert model with 20x its number of parameters (1.5B). This small-qpi-bert-dis model improves search relevance by 23%, over a baseline DSSM-based matching model [2] with similar number of parameters and inference latency. Using an online A/B test we also show that the small-qpi-bert-dis model outperforms the production model with 2% lift in both relevance and sales metrics.

Our work is closely related to the literature on semantic matching with deep learning. Some of the initial pre-BERT works include the Deep Semantic Similarity Model (DSSM) [2] , that constructs vector representations for queries and documents using a feedforward network and uses cosine similarity as the scoring function. DSSM-based models are widely used for real-time matching at web-scale [9, 10]. This was later specialized for online product matching [3]. Post-BERT techniques leverage Pretrained Language Models (PLMs), such as BERT [5] to construct bi-encoders for matching tasks [8, 11, 12]. These techniques have broadly been applied to question-answering where the question and answer are from similar domains and interaction pre-finetuning is less essential. A recent work [13], proposes a multi-stage semantic matching training pipeline for web retrieval. However, unlike our approach, their focus is on deploying an ERNIE model (220M), while we study how large bi-encoders (1.5B) can be compressed to much smaller bi-encoders (70M) at web-scale using interaction pre-finetuning. In summary, the key contributions of this work are:

  • We propose a multi-stage training procedure to effectively train a small BERT-based matching model for online inference from a much larger model (750 million to 1.5 billion parameters).

  • We introduce a novel pre-finetuning task where a span masking and field permutation equivariant objective is used on joint query-product input text to help align the query and product representations. This task helps stabilize training and improve generalization of bi-encoders.

  • We show using offline and online experiments at scale on an e-commerce website, that the proposed approach helps the small BERT small-qpi-bert-dis model significantly outperform both a DSSM-based model (by 23\(\%\)) in offline experiments and a production model in an online A/B test.

2 Methodology

In this section we describe our proposed four-stage training paradigm that consists of 1) domain-specific pretraining, 2) query-product interaction prefinetuning, 3) finetuning for retrieval, and 4) knowledge distillation to a smaller model. In the first three stages, we train a large BERT model for product matching and in the final stage we distill this knowledge to a smaller model that can be deployed efficiently in production (See Fig. 1).

2.1 Domain-Specific Pretraining

In the first stage of training, we pretrain a large BERT model on a domain specific dataset for product matching. The language used to describe products (catalog fields) in the e-commerce domain significantly differs from the language used on the larger web. Product titles and descriptions use a subset of the entire vocabulary, are often structured to follow a specific pattern, and in general have a different distribution from the sources that publicly available language models are trained on. Therefore, using an off-the-shelf pretrained BERT-based model does not perform well when finetuned for the product matching task.

Instead of using an off-the-shelf pretrained BERT model, we construct a BPE vocabulary [14] from the catalog corpus comprising of billions of products in e-commerce domain. We then pretrain the model on text from the catalog, and use all of the catalog text fields such as title and description of products available by concatenating them along with their field names. Our pretraining objective is the standard masked-language-modeling (MLM) loss [5, 15, 16]. We refer to the model trained with this strategy as the ds-bert model (See Fig. 1a).

Fig. 1.
figure 1

This figure shows the four stages involved in training an effective deployable model for semantic matching.

2.2 Query-Product Interaction Pre-finetuning

Bi-encoders are preferred over cross-encoder models with full interaction for matching due to their efficiency and feasibility at runtime. Bi-encoders are however notoriously difficult to train on query-product pairs due to training instabilities arising from gradient variance between the two inputs. Losing the capability to explicitly model the interaction between queries and products also results in worse generalization than the cross-encoder.

In the second stage of training we propose a novel self-supervised approach to incorporate query-product interaction in the large encoder which is critical to improving the performance on the product matching task. We use query-product paired data to help the encoder learn the relationship between a query and a product using full cross-attention. To construct such a dataset, we first identify query-product pairs that share a relevant semantic relationship, for example, all products purchased for a given query can be considered relevant or query-product pairs can be manually labeled for relevance. In this paper, the dataset is constructed such that the query-product pairs are semantically relevant with a high probability \(\alpha > 0.8\). The pre-finetuning dataset size (a few million examples) is much smaller than the pretraining dataset (a billion examples).

To perform pre-finetuning, we perform span MLM on the concatenated query and product text with a “[SEP]” token between them. At each iteration, we select spans from either the query text or product text (never both) to mask tokens. We sample span length (number of words) from a geometric distribution, till a predetermined percentage of tokens have been masked. The start of the span is uniformly sampled within the query or the product. During training we also observed that permuting the fields within the query and product, a form of field permutation equivariant training, also helped the model generalize better. We refer to the model trained with this strategy as qpi-bert model (See Fig. 1b). Pre-finetuning with self-supervision on semantically relevant paired dataset boosts generalization for matching when a large noisy training set is available. This differs from previous works that use supervision on manually labeled data.

2.3 Finetuning for Matching

The third stage of training is to finetune the large teacher encoder qpi-bert model in a bi-encoder setting for matching. We train a bi-encoder teacher as opposed to a cross-encoder teacher for retrieval as the extreme inefficiency in generating predictions for evaluation and slow training convergence rate makes it impractical to train cross-encoders for web-scale data and large models.

Let us denote the qpi-bert model as M, query encoder as \(M_{q}\), and product encoder as \(M_{p}\), where the weights between query encoder and product encoder are shared. In our experiments sharing weights performed comparably to independently training them. For any query-product pair Q and P as inputs, we first generate the embedding \(Q_{emb}\) for query Q using \(M_{q}\) and embedding \(P_{emb}\) for product P using \(M_{p}\) using their “[CLS]” token representation. A cosine similarity score \(s_{Q,P} = \cos (Q_{emb}, P_{emb})\) is used to compute relevance between them.

We train the bi-encoder using a three-part hinge loss. This loss requires the ground-truth data (\(y_{Q, P}\)) to be labeled with one of three possible values referred to as positive (1), hard negative (0) and random negative (\(-1\)). We use the purchased products for a given query as positive and any product uniformly sampled from the catalog as random negative. Identifying hard negatives is non-trivial [12, 17], and in this work we choose a simple yet effective approach [3], where for a given query, all products that were shown to the user but did not receive any interaction is a hard negative. The loss takes the following form:

$$\begin{aligned} \textrm{loss}_{Q, P}(y_{Q, P}, s_{Q, P}) = {\left\{ \begin{array}{ll} \max (\delta _{pos} - s_{Q, P}, 0), &{} \text {if}\; y_{Q, P}=1.\\ \max (\delta _{hn}^{-} - s_{Q, P}, 0) &{} \text {if}\;y_{Q, P}=0.\\ \ + \max (s_{Q, P} - \delta _{hn}^{+}, 0), \\ \max (s_{Q, P} - \delta _{rn}, 0), &{} \text {if}\;y_{Q, P}=-1. \end{array}\right. } \end{aligned}$$
(1)

where \(\delta _{pos}\) and \(\delta _{hn}^{-}\) are the lower thresholds for the positive and hard negative data scores respectively and \(\delta _{hn}^{+}\) and \(\delta _{rn}\) are the upper thresholds for the hard negative and random negative data scores respectively. We refer to the model trained with this strategy as the qpi-bert-ft model (see Fig. 1c).

2.4 Distillation and Realtime Inference

The final stage of our framework is to distill the knowledge of teacher qpi-bert-ft to a smaller student bi-encoder BERT model (75M to 150M parameters) that meets the online latency constraint. We first pretrain and prefinetune the small model similar to qpi-bert to generate small-qpi-bert model M. Then we clone the encoder to create a query encoder \(\tilde{M_{Q}}\) and a product encoder \(\tilde{M_{P}}\). Unlike the large model case, for the small model we observe that sharing parameters between encoders helps improve performance significantly. The query embedding \(\tilde{Q}_{emb}\) and product embedding \(\tilde{P}_{emb}\) for the student model are computed by averaging all token embeddings in the query Q and product P respectively. The relevance score for a query-product pair is compute using cosine similarity i.e., \(\tilde{s}_{Q, P} = cos(\tilde{Q}_{emb}, \tilde{P}_{emb})\). The model is trained by minimizing the distance between the scores generated by qpi-bert-ft teacher and the model using the mean squared error (MSE) loss function.

$$\begin{aligned} \textrm{loss}_{Q, P}(s_{Q, P}, \tilde{s}_{Q, P})&= (s_{Q, P} - \tilde{s}_{Q, P})^{2} \end{aligned}$$
(2)

In practice we observed that simple score matching using MSE outperformed other approaches such as using L2 loss on the embeddings directly, Margin-MSE [18] with random negatives, or contrastive losses like SimCLR [19] with random negatives. We refer to the model distilled with this strategy as the small-qpi-bert-ft model (see Fig. 1d). At runtime, for every query entered by the customer, we compute the query embedding and then retrieve top K products using ANN search [4]. To serve traffic in realtime, we cache the product embeddings and compute only the query embedding online. The retrieved products are served directly to customers or mixed with other results and re-ranked before displaying to the customer.

3 Empirical Evaluation

3.1 Experimental Setup

Data. We use the following multilingual datasets for different stages of training:

Domain-Specific Pretraining Data: We use \(\sim \)1 billion product titles and descriptions from 14 different languages. This data is also used to construct a sentencepiece [20] tokenizer with 256K vocab size.

Interaction Pre-finetuning Data: We use \(\sim \)15M query-product pairs from 12 languages and use weak supervision in the form of rules to label them as relevant or irrelevant. \(\sim \)80% of the pairs are relevant query-product pairs.

Finetuning for Matching Data: We use \(\sim \)330M query-product pairs subsampled from a live e-commerce service to train the model for matching. We maintain a positive to hard negative to negative ratio of 1:10:11. The pairs are collected from multiple countries with at least 4 languages. We use a validation dataset to compute recall that contains 28K queries and 1M products from the subsampled catalog. Human evaluation (Sect. 3.1) uses a held-out set of 100 queries.

Models. We experiment with several model variants, both small and large summarized in Table 1. All large models we train are based on ds-bert, which is a multilingual BERT model with 38 layers, 1024 output dimensions and 4096 hidden dimensions. When the parameters for the query and product encoder are not shared, the model has twice the parameters of the encoder. The small models we train are multilingual BERT models with 2 layers, 256 output dimensions, and 1024 hidden dimensions. In addition, we use dssm and xlmroberta as baselines.

\(\bullet {}\) xlmroberta : Publicly available XLMRoberta [21] model which is finetuned for matching as described in Sect. 2.3.

\(\bullet {}\) dssm : Bi-encoder model with a shared embedding layer (output dimension of 256) followed by batch norm and averaged token embedding to represent the query and product [3]. To ensure effective use of vocabulary for DSSM, we create a different sentencepiece model with 300k tokens using the matching training data.

Table 1. Bi-encoder model variants. Differences are number of parameters (Params), embedding dimensionality (ED), embedding type (ET), domain-specific pretraining (DS PT), QPI prefinetuning (QPI PFT), whether encoders share parameters (Shared), whether model is distilled from qpi-bert-ft (Dis).

Metrics. R@100: This is the average purchase recall computed on the validation data for the top 100 products retrieved.

Relevance Metrics: To understand the true improvement in the quality of matches retrieved by the model, we use Toloka (toloka.yandex.com) to label the results produced by our models. For every query we retrieve 100 results and ask the annotators to label them as exact match, substitute, or other. We report the average percentage of exact (E@100). substitute (S@100), and other (O@100). We use E@100 + S@100 (E+S) to measure semantic improvement in the model.

Training. We use Deepspeed (deepspeed.ai) and PyTorch for training models on AWS P3DN instances. We used LANS optimizer [22] with learning rate between \(1e^{-4}\) and \(1e^{-6}\) based on the model and for all models we use a batch size of 8192. During pre-finetuning, we use validation MLM accuracy to perform early stopping and for finetuning we use validation recall for stopping. When using the three-part hinge-loss in Eq. 1, \(\delta _{pos} = 0.9\), \(\delta _{hn}^{+} = 0.55\) and \(\delta _{rn} = \delta _{hn}^{-} = 0.2\).

3.2 Offline and Online Results

Does our Training Strategy Help Improve Semantic Matching Performance Offline? For large models, we compare qpi-bert-ft with xlmroberta, ds-bert, and qpi-bert-ft*, and for small models, we compare small-qpi-bert-ft small-qpi-bert-ft-sh with dssm (Table 2). a) qpi-bert outperforms other approaches both in R@100 and E+S. Among large models, the performance of ds-bert is better than xlmroberta and qpi-bert-ft* is better than ds-bert. This clearly indicates progressive improvement with the different stages in our approach. b) We observe is that dssm outperforms xlmroberta in all metrics indicating a vocabulary and domain mismatch between the catalog data and web data. Domain-specific pretraining is essential to performance when training the large models. c) We see that qpi-bert-ft significantly outperforms qpi-bert-ft* in all metrics, validating the importance of interaction pre-finetuning over mere supervision alone for matching. d) For small models, we observe that the performance of small-qpi-bert-ft is very similar to dssm, with small-qpi-bert-ft showing \(\sim \)45% relative lift in S@100 but, \(\sim \)8% relative drop in E@100, \(\sim \)4% relative lift in E+S, and \(\sim \)1% relative drop in R@100. When sharing parameters between the query and product encoder, and averaging embeddings, small-qpi-bert-ft-sh-avg outperforms dssm by \(\sim \)38% relative lift in S@100, \(\sim \)2% relative lift in E@100, \(\sim \)10% relative lift in E+S, and \(\sim \)2% relative lift in R@100. The results indicate that our strategy helps improve the performance overall and the improvements are higher for larger models (\(\sim \)23% relative lift in E+S over dssm). This reinforces our proposed approach: train a large model and distill the knowledge to a smaller model, instead of directly training a smaller model.

Table 2. Offline metrics of models on a multi-lingual e-commerce dataset

Can Distillation Preserve Large Model Performance? Given the large improvement in matching metrics for large models, we would ideally like to retain this improvement in smaller models using distillation. We compare small-qpi-bert-dis with qpi-bert-ft (Table 2) and observe a \(\sim \)3% relative drop in E+S and R@100. This shows that while there is small gap, it is possible to transfer most of the information from a 1.5B parameter large qpi-bert-ft model to a 20x smaller small-qpi-bert-dis model (75M parameter) using our approach.

Does Sharing Parameters in the Bi-encoder have an Impact on Retrieval Task Performance? To understand the effect of sharing parameters between query and product encoders in the bi-encoder setting, we compare qpi-bert-ft-sh with qpi-bert-ft among the large models and small-qpi-bert-ft-shwith small-qpi-bert-ft, and small-qpi-bert-ft-sh-avg, with small-qpi-bert-ft-avgamong the small models (Table 2). We observe that sharing encoders has almost no impact on the performance of large models and the maximum relative drop in E+S and recall is \(\sim \)1% with qpi-bert-ft-sh winning marginally. However, in the smaller models we observe that sharing parameters gives a large boost in performance with a relative lift of upto \(\sim \)32% in the E+S metric and \(\sim \)60% in R@100. When the model size is large enough, it is capable of learning independent encoders for both inputs. But, when the model is small, the model benefits from sharing parameters.

Fig. 2.
figure 2

Top 6 results obtained by dssm and qpi-bert-ft for queries “sailor ink” and “omron sale bp monitor machine”

How Does our Approach Improve over a non-BERT-Based Model? To visualize the difference in matching quality between our BERT-based model and DSSM, we look at results for two queries, with DSSM retrieving more relevant products on one query and vice-versa on the other (Fig. 2). We observe that for query “sailor ink” qpi-bert-ft performs better as all results are relevant products. For this query, dssm behaves like a lexical matcher and fetches results for both “sailor” and “ink”. For query “omron sale bp monitor machine”, dssm retrieves all relevant matches. qpi-bert-ft however, retrieves an irrelevant product (a fitness watch). While irrelevant, it still falls into the product type of “personal health” implying an error in semantic generalization. The significantly higher increase in S@100 compared to E@100 indicates that qpi-bert-ft is a better semantic model as the representations must incorporate high-level concepts to match substitutes, that token-level exact matches cannot achieve.

What is the Latency Improvement of the Smaller BERT Model Compared to the Large Model? We have seen earlier that the large model can be effectively compressed to a 20x smaller model that incurs much lower inference latency. We compare the inference latencies of our models while generating query embeddings which is representative of realtime latency as the product embeddings are generated offline and indexed for ANN. We ignore the ANN latency as modern ANN search can be computed effectively in realtime (\(\sim \)1 ms) [4]. Figure 3 shows the time it takes to compute query embedding (inference) for different query lengths (query length computed as number of tokens after tokenization) on an r5.4xlarge AWS instance. As expected, dssm has the lowest inference time and qpi-bert has the largest. Both small-qpi-bert and dssm have embedding generation time of under 1ms upto 32 tokens making it feasible to serve realtime traffic. small-qpi-bert reduces the latency time by \(\sim \)60\(\times \) compared to qpi-bert with a relevance metric performance drop of only \(\sim \)3%.

Fig. 3.
figure 3

Inference time for qpi-bert, small-qpi-bert, and dssm on r5.4xlarge

Table 3. A/B test results for small-qpi-bert-dis rel. to production system.

How Well Does the Approach Perform Online? To measure the impact of our approach online, we experiment with small-qpi-bert-dis in a multi-lingual large e-commerce service. The service augments matching results from several sources like lexical matchers, semantic matchers, upstream machine learning models, and advertised products. We replace only the production semantic matcher with our small-qpi-bert-dis and perform an A/B test. We measure both customer engagement metrics and relevance quality metrics. For customer engagement metrics, we look at the change in number of units purchased and the amount of product sales (PS). For quality metric, we look at the change in user evaluated E@16, S@16, E+S@16 and sparse results (SR) which is the percentage of queries with less than 16 products retrieved. We observe (Table 3) that our approach significantly improves over the production semantic matcher and lead to a significant drop in SR. The reduction in E@16 and increase in S@16 suggests that our approach is learning latent semantic meaning to increase substitutes displayed to customers. We also observe that our model does not have a significant impact on latency (\(\sim \)4 ms) and can be used at runtime.

4 Conclusion

In this work we develop a four-stage training paradigm to train an effective BERT model that can be deployed online to improve product matching. We introduce a new pre-finetuning task that incorporates the interaction between queries and products prior to training for retrieval which we show is critical to improving performance. Using a simple yet effective approach, we distill a large model to a smaller model and show through offline and online experiments that our approach can significantly improve customer experience. As future work, it would be interesting to incorporate other structured data from the e-commerce service to enhance representation learning, such as brand and product dimensions, as well as customer interaction data such as reviews.