Leveraging Customer Reviews for E-commerce Query Generation

. Customer reviews are an eﬀective source of information about what people deem important in products (e.g. “strong zipper” for tents). These crowd-created descriptors not only highlight key product attributes, but can also complement seller-provided product descriptions. Motivated by this, we propose to leverage customer reviews to generate queries pertinent to target products in an e-commerce setting. While there has been work on automatic query generation, it often relied on proprietary user search data to generate query-document training pairs for learning supervised models. We take a diﬀerent view and focus on leveraging reviews without training on search logs, making reproduction more viable by the public. Our method adopts an ensemble of the statistical properties of review terms and a zero-shot neural model trained on adapted exter-nal corpus to synthesize queries. Compared to competitive baselines, we show that the generated queries based on our method both better align with actual customer queries and can beneﬁt retrieval eﬀectiveness.


Introduction
Customer reviews contain diverse descriptions about how people reflect the properties, pros and cons of the products that they have experienced. For example, properties such as "for underwater photos" or "for kayaking recording" were mentioned in reviews for action cameras, as well as "compact" or "strong zipper" for tents. These descriptors not only paint a rich picture of what people deem important, but also can complement and uncover shopping considerations that may be absent in seller-provided product descriptions. Motivated by this, our work investigates ways to generate queries that surface key properties about the target products using reviews.
Previous work on automatic query generation often relied on human labels or logs of queries and engaged documents (or items) [18][19][20][21][22] to form relevance signals for training generative models. Despite the reported effectiveness, the cost of acquiring high quality human labels is high, whereas the access to search logs is often only limited to site owners. As we approach the problem using reviews, it Work done while an intern at Amazon.
brings an advantage of not requiring any private, proprietary user data, making reproduction more viable by the public in general. Meanwhile, generation based on reviews is favorable as the outcome may likewise produce human-readable language patterns, potentially facilitating people-facing experiences such as related search recommendation.
We propose a simple yet effective ensemble method for query generation. Our approach starts with building a candidate set of "query-worthy" terms from reviews. To begin, we first leverage syntactic and statistical signals to build up a set of terms from reviews that are most distinguishable for a given product. A second set of candidate terms is obtained through a zero-shot sequence-tosequence model trained according to adapted external relevance signals. Our ensemble method then devises a statistics-based scoring function to rank the combined set of all candidates, from which a query can be formulated by providing a desired query length.
Our evaluation examines two crucial aspects of query quality. To quantify how readable the queries are, we take the human-submitted queries from logs as ground truth to evaluate how close the generated queries are to them for each product. Moreover, we investigate whether the generated queries can benefit retrieval tasks, similar to prior studies [6,7,17]. We collect pairs of product descriptions and generated queries, both of which can be derived from public sources, to train a deep neural retrieval model. During inference, we take human-submitted queries on the corresponding product to benchmark the retrieval effectiveness. Compared with the competitive alternatives YAKE [1,2] and Doc2Query [6], our approach shows significantly higher similarity with human-submitted queries and benefits retrieval performance across multiple product types.

Related Work
Related search recommendation (or query suggestion) helps people automatically discover related queries pertinent to their search journeys. With the advances in deep encoder-decoder models [9,12], query generation [6,18,19,21,22] sits at the core of many recent recommendation algorithms. Sordoni et al. [19] proposed hierarchical RNNs [26] to generate next queries based on observed queries in a session. Doc2Query [6] adapted T5 [12] to generate queries according to input documents. Ahmad et al. [22] jointly optimized two companion ranking tasks, document ranking and query suggestion, by RNNs. Our approach differs in that we do not require in-domain logs of query-document relations for supervision.
Studies also showed that generated queries can be used for enhancing retrieval effectiveness [6,7,17]. Doc2Query [6] leveraged the generated queries to enrich and expand document representations. Liang et el [7] proposed to synthesize query-document relations based on MSMARCO [8] and Wikipedia for training large-scale neural retrieval models. Ma et al. [17] explored a similar zero-shot learning method for a different task of synthetic question generation, while Puri et al. [23] improve QA performance by incorporating synthetic questions. Our work resembles the zero-shot setup but differs in how we adapt external corpus particularly for e-commerce query generation.
Customer reviews have been adopted as a useful resource for summarization [24] and product question answering. Approaches to PQA [10,11,14,16] often take in reviews as input, conditioned on which answers are generated for user questions. Deng et al [11] jointly learned answer generation and opinion mining tasks, and required both a reference answer and its opinion type during training phase. While our work also depends on reviews as input, we focus on synthesizing the most relevant queries without requiring ground-truth labels.

Method
Our approach involves a candidate generation phrase to identify key terms from reviews, and a selection phrase that employs an unsupervised scoring function to rank and aggregate the term candidates into queries.

Statistics-based approach
We started with a pilot study to characterize the opportunity of whether and how reviews could be useful for query generation. We found that a subset of terms in reviews resemble that of search queries, which are primarily composed of combinations of nouns, adjectives and participles to reflect critical semantics. For example, given a headphone, the actual queries that had led to purchases may contain nouns such as "earbuds" or "headset" to denote product types, adjectives such as "wireless" or "comfortable" to reflect desired properties, and participles such as "running" or "sleeping" to emphasize use cases.
Inspired by this, we first leverage part-of-speech analysis to scope down reviews to the three types of POS-tags. From this set, we then rely on conventional tf-idf corpus statistics to mine distinguishing terms salient in a product type but not generic across the entire catalog. Specifically, an importance score is used to estimate the salience of a term t in a product type D by contrasting its density in review set R D to generic reviews R G , where p(t, R) = f req(t,R) Σ r∈R |r| . Beyond unigrams, we also consider if the relative frequency of bigram phrases containing the unigrams f req is above some threshold; in this case, bigrams will replace unigrams and become the candidates. We apply I D t to each review sentence, and collect top scored terms or phrases as candidates.
A straightforward way to form queries is to directly use the candidates asis. We additionally consider an alternative which trains a seq2seq model using the candidates as weak supervision (i.e. encode review sentences to fit the candidates). By doing so, we anticipate the terms decoded during inference can generalize more broadly compared to a direct application. The two methods are referred to as Stats-base and Stats-s2s respectively.

Zero-shot generation based on adapted external corpus
Recent findings [7,17] suggest that zero-shot domain adaptation can deliver high effectiveness given the knowledge embedded in large-scale language models via pre-training tasks. With this, we propose to rely on fine-tuning T5 [12] on MSMARCO query-passage pairs to capture the notion of generic relevance, and apply the trained model to e-commerce reviews to identify terms that are more probable to be adopted in queries.
This idea has been experimented by Nogueira et al. [6], where their Doc2Query approach focused on generating queries as document expansion for improving retrieval performance. Different from [6], our objective is to generate queries that are not only beneficial to retrieval but also similar to actual queries in terms of syntactic forms. Thus, a direct application of Doc2Query on MSMARCO creates a gap in our case since MSMARCO "queries" predominantly follow a natural-language question style, resulting in generated queries of similar forms 3 . To tighten the loop, we propose to apply POS-tag analysis to MSMARCO queries and retain only terms that satisfy the selected POS-tags (i.e. nouns, adjectives and participles). For example, an original query "what does physical medicine do" is first transformed into "physical medicine" as pre-processing. After the adaptation, we conduct T5 seq2seq model training and apply it in a zero-shot fashion to generate salient terms based on input review sentences.

Ensemble approach to query generation
For a product p in the product type D, we employ both statistical and zero-shot approaches on its reviews to construct candidates for generating queries, which we denote as C p . To select representative terms from the set, we devise a scoring function S t = f req(t, C p ) · log( |{p ∈D}| |{p |p ∈D,t∈C p }| ) to rank all candidates, where higher ranked terms are more distinguishable for a specific product based on the tf-idf intuition. Given a desired query length n, we formulate the pseudo queries for a product by selecting all possible k n combinations from the top-k scored terms in the C p set 4 . A final post-processing step removes any redundant words after stemming from the queries and adds product types if not already included.

Experiments
Our evaluation set is composed of products from three different product types, together with the actual queries 5 that were submitted by people who purchased those products on Amazon.com. As shown in Table 1, we consider headphones, tents and conditioners to evaluate our method across diverse product types, for which people tend to behave and shop differently with variances reflected in search queries. The query vocabulary size for conditioners, for instance, is about thrice the size of tents, with headphones sitting in-between the two.
As our approach disregards the actual queries for supervision, we primarily consider competitive baselines that do not involve using query logs. In particular, we compare to the unsupervised approach YAKE [1,2] which reportedly outperforms a variety of seminal key word extraction approaches, including RAKE [4], TextRank [3] and SingleRank [5] methods. In addition, we leverage the zeroshot Doc2Query model on adapted corpus as our baseline to reflect the absence of e-commerce logs. For generation, we initialize separate Huggingface T5base [12] weights with conditional generation head and fine-tune for Stats-s2s and Doc2Query models respectively. Training is conducted on review sentences broken down by NLTK. For retrieval, we fine-tune a Sentence-Transformer [25] ms-marco-TinyBERT 6 pre-trained with MSMARCO data, which was shown to be effective for semantics matching. Our experiments use a standard AdamW optimizer with learning rate 0.001 and β 1 , β 2 = (0.9, 0.999), and conduct 2 and 4 epochs training on a batch size of 16 respectively for generation and retrieval.

Intrinsic Similarity Evaluation
Constructing readable and human-like queries is desirable since it is practically useful for applications such as related search recommendation. A natural way to reflect readability is to evaluate the similarity between the generated and customer-submitted queries since the latter is created by human. In practice, we consider customer-submitted queries that had led to at least 5 purchases on the corresponding products as ground-truth queries, to which the generated queries are then compared. We use conventional metrics adopted in generative tasks including corpus BLEU and METEOR for evaluation. The results in Table 2 show that our ensemble approach consistently achieves the highest similarity with human-queries across product types, suggesting that the statistical and zero-shot methods could be mutually beneficial.

Extrinsic Retrieval Evaluation
We further study how the generated queries can benefit e-commerce retrieval. Our evaluation methodology leverages pairs of generated queries and product descriptions to train a retrieval model and validates its quality based on actual queries. During training, we fine-tune a Sentence-Transformer based on top-3 generated queries of each product. For each query, we prepare its corresponding relevant product description, together with 49 negative product descriptions randomly sampled from the same product type. During inference, instead of generated queries, we use customer-submitted queries to fetch descriptions from the product corpus, and an ideal retrieval model should rank the corresponding product description at the top. We also include BM25 as a common baseline. Table 3 shows that Doc2Query and the ensemble methods are the most effective and are on par in aggregate, with some variance in different product types. Stats-s2s slightly outperforms Stats-base overall, which may hint a potential for better generalization.

Conclusion
This paper connected salient review descriptors with zero-shot generative models for e-commerce query generation, without requiring human labels or search logs. The empirical results showed that the ensemble queries both better resemble customer-submitted queries and benefit training effective rankers. Besides MSMARCO, our future plan seeks to incorporate other publicly available resources such as community question-answering threads to generalize the notion of relevance. It is worth to consider ways to combine weak labels with few strong labels and dive deep into the impact of employing different hyper-parameters. A user study that characterizes the extent to which the generated queries can reflect people's purchase intent will further help qualitative understanding.