An Exploratory Study on Utilising the Web of Linked Data for Product Data Mining

Zhang, Ziqi; Song, Xingyi

doi:10.1007/s42979-022-01415-3

An Exploratory Study on Utilising the Web of Linked Data for Product Data Mining

Original Research
Open access
Published: 17 October 2022

Volume 4, article number 15, (2023)
Cite this article

Download PDF

You have full access to this open access article

SN Computer Science Aims and scope Submit manuscript

An Exploratory Study on Utilising the Web of Linked Data for Product Data Mining

Download PDF

1312 Accesses
2 Citations
Explore all metrics

Abstract

The Linked Open Data practice has led to a significant growth of structured data on the Web. While this has created an unprecedented opportunity for research in the field of Natural Language Processing, there is a lack of systematic studies on how such data can be used to support downstream NLP tasks. This work focuses on the e-commerce domain and explores how we can use such structured data to create language resources for product data mining tasks. To do so, we process billions of structured data points in the form of RDF n-quads, to create multi-million words of product-related corpora that are later used in three different ways for creating language resources: training word-embedding models, continued pre-training of BERT-like language models, and training machine translation models that are used as a proxy to generate product-related keywords. These language resources are then evaluated in three downstream tasks, product classification, linking, and fake review detection using an extensive set of benchmarks. Our results show word embeddings to be the most reliable and consistent method to improve the accuracy on all tasks (with up to 6.9% points in macro-average F1 on some datasets). Contrary to some earlier studies that suggest a rather simple but effective approach such as building domain-specific language models by pre-training using in-domain corpora, our work serves a lesson that adapting these methods to new domains may not be as easy as it seems. We further analyse our datasets and reflect on how our findings can inform future research and practice.

Using the Semantic Web as a Source of Training Data

Article 15 May 2019

Cross-Lingual Product Retrieval in E-Commerce Search

Linking IT Product Records

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Recent years have seen significant increase in the adoption of the Linked Open Data practice [1] by data publishers on the Web. LOD refers to the practice of describing structured data using standard markup languages (e.g. RDFa^{Footnote 1}) and universal vocabularies (e.g. schema.org) that allow defining properties and relations of data, and publishing and interlinking such data on the open Web (hence ‘semantic markup data’). While early LOD datasets primarily took form of a graph database such as DBpedia,^{Footnote 2} an increasingly popular decentralised approach has been the embedding of semantic markup data within HTML pages. The Web Data Commons^{Footnote 3} (WDC) project extracts such markup data from the CommonCrawl^{Footnote 4} as RDF n-quads,^{Footnote 5} and release them on an annual basis. As by October 2021, its statistics showed that over 60% of web pages, or over 50% of websites contained semantic markups, amounting to over 82 billion quads—more than doubled that from 2019.

Such semantic markup data allow the ‘machine reading’ of the Web. It has not only driven the development of new products and data integration services, but also created unprecedented opportunities for research in the areas of Natural Language Processing (NLP, e.g. [2, 3]). This is because the RDF n-quads extracted from such data are described by universal vocabularies that define concepts, their properties and relationships. One of the most popular vocabularies is schema.org, which currently contains nearly 800 concepts and 1400 properties, and is used by over 10 million websites.^{Footnote 6} Studies have shown that such data can be used to train models for various NLP tasks, such as event extraction [2] and entity linking [4].

A particular domain that is witnessing the boom of semantic markup data is e-commerce, where online shops are increasingly embedding product markup data described using schema.org vocabularies into their web pages in order to improve content accessibility. Among the over 82 billion RDF n-quads mentioned above, nearly 17% are related to products, and are described by schema.org vocabularies. As examples, previous studies [4, 5] showed that among all product offers, 95% had an n-quad related to their names, 65% had one for their description, 35% had one for their brand, and less than 10% had one for their category. Such product markup data can be potentially useful as language resources to benefit various product data mining tasks, particularly in the current trend towards employing neural network models in developing large language models that prove to be effective for a wide range of NLP tasks [6,7,8,9,10]. However, we identify a gap in existing studies in this area. While there have been a limited number of sporadic studies [4, 5, 11] towards this direction in the product domain, the data sources used, the processes applied to use these data sources, and the findings have been inconsistent. This has made it difficult to properly compare or evaluate the usefulness of such data, or even answer the question ‘to what extent can we, and how can we, exploit this gigantic semantic markup data for product data mining tasks’.

To address this gap, in this work, we explore a series of questions in the context of the product domain, which is chosen for two reasons. First, it is one of the most promising domains where a ‘critical mass’ of such resources has been created. Second, it is a domain that continues to garner interest from both researchers and practitioners, as evidenced by a series of workshops and shared tasks sponsored by industries [11, 12]. We study the following questions:

How can we transform such semantic markup data to potentially useful language resources?
How useful are these language resources for downstream NLP tasks in the product domain?

To answer these questions in this work, we aim to achieve the following objectives:

To investigate and develop different methods for building language resources out of the semantic markup data;
To evaluate the created language resources on a number of product-related NLP tasks using state-of-the-art (SoTA) benchmarks;
To develop an understanding of if and how quality issues in the semantic markup data may impact on the language resource creation;
To outline the implications of our study and future research directions that may help advance research and practice in this area.

Methodologically, we start with processing the semantic markup data to transform them into different types of language resources that can be used for downstream NLP tasks. For this, we review SoTA research in the area of product data mining and identify three ways: training word-embedding models (e.g. [11]), continued pre-training of BERT-like language models (e.g. [9]), and training machine translation models (e.g. [13]) that are used as a proxy to generate product-related keywords. Next, we apply these language resources to three product-related NLP tasks and evaluate them using SoTA methods on current benchmarks: product classification, product linking, and fake product review detection. The exact methods are dependent on the task and the language resources used, and therefore, will be explained in detail later.

The novelty and originality of this work lies in that, it is the first study that systematically studies and evaluates the use of semantic markup data for multiple product data mining tasks. Therefore, our findings serve as useful references for future research in this direction. First, we report that among the three methods, only word embeddings are the most consistent method to improve the accuracy on all three tasks. The BERT language models and the MT-based product keywords on the other hand, do not bring consistent improvements, even though there have been many studies that successfully developed in-domain BERT models following the simple principle of continued pre-training a generic BERT using large domain-specific corpora. Thus our results serve a lesson that this method may not be as easy as it seems. Second, we conduct a number of analyses of the data and show that the biased domain representation in the data and lack of vocabulary coverage may have been attributing factors. In particular, methods such as BERT language modelling and machine translation may be potentially more susceptible to such ‘data quality’ issues than word-embedding modelling. Finally, we discuss how these findings can inform future research and practice, and we contribute our data as public resources which can be obtained upon request.^{Footnote 7}

We organise the remainder of this paper as follows. The next section reviews related work. Then, the Sects. “Product Classification” to “Fake product review detection” present our exploration on each of the three tasks. Each section will introduce our methodology, experiments, and presents the results. The “Further analysis” section looks at potential quality issues of the data. In the “Discussion” section, we discusses our findings and their implications, and this is followed by a conclusion in the last section.

Related Work

While our work belongs to the general field of data mining [14,15,16], to avoid stretching our literature review too thinly, here we define our criteria of literature selection. More generally, semantic markup data is a type of LOD resources and there have been a large number of studies [17, 18] and organised events^{Footnote 8} on the creation and consumption of LOD. However, previous studies have predominantly looked at LOD resources that are published as a graph database such as the DBpedia and Wikidata, while very few focussed on the LOD published as semantic markup data embedded within web pages. The fundamental difference between the two is quality, which underpins the approaches that one can take to use such resources. Most of the LOD graph databases are well-curated, documented and maintained. Semantic markup data however, can be highly heterogeneous, noisy, and unbalanced [19]. There are also a blend of studies that focussed on creating further LOD resources out of existing ones [20], compared to those that actually use LOD as language resources for downstream language processing tasks. Our literature review therefore has a specific focus on the following areas: (1) work that uses semantic markup data to create language resources, and (2) work on the three tasks we focus on, i.e, product classification, linking, and fake product review detection. While our work is also broadly related to the use of neural networks in creating domain-specific language models, such as the BioBERT [6], Clinical BERT [7], SciBERT[8], E-Bert[9], and SMedBERT[10], these studies did not use semantic markup data and therefore, we do not expand our literature review to this broad area as that would significantly increase the scope of our discussion. However, in Section 8, we discuss our results with respect to the findings of these earlier studies.

Semantic Markup Data as Language Resources

Research on using semantic markup data for downstream language processing tasks has just taken off in the recent years and therefore, studies addressing the creation of language resources from such data are limited. Primpeli et al. [21] adopted an unsupervised approach to create a very large training dataset for product entity linking using semantic markup data extracted from the 2017 CommonCrawl corpus. The process started with extracting product offers that contain product identifiers annotated using the schema.org vocabulary. Then offers with the same identifiers are placed in the same cluster, followed by a cleaning process to eliminate potentially noisy clusters. The end clusters are considered to be product offers referring to the same product entity, and are used to train entity linking models. The work is further extended in later studies by [4, 22, 23], and it was shown that this automatically created training dataset has a high quality and can be used to effectively train product entity matchers at high accuracy. While these studies investigated ad-hoc usages of semantic markup data as training data for specific tasks, our work explores possibilities of utilising such data to create language resources that are usable by a wider range of tasks.

The same authors also used a product corpus to train a domain-specific word-embedding model in [22] using the fastText model. Specifically, they extracted the brand, name and description properties annotated by schema.org from the same corpus above, to create a text corpus that is used to train fastText embeddings. This domain-specific embedding model gained minor improvement over a generic fastText embedding model on some product linking tasks. In comparison, our work also explores using product related corpora to train word-embedding models. However, we study if this can generalise to other product data mining related tasks.

Work that uses semantic markup data to train embedding models can be traced back to [24], where authors used schema.org annotations (names and descriptions) of product entities to train entity embeddings using the paragraph2vec model [25]. This approach suffers similar limitations as the above, where the embeddings learned in such a way are ad-hoc and can only be used for entities found in the training process of the embedding models. Our study explores more generic ways of learning word embeddings.

In the 2020 Semantic Web Challenge on product data mining (MWPD2020, [11]), a corpus of 1.9 billion words extracted from the descriptions of product entities annotated by schema.org vocabulary was used to train word-embedding models. Compared to generic word embedding models, such models contributed to better results on the product classification task when used with a fastText baseline [12]. However, they were not used by any of the participating teams of the shared task. This study fills this gap by thoroughly evaluating them on several product data mining tasks.

Product Classification

Product classification is typically treated as an entity classification task. The process involves extracting product metadata for feature representation, then train a supervised algorithm that learns to assign category labels (i.e. classes) to product instances based on their features. Since most existing methods follow a similar process by mainly differ in terms of the metadata used, feature representation methods, and machine learning algorithms, below we summarise related work from these angles and highlight their similarities and differences instead of discussing each individual method in detail.

Metadata To classify products, features must be extracted from certain product metadata. Rich, structured metadata are often not available. Therefore, the majority of literature have only used product names, such as [26,27,28,29] and all of those participated in the 2018 Rakuten Data Challenge [12]. Several studies used both names and product descriptions [13, 30,31,32,33,34,35], while a few used other metadata such as model, brand, maker, etc., which need to be extracted from product specification web pages by an Information Extraction process [24, 36]. In addition, [24] also used product images. The work by [5, 37] used product categories allocated by the vendors and embedded as semantic markup data within the web pages. To differentiate these from the classification targets in such tasks, we refer to these as ‘site-specific product labels’ or ‘categories’. The authors noted that despite the highly heterogeneous nature of such site-specific labels across different websites, they are still very useful for supervised classification. In comparison, this work explores a ‘new’ type of metadata - product-related keywords generated by a machine translation model trained on the massive product corpora. Compared to product metadata already existing in a dataset and are comparatively better quality, such keywords may be very noisy. Our work will be first to explore if these keywords generated in such a way can be useful for product classification.

Feature representation Generally speaking, for text-based metadata, there are three types of feature representation. The first is based on Bag-of-Words (BoW) or N-gram models, where texts are represented based on the presence of vocabulary in the dataset using either 1-hot encoding or some weighting scheme such as TF-IDF [27, 29, 30, 36]. This often creates high-dimensional sparse feature vectors. The second uses pre-trained word embeddings or Language Models (LM) to create a relatively low-dimensional, dense feature vector of the input text.

Certain techniques will need to be applied in order to compose embeddings for long text passages based on single words, such as [26, 32] that computed text embeddings based its composing words, and [38] (non-product domain) and [37] that joined word-embedding vectors to create a 2D tensor to represent the text. In the more recent work that uses pre-trained LMs such as BERT (e.g. in [39]), the construction of text passage embeddings is taken care of dynamically by passing the input text through the language models directly, which will take into account the context of words. The third applies a separate learning process to learn a continuous distributional representation of the text directly from the downstream training datasets [24, 31, 32]. Our work will make use of feature representation methods from the second and the third types. However, studies by [26, 32, 37, 39] used general purpose, pre-trained word embeddings while we study the effects of embeddings purposefully trained on product-related corpora. Studies by [24, 31, 32] created ad-hoc representations of product entities discovered in the training set, while such representations cannot be generalised to other data or tasks, our study explores more generic, data-agnostic methods for composing such representations.

Algorithms The large majority of work has used supervised machine learning methods. These include those that use traditional machine learning algorithms [5, 24, 27, 30,31,32], and those that apply DNN-based algorithms [12, 28, 36, 37]usually based on CNN or RNN. These include the majority of the participating systems in the 2018 Rakuten Data Challenge. Besides, [5] also explored unsupervised methods based on the similarity between the feature representations of a product and target classes. [29] studied product clustering, which does not label the resulting product groups. These represent unsupervised methods. Further, [13] studied the problem as a machine translation task, where the goal is to learn the mapping between a sequence of words from product names, to a sequence of product categories. MWPD2020 [11] showed a trend towards using the pre-trained LMs for classification, such as those based on the BERT model [40]. All participants but one at the product classification task at MWPD2020 used LM-based classification methods. The work by [39] for example, used an ensemble model combining 17 different variants of the BERT model to achieve the best result on this task.

This study does not focus on developing novel algorithms but instead, reuse existing ones such as the fastText baseline in [12] and the DNN structure in [37]. It may however, reveal which algorithms are more susceptible to the different language resources created by this study.

Product Linking

Product linking or matching is the task of determining if multiple product offers found from different websites (sometimes even from the same website) refer to the same, identical product entity. Product linking can be achieved by one of the three approaches: classification (e.g. [11]), where product offer pairs are created a priori and are classified to match or non-match; clustering, where a dataset of product offers are split into groups and members within the same group are considered to be about the same product; or retrieval (e.g. [34]) where the goal is to find the matching product entity from an existing database for a given product offer. In both classification and retrieval, often a ‘blocking’ process is applied to reduce the search space. All three approaches depend on the calculation of ‘similarities’ between product offers and this makes use of product metadata. A good literature review on product linking can be found in [24]. Here, we summarise them in terms of metadata, feature representation, and algorithms in a similar fashion as before.

Metadata Similar to product classification, typically product linking will make use of product names (e.g. [34, 41,42,43,44,45,46]) and descriptions (e.g. [24, 34, 47]). The difference however, is that the task also makes use of a diverse range of structured product attributes (e.g. [34, 44, 45, 48]), often defined as ‘key-value’ pairs such as those that can be extracted from product specifications (e.g. product ID, model, brand, manufacturer). Intuitively, offers that have the similar sets of key-value pairs are more likely to match. Since such structured key-value attributes are often unavailable, many studies focussed on how to extract them from the descriptions of an offer [41, 49], or from the specification table of the source web page [24]. A small number of studies [24, 50, 51] also used product images. Similar to product classification, we will explore the usefulness of product-related keywords generated by a machine translation model trained on the product semantic markup data. This has not been explored before.

Feature representation Again, similar to product classification, broadly speaking, transforming textual metadata into feature representations is typically based on BoW (e.g. [43, 44]), pre-trained word embeddings or language models (e.g. [23, 24, 34, 45, 46, 52]), or learning word embeddings on the spot from the downstream task datasets (e.g. [45]). However, depending on the types of metadata, different methods may be adopted and then combined [49]. For example, structured key-value attributes are often kept as-is and compared as a BoW, particularly if the values are short (e.g. product IDs). In [44], a concept of ‘q-gram’ was introduced to represent short texts (especially key-value pairs) as a set of character n-grams. Longer texts such as descriptions are better represented using word embeddings or through LMs. In this direction, similar sets of methods to product classification are used. For image data, typical pixel-based image representation approach is widely used [24, 50, 51]. Same as product classification, in terms of novelty, our work focuses on evaluating the word embeddings purposefully trained on product-related corpora while many earlier studies used generic word embeddings. We also use more generic methods for composing feature representations for product linking while previous models tried to learn ‘ad-hoc’ representations.

Algorithms Since the prediction of linking/matching of product offers depends on a notion of ‘similarity’, some methods will have an ‘intermediary’ step that converts product metadata features to similarity features [34, 43, 48]. This is typically done by applying similarity metrics—usually based on string form, or word/character distribution—to the textual feature representations of two offers. Again, depending on the metadata, different similarity metrics may be applied [34, 43, 44, 49]. This ‘intermediary’ process creates a feature vector consisting of similarity scores computed by different measures, or using different features. The vector is then subject to another process to determine if the two offers should match. However, as mentioned before, some methods [45, 48] do not require such an intermediary step, as the similarity computation is embedded as part of the method that tackles the task in an ‘end-to-end’ fashion.

In terms of the method to the end-task, most studies are based on supervised binary classification, which aims to determine if a pair of offers should match or not. Following a similar pattern to product classification, the classification algorithms have evolved from the traditional [34, 41, 47], to DNN-based [34, 45], to LM-based [23, 33, 46, 52]. Depending on the classification algorithm, the input could be the similarity feature vector of a pair (e.g. [41]) computed by the intermediary step, or directly the feature vectors derived from the metadata of each offer [23, 45, 52]. In the study by [23] which extends their earlier work in [52], the authors proposed a multi-task learning neural network based on the BERT model, tailored for the product linking task. In addition to learning to predict if two pairs of product offers refer to the same entity (binary classification), the model at the same times learns to predict the shared product identifier by the two offers (multi-classification). Additionally, one can also make use of cutoff thresholds of similarity to determine match/non-match [42].

Clustering is used in a number of studies, such as [44] that clustered offers based on the ‘q-grams’ derived from their names and key-value pairs; and [53] where a ‘strength of ties’ style of clustering was applied to a ‘network’ of important words derived from a pair of product offers to determine if they form a cohesive ‘community’ and therefore, should match.

Methods that require offer pairs as input will often require a ‘blocking’ pre-process that aims to reduce the search space for pairs, to create a minimal set of pairs for classification. Blocking strategies are varied and often lightweight, such as [49] that is based on matching manufacturers and categories, and [42] that is based on string prefix.

A unique direction of research in product linking looks into automated expansion of training data, either in terms of training instances, or metadata that can be used for feature extraction. For example, [42] enriched the name of product offers with tokens retrieved using a web search engine. [33] used product offer names to fetch similar entities from Wikidata, to create additional training instances.

Compared to state-of-the-art, we focus on the sub-task of supervised, binary classification of match/non-match, while ignoring the ‘blocking’ process. Our method will use state-of-the-art algorithms, as our research focus is on evaluating the impact of the language resources created from the semantic markup data on existing algorithms.

Fake Product Review Detection

Fake reviews, as per [54], generally refer to reviews created in an attempt to mislead consumers (either in a positive or negative way). They are also known as deceptive opinions, spam opinions, or spam reviews [55]. Fake online reviews in e-commerce significantly affect consumers, merchants, and market dynamics. In extreme cases, they led to financial loss for companies and legal cases [56]. While traditionally, fake reviewers are written by humans, with the advancement of Natural Language Generation technology, it has been shown that fake reviews automatically generated by programs are even more difficult for human annotators to detect [57]. There is an extensive amount of studies on automated fake review detection and for that reason, we refer readers to the surveys by [54] while below we present a brief overview of this field, highlighting the novelty of our work. Further, in addition to studies focussing on detecting the content, there are work (e.g. [58, 59]) that detect spammers (users) and spammer groups (network) which we do not cover here.

Detecting fake review is predominantly treated as a supervised, binary text classification task. Thus similar to product classification, it involves extracting features of the review text (metadata), representing them in a machine processable format (feature representation), and training a model that is able to generalise patterns using the features and apply the patterns to unseen data (algorithm). In terms of features (metadata), [57] broadly categorised them into ‘lexical’ and ‘non-lexical’. Lexical features are attributes derived from text, such as words, n-grams, punctuations and latent topics. Non-lexical features are metadata related to the reviews (e.g. ratings, stars) or their authors (ID, location, number of reviews generated). In terms of feature representation and algorithms, same patterns to that of product classification are noticed due to the two tasks been handled by text classification approaches. Briefly, research has evolved from early methods that use manually engineered features in a 1-hot encoding (e.g. [60]) to pre-trained word embeddings (e.g. [61, 62]), and learning representations of the target dataset as part of the model, on the spot (e.g. [63]). The use of machine learning algorithms also evolved from the earlier classic algorithms such as SVM and logistic regression (e.g. [60]), to deep neural networks (e.g. [61, 62]), to using very large LMs such as BERT (e.g. [57]).

Compared to the previous studies, our work does not aim to introduce new features or algorithms. Instead, we explore the usefulness of the feature representations learned from massive product-related semantic markup data on the task of fake review detection. Earlier work such as [62, 64] used word embeddings pre-trained on general purpose corpora, and [61] trained domain-specific word embeddings using an Amazon product review corpus. In contrast, our work is the first that explores if a corpus of product details (instead of their reviews) can be used to learn word embeddings for this task. Compared to [57] who also used LMs, our work explores the effect of continued pre-training of LM using in-domain corpus, while [57] did not.

Reflection

Summarising related work above, our study addresses two limitations of state-of-the-art. First, despite the abundance of semantic markup data on the Web, there are only a very small number of studies that explored the use of such data to create language resources for downstream language processing tasks. Among them, the typical approach is training embedding models using such data [11, 24, 52]. However, these methods and/or resources are often ad-hoc, and their effects have not been compared on the same tasks.

Second, despite the continued interest in the research of product classification, linking, and fake review detection, the use of language resources to support such tasks has been highly inconsistent, ranging from no-use at all to using a diverse set of word-embedding models (e.g. [26, 33, 37]). It is unclear for example, if earlier success of building domain-specific LMs by continued pre-training of BERT models using in-domain corpora can be replicated in this domain. Adding to this complexity is the use of different datasets, and diverse use of machine learning models ranging from traditional algorithms (e.g. SVM, logistic regression), to deep neural networks, to pre-trained language models. The implication of this is that it is extremely difficult to compare the effect of using certain language resources on such tasks.

Motivated by these issues, our work in the following will explore three different ways of creating language resource using semantic markup data, and systematically evaluate them under uniform settings on the three different downstream tasks mentioned above.

Building Language Resources

In this section, we describe our method for the creation and evaluation of the language resources for product data mining. We begin with introducing the data sources we use to create the language resources (“Data sources”). We then discuss three different ways of using these data sources to create different types of language resources: training word-embedding models, continued pre-training of BERT-like LMs, and training machine translation models that are used as a proxy to generate product-related keywords. These language resources will be later used in the three downstream tasks, to be detailed in “Product classification”, “Product linking” and “Fake product review detection”.

Data Sources

In order to create language resources using semantic markup data for the product domain, we used the 2017 release of the structured data crawled by the WDC project. Specifically, we only downloaded and processed the class-specific subsets of the schema.org data related to sg:Product.^{Footnote 9} This contains nearly 5 billion RDF n-quads, extracted from over 267 million web pages and over 812 thousand hosts. Each n-quad contains a subject, predicate, object, and a graph label which in this case, denotes the source URL of the n-quad.

Next, we parse this dataset to identify product offer instances, and build a Solr^{Footnote 10} index of product offers with their attributes found in the n-quads. This is done by first searching for ‘definition n-quads’ that define a product offer instance with http://www.w3.org/1999/02/22-rdf-syntax-ns#type as the predicate, and either sg:Product or sg:Offer as the object (i.e. where an n-quad defines an instance of an sg:Product or sg:Offer), and then parsing other n-quads containing the same subject as the definition n-quad to create property-value pairs for each offer. Only data that are potentially English are retained. This is achieved by automatically checking if the source URL (i.e. graph label) contains a top-level domain that clearly indicates non-English websites (e.g. .fr, .cn). This Solr index is further processed to create two corpora: a product description corpus, and a product category corpus.

The product description corpus contains descriptions of product offers. These are extracted from the sg:Product/description property of each product offer. A light cleaning process is applied to ensure that only descriptions containing between 50 and 250 words are selected. This restriction is to reduce content that is likely to be very noisy. For example, we noticed that sometimes product descriptions contain only a handful of generic words; while other times they are too long and can include the entire web page content. These texts are also normalised to keep only alpha-numeric characters. If a token contains digits only, it is replaced with a symbolic token to indicate a digit-only token. The resulting product description corpus contains over 1.9 billion tokens, extracted from over 34 million product offers.

The product category corpus contains over 700 thousands of product name—site-specific category pairs. These are selected from offer instances that have both an n-quad defining their name and site-specific labels. Product names are extracted from sg:Product/name or sg:Offer/name properties, while site-specific labels are extracted from sg:Product/category or sg:Offer/category. Both product names and site-specific labels are subject to a light cleaning process where only alpha-numeric characters are retained, and those containing more than 10 tokens (delimited by white space characters) or less than 2 tokens are removed. These restrictions are for the same reason—to reduce noise in the data. In addition, digit-only tokens are replaced with the same universal symbol. Further, a stop word list is used to filter out generic site-specific labels, such as Home and Product, and only pairs extracted from the top 100 largest hosts (as measured by the number of product offer instances found from each host) are kept. This is to focus on hosts that are potentially large e-commerce vendors and therefore, have defined relatively good quality site-specific categorisation schemata.

We will explain how we use these corpora to build language resources below.

Training Word-Embedding Models

The first approach to utilising the above corpora is training word embedding models. As discussed before, only a couple of studies [11, 24] used semantic markup data to train embedding models. However, [24] trained product embeddings that are ad-hoc, while our earlier work [11] developed word-embedding models that were not thoroughly evaluated. Here, following our previous work, we simply use the Gensim^{Footnote 11} implementation of the Word2Vec algorithm [65] to train word-embedding models using the product description corpus. We use the skip-gram algorithm for training, as it was shown to better represent infrequent words [65]. This fits our data well, as a notable fraction of words in our product classification and linking datasets (see Appendix 1) are not represented by the most frequent words found in the product description corpus.

We use a sliding window of 10, minimum frequency threshold of 5 and text lower casing, then keeping the remaining parameters as default. The word embeddings have 300 dimensions. We refer to this as ‘product word embeddings’.

Continued Pre-training of BERT Language Models

The second approach explores the continued pre-training of large LMs. The principle of ‘continued pre-training’ of LMs has been introduced in the recent research. The idea is to take an existing LM such as BERT, and further training it on large, in-domain, unlabelled corpora (e.g. [66, 67]).

We explore the benefits of continued pre-training the BERT model on our product description corpus, and refer to the resulting LM as ‘$\mathrm{BERT}_{\mathrm{prod}}$’. Specifically, we take the ‘bert-base-uncased’ model^{Footnote 12} and run the masked language modelling task on our product description corpus, keeping all hyperparameters as the default.^{Footnote 13}

However, pre-training LMs is an extremely resource-demanding process, and due to our limited access to HPC resources, we had to split our product description corpus to small segments, and create different versions of $\mathrm{BERT}_{\mathrm{prod}}$ models. Specifically, we randomly sampled 8% (or approx. 570 MB, which is the maximum size of a corpus we can fit with the pre-training process on our hardware) of our corpus for 7 times ensuring no overlap of selected product descriptions, thus creating 7 smaller corpora to continue to pre-train the BERT model. This creates 7 $\mathrm{BERT}_{\mathrm{prod}}$ models, and the total size of data used for continued-pre-training represents 50% of the original product description corpus.

Training Machine Translation Models

The third approach to utilising the product corpora is inspired by the work of [13]. The authors cast product classification as an MT task, whose goal is to learn the mapping between the sequence of words in a product name, to the sequence of category labels that form a hierarchical path. In this sense, the product names and their category label paths are treated as two different languages.

However, an important difference of Li’s work from ours is that the dataset they used for training the MT models is arguably, much better quality. This is because it is collected from a single vendor website, hence there is only one categorisation scheme and the naming and categorisation of products are generally consistent. In contrary, our product category corpus contains data from hundreds of different hosts, potentially selling very different products, and therefore used highly different and inconsistent categorisation schemata, which will have different levels of hierarchies. Further, our goal of product classification is to assign category labels from a universal schema to products from different vendors. Therefore, the site-specific categories cannot be directly used as classification targets.

Therefore, instead of using this corpus to directly train a product classifier, we use it to train MT models that map a sequence of words in the product name, to the sequence of words in the product’s site-specific category. Then given a product name in the downstream task data, we apply the MT model to generate a sequence of words, which although will unlikely to map to the end classification labels, may still be indicative of the product’s ‘type’ or ’category’ and therefore, become useful features for the downstream tasks. We will refer to these words as ‘product related keywords’ (denoted as ‘pk’).

To train the MT model, we apply off-the-shelf MT toolkit OpenNMT [68] to the product category corpus. The encoder and decoder are 2-layer LSTM with 500 hidden units. We use all default settings of the hyperparameters in the distributed implementation.

Product Classification

In this section, we explore the usage of the different language resources created in “Building language resources” in the task of product classification. We show the datasets used for this study, and configure a number of models and compare them to evaluate the impact of these language resources on these datasets. We then present the results, which will be further discussed later in the “Discussion” section together with results from other tasks.

Datasets

Table 1 Summary of datsets for product classification

An Exploratory Study on Utilising the Web of Linked Data for Product Data Mining

Abstract

Similar content being viewed by others

Using the Semantic Web as a Source of Training Data

Cross-Lingual Product Retrieval in E-Commerce Search

Linking IT Product Records

Introduction

Related Work

Semantic Markup Data as Language Resources

Product Classification

Product Linking

Fake Product Review Detection

Reflection

Building Language Resources

Data Sources

Training Word-Embedding Models

Continued Pre-training of BERT Language Models

Training Machine Translation Models

Product Classification

Datasets

Model Configurations

Using Word Embeddings

Using Language Models

Using Machine Translation Models

Evaluation Metrics

Result Summary

Product Linking

Datasets

Model Configurations

Using Word Embeddings

Using Language Models

Using Machine Translation Models

Evaluation Metrics

Result Summary

Fake Product Review Detection

Datasets

Model Configurations and Evaluation Metrics

Result Summary

Further Analysis

Data Provenance

Vocabulary Coverage

Product Keywords Analysis

Discussion

Conclusion

Availability of data and materials

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix 1: Additional Data Analysis

Appendix 1: Additional Data Analysis

A.1 Product Classification: Word Frequency Analysis for Word-Embedding Model Training

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation