1 Semantic Annotations

Millions of websites have started to annotate data about products, people, organizations, places, local businesses, and events in their HTML pages using markup formats such as Microdata, JSON-LD, RDFa, and Microformats [12]. These annotations are used by all major search engines to display rich snippets in search results. The annotations are also one source of content of the knowledge graphs that are used by the search engines to rank search results and display knowledge cards next to search resultsFootnote 1. As Google, Bing, and Yandex recommend to use terms from the schema.org vocabularyFootnote 2, this vocabulary is used by most websites. Figure 1 shows an example of how an offer and a review for a tent are annotated within an HTML page of an e‑shop using the Microdata syntaxFootnote 3 and the schema.org vocabulary. The left side of Fig. 1 shows part of the HTML page as it is rendered by the browser. On the right, we see the corresponding source code. The itemtype attributes of the DIV and SPAN elements define the types of entities that are described, e.g. product and review. The itemprop attributes specify the properties that are used to describe the entities, e.g. name, productID, image, ratingValue, and reviewBody.

Fig. 1
figure 1

Example of Microdata and schema.org annotations within an HTML page.

The Web Data Commons (WDC) projectFootnote 4 monitors the adoption of schema.org annotations on the Web by analysing the Common CrawlFootnote 5, a series of public web corpora each containing several billion HTML pages [12]. The November 2018 version of the Common Crawl contains 2.5 billion pages originating from 32.8 million pay level domains (PLDs)Footnote 6. Out of these PLDs, 9.6 million use semantic annotations (29.3%). Table 1 gives an overview of the most frequently offered types of data (schema.org classes). The table distinguishes between the two most widely used annotation syntaxes: The Microdata syntax for annotating data in the BODY of HTML pages and the JSON-LD syntax which is used to embed data into the HEAD section of HTML pages. As we see in the table, in sum around 850 thousand websites provide product data using the schema.org vocabulary. The product properties that are most widely used are name, description, brand, and image. Interestingly and very crucial for using semantic annotations from different websites to train matching methods, 30.5% of the websites annotate product identifiers, such as MPNs, GTINs, or SKUs, which allow offers for the same products to be clustered.

Table 1 Number of websites (PLDs) offering specific types of data.

2 Cleansing Schema.org Product Data

Semantic annotations are placed in the templates, that are used to render HTML pages, by thousands of web masters. As these web masters have different levels of knowledge and different understandings of the schema.org vocabulary, schema.org terms are not used consistently and according to the specification on all sites. Thus, semantic annotations need to be cleaned before they can be used for training. In the following, we describe the pipeline of cleaning operations that we apply for creating our training dataset. We use the Web Data Commons product corpus version November 2017Footnote 7 as starting point for the creation of the training set. The corpus contains 809 million schema:Product and schema:Offer entities originating from 581,482 websites. First, we select the subset of the offers that provide some kind of product identifier. Afterwards, we group the offers based on the identifiers into ID-clusters and cleanse abnormalities in the clustering. In the following, we provide details about both steps.

Selection of offers with identifiers. For the creation of the training corpus, we want to gather all products and offers which include a globally unique identifier and can thus be clustered using this identifier. Examining the annotations, we notice that many websites annotate globally scoped identifiers, such as GTIN or MPN, using the vendor scoped term sku (stock keeping unit) or the generic terms identifier and productID. We thus consider all offers that contain any identifier-related term (e.g. gtin8, gtin12, gtin13, gtin14, mpn, sku, identifier, and productID) and try to filter out vendor-specific identifiers later in the cleansing process. Similar to the observations of [13] and [8], we notice that 6% of the websites annotating product offers have syntax errors in the URIs identifying schema.org terms or use deprecated or even undefined terms. As we do not want to miss these offers, we include all entities into our training set which have at least one property with an identity revealing suffixFootnote 8. Using this selection strategy we find 116 million out of the 809 million offers (14%) in the Web Data Commons product corpus to contain some sort of identifier.

Detection and removal of listing pages and advertisements. We want to include the comprehensive description of a product from its detail page into the training set and not the summary of this information often found on listing pages and in advertisements on other detail pages. However, identifiers in listing items and advertisements are annotated as well which makes it necessary for us to detect those entities and remove them from the corpus. For the detection of listing pages and advertisements we use a heuristic which relies on the following features: amount of schema.org/Offer and schema.org/Product entities per web page, variation of the length of the product descriptions, number of identifier values, and semantic connection to parent entities using the terms schema:relatedTo and schema:similarTo. Our heuristic for identifying listing pages and advertisements achieves an F1 score of 94.8% on a manually annotated test set. This cleansing step removes 49% of the offer entities, leaving 58 million non-listing and non-advertisement offers in the training set.

Filtering by identifier length. In the next step, the identifier values are normalized by removing non-alphanumeric characters and common prefixes such as initial zero digits and identifier-related strings like ean, mpn, sku, and isbn. Considering the length of global identifiers such as GTIN or ISBN numbers in comparison to vendor-specific identifiers that are often relatively short, we filter out all offers having identifiers that are shorter than 8 characters. Additionally, offers whose id values completely consist of alphabetical characters are removed. Finally, we observe that a considerable number of websites use the same identifier value to annotate all their offers, likely due to an error in the script generating the pages. We detect these websites and remove their offers from the training set. After applying both filtering steps 26 million offer entities remain in the training set.

Cluster creation. We group the remaining 26 million offers into 18 million clusters using their ID values. It happens that single offers contain multiple alternative identifiers referring to the same product, e.g. GTIN8 and GTIN12, or GTIN12 and MPN. We use this information to merge clusters referring to the same product which results in a reduction of the number of clusters to 16 million. 13 million of these clusters contain only a single offer. We also notice that some websites include identifiers referring to product categories, such as UNSPSC numbersFootnote 9, in addition to identifiers referring to single products into the annotations. For detecting such cases, we examine the structure of the identifier co-occurrence graph within each cluster. We discover that vertices having a degree larger than 10 and a clustering coefficient of \(C_{i}<0.2\) tend to represent product categories rather than single products and we split the clusters accordingly. This leads to the creation of 199,139 additional clusters.

Offer categorization. The schema.org vocabulary contains terms for annotating the product category of an offer. However, less than 2% of the offer pages in the WDC 2017 corpus annotate category information. Different shops use different categorization schemata for presenting their products to the customers. We do not attempt to solve the resulting large-scale taxonomy integration problem, but re-classify the offers into 26 product categories that we selected from the upper parts of the Amazon product taxonomy. We use a publicly available Amazon product and reviews datasetFootnote 10 and apply transfer learning [17] in order to assign product category labels to the clusters of our corpus. In cases for which the confidence of assigning a category label is low, we assign the label other category.

3 Profile of the WDC Training Dataset for Large-Scale Product Matching

Applying the cleansing procedure described above to the Web Data Commons product corpus (November 2017) results in a training data set consisting of 26 million offers originating from 79 thousand websites. The dataset has a compressed size of 6.4 GB. We call the dataset WDC Training Dataset for Large-Scale Product Matching (WDC-LSPM). Using the identifiers, the offers are grouped into 16 million clusters referring to the same products. 13 million of these have a size of one, 1.9 have a size of two, and 1.1 million have a size larger than two. We also create an English-language subset which includes only offers from the top level domains com, net, co.uk, and org. The English language subset has a size of 3.9 GB (compressed) and consists of 16 million offers which are grouped into 10 million clusters. Out of these clusters, 8.4 million have a size of one, 1 million have a size of two and 625.7 thousand have a size larger than two. Only considering clusters of English offers having a size larger than five and excluding clusters of sizes bigger than 80 offers which may introduce noise, 20.7 million positive training examples (pairs of matching product offers) and a maximum of 2.6 trillion negative training examples can be derived from the data set.

Table 2 Amount of offers in the training set and gold standard having specific properties.

We extract all descriptive properties of offers that are annotated with schema.org terms. Table 2 shows the distribution of descriptive schema.org properties in the Full and English training sets as well as the distribution of identifier-related schema.org properties in both sets. We see that the density of the descriptive properties beside name and description is rather low (\(<50\%\)). This is inline with earlier findings [12] that only a rather small subset of the schema.org vocabulary is actually widely used on the Web. We also see that over 75% of the identifiers that were used for clustering were annotated using the terms sku and productID which justifies our decision to not ignore these in theory vendor-specific properties but also consider their values in the cleansing process.

In addition to the annotated properties, we also extract product specifications in the form of key/value pairs from HTML tables that are included in the detail pages. For this we use the method described in the works of Petrovski et al. [16] and Qiu et al. [19]. The method detects specification tables for 24% of the offers contained in the Full set and 17% of the offers in the English set.

Table 3 shows the distribution of offers per product category as well as the clusters size distribution in the English training set. Considering clusters having a size larger than two, we can derive more than 1.2 million positive pairs for the clusters in the categories Office Product and Clothing, and more than 300 thousand pairs each for the categories Shoes, Camera and Photo, Cell Phone and Accessories, Computers and Accessories, and Jewelry. These amounts of pairs of offers referring to the same products are likely big enough for training even very data hungry entity matching methods and are even within each category alone much larger than the product matching datasets that have been available to the public so far (see Sect. 7).

Table 3 Distribution of product categories in the English training set.

4 Quality of the Clustering

In order to get an impression of the quality of the ID-clustering, we randomly sample 900 pairs of offers belonging to the same clusters and manually verify if the offers really refer to the same product by inspecting the name and description values of the offers. We discover that 93.4% of the pairs are correct, meaning that both offers refer to the same product. We find that 2.1% of the pairs in the sample (19 out of 900) are wrong due to web pages providing wrong identifier values. We consider this value to be low enough in order to use the identifiers for generating training pairs. We verify this assumption using the experiments described in 6. We further find that 1.0% of the pairs in the sample (9 out of 900) are wrong due to errors introduced by our transitive grouping strategy which combines two clusters if a single offer is found that is annotated with the identifier values of both clusters (e.g. a GTIN8 and a MPN number). In future work, we plan to investigate stricter merging criteria which might result in a better compromise between cluster size and amount of errors. For 3.4% of the sample (31 offer pairs), the authors of this article together were not able to decide whether the two offers refer to the same product as the names and descriptions were too short (e.g. just “Samsung Galaxy”) or too general (e.g. “computer software”). In 20 out of these 31 cases, name and description together contained less than four tokens. If desired, these pairs can be deleted from the training set using a length filter.

5 WDC Gold Standard for Large-Scale Product Matching

Having some noise in the training set is acceptable, but should be avoided for the test set. We thus create a clean evaluation gold standard by manually verifying for 2,000 pairs of offers whether they refer to the same product or not. The 2,000 pairs of offers differ from the 900 pairs that we verified in order to assess the quality of the clustering. The level of difficulty of a matching task as well as the suitability of a matching method for the task both depend on the structuredness of the data to be matched. Thus, we select for the gold standard two product categories containing less structured offers (watches and sneaker shoes) as well as two categories containing more structured offers (computers & accessories and camera & photo). First, we identify the clusters belonging to the selected product categories. We then sample positive pairs from within these clusters as well as textually similar negative pairs across clusters and manually check the correctness of the label. The resulting gold standard consists of 150 positive and 350 negative pairs for each category. The offers contained in the gold standard originate from the following numbers of clusters from each category: 338 for computers & accessories, 231 for camera & photo, 269 for watches and 186 for sneakers. The two right-most columns in Table 2 contain the density of the schema.org properties of the offers in the gold standard. The training set and the gold standard are provided for public download on the Web Data Commons websiteFootnote 11 which also provides additional statistics about both.

6 Entity Resolution Experiments

As a result of the success of embeddings and deep neural networks for tasks such as image recognition and natural language processing, the question whether these techniques also increase the performance of entity matching has recently moved into the research focus [5, 15, 20, 23]. Current results by Mudgal et al. [15] suggest that deep learning techniques perform comparable to traditional symbolic matching techniques on strongly structured data but outperform traditional techniques by a margin of 5% to 10% in F1 on less structured data such as product descriptions in e‑commerce. The problem with these results is that they are not verifiable as they have been produced using training data “from a major retailer” [15] which is not available to the public.

This section presents a set of matching experiments conducted using the English Training Set and the WDC gold standard. The experiments are intended on the one hand to verify the utility of our training set. On the other hand, we use our training set and gold standard to replicate the results of Mudgal et al. [15]. First, we perform an unsupervised bag-of-words (BOW) experiment using TF-IDF and cosine similarity. Afterwards, we train various supervised models such as Logistic Regression, Naive Bayes, LinearSVC, Decision Trees, and Random Forests using

  1. (i)

    binary word co-occurrence vectors and

  2. (ii)

    string similarity scores, automatically generated by the Magellan framework [9],

as features. As neural network based matchers, we combine all network types implemented in the deepmatcher framework (e.g. RNNs, Attention, and Hybrid, all with default parameters) with pre-trained and self-trained fastText embeddings.

We experiment with different subsets of the offer features title, description, brand, and specification table content. All identifier related properties (lower part of the Table 2) are removed from the offers. Due to resource limitations, we do not use the complete English training set for the supervised experiments but subsets of potentially interesting training examples (e.g. positive pairs from many different clusters and negative pairs from different clusters where both offers have a similar description). For the category computers, we use 20 thousand positive and 21 thousand negative training examples, for cameras 11 thousand positive and negative examples, for watches 6,289 positives and 9,161 negatives, and for sneakers 3,709 positives and 6,060 negatives.

The results of all experiments are summarized in Table 4. For each category, we report the best performing method/feature combination. As expected the supervised methods outperform the unsupervised BOW approach significantly. More interestingly, the deep learning approaches using pre-trained fastText embeddings are 8–10% better in F1 compared to the supervised methods using symbolic features. This confirms the result of Mudgal et al. that deep learning based matching methods excel on tasks involving less structured entity descriptions. More information about the exact configuration of all methods as well as the results of the not so good performing method/feature combinations are found on the project’s web pageFootnote 12.

Table 4 Results of the product matching experiments.

7 Comparison to Existing Entity Resolution Benchmark Datasets

Entity resolution is a long standing research area in which various benchmark datasets are used to compare matching methods. Table 5 gives an overview of entity resolution benchmark datasets along the dimensions: Public availability, number of sources from which the data originates, and number of positive pairs (e.g. records referring to the same real-world entity).

The two classic datasets in the area of product matching are Abt-Buy and Amazon-Google introduced by Köpcke, Thor, and Rahm [10]. Gokhale et al. introduce another public product dataset Walmart-Amazon [7]. In our previous work [18], we publish a gold standard for product data extraction and matching covering 32 different e‑shops. Several datasets for evaluating duplicate detection methods are provided for public download by Naumann et al.Footnote 13. The datasets describe movies, CDs, restaurants, scientific papers, and countries. Further benchmark datasets have been introduced for the Instance Matching Track of the Ontology Alignment Evaluation Initiative (OAEI)Footnote 14. Daskalaki et al. give an overview of these datasets [3]. A large citation data set Citeseer-DBLP offering 550 thousand matches is provided in the Magellan Data Repository [2]. Finally, a large song data set containing 1.2 million matching pairs has been used to evaluate Falcon [24]. Mudgal et al. [15] use several large product datasets with up to 111 thousand positive pairs for evaluating their deep learning methods. Unfortunately, these datasets are not public.

The table shows that concerning the number of positive pairs our training datasets (WDC-LSPM and WDC-LSPM English) are four orders of magnitude larger than the other public evaluation datasets in the area of product matching. Compared to the Falcon-Songs data set, WDC-LSPM English is 17 times larger. Concerning the number of sources, WDC-LSPM English covers 43,293 sources while the existing datasets cover at most 32 sources. The other datasets do not explicitly distinguish between training and test set but leave the split to the user. We distinguish between training set and gold standard and give different quality guarantees for both.

Table 5 Overview of entity resolution benchmark datasets.

8 Using Semantic Annotations as Training Data for Other Tasks

The previous chapters have demonstrated the utility of semantic annotations for creating training data for product matching. Aside of product matching, semantic annotations can also be used to create large training sets for other tasks, such as information extraction or sentiment analysis. In this section we will discuss the potential of using semantic annotations within these two areas.

Information Extraction. Semantic annotations about types (e.g. product, event, hotel, local business, cooking recipe) and properties (e.g. name, address, opening hours, ingredient) together with structure of the HTML code around the annotations can be used to train information extraction methods to recognize the same type of information in web pages that do not contain such annotations. For instance, the annotation of the product price 69,99 Euro within an HTML page provides the learning algorithm with an example of the structure and unit of measurement of price values as well as an example of the HTML structures that are used around price values on this page.

A successful example of an information extraction system that employs schema.org annotations as training data is the work of Foley et al. [6]. The purpose of their system is to discover data about local events, such as small venue concerts, theatre performances, garage sales, movie screenings, on web pages. To train their system they use event data from web pages which is annotated using the schema.org event properties name, date, time, and location. They evaluate their method on 700 million web pages from the ClueWeb12 corpus. Using 217,000 explicitly annotated events as supervision, they are able to double recall at a precision level of 85%. Unfortunately, they neither publish their code nor the event data set that they have extracted from the ClueWeb12 corpus.

A series of information extraction evaluation datasets that were built using schema.org annotations and which are public was compiled by Meusel and Paulheim for the information extraction challenge conducted at the Linked Data for Information Extraction (LD4IE) workshop 2014 and 2015. The dataset of the LD4IE Challenge 2014Footnote 15 consists of web pages containing HcardFootnote 16 annotations describing contact information of persons and organizations. The goal of the challenge is to extract such contact information from pages without annotations. The dataset of the LD4IE Challenge 2015 [14]Footnote 17 consists of HTML pages that contain schema.org annotations describing music recordings, persons, cooking recipes, restaurants, and sports events. This data set was extracted from the December 2014 version of the Common Crawl. Altogether, the pages originate from 7,300 different websites. The goal of the challenge is to extract such information from pages without annotations.

Sentiment Analysis. The goal of sentiment analysis is to determine the polarity of a given text towards an entity or different aspects characterizing the entity [11]. State of the art sentiment detection methods [4, 22, 25] are usually supervised. What is needed to train them are pairs consisting of a polarity score (e.g. positive, neutral, negative or scaled 1 to 5) and text expressing the same polarity towards the entity. In addition, it is also useful to know the type of the described entity, e.g. its product category or type of local business, in order to learn specific models for different entity types.

In sum around 130 thousand websites that are covered by the Web Data Commons 2018 Microdata corpus use the schema.org vocabulary to annotate reviews (see lower part of Table 1). Figure 1 shows an example of how a review about the tent is annotated in the HTML code of the web page. The schema.org term reviewValue is used to annotate the polarity score that is assigned to the tent. The term bestRating determines the rating scale and the term reviewBody annotates the free text review. The first itemType annotation determines the type of the reviewed entity, e.g. product. The Web Data Commons 2018 Microdata corpus contains 13.5 million schema.org:Review entitiesFootnote 18 that annotate review values and review bodies and can thus be used to train sentiment analysis methods. Table 6 shows the distribution of these reviews depending on the type of entity that is reviewed. We see that the corpus contains 6.3 million reviewValue/reviewBody pairs about 1.8 million different products, as well as 1.7 million reviewValue/reviewBody pairs judging 455 thousand local businesses.

Table 6 Distribution of schema:org/Review entities over different domains the WDC Microdata 2018 corpus

There exists a large body of research on sentiment analysis [4, 11, 22, 25]. However, to the best of our knowledge none of the approaches exploits semantically annotated reviews from the Web as supervision. Commonly used sources of training data for sentiment analysis are tweets which are for instance used for the SemEval-2017 Task 4 [21]. The SemEval-2017 training sets consist of 20,000 to 50,000 text/polarity pairs depending on the specific subtask. A large collection of recommender systems datasetsFootnote 19 has been collected by Julian McAuley. The datasets contain for instance reviews about products (e.g. 82.83 million reviews crawled from Amazon between 1996 and 2014), local businesses (e.g. 11.45 million reviews from Google maps) and books (1.5 million reviews from GoodReads, 2017). Compared to these datasets, using semantically annotated reviews from the Web as training data has the advantage that the reviews cover many languages [4, 12], cover more entity types (e.g. also hotels, events, services), originate from a larger number or sources, and are more up-to-date.

9 Conclusion

This article has demonstrated the potential of using semantic annotations from the Web as training data for supervised matching methods. In addition, we also explored the potential of using semantic annotations as training data for information extraction and sentiment analysis. The experiments in Sect. 6 clearly showed the usefulness of the training data for the task of product matching despite of the dataset containing some noise (see error analysis in Sect. 4).

While the generated training dataset is already large, it has been built using only the tip of the iceberg as the Common Crawl only covers 3.1 billion HTML pages while commercial crawls are believed to cover at least one order of magnitude more pages. Thus, if specific experiments require more data, it is clearly possible to crawl deeper into the websites that we identified to annotate specific types of data and retrieve large quantities of additional data.