The WDC Gold Standards for Product Feature Extraction and Product Matching

  • Petar PetrovskiEmail author
  • Anna Primpeli
  • Robert Meusel
  • Christian Bizer
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 278)


Finding out which e-shops offer a specific product is a central challenge for building integrated product catalogs and comparison shopping portals. Determining whether two offers refer to the same product involves extracting a set of features (product attributes) from the web pages containing the offers and comparing these features using a matching function. The existing gold standards for product matching have two shortcomings: (i) they only contain offers from a small number of e-shops and thus do not properly cover the heterogeneity that is found on the Web. (ii) they only provide a small number of generic product attributes and therefore cannot be used to evaluate whether detailed product attributes have been correctly extracted from textual product descriptions. To overcome these shortcomings, we have created two public gold standards: The WDC Product Feature Extraction Gold Standard consists of over 500 product web pages originating from 32 different websites on which we have annotated all product attributes (338 distinct attributes) which appear in product titles, product descriptions, as well as tables and lists. The WDC Product Matching Gold Standard consists of over \(75\,000\) correspondences between 150 products (mobile phones, TVs, and headphones) in a central catalog and offers for these products on the 32 web sites. To verify that the gold standards are challenging enough, we ran several baseline feature extraction and matching methods, resulting in F-score values in the range 0.39 to 0.67. In addition to the gold standards, we also provide a corpus consisting of 13 million product pages from the same websites which might be useful as background knowledge for training feature extraction and matching methods.


e-commerce Product feature extraction Product matching 


  1. 1.
    Gopalakrishnan, V., Iyengar, S.P., Madaan, A., Rastogi, R., Sengamedu, S.: Matching product titles using web-based enrichment. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 605–614. ACM, New York (2012)Google Scholar
  2. 2.
    Kannan, A., Givoni, I.E., Agrawal, R., Fuxman, A.: Matching unstructured product offers to structured product specifications. In: 17th ACM SIGKDD International Conference On Knowledge Discovery and Data Mining, pp. 404–412 (2011)Google Scholar
  3. 3.
    Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endowment 3(1–2), 484–493 (2010)CrossRefGoogle Scholar
  4. 4.
    Köpcke, H., Thor, A., Thomas, S., Rahm, E.: Tailoring entity resolution for matching product offers. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 545–550. ACM (2012)Google Scholar
  5. 5.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences, documents. arXiv preprint arXiv:1405.4053 (2014)
  6. 6.
    McAuley, J., Targett, C., Shi, Q., van den Hengel, A.: Image-based recommendations on styles and substitutes. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. ACM (2015)Google Scholar
  7. 7.
    Melli, G.: Shallow semantic parsing of product offering titles (for better automatic hyperlink insertion). In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pp. 1670–1678. ACM, New York (2014)Google Scholar
  8. 8.
    Meusel, R., Petrovski, P., Bizer, C.: The webdatacommons microdata, RDFa and microformat dataset series. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-11964-9_18 Google Scholar
  9. 9.
    Meusel, R., Primpeli, A., Meilicke, C., Paulheim, H., Bizer, C.: Exploiting microdata annotations to consistently categorize product offers at web scale. In: Stuckenschmidt, H., Jannach, D. (eds.) EC-Web 2015. LNBIP, vol. 239, pp. 83–99. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-27729-5_7 CrossRefGoogle Scholar
  10. 10.
    Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. Proc. VLDB Endowment 4(7), 409–418 (2011)CrossRefGoogle Scholar
  11. 11.
    Petrovski, P., Bryl, V., Bizer, C.: Integrating product data from websites offering microdata markup. In: Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion, pp. 1299–1304. International World Wide Web Conferences Steering Committee (2014)Google Scholar
  12. 12.
    Petrovski, P., Bryl, V., Bizer, C.: Learning regular expressions for the extraction of product attributes from e-commerce microdata (2014)Google Scholar
  13. 13.
    Qiu, D., Barbosa, L., Dong, X.L., Shen, Y., Srivastava, D.: Dexter: large-scale discovery and extraction of product specifications on the web. Proc. VLDB Endowment 8(13), 2194–2205 (2015)CrossRefGoogle Scholar
  14. 14.
    Ristoski, P., Mika, P.: Enriching product ads with metadata from HTML annotations. In: Proceedings of the 13th Extended Semantic Web Conference (2015, to appear)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Petar Petrovski
    • 1
    Email author
  • Anna Primpeli
    • 1
  • Robert Meusel
    • 1
  • Christian Bizer
    • 1
  1. 1.Data and Web Science GroupUniversity of MannheimMannheimGermany

Personalised recommendations