Extracting Product Descriptions from Polish E-Commerce Websites Using Classification and Clustering

Kołaczkowski, Piotr; Gawrysiak, Piotr

doi:10.1007/978-3-642-21916-0_49

Piotr Kołaczkowski²³ &
Piotr Gawrysiak²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6804))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

3 Citations

Abstract

A novel method for extracting product descriptions from e-commerce websites is presented. The algorithm consists of three major steps: (1) extracting descriptions of appropriate length from the source documents related to the search query using shallow text analysis methods; (2) assigning each of the description to one of the predefined categories by means of text classification and (3) grouping the results by a text clustering algorithm to return the descriptions found in the clusters with the highest quality. The recall and precision of the search are examined using a set of queries for laptops currently being sold in popular shopping sites. It is shown that, although the extraction method based purely on the classification and the method based purely on the clustering give acceptable results, the highest precision is achieved when using them together. It was also observed that examining about 20 first sites returned by Google is sufficient to get high quality descriptions of popular products.

This work is supported by the National Centre for Research and Development (NCBiR) under Grant No. SP/I/1/77065/10 by the Strategic scientific research and experimental development program: ”Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Literaki online, http://www.kurnik.pl/literaki/
Chang, C.H., Kuo, S.C.: OLERA: Semisupervised web-data extraction with visual support. IEEE Intelligent Systems 19, 56–64 (2004), http://dx.doi.org/10.1109/MIS.2004.71
Article Google Scholar
Chang, C.H., Lui, S.C.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 681–688. ACM, New York (2001), http://doi.acm.org/10.1145/371920.372182
Google Scholar
Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: Proceedings of the 11th International Conference on World Wide Web,WWW 2002, pp. 232–241. ACM, New York (2002), http://doi.acm.org/10.1145/511446.511477
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 109–118. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583. AAAI Press, Menlo Park (2000), http://portal.acm.org/citation.cfm?id=647288.723413
Google Scholar
Hammer, J., Garcia-molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the web. In: In Proceedings of the Workshop on Management of Semistructured Data, pp. 18–25 (1997)
Google Scholar
Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco (2005)
Google Scholar
Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: a machine learning approach. In: Szczepaniak, P.S., Segovia, J., Kacprzyk, J., Zadeh, L.A. (eds.) Intelligent Exploration of the Web, pp. 275–287. Physica-Verlag GmbH, Heidelberg (2003), http://portal.acm.org/citation.cfm?id=941713.941732
Chapter Google Scholar
Kushmerick, N.: Wrapper induction for information extraction. Ph.D. thesis, University of Washington (1997)
Google Scholar
Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, pp. 611–621. IEEE Computer Society, Washington, DC, USA (2000), http://portal.acm.org/citation.cfm?id=846219.847340
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Proceedings of the Third Annual Conference on Autonomous Agents,AGENTS 1999, pp. 190–197. ACM, New York (1999), http://doi.acm.org/10.1145/301136.301191
Chapter Google Scholar
Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR 2003, pp. 235–242. ACM, New York (2003), http://doi.acm.org/10.1145/860435.860479
Chapter Google Scholar
Sahuguet, A., Azavant, F.: WYSIWYG web wrapper factory (W4F). In: Proceedings of WWW Conference (1999)
Google Scholar
Zhai, Y., Liu, B.: Extracting web data using instance-based learning. World Wide Web 10, 113–132 (2007), http://portal.acm.org/citation.cfm?id=1265159.1265174
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Warsaw University of Technology, Poland
Piotr Kołaczkowski & Piotr Gawrysiak

Authors

Piotr Kołaczkowski
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Gawrysiak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Electronics and Information Technology, Institute of Computer Science, Warsaw University of Technology,, Nowowiejska 15/19, 00-665, Warsaw, Poland
Marzena Kryszkiewicz
Faculty of Electronics and Information Technology, Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665, Warsaw, Poland
Henryk Rybinski
University of Warsaw, 02-097, Warsaw, Poland
Andrzej Skowron
Faculty of Electronics and Information Technology, Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19,, 00-665, Warsaw, Poland
Zbigniew W. Raś

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kołaczkowski, P., Gawrysiak, P. (2011). Extracting Product Descriptions from Polish E-Commerce Websites Using Classification and Clustering. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds) Foundations of Intelligent Systems. ISMIS 2011. Lecture Notes in Computer Science(), vol 6804. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21916-0_49

Download citation

DOI: https://doi.org/10.1007/978-3-642-21916-0_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21915-3
Online ISBN: 978-3-642-21916-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics