Skip to main content

Extracting Product Descriptions from Polish E-Commerce Websites Using Classification and Clustering

  • Conference paper
Foundations of Intelligent Systems (ISMIS 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6804))

Included in the following conference series:

Abstract

A novel method for extracting product descriptions from e-commerce websites is presented. The algorithm consists of three major steps: (1) extracting descriptions of appropriate length from the source documents related to the search query using shallow text analysis methods; (2) assigning each of the description to one of the predefined categories by means of text classification and (3) grouping the results by a text clustering algorithm to return the descriptions found in the clusters with the highest quality. The recall and precision of the search are examined using a set of queries for laptops currently being sold in popular shopping sites. It is shown that, although the extraction method based purely on the classification and the method based purely on the clustering give acceptable results, the highest precision is achieved when using them together. It was also observed that examining about 20 first sites returned by Google is sufficient to get high quality descriptions of popular products.

This work is supported by the National Centre for Research and Development (NCBiR) under Grant No. SP/I/1/77065/10 by the Strategic scientific research and experimental development program: ”Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Literaki online, http://www.kurnik.pl/literaki/

  2. Chang, C.H., Kuo, S.C.: OLERA: Semisupervised web-data extraction with visual support. IEEE Intelligent Systems 19, 56–64 (2004), http://dx.doi.org/10.1109/MIS.2004.71

    Article  Google Scholar 

  3. Chang, C.H., Lui, S.C.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 681–688. ACM, New York (2001), http://doi.acm.org/10.1145/371920.372182

    Google Scholar 

  4. Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: Proceedings of the 11th International Conference on World Wide Web,WWW 2002, pp. 232–241. ACM, New York (2002), http://doi.acm.org/10.1145/511446.511477

    Google Scholar 

  5. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 109–118. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  6. Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583. AAAI Press, Menlo Park (2000), http://portal.acm.org/citation.cfm?id=647288.723413

    Google Scholar 

  7. Hammer, J., Garcia-molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the web. In: In Proceedings of the Workshop on Management of Semistructured Data, pp. 18–25 (1997)

    Google Scholar 

  8. Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco (2005)

    Google Scholar 

  9. Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: a machine learning approach. In: Szczepaniak, P.S., Segovia, J., Kacprzyk, J., Zadeh, L.A. (eds.) Intelligent Exploration of the Web, pp. 275–287. Physica-Verlag GmbH, Heidelberg (2003), http://portal.acm.org/citation.cfm?id=941713.941732

    Chapter  Google Scholar 

  10. Kushmerick, N.: Wrapper induction for information extraction. Ph.D. thesis, University of Washington (1997)

    Google Scholar 

  11. Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, pp. 611–621. IEEE Computer Society, Washington, DC, USA (2000), http://portal.acm.org/citation.cfm?id=846219.847340

    Google Scholar 

  12. Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Proceedings of the Third Annual Conference on Autonomous Agents,AGENTS 1999, pp. 190–197. ACM, New York (1999), http://doi.acm.org/10.1145/301136.301191

    Chapter  Google Scholar 

  13. Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR 2003, pp. 235–242. ACM, New York (2003), http://doi.acm.org/10.1145/860435.860479

    Chapter  Google Scholar 

  14. Sahuguet, A., Azavant, F.: WYSIWYG web wrapper factory (W4F). In: Proceedings of WWW Conference (1999)

    Google Scholar 

  15. Zhai, Y., Liu, B.: Extracting web data using instance-based learning. World Wide Web 10, 113–132 (2007), http://portal.acm.org/citation.cfm?id=1265159.1265174

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kołaczkowski, P., Gawrysiak, P. (2011). Extracting Product Descriptions from Polish E-Commerce Websites Using Classification and Clustering. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds) Foundations of Intelligent Systems. ISMIS 2011. Lecture Notes in Computer Science(), vol 6804. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21916-0_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21916-0_49

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21915-3

  • Online ISBN: 978-3-642-21916-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics