A Decision Tree Framework for Semi-Automatic Extraction of Product Attributes from the Web

  • Lior Rokach
  • Roni Romano
  • Barak Chizi
  • Oded Maimon
Part of the Studies in Computational Intelligence book series (SCI, volume 23)


Semi-Automatic extraction of product attributes from URLs is an important issue for comparison-shopping agents. In this paper we examine a novel decision tree framework for extracting product attributes. The core induction algorithmic framework consists of three main stages. In the first stage, a large set of regular expression-based patterns are induced by employing a longest common subsequence algorithm. In the second stage we filter the initial set and leave only the most useful patterns. In the last stage we represent the extraction problem (in which the domain values are not known in advance) as a classification problem and employ an ensemble of decision trees. An empirical study performed on a real-world extraction tasks illustrates the capability of the proposed framework.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cohen W., Jensen L., and Hurst M. (2002), A flexible learning system for wrapping tables and listsin HTML documents. In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawai, pp. 232–241Google Scholar
  2. 2.
    Crescenzi V., Mecca C., and Merialdo P. (2001), RoadRunner: Towards Automatic Data Extraction from Large Web Sites, Proceedings of 27th International Conference on Very Large Data Bases, September 11–14, 2001, Roma, Italy, pp. 109–118.Google Scholar
  3. 3.
    Etzioni, O., Cafarella, M., Downey, D., Kok, S. Popescu, A. Shaked, T., Soderland, S., Weld, D. and Yates, A. (2005). Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91–134.CrossRefGoogle Scholar
  4. 4.
    Freund Y. and Schapire R. E., Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 325–332, 1996.Google Scholar
  5. 5.
    Ghani R. and Fano A. E. (2002), Using Text Mining to Infer Semantic At tributes for Retail Data Mining, IEEE International Conference on Data Mining, December 9–12, 2002. Maebashi, Japan.Google Scholar
  6. 6.
    Hall, M. Correlation-based Feature Selection for Machine Learning. Ph.D. Thesis, University of Waikato, 1999.Google Scholar
  7. 7.
    Myers E. (1986), An O(ND) difference algorithm and its variations, Algorithmica l(2):251–266..CrossRefGoogle Scholar
  8. 8.
    Perkowitz, M. Doorenbos, R., Etzioni, O. and Weld S. (1997). Learning to Understand Information on the Internet: An Example-Based Approach, Journal of Intelligent Information Systems 8(2), pp 133–153CrossRefGoogle Scholar
  9. 9.
    Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Lior Rokach
    • 1
  • Roni Romano
    • 2
  • Barak Chizi
    • 2
  • Oded Maimon
    • 2
  1. 1.Department of Information Systems EngineeringBen-Gurion University of the NegevBeer-ShevaIsrael
  2. 2.Department of Industrial EngineeringTel Aviv UniversityIsrael

Personalised recommendations