A Decision Tree Framework for Semi-Automatic Extraction of Product Attributes from the Web
Semi-Automatic extraction of product attributes from URLs is an important issue for comparison-shopping agents. In this paper we examine a novel decision tree framework for extracting product attributes. The core induction algorithmic framework consists of three main stages. In the first stage, a large set of regular expression-based patterns are induced by employing a longest common subsequence algorithm. In the second stage we filter the initial set and leave only the most useful patterns. In the last stage we represent the extraction problem (in which the domain values are not known in advance) as a classification problem and employ an ensemble of decision trees. An empirical study performed on a real-world extraction tasks illustrates the capability of the proposed framework.
Unable to display preview. Download preview PDF.
- 1.Cohen W., Jensen L., and Hurst M. (2002), A flexible learning system for wrapping tables and listsin HTML documents. In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawai, pp. 232–241Google Scholar
- 2.Crescenzi V., Mecca C., and Merialdo P. (2001), RoadRunner: Towards Automatic Data Extraction from Large Web Sites, Proceedings of 27th International Conference on Very Large Data Bases, September 11–14, 2001, Roma, Italy, pp. 109–118.Google Scholar
- 4.Freund Y. and Schapire R. E., Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 325–332, 1996.Google Scholar
- 5.Ghani R. and Fano A. E. (2002), Using Text Mining to Infer Semantic At tributes for Retail Data Mining, IEEE International Conference on Data Mining, December 9–12, 2002. Maebashi, Japan.Google Scholar
- 6.Hall, M. Correlation-based Feature Selection for Machine Learning. Ph.D. Thesis, University of Waikato, 1999.Google Scholar
- 9.Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.Google Scholar