Abstract
In this paper, the problem of classifying HTML documents is investigated in the context of a client-server application, named WebClass, developed to support the search activity of a geographically distributed group of people with common interests. The two main issues studied in the paper are the selection of some features to represent HTML documents and the construction of the classifiers. A new feature selection technique is presented and its interaction with different classifiers is experimentally studied. Results show that performance improves even with simple classifiers and the proposed feature selection technique compares favorably with respect to other well-known approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
C. Apté, F. Damerau, & S.M. Weiss (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3), 233–251.
G. Attardi, S. DMarco, D. Salvi, & F. Sebastiani (1998). Categorisation by context. Online Proceedings of the 1 st International Workshop on Innovative Internet Information Systems, http://www.idt.ntnu.no/~monica/iii-98/proceedinas on line.html
R. Baumgartner, S. Flesca, G. Gottlob (2001). Supervised Wrapper Generation with Lixto. Proc. of the 27 th Int. Conf. on Very Large Data Bases, 715–716.
C. Cleverdon (1984). Optimizing convenient online access to bibliographic databases. Information Services and Use, 4, 37–47.
M. Diligenti, M. Gori, M. Maggini & F. Scarselli (2001). Classification of HTML Documents by Hidden Tree-Markov Models. Proc. of the 6 th Int. Conf. on Document Analysis and Recognition ICDAR’01, IEEE Computer Society Press, Los Vaqueros, CA.
F. Esposito, D. Malerba, L. Di Pace, & P. Leo (2000). A Machine Learning Approach to Web Mining, In E. Lamma & P. Mello (Eds.), AI*IA 99: Advances in Artificial Intelligence, Lecture Notes in Artificial Intelligence, Vol. 1792, 190–201, Berlin: Springer.
T. Joachims (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proc. of the 14 th Int. Conf. on Machine Learning, 143–151.
D. Koller & M. Sahami (1996). Toward optimal feature selection. Proc. of the 13 th Int. Conf. on Machine Learning ICML’96, 284–292.
D.D. Lewis, R.E. Schapire, J.P. Callan, & R. Papka (1996). Training algorithms for linear text classifiers. In H.-P. Frei, D. Harman, P. Schauble, & R. Wilkinson, (ed.), Proc. of the 19 th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 298–306.
H. Luhn (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2): 159–165.
B. Masand, G. Linoff, & D. Waltz (1992). Classifying new stories using memory based reasoning. Proc. SIGIR’92, 59–65.
D. Mladenic (1998). Feature subset selection in text-learning. In C. Nédellec, & C. Rouveirol (Eds.), Machine Learning: ECML-98, Lecture Notes in Artificial Intelligence, 1398, 95–100, Berlin: Springer.
S.K. Murthy, S. Kasif & S. Salzberg (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2, 1–32.
M. Pazzani & D. Billsus (1997). Learning and revising user profiles: The identification of interesting web sites. Machine Learning Journal, 23, 313–331.
M. F. Porter (1980). An algorithm for suffix stripping. Program, 14(3): 130–137.
G. Salton (1989). Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading, MA: Addison-Wesley.
G. Salton & C. Buckley (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.
W.M. Shaw Jr (1995). Term-relevance computations and perfect retrieval performance. Information Processing & Management, 31(4), 491–498.
Y. Yang & J.O. Pedersen (1997). A Comparative Study on Feature Selection in Text Categorization. Proc. of the 14 th Int. Conf. on Machine Learning ICML-97, 412–420.
G.K. Zipf (1949). Human Behavior and the Principle of Least Effort. Reading, MA: Addison-Wesley.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Malerba, D., Esposito, F., Ceci, M. (2002). Mining HTML Pages to Support Document Sharing in a Cooperative System. In: Chaudhri, A.B., Unland, R., Djeraba, C., Lindner, W. (eds) XML-Based Data Management and Multimedia Engineering — EDBT 2002 Workshops. EDBT 2002. Lecture Notes in Computer Science, vol 2490. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36128-6_25
Download citation
DOI: https://doi.org/10.1007/3-540-36128-6_25
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00130-0
Online ISBN: 978-3-540-36128-2
eBook Packages: Springer Book Archive