Abstract
Automatic categorization of text documents has become an important area of research in the last two decades, with features that make it significantly more difficult than the traditional classification tasks studied in machine learning. A more recent development is the need to classify hypertext documents, most notably web pages. These have features that add further complexity to the categorization task but also offer the possibility of using information that is not available in standard text classification, such as metadata and the content of the web pages that point to and are pointed at by a web page of interest. This chapter surveys the state of the art in text categorization and hypertext categorization, focussing particularly on issues of representation that differentiate them from ’conventional’ classification tasks and from each other.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amati, G., Crestani, F.: Probabilistic learning for selective dissemination of information. Information Processing and Management 35(5), 633–654 (1999)
Andrews, K.: The development of a fast conflation algorithm for English. Dissertation submitted for the Diploma in Computer Science (unpublished), University of Cambridge (1971)
Apté, C., Damerau, F., et al.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS) 12(3), 233–251 (1994)
Attardi, G., Gulli, A., et al.: Automatic Web Page Categorization by Link and Context Analysis. In: Proceedings of THAI’99, pp. 105–119 (1999)
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 96–103 (1998)
Benbrahim, H., Bramer, M.: An empirical study for hypertext categorization. In: IEEE International Conference on Systems, Man and Cybernetics, 2004, vol. 6 (2004a)
Benbrahim, H., Bramer, M.: Neighbourhood Exploitation in Hypertext Categorization. In: Proceedings of the Twenty-fourth SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, December 2004, pp. 258–268 (2004b)
Benbrahim, H., Bramer, M.: A Fuzzy Semi-Supervised Support Vector Machines Approach to Hypertext Categorization. In: Artificial Intelligence in Theory and Practice II, pp. 97–106. Springer, Heidelberg (2008)
Bharat, K., Broder, A.Z.: A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. WWW7 / Computer Networks 30(1-7), 379–388 (1998)
Borko, H., Bernick, M.: Automatic Document Classification. Journal of the ACM (JACM) 10(2), 151–162 (1963)
Bramer, M.A.: Principles of Data Mining. Springer, Heidelberg (2007)
Buckley, C., Salton, G., et al.: Automatic query expansion using SMART. In: TREC 3, Overview of the Third Text Retrieval Conference (TREC-3), pp. 500–225 (1995)
Caropreso, M.F., Matwin, S., et al.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Text Databases and Document Management: Theory and Practice, pp. 78–102 (2001)
Cavnar, W.B., Trenkle, J.M.: N-Gram based document categorization. In: Proceedings of the Third Symposium on Document Analysis and Information Retrieval, Las Vegas, pp. 161–176 (1994)
Chakrabarti, S., Dom, B., et al.: Using taxonomy, discriminants, and signatures for navigating in text databases. In: Proceedings of the 23rd VLDB Conference, pp. 446–455 (1997)
Chakrabarti, S., Dom, B., et al.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal The International Journal on Very Large Data Bases 7(3), 163–178 (1998)
Chakrabarti, S., Dom, B.E., et al.: Enhanced hypertext categorization using hyperlinks. Google Patents (2002)
Chen, H., Dumais, S.: Bringing order to the web: automatically categorizing search results. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 145–152 (2000)
Clack, C., Farringdon, J., et al.: Autonomous document classification for business. In: Proceedings of the 1st International Conference on Autonomous Agents, pp. 201–208 (1997)
Cohen, W.W., Hirsh, H.: Joins that generalize: text classification using Whirl. In: Proceedings of KDD-98, 4th International Conference on Knowledge Discovery and Data Mining, pp. 169–173 (1998)
Cohen, W.W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization. In: Conference on Research and Development in Information Retrieval (SIGIR), pp. 307–315 (1998)
Creecy, R.H.: Trading MIPS and Memory for Knowledge Engineering: Automatic Classification of Census Returns on a Massively Parallel Supercomputer. Thinking Machines Corp. (1991)
Dagan, I., Karov, Y., et al.: Mistake-driven learning in text categorization. In: Proceedings of the Second Conference on Empirical Methods in NLP, pp. 55–63 (1997)
Dattola, R.T.: FIRST: Flexible Information Retrieval System for Text. J. Am. Soc. Inf. Sci. 30(1) (1979)
De Heer, T.: The application of the concept of homeosemy to natural language information retrieval. Information Processing & Management 18(5), 229–236 (1982)
Deerwester, S., Dumais, S.T., et al.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)
Domingos, P., Pazzani, M.: On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning 29(2), 103–130 (1997)
Dumais, S., Platt, J., et al.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on Information and knowledge management, pp. 148–155 (1998)
Dumais, S.T.: Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers 23(2), 229–236 (1991)
Escudero, G., Marquez, L., et al.: Boosting Applied to Word Sense Disambiguation. Arxiv preprint cs.CL/0007010 (2000)
Field, B.: Towards automatic indexing: automatic assignment of controlled-language indexing and classification from free indexing. Journal of Documentation 31(4), 246–265 (1975)
Fuhr, N., Buckley, C.: A probabilistic learning approach for document indexing. ACM Transactions on Information Systems (TOIS) 9(3), 223–248 (1991)
Furnkranz, J.: Exploiting structural information for text classification on the WWW. In: Hand, D.J., Kok, J.N., R. Berthold, M. (eds.) IDA 1999. LNCS, vol. 1642, pp. 487–497. Springer, Heidelberg (1999)
Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)
Gale, W.A., Church, K.W., et al.: A method for disambiguating word senses in a large corpus. Computers and the Humanities 26(5), 415–439 (1992)
Gray, W.A., Harley, A.J.: Computer-assisted indexing. Inform. Storage Retrieval 7(4), 167–174 (1971)
Hersh, W., Buckley, C., et al.: OHSUMED: An interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 192–201 (1994)
Hull, D.: Improving text retrieval for the routing problem using latent semantic indexing. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 282–291 (1994)
Ittner, D.J., Lewis, D.D., et al.: Text categorization of low quality images. In: Symposium on Document Analysis and Information Retrieval, pp. 301–315 (1995)
Iyer, R.D., Lewis, D.D., et al.: Boosting for document routing. In: Proceedings of the ninth international conference on Information and knowledge management, pp. 70–77 (2000)
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, School of Computer Science, Carnegie Mellon University (1996)
Joachims, T.: Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. Springer, London (1998)
Kim, Y.H., Hahn, S.Y., et al.: Text filtering by boosting naive Bayes classifiers. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 168–175 (2000)
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 170–178 (1997)
Lam, S.L.Y., Lee, D.L.: Feature reduction for neural network based text categorization. In: Proceedings of 6th International Conference on Database Systems for Advanced Applications, pp. 195–202 (1999)
Lam, W., Ho, C.Y.: Using a generalized instance set for automatic text categorization. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 81–89 (1998)
Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 331–339 (1995)
Larkey, L.S.: Automatic essay grading using text categorization techniques. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 90–95 (1998)
Larkey, L.S.: A patent search and classification system. In: Proceedings of the fourth ACM conference on Digital libraries, pp. 179–187 (1999)
Larkey, L.S., Croft, W.B.: Combining classifiers in text categorization. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 289–297 (1996)
Lawrence, S., Giles, C.L.: Accessibility of information on the web. Nature 400, 107 (1999)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37–50 (1992)
Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on Speech and Natural Language, pp. 212–217 (1992)
Lewis, D.D.: Representation and learning in information retrieval. PhD Thesis, Department of Computer and Information Science, University of Massachusetts (1992)
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Li, H., Yamanishi, K.: Text classification using ESC-based stochastic decision lists. In: Proceedings of the eighth international conference on Information and knowledge management, pp. 122–130 (1999)
Li, Y.H., Jain, A.K.: Classification of Text Documents. The Computer Journal 41(8), 537 (1998)
Lovins, J.B.: Development of a Stemming Algorithm. MIT Information Processing Group, Electronic Systems Laboratory (1968)
Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Research and Development 2(2), 159–165 (1958)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Masand, B.: Optimizing confidence of text classification by evolution of symbolic expressions. Mit Press In Series In Complex Adaptive Systems, pp. 445–458 (1994)
Masand, B., Linoff, G., et al.: Classifying news stories using memory based reasoning. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 59–65 (1992)
McCallum, A., Nigam, K.: Employing EM in pool-based active learning for text classification. In: Proceedings of ICML-98, 15th International Conference on Machine Learning, pp. 350–358 (1998)
McGill, M.J., Salton, G.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
Miller, G., Princeton, U., et al.: WordNet. MIT Press, Cambridge (1998)
Mladenic, D., Grobelnik, M.: Word sequences as features in text-learning. In: Proceedings of ERK-98, the Seventh Electrotechnical and Computer Science Conference, pp. 145–148 (1998)
Moulinier, I., Ganascia, J.G.: Applying an existing machine learning algorithm to text categorization. In: Wermter, S., Scheler, G., Riloff, E. (eds.) IJCAI-WS 1995. LNCS, vol. 1040, pp. 343–354. Springer, Heidelberg (1996)
Moulinier, I., Raskinis, G., et al.: Text categorization: a symbolic approach. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval (1996)
Ng, H.T., Goh, W.B., et al.: Feature selection, perception learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 67–73 (1997)
Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of the ninth international conference on Information and knowledge management, pp. 86–93 (2000)
Oh, H.J., Myaeng, S.H., et al.: A practical hypertext catergorization method using links and incrementally available class information. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 264–271 (2000)
Petrarca, A.E., Lay, W.M.: Use of an automatically generated authority list to eliminate scattering caused by some singular and plural main index terms. Proceedings of the American Society for Information Science 6, 277–282 (1969)
Pierre, J.M.: Practical Issues for Automated Categorization of Web Sites. In: Electronic Proc. ECDL 2000 Workshop on Semantic Web (2000)
Porter, M.: An Algorithm for Suffix Stripping Program. Program 14(3), 130–137 (1980)
Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Ruiz, M.E., Srinivasan, P.: Hierarchical neural networks for text categorization (poster abstract). In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 281–282 (1999)
Sable, C.L., Hatzivassiloglou, V.: Text-based approaches for non-topical image categorization. International Journal on Digital Libraries 3(3), 261–275 (2000)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Salton, G., Wong, A., et al.: A vector space model for information retrieval. Communications of the ACM 18(11), 613–620 (1975)
Schapire, R.E., Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning 39(2), 135–168 (2000)
Schütze, H., Hull, D.A., et al.: A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 229–237 (1995)
Sebastiani, F., Sperduti, A., et al.: An improved boosting algorithm and its application to automated text categorization (2000)
Sinka, M.P., Corne, D.W.: A large benchmark dataset for web document clustering. Soft Computing Systems: Design, Management and Applications 87, 881–890 (2002)
Sj, C., Waltz, D.J.: Trading mips and memory for knowledge engeneering. Communications of the ACM 35, 48–64 (1992)
Slattery, S., Mitchell, T.: Discovering test set regularities in relational domains. In: Proc. ICML (2000)
Slonim, N., Tishby, N.: The power of word clusters for text classification. In: Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research (2001)
Taira, H., Haruno, M.: Feature selection in SVM text categorization. In: Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence table of contents, pp. 480–486 (1999)
Tauritz, D.R., Kok, J.N., et al.: Adaptive Information Filtering using evolutionary computation. Information Sciences 122(2-4), 121–140 (2000)
Tzeras, K., Hartmann, S.: Automatic indexing based on Bayesian inference networks. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 22–35 (1993)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Wai, L.A.M., Fan, L.: Using a Bayesian Network Induction Approach for Text Categorization. In: Proceedings of the 15th International Joint Conference on Artificial Intelligence, pp. 745–750 (1997)
Weiss, S.M., Apte, C., et al.: Maximizing text-mining performance. IEEE Intelligent Systems 14(4), 63–69 (1999)
Wiener, E., Pedersen, J.O., et al.: A neural network approach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95), pp. 317–332 (1995)
Yang, Y., Chute, C.G.: An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems (TOIS) 12(3), 252–277 (1994)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 42–49 (1999)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning 97 (1997)
Yang, Y., Slattery, S., et al.: A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems 18(2), 219–241 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 IFIP International Federation for Information Processing
About this chapter
Cite this chapter
Benbrahim, H., Bramer, M. (2009). Text and Hypertext Categorization. In: Bramer, M. (eds) Artificial Intelligence An International Perspective. Lecture Notes in Computer Science(), vol 5640. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03226-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-03226-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03225-7
Online ISBN: 978-3-642-03226-4
eBook Packages: Computer ScienceComputer Science (R0)