Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5640))

Abstract

Automatic categorization of text documents has become an important area of research in the last two decades, with features that make it significantly more difficult than the traditional classification tasks studied in machine learning. A more recent development is the need to classify hypertext documents, most notably web pages. These have features that add further complexity to the categorization task but also offer the possibility of using information that is not available in standard text classification, such as metadata and the content of the web pages that point to and are pointed at by a web page of interest. This chapter surveys the state of the art in text categorization and hypertext categorization, focussing particularly on issues of representation that differentiate them from ’conventional’ classification tasks and from each other.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Amati, G., Crestani, F.: Probabilistic learning for selective dissemination of information. Information Processing and Management 35(5), 633–654 (1999)

    Article  Google Scholar 

  • Andrews, K.: The development of a fast conflation algorithm for English. Dissertation submitted for the Diploma in Computer Science (unpublished), University of Cambridge (1971)

    Google Scholar 

  • Apté, C., Damerau, F., et al.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS) 12(3), 233–251 (1994)

    Article  Google Scholar 

  • Attardi, G., Gulli, A., et al.: Automatic Web Page Categorization by Link and Context Analysis. In: Proceedings of THAI’99, pp. 105–119 (1999)

    Google Scholar 

  • Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 96–103 (1998)

    Google Scholar 

  • Benbrahim, H., Bramer, M.: An empirical study for hypertext categorization. In: IEEE International Conference on Systems, Man and Cybernetics, 2004, vol. 6 (2004a)

    Google Scholar 

  • Benbrahim, H., Bramer, M.: Neighbourhood Exploitation in Hypertext Categorization. In: Proceedings of the Twenty-fourth SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, December 2004, pp. 258–268 (2004b)

    Google Scholar 

  • Benbrahim, H., Bramer, M.: A Fuzzy Semi-Supervised Support Vector Machines Approach to Hypertext Categorization. In: Artificial Intelligence in Theory and Practice II, pp. 97–106. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  • Bharat, K., Broder, A.Z.: A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. WWW7 / Computer Networks 30(1-7), 379–388 (1998)

    Article  Google Scholar 

  • Borko, H., Bernick, M.: Automatic Document Classification. Journal of the ACM (JACM) 10(2), 151–162 (1963)

    Article  MATH  Google Scholar 

  • Bramer, M.A.: Principles of Data Mining. Springer, Heidelberg (2007)

    MATH  Google Scholar 

  • Buckley, C., Salton, G., et al.: Automatic query expansion using SMART. In: TREC 3, Overview of the Third Text Retrieval Conference (TREC-3), pp. 500–225 (1995)

    Google Scholar 

  • Caropreso, M.F., Matwin, S., et al.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Text Databases and Document Management: Theory and Practice, pp. 78–102 (2001)

    Google Scholar 

  • Cavnar, W.B., Trenkle, J.M.: N-Gram based document categorization. In: Proceedings of the Third Symposium on Document Analysis and Information Retrieval, Las Vegas, pp. 161–176 (1994)

    Google Scholar 

  • Chakrabarti, S., Dom, B., et al.: Using taxonomy, discriminants, and signatures for navigating in text databases. In: Proceedings of the 23rd VLDB Conference, pp. 446–455 (1997)

    Google Scholar 

  • Chakrabarti, S., Dom, B., et al.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal The International Journal on Very Large Data Bases 7(3), 163–178 (1998)

    Article  Google Scholar 

  • Chakrabarti, S., Dom, B.E., et al.: Enhanced hypertext categorization using hyperlinks. Google Patents (2002)

    Google Scholar 

  • Chen, H., Dumais, S.: Bringing order to the web: automatically categorizing search results. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 145–152 (2000)

    Google Scholar 

  • Clack, C., Farringdon, J., et al.: Autonomous document classification for business. In: Proceedings of the 1st International Conference on Autonomous Agents, pp. 201–208 (1997)

    Google Scholar 

  • Cohen, W.W., Hirsh, H.: Joins that generalize: text classification using Whirl. In: Proceedings of KDD-98, 4th International Conference on Knowledge Discovery and Data Mining, pp. 169–173 (1998)

    Google Scholar 

  • Cohen, W.W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization. In: Conference on Research and Development in Information Retrieval (SIGIR), pp. 307–315 (1998)

    Google Scholar 

  • Creecy, R.H.: Trading MIPS and Memory for Knowledge Engineering: Automatic Classification of Census Returns on a Massively Parallel Supercomputer. Thinking Machines Corp. (1991)

    Google Scholar 

  • Dagan, I., Karov, Y., et al.: Mistake-driven learning in text categorization. In: Proceedings of the Second Conference on Empirical Methods in NLP, pp. 55–63 (1997)

    Google Scholar 

  • Dattola, R.T.: FIRST: Flexible Information Retrieval System for Text. J. Am. Soc. Inf. Sci. 30(1) (1979)

    Google Scholar 

  • De Heer, T.: The application of the concept of homeosemy to natural language information retrieval. Information Processing & Management 18(5), 229–236 (1982)

    Article  Google Scholar 

  • Deerwester, S., Dumais, S.T., et al.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)

    Article  Google Scholar 

  • Domingos, P., Pazzani, M.: On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning 29(2), 103–130 (1997)

    Article  MATH  Google Scholar 

  • Dumais, S., Platt, J., et al.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on Information and knowledge management, pp. 148–155 (1998)

    Google Scholar 

  • Dumais, S.T.: Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers 23(2), 229–236 (1991)

    Article  Google Scholar 

  • Escudero, G., Marquez, L., et al.: Boosting Applied to Word Sense Disambiguation. Arxiv preprint cs.CL/0007010 (2000)

    Google Scholar 

  • Field, B.: Towards automatic indexing: automatic assignment of controlled-language indexing and classification from free indexing. Journal of Documentation 31(4), 246–265 (1975)

    Article  Google Scholar 

  • Fuhr, N., Buckley, C.: A probabilistic learning approach for document indexing. ACM Transactions on Information Systems (TOIS) 9(3), 223–248 (1991)

    Article  Google Scholar 

  • Furnkranz, J.: Exploiting structural information for text classification on the WWW. In: Hand, D.J., Kok, J.N., R. Berthold, M. (eds.) IDA 1999. LNCS, vol. 1642, pp. 487–497. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  • Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  • Gale, W.A., Church, K.W., et al.: A method for disambiguating word senses in a large corpus. Computers and the Humanities 26(5), 415–439 (1992)

    Article  Google Scholar 

  • Gray, W.A., Harley, A.J.: Computer-assisted indexing. Inform. Storage Retrieval 7(4), 167–174 (1971)

    Article  Google Scholar 

  • Hersh, W., Buckley, C., et al.: OHSUMED: An interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 192–201 (1994)

    Google Scholar 

  • Hull, D.: Improving text retrieval for the routing problem using latent semantic indexing. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 282–291 (1994)

    Google Scholar 

  • Ittner, D.J., Lewis, D.D., et al.: Text categorization of low quality images. In: Symposium on Document Analysis and Information Retrieval, pp. 301–315 (1995)

    Google Scholar 

  • Iyer, R.D., Lewis, D.D., et al.: Boosting for document routing. In: Proceedings of the ninth international conference on Information and knowledge management, pp. 70–77 (2000)

    Google Scholar 

  • Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, School of Computer Science, Carnegie Mellon University (1996)

    Google Scholar 

  • Joachims, T.: Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. Springer, London (1998)

    Google Scholar 

  • Kim, Y.H., Hahn, S.Y., et al.: Text filtering by boosting naive Bayes classifiers. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 168–175 (2000)

    Google Scholar 

  • Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 170–178 (1997)

    Google Scholar 

  • Lam, S.L.Y., Lee, D.L.: Feature reduction for neural network based text categorization. In: Proceedings of 6th International Conference on Database Systems for Advanced Applications, pp. 195–202 (1999)

    Google Scholar 

  • Lam, W., Ho, C.Y.: Using a generalized instance set for automatic text categorization. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 81–89 (1998)

    Google Scholar 

  • Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 331–339 (1995)

    Google Scholar 

  • Larkey, L.S.: Automatic essay grading using text categorization techniques. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 90–95 (1998)

    Google Scholar 

  • Larkey, L.S.: A patent search and classification system. In: Proceedings of the fourth ACM conference on Digital libraries, pp. 179–187 (1999)

    Google Scholar 

  • Larkey, L.S., Croft, W.B.: Combining classifiers in text categorization. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 289–297 (1996)

    Google Scholar 

  • Lawrence, S., Giles, C.L.: Accessibility of information on the web. Nature 400, 107 (1999)

    Article  Google Scholar 

  • Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37–50 (1992)

    Google Scholar 

  • Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on Speech and Natural Language, pp. 212–217 (1992)

    Google Scholar 

  • Lewis, D.D.: Representation and learning in information retrieval. PhD Thesis, Department of Computer and Information Science, University of Massachusetts (1992)

    Google Scholar 

  • Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  • Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)

    Google Scholar 

  • Li, H., Yamanishi, K.: Text classification using ESC-based stochastic decision lists. In: Proceedings of the eighth international conference on Information and knowledge management, pp. 122–130 (1999)

    Google Scholar 

  • Li, Y.H., Jain, A.K.: Classification of Text Documents. The Computer Journal 41(8), 537 (1998)

    Article  MATH  Google Scholar 

  • Lovins, J.B.: Development of a Stemming Algorithm. MIT Information Processing Group, Electronic Systems Laboratory (1968)

    Google Scholar 

  • Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Research and Development 2(2), 159–165 (1958)

    Article  MathSciNet  Google Scholar 

  • Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  • Masand, B.: Optimizing confidence of text classification by evolution of symbolic expressions. Mit Press In Series In Complex Adaptive Systems, pp. 445–458 (1994)

    Google Scholar 

  • Masand, B., Linoff, G., et al.: Classifying news stories using memory based reasoning. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 59–65 (1992)

    Google Scholar 

  • McCallum, A., Nigam, K.: Employing EM in pool-based active learning for text classification. In: Proceedings of ICML-98, 15th International Conference on Machine Learning, pp. 350–358 (1998)

    Google Scholar 

  • McGill, M.J., Salton, G.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  • Miller, G., Princeton, U., et al.: WordNet. MIT Press, Cambridge (1998)

    Google Scholar 

  • Mladenic, D., Grobelnik, M.: Word sequences as features in text-learning. In: Proceedings of ERK-98, the Seventh Electrotechnical and Computer Science Conference, pp. 145–148 (1998)

    Google Scholar 

  • Moulinier, I., Ganascia, J.G.: Applying an existing machine learning algorithm to text categorization. In: Wermter, S., Scheler, G., Riloff, E. (eds.) IJCAI-WS 1995. LNCS, vol. 1040, pp. 343–354. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  • Moulinier, I., Raskinis, G., et al.: Text categorization: a symbolic approach. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval (1996)

    Google Scholar 

  • Ng, H.T., Goh, W.B., et al.: Feature selection, perception learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 67–73 (1997)

    Google Scholar 

  • Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of the ninth international conference on Information and knowledge management, pp. 86–93 (2000)

    Google Scholar 

  • Oh, H.J., Myaeng, S.H., et al.: A practical hypertext catergorization method using links and incrementally available class information. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 264–271 (2000)

    Google Scholar 

  • Petrarca, A.E., Lay, W.M.: Use of an automatically generated authority list to eliminate scattering caused by some singular and plural main index terms. Proceedings of the American Society for Information Science 6, 277–282 (1969)

    Google Scholar 

  • Pierre, J.M.: Practical Issues for Automated Categorization of Web Sites. In: Electronic Proc. ECDL 2000 Workshop on Semantic Web (2000)

    Google Scholar 

  • Porter, M.: An Algorithm for Suffix Stripping Program. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  • Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)

    Google Scholar 

  • Ruiz, M.E., Srinivasan, P.: Hierarchical neural networks for text categorization (poster abstract). In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 281–282 (1999)

    Google Scholar 

  • Sable, C.L., Hatzivassiloglou, V.: Text-based approaches for non-topical image categorization. International Journal on Digital Libraries 3(3), 261–275 (2000)

    Article  Google Scholar 

  • Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)

    Article  Google Scholar 

  • Salton, G., Wong, A., et al.: A vector space model for information retrieval. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  • Schapire, R.E., Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning 39(2), 135–168 (2000)

    Article  MATH  Google Scholar 

  • Schütze, H., Hull, D.A., et al.: A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 229–237 (1995)

    Google Scholar 

  • Sebastiani, F., Sperduti, A., et al.: An improved boosting algorithm and its application to automated text categorization (2000)

    Google Scholar 

  • Sinka, M.P., Corne, D.W.: A large benchmark dataset for web document clustering. Soft Computing Systems: Design, Management and Applications 87, 881–890 (2002)

    Google Scholar 

  • Sj, C., Waltz, D.J.: Trading mips and memory for knowledge engeneering. Communications of the ACM 35, 48–64 (1992)

    Google Scholar 

  • Slattery, S., Mitchell, T.: Discovering test set regularities in relational domains. In: Proc. ICML (2000)

    Google Scholar 

  • Slonim, N., Tishby, N.: The power of word clusters for text classification. In: Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research (2001)

    Google Scholar 

  • Taira, H., Haruno, M.: Feature selection in SVM text categorization. In: Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence table of contents, pp. 480–486 (1999)

    Google Scholar 

  • Tauritz, D.R., Kok, J.N., et al.: Adaptive Information Filtering using evolutionary computation. Information Sciences 122(2-4), 121–140 (2000)

    Article  MATH  Google Scholar 

  • Tzeras, K., Hartmann, S.: Automatic indexing based on Bayesian inference networks. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 22–35 (1993)

    Google Scholar 

  • Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    Book  MATH  Google Scholar 

  • Wai, L.A.M., Fan, L.: Using a Bayesian Network Induction Approach for Text Categorization. In: Proceedings of the 15th International Joint Conference on Artificial Intelligence, pp. 745–750 (1997)

    Google Scholar 

  • Weiss, S.M., Apte, C., et al.: Maximizing text-mining performance. IEEE Intelligent Systems 14(4), 63–69 (1999)

    Article  Google Scholar 

  • Wiener, E., Pedersen, J.O., et al.: A neural network approach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95), pp. 317–332 (1995)

    Google Scholar 

  • Yang, Y., Chute, C.G.: An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems (TOIS) 12(3), 252–277 (1994)

    Article  Google Scholar 

  • Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 42–49 (1999)

    Google Scholar 

  • Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning 97 (1997)

    Google Scholar 

  • Yang, Y., Slattery, S., et al.: A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems 18(2), 219–241 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 IFIP International Federation for Information Processing

About this chapter

Cite this chapter

Benbrahim, H., Bramer, M. (2009). Text and Hypertext Categorization. In: Bramer, M. (eds) Artificial Intelligence An International Perspective. Lecture Notes in Computer Science(), vol 5640. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03226-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03226-4_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03225-7

  • Online ISBN: 978-3-642-03226-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics