Text and Hypertext Categorization

Benbrahim, Houda; Bramer, Max

doi:10.1007/978-3-642-03226-4_2

Houda Benbrahim²⁰ &
Max Bramer²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5640))

2293 Accesses
4 Citations

Abstract

Automatic categorization of text documents has become an important area of research in the last two decades, with features that make it significantly more difficult than the traditional classification tasks studied in machine learning. A more recent development is the need to classify hypertext documents, most notably web pages. These have features that add further complexity to the categorization task but also offer the possibility of using information that is not available in standard text classification, such as metadata and the content of the web pages that point to and are pointed at by a web page of interest. This chapter surveys the state of the art in text categorization and hypertext categorization, focussing particularly on issues of representation that differentiate them from ’conventional’ classification tasks and from each other.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amati, G., Crestani, F.: Probabilistic learning for selective dissemination of information. Information Processing and Management 35(5), 633–654 (1999)
Article Google Scholar
Andrews, K.: The development of a fast conflation algorithm for English. Dissertation submitted for the Diploma in Computer Science (unpublished), University of Cambridge (1971)
Google Scholar
Apté, C., Damerau, F., et al.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS) 12(3), 233–251 (1994)
Article Google Scholar
Attardi, G., Gulli, A., et al.: Automatic Web Page Categorization by Link and Context Analysis. In: Proceedings of THAI’99, pp. 105–119 (1999)
Google Scholar
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 96–103 (1998)
Google Scholar
Benbrahim, H., Bramer, M.: An empirical study for hypertext categorization. In: IEEE International Conference on Systems, Man and Cybernetics, 2004, vol. 6 (2004a)
Google Scholar
Benbrahim, H., Bramer, M.: Neighbourhood Exploitation in Hypertext Categorization. In: Proceedings of the Twenty-fourth SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, December 2004, pp. 258–268 (2004b)
Google Scholar
Benbrahim, H., Bramer, M.: A Fuzzy Semi-Supervised Support Vector Machines Approach to Hypertext Categorization. In: Artificial Intelligence in Theory and Practice II, pp. 97–106. Springer, Heidelberg (2008)
Chapter Google Scholar
Bharat, K., Broder, A.Z.: A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. WWW7 / Computer Networks 30(1-7), 379–388 (1998)
Article Google Scholar
Borko, H., Bernick, M.: Automatic Document Classification. Journal of the ACM (JACM) 10(2), 151–162 (1963)
Article MATH Google Scholar
Bramer, M.A.: Principles of Data Mining. Springer, Heidelberg (2007)
MATH Google Scholar
Buckley, C., Salton, G., et al.: Automatic query expansion using SMART. In: TREC 3, Overview of the Third Text Retrieval Conference (TREC-3), pp. 500–225 (1995)
Google Scholar
Caropreso, M.F., Matwin, S., et al.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Text Databases and Document Management: Theory and Practice, pp. 78–102 (2001)
Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-Gram based document categorization. In: Proceedings of the Third Symposium on Document Analysis and Information Retrieval, Las Vegas, pp. 161–176 (1994)
Google Scholar
Chakrabarti, S., Dom, B., et al.: Using taxonomy, discriminants, and signatures for navigating in text databases. In: Proceedings of the 23rd VLDB Conference, pp. 446–455 (1997)
Google Scholar
Chakrabarti, S., Dom, B., et al.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal The International Journal on Very Large Data Bases 7(3), 163–178 (1998)
Article Google Scholar
Chakrabarti, S., Dom, B.E., et al.: Enhanced hypertext categorization using hyperlinks. Google Patents (2002)
Google Scholar
Chen, H., Dumais, S.: Bringing order to the web: automatically categorizing search results. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 145–152 (2000)
Google Scholar
Clack, C., Farringdon, J., et al.: Autonomous document classification for business. In: Proceedings of the 1st International Conference on Autonomous Agents, pp. 201–208 (1997)
Google Scholar
Cohen, W.W., Hirsh, H.: Joins that generalize: text classification using Whirl. In: Proceedings of KDD-98, 4th International Conference on Knowledge Discovery and Data Mining, pp. 169–173 (1998)
Google Scholar
Cohen, W.W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization. In: Conference on Research and Development in Information Retrieval (SIGIR), pp. 307–315 (1998)
Google Scholar
Creecy, R.H.: Trading MIPS and Memory for Knowledge Engineering: Automatic Classification of Census Returns on a Massively Parallel Supercomputer. Thinking Machines Corp. (1991)
Google Scholar
Dagan, I., Karov, Y., et al.: Mistake-driven learning in text categorization. In: Proceedings of the Second Conference on Empirical Methods in NLP, pp. 55–63 (1997)
Google Scholar
Dattola, R.T.: FIRST: Flexible Information Retrieval System for Text. J. Am. Soc. Inf. Sci. 30(1) (1979)
Google Scholar
De Heer, T.: The application of the concept of homeosemy to natural language information retrieval. Information Processing & Management 18(5), 229–236 (1982)
Article Google Scholar
Deerwester, S., Dumais, S.T., et al.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)
Article Google Scholar
Domingos, P., Pazzani, M.: On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning 29(2), 103–130 (1997)
Article MATH Google Scholar
Dumais, S., Platt, J., et al.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on Information and knowledge management, pp. 148–155 (1998)
Google Scholar
Dumais, S.T.: Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers 23(2), 229–236 (1991)
Article Google Scholar
Escudero, G., Marquez, L., et al.: Boosting Applied to Word Sense Disambiguation. Arxiv preprint cs.CL/0007010 (2000)
Google Scholar
Field, B.: Towards automatic indexing: automatic assignment of controlled-language indexing and classification from free indexing. Journal of Documentation 31(4), 246–265 (1975)
Article Google Scholar
Fuhr, N., Buckley, C.: A probabilistic learning approach for document indexing. ACM Transactions on Information Systems (TOIS) 9(3), 223–248 (1991)
Article Google Scholar
Furnkranz, J.: Exploiting structural information for text classification on the WWW. In: Hand, D.J., Kok, J.N., R. Berthold, M. (eds.) IDA 1999. LNCS, vol. 1642, pp. 487–497. Springer, Heidelberg (1999)
Chapter Google Scholar
Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)
Chapter Google Scholar
Gale, W.A., Church, K.W., et al.: A method for disambiguating word senses in a large corpus. Computers and the Humanities 26(5), 415–439 (1992)
Article Google Scholar
Gray, W.A., Harley, A.J.: Computer-assisted indexing. Inform. Storage Retrieval 7(4), 167–174 (1971)
Article Google Scholar
Hersh, W., Buckley, C., et al.: OHSUMED: An interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 192–201 (1994)
Google Scholar
Hull, D.: Improving text retrieval for the routing problem using latent semantic indexing. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 282–291 (1994)
Google Scholar
Ittner, D.J., Lewis, D.D., et al.: Text categorization of low quality images. In: Symposium on Document Analysis and Information Retrieval, pp. 301–315 (1995)
Google Scholar
Iyer, R.D., Lewis, D.D., et al.: Boosting for document routing. In: Proceedings of the ninth international conference on Information and knowledge management, pp. 70–77 (2000)
Google Scholar
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, School of Computer Science, Carnegie Mellon University (1996)
Google Scholar
Joachims, T.: Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. Springer, London (1998)
Google Scholar
Kim, Y.H., Hahn, S.Y., et al.: Text filtering by boosting naive Bayes classifiers. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 168–175 (2000)
Google Scholar
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 170–178 (1997)
Google Scholar
Lam, S.L.Y., Lee, D.L.: Feature reduction for neural network based text categorization. In: Proceedings of 6th International Conference on Database Systems for Advanced Applications, pp. 195–202 (1999)
Google Scholar
Lam, W., Ho, C.Y.: Using a generalized instance set for automatic text categorization. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 81–89 (1998)
Google Scholar
Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 331–339 (1995)
Google Scholar
Larkey, L.S.: Automatic essay grading using text categorization techniques. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 90–95 (1998)
Google Scholar
Larkey, L.S.: A patent search and classification system. In: Proceedings of the fourth ACM conference on Digital libraries, pp. 179–187 (1999)
Google Scholar
Larkey, L.S., Croft, W.B.: Combining classifiers in text categorization. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 289–297 (1996)
Google Scholar
Lawrence, S., Giles, C.L.: Accessibility of information on the web. Nature 400, 107 (1999)
Article Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37–50 (1992)
Google Scholar
Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on Speech and Natural Language, pp. 212–217 (1992)
Google Scholar
Lewis, D.D.: Representation and learning in information retrieval. PhD Thesis, Department of Computer and Information Science, University of Massachusetts (1992)
Google Scholar
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Chapter Google Scholar
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Google Scholar
Li, H., Yamanishi, K.: Text classification using ESC-based stochastic decision lists. In: Proceedings of the eighth international conference on Information and knowledge management, pp. 122–130 (1999)
Google Scholar
Li, Y.H., Jain, A.K.: Classification of Text Documents. The Computer Journal 41(8), 537 (1998)
Article MATH Google Scholar
Lovins, J.B.: Development of a Stemming Algorithm. MIT Information Processing Group, Electronic Systems Laboratory (1968)
Google Scholar
Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Research and Development 2(2), 159–165 (1958)
Article MathSciNet Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Masand, B.: Optimizing confidence of text classification by evolution of symbolic expressions. Mit Press In Series In Complex Adaptive Systems, pp. 445–458 (1994)
Google Scholar
Masand, B., Linoff, G., et al.: Classifying news stories using memory based reasoning. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 59–65 (1992)
Google Scholar
McCallum, A., Nigam, K.: Employing EM in pool-based active learning for text classification. In: Proceedings of ICML-98, 15th International Conference on Machine Learning, pp. 350–358 (1998)
Google Scholar
McGill, M.J., Salton, G.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Miller, G., Princeton, U., et al.: WordNet. MIT Press, Cambridge (1998)
Google Scholar
Mladenic, D., Grobelnik, M.: Word sequences as features in text-learning. In: Proceedings of ERK-98, the Seventh Electrotechnical and Computer Science Conference, pp. 145–148 (1998)
Google Scholar
Moulinier, I., Ganascia, J.G.: Applying an existing machine learning algorithm to text categorization. In: Wermter, S., Scheler, G., Riloff, E. (eds.) IJCAI-WS 1995. LNCS, vol. 1040, pp. 343–354. Springer, Heidelberg (1996)
Chapter Google Scholar
Moulinier, I., Raskinis, G., et al.: Text categorization: a symbolic approach. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval (1996)
Google Scholar
Ng, H.T., Goh, W.B., et al.: Feature selection, perception learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 67–73 (1997)
Google Scholar
Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of the ninth international conference on Information and knowledge management, pp. 86–93 (2000)
Google Scholar
Oh, H.J., Myaeng, S.H., et al.: A practical hypertext catergorization method using links and incrementally available class information. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 264–271 (2000)
Google Scholar
Petrarca, A.E., Lay, W.M.: Use of an automatically generated authority list to eliminate scattering caused by some singular and plural main index terms. Proceedings of the American Society for Information Science 6, 277–282 (1969)
Google Scholar
Pierre, J.M.: Practical Issues for Automated Categorization of Web Sites. In: Electronic Proc. ECDL 2000 Workshop on Semantic Web (2000)
Google Scholar
Porter, M.: An Algorithm for Suffix Stripping Program. Program 14(3), 130–137 (1980)
Article Google Scholar
Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Google Scholar
Ruiz, M.E., Srinivasan, P.: Hierarchical neural networks for text categorization (poster abstract). In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 281–282 (1999)
Google Scholar
Sable, C.L., Hatzivassiloglou, V.: Text-based approaches for non-topical image categorization. International Journal on Digital Libraries 3(3), 261–275 (2000)
Article Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Article Google Scholar
Salton, G., Wong, A., et al.: A vector space model for information retrieval. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Schapire, R.E., Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning 39(2), 135–168 (2000)
Article MATH Google Scholar
Schütze, H., Hull, D.A., et al.: A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 229–237 (1995)
Google Scholar
Sebastiani, F., Sperduti, A., et al.: An improved boosting algorithm and its application to automated text categorization (2000)
Google Scholar
Sinka, M.P., Corne, D.W.: A large benchmark dataset for web document clustering. Soft Computing Systems: Design, Management and Applications 87, 881–890 (2002)
Google Scholar
Sj, C., Waltz, D.J.: Trading mips and memory for knowledge engeneering. Communications of the ACM 35, 48–64 (1992)
Google Scholar
Slattery, S., Mitchell, T.: Discovering test set regularities in relational domains. In: Proc. ICML (2000)
Google Scholar
Slonim, N., Tishby, N.: The power of word clusters for text classification. In: Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research (2001)
Google Scholar
Taira, H., Haruno, M.: Feature selection in SVM text categorization. In: Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence table of contents, pp. 480–486 (1999)
Google Scholar
Tauritz, D.R., Kok, J.N., et al.: Adaptive Information Filtering using evolutionary computation. Information Sciences 122(2-4), 121–140 (2000)
Article MATH Google Scholar
Tzeras, K., Hartmann, S.: Automatic indexing based on Bayesian inference networks. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 22–35 (1993)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book MATH Google Scholar
Wai, L.A.M., Fan, L.: Using a Bayesian Network Induction Approach for Text Categorization. In: Proceedings of the 15th International Joint Conference on Artificial Intelligence, pp. 745–750 (1997)
Google Scholar
Weiss, S.M., Apte, C., et al.: Maximizing text-mining performance. IEEE Intelligent Systems 14(4), 63–69 (1999)
Article Google Scholar
Wiener, E., Pedersen, J.O., et al.: A neural network approach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95), pp. 317–332 (1995)
Google Scholar
Yang, Y., Chute, C.G.: An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems (TOIS) 12(3), 252–277 (1994)
Article Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 42–49 (1999)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning 97 (1997)
Google Scholar
Yang, Y., Slattery, S., et al.: A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems 18(2), 219–241 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Ernst and Young LLP, 1 More London Place, London, SE1 2AF, United Kingdom
Houda Benbrahim
School of Computing, University of Portsmouth, Portsmouth, Hants, PO1 3HE, United Kingdom
Max Bramer

Authors

Houda Benbrahim
View author publications
You can also search for this author in PubMed Google Scholar
Max Bramer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, Lion Terrace, University of Portsmouth, PO1 3HE, Portsmouth, Hants, UK
Max Bramer

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Benbrahim, H., Bramer, M. (2009). Text and Hypertext Categorization. In: Bramer, M. (eds) Artificial Intelligence An International Perspective. Lecture Notes in Computer Science(), vol 5640. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03226-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-03226-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03225-7
Online ISBN: 978-3-642-03226-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics