Hidden Markov Models for Text Categorization in Multi-Page Documents

Frasconi, Paolo; Soda, Giovanni; Vullo, Alessandro

doi:10.1023/A:1013681528748

Hidden Markov Models for Text Categorization in Multi-Page Documents

Published: March 2002

Volume 18, pages 195–217, (2002)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Paolo Frasconi¹,
Giovanni Soda¹ &
Alessandro Vullo¹

334 Accesses
24 Citations
Explore all metrics

Abstract

In the traditional setting, text categorization is formulated as a concept learning problem where each instance is a single isolated document. However, this perspective is not appropriate in the case of many digital libraries that offer as contents scanned and optically read books or magazines. In this paper, we propose a more general formulation of text categorization, allowing documents to be organized as sequences of pages. We introduce a novel hybrid system specifically designed for multi-page text documents. The architecture relies on hidden Markov models whose emissions are bag-of-words resulting from a multinomial word event model, as in the generative portion of the Naive Bayes classifier. The rationale behind our proposal is that taking into account contextual information provided by the whole page sequence can help disambiguation and improves single page classification accuracy. Our results on two datasets of scanned journals from the Making of America collection confirm the importance of using whole page sequences. The empirical evaluation indicates that the error rate (as obtained by running the Naive Bayes classifier on isolated pages) can be significantly reduced if contextual information is incorporated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Solution of the Multiaspect Text Categorization Problem by a Hybrid HMM and LDA Based Technique

Text Categorization Based on Semantic Cluster-Hidden Markov Models

URL-Based Web Page Classification: With n-Gram Language Models

References

Bengio, Y. and Frasconi, P. (1995). An Input Output HMM Architecture. In G. Tesauro, D. Touretzky, and T. Leen (Eds.), Advances in Neural Information Processing Systems vol.7 (pp. 427–434). Cambridge, MA: MIT Press.
Google Scholar
Bicknese, D.A. (1998). Measuring the Accuracy of the OCR in the Making of America. Report available at moa.umdl.umich.edu/moaocr.html.
Cavnar, W. and Trenkle, J. (1994). N-Gram Based Text Categorization. In Proc. of the 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas (pp. 161–175).
Charniak, E. (1993). Statistical Language Learning. Cambridge, MA: MIT Press.
Google Scholar
Cohen, W.W. (1995). Text Categorization and Relational Learning. In A. Prieditis and S.J. Russell (Eds.), Proc. of the 12th International Conference on Machine Learning, Lake Tahoe, California (pp. 124–132).
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum-Likelihood from Incomplete Data via the EM Algorithm. Journal of Royal Statistical Society B, 39, 1–38.
Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, L., and Gori, M. (2000). Focus Crawling by Context Graphs. In Proc. of the 26th International Conference on Very Large Databases (pp. 527–534).
Diligenti, M., Frasconi, P., and Gori, M. (2001). Image Document Categorization Using Hidden Tree-Markov Models and Structured Representations. In S. Singh, N.A. Murshed, and W. Kropatsch (Eds.), Proc. 2nd International Conference on Advances in Pattern Recognition, volume 2013 of LNCS (pp. 147–156). Berlin: Springer.
Google Scholar
Freitag, D. and McCallum, A. (2000). Information Extraction with HMM Structures Learned by Stochastic Optimization. In Proc. of the 12th AAAI Conference, Austin, TX (pp. 584–589).
Heckerman, D. (1997). Bayesian Networks dor Data Mining. Data Mining and Knowledge Discovery, 1(1), 79–120.
Google Scholar
Jensen, F.V. (1996). An Introduction to Bayesian Networks. New York: Springer Verlag.
Google Scholar
Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In C. Ndellec and C. Rouveirol (Eds.), Proceedings of the European Conference on Machine Learning (pp. 137–142), Berlin: Springer.
Google Scholar
Joachims, T. (1999). Transductive Inference for Text Classification using Support Vector Machines. In I. Bratko and S. Dzeroski (Eds.), Proc. of the 16th International Conference on Machine Learning (pp. 200–209).
Junker, M. and Hoch, R. (1998). An Experimental Evaluation of OCR Text Representations for Learning Document Classifiers. International Journal on Document Analysis and Recognition, 1(2), 116–122.
Google Scholar
Kalt, T. (1996). A New Probabilistic Model of Text Classification and Retrieval. CIIR TR98-18, University of Massachusetts. url: ciir.cs.umass.edu/publications/.
Koller, D. and Sahami, M. (1997). Hierarchically Classifying Documents using Very Few Words. In D.H. Fisher (Ed.), Proc. of the 14th International Conference on Machine Learning (pp. 170–178).
Lewis, D. and Gale, W. (1994). A Sequential Algorithm for Training Text Classifiers. In W.B. Croft and C.J. van Rijsbergen (Eds.), Proc. of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (pp. 3–12).
Lewis, D.D. and Ringuette, M. (1994). Comparison of Two Learning Algorithms for Text Categorization. In Proc. of the 3rd Annual Symposium on Document Analysis and Information Retrieval (pp. 81–93).
Lucke, H. (1995). Bayesian Belief Networks as a Tool for Stochastic Parsing. Speech Communication, 16, 89–118.
Google Scholar
McCallum, A., Nigam, K., Rennie, J., and Seymore, K. (2000). Automating the Construction of Internet Portals with Machine Learning. Information Retrieval Journal, 3, 127–163.
Google Scholar
Mitchell, T. (1997). Machine Learning. New York: McGraw-Hill.
Google Scholar
Ng, H.T., Goh, W.B., and Low, K.L. (1997). Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization. In Proc. of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 67–73).
Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. (2000). Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3), 103–134.
Google Scholar
Passerini, A., Frasconi, P., and Soda, G. (2001). Evaluation Methods for Focused Crawling. In Proc. of the 7th Conference of the Italian Association for Artificial Intelligence, Lecture Notes in Artificial Intelligence. Berlin: Springer-Verlag.
Google Scholar
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Rabiner, L.R. (1989). ATutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257–286.
Google Scholar
Shaw, E.J. and Blumson, S. (1997). Online Searching and Page Presentation at the University of Michigan. D-Lib Magazine, July/August 1997. URL: www.dlib.org/dlib/july97/america/ 07shaw.html.
Smyth, P., Heckerman, D., and Jordan, M.I. (1997). Probabilistic Independence Networks for Hidden Markov Probability Models. Neural Computation, 9(2), 227–269.
Google Scholar
Stolcke, A. and Omohundro, S. (1993). Hidden Markov Model Induction by Bayesian Model Merging. In S.J. Hanson, J.D. Cowan, and C.L. Giles (Eds.), Advances in Neural Information Processing Systems, vol. 5 (pp. 11–18). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Yang, Y. and Chute, C.G. (1994). An Example-Based Mapping Method for Text Classification and Retrieval. ACM Transactions on Information Systems, 12(3), 252–277.
Google Scholar
Yang, Y. and Pedersen, J.P. (1997). A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (pp. 412–420).

Download references

Author information

Authors and Affiliations

Department of Systems and Computer Science, University of Florence, Firenze, I-50139, Italy
Paolo Frasconi, Giovanni Soda & Alessandro Vullo

Authors

Paolo Frasconi
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Soda
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Vullo
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Frasconi, P., Soda, G. & Vullo, A. Hidden Markov Models for Text Categorization in Multi-Page Documents. Journal of Intelligent Information Systems 18, 195–217 (2002). https://doi.org/10.1023/A:1013681528748

Download citation

Issue Date: March 2002
DOI: https://doi.org/10.1023/A:1013681528748

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hidden Markov Models for Text Categorization in Multi-Page Documents

Abstract

Access this article

Similar content being viewed by others

A Solution of the Multiaspect Text Categorization Problem by a Hybrid HMM and LDA Based Technique

Text Categorization Based on Semantic Cluster-Hidden Markov Models

URL-Based Web Page Classification: With n-Gram Language Models

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Hidden Markov Models for Text Categorization in Multi-Page Documents

Abstract

Access this article

Similar content being viewed by others

A Solution of the Multiaspect Text Categorization Problem by a Hybrid HMM and LDA Based Technique

Text Categorization Based on Semantic Cluster-Hidden Markov Models

URL-Based Web Page Classification: With n-Gram Language Models

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation