Multinomial Event Model Based Abstraction for Sequence and Text Classification

Kang, Dae-Ki; Zhang, Jun; Silvescu, Adrian; Honavar, Vasant

doi:10.1007/11527862_10

Dae-Ki Kang²⁰,
Jun Zhang²⁰,
Adrian Silvescu²⁰ &
…
Vasant Honavar²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3607))

Included in the following conference series:

International Symposium on Abstraction, Reformulation, and Approximation

1037 Accesses
11 Citations

Abstract

In many machine learning applications that deal with sequences, there is a need for learning algorithms that can effectively utilize the hierarchical grouping of words. We introduce Word Taxonomy guided Naive Bayes Learner for the Multinomial Event Model (WTNBL-MN) that exploits word taxonomy to generate compact classifiers, and Word Taxonomy Learner (WTL) for automated construction of word taxonomy from sequence data. WTNBL-MN is a generalization of the Naive Bayes learner for the Multinomial Event Model for learning classifiers from data using word taxonomy. WTL uses hierarchical agglomerative clustering to cluster words based on the distribution of class labels that co-occur with the words. Our experimental results on protein localization sequences and Reuters text show that the proposed algorithms can generate Naive Bayes classifiers that are more compact and often more accurate than those produced by standard Naive Bayes learner for the Multinomial Model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Pazzani, M.J., Mani, S., Shankle, W.R.: Beyond concise and colorful: Learning intelligible rules. In: Knowledge Discovery and Data Mining, pp. 235–238 (1997)
Google Scholar
Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., Harris, M., Hill, D., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J., Richardson, J., Ringwald, M., Rubin, G., Sherlock, G.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 25, 25–29 (2000)
Article Google Scholar
Undercoffer, J.L., Joshi, A., Finin, T., Pinkston, J.: A Target Centric Ontology for Intrusion Detection: Using DAML+OIL to Classify Intrusive Behaviors. Knowledge Engineering Review (2004)
Google Scholar
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American (2001)
Google Scholar
Kohavi, R., Provost, F.: Applications of data mining to electronic commerce. Data Mining and Knowledge Discovery 5, 5–10 (2001)
Article MATH Google Scholar
Zhang, J., Honavar, V.: Learning decision tree classifiers from attribute value taxonomies and partially specified data. In: The Twentieth International Conference on Machine Learning (ICML 2003), Washington, DC (2003)
Google Scholar
Kang, D.K., Silvescu, A., Zhang, J., Honavar, V.: Generation of attribute value taxonomies from data for data-driven construction of accurate and compact classifiers. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), November 1-4, Brighton, UK, pp. 130–137 (2004)
Google Scholar
Zhang, J., Honavar, V.: AVT-NBL: An algorithm for learning compact and accurate naive bayes classifiers from attribute value taxonomies and data. In: Perner, P. (ed.) ICDM 2004. LNCS (LNAI), vol. 3275. Springer, Heidelberg (2004)
Google Scholar
Haussler, D.: Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artificial intelligence 36, 177–221 (1988)
Article MATH MathSciNet Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Eyheramendy, S., Lewis, D.D., Madigan, D.: On the naive bayes model for text categorization. In: Ninth International Workshop on Artificial Intelligence and Statistics (2003)
Google Scholar
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29, 131–163 (1997)
Article MATH Google Scholar
Arndt, C.: Information Measures. Springer, Heidelberg (2001)
MATH Google Scholar
Slonim, N., Tishby, N.: Agglomerative information bottleneck. In: Proceedings of the 13th Neural Information Processing Systems, NIPS 1999 (1999)
Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: CIKM 1998: Proceedings of the seventh international conference on Information and knowledge management, pp. 148–155. ACM Press, New York (1998)
Chapter Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Apté, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorization models. In: SIGIR 1994: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, pp. 23–30. Springer, Heidelberg (1994)
Google Scholar
Fawcett, T.: ROC graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Labs (2003)
Google Scholar
Andorf, C., Silvescu, A., Dobbs, D., Honavar, V.: Learning classifiers for assigning protein sequences to gene ontology functional families. In: Fifth International Conference on Knowledge Based Computer Systems (KBCS 2004), pp. 256–265 (2004)
Google Scholar
Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28, 45–48 (2000)
Article Google Scholar
Yan, C., Dobbs, D., Honavar, V.: A two-stage classifier for identification of protein-protein interface residues. In: Proceedings Twelfth International Conference on Intelligent Systems for Molecular Biology / Third European Conference on Computational Biology (ISMB/ECCB 2004), pp. 371–378 (2004)
Google Scholar
Taylor, M.G., Stoffel, K., Hendler, J.A.: Ontology-based induction of high level classification rules. In: DMKD (1997)
Google Scholar
Hendler, J., Stoffel, K., Taylor, M.: Advances in high performance knowledge representation. Technical Report CS-TR-3672, University of Maryland Institute for Advanced Computer Studies Dept. of Computer Science (1996)
Google Scholar
Han, J., Fu, Y.: Exploration of the power of attribute-oriented induction in data mining. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining. AIII Press/MIT Press (1996)
Google Scholar
desJardins, M., Getoor, L., Koller, D.: Using feature hierarchies in bayesian network learning. In: Choueiry, B.Y., Walsh, T. (eds.) SARA 2000. LNCS (LNAI), vol. 1864, pp. 260–270. Springer, Heidelberg (2000)
Chapter Google Scholar
Gibson, D., Kleinberg, J.M., Raghavan, P.: Clustering categorical data: An approach based on dynamical systems. VLDB Journal: Very Large Data Bases 8, 222–236 (2000)
Article Google Scholar
Ganti, V., Gehrke, J., Ramakrishnan, R.: Cactus - clustering categorical data using summaries. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 73–83. ACM Press, New York (1999)
Chapter Google Scholar
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL, pp. 183–190 (1993)
Google Scholar
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 96–103. ACM Press, New York (1998)
Chapter Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Google Scholar
Klimt, B., Yang, Y.: The Enron corpus: A new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Artificial Intelligence Research Laboratory, Department of Computer Science, Iowa State University, Ames, IA, 50011, USA
Dae-Ki Kang, Jun Zhang, Adrian Silvescu & Vasant Honavar

Authors

Dae-Ki Kang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Silvescu
View author publications
You can also search for this author in PubMed Google Scholar
Vasant Honavar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

UR 079 GEODES, IRD, 32 avenue Henri Varagnat, 93143, Bondy, France
Jean-Daniel Zucker
Dip. di Informatica, Università del Piemonte Orientale, Via Bellini 25/G, 15100, Alessandria, Italy
Lorenza Saitta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, DK., Zhang, J., Silvescu, A., Honavar, V. (2005). Multinomial Event Model Based Abstraction for Sequence and Text Classification. In: Zucker, JD., Saitta, L. (eds) Abstraction, Reformulation and Approximation. SARA 2005. Lecture Notes in Computer Science(), vol 3607. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527862_10

Download citation

DOI: https://doi.org/10.1007/11527862_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27872-6
Online ISBN: 978-3-540-31882-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics