Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data

Zhang, J.; Kang, D.-K.; Silvescu, A.; Honavar, V.

doi:10.1007/s10115-005-0211-z

Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data

Regular Paper
Published: 24 June 2005

Volume 9, pages 157–179, (2006)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

J. Zhang¹,
D.-K. Kang¹,
A. Silvescu¹ &
…
V. Honavar²

216 Accesses
37 Citations
3 Altmetric
Explore all metrics

Abstract

In many application domains, there is a need for learning algorithms that can effectively exploit attribute value taxonomies (AVT)—hierarchical groupings of attribute values—to learn compact, comprehensible and accurate classifiers from data—including data that are partially specified. This paper describes AVT-NBL, a natural generalization of the naïve Bayes learner (NBL), for learning classifiers from AVT and data. Our experimental results show that AVT-NBL is able to generate classifiers that are substantially more compact and more accurate than those produced by NBL on a broad range of data sets with different percentages of partially specified values. We also show that AVT-NBL is more efficient in its use of training data: AVT-NBL produces classifiers that outperform those produced by NBL using substantially fewer training examples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

Article 19 April 2016

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

References

Almuallim H, Akiba Y, Kaneda S (1995) On handling tree-structured attributes. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 12–20
Almuallim H, Akiba Y, Kaneda S (1996) An efficient algorithm for finding optimal gain-ratio multiple-split tests on hierarchical attributes in decision tree learning. In: Proceedings of the thirteenth national conference on artificial intelligence and eighth innovative applications of artificial intelligence conference, vol 1. AAAI/MIT Press, pp 703–708
Aronis J, Provost F, Buchanan B (1996) Exploiting background knowledge in automated discovery. In: Proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, pp 355–358
Aronis J, Provost F (1997) Increasing the efficiency of inductive learning with breadth-first marker propagation. In: Proceedings of the third international conference on knowledge discovery and data mining. AAAI Press, pp 119–122
Ashburner M, et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Gen 25:25–29
Google Scholar
Bergadano F, Giordana A (1990) Guiding induction with domain theories. Machine learning—an artificial intelligence approach, vol. 3. Morgan Kaufmann, pp 474–492
Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Am pp 35–43
Bhattacharya I, Getoor L (2004) Deduplication and group detection using links. KDD workshop on link analysis and group detection, Aug. 2004. Seattle
Caragea D, Silvescu A, Honavar V (2004) A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. Int J Hybrid Intell Syst 1:80–89
Google Scholar
Caragea D, Pathak J, Honavar V (2004) Learning classifiers from semantically heterogeneous data. In: Proceedings of the third international conference on ontologies, databases, and applications of semantics for large scale information systems. pp 963–980
Chen A, Chiu J, Tseng F (1996) Evaluating aggregate operations over imprecise data. IEEE Trans Knowl Data En 8:273–284
Google Scholar
Clare A, King R (2001) Knowledge discovery in multi-label phenotype data. In: Proceedings of the fifth European conference on principles of data mining and knowledge discovery. Lecture notes in computer science, vol 2168. Springer, Berlin Heidelberg New York, pp 42–53
Cohen W (1996) Learning trees and rules with set-valued features. In: Proceedings of the thirteenth national conference on artificial intelligence. AAAI/MIT Press, pp 709–716
DeMichiel L (1989) Resolving database incompatibility: an approach to performing relational operations over mismatched domains. IEEE Trans Knowl Data Eng 1:485–493
Article Google Scholar
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc, Series B 39:1–38
MathSciNet Google Scholar
desJardins M, Getoor L, Koller D (2000) Using feature hierarchies in Bayesian network learning. In: Proceedings of symposium on abstraction, reformulation, and approximation 2000. Lecture notes in artificial intelligence, vol 1864, Springer, Berlin Heidelberg New York, pp 260–270
Dhar V, Tuzhilin A (1993) Abstract-driven pattern discovery in databases. IEEE Trans Knowl Data Eng 5:926–938
Article Google Scholar
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–130
Article Google Scholar
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163
Article Google Scholar
Han J, Fu Y (1996) Attribute-oriented induction in data mining. Advances in knowledge discovery and data mining. AAAI/MIT Press, pp 399–421
Haussler D (1998) Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artif Intell 36:177–221
MathSciNet Google Scholar
Hendler J, Stoffel K, Taylor M (1996) Advances in high performance knowledge representation. University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland, July 1996. CS-TR-3672 (Also cross-referenced as UMIACS-TR-96-56)
Kang D, Silvescu A, Zhang J, Honavar V (2004) Generation of attribute value taxonomies from data for data-driven construction of accurate and compact classifiers. In: Proceedings of the fourth IEEE international conference on data mining, pp 130–137
Kohavi R, Becker B, Sommerfield D (1997) Improving simple Bayes. Tech. Report, Data mining and visualization group, Silicon Graphics Inc.
Kohavi R, Provost P (2001) Applications of data mining to electronic commerce. Data Min Knowl Discov 5:5–10
MATH MathSciNet Google Scholar
Kohavi R, Mason L, Parekh R, Zheng Z (2004) Lessons and challenges from mining retail E-commerce data. Special Issue: Data mining lessons learned. Mach Learn 57:83–113
Article MATH Google Scholar
Koller D, Sahami M (1997) Hierarchically classifying documents using very few words. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann, pp 170–178
Langley P, Iba W, Thompson K (1992) An analysis of Bayesian classifiers. In: Proceedings of the tenth national conference on artificial intelligence. AAAI/MIT Press, pp 223-228
McCallum A, Rosenfeld R, Mitchell T, Ng A (1998) Improving text classification by shrinkage in a hierarchy of classes. In: Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann, pp 359–367
McClean S, Scotney B, Shapcott M (2001) Aggregation of imprecise and uncertain information in databases. IEEE Trans Know Data Eng 13:902–912
Google Scholar
Mitchell T (1997) Machine Learning. Addison-Wesley
Núñez M (1991) The use of background knowledge in decision tree induction. Mach Learn 6:231–250
Google Scholar
Pazzani M, Kibler D (1992) The role of prior knowledge in inductive learning. Mach Learn 9:54–97
Google Scholar
Pazzani M, Mani S, Shankle W (1997) Beyond concise and colorful: learning intelligible rules. In: Proceedings of the third international conference on knowledge discovery and data mining. AAAI Press, pp 235–238
Pereira F, Tishby N, Lee L (1993) Distributional clustering of English words. In: Proceedings of the thirty-first annual meeting of the association for computational linguistics. pp 183–190
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, CA
Rissanen J (1978) Modeling by shortest data description. Automatica 14:37–38
Article Google Scholar
Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. ACM SIGIR 2000. pp 208–215
Taylor M, Stoffel K, Hendler J (1997) Ontology-based induction of high level classification rules. SIGMOD data mining and knowledge discovery workshop, Tuscon, Arizona
Towell G, Shavlik J (1994) Knowledge-based artificial neural networks. Artif Intell 70:119–165
Article Google Scholar
Undercoffer J, et al (2004) A target centric ontology for intrusion detection: using DAML+OIL to classify intrusive behaviors. Knowledge Engineering Review—Special Issue on Ontologies for Distributed Systems, January 2004, Cambridge University Press
Google Scholar
Walker A (1980) On retrieval from a small version of a large database. In: Proceedings of the sixth international conference on very large data bases. pp 47–54
Yamazaki T, Pazzani M, Merz C (1995) Learning hierarchies from ambiguous natural language data. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 575–583
Zhang J, Silvescu A, Honavar V (2002) Ontology-driven induction of decision trees at multiple levels of abstraction. In: Proceedings of symposium on abstraction, reformulation, and approximation 2002. Lecture notes in artificial intelligence, vol 2371. Springer, Berlin Heidelberg New York, pp 316–323
Zhang J, Honavar V (2003) Learning decision tree classifiers from attribute value taxonomies and partially specified data. In: Proceedings of the twentieth international conference on machine learning. AAAI Press, pp 880–887
Zhang J, Honavar V (2004) AVT-NBL: an algorithm for learning compact and accurate naive Bayes classifiers from attribute value taxonomies and data. In: Proceedings of the fourth IEEE international conference on data mining. IEEE Computer Society, pp 289-296

Download references

Author information

Authors and Affiliations

Department of Computer Science, Artificial Intelligence Research Laboratory, Computational Intelligence, Learning, and Discovery Program, Iowa State University, Ames, Iowa, 50011-1040, USA
J. Zhang, D.-K. Kang & A. Silvescu
Department of Computer Science, Artificial Intelligence Research Laboratory; Computational Intelligence, Learning, and Discovery Program; Bioinformatics and Computational Biology Program, Iowa State University, Ames, Iowa, 50011-1040, USA
V. Honavar

Authors

J. Zhang
View author publications
You can also search for this author in PubMed Google Scholar
D.-K. Kang
View author publications
You can also search for this author in PubMed Google Scholar
A. Silvescu
View author publications
You can also search for this author in PubMed Google Scholar
V. Honavar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. Zhang.

Additional information

This paper is an extended version of a paper published in the 4th IEEE International Conference on Data Mining, 2004.

Jun Zhang is currently a PhD candidate in computer science at Iowa State University, USA. His research interests include machine learning, data mining, ontology-driven learning, computational biology and bioinformatics, evolutionary computation and neural networks. From 1993 to 2000, he was a lecturer in computer engineering at University of Science and Technology of China. Jun Zhang received a MS degree in computer engineering from the University of Science and Technology of China in 1993 and a BS in computer science from Hefei University of Technology, China, in 1990.

Dae-Ki Kang is a PhD student in computer science at Iowa State University. His research interests include ontology learning, relational learning, and security informatics. Prior to joining Iowa State, he worked at a Bay-area startup company and at Electronics and Telecommunication Research Institute in South Korea. He received a Masters degree in computer science at Sogang University in 1994 and a bachelor of engineering (BE) degree in computer science and engineering at Hanyang University in Ansan in 1992.

Adrian Silvescu is a PhD candidate in computer science at Iowa State University. His research interests include machine learning, artificial intelligence, bioinformatics and complex adaptive systems. He received a MS degree in theoretical computer science from the University of Bucharest, Romania, in 1997, and received a BS in computer science from the University of Bucharest in 1996.

Vasant Honavar received a BE in electronics engineering from Bangalore University, India, an MS in electrical and computer Engineering from Drexel University and an MS and a PhD in computer science from the University of Wisconsin, Madison. He founded (in 1990) and has been the director of the Artificial Intelligence Research Laboratory at Iowa State University (ISU), where he is currently a professor of computer science and of bioinformatics and computational biology. He directs the Computational Intelligence, Learning & Discovery Program, which he founded in 2004. Honavar's research and teaching interests include artificial intelligence, machine learning, bioinformatics, computational molecular biology, intelligent agents and multiagent systems, collaborative information systems, semantic web, environmental informatics, security informatics, social informatics, neural computation, systems biology, data mining, knowledge discovery and visualization. Honavar has published over 150 research articles in refereed journals, conferences and books and has coedited 6 books. Honavar is a coeditor-in-chief of the Journal of Cognitive Systems Research and a member of the Editorial Board of the Machine Learning Journal and the International Journal of Computer and Information Security. Prof. Honavar is a member of the Association for Computing Machinery (ACM), American Association for Artificial Intelligence (AAAI), Institute of Electrical and Electronic Engineers (IEEE), International Society for Computational Biology (ISCB), the New York Academy of Sciences, the American Association for the Advancement of Science (AAAS) and the American Medical Informatics Association (AMIA).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, J., Kang, DK., Silvescu, A. et al. Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data. Knowl Inf Syst 9, 157–179 (2006). https://doi.org/10.1007/s10115-005-0211-z

Download citation

Received: 01 November 2004
Revised: 25 January 2005
Accepted: 19 February 2005
Published: 24 June 2005
Issue Date: February 2006
DOI: https://doi.org/10.1007/s10115-005-0211-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

A Comprehensive Survey of Clustering Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

A Comprehensive Survey of Clustering Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation