Abstract
This paper describes a flexible model for predictive data mining, EGB2, which optimizes over a parameter space to fit data to a family of models based on maximum-likelihood criteria. It is also shown how EGB2 can integrate asymmetric costs of Type I and Type II errors, thereby minimizing expected misclassification costs.
Importantly, it has been shown that standard methods of computing maximum-likelihood estimators are generally inconsistent when applied to sample data having different proportions of labels than are found in the universe from which the sample is drawn. We show how a choice estimator based on weighting each observation's contribution to the log-likelihood function, can contribute to estimator consistency and how this feature can be implemented in EGB2.
Similar content being viewed by others
References
Amemiya A. Advanced Econometrics. Cambridge, MA: Harvard University Press, 1985.
Bar Niv R, McDonald J. Identifying financial distress in the insurance industry:Asynthesis of methodological and empirical issues. Journal of Risk and Insurance 1992;59:543–574.
Bell T, Szykowny S, Willingham J. Assessing the likelihood of fraudulent financial reporting: A cascaded logit approach. Working paper, KPMG Peat Marwick, 1993.
Clarke D, McDonald J. Generalized bankruptcy models applied to predicting consumer credit behavior. Journal of Economics and Business 1992;44:47–62.
Dawes R. The robust beauty of improper linear models in decision making. American Psychologist 1979;34:571–582.
Glymour C, Madigan D, Pregibon D, Smyth P. Statistical themes and lessons for data mining. Data Mining and Knowledge Discovery 1997;1:11–28.
Hansen J, McDonald J, Stice J. Artificial intelligence and generalized qualitative-response models: An empirical test on two audit decision-making domains. Decision Sciences 1992;23:708–723.
Hassoun M. Fundamentals of Artificial Neural Networks. Cambridge, MA: MIT Press, 1995.
Johnson E, Meyer R, Ghose S. When choice models fails: Compensatory models in efficient sets. Working paper, Graduate School of Industrial Administration, Carnegie-Mellon University, 1985.
Kalbfleisch J, Prentice R. The Statistical Analysis of Failure Times. New York: Wiley, 1980.
Kearns M, Vazirani U. An Introduction to Computational Learning Theory. Cambridge, MA: The MIT Press, 1994.
Libby R. Accounting and Human Information Processing: Theory and Applications. Englewood Cliffs, NJ: Prentice-Hall, 1981.
Manski C, Lerman S. The estimation of choice probabilities from choice based samples. Econometrica 1977;45:1977–1988.
McDonald J. Some generalized functions for the size distribution of income. Econometrica 1984;52:647–663.
McDonald J, White S. A comparison of some robust, adaptive, and partially adaptive estimators of regression models. Econometric Reviews 1993;12:103–124.
McDonald J, Xu Y. A generalization of the beta distribution with applications. Journal of Econometrics 1995;66:133–152. Errata 1995;69:427–428.
Payne J. Task complexity and contingent processing in decision making: An information search and protocol analysis. Organizational Behavior and Human Performance 1976;16:366–387.
Quandt R. Computational problems and methods. In: Handbook of Econometrics, Ch. 12, Vol. 1, 1983:699–764.
Rainville ED. Special Functions. New York: MacMillan, 1960.
Shavlik J, Dietterich T. Introduction. In: Shavlik J, Dietterich T, eds. Readings in Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers, 1991.
Stice J. Using financial and market information to identify preengagement factors associated with lawsuits against auditors. The Accounting Review 1991;66:516–534.
Weiss S, Kulikowski C. Computer Systems that Learn. San Mateo, CA: Morgan Kaufmann Publishers, 1991.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hansen, J.V., McDonald, J.B. A Generalized Model for Predictive Data Mining. Information Systems Frontiers 4, 179–186 (2002). https://doi.org/10.1023/A:1016050803099
Issue Date:
DOI: https://doi.org/10.1023/A:1016050803099