Skip to main content
Log in

A Generalized Model for Predictive Data Mining

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

This paper describes a flexible model for predictive data mining, EGB2, which optimizes over a parameter space to fit data to a family of models based on maximum-likelihood criteria. It is also shown how EGB2 can integrate asymmetric costs of Type I and Type II errors, thereby minimizing expected misclassification costs.

Importantly, it has been shown that standard methods of computing maximum-likelihood estimators are generally inconsistent when applied to sample data having different proportions of labels than are found in the universe from which the sample is drawn. We show how a choice estimator based on weighting each observation's contribution to the log-likelihood function, can contribute to estimator consistency and how this feature can be implemented in EGB2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Amemiya A. Advanced Econometrics. Cambridge, MA: Harvard University Press, 1985.

    Google Scholar 

  • Bar Niv R, McDonald J. Identifying financial distress in the insurance industry:Asynthesis of methodological and empirical issues. Journal of Risk and Insurance 1992;59:543–574.

    Google Scholar 

  • Bell T, Szykowny S, Willingham J. Assessing the likelihood of fraudulent financial reporting: A cascaded logit approach. Working paper, KPMG Peat Marwick, 1993.

  • Clarke D, McDonald J. Generalized bankruptcy models applied to predicting consumer credit behavior. Journal of Economics and Business 1992;44:47–62.

    Google Scholar 

  • Dawes R. The robust beauty of improper linear models in decision making. American Psychologist 1979;34:571–582.

    Google Scholar 

  • Glymour C, Madigan D, Pregibon D, Smyth P. Statistical themes and lessons for data mining. Data Mining and Knowledge Discovery 1997;1:11–28.

    Google Scholar 

  • Hansen J, McDonald J, Stice J. Artificial intelligence and generalized qualitative-response models: An empirical test on two audit decision-making domains. Decision Sciences 1992;23:708–723.

    Google Scholar 

  • Hassoun M. Fundamentals of Artificial Neural Networks. Cambridge, MA: MIT Press, 1995.

    Google Scholar 

  • Johnson E, Meyer R, Ghose S. When choice models fails: Compensatory models in efficient sets. Working paper, Graduate School of Industrial Administration, Carnegie-Mellon University, 1985.

  • Kalbfleisch J, Prentice R. The Statistical Analysis of Failure Times. New York: Wiley, 1980.

    Google Scholar 

  • Kearns M, Vazirani U. An Introduction to Computational Learning Theory. Cambridge, MA: The MIT Press, 1994.

    Google Scholar 

  • Libby R. Accounting and Human Information Processing: Theory and Applications. Englewood Cliffs, NJ: Prentice-Hall, 1981.

    Google Scholar 

  • Manski C, Lerman S. The estimation of choice probabilities from choice based samples. Econometrica 1977;45:1977–1988.

    Google Scholar 

  • McDonald J. Some generalized functions for the size distribution of income. Econometrica 1984;52:647–663.

    Google Scholar 

  • McDonald J, White S. A comparison of some robust, adaptive, and partially adaptive estimators of regression models. Econometric Reviews 1993;12:103–124.

    Google Scholar 

  • McDonald J, Xu Y. A generalization of the beta distribution with applications. Journal of Econometrics 1995;66:133–152. Errata 1995;69:427–428.

    Google Scholar 

  • Payne J. Task complexity and contingent processing in decision making: An information search and protocol analysis. Organizational Behavior and Human Performance 1976;16:366–387.

    Google Scholar 

  • Quandt R. Computational problems and methods. In: Handbook of Econometrics, Ch. 12, Vol. 1, 1983:699–764.

    Google Scholar 

  • Rainville ED. Special Functions. New York: MacMillan, 1960.

    Google Scholar 

  • Shavlik J, Dietterich T. Introduction. In: Shavlik J, Dietterich T, eds. Readings in Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers, 1991.

    Google Scholar 

  • Stice J. Using financial and market information to identify preengagement factors associated with lawsuits against auditors. The Accounting Review 1991;66:516–534.

    Google Scholar 

  • Weiss S, Kulikowski C. Computer Systems that Learn. San Mateo, CA: Morgan Kaufmann Publishers, 1991.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James V. Hansen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hansen, J.V., McDonald, J.B. A Generalized Model for Predictive Data Mining. Information Systems Frontiers 4, 179–186 (2002). https://doi.org/10.1023/A:1016050803099

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1016050803099

Navigation