Data Mining and Knowledge Discovery

, Volume 20, Issue 3, pp 439–468 | Cite as

Medical data mining: insights from winning two competitions

  • Saharon Rosset
  • Claudia Perlich
  • Grzergorz Świrszcz
  • Prem Melville
  • Yan Liu
Article

Abstract

Two major data mining competitions in 2008 presented challenges in medical domains: KDD Cup 2008, which concerned cancer detection from mammography data; and Informs Data Mining Challenge 2008, dealing with diagnosis of pneumonia based on patient information from hospital files. Our team won both of these competitions, and in this paper we share our lessons learned and insights. We emphasize the aspects that pertain to the general practice and methodology of medical data mining, rather than to the specifics of each modeling competition. We concentrate on three topics: information leakage, its effect on competitions and proof-of-concept projects; consideration of real-life model performance measures in model construction and evaluation; and relational learning approaches to medical data mining tasks.

Keywords

Medical data mining Leakage Model evaluation Relational learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bandos AI, Rockette HE, Song T, Gur D (2008) Area under the free-response ROC curve (FROC) and a related summary index. Biometrics 65(1): 247–256CrossRefGoogle Scholar
  2. DeLuca PM, Wambersie A, Whitmore GF (2008) Extensions to conventional ROC methodology: LROC, FROC, and AFROC. J ICRU 8: 31–35Google Scholar
  3. Domingos P, Richardson M (2007) Markov logic: a unifying framework for statistical relational learning. In: Getoor L, Taskar B (eds) Introduction to statistical relational learning. MIT Press, CambridgeGoogle Scholar
  4. Ferri C, Flach P, Hernandez-Orallo J (2002) Learning decision trees using the area under the ROC curve. In: Proceedings of the international conference on machine learningGoogle Scholar
  5. Getoor L, Friedman N, Koller D, Pfeffer A, Taskar B (2007) Probabilistic relational models. In: Getoor L, Taskar B (eds) Introduction to statistical relational learning. MIT Press, CambridgeGoogle Scholar
  6. Glymour C, Scheines R, Spirtes P, Kelly K (1987) Discovering causal structure: artificial intelligence, philosophy of science, and statistical modeling. Academic Press, San DiegoMATHGoogle Scholar
  7. Inger A, Vatnik N, Rosset S, Neumann E (2000) KDD-Cup 2000: question 1 winner’s report, SIGKDD explorationsGoogle Scholar
  8. Joachims T (2005) A support vector method for multivariate performance measures. In: Proceedings of the international conference on machine learningGoogle Scholar
  9. Joachims T (1999) Making large-scale SVM learning practical. In: Scholkopf B, Burges C, Smola A (eds) Advances in Kernel methods—support vector learning. MIT Press, CambridgeGoogle Scholar
  10. Kou Z, Cohen WW (2007) Stacked graphical learning for efficient inference in markov random fields. In: Proceedings of the international conference on data miningGoogle Scholar
  11. Krogel M-A, Wrobel S (2003) Facets of aggregation approaches to propositionalization. In: Proceedings of the international conference on inductive logic programmingGoogle Scholar
  12. Lawrence R, Perlich C, Rosset S et al (2007) Analytics-driven solutions for customer targeting and sales-force allocation. IBM Syst J 46(4): 797–816CrossRefGoogle Scholar
  13. Melville P, Rosset S, Lawrence R (2008) Customer targeting models using actively-selected web content. In: Proceedings of the conference on knowledge discovery and data miningGoogle Scholar
  14. Muggleton SH, DeRaedt L (1994) Inductive logic programming: theory and methods. J Logic Program 19 & 20: 629–680CrossRefMathSciNetGoogle Scholar
  15. NIST/SEMATECH (2006) e-Handbook of Statistical Methods, chap. 1. http://www.itl.nist.gov/div898/handbook/eda/eda.htm
  16. Perlich C (2005) Approaching the ILP challenge 2005: class-conditional bayesian propositionalization for genetic classification. In: Proceedings of the conference on inductive logic programmingGoogle Scholar
  17. Perlich C, Provost F (2006) ACORA: distribution-based aggregation for relational learning from identifier attributes, special issue on statistical relational learning and multi-relational data mining. J Mach Learn 62: 65–105CrossRefGoogle Scholar
  18. Perlich C, Melville P, Liu Y, Swirszcz G, Lawrence R, Rosset S (2008) Breast cancer identification: KDD cup winner’s report, SIGKDD explorationsGoogle Scholar
  19. Platt J (1998) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Bartlett PJ, Schölkopf B, Schuurmans D, Smola AJ (eds) Advances in large margin classifiers. MIT Press, CambridgeGoogle Scholar
  20. Rao RB, Yakhnenko O, Krishnapuram B (2008) KDD Cup 2008 and the workshop on mining medical data, SIGKDD explorationsGoogle Scholar
  21. Rosset S, Perlich C, Liu Y (2007) Making the most of your data: KDD Cup 2007 “How many ratings” winner’s report, SIGKDD ExplorationsGoogle Scholar
  22. Russ TA (1989) Using hindsight in medical decision making. In: Proceedings of the thirteenth annual symposium on computer applications in medical careGoogle Scholar
  23. Saar-Tsechansky M, Pliskin N, Rabinowitz G, Porath A (2001) Monitoring quality of care with relational patterns. Top Health Inf Manag 22(1): 24–35Google Scholar
  24. Shahar Y (2000) Dimension of time in illness: an objective view. Ann Intern Med 132: 45–53Google Scholar
  25. Simon HA (1954) Spurious correlation: a causal interpretation. J Am Stat Assoc 49: 467–479MATHCrossRefGoogle Scholar
  26. Turney PD (2000) Types of cost in inductive concept learning In: Proceedings of the workshop on cost-sensitive learning at the international conference on machine learningGoogle Scholar
  27. Valentini G, Dietterich TG (2003) Low bias bagged support vector machines. In: International conference on machine learningGoogle Scholar
  28. Weiss GM, Saar-Tsechansky M, Zadrozny B (2008) Special issue on utility-based data mining (editors). Data Min Knowl Discov 17(2)Google Scholar
  29. White K, Dufresne RL (1997) The placebo effect in drug trials and the double blind. In: Hertzman M, Feltner DE (eds) The handbook of psychopharmacology trials. NYU Press, New York, pp 123–136Google Scholar
  30. Wolpert DH (1992) Stacked generalization. Neural Networks 5: 241–259CrossRefGoogle Scholar
  31. Yan R, Zhang J, Yang J, Hauptmann A (2004) A discriminative learning framework with pairwise constraints for video object classification. In: Proceedings of IEEE conference on computer vision and pattern recognitionGoogle Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  • Saharon Rosset
    • 1
  • Claudia Perlich
    • 2
  • Grzergorz Świrszcz
    • 2
  • Prem Melville
    • 2
  • Yan Liu
    • 2
  1. 1.School of Mathematical SciencesTel Aviv UniversityTel AvivIsrael
  2. 2.IBM T.J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations