An Unbalanced Data Classification Model Using Hybrid Sampling Technique for Fraud Detection

  • T. Maruthi Padmaja
  • Narendra Dhulipalla
  • P. Radha Krishna
  • Raju S. Bapi
  • A. Laha
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4815)


Detecting fraud is a challenging task as fraud coexists with the latest in technology. The problem to detect the fraud is that the dataset is unbalanced where non-fraudulent class heavily dominates the fraudulent class. In this work, we considered the fraud detection problem as unbalanced data classification problem and proposed a model based on hybrid sampling technique, which is a combination of random under-sampling and over-sampling using SMOTE. Here, SMOTE is used to widen the data region corresponding to minority samples and random under-sampling of majority class is used for balancing the class distribution. The value difference metric (VDM) is used as distance measure while doing SMOTE. We conducted the experiments with classifiers namely k-NN, Radial Basis Function networks, C4.5 and Naive Bayes with varied levels of SMOTE on insurance fraud dataset. For evaluating the learned classifiers, we have chosen fraud catching rate, non-fraud catching rate in addition to overall accuracy of the classifier as performance measures. Results indicate that our approach produces high predictions against fraud and non-fraud classes.


Fraud detection SMOTE VDM Hybrid Sampling and Data Mining 


  1. 1.
    Chan, P., Fan, W., Prodromidis, A., Stolfo, S.: Distributed Data Mining in Credit Card Fraud Detection. IEEE Intelligent Systems 14, 67–74 (1999)CrossRefGoogle Scholar
  2. 2.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. JAIR 16, 324–357 (2004)Google Scholar
  3. 3.
    Phua, C., Damminda, A., Lee, V.: Minority Report in Fraud Detection: Classification of Skewed Data. Sigkdd Explorations 6(1) (2004)Google Scholar
  4. 4.
    Clifton phua Lee, V., Smith, K., Gayler, R.: A Comprehensive Survey of Data Mining-based Fraud Detection Research. In: Artificial Intelligence review (2005)Google Scholar
  5. 5.
    Domingos, P.: Metacost: A General Method for Making Classifiers Cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164. ACM Press, New York (1999)CrossRefGoogle Scholar
  6. 6.
    Estabrooks, A., Jo, T., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalances Data Sets. Computational Intelligence 20(1) (2004)Google Scholar
  7. 7.
    Foster, P.: Machine learning from imbalanced data sets. Invited paper for the AAAI 2000 Workshop on Imbalanced Data Sets (2000)Google Scholar
  8. 8.
    Kubat, M., Holte, R., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 195–215 (1998)CrossRefGoogle Scholar
  9. 9.
    Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One Sided Selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, Tennesse, pp. 179–186. Morgan Kaufmann, San Francisco (1997)Google Scholar
  10. 10.
    Wilson, R., Tony,: Improved Heterogeneous Distance Functions. JAIR 6, 1–34 (1997)zbMATHGoogle Scholar
  11. 11.
    Stolfo, J., Fan, D.W., Lee, W., Prodromidis, A.L.: Credit card fraud detection using meta-learning: Issues and initial results. In: AAAI Workshop on AI Approaches to Fraud Detection and Risk Management, pp. 83–90. AAAI Press, Menlo Park, CA (1997)Google Scholar
  12. 12.
    Stolfo, S., Andreas, L.P., Tselepis, S., Lee, W., Fan, D.W.: JAM: Java agents for meta-learning over distributed databases. In: AAAI Workshop on AI Approaches to Fraud Detection and Risk Management, pp. 91–98. AAAI Press, Menlo Park, CA (1997)Google Scholar
  13. 13.
    Wei Fan, S., Lee, W., Prodromidis, A., Chan, P.: Cost-based modeling for fraud and intrusion detection: Results from the JAM Project. In: Proceedings of the DARPA Information Survivability Conference and Exposition 2, pp. 130–144. IEEE Computer Press, New York (1999)Google Scholar
  14. 14.
    Weiss, G., Provost, F.: The Effect of Class Distribution on Classifier Learning. Technical Report ML-TR-43, Department of Computer Science, Rutgers University (January 2001)Google Scholar
  15. 15.
    Wheeler, R., Aitken, S.: Multiple algorithms for fraud detection. Knowledge-Based Systems 13(2/3), 93–99 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • T. Maruthi Padmaja
    • 1
  • Narendra Dhulipalla
    • 1
  • P. Radha Krishna
    • 1
  • Raju S. Bapi
    • 2
  • A. Laha
    • 1
  1. 1.Institute for Development and Research in Banking Technology (IDRBT), HyderabadIndia
  2. 2.Dept of Computer and Information Sciences, University of Hyderabad, – 500046India

Personalised recommendations