Skip to main content

A First Experimental Study on Functional Dependencies for Imbalanced Datasets Classification

  • Conference paper
  • First Online:
Information Search, Integration, and Personalization (ISIP 2018)

Abstract

Imbalanced datasets for classification is a recurring problem in machine learning, as most real-life datasets present classes that are not evenly distributed. This causes many problems for classification algorithms trained on such datasets, as they are often biases towards the majority class. Moreover, the minority class often yields more interest for data scientist, when at the same time it is also the hardest to predict. Many different approaches have been proposed to tackle the problem of imbalanced datasets: they often rely on the sampling of the majority class, or the creation of synthetic examples for the minority one. In this paper, we take a completely different perspective on this problem: we propose to use the notion of distance between databases, to sample from the majority class, so that the minority and majority class are as distant as possible. The chosen distance is based on functional dependencies, with the intuition of capturing inherent constraints of the database. We propose algorithms to generate distant synthetic datasets, as well as experimentations to verify our conjecture on the classification on distant instances. Despite the mitigated results obtained so far, we believe this is a promising research direction, at the intersection of machine learning and databases, and it deserves more investigations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases: The Logical Level. Addison-Wesley Longman Publishing Co., Inc., Boston (1995)

    Google Scholar 

  2. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of 20th International Conference Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)

    Google Scholar 

  3. Armstrong, W.W.: Dependency structures of database relationship. In: Information Processing, pp. 580–583 (1974)

    Google Scholar 

  4. Beeri, C., Dowd, M., Fagin, R., Statman, R.: On the structure of armstrong relations for functional dependencies. J. ACM (JACM) 31(1), 30–46 (1984)

    Article  MathSciNet  Google Scholar 

  5. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: 2007 IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 746–755. IEEE (2007)

    Google Scholar 

  6. Bonifati, A., Ciucanu, R., Staworko, S.: Interactive join query inference with JIM. Proc. VLDB Endow. 7(13), 1541–1544 (2014)

    Article  Google Scholar 

  7. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  8. Chiang, F., Miller, R.J.: Discovering data quality rules. Proc. VLDB Endow. 1(1), 1166–1177 (2008)

    Article  Google Scholar 

  9. Cumin, J., Petit, J.-M., Scuturici, V.-M., Surdu, S.: Data exploration with SQL using machine learning techniques. In: International Conference on Extending Database Technology-EDBT (2017)

    Google Scholar 

  10. Dimitriadou, K., Papaemmanouil, O., Diao, Y.: AIDE: an active learning-based approach for interactive data exploration. IEEE Trans. Knowl. Data Eng. 28(11), 2842–2856 (2016)

    Article  Google Scholar 

  11. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)

    Article  MathSciNet  Google Scholar 

  12. Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2(4), 42–47 (2012)

    Google Scholar 

  13. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and techniques. Elsevier, Amsterdam (2011)

    MATH  Google Scholar 

  14. Imielinski, T., Mannila, H.: A database perspective on knowledge discovery. Commun. ACM 39(11), 58–64 (1996)

    Article  Google Scholar 

  15. Jekov, L., Cordero, P., Enciso, M.: Fuzzy functional dependencies. Fuzzy Sets Syst. 317(C), 88–120 (2017)

    MathSciNet  MATH  Google Scholar 

  16. Jones, M.P.: Type classes with functional dependencies. In: Smolka, G. (ed.) ESOP 2000. LNCS, vol. 1782, pp. 230–244. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-46425-5_15

    Chapter  Google Scholar 

  17. Katona, G.O.H., Keszler, A., Sali, A.: On the distance of databases. In: Link, S., Prade, H. (eds.) FoIKS 2010. LNCS, vol. 5956, pp. 76–93. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11829-6_8

    Chapter  Google Scholar 

  18. Kotsiantis, S., Kanellopoulos, D., Pintelas, P., et al.: Handling imbalanced datasets: a review. GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)

    Google Scholar 

  19. Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. In: Emerging Artificial Intelligence Applications in Computer Engineering, vol. 160, pp. 3–24 (2007)

    Google Scholar 

  20. Levene, M., Loizou, G.: A Guided Tour of Relational Databases and Beyond. Springer, Heidelberg (2012)

    MATH  Google Scholar 

  21. Müller, H., Freytag, J.-C., Leser, U.: Describing differences between databases. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 612–621. ACM (2006)

    Google Scholar 

  22. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  23. Sharma, S., Bellinger, C., Krawczyk, B., Zaiane, O., Japkowicz, N.: Synthetic oversampling with the majority class: a new perspective on handling extreme imbalance (2018)

    Google Scholar 

  24. Shen, Y., Chakrabarti, K., Chaudhuri, S., Ding, B., Novik, L.: Discovering queries based on example tuples. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 493–504. ACM (2014)

    Google Scholar 

  25. Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: 2003 Third IEEE International Conference on Data Mining, ICDM 2003, pp. 435–442. IEEE (2003)

    Google Scholar 

  26. Zou, B., Ma, X., Kemme, B., Newton, G., Precup, D.: Data mining using relational database management systems. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 657–667. Springer, Heidelberg (2006). https://doi.org/10.1007/11731139_75

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marie Le Guilly .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Le Guilly, M., Petit, JM., Scuturici, M. (2019). A First Experimental Study on Functional Dependencies for Imbalanced Datasets Classification. In: Kotzinos, D., Laurent, D., Spyratos, N., Tanaka, Y., Taniguchi, Ri. (eds) Information Search, Integration, and Personalization. ISIP 2018. Communications in Computer and Information Science, vol 1040. Springer, Cham. https://doi.org/10.1007/978-3-030-30284-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30284-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30283-2

  • Online ISBN: 978-3-030-30284-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics