A First Experimental Study on Functional Dependencies for Imbalanced Datasets Classification

Le Guilly, Marie; Petit, Jean-Marc; Scuturici, Marian

doi:10.1007/978-3-030-30284-9_8

Marie Le Guilly¹²,
Jean-Marc Petit¹² &
Marian Scuturici¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1040))

Included in the following conference series:

International Workshop on Information Search, Integration, and Personalization

165 Accesses

Abstract

Imbalanced datasets for classification is a recurring problem in machine learning, as most real-life datasets present classes that are not evenly distributed. This causes many problems for classification algorithms trained on such datasets, as they are often biases towards the majority class. Moreover, the minority class often yields more interest for data scientist, when at the same time it is also the hardest to predict. Many different approaches have been proposed to tackle the problem of imbalanced datasets: they often rely on the sampling of the majority class, or the creation of synthetic examples for the minority one. In this paper, we take a completely different perspective on this problem: we propose to use the notion of distance between databases, to sample from the majority class, so that the minority and majority class are as distant as possible. The chosen distance is based on functional dependencies, with the intuition of capturing inherent constraints of the database. We propose algorithms to generate distant synthetic datasets, as well as experimentations to verify our conjecture on the classification on distant instances. Despite the mitigated results obtained so far, we believe this is a promising research direction, at the intersection of machine learning and databases, and it deserves more investigations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases: The Logical Level. Addison-Wesley Longman Publishing Co., Inc., Boston (1995)
Google Scholar
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of 20th International Conference Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Google Scholar
Armstrong, W.W.: Dependency structures of database relationship. In: Information Processing, pp. 580–583 (1974)
Google Scholar
Beeri, C., Dowd, M., Fagin, R., Statman, R.: On the structure of armstrong relations for functional dependencies. J. ACM (JACM) 31(1), 30–46 (1984)
Article MathSciNet Google Scholar
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: 2007 IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 746–755. IEEE (2007)
Google Scholar
Bonifati, A., Ciucanu, R., Staworko, S.: Interactive join query inference with JIM. Proc. VLDB Endow. 7(13), 1541–1544 (2014)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Chiang, F., Miller, R.J.: Discovering data quality rules. Proc. VLDB Endow. 1(1), 1166–1177 (2008)
Article Google Scholar
Cumin, J., Petit, J.-M., Scuturici, V.-M., Surdu, S.: Data exploration with SQL using machine learning techniques. In: International Conference on Extending Database Technology-EDBT (2017)
Google Scholar
Dimitriadou, K., Papaemmanouil, O., Diao, Y.: AIDE: an active learning-based approach for interactive data exploration. IEEE Trans. Knowl. Data Eng. 28(11), 2842–2856 (2016)
Article Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Article MathSciNet Google Scholar
Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2(4), 42–47 (2012)
Google Scholar
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and techniques. Elsevier, Amsterdam (2011)
MATH Google Scholar
Imielinski, T., Mannila, H.: A database perspective on knowledge discovery. Commun. ACM 39(11), 58–64 (1996)
Article Google Scholar
Jekov, L., Cordero, P., Enciso, M.: Fuzzy functional dependencies. Fuzzy Sets Syst. 317(C), 88–120 (2017)
MathSciNet MATH Google Scholar
Jones, M.P.: Type classes with functional dependencies. In: Smolka, G. (ed.) ESOP 2000. LNCS, vol. 1782, pp. 230–244. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-46425-5_15
Chapter Google Scholar
Katona, G.O.H., Keszler, A., Sali, A.: On the distance of databases. In: Link, S., Prade, H. (eds.) FoIKS 2010. LNCS, vol. 5956, pp. 76–93. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11829-6_8
Chapter Google Scholar
Kotsiantis, S., Kanellopoulos, D., Pintelas, P., et al.: Handling imbalanced datasets: a review. GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)
Google Scholar
Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. In: Emerging Artificial Intelligence Applications in Computer Engineering, vol. 160, pp. 3–24 (2007)
Google Scholar
Levene, M., Loizou, G.: A Guided Tour of Relational Databases and Beyond. Springer, Heidelberg (2012)
MATH Google Scholar
Müller, H., Freytag, J.-C., Leser, U.: Describing differences between databases. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 612–621. ACM (2006)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Sharma, S., Bellinger, C., Krawczyk, B., Zaiane, O., Japkowicz, N.: Synthetic oversampling with the majority class: a new perspective on handling extreme imbalance (2018)
Google Scholar
Shen, Y., Chakrabarti, K., Chaudhuri, S., Ding, B., Novik, L.: Discovering queries based on example tuples. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 493–504. ACM (2014)
Google Scholar
Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: 2003 Third IEEE International Conference on Data Mining, ICDM 2003, pp. 435–442. IEEE (2003)
Google Scholar
Zou, B., Ma, X., Kemme, B., Newton, G., Precup, D.: Data mining using relational database management systems. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 657–667. Springer, Heidelberg (2006). https://doi.org/10.1007/11731139_75
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Université de Lyon, CNRS, INSA-LYON, LIRIS, UMR5205, 69621, Villeurbanne, France
Marie Le Guilly, Jean-Marc Petit & Marian Scuturici

Authors

Marie Le Guilly
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Marc Petit
View author publications
You can also search for this author in PubMed Google Scholar
Marian Scuturici
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marie Le Guilly .

Editor information

Editors and Affiliations

Lab. ETIS UMR 8051, University of Paris-Seine, University of Cergy-Pontoise, ENSEA, CNRS, Cergy-Pontoise, France
Dimitris Kotzinos
Lab. ETIS UMR 8051, University of Paris-Seine, University of Cergy-Pontoise, ENSEA, CNRS, Cergy-Pontoise, France
Dominique Laurent
LRI, University of Paris-Sud, Orsay, France
Nicolas Spyratos
Hokkaido University, Sapporo, Japan
Yuzuru Tanaka
Kyushu University, Fukuoka, Japan
Rin-ichiro Taniguchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Le Guilly, M., Petit, JM., Scuturici, M. (2019). A First Experimental Study on Functional Dependencies for Imbalanced Datasets Classification. In: Kotzinos, D., Laurent, D., Spyratos, N., Tanaka, Y., Taniguchi, Ri. (eds) Information Search, Integration, and Personalization. ISIP 2018. Communications in Computer and Information Science, vol 1040. Springer, Cham. https://doi.org/10.1007/978-3-030-30284-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-30284-9_8
Published: 24 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30283-2
Online ISBN: 978-3-030-30284-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics