Abstract
Imbalanced datasets for classification is a recurring problem in machine learning, as most real-life datasets present classes that are not evenly distributed. This causes many problems for classification algorithms trained on such datasets, as they are often biases towards the majority class. Moreover, the minority class often yields more interest for data scientist, when at the same time it is also the hardest to predict. Many different approaches have been proposed to tackle the problem of imbalanced datasets: they often rely on the sampling of the majority class, or the creation of synthetic examples for the minority one. In this paper, we take a completely different perspective on this problem: we propose to use the notion of distance between databases, to sample from the majority class, so that the minority and majority class are as distant as possible. The chosen distance is based on functional dependencies, with the intuition of capturing inherent constraints of the database. We propose algorithms to generate distant synthetic datasets, as well as experimentations to verify our conjecture on the classification on distant instances. Despite the mitigated results obtained so far, we believe this is a promising research direction, at the intersection of machine learning and databases, and it deserves more investigations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases: The Logical Level. Addison-Wesley Longman Publishing Co., Inc., Boston (1995)
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of 20th International Conference Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Armstrong, W.W.: Dependency structures of database relationship. In: Information Processing, pp. 580–583 (1974)
Beeri, C., Dowd, M., Fagin, R., Statman, R.: On the structure of armstrong relations for functional dependencies. J. ACM (JACM) 31(1), 30–46 (1984)
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: 2007 IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 746–755. IEEE (2007)
Bonifati, A., Ciucanu, R., Staworko, S.: Interactive join query inference with JIM. Proc. VLDB Endow. 7(13), 1541–1544 (2014)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chiang, F., Miller, R.J.: Discovering data quality rules. Proc. VLDB Endow. 1(1), 1166–1177 (2008)
Cumin, J., Petit, J.-M., Scuturici, V.-M., Surdu, S.: Data exploration with SQL using machine learning techniques. In: International Conference on Extending Database Technology-EDBT (2017)
Dimitriadou, K., Papaemmanouil, O., Diao, Y.: AIDE: an active learning-based approach for interactive data exploration. IEEE Trans. Knowl. Data Eng. 28(11), 2842–2856 (2016)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2(4), 42–47 (2012)
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and techniques. Elsevier, Amsterdam (2011)
Imielinski, T., Mannila, H.: A database perspective on knowledge discovery. Commun. ACM 39(11), 58–64 (1996)
Jekov, L., Cordero, P., Enciso, M.: Fuzzy functional dependencies. Fuzzy Sets Syst. 317(C), 88–120 (2017)
Jones, M.P.: Type classes with functional dependencies. In: Smolka, G. (ed.) ESOP 2000. LNCS, vol. 1782, pp. 230–244. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-46425-5_15
Katona, G.O.H., Keszler, A., Sali, A.: On the distance of databases. In: Link, S., Prade, H. (eds.) FoIKS 2010. LNCS, vol. 5956, pp. 76–93. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11829-6_8
Kotsiantis, S., Kanellopoulos, D., Pintelas, P., et al.: Handling imbalanced datasets: a review. GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)
Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. In: Emerging Artificial Intelligence Applications in Computer Engineering, vol. 160, pp. 3–24 (2007)
Levene, M., Loizou, G.: A Guided Tour of Relational Databases and Beyond. Springer, Heidelberg (2012)
Müller, H., Freytag, J.-C., Leser, U.: Describing differences between databases. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 612–621. ACM (2006)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Sharma, S., Bellinger, C., Krawczyk, B., Zaiane, O., Japkowicz, N.: Synthetic oversampling with the majority class: a new perspective on handling extreme imbalance (2018)
Shen, Y., Chakrabarti, K., Chaudhuri, S., Ding, B., Novik, L.: Discovering queries based on example tuples. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 493–504. ACM (2014)
Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: 2003 Third IEEE International Conference on Data Mining, ICDM 2003, pp. 435–442. IEEE (2003)
Zou, B., Ma, X., Kemme, B., Newton, G., Precup, D.: Data mining using relational database management systems. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 657–667. Springer, Heidelberg (2006). https://doi.org/10.1007/11731139_75
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Le Guilly, M., Petit, JM., Scuturici, M. (2019). A First Experimental Study on Functional Dependencies for Imbalanced Datasets Classification. In: Kotzinos, D., Laurent, D., Spyratos, N., Tanaka, Y., Taniguchi, Ri. (eds) Information Search, Integration, and Personalization. ISIP 2018. Communications in Computer and Information Science, vol 1040. Springer, Cham. https://doi.org/10.1007/978-3-030-30284-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-30284-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30283-2
Online ISBN: 978-3-030-30284-9
eBook Packages: Computer ScienceComputer Science (R0)