Abstract
AutoML aims to select an appropriate classification algorithm and corresponding hyperparameters for an individual dataset. However, existing AutoML methods usually ignore the intrinsic imbalance nature of most real-world datasets and lead to poor performance. For handling imbalanced data, sampling methods have been widely used since their independence of the used algorithms. We propose a method named AutoIDL for selecting the sampling methods as well as classification algorithms simultaneously. Particularly, AutoIDL firstly represents datasets as graphs and extracts their meta-features with a graph embedding method. In addition, meta-targets are identified as pairs of sampling methods and classification algorithms for each imbalanced dataset. Secondly, the user-based collaborative filtering method is employed to train a ranking model based on the meta repository to select appropriate sampling methods and algorithms for a new dataset. Extensive experimental results demonstrate that AutoIDL is effective for automated imbalanced data learning and it outperforms the state-of-the-art AutoML methods.
Supported by National Natural Science Foundation of China under Grant No. 61702405 and the China Postdoctoral Science Foundation under Grant No. 2017M623176.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barua, S., Islam, M.M., Yao, X., Murase, K.: Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26(2), 405–425 (2012)
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: PAKDD, pp. 475–482 (2009)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smotesynthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Douzas, G., Bacao, F., Last, F.: Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf. Sci. 465, 1–20 (2018)
Elshawi, R., Maher, M., Sakr, S.: Automated machine learning: state-of-the-art and open challenges. CoRR (2019). http://arxiv.org/abs/1906.02287
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: NeurIPS, pp. 2962–2970 (2015)
Guo, H., Li, Y., Jennifer, S., Gu, M., Huang, Y., Gong, B.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)
Han, H., Wang, W.Y., Mao, B.H.: Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: ICIC, pp. 878–887 (2005)
Hart, P.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1968)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IJCNN, pp. 1322–1328 (2008)
Herlocker, J.L., Konstan, J.A., Borchers, A., Riedl, J.: An algorithmic framework for performing collaborative filtering. In: SIGIR, pp. 227–234 (1999)
Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, pp. 179–186 (1997)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_9
Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409, 17–26 (2017)
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Mani, I., Zhang, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of Workshop on Learning from Imbalanced Datasets, pp. 1–7 (2003)
Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.: graph2vec: learning distributed representations of graphs. CoRR (2017). http://arxiv.org/abs/1707.05005
Olson, R.S., Moore, J.H.: TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Hutter, F., Kotthoff, L., Vanschoren, J. (eds.) Automated Machine Learning. TSSCML, pp. 151–160. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5_8
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: KDD, pp. 847–855 (2013)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 3, 408–421 (1972)
Yang, C., Akimoto, Y., Kim, D.W., Udell, M.: OBOE: collaborative filtering for AutoML model selection. In: KDD, pp. 1173–1183 (2019)
Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)
Zhu, X.J.: Semi-supervised learning literature survey. Technical report. University of Wisconsin-Madison Department of Computer Sciences (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, J., Sun, Z., Qi, Y. (2020). AutoIDL: Automated Imbalanced Data Learning via Collaborative Filtering. In: Li, G., Shen, H., Yuan, Y., Wang, X., Liu, H., Zhao, X. (eds) Knowledge Science, Engineering and Management. KSEM 2020. Lecture Notes in Computer Science(), vol 12275. Springer, Cham. https://doi.org/10.1007/978-3-030-55393-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-55393-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-55392-0
Online ISBN: 978-3-030-55393-7
eBook Packages: Computer ScienceComputer Science (R0)