Abstract
Getting visitors to register is a crucial factor in marketing for online news portals. Current approaches are rule-based by awarding points for specific actions [3]. Finding efficient rules can be challenging and depends on the specific task. Registration is generally rare compared to regular visitors, leading to highly imbalanced data.
We analyze different supervised learning classification algorithms under consideration of the data imbalance. As case study, we use anonymized real-world data from an Austrian newspaper outlet containing the visitor’s session behavior with around 0.1% registrations over all visits.
We identify an ensemble approach combining the Balanced Random Forest Classifier and the RUSBoost Classifier correctly identifying 76% of registrations over five independent data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alshehri, M., Alamri, A., Cristea, A.I., Stewart, C.D.: Towards designing profitable courses: predicting student purchasing behaviour in MOOCs. Int. J. Artif. Intell. Educ. 31, 215–233 (2021)
Artun, O., Levin, D.: Predictive Marketing: Easy Ways Every Marketer Can Use Customer Analytics and Big Data. Wiley Online Library (2015)
Benhaddou, Y., Leray, P.: Customer relationship management and small data - application of Bayesian network elicitation techniques for building a lead scoring model. In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), pp. 251–255 (2017). https://doi.org/10.1109/AICCSA.2017.51
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chen, C., Liaw, A., Breiman, L., et al.: Using random forest to learn imbalanced data. Univ. Calif. Berkeley 110(1–12), 24 (2004)
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., et al.: XGBoost: extreme gradient boosting. R Package Version 0.4-2 1(4), 1–4 (2015)
Duncan, B.A., Elkan, C.P.: Probabilistic modeling of a sales funnel to prioritize leads. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1751–1758 (2015)
Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)
Guo, X., Yin, Y., Dong, C., Yang, G., Zhou, G.: On the class imbalance problem. In: 2008 Fourth International Conference on Natural Computation, vol. 4, pp. 192–201. IEEE (2008)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)
Kietzmann, J., Paschen, J., Treen, E.: Artificial intelligence in advertising: how marketers can leverage artificial intelligence along the consumer journey. J. Advert. Res. 58(3), 263–267 (2018)
Kleinbaum, D.G., Klein, M.: Logistic Regression. SBH, Springer, New York (2010). https://doi.org/10.1007/978-1-4419-1742-3
Kotsiantis, S., Kanellopoulos, D., Pintelas, P., et al.: Handling imbalanced datasets: a review. GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(1), 559–563 (2017)
Liaw, A., Wiener, M., et al.: Classification and regression by RandomForest. R News 2(3), 18–22 (2002)
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
More, A., Rana, D.P.: Review of random forest classification techniques to resolve data imbalance. In: 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), pp. 72–78. IEEE (2017)
Nygård, R., Mezei, J.: Automating lead scoring with machine learning: an experimental study. In: Proceedings of the 53rd Hawaii International Conference on System Sciences (2020)
Patel, D., Zhou, N., Shrivastava, S., Kalagnanam, J.: Doctor for machines: a failure pattern analysis solution for industry 4.0. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 1614–1623 (2020). https://doi.org/10.1109/BigData50022.2020.9378369
Polikar, R.: Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6(3), 21–45 (2006)
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1), 1–39 (2010)
Rokach, L., Maimon, O.: Decision trees. In: Liu, L., Özsu, M.T. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 165–192. Springer, Cham (2005). https://doi.org/10.1007/978-0-387-39940-9_2445
Schapire, R.E.: Explaining AdaBoost. In: Schölkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Cham (2013). https://doi.org/10.1007/978-3-642-41136-6_5
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: RusBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man. Cybern. Part A Syst. Humans 40(1), 185–197 (2009)
Stehman, S.V.: Selecting and interpreting measures of thematic classification accuracy. Remote Sens. Environ. 62(1), 77–89 (1997)
Tharwat, A.: Classification assessment methods. In: Applied Computing and Informatics (2020)
Urban, T., Tatang, D., Degeling, M., Holz, T., Pohlmann, N.: Measuring the impact of the GDPR on data sharing in ad networks. In: Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, pp. 222–235, ASIA CCS 2020. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3320269.3372194
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 3, 408–421 (1972)
Xie, Y., Li, X., Ngai, E., Ying, W.: Customer churn prediction using improved balanced random forests. Expert Syst. Appl. 36(3), 5445–5449 (2009)
Yegnanarayana, B.: Artificial Neural Networks. PHI Learning Pvt, Ltd., New Delhi (2009)
Ying, W.Y., Qin, Z., Zhao, Y., Li, B., Li, X.: Support vector machine and its application in customer churn prediction. Syst. Eng. Theory Pract. 7 (2007)
Zhang, Y.P., Zhang, L.N., Wang, Y.C.: Cluster-based majority under-sampling approaches for class imbalance learning. In: 2010 2nd IEEE International Conference on Information and Financial Engineering, pp. 400–404. IEEE (2010). https://doi.org/10.1109/ICIFE.2010.5609385
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Spitzer, EM., Krauss, O., Stöckl, A. (2022). Accurately Predicting User Registration in Highly Unbalanced Real-World Datasets from Online News Portals. In: Strauss, C., Cuzzocrea, A., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2022. Lecture Notes in Computer Science, vol 13426. Springer, Cham. https://doi.org/10.1007/978-3-031-12423-5_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-12423-5_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12422-8
Online ISBN: 978-3-031-12423-5
eBook Packages: Computer ScienceComputer Science (R0)