Accurately Predicting User Registration in Highly Unbalanced Real-World Datasets from Online News Portals

Spitzer, Eva-Maria; Krauss, Oliver; Stöckl, Andreas

doi:10.1007/978-3-031-12423-5_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13426))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1175 Accesses
1 Citations

Abstract

Getting visitors to register is a crucial factor in marketing for online news portals. Current approaches are rule-based by awarding points for specific actions [3]. Finding efficient rules can be challenging and depends on the specific task. Registration is generally rare compared to regular visitors, leading to highly imbalanced data.

We analyze different supervised learning classification algorithms under consideration of the data imbalance. As case study, we use anonymized real-world data from an Austrian newspaper outlet containing the visitor’s session behavior with around 0.1% registrations over all visits.

We identify an ensemble approach combining the Balanced Random Forest Classifier and the RUSBoost Classifier correctly identifying 76% of registrations over five independent data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alshehri, M., Alamri, A., Cristea, A.I., Stewart, C.D.: Towards designing profitable courses: predicting student purchasing behaviour in MOOCs. Int. J. Artif. Intell. Educ. 31, 215–233 (2021)
Article Google Scholar
Artun, O., Levin, D.: Predictive Marketing: Easy Ways Every Marketer Can Use Customer Analytics and Big Data. Wiley Online Library (2015)
Google Scholar
Benhaddou, Y., Leray, P.: Customer relationship management and small data - application of Bayesian network elicitation techniques for building a lead scoring model. In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), pp. 251–255 (2017). https://doi.org/10.1109/AICCSA.2017.51
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Chen, C., Liaw, A., Breiman, L., et al.: Using random forest to learn imbalanced data. Univ. Calif. Berkeley 110(1–12), 24 (2004)
Google Scholar
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., et al.: XGBoost: extreme gradient boosting. R Package Version 0.4-2 1(4), 1–4 (2015)
Google Scholar
Duncan, B.A., Elkan, C.P.: Probabilistic modeling of a sales funnel to prioritize leads. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1751–1758 (2015)
Google Scholar
Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)
Article MathSciNet Google Scholar
Guo, X., Yin, Y., Dong, C., Yang, G., Zhou, G.: On the class imbalance problem. In: 2008 Fourth International Conference on Natural Computation, vol. 4, pp. 192–201. IEEE (2008)
Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)
Article Google Scholar
Kietzmann, J., Paschen, J., Treen, E.: Artificial intelligence in advertising: how marketers can leverage artificial intelligence along the consumer journey. J. Advert. Res. 58(3), 263–267 (2018)
Article Google Scholar
Kleinbaum, D.G., Klein, M.: Logistic Regression. SBH, Springer, New York (2010). https://doi.org/10.1007/978-1-4419-1742-3
Book MATH Google Scholar
Kotsiantis, S., Kanellopoulos, D., Pintelas, P., et al.: Handling imbalanced datasets: a review. GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)
Google Scholar
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(1), 559–563 (2017)
Google Scholar
Liaw, A., Wiener, M., et al.: Classification and regression by RandomForest. R News 2(3), 18–22 (2002)
Google Scholar
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Article Google Scholar
More, A., Rana, D.P.: Review of random forest classification techniques to resolve data imbalance. In: 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), pp. 72–78. IEEE (2017)
Google Scholar
Nygård, R., Mezei, J.: Automating lead scoring with machine learning: an experimental study. In: Proceedings of the 53rd Hawaii International Conference on System Sciences (2020)
Google Scholar
Patel, D., Zhou, N., Shrivastava, S., Kalagnanam, J.: Doctor for machines: a failure pattern analysis solution for industry 4.0. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 1614–1623 (2020). https://doi.org/10.1109/BigData50022.2020.9378369
Polikar, R.: Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6(3), 21–45 (2006)
Article Google Scholar
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1), 1–39 (2010)
Article MathSciNet Google Scholar
Rokach, L., Maimon, O.: Decision trees. In: Liu, L., Özsu, M.T. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 165–192. Springer, Cham (2005). https://doi.org/10.1007/978-0-387-39940-9_2445
Schapire, R.E.: Explaining AdaBoost. In: Schölkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Cham (2013). https://doi.org/10.1007/978-3-642-41136-6_5
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: RusBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man. Cybern. Part A Syst. Humans 40(1), 185–197 (2009)
Article Google Scholar
Stehman, S.V.: Selecting and interpreting measures of thematic classification accuracy. Remote Sens. Environ. 62(1), 77–89 (1997)
Article Google Scholar
Tharwat, A.: Classification assessment methods. In: Applied Computing and Informatics (2020)
Google Scholar
Urban, T., Tatang, D., Degeling, M., Holz, T., Pohlmann, N.: Measuring the impact of the GDPR on data sharing in ad networks. In: Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, pp. 222–235, ASIA CCS 2020. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3320269.3372194
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 3, 408–421 (1972)
Article MathSciNet Google Scholar
Xie, Y., Li, X., Ngai, E., Ying, W.: Customer churn prediction using improved balanced random forests. Expert Syst. Appl. 36(3), 5445–5449 (2009)
Article Google Scholar
Yegnanarayana, B.: Artificial Neural Networks. PHI Learning Pvt, Ltd., New Delhi (2009)
Google Scholar
Ying, W.Y., Qin, Z., Zhao, Y., Li, B., Li, X.: Support vector machine and its application in customer churn prediction. Syst. Eng. Theory Pract. 7 (2007)
Google Scholar
Zhang, Y.P., Zhang, L.N., Wang, Y.C.: Cluster-based majority under-sampling approaches for class imbalance learning. In: 2010 2nd IEEE International Conference on Information and Financial Engineering, pp. 400–404. IEEE (2010). https://doi.org/10.1109/ICIFE.2010.5609385

Download references

Author information

Authors and Affiliations

Advanced Information Systems and Technology, University of Applied Sciences Upper Austria, Wels, Austria
Eva-Maria Spitzer & Oliver Krauss
Digital Media Department, University of Applied Sciences Upper Austria, Wels, Austria
Andreas Stöckl

Authors

Eva-Maria Spitzer
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Krauss
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Stöckl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oliver Krauss .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Christine Strauss
University of Calabria, Rende, Italy
Alfredo Cuzzocrea
Johannes Kepler University of Linz, Linz, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Spitzer, EM., Krauss, O., Stöckl, A. (2022). Accurately Predicting User Registration in Highly Unbalanced Real-World Datasets from Online News Portals. In: Strauss, C., Cuzzocrea, A., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2022. Lecture Notes in Computer Science, vol 13426. Springer, Cham. https://doi.org/10.1007/978-3-031-12423-5_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-12423-5_23
Published: 29 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12422-8
Online ISBN: 978-3-031-12423-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics