Abstract
Analyzing data from the web is now one of the primary tasks, understood in a variety of manners and solved for a very wide variety of purposes. The talk describes the experience from a project, devoted to analyzing such data while drawing some more general conclusions. The project was aimed at distinguishing artificial ad-related traffic from the genuine one. The rationale is simple: The flow of money depends upon the number of clicks on/views of an ad. If so, fake clicking changes the market to the benefit of some, and to the loss of the other ones. The talk describes the problem and its conceptual framing, as well as a number of technical details, involving the issues and techniques of (1) variable analysis and choice; (2) clustering; (3) classification/classifiers; (4) potential hybrid techniques, along with citations of the most interesting results. These often imply definite general conclusions, some of them quite surprising.
The work reported was carried out within the project ABTShield, led by EDGE NPD Co. Ltd.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The best illustration is provided by the most recent sanctions against Russia in the context of her aggression against Ukraine: one of the key issues concerned the banking system and the possibility of performing transactions.
- 2.
The fact that the categorization is dichotomous or trichotomous does not mean that the problem is simple, see the task of identifying irony or sarcasm in the web-provided expressions.
- 3.
In this sequence of steps, we concentrate on the cognitive aspect, but, of course, the business side (costs and benefits expected) has to be normally accounted for on a par.
- 4.
Think of various functions, aggregates, statistical representations, etc., of the raw data.
- 5.
In the same step: is there any common sense (even if very rough) approach to the problem?
- 6.
We definitely believe that science should lead to truth, but it is most often simply, out of necessity, approximated.
- 7.
The web users are, as a rule, not aware that while they move to a given web page, supposed to provide the advertising content, their properties (as expressed through, in particular, the “cookies”) guide the flash auction, resulting in the advertising material they will actually see.
- 8.
We put apart the crawlers and bots with no “negative” objectives, gathering statistical data, etc.
References
M. Gajewski, O. Hryniewicz, A. Jastrzębska, K. Opara, J.W. Owsiński, S. Zadrożny, M. Kozakiewicz, T. Zwierzchowski: Explainable identification of bots from web activity logs, (2021) (submitted)
M. Gajewski, O. Hryniewicz, A. Jastrzębska, M. Kozakiewicz, K. Opara, J.W. Owsiński, Sł. Zadrożny, T. Zwierzchowski: Assessing the Share of the Artificial Ad-Related Traffic: Some General Observations. Chapter 26 w: C. Ciurea et al. (Eds.) Education, Research and Business Technologies. Smart Innovation, Systems and Technologies 276. Springer Nature Singapore Pte Ltd., (2022)
R. Mouawi, I.H. Elhajj, A Chehab, A Kayssi. Crowdsourcing for click fraud detection. EURASIP J. Inf. Secur, 11, (2019), https://doi.org/10.1186/s13635-019-0095-1
S. Khattak, N.R. Ramay, K.R. Khan, A.A. Syed, S.A. Khayam, A taxonomy of botnet behavior, detection, and defense. IEEE Commun. Surv. & Tutor. 16(2), 898–924 (2014)
G.S.Thejas, S. Dheeshjith, S.S. Iyengar, N.R. Sunitha, P.A Badrinath, hybrid and effective learning approach for Click Fraud detection. Mach. Learn. Appl. 3, (2021), https://doi.org/10.1016/j.mlwa.2020.100016
I. Aberathne, C. Walgampaya Smart mobile bot detection through behavioral analysis, in Advances in Data and Information Sciences. Springer, (2018) pp. 241−252
Y. Cai, G.O.M Yee, Y.X. Gu, C.-H. Lung Threats to online advertising and countermeasures: A technical survey. Digit. Threat.: Res. Pract, 1(2), (May 2020). https://doi.org/10.1145/3374136
M. Gagolewski, M. Bartoszuk, A. Cena, Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Inf. Sci. 363, 8–23 (2016)
M. Ester, H.-P. Kriegel, J. Sander, X.-w. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise. In: E. Simoudis, J.-w. Han, U. M. Fayyad (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press, 226–231 (1996)
R.F. Ling, On the theory and construction of k-clusters. Comput. J. 15(4), 326–332 (1972). https://doi.org/10.1093/comjnl/15.4.326
M.K. Pakhira A linear time-complexity k-means algorithm using cluster shifting, in 2014 International Conference on Computational Intelligence and Communication Networks, Bhopal, India, (2014), pp. 1047–1051, https://doi.org/10.1109/CICN.2014.220
M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques. J. Intell. Inf. Syst. 171(2–3), 107–145 (2001)
K. Kryszczuk, P. Hurley Estimation of the number of clusters using multiple clustering validity indices, in Multiple Classifier Systems. 2010. Lecture Notes in Computer Science. Springer: Cham. 5997: 114–123
H.M. Sani, C. Lei, D. Neagu. Computational complexity analysis of decision tree algorithms. in M. Bramer, M Petridis. (eds.) Artificial Intelligence XXXV. SGAI 2018. Lecture Notes in Computer Science. Springer: Cham. 11311: 191–197
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Owsiński, J.W. et al. (2023). On Ups and Downs in Analyzing Web Activity Data: Notes from a Project. In: Thampi, S.M., Mukhopadhyay, J., Paprzycki, M., Li, KC. (eds) International Symposium on Intelligent Informatics. ISI 2022. Smart Innovation, Systems and Technologies, vol 333. Springer, Singapore. https://doi.org/10.1007/978-981-19-8094-7_37
Download citation
DOI: https://doi.org/10.1007/978-981-19-8094-7_37
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8093-0
Online ISBN: 978-981-19-8094-7
eBook Packages: EngineeringEngineering (R0)