Abstract
AdEater is an early browsing assistant that automatically removes advertisement images from internet pages. It works by generating rules from training data and implementing these rules when browsing the internet. Advertisement images on web pages are replaced by transparent images that display on the image the word “ad”, and where images are misclassified, non-advertisement images on a webpage will also be replaced by transparent images displaying “ad”. This paper critically examines the dataset derived from a trial of AdEater and tries to build a robust image classifier. We apply data mining techniques to uncover associations between features of advertisements and non-advertisements and try to predict whether the images are advertisements or non-advertisements based on three classification methods. We achieve classification accuracy of 96.5%, using k-fold cross validation to train and test the model.
Similar content being viewed by others
Notes
Fiol-Roig et al. [2].
See Electronic Supplementary Material. Full code available upon request.
Agrawal et al. [8].
As indicated by the value 0. We converted these to 1 and converted the present (indicated by 1) values to 0 to look at rules between absent image features.
It was decided not to report lists of rules of more than 20 for reasons of brevity. These rules can be made available upon request.
\( \frac{{Lift \left( {A \to B} \right) - L}}{U - L} \).
Method for pruning sourced here: http://www.rdatamining.com/examples/association-rules.
Witten and Frank [10].
We applied the “binary” method in computing the distance in R.
Note that the data excludes the classifier variable and relates only to the 1554 image features.
Objective function: \( \mathop \sum \nolimits_{i = 1}^{n} \mathop \sum \nolimits_{j = 1}^{K} z_{ij} d\left( {x_{i} ,\mu_{j} } \right) \).
See Rand [11]. The Rand Index is between 0 and 1:
\( \frac{{\left( {\begin{array}{*{20}c} n \\ 2 \\ \end{array} } \right) + 2\mathop \sum \nolimits_{i = 1}^{{c_{1} }} \mathop \sum \nolimits_{j = 1}^{{c_{2} }} \left( {\begin{array}{*{20}c} {n_{ij} } \\ 2 \\ \end{array} } \right) - \left[ {\mathop \sum \nolimits_{i = 1}^{{c_{1} }} \left( {\begin{array}{*{20}c} {n._{j} } \\ 2 \\ \end{array} } \right) + \mathop \sum \nolimits_{j = 1}^{{c_{2} }} \left( {\begin{array}{*{20}c} {n._{i} } \\ 2 \\ \end{array} } \right)} \right]}} {\left({{\begin{array}{*{20}c}n \\ 2 \\ \end{array} }}\right) }\).
Hubert and Arabie [12] developed an adjusted Rand Index.
Other methods including bagging and random forest were implemented, however, due to run times, these could not make the final report.
We selected 10 for computing/run time reasons. This means that each fold has about 328 observations.
References
Kushmerick N (1999) Learning to remove internet advertisements. In: Agents’99, proceedings of the third annual conference on autonomous agents, pp 175–181
Fiol-Roig G, Miró-Julià M, Herraiz E (2011) Data mining techniques for web page classification. In: Highlights in practical applications of agents and multiagent systems, pp 61–68. https://link.springer.com/chapter/10.1007/978-3-642-19917-2_8
Iyengar V, Apte C, Zhang T (2000) Active learning using adaptive resampling. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, 2000, pp 91–98. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.5070&rep=rep1&type=pdf
Alvarez S, Kawato T, Ruiz C (2003) Mining over loosely couple data sources using neural experts. Computer Science Department, Boston College, Boston. https://pdfs.semanticscholar.org/bf09/0b728a3798fe4f95cc009590674fa555c316.pdf
Cohen S, Ruppin E, Dror G (2005) Feature selection based on the Shapley value. In: Proceedings of the 19th international joint conference on artificial intelligence, pp 665–670
Li Z, Wang Y, Bing Y (2005) Advertisement imagine detection. Manuscript http://www.cas.mcmaster.ca/~wangy22/public/FinalReport.pdf
Almonte I, Anden R, Schwarzbek S (2012) Evaluating machine learning classification methods for internet advertising data. Manuscript http://www.irvinalmonte.com/wp-content/uploads/2016/10/IrvinAlmonte_MachineLearning_CollaborativeSample_R.pdf
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data—SIGMOD'93. p 207. https://doi.org/10.1145/170035.170072
McNicholas PD, Murphy TB, O’Regan M (2008) Standardising the life of an association rule. Comput Stat Data Anal 52(10):4712–4721
Witten IH, Frank E (2000) Data mining: practical machine learning tools and techniques with java implementations. Academic Press, New York
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
O’Meara, G. Mining and Classifying Images from an Advertisement Image Remover. Ann. Data. Sci. 6, 279–303 (2019). https://doi.org/10.1007/s40745-018-0164-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40745-018-0164-1