Abstract
Web spam is a serious problem which nowadays continues to threaten search engines because the quality of their results can be severely degraded by the presence of illegitimate pages. With the aim of fighting against web spam, several works have been carried out trying to reduce the impact of spam content. Regardless of the type of developed approaches, all the proposals have been faced with the difficulty of dealing with a corpus in which the difference between the amount of legitimate pages and the number of web sites with spam content is extremely high. Unbalanced data is a well-known common problem present in many practical applications of machine learning, having significant effects on the performance of standard classifiers. Focusing on web spam detection, the objective of this work is two-fold: to evaluate the effect of the class imbalance ratio over popular classifiers such as Naïve Bayes, SVM and C5.0, and to assess how their performance can be improved when different types of techniques are combined in an unbalanced scenario.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
García, S., Derrac, J., Triguero, I., Carmona, C.J., Herrera, F.: Evolutionary-based selection of generalized instances for imbalanced classification. Knowledge-Based Systems 25(1), 3–12 (2012)
Fetterly, D., Manasse, M., Najork, M.: Detecting phrase-level duplication on the World Wide Web. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 170–177 (2005)
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proceedings of the 15th International Conference on World Wide Web (WWW 2006), pp. 83–92 (2006)
Erdélyi, M., Garzó, A., Benczúr, A.A.: Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality 2011), New York, USA, pp. 27–34 (2011)
Gyöngyi, Z., Berkhin, P., Molina, H.G., Pedersen, J.: Link spam detection based on mass estimation. In: Proceedings of the 32nd International Conference on Very large data bases, VLDB, pp. 439–450. Endowment, Seoul (2006)
Benczur, A., Csalogany, K., Sarlos, T., Uher, M.: SpamRank–Fully Automatic Link Spam Detection. In: Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, Japan (2005)
Geng, G.G., Wang, C.H., Li, Q.D., Xu, L., Jin, X.B.: Boosting the performance of web spam detection with ensemble under-sampling classification. In: Proceedings of IEEE 4th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 583–587 (2007)
Abernethy, J., Chapelle, O., Castillo, C.: Webspam identification through content and hyperlinks. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (2008)
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Web spam detection: link-based and content-based techniques. In: Proceedings of the European Integrated Project Dynamically Evolving, Large Scale Information Systems, pp. 99–113. Heinz-Nixdorf-Institut. (2008)
Karimpour, J., Noroozi, A.A., Alizadeh, S.: Web Spam Detection by Learning from Small Labelled Samples. International Journal of Computer Applications 50(21), 1–5 (2012)
Castillo, C., Chellapilla, K., Denoyer, L.: Web spam challenge 2008. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2008 (2008)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)
Drummond, C., Holte, R.C.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Proceedings of the International Conference on Machine Learning (2003)
Laza, R., Pavón, R., Reboiro-Jato, M., Fdez-Riverola, F.: Assessing the suitability of mesh ontology for classifying medline documents. In: Proceedings of the 5th International Conference on Practical Applications of Computational Biology & Bioinformatics, PACBB 2011, pp. 337–344 (2011)
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, pp. 935–942 (2007)
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter - Special Issue on Learning from Imbalanced Datasets 6(1), 20–29 (2004)
Chih-Chung, C., Chih-Jen, L.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Data Mining Tools C5.0, Rulequest Research (2013), http://www.rulequest.com (accessed December 19, 2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Fdez-Glez, J., Ruano-Ordás, D., Fdez-Riverola, F., Méndez, J.R., Pavón, R., Laza, R. (2015). Analyzing the Impact of Unbalanced Data on Web Spam Classification. In: Omatu, S., et al. Distributed Computing and Artificial Intelligence, 12th International Conference. Advances in Intelligent Systems and Computing, vol 373. Springer, Cham. https://doi.org/10.1007/978-3-319-19638-1_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-19638-1_28
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19637-4
Online ISBN: 978-3-319-19638-1
eBook Packages: EngineeringEngineering (R0)