Analyzing the Impact of Unbalanced Data on Web Spam Classification

Fdez-Glez, J.; Ruano-Ordás, D.; Fdez-Riverola, F.; Méndez, J. R.; Pavón, R.; Laza, R.

doi:10.1007/978-3-319-19638-1_28

J. Fdez-Glez⁹,
D. Ruano-Ordás⁹,
F. Fdez-Riverola⁹,
J. R. Méndez⁹,
R. Pavón⁹ &
…
R. Laza⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 373))

1163 Accesses
1 Citations

Abstract

Web spam is a serious problem which nowadays continues to threaten search engines because the quality of their results can be severely degraded by the presence of illegitimate pages. With the aim of fighting against web spam, several works have been carried out trying to reduce the impact of spam content. Regardless of the type of developed approaches, all the proposals have been faced with the difficulty of dealing with a corpus in which the difference between the amount of legitimate pages and the number of web sites with spam content is extremely high. Unbalanced data is a well-known common problem present in many practical applications of machine learning, having significant effects on the performance of standard classifiers. Focusing on web spam detection, the objective of this work is two-fold: to evaluate the effect of the class imbalance ratio over popular classifiers such as Naïve Bayes, SVM and C5.0, and to assess how their performance can be improved when different types of techniques are combined in an unbalanced scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

García, S., Derrac, J., Triguero, I., Carmona, C.J., Herrera, F.: Evolutionary-based selection of generalized instances for imbalanced classification. Knowledge-Based Systems 25(1), 3–12 (2012)
Article Google Scholar
Fetterly, D., Manasse, M., Najork, M.: Detecting phrase-level duplication on the World Wide Web. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 170–177 (2005)
Google Scholar
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proceedings of the 15th International Conference on World Wide Web (WWW 2006), pp. 83–92 (2006)
Google Scholar
Erdélyi, M., Garzó, A., Benczúr, A.A.: Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality 2011), New York, USA, pp. 27–34 (2011)
Google Scholar
Gyöngyi, Z., Berkhin, P., Molina, H.G., Pedersen, J.: Link spam detection based on mass estimation. In: Proceedings of the 32nd International Conference on Very large data bases, VLDB, pp. 439–450. Endowment, Seoul (2006)
Google Scholar
Benczur, A., Csalogany, K., Sarlos, T., Uher, M.: SpamRank–Fully Automatic Link Spam Detection. In: Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, Japan (2005)
Google Scholar
Geng, G.G., Wang, C.H., Li, Q.D., Xu, L., Jin, X.B.: Boosting the performance of web spam detection with ensemble under-sampling classification. In: Proceedings of IEEE 4th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 583–587 (2007)
Google Scholar
Abernethy, J., Chapelle, O., Castillo, C.: Webspam identification through content and hyperlinks. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (2008)
Google Scholar
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Web spam detection: link-based and content-based techniques. In: Proceedings of the European Integrated Project Dynamically Evolving, Large Scale Information Systems, pp. 99–113. Heinz-Nixdorf-Institut. (2008)
Google Scholar
Karimpour, J., Noroozi, A.A., Alizadeh, S.: Web Spam Detection by Learning from Small Labelled Samples. International Journal of Computer Applications 50(21), 1–5 (2012)
Article Google Scholar
Castillo, C., Chellapilla, K., Denoyer, L.: Web spam challenge 2008. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2008 (2008)
Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)
Article Google Scholar
Drummond, C., Holte, R.C.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Proceedings of the International Conference on Machine Learning (2003)
Google Scholar
Laza, R., Pavón, R., Reboiro-Jato, M., Fdez-Riverola, F.: Assessing the suitability of mesh ontology for classifying medline documents. In: Proceedings of the 5th International Conference on Practical Applications of Computational Biology & Bioinformatics, PACBB 2011, pp. 337–344 (2011)
Google Scholar
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, pp. 935–942 (2007)
Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter - Special Issue on Learning from Imbalanced Datasets 6(1), 20–29 (2004)
Article Google Scholar
Chih-Chung, C., Chih-Jen, L.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Data Mining Tools C5.0, Rulequest Research (2013), http://www.rulequest.com (accessed December 19, 2014)

Download references

Author information

Authors and Affiliations

Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
J. Fdez-Glez, D. Ruano-Ordás, F. Fdez-Riverola, J. R. Méndez, R. Pavón & R. Laza

Authors

J. Fdez-Glez
View author publications
You can also search for this author in PubMed Google Scholar
D. Ruano-Ordás
View author publications
You can also search for this author in PubMed Google Scholar
F. Fdez-Riverola
View author publications
You can also search for this author in PubMed Google Scholar
J. R. Méndez
View author publications
You can also search for this author in PubMed Google Scholar
R. Pavón
View author publications
You can also search for this author in PubMed Google Scholar
R. Laza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. Fdez-Glez .

Editor information

Editors and Affiliations

Department of Electronics, Information and Communication Engineering, Osaka Institute of Technology, Osaka, Osaka, Japan
Sigeru Omatu
College of Engineering, Qatar University, Doha, Qatar
Qutaibah M. Malluhi
Department of Computing Science Faculty of Science, University of Salamanca, Salamanca, Spain
Sara Rodríguez Gonzalez
Department of Electronics and Computers Division of Computer Science and Managem, Koszalin University of Technology, Koszalin, Poland
Grzegorz Bocewicz
Department of Philosophical, Pedagogical and Economic-Quantitative Sciences, University of Chieti-Pescara, Pescara, Italy
Edgardo Bucciarelli
Department of Philosophical, Pedagogical and Economic-Quantitative Sciences, University of Chieti-Pescara, Pescara, Italy
Gianfranco Giulioni
Zayed University, Abu Dhabi Campus, Abu Dhabi, Utd.Arab.Emir.
Farkhund Iqba

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fdez-Glez, J., Ruano-Ordás, D., Fdez-Riverola, F., Méndez, J.R., Pavón, R., Laza, R. (2015). Analyzing the Impact of Unbalanced Data on Web Spam Classification. In: Omatu, S., et al. Distributed Computing and Artificial Intelligence, 12th International Conference. Advances in Intelligent Systems and Computing, vol 373. Springer, Cham. https://doi.org/10.1007/978-3-319-19638-1_28

Download citation

DOI: https://doi.org/10.1007/978-3-319-19638-1_28
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19637-4
Online ISBN: 978-3-319-19638-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics