URL-Based Web Page Classification: With n-Gram Language Models

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 553)


There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Methods proposed for this task, for example, the all-grams approach which extracts all possible sub-strings as features, provide reasonable accuracy but do not scale well to large datasets.

We have recently proposed a new method for URL-based web page classification. We have introduced an n-gram language model for this task as a method that provides competitive accuracy and scalability to larger datasets. Our method allows for the classification of new URLs with unseen sub-sequences. In this paper we extend our presentation and include additional results to validate the proposed approach. We explain the parameters associated with the n-gram language model and test their impact on the models produced. Our results show that our method is competitive in terms of accuracy with the best known methods but also scales well for larger datasets.


Language models Information retrieval Web classification Web mining Machine learning 


  1. 1.
    Thomas, K., Grier, C., Ma, J., Paxson, V., Song, D.: Design and evaluation of a real-time URL spam filtering service. In: 2011 IEEE Symposium on Security and Privacy (SP), pp. 447–462. IEEE (2011)Google Scholar
  2. 2.
    Kan, M.: Web page classification without the web page. In: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, pp. 262–263. ACM (2004)Google Scholar
  3. 3.
    Vonitsanou, M., Kozanidis, L., Stamou, S.: Keywords identification within greek URLs. Polibits 43, 75–80 (2011)CrossRefGoogle Scholar
  4. 4.
    Nicolov, N., Salvetti, F.: Efficient spam analysis for Weblogs through URL segmentation. Amsterdam Stud. Theory History Linguist. Sci. Ser. 4 292, 125 (2007)Google Scholar
  5. 5.
    Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: Proceedings of the 18th International Conference on World Wide Web, pp. 1109–1110. ACM (2009)Google Scholar
  6. 6.
    Baykan, E., Marian, L., Henzinger, M., Weber, I.: A comprehensive study of features and algorithms for URL-based topic classification. ACM Trans. Web (TWEB) 5, 15 (2011)Google Scholar
  7. 7.
    Baykan, E., Henzinger, M., Weber, I.: A comprehensive study of techniques for URL-based web page language classification. ACM Trans. Web (TWEB) 7, 3 (2013)Google Scholar
  8. 8.
    Chung, Y., Toyoda, M., Kitsugeregawa, M.: Topic classification of spam host based on URLs. In: Proceedings of the Forum on Data Engineering and Information Management (DEIM) (2010)Google Scholar
  9. 9.
    Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Identifying suspicious URLs: an application of large-scale online learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 681–688. ACM (2009)Google Scholar
  10. 10.
    Zhao, P., Hoi, S.C.: Cost-sensitive online active learning with application to malicious URL detection. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 919–927. ACM (2013)Google Scholar
  11. 11.
    Peng, F., Huang, X., Schuurmans, D., Wang, S.: Text classification in asian languages without word segmentation. In: Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages, vol. 11, pp. 41–48. Association for Computational Linguistics (2003)Google Scholar
  12. 12.
    Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to Ad Hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342. ACM (2001)Google Scholar
  13. 13.
    Grau, S., Sanchis, E., Castro, M.J., Vilar, D.: Dialogue act classification using a Bayesian approach. In: 9th Conference Speech and Computer (2004)Google Scholar
  14. 14.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing, vol. 999. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  15. 15.
    Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics, pp. 310–318. Association for Computational Linguistics (1996)Google Scholar
  16. 16.
    Jurafsky, D., Martin, J.: Speech & Language Processing. Pearson Education India, New Delhi (2000)Google Scholar
  17. 17.
    Witten, I.H., Bell, T.C.: The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Inf. Theory 37, 1085–1094 (1991)CrossRefGoogle Scholar
  18. 18.
    Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–264 (1953)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Cooper, W.S.: Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Trans. Inf. Syst. (TOIS) 13, 100–111 (1995)CrossRefGoogle Scholar
  20. 20.
    Lavrenko, V.: A Generative Theory of Relevance, vol. 26. Springer, New York (2009)zbMATHGoogle Scholar
  21. 21.
    Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27, 129–146 (1976)CrossRefGoogle Scholar
  22. 22.
    Jones, Sk, Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: Part 1. Inf. Process. Manag. 36, 779–808 (2000)CrossRefGoogle Scholar
  23. 23.
    Terra, E.: Simple language models for spam detection. In: TREC (2005)Google Scholar
  24. 24.
    Abdallah, T.A., De la Iglesia, B.: URL-based web page classification - a new method for URL-based web page classification using n-gram language models. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2014) (2014)Google Scholar
  25. 25.
    Slattery, S., Craven, M.: Combining statistical and relational methods for learning in hypertext domains. In: Page, David L. (ed.) ILP 1998. LNCS, vol. 1446, pp. 38–52. Springer, Heidelberg (1998) CrossRefGoogle Scholar
  26. 26.
    Kan, M., Thi, H.: Fast webpage classification using URL features. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 325–326. ACM (2005)Google Scholar
  27. 27.
    Baykan, E., Henzinger, M., Weber, I.: Web page language identification based on URLs. Proc. VLDB Endowment 1, 176–187 (2008)CrossRefGoogle Scholar
  28. 28.
    Cavnar, W.B., Trenkle, J.M., et al.: N-gram-based text categorization, pp. 161–175. Ann Arbor MI 48113 (1994)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.School of Computing SciencesUniversity of East AngliaNorwichUK

Personalised recommendations