IC3K 2014: Knowledge Discovery, Knowledge Engineering and Knowledge Management pp 19-33 | Cite as
URL-Based Web Page Classification: With n-Gram Language Models
Abstract
There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Methods proposed for this task, for example, the all-grams approach which extracts all possible sub-strings as features, provide reasonable accuracy but do not scale well to large datasets.
We have recently proposed a new method for URL-based web page classification. We have introduced an n-gram language model for this task as a method that provides competitive accuracy and scalability to larger datasets. Our method allows for the classification of new URLs with unseen sub-sequences. In this paper we extend our presentation and include additional results to validate the proposed approach. We explain the parameters associated with the n-gram language model and test their impact on the models produced. Our results show that our method is competitive in terms of accuracy with the best known methods but also scales well for larger datasets.
Keywords
Language models Information retrieval Web classification Web mining Machine learningReferences
- 1.Thomas, K., Grier, C., Ma, J., Paxson, V., Song, D.: Design and evaluation of a real-time URL spam filtering service. In: 2011 IEEE Symposium on Security and Privacy (SP), pp. 447–462. IEEE (2011)Google Scholar
- 2.Kan, M.: Web page classification without the web page. In: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, pp. 262–263. ACM (2004)Google Scholar
- 3.Vonitsanou, M., Kozanidis, L., Stamou, S.: Keywords identification within greek URLs. Polibits 43, 75–80 (2011)CrossRefGoogle Scholar
- 4.Nicolov, N., Salvetti, F.: Efficient spam analysis for Weblogs through URL segmentation. Amsterdam Stud. Theory History Linguist. Sci. Ser. 4 292, 125 (2007)Google Scholar
- 5.Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: Proceedings of the 18th International Conference on World Wide Web, pp. 1109–1110. ACM (2009)Google Scholar
- 6.Baykan, E., Marian, L., Henzinger, M., Weber, I.: A comprehensive study of features and algorithms for URL-based topic classification. ACM Trans. Web (TWEB) 5, 15 (2011)Google Scholar
- 7.Baykan, E., Henzinger, M., Weber, I.: A comprehensive study of techniques for URL-based web page language classification. ACM Trans. Web (TWEB) 7, 3 (2013)Google Scholar
- 8.Chung, Y., Toyoda, M., Kitsugeregawa, M.: Topic classification of spam host based on URLs. In: Proceedings of the Forum on Data Engineering and Information Management (DEIM) (2010)Google Scholar
- 9.Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Identifying suspicious URLs: an application of large-scale online learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 681–688. ACM (2009)Google Scholar
- 10.Zhao, P., Hoi, S.C.: Cost-sensitive online active learning with application to malicious URL detection. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 919–927. ACM (2013)Google Scholar
- 11.Peng, F., Huang, X., Schuurmans, D., Wang, S.: Text classification in asian languages without word segmentation. In: Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages, vol. 11, pp. 41–48. Association for Computational Linguistics (2003)Google Scholar
- 12.Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to Ad Hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342. ACM (2001)Google Scholar
- 13.Grau, S., Sanchis, E., Castro, M.J., Vilar, D.: Dialogue act classification using a Bayesian approach. In: 9th Conference Speech and Computer (2004)Google Scholar
- 14.Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing, vol. 999. MIT Press, Cambridge (1999)MATHGoogle Scholar
- 15.Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics, pp. 310–318. Association for Computational Linguistics (1996)Google Scholar
- 16.Jurafsky, D., Martin, J.: Speech & Language Processing. Pearson Education India, New Delhi (2000)Google Scholar
- 17.Witten, I.H., Bell, T.C.: The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Inf. Theory 37, 1085–1094 (1991)CrossRefGoogle Scholar
- 18.Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–264 (1953)MathSciNetCrossRefMATHGoogle Scholar
- 19.Cooper, W.S.: Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Trans. Inf. Syst. (TOIS) 13, 100–111 (1995)CrossRefGoogle Scholar
- 20.Lavrenko, V.: A Generative Theory of Relevance, vol. 26. Springer, New York (2009)MATHGoogle Scholar
- 21.Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27, 129–146 (1976)CrossRefGoogle Scholar
- 22.Jones, Sk, Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: Part 1. Inf. Process. Manag. 36, 779–808 (2000)CrossRefGoogle Scholar
- 23.Terra, E.: Simple language models for spam detection. In: TREC (2005)Google Scholar
- 24.Abdallah, T.A., De la Iglesia, B.: URL-based web page classification - a new method for URL-based web page classification using n-gram language models. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2014) (2014)Google Scholar
- 25.Slattery, S., Craven, M.: Combining statistical and relational methods for learning in hypertext domains. In: Page, David L. (ed.) ILP 1998. LNCS, vol. 1446, pp. 38–52. Springer, Heidelberg (1998) CrossRefGoogle Scholar
- 26.Kan, M., Thi, H.: Fast webpage classification using URL features. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 325–326. ACM (2005)Google Scholar
- 27.Baykan, E., Henzinger, M., Weber, I.: Web page language identification based on URLs. Proc. VLDB Endowment 1, 176–187 (2008)CrossRefGoogle Scholar
- 28.Cavnar, W.B., Trenkle, J.M., et al.: N-gram-based text categorization, pp. 161–175. Ann Arbor MI 48113 (1994)Google Scholar