Towards ontology-based multilingual URL filtering: a big data problem

Abstract

Web content filtering is one among many techniques to limit the exposure of selective content on the Internet. It has gotten trivial with time, yet filtering of multilingual web content is still a difficult task, especially while considering big data landscape. The enormity of data increases the challenge of developing an effective content filtering system that can work in real time. There are several systems which can filter the URLs based on artificial intelligence techniques to identify the site with objectionable content. Most of these systems classify the URLs only in the English language. These systems either fail to respond when multilingual URLs are processed, or over-blocking is experienced. This paper introduces a filtering system that can classify multilingual URLs based on predefined criteria for URL, title, and metadata of a web page. Ontological approaches along with local multilingual dictionaries are used as the knowledge base to facilitate the challenging task of blocking URLs not meeting the filtering criteria. The proposed work shows high accuracy in classifying multilingual URLs into two categories, white and black. Evaluation results conducted on a large dataset show that the proposed system achieves promising accuracy, which is on a par with those achieved in state-of-the-art literature on semantic-based URL filtering.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Notes

  1. 1.

    https://datahub.io/dataset/dmoz.

References

  1. 1.

    Dalek J, Haselton B, Noman H, Senft A, Crete-Nishihata M, Gill P, Deibert RJ (2013) A method for identifying and confirming the use of URL filtering products for censorship. In: Proceedings of the 2013 Conference on Internet Measurement Conference. ACM, pp 23–30

  2. 2.

    Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 1245–1254

  3. 3.

    Cowings D, Hoogstrate D, Jensen S, Medlar A, Schneider K (2012) U.S. Patent No. 8,145,710. U.S. Patent and Trademark Office, Washington

  4. 4.

    Srivastava M, Garg R, Mishra P (2014) Preprocessing techniques in web usage mining: a survey. Int J Comput Appl 97(18):1–9

    Google Scholar 

  5. 5.

    Huang D, Xu K, Pei J (2014) Malicious URL detection by dynamically mining patterns without pre-defined elements. World Wide Web 17(6):1375–1394

    Article  Google Scholar 

  6. 6.

    Chandrinos K, Androutsopoulos I, Paliouras G, Spyropoulos C (2000) Automatic web rating: filtering obscene content on the web. In: Research and Advanced Technology for Digital Libraries, pp 403–406

    Google Scholar 

  7. 7.

    Lee LH, Juan YC, Chen HH, Tseng YH (2013) Objectionable content filtering by click-through data. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. ACM, pp 1581–1584

  8. 8.

    Zhou Z, Song T, Jia Y (2010) A high-performance url lookup engine for url filtering systems. In: 2010 IEEE International Conference on Communications (ICC). IEEE, pp 1–5

  9. 9.

    Zheng H, Liu H, Daoudi M (2004) Blocking objectionable images: adult images and harmful symbols. In: 2004 IEEE International Conference on Multimedia and Expo, 2004. ICME’04, vol. 2. IEEE, pp 1223–1226

  10. 10.

    Liu BB, Su JY, Lu ZM, Li Z (2008) Pornographic images detection based on CBIR and skin analysis. In: Fourth International Conference on Semantics, Knowledge and Grid, 2008. SKG’08. IEEE, pp 487–488

  11. 11.

    Imeshev S Cacheonix the big cache for big data. https://www.cacheonix.org/products/cacheonix/. Accessed 09 Aug 2017

  12. 12.

    Forte M, de Souza WL, do Prado AF (2006) A content classification and filtering server for the Internet. In: Proceedings of the 2006 ACM symposium on applied computing. ACM, pp 1166–1171

  13. 13.

    Thangaraj M, Karthikeyan VKT (2014) KT-grand: an algorithm for web content filtering. J Adva Resea Comp Sci Mana Stud 2(9):371–376

    Google Scholar 

  14. 14.

    Rajalakshmi R, Aravindan C (2011) Naive Bayes approach for website classification. In: Das VV, Thomas G, Lumban Gaol F (eds) Information technology and mobile communication. Communications in computer and information science, vol 147. Springer, Berlin, Heidelberg

    Google Scholar 

  15. 15.

    Neshatian K, Zhang M, Andreae P (2012) A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Trans Evol Comput 16(5):645–661

    Article  Google Scholar 

  16. 16.

    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1725–1732

  17. 17.

    Zhang JB, Xu ZM, Xiu KL, Pan QS (2010) A web site classification approach based on its topological structure. Int J Asian Lang Proc 20(2):75–86

    Google Scholar 

  18. 18.

    Chou C, Condron L, Belland JC (2005) A review of the research on Internet addiction. Psychol Rev 17(4):363–388

    Google Scholar 

  19. 19.

    Pai A (2011) FCC guide: children’s internet protection act. Federal Communications Commission

  20. 20.

    Cisco (2005) Content-control software. https://www.opendns.com/. Accessed 15 Aug 2017

  21. 21.

    Lee LH, Juan YC, Tseng WL, Chen HH, Tseng YH (2015) Mining browsing behaviors for objectionable content filtering. J Assoc Inf Sci Technol 66(5):930–942

    Article  Google Scholar 

  22. 22.

    Mahmood K, Takahashi H, Raza A, Qaiser A, Farooqui A (2015) Semantic based highly accurate autonomous decentralized URL classification system for Web filtering. In: 2015 IEEE twelfth international symposium on autonomous decentralized systems (ISADS). IEEE, pp 17–24

  23. 23.

    Feroz MN, Mengel S (2015). Phishing URL detection using URL ranking. In: 2015 IEEE international congress on Big Data (BigData congress). IEEE, pp 635–638

  24. 24.

    AOL (2016) “DMOZ,” AOL. http://www.dmoz.org/. Accessed 10 Aug 2017

  25. 25.

    “PhishTank.” https://www.phishtank.com/. Accessed 10 Aug 2017

  26. 26.

    Microsoft Corporation (2010) Microsoft reputation services. https://www.microsoft.com/emea/endtoend/sv-se/vision/reputation.aspx. Accessed 15 Aug 2017

  27. 27.

    Astrakhantsev N, Fedorenko D, Turdakov D (2014) Automatic enrichment of informal ontology by analyzing a domain-specific text collection. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue, vol. 13, pp 29–42

  28. 28.

    Barve A, Divakar S (2011) An efficient soft clustering algorithm for web page prediction. J Adv Eng Sci 1(1):3–6

    Google Scholar 

  29. 29.

    Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time url spam filtering service. In: 2011 IEEE symposium on security and privacy (SP). IEEE, pp 447–462

  30. 30.

    Khare R (1999) Anatomy of a URL (and other internet-scale namespaces, part 1). IEEE Internet Comput 3(5):78

    Article  Google Scholar 

  31. 31.

    McGuinness DL, Van Harmelen F (2004) OWL web ontology language overview. W3C Recomm 10(10):20

    Google Scholar 

  32. 32.

    Pasin M, Motta E (2011) Ontological requirements for annotation and navigation of philosophical resources. Synthese 182(2):235–267

    Article  Google Scholar 

  33. 33.

    Noy NF, Sintek M, Decker S, Crubézy M, Fergerson RW, Musen MA (2001) Creating semantic web contents with protege-2000. IEEE Intell Syst 16(2):60–71

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Alavalapati Goutham Reddy.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hussain, M., Ahmed, M., Khattak, H.A. et al. Towards ontology-based multilingual URL filtering: a big data problem. J Supercomput 74, 5003–5021 (2018). https://doi.org/10.1007/s11227-018-2338-1

Download citation

Keywords

  • Filtering
  • Information processing
  • Classification
  • Ontology engineering
  • Big data