An Empirical Study on Word Sense Disambiguation for Adult Content Filtering

  • Igor Santos
  • Patxi Galán-García
  • Carlos Laorden Gómez
  • Javier Nieves
  • Borja Sanz
  • Pablo García Bringas
  • Jose Maria Gómez
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 299)

Abstract

It is obvious that Internet can act as a powerful source of information. However, as happens with other media, each type of information is targeted to a different type of public. Specifically, adult content should not be accessible for children. In this context, several approaches for content filtering have been proposed both in the industry and the academia. Some of these approaches use the text content of a webpage to model a classic bag-of-word model to categorise them and filter the inappropriate content. These methods, to the best of our knowledge, have no semantic information at all and, therefore, they may be surpassed using different attacks that exploit the well-known ambiguity of natural language. Given this background, we present the first semantics-aware adult filtering approach that models webpages, applying a previous word-sense-disambiguation step in order to face the ambiguity. We show that this approach can improve the filtering results of the classic statistical models. abstract environment.

Keywords

information filtering content filtering machine learning web categorisation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Gómez Hidalgo, J., Sanz, E., García, F., Rodríguez, M.: Web content filtering. Advances in Computers 76, 257–306 (2009)CrossRefGoogle Scholar
  2. 2.
    Choi, B., Chung, B., Ryou, J.: Adult Image Detection Using Bayesian Decision Rule Weighted by SVM Probability. In: 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology, pp. 659–662. IEEE (2009)Google Scholar
  3. 3.
    Du, R., Safavi-Naini, R., Susilo, W.: Web filtering using text classification. In: The 11th IEEE International Conference on Networks, ICON 2003, pp. 325–330. IEEE (2003)Google Scholar
  4. 4.
    Kim, Y., Nam, T.: An efficient text filter for adult web documents. In: The 8th International Conference on Advanced Communication Technology, ICACT 2006, vol. 1, 3 p. IEEE (2006)Google Scholar
  5. 5.
    Ho, W., Watters, P.: Statistical and structural approaches to filtering internet pornography. In: 2004 IEEE International Conference on Systems, Man and Cybernetics, vol. 5, pp. 4792–4798. IEEE (2004)Google Scholar
  6. 6.
    Sanderson, M.: Wsd and ir. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 142–151. Springer, New York (1994)Google Scholar
  7. 7.
    Nelson, B., Barreno, M., et al.: Misleading learners: Co-opting your spam filter. In: Machine Learning in Cyber Trust, pp. 17–51 (2009)Google Scholar
  8. 8.
    Santos, I., Laorden, C., Sanz, B., Bringas, P.G.: Enhanced topic-based vector space model for semantics-aware spam filtering. Expert Systems With Applications (39), 437–444, doi:10.1016/j.eswa.2011.07.034Google Scholar
  9. 9.
    Laorden, C., Santos, I., Sanz, B., Alvarez, G., Bringas, P.G.: Word sense disambiguation for spam filtering. Electronic Commerce Research and Applications 11, 290–298 (2012), doi:10.1016/j.elerap.2011.11.004CrossRefGoogle Scholar
  10. 10.
    Mavroeidis, D., Tsatsaronis, G., Vazirgiannis, M., Theobald, M., Weikum, G.: Word sense disambiguation for exploiting hierarchical thesauri in text classification. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 181–192. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  11. 11.
    Xu, H., Yu, B.: Automatic thesaurus construction for spam filtering using revised back propagation neural network. Expert Systems with Applications 37, 18–23 (2010)CrossRefGoogle Scholar
  12. 12.
    Padr, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2012), Istanbul, Turkey. ELRA (2012)Google Scholar
  13. 13.
    Agirre, E., Soroa, A.: Personalizing pagerank for wsd. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 33–41 (2009)Google Scholar
  14. 14.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web (1999)Google Scholar
  15. 15.
    Carreras, X., Chao, I., Padró, L., Padró, M.: Freeling: An open-source suite of language analyzers. In: Proceedings of the 4th LREC, vol. 4 (2004)Google Scholar
  16. 16.
    Carreras, X., Padró, L.: A flexible distributed architecture for natural language analyzers. In: Proceedings of the LREC, vol. 2 (2002)Google Scholar
  17. 17.
    Garner, S.R., et al.: Weka: The waikato environment for knowledge analysisGoogle Scholar
  18. 18.
    Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)Google Scholar
  19. 19.
    Singh, Y., Kaur, A., Malhotra, R.: Comparative analysis of regression and machine learning methods for predicting fault proneness models. Int. J. Comput. Appl. Technol. 35, 183–193 (2009)CrossRefGoogle Scholar
  20. 20.
    Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18, 613–620 (1975)CrossRefMATHGoogle Scholar
  21. 21.
    Becker, J., Kuropka, D.: Topic-based vector space model. In: Proceedings of the 6th International Conference on Business Information Systems, pp. 7–12 (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Igor Santos
    • 1
  • Patxi Galán-García
    • 1
  • Carlos Laorden Gómez
    • 1
  • Javier Nieves
    • 1
  • Borja Sanz
    • 1
  • Pablo García Bringas
    • 1
  • Jose Maria Gómez
    • 1
  1. 1.DeustoTech ComputingUniversidad de DeustoBilbaoSpain

Personalised recommendations