Advertisement

Language Resources and Evaluation

, Volume 52, Issue 4, pp 949–968 | Cite as

Open set evaluation of web genre identification

  • Dimitrios PritsosEmail author
  • Efstathios Stamatatos
Original Paper

Abstract

Web genre detection is a task that can enhance information retrieval systems by providing rich descriptions of documents and enabling more specialized queries. Most of previous studies in this field adopt the closed-set scenario where a given palette comprises all available genre labels. However this is not a realistic setup since web genres are constantly enriched with new labels and existing web genres are evolving in time. Open-set classification, where some pages used in the evaluation phase do not belong to any of the known genres, is a more realistic setup for this task. In this case, all pages not belonging to known genres can be seen as noise. This paper focuses on systematic evaluation of open-set web genre identification when the noise is either structured or unstructured. Two open-set methods combined with alternative text representation schemes and similarity measures are tested based on two benchmark corpora. Moreover, we adopt the openness test for web genre identification that enables the observation of effectiveness for a varying number of known/unknown labels.

Keywords

Web genre identification Information retrieval Natural language processing Random feature selection 

References

  1. Abramson, M., & Aha, D.W. (2012) What’s in a url? genre classification from urls. Intelligent techniques for web personalization and recommender systems aaai technical report Association for the Advancement of Artificial Intelligence.Google Scholar
  2. Asheghi, N. R. (2015). Human annotation and automatic detection of web genres. Ph.D. thesis, University of Leeds.Google Scholar
  3. Asheghi, N. R,, Markert, K., & Sharoff, S. (2014) Semi-supervised graph-based genre classification for web pages. TextGraphs-9 p 39.Google Scholar
  4. Bishop, C. (2006). Pattern recognition and machine learning. (pp 331–336). New York: Springer.Google Scholar
  5. Boese, E. S., & Howe, A. E. (2005). Effects of web document evolution on genre classification. In: Proceedings of the 14th ACM international conference on information and knowledge management, (pp 632–639). ACM.Google Scholar
  6. Braslavski, P. (2007). Combining relevance and genre-related rankings: An exploratory study. In: Proceedings of the international workshop towards genreenabled search engines: The impact of NLP pp 1–4.Google Scholar
  7. Bryll, R., Gutierrez-Osuna, R., & Quek, F. (2003). Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition, 36(6), 1291–1302.CrossRefGoogle Scholar
  8. Crowston, K., Kwaśnik, B., & Rubleske, J. (2011). Problems in the use-centered development of a taxonomy of web genres. In: Genres on the Web, (pp 69–84), Springer.Google Scholar
  9. De Assis, G. T., Laender, A. H., Gonçalves, M. A., & Da Silva, A. S. (2009). A genre-aware approach to focused crawling. World Wide Web, 12(3), 285–319.CrossRefGoogle Scholar
  10. Dong, L., Watters, C., Duffy, J., & Shepherd, M. (2006). Binary cybergenre classification using theoretic feature measures. In Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on (pp. 313–316). IEEEGoogle Scholar
  11. Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844.CrossRefGoogle Scholar
  12. Jebari, C. (2014). A pure url-based genre classification of web pages. In: 2014 25th international workshop on database and expert systems applications(DEXA), (pp 233–237). IEEE.Google Scholar
  13. Jebari, C. (2015a). A combination based on owa operators for multi-label genre classification of web pages. Procesamiento del Lenguaje Natural, 54, 13–20.Google Scholar
  14. Jebari, C. (2015b). Enhanced and combined centroid-based approach for multi-label genre classification of web pages. International Journal of Metaheuristics, 4(3–4), 220–243.CrossRefGoogle Scholar
  15. Joho, H., & Sanderson, M. (2004). The spirit collection: An overview of a large web collection. In:textitACM SIGIR Forum (Vol. 38, pp. 57–61).ACM.CrossRefGoogle Scholar
  16. Kanaris, I., & Stamatatos, E. (2009). Learning to recognize webpage genres. Information Processing & Management, 45(5), 499–512.CrossRefGoogle Scholar
  17. Kennedy, A., & Shepherd, M. (2005). Automatic identification of home pages on the web. In: Proceedings of the 38th annual Hawaii international conference on system sciences, 2005. HICSS’05,(pp 99c–99c). IEEE.Google Scholar
  18. Koppel, M., Schler, J., & Argamon, S. (2011). Authorship attribution in the wild. Language Resources and Evaluation, 45(1), 83–94.CrossRefGoogle Scholar
  19. Koppel, M., & Winter, Y. (2014). Determining if two documents are written by the same author. Journal of the Association for Information Science and Technology, 65(1), 178–187.CrossRefGoogle Scholar
  20. Kumari, K. P., Reddy, A. V., & Fatima, S. S. (2014). Web page genre classification: Impact of n-gram lengths. International Journal of Computer Applications, 88(13), 13–17.CrossRefGoogle Scholar
  21. Levering, R., Cutler, M., & Yu, L. (2008). Using visual features for fine-grained genre classification of web pages. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences, (pp 131–131).IEEE.Google Scholar
  22. Lim, C. S., Lee, K. J., & Kim, G. C. (2005). Multiple sets of features for automatic genre classification of web documents. Information Processing and Management, 41(5), 1263–1276.CrossRefGoogle Scholar
  23. Madjarov, G., Vidulin, V., Dimitrovski, I., & Kocev, D. (2015). Web genre classification via hierarchical multi-label classification. In: Intelligent Data Engineering and Automated Learning–IDEAL 2015, (pp 9–17). Springer.Google Scholar
  24. Mason, J., Shepherd, M., & Duffy, J. (2009a). An n-gram based approach to automatically identifying web page genre. In: hicss, IEEE Computer Society, pp 1–10.Google Scholar
  25. Mason, J., Shepherd, M., & Duffy, J. (2009b). Classifying web pages by genre: A distance function approach. In: Proceedings of the 5th international conference on web information systems and technologies (WEBIST 2009).Google Scholar
  26. Mehler, A., Sharoff, S., & Santini, M. (2010). Genres on the Web: Computational models and empirical studies. Speech and Language Technology, Springer: Text.Google Scholar
  27. Mendes Júnior, P. R., de Souza, R. M., Werneck, R. d. O., Stein, B. V., Pazinato, D. V., de Almeida, W. R., Penatti, O. A., Torres, R. d. S., Rocha, A. (2017). Nearest neighbors distance ratio open-set classifier. Machine Learning, 106, 359–386.  https://doi.org/10.1007/s10994-016-5610-8.CrossRefGoogle Scholar
  28. Meyer zu Eissen, S., & Stein, B. (2004). Genre classification of web pages. In: KI 2004: Advances in Artificial Intelligence pp 256–269.Google Scholar
  29. Nooralahzadeh, F., Brun, C., & Roux, C. (2014). Part of speech tagging for french social media data. In: COLING 2014, 25th international conference on computational linguistics, proceedings of the conference: technical papers, August 23–29, 2014, Dublin, Ireland, pp 1764–1772.Google Scholar
  30. Petrenz, P., & Webber, B. (2011). Stable classification of text genres. Computational Linguistics, 37(2), 385–393.CrossRefGoogle Scholar
  31. Pritsos, D., & Stamatatos, E. (2015). The impact of noise in web genre identification. In: Experimental IR meets multilinguality, multimodality, and interaction, (pp 268–273). Springer.Google Scholar
  32. Pritsos, D.A., & Stamatatos, E. (2013). Open-set classification for automated genre identification. In: Advances in information retrieval, (pp 207–217). Springer.Google Scholar
  33. Priyatam, P. N., Iyengar, S., Perumal, K., & Varma, V. (2013). Don’t use a lot when little will do: Genre identification using urls. Research in Computing Science, 70, 207–218.Google Scholar
  34. Rosso, M. A. (2008). User-based identification of web genres. Journal of the American Society for Information Science and Technology, 59(7), 1053–1072.  https://doi.org/10.1002/asi.20798.CrossRefGoogle Scholar
  35. Santini, M. (2007). Automatic identification of genre in web pages. Ph.D. thesis, University of Brighton.Google Scholar
  36. Santini, M. (2011). Cross-testing a genre classification model for the web. In: Genres on the Web, (pp 87–128). Springer.Google Scholar
  37. Scheirer, W. J., de Rezende, Rocha A., Sapkota, A., & Boult, T. E. (2013). Toward open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7), 1757–1772.CrossRefGoogle Scholar
  38. Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R. (1999). Estimating the support of a high-dimensional distribution. Technical Report MSR-TR-99-87.Google Scholar
  39. Sharoff, S., Wu, Z., & Markert, K. (2010). The web library of babel: Evaluating genre collections. In: Proceedings of the seventh conference on international language resources and evaluation, pp 3063–3070.Google Scholar
  40. Shepherd, M. A., Watters, C. R., & Kennedy, A. (2004). Cybergenre: Automatic identification of home pages on the web. Journal of Web Engineering, 3(3–4), 236–251.Google Scholar
  41. Stubbe, A., Ringlstetter, C., & Schulz, K. U. (2007). Genre as noise: Noise in genre. International Journal of Document Analysis and Recognition (IJDAR), 10(3–4), 199–209.CrossRefGoogle Scholar
  42. Vidulin, V., Luštrek, M., & Gams, M. (2007). Using genres to improve search engines. In: Proceedings of the international workshop towards genre-enabled search engines, pp 45–51.Google Scholar
  43. Zhu, J., Zhou, X., & Fung, G. (2011). Enhance web pages genre identification using neighboring pages. In: Web information system engineering–WISE 2011, (pp 282–289). Springer.Google Scholar

Copyright information

© Springer Nature B.V. 2018

Authors and Affiliations

  1. 1.University of the AegeanKarlovassi, SamosGreece

Personalised recommendations