Skip to main content

Open-Set Classification for Automated Genre Identification

  • Conference paper
Book cover Advances in Information Retrieval (ECIR 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7814))

Included in the following conference series:

Abstract

Automated Genre Identification (AGI) of web pages is a problem of increasing importance since web genre (e.g. blog, news, e-shops, etc.) information can enhance modern Information Retrieval (IR) systems. The state-of-the-art in this field considers AGI as a closed-set classification problem where a variety of web page representation and machine learning models have intensively studied. In this paper, we study AGI as an open-set classification problem which better formulates the real world conditions of exploiting AGI in practice. Focusing on the use of content information, different text representation methods (words and character n-grams) are tested. Moreover, two classification methods are examined, one-class SVM learners, used as a baseline, and an ensemble of classifiers based on random feature subspacing, originally proposed for author identification. It is demonstrated that very high precision can be achieved in open-set AGI while recall remains relatively high.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Rosso, M.: Using genre to improve web search. PhD thesis, University of North Carolina at Chapel Hill (2005)

    Google Scholar 

  2. Braslavski, P.: Combining relevance and genre-related rankings: An exploratory study. In: Proceedings of the International Workshop Towards Genreenabled Search Engines: The Impact of NLP, pp. 1–4 (2007)

    Google Scholar 

  3. Sharoff, S., Wu, Z., Markert, K.: The web library of babel: evaluating genre collections. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation, pp. 3063–3070 (2010)

    Google Scholar 

  4. Santini, M., Sharoff, S.: Web genre benchmark under construction. Journal for Language Technology and Computational Linguistics 24(1), 129–145 (2009)

    Google Scholar 

  5. Kanaris, I., Stamatatos, E.: Learning to recognize webpage genres. Information Processing & Management 45(5), 499–512 (2009)

    Article  Google Scholar 

  6. Dong, L., Watters, C., Duffy, J., Shepherd, M.: Binary cybergenre classification using theoretic feature measures (2006)

    Google Scholar 

  7. Feldman, S., Marin, M., Medero, J., Ostendorf, M.: Classifying factored genres with part-of-speech histograms. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, Association for Computational Linguistics, pp. 173–176 (2009)

    Google Scholar 

  8. Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Language Resources and Evaluation 45(1), 83–94 (2011)

    Article  Google Scholar 

  9. Meyer zu Eissen, S., Stein, B.: Genre Classification of Web Pages. In: Biundo, S., Frühwirth, T., Palm, G. (eds.) KI 2004. LNCS (LNAI), vol. 3238, pp. 256–269. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  10. Santini, M.: Automatic identification of genre in web pages. PhD thesis, University of Brighton (2007)

    Google Scholar 

  11. Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Information Processing and Management 41(5), 1263–1276 (2005)

    Article  Google Scholar 

  12. Mason, J., Shepherd, M., Duffy, J.: An n-gram based approach to automatically identifying web page genre. In: HICSS, pp. 1–10. IEEE Computer Society (2009)

    Google Scholar 

  13. Khan, S.S., Madden, M.G.: A Survey of Recent Trends in One Class Classification. In: Coyle, L., Freyne, J. (eds.) AICS 2009. LNCS, vol. 6206, pp. 188–197. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  14. Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R.: Estimating the support of a high-dimensional distribution. Technical Report MSR-TR-99-87 (1999)

    Google Scholar 

  15. Manevitz, L., Yousef, M.: One-class svms for document classification. The Journal of Machine Learning Research 2, 139–154 (2002)

    MATH  Google Scholar 

  16. Anderka, M., Stein, B., Lipka, N.: Detection of text quality as as a one-class classification problem. In: 20th ACM International Conference on Information and Knowledge Management (CIKM 2011), pp. 2313–2316 (2011)

    Google Scholar 

  17. Ferretti, E., Fusilier, D., Cabrera, R., y Gómez, M., Errecalde, M., Rosso, P.: On the use of pu learning for quality flaw prediction in wikipedia. In: Working Notes, CLEF 2012 Evaluation Labs and Workshop, Rome, Italy, 17-20 (2012)

    Google Scholar 

  18. Bishop, C.: Pattern Recognition and Machine Learning, 331–336 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pritsos, D.A., Stamatatos, E. (2013). Open-Set Classification for Automated Genre Identification. In: Serdyukov, P., et al. Advances in Information Retrieval. ECIR 2013. Lecture Notes in Computer Science, vol 7814. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36973-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36973-5_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36972-8

  • Online ISBN: 978-3-642-36973-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics