Skip to main content

Web Genre Classification via Hierarchical Multi-label Classification

  • Conference paper
  • First Online:
Intelligent Data Engineering and Automated Learning – IDEAL 2015 (IDEAL 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9375))

Abstract

The increase of the number of web pages prompts for improvement of the search engines. One such improvement can be by specifying the desired web genre of the result web pages. This opens the need for web genre prediction based on the information on the web page. Typically, this task is addressed as multi-class classification, with some recent studies advocating the use of multi-label classification. In this paper, we propose to exploit the web genres labels by constructing a hierarchy of web genres and then use methods for hierarchical multi-label classification to boost the predictive performance. We use two methods for hierarchy construction: expert-based and data-driven. The evaluation on a benchmark dataset (20-Genre collection corpus) reveals that using a hierarchy of web genres significantly improves the predictive performance of the classifiers and that the data-driven hierarchy yields similar performance as the expert-driven with the added value that it was obtained automatically and fast.

The first two authors should be regarded as joint first authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Vidulin, V., Luštrek, M., Gams, M.: Multi-label approaches to web genre identification. J. Lang. Tech. Comput. Linguist. 24(1), 97–114 (2009)

    Google Scholar 

  2. Santini, M.: Automatic identification of genre in web pages. Ph.D. thesis, University of Brighton (2007)

    Google Scholar 

  3. Santini, M.: Cross-testing a genre classification model for the web. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web, pp. 87–128. Springer, Heidelberg (2011)

    Google Scholar 

  4. Crowston, K., Kwaśnik, B., Rubleske, J.: Problems in the use-centered development of a taxonomy of web genres. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web, pp. 69–84. Springer, Heidelberg (2011)

    Google Scholar 

  5. Silla Jr., C.N., Freitas, A.A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Discov. 22(1–2), 31–72 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  6. Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for predicting structured outputs. Pattern Recogn. 46(3), 817–833 (2013)

    Article  Google Scholar 

  7. Stubbe, A., Ringlstetter, C., Schulz, K.U.: Genre as noise: noise in genre. Int. J. Doc. Anal. Recogn. 10(3–4), 199–209 (2007)

    Article  Google Scholar 

  8. Madjarov, G., Dimitrovski, I., Gjorgjevikj, D., Džeroski, S.: Evaluation of different data-derived label hierarchies in multi-label classification. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2014. LNCS, vol. 8983, pp. 19–37. Springer, Heidelberg (2015)

    Google Scholar 

  9. Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, pp. 30–44 (2008)

    Google Scholar 

  10. Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)

    MATH  Google Scholar 

  11. Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recogn. 45(9), 3084–3104 (2012)

    Article  Google Scholar 

  12. Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Mach. Learn. 73(2), 185–214 (2008)

    Article  Google Scholar 

Download references

Acknowledgments

We acknowledge the financial support of the European Commission through the grant ICT-2013-612944 MAESTRA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dragi Kocev .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Madjarov, G., Vidulin, V., Dimitrovski, I., Kocev, D. (2015). Web Genre Classification via Hierarchical Multi-label Classification. In: Jackowski, K., Burduk, R., Walkowiak, K., Wozniak, M., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2015. IDEAL 2015. Lecture Notes in Computer Science(), vol 9375. Springer, Cham. https://doi.org/10.1007/978-3-319-24834-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24834-9_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24833-2

  • Online ISBN: 978-3-319-24834-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics