Searching for Ground Truth: A Stepping Stone in Automating Genre Classification

  • Yunhyong Kim
  • Seamus Ross
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4877)


This paper examines genre classification of documents and its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.


information extraction genre classification automated metadata extraction metadata digital library data management 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bagdanov, A., Worring, M.: Fine-grained document genre classification using first order random graphs. In: Proceedings 6th International Conference on Document Analysis and Recognition, pp. 79–83 (2001) ISBN 0-7695-1263-1Google Scholar
  2. 2.
    Barbu, E., Heroux, P., Adam, S., Turpin, E.: Clustering document images using a bag of symbols representation. In: Proceedings 8th International Conference on Document Analysis and Recognition, pp. 1216-1220 (2005) ISBN ISSN 1520-5263Google Scholar
  3. 3.
    Bekkerman, R., McCallum, A., Huang, G.: Automatic categorization of email into folders. benchmark experiments on enron and sri corpora. In: Bekkerman, R., McCallum, A., Huang, G. (eds.) Technical Report IR-418, Centre for Intelligent Information Retrieval, UMASS (2004)Google Scholar
  4. 4.
    Biber, D.: Representativeness in Corpus Design. Literary and Linguistic Computing 8(4), 243–257 (1993)Google Scholar
  5. 5.
    Biber, D.: Dimensions of Register Variation:a Cross-Linguistic Comparison. Cambridge University Press, New York (1995)Google Scholar
  6. 6.
    Boese, E.S.: Stereotyping the web: genre classification of web documents. Master’s thesis, Colorado State University (2005)Google Scholar
  7. 7.
    Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)zbMATHCrossRefGoogle Scholar
  8. 8.
    Chao, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data (2004),
  9. 9.
    Curran, J., Clark, S.: Investigating GIS and Smoothing for Maximum Entropy Taggers. In: Proceedings Aunnual Meeting European Chapter of the Assoc. of Computational Linguistics, pp. 91–98 (2003)Google Scholar
  10. 10.
    Finn, A., Kushmerick, N.: Learning to classify documents according to genre. Journal of American Society for Information Science and Technology 57(11), 1506–1518 (2006)CrossRefGoogle Scholar
  11. 11.
    Giuffrida, G., Shek, E., Yang, J.: Knowledge-based metadata extraction from postscript file. In: Proceedings 5th ACM Intl. Conf. Digital Libraries, pp. 77–84. ACM Press, New York (2000)Google Scholar
  12. 12.
    Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: 3rd ACM/IEEECS Conf. Digital Libraries, pp. 37–48 (2003)Google Scholar
  13. 13.
    Karlgren, J., Cutting, D.: Recognizing text genres with simple metric using discriminant analysis. Proceedings 15th Conf. Comp. Ling. 2, 1071–1075 (1994)CrossRefGoogle Scholar
  14. 14.
    Ke, S.W., Bowerman, C.: Perc: A personal email classifier. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 460–463. Springer, Heidelberg (2006)Google Scholar
  15. 15.
    Kessler, G., Nunberg, B., Schuetze, H.: Automatic detection of text genre. In: Proceedings 35th Ann., pp. 32–38 (1997)Google Scholar
  16. 16.
    Kim, Y., Ross, S.: Genre classification in automated ingest and appraisal metadata. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 63–74. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  17. 17.
    Kim, Y., Webber, B.: Implicit reference to citations: A study of astronomy papers. Presentation at the 20th CODATA international Conference, Beijing, China. (2006),
  18. 18.
    Kim, Y., Ross, S.: Detecting family resemblance: Automated genre classification. Data Science 6, S172–S183 (2007), CrossRefGoogle Scholar
  19. 19.
    Kim, Y., Ross, S.: The Naming of Cats: Automated genre classification. International Journal for Digital Curation 2(1) (2007),
  20. 20.
    Marcus, M.P., Santorini, B., Mareinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2), 313–330 (1994)Google Scholar
  21. 21.
    Rauber, A., Müller-Kögler, A.: Integrating automatic genre analysis into digital libraries. In: Proceedings ACM/IEEE Joint Conf. Digital Libraries, Roanoke, VA, pp. 1–10 (2001)Google Scholar
  22. 22.
    Ross, S., Hedstrom, M.: Preservation research and sustainable digital libraries. International Journal of Digital Libraries, (2005) DOI: 10.1007/s00799-004-0099-3Google Scholar
  23. 23.
    Thoma, G.: Automating the production of bibliographic records. Technical report, Lister Hill National Center for Biomedical Communication, US National Library of Medicine (2001)Google Scholar
  24. 24.
    Witten, H.I., Frank, E.: Data mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco (2005)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Yunhyong Kim
    • 1
  • Seamus Ross
    • 1
  1. 1.Digital Curation Centre (DCC) & Humanities Adavanced Technology Information Institute (HATII), University of Glasgow, GlasgowUK

Personalised recommendations