Genre Classification in Automated Ingest and Appraisal Metadata

  • Yunhyong Kim
  • Seamus Ross
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4172)


Metadata creation is a crucial aspect of the ingest of digital materials into digital libraries. Metadata needed to document and manage digital materials are extensive and manual creation of them expensive. The Digital Curation Centre (DCC) has undertaken research to automate this process for some classes of digital material. We have segmented the problem and this paper discusses results in genre classification as a first step toward automating metadata extraction from documents. Here we propose a classification method built on looking at the documents from five directions; as an object exhibiting a specific visual format, as a linear layout of strings with characteristic grammar, as an object with stylo-metric signatures, as an object with intended meaning and purpose, and as an object linked to previously classified objects and other external sources. The results of some experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-facetted approach.


Digital Library Physical Science Research Council Digital Material Genre Classification Graphic Recognition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aiello, M., Monz, C., Todoran, L., Worring, M.: Document Understanding for a Broad Class of Documents. International Journal on Document Analysis and Recognition 5(1), 1–16 (2002)zbMATHCrossRefGoogle Scholar
  2. 2.
  3. 3.
    Arens, A., Blaesius, K.H.: Domain oriented information extraction from the Internet. In: Proceedings of SPIE Document Recognition and Retrieval 2003, vol. 5010, p. 286 (2003)Google Scholar
  4. 4.
    Bagdanov, A.D., Worring, M.: Fine-Grained Document Genre Classification Using First Order Random Graphs. In: Proceedings of International Conference on Document Analysis and Recognition 2001, p. 79 (2001)Google Scholar
  5. 5.
    Barbu, E., Heroux, P., Adam, S., Trupin, E.: Clustering Document Images Using a Bag of Symbols Representation. In: International Conference on Document Analysis and Recognition, pp. 1216–1220 (2005)Google Scholar
  6. 6.
    Bekkerman, R., McCallum, A., Huang, G.: Automatic Categorization of Email into Folders. Benchmark Experiments on Enron and SRI Corpora’, CIIR Technical Report, IR-418 (2004)Google Scholar
  7. 7.
    Biber, D.: Dimensions of Register Variation:a Cross-Linguistic Comparison. Cambridge University Press, Cambridge (1995)CrossRefGoogle Scholar
  8. 8.
    Boese, E.S.: Stereotyping the web: genre classification of web documents. Master’s thesis, Colorado State University (2005)Google Scholar
  9. 9.
    Breuel, T.M.: An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis. In: 7th International Conference for Document Analysis and Recognition (ICDAR), pp. 66–70 (2003)Google Scholar
  10. 10.
    Digital Curation Centre:
  11. 11.
    DC-dot, Dublin Core metadata editor:
  12. 12.
    DELOS Network of Excellence on Digital Libraries:
  13. 13.
    NSF International Projects:
  14. 14.
    DELOS/NSF Working Groups: Reference Models for Digital Libraries: Actors and Roles (2003),
  15. 15.
  16. 16.
    Engineering and Physical Sciences Research Council:
  17. 17.
    Electronic Resources Preservation Access Network (ERPANET):
  18. 18.
  19. 19.
    Giuffrida, G., Shek, E., Yang, J.: Knowledge-based Metadata Extraction from PostScript File. In: Proc. 5th ACM Intl. conf. Digital Libraries, pp. 77–84 (2000)Google Scholar
  20. 20.
    Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic Document Metadata Extraction using Support Vector Machines. In: Proc. 3rd ACM/IEEECS conf. Digital libraries, pp. 37–48 (2000)Google Scholar
  21. 21.
    Hedstrom, M., Ross, S., Ashley, K., Christensen-Dalsgaard, B., Duff, W., Gladney, H., Huc, C., Kenney, A.R., Moore, R., Neuhold, E.: Invest to Save: Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation. Report of the European Union DELOS and US National Science Foundation Workgroup on Digital Preservation and Archiving (2003),
  22. 22.
    Joint Information Systems Committee:
  23. 23.
    Karlgren, J., Cutting, D.: Recognizing Text Genres with Simple Metric using Discriminant Analysis. In: Proc. 15th conf. Comp. Ling., vol. 2, pp. 1071–1075 (1994)Google Scholar
  24. 24.
    Ke, S.W., Bowerman, C., Oakes, M.: PERC: A Personal Email Classifier. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 460–463. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  25. 25.
    Kessler, B., Nunberg, G., Schuetze, H.: Automatic Detection of Text Genre. In: Proc. 35th Ann. Meeting ACL, pp. 32–38 (1997)Google Scholar
  26. 26.
    Le, Z.: Maximum Entropy Toolkit for Python and C++. LGPL license,
  27. 27.
  28. 28.
    McCallum, A.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering (1998),
  29. 29.
    National Archives UK: DROID (Digital Object Identification),
  30. 30.
    Natinal Library of Medicine US:
  31. 31.
    National Library of New Zealand: Metadata Extraction Tool,
  32. 32.
  33. 33.
  34. 34.
    PREMIS (PREservation Metadata: Implementation Strategy) Working Group:
  35. 35.
  36. 36.
    Riloff, E., Wiebe, J., Wilson, T.: Learning Subjective Nouns using Extraction Pattern Bootstrapping. In: Proc. 7th CoNLL, pp. 25–32 (2003)Google Scholar
  37. 37.
    Ross, S., Hedstrom, M.: Preservation Research and Sustainable Digital Libraries. International Journal of Digital Libraries (Springer) (2005), doi:10.1007/s00799- 004-0099-3Google Scholar
  38. 38.
    Santini, M.: A Shallow Approach To Syntactic Feature Extraction For Genre Classification. In: Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics, CLUK 2004 (2004)Google Scholar
  39. 39.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)CrossRefMathSciNetGoogle Scholar
  40. 40.
    Shafait, F., Keysers, D., Breuel, T.M.: Performance Comparison of Six Algorithms for Page Segmentation. In: 7th IAPR Workshop on Document Analysis Systems (DAS), pp. 368–379 (2006)Google Scholar
  41. 41.
    Shao, M., Futrelle, R.: Graphics Recognition in PDF document. In: Sixth IAPR International Workshop on Graphics Recognition (GREC 2005), pp. 218–227 (2005)Google Scholar
  42. 42.
    Thoma, G.: Automating the production of bibliographic records. R&D report of the Communications Engineering Branch, Lister Hill National Center for Biomedical Communications, National Library of Medicine (2001)Google Scholar
  43. 43.
    Witte, R., Krestel, R., Bergler, S.: ERSS 2005:Coreference-based Summarization Reloaded. DUC 2005 Document Understanding Workshop, CanadaGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Yunhyong Kim
    • 1
  • Seamus Ross
    • 1
  1. 1.Digital Curation Centre (DCC) & Humanities Adavanced Technology Information Institute (HATII)University of GlasgowGlasgowUK

Personalised recommendations