Formulating Representative Features with Respect to Genre Classification

Chapter
Part of the Text, Speech and Language Technology book series (TLTB, volume 42)

Abstract

Document classification is one of the most fundamental steps in enabling the search, selection, and ranking of digital material according to its relevance in answering a predefined search. As such it is a valuable means of knowledge discovery and an essential part of the effective and efficient management of digital documents in a repository, library, or archive.

Keywords

Genre Feature selection Burstiness of terms Classification Automation Ground truth Document Human labeling agreement 

References

  1. 1.
    Bagdanov, A., and M. Worring. 2001. Fine-grained document genre classification using first order random graphs. In Proceedings of the 2001 Sixth International Conference on Document Analysis and Recognition (ICDAR’01), 79–90. Seattle, WA. USA. http://doi.ieeecomputersociety.org/10.1109/ICDAR.2001.953759
  2. 2.
    Barbu, E., P. Heroux, S. Adam, and E. Turpin. 2005. Clustering document images using a bag of symbols representation. In Proceedings of the 2005 Eight International Conference on Document Analysis and Recognition (ICDAR’05), 1216–1220. Seoul, Korea. http://doi.ieeecomputersociety.org/10.1109/ICDAR.2001.953759
  3. 3.
    Bekkerman, R., A. McCallum, and G. Huang. 2004. Automatic categorization of email into folders: Benchmark experiments on Enron and Sri corpora. Technical Report IR-418, Center for Intelligent Information Retrieval, UMASS. http://www. cs.umass.edu/\homedirmccallum/papers/foldering-tr05.pdf
  4. 4.
    Berninger, V.F., Y. Kim, and S. Ross. 2009. Building a document genre corpus: A profile of the KRYS I corpus. In Proceedings of Corpus Profiling Workshop with ‘BCS-IRSG Workshop on Corpus Profiling’. http://www.bcs.org/server.php?show=conWebDoc.26115
  5. 5.
    Biber, D. 1995. Dimensions of register variation: a cross-linguistic comparison. New York, NY: Cambridge University Press.CrossRefGoogle Scholar
  6. 6.
    Bookstein, A., S.T. Klein, and T. Raita. 1998. Clumping properties of content-bearing words. Journal of the American Society of Information Science 49(2):102–114.Google Scholar
  7. 7.
    Dc-dot: UKOLN Dublin Core Metadata Editor (webpage last updated Aug 2000). http://www.ukoln.ac.uk/metadata/dcdot/
  8. 8.
    De Roeck, A., A. Sarkar, and P. Garthwaite. 2004. Frequent term distribution measures for dataset profiling. Technical Report 2004/2006, Faculty of Mathematics and Computing, Open University. Milton Keynes. http://computing-reports.open.ac.uk/index.php/
  9. 9.
    Dong, L., C. Watters, J. Duffy, and M. Shepherd. 2008. An examination of genre attributes for web page classification. In Proceedings 41st Hawaiian International Conference on System Sciences. IEEE Computer Society Press. Waikoloa, Big Island, HI, USA. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4438836
  10. 10.
    Finn, A., and N. Kushmerick. 2006. Learning to classify documents according to genre. Journal of American Society for Information Science and Technology 57(11):1506–1518.CrossRefGoogle Scholar
  11. 11.
    Giuffrida, G., E. Shek, and J. Yang. 2000. Knowledge-based metadata extraction from PostScript Files. In Proceedings 5th ACM International Conference on Digital Libraries, 77–84. San Antonio, TX, USA. http://portal.acm.org/citation.cfm?id=336597.336639
  12. 12.
    Han, H., L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E.A. Fox. 2003. Automatic document metadata extraction using support vector machines. In Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, 77–84. Houston, TX, USA. http://portal.acm.org/citation.cfm?id=827146
  13. 13.
    Kanaris, I., and E. Stamatatos. 2007. Webpage genre identification using variable-length character n-grams. In Proceedings 19th IEEE International Conference on Tools with Artificial Intelligence. Patras, GR. http://portal.acm.org/citation.cfm?id=1337285
  14. 14.
    Karlgren, J., and D. Cutting. 1994. Recognizing text genres with simple metric using discriminant analysis. In Proceedings 15th Conference on Computational Linguistics 2:1071–1075. Kyoto, Japan.Google Scholar
  15. 15.
    Ke, S.W., C. Bowerman, and M. Oakes. 2006. PERC: A personal email classifier. In ECIR 2006 (London, UK), eds. M. Lalmas et al., LNCS 3936, 460–463, Heidelberg: Springer-Verlag, http://www.springerlink.com/content/r27700t736786455/fulltext.pdf Google Scholar
  16. 16.
    Kessler, G., B. Nunberg, and H. Schuetze. 1997. Automatic detection of text genre. In Proceedings 35th Annual Meeting ACL, 32–38. Madrid, Spain.Google Scholar
  17. 17.
    Kim, Y., and S. Ross. 2007a. Detecting family resemblance: Automated genre classification. CODATA Data Science Journal 6:S172–S183. ISSN: 1683–1470. http://www.jstage.jst.go.jp/article/dsj/6/0/S172/_pdf CrossRefGoogle Scholar
  18. 18.
    Kim, Y., and S. Ross. 2007b. Searching for ground truth: A stepping stone in automated genre classification. In Digital libraries: R&D (Tirrenia, Italy), eds. C. Thanos, F. Borri, and L. Candela, LNCS 4877, 248–261, Heidelberg: Springer-Verlag, http://www.springerlink.com/content/lt760613m2731723 Google Scholar
  19. 19.
    Manning, C., and H. Schutze. 1999. Foundations of statistical language processing. Cambridge, MA: MIT Press.Google Scholar
  20. 20.
    McCallum, A. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/\verb1~11mccallum/bow
  21. 21.
    Rauber, A., and A. Müller-Kögler. 2001. Integrating automatic genre analysis into digital libraries. In Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries, 1–10. Roanoke, VA, USA. http://doi.acm.org/10.1145/379437.379439
  22. 22.
    Rayson, P., A. Wilson, and G. Leech. 2002. Grammatical word class variation within the British National Corpus sampler. In New frontiers of corpus research: Papers from the 21st International Conference on English Language Research on Computerized Corpora, Sydney 2000, eds. P. Peters, P. Collins, and A. Smith, 295–306. Amsterdam: Rodopi.Google Scholar
  23. 23.
    Ross, S., and M. Hedstrom. 2005. Preservation research and sustainable digital libraries. International Journal of Digital Libraries 5(4):317–325.CrossRefGoogle Scholar
  24. 24.
    Santini, M. 2007. Automatic identification of genre in web pages. PhD Thesis, University of Brighton, Brighton. http://www.itri.brighton.ac.uk/\homedirMarina.Santini/ MSantini\_PhD\_Thesis.zip
  25. 25.
    Thoma, G. 2001. Automating the production of bibliographic records. Technical report, Lister Hill National Center for Biomedical Communication, US National Library of Medicine. http://archive.nlm.nih.gov/pubs/thoma/mars2001.php
  26. 26.
    Yang, Y., J. Zhang, and B. Kisiel. 2003. A scalability analysis of classifiers in text categorization. In Proceedings 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 96–103. Toronto, ON, CA. http://doi.acm.org/10.1145/860435.860455
  27. 27.
    Witten, H.I., and E. Frank. 2005. Data mining: Practical machine learning tools and techniques, 2nd ed. San Francisco, CA: Morgan Kaufmann.MATHGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Humanities Advanced Technology and Information Institute (HATII), University of GlasgowGlasgowUK
  2. 2.School of Computing, Robert Gordon UniversityAberdeenUK
  3. 3.iSchool, University of TorontoTorontoUSA

Personalised recommendations