Genre identification for office document search and browsing

  • Francine Chen
  • Andreas Girgensohn
  • Matthew Cooper
  • Yijuan Lu
  • Gerry Filby
Original Paper

Abstract

When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve the performance of genre identification. Experiments were conducted on the open-set identification of four coarse office document genres: technical paper, photo, slide, and table. Our experiments show that when combined with image-based features, text-based features do not significantly influence performance. These results provide support for a topic-independent approach to identification of coarse office document genres. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone. We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections.

Keywords

Genre identification Office documents Image features Text features Classification 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bagdanov, A., Worring, M.: Fine-grained document genre classification using first order random graphs. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 79–83 (2001)Google Scholar
  2. 2.
    Boese, E.S., Howe, A.E.: Effects of web document evolution on genre classification. In: CIKM ’05: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, New York, NY, USA, pp. 632–639 (2005)Google Scholar
  3. 3.
    Burges C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)CrossRefGoogle Scholar
  4. 4.
    Chen N., Blostein D.: A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int. J. Doc. Anal. Recognit. 10(1), 1–16 (2007)MATHCrossRefGoogle Scholar
  5. 5.
    Meyer zu Eissen S., Stein B.: Genre classification of web pages: user study and feasibility analysis. In: Biundo, S., Fruhwirth, T., Palm, G. (eds) KI2004: Advances in Artificial Intelligence, pp. 256–269. Springer, Berlin (2004)Google Scholar
  6. 6.
    Fleiss J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)CrossRefGoogle Scholar
  7. 7.
    Freund, L., Clarke, C.L.A., Toms, E.G.: Towards genre classification for IR in the workplace. In: IIiX: Proceedings of the 1st International Conference on Information Interaction in Context, pp. 30–36 (2006)Google Scholar
  8. 8.
    Gupta, M.D., Sarkar, P.: A shared parts model for document image recognition. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition, pp. 1163–1172 (2007)Google Scholar
  9. 9.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explorations 11(1), 10–18 (2009). http://www.cs.waikato.ac.nz/ml/weka/
  10. 10.
    Hao, X., Wang, J., Bieber, M., Ng, P.: A tool for classifying office documents. In: Proceedings of the Fifth International Conference on Tools with Artificial Intelligence, pp. 427–434 (1993)Google Scholar
  11. 11.
    Hearst, M.A.: Design recommendations for hierarchical faceted search interfaces. In: Broder, A.Z., Maarek, Y.S. (eds.) Proceedings of the SIGIR 2006 Workshop on Faceted Search, pp. 26–30 (2006)Google Scholar
  12. 12.
    Henderson, S.: Genre, task, topic and time: facets of personal digital document management. In: CHINZ ’05: Proceedings of the 6th ACM SIGCHI New Zealand Chapter’s International Conference on Computer-Human Interaction, ACM, New York, NY, USA, pp. 75–82 (2005)Google Scholar
  13. 13.
    Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification, (2010). http://www.csie.ntu.edu.tw/~cjlin/libsvm/
  14. 14.
    Huang, J., Kumar, S.R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: CVPR ’97: Proceedings of the 1997 IEEE Conference on Computer Vision and Pattern Recognition, pp. 762–768 (1997)Google Scholar
  15. 15.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: ECML ’98: Proceedings of the 10th European Conference on Machine Learning, Springer, London, UK, pp. 137–142 (1998)Google Scholar
  16. 16.
    Kessler, B., Nunberg, G., Schütze, H.: Automatic detection of text genre. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pp. 32–38 (1997)Google Scholar
  17. 17.
    Kim, Y., Ross, S.: Feature type analysis in automated genre classification (2007). http://eprints.erpanet.org/128/
  18. 18.
    Kim, Y., Ross, S.: Examining variations of prominent features in genre classification. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences (2008)Google Scholar
  19. 19.
    Lee, Y.B., Myaeng, S.H.: Text genre classification with genre-revealing and subject-revealing features. In: SIGIR ’02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, pp. 145–150 (2002)Google Scholar
  20. 20.
    Levering, R., Cutler, M., Yu, L.: Using visual features for fine-grained genre classification of web pages. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences (2008)Google Scholar
  21. 21.
    Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, Chap. Text classification and naive bayes, Cambridge University Press, Cambridge (2008)Google Scholar
  22. 22.
    Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. on Pattern Anal. Mach. Intell. 27, 1226–1238 (2005). http://penglab.janelia.org/proj/mRMR/index.htm Google Scholar
  23. 23.
    Rauber, A., Müller-Kögler, A.: Integrating automatic genre analysis into digital libraries. In: Proceedings of the Joint Conference on Digital Libraries (2001)Google Scholar
  24. 24.
    Roussinov, D., Crowston, K., Nilan, M., Kwasnik, B., Cai, J., Liu, X.: Genre based navigation on the web. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences, vol. 4, IEEE Computer Society, Washington, DC, USA (2001)Google Scholar
  25. 25.
    Santini M., Sharoff S.: Web genre benchmark under construction. Special issue: automatic genre identification issues and prospects. J. Lang. Technol. Comput. Linguist. 25(1):129-145 (2009)Google Scholar
  26. 26.
    Schölkopf, B., Burges, C., Smola, A. (eds.): Advances in Kernel Methods—Support Vector Learning, Chap. 11 Making large-scale SVM learning practical. MIT-Press, MA (1999)Google Scholar
  27. 27.
    Scholl, P., Domínguez García, R., Böhnstedt, D., Rensing, C., Steinmetz, R.: Towards language-independent web genre detection. In: WWW ’09: Proceedings of the 18th International Conference on World Wide Web, New York, NY, USA, pp. 1157–1158 (2009)Google Scholar
  28. 28.
    Shin C., Doermann D., Rosenfeld A.: Classification of document pages using structure-based features. Int. J. Doc. Anal. Recognit. 3(4), 232–247 (2001)CrossRefGoogle Scholar
  29. 29.
    Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, vol. 2 (2003)Google Scholar
  30. 30.
    Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Early versus late fusion in semantic video analysis. In: MULTIMEDIA ’05: Proceedings of the 13th Annual ACM International Conference on Multimedia, ACM, New York, NY, USA, pp. 399–402 (2005)Google Scholar
  31. 31.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Text genre detection using common word frequencies. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING2000), pp. 808–814 (2000)Google Scholar
  32. 32.
    Witten I.H., Frank E.: Data Mining: Practical Machine Learning Tools and Techniques. 2nd edn. Morgan Kaufmann, MA (2005)MATHGoogle Scholar
  33. 33.
    Wong K., Casey R., Wahl F.: Document analysis systems. IBM J. Res. Dev. 26(6), 647–656 (1982)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Francine Chen
    • 1
  • Andreas Girgensohn
    • 1
  • Matthew Cooper
    • 1
  • Yijuan Lu
    • 2
  • Gerry Filby
    • 1
  1. 1.FX Palo Alto Laboratory, Inc.Palo AltoUSA
  2. 2.Texas State UniversitySan MarcosUSA

Personalised recommendations