The Utility of Information Extraction in the Classification of Books

  • Tom Betts
  • Maria Milosavljevic
  • Jon Oberlander
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4425)

Abstract

We describe work on automatically assigning classification labels to books using the Library of Congress Classification scheme. This task is non-trivial due to the volume and variety of books that exist. We explore the utility of Information Extraction (IE) techniques within this text categorisation (TC) task, automatically extracting structured information from the full text of books. Experimental evaluation of performance involves a corpus of books from Project Gutenberg. Results indicate that a classifier which combines methods and tools from IE and TC significantly improves over a state-of-the-art text classifier, achieving a classification performance of Fβ = 1 = 0.8099.

Keywords

Information Extraction Named Entity Recognition Book Categorisation Project Gutenberg Ontologies Digital Libraries 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Betts, T.: Using Text Mining to Place Books into an Ontology. Masters thesis, University of Edinburgh, Edinburgh, UK (2006)Google Scholar
  2. 2.
    Borko, H.: Measuring the reliability of subject classification by men and machines. American Documentation 15, 268–273 (1964)CrossRefGoogle Scholar
  3. 3.
    Caropreso, M.F., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Text Databases and Document Management: Theory and Practice, pp. 78–102. Idea Group Publishing, Hershey (2001)Google Scholar
  4. 4.
    Crane, G.: What Do You Do with a Million Books? D-Lib Magazine 12(3) (2006)Google Scholar
  5. 5.
    Curran, J.R., Clark, S.: Language independent NER using a maximum entropy tagger. In: Proceedings of CoNLL-03, the Seventh Conference on Natural Language Learning, Edmonton, Canada, pp. 164–167 (2003)Google Scholar
  6. 6.
    Frank, E., Paynter, G.W.: Predicting library of congress classifications from library of congress subject headings. J. of the American Society for Information Science and Technology 55(3), 214–227 (2004)CrossRefGoogle Scholar
  7. 7.
    Fürnkranz, J.: A study using n-gram features for text categorization. Technical Report OEFAI-TR-9830, Austrian Institute for Artificial Intelligence (1998)Google Scholar
  8. 8.
    Hamill, K.A., Zamora, A.: The Use of Titles for Automatic Document Classification. J. of the American Society for Information Science 31(6), 396–402 (1980)CrossRefGoogle Scholar
  9. 9.
    Larson, R.R.: Experiments in automatic Library of Congress Classification. J. of the American Society for Information Science 43(2), 130–148 (1992)CrossRefGoogle Scholar
  10. 10.
    Mladenić, D., Globelnik, M.: Word sequences as features in text learning. In: Proceedings of ERK-98, the Seventh Electrotechnical and Computer Science Conference, Ljubljana, Slovenia, pp. 145–148 (1998)Google Scholar
  11. 11.
    Mooney, R.J., Roy, L.: Content-based book recommending using learning for text categorization. In: Proceedings of DL-00, 5th ACM Conference on Digital Libraries, San Antonio, US, pp. 195–204. ACM Press, New York (2000)CrossRefGoogle Scholar
  12. 12.
    Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)Google Scholar
  13. 13.
    Nürnberg, P.J., et al.: Digital Libraries: Issues and Architectures. In: Proceedings of the 1995 ACM Digital Libraries Conference, Austin, TX, pp. 147–153. ACM Press, New York (1995)Google Scholar
  14. 14.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  15. 15.
    Scott, S., Matwin, S.: Feature engineering for text classification. In: Proceedings of ICML-99, Bled, Slovenia, pp. 379–388. Morgan Kaufmann, San Francisco (1999)Google Scholar
  16. 16.
    Joachims, T.: SVMlight 6.01 (2004), http://svmlight.joachims.org
  17. 17.
    Joachims, T.: SVMmulticlass 1.01 (2004), http://svmlight.joachims.org/svm_multiclass.html
  18. 18.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22th ACM SIGIR, Berkley, US, pp. 42–49. ACM Press, New York (1999)Google Scholar
  19. 19.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, Nashville, US, pp. 412–420 (1997)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Tom Betts
    • 1
  • Maria Milosavljevic
    • 1
  • Jon Oberlander
    • 1
  1. 1.School of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LWUK

Personalised recommendations