Analyzing Document Collections via Context-Aware Term Extraction

  • Daniel A. Keim
  • Daniela Oelke
  • Christian Rohrdantz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5723)

Abstract

In large collections of documents that are divided into predefined classes, the differences and similarities of those classes are of special interest. This paper presents an approach that is able to automatically extract terms from such document collections which describe what topics discriminate a single class from the others (discriminating terms) and which topics discriminate a subset of the classes against the remaining ones (overlap terms). The importance for real world applications and the effectiveness of our approach are demonstrated by two out of practice examples. In a first application our predefined classes correspond to different scientific conferences. By extracting terms from collections of papers published on these conferences, we determine automatically the topical differences and similarities of the conferences. In our second application task we extract terms out of a collection of product reviews which show what features reviewers commented on. We get these terms by discriminating the product review class against a suitable counter-balance class. Finally, our method is evaluated comparing it to alternative approaches.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)CrossRefGoogle Scholar
  2. 2.
    Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)MATHCrossRefGoogle Scholar
  3. 3.
    Kageura, K., Umino, B.: Methods of automatic term recognition: A review. Terminology 3(2), 259 (1996)Google Scholar
  4. 4.
    Feldman, R., Fresko, M., Kinar, Y., Lindell, Y., Liphstat, O., Rajman, M., Schler, Y., Zamir, O.: Text mining at the term level. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, pp. 65–73 (1998)Google Scholar
  5. 5.
    Matsuo, Y., Ishizuka, M.: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information. In: Proceedings of the 16th International Florida AI Research Society, pp. 392–396 (2003)Google Scholar
  6. 6.
    Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 474–479 (1999)Google Scholar
  7. 7.
    Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)Google Scholar
  8. 8.
    Brunzel, M., Spiliopoulou, M.: Domain Relevance on Term Weighting. In: 12th International Conference on Applications of Natural Language to Information Systems, pp. 427–432 (2007)Google Scholar
  9. 9.
    Witschel, H.F.: Terminologie-Extraktion: Möglichkeiten der Kombination statistischer und musterbasierter Verfahren. In: Content and Communication: Terminology, Language Resources and Semantic Interoperability. Ergon Verlag, Würzburg (2004)Google Scholar
  10. 10.
    Velardi, P., Missikoff, M., Basili, R.: Identification of relevant terms to support the construction of domain ontologies. In: Proceedings of the workshop on Human Language Technology and Knowledge Management, pp. 1–8 (2001)Google Scholar
  11. 11.
    Drouin, P.: Detection of Domain Specifc Terminology Using Corpora Comparison. In: Proceedings of the International Language Resources Conference, pp. 79–82 (2004)Google Scholar
  12. 12.
    Wise, J.A.: The ecological approach to text visualization. Journal of the American Society for Information Science, 1224–1233 (1999)Google Scholar
  13. 13.
    Kaski, S., Honkela, T., Lagus, K., Kohonen, T.: WEBSOM - Selforganizing maps of document collections. Neurocomputing 21, 101–117 (1998)MATHCrossRefGoogle Scholar
  14. 14.
    Lagus, K., Kaski, S.: Keyword selection method for characterizing text document maps. In: Proceedings of ICANN 1999, Ninth International Conference on Artificial Neural Networks, pp. 371–376 (1999)Google Scholar
  15. 15.
    Azcarraga, A.P., Yap, T.N., Tan, J., Chua, T.S.: Evaluating Keyword Selection Methods for WEBSOM Text Archives. IEEE Transactions on Knowledge and Data Engineering 16(3), 380–383 (2004)CrossRefGoogle Scholar
  16. 16.
    Seki, Y., Eguchi, K., Kando, N.: Multi-Document Viewpoint Summarization Focused on Facts, Opinion and Knowledge. In: Computing Attitude and Affect in Text: Theory and Applications. The Information Retrieval Series, pp. 317–336. Springer, Heidelberg (2005)Google Scholar
  17. 17.
    Lerman, K., McDonald, R.: Contrastive Summarization: An Experiment with Consumer Reviews. In: Proceedings of the North American Association for Computational Linguistics, NAACL (2009)Google Scholar
  18. 18.
    Zhai, C., Velivelli, A., Yu, B.: A Cross-Collection Mixture Model for Comparative Text Mining. In: Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 743–748 (2004)Google Scholar
  19. 19.
    Mei, Q., Zhai, C.: A mixture model for contextual text mining. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 649–655 (2006)Google Scholar
  20. 20.
    Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal 24(5), 513–523 (1988)CrossRefGoogle Scholar
  21. 21.
    Kuhlen, R.: Experimentelle Morphologie in der Informationswissenschaft. Verlag Dokumentation (1977)Google Scholar
  22. 22.
    Toutanova, K., Manning, C.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC), pp. 63–70 (2000)Google Scholar
  23. 23.
    Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL), pp. 173–180 (2003)Google Scholar
  24. 24.
    Stanford Log-linear Part-Of-Speech Tagger, http://nlp.stanford.edu/software/tagger.shtml
  25. 25.
    Ramshaw, L., Marcus, M.: Text Chunking Using Transformation-Based Learning. In: Proceedings of the Third ACL Workshop on Very Large Corpora (1995)Google Scholar
  26. 26.
    Greenwood, M.: Noun Phrase Chunker Version 1.1, http://www.dcs.shef.ac.uk/~mark/phd/software/chunker.html
  27. 27.
    Thiel, K., Dill, F., Kötter, T., Berthold, M.R.: Towards Visual Exploration of Topic Shifts. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 522–527 (2007)Google Scholar
  28. 28.
    Online tool for terminology extraction, http://wortschatz.uni-leipzig.de/~fwitschel/terminology.html

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Daniel A. Keim
    • 1
  • Daniela Oelke
    • 1
  • Christian Rohrdantz
    • 1
  1. 1.University of KonstanzGermany

Personalised recommendations