Abstract
In large collections of documents that are divided into predefined classes, the differences and similarities of those classes are of special interest. This paper presents an approach that is able to automatically extract terms from such document collections which describe what topics discriminate a single class from the others (discriminating terms) and which topics discriminate a subset of the classes against the remaining ones (overlap terms). The importance for real world applications and the effectiveness of our approach are demonstrated by two out of practice examples. In a first application our predefined classes correspond to different scientific conferences. By extracting terms from collections of papers published on these conferences, we determine automatically the topical differences and similarities of the conferences. In our second application task we extract terms out of a collection of product reviews which show what features reviewers commented on. We get these terms by discriminating the product review class against a suitable counter-balance class. Finally, our method is evaluated comparing it to alternative approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)
Kageura, K., Umino, B.: Methods of automatic term recognition: A review. Terminology 3(2), 259 (1996)
Feldman, R., Fresko, M., Kinar, Y., Lindell, Y., Liphstat, O., Rajman, M., Schler, Y., Zamir, O.: Text mining at the term level. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, pp. 65–73 (1998)
Matsuo, Y., Ishizuka, M.: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information. In: Proceedings of the 16th International Florida AI Research Society, pp. 392–396 (2003)
Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 474–479 (1999)
Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)
Brunzel, M., Spiliopoulou, M.: Domain Relevance on Term Weighting. In: 12th International Conference on Applications of Natural Language to Information Systems, pp. 427–432 (2007)
Witschel, H.F.: Terminologie-Extraktion: Möglichkeiten der Kombination statistischer und musterbasierter Verfahren. In: Content and Communication: Terminology, Language Resources and Semantic Interoperability. Ergon Verlag, Würzburg (2004)
Velardi, P., Missikoff, M., Basili, R.: Identification of relevant terms to support the construction of domain ontologies. In: Proceedings of the workshop on Human Language Technology and Knowledge Management, pp. 1–8 (2001)
Drouin, P.: Detection of Domain Specifc Terminology Using Corpora Comparison. In: Proceedings of the International Language Resources Conference, pp. 79–82 (2004)
Wise, J.A.: The ecological approach to text visualization. Journal of the American Society for Information Science, 1224–1233 (1999)
Kaski, S., Honkela, T., Lagus, K., Kohonen, T.: WEBSOM - Selforganizing maps of document collections. Neurocomputing 21, 101–117 (1998)
Lagus, K., Kaski, S.: Keyword selection method for characterizing text document maps. In: Proceedings of ICANN 1999, Ninth International Conference on Artificial Neural Networks, pp. 371–376 (1999)
Azcarraga, A.P., Yap, T.N., Tan, J., Chua, T.S.: Evaluating Keyword Selection Methods for WEBSOM Text Archives. IEEE Transactions on Knowledge and Data Engineering 16(3), 380–383 (2004)
Seki, Y., Eguchi, K., Kando, N.: Multi-Document Viewpoint Summarization Focused on Facts, Opinion and Knowledge. In: Computing Attitude and Affect in Text: Theory and Applications. The Information Retrieval Series, pp. 317–336. Springer, Heidelberg (2005)
Lerman, K., McDonald, R.: Contrastive Summarization: An Experiment with Consumer Reviews. In: Proceedings of the North American Association for Computational Linguistics, NAACL (2009)
Zhai, C., Velivelli, A., Yu, B.: A Cross-Collection Mixture Model for Comparative Text Mining. In: Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 743–748 (2004)
Mei, Q., Zhai, C.: A mixture model for contextual text mining. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 649–655 (2006)
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal 24(5), 513–523 (1988)
Kuhlen, R.: Experimentelle Morphologie in der Informationswissenschaft. Verlag Dokumentation (1977)
Toutanova, K., Manning, C.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC), pp. 63–70 (2000)
Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL), pp. 173–180 (2003)
Stanford Log-linear Part-Of-Speech Tagger, http://nlp.stanford.edu/software/tagger.shtml
Ramshaw, L., Marcus, M.: Text Chunking Using Transformation-Based Learning. In: Proceedings of the Third ACL Workshop on Very Large Corpora (1995)
Greenwood, M.: Noun Phrase Chunker Version 1.1, http://www.dcs.shef.ac.uk/~mark/phd/software/chunker.html
Thiel, K., Dill, F., Kötter, T., Berthold, M.R.: Towards Visual Exploration of Topic Shifts. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 522–527 (2007)
Online tool for terminology extraction, http://wortschatz.uni-leipzig.de/~fwitschel/terminology.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Keim, D.A., Oelke, D., Rohrdantz, C. (2010). Analyzing Document Collections via Context-Aware Term Extraction. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds) Natural Language Processing and Information Systems. NLDB 2009. Lecture Notes in Computer Science, vol 5723. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12550-8_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-12550-8_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12549-2
Online ISBN: 978-3-642-12550-8
eBook Packages: Computer ScienceComputer Science (R0)