Discovery of Novel Term Associations in a Document Collection

Hynönen, Teemu; Mahler, Sébastien; Toivonen, Hannu

doi:10.1007/978-3-642-31830-6_7

Teemu Hynönen⁵,
Sébastien Mahler⁵ &
Hannu Toivonen⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7250))

8933 Accesses
2 Citations

Abstract

We propose a method to mine novel, document-specific associations between terms in a collection of unstructured documents. We believe that documents are often best described by the relationships they establish. This is also evidenced by the popularity of conceptual maps, mind maps, and other similar methodologies to organize and summarize information. Our goal is to discover term relationships that can be used to construct conceptual maps or so called BisoNets.

The model we propose, tpf–idf–tpu, looks for pairs of terms that are associated in an individual document. It considers three aspects, two of which have been generalized from tf–idf to term pairs: term pair frequency (tpf; importance for the document), inverse document frequency (idf; uniqueness in the collection), and term pair uncorrelation (tpu; independence of the terms). The last component is needed to filter out statistically dependent pairs that are not likely to be considered novel or interesting by the user.

We present experimental results on two collections of documents: one extracted from Wikipedia, and one containing text mining articles with manually assigned term associations. The results indicate that the tpf–idf–tpu method can discover novel associations, that they are different from just taking pairs of tf–idf keywords, and that they match better the subjective associations of a reader.

Download to read the full chapter text

Chapter PDF

An Automatic Construction of Concept Maps Based on Statistical Text Mining

Explass: Exploring Associations between Entities via Top-K Ontological Patterns and Facets

Mining Semantic Relationships between Concepts across Documents Incorporating Wikipedia Knowledge

References

Kötter, T., Berthold, M.R.: From Information Networks to Bisociative Information Networks. In: Berthold, M.R. (ed.) Bisociative Knowledge Discovery. LNCS (LNAI), vol. 7250, pp. 33–50. Springer, Heidelberg (2012)
Chapter Google Scholar
Novak, J.D., Cañas, A.J.: The theory underlying concept maps and how to construct them. Technical Report IHMC CmapTools 2006-01 Rev 01-2008, Florida Institute for Human and Machine Cognition (2008)
Google Scholar
Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection and correction of malapropisms. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 305–332. MIT Press (1998)
Google Scholar
Patwardhan, S., Pedersen, T.: Using WordNet based context vectors to estimate the semantic relatedness of concepts. In: EACL 2006 Workshop Making Sense of Sense - Bringing Computational Linguistics and Psycholinguistics Together, pp. 1–8 (April 2006)
Google Scholar
Cilibrasi, R.L., Vitányi, P.M.: The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007)
Article Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407 (1990)
Article Google Scholar
Segond, M., Borgelt, C.: Selecting the Links in BisoNets Generated from Document Collections. In: Berthold, M.R. (ed.) Bisociative Knowledge Discovery. LNCS, vol. 7250, pp. 56–67. Springer, Heidelberg (2011)
Google Scholar
Petrič, I., Urbančič, T., Cestnik, B., Macedoni-Lukšič, M.: Literature mining method RaJoLink for uncovering relations between biomedical concepts. Journal of Biomedical Informatics 42(2), 219–227 (2009)
Article Google Scholar
Cowie, J.R., Lehnert, W.G.: Information extraction. Communications of ACM, 80–91 (1996)
Article Google Scholar
Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic detection and tracking pilot study. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 194–218 (1998)
Google Scholar
Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972)
Article Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article Google Scholar
Ohsawa, Y., Benson, N.E., Yachida, M.: Keygraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In: ADL 1998: Proceedings of the Advances in Digital Libraries Conference, vol. 12. IEEE Computer Society, Washington, DC (1998)
Google Scholar
Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Research Development 2(2), 159–165 (1958)
Article MathSciNet Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Banerjee, S., Pedersen, T.: The design, implementation and use of the ngram statistics package. In: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, pp. 370–381 (2003)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and HIIT, University of Helsinki, Finland
Teemu Hynönen, Sébastien Mahler & Hannu Toivonen

Authors

Teemu Hynönen
View author publications
You can also search for this author in PubMed Google Scholar
Sébastien Mahler
View author publications
You can also search for this author in PubMed Google Scholar
Hannu Toivonen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, University of Konstanz, Konstanz, Germany
Michael R. Berthold

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hynönen, T., Mahler, S., Toivonen, H. (2012). Discovery of Novel Term Associations in a Document Collection. In: Berthold, M.R. (eds) Bisociative Knowledge Discovery. Lecture Notes in Computer Science(), vol 7250. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31830-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-31830-6_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31829-0
Online ISBN: 978-3-642-31830-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Discovery of Novel Term Associations in a Document Collection

Abstract

Chapter PDF

Similar content being viewed by others

An Automatic Construction of Concept Maps Based on Statistical Text Mining

Explass: Exploring Associations between Entities via Top-K Ontological Patterns and Facets

Mining Semantic Relationships between Concepts across Documents Incorporating Wikipedia Knowledge

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Discovery of Novel Term Associations in a Document Collection

Abstract

Chapter PDF

Similar content being viewed by others

An Automatic Construction of Concept Maps Based on Statistical Text Mining

Explass: Exploring Associations between Entities via Top-K Ontological Patterns and Facets

Mining Semantic Relationships between Concepts across Documents Incorporating Wikipedia Knowledge

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation