Name Discrimination by Clustering Similar Contexts

Pedersen, Ted; Purandare, Amruta; Kulkarni, Anagha

doi:10.1007/978-3-540-30586-6_24

Ted Pedersen¹⁷,
Amruta Purandare¹⁸ &
Anagha Kulkarni¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2371 Accesses
55 Citations

Abstract

It is relatively common for different people or organizations to share the same name. Given the increasing amount of information available online, this results in the ever growing possibility of finding misleading or incorrect information due to confusion caused by an ambiguous name. This paper presents an unsupervised approach that resolves name ambiguity by clustering the instances of a given name into groups, each of which is associated with a distinct underlying entity. The features we employ to represent the context of an ambiguous name are statistically significant bigrams that occur in the same context as the ambiguous name. From these features we create a co–occurrence matrix where the rows and columns represent the first and second words in bigrams, and the cells contain their log–likelihood scores. Then we represent each of the contexts in which an ambiguous name appears with a second order context vector. This is created by taking the average of the vectors from the co–occurrence matrix associated with the words that make up each context. This creates a high dimensional “instance by word” matrix that is reduced to its most significant dimensions by Singular Value Decomposition (SVD). The different “meanings” of a name are discriminated by clustering these second order context vectors with the method of Repeated Bisections. We evaluate this approach by conflating pairs of names found in a large corpus of text to create ambiguous pseudo-names. We find that our method is significantly more accurate than the majority classifier, and that the best results are obtained by having a small amount of local context to represent the instance, along with a larger amount of context for identifying features, or vice versa.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bagga, A., Baldwin, B.: Entity–based cross–document co–referencing using the vector space model. In: Proceedings of the 17th international conference on Computational linguistics, pp. 79–85. Association for Computational Linguistics (1998)
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41, 391–407 (1990)
Article Google Scholar
Gaustad, T.: Statistical corpus-based word sense disambiguation: Pseudowords vs. real ambiguous words. In: Companion Volume to the Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL/EACL 2001) – Proceedings of the Student Research Workshop, Toulouse, France, pp. 61–66 (2001)
Google Scholar
Ginter, F., Boberg, J., Jrvine, J., Salakoski, T.: New techniques for disambiguation in natural language and their application to biological text. Journal of Machine Learning Research 5, 605–621 (2004)
Google Scholar
Gooi, C.H., Allan, J.: Cross-document coreference on a large scale corpus. In: Dumais, S., Marcu, D., Roukos, S. (eds.) HLT-NAACL 2004: Main Proceedings, Boston, Massachusetts, USA, May 2 - May 7, pp. 9–16. Association for Computational Linguistics (2004)
Google Scholar
Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries, pp. 296–305 (2004)
Google Scholar
Hatzivassiloglou, V., Duboue, P., Rzhetsky, A.: Disambiguating proteins, genes, and rna in text: A machine learning approach. In: Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology, Tivoli Gardens, Denmark (July 2001)
Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to Latent Semantic Analysis. Discourse Processes 25, 259–284 (1998)
Article Google Scholar
Mann, G., Yarowsky, D.: Unsupervised personal name disambiguation. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL 2003, Edmonton, Canada, pp. 33–40 (2003)
Google Scholar
Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1), 1–28 (1991)
Article Google Scholar
Nakov, P., Hearst, M.: Category-based pseudowords. In: Companion Volume to the Proceedings of HLT-NAACL 2003 - Short Papers, Edmonton, Alberta, Canada, May 27 - June 1, pp. 67–69 (2003)
Google Scholar
Purandare, A.: Discriminating among word senses using McQuitty’s similarity analysis. In: Companion Volume to the Proceedings of HLT-NAACL 2003 – Student Research Workshop, Edmonton, Alberta, Canada, May 27 - June 1, pp. 19–24 (2003)
Google Scholar
Purandare, A., Pedersen, T.: Word sense discrimination by clustering contexts in vector and similarity spaces. In: Proceedings of the Conference on Computational Natural Language Learning, Boston, MA, pp. 41–48 (2004)
Google Scholar
Schütze, H.: Automatic word sense discrimination. Computational Linguistics 24(1), 97–123 (1998)
Google Scholar
Wacholder, N., Ravin, Y., Choi, M.: Disambiguation of proper names in text. In: Proceedings of the fifth conference on Applied natural language processing, pp. 202–208. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Chapter Google Scholar
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the 11th Conference of Information and Knowledge Management (CIKM), pp. 515–524 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Minnesota, Duluth, MN, 55812, USA
Ted Pedersen & Anagha Kulkarni
University of Pittsburgh, Pittsburgh, PA, 15260, USA
Amruta Purandare

Authors

Ted Pedersen
View author publications
You can also search for this author in PubMed Google Scholar
Amruta Purandare
View author publications
You can also search for this author in PubMed Google Scholar
Anagha Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pedersen, T., Purandare, A., Kulkarni, A. (2005). Name Discrimination by Clustering Similar Contexts. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_24

Download citation

DOI: https://doi.org/10.1007/978-3-540-30586-6_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics