Multilingual Name Disambiguation with Semantic Information
This paper studies the problem of name ambiguity which concerns the discovery of the different underlying meanings behind a name. We have developed a semantic approach on the basis of which a graph-based clustering algorithm determines the sets of the semantically related sentences that talk about the same name. Our approach is evaluated with the Bulgarian, Romanian, Spanish and English languages for various couples of city, country, person and organization names. The yielded results significantly outperform a majority based classifier and are compared to a bigram co-occurrence approach.
KeywordsSingular Value Decomposition Semantic Information Semantic Similarity Latent Semantic Analysis Vector Space Model
Unable to display preview. Download preview PDF.
- 1.Jurafski, D., Martin, J.: Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hal, Englewood Cliffs (2000)Google Scholar
- 2.Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the Thirty-Sixth Annual Meeting of the ACL and Seventeenth International Conference on Computational Linguistics, pp. 79–85 (1998)Google Scholar
- 3.Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pp. 33–40 (2003)Google Scholar
- 4.Kulkarni, A.: Unsupervised discrimination and labeling of ambiguous names. In: Proceedings of 43rd Annual Meeting of the Association for Computational Linguistics (2005)Google Scholar
- 6.Pedersen, T., Kulkarni, A.: Unsupervised discrimination of person names in web contexts. In: Proceedings of the Eighth International Conference on Intelligent Text Processing and Computational Linguistics (2007)Google Scholar
- 7.Foltz, P.W.: Using latent semantic indexing for information filtering. In: Proceedings of the ACM SIGOIS and IEEE CS TC-OA conference on Office information systems, pp. 40–47 (1990)Google Scholar
- 8.Turney, P.D.: Human-level performance on word analogy questions by latent relational analysis. Technical report, Institute for Information Technology, National Research Council of Canada (2004)Google Scholar
- 9.Cleuziou, G., Martin, L., Vrain, C.: Poboc: An overlapping clustering algorithm, application to rule-based classification and textual data. In: ECAI, pp. 440–444 (2004)Google Scholar
- 10.Nakov, P., Hearst, M.: Category-based pseudowords. In: NAACL 2003: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 67–69 (2003)Google Scholar