Identifying Co-referential Names Across Large Corpora

  • Levon Lloyd
  • Andrew Mehler
  • Steven Skiena
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4009)


A single logical entity can be referred to by several different names over a large text corpus. We present our algorithm for finding all such co-reference sets in a large corpus. Our algorithm involves three steps: morphological similarity detection, contextual similarity analysis, and clustering. Finally, we present experimental results on over large corpus of real news text to analyze the performance our techniques.


Noun Phrase Large Corpus Cosine Similarity Contextual Similarity Computational Linguistics 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lloyd, L., Kechagias, D., Skiena, S.: Lydia: A system for large-scale news analysis. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 161–166. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  2. 2.
    Lloyd, L., Kaulgud, P., Skiena, S.: Newspapers vs. blogs: Who gets the scoop? In: Computational Approaches to Analyzing Weblogs (AAAI-CAAW 2006), Technical Report SS-06-03, pp. 117–124. AAAI Press, Menlo Park (2006)Google Scholar
  3. 3.
    Kil, J., Lloyd, L., Skiena, S.: Question answering with lydia. In: 14th Text REtrieval Conference (TREC 2005) (2005)Google Scholar
  4. 4.
    Mehler, A., Bao, Y., Li, X., Wang, Y., Skiena, S.: Spatial analysis of news sources (submitted for publication, 2006)Google Scholar
  5. 5.
    Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Boitet, C., Whitelock, P. (eds.) Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pp. 79–85. Morgan Kaufmann, San Francisco (1998)Google Scholar
  6. 6.
    Mann, G., Yarowsky, D.: Unsupervised personal name disambiguation. In: CoNLL, Edmonton, Alberta, Canada, pp. 33–40 (2003)Google Scholar
  7. 7.
    Gooi, C., Allan, J.: Cross-document coreference on a large scale corpus. In: Human Language Technology Conf. North American Chapter Association for Computational Linguistics, Boston, Massachusetts, USA, pp. 9–16 (2004)Google Scholar
  8. 8.
    Ng, V., Cardie, C.: Improving machine learning approaches to coreference resolution. In: 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 104–111 (2002)Google Scholar
  9. 9.
    Bean, D., Riloff, E.: Unsupervised learning of contextual role knowledge for coreference resolution. In: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, USA, pp. 297–304 (2004)Google Scholar
  10. 10.
    Hernandez, M., Stolfo, S.: The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD International Conference on the Management of Data, San Jose, California, USA, pp. 127–138 (1995)Google Scholar
  11. 11.
    Cohen, W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Eighth ACM SIGKDD Conf. Knowledge Discovery and Data Mining, pp. 475–480 (2002)Google Scholar
  12. 12.
    Philips, L.: Hanging on the Metaphone. Computer Language 7(12), 39–43 (1990)Google Scholar
  13. 13.
    Porter, M.: An algorithm for suffix stripping (1980),
  14. 14.
    Taft, R.: Name search techniques. New York State Identification and Intelligence Systems, Special Report No. 1, Albany, New York (1970)Google Scholar
  15. 15.
    Borgman, C., Siegfried, S.: Getty’s synoname and its cousins: A survey of applications of personal name-matching algorithms. JASIS 43(7), 459–476 (1992)CrossRefGoogle Scholar
  16. 16.
    Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the theory of NP-completeness. W. H. Freeman, San Francisco (1979)MATHGoogle Scholar
  17. 17.
    Karypis, G., Kumar, V.: METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices (2003),

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Levon Lloyd
    • 1
  • Andrew Mehler
    • 1
  • Steven Skiena
    • 1
  1. 1.Department of Computer ScienceState University of New York at Stony BrookStony BrookUSA

Personalised recommendations