Advertisement

A Network Analysis Model for Disambiguation of Names in Lists

  • Bradley Malin
  • Edoardo Airoldi
  • Kathleen M. Carley
Article

Abstract

In research and application, social networks are increasingly extracted from relationships inferred by name collocations in text-based documents. Despite the fact that names represent real entities, names are not unique identifiers and it is often unclear when two name observations correspond to the same underlying entity. One confounder stems from ambiguity, in which the same name correctly references multiple entities. Prior name disambiguation methods measured similarity between two names as a function of their respective documents. In this paper, we propose an alternative similarity metric based on the probability of walking from one ambiguous name to another in a random walk of the social network constructed from all documents. We experimentally validate our model on actor-actor relationships derived from the Internet Movie Database. Using a global similarity threshold, we demonstrate random walks achieve a significant increase in disambiguation capability in comparison to prior models.

Keywords

disambiguation social networks link analysis random walks clustering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adamic, L. and E. Adar (2003), “Friends and Neighbors on the Web,” Social Networks, 25(3), 211–230.CrossRefGoogle Scholar
  2. Airoldi, E., A. Slavkovic, S. Fienberg (2005), “Interactive Tetrahedron Applet: A Tool for Exploring the Geometry of 2 × 2 Contingency Tables,” Department of Statistics Technical Report CMU-STAT-05-824, Carnegie Mellon University: Pittsburgh, PA.Google Scholar
  3. Airoldi, E. and B. Malin (2004), “Data Mining Challenges for Electronic Safety: The Case of Fraudulent Intent Detection in E-mails,” in Proceedings of the IEEE Workshop on Privacy and Security Aspects of Data Mining, Brighton, England, pp. 57–66.Google Scholar
  4. Albert, R. and A.L. Barabási (2002), “Statistical Mechanics of Complex Networks,” Reviews of Modern Physics, 74, 47–97.CrossRefGoogle Scholar
  5. Bagga, A. and B. Baldwin (1998), Entity-based Cross-Document Coreferencing Using the Vector Space Model,” in Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, San Francisco, CA, pp. 79–85.Google Scholar
  6. Banko, M. and E. Brill (2001), “Scaling to Very Large Corpora for Natural Language Disambiguation,” in Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp. 26–33.Google Scholar
  7. Barabási, A.L. and R. Albert (1999), “Emergence of Scaling in Random Networks,” Science, 286, 509–512.CrossRefPubMedGoogle Scholar
  8. Bekkerman, R. and A. McCallum (2005), “Disambiguating Web Appearances of People in a Social Network,” in Proceedings of the 2005 World Wide Web Conference, Chiba, Japan.Google Scholar
  9. Bhattacharya, I. and L. Getoor (2004a), “Iterative Record Linkage for Cleaning and Integration,” in Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Paris, France, pp. 11–18.Google Scholar
  10. Bhattacharya, I. and L. Getoor (2004b), “Deduplication and Group Detection Using Links,” in Proceedings of the 2004 ACM SIGKDD Workshop on Link Analysis and Group Detection, Seattle, WA.Google Scholar
  11. Bishop, Y., S. Fienberg and P. Holland (1975), Discrete Multivariate Analysis: Theory and Practice, The MIT Press, Cambridge, MA.Google Scholar
  12. Brill, E. and P. Resnick (1994), “A Rule-based Approach to Prepositional Phrase Attachment Disambiguation,” in Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 1198–1204.Google Scholar
  13. Brown, P., S. Della Pietra, V. Della Pietra and R. Mercer (1991), “Word-sense Disambiguation using Statistical Methods,” in Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, pp. 264–270.Google Scholar
  14. Chan, S. and J. Franklin (1998), “Symbolic Connectionism in Natural Language Disambiguation,” IEEE Transactions on Neural Networks, 9(5), 739–755.CrossRefGoogle Scholar
  15. Chao, G. and M.G. Dyer (2000), “Word Sense Disambiguation of Adjectives using Probabilistic Networks,” in Proceedings of the 17th International Conference on Computational Linguistics, Saarbrucken, Germany, pp. 152–158.Google Scholar
  16. Coffman, T., S. Greenblatt and S. Marcus (2004), “Graph-Based Technologies for Intelligence Analysis,” Communications of the ACM, 47(3), 45–47.CrossRefGoogle Scholar
  17. Cohen, W., P. Ravikumar and S. Fienberg (2003), “A Comparison of String Matching Tasks for Names and Addresses,” in Proceedings of the IJCAI Workshop on Information Integration on the Web, Acapulco, Mexico.Google Scholar
  18. Culotta, A., R. Bekkerman and A. McCallum (2004), “Extracting Social Networks and Contact Information from Email and the Web,” in Proceedings of the First Conference on Email and Anti-Spam, Mountain View, CA.Google Scholar
  19. Diesner, J., and K. Carley (2005), “Exploration of Communication Networks from the Enron Email Corpus,” in Proceedings of the 2005 SIAM Workshop on Link Analysis, Counterterrorism and Security, Newport Beach, CA, pp 3-14.Google Scholar
  20. Duda, R.O., P.E. Hart and D.G. Stork (2001), Pattern Classification, 2nd Edition, Wiley, New York, NY.Google Scholar
  21. Fienberg, S. (1970), “An Iterative Procedure for Estimation in Contingency Tables,” Annals of Mathematical Statistics, 41(3), 907–917.Google Scholar
  22. Gale, W.A., K.W. Church and D. Yarowsky (1992), “A Method for Disambiguating Word Senses in Large Corpora,” Computers and Humanities, 26, 415–439.CrossRefGoogle Scholar
  23. Ginter, F., J. Boberg, J. Jarvinen and T. Salakoski (2004), “New Techniques for Disambiguating in Natural Language and Their Application to Biological Text,” Journal of Machine Learning Research, 5, 605–621.Google Scholar
  24. Girvan, M. and M. Newman (2002), “Community Structure in Social and Biological Networks,” in Proceedings of the National Academy of Sciences, USA, 99, 7821–7826.Google Scholar
  25. Hatzivassiloglou, V., P.A. Duboue and A. Rzhetsky (2001), “Disambiguating Proteins, Genes, and RNA in text: A Machine Learning Approach,” Bioinformatics, 17, 97–106.Google Scholar
  26. Internet Movie Database. http://www.imdb.com. Accessed June 20, 2004.
  27. Harada, M., S. Sato and K. Kazama (2004), “Finding Authoritative People on the Web,” in Proceedings of the Joint Conference on Digital Libraries, Tucson, AZ.Google Scholar
  28. Hiro, K, H. Wu and T. Furugori (1996), “Word-Sense Disambiguation with a Corpus-Based Semantic Network,” Journal of Quantitative Linguistics, 3, 244–251.Google Scholar
  29. Jaro, M. (1989) “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,” Journal of the American Statistical Association, 89, 414–420.Google Scholar
  30. Jensen, K. and J.L. Binot (1987), “Disambiguating Prepositional Phrase Attachments by Using Online Definitions,” Computational Linguistics, 13(3/4), 251–260.Google Scholar
  31. Jensen, D. and J. Neville (2000), “Iterative Classification in Relational Data,” in Proceedings of the AAAI-2000 Workshop on Learning Statistical Models From Relational Data, pp. 13–20.Google Scholar
  32. Kalashnikov, D., S. Mehotra and Z. Chen (2005), “Exploiting Relationships for Domain-independent Data Cleaning,” in Proceedings of the 2005 SIAM International Conference on Data Mining, Newport Beach, CA, pp. 262–273.Google Scholar
  33. Klimt, B. and Y. Yang (2004), “The Enron Email Corpus: A New Dataset for Email Classification Research,” in Proceedings of the 15th European Conference on Machine Learning, Pisa, Italy, pp. 217–226.Google Scholar
  34. Larsen, B. and C. Aone (1999), “Fast and Effective Text Mining Using Linear-time Document Clustering,” in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, pp. 16–22.Google Scholar
  35. Lesk, M. (1986), “Automatic Sense Disambiguation: How to Tell a Pine Cone from an Ice Cream Cone,” in Proceedings of the 1986 ACM SIGDOC Conference, New York, NY, pp. 24–26.Google Scholar
  36. Malin, B. (2005), “Unsupervised Name Disambiguation via Social Network Similarity,” in Proceedings of the 2005 SIAM Workshop on Link Analysis, Counterterrorism, and Security, Newport Beach, CA, pp. 93–102.Google Scholar
  37. Mann, G. and D. Yarowsky (2003), “Unsupervised Personal Name Disambiguation,” in Proceedings of the 7th Conference on Computational Natural Language Learning, Edmonton, Canada, pp. 33–40.Google Scholar
  38. Neville, J., M. Adler and D. Jensen (2003), “Clustering Relational Data using Attribute and Link Information,” in Proceedings of the IJCAI Text Mining and Link Analysis Workshop, Acapulco, Mexico.Google Scholar
  39. Newman, M. (2003), “The Structure and Function of Complex Networks,” SIAM Review, 45, 167–256.CrossRefGoogle Scholar
  40. Ng, H.T. (1997), “Exemplar-Based Word Sense Disambiguation: Some Recent Improvements,” in Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Somerset, New Jersey, pp. 208–213.Google Scholar
  41. Shetty, J. and J. Adibi (2004), “Enron Email Dataset: Database Schema and Brief Statistical Report,” Information Sciences Institute Technical Report, University of Southern California, 2004.Google Scholar
  42. Sweeney, L. (2004), “Finding Lists of People on the Web,” ACM Computers and Society, 34(1).Google Scholar
  43. Thompson, P. (2005), “Text Mining, Names, and Security,” Journal of Database Management, 16(1), 54–59.Google Scholar
  44. Vronis, J. and N. Ide (1999), “Word Sense Disambiguation with Very Large Neural Networks Extracted from Machine Readable Dictionaries,” in Proceedings of the 13th International Conference on Computational Linguistics, Helsinki, Finland, pp. 389–394.Google Scholar
  45. Wacholder, N., Y. Ravin and M. Coi (1997), “Disambiguation of Proper Names in Text,” in Proceedings of the 5th Applied Natural Language Processing Conference, Washington, DC, pp. 202–208.Google Scholar
  46. Wei, J. (2004), “Markov Edit Distance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(3), 311–321.CrossRefPubMedGoogle Scholar
  47. Winkler, W. (1995), “Matching and Record Linkage,” in Cox, B. et al. (ed.), in Business Survey Methods, Wiley, New York, NY, pp. 355–384.Google Scholar
  48. Yarowsky, D. (1992), “Word-sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora,” in Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, Nantes, France, pp. 454–460.Google Scholar
  49. Zelnik-Manor, L. and P. Perona (2004), “Self-Tuning Spectral Clustering,” in Advances in Neural Information Processing Systems 17, Vancouver, Canada, pp. 1601–1608.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  • Bradley Malin
    • 1
    • 2
  • Edoardo Airoldi
    • 1
  • Kathleen M. Carley
    • 2
  1. 1.Data Privacy Laboratory, Institute for Software Research International, School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA
  2. 2.Center for the Computational Analysis of Social and Organizational Systems, Institute for Software Research InternationalSchool of Computer Science, Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations