Skip to main content
Log in

A Network Analysis Model for Disambiguation of Names in Lists

  • Published:
Computational & Mathematical Organization Theory Aims and scope Submit manuscript

Abstract

In research and application, social networks are increasingly extracted from relationships inferred by name collocations in text-based documents. Despite the fact that names represent real entities, names are not unique identifiers and it is often unclear when two name observations correspond to the same underlying entity. One confounder stems from ambiguity, in which the same name correctly references multiple entities. Prior name disambiguation methods measured similarity between two names as a function of their respective documents. In this paper, we propose an alternative similarity metric based on the probability of walking from one ambiguous name to another in a random walk of the social network constructed from all documents. We experimentally validate our model on actor-actor relationships derived from the Internet Movie Database. Using a global similarity threshold, we demonstrate random walks achieve a significant increase in disambiguation capability in comparison to prior models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Adamic, L. and E. Adar (2003), “Friends and Neighbors on the Web,” Social Networks, 25(3), 211–230.

    Article  Google Scholar 

  • Airoldi, E., A. Slavkovic, S. Fienberg (2005), “Interactive Tetrahedron Applet: A Tool for Exploring the Geometry of 2 × 2 Contingency Tables,” Department of Statistics Technical Report CMU-STAT-05-824, Carnegie Mellon University: Pittsburgh, PA.

  • Airoldi, E. and B. Malin (2004), “Data Mining Challenges for Electronic Safety: The Case of Fraudulent Intent Detection in E-mails,” in Proceedings of the IEEE Workshop on Privacy and Security Aspects of Data Mining, Brighton, England, pp. 57–66.

  • Albert, R. and A.L. Barabási (2002), “Statistical Mechanics of Complex Networks,” Reviews of Modern Physics, 74, 47–97.

    Article  Google Scholar 

  • Bagga, A. and B. Baldwin (1998), Entity-based Cross-Document Coreferencing Using the Vector Space Model,” in Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, San Francisco, CA, pp. 79–85.

  • Banko, M. and E. Brill (2001), “Scaling to Very Large Corpora for Natural Language Disambiguation,” in Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp. 26–33.

  • Barabási, A.L. and R. Albert (1999), “Emergence of Scaling in Random Networks,” Science, 286, 509–512.

    Article  PubMed  Google Scholar 

  • Bekkerman, R. and A. McCallum (2005), “Disambiguating Web Appearances of People in a Social Network,” in Proceedings of the 2005 World Wide Web Conference, Chiba, Japan.

  • Bhattacharya, I. and L. Getoor (2004a), “Iterative Record Linkage for Cleaning and Integration,” in Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Paris, France, pp. 11–18.

  • Bhattacharya, I. and L. Getoor (2004b), “Deduplication and Group Detection Using Links,” in Proceedings of the 2004 ACM SIGKDD Workshop on Link Analysis and Group Detection, Seattle, WA.

  • Bishop, Y., S. Fienberg and P. Holland (1975), Discrete Multivariate Analysis: Theory and Practice, The MIT Press, Cambridge, MA.

    Google Scholar 

  • Brill, E. and P. Resnick (1994), “A Rule-based Approach to Prepositional Phrase Attachment Disambiguation,” in Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 1198–1204.

  • Brown, P., S. Della Pietra, V. Della Pietra and R. Mercer (1991), “Word-sense Disambiguation using Statistical Methods,” in Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, pp. 264–270.

  • Chan, S. and J. Franklin (1998), “Symbolic Connectionism in Natural Language Disambiguation,” IEEE Transactions on Neural Networks, 9(5), 739–755.

    Article  Google Scholar 

  • Chao, G. and M.G. Dyer (2000), “Word Sense Disambiguation of Adjectives using Probabilistic Networks,” in Proceedings of the 17th International Conference on Computational Linguistics, Saarbrucken, Germany, pp. 152–158.

  • Coffman, T., S. Greenblatt and S. Marcus (2004), “Graph-Based Technologies for Intelligence Analysis,” Communications of the ACM, 47(3), 45–47.

    Article  Google Scholar 

  • Cohen, W., P. Ravikumar and S. Fienberg (2003), “A Comparison of String Matching Tasks for Names and Addresses,” in Proceedings of the IJCAI Workshop on Information Integration on the Web, Acapulco, Mexico.

  • Culotta, A., R. Bekkerman and A. McCallum (2004), “Extracting Social Networks and Contact Information from Email and the Web,” in Proceedings of the First Conference on Email and Anti-Spam, Mountain View, CA.

  • Diesner, J., and K. Carley (2005), “Exploration of Communication Networks from the Enron Email Corpus,” in Proceedings of the 2005 SIAM Workshop on Link Analysis, Counterterrorism and Security, Newport Beach, CA, pp 3-14.

  • Duda, R.O., P.E. Hart and D.G. Stork (2001), Pattern Classification, 2nd Edition, Wiley, New York, NY.

    Google Scholar 

  • Fienberg, S. (1970), “An Iterative Procedure for Estimation in Contingency Tables,” Annals of Mathematical Statistics, 41(3), 907–917.

    Google Scholar 

  • Gale, W.A., K.W. Church and D. Yarowsky (1992), “A Method for Disambiguating Word Senses in Large Corpora,” Computers and Humanities, 26, 415–439.

    Article  Google Scholar 

  • Ginter, F., J. Boberg, J. Jarvinen and T. Salakoski (2004), “New Techniques for Disambiguating in Natural Language and Their Application to Biological Text,” Journal of Machine Learning Research, 5, 605–621.

    Google Scholar 

  • Girvan, M. and M. Newman (2002), “Community Structure in Social and Biological Networks,” in Proceedings of the National Academy of Sciences, USA, 99, 7821–7826.

  • Hatzivassiloglou, V., P.A. Duboue and A. Rzhetsky (2001), “Disambiguating Proteins, Genes, and RNA in text: A Machine Learning Approach,” Bioinformatics, 17, 97–106.

    Google Scholar 

  • Internet Movie Database. http://www.imdb.com. Accessed June 20, 2004.

  • Harada, M., S. Sato and K. Kazama (2004), “Finding Authoritative People on the Web,” in Proceedings of the Joint Conference on Digital Libraries, Tucson, AZ.

  • Hiro, K, H. Wu and T. Furugori (1996), “Word-Sense Disambiguation with a Corpus-Based Semantic Network,” Journal of Quantitative Linguistics, 3, 244–251.

    Google Scholar 

  • Jaro, M. (1989) “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,” Journal of the American Statistical Association, 89, 414–420.

  • Jensen, K. and J.L. Binot (1987), “Disambiguating Prepositional Phrase Attachments by Using Online Definitions,” Computational Linguistics, 13(3/4), 251–260.

  • Jensen, D. and J. Neville (2000), “Iterative Classification in Relational Data,” in Proceedings of the AAAI-2000 Workshop on Learning Statistical Models From Relational Data, pp. 13–20.

  • Kalashnikov, D., S. Mehotra and Z. Chen (2005), “Exploiting Relationships for Domain-independent Data Cleaning,” in Proceedings of the 2005 SIAM International Conference on Data Mining, Newport Beach, CA, pp. 262–273.

  • Klimt, B. and Y. Yang (2004), “The Enron Email Corpus: A New Dataset for Email Classification Research,” in Proceedings of the 15th European Conference on Machine Learning, Pisa, Italy, pp. 217–226.

  • Larsen, B. and C. Aone (1999), “Fast and Effective Text Mining Using Linear-time Document Clustering,” in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, pp. 16–22.

  • Lesk, M. (1986), “Automatic Sense Disambiguation: How to Tell a Pine Cone from an Ice Cream Cone,” in Proceedings of the 1986 ACM SIGDOC Conference, New York, NY, pp. 24–26.

  • Malin, B. (2005), “Unsupervised Name Disambiguation via Social Network Similarity,” in Proceedings of the 2005 SIAM Workshop on Link Analysis, Counterterrorism, and Security, Newport Beach, CA, pp. 93–102.

  • Mann, G. and D. Yarowsky (2003), “Unsupervised Personal Name Disambiguation,” in Proceedings of the 7th Conference on Computational Natural Language Learning, Edmonton, Canada, pp. 33–40.

  • Neville, J., M. Adler and D. Jensen (2003), “Clustering Relational Data using Attribute and Link Information,” in Proceedings of the IJCAI Text Mining and Link Analysis Workshop, Acapulco, Mexico.

  • Newman, M. (2003), “The Structure and Function of Complex Networks,” SIAM Review, 45, 167–256.

    Article  Google Scholar 

  • Ng, H.T. (1997), “Exemplar-Based Word Sense Disambiguation: Some Recent Improvements,” in Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Somerset, New Jersey, pp. 208–213.

  • Shetty, J. and J. Adibi (2004), “Enron Email Dataset: Database Schema and Brief Statistical Report,” Information Sciences Institute Technical Report, University of Southern California, 2004.

  • Sweeney, L. (2004), “Finding Lists of People on the Web,” ACM Computers and Society, 34(1).

  • Thompson, P. (2005), “Text Mining, Names, and Security,” Journal of Database Management, 16(1), 54–59.

    Google Scholar 

  • Vronis, J. and N. Ide (1999), “Word Sense Disambiguation with Very Large Neural Networks Extracted from Machine Readable Dictionaries,” in Proceedings of the 13th International Conference on Computational Linguistics, Helsinki, Finland, pp. 389–394.

  • Wacholder, N., Y. Ravin and M. Coi (1997), “Disambiguation of Proper Names in Text,” in Proceedings of the 5th Applied Natural Language Processing Conference, Washington, DC, pp. 202–208.

  • Wei, J. (2004), “Markov Edit Distance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(3), 311–321.

    Article  PubMed  Google Scholar 

  • Winkler, W. (1995), “Matching and Record Linkage,” in Cox, B. et al. (ed.), in Business Survey Methods, Wiley, New York, NY, pp. 355–384.

  • Yarowsky, D. (1992), “Word-sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora,” in Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, Nantes, France, pp. 454–460.

  • Zelnik-Manor, L. and P. Perona (2004), “Self-Tuning Spectral Clustering,” in Advances in Neural Information Processing Systems 17, Vancouver, Canada, pp. 1601–1608.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bradley Malin.

Additional information

Bradley A. Malin is a Ph.D. candidate in the School of Computer Science at Carnegie Mellon University. He is an NSF IGERT fellow in the Center for Computational Analysis of Social and Organizational Systems (CASOS) and a researcher at the Laboratory for International Data Privacy. His research is interdisciplinary and combines aspects of bioinformatics, data forensics, data privacy and security, entity resolution, and public policy. He has developed learning algorithms for surveillance in distributed systems and designed formal models for the evaluation and the improvement of privacy enhancing technologies in real world environments, including healthcare and the Internet. His research on privacy in genomic databases has received several awards from the American Medical Informatics Association and has been cited in congressional briefings on health data privacy. He currently serves as managing editor of the Journal of Privacy Technology.

Edoardo M. Airoldi is a Ph.D. student in the School of Computer Science at Carnegie Mellon University. Currently, he is a researcher in the CASOS group and at the Center for Automated Learning and Discovery. His methodology is based on probability theory, approximation theorems, discrete mathematics and their geometries. His research interests include data mining and machine learning techniques for temporal and relational data, data linkage and data privacy, with important applications to dynamic networks, biological sequences and large collections of texts. His research on dynamic network tomography is the state-of-the-art for recovering information about who is communicating to whom in a network, and was awarded honors from the ACM SIG-KDD community. Several companies focusing on information extraction have adopted his methodology for text analysis. He is currently investigating practical and theoretical aspects of hierarchical mixture models for temporal and relational data, and an abstract theory of data linkage.

Kathleen M. Carley is a Professor of Computer Science in ISRI, School of Computer Science at Carnegie Mellon University. She received her Ph.D. from Harvard in Sociology. Her research combines cognitive science, social and dynamic networks, and computer science (particularly artificial intelligence and machine learning techniques) to address complex social and organizational problems. Her specific research areas are computational social and organization science, social adaptation and evolution, social and dynamic network analysis, and computational text analysis. Her models meld multi-agent technology with network dynamics and empirical data. Three of the large-scale tools she and the CASOS group have developed are: BioWar a city, scale model of weaponized biological attacks and response; Construct a models of the co-evolution of social and knowledge networks; and ORA a statistical toolkit for dynamic social Network data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Malin, B., Airoldi, E. & Carley, K.M. A Network Analysis Model for Disambiguation of Names in Lists. Comput Math Organiz Theor 11, 119–139 (2005). https://doi.org/10.1007/s10588-005-3940-3

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10588-005-3940-3

Keywords

Navigation