Names: A New Frontier in Text Mining

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2665)


Over the past 15 years the government has funded research in information extraction, with the goal of developing the technology to extract entities, events, and their interrelationships from free text for further analysis. A crucial component of linking entities across documents is the ability to recognize when different name strings are potential references to the same entity. Given the extraordinary range of variation international names can take when rendered in the Roman alphabet, this is a daunting task. This paper surveys existing technologies for name matching and for accomplishing pieces of the cross-document extraction and linking task. It proposes a direction for future work in which existing entity extraction, coreference, and database name matching technologies would be harnessed for cross-document coreference and linking capabilities. The extension of name variant matching to free text will add important text mining functionality for intelligence and security informatics toolkits.


Noun Phrase Free Text Computational Linguistics Deception Detection Coreference Resolution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Taft, R.L.: Name Search Techniques. Special Rep. No. 1. Bureau of Systems Development, New York State Identification and Intelligence System, Albany (1970)Google Scholar
  2. 2.
    Verton, D.: Technology Aids Hunt for Terrorists. Computer World, 9 September (2002)Google Scholar
  3. 3.
    Borgman, C.L., Siegfried, S.L.: Getty’s Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms. Journal of the American Society for Information Science, Vol. 43 No. 7. (1992) 459–476CrossRefGoogle Scholar
  4. 4.
    Grishman, R., Sundheim, B.: Message Understanding Conference — 6: A Brief History. In: Proceedings of the 16th International Conference on Computational Linguistics. Copenhagen (1999)Google Scholar
  5. 5.
    DARPA. Tipster Text Program Phase III Proceedings. Morgan Kaufmann, San Francisco (1999)Google Scholar
  6. 6.
    National Institute of Standards and Technology. ACE-Automatic Content Extraction Information Technology Laboratories. (2000)
  7. 7.
    Fuhr, N.: XML Information Retrieval and Extraction [to appear]Google Scholar
  8. 8.
    Hermansen, J.C.: Automatic Name Searching in Large Databases of International Names. Georgetown University Dissertation, Washington, DC (1985)Google Scholar
  9. 9.
    Holmes, D., McCabe, M.C.: Improving Precision and Recall for Soundex Retrieval. In: Proceedings of the 2002 IEEE International Conference on Information Technology — Coding and Computing. Las Vegas (2002)Google Scholar
  10. 10.
    Navarro, G., Baeza-Yates, R., Azevedo Arcoverde, J.M.: Matchsimile: A Flexible Approximate Matching Tool for Searching Proper Names. Journal of the American Society for Information Science and Technology, Vol. 54 No. 1 (2003) 3–15CrossRefGoogle Scholar
  11. 11.
    Patman, F., Shaefer, L.: Is Soundex Good Enough for You? On the Hidden Risks of Soundex-Based Name Searching. Language Analysis Systems, Inc., Herndon (2001)Google Scholar
  12. 12.
    Lutz, R., Greene, S.: Measuring Phonological Similarity: The Case of Personal Names. Language Analysis Systems, Inc., Herndon (2002)Google Scholar
  13. 13.
    Bikel, D.M., Schwartz, R., Weischedel, R.M.: An Algorithm that Learns What’s in a Name. Machine Learning, Vol. 34 No. 1–3. (1999) 211–231zbMATHCrossRefGoogle Scholar
  14. 14.
    Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: NYU: Description of the MENE Named Entity System as Used in MUC-7. In: Proceedings of the Seventh Message Understanding Conference. Fairfax (1998)Google Scholar
  15. 15.
    Baluja, S., Mittal, V.O., Sukthankar, R.: Applying Machine Learning for High Performance Named-Entity Extraction. Pacific Association for Computational Linguistics (1999)Google Scholar
  16. 16.
    Collins, M.,: Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 489–496Google Scholar
  17. 17.
    Zelenko, D., Aone, C., Richardella, A.: Kernel Methods for Relation Detection Extraction. Journal of Machine Learning Research [to appear]Google Scholar
  18. 18.
    Soon, W.M., Ng, H.T., Lim, D.C.Y.: A Machine Learning Approach to Coreference Resolution of Noun Phrases. Association for Computational Linguistics (2001)Google Scholar
  19. 19.
    Bontcheva, K., Dimitrov, M., Maynard, D., Tablin, V., Cunningham, H.: Shallow Methods for Named Entity Coreference Resolution. TALN (2002)Google Scholar
  20. 20.
    Hartrumpf, S.: Coreference Resolution with Syntactico-Semantic Rules and Corpus Statistics. In: Proceedings of CoNLL-2001. Toulouse (2001) 137–144Google Scholar
  21. 21.
    Ng, V., Cardie, C.: Improving Machine Learning Approaches to Coreference Resolution. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 104–111Google Scholar
  22. 22.
    McCarthy, J.F., Lehnert, W.G.: Using Decision Trees for Coreference Resolution. In: Mellish, C. (ed.): Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (1995) 1050–1055Google Scholar
  23. 23.
    Bagga, A., Baldwin, B.: Entity-Based Cross-Document Coreferencing Using the Vector Space Model. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (1998) 79–85Google Scholar
  24. 24.
    Ravin, Y., Kazi, Z. Is Hillary Rodham Clinton the President? Disambiguating Names Across Documents. In: Proceedings of the ACL’99 Workshop on Coreference and Its Applications (1999)Google Scholar
  25. 25.
    Schiffman, B., Mani, I., Concepcion, K.J.: Producing Biographical Summaries: Combining Linguistic Knowledge with Corpus Statistics. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (2001) 450–457Google Scholar
  26. 26.
    Bagga, A.: Evaluation of Coreferences and Coreference Resolution Systems. In: Proceedings of the First International Conference on Language Resources and Evaluation (1998) 563–566Google Scholar
  27. 27.
    Inxight. A Research Engine for the Pharmaceutical Industry.
  28. 28.
    Hetzler, B., Harris, W.M., Havre, S., Whitney, P.: Visualizing the Full Spectrum of Document Relationships. In: Structures and Relations in Knowledge Organization. Proceedings of the 5th International ISKO Conference. ERGON Verlag, Wurzburg (1998) 168–175Google Scholar
  29. 29.
    Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., Schroeder, J.: COPLINK: Managing law enforcement data and knowledge. Communications of the ACM, Vol. 46 No. 1 (2003)Google Scholar
  30. 30.
    InfoGlide Software. Similarity Search Engine: The Power of Similarity Searching.
  31. 31.
    American Association for Artificial Intelligence Fall Symposium on Artificial Intelligence and Link Analysis (1998)Google Scholar
  32. 32.
    i2. Analyst’s Notebook. (2002)
  33. 33.
    Winkler, W.E.: The State of Record Linkage and Current Research Problems. Technical Report RR99/04. U.S. Census Bureau,
  34. 34.
    Wang, G., Chen, H., Atabakhsh, H.: Automatically Detecting Deceptive Criminal Identities [to appear]Google Scholar
  35. 35.
    Fuhr, N.: Probabilistic Datalog — A Logic for Powerful Retrieval Methods. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (1995) 282–290Google Scholar
  36. 36.
    Fuhr, N.: Models for Integrated Information Retrieval and Database Systems. IEEE Data Engineering Bulletin, Vol. 19 No. 1. (1996)Google Scholar
  37. 37.
    Hoogeveen, M., van der Meer, K.: Integration of Information Retrieval and Database Management in Support of Multimedia Police Work. Journal of Information Science, Vol. 20 No. 2 (1994)Google Scholar
  38. 38.
    Institute for Mathematics and Its Applications. IMA Hot Topics Workshop: Text Mining. (2000)
  39. 39.
    KDD-2000 Workshop on Text Mining. The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston (2000)
  40. 40.
    SIAM Text Mining Workshop. (2002)
  41. 41.
    Text-ML 2002 orkshop on Text Learning. The Nineteenth International Conference on Machine Learning ICML-2002. Sydney (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  1. 1.Language Analysis Systems, Inc.Herndon
  2. 2.Institute for Security Technology StudiesDartmouth CollegeHanover

Personalised recommendations