Advertisement

Learning to Match Names Across Languages

  • Inderjeet Mani
  • Alex Yeh
  • Sherri Condon
Chapter
Part of the Theory and Applications of Natural Language Processing book series (NLP)

Abstract

We report on research on matching names in different scripts across languages. We explore two trainable approaches based on comparing pronunciations. The first, a cross-lingual approach, uses an automatic name-matching program that exploits rules based on phonological comparisons of the two languages carried out by humans. The second, monolingual approach relies only on automatic comparison of the phonological representations of each pair. Alignments produced by each approach are fed to a machine learning algorithm. Results show that the monolingual approach results in machine-learning based comparison of person-names in English and Chinese at an accuracy of over 97.0 F-measure.

Keywords

Support Vector Machine Chinese Character Language Pair Chinese Script Linguistic Data Consortium 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

This research has been funded by the MITRE Innovation Program (Public Release Case Number 07–0752). We are also grateful to the reviewers for their insightful comments.

References

  1. 1.
    Al-Onaizan, Y., Knight, K.: Machine transliteration of names in Arabic text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, pp. 1–13. Association for Computational Linguistics, Stroudsburg (2002)Google Scholar
  2. 2.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48. ACM, New York (2003)Google Scholar
  3. 3.
    Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, pp. 475–480. ACM, New York (2002)Google Scholar
  4. 4.
    Damerau, F.J.A.: Technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171176 (1964)Google Scholar
  5. 5.
    Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Soc. 64, 1183–1210 (1969)Google Scholar
  6. 6.
    Freeman, A., Condon, S., Ackermann, C.: Cross linguistic name matching in English and Arabic. In: Proceedings of the Human Language Technology Conference, New York, pp. 471–478. Association for Computational Linguistics, Stroudsburg (2006)Google Scholar
  7. 7.
    Freitag, D., Khadivi, S.: A sequence alignment model based on the averaged perceptron. In: Proceedings of EMNLP-CONLL, Prague (2007)Google Scholar
  8. 8.
    Gao, W., Wong, K., Lam, W.: Phoneme-based transliteration of foreign names for OOV problem. In: Proceedings of First International Joint Conference on Natural Language Processing (IJCNLP), Hainan Island, China, pp. 374–381 (2004)Google Scholar
  9. 9.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1) (2009). www.cs.waikato.ac.nz/%ml/weka/Google Scholar
  10. 10.
    Huang, F., Vogel, S., Waibel, A.: Improving named entity translation combining phonetic and semantic similarities. In: Proceedings of HLT-NAACL, Boston (2004)Google Scholar
  11. 11.
    Ji, H., Grishman, R., Freitag, D., Blume, M., Wang, J., Khadivi, S., Zens R., Ney, H.: Name extraction and translation for distillation. In: Olive, J., Christianson, C., McCary, J. (eds.) Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, Springer (2011). DOI:  10.1007/978-1-4419-7713-7_3
  12. 12.
    Jiampojamarn, S., Bhargava, A., Dou, Q., Dwyer, K., Kondrak, G.: DIRECTL: a language-independent approach to transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore, pp. 28–31 (2009)Google Scholar
  13. 13.
    Joachims, T.: Making large-Scale SVM Learning Practical. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge, MA (1999). svmlight.joachims.org/
  14. 14.
    Jung, S., Hong, S., Paek, E.: An English to Korean transliteration model of extended Markov window. In: Proceedings of the 18th Conference on Computational Linguistics (COLING), Saarbrücken, Germany, vol. 1, pp. 383–389. Association for Computational Linguistics, Stroudsburg (2000)Google Scholar
  15. 15.
    Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA, pp. 288–295. Association for Computational Linguistics, Stroudsburg (2000)Google Scholar
  16. 16.
    Knight, K., Graehl, J.: Machine transliteration. Comput. Linguist. 27(4), 599–612 (1998)Google Scholar
  17. 17.
    Lait, A., Randell, B.: An assessment of name matching algorithms. Technical Report, Department of Computer Science, University of Newcastle upon Tyne, UK (1996)Google Scholar
  18. 18.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)Google Scholar
  19. 19.
    Li, H., Kumaran, A., Pervouchine, V., Zhang, M.: Report of NEWS 2009 machine transliteration shared task. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009)Google Scholar
  20. 20.
    Li, H., Zhang, M., Su, J.: A joint source-channel model for machine transliteration. In: Proceedings of Conference of the Association for Computation Linguistics, Barcelona, Spain, pp. 159–166. Association for Computational Linguistics, Stroudsburg (2004)Google Scholar
  21. 21.
    McCallum, A., Bellare, K., Pereira, F.: A conditional random field for discriminatively-trained finite-state string edit distance. In: Proceedings of the Conference on Uncertainty in AI, Edinburgh, Scotland, pp. 388–395 (2005)Google Scholar
  22. 22.
    Meng, H., Lo, W., Chen B., Tang, T.: Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Italy (2001)Google Scholar
  23. 23.
    (NEWS-2009) 2009 named entities workshop: shared task on transliteration. In: Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP, Singapore (2009)Google Scholar
  24. 24.
    Oh, J., Choi, K., Isahara, H.: A comparison of different machine transliteration models. J. Artif. Intell. Res. 27, 119–151 (2006)Google Scholar
  25. 25.
    Ristad, E.S., Yianilos, P.N.: Learning string edit distance. In: IEEE Transactions on Pattern Recognition and Machine Intelligence, pp. 522–532. IEEE Computer Society, Washington, DC (1998)Google Scholar
  26. 26.
  27. 27.
    Samuel, K., Rubenstein, A., Condon, S., Yeh, A.: Name matching between Chinese and Roman scripts: machine complements human. In: Proceedings of the 2009 Named Entities Workshop, Singapore, pp. 152–160. ACL-IJCNLP, Stroudsburg (2009)Google Scholar
  28. 28.
    Sproat, R., Tao, T., Zhai, C.: Named entity transliteration with comparable corpora. In: Proceedings of the Conference of the Association for Computational Linguistics, Sydney, Australia, pp. 73–80. Association for Computational Linguistics, Stroudsburg (2006)Google Scholar
  29. 29.
    Tao, T., Yoon, S., Fister, A., Sproat, R., Zhai, C.: Unsupervised named entity transliteration using temporal and phonetic correlation. In: Proceedings of the Empirical Methods in Natural Language Processing Conference, Sydney, Australia, pp. 250–257. Association for Computational Linguistics, Stroudsburg (2006)Google Scholar
  30. 30.
    The CMU Pronouncing Dictionary: ftp://ftp.cs.cmu.edu/project/speech/dict/ (2008)
  31. 31.
    Ukkonnen, E.: Approximate string-matching with Q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1992)Google Scholar
  32. 32.
    Virga, P., Khudanpur, S.: Transliteration of proper names in cross-lingual information retrieval. In: Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, Sapporo, Japan. Association for Computational Linguistics, Stroudsburg (2003)Google Scholar
  33. 33.
    Wan, S., Verspoor, C.M.: Automatic English-Chinese name transliteration for development of multilingual resources. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, pp. 1352–1356. Association for Computational Linguistics, Stroudsburg (1998)Google Scholar
  34. 34.
    Wikipedia: Pinyin. en.wikipedia.org/wiki/Pinyin (2006)
  35. 35.
    Winkler, W., Thibaudeau, Y.: An application of the fellegi-sunter model of record linkage to the 1990 U.S. decennial census. Technical Report RR91/09, Energy Information Administration, Washington, DC (1991)Google Scholar
  36. 36.
    Zobel, J., Dart, P.: Finding approximate matches in large lexicons. Softw. Pract. Exp. 25(3), 331–345 (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.The MITRE CorporationBedfordUSA
  2. 2.The MITRE CorporationMcLeanUSA

Personalised recommendations