Skip to main content

A Trainable Method for the Phonetic Similarity Search in German Proper Names

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

Abstract

Efficient methods for the similarity search in word databases play a significant role in various applications such as the robust search or indexing of names and addresses, spell-checking algorithms or the monitoring of trademark rights. The underlying distance measures are associated with similarity criteria of the users, and phonetic-based search algorithms are well-established since decades. Nonetheless, rule-based phonetic algorithms exhibit some weak points, e.g. their strong language dependency, the search overhead by tolerance or the risk of missing valid matches vice versa, which causes a pseudo-phonetic functionality in some cases. In contrast, we suggest a novel, adaptive method for similarity search in words, which is based on a trainable grapheme-to-phoneme (G2P) converter that generates most likely and widely correct pronunciations. Only as a second step, the similarity search in the phonemic reference data is performed by involving a conventional string metric such as the Levenshtein distance (LD). The G2P algorithm achieves a string accuracy of up to 99.5% in a German pronunciation lexicon and can be trained for different languages or specific domains such as proper names. The similarity tolerance can be easily adjusted by parameters like the admissible number or likability of pronunciation variants as well as by the phonemic or graphemic LD. As a proof of concept, we compare the G2P-based search method on a German surname database and a telephone book including first name, surname and street name to similarity matches by the conventional Cologne phonetic (Kölner Phonetik, KP) algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baayen, R., Piepenbrock, R., Gulikers, L.: CELEX2 lexical database of German (Version 2.0). Linguistic Data Consortium Philadelphia (1995). https://catalog.ldc.upenn.edu/ldc96l14. Accessed 12 Oct 2016

  2. Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008). https://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html, gPL software

    Article  Google Scholar 

  3. D’Haro, L.F., Banchs, R.E.: Automatic correction of ASR outputs by using machine translation. In: Interspeech 2016, San Francisco, pp. 3469–3473 (2016). http://dx.doi.org/10.21437/Interspeech.2016-299

  4. Hain, H.-U.: Graphem-Phonem-Konvertierung, Patent DE 100 42 944 C2 (2003). (in German)

    Google Scholar 

  5. Hain, H.-U.: Phonetische Transkription für ein multilinguales Sprachsynthesesystem. PhD thesis, TU Dresden (2004). (in German)

    Google Scholar 

  6. Kessler, B.: Phonetic comparison algorithms. Trans. Philol. Soc. 103(2), 243–260 (2005)

    Article  MathSciNet  Google Scholar 

  7. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Dokl. Akad. Nauk SSSR 163(4), 845–848 (1965). (in Russian)

    MathSciNet  MATH  Google Scholar 

  8. Madden, R.: (2013). https://github.com/rockymadden/stringmetric/. Accessed 03 Apr 2017

  9. Odell, M.K., Russell, R.C.: US patents 1 261 167 and 1 435 683 (1918, 1922). https://en.wikipedia.org/wiki/Soundex

  10. Pardeshi, J.B., Nandwalkar, B.R.: Survey on rule based phonetic search for slavic surnames. J. Comput. Technol. Appl. 7(1), 65–68 (2016)

    Google Scholar 

  11. Parmar, V.P., Kumbharana, C.K.: Study existing various phonetic algorithms and designing and development of a working model for the new developed algorithm and comparison by implementing it with existing algorithms. J. Comput. Appl. 98(19), 45–49 (2014)

    Google Scholar 

  12. Philips, L.: Hanging on the metaphone. J. Comput. Lang. 7(12), 39–44 (1990)

    Google Scholar 

  13. Philips, L.: The double metaphone search algorithm. C/C++ Users J. 18(6), 38–43 (2000)

    MathSciNet  Google Scholar 

  14. Plique, G.: (2014). http://yomguithereal.github.io/clj-fuzzy/. Accessed 03 Apr 2017

  15. Postel, H.J.: Die Kölner Phonetik. Ein Verfahren zur Identifizierung von Personennamen auf der Grundlage der Gestaltanalyse. IBM-Nachrichten 19, 925–931 (1969). (in German)

    Google Scholar 

  16. Interface for data exchange in automated information process according to §112 TKG between Federal Network Agency and beneficiary (SBS, in German). Version 1.0, 27 October (2015). https://www.bundesnetzagentur.de/DE/Sachgebiete/Telekommunikation/Unternehmen_Institutionen/Anbieterpflichten/OeffentlicheSicherheit/AutomatisiertesAuskunftsverfahren/Automatisiertesauskunftsverfahren-node.html. Accessed 10 Dec 2016

  17. Interface for data exchange in automated information process according to Section 112 TKG between Federal Network Agency and obligor (SBV, in German). Version 1.1 (Draft), 04 January (2016). https://www.bundesnetzagentur.de/DE/Sachgebiete/Telekommunikation/Unternehmen_Institutionen/Anbieterpflichten/OeffentlicheSicherheit/AutomatisiertesAuskunftsverfahren/Automatisiertesauskunftsverfahren-node.html. Accessed 10 Dec 2016

  18. Shah, R., Singh, D.K.: Analysis and comparative study on phonetic matching techniques. Int. J. Comput. Appl. 87(9), 14–17 (2014)

    Google Scholar 

  19. Shah, R., Singh, D.K.: Improvement of Soundex algorithm for Indian language based on phonetic matching. Int. J. Comput. Sci. Eng. Appl. (IJCSEA) 4(3), 31–39 (2014)

    Google Scholar 

  20. http://yomguithereal.github.io/talisman/phonetics/. Accessed 03 Apr 2017

  21. Das Telefonbuch Deutschland. https://www.telefoncd.de/DasTelefonbuch-CD-mit-Rueckwaertssuche.html (2016). German phone book DVD 2016–17, data status 01 September 2016

  22. Supraregional collection of German family names from death certificates. Verein für Computergenealogie, Erkrath (2016). www.familienanzeigen.org/totzfanamen.php. Accessed 12 Oct 2016

  23. Wells, J.: SAMPA - computer readable phonetic alphabet (1997). http://www.phon.ucl.ac.uk/home/sampa/. Accessed 10 Jan 2017

  24. Zahoranský, D., Polasek, I.: Text search of surnames in some slavic and other morphologically rich languages using rule based phonetic algorithms. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 553–563 (2015)

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank Haya Hadidi and Tristan Münz from the Federal Network Agency of Germany (Bundesnetzagentur) for initiating this research and their practical hints on AAV procedures. Further thanks goes to Viktor Iaroshenko from HfT Leipzig and to Gabor Pintér from Kobe University in Japan for their project support and advice.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliver Jokisch .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Jokisch, O., Hain, HU. (2017). A Trainable Method for the Phonetic Similarity Search in German Proper Names. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66429-3_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66428-6

  • Online ISBN: 978-3-319-66429-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics