Advertisement

Polyphon: An Algorithm for Phonetic String Matching in Russian Language

  • Viacheslav V. ParamonovEmail author
  • Alexey O. Shigarov
  • Gennagy M. Ruzhnikov
  • Polina V. Belykh
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 639)

Abstract

Data cleansing is the crucial matter in business intelligence. We propose a new phonetic algorithm to string matching in Russian language without transliteration from Cyrillic to Latin characters. It is based on the rules of sounds formation in Russian language. Additionally, we consider an extended algorithm for matching of Cyrillic strings where phonetic code letters are presented as primes, and the code of a string is the sum of these numbers. Experimental results show that our algorithms allow accurately matching phonetically similar strings in Russian language.

Keywords

Phonetic algorithms String matching Language Classifiers 

Notes

Acknowledgments

The reported study was supported in part by RFBR (grants 15-37-20042, 15-47-04348, 16-07-00411, and 16-57-44034); Council for Grants of the President of Russian Foundation (grant NSh-8081.2016.9). Experiments were performed on the resources of the Shared Equipment Centre of Integrated information and computing network of Irkutsk Research and Educational Complex (http://net.icc.ru).

References

  1. 1.
    Müller, H., Freytag, J.-Ch.: Problems, Methods, and Challenges in Comprehensive Data Cleansing, pp 5–12. Berlin University (2003)Google Scholar
  2. 2.
    Maletic, J., Marcus, A.: DataCleansing: A Prelude to Knowledge Discovery. Data Mining and Knowledge Discovery Handbook, pp. 19–32. Springer, Heidelberg (2010)Google Scholar
  3. 3.
    Zobel, J., Dart, Ph: Finding approximate matches in large lexicons. Softw. Pract. Exp. 25(3), 331–345 (1995)CrossRefGoogle Scholar
  4. 4.
    Zahoransky, D., Polasek I.: Text search of surnames in some slavic and other morphologically rich languages using rule based phonetic algorithms. In: Processing, IEEE/ACM Trans on Audio, Speech, and Language (T-ASL), pp. 553–563. IEEE (2015)Google Scholar
  5. 5.
    Cubberley, P.: Russian A Linguistic Introduction, p. 369. Cambridge press, New York (2002)Google Scholar
  6. 6.
    Parmar, V.P., Kumbharana, C.K.: Study existing various phonetic algorithms and designing and development of a working model for the new developed algorithm and comparison by implementing it with existing algorithm(s). Int. J. Comput. Appl. 98(19), 45–49 (2014). (0975 – 8887)Google Scholar
  7. 7.
    Russia’ address classifiactior. Tax Service of Russia (Клaccификaтop aдpecoв Poccии (КЛAДP)). http://www.gnivc.ru/inf_provision/classifiers_reference/kladr/ (in Russian)
  8. 8.
    The International Plant Names Index (IPNI). http://www.ipni.org/
  9. 9.
    International Statistical Classification of Diseases and Related Health Problems 10th Revision. http://apps.who.int/classifications/icd10/browse/2016/en
  10. 10.
    Orr, K.: Data quality and systems theory. Commun. ACM 41(2), 66–71 (1998)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Skripnik, Ya.N., Smolenskaya, T.M.: Phonetic of modern Russian language: study book. (Cкpипник Я.H., Cмoлeнcкaя T.M. Фoнeтикa coвpeмeннoгo pyccкoгo языкa: Учeбнoe пocoбиe / Пoд peд. Я.H. Cкpипник.) Stavropol, 152p (2010). (in Russian)Google Scholar
  12. 12.
    Valgina, N.S., Rozental, D.E., Fomina M.I.: Modern Russian language: Textbook (Baлгинa H.C., Poзeнтaль Д.Э., Фoминa M.И. Coвpeмeнный pyccкий язык: Учeбник) 6th edition Moscow: Logos. 2002 – 528 p (in Russian)Google Scholar
  13. 13.
    Osipov, B.I., Galushinskaya, L.G., Popkov, V.V.: Phonetic and hypercorrection errors in written assignments of pupils of 3-11 classes of high school (Фoнeтичecкиe и гипepичecкиe oшибки в пиcьмeнныx paбoтax yчaщиxcя 3–11-x клaccoв cpeднeй шкoлы). Russian Language journal. # 15, 2002 (in Russian). http://rus.1september.ru/article.php?ID=200201501
  14. 14.
    GOST R 52535.1-2006. Identification cards. Machine readable travel documents. Part 1 Machine Readable Passports. National Standard of the Russian Federation (ГOCT P 52535.1-2006. Кapты идeнтификaциoнныe. Maшинocчитывaeмыe дopoжныe дoкyмeнты. Чacть 1. Maшинocчитывaeмыe пacпopтa. Haциoнaльный cтaндapт Poccийcкoй Фeдepaции). Moscow, Russia, 18 p (2006). (in Russian)Google Scholar
  15. 15.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998)Google Scholar
  16. 16.
    Haveliwala, T.: Efficient computation of pagerank. Technical Report 1999-31. Stanford University (1999). http://dbpubs.stanford.edu/pub/1999-31
  17. 17.
    The Soundex Indexing System. National archives. http://www.archives.gov/research/census/soundex.html
  18. 18.
    Ivanova, T.F.: New orthoepic dictionary of Russian. Pronunciation. Accent. Grammatical forms (Ивaнoвa T.Ф. Hoвый opфoэпичecкий cлoвapь pyccкoгo языкa. Пpoизнoшeниe. Удapeниe. Гpaммaтичecкиe фopмы) Second edititon. – Russian language-Media, 893 p. (2005) (in Russian)Google Scholar
  19. 19.
    Zhirmunsky, V.: National Language and social dialects (Жиpмкнcкий B. Haциoнaльный язык и coциaльныe диaлeкты).Moscow: The state publisher of fiction, 300 p. (1936). (in Russian)Google Scholar
  20. 20.
    Ozhegov, S.I.: Dictionary of Russian language. About 53000 words. (Cлoвapь pyccкoгo языкa: Oк. 53 000 cлoв) / Editor Skvortsova L.I. Edition 24, Moscow: Oniks, World and education, 1200 p. (2007). (in Russian)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Viacheslav V. Paramonov
    • 1
    Email author
  • Alexey O. Shigarov
    • 1
  • Gennagy M. Ruzhnikov
    • 1
  • Polina V. Belykh
    • 1
  1. 1.Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian, Academy of Sciences (ISDCT SB RAS)IrkutskRussia

Personalised recommendations