Phonetic String Matching for Languages with Cyrillic Alphabet
The usage of phonetic similarity in comparison of textual strings and elimination of misprints is one of significant issues in philology. It is widely used in automatic text checking. Nowadays most of phonetic algorithms are designed for English language words processing. The quality of comparison may be decreased for non-English languages especially for languages, which have rich morphology and use non-Latin alphabet symbols, e.g. East Slavic languages with Cyrillic letters. We propose an approach to phonetic comparison of Russian language words. It is based on detection letters and letter sequences that have similar pronunciation according to rules of the language. The resultant phonetic representation of the words are coded by prime numbers. The efficiency of the reviewed algorithm is considered in the paper. The algorithm was adopted for Mongolian language phonetic processing.
KeywordsNatural language processing Phonetic algorithms String comparison Cyrillic letters
The reported study was supported in part by RFBR (grants 18-07-00758, 17-57-44006, 16-07-00411), RFBR and Government of Irkutsk Region – grant 17-47-380007. Experiments were performed on the resources of the Shared Equipment Centre of Integrated information and computing network of Irkutsk Research and Educational Complex http://net.icc.ru.
- 2.Cubberley Russian, P.: A Linguistic Introduction, 396 p. Cambridge Press (2002)Google Scholar
- 3.Parmar, V.P., Kumbharana, C.K.: Study existing various phonetic algorithms and designing and development of a working model for the new developed algorithm and comparison by implementing it with existing algorithm(s). Int. J. Comput. Appl. 98(19), 45–49 (2014). (0975 — 8887)Google Scholar
- 4.Zahoranský, D., Polasek, I.: Text search of surnames in some slavic and other morphologically rich languages using rule based phonetic algorithms. IEEE/ACM Trans. Audio Speech Lang. Proces. (T–ASL), 553–563. IEEE (2015)Google Scholar
- 10.Paramonov, V.V., Shigarov, A.O., Ruzhnikov, G.M., Belykh, P.V.: Polyphon: an algorithm for phonetic string matching in russian language. In: Proceeding of the 22nd International Conference Information and Software Thechnologies, ICTIST 2016. Communications in Computer Science, vol. 639, pp. 568–579 (2016)Google Scholar
- 12.The Soundex Indexing System. National archives. http://www.archives.gov/research/census/soundex.html
- 18.Damaševičius, R., Kapociute-Dzikine, J., Wozniak, M.: Towards Rhythmicity analysis of text using empirical mode decomposition. In: Proceeding of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2017), vol. 1, pp. 310–317. KDIR (2017)Google Scholar