Measuring Spelling Similarity for Cognate Identification
The most commonly used measures of string similarity, such as the Longest Common Subsequence Ratio (LCSR) and those based on Edit Distance, only take into account the number of matched and mismatched characters. However, we observe that cognates belonging to a pair of languages exhibit recurrent spelling differences such as “ph” and “f” in English-Portuguese cognates “phase” and “fase”. Those differences are attributable to the evolution of the spelling rules of each language over time, and thus they should not be penalized in the same way as arbitrary differences found in non-cognate words, if we are using word similarity as an indicator of cognaticity.
This paper describes SpSim, a new spelling similarity measure for cognate identification that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori. Compared to LCSR and EdSim (Edit Distance-based similarity), SpSim yields an F-measure 10% higher when used for cognate identification on five different language pairs.
KeywordsString Similarity Cognate Identification Translation Extraction
Unable to display preview. Download preview PDF.
- 1.Bergsma, S., Kondrak, G.: Alignment-based discriminative string similarity. In: Annual Meeting – Association for Computational Linguistics, vol. 45, page 656 (2007)Google Scholar
- 2.Gusfield, D.: Algorithms on Strings, Trees and Sequences – Computer Science and Computational Biology. Cambridge University Press (1997)Google Scholar
- 4.Kondrak, G.: Identification of Cognates and Recurrent Sound Correspondences in Word Lists. Traitement Automatique des Langues 50(2), 201–235 (2009)Google Scholar
- 5.Dan Melamed, I.: Bitext maps and alignment via pattern recognition. Comput. Linguist. 25(1), 107–130 (1999)Google Scholar
- 6.Ribeiro, A., Dias, G., Lopes, G.P., Mexia, J.T.: Cognates alignment. In: Maegaard, B. (ed.) Proceedings of the Machine Translation Summit VIII (MT Summit VIII), Santiago de Compostela, Spain, September 18-22, pp. 287–292. European Association of Machine Translation (2001)Google Scholar
- 7.Simard, M., Foster, G., Isabelle, P.: Using cognates to align sentences in parallel corpora. In: Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 67–81 (1992)Google Scholar
- 8.Tiedemann, J.: Automatic construction of weighted string similarity measures. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 213–219 (1999)Google Scholar