Detecting Misspelled Words in Turkish Text Using Syllable n-gram Frequencies

  • Rifat Aşliyan
  • Korhan Günel
  • Tatyana Yakhno
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4815)

Abstract

In this study, we have designed and implemented a system which decides whether or not a word is misspelled in Turkish text. Firstly, three databases of syllable monogram, bigram and trigram frequencies are constructed using the syllables that are derived from five different Turkish corpora. Then, the system takes words in Turkish text as an input and computes the probability distribution of words using syllable monogram, bigram and trigram frequencies from the databases. If the probability distribution of a word is zero, it is decided that this word is misspelled. For testing the system, we have constructed two text databases with the same words. One text database has 685 misspelled words. The other has 685 correctly spelled words. The words from these text databases are taken as inputs for the system. The system produces two results for each word: “Correctly spelled word” or “Misspelled word”. The system that is designed with monogram and bigram frequencies has 86% success rate for the misspelled words and has 88% success rate for the correctly spelled words. According to the system designed with bigram and trigram frequencies, there is 97% success rate for the misspelled words and there is 98% success rate for the correctly spelled words.

Keywords

Optical Character Recognition Bigram Frequency Text Database Statistical Natural Language Trigram Frequency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Barari, L., QasemiZadeh, B.: CloniZER spell checker adaptive language independent spell checker. In: AIML 2005 Conference CICC, Cairo, Egypt, pp. 19–21 (2005)Google Scholar
  2. 2.
    Tong, X., Evans, D.A.: A statistical approach to automatic OCR error correction in context. In: Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen. Denmark, pp. 88–100 (1996)Google Scholar
  3. 3.
    Kang, S.S., Woo, C.W.: Automatic segmentation of words using syllable bigram statistics. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, Tokyo. Japan, November 27-30, pp. 729–732 (2001)Google Scholar
  4. 4.
    Deorowicz, S., Ciura, M.G.: Correcting spelling errors by modelling their causes. International Journal of Applied Mathematics and Computer Science 15(2), 275–285 (2005)Google Scholar
  5. 5.
    Dalkilic, G., Cebi, Y.: Creating a Turkish corpus and determining word length. DEU Muhendislik Fakultesi Fen ve Muhendislik Dergisi 5(1), 1–7 (2003)Google Scholar
  6. 6.
    Kukich, K.: Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), 377–439 (1992)CrossRefGoogle Scholar
  7. 7.
    Kurumu, T.D.: Imla kilavuzu. Ankara (2001)Google Scholar
  8. 8.
    Asliyan, R., Günel K.: Design and implementation for extracting Turkish syllables and analyzing Turkish syllables. In: International Symposium on Innovations in Intelligent Systems and Applications. INISTA (2005)Google Scholar
  9. 9.
    Zhuang, L., Bao, T., Zhu, X., Wang, C., Naoi, S.: A chinese OCR spelling check appoarch based on statistical language models. IEEE International Conference on Systems, Man and Cybernetics, 4727–4732 (2004)Google Scholar
  10. 10.
    Jurafsky, D., Martin, J.H.: Speech and language processing. Prentice-Hall Press, Englewood Cliffs (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Rifat Aşliyan
    • 1
  • Korhan Günel
    • 1
  • Tatyana Yakhno
    • 1
  1. 1.Dokuz Eylül University, lzmirTurkey

Personalised recommendations