Flipping onsets to enhance syllabification
- 23 Downloads
Two-year-old children who start learning to speak generally spell a polysyllabic word by flipping onsets of consecutive syllables. Sometimes they speak unclearly, hard to understand since the flipped onsets produce another word that has a much different meaning. For instance, two onsets in an English word “me.lon” (large round fruit of a plant of the gourd family) are flipped to produce another word “le.mon” (an acid fruit). In Bahasa Indonesia, such cases are quite common. For examples, two onsets in word “ba.tu” (stone) are swapped to be “ta.bu” (taboo), two onsets in “be.sar” (big) are flipped to be “se.bar” (spread), two onsets in “ru.mah” (house) are swapped to be “mu.rah” (cheap), etc. A preliminary study on 50k Indonesian formal words shows that the ratio between frequencies of the flipped-onset-bigrams and the 50 most frequent original syllable-bigrams is quite high, up to 13.09%. This research investigates the adoption of such phenomenon to enhances a bigram orthographic syllabification model that is commonly poor for out-of-vocabulary words. A five-fold cross-validation on 50k Indonesian formal words proves that the flipping onsets enhances the bigram orthographic syllabification, where the syllable error rate (SER) is relatively reduced by 18.02%. The method is also capable of producing quite low SER for a tiny trainset of 1k words to generalize 10k unseen words. Besides, it can be simply generalized to be applied to other languages as well as named-entities using a few specific knowledge related to the sets of vowels, diphthongs, and consonants.
KeywordsBigram Consecutive syllables Flipping onsets Orthographic syllabification
I would like to thank my beloved wife, Ari Virgandini, as well as my sons, Muhammad Arkan Ariyanto and Muhammad Agha Ariyanto, for the great inspiring ideas of flipping your onsets, and also all colleagues in Telkom University for the supports.
- Adsett, C.R., & Marchand, Y. (2009). A comparison of data-driven automatic syllabification methods. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE) (pp. 174–181). Berlin: Springer. https://doi.org/10.1007/978-3-642-03784-9.zbMATHGoogle Scholar
- Alwi, H., Dardjowidjojo, S., Lapoliwa, H., & Moeliono, A. M. (1998). Tata Bahasa Baku Bahasa Indonesia (The Standard Indonesian Grammar) (3rd ed.). Jakarta: Balai Pustaka.Google Scholar
- Bartlett, S., Kondrak, G., & Cherry, C. (2009). On the syllabification of phonemes. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 308–316). Boulder, Colorado. https://doi.org/10.3115/1620754.1620799
- Brants, T., Popat, A. C., & Och, F. J. (2007). Large language models in machine translation. The 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 1, 858–867.Google Scholar
- Chaer, A. (2009). Fonologi Bahasa Indonesia (Indonesian Phonology). Jakarta: Rineka Cipta.Google Scholar
- Daelemans, W., & Bosch, A.V.D. (1992). A neural network for hyphenation. In Proceedings of the International Conference on Artificial Neural Networks (ICANN–92) Brighton, United Kingdom. (vol. 2, pp. 1647–1650) https://doi.org/10.1016/B978-0-444-89488-5.50176-7 CrossRefGoogle Scholar
- Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing: a guide to theory, algorithm, and system development. Upper Saddle River: Prentice Hall PTRGoogle Scholar
- Janakiraman, R., Kumar, J.C., & Murthy, H.A. (2010). Robust syllable segmentation and its application to syllable-centric continuous speech recognition. In National Conference on Communications (NCC) (pp. 1–5). Chennai, India: Joint Telematics Group of IITs & IISc. https://doi.org/10.1109/NCC.2010.5430189.
- Kettunen, K., McNamee, P., & Baskaya, F. (2010). Using syllables as indexing terms in full-text information retrieval. In Proceedings of Baltic HLT. https://doi.org/10.3233/978-1-60750-641-6-225.
- Kiraz, G.A., Bernd, M., Labs, B., Technologies, L., & Hill, M. (1998). Multilingual syllabification using weighted finite-state transducers. In Proceedings of the Third ESCA/COCOSDA Workshop on Speech Synthesis (pp. 59–64).Google Scholar
- Krantz, J., Dulin, M., De Palma, P., & VanDam, M. (2018). Syllabification by phone categorization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’18 (pp. 47–48). New York: ACM. https://doi.org/10.1145/3205651.3208781.
- Kristensen, T. (2000). A neural network approach to hyphenating Norwegian. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (vol. 2, pp. 148–153). IEEE. https://doi.org/10.1109/IJCNN.2000.857889.
- Kunchukuttan, A., & Bhattacharyya, P. (2016). Orthographic Syllable as basic unit for SMT between Related Languages. CoRR arXiv:1610.00634.
- Mayer, T. (2010). Toward a totally unsupervised, language-independent method for the syllabification of written texts. In: Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology (pp. 63–71).Google Scholar
- Müller, K. (2001). Automatic detection of syllable boundaries combining the advantages of treebank and bracketed corpora training. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (pp. 410–417). ACL.Google Scholar
- Müller, K. (2006). Improving syllabification models with phonotactic knowledge. In: Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology - SIGPHON ’06 (pp. 11–20). https://doi.org/10.3115/1622165.1622167
- Oncevay-marcos, A. (2017). Spell-checking based on syllabification and character-level graphs for a peruvian agglutinative language. In Proceedings of the First Workshop on Subword and Character Level Models in NLP (pp. 109–116).Google Scholar
- Rogova, K., Demuynck, K., & Compernolle, D. V. (2013). Automatic syllabification using segmental conditional random fields. Computational Linguistics in the Netherlands Journal, 3, 34–48.Google Scholar
- Schmid, H., Möbius, B., & Weidenkaff, J. (2007). Tagging syllable boundaries with joint n-gram models. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 1(1), 49–52.Google Scholar
- Tian, J. (2004). Data-driven approaches for automatic detection of syllable boundaries. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP) (pp. 61–64).Google Scholar
- Wu, S. L. W. S. L., Shire, M., Greenberg, S., & Morgan, N. (1997). Integrating syllable boundary information into speech recognition. Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2, 987–990. https://doi.org/10.1109/ICASSP.1997.596105.CrossRefGoogle Scholar