Multimedia Tools and Applications

, Volume 74, Issue 11, pp 3933–3946 | Cite as

Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary

Article

Abstract

The rapid growth of globalization requires handling a large number of multilingual documents, where Japanese input co-exist with English and other languages, which use the Roman alphabet. Conventional methods for Japanese input require Japanese users to switch the input mode between Japanese and the Latin alphabet. As current solution, there is a modeless Japanese input method that automatically switches the input mode. However, those need training with a large amount of text data for improving the performance. This paper proposes a hybrid modeless Japanese input method that is based on the non-Japanese word dictionary and n-gram character sequence features to decide whether to convert and switch to Kana input or not. The aim of using the non-Japanese word dictionary is decreasing false positive against non-Japanese language words. This dictionary is composed by text data available on the Web. The n-gram based discriminative model are learned by a Support Vector Machine from a balanced corpus, which contains various domain texts. The evaluation of our method has shown that its statistical accuracy according to F-measure for prediction of non-Kana characters improves 7.7 % compared to n-gram only based method. In addition, the real user test has shown the average value of inputted time was agreeside for our method, against disagree side for conventional Japanese input method that requires switching input mode.

Keywords

Multilingual documents Modeless Japanese input 

References

  1. 1.
    Beesley KR (1988) Language identifier: a computer program for automatic natural-language identification of on-line text. In: Proceedings of the 29th ATA annual conference. pp 47–54Google Scholar
  2. 2.
    Bellandi V, Ceravolo P, Damiani E, Frati F, Maggesi J (2012) Towards a Collaborative Innovation Catalyst. In: Proceedings of SITIS 2012. IEEE Computer Society, pp 637–643Google Scholar
  3. 3.
    Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of SDAIR’94. pp 161–175Google Scholar
  4. 4.
    Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/%7Ecjlin/libsvm
  5. 5.
    Chen Z, Lee K (2000) A new statistical approach to Chinese Pinyin input. In: Proceedings of the 38th annual meeting on association for computational linguistics. pp 241–247Google Scholar
  6. 6.
    Damiani E, di Vimercati SDC, Paraboschi S, Samarati P (2004) An open digest-based technique for spam detection. In: ISCA PDCS 2004. pp 559–564Google Scholar
  7. 7.
    Davies M (2009) The 385+ million word corpus of contemporary american english (19902008+): design, architecture, and linguistic insights. Int J Corpus Linguis 14(2):159–190CrossRefGoogle Scholar
  8. 8.
    Dumais S (1998) Using SVMs for text categorization. IEEE Intell Syst 13(4):21–23Google Scholar
  9. 9.
    Ehara Y, Tanaka-Ishii K (2008) Multilingual text entry using automatic language detection. In: Proceedings of international joint conference on natural language processing. pp 441–448Google Scholar
  10. 10.
    Fan RE, Chang KW, Hsieh C-J, Wang X-R, Lin C-J (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874MATHGoogle Scholar
  11. 11.
    Grudin JT (1983) Error patterns in novice and skilled transcription typing. In: Cognitive aspects of skilled typewriting. Springer, Verlag, pp 121–143Google Scholar
  12. 12.
    Hakkani-T’́ur DZ, Oflazer K, T’́ur G (2002) Statistical morphological disambiguation for agglutinative languages. Comput Humanit 36(4):381–410CrossRefGoogle Scholar
  13. 13.
    Ikegami Y, Sakurai Y, Tsuruta S (2012) Modeless Japanese input method using multiple character sequence features. In: Proceedings of eighth international conference on signal image technology and internet based systems. IEEE Computer Society, pp 613–618Google Scholar
  14. 14.
    Internet.com K.K. (Japan) (2009) Roma to Kana input users are 90 %, direct Kana input users are 10 % - survey about typing - (in Japanese), http://japan.internet.com/research/20090611/1.html. Accessed 3 July 2013
  15. 15.
    Japanese Ministry of Internal Affairs and Communications (2009) Utilization situation of Internet (in Japanese). http://www.soumu.go.jp/johotsusintokei/whitepaper/ja/h24/html/nc.243120.html. Accessed 10 October 2013
  16. 16.
    Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 217–226Google Scholar
  17. 17.
    Kasahara S, Komachi M, Nagata M, Matsumoto Y (2011) Error correcting Romaji-kana conversion for Japanese language education. In: Proceedings of the workshop on advances in text input methods. pp 38–42Google Scholar
  18. 18.
    Kerkhofs R, Dijkstra T, Chwilla DJ, de Bruijn ER (2006) Testing a model for bilingual semantic priming with interlingual homographs: RT and N400 effects. In: Brain research, vol 1068. Elsevier, pp 170–813Google Scholar
  19. 19.
    Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphologiaical analysis. In: Proceedings of the EMNLP-2004. pp 230–237Google Scholar
  20. 20.
    Maekawa K (2008) Balanced corpus of contemporary written Japanese. In: Proceedings of the 6th workshop on asian language resources. pp 101–102Google Scholar
  21. 21.
    Neubig G, Duh K (2013) How much is said in a tweet? A multilingual, in-formation-theoretic perspective. In: Proceedings of the AAAI’13 spring symposium on analyzing microtext. StanfordGoogle Scholar
  22. 22.
    Neubig G, Nakata Y, Mori S (2011) Pointwise prediction for robust, adaptable Japanese morphological analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 2. pp 529–533Google Scholar
  23. 23.
    Pouliquen B, Steinberger R, Ignat C (2006) Automatic annotation of multilingual text collections with a conceptual thesaurus. arXiv:preprint cs/0609059
  24. 24.
    Roeber H, Bacus J, Tomasi C (2003) Typing in thin air: the canesta projection keyboard - a new method of interaction with electronic devices. In: Proceedings of CHI extended abstracts. pp 712–713Google Scholar
  25. 25.
    Shalev-Shwartz S, Singer Y, Srebro N (2007) Pegasos Primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th international conference on machine learning. ACM, pp 807–814Google Scholar
  26. 26.
    Suzumegano F, Amano J, Maruyama Y, Hayakawa E, Namiki M, Takahashi N (1995) The evaluation environment for a Kana to Kanji transliteration system and an evaluation of the modeless input method. In: IPSJ SIG technical report, vol 1995-HI-42. pp 9–16Google Scholar
  27. 27.
    Teahan WJ (2000) Text classification and segmentation using minimum cross-entropy. In: Proceedings of RIAO’00. pp 943–961Google Scholar
  28. 28.
    Zheng Y, Liu C, Ding X (2001) Single-character type identification. In: Electronic imaging 2002, International society for optics and photonics. pp 49–56Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Tokyo Denki UniversityChibaJapan

Personalised recommendations