LiCord: Language Independent Content Word Finder

  • Md-Mizanur Rahoman
  • Tetsuya Nasukawa
  • Hiroshi Kanayama
  • Ryutaro Ichise
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9648)


Content Words (CWs) are important segments of the text. In text mining, we utilize them for various purposes such as topic identification, document summarization, question answering etc. Usually, the identification of CWs requires various language dependent tools. However, such tools are not available for many languages and developing of them for all languages is costly. On the other hand, because of recent growth of text contents in various languages, language independent text mining carries great potentiality. To mine text automatically, the language tool independent CWs finding is a requirement. In this research, we devise a framework that identifies text segments into CWs in a language independent way. We identify some structural features that relate text segments into CWs. We devise the features over a large text corpus and apply machine learning-based classification that classifies the segments into CWs. The proposed framework only uses large text corpus and some training examples, apart from these, it does not require any language specific tool. We conduct experiments of our framework for three different languages: English, Vietnamese and Indonesian, and found that it works with more than 83 % accuracy.


Content words Language independent Word segmentation  


  1. 1.
    Aggarwal, C.C., Zhai, C. (eds.): Mining Text Data. Springer, New York (2012)Google Scholar
  2. 2.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar
  3. 3.
    Gamon, M., Aue, A., Corston-Oliver, S., Ringger, E.: Pulse: mining customer opinions from free text. In: Famili, A.F., Kok, J.N., Peña, J.M., Siebes, A., Feelders, A. (eds.) IDA 2005. LNCS, vol. 3646, pp. 121–132. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  4. 4.
    Kanayama, H., Nasukawa, T.: Unsupervised lexicon induction for clause-level detection of evaluations. Nat. Lang. Eng. 18(1), 83–107 (2012)CrossRefGoogle Scholar
  5. 5.
    Kim, S., Toutanova, K., Yu, H.: Multilingual named entity recognition using parallel data and metadata from wikipedia. In: Proceedings of the 50th Annual Meeting on Association for Computational Linguistics, pp. 694–702 (2012)Google Scholar
  6. 6.
    Lewis, D.: What is web 2.0? Crossroads 13(1), 3–3 (2006)CrossRefGoogle Scholar
  7. 7.
    Ma, Y., Wu, J.: Combining n-gram and dependency word pair for multi-document summarization. In: IEEE 17th International Conference on Computational Science and Engineering, pp. 27–31 (2014)Google Scholar
  8. 8.
    Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: Dbpedia spotlight: Shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, pp. 1–8 (2011)Google Scholar
  9. 9.
    Nasukawa, T., Nagano, T.: Text analysis and knowledge mining system. IBM Syst. J. 40(4), 967–984 (2001)CrossRefGoogle Scholar
  10. 10.
    Niesler, T., Woodland, P.C.: Variable-length category n-gram language models. Comput. Speech Lang. 13(1), 99–124 (1999)CrossRefGoogle Scholar
  11. 11.
    Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)Google Scholar
  12. 12.
    Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local, global algorithms for disambiguation to wikipedia. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1375–1384 (2011)Google Scholar
  13. 13.
    Shinzato, K., Shibata, T., Kawahara, D., Kurohashi, S.: Tsubaki: An open search engine infrastructure for developing information access methodology. Inf. Med. Technol. 7(1), 354–365 (2012)Google Scholar
  14. 14.
    Zhu, X., Kiritchenko, S., Mohammad, S.M.: Sentiment analysis of short informal texts. J. Artif. Intell. Res. 50, 723–762 (2014)zbMATHGoogle Scholar
  15. 15.
    Volpe, A.D., Klammer, T.P., Schulz, M.R.: Analyzing English Grammar. Longman, New York (2009)Google Scholar
  16. 16.
    Tckstrm, O., Das, D., Petrov, S., McDonald, R., Nivre, J.: Token and type constraints for cross-lingual part-of-speech tagging. Trans. Assoc. Comput. Linguist. 1, 1–12 (2013)Google Scholar
  17. 17.
    Wang, M., Manning, C.D.: Cross-lingual projected expectation regularization for weakly supervised learning. TACL 2, 55–66 (2014)Google Scholar
  18. 18.
    Winkler, E.: Understanding Language: A Basic Course in Linguistics. Continuum, London (2007)Google Scholar
  19. 19.
    Wisniewski, G., Pécheux, N., Gahbiche-Braham, S., Yvon, F.: Cross-lingual part-of-speech tagging through ambiguous learning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1779–1785 (2014)Google Scholar
  20. 20.
    Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research, pp. 1–8 (2001)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Md-Mizanur Rahoman
    • 1
  • Tetsuya Nasukawa
    • 2
  • Hiroshi Kanayama
    • 2
  • Ryutaro Ichise
    • 1
    • 3
  1. 1.SOKENDAI (The Graduate University for Advanced Studies)HayamaJapan
  2. 2.IBM Research – TokyoTokyoJapan
  3. 3.National Institute of InformaticsTokyoJapan

Personalised recommendations