A Hybrid Approach to Improve Bilingual Multiword Expression Extraction

  • Jianyong Duan
  • Mei Zhang
  • Lijing Tong
  • Feng Guo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5476)


We propose a hybrid approach for bilingual multiword expression extraction. There are two phases in the extraction process. In the first phase, lots of candidates are extracted from the corpus by statistic methods. The algorithm of multiple sequence alignment is sensitive to the flexible multiword. In the second phase, error-driven rules and patterns are extracted from corpus. These trained rules are used to filter the candidates. Some related experiments are designed for achieving the best performance because there are lots of parameters in this system. Experimental results showed our approach gains good performance.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bharati, A., Sangal, R., Mishra, D., Sriram, V., Papi Reddy, T.: Handling Multi-word Expressions without Explicit Linguistic Rules in an MT System. In: International Conference on text, speech and dialog, Czech Republic (2004)Google Scholar
  2. 2.
    Gil, A., Dias, G.: Using Masks, Suffix Array-based Data Structures and Multidimensional Arrays to Compute Positional Ngram Statistics from Corpora. In: ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (2003)Google Scholar
  3. 3.
    McCallumzy, A.K., Nigam, K.: Employing EM and Pool-Based Active Learning for Text Classification. In: Machine Learning: Proceedings of the Fifteenth International Conference, ICML 1998 (1998)Google Scholar
  4. 4.
    Atwell, C.O., Atwell, E., Souter, C.: Detecting Multiword Expressions Using Cross-Linguistic Corpora and Machine Learning Techniques. technical report (2004)Google Scholar
  5. 5.
    Boxing Chen, L.D.: Preparatory Work on Automatic Extraction of Bilingual Multi-Word Units from Parallel Corpora. Computational Linguistics and Chinese Language Processing 8(2), 77–92 (2003)Google Scholar
  6. 6.
    Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics 21(4) (1995)Google Scholar
  7. 7.
    Wu, C.-C., Chang, J.S.: Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses. Computational Linguistics and Chinese Language Processing 9(1), 1–20 (2004)MathSciNetGoogle Scholar
  8. 8.
    Evans, D.A., Zhai, C.: Noun-Phrase Analysis in Unrestricted Text for Information Retrieval. In: Proceedings of the 34th Annual Meeting of Association for Computational Linguistics, Santa Cruz, California (1996)Google Scholar
  9. 9.
    Dias, G., Gabrial, J., Lopes, P., Guillore, S.: Multilingual Aspects of Multiword Lexical Units. In: Proceedings of Workshop on Language Technologies of the 32th annual meeting fo the Societas Linguistica Europea (1999)Google Scholar
  10. 10.
    Guo, J.: Critical Tokenization and its Properties. Computational Linguistics 23(4) (1997)Google Scholar
  11. 11.
    Bentivogli, L., Pianta, E.: Detecting Hidden Multiwords in Bilingual Dictionaries. In: The Tenth EURALEX International Congress (2002)Google Scholar
  12. 12.
    Martinez, F., Martin, M.T., Rivas, V.M., Diáz, M.C., Urena, L.A.: Using Neural Networks for Multiword Recognition in IR. In: Proceedings of 2002 Conference of International Society of Knowledge Organization (ISKO) (2002)Google Scholar
  13. 13.
    Lambert, P., Banchs, R.: Grouping Multi-word Expressions According to Part-Of-Speech in Statistical Machine Translation. In: Proceedings of the EACL Workshop on Multi-word expressions in a multilingual context (2006) Google Scholar
  14. 14.
    Piao, S.S., Sun, G., Rayson, P., Yuan, Q.: Automatic Extraction of Chinese Multiword Expressions with a Statis-tical Tool. In: Proceedings of the EACL Workshop on Multi-word expressions in a multilingual context (2006)Google Scholar
  15. 15.
    Piao, S.S., Rayson, P., Archer, D., McEnery, T.: Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech and Language 19(4), 378–397 (2005)CrossRefGoogle Scholar
  16. 16.
    Venkatsubramanyan, S., Jose, P.-C.: Multiword Expression Filtering for Building Knowledge Maps. In: Proceedings of the 2nd ACL Workshop on Multiword Expressions: Integrating Processing (MWE 2004), Barcelona, Spain (2004) Google Scholar
  17. 17.
    Simon Tong, D.K.: Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research 2, 45–66 (2002)MATHGoogle Scholar
  18. 18.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Molecular Biology 147(1), 195–197 (1981)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Jianyong Duan
    • 1
  • Mei Zhang
    • 2
  • Lijing Tong
    • 1
  • Feng Guo
    • 1
  1. 1.College of Information EngineeringIndia
  2. 2.College of Art DesignNorth China University of TechnologyBeijingP.R. China

Personalised recommendations