Skip to main content
Log in

A New Multiword Expression Metric and Its Applications

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Multiword Expressions (MWEs) appear frequently and ungrammatically in natural languages. Identifying MWEs in free texts is a very challenging problem. This paper proposes a knowledge-free, unsupervised, and language-independent Multiword Expression Distance (MED). The new metric is derived from an accepted physical principle, measures the distance from an n-gram to its semantics, and outperforms other state-of-the-art methods on MWEs in two applications: question answering and named entity extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Choueka Y. Looking for needles in a haystack or locating interesting collocation expressions in large textual databases. In Proc. the RIAO Conf. User-Orient Content-Based Text and Image Hamdling, Cambridge, USA, Mar. 21–24, 1988, pp.38–43.

  2. Jackendoff R. The Architecture of the Language Faculty. MIT Press, Cambridge, MA, 1997.

    Google Scholar 

  3. Fellbaum C. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998.

    MATH  Google Scholar 

  4. Church K W, Hanks P. Word association norms, mutual information and lexicography. Computational Linguistics, 1990, 16(1): 22–29.

    Google Scholar 

  5. Dias G, Guilloré S, Lopes J G P. Mining textual associations in text corpora. In Proc. Sixth ACM SIGKDD, Workshop on Text Mining, Boston, USA, Aug. 20–23, 2000, pp.92–95.

  6. Pecina P. An extensive empirical study of collocation extraction methods. In Proc. COLING-ACL, Sydney, Australia, Jul. 17–21, 2006, pp.953–960.

  7. Silva J, Lopes G. A local maxima method and a fair dispersion normalization for extracting multiword units. In Proc. Sixth Meeting on Mathematics of Language, Orlando, USA, Jul. 23–25, 1999, pp.369–381.

  8. Schone P, Jurafsky D. Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Proc. EMNLP, Pittsburgh, USA, Jun. 3–4, 2001, pp.100–108.

  9. Zhang W, Yoshida T, Tang X, Ho T B. Improving effectiveness of mutual information for substantival multiword expression extraction. Expert Systems with Applications, 2009, 36(8): 10919–10930.

    Article  Google Scholar 

  10. Bennett C H, Gács P, Li M et al. Information distance. IEEE Trans. Information Theory, 1998, 44(4): 1407–1423.

    Article  MATH  Google Scholar 

  11. Downey D, Broadhead M, Etzioni O. Locating complex named entities in Web text. In Proc. IJCAI, Hyderabad, India, Jan. 6–12, 2007, pp.2733–2739.

  12. Justeson J S, Katz S M. Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1995, 1(1): 9–27.

    Article  Google Scholar 

  13. Argamon S, Dagan I, Krymolowski Y. A memory-based approach to learning shallow natural language patterns. In Proc. COLING, Montreal, Canada, Aug. 10–14, 1998, pp.67–73.

  14. McCallum A, Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proc. the 7th Conference on Natural Language Learning at HLT-NAACL, Edmonton, Canada, May 27-June 1, 2003, pp.188–191.

  15. Finkel J R, Grenager T, Manning C. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proc. ACL, Michigan, USA, Jun. 25–30, 2005, pp.363–370.

  16. Dunning T. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 1993, 19(1): 61–74.

    Google Scholar 

  17. Lin D. Automatic identification of non-compositional phrases. In Proc. ACL 1999, College Park, USA, Jun. 20–26, 1999, pp.317–324.

  18. Park Y, Byrd R J, Boguraev B K. Automatic glossary extraction: Beyond terminology identification. In Proc. the 19th Int. Conf. Computational Linguistics, Taipei, China, Aug. 24-Sept. 1, 2002, pp.1–7.

  19. Li M, Badger J H, Chen X et al. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 2001, 17(2): 149–154.

    Article  Google Scholar 

  20. Li M, Chen X, Li X, Ma B, Vitányi P M B. The similarity metric. IEEE Trans. IT, 2004, 50(12): 3250–3264.

    Article  Google Scholar 

  21. Bennett C H, Li M, Ma B. Chain letters and evolutionary histories. Scientific American, 2003, 288(6): 76–81. (Feature Article)

    Article  Google Scholar 

  22. Chen X, Francia B, Li M, Mckinnon B, Seker A. Shared information and program plagiarism detection. IEEE Trans. Information Theory, 2004, 50(7): 1545–1550.

    Article  MathSciNet  Google Scholar 

  23. Keogh E J, Lonardi S, Ratanamahatana C A. Towards parameter-free data mining. In Proc. ACM SIGKDD, Seattle, USA, Aug. 22–25, 2004, pp.206–215.

  24. Cilibrasi R L, Vitányi P M B. The Google similarity distance. IEEE Trans-Knowledge and Data Engineering, 2007, 19(3): 370–383.

    Article  Google Scholar 

  25. Baldwin T. Multiword expressions. Advanced Course at the Australasian Language Technology Summer School, 2004.

  26. Bu F, Zhu X, Li M. Measuring the non-compositionality of multiword expressions. In Proc. the 23rd International Conference on Computational Linguistics, Beijing, China, 2010, pp.116–124.

  27. Manning C D, Schütze H. Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, MA, 1999.

    MATH  Google Scholar 

  28. Li M, Vitányi P M B. An Introduction to Kolmogorov Complexity and Its Applications, Third Edition. New York: Springer-Verlag, 2008.

    Book  MATH  Google Scholar 

  29. Zhang Y, Kordoni V, Villavicencio A, Idiart M. Automated multiword expression prediction for grammar engineering. In Proc. the ACL Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, Jul. 17–21, 2006, pp.36–44.

  30. Magnini B, Negri M, Tanev H. Is it the right answer? Exploiting Web redundancy for answer validation. In Proc. ACL, Philadelphia, USA, Jul. 6–12, 2002, pp.425–432.

  31. Zhang X, Hao Y, Zhu X, Li M. New information measure and its application in question answering system. J. Comput. Sci. Tech., 2008, 23(4): 557–572.

    Article  MathSciNet  Google Scholar 

  32. http://nlp.stanford.edu/software/CRF-NER.shtml.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fan Bu.

Additional information

This work was supported mainly by Canada’s IDRC Research Chair in Information Technology Program, under Grant No. 104519-006. It is also supported by the National Natural Science Foundation of China under Grant No. 60973104, the National Basic Research 973 Program of China under Grant No. 2007CB311003, NSERC Grant OGP0046506, Canada Research Chair’Program, MITACS, an NSERC Collaborative Grant, and Ontario’s Premier’s Discovery Award.

Electronic supplementary material

Below is the link to the electronic supplementary material.

PDF (61.8 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bu, F., Zhu, XY. & Li, M. A New Multiword Expression Metric and Its Applications. J. Comput. Sci. Technol. 26, 3–13 (2011). https://doi.org/10.1007/s11390-011-9410-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-011-9410-0

Keywords

Navigation