Skip to main content
Log in

Constructing and validating word similarity datasets by integrating methods from psychology, brain science and computational linguistics

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Human-scored word similarity gold-standard datasets are normally composed of word pairs with corresponding similarity scores. These datasets are popular resources for evaluating word similarity models which are the essential components for many natural language processing tasks. This paper proposes a novel multidisciplinary method for constructing and validating word similarity gold-standard datasets. The proposed method is different from the previous ones in that it introduces methods from three different disciplines, i.e., psychology, brain science and computational linguistics to validate the soundness of the constructed datasets. Specifically, to the best of our knowledge, this is the first time event-related potentials experiments are incorporated to validate the word similarity datasets. Using the proposed method, we finally constructed a Chinese gold-standard word similarity dataset with 260 word pairs and showed its soundness using the interdisciplinary validating methods. It should be noted that, although the paper only focused on constructing Chinese standard dataset, the proposed method is applicable to other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. The dataset is freely available at https://github.com/ydc/ws260.

  2. http://www.sogou.com/labs/resource/ca.php.

  3. http://www.keenage.com.

  4. http://wordnet.princeton.edu/.

  5. https://github.com/fozziethebeat/S-Space.

  6. http://ai.stanford.edu/~ehhuang/.

  7. https://code.google.com/archive/p/word2vec/.

References

  • Agirre E, Alfonseca E, Hall K, Kravalova J, Pasca M, Soroa A (2009) A study on similarity and relatedness using distributional and wordnet-based approaches. In: Proceedings of human language technologies: the 2009 annual conference of the north American chapter of the ACL, pp 19–27

  • Bennett M, Duke P, Fuggetta G (2014) Event-related potential n270 delayed and enhanced by the conjunction of relevant and irrelevant perceptual mismatch. Psychophysiology 51(5):456–463

    Article  Google Scholar 

  • Burgess C, Lund K (1997) Modelling parsing constraints with high-dimensional context space. Lang Cogn Process 12:177–210

    Article  Google Scholar 

  • Chen C, Lee S, Stevenson HW (1995) Response style and cross-cultural comparisons of rating scales among East Asian and North American students. Psychol Sci 6(3):170–175

    Article  Google Scholar 

  • Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, pp 160–167

  • Deacon D, Hewitt S, Yang CM, Nagata M (2000) Event-related potential indices of semantic priming using masked and unmasked words: evidence that the n400 does not reflect a post-lexical process. Cogn Brain Res 9(2):137–146

    Article  Google Scholar 

  • Dong Z, Dong Q (2006) HowNet and the computation of meaning, 1st edn. World Scientific, Hackensack

    Book  Google Scholar 

  • Dong Z, Dong Q, Hao C (2010) Hownet and its computation of meaning. In: Proceedings of the 23rd international conference on computational linguistics, pp 53–56

  • Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2001) Placing search in context: the concept revisited. In: Proceedings of the 10th international conference on world wide web, pp 406–414

  • Harris Z (1968) Mathematical structures of language, 1st edn. Wiley, New York

    MATH  Google Scholar 

  • Hauk O, Pulvermüller F (2004) Effects of word length and frequency on the human event-related potential. Clin Neurophysiol 115(5):1090–1103

    Article  Google Scholar 

  • Hill F, Reichart R, Korhonen A (2015) Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput Linguist 41(2):665–695

    Article  MathSciNet  Google Scholar 

  • Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, pp 873–882

  • Jin P, Wu YF (2012) Semeval-2012 task 4: evaluating chinese word similarity. In: Proceedings of the 6th international workshop on semantic evaluation, pp 374–377

  • Jurgens D, Stevens K (2010) The s-space package: an open source package for word space models. In: Proceedings of the ACL 2010 system demonstrations, pp 30–35

  • Kiefer M (2002) The n400 is modulated by unconsciously perceived masked words: further evidence for an automatic spreading activation account of n400 priming effects. Cogn Brain Res 13(1):27–39

    Article  Google Scholar 

  • Kutas M, Federmeier KD (2011) Thirty years and counting: finding meaning in the n400 component of the event related brain potential (erp). Annu Rev Psychol 62:621

    Article  Google Scholar 

  • Kutas M, Hillyard SA et al (1980) Reading senseless sentences: brain potentials reflect semantic incongruity. Science 207(4427):203–205

    Article  Google Scholar 

  • Liu Q, Li S (2002) Word similarity computing based on how-net. In: Proceedings of the 3rd Chinese lexical semantics workshop, pp 59–76

  • Liu Y (2009) A review of Chinese vocabulary statistic studies. Chin Lang Learn 1:62–69

    Google Scholar 

  • Mao W, Yuping W (2007) Various conflicts from ventral and dorsal streams are sequentially processed in a common system. Exp Brain Res 177:113–121

    Article  Google Scholar 

  • Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Proceedings of international conference of learning representations

  • Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119

    Google Scholar 

  • Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41

    Article  Google Scholar 

  • Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6(1):1–28

    Article  MathSciNet  Google Scholar 

  • Moss HE, Ostrin RK (1995) Accessing different types of lexical semantic information: evidence from priming. J Exp Psychol Learn Mem Cogn 21(4):863–883

    Article  Google Scholar 

  • Rohde DLT, Gonnerman LM, Plaut DC (2006) An improved model of semantic similarity based on lexical co-occurrence. Commun ACM 8:627–633

    Google Scholar 

  • Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633

    Article  Google Scholar 

  • Turian J, Ratinov LA, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394

  • Wang X, Jia Y, Zhou B, Ding ZY, Liang Z (2011) Computing semantic relatedness using chinese wikipedia links and taxonomy. J Chin Comput Syst 32(11):2237–2242

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 61573294), National Social Science Foundation of China (No. 16AZD049) and Fujian Province 2011 Collaborative Innovation Center of TCM Health Management.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yidong Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Communicated by P. Angelov, F. Chao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wan, Y., Chen, Y., Shi, X. et al. Constructing and validating word similarity datasets by integrating methods from psychology, brain science and computational linguistics. Soft Comput 22, 6967–6979 (2018). https://doi.org/10.1007/s00500-018-3174-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3174-1

Keywords

Navigation