Constructing and validating word similarity datasets by integrating methods from psychology, brain science and computational linguistics

Wan, Yu; Chen, Yidong; Shi, Xiaodong; Zhou, Changle

doi:10.1007/s00500-018-3174-1

Constructing and validating word similarity datasets by integrating methods from psychology, brain science and computational linguistics

Focus
Published: 03 April 2018

Volume 22, pages 6967–6979, (2018)
Cite this article

Soft Computing Aims and scope Submit manuscript

Yu Wan^1,2,
Yidong Chen^1,2,
Xiaodong Shi^1,2 &
…
Changle Zhou^1,2

391 Accesses
1 Citation
Explore all metrics

Abstract

Human-scored word similarity gold-standard datasets are normally composed of word pairs with corresponding similarity scores. These datasets are popular resources for evaluating word similarity models which are the essential components for many natural language processing tasks. This paper proposes a novel multidisciplinary method for constructing and validating word similarity gold-standard datasets. The proposed method is different from the previous ones in that it introduces methods from three different disciplines, i.e., psychology, brain science and computational linguistics to validate the soundness of the constructed datasets. Specifically, to the best of our knowledge, this is the first time event-related potentials experiments are incorporated to validate the word similarity datasets. Using the proposed method, we finally constructed a Chinese gold-standard word similarity dataset with 260 word pairs and showed its soundness using the interdisciplinary validating methods. It should be noted that, although the paper only focused on constructing Chinese standard dataset, the proposed method is applicable to other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Multidisciplinary Method for Constructing and Validating Word Similarity Datasets

Semantic similarity and associated abstractness norms for 630 French word pairs

Article 01 October 2020

Lemon and Tea Are Not Similar: Measuring Word-to-Word Similarity by Combining Different Methods

Notes

The dataset is freely available at https://github.com/ydc/ws260.
http://www.sogou.com/labs/resource/ca.php.
http://www.keenage.com.
http://wordnet.princeton.edu/.
https://github.com/fozziethebeat/S-Space.
http://ai.stanford.edu/~ehhuang/.
https://code.google.com/archive/p/word2vec/.

References

Agirre E, Alfonseca E, Hall K, Kravalova J, Pasca M, Soroa A (2009) A study on similarity and relatedness using distributional and wordnet-based approaches. In: Proceedings of human language technologies: the 2009 annual conference of the north American chapter of the ACL, pp 19–27
Bennett M, Duke P, Fuggetta G (2014) Event-related potential n270 delayed and enhanced by the conjunction of relevant and irrelevant perceptual mismatch. Psychophysiology 51(5):456–463
Article Google Scholar
Burgess C, Lund K (1997) Modelling parsing constraints with high-dimensional context space. Lang Cogn Process 12:177–210
Article Google Scholar
Chen C, Lee S, Stevenson HW (1995) Response style and cross-cultural comparisons of rating scales among East Asian and North American students. Psychol Sci 6(3):170–175
Article Google Scholar
Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, pp 160–167
Deacon D, Hewitt S, Yang CM, Nagata M (2000) Event-related potential indices of semantic priming using masked and unmasked words: evidence that the n400 does not reflect a post-lexical process. Cogn Brain Res 9(2):137–146
Article Google Scholar
Dong Z, Dong Q (2006) HowNet and the computation of meaning, 1st edn. World Scientific, Hackensack
Book Google Scholar
Dong Z, Dong Q, Hao C (2010) Hownet and its computation of meaning. In: Proceedings of the 23rd international conference on computational linguistics, pp 53–56
Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2001) Placing search in context: the concept revisited. In: Proceedings of the 10th international conference on world wide web, pp 406–414
Harris Z (1968) Mathematical structures of language, 1st edn. Wiley, New York
MATH Google Scholar
Hauk O, Pulvermüller F (2004) Effects of word length and frequency on the human event-related potential. Clin Neurophysiol 115(5):1090–1103
Article Google Scholar
Hill F, Reichart R, Korhonen A (2015) Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput Linguist 41(2):665–695
Article MathSciNet Google Scholar
Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, pp 873–882
Jin P, Wu YF (2012) Semeval-2012 task 4: evaluating chinese word similarity. In: Proceedings of the 6th international workshop on semantic evaluation, pp 374–377
Jurgens D, Stevens K (2010) The s-space package: an open source package for word space models. In: Proceedings of the ACL 2010 system demonstrations, pp 30–35
Kiefer M (2002) The n400 is modulated by unconsciously perceived masked words: further evidence for an automatic spreading activation account of n400 priming effects. Cogn Brain Res 13(1):27–39
Article Google Scholar
Kutas M, Federmeier KD (2011) Thirty years and counting: finding meaning in the n400 component of the event related brain potential (erp). Annu Rev Psychol 62:621
Article Google Scholar
Kutas M, Hillyard SA et al (1980) Reading senseless sentences: brain potentials reflect semantic incongruity. Science 207(4427):203–205
Article Google Scholar
Liu Q, Li S (2002) Word similarity computing based on how-net. In: Proceedings of the 3rd Chinese lexical semantics workshop, pp 59–76
Liu Y (2009) A review of Chinese vocabulary statistic studies. Chin Lang Learn 1:62–69
Google Scholar
Mao W, Yuping W (2007) Various conflicts from ventral and dorsal streams are sequentially processed in a common system. Exp Brain Res 177:113–121
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Proceedings of international conference of learning representations
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119
Google Scholar
Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41
Article Google Scholar
Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6(1):1–28
Article MathSciNet Google Scholar
Moss HE, Ostrin RK (1995) Accessing different types of lexical semantic information: evidence from priming. J Exp Psychol Learn Mem Cogn 21(4):863–883
Article Google Scholar
Rohde DLT, Gonnerman LM, Plaut DC (2006) An improved model of semantic similarity based on lexical co-occurrence. Commun ACM 8:627–633
Google Scholar
Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633
Article Google Scholar
Turian J, Ratinov LA, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394
Wang X, Jia Y, Zhou B, Ding ZY, Liang Z (2011) Computing semantic relatedness using chinese wikipedia links and taxonomy. J Chin Comput Syst 32(11):2237–2242
Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 61573294), National Social Science Foundation of China (No. 16AZD049) and Fujian Province 2011 Collaborative Innovation Center of TCM Health Management.

Author information

Authors and Affiliations

Department of Cognitive Science, School of Information and Engineering, Xiamen University, Xiamen, 361005, Fujian, People’s Republic of China
Yu Wan, Yidong Chen, Xiaodong Shi & Changle Zhou
Fujian Key Laboratory of Brain-Inspired Computing Technique and Applications, Xiamen University, Xiamen, 361005, Fujian, People’s Republic of China
Yu Wan, Yidong Chen, Xiaodong Shi & Changle Zhou

Authors

Yu Wan
View author publications
You can also search for this author in PubMed Google Scholar
Yidong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Shi
View author publications
You can also search for this author in PubMed Google Scholar
Changle Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yidong Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Communicated by P. Angelov, F. Chao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wan, Y., Chen, Y., Shi, X. et al. Constructing and validating word similarity datasets by integrating methods from psychology, brain science and computational linguistics. Soft Comput 22, 6967–6979 (2018). https://doi.org/10.1007/s00500-018-3174-1

Download citation

Published: 03 April 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s00500-018-3174-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constructing and validating word similarity datasets by integrating methods from psychology, brain science and computational linguistics

Abstract

Access this article

Similar content being viewed by others

A Multidisciplinary Method for Constructing and Validating Word Similarity Datasets

Semantic similarity and associated abstractness norms for 630 French word pairs

Lemon and Tea Are Not Similar: Measuring Word-to-Word Similarity by Combining Different Methods

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Constructing and validating word similarity datasets by integrating methods from psychology, brain science and computational linguistics

Abstract

Access this article

Similar content being viewed by others

A Multidisciplinary Method for Constructing and Validating Word Similarity Datasets

Semantic similarity and associated abstractness norms for 630 French word pairs

Lemon and Tea Are Not Similar: Measuring Word-to-Word Similarity by Combining Different Methods

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation