A Multidisciplinary Method for Constructing and Validating Word Similarity Datasets

  • Yu Wan
  • Yidong ChenEmail author
  • Xiaodong Shi
  • Guorong Cai
  • Libai Cai
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 650)


Measuring semantic similarity is essential to many natural language processing (NLP) tasks. One widely used method to evaluate the similarity calculating models is to test their consistency with humans using human-scored gold-standard datasets, which consist of word pairs with corresponding similarity scores judged by human subjects. However, the descriptions on how such datasets are constructed are often not sufficient previously. Many problems, e.g. how the word pairs are selected, whether or not the scores are reasonable, etc., are not clearly addressed. In this paper, we proposed a multidisciplinary method for building and validating semantic similarity standard datasets, which is composed of 3 steps. Firstly, word pairs are selected based on computational linguistic resources. Secondly, similarities for the selected word pairs are scored by human subjects. Finally, Event-Related Potentials (ERPs) experiments are conducted to test the soundness of the constructed dataset. Using the proposed method, we finally constructed a Chinese gold-standard word similarity dataset with 260 word pairs and validated its soundness via ERP experiments. Although the paper only focused on constructing Chinese standard dataset, the proposed method is applicable to other languages.


Word similarity Dataset Multidisciplinary method ERP 



This work was supported by National Natural Science Foundation of China (No. 61573294), National Social Science Foundation of China (No. 16AZD049) and Fujian Province 2011 Collaborative Innovation Center of TCM Health Management.


  1. 1.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR), Scottsdale, Arizona, May 2013Google Scholar
  2. 2.
    Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8(10), 627–633 (1965)CrossRefGoogle Scholar
  3. 3.
    Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Lang. Cogn. Process. 6(1), 1–28 (1991)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. In: Proceedings of the 10th International World Wide Web Conference (WWW10), Hongkong, China, pp. 406–414, May 2001Google Scholar
  5. 5.
    Wang, X., Jia, Y., Zhou, B., Ding, Z., Liang, Z.: Computing semantic relatedness using Chinese Wikipedia links and taxonomy. J. Chin. Comput. Syst. 32(11), 2237–2242 (2011)Google Scholar
  6. 6.
    Jin, P., Wu, Y.: Semeval-2012 task 4: evaluating Chinese word similarity. In: Proceedings of the Joint Conference on Lexical and Computational Semantics, Montréal, Canada, pp. 374–377, June 2012Google Scholar
  7. 7.
    Hauk, O., Pulvermüller, F.: Effects of word length and frequency on the human event-related potential. Clin. Neurophysiol. 115(5), 1090–1103 (2004)CrossRefGoogle Scholar
  8. 8.
    Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A.: A Study on similarity and relatedness using distributional and WordNet-based approaches. In: Proceedings of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL - HLT 2009), Colorado, pp. 19–27, June 2009Google Scholar
  9. 9.
    Dong, Z., Dong, Q.: Hownet, March 1999.
  10. 10.
    Dong, Z., Dong, Q., Hao, C.: HowNet and its computation of meaning. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, pp. 53–56, August 2010Google Scholar
  11. 11.
    Liu, Q., Li, S.: Word similarity computing based on HowNet. In: Proceedings of the Third Chinese Lexical Semantics Workshop, pp. 59–76 (2002)Google Scholar
  12. 12.
    Chen, C., Lee, S., Stevenson, H.W.: Response style and cross-cultural comparisons of rating scales among East Asian and North American students. Psychol. Sci. 6(3), 170–175 (1995)CrossRefGoogle Scholar
  13. 13.
    Kutas, M., Federmeier, K.D.: Thirty years and counting: finding meaning in the N400 component of the event related brain potential (ERP). Annu. Rev. Psychol. 62, 621–647 (2011)CrossRefGoogle Scholar
  14. 14.
    Kutas, M., Hillyard, S.A.: Reading senseless sentences: brain potentials reflect semantic incongruity. Science 207(4427), 203–205 (1980)CrossRefGoogle Scholar
  15. 15.
    Deacon, D., Hewitt, S., Yang, C., Nagata, M.: Event-related potential indices of semantic priming using masked and unmasked words: evidence that the N400 does not reflect a post-lexical process. Cogn. Brain. Res. 9(2), 137–146 (2000)CrossRefGoogle Scholar
  16. 16.
    Kiefer, M.: The N400 is modulated by unconsciously perceived masked words: further evidence for an automatic spreading activation account of N400 priming effects. Cogn. Brain. Res. 13(1), 27–39 (2002)CrossRefGoogle Scholar
  17. 17.
    Mao, W., Wang, Y.: Various conflicts from ventral and dorsal streams are sequentially processed in a common system. Exp. Brain Res. 177, 113–121 (2007)CrossRefGoogle Scholar
  18. 18.
    Bennett, M.A., Duke, P.A., Fuggetta, G.: Event-related potential N270 delayed and enhanced by the conjunction of relevant and irrelevant perceptual mismatch. Psychophysiology 51(5), 456–463 (2014)CrossRefGoogle Scholar
  19. 19.
    Moss, H.E., Ostrin, R.K., Tyler, L.K., Marslen, W.D.: Accessing different types of lexical semantic information: evidence from priming. J. Exp. Psychol. Learn. Mem. Cogn. 21(4), 863–883 (1995)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Yu Wan
    • 1
  • Yidong Chen
    • 1
    Email author
  • Xiaodong Shi
    • 1
  • Guorong Cai
    • 2
  • Libai Cai
    • 3
  1. 1.Department of Cognitive Science, School of Information and EngineeringXiamen UniversityXiamenPeople’s Republic of China
  2. 2.State Grid Fujian Liancheng Electric Power Company LimitedLongyanPeople’s Republic of China
  3. 3.Computer Engineering CollegeJimei UniversityXiamenPeople’s Republic of China

Personalised recommendations