The Chinese Lexicon Project: A repository of lexical decision behavioral responses for 2,500 Chinese characters

Abstract

The Chinese language has more native speakers than any other language, but research on the reading of Chinese characters is still not as well-developed as it is for the reading of words in alphabetic languages. Two areas notably lacking are the paucity of megastudies in Chinese and the relatively infrequent use of the lexical decision paradigm to investigate single-character recognition. The Chinese Lexicon Project, described in this article, is a database of lexical decision latencies for 2,500 Chinese single characters in simplified script, collected from a sample of native mainland Chinese (Mandarin) speakers (N = 35). This resource will provide a valuable adjunct to influential mega-databases, such as the English, French, and Dutch Lexicon Projects. Using two separate analyses, some advantages associated with megastudies are exemplified. These include the selection of the strongest measure to represent Chinese character frequency (Cai & Brysbaert’s (PLoS ONE 5(6): e10729, 2010) subtitle contextual diversity frequency count), and the conducting of virtual studies to replicate and clarify existing findings. The unique morpho-syllabic nature of the Chinese writing system makes it a valuable case study for functional language contrasts. Moreover, this is the first publicly available large-scale repository of behavioral responses pertaining to Chinese language processing (the behavioral dataset is attached to this article, as a supplemental file available for download). For these reasons, the data should be of substantial interest to psychologists, linguists, and other researchers.

This is a preview of subscription content, log in to check access.

Notes

  1. 1.

    A mega-naming study was previously conducted by Liu, Shu, and Li (2007). Unfortunately, their naming latencies were not released for public access.

  2. 2.

    The Web of Science search was conducted on July 2, 2012. Results from the search can be generic, so each of the articles generated by the search was checked manually to remove the irrelevant titles.

  3. 3.

    The numerical breakdown of participants excluded on the basis of poor performance on the screening tasks or lexical decision task is as follow: 1 participant was eliminated on the basis of The Chinese Author Recognition Test, 2 were eliminated on the basis of The HSK Chinese Proficiency Test, and 8 participants scored less than 85% accuracy on the lexical decision task and thus, their data were discarded.

  4. 4.

    We should point out that cncorpus was not created solely for psycholinguistic research. As a national corpus, it probably serves other functions—for example, providing historical linguists information on the evolution of character use and so forth. One recommendation would be for the corpus to include an option for users to select, say, which period and type of text to begin computing character/word statistics from, thus reducing dead weight.

  5. 5.

    In the virtual replication of Leong et al. (1987), 数 (數 in traditional script) was included in the stimuli as a character with many strokes. The inclusion of 数 does not violate the stroke manipulation, since its simplified form has 13 strokes, which is above the cutoff of “many strokes” (Leong et al.’s cutoff is placed at 12). The traditional 數 has 15 strokes. In any case, we also ran an additional series of analyses that excluded 数. The same pattern of findings was elicited [response time: F 1(1, 34) = 5.26, MSE = 1,985.36, p < .03, η partial = .13].

  6. 6.

    B. Chen et al. (2009) created two sets of stimuli for their three experiments (Experiment 1 [tachistoscopic task] used stimuli set 1, while Experiments 2 and 3 [visual duration threshold and lexical decision tasks] used stimuli set 2). Both sets of stimuli were created on the basis of the same design and requirements. These two sets of stimuli were thus combined for the drawing of our stimuli. Out of the 128 unique characters found in the combined set (61 early AoA, 67 late AoA; there were repetitions in the two stimuli sets), 78 of them are present in the Chinese Lexicon Project (38 early AoA, 40 late AoA). The value 78 is rather close to the original number of stimuli used in Chen et al.’s lexical decision experiment (72 characters were chosen as stimuli—i.e., 36 early AoA, 36 late AoA).

References

  1. Adelman, J. S., & Brown, G. D. A. (2008). Modeling lexical decision: The form of frequency and diversity effects. Psychological Review, 115(1), 214–229.

    PubMed  Article  Google Scholar 

  2. Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17(9), 814–823.

    PubMed  Article  Google Scholar 

  3. Baayen, R. H., Piepenbrock, R., & Van Rijn, H. (1993). The CELEX lexical database on CD-ROM. Philadephia, PA: Linguistic Data Consortium.

    Google Scholar 

  4. Balota, D. A., & Chumbley, J. I. (1984). Are lexical decisions a good measure of lexical access? The role of word frequency in the neglected decision stage. Journal of Experimental Psychology. Human Perception and Performance, 10(3), 340–357.

    PubMed  Article  Google Scholar 

  5. Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D., & Yap, M. J. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology. General, 133(2), 283–316.

    PubMed  Article  Google Scholar 

  6. Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B., ... Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39(3), 445–459.

    PubMed  Article  Google Scholar 

  7. Balota, D. A., Yap, M. J., Hutchison, K.A., & Cortese, M. J. (2012). Megastudies: Large scale analysis of lexical processes. In James S. Adelman (Series and Vol. Ed). Visual word recognition, Volume 1: Models and methods, orthography and phonology (pp. 90–115). Hove, East Sussex: Psychological Press.

  8. Bonin, P., Chalard, M., Méot, A., & Fayol, M. (2001). Age-of-acquisition and word frequency in the lexical decision task: Further evidence from the French language. Current Psychology of Cognition, 20(6), 401–443.

    Google Scholar 

  9. Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.

    PubMed  Article  Google Scholar 

  10. Burgess, C., & Livesay, K. (1998). The effect of corpus size in predicting reaction time in a basic word recognition task: Moving on from Kucera and Francis. Behavior Research Methods, Instruments, & Computers, 30(2), 272–277.

    Article  Google Scholar 

  11. Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS One, 5(6), e10729. doi:10.1371/journal.pone.0010729

    PubMed Central  PubMed  Article  Google Scholar 

  12. Chen, B., Dent, K., You, W., & Wu, G. (2009). Age of acquisition affects early orthographic processing during Chinese character recognition. Acta Psychologica, 130(3), 196–203.

    PubMed  Article  Google Scholar 

  13. Chen, B., & Peng, D. (2001). 汉语双字多义词的识别优势效应 [The effects of polysemy in two-character word identification]. Acta Psychologica Sinica, 33(4), 300–304.

    Google Scholar 

  14. Chen, H.-C., & Zhou, X. (1999). Processing East Asian languages: An introduction. Language & Cognitive Processes, 14(5/6), 425–428.

    Article  Google Scholar 

  15. Cortese, M. J. (1998). Revisiting serial position effects in reading. Journal of Memory and Language, 39(4), 652–665.

    Article  Google Scholar 

  16. Cunningham, A. E., & Stanovich, K. E. (1997). Early reading acquisition and its relation to reading experience and ability 10 years later. Developmental Psychology, 33(6), 934–945.

    PubMed  Article  Google Scholar 

  17. Da, J. (2004). A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction. In P. Zhang, T. Xie, & J. Xu (Eds.), Proceedings of the 4th International Conference on New Technologies in Teaching and Learning Chinese: The studies on the theory and methodology of the digitized Chinese teaching to foreigners (pp. 501–511). Beijing: The Tsinghua University Press.

    Google Scholar 

  18. Dong, L.-C. (2005). 说文解字考证 [An investigation of Chinese characters’ etymology]. Beijing: Writer’s Publishing House.

    Google Scholar 

  19. Faust, M. E., Balota, D. A., Spieler, D. H., & Ferraro, F. R. (1999). Individual differences in information-processing rate and amount: Implications for group differences in response latency. Psychological Bulletin, 125(6), 777–799.

    PubMed  Article  Google Scholar 

  20. Feng, Z. (2002). 中国语料库研究的历史与现状 [Evolution and present situation of corpus research in China]. Journal of Chinese Language and Computing, 12(1), 43–62.

    Google Scholar 

  21. Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Méot, A., ... Pallier, C. (2010). The French Lexicon Project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42(2), 488–496.

    PubMed  Article  Google Scholar 

  22. Gernsbacher, M. A. (1984). Resolving 20 years of inconsistent interactions between lexical familiarity and orthography, concreteness, and polysemy. Journal of Experimental Psychology. General, 113(2), 256–281.

    PubMed  Article  Google Scholar 

  23. Gu, J.-P. (2007). 字解: 字形图解字典 [A compendium of Chinese characters]. Singapore: Chinese Heritage Lodge.

    Google Scholar 

  24. Hoosain, R. (1991). Psycholinguistic implications for linguistic relativity: A case study of Chinese. Hillsdale, New Jersey: Lawrence Erlbaum Associates.

    Google Scholar 

  25. Institute of Applied Linguistics. (2009). 国家语委现代汉语语料库介绍 [Introduction to the Modern Chinese corpus (cncorpus) by the State Language Commission]. Retrieved March 29, 2011, from Chinese Linguistic Data web site: www.cncorpus.org

  26. Institute of Applied Linguistics. (2010). 现代汉语语料库汉字频率表 [Modern Chinese corpus character frequency list]. Retrieved March 29, 2011, from Chinese Linguistic Data web site: www.cncorpus.org

  27. Institute of Linguistics in the Chinese Academy of Social Sciences. (2008). 现代汉语词典 [Modern Chinese dictionary] (5th ed.). Beijing: The Commercial Press.

    Google Scholar 

  28. Katz, L., & Frost, R. (1992). The reading process is different for different orthographies: The orthographic depth hypothesis. In R. Frost & L. Katz (Eds.), Orthography, phonology, morphology, and meaning (pp. 67–84). Amsterdam: Elsevier North-Holland.

    Google Scholar 

  29. Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 Dutch mono- and disyllabic words and nonwords. Frontiers in Psychology, 1, 174. doi:10.3389/fpsyg.2010.00174

    PubMed Central  PubMed  Article  Google Scholar 

  30. Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44(1), 287–304.

    PubMed Central  PubMed  Article  Google Scholar 

  31. Kučera, H., & Francis, W. N. (1967). Computational analysis of present-day American English. Providence: Brown University Press.

    Google Scholar 

  32. Language Teaching and Research Institute of Beijing Language and Culture University. (1986). 现代汉语频率词典 [Dictionary of modern Chinese frequency]. Beijing: Beijing Language and Culture University Press.

    Google Scholar 

  33. Lee, S.-Y., & Krashen, S. (1996). Free voluntary reading and writing competence in Taiwanese high school students. Perceptual and Motor Skills, 83, 687–690.

    Article  Google Scholar 

  34. Leong, C. K., Cheng, P.-W., & Mulcahy, R. (1987). Automatic processing of morphemic orthography by mature readers. Language and Speech, 30(2), 181–196.

    PubMed  Google Scholar 

  35. Li, P., Tan, L. H., Bates, E., & Tzeng, O. J. L. (2006). Introduction: New frontiers in Chinese psycholinguistics. In P. Li (Series and Vol. Ed.), L.H. Tan, E. Bates, & O. J. L. Tzeng (Vol. Eds.), Handbook of East Asian psycholinguistics: Vol. 1. Chinese (pp. 1–9). Cambridge, UK: Cambridge University Press.

  36. Liu, Y. (2009). 汉语词汇研究统计述评 [A review of Chinese vocabulary statistic studies]. Chinese Language Learning, 30(1), 62–69.

    Google Scholar 

  37. Liu, Y., Shu, H., & Li, P. (2007). Word naming and psycholinguistic norms: Chinese. Behavior Research Methods, 39(2), 192–198.

    PubMed  Article  Google Scholar 

  38. Liu, Y., Wang, R. D., & Zhou, H. (2009). 现代汉语概论 (留学生版) [Modern Chinese: An overview]. Shanghai: Shanghai Educational Publishing House.

    Google Scholar 

  39. Lu, S. C. (1989). 字词频率词典 (以拼音为序): 新加坡《小学华文教材》 [Frequency dictionary of Chinese characters, words and phrases used in Singapore primary school textbooks]. Singapore: Center of Research for Chinese, National University of Singapore.

    Google Scholar 

  40. Lu, S. C. (1992). 字词频率词典 (以拼音为序): 新加坡《中学华文教材》 [Frequency dictionary of Chinese characters, words and phrases used in Singapore secondary school textbooks]. Singapore: Center of Research for Chinese, National University of Singapore.

    Google Scholar 

  41. Myers, J., Huang, Y.-C., & Wang, W. (2006). Frequency effects in the processing of Chinese inflection. Journal of Memory and Language, 54(3), 300–323.

    Article  Google Scholar 

  42. Ostler, N. (2008). World languages. In P. K. Austin (Ed.), 1000 languages: The worldwide history of living and lost tongues (pp. 10–34). UK: Thames & Hudson.

    Google Scholar 

  43. Peng, D., Deng, Y., & Chen, B. (2003). 汉语多义单字词的识别优势效应 [The polysemy effect in Chinese one-character word identification]. Acta Psychologica Sinica, 35(5), 569–575.

    Google Scholar 

  44. Perfetti, C. A., Zhang, S., & Berent, I. (1992). Reading in English and Chinese: Evidence for a ‘universal’ phonological principle. In R. Frost & L. Katz (Eds.), Orthography, phonology, morphology, and meaning (pp. 227–248). Amsterdam: Elsevier North-Holland.

    Google Scholar 

  45. Reynolds, M., & Besner, D. (2006). Reading aloud is not automatic: Processing capacity is required to generate a phonological code from print. Journal of Experimental Psychology. Human Perception and Performance, 32(6), 1303–1323.

    PubMed  Article  Google Scholar 

  46. Rogers, H. (2005). Writing systems: A linguistic approach. Malden, MA: Blackwell Publishing.

    Google Scholar 

  47. Rubenstein, H., Garfield, L., & Millikan, J. A. (1970). Homographic entries in the internal lexicon. Journal of Verbal Learning & Verbal Behavior, 9(5), 487–494.

    Article  Google Scholar 

  48. Scarborough, D. L., Cortese, C., & Scarborough, H. S. (1977). Frequency and repetition effects in lexical memory. Journal of Experimental Psychology. Human Perception and Performance, 3(1), 1–17.

    Article  Google Scholar 

  49. Schneider, W., Eschman, A., & Zuccolotto, A. (2002). E-prime (Version 1.2) [Computer software]. Pittsburgh: Psychology Software Tools Inc.

    Google Scholar 

  50. Seidenberg, M. S. (1985). The time course of phonological code activation in two writing systems. Cognition, 19(1), 1–30.

    PubMed  Article  Google Scholar 

  51. Share, D. L. (2008). On the Anglocentricities of current reading research and practice: The perils of overreliance on an “outlier” orthography. Psychological Bulletin, 134(4), 584–615.

    PubMed  Article  Google Scholar 

  52. Stanovich, K. E., & Cunningham, A. E. (1993). Where does knowledge come from? Specific associations between print exposure and information acquisition. Journal of Educational Psychology, 85(2), 211–229.

    Article  Google Scholar 

  53. Stanovich, K. E., & West, R. F. (1989). Exposure to print and orthographic processing. Reading Research Quarterly, 24(4), 402–433.

    Article  Google Scholar 

  54. State Language Commission & News Bureau. (1988). 现代汉语通用字表 [List of Commonly Used Characters]. Retrieved June 30, 2011, from http://www.china-language.gov.cn/wenziguifan/shanghi/014c.htm

  55. Stone, G. O., Vanhoy, M., & Van Orden, G. C. (1997). Perception is a two-way street: Feedforward and feedback phonology in visual word recognition. Journal of Memory and Language, 36(3), 337–359.

    Article  Google Scholar 

  56. Sun, C. (2006). Chinese: A linguistic introduction. New York: Cambridge University Press.

    Google Scholar 

  57. Tsai, P.-S., Yu, B. H.-Y., Lee, C.-Y., Tzeng, O. J. L., Hung, D. L., & Wu, D. H. (2009). An event-related potential study of the concreteness effect between Chinese nouns and verbs. Brain Research, 1253, 149–160.

    PubMed  Article  Google Scholar 

  58. Urbaniak, G. C., & Plous, S. (2011). Research Randomizer (Version 3.0) [Computer software]. Retrieved on January 1, 2011, from http://www.randomizer.org/

  59. Wang, J. (2001). Recent progress in corpus linguistics in China. International Journal of Corpus Linguistics, 6(2), 281–304.

    Article  Google Scholar 

  60. Xiao, R., Rayson, P. A., & McEnery, T. (2009). Frequency dictionary of Mandarin Chinese: Core vocabulary for learners. London: Routledge.

    Google Scholar 

  61. Yap, M. J., Rickard Liow, S. J., Jalil, S. B., & Faizal, S. S. B. (2010). The Malay Lexicon Project: A database of lexical statistics for 9,592 words. Behavior Research Methods, 42(4), 992–1003.

    PubMed  Article  Google Scholar 

  62. Yin, B., & Rohsenow, J. S. (1994). Modern Chinese characters. Beijing: Sinolingua.

    Google Scholar 

  63. Yip, M. (2002). Tone. New York: Cambridge University Press.

    Google Scholar 

  64. You, W., Chen, B., & Dunlap, S. (2009). Frequency trajectory effects in Chinese character recognition: Evidence for the arbitrary mapping hypothesis. Cognition, 110(1), 39–50.

    PubMed  Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Wei Ping Sze.

Electronic Supplementary Materials

Below is the link to the electronic supplementary material.

Supplementary material

(ZIP 999 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Sze, W.P., Rickard Liow, S.J. & Yap, M.J. The Chinese Lexicon Project: A repository of lexical decision behavioral responses for 2,500 Chinese characters. Behav Res 46, 263–273 (2014). https://doi.org/10.3758/s13428-013-0355-9

Download citation

Keywords

  • Mandarin
  • Visual word recognition
  • Megastudy
  • Reaction time
  • Nonalphabetic
  • Logograph