Word intuition is speakers’ intuitive knowledge of what a word is and can be defined individually or collectively. The individual word intuition is the word intuition of individual speakers. Whereas, the collective word intuition is the word intuition of the whole language community. In English and other languages with conventionalized word boundaries in orthography, collective word intuition can be roughly modeled by the collective behavior in marking orthographic wordbreaks. However, in languages without conventionalized word boundaries in orthography, explicit empirical evidence for both individual and collective word intuition is very difficult to obtain.

A priori, a general method to the description of the collective word intuition of Chinese speakers is to summarize the individual word intuition of all the Chinese speakers. In principle, collective word intuition of Chinese speakers can be measured using word segmentation tasks (Hoosain 1992; Wang 王立 2003), assuming that speakers’ word segmentation behaviors reflect their word intuition. This quantifiable result of segmentation consistency can be a convenient measurement of Chinese speakers’ word intuition. This leads to a potentially very attractive feature of collective word intuition: that it can be quantified with probabilistic values.

In computational and Chinese language processing, the difficulty in capturing collective word intuition leads to the difficulty in consensus in articulating consensus word segmentation standard (Huang et al. 1996; Liu et al. 刘源等 1994) and in achieving optimal results in word segmentation (Huang and Xue 2012; Huang and Xue 2015; Huang and Zhao 黄昌宁, 赵海 2007). Presumably, a clearly articulated collective word intuition for Chinese speakers would be the ideal word segmentation standard for Chinese language processing and would also make modeling and evaluation of Chinese word segmentation explicit and straightforward. And the probabilistic representation of word intuition in fact is even better suited as word segmentation standards given the predominance of stochastic models in computational word segmentation.

Collecting word intuition experimental data, however, is not an easy task. It is a very time- and resource-intensive process in traditional laboratory setting. In addition, given the possible variations in such experimental results, one may even question the psychological reality of word as a natural linguistic unit in Chinese (Hoosain 1992; Huang and Xue 2012). Only when the word intuition agreement among Chinese speakers is reasonably high, can the psychological reality of word in Chinese be supported and meanwhile can word intuition be the solid foundation of the definition of Chinese word. It is interesting to note that in fact, the first studies tackling agreements in human word segmentation results were done by computational linguists rather than psycholinguists, such as Sproat et al. (1996) and Liu and Liang 刘源, 梁南元 (1986). Liu and Liang 刘源, 梁南元 (1986) reported agreement ranging from 60% (before training) to 80% (after training), without describing how agreement is measured and cannot be correctly interpreted. Another weakness of their study is that the text stimuli are short and out of context, and hence do not model realistic context of word segmentation. Sproat et al. (1996) conducted a similar experiment. They extracted 100 sentences (4372 Chinese characters in total) randomly from a corpus and then asked six Chinese native speakers (three from Mainland and three from Taiwan) to segment these sentence stimuli. They used the arithmetic mean of precision and recall to measure interjudge similarity between each unique subject pair. They reported minimum interjudge similarity of 0.69, maximum similarity 0.89, and the mean of 0.76. The segmentation agreement in terms of the arithmetic mean of precision and recall, however, is rarely (if ever) used in psychological or linguistic studies. Instead, various kappa statistics are more “standard” for current studies in cognitive and social sciences. Their number of subjects (6) is also considered to be too small for this study (typically 20 or more will be required.)

The possible link between collective word intuition and word segmentation standard, as well as the fact that previous studies failed to recruit enough subjects for valid results lead to our proposal to use the NLP technique of crowdsourcing to tackle this issue. Mechanical Turk (MTurk) has emerged in recent years to be a promising solution to the problem of linguistic data bottleneck by providing a new paradigm for linguistic experiments, i.e., the MTurk-based experiment (Berinsky et al. 2012; Buhrmester et al. 2011; Horton et al. 2011; Mason and Suri 2012; Paolacci et al. 2010; Schnoebelen and Kuperman 2010; Sprouse 2011). Data quality is the key concern in conducting research using MTurk-based experiments because the MTurk setting is not so controllable as the laboratory setting; a host of studies have been carried out to address this concern. The comparison between the data obtained from MTurk-based experiments and laboratory-based experiments suggests that MTurk-based experiments can provide comparable or even better data (Horton et al. 2011; Munro et al. 2010; Schnoebelen and Kuperman 2010; Sprouse 2011). A large set of classic effects discovered previously in laboratory-based experiments have been successfully replicated using MTurk-based experiments (Crump et al. 2013; Enochson and Culbertson 2015; Horton et al. 2011; Simcox and Fiez 2014). Last but not the least, MTurk has been successfully implemented for research on Chinese language resources (Wang et al. 2014b). Our current study hence designs an MTurk-based experiment on word intuition, with the hope of constructing word segmentation resources to inform computational word segmentation in the future.

“A word is a minimum free form”. This is perhaps the most classic definition of word which is suggested by Bloomfield (1933: 178). However, according to this definition, the Chinese forms 江水 jiāngshuǐ 'river water' and 龙眼 lóngyǎn 'longan' are not words because 江 jiāng 'river', 水 shuǐ 'water', 龙 lóng 'dragon', and 眼 yǎn 'eye' are all free forms (or free morphemes). This is quite counter-intuitive. If we inspect carefully the two forms, we can find that in fact, they are very different. The form jiāngshuǐ is semantically transparent, but lóngyǎn is semantically opaque. Because of this, among Chinese linguists, there is rarely debate on the wordhood of lóngyǎn, but there are still some debates on the wordhood of jiāngshuǐ. This leads to the hypothesis that semantic transparency may affect Chinese speakers’ word intuition. Is the word intuition agreement on semantically transparent forms significantly lower than semantically opaque forms? This study will also probe into this research question.



The materials of word segmentation tasks are at least phrases, but we prefer naturally occurred sentences. In order to cover more linguistic phenomena to better support the studies of word intuition, we decided to use more than 150 long sentences (the crowdsourcing method makes this possible). Meanwhile, the resultant dataset will also be used to examine the effect of semantic transparency on word intuition, so these sentences should contain the words to be used in the examination of semantic transparency effect. Hence, the material selection procedure consists of two steps: (1) word selection, i.e., to select an initial set of words which would be used in the examination of semantic transparency effect, and (2) sentence selection, i.e., to select a set of sentences which contains the words selected in step 1 (each sentence carries one word) and at the same time satisfy other requirements.

Word selection

We have already created a semantic transparency dataset SimTransCNC 1.0 which contains the overall and constituent semantic transparency rating data of about 1200 Chinese bimorphemic nominal compounds which have mid-range word frequencies and consist of free morphemes (Wang et al. 2014a). Based on this dataset, 152 words are selected. These words have two functions in this study: (1) they are used as indexes to extract sentences from corpus; (2) they are used as the word stimuli in the study of the effect of semantic transparency on word intuition agreement. However, for function 2, we will not use all the 152 words; instead, these 152 words will firstly undergo a laboratory-based semantic transparency rating experiment (Wang et al. 2015) to further ensure the accuracy of their semantic transparency scores and then a subset of words will be selected according to the laboratory experiment results to examine the semantic transparency effect on word intuition agreement (see Section 3.5).

In the study of the semantic transparency effect on word intuition agreement, the independent variable is the semantic transparency of words and the dependent variable is the word intuition agreement of these words. We control the length, part-of-speech, frequency, morphological structure, and the nature of constituent of these words. All the words are bimorphemic nominal compounds which consist of free morphemes and have the structure of modifier-head and mid-range frequencies. But the modifier-head structure can be further mainly divided into three substructures: NN, AN, and VN. These words cover all the three substructures to enable us to see if these substructures make differences. Following Libben et al. (2003), we differentiate four transparency types: TT, TO, OT, and OO; “T” means “transparent” and “O” means “opaque”. TT words show the highest OST (overall semantic transparency: the semantic transparency of a whole compound) scores and the most balanced CST (constituent semantic transparency: the semantic transparency of a constituent of a compound) scores, e.g., 江水 jiāngshuǐ 'river water'; OO words have the lowest OST scores and the most balanced CST scores, e.g., 脾气 píqi 'temperament'; TO and OT words bear mid-range OST scores and the most imbalanced CST scores, e.g., 音色 yīnsè 'timbre' (TO) and 贵人 guìrén 'magnate' (OT). See Table 1 for the distribution of the selected words.

Table 1 Distribution of types of selected words

Sentence selection

The words selected in step 1 were used as indexes, and all the sentences carrying them in Sinica corpus 4.0 were extracted. One sentence was selected for each word roughly according to the following criteria: (1) the length of sentence should be between 20 and 50 characters (punctuations excluded); (2) the sentence should not contain too many punctuations; (3) preferred concrete and narrative sentences to abstract ones which are difficult to understand; and (4) if we could not find proper sentences from Sinica corpus for some words, we used other corpora (only 5 sentences). In this way, a total of 152 sentences are selected, for the length (in character) distribution, see Table 2.

Table 2 Length distribution of selected sentences

Crowdsourcing task design

Since a crowdsourcing task should be short, these 152 sentences are evenly and randomly divided into eight sentence groups; each sentence group has 19 sentences. We created one crowdsourcing task for each sentence group on Crowdflower; according to our previous studies, compared to Amazon Mechanical Turk (MTurk), Crowdflower is a more feasible platform for Chinese linguistic data collection since it is more accessible and can reach more Chinese speakers than Amazon Mechanical Turk (Wang et al. 2014b).


The core of each crowdsourcing task is a questionnaire. Each questionnaire consists of five sections: (1) title, (2) instructions, (3) demographic questions, (4) screening questions, and (5) segmentation task; both simplified and traditional Chinese character versions are provided. Section 3, demographic questions, asks the online subjects (all of the subjects are volunteers) to provide their identity information on gender, age, level of education, and email address (optional). Section 4, screening questions, consists of four simple questions on the Chinese language which can be used to test if a subject is a Chinese speaker or not; the first two questions are open-ended Chinese character identification questions. Each question shows a picture containing a simple Chinese character and asks the subject to identify and type it in a text box below it. The third question is a close-ended homophonic character identification question. It shows the subject a character and asks him/her to identify its homophonic character in 10 different characters. The fourth one is a close-ended antonymous character identification question asking the subject to identify the antonymous character of the given one from 10 different characters. The sections 4s of the eight crowdsourcing tasks share the same question types but have different question instances. Section 5, the segmentation task, shows the subjects 19 sentences and asks them to insert a word boundary symbol (“/”) at each word boundary they perceive. The subjects are required to insert a “/” behind each punctuation and the last character of a sentence. The subjects are also informed that they need not to care about if their judgments are right or wrong, but just follow their intuitions.

Parameters of tasks

These eight crowdsourcing tasks are created with the following parameters: (1) each subject account can only submit one response to one task; (2) each IP address can only submit one response to one task; (3) we only accept the responses from mainland China, Hong Kong, Macao, Taiwan, Singapore, Indonesia, Malaysia, Thailand, Australia, Canada, Germany, USA, and New Zealand since there areas are main Chinese speaking areas; and (4) we pay 0.25 USD for one response.

Quality control measures

The following quality control measures are used: (1) Section 4, screening questions, was used to discriminate Chinese speakers from non-Chinese speakers and to block bots; (2) Section 5, the segmentation task, was kept invisible unless the first two screening questions were correctly answered; (3) the answers to the segmentation questions in section 5 must comply with the prescribed format to prevent random string: (a) the segmentation answer to each sentence must be only composed by the original sentence with one or zero “/” behind each Chinese character and each punctuation; (b) in the answers behind each punctuation, there must be a “/”; (c) the end of an answer must be a “/”; (4) the submission attempts were blocked unless all the required questions are answered and the answers satisfy the above conditions; and (5) data cleansing was conducted after data collection to rule out invalid responses.


We firstly ran a small pretest task to test if the tasks were correctly designed, and it turned out that the pretest task could run smoothly. Then, we launched the first task and let it run alone for about 2 days to further test the task design. After we finally confirmed that the tasks could really run smoothly, we launched the other seven tasks and let them run concurrently. Our aim was to collect 200 responses for each task. The speed was amazingly fast in the beginning, and all eight tasks received their first 100 responses in the first 3 to 6 days; then the speed became slower and slower, it eventually took us about 1.3 months to reach our target number. After all, Crowdflower is not a Chinese native crowdsourcing platform; this kind of speed is understandable.


Data cleansing

All tasks successfully obtained 200 responses. However, not all responses are valid. Compared to the laboratory setting, the crowdsourcing environment is quite noisy by nature, so before the newly collected data can be used in any serious analysis to draw reliable conclusions, data cleansing must be conducted. The raw responses underwent rule-based data cleansing. A response is considered invalid if it has at least one of the following five features: (1) at least one of the four screening questions are incorrectly answered; (2) the lengths of the resultant segments of at least one of its 19 sentences are all one character; (3) at least one segment longer than seven characters is observed in the resultant segments of its 19 sentences; (4) the completion time of the response is shorter than 5 min; and (5) the completion time of the response is longer than 1 h. Invalid responses were ruled out. The numbers of valid response of the eight tasks are listed in Table 3. The resultant dataset contains the manual Chinese word segmentation data of 152 sentences, length of which ranges from 20 to 46 characters (M = 32.54, SD = 5.46), and each sentence is segmented by at least 123 and at most 143 subjects (M = 133.5, SD = 7.37).

Table 3 Numbers of valid response of the tasks

Evaluation of experimental data

Although Fleiss’ kappa can be used to measure the agreement between raters, high agreement does not necessarily mean high data quality especially in the situation of intuition measurement where variations among subjects are expected. And it cannot show directly how many errors the resultant dataset actually contains either. Knowing how many errors the dataset contains is very important to assess the reliability of the conclusions drawn from the dataset. We firstly define two kinds of manual segmentation errors, and based on that, an evaluation method called manual segmentation error rate (MSER) was proposed to evaluate the resultant dataset.

Types of manual segmentation errors

In Chinese phrases/sentences, there are three types of non-monosyllabic segments from the point of view of manual word segmentation: ridiculous segments, indivisible segments, and modest segments. A ridiculous segment usually cannot be treated as one valid unit/word because it makes no sense in the context of the phrase/sentence; for example, in the phrase 这是好东西 zhè shì hǎo dōngxī 'this is a good thing', the segment 好东 hǎo dōng 'good-east' (NONSENSE) cannot be treated as one unit/word because it is incomprehensible. An indivisible segment usually cannot be divided because it is a fixed unit and its lexical meaning cannot be derived easily from the lexical meanings of its constituents (in other word, semantically opaque); it will become incomprehensible if it is divided; for example, in the phrase example, the segment 东西 dōngxī 'thing' is of this type. A modest segment can be either treated as one unit/word or divided into two or more units/words because it is equally comprehensible no matter divided or not; the segment 这是 zhè shì 'this is' in the phrase example is of this type.

Two circumstances can be treated as errors of manual word segmentation; firstly, if a ridiculous segment appears in segmentation results, it can be treated as an error (type I error); and secondly, if an indivisible segment is divided in segmentation results, it can also be treated as an error (type II error). These two circumstances are not compatible with our general word intuition even to the least extent because they are simply incomprehensible, and they cannot be explained by variations of word intuition among speakers; normally, when the subjects do word segmentation tasks carefully according to their word intuition, these would not occur and thus we can treat them as errors. Human word segmentation errors will occur when the subjects try to cheat by segmenting randomly or make accidental mistakes.

Manual segmentation error rate

A subject divides the phrase/sentence S into n (n ∈ N +) segments by n segmentation operations (not n − 1; the subject left the remaining segment at the tail as one word; it means the subject had “confirmed” that this is a segmentation operation too). A segmentation operation can only yield one of the following four possible results: one type I error, one type II error, one type I error plus one type II error (two errors; e.g., 好东/西 hǎo dōng/ xī 'good-east/west' (NONSENSE’), or no error. Suppose e'(e' ∈ N) is the number of times the type I error occurred during the segmentation process, and e "(e " ∈ N), the number of times the type II error occurred, then we can define manual segmentation error rate (MSER):

$$ MSER=\left({e}^{\prime }+{e}^{{\prime\prime}}\right)/n $$

In extreme cases, MSER could be greater than one, for example, in the segmentation result 去哈/尔滨/ qù Hā/ ěrbīn/ 'go to Ha/er-bin/' (NONSENSE), e ' = 2,  e " = 1,  n = 2, so MSER = 3/2. If this happens, we just assume that MSER = 1. MSER can be used to evaluate manual word segmentation results. Lower MSER means better data quality. Let us consider its collective form. If S is segmented by m (m ∈ N +) subjects, and the ith (1 ≤ i ≤ m) subject’s type I error count, type II error count, and segmentation operation count are \( {e}_{i^{\prime }},{e}_{i^{{\prime\prime} }},{n}_i \) respectively, then the collective form of MSER is:

$$ MSER=\frac{\sum \limits_{i=1}^m\left({e}_{i^{\prime }}+{e}_{i^{{\prime\prime} }}\right)}{\sum \limits_{i=1}^m{n}_i} $$

As a convenient way, we can find type I errors and their counts in the unigram frequency list of the segmentation results, and find type II errors and their counts in the bigram frequency list of the segmentation results.

Among the 19 sentences of each task, three sentences were sampled for evaluation: the first sentence, the middle (10th) sentence, and the last (19th) sentence. We calculated the MSER for each of them, see Table 4 for details. The MSERs of the segmentation results of these sentences are all very low (< .05), and their mean is only .013 (SD = .004). This means the resultant dataset only contains few errors and indicates that the data quality is good.

Table 4 Manual segmentation error rates (MSER) of the segmentation results of the eight tasks

Representation of word segmentation results

Characters are written symbols which are used to record linguistic units (e.g., morphemes, words, phrases, and sentences). The characters which are used to record Chinese sentences include Chinese characters, punctuations, numbers, and Latin letters. All the characters constituent a character set. A sentence (more precisely, a written sentence) can be treated as a string of characters which follow proper grammatical rules. A grammar is a set of rules which combine characters into sentences. A language is a set of all possible sentences given a character set and a grammar.

In a written sentence, after each character, there is an interval and we call it a character interval. A character interval can be a word-boundary which indicates the end of a word and at the same time the start of the next word if there exists one. And it can also be a non-word boundary which locates inside a word. Each character interval can be treated as a binary variable. When a character interval is a word boundary, we say that it has the value of one; when it is a non-word-boundary, it has the value of zero. See Huang et al. (2007) and Li and Huang (2009) for the source of this abstraction.

The sentence S which consists of n (n > 0) characters C 1,C 2, C 3, …, C n can be represented as follows:

$$ S={C}_1{I}_1{C}_2{I}_2{C}_3{I}_3\dots {C}_n{I}_n $$

I 1, I 2, I 3, …, I n represent character intervals. Suppose that {S 1, S 2, S 3, …, S n } is a set of sentences of the language L. And {H 1, H 2, H 3, …H m } is a set of human subjects who speak the language L. An intuitive word segmentation task virtually requires each human subject to assign one to the character intervals of each sentence which are word boundaries and to assign zero to character intervals of each sentence which are non-word boundaries according to intuition.

The results of an intuitive word segmentation task can be summarized into a table in the format illustrated in Table 5. Each row in the table stores the segmentation results of one human subject, and each column stores the segmentation results of one sentence. The x th human subject segments the y th sentence and the segmentation result is R x,y . Since the sentence to be segmented is known, it is sufficient to represent the segmentation result by just listing the values of the character intervals in the sentence. So R x,y can be treated as a vector (i 1, i 2, i 3, …) in which i 1, i 2, i 3, … represent the values of the first, second, third,... character interval. The number of components of the vector equals the length of the sentence to be segmented len(S) (in character). And we call this kind of vector the segmentation result vector (SRV). Normally, this table should be analyzed column by column (hence sentence by sentence). The i th(i = 1, 2, 3, …) column stores the segmentation results of the i th sentence and it can be treated as a m×len(S i ) matrix and we call it the segmentation result matrix (SRM).

Table 5 Results of an intuitive word segmentation task

Calculation of word intuition between/among Chinese speakers

Word segmentation agreement between human subjects reflects the agreement of word intuition between Chinese speakers, i.e., to what extent the Chinese speakers agree with each other on what is a word intuitively. Since there is no single best way to measure agreement, we used several metrics to provide more information: (1) proportionate agreement, (2) Cohen’s kappa, and (3) Fleiss’ kappa. Suppose that a and b are two human subjects and s is the sentence to be segmented and len(s) = n, and the SRVs generated by a and b on sentence s are R a, s  = (a 1, a 2,a 3, …, a n ) and R b, s  = (b 1, b 2,b 3, …, b n ). We can use proportionate agreement and Cohen’s kappa to measure the segmentation agreement between a and b based on their SRVs. When there are more than two human subjects, we can measure the segmentation agreements of all the unique subject pairs one by one and then see the distribution, or alternatively, we can use Fleiss’ kappa to measure the overall agreement as a summary.

Proportionate agreement

The proportionate agreement between these two SRVs is defined as the number of same judgments between these two vectors normalized by the maximum possible number of different judgments between two SRVs whose length equals n:

$$ PA\left({R}_{a,s},{R}_{b,s}\right)=1-\frac{\sum \limits_{i=1}^n{\left({a}_i-{b}_i\right)}^2}{n} $$

The range of PA(R a, s , R b, s ) is [0,1], where 0 means complete disagreement (0% agreement) while 1 means complete agreement (100% agreement).

The crowdsourcing Chinese word segmentation experiment consists of eight sessions (i.e., eight crowdsourcing word segmentation tasks); each session have the same human subject set (see Table 3 for subject numbers of the eight sessions) and the same sentence set (19 sentences per session). Different sessions have different human subject group (there are perhaps partial overlaps) and different sentence set (without overlap).

For each session, we calculate the proportionate agreement for each unique human subject pair. A session has 19 sentences; hence, a human subject generates 19 SRVs. These SRVs are concatenated into one general SRV to represent the segmentation behavior of the human subject; the calculation is based on the general SRVs of the human subjects. See Fig. 1 for the distributions of proportionate agreement statistics of the eight sessions, and see Table 6 for the proportionate agreement statistics of the eight sessions. Because of the existence of outliers (see Fig. 1), the medians summarize the statistics better than the means. As a summary, the word intuition agreement measured by proportionate agreement ranges from 0.91 to 0.93 (M = 0.92, SD = 0.001).

Fig. 1
figure 1

Word segmentation agreement between human subjects: proportionate agreement

Table 6 Summaries of proportionate agreement statistics of the eight sessions

Cohen’s kappa

Cohen’s kappa (Cohen 1960) is another metric which measures the rating agreement between two raters who classify the same set of objects into several categories. It is believed to be more reasonable than proportionate agreement since it takes the agreement by chance into account. For this reason, Cohen’s kappa is more conservative than proportionate agreement. It is calculated using the following formula:

$$ \kappa =\frac{p_o-{p}_e}{1-{p}_e} $$

in which “ p o is the observed proportion of agreement, and p e is the proportion of agreement expected by chance” (Cohen 1968). See Cohen (1960) for the details of the definition and calculation of Cohen’s kappa. To interpret it, two subjects are in complete agreement when κ = 1, and in complete disagreement when κ ≤ 0. Landis and Koch (1977) provide a scheme for the interpretation of κ statistic (see Table 7). But it is worth noting that the authors also pointed that “these divisions are clearly arbitrary”, but “they do provide useful ‘benchmarks’” and this scheme can help to “maintain consistent nomenclature when describing the relative strength of agreement associated with kappa statistics” (Landis and Koch 1977).

Table 7 Interpretation of kappa statistics

Following the same procedure of the calculation of proportionate agreement, we calculated Cohen’s kappa for the eight sessions. See Fig. 2 for the distributions of the Cohen’s kappa statistics of the eight sessions, and see Table 8 for the statistics. Because of the existence of outliers (see Fig. 2), the medians summarize the statistics better than the means. As a summary, the word intuition agreement measured by Cohen’s kappa ranges from 0.82 to 0.86 (M = 0.84, SD = 0.02). According to Table 7, this means almost perfect agreement.

Fig. 2
figure 2

Word segmentation agreement between human subjects: Cohen’s kappa

Table 8 Summaries of Cohen’s kappa statistics of the eight sessions

Fleiss’ kappa

The agreement metrics we discussed above, proportionate agreement and Cohen’s kappa, are only used to measure the agreement between two subjects. When measuring the agreement among three or more subjects, Fleiss’ kappa (Fleiss 1971) should be used instead. It is calculated using the following formula:

$$ \kappa =\frac{\overline{P}-{\overline{P}}_e}{1-{\overline{P}}_e} $$

\( 1-\overline{P} \) measures the degree of agreement attainable over and above what would be predicted by chance” (Fleiss 1971), \( \overline{P}-{\overline{P}}_e \) measures “the degree of agreement actually attained in excess of chance” (Fleiss 1971), and the whole equation is a “normalized measure of overall agreement, corrected for the amount expected by chance” (Fleiss 1971). The interpretation of Fleiss’ kappa statistics accords with that of Cohen’s kappa, and Table 7 is applicable too. See Table 9 for the Fleiss’ kappa statistics of the eight sessions. The mean of the Fleiss’ kappa statistics is 0.78 (SD = 0.02). According to Table 7, this means substantial agreement but it is very close to the threshold value of almost perfect agreement (0.81). Because there are outliers in the word segmentation results (see Figs. 1 and 2) which would reduce the Fleiss’ kappa statistics, we could interpret the Fleiss’ kappa value 0.78 as almost perfect agreement.

Table 9 Fleiss’ kappa statistics of the eight sessions

Role of semantic transparency in word intuition agreement

In order to examine the role of semantic transparency in the word intuition of Chinese speakers, we firstly extracted the most typical compound stimuli of all the semantic transparency types from the compound stimuli of the Chinese word segmentation experiment (see Table 1) according the semantic transparency rating data of a laboratory-based semantic transparency rating experiment (Wang et al. 2015). See Table 10 for the distribution of the extracted word stimuli among semantic transparency types and structural types.

Table 10 Distribution of types of most typical compound

Then, we checked the segmentation results of these compounds to see if the transparent compounds have higher probability to be explicitly segmented into two units than the opaque compounds. Tables 11, 12, 13, and 14 show the segmentation results of four types of compound stimuli: transparent compounds (TT), two types of partially transparent compounds (TO and OT), and opaque compounds (OO). In each table, the first column (“Type”) indicates the semantic transparency type and structural types of the compound stimuli. The second column (“Compound”) lists the compound stimuli. The third column (“#Resp”) means the number of the valid responses the sentences contain the compound stimuli received. Column four (“#AB”) shows how many times the compound stimuli were explicitly segmented as one unit (or one word). Column five (“#A/B”) indicates how many times the compound stimuli were explicitly segmented as two units (or two words). And there could also be other segmentation results, column six (“#Other”) shows the counts of other segmentation results (see Wang (2016: 145) for the details of other segmentation results).

Table 11 Segmentation results of typical TT compounds
Table 12 Segmentation results of typical TO compounds
Table 13 Segmentation results of typical OT compounds
Table 14 Segmentation results of typical OO compounds

The possibilities of the four types of compound stimuli to be segmented into two units are all very close to zero; we found no evidence to support the hypothesis that transparent compounds have more chances to be segmented into two units than opaque ones. For example the word intuition agreement of the transparent compound jiāngshuǐ is 0.99 (141/142) and that of the opaque compound lóngyǎn is 0.97 (129/133); there is no significant difference. Based on our data, we cannot say that semantic transparency plays no role in word intuition of Chinese speakers, but even if semantic transparency affects the word intuition of Chinese speakers, its role is rather restricted. We also found some special forms, for example monosyllabic verb + disyllabic noun (没冷水 méi lěngshuǐ 'no cold water', 拿借款 ná jièkuǎn 'fetch borrowed money', 喝开水 hē kāishuǐ 'drink boiled water', 抽大麻 chōu dàmá 'smoke marijuana'), disyllabic noun + monosyllabic localizer (草地上 cǎodì shàng 'on the grassland', 天桥上 tiānqiáo shàng 'on the overpass'), and some other forms (脾气大 píqi dà 'violent-tempered', 天王地位 tiānwáng dìwèi 'super star status', and 幕后黑手 mùhòu hēishǒu 'black hand behind the scenes'). These forms are usually treated as phrases theoretically; however, they all show significant chances to be treated as one word intuitively. These forms await further studies.


We measured word intuition agreement among Chinese speakers based on the measurement of word segmentation agreement. Various metrics show that Chinese speakers agree with each other almost perfectly on what is a word. Measured by proportionate agreement, the word intuition agreement between Chinese speakers is about 0.9 on average; measured by Cohen’s kappa, the word intuition agreement between Chinese speakers is about 0.8∼0.9 on average; and measured by Fleiss’ kappa, the word intuition agreement among Chinese speakers is about 0.8 which would be higher if we further filter out outliers. There are word intuition differences among Chinese speakers, but the differences are not as large as we thought. These statistics strongly support the psychological reality of Chinese word and suggest that the concept of word in Chinese linguistics has solid psychological foundation in Chinese speaking community. We also studied the role of semantic transparency in word intuition agreement; we found that at least in terms of the compounds we examined there is no evidence to support certain semantic transparency effect on word intuition agreement. Although there are some debates among linguists on the wordhood of semantically transparent forms which consist of free forms such as jiāngshuǐ, there is no intuitive divergence among Chinese speakers even to the least extent. Such high word intuition agreement also suggests that it is quite feasible to formulate a definition of Chinese word according to the collective word intuition of Chinese speakers. And such a definition of Chinese word will be quite different from the classic word definition (i.e.,“minimum free form”). In addition, the data collected in this study can be probably annotated lexical resource to support computational word segmentation task in the future. Last, but not the least, although kappa based agreement measure cannot be directly compared with F-scores, the fact that word intuition agreement is about 0.9 among native speakers suggests that perhaps the pursuit for 0.97+ F-score in current Chinese segmentation bakeoff competition could be a result of overfitting rather than real improvements in methodology.