Introduction

Word embedding (also known as distributed word representation) denotes a word as a real-valued and low-dimensional vector. In recent years, it has attracted significant attention and has been applied to many natural language processing (NLP) tasks, such as sentiment classification [1,2,3,4,5], sentence/concept-level sentiment analysis [6,7,8,9,10], question answering [11, 12], text analysis [13, 14], named entity recognition [15, 16], and text segmentation [17, 18]. For instance, in some prior studies, word embeddings in a sentence were summed and averaged to obtain the probability of each sentiment [19]. An increasing number of researchers have used word embeddings as the input of neural networks in sentiment analysis tasks [20]. This approach can faciliate neural networks to encode semantic and syntactic information of words into embedding and subsequently place the words with the same or similar meaning close to each other in a vector space. Among the existing methods, the continuous bag-of-words (CBOW) method and continuous skip-gram (skip-gram) method are popular because of their simplicity and efficiency to learn word embeddings from large corpora [21, 22].

Due to its success in modeling English documents, word embedding has been applied to Chinese text. In contrast to English where a word is the basic semantic unit, characters are considered as the smallest meaningful units in Chinese and are called morphemes in morphology [23]. A Chinese character may form a word by itself or, in most occasions, be a part of a word. Chen et al. [24] integrated context words with characters to improve the learning of Chinese word embedding. From the perspective of morphology, a character can be further decomposed into sub-character components, which contain rich semantic or phonological information. With the help of the internal structural information of Chinese characters, many studies tried to improve the learning of Chinese word embeddings by using radicals [25, 26], sub-word components [27], glyph features [28], and strokes [29].

These methods enhanced the quality of Chinese word embeddings in terms of two distinct perspectives: morphology and semantics. In particular, they explore the semantics of characters within different words through the internal structure of characters. However, we argue that such information is insufficient to capture semantics because a Chinese character may represent different meanings in different words and the semantic information cannot be completely drawn from their internal structures. As shown in Fig. 1, “辶” is the radical of “道”, but it can merely represent the first meaning. “道” can be decomposed into “辶” and “首” , which are still semantically related to the first meaning. The stroke n-gram feature “首” appears to have no relevance to any semantics. Although Chen et al.e [24] proposed to address this issue by incorporating a character’s positional information in words and learning position-related character embeddings, their method cannot identify distinct meanings of a character. For instance, “道” can be used at the beginning of multiple words but with distinct meanings, as illustrated in Fig. 1.

Fig. 1
figure 1

Illustrative example of radical, components, and stroke n-gram of character “道”

Chinese characters represent a combination of pronunciation, structure, and meaning, which correspond to phonology, morphology, and semantics in linguistics. However, the aforementioned methods consider the structure and meaning only. Phonology describes “the way sounds function within a given language to encode meaning” [30]. In Chinese language, the pronunciation of Chinese words and characters is marked by Pinyin, which stems from the official Romanization system for Standard Chinese [31]. The Pinyin system includes five tones, namely, flat tone, rising tone, low tone, falling tone, and neutral tone. Figure 2 shows that the syllable, “ma”, can produce five different characters with different tones.

Fig. 2
figure 2

Example of five characters with five tones from syllable ma

In contrast to English characters where one character has only one pronunciation, many Chinese characters can have two or more pronunciations. These are called Chinese polyphonic characters. Each pronunciation can usually refer to several different meanings. In modern Standard Chinese, one-fifth of the 2400 most common characters have multiple pronunciationsFootnote 1.

Let us take the Chinese polyphonic character, “长”, as an example. As Fig. 3 shows, the character has two pronunciations: “cháng” and “zhǎng”, and each has multiple semantics for distinct words. It is possible to uncover different meanings of the same character based on its pronunciation.

Although a character may represent several meanings with the same pronunciation, we can identify its meaning by using the phonological information of other characters within the same word. This is possible because the semantic information of the character is determined when collocated with other pronunciations. Moreover, the position problem in character-enhanced word embedding (CWE) and multi-granularity embedding (MGE) can be addressed because the position of a character in a word can be determined when its pronunciation is considered. For example, when “长” is combined with “(generation)”, they can only form the word, “长辈”. In this case, “长” represents “old”. In another case, “长” and “年 (year)” can produce two different words, “年长” and “长年”. They can be distinguished by the pronunciation of “长”, where the former is “zhǎng” and the latter is “cháng”. Hence, it would be helpful to incorporate the pronunciation of characters to learn Chinese word embeddings and improve the ability to capture polysemous words.

Fig. 3
figure 3

Chinese polyphonic character “长”

In this study, we propose a method named pronunciation-enhanced Chinese word embedding (PCWE), which makes full use of the information represented by Chinese characters, including phonology, morphology, and semantics. Pinyin, which is the phonological transcription of Chinese characters, is combined with Chinese words, characters, and sub-character components as context inputs of PCWE. Word similarity, word analogy reasoning, text classification, and sentiment analysis tasks are evaluated to validate the effectiveness of PCWE.

Proposed Method

In this section, we describe the details of PCWE, which makes full use of phonological, internal structural, and semantic features of Chinese characters based on CBOW [22]. We do not use the skip-gram method because there are insignificant differences between CBOW and skip-gram but CBOW runs slightly faster [32]. PCWE uses context words, context characters, context sub-characters, and context pronunciation to predict the target word.

Fig. 4
figure 4

Structure of PCWE, where \(w_i\) is the target word, \(w_{i-1}\) and \(w_{i+1}\) are the context words, \(c_{i-1}\) and \(c_{i+1}\) represent the context characters, \(s_{i-1}\) and \(s_{i+1}\) indicate the context sub-characters, \(p_{i-1}\) and \(p_{i+1}\) denote the context pronunciation, and \(p_{i}\) is the pronunciation of \(w_i\)

We denote the training corpus as D, vocabulary of words as W, character set as C, sub-character set as S, phonological transcription set as P, and size of the context window as T. As can be seen in Fig. 4, PCWE attempts to maximize the sum of four log-likelihoods of conditional probabilities for the target word, \(w_i\), given the average of individual context vectors:

$$\begin{aligned} L(w_i) = \sum _{k=1}^4 p(w_i|h_{ik}), \end{aligned}$$
(1)

where \(h_{i1}, h_{i2}, h_{i3}, and \ h_{i4}\) are the compositions of context words, context characters, context sub-characters, and context phonological transcriptions, respectively. Let \(v_{w_i}\), \(v_{c_{w_i}}\), \(v_{s_{w_i}}\), and \(v_{p_{w_i}}\) denote the vector of word \(w_i\), character \(c_i\), sub-character \(s_i\), and phonological transcription \(p_i\), respectively. \(\hat{v}_{w_i}\) is the predictive vector of the target word, \(w_i\). The conditional probability is defined as

$$\begin{aligned} p(w_i|h_{ik}) = \frac{exp(h_{ik}^{T} \hat{v}_{w_i})}{\sum _{n=1}^N exp(h_{ik}^{T} \hat{v}_{w_n})}, k = 1,2,3,4, \end{aligned}$$
(2)

where \(h_{i1}\) is the average of the vectors of the context words:

$$\begin{aligned} h_{i1} = \sum _{-T \le j \le T, \ j \ne 0} v_{w_{i+j}}. \end{aligned}$$
(3)

Similarly, \(h_{i2}\), \(h_{i3}\), and \(h_{i4}\) are the averages of the vectors of the characters, sub-characters, and pronunciations in the context, respectively.

A similar objective function was used in joint learning word embedding (JWE) [27]. However, there are differences between JWE and PCWE. PCWE integrates the phonological information of words and characters with words and sub-character components to jointly learn Chinese word embedding, whereas JWE incorporates only words, characters, and sub-character component features. In other words, PCWE makes full use of context information from the perspective of both phonology and morphology, whereas JWE merely uses the morphological features. PCWE utilizes the pronunciation of target characters, but JWE does not.

Experiments

We tested our method in terms of word similarity, word analogy reasoning, text classification, and sentiment analysis. Qualitative case studies were also conducted.

Datasets and Experiment Settings

We employed the Chinese Wikipedia DumpFootnote 2, as our training corpus. In preprocessing, we used THULACFootnote 3 for segmentation and parts-of-speech-tagging. Pure digits and non-Chinese characters were removed. Finally, we obtained a 1GB training corpus with 169,328,817 word tokens and 613,023 unique words. We used the lists of radicals and sub-character components in [27], comprising 218 radicals, 13,253 components, and 20,879 characters.

We crawled 411 Chinese Pinyin spellings without tones from the online Chinese Dictionary websiteFootnote 4. Since the Pinyin system includes five tones, a total of 2055 Pinyin spellings were collected. We adopted HanLPFootnote 5 to transfer Chinese words into Pinyin. Finally, we obtained 3,289,771 Chinese words and Pinyin pairs.

PCWE was compared with CBOW [22]Footnote 6, CWE [24]Footnote 7, position-based CWE (CWE+P), MGE [26]Footnote 8, and JWE [27]Footnote 9. The same parameter settings were applied to all methods for comparison. We set the context window size to 5, embedding dimension to 200, and training iteration to 100. We used 10-word negative sampling and set the initial learning rate to 0.025 and the subsampling parameter to \(10^{-4}\) for process of the optimization. Words with a frequency of less than 5 were ignored during training.

Word Similarity

Table 1 Results on word similarity

This task evaluates the ability to capture the semantic relativity between two embeddings. We adopted two Chinese word-similarity datasets, Wordsim-240 and Wordsim-297, provided by Chen et al. [24]Footnote 10 for evaluation. Both datasets contained Chinese word pairs with human-labeled similarity scores. There were 240 and 297 pairs of Chinese words in Wordsim-240 and Wordsim-297, respectively. However, there were 8 words in Wordsim-240 and 10 words in Wordsim-297 that did not appear in the training corpus. We removed these words to generate Wordsim-232 and Wordsim-287.

The cosine similarity of two word embeddings was computed to measure the similarity score of word pairs. We calculated the Spearman correlation [33] between the similarity scores computed by using the word embeddings and the human-labeled similarity scores. Higher values of the Spearman correlation denote better word embedding for capturing semantic similarity between words. The evaluation results are listed in Table 1.

The results show that PCWE outperforms CBOW, CWE, CWE+P, and MGE on the two word similarity datasets. This indicates that the combination of morphological, semantic, and phonological features can exploit deeper Chinese word semantic and phonological information than that exploited by other methods. In addition, we find that PCWE can yield competitive results when compared with CWE+P. This verifies the benefits of exploiting phonological features over position features to reduce the ambiguity of Chinese characters within different words. Although PCWE performs better than all the baselines on Wordsim-287, JWE achieves the best results on Wordsim-232. A possible reason for this could be that Wordsim-287 contains more Chinese polyphonic characters such as “行” (háng, xíng)) and “中” .

Word Analogy Reasoning

Table 2 Results of word analogy reasoning measured in 3CosAdd
Table 3 Results of word analogy reasoning measured in 3CosMul

This task estimates the effectiveness of word embeddings to reveal linguistic regularities between word pairs. Given three words, a, b, and c, the above task aims to explore a fourth word, d, such that a to b is similar to c to d. We used 3CosAdd [34] and 3CosMul [35] to determine the nearest word, d. We employed the analogy dataset provided by Chen et al. [24] that contained 1127 Chinese word tuples. They were categorized into three types: capitals of countries (677 tuples), state/provinces of cities (175 tuples), and family words (240 tuples). The training corpus covered all the words in the analogous dataset. We used accuracy as the evaluation metric and the results are listed in Tables 2 and 3.

From the results of word analogy reasoning with the 3CosAdd measure function, it can be found that PCWE achieves the second best performance, whereas JWE performs the best. Nevertheless, the embedding representations learned by PCWE have better analogies for Capital and City types when computing with the 3CosMul function. This could be atributed to the fact that the words in these categories rarely consist of Chinese polyphonic characters, and meanwhile the morphological features provide sufficient semantic information to identify similar word pairs. To verify this hypothesis, we used a statistical technique to analyze these datasets. Only 3.46% of words in all tuples contained Chinese polyphonic characters that can affect the semantics. By contrast, the ratios for Wordsim-232 and Wordsim-287 were 15.73% and 10.98%, respectively. It can also be found that the results of PCWE are better than those of CWE and CWE+P, suggesting the success of leveraging the compositional internal structure and phonological information.

Text Classification

Table 4 Accuracy of text classification

Text classification is widely used to evaluate the effectiveness of word embeddings in NLP tasks [36]. We adopted the Fudan dataset, which contained documents on 20 topics, for trainingFootnote 11 and testingFootnote 12. Following [29], we selected 12,545 (6424 for training and 6121 for testing) documents on five topics: environment, agriculture, economy, politics, and sports. We averaged the embeddings of the words that were present in the documents as the features of the documents. We also trained a classifier with LIBLINEARFootnote 13 [37]. The accuracy of different methods is given in Table 4.

As shown in the table, all the methods achieve an accuracy of over \(94\%\) and our method performs the best. This is because the distinct semantics of characters with different pronunciations are captured by our method. For example, for documents on the topic of economy, the word “银行 (bank)” with the polyphonic character “行” is used frequently and its pronunciation can contribute more to the accuracy than its subcomponents. Therefore, the PCWE outperforms other baselines.

Sentiment Analysis

Table 5 Accuracy of sentiment analysis

Sentiment analysis [38,39,40,41] is one of the most important applications of word embedding [19, 42, 43]. It can also be used to evaluate the effectiveness of our method. With the popularity of Chinese texts, an increasing number of researchers have conducted Chinese sentiment analysis using Chinese word embedding methods [8, 44]. We chose the Hotel Reviews datasetFootnote 14 in this experiment and analyzed the sentiment of each document to evaluate our word-embedding method. This dataset contained 2000 positive reviews and 2000 negative reviews. We randomly divided the dataset into two parts. One was the training set, which accounted for 90% of the data. The other one was the testing set, which contained the remaining 10% of the data. We built a bidirectional long short-term memory to perform the sentiment analysis task [45, 46]. The average accuracy of different methods is shown in Table 5.

It can be observed that PCWE performs significantly better than the employed word embedding baselines when applied to sentiment analysis. In the process of analyzing characters with more than one pronunciation, PCWE can capture different meanings of different pronunciations. For example, the negative reviews contain several words such as “假四星酒店 (fake four-star hotel)”. In this word, the polyphonic character “假” is the core of the meaning. The pronunciation of the polyphonic character is considerably more important than its sub-character component for Chinese sentimental analysis.

Case Study

Fig. 5
figure 5

Case study for semantically related words. Given the target word, the top 10 similar words identified by each method are listed

Fig. 6
figure 6

Case study for related Chinese characters and pronunciations. Given a Chinese character, the top 3 related pronunciations are listed. The correct Pinyin is in boldface

Fig. 7
figure 7

Case study for related Chinese characters and pronunciations. Given a Pinyin, the top 5 related Chinese characters are listed

In addition to validating the benefit of using the phonetic information in Chinese characters to improve the word embedding quality, we performed qualitative analysis by conducting case studies to present the most similar words to certain target words.

Figure 5 illustrates the top-10 similar words to two target words identified by each method. The first example of the target word is “强壮 (qiáng zhuàng, strong)”. This target word includes a Chinese polyphonic character, “强’, which is used to describe “strong man” or “strong power”. The majority of the top-ranked words generated by CBOW have no relevance to the target word, such as “鳍状肢 (flipper)” and “尾巴 (tail)”. This results from the fact that CBOW only incorporated context information. CWE exploits many words related to the characters constituting the target word “(强” or “壮)”. This verifies the idea of CWE to jointly learn embeddings of words and characters. However, CWE generates the word “瘦弱 (emaciated),” which represents the opposite meaning to the target word. MGE is the worst method which discovers words that are not semantically related to the target word, such as “主密码 (master password)” and “短尾蝠 (Mystacina tuberculata)”. By contrast, JWE yields words that are correlated with the target word, except “聪明 (smart).” As the best method, PCWE identifies words that are highly semantically relevant to the target word. In addition, only PCWE can generate the word “大块头 (big man),” which is related to the target word. Overall speaking, PCWE can effectively capture semantic relevance, as it combines the comprehensive information in characters from the perspectives of morphology, typography, and phonology.

The other example is “朝代 (dynasty),” which contains a Chinese polyphonic character, “朝.” It refers to an emperor’s reign or a certain emperor of a pedigree. CWE generated irrelevant words such as “分封制 (the system of enfeoffment)” and “典章制度 (ancient laws and regulations),” which are under the theme of laws and institutions. CWE also identifies irrelevant words under the topic of calendar, such as “大统历 (The Grand Unified Calendar)” and “统元历 (The Unified Yuan Calendar).” As same as the first example, the words found by MGE appear to have no correlation with the target word. In contrast, the majority of the similar words generated by JWE are semantically correlated with the target word, but the words “妃嫔 (imperial concubine)” and “史书 (historical records)” present no relevance to the target word. For PCWE, all generated words are highly semantically relevant to the target word.

The singular inclusion of context word information in CBOW leads to the generation of contextual words instead of semantically related words. The learning processes of CWE, MGE, and JWE involve characters, radicals, and internal structures, resulting in the limitation of identifying similar words with the same characters, radicals, and internal structure but with no semantic relevance to the target word. Our proposed method PCWE can exploit more complete information of characters by considering their phonological features.

In addition to demonstrating the effectiveness of our method to encode phonological information, we conducted a case study to exploit the relationship between characters and pronunciations. We first analyzed the effectiveness of our method in capturing Chinese polyphonic characters and listed the related Pinyin to the given character, as shown in Fig. 6. From the results, it is found that PCWE identifies correct pronunciations for every character. It is also found that our method does not find all possible pronunciations for the characters because some pronunciations are rarely used. Figure 7 depicts the relevant characteristics for the given Pinyin. The characters exploited for the first three Pinyin are all pronounced as expected. For the fourth Pinyin “shān,” only the character “山” can be correctly identified. Although the other characters are not pronounced as the target Pinyin, they are all semantically related to hills or mountains. The same situation occurs for the last Pinyin “.” The results show that our method can encode the polyphonic features of characters and explore semantically relevant characters.

Related Work

Most methods designed for word embedding learning are based on CBOW or skip-gram [21, 22] because of their effectiveness and efficiency. However, they regard words as basic units and ignore the rich internal information within words. Therefore, many methods have been exploited to improve word embeddings by incorporating morphological information. Bojanowski et al. [47] and Wieting et al. [48] proposed methods for learning word embeddings with character n-grams. Goldberg and Avraham [49] proposed a method to capture both semantic and morphological information, where each word is composed of vectors of linguistic properties. Cao and Lu [50] introduced a method based on a convolutional neural network that learns word embeddings with character 3-gram, root/affix, and inflections. Cotterell and Schütze [51] proposed a method that learns morphological word embeddings with morphological annotated data by extending the log-bilinear method. Bhatia et al. [52] introduced a morphological prior distribution to improve word embeddings. Satapathy et al. [53] developed a systematic approach called PhonSenticNet to integrate phonetic and string features of words for concept-level sentiment analysis.

In recent years, methods specifically designed for the Chinese language where characters are treated as the basic semantic units have been studied. Chen et al. [24] utilized characters to augment Chinese word embeddings and proposed a CWE method to jointly learn Chinese characters and word embeddings. Yang and Sun [54] considered the semantic knowledge of characters when combining them with context words. There are also methods that incorporate internal morphological features of characters to enhance the learning of Chinese word embeddings. With a structure similar to that of [54], Xu et al. [55] combined characters with their semantic similarity with words. Yin et al. [26] proposed MGE by extending CWE with the radicals of target words. Considering that radicals cannot fully uncover the semantics of characters, Yu et al. [27] proposed the JWE method to utilize context words, context characters, and context sub-characters. Li et al. [25] combined the context characters and their respective radical components as inputs to learn character embeddings. Shi et al. [32] decomposed the contextual character sequence into a radical sequence to learn radical embedding. Su and Lee [28] enhanced Chinese word representation using character glyph features to learn from the bitmaps of characters. Cao et al. [29] learned Chinese word embeddings by exploiting stroke-level information. More recently, Peng et al. [56] have proposed two methods to encode phonetic information and integrate the representations for Chinese sentiment analysis. Unlike the said studies, our method incorporates multiple features of Chinese characters in terms of morphology, semantics, and phonology.

Conclusion and Future Work

In this study, we propose a method named PCWE to learn Chinese word embeddings. It incorporates various features of Chinese characters from the perspectives of morphology, semantics, and phonology. Experimental results of word similarity, word analogy reasoning, text classification, sentiment analysis, and case studies validate the effectiveness of our method. In the future, we plan to explore comprehensive strategies for modeling phonological information by integrating our PCWE with other resources [7] or methods [6], and exploit the core idea of the proposed method to address other state-of-the-art NLP tasks, including sentiment analysis [57], reader emotion classification [58], review interpretation [59], empathetic dialogue systems [60], end-to-end dialogue systems [61], and stock market prediction [62, 63].