Cantonese Natural Language Processing in the Transformers Era

Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current “pre-training and fine-tuning” paradigm that is dominated by Transformer architectures.In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models.We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: colloquialism and multilinguality.


Introduction
Cantonese, or Yue Chinese, is a diaspora language with over 85 million speakers all over the world (Lai, 2004; García and Fishman, 2011; Yu, 2013; Eberhard et al., 2022). 1 It is commonly used in colloquial scenarios (e.g., daily conversation and social media) but also in formal and written contexts, such as in the Legislative Council of the Hong Kong Special Administrative Region, or in sections of special local interests in the newspapers, such social and entertainment, or in horse racing and betting information.Otherwise Standard Chinese (SCN)2 , sometimes called Putonghua (普 通话) or Guoyu (國語), is generally favored in formal and written contexts (Luke, 1995; Lee, 2016; Li, 2017; Wong and Lee, 2018).
In terms of digital language support, Mandarin Chinese thrives with a mature Natural Language Processing (NLP) environment.Chinese NLP has a versatile and growing literature from major conferences, such as ACL and COLING.In contrast, as for digital language support Cantonese is at the vital level, one level lower than thriving (cf.Ethnologue) (Zhao et al., 2024b; Zhu et al., 2024).In fact, Cantonese is an rare exception as a main diaspora language, as most diaspora languages -including but not limited to Arabic, Chinese, English, French, Hindi, Japanese, Korean, Portuguese, Spanish, etc.-have both a thriving digital language support and a strong NLP community, while Cantonese does not (Li et al., 2023; Zhao et al., 2024a).
More specifically, while current NLP paradigms have been deeply changed by large-scale pretraining models based on Transformer architectures, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), ELECTRA (Clark et al., 2020), GPT-3 (Brown et al., 2020) and GPT-4 (Achiam et al., 2023), which have achieved state-of-theart (SOTA) level of performance on several tasks.Compared to the previous generation systems, the progress was particularly remarkable in task requiring fine-grained semantic understanding, such as textual entailment, question answering and causal reasoning (Wang et al., 2018, 2019; Zhao et al., 2023).On the other hand, language technologies for Cantonese have not yet benefited from this revolution (Xiang et al., 2022).From this point of view, the number of publications in the ACL Anthology is emblematic (see Figure 1): only 61 papers are related to "Cantonese", compared to 9,756 papers for English,and 5,312 (4,919 + 393) for SCN/Mandarin.
The history of publications in Cantonese NLP, as in Figure 1, shows that the numbers of papers published yearly remains in single digit, although there is a moderate increasing trend (cf. Figure 2).However, as an emergent language in NLP, it is surprising that only a small portion (17/61, 27.9%) introduces language resources, as shown by Table   1.This explains why Cantonese NLP has a problem in terms of scarcity of resources and lack of alignment to state-of-the-art practices.

Research Topics # of Papers
In light of these concerns, this paper presents a first overview of Cantonese NLP, going through essential issues regarding this language's uniqueness, data scarcity, research progress, and major challenges.As a pilot study, we also present some preliminary analysis on Cantonese data from social media and discuss the possible challenges.We found that, given the prominence of colloquial language and code-switching in the data, it is desirable that future models will be developed to properly deal with such phenomena.Finally, we conclude our contribution by indicating some possible directions for future research.

Corpora
Cantonese was perhaps the most documented Sinitic languages in early bilingual dictionaries compiled by western missionaries (Huang et al., 2016).Some Cantonese words were included in the first 'modern' bilingual Chinese dictionary compiled by Matteo Ricci at the end of the 16th century.The majority of the bilingual dictionaries published throughout the 19th century were, indeed, dedicated to Cantonese.Given the important role of Cantonese in the context of the encounter between China and the West, it is perhaps no surprising that the first Cantonese corpus was a bilingual one.Wu (1994) introduced the work on the HKUST Chinese-English Bilingual Parallel Corpus, based on the transcriptions from the Hong Kong legislative Council.The first monolingual Cantonese corpus was most likely the CANCORP (Lee and Wong, 1998), consisting of one million characters from Cantonese-speaking children in Hong Kong.Another important corpus for child language acquisition is the CHILDES Cantonese-English Corpus by Yip and Matthews (2007), containing both audio and visual data of children conversation and the related transcripts.
The Hong Kong Cantonese Adult Language Corpus (HKCAC) focuses instead on adult language and contributes speech recorded from phone-in programs and forums (Leung and Law, 2001).This corpus also presents speech transcriptions for a total of 170k characters.Another resource, the Hong Kong University Cantonese Corpus (HKUCC) (Wong, 2006) was collected from transcribed spontaneous speech in conversations and radio programs and its annotation include word segmentation, Cantonese pronunciation and parts-of-speech, covering approximately 230,000 words.Lee (2011) introduced a parallel corpus that aligns Cantonese and SCN at the sentence level for machine translation.The annotation materials are the transcriptions of Cantonese speeches from television shows in Hong Kong, and their correspond- ing Mandarin subtitles.The corpus contains 4,135 pairs of aligned sentences, with a total of 36,775 characters in Mandarin, and 39,192 in Cantonese.Wong et al. (2017) later published a small parallel dependency treebank for Cantonese and Mandarin, based on the same textual materials.The corpus consists, in total, of 569 aligned sentences and it is annotated with the Universal Dependencies scheme (De Marneffe et al., 2014; Nivre et al., 2016).Another corpus based on the transcripts of Hong Kong Cantonese movies has been presented by Chin (2015), and made accessible to the users via an online interface. 3poken Cantonese data from television and radio programmes broadcasted in Hong Kong are the source material also for the corpus introduced by Kwong (2015).The corpus covers different topics, such as politics, affairs, economics/finance, and food/entertainment, and a variety of textual typologies (interviews, phone call transcriptions, reviews etc.).The Hong Kong Cantonese Corpus by Luke and Wong (2015) includes 150,000 words, and it also consists of transcribed Cantonese speech recordings that are annotated with both segmentation and part-of-speech tags.Ng et al. (2017) proposed the first bilingual speech corpus of Cantonese and English, built with the goal of the assessment of correct Cantonese pronunciation.Finally, the most recent introduction is the MYCan-Cor corpus (Liesenfeld, 2018), which has been built with 20 hours of Cantonese speech recorded in Malaysia (plus the videos and the related transcriptions) to support studies on multimodal communication.
Concerning domain-specific resources, the parallel corpus by Ahrens (2015) 2020) is an openaccess corpus created specifically for translation tasks and contains bilingual speech conversations in Cantonese and English, for a total of 19 hours of conversation.The transcripts have been produced with the Google Cloud Speech-to-Text application, followed by manual corrections, orthographic alignment and phonetic transcriptions.
For corpus reading and preprocessing, Lee et al. (2022) recently introduced the PyCantonese package, which includes reader modules for some of the most popular Cantonese corpora (e.g. the CHILDES Cantonese-English Bilingual Corpus, the Hong Kong Cantonese Corpus etc.), stopword lists, modules for carrying out word segmentation and part-of-speech tagging, parsing and common computational tasks involving Jyutping (e.g.romanization of the characters).

NLP Benchmarks
The gap between Cantonese and other diaspora languages in NLP research and digital support is underlined by the scarcity of benchmark datasets specifically targeting Cantonese.A first example was the shared task for Chinese Spelling Check, which was conducted in co-location with the workshop on NLP for Educational Applications in 2017.The organizers published a benchmark dataset with 6,890 sentences for normalizing Cantonese, mapping from the spoken to the written form (Fung et al., 2017).Xiang et al. (2019) provided a sentiment analysis benchmark collected OpenRice, a Hong Kong catering website, where over 60k comments are labeled with 5-level ratings indicating sentiment scores.The authors anonymized the data, filtered out comments written in other languages (e.g.SCN, English) and limited the length of the examples to 250 words. 4hen et al. ( 2020) published a rumor detection benchmark collected from Twitter, including 27,328 web-crawled tweets (13,883 rumors and 13,445 non-rumors) written in Traditional Chinese characters, in part in Taiwanese Mandarin and in part in Cantonese5 .However, the dataset does not provide the information about the language in which a tweet has been written.
For text genre categorization, a benchmark has been collected by the ToastyNews project6 .The dataset consists of more than 11000 texts, divided into 20 different categories.The texts have been extracted from LIHKG, a popular Hong Kong forum with a structure similar to Reddit, and the category labels have been generated from the discussion threads they belong to.
Finally, for the development of dialogue systems, Wang et al. (2020) presented a food-ordering dialogue dataset for Cantonese called KddRES, including dialogues extracted from Facebook and OpenRice for 10 different Hong Kong restaurants.Using this dataset, it is possible to evaluate systems either on the classification of the intention of customer statements, or on sequence labeling tasks to identify the slot of interests of a conversation (e.g. the selected food, the number of people for a reservation, the time for take-out etc.).

Pilot Study for Cantonese
In the previous sections, we have illustrated the general scarcity of resources in NLP for Cantonese.We also mentioned that Cantonese has a numerous and active social media community, and Cantonese social media language provides an interesting example for analysis, as it can show the main challenges related to the automatic processing of this language.
As we anticipated, colloquialism and multilinguality are primary obstacles to robust and effective processing.In the next sections, we present an analysis of the two phenomena in Cantonese social media.

Colloquialism and Lexical Differences
In the introductory sections, we already discussed how the Cantonese vocabulary deeply diverges from SCN (Ouyang, 1993; Snow, 2004), and mentioned the fact that, due to the long tradition of all Sinitic languages sharing a written/formal strata (i.e.written Chinese), the divergence and challenges of Cantonese are in the spoken or informal strata.This include transcriptions of speech, as well as the habit in writing to adopt a colloquial style when dealing with topics of local interest, hence we refer to it as "colloquialism").In this section, we analyze the colloquial features of Cantonese, with some examples, and present some data from a small-scale study on word surprisal (Hale, 2001(Hale, , 2016)).To start with, we examined the data from three popular Cantonese online forums: DISCUSS, LIHKG, and OpenRice (Hong Kong). 7The first two are general forums with diverse topics, while OpenRice is s the most popular forum for sharing restaurant and food reviews.Table 2 shows the statistics of the forums, where the three sources altogether contribute 1.1 Gigabytes (G) texts and 0.924 billion (B) tokens.Just to give some figures for comparison, 80G texts and 16B tokens have been used for pre-training English models on tweets (BERTweet, Nguyen et al. ( 2020)), and 5.4B tokens have been used for a relatively small size model for SCN (MacBERT, Cui et al. (2021)).This would be, to the best of our knowledge, the largest social media text collection for pre-training a Cantonese model from scratch, although the data size is certainly smaller compared to other languages.

Data Source Token
One reason why it is challenging to directly apply or adapt SCN NLP models for Cantonese is the large number of Cantonese specific vocabulary and expressions, including words with unknown forms and words with known forms but with novel meanings.These discrepancies made the pre-trained models based on Mandarin ineffective for Cantonese NLP.In addition, due to the low degree of conventionalizing, spelling mistakes are prominent in the data, such as the mis-replacement of fan3 gaau3 訓覺 instead of fan3 gaau3 瞓覺 (sleep), together with intentional misspellings in jokes and punning, which are commonly found also in newspapers headlines (Li and Costa, 2009).
As in all social media texts, slang expressions and idioms are also frequently found, requiring external knowledge and background for the correct understanding, and most of such expressions are unknown in standard Chinese.Consider the following example: gam1 ci3 jin2 coeng3 wui2 hou2 naan4 maai5 dou3 fei1 ， keoi5 dou1 hai6 zap1 sei2 gai1 sin1 zi3 jau5 dak1 tai2 zaa3 。今次演唱會 好難買到飛，佢都係執死雞先至有得睇咋。 (It's extremely hard to buy tickets for the concert.He would not have a chance to go to the concert if he did not collect a lucky coin).There are at least two expressions that would be challenging to a SCN trained model.The first is the word 飛 'fare, ticket' , which is a phonetic borrowing as discussed above.A Mandarin trained model would treat it as the verb'to fly', with a different PoS and totally different behavior.The second is the expression zap1 sei2 gai1 執死雞 is a Cantonese idiom originated from football terminology, literally meaning 'to hold (a) dead chicken' , which is shared by Mandarin and Cantonese.However, in Cantonese, it also has the idiomatic meaning that was originally used in soccer 'scoring a goal with pure luck.'These two meanings in Cantonese cannot be obtained without either a comprehensive Cantonese lexicon of colloquial usages or a large training cor-pus.Without the prior knowledge of its extended meaning of "to get a great deal", even for humans it would be challenging to make sense of the sentence, not to mention NLP models.We studied the bigram distributions of DIS-CUSS, containing forum threads in 20 different topics, and compare it with the Gigaword corpus, which is composed of text from news outlets in Chinese (Huang, 2009; Parker et al., 2011).Both datasets concern contemporary and widelydiscussed events in diverse news topics and are written in traditional Chinese.For both datasets, we sampled 260 megabytes of textual data and computed the average frequency of the union of the top 1000 most frequent bigrams in the two datasets.The relative frequencies of the bigrams are shown in Figure 4. We can observe, at a glance, that the distribution of DISCUSS exhibits a high spike on the left, and then it has a long tail of low-frequency bigrams.Notice that, given the bigger size and the more standardized nature of GigaWord, the relative frequencies of many of the shared bigrams in the long tail are comparably higher.
To explore the predictability of Cantonese text by SCN models, we utilized two representative models to extract and compare surprisal scores for Cantonese sentences and the corresponding translations in Simplified and Traditional Chinese.We chose to use the BERT-CKIP model8 , which was trained on Traditional Chinese on a concatenation of a 2020 dump of the Chinese Wikipedia and the Chinese Gigaword Corpus (Huang, 2009; Parker et al., 2011); and the RoBERTa-HFL model9 , an implementation of RoBERTa by Cui et al. (2021).It has been trained on both Simplified and Traditional characters on a 2019 dump of the Chinese Wikipedia and various news and question answering websites.
The surprisal of a word w (Hale, 2001; Levy, 2008) is generally defined as the negative log probability of the word conditioned on the sentence context, according to the following: The higher the surprisal for a given linguistic expression, the more unpredictable that expression is for a given computational model.If a model instead is able to provide confident estimates of words occurring in a corpus, the surprisal will be low.
To run our small experiment, we adopted the implementation of the minicons library (Misra, 2022), which provides handy functions to estimate probability and surprisal scores of a sentence.We randomly sampled 50 sentences from the Cantonese forums in Section 4.1, and for each of them we generate the translation in both Traditional and Simplified Chinese using the Baidu translation interface10 .Then we computed the surprisal score for each sentence using the two SCN models, and took the average across sentences.The sampling was repeated 10 times (Table 3 reports the average across different samples).Notice that, since both BERT-CKIP and RoBERTa-HFL are bidirectional models trained, the surprisal scores for each word are computed by masking the words in the sentence one-by-one, computing their probabilities in context and then applying the formula in (1).Once the scores for single words are obtained, the minicons library outputs their average as the surprisal score for the sentence. 11e tested both Cantonese sentences and Taiwan Mandarin sentences from the Academia Sinica Corpus (Huang and Chen, 1992).Note that both Hong Kong and Taiwan use traditional characters with variations in lexical choices.Thus, our study was carried out in three different writing systems to ensure that the differences in writing systems do not contribute to the surprisal scores.Thus each set of data are tested in 1) original writing forms, 2) converted writing forms with each other (i.e.Hong Kong vs. Taiwan), and 3) converted to simplified Chinese.The results in Table 3 show that for both models and for three possible writing system settings (i.e.original, switched, simplified), the Cantonese sentences tend to have higher surprisal scores.The experiment establishes that it is more difficult for SCN trained models to predict Cantonese sentences.One of the reasons of the additional difficulties may be the usage of different words in Cantonese: we computed that, compared to the translated sentences, there is an overlap of characters of 69.1% for the Traditional Chinese translation and 65.5% for the Simplified Chinese one (i.e. more than 30% of the Cantonese characters do not appear in the translations).Still, given the relatively high overlap degree, it is likely that Cantonese-specific words play a role together with other factors, such as regional usages of the same words/characters and differences in grammar.
The two models behave very differently when the Cantonese text is translated into Simplified Chinese: RoBERTa-HFL, which is trained on both Traditional and Simplified characters, reports lower surprisal scores than on the original Cantonese sentences, and has a slightly higher score for the translation from Traditional to Simplified   To better understand the nature of multilingualism, we examine the contribution of different languages to Hong Kong social media data.The opensource toolkit fastlangid is employed to analyze the language usage ratio of the datasets 12 .More specifically, we used fastlangid with the default settings and the parameter k = 1, meaning that only the most likely language shall be detected.The percentages are shown in Table 4, where the statistics have been computed as an aggregation of sentence-level results.As it can be seen, the codeswitching behavior across Cantonese and SCN is frequent; English is also very often attested in our data 13 , and we can even observe code-mixing 12 https://github.com/currentsapi/fastlangid 13It should be kept in mind that English is still one of the primary languages in Hong Kong education.with other languages.This is because Cantonesespeaking areas happen to integrate speakers of multiple nationalities (Yue-Hashimoto, 1991; Li, 2006. To exemplify the multilingualism phenomenon in Cantonese, we present some typical codeswitching cases of Cantonese and English.The original texts are followed by the English translations in brackets.The switched scripts are underlined in both the original texts and the translations.The code-switching phenomenon in E1 is commonly observed in the data: the English nouns "offer" is directly taken and inserted in a Cantonese context.E2 uses "D" in the alphabet as an alternative to Cantonese tokens di1 "啲" (of ) and dim2 "點" (some) because of their similar pronunciations.For E3, "Q" is borrowed from Hokkien, another Chinese variety of the Southern Min group that is widely used in Fujian and Taiwan, and it means "chewy".The borrowing can be explained by the geographical proximity of the Cantonese and Hokkien speaking areas and by the constant migratory flows between the two regions.
In sum, our analysis shows how colloquialism and code-switching with multiple languages are pervasive in Cantonese social media data, and thus models for Cantonese NLP will have to be robust to such phenomena.For example, future Cantonese language understanding systems could be integrated with spelling correction and dialect identification components, in order to mitigate the irregularity of the input data.

Conclusions
In this paper, our goal is to present the status of the research on Cantonese NLP, to describe the uniqueness of this language and to suggest possible solutions for addressing the current shortcoming, due to the lack of resources.Indeed, most research on Cantonese NLP has not translated into the release of useful models, corpora and benchmark datasets, which are often not publicly available or not up to date.A possible reason of this difficulty is the limited number of online sources of Cantonese text with non-restrictive licenses (Eckart de Castilho et al., 2018), which does not leave too many options to researchers for putting together new benchmarks and for training large-scale models that are Cantonese-specific.
After reviewing the existing resources and methods, we analyzed the two main challenges that such data pose to automatic systems: the pervasive colloquialism and the multilinguality of Cantonese text, which often leads to the simultaneous presence of multiple languages in the same message or post.As strategies to tackle the challenges of Cantonese NLP, we could safely indicate data augmentation and crosslingual learning as two possible ways to go, in case the collection and balancing of large-scale Cantonese corpora turn out to be too problematic.
Cantonese is one of the most pervasive diaspora languages with native speaking communities spread around the world and has a vibrant and multicultural online community, and unique features that deserve a special attention for computational modeling.With our contribution, we hope we will manage to stimulate a new interest around this language in the NLP community, and to encourage future studies that will be devoted to resource sharing and to the reproducibility of the research results on public benchmarks.

Limitations
The main limitation of this work is that we only conduct our pilot study on limited number of domains since the textual data demands more efforts to clean.In future work, we plan to extend our study in more domains and more specifically focus on multi/cross lingual scenarios.

Figure 1 :
Figure 1: Number of publications in the ACL Anthology indexed by languages as of Mar 2024.The publications were retrieved via searching the language name in either the title or the abstract.

Figure 2 :
Figure 2: Yearly publications of the 61 papers for Cantonese NLP in the ACL Anthology from 1998 to 2024.

Figure 3 :
Figure 3: Outline of the survey.
includes 6 million words from political speeches from China, Hong Kong, Taiwan and USA, and it contains more than one million words of transcribed speeches of Hong Kong'leaders before and after the handover.It consist of more than 400k words in English, and more than 600k words in Chinese/Cantonese.Pan (2019) introduced a Chinese/English Political Corpus for translation and interpretation studies.With over 6 million word tokens, the corpus consists of transcripts of both Cantonese and Mandarin and their English translations.Lee et al. (2020) introduced a Counselling Corpus in Cantonese to research domain-specific dialogues: 436 input questions were solicited from native Cantonese speakers and 150 chatbot replies were harvested from mental health websites.The authors later extended their work by collecting another dataset used for text summarization and question generation (Lee et al., 2021), containing 12,634 post-restatement pairs and 9,036 post-question pairs, all with manual annotations.It also includes 89,000 unlabeled post-reply pairs collected from the online discussion forums in Hong Kong.Finally, the SpiCE corpus by Johnson et al. (

Figure 4 :
Figure 4: Distribution of bigrams from DISCUSS and Gigaword datasets.The x-axis shows the union dataset of the top 1,000 bigrams from each dataset ordered by the average relative frequency in the two datasets.The top curve refers to DISCUSS, the bottom one to Gigaword.

Table 1 :
Papers on Cantonese by research topic (statistics checked on Mar 2024).

Table 2 :
Scales of textual data from 3 different Cantonese forums (0.924 billion tokens and 1.1 Gigabytes size in total).

Table 4 :
Ratio of language usage.Cantonese and Standard Chinese are dominant in all the datasets under consideration.