The value of language databases to linguistic and especially psycholinguistic studies has been well-recognized and demonstrated by researchers in these fields. This article introduces a newly developed spoken language database, the Cantonese AphasiaBank (Kong & Law, 2010–2014), that has been available in the public domain since 2016. The database, containing discourse samples from unimpaired native Cantonese speakers and right-handed people with aphasia (PWA) living in Hong Kong of different ages and education levels, is the first resource of its kind in an Asian language, for studying spoken narratives. The PWA in the database include different aphasia types caused by stroke. Significantly, the database encompasses not only information on distinctive linguistic properties in discourse production, but also nonverbal behaviors (i.e., co-verbal gestures) produced by healthy speakers and PWA. The elicitation protocol follows the English AphasiaBank protocol (MacWhinney, Fromm, Forbes, & Holland, 2011), but with careful adaptation to the local Chinese culture. This corpus was established with the premise that it would provide the necessary foundation for aphasiologists and clinicians to design and conduct research investigations into theoretical and clinical issues related to acquired language disorders in Chinese. The overarching goal of constructing the database was to improve the planning of assessment and remediation procedures for Chinese-speaking PWA worldwide, including those living in North America, through actively sharing this multimedia database with any clinicians and researchers who work with Chinese speakers with acquired language deficits. However, since the Cantonese AphasiaBank contains both normal and disordered language data, we will demonstrate that it constitutes a rich resource for addressing various issues related to language behavior that are of interest to linguists, neurolinguists, and psycholinguists.

Generally speaking, corpora drawn from written texts outnumber those based on the spoken language. This is also the case for Chinese as revealed in a survey of Chinese language corpora for research (Yang, 2006), in which about 70% of the listed corpora were constructed using a written source. The Cantonese AphasiaBank was designed to distinguish from these databases in several ways. The scale of this corpus is larger than those of most corpora, in terms not just of the number of speakers, but also of the content included, which includes both unimpaired language and aphasic discourse production. In addition, information on verbal language performance, ranging from the lexical to the clausal/sentential and discourse levels, is systematically captured and presented, along with the performance of co-verbal gestures that are synchronized with the discourse language production. Finally, unlike most previous databases, which have been based on open-ended conversations, the participants in this database performed identical sets of discourse tasks with the target contents controlled. As a result of these unique features, the Cantonese AphasiaBank allows for investigations by linguists, neurolinguists, and psycholinguists. Apart from the above features, the database has also supports the education of student-clinicians in speech–language pathology.Footnote 1

To better appreciate how the Cantonese AphasiaBank is advantageous for research applications and, therefore, a new contribution to current resources, a brief introduction of stroke-induced aphasia and a review of existing spoken Chinese databases is in order. Stroke has been listed as one of the target diseases on the global agenda for prevention and control by the World Health Organization. The burden of stroke is particularly serious in Asian countries (Kim, 2014) because of their rapidly aging population. The prevalence of post-stroke aphasia in the Indo-European populations is about 40% (Salter, Teasell, Foley, & Allen, 2013; Wade, Hewer, David, & Menderby, 1986). Applying this prevalence rate to Chinese speakers (due to the lack of comparable figures available for Asian countries), one can see a huge demand for language rehabilitation from these individuals. However, as compared to the rigorous research agenda of investigating and evidence-based protocols for managing aphasia in English, very few resources have been reported for use with Chinese-speaking PWA (Kong, 2017). Chinese PWA are one of the underrepresented minority groups listed by the National Institutes of Health. Traditionally, aphasiology research has employed methods of single-case studies, case series, or participants in groups of small sizes (Martin & Kalinyak-Fliszar, 2014; Willmes, 2007). Using big data to predict epidemics, address pathological deficits, and improve quality of life has become a growing trend in the healthcare industry, including the field of speech therapy or communication sciences and disorders (Faroqi-Shah, 2016). Many clinical as well as research questions can only be answered with data from substantial numbers of patients, their performances across different language tasks, and responses to individual test items. The Cantonese AphasiaBank project (Kong & Law, 2010–2014) was initiated to go beyond the conventional narrow-sampling approach of investigation.

The Chinese language family consists of seven major dialects, including Mandarin, Min, Hakka, Wu, Cantonese, Xiang, and Gan. In terms of the number of reviews of spoken Chinese databases (Chui & Lai, 2008; Leung & Law, 2001; Wang, 2001; Yang 2006), there are currently a total of 13 in Mandarin and ten in spoken Cantonese. Note that Cantonese is the second most widely spoken dialect, with over 52 million speakers distributed over southern China and overseas Chinese communities; this dialect also differs from Mandarin in that it has both the spoken and corresponding written form. Similar to corpora in other languages, Chinese spoken databases serve different purposes and may represent different registers or genres from different sources. Some consist of recordings of read single words, utterances, and passages from printed materials in Mandarin (e.g., Chou & Tseng, 1999) or Cantonese (e.g., T. Lee, Lo, Ching, & Meng, 2002) for developing speech recognition and synthesis technology; some contain scripted dialogues from television dramas in Mandarin (e.g., J. Lee, 2011) or Cantonese (e.g., Xu & Lee, 1998), whereas others contain speech of a more spontaneous nature from telephone conversations in Mandarin (e.g., Zhou, Li, Yin, & Zong, 2010), Cantonese radio programs (e.g., Leung & Law, 2001), and monologues such as storytelling in Mandarin (e.g., Chafe, 1980) or Cantonese (e.g., Chui & Lai, 2008). The above-mentioned databases were constructed solely from language samples of unimpaired speakers. Moreover, videos files capturing speakers’ performance during the time of speech sample collection were absent, since the majority of the language materials in these databases have been drawn from audio recordings.Footnote 2 Furthermore, only two of these Cantonese adult corpora, namely the Hong Kong Cantonese Adult Language Corpus (HKCAC; Leung & Law, 2001; Leung, Law, & Fung, 2004) and the Hong Kong University Cantonese Corpus (HKUCC; Luke & Nancarrow, 1997; Wong, 2006a), have part-of-speech (POS) tagging.Footnote 3

In the rest of this article, we describe the background of building the Cantonese AphasiaBank. Specific search functions of this database, its multimodal display features, as well as specific challenges of and solutions to annotation of the database contents will also be illustrated. We will also demonstrate how Cantonese AphasiaBank can be used by researchers from different language disciplines as a research tool that can subsequently facilitate the management of Chinese-speaking PWA.

Cantonese AphasiaBank

There are two corpora in the Cantonese AphasiaBank: the Cantonese Corpus of Oral Narratives (CANON) and the Database of Speech and Gesture (DoSaGE). They can be accessed at www.speech.hku.hk/caphbank/search/. After registration, users can utilize all of the database contents they wish to analyze (see Fig. 1 for a screenshot of the “About us” screen).

Fig. 1
figure 1

Screenshot of the Cantonese AphasiaBank “About us” screen

Cantonese Corpus of Oral Narratives

CANON differs importantly from HKUCC in a number of respects. Although HKUCC contains mainly conversations of a variety of topics among highly educated young and middle-aged Cantonese speakers and HKCAC contains conversations recorded from phone-in programs on the radio, data in Cantonese AphasiaBank are monologues collected from local native speakers of Cantonese residing in Hong Kong (a linguistically homogeneous city of China) balanced in gender, age, and education. To be specific, the first corpus, namely CANON, contains annotated orthographic and morphological information as well as romanized transcripts of 149 unimpaired speakers and 105 PWA. The language elicitation protocol included (1) description of a single color photo displaying the scene of rescuing someone in a flood, (2) description of a single black and white line drawing of “Cat Rescue,” (3) two sequential picture description tasks using two sets of black and white drawings—“Broken Window” and “Refused Umbrella,” (4) a procedural discourse task of describing how to prepare an “Egg and Ham Sandwich,” (5) telling of two stories—“The Boy Who Cried Wolf” and “Tortoise and Hare”—and (6) a personal monologue of an important event. For all PWA, the protocol also included (7) an additional monologue task of telling their “stroke story.” Only Tasks 1–5 were elicited using pictorial materials. If needed, task-specific probing questions were also given. Apart from the above narrative tasks, each PWA is administered language tests to assess repetition of words and phrases, noun and verb naming, and (non)verbal semantic skills as well as the Action Research Arm Test (ARAT; Lyle, 1981) to quantify the degree of upper limb hemiplegia. The demographic information for both speaker groups is given in Tables 1 and 2. Figure 2 shows the interface displayed to users for selecting a subject using the embedded filter features in this database (including “subject type,” “aphasia type,” “gender,” “age,” “education level,” and “ARAT score”).

Table 1 Distribution of unimpaired participants and persons with aphasia (PWA) by age and education subgroups
Table 2 Demographic information on persons with aphasia (PWA)
Fig. 2
figure 2

Screenshot of the subject filter features and results in the Cantonese AphasiaBank

As one can see, CANON does not contain free language samples; that is, the target content of the spoken output was controlled to be the same across our participants. The advantage of task-specific data is to enable us to have control over the content such that one may study how language outputs differ as a function of age, gender, and education. The language samples were orthographically transcribed by two linguistically trained research assistants. The interrater reliability of the orthographic transcriptions was computed by randomly selecting 10% of the samples and double-checking them against the audio recordings by the project investigators. This reliability was found to be greater than 99%.

Database of Speech and Gesture

The second corpus in Cantonese AphasiaBank is DoSaGE, which contains digitized video recordings of the procedural discourse, storytelling, and personal monologues from 131 (of the 149) unimpaired speakers and 96 (of the 105) PWA in CANON. These videos are synchronized with the corresponding orthographic transcripts, using the EUDICO Linguistic Annotator (Lausberg & Sloetjes, 2009), with independent annotations of the forms and functions of all co-verbal gestures (for a list, see Table 3).

Table 3 List of gesture annotations in the Cantonese AphasiaBank

Markup, search, and annotation in CANON

This section describes the characteristics, structure, and annotation of the data in CANON. Note that the morphological tagging of parts of speech (POSs) in CANON is automatic, unlike the manual annotation in HKUCC, with specific tagging rules written for CANON to reflect morphological processes of Cantonese.

For each participant in the Cantonese AphasiaBank, the orthographic transcriptions of all language samples are combined into one single document with a code name. The identities and personal and demographic information of the subjects, as well as the times and dates of the individual recordings, are kept in a separate master file. Each narrative in the transcript begins with a line marking the name of the sample. Transcriptions are formatted using the Codes for the Human Analysis of TranscriptsFootnote 4 (CHAT; MacWhinney 2000). The transcripts in CANON are linked to audio and video recordings through a computerized analytic program named Child Language Analyses (CLAN; MacWhinney, 2003). CLAN allows one to carry out a variety of linguistic analyses, such as for frequency count, lexical diversity, mean length of utterances (MLU), as well as searches for user-specified combinations of words, character strings, words in context, and so forth.

In addition to orthographic transcription, each lexical entry or token demarcated by a space is annotated for POS, an automatically generated phonetic transcription in Cantonese romanization, and an English gloss. For each transcript, segmentation of utterances largely follows the “one verb or clause per line” principle, except when a verb subcategorizes for a clause. Specifically, a new utterance (therefore a new line in CLAN) is formed (1) when a speaker restarts or partially repeats what s/he just said; (2) when a speaker switches to a new topic or when the topic contains more than a noun phrase (i.e., a clause); (3) when there is an interjection, connective, or filler between clauses; or (4) whenever the adverbial 跟住(呢) “then” is used. Examples of two utterances are given in Fig. 3, which (a) shows morphological annotation with unambiguous tagging and (b) illustrates the initial POS tagging, with some tokens having multiple tags separated by the symbol “^” (shown in shaded/highlighted text), and POS annotation after the transcript has been manually verified. In each example, the first tier represents the orthographic transcription, in which tokens are divided by a space. The second tier consists of information on POS, Cantonese Romanization and an English gloss of each token, and in that order.

Fig. 3
figure 3

Format and levels of annotation of the data in CANON

The morphological (POS) tagset in CANON has a total of 38 classes, listed in Table 4, which is fewer than the 54 tags in HKUCC, another adult Cantonese corpus mentioned earlier. Although the two tagsets share many common classes, as expected, the main differences lie in that (i) derivational and inflectional morphemes are distinguished in terms of their position of occurrence in HKUCC, but classified under the category of “affixes” in CANON; (ii) major content word classes—that is, nouns, verbs, and adjectives—are distinguished in terms of number of constituent morphemes and furthermore in internal structure for verbs in CANON, but not so in HKUCC; (iii) the classes of “short form,” “fixed expressions,” and “idioms” in HKUCC all fall in the category of “expressions” in CANON; and (iv) the classes of time, pronoun, numeral, verb, adjective, and noun morphemes in HKUCC do not have counterparts in CANON. Although seven of the eight narratives from each participant were elicited using specific stimuli, the variety of lexical forms in the language samples is still evident. A total of 4,450 entries are listed in the CANON dictionary.Footnote 5

Table 4 List of part-of-speech tags in Cantonese AphasiaBank

For the present study, we applied the MOR tagger, a computational systems for the morphosyntactic analysis of the spoken language data in the CHILDES (http://childes.psy.cmu.edu) and TalkBank (http://talkbank.org) databases, for automatic annotation of POS. This procedure generally followed those listed in the English AphasiaBank project (MacWhinney & Fromm, 2016). According to MacWhinney (2012), the MOR grammars were built for Indo-European languages, such as English, Spanish, French, Italian, Dutch, and German, as well as for three Asian languages: Mandarin, Cantonese, and Japanese. Specifically, the Cantonese MOR tagger (http://talkbank.org/morgrams/) was designed on the basis of statistical distribution of lexical items in Cantonese and then with contextual rules of Cantonese grammar. Although the MOR tagger reaches 98% accuracy for English adult corpora (MacWhinney, 2012), the tagger in the CANON has a comparable but slightly lower tagging accuracy (based on the total tokens of 132,024).

Users of the Cantonese AphasiaBank who want to conduct a quick and simple search of specific lexical information of the database transcripts can do so through the “Search by Word” tab. The search can be performed by specifying one or more of the following parameters: a particular Chinese character, a specific Chinese lexicon (that is made up of one or more Chinese characters), jyutping (a romanization system for Cantonese developed by the Linguistic Society of Hong Kong), part-of-speech tags of transcription, glossary in English, and narrative task(s) that contain the search item. Figure 4 displays a screenshot of the interface displayed to users for defining search criteria on this tab. The length of utterances to be displayed in the results can also be preset by adjusting the concordance length parameter.

Fig. 4
figure 4

Screenshot of the “Search by Word” filter parameters in the Cantonese AphasiaBank

A search for a simple Chinese keyword, 仔 “son,” is illustrated here. This character, apart from acting as a noun “son” in oral narratives, can also serve multiple functions, such as a suffix or a constituent of a compound noun or proper noun. If a user is interested in knowing the total number of lexical items containing this character in the Cantonese AphasiaBank, its frequency of occurrence in the database for each items and its part of speech, and/or how each lexical item was used by PWA (vs. controls) across different narrative tasks, the simple “Search Keyword (Chinese)” result displayed in Fig. 5 demonstrates the following: (a) a total of 1,303 tokens of lexical items contain the character仔, (b) these represent a total of 69 different lexical types with different POSs across the various narrative tasks within the database, and (c) the screenshot shows contexts for three of the 69 lexical types in the “Word” column—a suffix in 几仔 “small chair,” a compound noun constituent in 細路仔 “child” and 男仔 “boy,” and a single noun in 仔 “son.” Note that this display of frequency of occurrence is different from the usual notion of frequency count in most databases of open-ended content, which is supposed to reflect the extent of a word’s usage. The textbox simply displays a list of the utterances in which the keyword 仔appears. The tags (containing extra information pertaining to POS, jyutping, and gloss) can be removed by clicking the “Show/hide Tags” button. Each of these lines can also be clicked to further examine the language sample in which the target keyword is found. Note that there are four different matching modes—namely “Exact,” “Starts with,” “Contains,” and “Ends with”—that can be selected from the “Match Mode” drop-down menu for result filtering. In particular, the default mode of “Exact” search will only include a list of utterances in which the exact same keyword appears in the database. For “Starts with,” any lexical items starting with the keyword entered will be identified. “Contains” mode will search for any lexical items that contain the keyword entered, and “Ends with” mode will identify any lexical items ending with the keyword entered. All users can then download their search results in the format of a Microsoft Excel worksheet. Importantly, these search results can be useful for linguists and/or psycholinguists who have specific research questions to address or hypotheses to test, and find our data collection methods and data presentation suitable for their purposes.

Fig. 5
figure 5

Screenshot of “Search by Word” results in the Cantonese AphasiaBank, with the corresponding keywords highlighted. The code “BroWn” in the text box indicates that these sentence lines were found in the sequential picture description task of the “Broken Window” drawings

Search and display of materials in DoSaGE

The video recordings of the subjects performing all discourse production tasks can be viewed in the “Browse Videos” tab, where a simple search form (drop-down menu of “Subject type,” “Aphasia type,” and “Task”) can be found. The video will be played on the left while a moving window of transcriptions is displayed on the right. The current sentences spoken by the subject in the video are highlighted in yellow. For the tasks of procedural discourse, storytelling, and personal monologue, synchronized annotation information about the gestures employed by the speaker is available and highlighted in orange (see Fig. 6). In other words, the video, display of language transcriptions, and gesture annotations are time-framed (with information on duration of pauses) and synchronized. The detailed multilevel and multimodal performance of the subject’s discourse production is, therefore, readily available for inspection. The transcribed texts can also be clicked for the video to skip to the exact position a user desires to view. Finally, a watermark with the user’s login e-mail address, IP address, and the time and date of viewing is automatically generated, to avoid unauthorized duplication of our database videos.

Fig. 6
figure 6

Screenshot of viewing the recording of a subject in the Cantonese AphasiaBank, with synchronized display of video, language transcriptions, and gesture annotations

Specific challenges to building a Cantonese spoken corpus with POS annotation

There are many challenges in constructing a spoken corpus in Cantonese. We discuss below how we handle these difficulties and those that are unique to the language at every stage of the development of CANON, including challenges associated with transcribing and annotating spoken output of PWA.

Orthographic representation of colloquial morphemes

Cantonese is essentially a spoken language; therefore, the problem of orthographically transcribing colloquial morphemes in the language is obvious. Although written forms for these items can be found in the popular culture, such as local magazines and comic books, there is no standardization. To deal with this problem, we consult online Cantonese dictionaries including 粵語審音配詞字庫 (Chinese Character Database with Word Formations) at the Chinese University of Hong Kong (http://humanum.arts.cuhk.edu.hk/Lexis/lexi-can/) and CantoDict version 1.4.2 (www.cantonese.sheik.co.uk/scripts/wordsearch.php?level=0). For morphemes not found in these sources, Cantonese Romanization would be used. Because CLAN supports Unicode, colloquial characters not available in Unicode are represented either in romanization or by an orthographically similar homophonous character; for instance, the progressive aspect marker , which cannot be found in Unicode, is written as 緊 in CANON.

Orthographic transcription of homophonic and homographic morphemes, allomorphs, and fused syllables

Cantonese morphemes, similar to Mandarin, have a high degree of homophony, and many of which are written with the same characters, resulting in ambiguity in tagging. This is particularly problematic if the morphemes in question are highly frequent—for example, 嘅 [kE3] as a possessive marker, a relative clause marker, and a sentence final particle. Our interim solution has been to distinguish the different homographic homophonous morphemes with a numeric code following the characters 嘅1, 嘅2, and 嘅3. Conversely, some Cantonese morphemes have alternative acceptable phonological forms—for example, [lIN1] and [lIk1] for “carry.” In such cases, the different characters, 拎 [lIN1] and 搦 [lIk1], would be used to represent the various forms as much as possible.

A related issue is the occurrence of fused syllables, which is considered a unique feature of Chinese phonology (Wong, 2006b). Different extent of fusion from reduction to a single syllable to deletion of coda or onset can occur for different syllable strings. A single character would be used if available—for example, 咩 [mE15] in Fig. 1b for 乜嘢 [mAt1 jE5] “what,” in which the rime of the first and onset of the second syllable are deleted.

Line segmentation and tokenization

Partly due to the lack of word boundaries in Chinese text, the definition of a word in Chinese is far from straightforward. Because the characters represent morphosyllables, our guiding principle was to treat each character as a separate token, except for affixation and compounds, which may appear in the categories of nouns, verbs, and adverbs. We consulted the online HKUCC dictionary for compounds in the language. After tokenization, the discourse is segmented into units for further analysis. Although some spoken corpora would define an utterance in terms of prosodic boundaries (e.g., Zhou et al., 2010), our segmentation is syntactically based in anticipation for building a syntactic parser. A language sample is divided into separate lines in the transcript with each one corresponding to a clause as much as possible.

POS of grammatical morphemes

One major difficulty in grammatically annotating a Cantonese corpus is POS classification, since less systematic and theoretical research has been conducted on the Cantonese grammar, compared with Mandarin. As is known among Chinese linguists, determining the POS of grammatical morphemes is particularly challenging for the so-called verbal particles and affixes, because many of them (e.g., prepositions or co-verbs) originate from content words. We adhered to two principles when classifying candidates for the two classes. The content word status of a morpheme has been maintained unless it is a bound morpheme—for example, 士 [si6] “scholar” (somewhat equivalent to the English suffixes “-ist” in “psychologist” and “-er” in “philosopher”) in 護士 “nurse,” 師 [si1] “professional people” in 會計師 “accountant”—and/or it is semantically empty or detached from its origin—for example, 仔 [tsAI2] (“son” when standalone) as an affix meaning “diminutive” in 簿仔 “a little notebook,” 頭 [tHAU4] (“head” when standalone) “suffix without meaning” in 石頭 “rock-suffix.” As such, 落 [lOk6] “down” in 行去 “walk-down-go” and 埋 [maI4] “approach” in 行嚟 “walk-approach-come” are treated as directional verbs, and 爛 [lan6] “broken” in 搣爛 “tear-broken” as the resultative component of a verbal compound, rather than as verbal particles.

Capturing morphological processes in Cantonese

Of particular interest to Chinese linguists is that much effort and resources in developing CANON have been put into automatic annotation of morphological processes. In addition to prefixation (一 “one” ➔ 一 “first”) and suffixation (教 “teach” ➔ 教 “teacher”), there are also infixation (討厭 “annoying” ➔ 討厭 “really annoying”) and the insertion of aspect marker (ASP), quantifier, or verbal particles in a verbal compound (瞓覺 “sleep” ➔ 瞓覺 “fallen asleep,” 出街 “go out” ➔ 出街 “all gone out,” 返學 “go to school” ➔ 返學 “able to go to school”). The cases of infixation and insertion are particularly challenging, and automatic parsing of these processes has not been dealt with previously in a Chinese corpus, as far as we know. To illustrate, while 瞓咗覺 “fallen asleep” is treated as an insertion of the perfective ASP 咗 in the verb compound 瞓覺 in CANON, it would be marked as verb + ASP + (verb) morpheme in Wong (2006b). The latter method seems to be ad hoc and inconsistent between the verbal compound with and without ASP insertion.

Besides affixation, reduplication is a prominent morphological process in Cantonese, which may involve yes/no question formation (or A-not-A question, 傲 “proud” ➔ 傲 “proud or not proud”), intensifying the meaning of an adjective (佢生得 “he is fat” ➔ 佢生得肥肥 “he is very fat,” or 佢做嘢不溜都穩陣 “he is always a reliable worker” ➔ 佢做嘢不溜都穩穩陣陣 ➔ “he is always a very reliable worker,” examples adapted from Gāo (1984, pp. 60–63), adding a tentative aspect to a verb (睇 “to look” ➔ 睇 “have a look”), a progressive aspect to a verb (瞓覺 “to sleep” ➔ 瞓吓覺 “in the middle of the sleep”), or a distributive property to a classifier (條 “classifier for long and flexible object” ➔ 條 “every-classifier”). These surface forms can be parsed in CANON. The morphological tags contain the underlying lexical item and the added meaning (see Table 5 for illustrations). In addition, the tagger can handle situations in which two morphological processes have taken place—for instance, prefixation and suffixation: 兔 “rabbit” ➔ 兔 “rabbit + diminutive” ➔ 兔仔 “endearing + rabbit + diminutive”; or prefixation and reduplication: 妹 ➔ 妹 “sister” ➔ 妹妹 “little + sister,” 掉轉 “turn around” ➔ 掉轉 “turn around + suffix” ➔ 掉轉頭 “turn around + aspect marker insertion + suffix.”

Table 5 Tagging of morphological rules

Annotating spoken output of PWA

One of the greatest challenges of annotating an aphasic speech output corpora is the potential disagreement of parsing POS in the nonfluent or grammatically ill-formed sentences of PWA. As is detailed in Kong (2016), unifying the linguistic information, paralinguistic aspects of oral production, and nonverbal skills co-occurring in spoken narratives can be a daunting task, due to the complexity of this phenomenon. This is especially the case when output is produced by language-impaired people. The availability of audio and video files that correspond to the spoken content of PWA has greatly enhanced the inter- and/or intrarater consistency of the disambiguation process in annotation.

Potential contribution to research in linguistic, psycholinguistic and neurolinguistics

The rich linguistic and prosodic data as well as videos with information on nonverbal behaviors extracted from Cantonese AphasiaBank have been proven to be extremely valuable for conducting linguistic, psycholinguistic, and neurolinguistics research in Chinese. MacWhinney and Fromm (2016) in a recent review article summarized how the AphasiaBank in English facilitated 45 research investigations in various areas of aphasic productions: (1) lexical, grammatical, and discourse output; (2) fluency and syndrome classification of aphasia; and (3) gesture employment. In addition, effects of social factors of aphasia, such as PWA’s educational background, age, gender, and occupational status, as well as intervention effects on spoken language have been explored. Here we illustrate how the Cantonese AphasiaBank may benefit researchers of different interests:

  1. 1.

    Psycholinguists in recent years have been concerned with the neural mechanisms underlying language processing and representation. Compared with the corpus that only contain natural conversational speech output of four Mandarin speakers with aphasia (Packard, 1993), our database is of a much larger scale in terms of sample size and includes a wide range of PWA with different syndromes (or types). Each participant also produced samples spanning several discourse genres arguably with different processing demands. How linguistic performance and nonverbal behaviors would vary as a function of variables such as Cantonese-speaking PWA’s age, gender, education level, or aphasia severity can be systematically analyzed. The fact that the Cantonese AphasiaBank features unimpaired language dataset would allow users to compare performance between PWA and controls. For example, when contrasting the gestures of PWA and those of individuals without aphasia, videos and corresponding language samples with gesture annotations of age- and education-matched pairs of PWA and control participants can first be extracted from the database. Subsequent analyses in terms of the quantity of gestures and how their forms and functions are affected by aphasia can be conducted; this would allow a better understanding of whether and how speakers may employ gestures to enrich spoken discourse. A recent study utilizing a subset of the data bank to answer the above questions has been reported (Kong, Law, Wat, & Lai, 2015).

  2. 2.

    Neurolinguists, on the other hand, may be interested in examining how neural substrates are associated with performance in language production and comprehension. The various properties of the Cantonese AphasiaBank described above, such as lesion data of PWA, co-verbal gestures, as well as discourse dysfluencies of repetitions, false starts, fillers, or self-corrections, constitute a rich source of information for studying theoretical processes of language production. Examples of possible questions include the impact of planning time on online language performance or effects of task demand on discourse performance (e.g., Lai, Law, & Kong, 2017).

  3. 3.

    Linguists have traditionally examined issues addressing language behaviors at specific linguistic levels, such as phonology, morphology, syntax, semantic, or discourse. In recent years an approach to language quantification has been promoted that analyzes performance across linguistic levels within an individual and how they may relate to one another—namely, multilevel analyses (e.g., Marini, Andreetta, Del Tin, & Carlomagno, 2011; Milman, Vega-Mendoza, & Clendenen, 2014) and multimodal production (Linnik, Bastiaanse, & Höhle, 2016) of healthy and PWA speakers. The Cantonese AphasiaBank has been utilized in a series of studies looking at issues specific to Cantonese aphasia from the word to the discourse level. The major findings are summarized in Table 6. The application of multilevel analyses to data in the Cantonese AphasiaBank is currently under way.

Table 6 Current major findings of investigations using Cantonese AphasiaBank

We have demonstrated how the Cantonese AphasiaBank is instrumental to enhancing our knowledge of Chinese aphasia. The availability of language materials produced by healthy as well as impaired speakers will render various researchers the opportunity to perform systematic group-based comparisons that has not been possible until now. Further exploration of the data in the Cantonese AphasiaBank by researchers, instructors, speech–language pathologists, related healthcare professionals, and student clinicians may facilitate the development of fundamental principles for managing Chinese aphasia now and in the future.

Conclusions

This article has described the background and processes involved in developing the Cantonese AphasiaBank. The large-scale database is distinguished from other existing spoken Chinese corpora in terms of the availability of spoken discourse and co-verbal gestures by speakers with and without aphasia, described by both verbal and nonverbal annotations. It promotes the use of open-source Web-based access corpora for studying spoken language in a variety of discourse types among native Cantonese speakers and PWA. The rich information from the database has led to a series of linguistic, psycholinguistic, and neurolingusitic studies. Further contributions of the corpora toward research and educational and/or clinical use will depend on its wider usage by researchers across disciplines.