1 Introduction

Effective biomedical communication with minority linguistic groups remains elusive, due to a lack of awareness regarding appropriate means to express various medical concepts. Natural language processing (NLP) can provide a solution through automating the identification of equivalent expressions in documents composed in the language, especially where these documents are translations of English or other majority-language originals. Named entity recognition can be exploited for the development of translation standards for medical terminology, which are necessary for medical practitioners and interpreters to ensure effective communication. Likewise, information extraction involving named entities is an important step in the development of question-answering systems for minority-language speakers and automated translation services for practitioners who lack access to a qualified, licensed interpreter.

Nevertheless, biomedical NLP applications involving named entities generally remain limited to national languages, with a strong bias in favor of English, and several other official languages are at best minimally represented. This is due to a virtually non-existent supply of medically-annotated data that could be used in NLP contexts, a problem that naturally applies to the Hmong language.

The Hmong are an ethnic group of approximately 4.5 million native to Southeast Asia, with major population centers in China, Vietnam, Laos, and Thailand (Lemoine, 2005). In the past 35 years, a Hmong refugee diaspora has emerged in several developed Anglophone countries, including the United States, Canada, and Australia, with an ethnic population of more than 260,000 in the United States alone (Pfeifer et al., 2010).

The Hmong refugee community is markedly affected by communication problems between English-speaking medical practitioners and Hmong patients (Fadiman, 1998; Thornburn et al., 2012). This especially involves difficulties in finding appropriate terminology or other expression of medical concepts in the Hmong language (Johnson, 2002).

The Hmong Medical Corpus addresses this problem by providing an annotated corpus of medical information documents translated by native speakers in an official, government-recognized capacity from English originals. These documents are annotated for parts of speech (POS) as well as named entities (NEs) of several categories most affected by the communication issues mentioned above: body part/organ/organ component terms, names of diseases/syndromes, and names of signs/symptoms.

This paper makes three contributions: (1) the presentation of the construction of the Hmong Medical Corpus, including its part-of-speech and named entity tags; (2) the presentation of a viable, reproducible methodology for producing annotated biomedical corpora for resource-poor minority languages; and (3) the release of the corpus for general use.

The remainder of the paper is comprised of ten sections. Section 2 provides a brief review of related work, while Sect. 3 provides a typological overview of the Hmong language to orient the reader. This is followed by an overview of the corpus and data collection in Sect. 4, details of the annotation scheme in Sect. 5, the process of annotation development in Sect. 6, and details regarding the word position/POS tagger model in Sect. 7. Section 8 provides a brief description of the limitations of the project, while Sect. 9 gives corpus statistics. Section 10 gives an overview of corpus contributions, and Sect. 11 provides a summary of conclusions.

2 Related work

We briefly review corpora containing named entity annotations in the medical domain, focusing on both English and non-English corpora.

An array of English corpora involving named entity recognition has emerged in the last twenty years. An early corpus is GENIA (Kim et al., 2003), which provides annotations of biological entities. CADEC (Karimi et al., 2015) provides annotations for diseases, symptoms, drugs, adverse effects, and findings found in 1,253 medical forum posts. The NCBI corpus (Doğan et al., 2014) contains a set of 793 abstracts derived from PubMed with annotations for disease concepts and names. The CRAFT corpus (Bada et al., 2012) is a set of 97 biomedical journal articles with annotations for an array of named entity types. The CHQA corpus (Kilicoglu et al., 2018) is a set of 2,614 health-related questions from both web and email sources, with annotations for named entities and several other semantic types. The CHEMDNER corpus (Krallinger et al., 2015) includes 10,000 abstracts sourced from PubMed with annotations that focus on drugs and chemical compounds. The i2b2 Shared Tasks have also involved several corpora (Uzuner et al., 2010, 2011; Stubbs & Uzuner, 2015), which provide medical named entity annotations in a number of areas, including medications and their administration as well as health conditions associated with heart disease.

Several non-English corpora have also emerged in recent years. A recent Spanish corpus, the PharmaCoNER corpus (Gonzalez-Agirre et al., 2019), is a collection of 1,000 clinical case studies with annotations for chemicals, proteins, genes, and other biomedical or clinical substances. Two other major Spanish corpora, the DrugSemantics corpus (Moreno et al., 2017) and the IxaMedGS corpus (Oronoz et al., 2015), provide support for a range of named entity types, especially drugs and diseases. Recent French corpora include the MERLOT corpus (Campillos et al., 2018), which is comprised of 500 documents and provides named entity annotations for symptoms and disorders among several other categories, as well as the Quaero corpus (Névéol et al., 2014), which provides annotations for several medical categories based on the Unified Medical Language System (UMLS; Lindberg et al., 1993; Bodenreider, 2004). Other languages recently represented in medical corpora with named entity annotations include Chinese (Gao et al., 2019) and Romanian (Mitrofan et al., 2019).

To date, there have been no prior annotated medical corpora for a low-resource, minority language such as Hmong.

3 Typological overview

The Hmong language possesses a number of relatively special typological features that merit discussion to orient the reader here.

Typologically, Hmong tends toward a one-to-one correspondence between syllable, morpheme, and word, with a relatively limited number of affixes and some compounding. For example, one relatively technical Hmong text sampled by White (2020) contains 805 grammatical words with 727 of these monosyllabic (at 90.3% of the total). Furthermore, Hmong possesses two phenomena that present a challenge for marking word boundaries as they are technically intermediate between a word and a phrase given their behavior in syntax: coordinate compounds and four-syllable elaborate expressions (White, 2020; cf. Wälchli, 2005). Hmong likewise contains a number of combinations that cohere as grammatical words but where the parts might be otherwise predicted to be independent words on their own, such as ib-tug ‘one-CLASSIFIER:ANIMATE’ (White, 2020; cf. Ratliff, 2009). The issues these phenomena raise and the solution pursued are referenced in Sect. 5 below.

In addition, as with a number of other Southeast Asian languages, Hmong exhibits a tendency toward a lack of obligatoriness, where grammatical categories (aspect, mood, etc.) are typically not obligatorily marked (Bisang, 2015, inter alia). There exists instead a high degree of reliance on pragmatic inference and multifunctionality, where the same form or construction may have several functions which are differentiated by context alone (Bisang, 2015). This presents a challenge to straightforward classification of parts of speech, and led to the method of asking critical questions to determine part-of-speech class in potentially ambiguous cases, as mentioned in Sect. 6 below.

Finally, several Hmong part-of-speech classes are relatively unusual. Adjectives as a cohesive part-of-speech class are a relatively limited set of eight words (Bisang, 1993), with the other property concepts (term following Post, 2008) represented by stative verbs; this affects the part-of-speech results for adjectives provided in Sect. 9 below. Hmong also contains classes common for Southeast Asian languages but relatively uncommon elsewhere such as nominal classifiers (Bisang, 1993; White, 2019), verb classifiers (Gerner, 2014), and localizers (Xiong & Cohen, 2005).

4 Overview of corpus and data collection

The U.S. state governments of WisconsinFootnote 1 and MinnesotaFootnote 2 have produced a significant number of medical informational documents in the Hmong language produced by native speakers translating from English into Hmong. The Hmong Medical Corpus is a collection of 105 of these documents. These documents were taken from a range of government-sourced documents on medical, health insurance, and other health-related topics, obtained through a web crawler where permitted. The documents were selected based on their coverage of disorders or pathogens, the associated symptoms, and their treatments. Given their focus on specific illnesses, these documents are all genre-specific to the biomedical domain.

The annotation process took the form of two components: combined word position/POS tagging and named entity tagging. First, we obtained combined word position and POS tags. As standard Hmong orthography places spaces between syllables rather than words, word segmentation is less than trivial. The word segmentation task was therefore treated as a sequential tagging task, as previously done for Vietnamese (Nguyen et al., 2006; Dinh et al., 2008), which exhibits the same syllable-based spacing. As the POS-tagging task is likewise a sequential tagging task, the two tasks were combined using two tags separated by a hyphen, as has been done for Chinese (Kruengkrai et al., 2009; Shao et al., 2017) and Vietnamese (Takahashi & Yamamoto, 2016; Nguyen et al., 2017). These combined tags were assigned by syllable rather than by word.

Second, named entity tagging took the form of one tag per syllable, based on three sets of labels derived from semantic types found in the UMLS Semantic Network: body part/organ/organ component terms, names of diseases/syndromes, and names of signs/symptoms.

5 Annotation scheme

The combined word position/POS tag scheme involves two components separated by a hyphen, where the word token annotation is the first element, and the POS tag the second. The word position portion takes one of three values: B (beginning of a word), I (inside of a word), or O (other). The POS tag portion takes one of the values in Table 1, based on the POS categories specifically identified in Hmong as part of ongoing analytical work. The tagset is significantly adapted from that of the Penn Chinese Treebank (Xue et al., 2005).

Table 1 POS tag categories

As Hmong orthography exhibits syllable spacing and community practices show a tendency to experiment with word-based spacing with the category of “word” ill-defined in practice, POS tags in this scheme are specific to the morpheme, rather than the word. For example, the verb sib txawv ‘differ from one another’ is composed of a verb-modifying derivational prefix sib– ‘RECIPROCAL’ and the verbal root txawv ‘differ’; this is tagged in the scheme as sib/B-AD txawv/I-VV.

An example of a sentence with the resulting annotations is as follows: Cia/B-VV tus/B-CL me/B-NN nyuam/I-NN nyob/B-VV hauv/B-LC tsev/B-NN es/B-CS txhob/B-AD mus/B-VV kawm/B-VV ntawv/B-NN ./O-PU (“Let the child stay at home and not go and study.”).

The named entity tags likewise have two parts separated by a hyphen: an IOB tag portion that indicates position in the named entity of one of four types (B, I, E, or O, where E is ‘end’), and a named entity category selected (if the position is not O) from those found in Table 2.

The three categories as shown in Table 2 were selected for the following reasons: (1) the vast majority of documents selected had at least two of these categories robustly represented, and (2) these categories provide the best basis to create an automated question-answering system to improve Hmong community access to medical information—a longer-term goal of the Hmong Medical Corpus project.

Table 2 NE tag categories

Named entities of the above three categories can be nested in Hmong: for example, a disease name often contains a body part term, as in ntsws ‘lung’ in mob ntsws qhuav ‘tuberculosis’, or a symptom can include a body part term, as in tob hau ‘head’ in dias tob hau ‘be dizzy’.

This nesting is handled through the creation of separate files, where each contains NE labels of exactly one category. This allows for layering and combining of labels as appropriate for downstream NLP tasks. Other possible situations that could result in tag conflicts, such as overlap of named entities, did not present an issue.

An extended sample from the Hmong Medical Corpus with both word position/POS tags and NE tags is provided in Appendix A.

6 Annotation development

The combined word position/POS tag annotations were developed through a two-stage process. In the first stage, a linguistic expert provided the initial POS tags; only one expert was available given the resource-marginalized status of the language. This process was guided by an ever-evolving annotation guidelines document, with revisions made to the full set of documents as necessary changes in the guidelines were identified. Potentially ambiguous cases were then checked with Hmong community collaborators, who were asked critical questions based on part-of-speech criteria specific to Hmong to determine the best possible tag. The linguistic expert then revised the tags based on the answers.

In the second stage, a tagger with a Bidirectional Long-Short Term Memory (BiLSTM; Schuster & Paliwal, 1997; Hochreiter & Schmidhuber, 1997) model architecture (described below) was trained on the documents tagged in the first stage. New documents were then tagged using the automated tagger, and these results were manually verified by the linguistic expert in consultation with Hmong community collaborators, as in the first stage. As new documents were tagged, the BiLSTM tagger was retrained with the new data, and the result applied to additional documents.

The named entity tags were generated based on the creation of curated lists for body part/organ/organ component terms, names of diseases/syndromes, and names of signs/symptoms. The curated lists of diseases/syndromes and signs/symptoms were obtained in raw form through an algorithm. These were initially drawn from semi-structured sections of those medical information documents that encoded this sort of information in a relatively consistent way across the documents. These lists were then modified and expanded manually to represent more general cases. The list for body part/organ entities was developed through the review of a range of Hmong dictionaries, published linguistic research, and consultation with Hmong community collaborators. These lists were then used to algorithmically tag the full corpus. The results were manually verified by an expert in the language.

The algorithm’s performance versus the final gold-standard annotations is presented in Table 3, both in terms of the correct choice of position tags for each named entity category (e.g., B-BPOC, I-BPOC, E-BPOC, O for the BPOC category) and overall. The relatively weaker performance for BPOC terms is likely due to the high number of homophones in Hmong involving pairs of named entity terms and non-terms, such as rau ‘(finger/toe)nail’ versus rau ‘six’, or siab ‘liver’ versus siab ‘be high’. SOSY terms are likewise affected since they often contain BPOC terms in Hmong; this is in addition to the homophone issue, such as with raws in raws plab ‘have diarrhea’ versus raws ‘be according to’.

Table 3 Performance of NE tagging algorithm compared against the final gold-standard annotations

7 Word position/POS tagger

The combined word position/POS tagger used to automate the annotation work described above was trained as a BiLSTM model with a hidden BiLSTM layer of size 256 and trained for 50 epochs. The models were trained using Word2Vec embeddings (Mikolov et al., 2013) of size 150 pretrained on the soc.culture.hmong Usenet Corpus (Mortensen, 2015), consistent with the approach first proposed by Wang et al. (2015). The early coding approach for creating the model was inspired in part by an approach by Ivanov (2018).

The set of hand-annotated documents from the first stage of tagging served as the initial training set, and the model was retrained with additional data from subsequent documents after expert checking of the tags, as described in Sect. 6 above. For reference, performance metrics of the final version of the BiLSTM model on the 11th (final) document appear in Table 4 below. “Non-predicted” and “Non-true” values of 0 and 1 are specified for those cases where the model predicted a label not truly present in the document or failed to predict a label that was present, which would otherwise result in zero division when calculating the Macro Precision and Macro Recall scores.

Table 4 Performance metrics of the final version of the BiLSTM model on the final document

8 Limitations of the study

There are two limitations of the Hmong Medical Corpus project worthy of note. The first is the presence of “translationese” (see Volansky et al., 2015) in the corpus, due to translation from English originals. This is described in more detail in Sect. 8.1 below.

The second is the limitation of annotators to a single expert annotator, as stated in Sect. 6. This prcluded the use of inter-annotator agreement as a metric, which would otherwise prove useful in evaluating the quality of the annotations (cf. Fort, 2016). This was despite extensive effort and collaboration with two community organizations to find additional community-based annotators, though several members of the community were willing to provide collaborative effort as described in Sect. 6 above.

8.1 Issue of translationese

As a corpus that has been primarily translated from English, there is the expectation that some “translationese” phenomena would be present. Ideally, one would pursue a computational analysis approach along the lines of Volansky et al. (2015), inter alia, to perform an analysis comparing native Hmong text with the translation-based text found in the corpus. The issue here, however, is the lack of a corresponding corpus of the same genre and register, meaning that quantitative comparisons between the only other publicly available Hmong corpus, the soc.culture.hmong Usenet corpus (Mortensen, 2015), and the Hmong Medical Corpus would produce numbers that fail to control for these other factors (cf. Baker, 1993).

However, some qualitative observations can be made. First, the extensive use of equivalent English phrases appear in parentheses following their Hmong equivalents, as if to enhance reference or clarify for the reader a Hmong expression by repeating its English original. Examples include tus khaub thuas (Influenza) “lit., the influenza (Influenza)”, tshuaj tua kab mob (antibiotics) “lit., medicine [that] kills pathogens (antibiotics)”, and teb chaws Meskas Sab Hnub Tuaj Qaum Teb (Northeastern United States) “lit., North-East American country (Northeastern United States)”. Second, the use of circumlocutions to translate English words without Hmong equivalents appears, sometimes in combination with the original English word in parentheses, as with tej chaw uas nyob ib ncig yus (environment) “lit., the places that are around you (environment)”. Overly literal translations that are not standard terms in Hmong also appear, such as cov pas dej loj (Great Lakes) “lit., the big lakes (Great Lakes)”. Parenthetical explanations in Hmong also appear with some uncommon expressions, as with av hmo ntuj (night soil) (cov av uas xyaw tib neeg cov quav) “lit., night soil (night soil) (the soil that mixes [with] people’s feces)”. The use of a vague Hmong expression with an English term for clarification also occurs, as with tshuaj pleev ib ce (lotions thiab cream) “lit., medicine [with which one] smears a body (lotions and cream)” and tus kab mob swimmer’s itch “lit., the pathogen swimmer’s itch”.

This is in addition to the widespread phenomena in the Hmong diaspora of calquing and code-switching in general (White, 2021).

9 Corpus statistics

The Hmong Medical Corpus is composed of 105 medical informational documents. Basic statistics regarding its content appear in Table 5, and those of the word position/POS-tagged subset in Table 6. The corpus contains 100,535 tokens, of which 10,152 belong to the word position/POS-tagged subset in 8152 full words.

Table 5 General statistics of the corpus
Table 6 Statistics of the POS-tagged subset
Table 7 Distribution of POS-tagged tokens

Table 7 provides total counts and percentages for the POS tags associated with syllable-based tokens in the POS-tagged subset of the Hmong Medical Corpus. The largest categories by far are common nouns (18.95%) and verbs (27.79%), while non-content word categories such as nominal classifiers (11.98%) and verbal modifiers (7.51%) comprise a significant portion of the total—a situation which reflects the general syntactic properties of Hmong. The large number of verbs (VV) and common nouns (NN) is not particularly surprising, given their relatively high frequency of use cross-linguistically. Classifiers (CL), on the other hand, are relatively frequent in Hmong as they have a wide range of uses, including to indicate definite reference (Simpson et al., 2011). Localizers (LC) have a relatively high frequency of use (2.46%) as compared to demonstratives (DT; 0.92%) given their greater semantic specificity as regards deictic reference. Of particular note, only two adjectives (JJ) appear among the POS-tagged documents in the corpus, given the unusual nature of Hmong adjectives as a part-of-speech class, as described in Sect. 3 above. Foreign words (FW) feature strongly (5.19%) as a part of the overall corpus, likely as a result of typical code-mixing that features in diasporic Hmong language as well as the number of English terms that are carried over in the translations into Hmong, as described in Sect. 8.1.

The total number of tagged named entities is shown by named entity type in Table 8. The number of tagged body part references is higher than the number of each of the other two semantic groups; this reflects the tendency for body part terms to appear in the names of diseases and symptom expressions in Hmong. The total of tagged named entities is provided in the table, while the amounts in the other categories cannot be combined into totals as an individual token can have more than one tag and thus a total would count the same token multiple times.

Table 8 Distribution of NEs by semantic type

10 Corpus contributions

The Hmong Medical Corpus contributes significantly to natural language processing efforts for low-resource languages in the following ways:

  1. (1)

    it is the first NE-annotated biomedical corpus for a non-official minority language;

  2. (2)

    it provides the first publicly-available POS-tagged dataset on the Hmong language;

  3. (3)

    it is the first biomedical corpus with named entity tags for Hmong, which will enable additional NLP work in the biomedical domain for the language;

  4. (4)

    it provides a successful means of how to handle the tokenization issues involving syllable-spacing in Hmong;

  5. (5)

    it is publicly available, both to search and to download; and

  6. (6)

    it represents an effective, replicable paradigm for the development of NE-annotated corpora for other minority and low-resource languages.

11 Conclusions

We presented the Hmong Medical Corpus, a new biomedical corpus for the Hmong language. The documents comprising the corpus contain annotations of two kinds: POS tags and named entity tags. This dataset represents the first time a publicly-available annotated corpus has been released for a non-official minority language in the biomedical domain. It addresses the prior lack of annotated data for the Hmong language, enabling a range of possible Hmong-specific NLP applications of either a biomedical or general nature. The Hmong Medical Corpus is publicly available to access and download online.Footnote 3