An efficient prototype method to identify and correct misspellings in clinical text
- 127 Downloads
Misspellings in clinical free text present challenges to natural language processing. With an objective to identify misspellings and their corrections, we developed a prototype spelling analysis method that implements Word2Vec, Levenshtein edit distance constraints, a lexical resource, and corpus term frequencies. We used the prototype method to process two different corpora, surgical pathology reports, and emergency department progress and visit notes, extracted from Veterans Health Administration resources. We evaluated performance by measuring positive predictive value and performing an error analysis of false positive output, using four classifications. We also performed an analysis of spelling errors in each corpus, using common error classifications.
In this small-scale study utilizing a total of 76,786 clinical notes, the prototype method achieved positive predictive values of 0.9057 and 0.8979, respectively, for the surgical pathology reports, and emergency department progress and visit notes, in identifying and correcting misspelled words. False positives varied by corpus. Spelling error types were similar among the two corpora, however, the authors of emergency department progress and visit notes made over four times as many errors. Overall, the results of this study suggest that this method could also perform sufficiently in identifying misspellings in other clinical document types.
KeywordsSpelling analysis Spelling correction Clinical text Word embeddings Word2Vec
continuous bag of words
Corporate Data Warehouse
Emergency Department Visit and Progress Notes
electronic medical record
surgical pathology notes
Veterans Health Administration
Veterans Affairs Informatics and Computing Infrastructure
Misspellings in clinical text are common, in some instances constituting 5% of all content , and over 17% of content addressing a particular domain . Spelling irregularities in clinical text exceed those in other types of text  and can significantly affect natural language processing tasks like in part-of-speech tagging , drug extraction , information retrieval , and drug-drug interaction alerts . In studies, Ruch et al.  found that even a small amount of misspellings, primarily consisting of common errors, adversely affected information retrieval in clinical text.
Word Embedding (a technique mapping words to real number vectors) facilitated by Word2Vec models , holds promise in identifying words with both correct and incorrect spelling forms. Word2Vec models implement neural networks of a single hidden layer to create word vectors, using either a skip-gram or continuous bag of words (CBOW) approach. The skip-gram approach identifies multiple words in a contextual window, given a single word, whereas CBOW identifies a single word given all the other context words in the window. The end-product of either method is the identification of words found in similar contexts.
The Veterans Health Administration (VHA) is the largest integrated health care system in the world, providing care to over 8 million patients each year at over 1243 facilities . Efforts to computerize VHA data began in the 1970s, leading to the creation of VistA, one of the first electronic medical record (EMR) systems , and by consequence, the creation of a vast electronic clinical data resource, which VHA maintains in their Corporate Data Warehouse (CDW). These data are made available for research activities through the Veterans Affairs Informatics and Computing Infrastructure (VINCI), a secure platform enabling data research.
We created a prototype method to identify correct and incorrect spelling forms of words in clinical text, using Word2Vec, Levenshtein edit distance , the SPECIALIST lexicon , and corpus word frequencies, with an eye toward NLP practitioners and their work. We hypothesized that a frequent spelling of a given word would be key in identifying its correctly spelled form, an idea already explored in prior research [12, 13], and that Word2Vec similarity values, edit distance constraints, and a lexical aid could enable identification of its misspellings. Our corpora consisted of randomly-selected surgical pathology notes, and emergency department visit and progress notes from VINCI. We tested our method on these two corpora in order to gauge performance on different types of content. Three annotators assessed output. We measured the positive predictive value, performed an error analysis, and analyzed the output to characterize misspellings according to common error types.
Data procurement and preparation
We extracted two corpora, 50,000 randomly selected surgical pathology notes (SP), and 26,786 emergency department visit and progress notes (EDVP), from the Corporate Data Warehouse, made available through VINCI. To gain relevant information regarding word frequencies, we tokenized each corpus. We removed words appearing in a standard stoplist, which consisted of common functional words (e.g., articles, prepositions). Tokens consisting of only upper case characters, or containing digits, or consisting of less than four characters were also removed, and the remaining were transformed to lower case. These combined actions enabled removal of non-information bearing words. Superfluous punctuation was also removed. Token frequency in each corpus was computed, identifying and storing the 1000 most frequent words in each corpus.
After preliminary testing to determine the most effective hyperparameters, we trained a Word2Vec model for each corpus, implementing the Gensim Word2Vec library in Python, using a dimension of 500 in the hidden layer, and the CBOW algorithm , with a maximun window size of 5 words around a given target word. In these models, we also limited processing to words that occurred at least five times in their respective corpus. Each model implemented 10 training epochs. We used the same hyperparameters to train each model.
Is the target word a valid, correctly spelled term?
Is the candidate a misspelling of the target word?
To answer these questions, annotators could use a reference standard, such as a dictionary or lexicon. Candidate terms could not be a valid spelling of another word. Inflections were regarded as different words, i.e., not the same as the given target word. An affirmative answer to both of the questions indicated a true positive finding. We calculated Fleiss’ Kappa  to assess inter-annotator agreement. Disagreements were settled by majority rule. If an annotator suspected a candidate term was a true misspelling, but was unsure (by answering “maybe” or “?”) its usage was reviewed in the original corpus.
Raw output for each corpus
The method produced greater output on the most frequent 1000 words from the EDVP corpus, despite its smaller size. In total, there were 235 potential variants identified in the EDVP corpus, as compared to 53 in the SP corpus. There was also a tendency for certain words to have multiple misspelled forms, especially in the EDVP corpus, where approximately 45% of the method’s output consisted of multiple misspelled forms for 38 words. In the SP corpus, approximately 30% of the output consisted of multiple misspelled forms for 7 words. In EDVP, there were 8 different variations of the word “presents”; in SP, there were 4 different variations of the word “received”.
The Fleiss’ Kappa scores were 0.533 for output from the SP corpus and 0.466 for output from the EDVP corpus. The annotators reached moderate agreement in annotating both outputs . The authors felt these scores were sufficient for this exploratory study. Because disagreements were resolved by majority rule, the final classifications of true or false positive for output were the opinions of at least two annotators.
The positive predictive value (True Positives/(True Positives + False Positives)) for the SP and the EDVP corpora were 0.9057 and 0.8979, respectively. More true misspellings were found in the EDVP corpus than the SP corpus. Additional file 1 includes the true positive and false positive outputs. The following example illustrates an instance of the prototype method successfully identifying a misspelling. “Suicidal”, a correctly spelled term and frequent word in the EDVP corpus, was extracted as a target input word. It was matched to another word in the EDVP corpus, “sucidal”, because they appear in similar contexts in the corpus, manifest by the Word2Vec word embedding similarity value of “sucidal” to “suicidal”. Because it was not in the SPECIALIST Lexicon, and was within the set Levenshtein edit distance, “sucidal” was correctly identified as a misspelling of “suicidal”. In output these were represented as a pair: “suicidal, sucidal”.
Error analysis of output (FPs)
False positives’ types and frequencies
Surgical pathology notes
Emergency visit and progress notes
Different word, spelled correctly
Misspelling of different word
Alternative form of noisy term
Characterizing misspellings in the corpora
Spelling error types by corpus and frequency
Surgical pathology notes
Emergency visit and progress notes
Output and method performance
Output volume varied, yet the prototype method achieved comparable performance for each corpus. There was more output for the EDVP corpus than the SP corpus, despite its smaller size. Positive predictive values of 0.9057 (SP) and 0.8979 (EDVP) indicate comparatively good performance on the two different document types. The similar, high positive predictive values suggest that this method may perform well for multiple types of clinical text. More research is needed to determine performance for other document types.
Dual task performance
The task of only identifying misspellings is not as difficult as correcting them. This method performs both tasks relatively well. This is, in part, due to its novel application of Word2Vec, which ranks contextual words by similarity values, implementing a preset cutoff, here being a maximum of 1000 words per target word. This is analogous to leveraging information retrieval relevance ranking in identifying similar documents, but applied at the finer-granularity level of words. Combining this with the technique of using the most frequent spelling of the given target word in the corpus as its correct form, lexical filtering, plus applying the Levenshtein edit distance constraint of one to three characters, the prototype method effectively identified matching pairs of correctly spelled and misspelled words. If someone were using the keyword “apetite” in searching the EDVP corpus, this method could both inform the searcher that the likely correct keyword was “appetite”, and that the corpus also included the misspellings “apettite” and “appetitie” for this concept (Additional file 1).
The majority of false positive output (3 FPs) for the SP corpus fell into Category 1 (Different word, spelled correctly), but the rest (2 FPs) were in Category 3 (Alternative form of noisy term). The majority of false positive output for the EDVP corpus fell into Category 2 (Misspelling of different word: 12 FPs), but there were also relatively significant amounts in Category 1 (9 FPs) and Category 4 (Slang equivalent: 3 FPs). Terms in Categories 1 and 4 could be added to the SPECIALIST Lexicon. Category 1 terms would definitely be of use. Officials at the National Library of Medicine would need to decide whether slang terms would enrich the SPECIALIST Lexicon.
The most common misspelling type in both corpora was letter omission (Table 2). Letter insertion and transposition were also common in each corpora, with wrong letter and multiple/mixed errors less common. However, there were over four times the total spelling errors in the EDVP corpus. These values suggest that authors of these two corpora tend to make similar mistakes, but those writing emergency visit and progress notes make more mistakes. This may be due to one or more reasons, including different environmental conditions in which the various authors work, and presents an interesting direction for future research. More research involving other document types may provide further insight into the types and frequencies of errors made in clinical text.
We developed a prototype method to detect and correct misspellings in clinical text. This method uses a pipeline framework to identify correctly and incorrectly spelled word pairs by leveraging term frequencies, Word2Vec, use of the SPECIALIST Lexicon, and a Levenshtein edit distance constraint. An implementation of the method achieved 0.9057 and 0.8979 positive predictive values on two separate corpora. These promising results suggest the need for expanded research regarding this method.
This was only an exploratory study, using small corpora of surgical pathology and emergency department documents. More research is needed to determine the method’s performance for other document types. Because the method uses frequencies, performance for very infrequent terms would likely be affected.
Study design: TEW, QZ, GD. Prototype method design: TEW, YS. Output evaluation: GD, TEW, QZ. Manuscript composition: TEW, GD. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Availability of data and materials
The datasets supporting the conclusions of this article are included in Additional file 1. The datasets used as input include protected health information (PHI), therefore their access is restricted.
The views expressed are those of the authors and do not necessarily reflect those of the Department of Veterans Affairs, the United States Government, or the academic affiliate organizations.
Consent for publication
Not applicable. Data reported only includes results.
Ethics approval and consent to participate
This project was carried out in support of the VA Clinical NLP Ecosystem (Ecosystem) study, which was approved by the VA Central IRB #CIRB 13–17 Zeng. No patients were contacted, only data from the EMR database were used.
This work was supported by the VA IDEAS 2.0 HSRD Research Center and CREATE: A VHA NLP Software Ecosystem for Collaborative Development and Integration project, Grant CRE 12–315. Support from VHA enabled secure access to data, and a secure virtual environment (VINCI) in which these data could be analyzed. The organization of data as well as software resources in the VINCI environment also facilitated the project’s design. Funding from the NIH Clinical and Translational Science Award (CTSA) program, Grants UL1TR001876 and KL2TR001877, through The George Washington University and The Clinical and Translational Science Institute at Children’s National Health System, also contributed to the Biomedical Informatics Center at the George Washington University, where much of the research was performed.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Hersh WR, Campbell EM, Malveau SE. Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis. In: Proceeding on AMIA annual fall symposium. 1997. p. 580–4.Google Scholar
- 3.Ruch P, Baud RH, Geiddbühler A, Lovis C, Rassinoux A-M, Riviere A. Looking back or looking all around: comparing two spell checking strategies for documents edition in an electronic patient record. In: Proceedings of the AMIA symposium: 2001. New York: American Medical Informatics Association; 2001.Google Scholar
- 6.Ruch P. Using contextual spelling correction to improve retrieval effectiveness in degraded text collections. In: Proceedings of the 19th international conference on computational linguistics, vol. 1. Association for Computational Linguistics; 2002. p. 1–7.Google Scholar
- 7.Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013.
- 8.Veterans Health Administration: about VHA. https://www.va.gov/health/aboutvha.asp.
- 10.Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady; 1966. p. 707–10.Google Scholar
- 14.Rehurek R, Sojka P. Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. Citeseer; 2010.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.