Study Population and Design
Our data included 2514 Chinese EMRs with a diagnosis of central nervous system (CNS) infectious or inflammatory diseases, which were identified using the International Classification of Diseases-Tenth Revision (ICD-10) codes from the medical records database of the neurology department of Xijing Hospital during a period of 10 years (October 2010–December 2020). The ICD-10 codes are provided in Table S1 in the supplementary material. De-identification of all EMRs was conducted for every known identifier type (ID, name, gender, age, address, etc.). Standard paraclinical tests (e.g., CSF, MRI, or EEG studies) and other routine laboratory data were obtained by review of the EMRs. The Xijing Hospital Ethics Committee approved this study (KY20192071-F-1).
Our study constructed two datasets by using 2514 Chinese EMRs: the first to train CNER as part of the NLP (552 cases) analysis pipeline, and the second (199 cases) to train and test the diagnosis classification model between AE and IE. The two datasets were filtered respectively from 2514 cases, so there was overlap between the CNER dataset and the text classification model dataset. To eliminate subjective interference and evaluate the classification model more objectively, the EMR samples included in the text classification dataset were AE or IE cases with definite evidence of etiology. The definitive diagnosis of IE was based on the results of traditional etiological examinations, which included microscopic staining, pathogenic microbiological analysis, and PCR. The definitive diagnosis of AE was based on the exclusion of other definite causes (e.g., IE). Then, all patients underwent an extensive search for neuronal antibodies and they required positive antibodies, either in CSF or serum, using commercial cell-based assay kits, according to published guidelines [32, 33]. Finally, because the possibility that autoantibodies may not be detected in definite autoimmune limbic encephalitis, the clinical diagnostic criteria proposed by Graus et al. only applied to definite autoimmune limbic encephalitis in this study [12].
A synopsis of the overall NLP analysis pipeline is shown in Fig. 1. First, we automatically extracted symptoms (i.e., CNER) from all HPI texts by training the BiLSTM-CRF model on the basis of a dataset of 552 cases with a single diagnosis of central nervous system (CNS) infection or AE. Second, post-structuring of the HPI in EMRs for all cases was implemented after normalizing symptom terminologies in English language. Third, four text classification models trained by a dataset of 199 cases were established for differential diagnosis of AE and IE based on a post-structuring text dataset of every HPI. The optimal model was identified by evaluating and comparing the performance of the four models. Finally, combined with three typical symptoms and the results of standard paraclinical tests (e.g., CSF, MRI, or EEG studies) proposed from Graus criteria, an assisted early diagnostic model for AE was established on the basis of the text classification model with the best performance.
Data Preprocessing for CNER
We filtered all 552 patient records identified with a single diagnosis of CNS infection or AE from 2514 EMRs with CNS infectious or inflammatory diseases. While all 552 cases were used at different stages of the CNER development (training word embedding, identifying the symptoms), a random subset of 140 cases (25% of 552 patient records) were selected for manual word annotations to assist with training. It has been proven that Chinese medical text segmentation is very important for producing high-quality word embedding and promoting downstream information extraction applications [34]. Therefore, we used the Jieba Chinese Word Segmentation Library supported by Python programming language to segment the HPI in EMRs.
In our CNER approach, annotated data were represented in the BMESO format, in which each word was assigned to one of five classes: B, beginning of an entity; M, middle of an entity; E, ending of an entity; S, single word for an entity; O, outside of an entity. Therefore, the CNER problem became a classification problem requiring assignment of one of the five class tags to each word. The annotation guidelines were similar to those in Yang et al.’s study [35]. One main difference was that we only manually annotated symptoms in the HPI in EMRs. Thus, we only had one type of entity in this study. Another difference was that negative symptoms were recognized as a whole entity [36]. The statistics of the training subset of the HPI used for CNER is shown in Table S2 in the supplementary material. There were 26,655 words and 1055 symptoms in the HPI in the training subset, and B, M, E, S, and O were annotated 1006, 479, 1006, 49, and 24,115 times, respectively.
We explained the annotation guidelines in this study to three specialized neurology physicians. The final annotation results were identified by the following rules: Word boundaries of symptoms were marked using B, M, E, S, and O tags by manual annotations performed by two physicians. Examples of some annotated sequences are provided in Table S3 in the supplementary material. After the manual annotation by the two physicians was completed, the inter-annotator agreement was calculated with Kohen’s kappa and was 0.87. Then, the manual annotation results provided by the two physicians were compared. When the results were the same, the same annotation result was the consensus result. In the case of different annotation results by the two physicians, the third physician made a final interpretation of the two annotation results, and one was selected as the consensus result.
Training and Evaluation for CNER Using the BiLSTM-CRF Model
To improve the quality of training word embedding, 552 HPI texts were used as a corpus for training word embedding. We used 140 HPI texts for the BiLSTM-CRF model. The Word2Vec tool is used to train word embedding using a continuous bag of words (CBOW) model or skip-gram model [37]. Various empirical evidence shows that the CBOW model performs better than the skip-gram model for databases with only hundreds of thousands of words [38, 39]. Therefore, the CBOW model was adopted in this study to achieve word embedding. The dimension of the word embedding vector was set to 128 dimensions, and the other hyper-parameters are provided in Table S4 in the supplementary material.
The dataset, which contained 140 HPI texts, was split into two mutually exclusive subsets: training (80% of dataset) and testing (20% of dataset) subsets [40]. We divided every HPI with full stops into sentences. Each sentence of every HPI was used as an input sequence in the following model. The implementation of CNER in this study is based on the BiLSTM-CRF model, with word embedding as the input of sequences [23, 24]. To optimize the hyper-parameters of the BiLSTM-CRF model, a tenfold cross-validation method was adopted using the training subset. When the optimal hyper-parameters of the BiLSTM-CRF model were determined, the entire training subset was trained as the final BiLSTM-CRF model according to the optimal hyper-parameters.
The hyper-parameters of the BiLSTM-CRF model were fine-tuned, training the model with the training subset and evaluating it with the F measure. The results for each configuration of the parameters (batch size, number of epochs, dropout, and learning rate with the Adam algorithm [41]) in the hyper-parameter fine-tuning stage are shown in Fig. S1 in the supplementary material. In the case of the embedding created from words by Word2Vec, the best hyper-parameters were as follow: 2 batches, 40 epochs, 0.2 dropout, and 0.001 learning rate. The other hyper-parameters are provided in Table S4 in the supplementary material.
The performance of the final BiLSTM-CRF model was measured by precision, recall, and F measure for all entities using the independent testing subset [42]. The evaluation program provided two sets of measures—exact match and inexact match—where exact match means that an entity is correctly predicted if, and only if, the starting and ending offsets are exactly the same as those in the consensus result; the inexact match means that an entity is correctly predicted if it overlaps with any entity in the consensus result [43, 44]. Table S5 in the supplementary material shows the statistics of the independent testing subset of the HPI used in this study. There were 6425 words and 296 symptoms in the HPI in EMRs. Table S5 also shows the performance of the BiLSTM-CRF model applied to the independent testing subset. The numbers in columns 4–6 are precision, recall, and F measure values for all entities using the exact match or inexact match measures.
Post-Structuring of the HPI
The final BiLSTM-CRF model was used to identify and extract the symptoms in 552 HPI texts presented in EMRs. All the symptoms were classified into different groups according to whether they had synonyms. Then, we obtained normalized symptom terminologies by establishing a mapping relationship between the categorized symptoms and the international standard English medical terms set, Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) and Medical Subject Headings (MeSH). The process of normalization of symptoms was conducted using Python regular expressions according to normalized symptom terminologies in English language. Then, the post-structuring of the HPI was completed using the normalized symptom terminologies; in other words, we rebuilt the HPI texts using the normalized symptom terminologies, which included typical symptoms, atypical symptoms, and negative symptoms.
Training and Evaluation for Text Classification Model
According to the definite diagnostic criteria for AE and IE, we obtained the text classification dataset of 199 cases by reviewing EMRs. The dataset included 83 AE cases and 116 IE cases. There are several autoimmune CNS diseases (primary CNS angiitis, Rasmussen’s encephalitis) that are often considered in the differential diagnosis of autoimmune encephalitis because of the clinical features. These diseases were considered immune-related diseases and excluded [11, 12]. The overall data were randomly divided into a 65% training dataset and a 35% testing dataset. The training dataset included 53 AE cases and 75 IE cases, and the independent testing dataset included 30 AE cases and 41 IE cases.
We did not directly perform text classification on raw texts because of the impact of the EMR template. The EMR template makes different EMRs have the same words, which are not related to symptoms of AE. However, the large numbers of same words from EMR template would cover up the role of symptoms of AE and affect the result of text classification. Meanwhile, as a result of different habits in writing EMRs, there were a variety of synonyms for the same symptom in raw texts. Therefore, our text classification model was based on rebuilding the HPI using normalized symptom terminologies rather than the raw HPI texts presented in EMRs. Our text classification models consisted of four models: two different classifiers with two different text feature selection methods respectively. Four text classification models, namely, naïve Bayesian classifier (NBC) and support vector machine (SVM), using two different text feature selection methods, namely, bag of words (BoW) and term frequency–inverse document frequency (TF-IDF), were established to distinguish AE and IE on the basis of structured texts of the HPI, which were rebuilt using normalized symptom terminologies. The process of determining the optimal hyper-parameters of the four text classification models was realized by the grid search function GridSearchCV from the Scikit-Learn library [45]. The hyper-parameters are provided in Table S4 in the supplementary material.
When the optimal hyper-parameters of the text classification models were determined, the final models were trained using the whole training dataset. The performances of the four text classification models were evaluated and compared using the independent testing dataset. The performances of the text classification models were measured by sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve (AUROC). Finally, the model that performed best was used to establish the assisted diagnostic model for AE.
Evaluation of the Assisted Diagnostic Model
Because the diagnostic criteria from Graus et al. [12] emphasized the importance of psychiatric symptoms, seizures, and short-term memory deficits, logical diagnosis rules for these three symptoms were added to the basis of the text classification model. Specifically, when the text classification model judges that the case is not an AE case, if the case contains one or more of these three symptoms, the case will be clinically diagnosed as an AE case. Furthermore, our assisted diagnostic model was combined with the results of standard paraclinical tests (e.g., CSF, MRI, or EEG studies) from the Graus criteria.
All cases of the independent testing dataset were analyzed according to the diagnostic criteria for PAE from Graus et al. and the assisted diagnostic model to compare the etiological diagnosis. The performance was measured and assessed according to sensitivity, specificity, accuracy, and confusion matrices.
We analyzed demographics, clinical, and standard paraclinical test characteristics. Continuous variables were presented as the mean ± standard deviation (SD) in the descriptive analyses, while categorical and binary variables were presented as frequencies (n) and percentages (%). Student’s t test and chi-squared test were used to compare outcomes between patient subgroups for continuous and categorical data, respectively. All data acquisition, processing, and analyses were conducted in the Python programming language (version 3.7.0) [46, 47], TensorFlow library [48, 49], and Scikit-Learn library [45].