Background

Imaging examinations are common, as well as efficient, diagnostic tools in clinical practice worldwide. Radiologists or sonographers perform examinations, observe images and write reports for meaningful findings, conclusions and opinions. Imaging examinations are highly operator-dependent modality, and many factors influence the interpretation of the images, such as patients’ demographics, current health status and medical histories. There could be discrepancies in such complicated and heterogeneous information (e.g., the diagnosis in patient’s radiology report is different than the one his/her really has), which may lead to imprecise clinical decisions [1]. Although such discrepancies could be inevitable due to the complexity of imaging-diagnosis, quality measurement and improvement are still needed to minimize avoidable error via a manual verification process. A common objective and standardized verification process is to retrospectively compare the reports of prior imaging and follow-up pathology examinations [2]. However, only few patients receiving imaging examinations on certain body site will have surgical or pathologic biopsy on the same site. To find these patients, quality control staff will regularly and manually review electrical medical records (EMRs) and scan related examination reports, which is inefficient and time consuming. In this study, we propose a machine learning based approach to retrieve these patients from EMRs more efficiently.

Formally, we aim to predict which of the provided report pairs, imaging report and pathologic report, contain overlapping body sites or regions based on their textual semantic similarity. It is slightly different from conventional text similarity cases where researchers care about if two sentences have the same meaning [3, 4]. We care about similar “body sites”, site on the patient’s body where the anomaly has been detected, rather than similar syntax or semantics in general. For example, Table 1 shows a report-pair in our study with original Chinese and translated English. This report-pair contains overlapping body sites — parotid gland, but only the pathology report mentions “parotid gland” (腮腺), and the imaging report describes the condition about “maxillofacial region” (颌面部). Parotid gland is anatomically located in the maxillofacial regions, and thus, the report-pair has similar “body site” and should be picked up. In spite of their different forms, the methodology remains unchanged — the model should extract features from texts, calculate and judge if the pairs have enough common information with certain criteria, and then assign the pairs to certain nominal categories (match or mismatch).

Table 1 A report-pair in this study

We believe that developing a well-designed semantic similarity algorithm should consider three main aspects: textual features, algorithm and domain knowledge. Previous works using textual features to calculate similarity are mainly based on corpus-based methods such as bag-of-words and word embeddings. Bag-of-words model, including vector space model (VSM) [5], latent semantic analysis (LSA) [6], and latent Dirichlet allocation (LDA) [7], treats the entire text as a set of words and calculates the weight for each word, thus transforms them into real-valued vectors and then calculate the similarity on top of them [8,9,10]. These methods need handcrafted features and external lexical resources, which makes it difficult to apply in domains without too much readily available knowledge. Word embeddings are low-dimensional real-valued vectors trained from large-scale unlabeled text. Those vectors are able to capture semantic relationships among free text documents [4, 11].

Convolutional neural network (CNN) is a typical artificial neural network algorithm which could automatically learn, filter, cluster and combine features without much human effort. It was originally invented for computer vision [12] and subsequently been shown to be effective in natural language processing (NLP) tasks, such as sentence modeling [13], search query retrieval [14], and semantic parsing [15]. CNN models can nicely represent the hierarchical structures of sentences with their layer-by-layer convolutional kernel and pooling, so as to capture the semantic patterns at different layers [16]. CNN has also been demonstrated to be effective in capturing the semantic similarities between text pairs and thus able to perform well on text matching tasks [17, 18].

Moreover, domain ontologies contain lots of semantic relations which represent the body of knowledge. In the medical domain, there are a number of popular biomedical ontologies, such as MeSH (Medical Subject Headings) for indexing literature, the ICD taxonomy (International Classification of Diseases) for public health surveillance and billing purposes, and SNOMED-CT for aggregating medical terms across sites of healthcare. All these ontologies use graph structures [19] to represent the relationships among medical concepts. However, it is not straightforward to use this extra knowledge by conventional machine learning methods. Graph embedding technology embeds edge and node information of graphs into low dimensional dense vectors [20], and we believe it has a great potential to facilitate the utility of those ontologies.

In this study, we propose an end-to-end solution based on CNN to help physician and clinical quality control staff efficiently retrieve patients’ examination reports for imaging diagnosis verification process. The input of the model is imaging and pathology report-pairs from certain patients, and the output is the corresponding label indicating whether the report-pairs contain overlapping body sites. We compared accuracy of our model (with different word embedding methods) with conventional approaches such as keyword mapping, latent semantic analysis (LSA) [6], latent Dirichlet allocation (LDA) [7], Doc2Vec [21], Siamese LSTM [22] and a method based on named entity recognition (NER) [23]. Moreover, we further applied the LIME algorithm [24] to identify the features contribute most to the final results, and imporved the model interpretability.

Methods

Technical workflow

Figure 1 shows the workflow of identifying matching body sites from medical report-pairs. We removed all punctuations, numbers, and stop words from the raw report texts, then used Jieba,Footnote 1 a Chinese segmentation tool, to transform entire texts into sequence of words for CNN model training. The study and data use were approved by the Human Research Ethics Committees of Tongren Hospital, Shanghai Jiao Tong University, Shanghai, China.

Fig. 1
figure 1

Workflow of detecting text semantic similarity

Data description

We included 4262 imaging reports and 2141 pathology reports from the EMRs of 1926 patients who admitted to Shanghai Tongren Hospital and had ultrasonic examinations between 1st May 2017 and 31st July 2017, which finally resulted in 16,354 report-pairs. All report texts were de-identified. Each pair contained two pieces of report, one is imaging report and the other is pathology report. Three physicians were recruited to annotate whether each report-pair contains overlapping body sites independently, the kappa coefficient between each pair of two physicians is 0.95, 0.95, 0.97 respectively. The overall rate of positive pairs (which contain overlapping body sites) was 14.8% (2415/13,939). We randomly split the data into 80% for training and 20% for testing.

CNN model for text similarity detection

The structure of our model (showed in Fig. 2) can be divided into three parts: input layer, feature extraction layer and fully connected layer.

Fig. 2
figure 2

CNN-based neural network for text similarity detection

The input layer mapped each word into a dense vector (with 128 dimensions) and transformed each report into a dense matrix. Each dense vector represented the semantic information of corresponding word, the values of which could be updated during training. We used two strategies to initialize word vectors: randomly initialized vectors and pre-trained (word2vec model trained by skip-gram and negative sampling method) word vectors using Baidu Encyclopedia corpora obtained from github.Footnote 2 We set the window length to be 3, 4 and 5, for the convolution filters, and adopted 32 convolution filters for each window size. Then we applied max-pooling operations and obtained a new feature vector for two reports respectively. We concatenated the two feature vectors and passed it to a fully connected layer and a output layer to calculate the likelihood of containing overlapping body sites. We set cross-entropy as the loss function and performed mini-batch stochastic gradient descent to train the model.

Medical concept vectors using ontology-based graph embedding

We utilized graph embedding method as a third word vector initialization strategy to enhance the word representation by capturing the semantic relations information from medical ontologies. We used CMeSH (Chinese Medical Subject Headings), a Chinese version of MeSH, which contains about 391,892 medical concepts and 2,047,749 relations, to train our medical concept vectors. We randomly generated word sequences by sampling neighbor concepts along the edge of relation in CMeSH with a length of 10. The sampling process basically follows the procedure in node2vec [20], which was composed of two major steps: 1) for every node (medical concept) V, adding its direct (1st order) neighbors to the sampling set \( {\mathcal{M}}_V \); 2) let Vm be the m-th order neighbor of V and \( {V}_m^1 \) be the direct neighborhood of Vm, then we randomly sample one node from \( {V}_m^1 \) and add it to \( {\mathcal{M}}_V \). In our experiments, we set m to be 9 and sampled a word sequence for each node. And we feed the sequence set \( {\mathcal{M}}_V \) into word2vec model with skip-gram method [25] to train the medical concept vectors.

Model evaluation

We compared the performance of our CNN model with the following six baseline models:

  • Keyword mapping. We used the vocabulary from CMeSH as a medical dictionary to filter the original text. All words outside the dictionary were discarded and Jaccard similarity coefficient was calculated based on the key words remained in the two report texts

  • Latent Semantic Analysis (LSA). For this approach we collected all reports and construct bag-of-words representation vectors for each of them. Then singular value decomposition was performed on the matrix concatenating all bag-of-words vectors to reduce the dimensionality of the vector representations and cosine similarity was measured on those vectors from the reduced-dimension space

  • Latent Dirichlet Allocation (LDA). This approach constructed the bag-of-words representations for the reports. It assumed that each report was a mixture of a set of “topics” and each topic was a mixture of the set of words in the vocabulary. Cosine similarity was measured on their topic composition vectors.

  • Doc2Vec. Doc2Vec is an extension to the Word2Vec model [26], where a document vector is trained together with the word vectors in the continuous bag-of-words model. Cosine similarity was measured on the learned document vectors.

  • Siamese long short term memory (LSTM). Siamese LSTM is often used for text similarity systems. It uses two LSTM networks to encode two sentences respectively, then calculate Manhattan distance between the encoded hidden vectors to decide whether the two sentences are similar or not. The training process is supervised.

  • Named Entity Recognition (NER). We used another annotated Chinese clinical EMR corpus from Shanghai Tongren Hospital. This corpus contains 46,665 sentences and 89,231 entities of four types: symptoms, diseases, lab tests and body structures. We trained a DNN-based NER model with random initialized word embedding [23] and then adopted this model to identify all the entities in the original report texts. We only keep these entity words and construct bag-of-words representation vectors for each of the reports. Cosine similarity was measured on their entity representation vectors.

All models were trained on the training set and evaluated on the testing set. We performed receiver operating characteristic (ROC) curve analysis for each model and calculated the AUC score. We calculated precision, recall and F1-score based on the cutoff value equal to the ratio of the positive pairs in the whole dataset. Report-pairs with similarity score higher than the cutoff value will be labeled positive in all of our models. We used bootstrapping method with 50 times repeated samplings to estimate mean and standard deviation (std) of our model performances. Because of data imbalance, we reported both overall performance (marco average) and performance for each class group.

Model interpretability

We further explored the LIME algorithm to improve the interpretability of our model. LIME, proposed by Guestrin et al. [24], can be used to explain the predicted results of machine learning models. The basic idea of LIME algorithm is to define “interpretation” using another model, usually a linear model or a decision tree. We adopted LIME algorithm to identify which keywords from report-pairs our CNN model took to give final results. Specifically, for a given report-pair, we first fixed the content of imaging report and generated new samples of pathology report by randomly deleting words. Then, we trained a LIME model on the generated pairs and calculated the relative importance for each word in the pathology report. Similarly, we also fixed the content of the pathology report, randomly generated the pairs and trained another LIME model for the pathology report. We represented the relative importance of the keywords in a visual way.

Results

Model performance

Table 2 showed both average and class-level performances for all models and Fig. 3 showed the corresponding ROC curves. The AUC score of our CNN models with both randomly initialized vectors and pre-trained word vectors were superior than that of any other baseline models, and gained approximately 3–7% improvement. In particular, the AUC score of CNN model with medical concept vectors was 0.8% higher than the model with randomly initialized vectors and 1.5% higher than the one with pre-trained word vectors. We have done t-test to the AUC results from 50 independent runs of CNN with or without pre-trained medical concept vectors and the p-value is smaller than 0.001, which suggests the improvement is significant. Not surprisingly, keyword mapping model had the worst performance among all models.

Table 2 Performance comparison of different models including Precision/Recall/F1-score
Fig. 3
figure 3

ROC Curve of different models

LIME experiment

We randomly selected two report-pairs which contain overlapping body site in the test sets and process through LIME model. Table 3 shows the original text for two sample pairs (sample pair No. 1 and No. 2) and Table 4 shows the corresponding results. For sample pair No. 1, the importance scores of words, “fetal membrane” (胎膜) and “umbilical cord” (脐带) in the pathology report, and “fetus” (胎儿) and “fetal heart” (胎心) in the imaging report, were relatively high with a score of 0.15 and 0.14, 0.12 and 0.03 respectively. The result indicated that the existence of these four words might account for the positive judgement with a prediction of 0.77 by our CNN model. “Fetal membrane”, “umbilical cord” and “fetal heart” were all body structures contained by “fetus”, and our CNN model was able to automatically and reasonably extract semantic features from texts and make judgements. For sample pair No. 2, the word “Thyroid” (甲状腺) which exist in both reports, contributes most to the result, with a score of 0.16 and 0.19 respectively. “Tubercle” (结节) and “Glandular body” (腺体) are sub-structures of thyroid gland and also contributed much to the final result. The LINE algorithm could efficiently locate the most related words from text pairs and provide meaningful explanations of our model behaviors.

Table 3 The original text of selected samples
Table 4 Sample-level feature importance of sample pair 1 and 2 for both imaging and pathologic report provided by LIME algorithm

Discussion

In this paper, we proposed a direct end-to-end CNN model to judge whether two reports contained matching body sites. Comparing with conventional language models based on handcrafted textual features (keywords and bag-of-words), automatically generated features (bag-of-words extracted by our NER model and word embeddings) and neural network model with LSTM structure, our CNN model provided more flexibility in exploring the semantic information contained in medical documents and yield better performance. In addition, we compared three strategies to generate word vectors for our CNN model: randomly initialized vectors, pre-trained word vectors and graph embedding and CMeSH based medical concept vectors. Our CNN model with medical concept vectors outperformed the other two methods and an significant improvement was observed.

Many factors might contribute to the advantage of CNN model. First, our CNN model is a supervised learning model and could automatically adapt feature representations to task objectives. For LSA, LDA and Doc2Vec, we learn feature representations in an unsupervised way, semantic information and co-occurrence relationship of words or characters are weakly correlated with current learning objectives. Second, our CNN model could extract syntactic and semantic information from both local semantic patterns and hierarchical structures of the sentences. For example, body sites could be described by physicians using anatomy terms and their relative locations. Thus, information at word-level or chunk-level is more important than information at sentence-level or document-level. This could explain why the performance of our CNN model was higher than Siamese LSTM model. Even though Siamese LSTM performs better on precision than CNN model with randomly initialized vectors, it’s F1-score was significantly lower than our CNN. Third, we used end-to-end training strategy, which updated feature representations and optimized weights simultaneously.

We used graph embedding method to utilize domain knowledge from CMeSH and gained a significant performance boost. CMeSH, just like other domain-specific ontologies, organizes and represents the body of knowledge using concepts and their relations. For example, concept “parotid gland” is a sub-class of concept “salivary glands”, concept “salivary glands” is a sub-class of both concepts “exocrine glands” and “mouth”, and concept “salivary glands” contains sub-class concepts “parotid gland”, “salivary ducts” and “sublingual gland”. In our study, the affiliation information of anatomy terms extracted by graph embedding method was quite useful to judge overlapping body sites, and could explain for the higher performance.

To validate that our model could correctly find related semantic or anatomy information and make judgement as expected, we used LIME algorithm and analyzed two concrete examples. From the results we could see our CNN model chose reasonable keywords as the basis to give the predictions. In real world, we can incorporate these explanations of model behaviors into the computer-aided decision supporting system so as to further remind the clinical quality control staff why our model give such results.

Our CNN model still has several limitations. We performed an error analysis of our model and find several typical mis-classifications. Table 5 shows two sample pairs from the analysis, sample pair No.3 is a case of false-negative and No.4 is a case of false-positive. Sample pair No.3 indicates that our model could not correctly identify spatial relationship between body structures. In sample pair No.3, the imaging report described a mass observed in the ventral side of the inferior pole of left kidney, and the pathologic report described a lesion from left adrenal gland. This report-pair does not share common anatomical terms, but both left kidney and adrenal gland are adjacent body structures in local anatomy space, and thus the report-pair shares common body site. Sample pair No.4 indicate that our model is insensitive to direction information. In sample pair No.4, both imaging report and pathology report described the lymph node of armpit, but the one of imaging report is from left armpit and the one of pathology report is from right armpit, and thus this report-pair share no common region. There are other limitations including: first, Chinese word segmentation using Jieba was imperfect and might induce errors in segmentations, especially for medical terms; second, there is no Chinese version of SNOMED-CT or UMLS (Unified Modeling Language), and thus we only performed graph embedding on CMeSH, which has relatively small number of concepts and relations; third, we only evaluated our model on Chinese medical reports, but it could be easily move onto other language scenarios without language-specific optimizations.

Table 5 Sample pairs from error analysis

In this paper we only focused on identifying whether a pair of reports contain overlapping body sites. We treated it as a binary classification problem, trained a CNN model and used graph embedding based on CmeSH ontology. In future, we could: first, consider it as a ranking problem, annotate and train a machine learning model to identify whether report A is more similar to report B than report C (e.g., by checking the number of overlapping body parts); second, try different graph embedding methods and combination of medical ontologies; third, validate the end-to-end architecture in other language tasks.

The proposed technique in this paper can be used for matching the reports of medical images from different resources and help better consolidate the heterogeneous patient clinical information and improve the efficiency of clinicians. Fundamentally, our study provides a generalizable architecture to detect information discrepancies from different sources of routinely collected clinical data. With the increasing secondary use of clinical data, many commercial software was developed for similar purpose but with different data sources and algorithms, for example, IBM Watson Imaging Clinical Review.Footnote 3 Moreover, as Wang et al. [27] have envisioned, improving the quality of clinical data is one key aspect to make artificial intelligence tools really useful in clinical practice. The effective consolidation of clinical data can help us better reconcile them, detect potential errors and thus improve the data quality.

Conclusion

In this paper we proposed a convolutional neural network-based model to identify report-pairs of imaging examinations and pathologic examinations which contain overlapping exam body sites by detecting semantic similarity. Our model exhibited superior performance compared to other conventional models such as key word mapping, LSA, LDA, Doc2Vec, Siamese LSTM and a method based on NER. We also leveraged graph embedding method to utilize external information from medical ontologies and gained further improvement. In addition, we adopted LIME algorithm to analyze our model behavior in a visible way. The results indicated our model was able to automatically and reasonably extract semantic features from texts and make accurate judgements. It could help retrieve patients or reports for imaging diagnosis quality measurement in a more efficient way.