Abstract
The paper investigates the possibilities of adapting various ADR algorithms to the Russian language environment. In general, the ADR detection process consists of 4 steps: (1) data collection from social media; (2) classification/filtering of ADR assertive text segments; (3) extraction of ADR mentions from text segments; (4) analysis of extracted ADR mentions for signal generation. The implementation of each step in the Russian-language environment is associated with a number of difficulties in comparison with the traditional English-speaking environment. First of all, they are connected with the lack of necessary databases and specialized language resources. In addition, an important negative role is played by the complex grammatical structure of the Russian language. The authors present various methods of machine learning algorithms adaptation in order to overcome these difficulties. For step 3 on the material of Russian-language text forums using the ensemble classifier, the Accuracy = 0.805 was obtained. For step 4 on the material of Russian-language EHR, by adapting pyConTextNLP, the F-measure = 0.935 was obtained, and by adapting ConText algorithm, the F-measure = 0.92–0.95 was obtained. A method for full-scale performing of step 4 was developed using cue-based and rule-based approaches, and the F-measure = 67.5% was obtained that is quite comparable to baseline.
Similar content being viewed by others
Keywords
1 Introduction
One of the challenging problems of NLP is the problem of processing healthcare information. Nowadays it includes not only the actual clinical information, but also content from social media. In our work we appeal to the texts concerning adverse drug reaction (ADR). ADR detection is one of the most important tasks of modern healthcare. Texts containing information on ADRs can be characterized by non-compliance with grammatical rules, a significant portion of texts in narrative formats. According to the World Health Organization, the death rate from ADR is among the top ten of all causes of death in many countries [5], and unfortunately Russia is in this list as well. In Russia, studies are under way on the use of NLP in medicine [3, 10, 12], but they are not focused on ADR detection.
In general, the ADR detection process can be divided into the following steps: (1) data collection from suitable text sources (social media and/or clinical texts); (2) selection of text segments containing a reference to ADR; (3) eliciting of assertions concerning ADR in a form suitable for further analysis (mainly in the predicate form). The implementation of each step in the Russian-language environment is associated with a number of difficulties in comparison with the traditional English-speaking environment. First of all, these are connected with the lack of necessary databases and specialized language resources. In addition, an important negative role is played by the complex grammatical structure of the Russian language. The article explores these difficulties and presents various adapted algorithms for retrieving ADR from the Russian text content.
The rest of the article is organized as follows. In Sect. 2 we discuss each step mentioned above in a uniform structure: the English-language background – available Russian-language support – our proposals, developed methods and the experimental results. Section 3 concludes and outlines the future work.
2 Methods
2.1 Text Sources for Data Collection and Processing
As the literature analysis [1, 2, 4, 9, 11, 13] and real practice shows, in order to solve the ADR problem by natural language processing (NLP) methods the following input data are needed: primary sources of information; marked datasets; auxiliary linguistic resources. We conducted a comparative analysis of the most common sources of obtaining these data in English and in Russian. The results of the analysis of available information sources on ADR in English and their Russian-language analogues as well as variants of replacement missing sources used in our work are briefly described below. More complete review can be found, for example, in [6].
In the context of ADR detection, the needed resources can be divided into the following groups: (1) Spontaneous reporting systems; (2) Databases based on clinical records and other medical texts; (3) Dictionaries and knowledge bases; (4) Health-related websites and other network resources; (5) Specialized linguistic resources.
Spontaneous reporting systems (1), such as FAERSFootnote 1, VigiBaseFootnote 2 and AIS-RospharmaconadzorFootnote 3, are databases of reports of suspected ADR events, collected from healthcare professionals, consumers, and pharmaceutical companies. These databases are maintained by regulatory and health agencies, and contain structured information in a predetermined form.
In the English segment of group (2), the main place is occupied by the MEDLINEFootnote 4 database. There is no similar database in Russian. Of the verified databases of such type in Russia, one can call the annotated corpus of clinical free-text notes [12] based on medical histories of more than 60 patients of Scientific Center of Children Health with allergic and pulmonary disorders and diseases. In general, most of the datasets of this type in Russian are rather small and designed in-house for other research purposes and not for ADR detection.
Dictionaries and knowledge bases (3) helping to ADR detection are widely represented in English-language segment. Specialized dictionaries in Russian, reflecting all medical terminology, do not yet exist. At present, the process of their creation is underway, mainly by the forces of individual research teams (see, for example, [8]).
Health-related websites and other network resources (4) are represented in the Russian-language segment as widely as in the English-language. For example, the alternative to the online health community DailyStrengthFootnote 5 are numerous websitesFootnote 6 aggregating users’ messages about ADR. However, our studies have revealed a number of differences between them, important from the point of view of ADR detection: users of Russian-language web resources are much more emotional and prone to polar assessments (such as fine/terrible). Consequently, there is a problem of an adequate choice of assessment scales to take into account the opinions of users.
As concerning to specialized linguistic resources (5), here in the first place should be called MetaMapFootnote 7 toolbox. For the Russian language, such a resource does not exist, and in order to solve the ADR problem the researchers are to adapt non-specialized NLP tools or to develop them independently.
The brief overview shows that due to the lack of verified databases and specialized resources it is expedient to follow the path of adaptation of existing ADR algorithms designed for English content to the Russian language.
2.2 Selection of ADR-Reference Text Segments
The problem of selection of ADR-reference text segments can be considered in the class of tasks of text summarization and has an extensive bibliography (see, for example, [2, 4]). In Russia, text forums are of particular interest as a source of information on ADRs, so for the comparative analysis we chose the work [11]. Solving the mentioned problem, the authors used data set built in-house from DailyStrength. The classification was performed using three supervised classification approaches: Naïve Bayes, Support Vector Machines and Maximum Entropy. Preprocessing included adding the synonymous terms p(using WordNet) and negation detection. The best of achieved results is presented in Table 1 (gray background).
We have adapted this to the Russian language environment. As a data source, we used three forumsFootnote 8. We built a parser and collected 1210 reviews on medications: asthma – 508 reviews, type 2 diabetes – 222 reviews, antibiotics – 480 reviews from the above sites. All data were annotated manually for the presence or absence of ADR, the range of ADR manifestation.
We used only the features, available for assessment in Russian language (see Table 1). We calculated Tf.Idf using the weightTfIdf() function from the tm package for R language. In order to calculate feature (8) we needed a dictionary of terms denoting adverse effects in Russian which is currently absent. The list of adverse effects was manually collected from the medical dictionary and accounted for 215 adverse effects. As for the feature (3), the main problem was the lack of sufficiently complete dictionaries. We used a specialized dictionary built in-houseFootnote 9.
Preprocessing included the removal of stop words and lemmatization using SnowBallC package. The classification was performed using a decision trees algorithm. Since each of the features is represented by a sufficiently large matrix, three classification models were constructed using each attribute separately. To combine these attributes, an ensemble of classifiers was constructed using the accuracy of the classification of each solo-model as a decision rule. The results are presented in Table 1 (transparent background).
Comparison of the results of Table 1 allows us to draw the following conclusions. First, despite the smaller set of specialized linguistic resources and, correspondingly, the smaller number of available features for the Russian language, the achieved values of Accuracy on English and Russian-language content are quite comparable (Table 1 in bold). Secondly, we confirm the conclusion we had reached earlier [7] that the most important role in forums summarization belongs to successful feature selection. Finally, the accuracy of selection of ADR-reference text segments depends on the quality of the content. Indeed, the posts in DailyStrength resource contain more structure, are longer, and often consist of multiple sentences than in our forums, and this affects the results of Table 1.
2.3 Analysis of Extracted Fragments for the Formation of Logical Statements About ADR
First of all, we considered the possibilities of adapting existing NLP software tools for processing Russian-language texts. As a prototype, we used the method represented in [13]. The method is intended for porting the pyConTextNLP library from English into Swedish (pyConTextSwe). The library allows automatically finding in the text the name of the disease with the help of regular expressions and determining the degree of its confirmation within the sentence. In total, to classify the confirmation of the disease, four classes are identified: define_negated_existence; probable_negated_existence; probable_existence; define_existence;. In the original library there are 381 keywords and 40 names of illnesses in English. The authors of the articles created an extensive dictionary of expressions for Swedish medical texts containing 454 cues (key phrases) using a subsets of a clinical corpus in Swedish.
We did a similar job to port pyConTextNLP library to Russian. As sources for the formation of the dataset, we used resources containing impersonal medical historiesFootnote 10. We have formed a data set consisting of 29 Russian-language medical histories and containing 513 separate assertions. We translated the key words and diagnoses into Russian and made regular expressions for them. We also expanded the list of diseases with the help of third-party resources, including in addition to it 2017 names of diagnoses. Based on the results of the first test of the algorithm, we added a list of keywords, and included in the regular expressions various health characteristics mentioned in the medical records, and re-tested the algorithm on updated regular expressions.
A comparison of the results is given in Table 2. Attention is drawn to our good results in classes def_existence and def_negated in comparison with the comparatively weak results in classes prob_existense and prob_negated. Our analysis showed that this is due to the quality of the initial data: the final diagnoses corresponding to classes def_existence and def_negated are practically true in all the medical histories, while the differential diagnoses corresponding to classes 1 and 2 are described vaguely and remotely from the context of the specific medical history, so that the accuracy of the algorithm is understandably low. Thus, the problem of porting pyConTextNLP library to Russian can be considered successfully solved.
Our next research was devoted to the possibility of inter-language adaptation of single triggers. As a prototype, we used the work [1] concerning an adaptation of the English ConText algorithm to the Dutch language, and a Dutch clinical corpus. Algorithm ConText is based on regular expressions and lists of trigger terms. It searches for words related to medical terms (cues) considered as triggers and defines three parameters for them: negation (denied or affirmed), experiencer (patient or other person), temporality (at the moment, no more than 2 weeks ago, long ago) thus identifying the contextual properties in the clinical corpus. Four types of medical documentation were used in [1] as a source of information: general practitioner entries, specialist letters, radiology reports and discharge letters. The total volume of the dataset was 7,500 documents with an average number of words in the document equal to 72. Such a volume of raw materials in Russia is not available, so we built our dataset of 23 medical records, including all types of records mentioned above.
The main difference between our approach and [1] approach was as follows. To select a suitable parameter state, the original English algorithm as well its Dutch adaptation by [1] uses regular expressions with a certain constant set of markers; we instead use the search for words from specially compiled customized dictionaries. Besides, for the analysis of medical texts in Russian, we used two values for the time parameter instead of three used in Dutch variant. Finally, ConText for Russian uses not only a point and a semicolon as a terminator, but also a specially developed dictionary of conjunctions that allows you to correctly determine the context of the trigger. These changes have significantly increased the efficiency of identifying contextual properties of medical terms in Russian (see Table 3).
Finally, we investigated the need to use a full syntactic parsing to solve the ADR problem. For comparison, we used [9]. In this work the graphs of grammatical dependencies are constructed using Stanford Parser for all sentences containing medical terms. These determine the shortest pathways considered as the kernel of the relationship between the drug and the side effect, thereby forming potential pairs ‘drug – adverse effect’. In conclusion, with the help of linguistic rules, the negations detection is performed. For adverse drug event extraction, the authors obtained \(F\text {-}measure\) = 50.5–72.2%, depending on the variety of complexing the applied algorithms.
But, our experiments showed that due to the complex grammatical structure of the Russian language, the use of the described kernel function leads to significant recognition errors. Therefore, we refused to parse the sentences, but developed a problem-oriented algorithm for allocating ADR from the sentence in Russian. The scheme of the algorithm is shown in Fig. 1. We tested the work of the algorithm on 100 sentences extracted from the medical siteFootnote 11. In the experiment for adverse drug event extraction, \(F\text {-}measure\) = 67.5% was obtained that is quite comparable to baseline.
3 Conclusion
We proposed a comprehensive solution to the problem of ADR detection on Russian-language texts. Solving the problem of selection of ADR-reference text segments we constructed an ensemble of classifiers using the accuracy of the classification of each solo-model as a decision rule. Despite the smaller set of specialized linguistic resources and, correspondingly, the smaller number of available features for the Russian language, the achieved values of Accuracy on English and Russian-language content are quite comparable.
Solving the problem of analysis of extracted fragments for the formation of logical statements about ADR we have built a specialized dataset of medical records, a number of specially compiled customized dictionaries and a set of logical rules for the processing. These changes have significantly increased the efficiency of identifying contextual properties of medical terms in Russian. Finally, we have developed a problem-oriented algorithm for allocating ADR from the sentence in Russian.
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
References
Afzal, Z., Pons, E., Kang, N., Sturkenboom, M.C., Schuemie, M.J., Kors, J.A.: ContextD: an algorithm to identify contextual properties of medical terms in a Dutch clinical corpus. BMC Bioinform. 15(1), 373 (2014)
Allahyari, M., et al.: Text summarization techniques: a brief survey. arXiv preprint arXiv:1707.02268 (2017)
Baranov, A., et al.: Technologies for complex intelligent clinical data analysis. Vestnik Rossiiskoi akademii meditsinskikh nauk 2, 160–171 (2016)
Bhatia, N., Jaiswal, A.: Automatic text summarization and it’s methods - a review. In: 2016 6th International Conference on Cloud System and Big Data Engineering, Confluence, pp. 65–72. IEEE (2016)
Gildeeva, G., Yurkov, V.: Pharmacovigilance in Russia: challenges, prospects and current state of affairs. J. Pharmacovigil. (2016)
Gonzalez, G.H., Tahsin, T., Goodale, B.C., Greene, A.C., Greene, C.S.: Recent advances and emerging applications in text and data mining for biomedical discovery. Brief. Bioinform. 17(1), 33–42 (2015)
Grozin, V., Buraya, K., Gusarova, N.: Comparison of text forum summarization depending on query type for text forums. In: Soh, P.J., Woo, W.L., Sulaiman, H.A., Othman, M.A., Saat, M.S. (eds.) Advances in Machine Learning and Signal Processing. LNEE, vol. 387, pp. 269–279. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32213-1_24
Lapaev, M.: Automated extraction of concept matcher thesaurus from semi-structured catalogue-like sources of data on the web. In: 2016 18th Conference of Open Innovations Association and Seminar on Information Security and Protection of Information Technology, FRUCT-ISPIT, pp. 153–160. IEEE (2016)
Liu, X., Chen, H.: A research framework for pharmacovigilance in health social media: identification and evaluation of patient adverse drug event reports. J. Biomed. Inform. 58, 268–279 (2015)
Lushnov, M., Kudashov, V., Vodyaho, A., Lapaev, M., Zhukova, N., Korobov, D.: Medical knowledge representation for evaluation of patient’s state using complex indicators. In: Ngonga Ngomo, A.-C., Křemen, P. (eds.) KESW 2016. CCIS, vol. 649, pp. 344–359. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45880-9_26
Sarker, A., Gonzalez, G.: Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J. Biomed. Inform. 53, 196–207 (2015)
Shelmanov, A., Smirnov, I., Vishneva, E.: Information extraction from clinical texts in Russian. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference, Dialogue, vol. 14, pp. 537–549 (2015)
Velupillai, S., et al.: Cue-based assertion classification for Swedish clinical text—Developing a lexicon for pyConTextSwe. Artif. Intell. Med. 61(3), 137–144 (2014)
Acknowledgment
This work was financially supported by the Government of Russian Federation, “Grant 08-08”. This work financially supported by Ministry of Education and Science of the Russian Federation, Agreement #14.578.21.0196 (03/10/2016). Unique Identification RFMEFI57816X0196.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Vatian, A. et al. (2018). Adaptation of Algorithms for Medical Information Retrieval for Working on Russian-Language Text Content. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-00794-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00793-5
Online ISBN: 978-3-030-00794-2
eBook Packages: Computer ScienceComputer Science (R0)