Mining comorbidity patterns using retrospective analysis of big collection of outpatient records

Studying comorbidities of disorders is important for detection and prevention. For discovering frequent patterns of diseases we can use retrospective analysis of population data, by filtering events with common properties and similar significance. Most frequent pattern mining methods do not consider contextual information about extracted patterns. Further data mining developments might enable more efficient applications in specific tasks like comorbidities identification. We propose a cascade data mining approach for frequent pattern mining enriched with context information, including a new algorithm MIxCO for maximal frequent patterns mining. Text mining tools extract entities from free text and deliver additional context attributes beyond the structured information about the patients. The proposed approach was tested using pseudonymised reimbursement requests (outpatient records) submitted to the Bulgarian National Health Insurance Fund in 2010–2016 for more than 5 million citizens yearly. Experiments were run on 3 data collections. Some known comorbidities of Schizophrenia, Hyperprolactinemia and Diabetes Mellitus Type 2 are confirmed; novel hypotheses about stable comorbidities are generated. The evaluation shows that MIxCO is efficient for big dense datasets. Explicating maximal frequent itemsets enables to build hypotheses concerning the relationships between the exogeneous and endogeneous factors triggering the formation of these sets. MixCO will help to identify risk groups of patients with a predisposition to develop socially-significant disorders like diabetes. This will turn static archives like the Diabetes Register in Bulgaria to a powerful alerting and predictive framework.

Language Processing (NLP) for under-resourced languages is another challenge to be met in order to improve DM achievements in knowledge discovery.
We propose a cascade data mining approach for frequent pattern mining enriched with context information, including a new algorithm MIxCO for maximal frequent patterns mining. Text mining tools extract entities from free text and deliver additional context attributes beyond the structured information about the patients. NLP for Bulgarian delivers entities from Outpatient Records (ORs) free texts. Novel hypotheses are generated to discover stable comorbidities and to confirm known ones. The experiments explicate some population specific comorbidities. We also discuss the effects of age, gender and demographics on these comorbidities.
The paper is structured as follows. Section 2 presents related work, Sect. 3-the data we use, Sect. 4-the methods. Section 5 discusses current experiments and their medical interpretation. Section 6 sketches further work and the conclusion.

Related work
The concept of frequent itemsets is introduced by Agrawal et al. [2]. Methods for solving FPM vary from the naive BruteForce and Apriori algorithms, where the search space is organized as a prefix tree, to Eclat/dEclat algorithm that uses tidsets directly for support computation, by processing prefix equivalence classes [1]. Another efficient algorithm is FPGrowth (Frequent Pattern Tree Approach). Using the generated frequent patterns we can later generate association rules. Most FPM algorithms generate all possible frequent patterns (FPs). The search space grows exponentially with the number of items. Summarized information for data relations can be extracted as maximal frequent itemsets (MFI). The condensed information not only accelerates the process, reducing redundancy, but also decreases significantly the number of FPs for post-analysis. All classic algorithms for FPM can be modified for MFI search, by checking for maximality at each step. There are some especially designed algorithms for MFI search, e.g. the MFCS algorithm which combines top-down and bottom-up [3]. The GenMax algorithm that uses a vertical database, diffsets and optimizations by checking whether the union of all itemsets is included already in some maximal itemset and then pruning the branch [4]. The FPMax algorithm is based on FP-trees by extending FP-growth algorithm [5]. MAFIA uses depth-first traversal of the itemset lattice with effective pruning mechanisms which is quite good especially when the database itemsets are very long [6].
NLP for English clinical texts made significant progress in algorithm development and resource construction since 2000. Open-source tools like cTAKES 1 extracts information from clinical free text. Another open source system is HITEx (Health Information Text Extraction) which extracts variables of interest from narratives [7]. Despite the limitations, the NLP importance as a supporting technology will grow due to its constant improvements [8].
Studies on multimorbidity are a great challenge given the mismatch between the high prevalence of this condition and relatively smaller number of research papers [9], which is partly due to lack of data. Machine learning (ML) is the basic technology used in such studies. For instance, four ML techniques (logistic regression, k-nearest neighbors, multifactor dimensionality reduction and support vector machines) were applied to assess risks for diabetes, hypertension and their comorbidity in a cohort of 270,172 hospital visitors (89,858 diabetic, 58,745 hypertensive and 30,522 comorbid patients) in Kuwait, with accuracy > 85% (for diabetes) and > 90% (for hypertension) [10]. An original approach for predicting a comorbid medical condition incidence and progression of medical conditions, using self-posted data available on patient-oriented social media sites, is presented in [11]. The similarity between patient postings is calculated and the risk of a condition is derived thus producing a ranked list of medical conditions for each patient. An algorithm to build medical condition progression trajectories is suggested. The condition incidence model predicts future conditions with coverage of 48% (top-20) and 75% (top-100).

Materials
The data repository we use currently contains more than 262 million pseudonymised ORs submitted to the Bulgarian National Health Insurance Fund (NHIF) in 2010-2016 for more than 5 mln citizens yearly. In Bulgaria ORs are produced by the General Practitioners and the Specialists from Ambulatory Care for every contact with the patient. Despite their primary accounting purpose the  Table 1, are processed by our NLP tools.

Methods
The system architecture is shown on Fig. 1. Text mining modules convert the raw text to structured data. We developed a drug extractor using regular expressions to describe linguistic patterns [12], it handles 2239 drug names included in the NHIF nomenclatures. For extraction of clinical test data (body mass index-BMI, weight, blood pressure etc.) we designed a Numerical value extractor [13].
We search for as many as possible associations between chronic diseases. 3 A tabular method using a vertical database is proposed, with depth-first traversal as well as set intersection and diffsets. Further processing of the MFI is applied to remove diagnostic related groups. Some context information is added to each MFI to study comorbidities. This information is presented as attribute-value tuples for each patient; the post-processing identifies the importance of different attributes for each MFI.

Mining maximal frequent itemsets
For the collection S of ORs we extract the set of all different patient identifiers P = {p 1 , p 2 , . . . , p N }. This set corresponds to transaction identifiers (tids) and we call them pids (patient identifiers). We consider each patient visit to a doctor as a single event. For each patient p i ∈ P an event sequence of tuples event, timestamp is generated: Let E be the set of all possible events and T be the set of all possible timestamps. Let C = c 1 , c 2 , . . . , c p be the set of all chronic diseases, which we call items. Each subset of X ⊆ C is called an itemset. We define a projection function π : . . , c mi ), such that for each patient p i ∈ P the projected time sequence contains only the first occurrence (onset) of each chronic disorder recorded in E(p i ). Let D ⊆ P × 2 C be the set of all itemsets in our collection after projection π in the format pid, itemset . We will call D database. We are looking for itemsets X ⊆ C with frequency (sup(X)) above given minsup. Let F denote the set of all frequent itemsets, i.e. F = { X|X ⊆ C and sup(X) ≥ minsup} . A frequent itemset X ∈ F is called maximal if it has no frequent supersets. Let M denote the set of all maximal frequent itemsets, i.e. M = { X|X ∈ F and ∄Y ∈ F, such that X ⊂ Y } . Let 2 X denote the power set (set of all subsets) of itemset X. Then each subset of X ∈ F is also frequent itemset, i.e. ∀Y ∈ 2 X implies that Y ∈ F . For each item c ∈ C we define the set called pidset: We preprocess the database D by generating pidsets and transform it to vertical database D V : Let w ∈ C, we define projection P w of the database D V by pidsets intersection: p(c) ∩ p(w)} and its complement by pidsets difference:

Algorithm MIxCO (MIning COmorbidity)
Assume that the set of all maximal frequent itemsets M is initially the empty set. We reduce the database D V by deleting all tuples that contain items with support below minsup and process further the obtained database D V ′ . Obviously the maximal frequent itemsets will contain as many as possible items, thus they must contain also items with low frequency. In order to identify maximal frequent itemsets we start from the weakest item w ∈ C in D V ′ . There are two cases: either a maximal frequent itemset X contains w, or it does not contain it. Thus we need to split D V ′ in two subsets by projections P w D V ′ and P w D V ′ . We apply recursively the algorithm MIxCO for searching all maximal frequent itemsets in P w D V ′ .
Let the result set of all maximal frequent itemsets in P w D V ′ be M w . We add to each of them the item w: M ′ w = { Y |X ∈ M w , Y = X ∪ {w}} and obtain the maximal frequent itemsets that contain w. Let B be the set of all members of P w D V ′ that were reduced from the algorithm MIxCO due to low frequency (bellow the minsup). These items cannot be reduced from further considerations because they have low frequency in combination with w, but support above minsup in the entire database D V ′ and they can be members of maximal frequent itemsets that do not contain w. We update P w D V ′ by adding those itemsets that contain members of B: �c, z� ∈ P w D V ′ , �c, y� ∈ P w D V ′ }. We apply recur- We illustrate MIxCO by a synthetic example (Fig. 2). Itemsets of ICD-10 codes for 10 patients are presented. For each ICD-10 code (F20, E11, I11, M17, I20, E66, J44) is generated a set of pids, i.e. DV . We apply reduction for minsup = 3 and obtain B = {�M17, {2, 4}�, �E66, {10}�, �J 44, {5}�}. The weakest item of the new set DV ′ is w = I20. On the next step we partition DV ′ into two subsets by projection P I20 DV ′ and P I20 (DV ′ ). First we start processing P I20 DV ′ and apply reduction with B ′ = {�I11, {8}�}. The weakest item in the resuced set P I20 DV ′′ is w = F 20. We apply projection and obtain to subsets P F 20 P I20 DV ′′ and P F 20 (P I20 (DV ′′ )). Because for P F 20 P I20 DV ′′ no reduction is applied and its cardinality is 1, we return the frequent itemset M = {{F 20, E11, I20}}, which contains items from both projections F20 and I20 and the only left item E11 in the later subset. The subset P F 20 (P I20 (DV ′′ )) is empty and the algorithm terminates processing the subset P I20 DV ′ . We continue by processing P I20 (DV ′ ) and update it by the reduced data from B ′ . No further reductions are applied to the updated set P I20 (DV ′′ ), because all subsets have support above minsup. The weakest item in P I20 (DV ′′ ) is w = F 20. We apply projection and obtain to subsets P F 20 P I20 (DV ′′ ) and P F 20 P I20 (DV ′′ ) . For P F 20 P I20 (DV ′′ ) no reduction is applied and its weakest item is w = E11. We apply projection and obtain to subsets P E11 P F 20 P I20 (DV ′′ ) and P E11 P F 20 P I20 (DV ′′ ) and so on.

Context information
Comorbidities need to be studied in the context where they occur so we add semantic attributes to each event-patient demographics, age and gender, treatment, status etc.
We define a set of attributes of interest A = {a 1 , a 2 , . . . , a k }. Context Q for some patient p i ∈ P is defined as the set of attribute-value pairs from patient profile information: Q(p i ) = {�a 1 , q 1 �, �a 2 , q 2 �, . . . , �a k , q k �}. In order to decrease the number of possible values of attributes we apply some aggregation of data. For instance age value is categorized according to the World Health Organization (WHO) standard age groups. 4 Data for body mass index (BMI) are also categorized according to the WHO 5 standard classification-underweight, normal weight, overweight, obesity. For some data concerning demographic information, like region ID we have large number of distinct values. For such data we add also some additional properties concerning background information 4 WHO, Standard age groups http://www.who.int/healthinfo/paper31.pdf. 5  for the region-e.g. whether it is south, north, west, east, central, northwest etc., and mountain, river, sea, thermal spring, urban region etc.
From Q(p i ) we generate a feature vector v(p i ) = (v 1i , v 2i , . . . , v mi ), where each attribute a j ∈ A with N j possible values is represented by N j consecutive positions in the vector. For the set of maximal frequent itemset M with cardinality |M| = K we have K classes of comorbidities. We apply classification of multiple classes in order to generate rules for each comorbidity class. We use SVM and optimization based on block minimization method described by Yu et al. [14].

MIxCO algorithm evaluation
Some evaluation experiments were performed for MixCO and FPMax algorithms with two databases A and B. The number of transaction in both collections is 11,345, but A is very dense, and in contrast B is very sparse. The number of items in A is 4337, and in B is 3412. Table 2 shows the execution time in milliseconds for a relative minsup between 0.01 and 0.05.
The evaluation results show that FP-Max outperforms MIxCO for big sparse databases. In contrast MIxCO shows better results for big dense databases.

Comorbidity identification
The term "comorbid" here means "indicating two or more medical conditions existing simultaneously regardless of their causal relationship". One comprehensive study of the possible relations between comorbid diseases is [15]. The authors describe 13 comorbid models, also known as "NK models", which allow to examine the etiology of the comorbidity between disorders and to predict mortality and other outcomes.
Our experiments for pattern search are made on five OR collections that are used as training and test corpora (Table 3). They contain data about patients suffering from Schizophrenia (ICD-10 code F20), Hyperprolactinaemia (ICD-10 code E22.1), and Diabetes Mellitus Type 2 (ICD-10 code E11). Schizophrenia and Diabetes Mellitus are chronic diseases with a variety of complications that are also chronic diseases. The collections are extracted by using a Business Intelligence Tool (BITool) [13] from the repository of ORs for approx. 5 million patients for a 3-years period.
The minsup value was set as relative minsup function of the ration between the number of patients and ORs.
It is approximately between 0.015% for S2 and S3 and 0.005% for S1. This is a rather small minsup value that will guarantee coverage even for more rare chronic diseases but with sufficient support.
The noise in the data is not taken into account. We do not discuss the correctness of the clinical data from medical point of view. The average number of ORs per patient is distributed almost evenly in the collections S1-S3: 12.2 (set S1), 9.85 (S2) and 14.5 (S3) and each patient has several visits each year. On the other hand the collections are almost complete and cover the population in Bulgaria for these period.
The experimental collections were carefully selected. The association between Schizophrenia, Hyperprolactinemia, and Diabetes Mellitus Type 2 is well known so it is easier to assess the novelty of discovered comorbidities corresponding to the extracted maximal frequency itemsets.
Comorbidity interpretation in psychiatric diseases has specific aspects because in mental health comorbidity does not necessarily imply the presence of multiple diseases. It usually is the result of imprecisely distinguished mental illnesses and inability to supply a single diagnosis that accounts for all symptoms. For example in collection S1 the support of itemset {F 20, F 31} is 871, where F 31 is Bipolar affective disorder and F 20 is Schizophrenia. Despite this imperfection, we see that the longest maximal frequent itemsets overcome this problem. Table 4 contains diseases with ICD-10 codes I11 (Hypertensive heart disease with heart failure), I20 (Angina pectoris), I50 (Heart failure), I69 (Sequelae of cerebrovascular disease). The result is not quite surprising due to the   well-studied comorbidity between Schizophrenia and Cardio-vascular diseases [15]. Interesting and unexpected results were found in the set of maximal frequent itemsets with size 5 (Table 5)comorbidity with M17 Gonarthrosis (arthrosis of knee).
This correlation seems to be a new hypothesis: a search PubMed found only 3 papers referring to relations between delusions and physical diseases such as knee osteoarthritis. Even more interesting results were obtained after adding context information. The demographic data show some relation between comorbidity of {F 20, M17} and location of thermal springs in Bulgaria (Fig. 3). Expected BMI values of these patients are high but most of them have normal BMI or a little overweigh. Thus, contextualizing the FPM findings, the proposed technology supports discovery and exploration of novel correlations between phenotypes and comorbidity.
The role of phenotype for comorbidity of various diseases is known. For instance, the most often psychiatric disorder-depression-is comorbid with anxiety disorders, abuse with psychoactive substances, alcohol and drug dependence. High comorbidity is established between depression and somatic dysfunctions as well, e.g. 22-33% of the patients hospitalised for treatment of somatic diseases have depressive disorders too [16]. It is accepted that the predisposition to the development of certain disease is due to the contribution of multiple genes with little effect. The correlation between the genetic fingerprint and the environment works in both directions: people with genetic predisposition can develop certain illness when they live in the respective environment; on the other hand the genes can change the individual sensitivity to the environmental factors and contribute to the development of predisposition [17].
The experiments presented here show that deeper understanding of the interrelations between comorbidity, phenotypes and environmental factors can be achieved by finer tuning of the classical data mining techniques in order to discover unknown correlations between data items in patient records and contextual information.

Conclusion and future work
This paper presents a novel algorithm MixCO for MFI mining. The main advantage of MixCO is that it can process efficiently big dense datasets for small relative minsup values. This is a bottom-up approach which eliminates at the beginning the most critical items that are highly possible to be reduced in the MFI. The expected application impact of MixCO is significant. The explication of maximal frequent itemsets enables to build hypotheses concerning the causality relationships among the exogeneous and endogeneous factors that trigger the formation of these sets. Mining of patterns is shown here, and mining sequences is the next task in our agenda.
Future work includes also in-depth experiments with various OR subsets and evaluation of the effectiveness of MixCO.
The diagnoses with several possible ICD-10 codes or similar diagnoses are also not interpreted in this model. This is an important issue and we plan further investigation of it in our future work.
Finally we note that the technology can be successfully used for explication of risk groups of patients that have predisposition to develop socially-significant disorders like diabetes. This is possible given the large repository of