Background

Patient healthcare trajectory is a recent emergent topic in the literature, encompassing broad concepts. Our research was focused on the patient trajectory based on disease management and care, while also considering medico-economic aspects of the associated management. We approached patient care trajectories based on an example; the occurrence of a myocardial infarction (MI). As MI treatment is performed in a health facility, we were able to trace the patient trajectories through the national hospital financing system, using comprehensive hospital databases or registers, regularly collected for billing purposes.

The first prospective payment system (PPS), based on diagnosis-related groups (DRG), was established in the United States in 1983. The objective of this system was to control the expenditures of health care institutions and streamline the costs [1]. Thereafter, similar medical information systems were adopted in many other industrialised countries. Others, like France, also adopted an anonymised database with unique patient identifiers (for instance, through cryptographic hash functions) to facilitate chaining hospital stays [24]. In addition, the gradual increase in fees-for-services enhanced the coding quality [5]. The introduction of these systems enabled new epidemiological and/or economic studies [69] using these databases, with temporal follow-up of patients allowing tracing of their trajectory of care. This review investigated how the trajectory concept is defined, studied and what it achieves.

We carried out a literature search on PubMed using keywords related to trajectory, PPS and MI concepts. We then proceeded in two steps: (1) a non-a priori search with text mining techniques; and (2) a more standard analysis of a sub-selection of documents.

Similar systematic reviews [10] have been performed before, but without using automatic procedures. However, conducting an automatic search is of considerable interest for processing a large number of documents. Text mining allows better targeting for information retrieval and reduces the search time [11], while also enabling users to prioritise searches.

Our reviewing strategy is presented in the “Methods” section; the search questions that guided our review, together with the various methods used to address them. The results are reported in the “Results” section. We end with the “Discussion” section, where we present answers to the search questions and comment on the results. To conclude this section, we discuss the different existing text mining techniques used in systematic reviews.

Methods

Search questions

Healthcare researchers currently explore the literature manually, and use statistical methods or models that require a priori extreme simplification of the processes. Data exploration methods such as text mining methods end by the interpretation and exploration of the processes, not a priori by knowledge discovery. We formulated practical questions to guide the review process (see Table 1). We identified seven types of non-a priori questions expressed in general terms that integrate thematic and medical oriented issues and satisfy scientific and medical aspects for health care professional expertise. We also identified seven additional specific a priori questions requiring in-depth analysis.

Table 1 Search questions

Step 1: document retrieval

In PubMed, we searched for documents examining patient healthcare trajectories, as well as PPS and MI. The review selection process is summarised in Fig. 1. The trajectory concept can be expressed with different words such as “trajectory”, “pathway” or “path”. For the PPS theme, keywords used are “Prospective Payment System” and “PMSI” (Programme de médicalisation du système d’information, the French PPS equivalent), in addition to “DRG” (Diagnosis-Related Group) databases. This theme also arises in the International Classification of Diseases (ICD), pricing for the activity via the “fee-for-service” or “activity-based payment” expressions, but also in the national health registry or hospital registry concepts. We conducted searches according to the themes and constraints summarised in Table 2.

Fig. 1
figure 1

Flow diagram of study selection

Table 2 Keywords used in document retrieval

Step 2: first text mining approach

The strategy was as follows:

  1. (i)

    From the selection of articles gathered in step 1, we created a corpus of texts, divided into three parts, T1 to T3 (corresponding to Table 2 topics), consisting of the title and abstract, in which we removed the keywords (see Table 2) in order to only keep the other terms;

  2. (ii)

    The three parts of the corpus were analysed separately with IRaMuteQFootnote 1 software. This is an R interface for multidimensional analysis of texts and questionnaires [12], allowing statistical analysis of the text corpus [13];

  3. (iii)

    We applied the following pre-processing techniques: (a) Lemmatization of texts, (b) Dictionary enrichment: we lemmatized unrecognised terms by TreeTaggerFootnote 2 and added specific medical terms and well-known acronyms such as acute myocardial infarction (AMI). Subsequently, the analyses were conducted with the full forms (nouns, adjectives, adverbs and verbs);

  4. (iv)

    We carried out conventional textual analysis, then similarity analysis and finally clustering. The various tools used were as follows:

Word cloud This is a synthetic representation of the terms distribution: the most recurrent words are in the centre with text size proportional to the number of occurrences. Thus, this kind of representation symbolises, by order of importance, the concepts covered in all of the articles. This method will provide an answer to Q1.

Similarity analysis This graph theory-based technique is conventionally used to describe social representations based on survey questionnaires [14]. Similarity analysis is applied to study the proximity and relationships between elements in a set, in the form of maximum trees. The objective is to reduce the numbers of links between two items, to obtain an acyclic connected graph. The maximum tree is therefore the tree created by the strongest edges of the graph, where the strength is measured by the occurrence of the linked terms. For each corpus, we selected the tree representation described in [15] and in the algorithm in [16], to describe communities via the shortest path, thus highlighting the most frequently associated words in the same sentence or text. The graph generates a more precise idea of the content of articles concerning the concepts and themes raised by linking important terms. This method will provide answers to Q3, Q5 and Q7.

Text clustering Reinert clustering [17] is a form of divisive hierarchical clustering (DHC) that is carried out in several stages, offering a global approach to the corpus. It identifies statistically independent word classes after partitioning the corpus. These classes may be interpreted by their profiles, which are characterised by specific correlated words. DHC summarises this through a dendrogram. This analysis generates a complementary vision with regard to similarity analysis by clustering articles according to concepts, partly identified by similarity analysis, characterised by word groups. This method will supplement the answer to Q7, and address Q2, Q4 and Q6.

Step 3: thorough analysis of the selected articles

We used the sub-selection technique derived from Moher’s method described in [18], and crossed the sets of themes: T1 and T2, denoted T1∩T2, then T1 and T3, denoted T1∩T3. This selection was performed in the same manner as described in Table 2. We added an additional constraint to better target our study through counting the K occurrence number of the trajectory concept in each document and selecting those for which: K ≥ 2. We counted each time the words “trajectories”, “trajectory” or “pathway” appeared in the titles and summaries of the articles.

Our reading grid was based on that described in the PRISMAFootnote 3 guidelines. We selected items that could be used to address the a priori search questions (see Table 1), Q8 to Q14: publication year, country of study, number of patients, observation period, methods and objectives. Other items that were irrelevant to our study were not kept. We added the following items: pathologies studied, databases used and definition of the trajectory concept.

Results

Some results, not listed in this paper, can be viewed at the following address: http://www.lirmm.fr/~pinaire/.

Step 1: document retrieval

The document retrieval resulted in a total of 33,514 articles.

Step 2: first text mining approach

We present the results obtained by our method which combined different approaches of lexicographic analysis (see below) following the flow diagram (Fig. 1).

Word cloud For T1 (see Fig. 2), the salient terms were “care”, “study”, “cancer”, “cell”, “treatment” and “increase”, while for T2 they were “study”, “registry”, “datum” and “cancer”, and for T3 they were “AMI”, “acute” and “hospital”.

Fig. 2
figure 2

Word cloud: trajectory, PPS, MI

Similarity analysis In a maximum tree only the strongest edges of the graph are kept. An edge symbolises the co-occurrence between vertices (i.e. terms), and its thickness represents the strength of the link. For instance, in Fig. 3, the link between “care” and “study” is thicker than the link between “significant” and “difference”. For T1, Fig. 3 comprises three hubs: the bottom part with a large network characterised by “care”, then a smaller contiguous network gathering the terms “clinical” and “outcome”. On the right upper part, there is a network encompasses genetics terms, with “cell”, “expression” and “gene”, then a smaller connecting network gathering the terms “increase”, “high” and “significantly”. The central upper part contains the terms “cancer”, “diagnosis” and “treatment”. Finally, the left upper part has a large network containing “study”, to which are attached several smaller clusters characterised by the terms “risk”, “disease”, “time”, “year” and finally “disease”. The most closely linked word communities are “genetics” with “cancer”, “cancer” with “study”, and “study” with “care”.

Fig. 3
figure 3

Similarity analysis and communities: T1 trajectory

Text clustering Following this clustering, 80% of the articles of T1 were distributed in 11 disjointed clusters, 86% for T2 in five clusters, and 98% for T3 in five clusters. We then performed a second clustering on the sub-corpus of each theme, consisting of articles that were not clustered during the first analysis.

For T1, Fig. 4 shows, from right to left, two clusters pooling the concepts of genetic organisation (cluster 5), signal organisation and cellular mediation (cluster 10). Cluster 8 pools concepts related to the immune system response in an inflammatory process. Cluster 1 pools dysfunctions related to diabetes and the consequences. Cluster 2 and 7 respectively symbolise time in the organization of hospital stays, and time in the trajectory. Cluster 6 concerns questionnaires and psychometric scales with depression. Cluster 11 contains concepts related to medical imagery. In the last branch, cluster 9, medicine is described with regard to its financial and regulatory aspects. Cluster 4 concerns the patient management including practices. Cluster 3 groups together terms pertaining to the way information is conveyed.

Fig. 4
figure 4

Dendrogram of the first clustering for T1 corpus: trajectory

We performed a more detailed examination of clusters 3 and 4, while studying the similarity tree. The two clusters pool 1645 items between them. For cluster 3, there are three nodes (see Fig. 5), i.e. for “care”, which is closely related to that for “study”, which in turn is closely related to that for “cancer”. For cluster 4 (see Fig. 5), there is only one node represented by “care”, from which there are several branches for “research” and “process”, and then higher there is a sub-node for “clinic”, connecting “trial”, “datum”, “bases” and “identify”.

Fig. 5
figure 5

Similarity analysis on cluster 3 from the first T1 clustering corpus

In the second clustering of the 3160 non-clustered articles, we identified five clusters consisting of 99% of the articles. From right to left, cluster 1 pools the concept of studies from a methodological standpoint. Cluster 5 concerns end-of-life issues. Cluster 4 pools the macroscopic aspect of care with public support. Cluster 3 groups studies involving animal experiments. Finally, cluster 2 concerns genetic mutation and anomalies. Three articles could not be clustered due to a lack of information.

Step 3: thorough analysis of the selected articles

Through set crossing, we generated a sub-selection of 84 articles, including 53 for T1∩T2 and 31 for T1∩T3. After reading the abstracts, we eliminated 8 articles for T1∩T2, and 6 for T1∩T3 as they evaluated a protein or organisation trajectory rather than patient trajectory. For the majority of items, we created categories as detailed in Table 3 and the sources of associated references are indexed in Table 4.

Table 3 Description of the observed items and categories
Table 4 Review references sources by item reviewed

Generally, the authors used several sources and methods in their studies. The results for these items are summarised in Fig. 6.

Fig. 6
figure 6

Databases and methods used in trajectories studies

We grouped the countries according to continent. For T1∩T2, we noted strong representation from Europe (55% of articles) and the Americas (29%). There were some studies from Oceania (9%) and Asian countries (7%). For T1∩T3, the article distribution was essentially between three continents: Europe (36%), the Americas (28%) and Asia (24%). Australasia was marginal, with 4% of articles. There were some atypical studies with data from multiple continents (8%).

We next considered publication year. For T1∩T2, the results highlighted activity that began developing in 2013. While for T1∩T3, we noted a peak of activity in 2004, and increasing activity in 2012.

The number of patients involved was then analysed, showing that the number of patients ranged from 14 to 6.2 million T1∩T2 (vs 20–30.20 million for T1∩T3), with a median of 859 and an interquartile interval (IQ) of 3250 (vs 604.5 and IQ = 933.25 for T1∩T3), with missing data for three articles (vs five for T1∩T3).

We also focused on the observation duration, measured in months, available in more than 85% articles. Observation duration ranged from 5 to 180 months in T1∩T2 (vs 3 to 240 months in T1∩T3) with a median of 36 months and an IQ of 54 months (vs 12 and 99 months in T1∩T3).

Discussion

Our method is based on a semi-automatic approach of text mining. We used terms and concepts which emerged from classification techniques rather than the simple presence of words. This approach was structured into two main steps prior to a thorough text analysis of the selected articles. These two steps were based on document retrieval and text mining techniques.

Step 1: document retrieval

For document retrieval, we chose to focus our study on PubMed. The search results are entirely dependent on the choice of keywords, making this a particularly delicate task when definitions may vary between authors and countries. Indeed, we encountered this difficulty for T2. As presented in Table 2, the keywords used were “Prospective Payment System”, “PMSI”, “DRG”, “ICD”, “regional information system”, “fee for service system”, “registry”, “Activity-based Payment”. However some documents used words not in our final selection, such as in [19], which contains the term “national case-mix system”. Our objective was not to be exhaustive with regard to covering all the publications, but rather to define a general method of analysis. A way to improve our approach would be to implement an adaptive algorithm for keywords enrichment.

Step 2: first text mining approach

The lexicographic analysis was based on three combined tools.

For the word cloud approach, the occurrence of the terms “study” and “care”, for all of the studied fields, means that these articles cover care concept and studies on topics such as diseases or drugs. For T1, the terms “treatment” and “increase” reflect a focus on patient healthcare trajectories. Thus, there are many studies on patient trajectories. Here we have answered Q1.

For the similarity analysis, the results showed that for T1, studies were closely related to care, disease and more specifically to cancer. In response to Q3, the studied diseases were those causing severe and chronic organ dysfunction: heart, kidneys, or lungs. We noted that the cancer concept was also closely related to that of genetics. We found here that the use of the keyword “pathway” highlighted all articles pertaining to cell signalling or gene pathways [2029].

For T2, cancer was closely related to the registry data. This highlights the descriptive aspect of the data information, i.e. registry data describing the patient’s cancer history from its diagnosis. We noted that the study concept was related to the disease concept, i.e. cardiac or renal, but also to the various treatments and therapies. In response to Q5, T2 was thus related to research: in disease studies [3037], to compare care and coding [3840], but also in monitoring of patients over time and the survival rate [31, 4144]. Survival rate forms part of a trajectory concept. This trajectory concept may also encompass the registry, i.e. a longitudinal concept containing many concepts related to longitudinality.Footnote 4

The T3 graph highlights two standpoints regarding MI studies: firstly that of clinicians who study MI, its risks and aggravating factors to gain insight into preventing and, if necessary, managing these patients. Secondly, that of patients with coronary symptoms, which could progress to incidents, which could then progress to AMI requiring hospitalisation and with high risk of mortality depending on the patient’s age. This partly answers Q7.

The text clustering enabled summarisation of the results in order to list the topics studied, asked in Q2, in these articles concerning patient trajectories. The first topic that we covered is disease with, for example, metabolic disorders such as diabetes and cardiovascular complications. Certain articles addressed patients’ feelings, anxieties and disease experience. In the patient trajectory, there was support from the patient’s immediate relations and family, but also health services, such as home nurses. Other articles focused on end-of-life situations, palliative care and processes set up to manage this last stage of the disease. Another topic was clinical research, involving developing cohorts, data collection, and methods used in different studies. Other studies concentrated on hospital organisation, various services, patient care staff, and associated costs. Other articles were focused on the health regulations and recommendations from guides of good practices.

As a response to Q4, our conclusion regarding the two T2 clusterings was that PPS is used in research primarily in the study of diseases, sometimes on disease onset, especially on disease management, associated costs, treatment and possible complications, but also in its coding. The studied diseases included neurological disorders, cancer, irregular heartbeat and cardiovascular diseases, the implantable medical devices to regulate these anomalies, traumas and wounds, infectious diseases, organ transplants, genetic and autoimmune diseases, and finally renal failure. Pregnancy and birth are also studied.

The T3 results, in reply to Q6 and to complete response to Q7, showed that MI is studied from several aspects, with the first regarding the risk factors (socioeconomic, age, hypertension, diabetes). Then there are the biochemical and cardiocirculatory functioning aspects, the various mechanisms which lead to MI and genetic predisposition [45]. In addition, there is the psychological aspect of ill patients and the consequences. There is also emergency management before hospital admission, including transport and first aid. Then there is care at admission, medication management [44] and associated costs—here the trajectory concept emerges. There is also an aspect regarding the effectiveness of the measures implemented [4652] and the different treatments [5355]. Another investigated aspect concerns lifestyle, with regard to dietary habits, healthy lifestyle [56], comorbidities [43, 57] (smoking and/or alcohol), but also environmental factors like atmospheric pollution.

Step 3: thorough analysis of the selected articles

A thorough analysis of the selected articles was performed. Trajectory studies require, first and foremost, a definition of this concept, which is the focus of question Q8. The results showed that in most cases trajectory is characterised by care processes established for a specific disease to improve patient care, facilitate health planning within institutions, ensure prevention, predict the course of the disease and prevent the onset of symptoms.

In response to Q9, we found that interest in patient trajectory studies have increased in the last 5 years. The resurgence of studies in 2013 could be explained by the improvement in the quality of databases as of 2009 (ref), particularly in France, and the possibility of chaining hospital stays and reconstructing patient care trajectories throughout the country.

This interest in trajectories mainly stems from Europe and the Americas with 47 and 29% of studies, respectively. The PPS concept necessarily led to only including countries with a similar health system database organisation. This is a weakness of our study, since countries with a different information health system to the American model were not selected through this filter. Thus we have answered Q10.

Then we sought to determine the rationale for why these studies were conducted and provide a response to Q11. The six-category article distribution we defined showed that the aim of most of the studies was to compare treatments, techniques or care procedures. In each case, the aim was to reduce costs while improving the quality of care. Patient healthcare trajectory studies appeared beneficial in two ways: (1) First, the trajectory provides insight into the course of the disease following medical and surgical care. (2) Secondly, the trajectory may be highly informative regarding the medico-economic aspects so as to be able to streamline the patient’s care management to avoid treatment dispersion.

In addition, the methods used underpinned the rationale of comparative studies as part of care techniques, treatment or care processes. These methods (Anova, comparisons tests, survival models, linear or logistic regression, etc.) are listed in the second part of Fig. 6 which solves question Q12.

We pursued this investigation by assessing the study characteristics, and answered Q13. In the studies, the number of recruited patients was estimated a priori for statistical analyses in good conditions with sufficient power. However, we identified a few studies that were conducted on the entire population, without sampling.

Overall, the study time was short, not more than a few years, which could be explained by economic considerations or a lack of data. For retrospective studies, for example, it was sometimes hard to trace back several years because the information is deleted after a certain period of time.

Next, we investigated the origin of the data used. For T1∩T3, registry data were mostly used. For T1∩T2, hospital databases and hospitalisation billing databases were used, so the studies were mostly hospital-based. Moreover, apart from hospital databases, some studies took patients’ feelings into account via interviews and questionnaires. Some studies required additional information on, for instance, medication [31, 58] through pharmacy databases or non-hospital care [59, 60] with social security databases for complete patient care monitoring.

To supplement previous findings concerning the list of diseases studied, for T1∩T2, the patient trajectories were closely focused on different cancers. Note that this brings us back to the results that were highlighted in step 2. We thus resolved question Q14.

Methods analysis

There are many different text mining techniques which are being constantly developed for literature searches and systematic review [6]. In systematic reviews, text mining techniques are used for four purposes [61]: (1) automatic terms recognition to identify and extract terms automatically from texts [62]; (2) document classification by generating subsets of documents focused on a specific topic [6366]; (3) document clustering to group documents into topics. These correspond to topics shared by all the documents in the group they contain and by no other document in the collection [17, 67, 68]; (4) drafting abstracts by selecting sentences from each document based on the significance of its terms, which are combined via classification techniques [69]. Some authors used text mining for other purposes. For example, in [70] the authors created correspondence databases linking authors with the name abbreviations and processed a co-authorship analysis. In [71] the authors annotated abstracts in two ways, first the gene or protein of interest, then the protein interactions and/or gene functions. Ultimately, they categorised documents according to these annotations. Thus, combining text mining methods for systematic review is a hot topic [7275].

While there is no consensus for a method in conducting a review with a huge number of documents, there are several techniques in text mining already used in various fields to explore text data [7678]. Here, we wished to gain an overview of the document content in a recent developing field of inquiry, in order to provide general information and to respond to research questions. Our aim is to maximise the recall to ensure comprehensive study. We also aimed to better select publications, then reviewed them in a classical manner, by creating filters. With our method, searches are conducted based on the meaning of the words and concepts emerging from classification techniques rather than simply the presence of this term and concept. Thus, we conducted an in-depth study to explore the texts, starting by highlighting keywords, which were often used in the abstracts. Word cloud representation was most suited for this step, as it enabled a quick visual reading of the results. However, beyond the visual data display, word clouds do not provide much information.

One way to gain further insight is to highlight a lexical universe attached to those keywords. Thus, the same word may be interpreted differently depending on the terms associated with it. Similarity analysis best addresses this issue. Its tree construction approach connects highly co-occurring networks of terms and allows a better understanding of the most frequently discussed themes through the various items making up each corpus.

The last step in the exploration process is to determine whether it is possible to classify these articles in the topics highlighted by similarity analysis. We compared these results by using Reinert clustering because it has the advantage of respecting the text construction. It is also offers more flexibility than latent Dirichlet allocation (LDA), for example, where the researcher has to pre-determine the number of clusters. Although some authors have proposed solutions for the “optimal” number of topics in topic modelling [79, 80], it is not possible to verify, making this method even harder to apply.

The text mining methods that we selected have proven to be effective in exploring the corpora without a priori and with open-ended questions, allowing us to quickly identify documents associated with subjects beyond genetics. This facilitated the filtering of articles to apply methods with a priori to answer specific questions. Although existing methods for exploring text data to conduct rapid reviews are good, we hoped to validate a non-traditional methodology to conduct more extensive systematic reviews for future research.

Conclusion

In this article, a semi-automatic text mining methodology was applied to investigate patient healthcare trajectory. Patterns were extracted and identified semi-automatically from the published articles in PubMed. With text mining techniques we could analyse large amounts of text data, which would have not been possible otherwise. The originality of our approach lies in assisting a research review on the basis of a semantic approach, from research questions to targeted documents which will be then thoroughly analysed. This method is well-adapted for complex review questions or hard to define topics such as those addressed in public health and more particularly in the context of patient healthcare trajectory literature. Finally, our strategy enabled us to explore the concept of trajectory in the care domain.

We illustrated our search using a frequent cause of hospital stay, the occurrence of a MI. We chose to trace the follow-up of these MI patients through the PPS. We addressed open-ended questions by determining the topics covered in each area, to explore areas transversely, while highlighting studies dealing with patient trajectories with regard to MI, based on PPS data. This semantic approach was demonstrated to be well-tailored for addressing our issues.

Document retrieval on the patient trajectories was combined with two major themes, i.e. PPS databases and MI. The findings showed that this type of study is of interest in the biomedical community; for comparative trajectories of drug prescriptions and costs Sundberg et al. concluded [58] that: “Drug prescriptions and costs of analgesics increased following conventional care and decreased following integrative care, indicating potentially fewer adverse drug events and beneficial societal cost savings with integrative care”. Similarly, with regards to access to the appropriate treatment in time for cancer patients, Defossez et al. affirmed [81] that “There is in particular a need to describe and analyse cancer care trajectories and to produce waiting time indicators…The evaluation shows the ability of an integrated regional information system to formalise care trajectories and automatically produce indicators for time-lapse to care instatement, of interest in the planning of care in cancer.” Our study revealed that the trajectory concept, regardless of its form, is being explored, analysed and exploited, especially in oncology through the oncology communicative medical file and multidisciplinary meetings.

To complete this research, it would now be interesting to include studies on patient trajectories in electronic health records. Some recent studies have focused on the use of these new technologies in order to offer patients with mobility difficulties integrated care by pooling electronic records from patients, caregivers or healthcare teams as well as doctors’ follow-ups [82]. However, the implementation of such processes requires considerable organisation and adequate resources [83] and can lead to technical interoperability problems [84].

We were also studying patient trajectories in a health environment with MI. We obtained DRG sequences by chaining hospital stays. These sequences represent the chronological pattern of hospital healthcare of patients. We have characterised patient trajectories by such DRG sequences. We have applied sequential pattern mining techniques [85] to our trajectories in order to highlight frequent hospital trajectory patterns. To our knowledge, this is the first time that this type of approach, by applying sequential patterns to hospital data or registry data, has been used. Our ultimate goal is to build a predictive model of MI trajectories to simulate disease progress in the coming years so as to help anticipate health needs.