Mapping of diseases from clinical medicine research—a visualization study

By employing bibliometric method, this study aimed to visualize the research hotspots and correlations among clinical medicine subjects. Literatures were retrieved from the PubMed database based on MeSH words and free-text phrases and screened based on inclusion and exclusion criteria. The disease themes were manually marked according to ICD-10. Co-word analysis and strategic diagram methods were applied to explore the hot topics and development trends of disease themes. 2551 articles were included after literature screening. The amount of paper showed an increasing trend and reached a peak in 2013. The subjects of adults and the elderly accounted for 45.0% and 27.0% respectively. The United States of America had the most publication, with Massachusetts and California being the most prevalent states, and Harvard University was the most prolific institution. Co-word analysis revealed that research hot topics of diseases were divided into 8 themes, among which the most was “disease of the circulatory system” and “injury, poisoning and certain other consequences of external causes”. The strategic diagram showed that the above two topics were mature but relatively independent, while the “physical fitness” topic was not mature enough but was closely related to the others. There are more and more data-driven studies in the field of medicine and health, while, huge development spaces in the full spectrum of the diseases do exist. Mining the published researches through bibliometrics and visualized methods could come up with valuable results to inform further study.


Introduction
Medicine is a systematic discipline from the prevention to the treatment of disease, also be divided into many different research directions and research fields (Iserson and Moskop 2007). Disease is the abnormal life process of the body under certain conditions, which is caused by the damage of etiology and is caused by the disorder of self-regulation (Miller and Chappel 1991). Disease category is one of the appropriate ways to distinguish different clinical fields in clinical medicine research.
A method named bibliometric analysis is to make quantitative analysis on appropriate literature sources by showing the growing trends and periodical hot topics of a subject in a visualization way (Shen et al. 2011a, b, c, d). In 2011, Prof. Shen, one of our group members, published a series of articles about visualization studies on evidence-based medicine domain knowledge, which has received widespread attention (Shen et al. 2011a, b, c, d). Meanwhile, medical scholars had adopted the bibliometric method to find theme trends and knowledge structures in different medical fields. For instance, Zhang visualized knowledge domain of patient adherence by co-word analysis and social network analysis (Zhang et al. 2012); Fu utilized bibliometric analysis to research malaria in China during 2004 to 2014 (Fu et al. 2015); Gu, Hsu, and Liao all visualized the big data research in medicine (Gu et al. 2017;Hsu and Li 2019;Liao et al. 2018); Huang analyzed the pelvic organ prolapse during 2007 to 2016 by bibliometric and social network analysis (Huang et al. 2018); Jin and Xin visualized the hotspots and trends of multimedia big data (Jin and Li 2018); Zhao used co-word analysis to study the theme trends and knowledge structure on choroidal neovascularization (Zhao et al. 2018); Saheb adopted bibliometric analysis to exploring IoT big data analytics in the healthcare industry (Saheb and Izadi 2019); Wang applied co-word analysis to investigate the professional-patient relations in the Internet era ; and Yang used the same method to identify the trends in research on Vitamin D (Yang et al. 2019).
However, all those articles were aimed at isolated disease without exploring the relationship among them. Investigating the hotspots and correlations of broad clinical medicine research subjects by bibliometrics method might provide references for clinical scholars and methodologists in their professional research. In this study, we conducted a relatively comprehensive exploration and collation of clinical medicine literature subjects, through manually labeling the disease theme according to the International Classification of Disease, 10th Revision (ICD-10), which was developed by the World Health Organization (WHO) in 1992 (Quan et al. 2005), and, applied bibliometrics and visualized methods to reveal hot topics and relationship among them, so as to inform further study.

Data sources and searching strategies
The clinical medicine researches based on medical data science with various methods were searched in the PubMed database on March 28th, 2018. When selecting research terms, besides main terms such as "Medicine", "Health" and "Disease", we focused on three topics in fields of medical health. The first topic was "data" because the object of our research aims at medical data science, and the search terms of "big data", "realworld", etc. were considered; the second topic was "method", such as "Data Mining", "Machine Learning", "Statistical Models", "Biostatistics" etc.; and the last topic was "health information" with the search terms being "Health Information Systems", "Electronic Health Records", etc.. A total of 25 search terms were used, including Medical Subject Headings (MeSH) terms (which contained vocabulary of biomedical terms and created by U.S. National Library of Medicine) and free-text phrases. Furthermore, we applied the retrieval strategy (Table 1) of the following three steps to obtain literature in initial research: (1) "OR" was used to combine the search terms with the same meaning or topic (#1-#6); (2) the retrieval results from #2 to #6 were connected as #7; (3) "AND" was used to combine #1 and #7 and filtered to remain papers from core clinical journal subsets, which is an optional choice to add journal restriction in the Pub-Med database searching (https ://www.ncbi.nlm.nih.gov/books /NBK37 99/#catal og.Searc hing_for_Journ als_in_the_NL).

Inclusion and exclusion criteria
In this study, we included articles published in English (Saheb and Izadi 2019) and studied in human with quantitative data methods, such as statistical models, machine learning, etc., focusing, especially, on clinical diseases, including etiology, mechanism, clinical manifestations, diagnosis, treatment, prevention, prognosis, etc.. Moreover, the following related articles were excluded (Haase et al. 2009;Huang et al. 2018;Wang et al. 2019): (1) Duplicated article; (2) Articles whose abstract cannot be imported into EndNote software (https ://endno te.com); (3) Articles that just apply descriptive methods; (4) Review articles or systematic review articles; (5) Articles with irrelevant topics. The screening step and results of the literature selection are shown in Fig. 1.

Paper screening
Based on inclusion and exclusion criteria, two researchers selected the literature back to back. In circumstances of inconsistency, they discussed and decided whether to include, if an agreement could not be reached after discussion, a senior researcher would be asked to make the decision.

Manual Labeling
Based on ICD-10, the experienced medical workers conducted a classification supplement of keywords for key diseases in the included literatures. We invited four experienced medical workers to manually label disease themes, the detailed steps and rules are as follows: (1) The medical workers were trained to classify the main research content based on ICD-10 categories by viewing the title, keywords and abstract; (2) To ensuring the label results' Articles were obtained through database retrieval and were imported into EndNote (n = 5,987) Articles were acquired after eliminating duplicated articles (n = 5,980) Primary screening (n = 5,980) Secondary screening (n = 2,774) Articles included in this study (n = 2,551)  consistency from different medical workers, we randomly chose 50 articles and each one should label the disease theme back to back. If there are two or more different label results from four medical workers in the same article, they would discuss in the group, and the training and test procedure would be carried out again as above, until the label results are consistent.
(3) After confirming the label results' consistency, we divided the rest of included articles into four groups and each group was in charge by one experienced medical worker; (4) If there were several different disease themes existed in one article, they would try to classify each article into two or fewer main disease themes; if some articles clearly indicate more than two main disease themes, they would discuss with others to sum up into two or fewer themes, however, more than two disease themes were allowed sometimes if the agreement could be reached after discussion. (5) After finishing marking the articles of one group, another medical worker would double-check them in this group. Let's take a simple example through one group of the articles which the paper Abdelsattar's (Abdelsattar et al. 2017) belonged to. Firstly, we could easily classify the article as "Neoplasms" disease theme based on its title "The impact of health insurance on cancer care in disadvantaged communities" and its keyword "Neoplasms/*epidemiology/mortality", then we had a quick view of its abstract or text to see if there was another disease theme related to, and finally another person did a transvaluation on this article. After manual labeling, the main classification types of disease themes were set as "Certain infectious and parasitic diseases" (CPD), "Neoplasms" (NPS), "Injury, poisoning and certain other consequences of external causes" (IPC), "Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism" (DBC), "Endocrine, nutritional and metabolic diseases" (ENM), "Mental and behavioral disorders" (MBD), "Diseases of the nervous system" (DNS), "disease of the circulatory system" (DCS), "physical fitness" (PSF), "Diseases of the eye and adnexa" (DEA), "Diseases of the ear and mastoid process" (DEM), "Diseases of the respiratory system" (DRS), "Diseases of the digestive system" (DDS), "Diseases of the genitourinary system" (DGS), and "Pregnancy, childbirth and the puerperium" (PCP).

Descriptive statistics
A histogram chart of the number of publications per year was presented. The proportion of age distribution of research objects was calculated by the frequency data set of keywords indicating age structure, such as "Adult", "Infant", "Newborn", "Child", "Preschool", "Aged", "80 or over", etc. (Shen et al. 2011a, b, c, d). The spacial distribution was presented through filtering out the "country" in the "Author Address" entry from EndNote and counting the word frequency of each "country" for the corresponding author, and the same "country" was merged and sorted by their frequency to obtain national data with significant results. And the same method was applied to acquire the spacial distribution at the state level in the top country, and the affiliation distribution through publication counts of the corresponding author's affiliations in the top three states/regions in the top country. The "maps" and "ggplot2" packages in R software (https ://www.r-proje ct.org) were used to realize data visualization for heat maps combined with geographic location information.

Association of clinical medicine researches by co-word analysis
Co-word analysis is to reflect the strength of the correlation between keywords by analyzing the co-word of words in literatures. We excluded subject headings like "age", "gender" and "nationality" from the total keyword list. Based on the keywords of our research, a word co-occurrence matrix representing the number of articles containing two specific keywords was established. The number of articles reflects the correlation strength between the keywords. The more the articles, the closer the connection between the two keywords. In this study, all keywords were extracted from EndNote with the label of 2,551 eligible articles, and we chose the co-occurrence frequency of keywords that is larger than 10 to finished the next steps. According to the word co-occurrence matrix, a common word network composed of those words can be formed, and the corresponding keywords selected in the co-occurrence analysis were treated as nodes, the co-occurrence between the two keywords as connection, and the co-occurrence frequency of keywords as connection weight. We used "igraph" (Csardi and Nepusz 2006) package in R software to visualize the mapping for the association of clinical medicine research, and applying "walktrap.community" function to classify the type of clinical medicine researches (Deitrick et al. 2013). The size of a node was formed by the value of degree of centrality, which is formed by the number of the connection edges that the node possesses (Borgatti 2005;Everett et al. 2005;Valente et al. 2008;Rodrigues 2019). The higher value of degree of centrality, the larger size of the node. The highest value of degree of centrality indicated the most important node with red color labeling in each cluster.
Law et al. put forward the strategic coordinates method in 1988 to describe the correlation between different fields and the correlation within one field (Law et al. 1988). In this study, we introduced a strategic diagram to present the current hotspots and trends of disease themes based on medical data science and to predict future developments. The strategic diagram is composed of centrality and density. Centrality is used to measure the strength of interrelation between a cluster and other clusters, and the high centrality indicates the hotspot of this research has extensive connections with other hotspots (Shen et al. 2011a, b, c, d). Density is used to measure the strength of interrelation between subject terms within the cluster, and the high density indicates that the research subjects are highly correlated and the research tends to be mature (Shen et al. 2011a, b, c, d). The four quadrants indicate the different combination of centrality and density: high centrality and density in quadrant I, high centrality and low density in quadrant II, low centrality and density in quadrant III, and low centrality and high density in quadrant IV.

Software
EndNote was utilized to filter literature and to make manual labels. R software and Microsoft Excel were used to clean and integrate data. All plots were visualized by R software.

The feature of clinical medicine researches
A total of 5,987 articles were obtained after initial search and finally, 2,551 were extracted after screening step by step (Fig. 1).
The number of publications in each year from 1970 to 2017 was counted, it suggests that the overall trend is increasing especially after 1990, with the most researches in 2013 (Fig. 2). Table 2 is depicting the age distribution of the subjects. It shows that the frequency of "adult" reached 1845, accounting for 45.0% of the total, and infants accounted for 6.5%, children for 9.0%, adolescents for 12.5% and aged group for 27.0%.

Mapping of literature publication location (spatial distribution)
The location of published articles in each country is visualized in Fig. 3 marking the top 6 countries which account for 69% (Table S1) of the total, the label color in the heat maps indicates the number of article publication, the deeper color of the label, the larger number of article publications. It is obvious that the United States of America (USA) is the country with the most publication and has been far more than that in other countries. Consequently, we did a further study on papers from the USA, not only counting the number of publications in each state (Table S2), but also exploring each affiliation within the top three states (Fig. 4). It shows that Massachusetts and California have the largest number of published articles compared with the others. And the affiliations with the most publication in Massachusetts, California, and New York were Harvard University, University of California Los Angeles, and Columbia University, respectively (Fig. 4). The largest cluster is DCS, with IPC as the second, suggesting these two are popular diseases that investigators would like to explore. Additionally, we observed that "Stroke" and DNS were included in the DCS cluster, which means that they are highly correlated with each other, and they would, normally, been investigated together. Furthermore, in Fig. 5, it is obvious to see that DRS, NPS, and PCP are three independent clusters, they do not have any edge between other clusters, meaning that these three disease themes are more independent. PSF also has a small cluster, but it has multiple edges with other clusters, such as DCS, MBD, and ENM, suggesting that PSF research is not mature, but it is often researched with other disease themes together.

Mapping of association between clinical medicine researches
We used the density and centrality value of each disease theme (Table 3) to visualize a strategic diagram. (Fig. 6). It shows that IPC possesses the largest density value, which indicates the nodes within the cluster are strongly related to one another and it is mature in the clinical medicine research subjects. While, it is obvious to see that the centrality value of IPC is much smaller than that of PSF, presenting that PSF has more ability to connect with other disease themes. Additionally, the disease themes of NPS and DRS are both located in Quadrant III, suggesting that these two disease themes were neither popular nor mature in clinical medicine research. It's worth noting that there is no disease theme in Quadrant I, indicating none disease theme falls into the development period of both mature in itself and as the center of the others (that is, extensively connected with other disease themes). So, there are still huge development spaces in medical data science in terms of disease domain research.

Discussion
With the development of medical data, more and more researchers are applying them in various fields, and most of the researches could be divided into two kinds: those of methods and those of applications. To inform clinical scholars and relevant methodologists, this study explored the hot spots and relationships among different disease themes to stimulate further applicational and methodological studies in clinical medicine. In our study, we clustered into eight disease subjects and it is found that DCS was the most frequently studied topic, followed by IPC, which informs us that these two topics attracted wide attentions and had great importance than the others in clinical datadriven researches, and further, the conclusions from these two disease themes might be reached through a variety of techniques, which is worth investigating in the perspective of methodology.
Results of the strategic diagram graph show that, in terms of the development relationships of the clusters, PSF has the highest centricity, suggesting its center role of the research network; and IPC has the highest density, suggesting its maturity itself among the research network. There are no cluster falls in Quadrant I area, that is, none disease theme is both central and developed, suggesting great improvement opportunities in the future. Recently, some studies focus on research in healthcare informatics based on big data. For example, Gu introduced the contribution of countries in healthcare informatics and the knowledge structure of healthcare big data research (Gu et al. 2017); Liao explored the situation of publication, country, institution, published journal of the medical big data research (Liao et al. 2018); Hsu investigated the research trends and relationship between knowledge clusters in medical big data (Hsu and Li 2019). All of them concluded the trend and correlation between knowledge clusters in healthcare big data research. Here, our research focuses on using the bibliometric analysis to find the hotspots and relationship of disease themes in data-driven clinical medicine researches, with a wide variety of the data volume. In addition, it is not found yet that there are other studies exploring relationships among disease themes labeled manually according to ICD-10, which might contribute a lot to inform scholars in this field.
It is found that the USA was the top publication country in the world, which is the same as other researches. For example, Fu pointed out that the USA is the top one country on the number of papers and citations in the 19 complementary and alternative medicine journals from the 1980 to 2000 s (Fu et al. 2011); Filser found that the USA is the most cited country in the literature on Lean Healthcare (Filser et al. 2017); Tran discovered that the USA is the most prolific country in Artificial Intelligence in Health/Medicine research (Tran et al. ); Churruca introduced that the USA is the most published country on complexity science applied to healthcare (Churruca et al. 2019). Furthermore, based on the addresses of the corresponding authors in the included studies, we investigated the publication in each state in the USA and the most prolific institutions in the top three states. our result is consistent with other articles (Gu et al. 2017;Liang 2010;Liao et al. 2018), pointing out that Harvard University is the most productive institution in medical research. These results might provide a clue for scholars when they need to choose cooperative members.
While, there are several limitations to this study. Firstly, as we are focusing on application level of the data-driven clinical researches, and, in consideration of practicable, our Mapping of association between disease themes. Each node represents a keyword, and the size of nodes indicates the word frequency. Each link represents the connection between two keywords, the thickness of the lines indicates the frequency of the co-word about highly-frequent MeSH/free-text words study included articles in the core journal of the PubMed database and did not expand the methodological aspect, resulting that the papers included are a subset of the whole; with the recent popularity of artificial intelligence, there are more and more scholars applying machine learning methods in clinical medicine research (Beam and Kohane 2018;Tran et al. 2019), and as the methodology system is getting more and more complete in the medical field, we are carrying out further research targeted specially on the methodological aspect, based on experiences of this study. Secondly, we counted the frequency of keywords (such as "Adult", "Infant", "Newborn", "Child", "Preschool", "Aged", "80 or over", etc.) to show the age distribution, by doing so, we are not able to reveal the whole situation, but trying to provide as much information as we could, because, as a matter of fact, not all the studies revealed the objects' age, even in the whole paper.
In conclusion, there are more and more data-driven studies in the field of medicine and health, while, huge development spaces in the full spectrum of the diseases do exist. Mining the published researches through bibliometrics and visualized methods could come up with valuable results to inform further study. Strategic diagram for disease themes. Centrality feedbacks the degree of connection to other disease themes, the higher value of centrality, the stronger connection ability; Density feedbacks the degree of correlation between each node in this disease theme, the higher value of density, the more mature disease theme Table 3 Density and centrality values in the strategic diagram NPS neoplasms, MBD mental and behavioral disorders, PCP pregnancy, childbirth, and the puerperium, ENM endocrine, nutritional and metabolic diseases, PSF physical fitness, DRS diseases of the respiratory system, IPC injury, poisoning and certain other consequences of external causes, DCS diseases of the circulatory system