Background

In the past decade, the research field of utilizing social media for healthcare has attracted great interests from scientific communities, which can be observed from the annual increasing of research publications. Internet is becoming a significantly important role as the source of information for public health issues [1]. Health-related information is being actively searched, shared, communicated, and discussed through social media. This kind of online information exchange benefits users in aspects of immediate access to health concern information [2], emotional and psychological support [3], and health-related decision making [4]. Furthermore, the development of digital social media brings relatively inexpensive and readily available means for the collection and storage of large volumes of data [5].

Especially in recent years, researchers are beginning to explore how social media can be used in health and healthcare research [6]. There have been rich researches and achievements. For example, based on the regression analysis of country-level HIV rates and aggregation usage of future tense language, Ireland et al. [7] found that there were fewer HIV cases in countries with higher rates of future tense on Twitter. Similar works focusing on sex related events can be found, e.g., HIV prevention among men who have sex with men [8], and assessment of personal and environmental factors associated with premarital sex among adolescents [9]. Some researchers conduct studies on certain diseases, e.g., breast cancer [10], testicular cancer [11], and prostate cancer [12], with social media content as analysis materials, e.g., videos [13], twitter messages [14], and publicly available user profiles [15]. Similar studies centering on drug can also be found, e.g., online drug sales [16], and direct-to-consumer drug advertising [17]. As a result, the research field of utilizing social media for healthcare is growing fast and is receiving more and more attention. It is of great significance to conduct a systematic analysis on existing research publications to understand the status of recent development.

As an effective statistical method for evaluating scientific publications, bibliometric analysis has been widely applied in various fields [18, 19]. It has been especially applied in interdisciplinary research, e.g., artificial intelligence on electronic health records research [20], natural language processing empowered mobile computing research [21], natural language processing in medical research [22], text mining in medical research [23], technology enhanced language learning research [24], and event detection in social media research [25].

To that end, this study carries out a bibliometric analysis of utilizing social media for healthcare research based on the research publications from Web of Science and PubMed during the year 2008–2017. The main aim is to develop a general approach to analyze the thematic change and evolution in the research field. As for the overall thematic detection, topic modelling analysis is conducted to identify major topics in the whole period. As for the thematic evolution, the approach combines performance analysis and science mapping for detecting and visualizing conceptual subdomains to quantify and visualize the thematic evolution of the research field.

Methods

Data retrieval and preprocessing

In this study, bibliometric methodology is applied using data from Web of Science (WoS) and PubMed. WoS is the most authoritative citation database and has been widely applied for bibliometric analysis, while PubMed provides a wide coverage of medical-related publications.

The keywords of social media are developed by domain experts after an extensive literature review. In WoS Core Collection database, Topic Subject is used as a retrieval field. Publications indexed in “Science citation index expanded (SCI-EXPANDED)” and “Social Sciences Citation Index (SSCI)” are considered. Further, publications of “Article” and “Proceedings paper” types indexed in the research areas pertaining to healthcare are selected manually. While in PubMed database, Title and MeSH Terms are used as two retrieval fields. Specific exclusion strategies are also conducted to ensure high relatedness of the retrieved publications. The specific search strategy is shown as Additional file 1. In total, 4361 unique publications are finally identified out for analysis. Since there is no citation data available in PubMed, we use Google scholar citation as a measurement of citation count of the 4361 publications.

The raw data are downloaded as plain text. Key elements, e.g., title, published year, abstract, and author address are automatically extracted. Author affiliations and countries are identified based on author addresses. Inconsistent expressions are standardized.

As for the thematic analysis, in addition to author keywords, KeyWords Plus, and PubMed MeSH, we also include keywords from title and abstract using a self-developed Python program with a natural language processing module based on syntactic tree analysis. 1) The singular and plural forms of all the author keywords, KeyWords Plus, and PubMed MeSH are firstly stored as a database; 2) Keywords in title and abstract text are automatically and separately extracted from the database; 3) As for the remaining text of the title and abstract, notional words are also extracted. 4) All the keywords are merged and unified as singular form.

In order to improve the effectiveness of thematic analysis, a duplication checking process is conducted according to the experience by Cobo et al. [26]. Abbreviations are replaced by corresponding full names with a mapping table, e.g., SMS is replaced by short message service; ADE is replaced by adverse drug event; MSM is replaced by men who have sex with men. Keywords representing the same concepts are grouped, e.g., diabete mellitus, type 2, type 2 diabete, type 2 diabete mellitus, etc. We also apply weight 0.4, for author keywords, KeyWords Plus, and PubMed MeSH, as well as weights 0.4 and 0.2 to the keywords from title and abstract, respectively, based on our former experiment [22]. We then set TF-IDF > =0.1 to exclude terms with low frequency as well as those occurring in too many publications.

Approach for thematic detection analysis

Proposed by Blei et al. [27], Latent Dirichlet Allocation (LDA) model has been widely applied in topic detection in various domains. It is a Bayesian mixture model for discrete data with an assumption that topics are uncorrelated. Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

A document is represented as a sequence of N words denoted by d = (w1,  … , wN), where a word is an item from a vocabulary indexed by {1,  … , V}. A corpus is a collection of M documents denoted by D = {d1,  … , dM}. LDA follows the following generation process. 1) The term distribution β is as β~Dirichlet(δ), donating the probability of a word occurring in a given topic; 2) θ~Dirichlet(α) is the proportions θ of the topic distribution for a document d; 3) For each word wi in the document d, a topic is chosen by the distribution zi~Multinomial(θ), and a word is chosen as zi : p(wi| zi, β). The log-likelihood for one document d ∈ D is as Eq. (1), and Eq. (2) is the likelihood for Gibbs sampling estimation with k topics.

$$ \ell \left(\alpha, \beta \right)=\log \left(p\left(d|\alpha, \beta \right)\right)=\log \int \left\{{\sum}_z\left[{\prod}_{i=1}^Np\left({w}_i|{z}_i,\beta \right)p\left({z}_i|\theta \right)\right]\right\}p\left(\theta |\alpha \right) d\theta $$
(1)
$$ \log \left(p\left(d|z\right)\right)=k\log \left(\frac{\varGamma \left( V\delta \right)}{\varGamma {\left(\delta \right)}^V}\right)+{\sum}_{K=1}^k\left\{\left[{\sum}_{j=1}^V\log \left(\varGamma \left({n}_K^{(j)}+\delta \right)\right)\right]-\log \left(\varGamma \left({n}_K^{(.)}+ V\delta \right)\right)\right\} $$
(2)

We use 10-fold cross-validation to evaluate model performance with 16 different topic numbers as c(2–10,15,20,30,40,50,100,200). Perplexity criteria is used to select optimal topic number [27]. α for Gibbs sampling is the mean value of the α values in the 10 cross-validation for model fitting using VEM with the optimal topic number. With α and the optimal topic number, we adopt Gibbs sampling and VEM method to estimate the LDA model. The best matches are determined by Hellinger distance as Eq. (3), in which P and Q are two probability measures.

$$ {H}^2\left(P,Q\right)=\frac{1}{2}\int {\left(\sqrt{dP}-\sqrt{dQ}\right)}^2 $$
(3)

Further, we conduct comparative analysis using Affinity Propagation (AP) clustering method [28] based on keyword co-occurrence. In the analysis, only author keywords, KeyWords Plus, and PubMed MeSH are utilized. Keywords with a frequency less than 40 or that do not meet a co-occurrence frequency of 40 are excluded. 139 keywords meeting the threshold are selected. Based on keyword co-occurrence matrix of the 139 keywords, a keyword correlation matrix is calculated using Ochiai correlation coefficient expressed in Eq. (4). Oij represents the co-occurrence probability of two keywords. Ai and Aj represent keyword frequencies. Aij indicates the co-occurrence frequency of the two keywords. AP clustering is then conducted with the correlation matrix. Exemplars determined are used for representing and explaining each cluster.

$$ {O}_{ij}={A}_{ij}/\sqrt{A_i{A}_j} $$
(4)

Approach for thematic evolution analysis

Science mapping or bibliometric mapping is a spatial representation of the relationship between disciplines, fields, and documents or authors [29]. It has been widely used in different research fields [30,31,32] to reveal hidden key elements such as topics.

Science mapping analysis is carried out with SciMAT presented in [33] as a powerful science mapping software tool integrating the majority of the advantages of available tools [34]. In this paper, we adopt the bibliometric approach defined by Cobo et al. [35] that is based on a co-word analysis [36] and the H-index [37]. This approach establishes four stages to detect and visualize conceptual subdomains and thematic evolution of a research field in a longitudinal framework:

  1. 1)

    Research themes detection

The research themes for each period are detected using a co-word analysis [36]. The clustering of keywords to themes is conducted based on simple centers algorithm [38], a simple and well-known algorithm in the context of co-word analysis. The algorithm locates subgroups of keywords with strong link and that correspond to research interests or problems that are of great significance in the academia. The similarity between the keywords is measured by equivalence index [39] defined as Eq. (5). In the equation, cij is the count of publications in which two keywords i and j co-occur, and ci and cj represent the count of publications in which each one appears.

$$ {e}_{ij}={c}_{ij}^2/{c}_i{c}_j $$
(5)
  1. 2)

    Research themes visualization

The detected networks can be represented by two measures [39], i.e., Callon’s centrality and Callon’s density. Callon’s Centrality measures the degree of interaction among networks and can be defined as Eq. (6) with k a keyword belonging to the theme and h a keyword belonging to other themes. The internal strength of the network can be measured by Callon’s density defined as Eq. (7) with keywords i and j belonging to the theme and w is the keyword count in the theme.

$$ c=10\times \sum {e}_{kh} $$
(6)
$$ d=100\left(\sum {e}_{ij}/w\right) $$
(7)

Based on the two measures, research themes can be mapped in a two-dimensional strategic diagram with four quadrants. Commonly, themes in the upper-right quadrant known as the motor-themes are both well developed and are important for structuring a research field. Themes in the upper-left quadrant are of only marginal importance for the field with well-developed internal ties but unimportant external ties. Themes in the lower-left quadrant are both weakly developed and marginal. They mainly represent either emerging or disappearing themes. Transversal and basic themes are contained in the lower-right quadrant, and they are important but are not developed.

  1. 3)

    Thematic evolution discovery and performance analysis

A thematic area is a set of themes that have evolved across different subperiods. Suppose Tt is the set of detected themes of the subperiod t, and U ∈ Tt donates each detected theme. Let V ∈ Tt + 1 be each detected theme in the next subperiod t + 1. It is considered that there is a thematic evolution from theme U to theme V if there are keywords presented in both associated thematic networks. Keywords k ∈ U ∩ V are considered to be a “thematic nexus”. The inclusion index [40] shown as Eq. (8) is used to weight the importance of a thematic nexus. It is worth noting that a theme could belong to a different thematic area, or could not come from any.

In a bibliometric map of thematic evolution over two periods. The solid lines show that the linked themes are with the same name. A dotted line indicates that the themes share elements that are not the theme names. The thickness of the lines and the sphere volume are proportional to the inclusion index and the publication count associated with each theme, respectively. Hence, two different thematic areas in different colors can be observed. However, theme in the first period has no link with any themes is discontinued, while theme in the second period has no link with any themes is a new one.

$$ \mathrm{Inclusion}\kern0.20em \mathrm{index}=\frac{\#\left(U\cap V\right)}{\min \left(\#U,\#V\right)} $$
(8)

The analysis of the science mapping work-flow can be further enriched by a performance analysis with two kinds of bibliometric indicators, i.e., quantitative and qualitative ones. The quantitative indicators, e.g., publication count, author count, publication source count, and country count, measure the productivity of the detected themes and thematic areas. The qualitative indicators, e.g., citation count and H-index, measure the quality based on the bibliometric impact of those themes and thematic areas.

Results

Performance bibliometric analysis

The statistical result of publication count and citation count from the year 2008 to 2017 are presented in Fig. 1. It is clear that the research of utilizing social media for healthcare is becoming more and more influential in scientific communities evidenced by the significant growth of publications from two databases, i.e. from 18 publications in 2008 to 1030 publications in 2017. The similar increasing trend can also be observed from the publication count in WoS. These results may be explained by the increasing global concerns and interests in exploring the use of social media data for healthcare research. It is worth mentioning that there is a remarkable upsurge on the research in 2010 with growth rates up to 309% in the WoS and 170% in the PubMed. The citation count curve shows an increasing trend between 2008 and 2013, and publications in 2013 have received the most citations. A decreasing trend is shown between 2014 and 2017, which may be resulted from the fact that new publications usually have less citations due to the limited time. On the whole, the research of utilizing social media for healthcare has received growing attention in the last decade.

Fig. 1
figure 1

Publication count and citation count

Researches in the field have been published in a wide range of nearly one thousand publication sources. Some of these publication sources are highly relevant to the field, while others are partially related. Table 1 lists the top 20 publication sources ranked by publication count in the research field. According to both publication percentage and H-index, Journal of Medical Internet Research, PLoS One, and Cyberpsychology, Behavior and Social Networking are the most influential journals in the field.

Table 1 Prolific publication sources

Among the 4361 publications, there are 3311 affiliations and 14,154 authors from 88 countries/regions. 18.18% of the countries/regions, 65.06% of the affiliations, or 84.41% of the authors contribute only one publication. Table 2 lists top 20 most prolific countries/regions, affiliations, and authors.

Table 2 Prolific countries/regions, affiliations and authors

From the country/region perspective, the USA dominates in the field with 2394 publications, accounting for 54.90% of the total publications. The USA also has the highest H-index as 125, indicating the high quality of its publications. Other prolific countries/regions with more than 100 publications include England, Australia, Canada, China, Germany, and Spain.

15 of the top 20 prolific affiliations are from the USA with Harvard University (97 publications and 30 H-index) and University of Washington (86 publications and 30 H-index) ranking at the top 2. University of Toronto and University of British Columbia from Canada, as well as three affiliations (University of Melbourne, University of Sydney, Monash University) from Australia also appear in the list.

The leading position of the USA in the research field can also be embodied from the analysis of prolific authors. Most of the top 20 authors are from the USA except Mowafa Househ from Saudi Arabia, King-Wa Fu form Hong Kong, and Luis Fernandez-Luqu form Norway. Megan A. Moreno has the most publications as well as the highest H-index, indicating the high productivity and high influence of his research.

Thematic detection analysis

With the optimal topic number as 20 and the initialized α as 0.028204, LDA model using Gibbs sampling is conducted for overall thematic detection. The 20 topics with their top 15 representative terms is shown in Table 3, along with their possible themes, e.g., YouTube analysis, Sex event, Web-based medical education, Students’ use of Facebook, and Twitter use.

Table 3 Top 15 most frequent terms for the 20 detected topics

The top frequent keywords used for AP clustering analysis include social media (3484), human (2109), internet (1323), female (886), male (817), adolescent (694), adult (624), young adult (522), Facebook (473), and social networking (463). Figure 2 shows that the 139 keywords are classified into 28 clusters with exemplars, e.g., self concept, male, middle aged, internet, cancer, Youtube, and weight loss.

Fig. 2
figure 2

AP clustering result for the publications during the year 2008–2017 (Terms in bold and italic type donate exemplar for each cluster)

Thematic evolution analysis

For each time period, two kinds of strategic diagrams are generated to analyze the most highlighted themes. The sphere size in the first diagram is proportional to publication count associated with each theme, while in the second one, the sphere size is proportional to the citation count received for each theme. We split the 10 years into five periods, i.e., [2008–2009], [2010–2011], [2012–2013], [2014–2015], and [2016–2017]. The identified themes with publication count are reported in Table 4 and are visualized using the strategic diagrams as Figs. 3, 4, 5, 6 and 7.

Table 4 Performance measures for the themes of each subperiod
Fig. 3
figure 3

Strategic diagrams for the period 2008–2009

Fig. 4
figure 4

Strategic diagrams for the period 2010–2011

Fig. 5
figure 5

Strategic diagrams for the period 2012–2013

Fig. 6
figure 6

Strategic diagrams for the period 2014–2015

Fig. 7
figure 7

Strategic diagrams for the period 2016–2017

In the period 2008–2009, there are a total of 39 publications. According to the strategic diagrams (Fig. 3) and quantitative measures (Table 4), we can observe that the motor themes PROFILE and SOCIAL-NETWORKING have high citations and H-index scores. Theme MANAGEMENT has the highest H-index score, indicating that it has a higher impact.

In the period 2010–2011, there are a total of 240 publications. The motor-theme FACEBOOK is the most cited and presents the highest impact. Other motor-themes TECHNOLOGY and ADOLESCENT also get high citations, and are with high H-index scores. Themes MASSAGE and DATA-COLLECTION get rather low citations and H-index scores.

In the period 2012–2013, a total of 729 publications are published. According to the performance measures, the following four themes could be highlighted: FACEBOOK, PATIENT, MESSAGE, and WEB-2. These research themes get important impact, achieving higher citations and H-index scores comparing with the remaining themes. The motor-theme FACEBOOK gets the most citations and also has the highest H-index score. The basic and transversal theme SURVEY-AND-QUESTIONNAIRE gets rather low citations and H-index score.

In the period 2014–2015 with a total of 1385 publications, according to the strategic diagrams (Fig. 6) and quantitative measures (Table 4), motor-themes present the highest citations and impact scores. The following seven themes with high citations and H-index scores are highlighted: FACEBOOK, PATIENT, TWEET, TECHNOLOGY, PUBLIC-HEALTH, WEB, and SCHOOL.

A total of 1968 publications are published in the period 2016–2017. The strategic diagrams (Fig. 7) and quantitative measures (Table 4) also show that motor-themes present the highest citations and impact scores, i.e., FACEBOOK, PATIENT, TWITTER, PROGRAM, YOUNG-ADULT, and MEDIA. The theme NETWORK also gets high citations, and are with high H-index score. The basic and transversal theme PERCEPTION gets rather low citations and H-index score.

An analysis of the evolution of the themes detected in each period considering their keywords and evolution across time is developed, shown as Fig. 8. Eight main thematic areas are identified such as FACEBOOK, PATIENT, TWEET, WEB, SOCIAL-NETWORK, and etc. According to Fig. 8, the research in this field presents dramatic cohesion due to the fact that the majority of the detected themes are grouped under a thematic area and come from a theme existing in the previous period. Some thematic areas are present in the research over the five periods studied such as FACEBOOK and PATIENT. Some thematic areas appear in the later periods such as SOCIAL-NETWORK.

Fig. 8
figure 8

Thematic evolution of the research field (2008–2017)

Discussion

Based on the 4361 research publications from Web of Science and PubMed during the year 2008–2017, a bibliometric analysis of utilizing social media for healthcare research is conducted, aiming at exploring the thematic detection and evolution of the research field.

The first finding worth noting is that the research field has attracted more and more attention from scientific communities throughout the last ten years. Most prolific publication sources are Journal of Medical Internet Research, PLoS One, and Cyberpsychology, Behavior and Social Networking. The USA dominates in the research with a comparatively higher publication count. Its dominant role can also be observed from the top prolific authors and affiliations, most of which belong to the USA.

In the overall thematic detection, 20 topics are detected by topic modelling analysis, e.g., YouTube analysis, Sex event, Web-based medical education, Students’ use of Facebook, and Twitter use. Most topics identified are recognizable because they are generally major issues in the research field. We here provide interpretations for some representative topics. Topic 14 contains words such as YouTube, YouTube video, video recording, viewer, and viewed. Thus it pertains to YouTube analysis. As a video-sharing platform, YouTube is nowadays widely utilized to search, share and disseminate health-related information. Topic 18 discusses Sex event. It includes terms such as men who have sex with men, HIV, adolescent, sexual, youth, sex, prevention, and intervention. Most relevant studies are about sexually transmitted infections with HIV as the major research focus, e.g., HIV prevention, treatment, and testing, in which men who have sex with men are often the main focus. Topic 10 mainly focuses on Web-based medical education with terms such as student, learning, medical education, teaching, course, nursing student, web-2, and technology. Participatory web-based platforms, including social media, have been increasingly recognized as valuable learning tools in medical and health education.

Comparing the results of topic modelling and AP clustering, it is found that for most of the identified groups, the representative terms in each group are more similar and understandable in AP clustering. The reason for this may be the use of analysis units. In AP clustering, only author keywords, KeyWords Plus, and PubMed MeSH are used with the consideration that too many analysis units may lead to poor performance when the selected frequent keywords are not of high quality. While in topic modelling, not only author keywords, KeyWords Plus, and PubMed MeSH, but also keywords from title and abstract are used with the consideration that more analysis units may lead to higher performance for topic modelling. However, phrase extraction is a difficult task due to the complexity of natural language text, thus the developed extraction program may extract keywords that are of low quality. Therefore, in the future work, more attention should be paid to improve keywords extraction performance.

From the thematic evolution analysis, Eight main thematic areas can be detected, e.g., FACEBOOK, PATIENT, TWEET, WEB, and SOCIAL-NETWORK. Also, generally, the motor-themes are presenting the highest citations and impact scores in each period. FACEBOOK, for instance, is presented as motor-theme in all the last four periods, while PATIENT and TWEET are motor-themes in all the last three periods, demonstrating their significant roles in the research field.

Specifically, the evolution of a certain thematic area can be represented using a series of thematic networks for each period. Taking the thematic area TWEET in Fig. 9 as an example, it first evolves in a decreasing way, and then in an increasing way. This thematic area is the origin of important thematic areas MANAGEMENT and VIRTUAL-COMMUNITY in the period 2008–2009, and these two areas evolve into MESSAGE in 2010–2011, and stays constant in the new period. In the period 2014–2015, it evolves into TWEET and PUBLIC-HEALTH, and finally moves into TWEET and MEDIA in the last period. Some thematic areas evolve in a constant way such as FACEBOOK, as shown in Fig. 10.

Fig. 9
figure 9

The TWEET thematic area (2008–2017)

Fig. 10
figure 10

The FACEBOOK thematic area (2011–2017)

Topic modelling analysis depicts the major research themes from the holistic perspective, and it does not take their evolution throughout different periods into consideration. The science mapping analysis fills this gap by providing opportunity to dig out the periodical thematic detection and how the detected themes evolve in a longitudinal framework. Observing from Tables 3 and 4, it is easy to find that there are more themes detected by topic modeling analysis comparatively. For example, some significant themes such as Sex event, Alcohol & drug, Vaccine, and Exercise, food, and weight, cannot be embodied in science mapping analysis. This may be caused by the fact that in the topic modelling analysis, all the keywords selected by TF-IDF are used as analysis units, but are not included the science mapping analysis.

In the science mapping analysis, data reduction and network reduction are used to attain modest network and dendrogram. On the one hand, data reduction is conducted by using a minimum frequency as a threshold to filter infrequent keywords so that the networks are not too complex to identify. On the other hand, as noted in [38], two keywords that appear infrequently in the corpus but always appear together usually have larger strength values than keywords that appear many times in the corpus almost always together, leading to the fact that possibly irrelevant or weak associations may dominate the network. Thus, SciMAT allows the network to be filtered using a minimum threshold edge value. The simple centers algorithm also has two parameters to limit the size of the detected themes: the minimum and maximum size of the networks. Although the data reduction and network reduction are of good intention to demonstrate the most significant keywords and their relationship in a more visible and clear way. Some keywords with a comparatively low frequency that are not taken into account may be also of importance. Thus, in the future work, we will find ways to explore periodical thematic evolution with consideration of every single word.

Conclusions

Aiming at understanding the thematic change and evolution of utilizing social media for healthcare research during the last decade, this paper presents a quantitative analysis of publications from Web of Science and PubMed. Topic modelling analysis is used to identify major areas from an overall perspective. An approach of science mapping combining performance analysis is applied to quantify and visualize the thematic evolution. This systematic mapping of the research themes and research areas helps identify research interests and how they evolve across time, as well as providing insight into future research direction.