Basic statistics of the bibliographic record
Our query targeting the thematic intersection of obesity and policy retrieved n = 4340 documents. Some of the basic statistics for this record are reported in Table 1, indicating the average annual number of publications and journals for consecutive, equal-length periods, along with the standard deviation of these frequency distributions. Striking is the fact that both the mean number of publications and publication venues (journals) goes through an exponential-like growth, with a sudden increase in the last decade, as these periods witness an order of magnitude higher amounts than the previous ones. This dynamics is also mirrored in plotting the annual percentage share from the total amounts against publication years (Fig. 1), where the rather steep slope from the 2000s on indicate this high growth rate both in publication output and journal composition.
Table 1 Summary statistics of the publication output by time period Inspecting the “high end” of the journal distribution, i.e., the most frequent journals occurring in the (inverted) first quartile per year, 9 titles appear on the list (Table 2), with a main focus on public health and related areas (nutrition, environment, health promotion), preventive medicine and pediatrics. In a time perspective (Fig. 2), it is striking that in the decade between the early 2000s and 2010, Public Health Nutrition (PUHN) and the American Journal of Preventive Medicine (AJOPM) have been the leading venues, along with an increasing role of Pediatrics (PEDI), while from 2010 the BMJ Journal of Public Health (BMPH) became the most prominent publication venue, recently complemented with Plos One (PLOO) and the International Journal of Environmental Research and Public Health (IJOERAPH). So as to the volume of the output, despite the high and growing number of journals, research seems to be concentrated in a few leading venues characteristic of consecutive periods. Also, some indication of the growing and broadening scientific interest is also visible, so far as more specialized journals (in nutrition or pediatrics) are being gradually complemented by generalist ones (such as Plos One).
Table 2 The most frequent journal titles with the total number of published papers The statistics on field composition (Table 3), narrowing the scope to the most frequent WoS Subject Categories in the usual first quartile, also conveys a high concentration in a dominant category (Public, Environmental and Occupational Health). A first impression on the theoretical frameworks can be gained through the dynamic view of this composition (Fig. 3). Aside from this dominant category (for better readability, we omitted this from the figure), some fields can be identified leading and almost “monotonically” increasing its role in the last decade: Health Care Sciences and Services (HCS&S), Business and Economics (Bs&E) and Nutrition and Dietetics (Nt&D) are the most prominent ones, while Pediatrics (Pdtr) or General and Internal Medicine (G&IM), while significant on the whole, has been fluctuating as to its rank in the annual profiles. A similar fluctuation is present for Agriculture (Agrc) and Food Science (FS&T) or Psychology (Psychl) with low annual shares, while Educational Research (E&ER) is trending in recent years.
Table 3 The frequency of occurrence of Web of Science Subject Categories (type: n ≈ 240) in the corpus The multidimensional characterization of thematic development
The results from the two clustering methods are presented in Tables 4 and 5, for the BC- and the Kw-based classification of our documents, respectively. In the tables below, cluster IDs and labels are being reported, while in the “Appendix” it is complemented with (1) the most characteristic title words from the core documents of the cluster, i.e. “core terms”, and (2) the most frequent title words from the whole cluster (Tables 9 and 10 for the BC- and Kw-based clustering, respectively). In the case of core terms, the frequency of individual words and phrases are also shown, since it is not only the number of occurrences that qualifies a core term characteristic. For small clusters, all papers have been taken into account for interpretation (collected in the “Appendix”, in “Data and methods” section.), so that no core documents and terms are being reported. For brevity and better visibility, cluster IDs are also used in the alluvials to name clusters.
Table 4 Cluster IDs and labels of clusters resulted from the bibliographic coupling (BC; keyword profiles for the clusters are reported in the “Appendix”) Table 5 Cluster IDs and labels of clusters resulted from the semantic similarity network (Kw; keyword profiles for the clusters are reported in the “Appendix”) The size distribution of clusters is given by the margins of Table 6 (a confusion matrix of the two classifications, see below), where cluster #0 contains the “unclassified” part of the corpus due to the lack of either references or author keywords in the database. It can be seen that both techniques resulted in a reasonably balanced structure in that most clusters consist of several hundreds of documents, while some small groups emerged from both procedures. Bibliographic coupling provided a much higher “coverage” over publications, i.e., a much fewer amount of unclassified cases, which in itself demonstrates the complementary roles of these techniques. Labelling the clusters demanded the joint interpretation of textual profiles (frequent title words) and core terms, given the significant overlap in title word sets, that, to some extent, prognosticated the high thematic interrelatedness among clusters (see below). Also, title words in several cases much less characteristic of the main theme, than core terms and documents (cf. BC cluster #5). However, the most frequent title word(s) in most cases seemed congruent with the common topic characterized by core document titles (for the full list of core documents and small cluster items, cf. Tables 11, 12, 13 and 14 in the “Appendix”).
Table 6 Confusion matrix of the two clusterings (Kw vs. BC-clusters) Our primary question, (RQ1), is addressed by evaluating the joint distribution of publications along both clusterings plus the time dimension, using alluvials. The alluvial diagrams created for this purpose are being shown in Fig. 4. For a better readability, three versions of alluvials has been made: (A) the first one is the simplest, “two-way” diagram, that only shows the relationships between the two clusterings. Given only two dimensions, the joint distribution of publications can also be represented by a so-called confusion matrix, a cross-tabulation of the two clustering variables, where cell values contain the number of papers in the intersection of the corresponding clusters—for the first alluvial, this matrix is reported in Table 6, where the color code (shades of blue) highlights the strength of the relationship. (B) The second one is already of the “three-way” form proposed in this paper, incorporating the time dimension, that is, it shows the linkages between publication years and the two clusterings. (C) The third type of alluvials only differs from the second one in the granularity of the time variable, as publication years are being aggregated into approx. 5-year periods. The reason for this is to have a more robust picture on how topics are distributed along the timeline. Robustness was also the reason behind setting a constraint on which stripes (publication sets with a given publication year and cluster memberships) were left visible on the diagrams: a minimum of 10 papers per year was specified as a threshold. Beyond connections, block sizes also convey a useful information, as these are proportional to the frequency of category values (cluster sizes or weight of publication years). As to the terminology, a topic will be referred to as a connected set of BC- and Kw-clusters, i.e. a “multicluster”, within which individual stripes are considered as subtopics. In order to improve the clarity of visualizations, we omitted the cluster #0 from both clusterings, that is, the cluster containing the unclassified cases in the respective grouping.
Quantitatively speaking, both the two-and three-way diagrams exhibit a relatively high concordance between BC and Kw clusters: the largest proportion of BC-clusters or research traditions (i.e. the widest, or the majority of stripes) is being classified under a single corresponding Kw-cluster or semantic class. The concordance of the two classifications is also reflected by the coloring of the confusion matrix (Table 6), conveying the overlap between BC- and Kw-clusters in document coverage. The majority of documents in most major (sizable) clusters is being concentrated in a single cell, indicating a basically (though not technically) one-to-one relationship between the respective clusters. In order to make this pattern more explicit, we constructed a two-mode similarity network of the two clusterings (Fig. 5), with two sets of nodes representing BC- and Kw-clusters, respectively, and edges showing an overlap between two such nodes. The width of an edge is proportional to the extent of the overlap (from the BC-perspective), and red links indicate that at least 40% of the papers in the BC-cluster is covered by the corresponding Kw-cluster. The visualization also reinforces the match between the clusterings, since red lines tend to exhibit a one-to-one assignment (BC_3 and Kw_3; BC_6 and Kw_1 etc.), as BC-clusters tend to strongly connect to one Kw-cluster. Kw-clusters, or “subtopics”, on the other hand, seem to be more common, but, as to strong connections, still having an affinity to only one or two BC-groupings.
In order to provide an overall quantitative evidence for the concordance, we have calculated the Rand index between the two groupings. The Rand index (abbreviated here as R) is a statistical measure of the similarity between two clusterings of the same dataset (the technical definition is given in the “Appendix”), and its value ranges between [0, 1]. The closer its value to the possible maximum, R = 1, the better the agreement between the two groupings. For our classifications, we obtained a value of R = 0.7 as the degree of similarity between the BC- and Kw-clustering, which, being close enough to unity, is generally considered to be a sign of good alignment.
Beyond the correspondence between the classifications, the next important observation (via the three-way diagrams) is that almost all topics are concentrated in the last 10-year period, and that almost all topics show an increase (in size) in the last 5 years, so that an upward trend emerges for the whole topical structure. This is very much in line with the rising curves obtained for publication numbers for this period, showing that this increase is multidirectional, covering a diverse and balanced thematic composition of this output.
Qualitatively, or content-wise, in order to elicit the thematic development of our domain, we follow the strategy of interpreting each multicluster topic relying on the interrelations of BC and Kw clusters (i.e. interpreting the two-way alluvial diagram using the cluster labels), and then linking them to the time dimension (i.e. via the three-way alluvial diagrams). The most salient BC cluster, #3 on risk factors of childhood obesity is mostly linked to the Kw cluster #3 on risk factors of overweight, children, and, to a smaller extent, to Kw #2 on public health policies and obesity prevention. The topic is constantly present since 2007, but increases its weight in the last five decades. The second biggest BC cluster, #6 focusing on physical activity and health promotion has a similarly strong affiliation with a specific Kw cluster, #1 on physical activity, built environment and active living, but this topic is somewhat more stratified, as BC#6 is, to a letter extent, also connected to Kw#2 (prevention) and Kw#3 (overweight). This topic is also abundant in the last decade, but reaches a higher volume in the recent period (with all subtopics). This high level of concordance can also be observed between BC#4 on economics of food policies (food policy, farm policy, taxation, public economic policy on consumption) and Kw#4 on food environment in school nutrition, where smaller proportions of the BC cluster are again linked to two further Kw blocks (Kw#2 on prevention and Kw#3 on overweight). Its subtopics mostly pertain to the last 5-year period, while the first appearance of the major subtopic (food environment) is in the previous interval. Screening the diagrams further, BC#7 with food marketing and communication policies, consumer behavior tends to be divided between several Kw clusters,—resulting in a semantically multifaceted topic—including Kw#2 (prevention) and Kw#4 (food environment) as the more prevalent subtopics, as well as Kw#3 (overweight) and Kw#5 (Childhood obesity prevention, active living and community-based instruments), in smaller proportions. The rather complex topic is clearly a product of the last period, within which it is equally distributed between the consecutive years. Cluster BC#8 on School nutritional policies and instruments, competitive and healthy nutrition closely resembles BC#7 in its semantic composition, with an even more equal distribution among the same Kw clusters (Kw#2, Kw#4, Kw#5, Kw#3), and with its subtopics also concentrated in the most recent period, as a relatively new topic. The last two prominent BC clusters (similar in terms of size) constitute a topic with one or two Kw clusters: BC#5, food environment, is practically aligned with Kw#4 (food environment in school nutrition)—aside from some subtopics being classified under Kw#3 (overweight) and Kw#2 (prevention), whereas BC#2 on interventions and particular, community-based programmes (especially for childhood obesity) has weak connections to Kw#2 and Kw#5, that is, the cluster on childhood obesity prevention, active living and community-based instruments, respectively. Again, both topics, BC#5 and BC#2 expands in the last 5-year interval (being concentrated mostly in most recent 3-year period).
We might summarize the findings concerning the dominant topical structure of our domain and its development as follows (Table 7, dominant subtopics are written in bold)
Table 7 Dominant topical structure of the domain: summary of the semantic concordances between BC-clusters and Kw-clusters Upon this overview of the resulted structure, several observations can be made. Firstly, the two clusterings (topics and subtopics) show a fairly coherent thematic picture, with subtopics fitting into, and elaborating on the meaning of general topics. Secondly, although the whole spectrum of factors and determinants of obesity, both at the individual (e.g. dietary patterns) and various community and economic levels (sociodemographic factors, schools, education, consumption etc.), can be seen as continuously trending, the dominant theme associated to most factors is that of food environment, and school-related issues. This two general theme seems to be the main focus of recent approaches to obesity form policy perspectives. These themes relating various BC-clusters also provides evidence that the topics are densely overlapping. This interrelatedness is further confirmed by the other (semantically defined) subtopics being common to many research directions (BC-clusters). Finally, some small BC- and Kw-clusters are underrepresented in the alluvials, showing scarce connections (under the thresholds set for stripes be visible), such as the quite coherent BC-theme on trade and globalization as a factor of obesity as a global disease (#9), or the very small but distinctive BC-cluster on internet behavior (#11). This fact underlies the role of the original clusterings to complement the “dominant” structure outlined by cluster interrelations.
Key concepts connecting research fields: interdisciplinarity and its development
In order to address (RQ2), another type of three-way alluvial diagrams has been applied. Similarly to the multicluster view, these diagrams connect the time dimension (in the aggregated form) with two further variables, key concepts and research fields (WoS Subject Categories) through their associations in the publication output under study. Since we are primarily interested in the patterns of interdisciplinarity, the role of key concepts in connecting research areas, we designed the alluvials to highlight these interconnections. As many concepts with high connecting potential are very much distributed among research fields, but would be suppressed in the alluvials by concepts with higher frequencies in the related categories, two types of diagrams have been created: the first one reports the main trends with concept occurrences above 10/category are made visible only (Fig. 6), and the other conveying latent trends where occurrences below this threshold are visible only (Fig. 7, “Appendix”). Concepts are sorted into different diagrams for better readability, only. Among the author keywords of the corpus about forty (n = 39) key concepts were identified, according to the procedure described in the “Data and methods” section, with a diversity threshold H > 1.5 and a frequency threshold F > 100 (based on the distribution of both values). These key concepts are listed in Table 8 (in their stemmed version), along with the indicator values applied for their selection. Concepts in the table are being ordered by diversity/entropy (that is, their distribution over research fields), the leading concepts bearing the highest potential in connecting fields. For contrasting the connecting role and the weight of key concepts, their frequency of occurrence is plotted against the entropy measure in the table.
Table 8 The list and indicators of key concepts identified for the interdisciplinarity maps In what follows, we will interpret the alluvial diagrams starting with the concept group contributing most to the multidisciplinarity of our domain (highest diversity values), which is practically the first column of Table 8. Filtering out the general themes (health, public health, nutrition, health policy etc.) we are left, in approx. the order of frequencies, with “obesity indicators” (body mass index, overweight), children as distinguished target group (childhood obesity, children, parent, and we can assign to this thematic axis adolescent as well), food policies (food policy, food, diet), and some prevalent factors and risks or regions of interest (socioeconomic status, diabetes, China, respectively). Concepts in the second column can be roughly assigned to these categories (see the detailed overview below), except physical activity, which is the most frequent key concept in the list, deserving a separate description.
Consulting the alluvials (Fig. 6), obesity indicators seem to serve as a common language among fields, connecting mainly Nutrition and Dietetics (through overweight but not the BMI, interestingly) and Pediatrics with Business and Economics and the Biomedical Social Sciences, but with stronger presence in the previous than in the most recent periods.
Childhood obesity and the related terms also link Business and Economics or Educational Sciences with Pediatrics, Nutrition and Health Care Sciences and Services with a constant prevalence in the last decade. The connecting role of parent is more extensive and latent, encompassing most areas with low frequencies per field, as shown in Fig. 7, but increasing its presence in the most recent period. On the other hand, school, while being a bridge between educational, clinical, nutritional and pediatric fields, mostly appears in between 2008 and 2013, which signals some shift of emphasis from institutional to family-related factors of childhood obesity.
Food policy as a term is rooted both in Business and economics and Agriculture, but seems more timely in connection with Nutrition and Dietetics, as the former stripes connect to only earlier periods. Fast food also appears as a less timely concept from mainly Business and Economics. Food environment, however, as a bridge between Education studies and Nutrition is a more recent framing of this perspective on obesity (linking to the latest period), while diet connects most fields (clinical, social and economic sciences) and concentrated in the last time interval. Sugar-sweetened beverages, as a specificity of this category generating much attention, is again a constant and highly multidisciplinary topic.
Physical activity is a prevalent concept in terms of size, and links the educational, clinical and health-related categories with pediatric issues. Its presence is continuous over time up to the very recent years. On the other hand, physical education is only attached to Educational Research (at least in the main trend diagram) with a similar distribution over the timeline.
Prevalent factors and risks of obesity exhibit a more subtle role in connecting areas, as the previous “massive” conceptual structures, in that many of them only expressed in the diagrams with latent connections (Fig. 7). Socioeconomic status as a key concept is equally present in all fields, and in the last decade, though with much lower frequency as per subject area. Diabetes behaves the same way, somewhat increasing its weight only within Nutrition in the last period. Distinctive in this respect is the concept of health disparities, at the intersection of Educational Research, Health Care Sciences, Nutrition and Pediatrics, as it connects these areas clearly in the most recent period. In contrast, nutrition transition and environment link most areas but in a less timely fashion. Finally, as a single concept signalling a joint attention on a geographic aspect of obesity policy, China has its related fields divided along the timeline, with an earlier contribution from Business and Economics, Agriculture and Biomedical Social Sciences, and with a most recent interest from clinical, health and nutritional sciences (G&IM, HCS&S, Nt&D).