Introduction

The ability to define research fields is one of several great challenges in information science (Chen 2016). Early efforts relied on classifying publication sources, such as journals, to define research fields. In addition to disciplinary journals, however, the literature databases Web of Science (WoS, Thomson Reuters) and Scopus (Elsevier) contain multi-disciplinary journals such as Science and Nature. In recent years, new journals which are not organized along disciplinary lines, have been added to the databases. PLoS ONE, for example, tends to disturb the existing classifications of journals (Leydesdorff and De Nooy, in press). In response to these changes, bibliometricians have begun to cluster the database at the level of documents instead of journals (e.g., Waltman and van Eck 2012; cf. Hutchins et al. 2016).

An alternative to clustering documents on the basis of direct citations could be to use databases that are more specialized than WoS and Scopus, but with professional indexing at the document level. The National Library of Medicine, for example, makes a huge investment to maintain a classification system of Medical Subject Headings (MeSH) as tags to the PubMed/MEDLINE database (which is publicly available at http://www.ncbi.nlm.nih.gov/pubmed/advanced).Footnote 1 The classification at the article level is elaborated in great detail (Agarwal and Searls 2009), with a hierarchical tree covering sixteen separate branches that can reach up to twelve levels of depth. Diseases, for example, are classified under C.

“Alzheimer’s disease” (AD) for example is classified as C10.228.140.380.100 under “Dementia,” as C10.574.945.249 under “Neurodegenerative diseases,” and as F03.615.400.100 under “Neurocognitive disorders” in the F-branch covering “Psychiatry and psychology.” Unlike other disciplinarily specialized databases such as Chemical Abstracts (Bornmann et al. 2009), the multiple tree-structure of the Index Medicus allows for mapping documents differently across heterogeneous domains (Leydesdorff et al. 2012; Rotolo et al. 2016). Unlike WoS or Scopus, Medline does not cover the full range of disciplines; but a large part of the scholarly literature in the life sciences is included even more exhaustively than in the more comprehensive databases (Lundberg et al. 2006).

A version of MEDLINE is integrated in the databases of Thomson Reuters. The advantage of this installation is that the “times cited” of each record (if the document is also available in the WoS Core Collection of the Citation Indices) is available on screen; but this field is not integrated when the records are downloaded. Rotolo and Leydesdorff (2015) provide software for integrating the “times cited” from the citation indices at WoS into the MEDLINE data. One technical advantage of the installation at PubMed is that the retrieval is not restrained. Using WoS, one can download only 500 records at a time and Scopus has a maximum of 2000 records.

The MeSH terms attributed to a paper can be considered as references to a body of knowledge stored as documents in a database. Whereas the cited references are provided by the authors themselves, the MeSH categories are attributed by professional indexers. Using MeSH terms as references, one can envisage a matrix of documents referencing MeSH comparable to the cited/citing matrix at the article level. Both cited references and MeSH terms can be considered as attributes of articles, and thus be combined and compared using various forms of multi-variate analysis. The two matrices can also be integrated into a 2-mode matrix of MeSH terms versus cited references. In this brief communication, we explore these options computationally and describe software that has been developed and made available for this purpose on the internet. We discuss the opportunities and the pros and cons of various approaches.

Methods

Data

At the professional suggestion of one of us (AS, the scientometrics editor of the Journal of Alzheimer’s Disease), we selected the amyloid cascade hypothesis in Alzheimer’s disease (AD) as a test case to develop software and routines to merge and analyze citation information from the Web of Science and MeSH. The amyloid cascade hypothesis in AD was formulated by Hardy and Allsop (1991) (cf. Hardy and Higgins 1992; Selkoe 1991). Reitz (2012: 1) summarized this hypothesis as follows:

Since 1992, the amyloid cascade hypothesis has played a prominent role in explaining the etiology and pathogenesis of Alzheimer’s disease (AD). It proposes that the deposition of β-amyloid (Aβ) is the initial pathological event in AD leading to the formation of senile plaques (SPs) and then to neurofibrillary tangles (NFTs), neuronal cell death, and ultimately dementia. While there is substantial evidence supporting the hypothesis, there are also limitations: (1) SP and NFT may develop independently, and (2) SPs and NFTs may be the products rather than the causes of neurodegeneration in AD. In addition, randomized clinical trials that tested drugs or antibodies targeting components of the amyloid pathway have been inconclusive.

For the purpose of this study, the search string ‘(“Alzheimer disease”[MeSH Terms] AND “amyloid beta-protein precursor”[MeSH Terms]) AND “mice, transgenic”[MeSH Terms])’ was proposed to encompass the relevant literature. This string provided us (on March 6, 2016) with a retrieval of 3558 records in both PubMed/MEDLINE and the MEDLINE version in WoS. Using PubMed Identifiers (PMID numbers), 3416 of these records could be retrieved in the WoS Core Collection. As noted, not all journals covered by PubMed/MEDLINE are also covered in the WoS Core Collection.

Methods

Two dedicated programs, MHNetw.exeFootnote 2 and CitNetw.exe,Footnote 3 have been developed to generate reference matrices using the PubMed/MEDLine and the WoS data, respectively. The matrices are provided in the Pajek format. CitNetw.exe generates the cited/citing matrix with the citing documents as units of analysis in the rows and the cited references as variables in the columns; MHNetw.exe generates a similar matrix, but with the MeSH in the columns. The number of citing documents is determined by the retrieval from PubMed/MEDLINE or Medline in WoS, respectively. Instructions for how to use the databases and routines are provided in Appendix 1.

The routine MHNetw.exe presumes that the data from WoS with the citation information is already organized (by CitNetw.exe) in the same folder so that the citation information can be retrieved locally and attributed to the MeSH categories. If this data is not yet present, the user is first prompted with a search string in the file “string.wos” that can be used at the advanced search interface of WoS.Footnote 4

Both MHNetw.exe and CitNetw.exe provide the following files:

  1. 1.

    “Mtrx.net” contains the reference matrix in the Pajek format; the Pajek format allows for virtually unlimited file sizes.

  2. 2.

    The SPSS syntax file “mtrx.sps” reads the reference matrix (“mtrx.txt”) into SPSS and saves this file as an SPSS systems file (“mtrx.sav”). MeSH terms are included as variable labels in the case of MHNetw.exe; in the case of CitNetw.exe, the cited references are the variable labels. The user can combine the two matrices using, for example, Excel.

MHNetw.exe additionally provides:

  1. (a)

    Cr_mh.net, which contains the 2-mode matrix of cited references (CR) in the rows and MeSH terms in the columns;

  2. (b)

    Jcr_mh.net, which simplifies cr_mh.net by using only the abbreviated journal names in the cited references in the rows and MeSH terms in the columns;

  3. (c)

    The file jcr_mh_a.net, which contains the same information (abbreviated journal names and MeSH categories), but organized differently: both CR and MeSH are attributed as variables to the documents under study as the cases (in the rows). Within Pajek, one can convert this matrix into an affiliations matrix (using Network > 2-Mode Network > 2-Mode to 1-Mode > Columns). One can also export this file (e.g., to SPSS) for cosine-normalization of the matrix.

CitNetw.exe, furthermore, provides a file “lcs.net” containing the cited/citing matrix for the bounded citation network of the citing documents under study. The bounded citation network corresponds with what was defined as the “local citation environment” in HistCite™ (Garfield et al. 1964, 2003). The cited references are matched against a string composed from the meta-data of the citing document using the standard WoS-format of the cited references: “Name Initial, publication year, abbreviated journal title, volume number, and page number” (e.g., “Zhang CL, 2002, CLIN CANCER RES, V8, P1234”). The matrix may be somewhat different from the one obtained from using HistCite™ because of different matching and disambiguation rules.

In order to proceed with main-path analysis in Pajek, the network has to be a-cyclical (de Nooy et al. 2011, pp. 244f.). If needed, one can make the network a-cyclical within Pajek by using the following steps in the order specified in Table 1.

Table 1 Main or critical path analysis using lcs.net

Note that the cited references are not disambiguated by these routines, but are used as they appear on the input file. The user may wish to disambiguate the references before entering this routine; for example, by using CRExplorer.EXE at http://www.crexplorer.net (Thor et al. 2016).

Results

Descriptive

Figure 1 shows the number of documents in the set over time and the development of the ratio of citations per publication (c/p). As noted, the research program under study was triggered by a paper in 1992 (Hardy and Higgins 1992). However, there are 11 papers in the set with publication dates in 1991 predating this formulation. In the first decade, the number of publications shows exponential growth; but over the full time span linear growth prevails. In other words, this line of research is no longer booming, but since around 2000 can be considered as “normal science.” The c/p ratio declines linearly with the subsequently shorter citation windows for more recent papers. However, the decline in this ratio may also indicate a diminishing attractiveness of this line of research (Hardy and Selkoe 2002). The sharp decline in the number of publications in the most recent years confirms this inference (Selkoe and Hardy 2016). Recently, Herrup (2015) concluded “that the time has come to face our fears and reject the amyloid cascade hypothesis,” albeit at the moment without an alternative explanation of Alzheimer’s Disease.

Fig. 1
figure 1

Number of yearly papers (diamond) and citations per paper ratio (triangle) over time

Table 2 tells us that the number of cited references in the papers under study (176,670) is almost three times that of the MeSH terms attributed (62,648). In terms of unique cited references (67,831) versus unique MeSH terms (3532), the ratio is further worsened. On a map, the citations would completely overshadow the MeSH terms. However, the number of referenced journals (5345) is of the same order as the unique MeSH terms.

Table 2 Some descriptive statistics of the data under study

Figure 2 provides a map which can be generated using the 2-mode matrix of 5345 abbreviated journal names in the references (red) versus 3482 MeSH terms (green).Footnote 5 (To generate this figure, the file jcr_mh.net was input into Pajek and from there into VOSviewer for the visualization). The figure shows the very central position of the Journal of Neuroscience among the references. Although there are more unique references to journals than to MeSH, their concentration indicates that the red-colored journals form a backbone structure with the MeSH terms spreading out as variations. This is the dominant structure in this data: the journals provide a core structure and the MeSH terms the variation. The journals are more concentrated than the MeSH terms (Table 3): the Gini coefficient of the journal distribution is 0.937 while it is 0.852 for the distribution of MeSH.

Fig. 2
figure 2

Visualization of the 2-mode matrix jcr_mh.net showing 5345 journals cited (upper case) and 3482 MeSH terms (Capital case) in 3558 documents. Layout with Kamada and Kawai (1989); visualization in VOSviewer. This map can be web-started at http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/software/mhnetw/jcr_mh_map.txt&label_size_variation=0.3&zoom_level=1&scale=0.9

Table 3 Ten most frequently cited journals and ten most frequently referenced MeSH

Analysis and decomposition

Whereas multivariate analysis (e.g., factor analysis) is limited by systems and software limitations, the new decomposition algorithms enable us to decompose large and even very large matrices. The above matrix (Fig. 2), for example, can robustly be decomposed into five clusters using the algorithm of Blondel et al. (2008); the modularity of the network is low (Q = 0.066).Footnote 6 Figure 3, for example, shows the fourth component consisting of 598 cited journals versus 326 MeSH terms focusing on techniques such as neuro-imaging. This cluster can be further subdivided into nine components (Q = 0.375).

Fig. 3
figure 3

Visualization of the fourth component of the 2-mode matrix jcr_mh.net showing 598 journals cited and 326 MeSH terms. Nine clusters are distinguished with modularity Q = 0.375 (Blondel et al. 2008). Layout using (Fruchterman and Reingold 1991) and visualization in VOSviewer. This map can be web-started at http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/software/mhnetw/comp4map.txt&network=http://www.leydesdorff.net/software/mhnetw/comp4net.txt&label_size_variation=0.2&zoom_level=1&scale=1.20&colored_lines&n_lines=10000&curved_lines

The file jcr_mh_a.net organizes the same information as a matrix of the 3558 documents under study as the cases and both the MeSH terms and abbreviated journal titles as variables in the columns. Using this file, one can normalize the variables or proceed to multivariate analysis. After normalization using the Jaccard index—available in UCInet—the highly centralized structure, indeed, has disappeared. The resulting 1-mode similarity matrix can be decomposed into approximately 70 components by the algorithm of Blondel et al. (2008) and into 61 by the algorithm of VOSviewer (Waltman and van Eck 2012). The modularity is an order of magnitude larger than in the previous case (Q = 0.577). After this normalization, however, journal names come even more to the fore on the map (Fig. 4),Footnote 7 indicating their structural role in this information.

Fig. 4
figure 4

First component of the Jaccard-normalized matrix: 1083 cited journals and 900 MeSH terms; subdivided into 11 clusters (Blondel et al. 2008; Q = 0.220); layout and visualization using VOSviewer

In summary, the abbreviated journal names in the references provide us with far greater access to the structure in the matrix than do the MeSH terms. Referenced journals reflect the archival knowledge base on which the new knowledge claims build, whereas MeSH terms position papers as variation (including novelty; Boudreau et al. 2016) at the research front. The MeSH terms are attributed from the perspective of hindsight. In other words, the MeSH classification which operates at the paper level may be less suited for the normalization of citations than journals or journal categories, which can reveal archival structures.

Main-path analysis

As noted, CitNetw.exe also generates a file “lcs.net” containing the bounded network of the papers under study with “local citation scores” (Garfield et al. 2003). Using the instruction provided in Appendix 2, one can generate a main path using Pajek. Figure 5, for example, shows the so-called “key-route main path” as the most recommended option for this analysis (Liu and Lu 2012). Forty of the 3416 documents downloaded from WoS (or slightly more than 1 %) are located on this main path.Footnote 8

Fig. 5
figure 5

Forty papers on the so-called “key route global main path” in the citations among the 3416 WoS documents under study. Decomposition using the Louvain algorithm in Pajek (Blondel et al. 2008; Q = 0.757); layout using Kamada and Kawai (1989)

It is beyond the scope of this paper to compare these results with other options for main-path or critical path analysis (Batagelj 2003; Hummon and Doreian 1989). A review of the various options is provided by Liu and Lu (2012), who suggest that a combination of the results of several algorithms into an integrated model can improve the quality of the main-path analysis (cf. Lucio-Arias and Leydesdorff 2008). The resulting main path can be further analyzed as a Pajek file; for example, the colors in Fig. 5 show the results of decomposition using the algorithm of Blondel et al. (2008).

The generation of a main path of forty articles for a line of investigation encompassing approximately 3500 papers is appealing due to the reduction by two orders of magnitude in the amount one would need to read to obtain an understanding of this subfield. However, a main path remains an algorithmic construct that one can use heuristically, but that otherwise requires validation. For example, the paper by Kawabata et al. (1991) published in December 1991 in Nature was retracted on March 19, 1992. This paper received 16 citations by other papers on the main path, thirteen of them in the years after the retraction. From an intellectual perspective, one might consider removing this article from the pool of candidate nodes before regenerating the main path.

The two main scientific awards within the field of AD research are the “Potamkin Prize for Research in Pick’s, Alzheimer’s, and Related Diseases” and the “MetLife Foundation Award for Medical Research in Alzheimer’s Disease.” Both prizes have been awarded since the late 1980s, thus capturing in full the time period of our analysis. Forty investigators have won both awards. The main path (as depicted in Fig. 5) includes one or more papers from twelve of these authors.

Conclusions

We have developed two routines that enable the researcher to generate matrices of citing versus cited documents and/or citing documents versus MeSH terms. The data from WoS and PubMed/Medline was integrated using the PubMed Identifier (PMID). Since the number of citing documents is (almost) the same in both cases, the two matrices can also be juxtaposed and then merged so that combinations of citations and MeSH terms can be analyzed. These combinations can perhaps be considered as hybrid indicators (e.g., Braam et al. 1991).

Aggregation of the cited references at the journal level reduces the number of variables by orders of magnitude; the resulting numbers are comparable to the numbers of MeSH categories attributed. Further analysis leads to the conclusion that the abbreviated journal names in the cited references indicate a core structure of the set,Footnote 9 whereas the MeSH are attributed regarding to their relevance to current research options. This classification therefore seems less suited for carrying the normalization of citations than journals or journal groups.

In the context of this study, main-path analysis provides another example of the research potential of organizing the data into primary matrices extracted from downloads of PubMed and WoS. As a perspective for further research, Hellsten and Leydesdorff (2016), for example, analyze translational research in medicine in terms of combinations of MeSH terms, institutional addresses, and journal names. By considering these and other (meta-)data as attributes of documents, one can merge matrices and combine dimensions in the data as we have done above for cited references and MeSH terms, but also beyond two dimensions in terms of n-mode arrays and therefore heterogeneous networks (Callon and Latour 1981; Law 1986).