Cited references and Medical Subject Headings (MeSH) as two different knowledge representations: clustering and mappings at the paper level
For the biomedical sciences, the Medical Subject Headings (MeSH) make available a rich feature which cannot currently be merged properly with widely used citing/cited data. Here, we provide methods and routines that make MeSH terms amenable to broader usage in the study of science indicators: using Web-of-Science (WoS) data, one can generate the matrix of citing versus cited documents; using PubMed/MEDLINE data, a matrix of the citing documents versus MeSH terms can be generated analogously. The two matrices can also be reorganized into a 2-mode matrix of MeSH terms versus cited references. Using the abbreviated journal names in the references, one can, for example, address the question whether MeSH terms can be used as an alternative to WoS Subject Categories for the purpose of normalizing citation data. We explore the applicability of the routines in the case of a research program about the amyloid cascade hypothesis in Alzheimer’s disease. One conclusion is that referenced journals provide archival structures, whereas MeSH terms indicate mainly variation (including novelty) at the research front. Furthermore, we explore the option of using the citing/cited matrix for main-path analysis as a by-product of the software.
The ability to define research fields is one of several great challenges in information science (Chen 2016). Early efforts relied on classifying publication sources, such as journals, to define research fields. In addition to disciplinary journals, however, the literature databases Web of Science (WoS, Thomson Reuters) and Scopus (Elsevier) contain multi-disciplinary journals such as Science and Nature. In recent years, new journals which are not organized along disciplinary lines, have been added to the databases. PLoS ONE, for example, tends to disturb the existing classifications of journals (Leydesdorff and De Nooy, in press). In response to these changes, bibliometricians have begun to cluster the database at the level of documents instead of journals (e.g., Waltman and van Eck 2012; cf. Hutchins et al. 2016).
An alternative to clustering documents on the basis of direct citations could be to use databases that are more specialized than WoS and Scopus, but with professional indexing at the document level. The National Library of Medicine, for example, makes a huge investment to maintain a classification system of Medical Subject Headings (MeSH) as tags to the PubMed/MEDLINE database (which is publicly available at http://www.ncbi.nlm.nih.gov/pubmed/advanced).1 The classification at the article level is elaborated in great detail (Agarwal and Searls 2009), with a hierarchical tree covering sixteen separate branches that can reach up to twelve levels of depth. Diseases, for example, are classified under C.
“Alzheimer’s disease” (AD) for example is classified as C10.228.140.380.100 under “Dementia,” as C10.574.945.249 under “Neurodegenerative diseases,” and as F03.615.400.100 under “Neurocognitive disorders” in the F-branch covering “Psychiatry and psychology.” Unlike other disciplinarily specialized databases such as Chemical Abstracts (Bornmann et al. 2009), the multiple tree-structure of the Index Medicus allows for mapping documents differently across heterogeneous domains (Leydesdorff et al. 2012; Rotolo et al. 2016). Unlike WoS or Scopus, Medline does not cover the full range of disciplines; but a large part of the scholarly literature in the life sciences is included even more exhaustively than in the more comprehensive databases (Lundberg et al. 2006).
A version of MEDLINE is integrated in the databases of Thomson Reuters. The advantage of this installation is that the “times cited” of each record (if the document is also available in the WoS Core Collection of the Citation Indices) is available on screen; but this field is not integrated when the records are downloaded. Rotolo and Leydesdorff (2015) provide software for integrating the “times cited” from the citation indices at WoS into the MEDLINE data. One technical advantage of the installation at PubMed is that the retrieval is not restrained. Using WoS, one can download only 500 records at a time and Scopus has a maximum of 2000 records.
The MeSH terms attributed to a paper can be considered as references to a body of knowledge stored as documents in a database. Whereas the cited references are provided by the authors themselves, the MeSH categories are attributed by professional indexers. Using MeSH terms as references, one can envisage a matrix of documents referencing MeSH comparable to the cited/citing matrix at the article level. Both cited references and MeSH terms can be considered as attributes of articles, and thus be combined and compared using various forms of multi-variate analysis. The two matrices can also be integrated into a 2-mode matrix of MeSH terms versus cited references. In this brief communication, we explore these options computationally and describe software that has been developed and made available for this purpose on the internet. We discuss the opportunities and the pros and cons of various approaches.
Since 1992, the amyloid cascade hypothesis has played a prominent role in explaining the etiology and pathogenesis of Alzheimer’s disease (AD). It proposes that the deposition of β-amyloid (Aβ) is the initial pathological event in AD leading to the formation of senile plaques (SPs) and then to neurofibrillary tangles (NFTs), neuronal cell death, and ultimately dementia. While there is substantial evidence supporting the hypothesis, there are also limitations: (1) SP and NFT may develop independently, and (2) SPs and NFTs may be the products rather than the causes of neurodegeneration in AD. In addition, randomized clinical trials that tested drugs or antibodies targeting components of the amyloid pathway have been inconclusive.
For the purpose of this study, the search string ‘(“Alzheimer disease”[MeSH Terms] AND “amyloid beta-protein precursor”[MeSH Terms]) AND “mice, transgenic”[MeSH Terms])’ was proposed to encompass the relevant literature. This string provided us (on March 6, 2016) with a retrieval of 3558 records in both PubMed/MEDLINE and the MEDLINE version in WoS. Using PubMed Identifiers (PMID numbers), 3416 of these records could be retrieved in the WoS Core Collection. As noted, not all journals covered by PubMed/MEDLINE are also covered in the WoS Core Collection.
Two dedicated programs, MHNetw.exe2 and CitNetw.exe,3 have been developed to generate reference matrices using the PubMed/MEDLine and the WoS data, respectively. The matrices are provided in the Pajek format. CitNetw.exe generates the cited/citing matrix with the citing documents as units of analysis in the rows and the cited references as variables in the columns; MHNetw.exe generates a similar matrix, but with the MeSH in the columns. The number of citing documents is determined by the retrieval from PubMed/MEDLINE or Medline in WoS, respectively. Instructions for how to use the databases and routines are provided in Appendix 1.
The routine MHNetw.exe presumes that the data from WoS with the citation information is already organized (by CitNetw.exe) in the same folder so that the citation information can be retrieved locally and attributed to the MeSH categories. If this data is not yet present, the user is first prompted with a search string in the file “string.wos” that can be used at the advanced search interface of WoS.4
“Mtrx.net” contains the reference matrix in the Pajek format; the Pajek format allows for virtually unlimited file sizes.
The SPSS syntax file “mtrx.sps” reads the reference matrix (“mtrx.txt”) into SPSS and saves this file as an SPSS systems file (“mtrx.sav”). MeSH terms are included as variable labels in the case of MHNetw.exe; in the case of CitNetw.exe, the cited references are the variable labels. The user can combine the two matrices using, for example, Excel.
Cr_mh.net, which contains the 2-mode matrix of cited references (CR) in the rows and MeSH terms in the columns;
Jcr_mh.net, which simplifies cr_mh.net by using only the abbreviated journal names in the cited references in the rows and MeSH terms in the columns;
The file jcr_mh_a.net, which contains the same information (abbreviated journal names and MeSH categories), but organized differently: both CR and MeSH are attributed as variables to the documents under study as the cases (in the rows). Within Pajek, one can convert this matrix into an affiliations matrix (using Network > 2-Mode Network > 2-Mode to 1-Mode > Columns). One can also export this file (e.g., to SPSS) for cosine-normalization of the matrix.
CitNetw.exe, furthermore, provides a file “lcs.net” containing the cited/citing matrix for the bounded citation network of the citing documents under study. The bounded citation network corresponds with what was defined as the “local citation environment” in HistCite™ (Garfield et al. 1964, 2003). The cited references are matched against a string composed from the meta-data of the citing document using the standard WoS-format of the cited references: “Name Initial, publication year, abbreviated journal title, volume number, and page number” (e.g., “Zhang CL, 2002, CLIN CANCER RES, V8, P1234”). The matrix may be somewhat different from the one obtained from using HistCite™ because of different matching and disambiguation rules.
Main or critical path analysis using lcs.net
1. Extract the largest component from the network
a. Network > Create partition > Component > Weak
b. Operations > Network + Partition > Extract subnetwork > Choose cluster;
2. Remove strong components from the largest component
a. Network > Create partition > Component > Strong
b. Operations > Network + Partition > Shrink network > [use default values]
3. Remove loops
a. Network > Create new network > Transform > Remove > Loops
4. Create main path (or critical path)
a. Network > Acyclic network > Create weighted > Traversal > SPC
b. Network > Acyclic network > Create (Sub)Network > Main Paths
Note that the cited references are not disambiguated by these routines, but are used as they appear on the input file. The user may wish to disambiguate the references before entering this routine; for example, by using CRExplorer.EXE at http://www.crexplorer.net (Thor et al. 2016).
Some descriptive statistics of the data under study
N of documents
Unique MeSH terms
Unique cited references
Ten most frequently cited journals and ten most frequently referenced MeSH
P Natl Acad Sci USA
J Biol Chem
Disease Models, Animal
Amyloid beta-Protein Precursora
Am J Pathol
Analysis and decomposition
In summary, the abbreviated journal names in the references provide us with far greater access to the structure in the matrix than do the MeSH terms. Referenced journals reflect the archival knowledge base on which the new knowledge claims build, whereas MeSH terms position papers as variation (including novelty; Boudreau et al. 2016) at the research front. The MeSH terms are attributed from the perspective of hindsight. In other words, the MeSH classification which operates at the paper level may be less suited for the normalization of citations than journals or journal categories, which can reveal archival structures.
It is beyond the scope of this paper to compare these results with other options for main-path or critical path analysis (Batagelj 2003; Hummon and Doreian 1989). A review of the various options is provided by Liu and Lu (2012), who suggest that a combination of the results of several algorithms into an integrated model can improve the quality of the main-path analysis (cf. Lucio-Arias and Leydesdorff 2008). The resulting main path can be further analyzed as a Pajek file; for example, the colors in Fig. 5 show the results of decomposition using the algorithm of Blondel et al. (2008).
The generation of a main path of forty articles for a line of investigation encompassing approximately 3500 papers is appealing due to the reduction by two orders of magnitude in the amount one would need to read to obtain an understanding of this subfield. However, a main path remains an algorithmic construct that one can use heuristically, but that otherwise requires validation. For example, the paper by Kawabata et al. (1991) published in December 1991 in Nature was retracted on March 19, 1992. This paper received 16 citations by other papers on the main path, thirteen of them in the years after the retraction. From an intellectual perspective, one might consider removing this article from the pool of candidate nodes before regenerating the main path.
The two main scientific awards within the field of AD research are the “Potamkin Prize for Research in Pick’s, Alzheimer’s, and Related Diseases” and the “MetLife Foundation Award for Medical Research in Alzheimer’s Disease.” Both prizes have been awarded since the late 1980s, thus capturing in full the time period of our analysis. Forty investigators have won both awards. The main path (as depicted in Fig. 5) includes one or more papers from twelve of these authors.
We have developed two routines that enable the researcher to generate matrices of citing versus cited documents and/or citing documents versus MeSH terms. The data from WoS and PubMed/Medline was integrated using the PubMed Identifier (PMID). Since the number of citing documents is (almost) the same in both cases, the two matrices can also be juxtaposed and then merged so that combinations of citations and MeSH terms can be analyzed. These combinations can perhaps be considered as hybrid indicators (e.g., Braam et al. 1991).
Aggregation of the cited references at the journal level reduces the number of variables by orders of magnitude; the resulting numbers are comparable to the numbers of MeSH categories attributed. Further analysis leads to the conclusion that the abbreviated journal names in the cited references indicate a core structure of the set,9 whereas the MeSH are attributed regarding to their relevance to current research options. This classification therefore seems less suited for carrying the normalization of citations than journals or journal groups.
In the context of this study, main-path analysis provides another example of the research potential of organizing the data into primary matrices extracted from downloads of PubMed and WoS. As a perspective for further research, Hellsten and Leydesdorff (2016), for example, analyze translational research in medicine in terms of combinations of MeSH terms, institutional addresses, and journal names. By considering these and other (meta-)data as attributes of documents, one can merge matrices and combine dimensions in the data as we have done above for cited references and MeSH terms, but also beyond two dimensions in terms of n-mode arrays and therefore heterogeneous networks (Callon and Latour 1981; Law 1986).
The National Library of Medicine of the United States (NLM) has constantly received substantial funding to maintain and update its biomedical and health information services—for example, the 2015 budget for these services was $117 Million (National Library of Medicine, 2015). This has enabled a relatively uniform application of the MeSH classification to publications by indexers over many years (Hicks and Wang 2011, at p. 292; Petersen et al. 2016).
One can use this string also for computing the Relative Citation Ratios at https://icite.od.nih.gov/analysis (Hutchins et al. 2016). However, this facility has currently a limitation of 200 PubMed identifiers.
Fifty of the 3532 MeSH terms were not related in this case.
The decomposition algorithm of VOSviewer distinguishes more than one hundred clusters after symmetrizing the asymmetrical matrix internally by summing the cells (i,j) and (j,i).
Thirty-four of these documents are located on both the standard and critical main paths as a reduction to a single main path.
The results of a core/periphery analysis are not shown here, but can be web-started at http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/software/mhnetw/kcore_map.txt&network=http://www.leydesdorff.net/software/mhnetw/kcore_net.txt&label_size_variation=0.3&zoom_level=2&scale=0.9. The k-core analysis is based on relations with a value of ten or more and confirms that journal names are prevailing in the core set.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.