Generating clustered journal maps: an automated system for hierarchical classification

Journal maps and classifications for 11,359 journals listed in the combined Journal Citation Reports 2015 of the Science and Social Sciences Citation Indexes are provided at https://leydesdorff.github.io/journals/ and http://www.leydesdorff.net/jcr15. A routine using VOSviewer for integrating the journal mapping and their hierarchical clusterings is also made available. In this short communication, we provide background on the journal mapping/clustering and an explanation about and instructions for the routine. We compare journal maps for 2015 with those for 2014 and show the delineations among fields and subfields to be sensitive to fluctuations. Labels for fields and sub-fields are not provided by the routine, but an analyst can add them for pragmatic or intellectual reasons. The routine provides a means of testing one’s assumptions against a baseline without claiming authority; clusters of related journals can be visualized to understand communities. The routine is generic and can be used for any 1-mode network.


Introduction
Scholarly journals have been and remain the primary organizers of scientific communication.
The number of journals has increased over the centuries, at times showing exponential growth (Mabe, 2001;Mabe & Amin, 2003;Price, 1961, p. 166), but the journal form has remained remarkably stable in the social life of science.The intellectual development of the sciences and their organization, as well as growth of new specialties and disciplines, is organized, validated, and retained in scholarly journals.Ware & Mabe (2015) estimated that there were 28,100 peerreviewed journals published in English in 2015; in the Web of Science (WoS) in that year, 11,365 journals were indexed.The source journals can be expected to account for more than 90 percent of citations because of the skew in the distributions (Garfield, 1971;Seglen, 1992).
Specialties and new fields develop at a level above individual journals.Since journals relate to one another through citations and references (Price, 1965), perhaps the best way to identify networked communities is through the cross-referencing of these already aggregated citations and references into algorithmically significant clusters (Leydesdorff, 1987;Tijssen et al., 1987).This article describes a method of visualizing journal-to-journal connections to create 'macroepistemics' (Knorr-Cetina, 2007).New developments with validity can be expected to form new journals and journal clusters (van den Besselaar & Leydesdorff, 1996).
The classification of journals into disciplines is complicated by the many venues where one finds results.Multidisciplinary journals such as Science and Nature play important roles in scientific communication, especially in calling attention to advances in knowledge.More recently, open-access journals (e.g., PLoS ONE) have emerged which deliberately ignore disciplinary boundaries and thus tend to disturb the classification of journals.Some scholars suggest that the journal form may diminish in use in favor of archives and repositories (Harnad, 2001), although the majority of scholars view journals as increasingly important (e.g., Marbán, 1999).As Lavoie et al. (2014) detail, "the transition from print to a digital, networked environment likely means that decision-making around the scholarly record will have to become more consciously coordinated."Journals are classified into disciplinary groups by indexing services; the classifications serve a number of purposes.First, classification serves to facilitate the process of search and retrieval.
For pragmatic reasons, it has been considered "best practice" in evaluation studies to use the WoS Subject Categories (WCs)1 for the operationalization of fields of science even though these categories do not represent homogeneous sets (Leydesdorff & Bornmann, 2016).They are attributed to journals by manual indexing and have been elaborated incrementally for more than forty years by the providers of the database (Bensman & Leydesdorff, 2009;Pudovkin & Garfield, 2002, p. 1113).Journals can be attributed to more than one WC.
Beyond journal names and identity through sponsorship (e.g., by learned societies), articles can be classified in terms of co-citation, bibliographic coupling, or direct citation relations (Klavans & Boyack, 2016, in press).Clustering the database at the level of papers, however, requires access to large computing capacity and to entire copies of Scopus (Boyack et al., 2011) or the WoS (Waltman & van Eck, 2012).The problem of the validity of the delineations remains.As Schubert, Glänzel, & Braun (1989, at p. 7) have noted, "the field/subfield classification of papers is a neuralgic point of all kind of scientometric evaluations." Aware of the constraints of using WCs for evaluation purposes, Glänzel & Schubert (2003) developed a new journal classification system based on a pragmatic weighting of the results of algorithmic clustering of journals in terms of citation patterns against expert judgment.The Centre for Research and Development Monitoring ECOOM of the Catholic University of Leuven (Belgium) uses this classification system for evaluations.In the meantime, fast decomposition algorithms have been developed that can be used for classifications.Klavans & Boyack (in press, at p. 12, Table 3) list seven journal-based partitions of Scopus data currently in use.Rafols and Leydesdorff (2009) compared (i) the WCs and (ii) Glänzel & Schubert's (2003) alternative classification with two algorithmically generated ones: (iii) Newman & Girvan's (2004) algorithm applied to the matrix of 7,611 citing journals contained in the Journal Citation Reports (JCR) 2006; and (iv) a random-walk based algorithm submitted by Rosvall & Bergstrom (2008) that had been applied to 6,128 journals in the JCR 2004.The concordance between the four classifications was modest: in the 40-60% range (Rafols & Leydesdorff, 2009, Table 3, at p. 1828).This conclusion agrees with Boyack's estimate of 50% correct classifications for the WCs (Boyack, personal communication, 14 September 2008).However, most of the miscategorised journals appear to occur in areas within the close vicinity of categories indicated by the other classifications.In other words, the various decompositions are roughly consistent with each other, but imprecise.Despite the low correspondence, maps based on the different classifications can be rather similar (Leydesdorff & Rafols, 2009;Klavans & Boyack, 2009).
In summary, there are no unique or universally valid classifications of journals.Two runs of the same algorithmic decomposition may not provide the same results.Most algorithms begin by drawing a random number using the computer clock.However, Leydesdorff, Bornmann, & Zhou (2016, at p. 907) noted that VOSviewer-visualization software developed by CWTS and available free for download at http://www.vosviewer.com-cangenerate quasi-deterministic classifications when the seed number of the randomizer is set equal to a constant (the default is zero).Using this option, the decomposition can pragmatically be combined with visualizations in a hierarchical classification by using the output of each decomposition recursively as input to the further decomposition at a next-lower level (Waltman, van Eck, & Noyons, 2010).One begins at the top-level of the complete matrix and then extracts the clusters one by one; this process can be automated in a recursive loop if an option were added to VOSviewer for writing the output files to disk when running the program from the command line (Nees Jan van Eck, personal communication, 3 and 16 May 2016).
The most recent version 1.6.5 of VOSviewer (dated September 28, 2016), among other things, enables the user to run VOSviewer in a batch job from the command line.In this short communication, we report on generating such an automatic classification and visualization of the JCR-2015 data.The resulting classifications, visualizations, and routines are available at https://leydesdorff.github.io/journals/and http://www.leydesdorff.net/jcr15 .The website provides input files for journal maps for the more than 11,000 journals contained in the JCR-2015, at the various levels discussed above.
Although developed for JCR-data, the routines are formulated so that any 1-mode matrix can be decomposed similarly in terms of mappings using VOSviewer.Note that one can also export the clusters in the Pajek format so that the files can be used for other visualizations such as in Gephi.

Data
Using dedicated software, the JCR 2015 data for the Science and Social Sciences Citation Indexes was first organized into a matrix of citing versus cited journals.All 11,365 journals are included so that the matrix is 1-mode, albeit asymmetrical.VOSviewer symmetrizes the asymmetrical matrix internally by summing the cells (i,j) and (j,i).Six journals are not connected (Avian Res, EDN, Neuroforum, Austrian Hist Yearb, Curric Matters, and Policy Rev), and are therefore excluded from the analysis.Thus, we work with (11,365 -6 =) 11,359 journals.The results can be compared with results based on JCR-2014 data elaborated for a single branch in Leydesdorff, Bornmann, & Zhou (2016). 2 Table 1 provides descriptive statistics and network characteristics for the large components in 2015 and 2014.The intersection between these two years (using identical journal names) contains 11,009 journals.Table 1 shows that the network increases more in terms of links than nodes.However, the density and the clustering coefficients did not change.

-merge_small_clusters true"
VOSviewer is to be installed in the folder C:\vosviewer and one operates in the folder C:\temp .
The "minimum cluster size" is set to "two" in order to suppress isolates; repulsion is set to "zero" to optimize the visualizations.
The initial output is written to the files m0.txt for the map and n0.txt for the network (at level 1), respectively.The file m0.txt contains the clustering that is used by the routine for generating an input file for each of the clusters.This next round generates output files m1.txt, m2.txt, etc., as map files of VOSviewer which contain the information for drawing maps at the next-lower level (level 2).In a next round, each of these files is further decomposed into m1_1.txt,m1_2.txt,etc.

The global map based on JCR 2015 data
Figure 1 provides the global map based on 2015 data.One can compare this map, for example, with the 2014 map (Leydesdorff, Bornmann, & Zhou, 2016, at p. 906);5 the procedures for producing these two maps were virtually identical.However, the resulting delineations are notably different (Table 2; cf.Leydesdorff et al., 2016a, Table 4, at p. 907).Since the clustering is hierarchical, the extraction of different sets can sometimes become a trade-off among memberships of journals in different groups.For example, in 2014, as can be seen in Table 2, an eighth cluster of 343 neuroscience journals is distinguished.This same cluster is no longer visible in 2015; the same journals are split between a third cluster ("Medicine") and a fifth cluster ("Biomedical").In 2015, however, cluster seven (dark green in Figure 1) groups 583 journals into a "bio-agricultural" cluster.The extraction of this seventh set (before the extraction of the neuroscience group as the eighth set) changes the path of the decomposition so that a different sub-optimum is reached.Note that this different branching can be caused by relatively small differences in the data.
The program does not provide the disciplinary designations; labels can be added (subjectively) to the algorithmic artifacts by the analyst depending of the objectives of the study.As shown in Table 2, a tenth field of only six journals is identified in 2015.We have labeled this cluster "data analysis" based upon the titles of the six journals (Table 3) and their combined citation environment (Figure 2). 6While the larger environment of the cluster (indicated as "#Big Data-US") shows biomedicine, environmental sciences, chemistry, etc., the J Am Stat Assoc, a core journal in statistics, is mapped in close proximity to the cluster as is Commun ACM, a leading journal in computer science.The classification in 2015 (ten clusters) can be compared with the one in 2014 (nine clusters) for the 11,009 journals that are included in the JCR versions of both years.The chi-square of the cross-tabulation is highly significant (p<.001), but Cramer's V-a measure of the chi-square which varies between zero and one-is only 0.82.Ten percent of the journals are differently classified between 2014 and 2015.Although the clusters are reproducible within each year, the clustering is, in our opinion, not sufficiently reliable for comparisons across years.As noted, relatively small changes in numbers of citations can affect the order of the extractions in a hierarchical decomposition.

The social-sciences cluster
In both 2014 (with 3131) and 2015 (with 3274), cluster 1-representing the social sciences--is the largest group.This is not a homogenous cluster; but the citation patterns in the social sciences are statistically so different from those in natural sciences and engineering that they are sorted separately in the first pass of the decomposition (at level 1). Figure 3 provides the decomposition of this cluster at level 2; the information is summarized in Table 4 and compared with the corresponding table for 2014.Note that a light-brown cluster with 62 "library and information science" journals can be found in the middle of Figure 1 (indicated most clearly by the journal title "Scientometrics" circled).In Appendix 1, the two sets (for 2015 and 2014) are compared with the WC "information science and library science".Forty-three journals co-occur in all three lists; 49 co-occur in two of the three lists.The WC also includes 37 journals that belong mostly to the cluster of managementinformation-science journals (Leydesdorff & Bornmann, 2016).

Library and information sciences
As noted, the JCR-2015 set includes 12 journals that belong to a "statistics and methods" cluster.
Journals such as Social Networks are cited both in "information science" and in other fields such as "business & management" or "organization studies" (Leydesdorff et al., 2008).In 2014, for example, this journal is grouped with the J Artif Soc S in a cluster of 143 sociology journals, whereas Qual Quant is grouped among 335 economic journals.However, one is dis-advised to draw far-reaching conclusions on the basis of changes among two subsequent years (Leydesdorff & de Nooy, in press).
Further decomposition of the LIS set leads to six clusters which vary from three to 27 journals (Table 4).In other words, the relatively homogenous modules of the (social) sciences may be rather fine-grained.In our opinion, this leads to the question of whether one can use these clusters to normalize citation behavior above the level of individual journals.The delineations among "fields" and "subfields" (in terms of citation patterns) seem sensitive to weak fluctuations that might, from another perspective, be considered as noise (Leydesdorff, 2006).Patterns may be affected by specific events.For example, the publication of one or two special issues on the border between two specialisms may change the pattern and provide the impression of emerging new developments.From this perspective, one can question the suggestion made above that a new set of six journals were labeled as "data analysis" or "big data".This may be an over-interpretation on our side, influenced by the hype around this topic.Moreover, many articles about "big data" appear in journals other than the six journals listed in Table 3.

Discussion and Conclusions
The matrix of aggregated journal-journal citation relations represents a complex system of scientific communication that is both hierarchically layered and functionally differentiated in terms of scientific specialties and fields.Such a system cannot be decomposed unambiguously (Simon, 1973).The clusters can be related in other (e.g., methodological versus theoretical) dimensions; densities of communication in subsets can vary significantly.Referencing behavior norms may differ across fields.For example, an article in a biomedical specialty may contain more than forty references, while in other fields, such as mathematics, fewer than ten references is more common (Garfield, 1979;Moed, 2010).However, what is measured as "differences in citation behavior among fields" can also be an artifact of the different degrees of coverage of the field-specific literature by the database (Marx & Bornmann, 2015).Epistemically, references may function at research fronts to position the citing papers or acknowledge intellectual debt and/or credit to previous (that is, cited) publications (Leydesdorff, Bornmann, Comins, & Milojević, 2016).Bodies of specialist literature may interact in next-order-i.e., more generalist-layers carried by quality journals such as Science and Nature.
This complex interweaving of different dynamics is further complicated because all relevant distributions are heavily skewed (Seglen, 1992).Weak ties in one context can be strong ties from another perspective (Granovetter, 1973).As we have seen above, hierarchical decomposition follows a path downward so that the results are path-dependent and may lead to different suboptima.There is no objective yardstick to inform us how much better one representation is when compared with another (cf.Klavans & Boyack, 2016, in press).
In addition to the statistical quality of the distinctions, the groupings have to be labeled manually; this adds a subjective dimension of flexible interpretations with different meanings, since the labels are not provided by the decomposition itself.The labels are added by an analyst who, as a user of the system, may wish to mix pragmatic with intellectual considerations.Ex ante, one representation is as legitimate as another and no methodological prescription can be formulated.
Within this context of uncertainty and complexity, the proposed routine provides a means for testing one's assumptions without claiming authority; but with the advantage of reproducibility and the possibility of rich visualizations.The algorithm is semantically neutral: the routine will work on any 1-mode matrix and provide a purely algorithmic decomposition of the system into lower-level units in a series of layers.The advantages of using this decomposition and the quality of the visualizations will have to show their usefulness in bibliometric practices.The results may raise further questions and thus help to shape research ideas and agendas.

Figure 4
Figure 4 provides the map for the 62 LIS journals distinguished as a cluster in 2015.As in 2014,

Table 1 :
Network characteristics of the largest components of the matrix based on JCR 2015, compared with JCR 2014.
The two levels are attributed to individual journals at http://www.leydesdorff.net/jcr15/index.htm.Finally, all level- 3 files are run in VOSviewer in order to generate the classification at level 4.This classification is attributed to each journal as a hyperlink at http://www.leydesdorff.net/jcr15:by clicking on a journal name, one webstarts VOSviewer to generate a map of the citation environment of this journal at level 4. The user can save this map for further decomposition (at level 5; see below).

Table 2 :
Fields distinguished at the top level of JCR 2015 and 2014.

Table 3 :
Six journals in cluster 10

Table 4 :
Decomposition of the social-sciences cluster in 2014 and 2015. 7 7The overlap of journals (with the same name) between 2014 and 2015 contains 3,073 journals; Cramer's V between the two classifications is 0.78.