Citation-based clustering of publications using CitNetExplorer and VOSviewer

Clustering scientific publications in an important problem in bibliometric research. We demonstrate how two software tools, CitNetExplorer and VOSviewer, can be used to cluster publications and to analyze the resulting clustering solutions. CitNetExplorer is used to cluster a large set of publications in the field of astronomy and astrophysics. The publications are clustered based on direct citation relations. CitNetExplorer and VOSviewer are used together to analyze the resulting clustering solutions. Both tools use visualizations to support the analysis of the clustering solutions, with CitNetExplorer focusing on the analysis at the level of individual publications and VOSviewer focusing on the analysis at an aggregate level. The demonstration provided in this paper shows how a clustering of publications can be created and analyzed using freely available software tools. Using the approach presented in this paper, bibliometricians are able to carry out sophisticated cluster analyses without the need to have a deep knowledge of clustering techniques and without requiring advanced computer skills.


Introduction
Clustering techniques play a prominent role in bibliometric research.They are for instance used to identify groups of related publications, authors, or journals.
Clustering techniques have been developed mainly in fields such as statistics, computer science, and network science.Bibliometricians usually do not develop their own clustering techniques, but they use existing clustering techniques developed in other fields.They apply these techniques to bibliometric data sets, sometimes after adapting the techniques to the specific characteristics of bibliometric data.
When the number of objects to be clustered is relatively limited (e.g., at most a few hundred objects), analyzing and interpreting the results obtained from a clustering In the approach that we take in this paper, we first use CitNetExplorer to cluster publications based on their citation relations.For this purpose, CitNetExplorer employs a clustering technique that we have introduced in earlier papers (Waltman & Van Eck, 2012, 2013).We then use CitNetExplorer to analyze the resulting clustering solution at the level of individual publications.To facilitate the analysis of a clustering solution, the following features of CitNetExplorer are essential:  Visualizing a citation network.CitNetExplorer can be used to visualize a citation network of publications, with publications shown along a time axis and with colors indicating the clusters to which publications belong.Using the visualization functionality of CitNetExplorer, we obtain an overview of the most frequently cited publications in a citation network, the citation relations between these publications, and the clusters to which the publications belong.

 Drilling down into a citation network. The drill down functionality of
CitNetExplorer can be used to analyze a clustering solution at different levels of detail.We may for instance start with a visualization at the level of the entire citation network.We may then perform a drill down into one or more selected clusters, after which we are provided with a visualization at the level of the subnetwork consisting of the publications belonging to the selected clusters.
 Searching for publications.We can search for publications based on title, publication year, author name, and journal name.The search functionality of CitNetExplorer can be used to find publications that are of special interest, for instance all publications in a specific journal, and to find out to which clusters these publications belong.
VOSviewer is a software tool for constructing and visualizing bibliometric networks.In this paper, VOSviewer is used to complement CitNetExplorer.

Clustering technique
In this paper, we use the clustering technique that is available in the CitNetExplorer software tool.This section provides a discussion of this clustering technique.Subsection 2.1 explains how the relatedness of publications is determined, and Subsection 2.2 describes how publications are assigned to clusters.We refer to Waltman andVan Eck (2012, 2013) for a more extensive discussion of our clustering technique.

Determining the relatedness of publications
To cluster publications, we first need to determine the relatedness of publications.
In the bibliometric literature, the most commonly used approaches to determine the relatedness of publications are based on either citation relations or word relations (for a more extensive discussion, see Van Eck & Waltman, 2014b).In the case of citation relations, a further distinction can be made between direct citation relations, bibliographic coupling relations, and co-citation relations (e.g., Boyack & Klavans, 2010;Klavans & Boyack, in press).In the case of word relations, shared words in the titles, abstracts, or full texts of publications serve as an indication of the relatedness of publications (e.g., Boyack et al., 2011;Janssens, Leta, Glänzel, & De Moor, 2006).
Sometimes the relatedness of publications is determined using a combined approach that takes into account both citation relations and word relations (e.g., Boyack & Klavans, 2010;Janssens, Glänzel, & De Moor, 2008).
Our clustering technique determines the relatedness of publications based on direct citation relations.We prefer to use citation relations rather than word relations because the use of word relations involves some difficulties.Some words have a different meaning in different fields of science.These words may incorrectly indicate that publications from different fields are related to each other.Also, some words are very general and are used in many different fields.These words do not provide useful information on the relatedness of publications.
We prefer to use direct citation relations rather than bibliographic coupling relations (i.e., relations between publications that cite the same publication) or cocitation relations (i.e., relations between publications that are cited by the same publication) for two reasons.First, bibliographic coupling and co-citation relations are indirect relations, and we therefore expect them to provide less accurate information on the relatedness of publications than direct citation relations (Waltman & Van Eck, 2012).Second, there are many more bibliographic coupling or co-citation relations between publications than direct citation relations, and therefore the use of bibliographic coupling or co-citation relations may easily lead to computational problems.(This also applies to the use of word relations.)Although we prefer the use of direct citation relations over the use of bibliographic coupling or co-citation relations, we acknowledge that the use of direct citation relations also has a disadvantage.Within the period of analysis, some publications may have no direct citation relations with other publications.When using direct citation relations, these publications cannot be properly assigned to a cluster.This problem is especially serious when the period of analysis is relatively short.When using bibliographic coupling relations rather than direct citation relations, one usually does not have this problem.We note that, in addition to our own work, the use of direct citation relations is also advocated in recent work by Klavans and Boyack (in press).

Clustering publications
After the relatedness of publications has been determined, our clustering technique assigns publications to clusters.Each publication is assigned to exactly one cluster.Hence, there is no overlap of clusters and there are no publications without a cluster assignment.It may be argued that there should be room for publications to be assigned to more than one cluster.However, allowing publications to be assigned to multiple clusters introduces significant technical challenges.For this reason, we prefer to assign publications to a single cluster only.For most publications, we believe that it is reasonable to assign them to just one cluster.
Publications are assigned to clusters by maximizing a quality function.The quality function that is used has been introduced in an earlier paper (Waltman & Van Eck, 2012).This quality function is a variant of the well-known modularity function of Newman and Girvan (2004) and Newman (2004) developed in the field of network science.The quality function is very similar to the quality function resulting from the so-called constant Potts model proposed by Traag, Van Dooren, and Nesterov (2011).
Our quality function has an important advantage over the popular modularity function.The modularity function suffers from a problem known as the resolution limit (Fortunato & Barthélemy, 2007).This problem causes the modularity function to yield counterintuitive results in certain situations.As shown by Traag et al. (2011), our quality function does not suffer from the resolution limit problem.
More specifically, our clustering technique assigns publications to clusters by maximizing the quality function where n denotes the number of publications, a ij denotes the relatedness of publication i with publication j, γ denotes a so-called resolution parameter, and x i denotes the cluster to which publication i is assigned.The function δ(x i , x j ) equals 1 if x i = x j and 0 otherwise.The relatedness of publication i with publication j is given by where c ij equals 1 if either publication i cites publication j or publication j cites publication i and c ij equals 0 otherwise.Hence, if there is a direct citation relation between publications i and j, the relatedness of publication i with publication j is inversely proportional to the total number of direct citation relations of publication i.
If there is no direct citation relation between publications i and j, the relatedness of the publications equals 0. Notice that our clustering technique ignores the direction of a citation (i.e., no distinction is made between publication i citing publication j and publication j citing publication i).
The value of the resolution parameter γ in ( 1) should be chosen based on the purpose of the cluster analysis.Higher values of this parameter will yield a larger number of clusters.In other words, the higher the value of γ, the higher the level of detail of the clustering solution that will be obtained.In CitNetExplorer, the default value of γ is 1.However, we emphasize that there is no generally optimal value of γ.
Our recommendation to users of our clustering technique is to try out different values of γ and to choose the value that seems to give the most useful results for the specific needs of a user.
In order to maximize the quality function in (1), our clustering technique uses the smart local moving algorithm introduced by Waltman and Van Eck (2013).This algorithm offers a more sophisticated alternative to the popular Louvain algorithm for modularity optimization (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008).When the smart local moving algorithm and the Louvain algorithm are given a similar amount of computing time, the smart local moving algorithm typically identifies a clustering solution with a significantly higher value for the quality function.We refer to Waltman and Van Eck (2013) for an extensive comparison of the two algorithms.
Our clustering technique usually identifies a relatively limited number of larger clusters and a more substantial number of smaller clusters.Sometimes clusters are very small and for instance include only one or two publications.Because in many cases small clusters are of limited interest, a minimum cluster size parameter can be specified.Clusters that are too small can be either discarded or merged with other clusters.We refer to Waltman and Van Eck (2012) for a discussion of the approach that we take to merge small clusters with larger ones.

Results
We now demonstrate how CitNetExplorer and VOSviewer can be used to cluster publications and to analyze the resulting clustering solutions.In our demonstration, we work with a large data set of publications in the field of astronomy and astrophysics.We emphasize that in this paper it is not our aim to assess the quality of our clustering solutions or to compare our clustering solutions with other alternative solutions.We do not have the domain knowledge required to provide an in-depth interpretation of our clusters and to assess their quality.For a comparison of our clustering solutions with other alternative solutions, we refer to the comparison paper by Velden, Boyack, Gläser, Koopman, Scharnhorst, and Wang (in press) in this special issue.

Data
We use the 'Astro data set' that is also used in other papers in this special issue.A general introduction to the data set is provided in the introductory paper by Gläser, Glänzel, and Scharnhorst (in press) 1.
CitNetExplorer requires a citation network to be acyclic.When analyzing a citation network, CitNetExplorer will make sure that the network is acyclic by removing citation relations that cause the network to have cycles.CitNetExplorer will also remove citation relations for which the citing publication appeared in an earlier year than the cited publication (e.g., a publication from 2009 citing a publication from 2010).In the case of our data set, of the 929,364 citation relations between publication in the data set, 3,824 were removed by CitNetExplorer.Hence, the citation network analyzed using CitNetExplorer included 925,540 citation relations.We note that the four clustering solutions do not have a hierarchical relationship with each other.For instance, a cluster in the most detailed clustering solution may overlap with more than one cluster in the second most detailed clustering solution.
For each of the four clustering solutions, In the rest of this paper, our focus will be mainly on the level 1 clustering.To get an impression of the topics covered by the 22 level 1 clusters, Table 3 presents for each cluster the number of publications and five characteristic terms.The characteristic terms were extracted from the titles of the publications belonging to a cluster using the methodology described by Waltman and Van Eck (2012).A more extensive summary of the level 1 clusters is provided in Table A1 in the appendix.For each cluster, this table lists not only the number of publications and five characteristic terms but also the three journals with the largest number of publications and the most frequently cited publication.In addition, for each cluster, ten standardized terms are presented.These terms were selected using a standardized approach that has also been used in other papers in this special issue.

Using CitNetExplorer to analyze clustering solutions at the publication level
We first use CitNetExplorer to analyze the level 1 clustering.The analysis takes place at the level of individual publications.In the next subsection, we use VOSviewer to perform an analysis at an aggregate level.
For a given set of publications, CitNetExplorer can be used to get an overview of the most frequently cited publications, the citation relations between these publications, the temporal order of the publications, and the assignment of the publications to clusters.Suppose we are interested to get a better understanding of the publications belonging to level 1 clusters 1, 2, 3, and 4 (i.e., the four largest level 1 clusters).Figure 1  The visualization provided in Figure 1 is static.In the CitNetExplorer software tool, the same visualization is presented in an interactive way.This for instance makes it possible to zoom in on a specific area in the visualization and to explore in more detail the publications located in that area.Also, by hovering the mouse over a publication, bibliographic information on the publication is presented, for instance the authors, the title, and the journal in which the publication appeared.What do we learn from the visualization provided in Figure 1? First of all, the visualization confirms that level 1 clusters 1, 2, 3, and 4 cover relatively independent bodies of literature.Most citation relations shown in the visualization are between publications belonging to the same cluster rather than between publications belonging to different clusters.In addition, the visualization reveals that clusters 1, 2, and 4 (shown in blue, green, and orange, respectively) are more strongly connected to each other than to cluster 3 (shown in purple), at least when focusing on the most highly cited publications in the different clusters.Of the four clusters, cluster 3 therefore appears to be the one that is most independent from the others.
A more detailed interpretation of the visualization presented in Figure 1 requires expert knowledge of the field of astronomy and astrophysics.Using the visualization, an expert in the field obtains a basic understanding of the topics covered by the different clusters and of the developments taking place within each cluster.An expert will probably be familiar with many of the publications shown in the visualization and will have some general idea of the role played by these publications in the development of the field of astronomy and astrophysics.By combining this expert knowledge with the information offered by the visualization, on the one hand an expert can provide an interpretation of the clusters and on the other hand the expert can deepen his or her understanding of the astronomy and astrophysics field.
Suppose next that we would like to explore level 1 cluster 2 in more detail.This can be done using the drill down functionality of CitNetExplorer.This functionality makes it possible to drill down into a specific subnetwork of a citation network.In this case, a drill down is performed into the subnetwork consisting of the publications belonging to cluster 2 and the citation relations between these publications.After drilling down, the visualization presented in Figure 2 is obtained.Of the 8,954 publications belonging to cluster 2, the visualization shows the 100 most frequently cited ones.As discussed in Subsection 3.2, publications were clustered at four levels of detail.In the visualization, the color of a publication is determined by the cluster to which the publication belongs in the level 3 clustering.As can be seen in the visualization, the most frequently cited publications in level 1 cluster 2 belong mostly to three different level 3 clusters.These clusters are indicated using the colors red, brown, and light blue in the visualization.
The visualization presented in Figure 2 provides insight into the subdivision of level 1 cluster 2 into smaller level 3 clusters.If a deeper understanding of the literature is required, one could perform a further drill down.In this way, a specific level 3 cluster could be explored in more detail.In a next step, another drill down could be performed to explore an even smaller level 4 cluster.
An analysis using CitNetExplorer takes place at the level of individual publications.In many cases, one may also want to analyze a clustering solution at an aggregate level.This is not possible using CitNetExplorer, but it can be accomplished using other software tools.In particular, VOSviewer can be used for this purpose, as discussed in the next subsection.

Using VOSviewer to analyze clustering solutions at an aggregate level
We now use VOSviewer to carry out a further analysis of the level 1 clustering.
The analysis is performed at an aggregate level and uses two visualizations.One visualization shows the level 1 clusters and the citation relations between these clusters.The other visualization uses a term map to indicate the topics that are covered by a level 1 cluster.
A visualization of the 22 level 1 clusters and their citation relations is provided in  An interactive version of the visualization provided in Figure 3 is available online at http://goo.gl/968hLw.The interactive visualization offers additional information not visible in Figure 3.In particular, when the mouse is hovered over a cluster, more detailed information on the cluster is presented, similar to the information provided in Table 3.
Suppose now that we would like to get a better understanding of a specific level 1 cluster, for instance cluster 3.For this purpose, we use the term map visualization presented in Figure 4. To create this visualization, the titles and abstracts of the 7,998 publications belonging to cluster 3 were analyzed using natural language processing techniques (Van Eck & Waltman, 2011).For each publication, the terms occurring in the title and abstract of the publication were identified.Of all terms that were found in at least 15 publications, the 1,420 terms that seemed most relevant were algorithmically selected.These terms are shown in the term map visualization provided in Figure 4.Each term is represented by a circle, and some terms are also indicated by a label.(VOSviewer aims to avoid overlapping labels, and therefore labels are visible only for some of the terms.)The size of a term reflects the number of publications in which the term was found, and the distance between two terms offers an approximate indication of the relatedness of the terms.The relatedness of terms was determined based on co-occurrences.In other words, the larger the number of publications in which two terms were both found, the stronger the relation between the terms.Colors represent groups of terms that are relatively strongly related to each other.These groups were identified using the clustering technique of VOSviewer that was also mentioned above.In the visualization, the strongest relations between terms are also indicated using curved lines.
What does the term map visualization tell us about the topics that are covered by level 1 cluster 3? Publications belonging to cluster 3 seem to study various types of solar phenomena.In the right area of the visualization, we observe terms dealing with the phenomenon of solar wind and the related phenomenon of coronal mass ejection.
In the top area, terms related to the phenomenon of solar flares can be found.Terms related to the phenomenon of sunspots are located in the left area of the visualization.
In the bottom area, we observe the term 'solar cycle'.Solar phenomena are often influenced by the solar cycle.
An interactive version of the term map visualization presented in Figure 4 is available online at http://goo.gl/sotbF1.In the interactive visualization, it is possible to zoom in on specific areas in the visualization.When zooming in, the labels of more and more terms become visible, making it possible to interpret a specific area in the visualization in more detail.

Conclusion
We Web of Science database can be provided directly as input to the software tools, without the need to preprocess the data.Of course, despite the ease of use of our tools, a basic understanding of clustering techniques remains essential to perform meaningful analyses and to avoid misinterpretations of the results that are obtained.
The clustering technique that we have used is based on recent developments in the fields of network science and bibliometrics (Traag et al., 2011;Waltman & Van Eck, 2012, 2013).In addition to our own work, this clustering technique has also been used in the work of other bibliometricians (Boyack & Klavans, 2014;Klavans & Boyack, in press;Small, Boyack, & Klavans, 2014).Our clustering technique determines the relatedness of publications based on direct citation relations.A major advantage of the use of direct citation relations is the possibility to efficiently cluster very large numbers of publications (e.g., tens of millions of publications).A disadvantage is that, due to a lack of direct citation relations, some publications cannot be properly assigned to a cluster.We note that, in addition to our clustering technique, other clustering techniques could also be considered for clustering publications based on direct citation relations.For instance, in a recent study (Šubelj, Van Eck, & Waltman, 2015), we found indications suggesting that the map equation technique, used together with the Infomap optimization algorithm (Bohlin, Edler, Lancichinetti, & Rosvall, 2014;Rosvall & Bergstrom, 2008), may give particularly good results. We While we use CitNetExplorer to analyze a clustering solution at the level of individual publications, we use VOSviewer to analyze a clustering solution at an aggregate level.Two visualizations provided by VOSviewer play an important role.The first visualization shows the clusters in a clustering solution and the citation relations between these clusters.The second visualization uses a so-called term map to indicate the topics that are covered by a cluster.This visualization shows the most important terms occurring in the publications belonging to a cluster and the co-occurrence relations between these terms.This paper is organized as follows.Section 2 discusses the clustering technique that is used by CitNetExplorer to cluster publications based on their citation relations.Section 3 demonstrates the use of CitNetExplorer and VOSviewer to cluster publications and to analyze the resulting clustering solutions.CitNetExplorer is used to cluster more than 100,000 publications in the field of astronomy and astrophysics, and CitNetExplorer and VOSviewer are used together to analyze the resulting clustering solutions.Section 4 concludes the paper.
provides a CitNetExplorer visualization of the 100 most frequently cited publications in these four clusters.Each publication is indicated by a circle, and publications are labeled by the last name of the first author.The vertical dimension represents time, with publications in the top part of the visualization being older and publications in the bottom part being more recent.In the horizontal dimension, publications are positioned based on their relatedness in terms of citations.Publications that are strongly related in terms of citations, taking into account not only direct citation relations between publications but also indirect citation relations, tend to be located close to each other in the horizontal dimension.Publications that are only weakly related in terms of citations are located further away from each other.The curved lines between publications indicate citation relations, with the citing publication always being located below the cited publication.The darker lines represent direct citation relations, while the lighter lines represent indirect citation relations.There is an indirect citation relation from publication A to publication B if publication A does not directly cite publication B but if publication A for instance cites publication C and publication C in turn cites publication B. The color of a publication indicates the cluster to which the publication belongs, with blue, green, purple, and orange corresponding with, respectively, clusters 1, 2, 3, and 4.

Figure 1 .
Figure 1.CitNetExplorer visualization of the 100 most frequently cited publications in level 1 clusters 1, 2, 3, and 4. Colors indicate the level 1 cluster to which a publication belongs.

Figure 2 .
Figure 2. CitNetExplorer visualization of the 100 most frequently cited publications in level 1 cluster 2. Colors indicate the level 3 cluster to which a publication belongs.

Figure 3 .
Figure 3.In this visualization, the size of a cluster reflects the number of publications belonging to the cluster.Larger clusters include more publications.The distance between two clusters approximately indicates the relatedness of the clusters in terms of citations.Clusters that are located close to each other tend to be strongly related in terms of citations, while clusters that are located further away from each other tend to be less strongly related.The curved lines between the clusters also reflect the relatedness of clusters, with the thickness of a line representing the number of citations between two clusters.The horizontal and vertical axes have no special meaning.

Figure 3 .
Figure 3. VOSviewer visualization of the 22 level 1 clusters and their citation relations.An interactive version of the visualization is available online at http://goo.gl/968hLw.

Figure 4 .
Figure 4. VOSviewer term map visualization for level 1 cluster 3. The visualization shows 1,420 terms extracted from the titles and abstracts of the publications belonging to the cluster.The strongest co-occurrence relations between terms are shown as well.An interactive version of the visualization is available online at http://goo.gl/sotbF1.
have demonstrated the capabilities of CitNetExplorer and VOSviewer for clustering publications and for analyzing the resulting clustering solutions.However, the combined use of the two software tools is somewhat laborious, and preparing the input data for VOSviewer based on the clustering results provided by CitNetExplorer is not entirely straightforward.In future research, we therefore plan to work on the development of a single integrated software tool in which many of the key features of CitNetExplorer and VOSviewer are brought together.We have in mind a tool that combines different types of interactive visualizations to support users in exploring the scientific literature.A technique for clustering publications based on direct citation relations, similar to the technique used in this paper, will be at the core of the new tool.Like in this paper, it will be possible to create clustering solutions at different levels of detail.The new tool will provide interactive functionality for browsing through a hierarchical structure of clusters, and the tool will use visualizations similar to the ones used in this paper to show citation relations between publications and between clusters and to indicate the topics covered by clusters.The dynamics of clusters, revealing for instance how interest in a topic has grown or declined over time, will be made visible as well.

Table 1 .
Statistics for the data set of astronomy and astrophysics publications.

Using CitNetExplorer to cluster publications
In line with this idea, we used CitNetExplorer to create four clustering solutions, each providing a different level of detail.The clustering solutions are based on different values of the resolution parameter and the minimum cluster size parameter.Clusters that did not meet the minimum cluster size criterion were merged with larger clusters.
We clustered the publications in our data set using the clustering technique that is available in CitNetExplorer.We refer to Section 2 for a discussion of this clustering technique.Our citation network of 111,616 publications has a largest component that includes 101,828 publications.Only these 101,828 publications were included in the cluster analysis.The other 9,788 publications were not assigned to a cluster.As already explained in Subsection 2.2, clustering solutions can be created at different levels of detail.The choice of the most suitable level of detail is not a technical one but instead depends on the purpose of the cluster analysis.Our recommendation is to create multiple clustering solutions at different levels of detail and to use the solution (or the solutions) that fits best with the needs one has.

Table 2 .
Waltman & Van Eck, 2012) of the resolution parameter and the minimum cluster size parameter.The table also provides for each clustering solution a number of statistics.These are the number of clusters, the average number of publications per cluster, and the number of publications in the smallest and the largest cluster.As can be seen in Table2, the clustering solution that provides the lowest level of detail, referred to as the level 1 clustering, includes 22 clusters.This clustering solution has an average cluster size of 4,629 publications and a maximum cluster size of almost 15,000 publications.On the other hand, the clustering solution that provides the highest level of detail, referred to as the level 4 clustering, includes 434 clusters.This clustering solution has an average cluster size of 235 publications and a maximum cluster size of somewhat more than 1,000 publications.The statistics reported in Table 2 make clear that, regardless of the level of detail of a clustering solution, the distribution of publications over clusters is quite skewed.This is a typical phenomenon when our technique for clustering publications is used (for more details, seeWaltman & Van Eck, 2012).Parameters and statistics for the different clustering solutions.

Table 3 .
Brief summary of the 22 level 1 clusters.
have demonstrated the use of CitNetExplorer and VOSviewer for clustering The approach presented in this paper is well suited for this purpose.The software tools that we have used are freely available.Using these tools, publications can be clustered without the need to have a deep knowledge of clustering techniques.In addition, no advanced computer skills are required.For instance, data download from the online Bibliometricians usually do not develop their own clustering techniques, but instead they apply existing clustering techniques in a bibliometric context.