Skip to main content
Log in

Memetic search for overlapping topics based on a local evaluation of link communities

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

In spite of recent advances in field delineation methods, bibliometricians still don’t know the extent to which their topic detection algorithms reconstruct ‘ground truths’, i.e., thematic structures in the scientific literature. In this paper, we demonstrate a new approach to the delineation of thematic structures that attempts to match the algorithm to theoretically derived and empirically observed properties all thematic structures have in common. We cluster citation links rather than publication nodes, use predominantly local information and search for communities of links starting from seed subgraphs in order to allow for pervasive overlaps of topics. We evaluate sets of links with a new cost function and assume that local minima in the cost landscape correspond to link communities. Because this cost landscape has many local minima we define a valid community as the community with the lowest minimum within a certain range. Since finding all valid communities is impossible for large networks, we designed a memetic algorithm that combines probabilistic evolutionary strategies with deterministic local searches. We apply our approach to a network of about 15,000 Astronomy and Astrophysics papers published 2010 and their cited sources, and to a network of about 100,000 Astronomy and Astrophysics papers (published 2003–2010) which are linked through direct citations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. The same argument can be made for a term used in one paper. Exploiting this property alone and in combination with links to cited sources is a task for future work.

  2. The sum runs through all nodes but only boundary nodes of L contribute to it because for inner nodes of L we have \(k_i^\mathrm {out}(L) = 0\) and for nodes not attached to links in L we have \(k_i^\mathrm {in}(L)=0\).

  3. Applying Kirchhoff’s laws we obtain total resistance of all links of node i as \(1/k_i^\mathrm {in}(L) +1/k_i^\mathrm {out}(L) = [k_i^\mathrm {in}(L)+ k_i^\mathrm {out}(L)] / [k_i^\mathrm {in}(L) k_i^\mathrm {out}(L)]=k_i/[k_i^\mathrm {in}(L) k_i^\mathrm {out}(L)]\).

  4. cf. Evans and Lambiotte (2009) and for calculations see Havemann et al. (2015a, "Appendix").

  5. The central node is the boundary node for both triangles with equal internal and external degree of 2. This gives \(\sigma (L) = 1\) and \(\varPsi = 1 / 3\). All other partitions cut through two nodes and have higher values of \(\varPsi\).

  6. We only compare connected link sets because for a community L one can construct an unconnected link set \(L \cup L_s\) with \(\varPsi (L \cup L_s) < \varPsi (L)\) (which makes L invalid) by adding a small link set \(L_s\) to L if \(\varPsi (L_s)\) is small enough and \(L_s\) has no connection to L.

  7. Instead of the function \(\varPsi\) defined in Eq. (2), the following function was implemented: \((\sigma (L)/k_\mathrm {in}(L))(1-k_\mathrm {in}(L)/2m)\).

  8. See webpage http://researchdata.ibi.hu-berlin.de/comparison2010.htm for cluster description sheets. The whole clustering solution and more statistical data can be downloaded from http://researchdata.ibi.hu-berlin.de/AstronomyandAstrophysics/2010/.

  9. Coverage is decreased by this omission only by a negligible amount.

  10. See webpage http://researchdata.ibi.hu-berlin.de/comparison.htm for cluster description sheets. The whole clustering solution and more statistical data can be downloaded from http://researchdata.ibi.hu-berlin.de/Astronomy&Astrophysics/2003-2010/.

  11. The Salton index measures similarity of two sets and is defined as the ratio of the size of their intersection and the geometric mean of their sizes. It is also called Salton’s cosine because it can be calculated as the cosine of the angle between two vectors characterising the sets.

  12. cf. webpage http://researchdata.ibi.hu-berlin.de/comparison.htm with clusters c 18 and hd 25.

  13. We have put papers of each hd-community into the 2010-network and linked them via all their cited sources and calculated the sizes of their main components.

  14. There are also four small h-communities which have only h 1 as their supertopic. Two of them have only papers in solid-state physics, the other two are overlapping with h 2 and h 7.

  15. cf. also http://astrothesaurus.org/. and http://www.dataharmony.com/.

  16. cf. Subject Areas on this webpage of Physical Review D: http://journals.aps.org/prd/authors/editorial-policies-practices.

  17. E.g., 1050 of the 1175 particle-physics papers in Phys. Rev. D are among the 1272 full papers in h 10.

  18. Note that h-topics in Fig. 10 cover a smaller part of the their network than h d-topics in Fig. 13. Especially, a lot of papers about galaxies are missing in Fig. 10 (cf. next section).

  19. There are 21 subtopics of hd 3 among all 381 valid hd-communities, but we have deselected them because their \(\varPsi\)-value is above the threshold. Thus, hd 3 has substructures but only faint ones and their inclusion would increase coverage only by 4% resulting in 72%.

  20. We found the area of clustering solution h that is not occupied by communities but by h 1 largely coinciding with the 2010 proportion of the area not covered by clustering solution hd without the three largest communities (90% of 3013 papers with membership >1/2 not covered by the reduced h-solution are also in the set of 3881 papers not covered by the reduced hd-solution).

  21. For example, one evolution in which a large community (with nearly 18% of papers 2003–2010) is constructed run nearly six days on a machine with 16 CPUs.

References

  • Ahn, Y.-Y., Bagrow, J.P., & Lehmann, S. (2010). Link communities reveal multi-scale complexity in networks. Nature, 466, 761–764. Cf. arXiv preprint arXiv:0903.3178. Accessed December 1, 2016.

  • Amelio, A., & Pizzuti, C. (2014). Overlapping community discovery methods: A survey. Social Networks: Analysis and Case Studies, 105. Cf. arXiv preprint arXiv:1411.3935. Accessed December 1, 2016.

  • Amsterdamska, O., & Leydesdorff, L. (1989). Citations: Indicators of significance? Scientometrics, 15(5–6), 449–471.

    Article  Google Scholar 

  • Boyack, K. W. (2017a). Investigating the effect of global data on topic detection. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data–different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2297-y. Cf. preprint on  http://www.topic-challenge.info/.  

  • Boyack, K. (2017b). Thesaurus-based methods for mapping contents of publication sets. In Gläser, J., Scharnhorst, A. & Glänzel, W. (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2304-3.

  • Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404.

    Article  Google Scholar 

  • Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351–374.

    Article  Google Scholar 

  • Chubin, D. E. (1976). The conceptualization of scientific specialties. Sociological Quarterly, 17(4), 448–476.

    Article  Google Scholar 

  • Clauset, A. (2005). Finding local community structure in networks. Physical Review E, 72(2), 26132.

    Article  Google Scholar 

  • Cozzens, S. E. (1985). Comparing the sciences: Citation context analysis of papers from neuropharmacology and the sociology of science. Social Studies of Science, 15(1), 127–153.

    Article  Google Scholar 

  • Edge, D. O., & Mulkay, M. J. (1976). Astronomy transformed. The emergence of radio astronomy in Britain. New York: Wiley.

    Google Scholar 

  • Evans, T. S., & Lambiotte, R. (2009). Line graphs, link partitions, and overlapping communities. Physical Review E, 80(1), 16105.

    Article  Google Scholar 

  • Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3–5), 75–174.

    Article  MathSciNet  Google Scholar 

  • Gach, O., & Hao, J.-K. (2012). A memetic algorithm for community detection in complex networks. In C. A. C. Coello, V. Cutello, K. Deb, S. Forrest, G. Nicosia, & M. Pavone (Eds.), Parallel problem solving from nature—PPSN XII, Number 7492 in Lecture Notes in Computer Science (pp. 327–336). Berlin: Springer.

  • Glänzel, W., & Thijs, B. (2015). Using hybrid methods and ‘core documents’ for the representation of clusters and topics: The astronomy dataset. In A.A. Salah, Y. Tonta, A.A.A. Salah, C. Sugimoto, & U. Al (Eds.), Proceedings of ISSI 2015 Istanbul: 15th international society of scientometrics and informetrics conference (pp. 1085–1090).

  • Gläser, J. (2006). Wissenschaftliche Produktionsgemeinschaften: die soziale Ordnung der Forschung. New York: Campus Verlag.

    Google Scholar 

  • Gläser, J., Heinz, M., & Havemann, F. (2015). Epistemic Diversity as distribution of paper dissimilarities. In A. A. Salah, Y. Tonta, A. A. A. Salah, C. Sugimoto, & U. Al (Eds.), Proceedings of ISSI 2015 Istanbul: 15th international society of scientometrics and informetrics conference (pp. 1006–1017).

  • Glenisson, P., Glänzel, W., Janssens, F., & Moor, B. D. (2005). Combining full text and bibliometric information in mapping scientific disciplines. Information Processing & Management, 41(6), 1548–1572.

    Article  Google Scholar 

  • Gong, M., Fu, B., Jiao, L., & Du, H. (2011). Memetic algorithm for community detection in networks. Physical Review E, 84(5), 056101.

    Article  Google Scholar 

  • Havemann, F., Gläser, J., & Heinz, M. (2015a). Detecting overlapping link communities by finding local minima of a cost function with a memetic algorithm. Part 1: Problem and method. Cf. arXiv preprint arXiv:1501.05139. Accessed December 1, 2016.

  • Havemann, F., Gläser, J., & Heinz, M. (2015b). A link-based memetic algorithm for reconstructing overlapping topics from networks of papers and their cited sources. In A. A. Salah, Y. Tonta, A. A. A. Salah, C. Sugimoto, & U. Al (Eds.), Proceedings of ISSI 2015 Istanbul: 15th international society of scientometrics and informetrics conference (pp. 1054–1060).

  • Havemann, F., Heinz, M., Struck, A., Gläser, J. (2011). Identification of overlapping communities and their hierarchy by locally calculating community-changing resolution levels. Journal of Statistical Mechanics: Theory and Experiment 2011, P01023. Cf. arXiv preprint arXiv:1008.1004. Accessed December 1, 2016.

  • Healey, P., Rothman, H., & Hoch, P. K. (1986). An experiment in science mapping for research planning. Research Policy, 15(5), 233–251.

    Article  Google Scholar 

  • Hric, D., Darst, R. K., & Fortunato, S. (2014). Community detection in networks: Structural communities versus ground truth. Physical Review E, 90(6), 062805.

    Article  Google Scholar 

  • Katz, J. S. (1999). The self-similar science system. Research Policy, 28(5), 501–517.

    Article  Google Scholar 

  • Klavans, R., & Boyack, K. (2011). Using global mapping to create more accurate document-level maps of research fields. Journal of the American Society for Information Science and Technology, 62(1), 1–18.

  • Koopman, R., & Wang, S. (2017). Mutual Information based labelling and comparing clusters. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2305-2.

  • Lancichinetti, A., Fortunato, S., & Kertesz, J. (2009). Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics11, 033015. Cf. arXiv preprint arXiv:0802.1218. Accessed December 1, 2016.

  • Leydesdorff, L. (2004). Clusters and maps of science journals based on bi-connected graphs in the Journal Citation Reports. Journal of Documentation, 60(4), 371–427.

    Article  Google Scholar 

  • Leydesdorff, L., & Rafols, I. (2009). A global map of science based on the ISI subject categories. Journal of the American Society for Information Science and Technology, 60(2), 348–362.

    Article  Google Scholar 

  • Leydesdorff, L., & Rafols, I. (2012). Interactive overlays: A new method for generating global journal maps from Web-of-Science data. Journal of Informetrics, 6(2), 318–332.

    Article  Google Scholar 

  • Li, Z., Zhang, X.-S., Wang, R.-S., Liu, H., & Zhang, S. (2013). Discovering link communities in complex networks by an integer programming model and a genetic algorithm. PLoS ONE, 8(12), e83739.

    Article  Google Scholar 

  • Ma, L., Gong, M., Liu, J., Cai, Q., & Jiao, L. (2014). Multi-level learning based memetic algorithm for community detection. Applied Soft Computing, 19, 121–133.

    Article  Google Scholar 

  • Neri, F., Cotta, C., & Moscato, P. (Eds.). (2012). Handbook of memetic algorithms, volume 379 of studies in computational intelligence. Berlin: Springer.

    Google Scholar 

  • Peel, L., Larremore, D. B., & Clauset, A. (2016). The ground truth about metadata and community detection in networks. Cf. arXiv preprint arXiv:1608.05878. Accessed December 1, 2016.

  • Pizzuti, C. (2009). Overlapped community detection in complex networks. In Proceedings of the 11th annual conference on genetic and evolutionary computation (pp. 859–866). ACM.

  • Pizzuti, C. (2012). Boosting the detection of modular community structure with genetic algorithms and local search. In Proceedings of the 27th annual ACM symposium on applied computing (pp. 226–231).

  • Shi, C., Cai, Y., Fu, D., Dong, Y., & Wu, B. (2013). A link clustering based overlapping community detection algorithm. Data & Knowledge Engineering, 87, 394–404.

    Article  Google Scholar 

  • Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.

    Article  Google Scholar 

  • van Eck, N. J., & Waltman, L. (2017). Citation-based clustering of publications using CitNetExplorer and VOSviewer. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2300-7. Cf. arXiv preprint arXiv:1702.03411. Accessed February 20, 2017.

  • van Raan, A. (1996). Advanced bibliometric methods as quantitative core of peer review based evaluation and foresight exercises. Scientometrics, 36(3), 397–420.

    Article  Google Scholar 

  • van Raan, A. F. J. (1991). Fractal geometry of information space as represented by co-citation clustering. Scientometrics, 20, 439–449.

    Article  Google Scholar 

  • Velden, T., Boyack, K., Gläser, J., Koopman, R., Scharnhorst, A., & Wang, S. (2017a). Comparison of topic extraction approaches and their results. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science. Special Issue of Scientometrics. doi:10.1007/s11192-017-2306-1. Cf. preprint on  http://www.topic-challenge.info/.

  • Velden, T., Yan, S., & Lagoze, C. (2017b). Infomap clustering of direct citation network and topic affinity analysis. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science. Special Issue of Scientometrics. doi:10.1007/s11192-017-2299-9. Cf. preprint on http://www.topic-challenge.info/.

  • Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378–2392.

    Article  Google Scholar 

  • Whitley, R. (1974). Cognitive and social institutionalization of scientific specialties and research areas. In R. Whitley (Ed.), Social processes of scientific development (pp. 69–95). London: Routledge & Kegan Paul.

    Google Scholar 

  • Xie, J., Kelley, S., & Szymanski, B. K. (2013). Overlapping community detection in networks: The state-of-the-art and comparative study. ACM Computing Surveys, 45(4), 43:1–43:35.

    Article  MATH  Google Scholar 

  • Yau, C.-K., Porter, A., Newman, N., & Suominen, A. (2014). Clustering scientific documents with topic modeling. Scientometrics, 100(3), 767–786.

    Article  Google Scholar 

  • Zhang, Z., Zhang, N., Zhong, C., & Duan, L. (2015). Detecting overlapping communities with triangle-based rough local expansion method. In D. Ciucci, G. Wang, S. Mitra, & W.-Z. Wu (Eds.), Rough sets and knowledge technology, number 9436 in Lecture Notes in Computer Science (pp. 446–456). Springer International Publishing. doi:10.1007/978-3-319-25754-9_39.

Download references

Acknowledgements

First, we want to thank all other authors of this special issue for doing this collaborative clustering exercise with us and for giving us many suggestions and hints. Special thanks to the members of the advisory board of our diversity project (http://141.20.126.172/~div/) who accompanied it throughout seven years. The work published here was partly funded by the German Research Ministry (01UZ0905). We thank Andreas Prescher for programming a fast C++-based R-package for parallel node-wise memetic search in the \(\varPsi\)-landscape and a package for parallel link-wise local search. We thank Theresa Velden for the giant component of the network of direct citations between papers published 2003–2010 and her cluster solution for the network of papers published 2010 obtained with Infomap. Thanks to Nees Jan van Eck for the cluster solutions for this network he had obtained with CitNetExplorer. We also would like to thank Kevin Boyack for additional data of his clustering. Kevin Boyack, Rob Koopman, and Shenghui Wang have determined keywords we used to characterise clusters and communities. We are grateful to Wolfgang Glänzel and Rob Koopman who commented on an earlier version of this paper, and to two anonymous reviewers whose critical remarks helped us to clarify our argument.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frank Havemann.

Appendices

Appendix 1: Elements of memetics

In large networks exploring the cost landscape by adding or removing individual links is very time-consuming. Therefore, we begin the search with a coarse search phase in which the memetic algorithm adds or removes groups of links by adding or removing a node with all its links. Node-wise memetics is done with subgraphs induced by node sets C: mutation, crossover, and local search are based on node sets. The coarse search phase is followed by a fine search phase, namely a link-wise local search (cf. Algorithm 2).

Local search Local search (adaptation) in the cost landscape is done by a greedy algorithm for finding local cost minima that correspond to communities. The algorithm is called greedy because it always chooses the step in the cost landscape that brings the biggest decrease (or the smallest increase) of \(\varPsi\). Node-wise local search includes a neighbouring node of subgraph S (or excludes a boundary node) with all its links to nodes in S. Link-wise local search includes a neighbouring link of subgraph S or excludes a link attached to a boundary node. Local search can begin by a series of either inclusions or exclusions of nodes (links). If no neighbouring place in the landscape with lower cost can be reached the algorithm excludes or includes further nodes (links). This stops after a maximum number of steps without improvement. This maximum number is calculated from the relative resolution parameter. In all experiments described here we go as many steps as allowed by the rule that a community’s range should be at least one third of its size (Havemann et al. 2015a, cf. p. 6). When no further improvement can be achieved, the search switches from inclusion to exclusion or vice versa. Inclusion and exclusion are continued until no further improvement is possible. Exclusion can make the subgraph unconnected. Node-wise local search then proceeds with its main component. In the case of link-wise local search the exclusion process can go through places in the cost landscape which correspond to unconnected subgraphs. Connectedness is tested only after running the algorithm. If we obtain two or more components then we start new link-wise local searches using components as seeds.

Mutation We mutate a community C with mutation variance \(v\,<\,1\) by changing maximally a proportion v of its nodes. This is done by randomly selecting a node in C, to which neighbours in C are randomly added until a connected subgraph of a size larger than \((1 - v)|C|\) is reached. To this subgraph, further random neighbours are added, which now can also be outside C. This random extension continues until a connected subgraph of size |C| is reached. The mutant is then subjected to two local searches, one each starting with greedy inclusion and exclusion of nodes. Thereby we obtain two communities from each mutation (which can be identical).

Crossover From two parent communities we construct two new individuals by taking intersection and union of the communities as starting points for adaptive local searches. Adapting the union is started with exclusion of nodes, adapting the intersection is started with inclusion of nodes. Of course, crossing the parents has no effect when one of them is part of the other. We also do not cross disjoint parents because we are interested in connected subgraphs as solutions.

Selection From the old population and the results of mutations and crossovers we select the communities with lowest \(\varPsi\)-values, keeping the population size constant. A new best community is only included if it is inside the minimal range (defined by the resolution parameter) of the best community of the original population. Deselected new best communities can be used as seeds for other memetic searches.

Appendix 2: Running memetic search

Each seed was first adapted by a local search and then used to initialise the population of 16 different communities by mutating the adapted seed with a variance of 15%.

In the first round of memetic searches, up to five experiments were run with each seed. The standard mutation variance in each experiment was 2%, i.e., up to 2% of the nodes were randomly exchanged. The variance was increased to 15% if \(\varPsi\)-values did not improve for 10 generations (renewal of population). For each seed, the results of these experiments where used to initialise a population (of eight communities) for a second round of up to ten memetic experiments. Here we omitted renewal and led the mutation variance decrease after each generation for which no improved best community was obtained. This strategy was applied to find nearby better communities which could have been overseen with fixed mutation variance. After node-wise memetics, for each community we made a link-wise local search.

We tested whether link communities have a range above the minimum range of one third of the link set’s size. At this stage, we also considered subgraphs induced by complementary link-sets because cost function \(\varPsi (L)\) is the same for L and its complement. If the complement is unconnected then we tried to improve its main component by link-wise local search.

We implemented the whole procedure as R-scripts. For parallel node-wise memetic search in the \(\varPsi\)-landscape we applied a fast new C++-based R-Package PsiMin. For parallel link-wise local search we applied R-Package PsiMinL. Both (yet unpublished) packages were programmed for us by Andreas Prescher.

Appendix 3: Results

Table 5 Size data, cost, numbers of subtopics, of supertopics, and of other related topics of 50 largest link clusters 2010 (clustering h, relation data without h 1)
Table 6 Size data, cost, numbers of subtopics, of supertopics, and of related topics of 50 largest link clusters 2003–2010 (clustering hd, relation data without hd 1 and hd 2)

In Tables 5 and 6 we list data of the 50 largest communities in the networks with papers published in 2010 and in 2003–2010, respectively. The following definitions are used:

Full papers are papers which have no citation links to papers outside the subgraph induced by link set L.

Fractions of papers are membership grades of papers, i.e., the ratios of numbers of internal to numbers of all citation links. Thus, sum of fractions is determined as

$$\begin{aligned} \sum _{i=1}^n \frac{k_i^\mathrm {in}(L)}{k_i}, \end{aligned}$$

where \(k_i\) is the degree of paper i and \(k_i^\mathrm {in}(L)\) its internal degree with regard to L (the number of links attached to i which are in link set L).

Subtopics of community L are all other communities which have at least 95% of their links in link set L.

Supertopics of community L are all other communities which include at least 95% of links in link set L.

Other related communities of community L are all those communities which have links in common with link set L but are not sub- or supertopics of L.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Havemann, F., Gläser, J. & Heinz, M. Memetic search for overlapping topics based on a local evaluation of link communities. Scientometrics 111, 1089–1118 (2017). https://doi.org/10.1007/s11192-017-2302-5

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-017-2302-5

Keywords

Navigation