Skip to main content

Experimental evaluation of parameter settings in calculation of hybrid similarities: effects of first- and second-order similarity, edge cutting, and weighting factors

Abstract

The ongoing discussion in the bibliometric community about the best similarity measures has led to diverse insights. Although these insights are sometimes contradicting, there is one very consistent conclusion: Hybrid measures outperform the application of their singular components. While this initially answers the question as to what is the best similarity measure, it also raises issues which have been resolved in part for conventional similarity measures. Given this, in this study we investigate the impact of the right weighting factors, the appropriate level of edge cutting, the performance of first- in contrast to second-order similarities, and the interaction of these three parameters in the context of hybrid similarities. Building upon a dataset of over 8000 articles from the manufacturing engineering field and using different parameter settings we calculated over 100 similarity matrices. For each matrix we determined several cluster solutions of different resolution levels, ranging from 100 to 1000 clusters, and evaluated them quantitatively with the help of a textual coherence value based on the Jensen Shannon Divergence. We found that second-order hybrid similarity measures calculated with a weighting factor of 0.6 for the citation-based similarity and a reduction to only the strongest values yield the best clustering results. Furthermore, we found the assessed parameters to be highly interdependent, where for example hybrid first-order outperforms second-order when no edge cutting is applied. Given this, our results can serve the bibliometric community as a guideline for the appropriate application of hybrid measures.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

References

  • Ahlgren, P., & Colliander, C. (2009). Document–document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63. doi:10.1016/j.joi.2008.11.003.

    Article  Google Scholar 

  • Arenas, A., Fernández, A., & Gómez, S. (2008). Analysis of the structure of complex networks at different resolution levels. New Journal of Physics, 10(5), 53039.

    Article  Google Scholar 

  • Benoit, K., & Nulty P. (2016). quanteda: Quantitative analysis of textual data. https://CRAN.R-project.org/package=quanteda. Accessed January 31, 2016.

  • Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10, 10008ff.

    Article  Google Scholar 

  • Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404. doi:10.1002/asi.21419.

    Article  Google Scholar 

  • Boyack, K. W., & Klavans, R. (2014). Creation of a highly detailed, dynamic, global model and map of science. Journal of the Association for Information Science and Technology, 65(4), 670–685. doi:10.1002/asi.22990.

    Article  Google Scholar 

  • Colliander, C., & Ahlgren, P. (2012). Experimental comparison of first and second-order similarities in a scientometric context. Scientometrics, 90(2), 675–685. doi:10.1007/s11192-011-0491-x.

    Article  Google Scholar 

  • Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research (p. 1695). Complex Systems: InterJournal.

    Google Scholar 

  • Eisenhardt, K. M. (1989). Building theories from case study research. Academy of Management Review, 14(4), 532–550.

    Google Scholar 

  • Feinerer, I., & Hornik, K. (2015). tm: Text mining package. https://CRAN.R-project.org/package=tm. Accessed January 31, 2016.

  • Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315(5814), 972–976. doi:10.1126/science.1136800.

    MathSciNet  Article  MATH  Google Scholar 

  • Glänzel, W. (2012). Bibliometric methods for detecting and analysing emerging research topics. Profesional De La Informacion, 21(2), 194–201. doi:10.3145/epi.2012.mar.11.

    Article  Google Scholar 

  • Glänzel, W., & Thijs, B. (2011). Using ‘core documents’ for the representation of clusters and topics. Scientometrics, 88(1), 297–309. doi:10.1007/s11192-011-0347-4.

    Article  Google Scholar 

  • Hornik, K., Buchta, C., & Zeileis, A. (2009). Open-source machine learning: R meets Weka. Computational Statistics, 24(2), 225–232. doi:10.1007/s00180-008-0119-7.

    MathSciNet  Article  MATH  Google Scholar 

  • Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. doi:10.1007/BF01908075.

    Article  MATH  Google Scholar 

  • Janssens, F., Glänzel, W., & Moor, B. (2008). A hybrid mapping of information science. Scientometrics, 75(3), 607–631. doi:10.1007/s11192-007-2002-7.

    Article  Google Scholar 

  • Janssens, F., Zhang, L., de Moor, B., & Glänzel, W. (2009). Hybrid clustering for validation and improvement of subject-classification schemes. Information Processing and Management, 45(6), 683–702. doi:10.1016/j.ipm.2009.06.003.

    Article  Google Scholar 

  • Klavans, R., & Boyack, K. W. (2017). Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? Journal of the Association for Information Science and Technology, 68, 984–998. doi:10.1002/asi.23734.

    Article  Google Scholar 

  • Li, Y., Zhang, G., Feng, Y., & Wu, C. (2015). An entropy-based social network community detecting method and its application to scientometrics. Scientometrics, 102(1), 1003–1017. doi:10.1007/s11192-014-1377-5.

    Article  Google Scholar 

  • Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151. doi:10.1109/18.61115.

    MathSciNet  Article  MATH  Google Scholar 

  • Liu, X., Glänzel, W., & de Moor, B. (2012). Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping. Scientometrics, 91(2), 473–493. doi:10.1007/s11192-011-0600-x.

    Article  Google Scholar 

  • Martin, S., Brown, W. Michael, Klavans, R., & Boyack, K. W. (2011). OpenOrd: An open-source toolbox for large graph layout. Proceedings of SPIE - The International Society for Optical Engineering, 7868, 786–806. doi:10.1117/12.871402.

    Google Scholar 

  • Meng, X., Liu, X., Tong, Y., Glänzel, W., & Tan, S. (2015). Multi-view clustering with exemplars for scientific mapping. Scientometrics, 105(3), 1527–1552. doi:10.1007/s11192-015-1682-7.

    Article  Google Scholar 

  • Newman, M. (2004). Fast algorithm for detecting community structure in networks. Physical Review E, 69(6), 066133. doi:10.1103/PhysRevE.69.066133.

    Article  Google Scholar 

  • R Core Team (2016). R: A language and environment for statistical computing. Vienna: R Foundation for statistical computing. URL https://www.R-project.org/. Accessed January 31, 2016.

  • Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval (McGraw-Hill computer science series). New York: McGraw-Hill.

    MATH  Google Scholar 

  • Schiebel, E. (2012). Visualization of research fronts and knowledge bases by three-dimensional areal densities of bibliographically coupled publications and co-citations. Scientometrics, 91(2), 557–566. doi:10.1007/s11192-012-0626-8.

    Article  Google Scholar 

  • Sharma, V., Prakash, U., & Kumar, B. V. M. (2015). Surface composites by friction stir processing: A review. Journal of Materials Processing Technology, 224, 117–134. doi:10.1016/j.jmatprotec.2015.04.019.

    Article  Google Scholar 

  • Sims, G. E., Jun, S.-R., Wu, G. A., & Kim, S.-H. (2008). Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America, 106(8), 2677–2682. doi:10.1073/pnas.0813249106.

    Article  Google Scholar 

  • Strehl, A., & Ghosh, J. (2003). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. doi:10.1162/153244303321897735.

    MathSciNet  MATH  Google Scholar 

  • Thijs, B., Schiebel, E., & Glänzel, W. (2013). Do second-order similarities provide added-value in a hybrid approach? Scientometrics, 96(3), 667–677. doi:10.1007/s11192-012-0896-1.

    Article  Google Scholar 

  • Zhang, L., Glänzel, W., & Ye, F. Y. (2015). The Dynamic evolution of core documents: An experimental study based on h-related literature (2005–2013). Scientometrics, 106(1), 369–381. doi:10.1007/s11192-015-1705-4.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabian Meyer-Brötz.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Meyer-Brötz, F., Schiebel, E. & Brecht, L. Experimental evaluation of parameter settings in calculation of hybrid similarities: effects of first- and second-order similarity, edge cutting, and weighting factors. Scientometrics 111, 1307–1325 (2017). https://doi.org/10.1007/s11192-017-2366-2

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-017-2366-2

Keywords

  • Hybrid clustering
  • Bibliographic coupling
  • Textual coherence
  • Similarity measures
  • First- and second-order similarity