Scientometrics

, Volume 90, Issue 2, pp 675–685 | Cite as

Experimental comparison of first and second-order similarities in a scientometric context

Article

Abstract

The measurement of similarity between objects plays a role in several scientific areas. In this article, we deal with document–document similarity in a scientometric context. We compare experimentally, using a large dataset, first-order with second-order similarities with respect to the overall quality of partitions of the dataset, where the partitions are obtained on the basis of optimizing weighted modularity. The quality of a partition is defined in terms of textual coherence. The results show that the second-order approach consistently outperforms the first-order approach. Each difference between the two approaches in overall partition quality values is significant at the 0.01 level.

Keywords

Bibliographic coupling Cluster analysis Document–document similarity Science mapping Similarity order Textual coherence 

References

  1. Ahlgren, P., & Colliander, C. (2009a). Document–document similarity approaches and science mapping: experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63.CrossRefGoogle Scholar
  2. Ahlgren, P., & Colliander, C. (2009b). Textual content, cited references, similarity order, and clustering: an experimental study in the context of science mapping. In Proceedings of the 12th International Conference on Scientometrics and Informetrics (Vol. 2, pp 862–873), Rio de Janeiro.Google Scholar
  3. Ahlgren, P., & Jarneving, B. (2008). Bibliographic coupling, common abstract stems and clustering: A comparison of two document–document similarity approaches in the context of science mapping. Scientometrics, 76(2), 273–290.CrossRefGoogle Scholar
  4. Ahlgren, P., Jarneving, B., & Rousseau, R. (2003). Requirements for a cocitation similarity measure, with special reference to Pearson’s correlation coefficient. Journal of the American Society for Information Science and Technology, 54(6), 550–560.CrossRefGoogle Scholar
  5. Arenas, A., Fernandez, A., & Gomez, S. (2008). Analysis of the structure of complex networks at different resolution levels. New Journal of Physics, 10, Article Number: 053039.Google Scholar
  6. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Harlow, UK: Addison-Wesley.Google Scholar
  7. Bland, J. M., & Kerry, S. M. (1998). Statistics notes—Weighted comparison of means. British Medical Journal, 316(7125), 129.CrossRefGoogle Scholar
  8. Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics—Theory and Experiment, Article Number: P10008.Google Scholar
  9. Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404.CrossRefGoogle Scholar
  10. Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351–374.CrossRefGoogle Scholar
  11. Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R., et al. (2011). Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS One, 6(3), Article Number: e18029.Google Scholar
  12. Cao, M., & Gao, X. (2005). Combining contents and citations for scientific document classification. AI 2005: Advances in artificial intelligence (pp. 143–152). Berlin: Springer.Google Scholar
  13. Cribbin, T. (2011). Discovering latent topical structure by second-order similarity analysis. Journal of the American Society for Information Science and Technology, 62(6), 1188–1207.CrossRefGoogle Scholar
  14. Egghe, L. (2009). New relations between similarity measures for vectors based on vector norms. Journal of the American Society for Information Science and Technology, 60(2), 232–239.CrossRefGoogle Scholar
  15. Egghe, L. (2010a). Good properties of similarity measures and their complementarity. Journal of the American Society for Information Science and Technology, 61(10), 2151–2160.CrossRefGoogle Scholar
  16. Egghe, L. (2010b). On the relation between the association strength and other similarity measures. Journal of the American Society for Information Science and Technology, 61(7), 1502–1504.CrossRefGoogle Scholar
  17. Egghe, L., & Leydesdorff, L. (2009). The relation between Pearson’s correlation coefficient r and Salton’s cosine measure. Journal of the American Society for Information Science and Technology, 60(5), 1027–1036.CrossRefGoogle Scholar
  18. Egghe, L., & Rousseau, R. (2006). Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve. Information Processing & Management, 42(1), 106–120.CrossRefMATHGoogle Scholar
  19. Fortunato, S., & Barthelemy, M. (2007). Resolution limit in community detection. Proceedings of the National Academy of Sciences of the United States of America, 104(1), 36–41.CrossRefGoogle Scholar
  20. Glenisson, P., Glänzel, W., & Persson, O. (2005). Combining full-text analysis and bibliometric indicators. A pilot study. Scientometrics, 63(1), 163–180.CrossRefGoogle Scholar
  21. Gmür, M. (2003). Co-citation analysis and the search for invisible colleges: A methodological evaluation. Scientometrics, 57(1), 27–57.CrossRefGoogle Scholar
  22. Hamers, L., Hemeryck, Y., Herweyers, G., Janssen, M., Keters, H., Rousseau, R., et al. (1989). Similarity measures in scientometric research— The Jaccard index versus Salton cosine formula. Information Processing & Management, 25(3), 315–318.CrossRefGoogle Scholar
  23. Janssens, F., Quoc, V. T., Glänzel, W., & Moor, B. D. (2006). Integration of textual content and link information for accurate clustering of science fields. In InSCit2006, Current Research in Information Sciences and Technologies: Multidisciplinary Approaches to Global Information Systems (Vol. I, pp. 615–619), Merida, Spain.Google Scholar
  24. Klavans, R., & Boyack, K. W. (2006). Identifying a better measure of relatedness for mapping science. Journal of the American Society for Information Science and Technology, 57(2), 251–263.CrossRefGoogle Scholar
  25. Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 845–848.MathSciNetGoogle Scholar
  26. Leydesdorff, L. (2008). On the normalization and visualization of author co-citation data: Salton’s cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology, 59(1), 77–85.CrossRefGoogle Scholar
  27. Lin, J. H. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151.CrossRefMATHGoogle Scholar
  28. Luukkonen, T., Tijssen, R. J. W., Persson, O., & Sivertsen, G. (1993). The measurement of international scientific collaboration. Scientometrics, 28(1), 15–36.CrossRefGoogle Scholar
  29. Newman, M. E. J. (2004). Analysis of weighted networks. Physical Review E, 70(5), Article Number: 056131.Google Scholar
  30. Peters, H. P. F., & Van Raan, A. F. J. (1993). Co-word-based science maps of chemical-engineering. Part 1: Representations by direct multidimensional-scaling. Research Policy, 22(1), 23–45.CrossRefGoogle Scholar
  31. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.CrossRefGoogle Scholar
  32. Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.MATHGoogle Scholar
  33. Schneider, J. W., & Borlund, P. (2007a). Matrix comparison, part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results. Journal of the American Society for Information Science and Technology, 58(11), 1586–1595.CrossRefGoogle Scholar
  34. Schneider, J. W., & Borlund, P. (2007b). Matrix comparison, part 2: Measuring the resemblance between proximity measures or ordination results by use of the mantel and procrustes statistics. Journal of the American Society for Information Science and Technology, 58(11), 1596–1609.CrossRefGoogle Scholar
  35. Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston: Pearson Addison Wesley.Google Scholar
  36. van Eck, N. J., & Waltman, L. (2009). How to normalize cooccurrence data? An analysis of some well-known similarity measures. Journal of the American Society for Information Science and Technology, 60(8), 1635–1651.CrossRefGoogle Scholar
  37. Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications. Cambridge: Cambridge University Press.Google Scholar
  38. Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. (1999). KEA: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, Berkeley, CA.Google Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2011

Authors and Affiliations

  1. 1.Department of Sociology, InforskUmeå UniversityUmeåSweden
  2. 2.Department of e-Resources, University LibraryStockholm UniversityStockholmSweden

Personalised recommendations