Experimental comparison of first and second-order similarities in a scientometric context
Article
First Online:
Received:
- 367 Downloads
- 10 Citations
Abstract
The measurement of similarity between objects plays a role in several scientific areas. In this article, we deal with document–document similarity in a scientometric context. We compare experimentally, using a large dataset, first-order with second-order similarities with respect to the overall quality of partitions of the dataset, where the partitions are obtained on the basis of optimizing weighted modularity. The quality of a partition is defined in terms of textual coherence. The results show that the second-order approach consistently outperforms the first-order approach. Each difference between the two approaches in overall partition quality values is significant at the 0.01 level.
Keywords
Bibliographic coupling Cluster analysis Document–document similarity Science mapping Similarity order Textual coherenceReferences
- Ahlgren, P., & Colliander, C. (2009a). Document–document similarity approaches and science mapping: experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63.CrossRefGoogle Scholar
- Ahlgren, P., & Colliander, C. (2009b). Textual content, cited references, similarity order, and clustering: an experimental study in the context of science mapping. In Proceedings of the 12th International Conference on Scientometrics and Informetrics (Vol. 2, pp 862–873), Rio de Janeiro.Google Scholar
- Ahlgren, P., & Jarneving, B. (2008). Bibliographic coupling, common abstract stems and clustering: A comparison of two document–document similarity approaches in the context of science mapping. Scientometrics, 76(2), 273–290.CrossRefGoogle Scholar
- Ahlgren, P., Jarneving, B., & Rousseau, R. (2003). Requirements for a cocitation similarity measure, with special reference to Pearson’s correlation coefficient. Journal of the American Society for Information Science and Technology, 54(6), 550–560.CrossRefGoogle Scholar
- Arenas, A., Fernandez, A., & Gomez, S. (2008). Analysis of the structure of complex networks at different resolution levels. New Journal of Physics, 10, Article Number: 053039.Google Scholar
- Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Harlow, UK: Addison-Wesley.Google Scholar
- Bland, J. M., & Kerry, S. M. (1998). Statistics notes—Weighted comparison of means. British Medical Journal, 316(7125), 129.CrossRefGoogle Scholar
- Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics—Theory and Experiment, Article Number: P10008.Google Scholar
- Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404.CrossRefGoogle Scholar
- Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351–374.CrossRefGoogle Scholar
- Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R., et al. (2011). Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS One, 6(3), Article Number: e18029.Google Scholar
- Cao, M., & Gao, X. (2005). Combining contents and citations for scientific document classification. AI 2005: Advances in artificial intelligence (pp. 143–152). Berlin: Springer.Google Scholar
- Cribbin, T. (2011). Discovering latent topical structure by second-order similarity analysis. Journal of the American Society for Information Science and Technology, 62(6), 1188–1207.CrossRefGoogle Scholar
- Egghe, L. (2009). New relations between similarity measures for vectors based on vector norms. Journal of the American Society for Information Science and Technology, 60(2), 232–239.CrossRefGoogle Scholar
- Egghe, L. (2010a). Good properties of similarity measures and their complementarity. Journal of the American Society for Information Science and Technology, 61(10), 2151–2160.CrossRefGoogle Scholar
- Egghe, L. (2010b). On the relation between the association strength and other similarity measures. Journal of the American Society for Information Science and Technology, 61(7), 1502–1504.CrossRefGoogle Scholar
- Egghe, L., & Leydesdorff, L. (2009). The relation between Pearson’s correlation coefficient r and Salton’s cosine measure. Journal of the American Society for Information Science and Technology, 60(5), 1027–1036.CrossRefGoogle Scholar
- Egghe, L., & Rousseau, R. (2006). Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve. Information Processing & Management, 42(1), 106–120.CrossRefMATHGoogle Scholar
- Fortunato, S., & Barthelemy, M. (2007). Resolution limit in community detection. Proceedings of the National Academy of Sciences of the United States of America, 104(1), 36–41.CrossRefGoogle Scholar
- Glenisson, P., Glänzel, W., & Persson, O. (2005). Combining full-text analysis and bibliometric indicators. A pilot study. Scientometrics, 63(1), 163–180.CrossRefGoogle Scholar
- Gmür, M. (2003). Co-citation analysis and the search for invisible colleges: A methodological evaluation. Scientometrics, 57(1), 27–57.CrossRefGoogle Scholar
- Hamers, L., Hemeryck, Y., Herweyers, G., Janssen, M., Keters, H., Rousseau, R., et al. (1989). Similarity measures in scientometric research— The Jaccard index versus Salton cosine formula. Information Processing & Management, 25(3), 315–318.CrossRefGoogle Scholar
- Janssens, F., Quoc, V. T., Glänzel, W., & Moor, B. D. (2006). Integration of textual content and link information for accurate clustering of science fields. In InSCit2006, Current Research in Information Sciences and Technologies: Multidisciplinary Approaches to Global Information Systems (Vol. I, pp. 615–619), Merida, Spain.Google Scholar
- Klavans, R., & Boyack, K. W. (2006). Identifying a better measure of relatedness for mapping science. Journal of the American Society for Information Science and Technology, 57(2), 251–263.CrossRefGoogle Scholar
- Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 845–848.MathSciNetGoogle Scholar
- Leydesdorff, L. (2008). On the normalization and visualization of author co-citation data: Salton’s cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology, 59(1), 77–85.CrossRefGoogle Scholar
- Lin, J. H. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151.CrossRefMATHGoogle Scholar
- Luukkonen, T., Tijssen, R. J. W., Persson, O., & Sivertsen, G. (1993). The measurement of international scientific collaboration. Scientometrics, 28(1), 15–36.CrossRefGoogle Scholar
- Newman, M. E. J. (2004). Analysis of weighted networks. Physical Review E, 70(5), Article Number: 056131.Google Scholar
- Peters, H. P. F., & Van Raan, A. F. J. (1993). Co-word-based science maps of chemical-engineering. Part 1: Representations by direct multidimensional-scaling. Research Policy, 22(1), 23–45.CrossRefGoogle Scholar
- Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.CrossRefGoogle Scholar
- Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.MATHGoogle Scholar
- Schneider, J. W., & Borlund, P. (2007a). Matrix comparison, part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results. Journal of the American Society for Information Science and Technology, 58(11), 1586–1595.CrossRefGoogle Scholar
- Schneider, J. W., & Borlund, P. (2007b). Matrix comparison, part 2: Measuring the resemblance between proximity measures or ordination results by use of the mantel and procrustes statistics. Journal of the American Society for Information Science and Technology, 58(11), 1596–1609.CrossRefGoogle Scholar
- Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston: Pearson Addison Wesley.Google Scholar
- van Eck, N. J., & Waltman, L. (2009). How to normalize cooccurrence data? An analysis of some well-known similarity measures. Journal of the American Society for Information Science and Technology, 60(8), 1635–1651.CrossRefGoogle Scholar
- Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications. Cambridge: Cambridge University Press.Google Scholar
- Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. (1999). KEA: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, Berkeley, CA.Google Scholar
Copyright information
© Akadémiai Kiadó, Budapest, Hungary 2011