Abstract
Related records searching, now a common option within bibliographic databases, is applied to an individual result record as a secondary way of refining the retrieval set obtained from the primary subject search operation. In one approach, an individual result record is linked to other article records on the basis of the number of references cited they share in common, the theory being that two articles that cite many of the same sources are likely to be highly similar in subject content. Results of the secondary search are usually displayed in the order of each item’s actual number of commonly-shared references. In the present paper we suggest an improved way of ranking the results, employing statistical significance tests. We suggest two approaches, one involving a statistical test previously unknown in bibliometric circles, the binomial index of dispersion, and the other employing the more familiar centralized cosine measure; these turn out to produce nearly identical results. An example demonstrating the application of these measures, and contrasting such with the use of raw totals, is provided. In the example the results rankings are found to be only modestly (positively) correlated, suggesting that much information is lost to the user when raw totals alone are made the basis for ordering results.
Similar content being viewed by others
Notes
The literature on bibliometrics in general is enormous; for quick accountings of the subject see Donohue (1974), Wolfram (2003), De Bellis (2009), and Haustein (2012). The same is true even for citation analysis alone; consult Borgman (1990), Moed (2005), and De Bellis (2009) for reviews. Importantly, however, much of the attention given to citation analysis has been in: (1) the scientometric context of the sociology of science: e.g., to identifying ways of establishing schools of scientific endeavor, roles of key figures, subject trends, etc., and (2) the evaluation of various kinds of database inconsistencies or omissions (especially as related to impact factors). Much less attention has been given to the investigation of user-focused needs, though this is beginning to change (see, for example, Zhao and Strotmann 2014).
In Web of Science, the command icon “View Related Records” is positioned on the right hand side of each of the records obtained through the primary search; the pop-up descriptor linked to the command reads “View other records that share references with this one”. There is very little literature on the citation analysis form of related records searching beyond notices in product reviews (e.g., Wiley 1998), probably because the retrieval algorithms involved feature simple match counting.
Strictly speaking the order of two articles in each pair (i, j) is not important (i.e., the index is symmetrical), and we do not need to compare an article i with itself. Hence there are only n(n−1)/2 relevant indices.
A primary objective of the present work is to introduce database providers to the possibility of a new kind of tool, but ultimately it will be up to them to adapt the ideas to their own circumstances. The statistical approach itself may of course be applied to any setting—including humanities subjects—that meet the basic conditions of order within the data.
Also, given Eq. (6) and because CSC can take negative values, BID is not a monotonic function of CSC. Hence, the order (or ranking) between CSC and BID is not preserved over the full set of CSC values. In our data set where negative values for CSC always remain very close to zero, then both BID and CSC will always generate the same ranking for relatively similar composers—that is, when CSC takes a positive value not too close to zero. Note that the ranking based on BID is equal to the ranking of the set of absolute values of CSC. The overall nature of this synonomy has led us to consider some related statistical and philosophical issues bearing on the measure and meaning of similarity and relatedness in data sets of the present type. We hope to explore this further in a future publication.
References
Bassanezi, R. B., Filho, A. B., Amorim, L., Gimenes-Fernandes, N., Gottwald, T. R., & Bové, J. M. (2003). Spatial and temporal analyses of citrus sudden death as a tool to generate hypotheses concerning its etiology. Phytopathology, 93(4), 502–512.
Borgman, C. L. (1990). Scholarly communication and bibliometrics. Newbury Park, CA: Sage Publications.
Bradford, S. C. (1934). Sources of information on specific subjects. Engineering: An Illustrated Weekly Journal (London), 137, 85–86.
Brandyberry, A., Rai, A., & White, G. P. (1999). Intermediate performance impacts of advanced manufacturing technology systems: An empirical investigation. Decision Sciences, 30(4), 993–1020.
Cheetham, A. H., & Hazel, J. E. (1969). Binary (presence–absence) similarity coefficients. Journal of Paleontology, 43(5), 1130–1136.
De Bellis, N. (2009). Bibliometrics and citation analysis: From the Science Citation Index to cybermetrics. Lanham, MD: Scarecrow Press.
Donohue, J. C. (1974). Understanding scientific literatures: A bibliometric approach. Cambridge, MA: MIT Press.
Duncan, O. D., & Duncan, B. (1955). A methodological analysis of segregation indexes. American Sociological Review, 20(2), 210–217.
Giller, G. L. (2012). The statistical properties of random bitstreams and the sampling distribution of cosine similarity. Giller Investments Research Notes (20121024/1). http://dx.doi.org/10.2139/ssrn.2167044. Accessed 17 January 2015
Glänzel, W., & Czerwon, H. J. (1996). A new methodological approach to bibliographic coupling and its application to the national, regional, and institutional level. Scientometrics, 37(2), 195–221.
Haustein, S. (2012). Multidimensional journal evaluation: Analyzing scientific periodicals beyond the impact factor. Berlin: De Gruyter/Saur.
Hayek, L.-A. C. (1994). Analysis of amphibian biodiversity data. In W. R. Heyer, M. A. Donnelly, R. W. McDiarmid, L.-A. C. Hayek, & M. S. Foster (Eds.), Measuring and monitoring biological diversity: Standard methods for amphibians (pp. 207–269). Washington, DC: Smithsonian Books.
Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37(142), 547–579.
Lotka, A. (1926). The frequency distribution of scientific productivity. Journal of the Washington Academy of Sciences, 16(12), 317–324.
Moed, H. F. (2005). Citation analysis in research evaluation. Dordrecht: Springer.
Potthoff, R. F., & Whittinghill, M. (1966). Testing for homogeneity. I. The binomial and multinomial distributions. Biometrika, 53, 167–182.
Rogosa, D., Floden, R., & Willett, J. B. (1984). Assessing the stability of teacher behavior. Journal of Educational Psychology, 76(6), 1000–1027.
Sen, S. K., & Gan, S. K. (1983). A mathematical extension of the idea of bibliographic coupling and its applications. Annals of Library and Information Studies, 30(2), 78–82.
Smith, C. H. (2000). The classical music navigator. http://people.wku.edu/charles.smith/music/. Accessed 17 November 2014
Smith, C. H., & Georges, P. (2014). Composer similarities through ‘The Classical Music Navigator’: Similarity inference from composer influences. Empirical Studies of the Arts, 32(2), 205–229.
Spósito, M. B., Amorim, L., Ribeiro, P. J. Jr., Bassanezi, R. B., & Krainski, E. T. (2007). Spatial pattern of trees affected by black spot in citrus groves in Brazil. Plant Disease, 91(1), 36–40.
Wallet, L. A., & Gotway, C. A. (2004). Applied spatial statistics for public health data. Hoboken, NJ: Wiley.
Wiley, D. L. (1998). Cited references on the Web: A review of ISI’s ‘Web of Science’. Searcher, 6(1), 32–39, 57.
Wolfram, D. (2003). Applied informetrics for information retrieval research. Westport, CT: Libraries Unlimited.
Zhao, D., & Strotmann, A. (2014). In-text author citation analysis: Feasibility, benefits, and limitations. Journal of the Association for Information Science and Technology, 65(11), 2348–2358.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Smith, C.H., Georges, P. & Nguyen, N. Statistical tests for ‘related records’ search results. Scientometrics 105, 1665–1677 (2015). https://doi.org/10.1007/s11192-015-1610-x
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-015-1610-x