Statistical tests for ‘related records’ search results

Smith, Charles H.; Georges, Patrick; Nguyen, Ngoc

doi:10.1007/s11192-015-1610-x

Statistical tests for ‘related records’ search results

Published: 02 June 2015

Volume 105, pages 1665–1677, (2015)
Cite this article

Scientometrics Aims and scope Submit manuscript

Charles H. Smith¹,
Patrick Georges² &
Ngoc Nguyen³

377 Accesses
4 Citations
Explore all metrics

Abstract

Related records searching, now a common option within bibliographic databases, is applied to an individual result record as a secondary way of refining the retrieval set obtained from the primary subject search operation. In one approach, an individual result record is linked to other article records on the basis of the number of references cited they share in common, the theory being that two articles that cite many of the same sources are likely to be highly similar in subject content. Results of the secondary search are usually displayed in the order of each item’s actual number of commonly-shared references. In the present paper we suggest an improved way of ranking the results, employing statistical significance tests. We suggest two approaches, one involving a statistical test previously unknown in bibliometric circles, the binomial index of dispersion, and the other employing the more familiar centralized cosine measure; these turn out to produce nearly identical results. An example demonstrating the application of these measures, and contrasting such with the use of raw totals, is provided. In the example the results rankings are found to be only modestly (positively) correlated, suggesting that much information is lost to the user when raw totals alone are made the basis for ordering results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The literature on bibliometrics in general is enormous; for quick accountings of the subject see Donohue (1974), Wolfram (2003), De Bellis (2009), and Haustein (2012). The same is true even for citation analysis alone; consult Borgman (1990), Moed (2005), and De Bellis (2009) for reviews. Importantly, however, much of the attention given to citation analysis has been in: (1) the scientometric context of the sociology of science: e.g., to identifying ways of establishing schools of scientific endeavor, roles of key figures, subject trends, etc., and (2) the evaluation of various kinds of database inconsistencies or omissions (especially as related to impact factors). Much less attention has been given to the investigation of user-focused needs, though this is beginning to change (see, for example, Zhao and Strotmann 2014).
In Web of Science, the command icon “View Related Records” is positioned on the right hand side of each of the records obtained through the primary search; the pop-up descriptor linked to the command reads “View other records that share references with this one”. There is very little literature on the citation analysis form of related records searching beyond notices in product reviews (e.g., Wiley 1998), probably because the retrieval algorithms involved feature simple match counting.
Regarding biological measures of association, two well-known reviews are Cheetham and Hazel (1969) and Hayek (1994). As far as we can tell such measures have not been used in the past to contribute to a database user-oriented citation analysis mission.
Strictly speaking the order of two articles in each pair (i, j) is not important (i.e., the index is symmetrical), and we do not need to compare an article i with itself. Hence there are only n(n−1)/2 relevant indices.
A primary objective of the present work is to introduce database providers to the possibility of a new kind of tool, but ultimately it will be up to them to adapt the ideas to their own circumstances. The statistical approach itself may of course be applied to any setting—including humanities subjects—that meet the basic conditions of order within the data.
Also, given Eq. (6) and because CSC can take negative values, BID is not a monotonic function of CSC. Hence, the order (or ranking) between CSC and BID is not preserved over the full set of CSC values. In our data set where negative values for CSC always remain very close to zero, then both BID and CSC will always generate the same ranking for relatively similar composers—that is, when CSC takes a positive value not too close to zero. Note that the ranking based on BID is equal to the ranking of the set of absolute values of CSC. The overall nature of this synonomy has led us to consider some related statistical and philosophical issues bearing on the measure and meaning of similarity and relatedness in data sets of the present type. We hope to explore this further in a future publication.

References

Bassanezi, R. B., Filho, A. B., Amorim, L., Gimenes-Fernandes, N., Gottwald, T. R., & Bové, J. M. (2003). Spatial and temporal analyses of citrus sudden death as a tool to generate hypotheses concerning its etiology. Phytopathology, 93(4), 502–512.
Article Google Scholar
Borgman, C. L. (1990). Scholarly communication and bibliometrics. Newbury Park, CA: Sage Publications.
Google Scholar
Bradford, S. C. (1934). Sources of information on specific subjects. Engineering: An Illustrated Weekly Journal (London), 137, 85–86.
Google Scholar
Brandyberry, A., Rai, A., & White, G. P. (1999). Intermediate performance impacts of advanced manufacturing technology systems: An empirical investigation. Decision Sciences, 30(4), 993–1020.
Article Google Scholar
Cheetham, A. H., & Hazel, J. E. (1969). Binary (presence–absence) similarity coefficients. Journal of Paleontology, 43(5), 1130–1136.
Google Scholar
De Bellis, N. (2009). Bibliometrics and citation analysis: From the Science Citation Index to cybermetrics. Lanham, MD: Scarecrow Press.
Google Scholar
Donohue, J. C. (1974). Understanding scientific literatures: A bibliometric approach. Cambridge, MA: MIT Press.
Google Scholar
Duncan, O. D., & Duncan, B. (1955). A methodological analysis of segregation indexes. American Sociological Review, 20(2), 210–217.
Article Google Scholar
Giller, G. L. (2012). The statistical properties of random bitstreams and the sampling distribution of cosine similarity. Giller Investments Research Notes (20121024/1). http://dx.doi.org/10.2139/ssrn.2167044. Accessed 17 January 2015
Glänzel, W., & Czerwon, H. J. (1996). A new methodological approach to bibliographic coupling and its application to the national, regional, and institutional level. Scientometrics, 37(2), 195–221.
Article Google Scholar
Haustein, S. (2012). Multidimensional journal evaluation: Analyzing scientific periodicals beyond the impact factor. Berlin: De Gruyter/Saur.
Book Google Scholar
Hayek, L.-A. C. (1994). Analysis of amphibian biodiversity data. In W. R. Heyer, M. A. Donnelly, R. W. McDiarmid, L.-A. C. Hayek, & M. S. Foster (Eds.), Measuring and monitoring biological diversity: Standard methods for amphibians (pp. 207–269). Washington, DC: Smithsonian Books.
Google Scholar
Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37(142), 547–579.
Google Scholar
Lotka, A. (1926). The frequency distribution of scientific productivity. Journal of the Washington Academy of Sciences, 16(12), 317–324.
Google Scholar
Moed, H. F. (2005). Citation analysis in research evaluation. Dordrecht: Springer.
Google Scholar
Potthoff, R. F., & Whittinghill, M. (1966). Testing for homogeneity. I. The binomial and multinomial distributions. Biometrika, 53, 167–182.
Rogosa, D., Floden, R., & Willett, J. B. (1984). Assessing the stability of teacher behavior. Journal of Educational Psychology, 76(6), 1000–1027.
Article Google Scholar
Sen, S. K., & Gan, S. K. (1983). A mathematical extension of the idea of bibliographic coupling and its applications. Annals of Library and Information Studies, 30(2), 78–82.
Google Scholar
Smith, C. H. (2000). The classical music navigator. http://people.wku.edu/charles.smith/music/. Accessed 17 November 2014
Smith, C. H., & Georges, P. (2014). Composer similarities through ‘The Classical Music Navigator’: Similarity inference from composer influences. Empirical Studies of the Arts, 32(2), 205–229.
Article Google Scholar
Spósito, M. B., Amorim, L., Ribeiro, P. J. Jr., Bassanezi, R. B., & Krainski, E. T. (2007). Spatial pattern of trees affected by black spot in citrus groves in Brazil. Plant Disease, 91(1), 36–40.
Article Google Scholar
Wallet, L. A., & Gotway, C. A. (2004). Applied spatial statistics for public health data. Hoboken, NJ: Wiley.
Book Google Scholar
Wiley, D. L. (1998). Cited references on the Web: A review of ISI’s ‘Web of Science’. Searcher, 6(1), 32–39, 57.
Wolfram, D. (2003). Applied informetrics for information retrieval research. Westport, CT: Libraries Unlimited.
Google Scholar
Zhao, D., & Strotmann, A. (2014). In-text author citation analysis: Feasibility, benefits, and limitations. Journal of the Association for Information Science and Technology, 65(11), 2348–2358.
Article Google Scholar

Download references

Author information

Authors and Affiliations

University Libraries, Western Kentucky University, 1906 College Heights Blvd., Bowling Green, KY, 42101, USA
Charles H. Smith
Graduate School of Public and International Affairs, University of Ottawa, Social Sciences Building, Room 6011, 120 University, Ottawa, ON, K1N 6N5, Canada
Patrick Georges
Department of Mathematics, Western Kentucky University, 1906 College Heights Blvd., Bowling Green, KY, 42101, USA
Ngoc Nguyen

Authors

Charles H. Smith
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Georges
View author publications
You can also search for this author in PubMed Google Scholar
Ngoc Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charles H. Smith.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Smith, C.H., Georges, P. & Nguyen, N. Statistical tests for ‘related records’ search results. Scientometrics 105, 1665–1677 (2015). https://doi.org/10.1007/s11192-015-1610-x

Download citation

Received: 04 February 2015
Published: 02 June 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s11192-015-1610-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical tests for ‘related records’ search results

Abstract

Access this article

Similar content being viewed by others

A relevance ranking method for citation-based search results

Related records retrieval and pennant retrieval: an exploratory case study

On Two Classes of Weighted Rank Correlation Measures Deriving from the Spearman’s ρ

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Statistical tests for ‘related records’ search results

Abstract

Access this article

Similar content being viewed by others

A relevance ranking method for citation-based search results

Related records retrieval and pennant retrieval: an exploratory case study

On Two Classes of Weighted Rank Correlation Measures Deriving from the Spearman’s ρ

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation