# Using mutual information as a cocitation similarity measure

- 46 Downloads

## Abstract

The debate regarding to which similarity measure can be used in co-citation analysis lasted for many years. The mostly debated measure is Pearson’s correlation coefficient *r*. It has been used as similarity measure in literature since the beginning of the technique in the 1980s. However, some researchers criticized using Pearson’s *r* as a similarity measure because it does not fully satisfy the mathematical conditions of a good similarity metric and (or) because it doesn’t meet some natural requirements a similarity measure should satisfy. Alternative similarity measures like cosine measure and chi square measure were also proposed and studied, which resulted in more controversies and debates about which similarity measure to use in co-citation analysis. In this article, we put forth the hypothesis that the researchers with high mutual information are closely related to each other and that the mutual information can be used as a similarity measure in author co-citation analysis. Given two researchers, the mutual information between them can be calculated based on their publications and their co-citation frequencies. A mutual information proximity matrix is then constructed. This proximity matrix meet the two requirements formulated by Ahlgren et al. (J Am Soc Inf Sci Technol 54(6):550–560, 2003). We conduct several experimental studies for the validation of our hypothesis and the results using mutual information are compared to the results using other similarity measures.

## Keywords

Author co-citation analysis Similarity measures Mutual information## References

- Ahlgren, P., Javerning, B., & Rousseau, R. (2003). Requirements for a co-citation similarity measure, with special references to Pearson’s correlation coefficient.
*Journal of the American Society for Information Science and Technology*,*54*(6), 550–560.CrossRefGoogle Scholar - Bennasar, M., Hicks, Y., & Setchi, R. (2015). Feature selection using joint mutual information maximisation.
*Expert Systems with Applications*,*42*(22), 8520–8532.CrossRefGoogle Scholar - Berger, A., & Lafferty, J. (2017). Information retrieval as statistical translation.
*ACM SIGIR Forum*,*51*(2), 219–226.CrossRefGoogle Scholar - Cover, T. M., & Thomas, J. A. (2006).
*Elements of information theory*(2nd ed.). Hoboken: Wiley.zbMATHGoogle Scholar - Egghe, L. (2010). Good properties of similarity measures and their complementarity.
*Journal of the American Society for Information Science and Technology*,*61*(10), 2151–2160.CrossRefGoogle Scholar - Fiedor, P. (2014). Networks in financial markets based on the mutual information rate.
*Physical Review E*,*89*(5), 052801.CrossRefGoogle Scholar - Gao, S., Ver Steeg, G., & Galstyan, A. (2015). Efficient estimation of mutual information for strongly dependent variables. In
*Artificial intelligence and statistics*(pp. 277–286).Google Scholar - Hausser, J., & Strimmer, K. (2014). Entropy: Estimation of entropy, mutual information and related quantities.
*R package version*,*1*(1).Google Scholar - Leydesdorff, L. (2008). On the normalization and visualization of author co-citation data: Salton’s Cosine versus the Jaccard index.
*Journal of the American Society for Information Science and Technology*,*59*(1), 77–85.CrossRefGoogle Scholar - Leydesdorff, L., & Vaughan, L. (2006). Co-occurrence matrices and their applications in information science: Extending ACA to the Web environment.
*Journal of the American Society for Information Science and technology*,*57*(12), 1616–1628.CrossRefGoogle Scholar - Megnigbeto, E. (2013). Controversies arising from which similarity measures can be used in co-citation analysis.
*Malaysian Journal of Library & Information Science*,*18*(2), 25–31.Google Scholar - Paninski, L. (2003). Estimation of entropy and mutual information.
*Neural Computation*,*15*(6), 1191–1253.CrossRefzbMATHGoogle Scholar - Shannon, C. E. (1948). A mathematical theory of communication.
*Bell System Technical Journal*,*27*(379–423), 623–656.MathSciNetCrossRefzbMATHGoogle Scholar - Sammon, J. W. (1969). A nonlinear mapping for data structure analysis.
*IEEE Transactions on Computers*,*100*(5), 401–409.CrossRefGoogle Scholar - Van Eck, N. J., & Waltman, L. (2008). Appropriate similarity measures for author co-citation analysis.
*Journal of the American Society for Information Science and Technology*,*59*(10), 1653–1661.CrossRefGoogle Scholar - White, H. D. (2003). Author cocitation analysis and Pearson’s r.
*Journal of the American Society for Information Science and Technology*,*54*(13), 1250–1259.CrossRefGoogle Scholar - White, H. D., & Griffith, B. C. (1981). Author cocitation: A literature measure of intellectual structure.
*Journal of the American Society for Information Science*,*32*, 163–171.CrossRefGoogle Scholar - Zhang, Z. (2012). Entropy estimation in Turing’s perspective.
*Neural Computation*,*24*(5), 1368–1389.MathSciNetCrossRefzbMATHGoogle Scholar - Zhang, Z., & Zheng, L. (2015). A mutual information estimator with exponentially decaying bias.
*Statistical Applications in Genetics and Molecular Biology*,*14*(3), 243–252.MathSciNetCrossRefzbMATHGoogle Scholar