Using mutual information as a cocitation similarity measure

Zheng, Lukun

doi:10.1007/s11192-019-03098-9

Using mutual information as a cocitation similarity measure

Published: 13 April 2019

Volume 119, pages 1695–1713, (2019)
Cite this article

Scientometrics Aims and scope Submit manuscript

Lukun Zheng ORCID: orcid.org/0000-0001-8253-1267¹

524 Accesses
7 Citations
Explore all metrics

Abstract

The debate regarding to which similarity measure can be used in co-citation analysis lasted for many years. The mostly debated measure is Pearson’s correlation coefficient r. It has been used as similarity measure in literature since the beginning of the technique in the 1980s. However, some researchers criticized using Pearson’s r as a similarity measure because it does not fully satisfy the mathematical conditions of a good similarity metric and (or) because it doesn’t meet some natural requirements a similarity measure should satisfy. Alternative similarity measures like cosine measure and chi square measure were also proposed and studied, which resulted in more controversies and debates about which similarity measure to use in co-citation analysis. In this article, we put forth the hypothesis that the researchers with high mutual information are closely related to each other and that the mutual information can be used as a similarity measure in author co-citation analysis. Given two researchers, the mutual information between them can be calculated based on their publications and their co-citation frequencies. A mutual information proximity matrix is then constructed. This proximity matrix meet the two requirements formulated by Ahlgren et al. (J Am Soc Inf Sci Technol 54(6):550–560, 2003). We conduct several experimental studies for the validation of our hypothesis and the results using mutual information are compared to the results using other similarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Paper Co-citation Analysis Using Semantic Similarity Measures

MACA: a modified author co-citation analysis method combined with general descriptive metadata of citations

Article 03 May 2016

Co-citation analysis between coupler authors of a scientific domain’s citation identity: a case study in scientometrics

Article 19 January 2024

References

Ahlgren, P., Javerning, B., & Rousseau, R. (2003). Requirements for a co-citation similarity measure, with special references to Pearson’s correlation coefficient. Journal of the American Society for Information Science and Technology, 54(6), 550–560.
Article Google Scholar
Bennasar, M., Hicks, Y., & Setchi, R. (2015). Feature selection using joint mutual information maximisation. Expert Systems with Applications, 42(22), 8520–8532.
Article Google Scholar
Berger, A., & Lafferty, J. (2017). Information retrieval as statistical translation. ACM SIGIR Forum, 51(2), 219–226.
Article Google Scholar
Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed.). Hoboken: Wiley.
MATH Google Scholar
Egghe, L. (2010). Good properties of similarity measures and their complementarity. Journal of the American Society for Information Science and Technology, 61(10), 2151–2160.
Article Google Scholar
Fiedor, P. (2014). Networks in financial markets based on the mutual information rate. Physical Review E, 89(5), 052801.
Article Google Scholar
Gao, S., Ver Steeg, G., & Galstyan, A. (2015). Efficient estimation of mutual information for strongly dependent variables. In Artificial intelligence and statistics (pp. 277–286).
Hausser, J., & Strimmer, K. (2014). Entropy: Estimation of entropy, mutual information and related quantities. R package version, 1(1).
Leydesdorff, L. (2008). On the normalization and visualization of author co-citation data: Salton’s Cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology, 59(1), 77–85.
Article Google Scholar
Leydesdorff, L., & Vaughan, L. (2006). Co-occurrence matrices and their applications in information science: Extending ACA to the Web environment. Journal of the American Society for Information Science and technology, 57(12), 1616–1628.
Article Google Scholar
Megnigbeto, E. (2013). Controversies arising from which similarity measures can be used in co-citation analysis. Malaysian Journal of Library & Information Science, 18(2), 25–31.
Google Scholar
Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15(6), 1191–1253.
Article MATH Google Scholar
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(379–423), 623–656.
Article MathSciNet MATH Google Scholar
Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 100(5), 401–409.
Article Google Scholar
Van Eck, N. J., & Waltman, L. (2008). Appropriate similarity measures for author co-citation analysis. Journal of the American Society for Information Science and Technology, 59(10), 1653–1661.
Article Google Scholar
White, H. D. (2003). Author cocitation analysis and Pearson’s r. Journal of the American Society for Information Science and Technology, 54(13), 1250–1259.
Article Google Scholar
White, H. D., & Griffith, B. C. (1981). Author cocitation: A literature measure of intellectual structure. Journal of the American Society for Information Science, 32, 163–171.
Article Google Scholar
Zhang, Z. (2012). Entropy estimation in Turing’s perspective. Neural Computation, 24(5), 1368–1389.
Article MathSciNet MATH Google Scholar
Zhang, Z., & Zheng, L. (2015). A mutual information estimator with exponentially decaying bias. Statistical Applications in Genetics and Molecular Biology, 14(3), 243–252.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, Western Kentucky University, Bowling Green, KY, USA
Lukun Zheng

Authors

Lukun Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lukun Zheng.

Appendices

The Estimation of MIPM

Assume that we have M authors, then the mutual information proximity matrix (MIPM) is an \(M\times M\) matrix consisting of \(M\times M-M\) off-diagonal entries and M diagonal entries. The off-diagonal entries are the mutual information, defined in Eq. (1), between all pairs from these M authors. The diagonal entries are the entropy, defined in Eq. (3), of all the M authors. These entries are as follows:

Off-diagonal entries \(\mathbb {MI}(X_s,\ X_t),\ s=1,\dots , M,\ t=1,\dots ,M,\ s\ne t\).
Diagonal entries \(\mathbb {MI}(X_m,X_m)=H(X_m),\ m=1\ldots , M\).

The mutual information \(\mathbb {M}(X_s,X_t)\) and entropy \(H(X_m)\) in the mutual information proximity matrix can be estimated using (2) and (4) respectively, given the corresponding co-citation frequency distribution tables (like Table 1) and citation frequency distribution tables (like Table 2).

For illustration purpose, we will show the procedure on how to estimate the off-diagonal entries and diagonal entries based on Tables 1 and 2, respectively.

Off-diagonal entries

Assume that the co-citation frequency distribution table between the first and second oeuvres is given by Table 1. We can estimate the mutual information \(\mathbb {M}(X_1,X_2)\) between them by the following procedure.

1.
Obtain the total co-citation counts: \(n=\sum _{i}\sum _{j}f_{ij}=230\).
2.
Calculate the joint and marginal relative frequencies: \(\hat{p}_{ij}=f_{ij}/230\), \(\hat{p}_{i,\cdot }=\sum _j\hat{p}_{ij}\), and \(\hat{p}_{\cdot ,j}=\sum _i\hat{p}_{ij}\).
3.
Plug in all these values in Eq. (2) (in “Methods’ section) and calculate the value of \(\widehat{\mathbb {M}}(X_1,X_2)=0.2197\).

Since MIPM is symmetric, we have \(\widehat{\mathbb {M}}(X_2,X_1)=\widehat{\mathbb {M}}(X_1,X_2)=0.2197\). Hence, given one co-citation frequency distribution table, we will obtain the estimates of two entries of the MIPM.

We can use the procedure above repeatedly to find the estimates of other off-diagonal entries of the MIPM, given the corresponding co-citation frequency distribution tables.

Diagonal entries

Assume that the citation frequency distribution table of the m-th oeuvre is given by Table 2. We can calculate the m-th diagonal entry of the MIPM by the following procedure.

1.
Obtain the total citation counts: \(n_m=\sum _{i}f_{mi}=92\).
2.
Calculate the relative frequencies: \(\hat{p}_{mi}=f_{mi}/92\).
3.
Plug in all these values in Eq. (4) (in “Methods’ section) and calculate the value of \(\hat{H}(X_m)=1.6802\).

Similarly, we can use the procedure above repeatedly to find the estimates of other diagonal entries of the MIPM, given the corresponding citation frequency distribution tables.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, L. Using mutual information as a cocitation similarity measure. Scientometrics 119, 1695–1713 (2019). https://doi.org/10.1007/s11192-019-03098-9

Download citation

Received: 30 December 2018
Published: 13 April 2019
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s11192-019-03098-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using mutual information as a cocitation similarity measure

Abstract

Access this article

Similar content being viewed by others

Paper Co-citation Analysis Using Semantic Similarity Measures

MACA: a modified author co-citation analysis method combined with general descriptive metadata of citations

Co-citation analysis between coupler authors of a scientific domain’s citation identity: a case study in scientometrics

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendices

The Estimation of MIPM

Off-diagonal entries

Diagonal entries

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using mutual information as a cocitation similarity measure

Abstract

Access this article

Similar content being viewed by others

Paper Co-citation Analysis Using Semantic Similarity Measures

MACA: a modified author co-citation analysis method combined with general descriptive metadata of citations

Co-citation analysis between coupler authors of a scientific domain’s citation identity: a case study in scientometrics

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendices

The Estimation of MIPM

Off-diagonal entries

Diagonal entries

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation