Skip to main content
Log in

Improving co-authorship network structures by combining multiple data sources: evidence from Italian academic statisticians

  • Published:
Scientometrics Aims and scope Submit manuscript


The aim of the present contribution is to merge bibliographic data for members of a bounded scientific community in order to derive a complete unified archive, with top-international and nationally oriented production, as a new basis to carry out network analysis on a unified co-authorship network. A two-step procedure is used to deal with the identification of duplicate records and the author name disambiguation. Specifically, for the second step we strongly drew inspiration from a well-established unsupervised disambiguation method proposed in the literature following a network-based approach and requiring a restricted set of record attributes. Evidences from Italian academic statisticians were provided by merging data from three bibliographic archives. Non-negligible differences were observed in network results in the comparison of disambiguated and not disambiguated data sets, especially in network measures at individual level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others


  1. Two international databases, one general (WoS) and one thematic (Current Index to Statistics, CIS) were considered, together with bibliographic information retrieved from the Italian Ministry of University and Research (MIUR) database of nationally funded research projects (PRIN).

  2. At December 2014 the size of population was 722.

  3. Although PRIN funding was launched in 1996, information on funded projects has been released only since the year 2000.

  4. For deepening on this problem, the reader can refer to Bilenko et al. (2003).

  5. For instance, Lee et al. (2005) and Santana et al. (2015) supposed different weights according to the discriminative capability of the attributes.

  6. A co-authorship network is derived from the matrix product \(\mathbf {Y}=\mathbf {A}\mathbf {A}'\), where \(\mathbf {A}\) is a \(n \times p\) affiliation matrix, with elements \(a_{ik}\) = 1 if \(i \in \mathcal {N}\) authored the publication \(k \in {\mathcal {P}}\), 0 otherwise. The matrix \(\mathbf {Y}\) is the undirected and valued \(n \times n\) adjacency matrix with element \(y_{ij}\) greater than 0 if \(i,\,j \in \mathcal {N}\) co-authored one or more publications in \(\mathcal {P}\), and otherwise 0. The binary version of \(\mathbf {Y}\), setting all entries in the valued adjacency matrix greater than zero to 1, was used in our analysis.


  • Albert, R., & Barabási, A.-L. (2002). Statistical mechanics of complex networks. Reviews of Modern Physics, 74(1), 47.

    Article  MathSciNet  MATH  Google Scholar 

  • Baxter, R., Christen, P., & Churches, T. (2003). A comparison of fast blocking methods for record linkage. In ACM KDD Workshops (Vol. 3, pp. 25–27).

  • Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., & Fienberg, S. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems, 5, 16–23.

    Article  Google Scholar 

  • Christen, P. (2008). Febrl: An open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1065–1068. ACM.

  • Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.

    Article  Google Scholar 

  • Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870.

    Article  Google Scholar 

  • Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends in Computer Graphics and Vision, 7(2–3), 81–227.

    MATH  Google Scholar 

  • Cuxac, P., Lamirel, J.-C., & Bonvallot, V. (2013). Efficient supervised and semi-supervised approaches for affiliations disambiguation. Scientometrics, 97(1), 47–58.

    Article  Google Scholar 

  • de Carvalho, A. P., Ferreira, A. A., Laender, A. H., & Gonçalves, M. A. (2011). Incremental unsupervised name disambiguation in cleaned digital libraries. Journal of Information and Data Management, 2(3), 289.

    Google Scholar 

  • De Stefano, D., Fuccella, V., Vitale, M. P., & Zaccarin, S. (2013). The use of different data sources in the analysis of co-authorship networks and scientific performance. Social Networks, 35(3), 370–381.

    Article  Google Scholar 

  • De Stefano, D., & Zaccarin, S. (2016). Co-authorship networks and scientific performance: An empirical analysis using the generalized extreme value distribution. Journal of Applied Statistics, 43(1), 262–279.

    Article  MathSciNet  Google Scholar 

  • Domingo-Ferrer, J., & Torra, V. (2003). Disclosure risk assessment in statistical microdata protection via advanced record linkage. Statistics and Computing, 13(4), 343–354.

    Article  MathSciNet  Google Scholar 

  • Dong, X., Halevy, A., & Madhavan, J. (2005). Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD international conference on management of data, pp. 85–96. ACM.

  • Durham, E., Xue, Y., Kantarcioglu, M., & Malin, B. (2012). Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage. Information Fusion, 13(4), 245–259.

    Article  Google Scholar 

  • Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.

    Article  MATH  Google Scholar 

  • Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. (2012). A brief survey of automatic methods for author name disambiguation. ACM Sigmod Record, 41(2), 15–26.

    Article  Google Scholar 

  • Goyal, S., Van Der Leij, M. J., & Moraga-González, J. L. (2006). Economics: An emerging small world. Journal of Political Economy, 114(2), 403–412.

    Article  Google Scholar 

  • Gurney, T., Horlings, E., & Van Den Besselaar, P. (2011). Author disambiguation using multi-aspect similarity indicators. Scientometrics, 91(2), 435–449.

    Article  Google Scholar 

  • Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Digital Libraries, 2004. Proceedings of the 2004 joint ACM/IEEE conference on, pp. 296–305. IEEE.

  • Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a k-way spectral clustering method. In Digital Libraries, 2005. JCDL’05. Proceedings of the 5th ACM/IEEE-CS joint conference on, pp. 334–343. IEEE.

  • Hernandez, M. A., & Stolfo, S. J. (1995). The merge/purge problem for large databases. ACM Sigmod Record, 24(2), 127–138.

    Article  Google Scholar 

  • Hicks, D. (1999). The difficulty of achieving full coverage of international social science literature and the bibliometric consequences. Scientometrics, 44(2), 193–215.

    Article  Google Scholar 

  • Imran, M., Gillani, S., & Marchese, M. (2013). A real-time heuristic-based unsupervised method for name disambiguation in digital libraries. D-Lib Magazine, 19(9), 1.

    Google Scholar 

  • Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., et al. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45(1), 84–97.

    Article  Google Scholar 

  • Lee, D., On, B.-W., Kang, J., & Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceedings of the 2nd international workshop on Information quality in information systems, pp. 69–76. ACM.

  • Li, G.-C., Lai, R., D’Amour, A., Doolin, D. M., Sun, Y., Torvik, V. I., et al. (2014). Disambiguation and co-authorship networks of the US patent inventor database (1975–2010). Research Policy, 43(6), 941–955.

    Article  Google Scholar 

  • Liseo, B., Montanari, G. E., & Torelli, N. (2006). Metodi statistici per l’integrazione di dati da fonti diverse (Vol. 412). Milan: FrancoAngeli.

    Google Scholar 

  • Milojević, S. (2013). Accuracy of simple, initials-based methods for author name disambiguation. Journal of Informetrics, 7(4), 767–773.

    Article  Google Scholar 

  • Moody, J. (2004). The structure of a social science collaboration network: Disciplinary cohesion from 1963 to 1999. American Sociological Review, 69(2), 213–238.

    Article  Google Scholar 

  • Newman, M. E. (2004). Coauthorship networks and patterns of scientific collaboration. Proceedings of the National Academy of Sciences, 101(suppl 1), 5200–5205.

    Article  Google Scholar 

  • Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5), 503–520.

    Article  Google Scholar 

  • Sadinle, M., Hall, R., & Fienberg, S. E. (2011). Approaches to multiple record linkage. In Proceedings of International Statistical Institute (Vol. 260).

  • Santana, A. F., Gonçalves, M. A., Laender, A. H., & Ferreira, A. A. (2015). On the combination of domain-specific heuristics for author name disambiguation: The nearest cluster method. International Journal on Digital Libraries, 16(3–4), 229–246.

    Article  Google Scholar 

  • Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual Review of Information Science and Technology, 43(1), 1–43.

    Article  Google Scholar 

  • Strotmann, A., Zhao, D., & Bubela, T. (2009). Author name disambiguation for collaboration network analysis and visualization. Proceedings of the American Society for Information Science and Technology, 46(1), 1–20.

    Google Scholar 

  • Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158.

    Article  Google Scholar 

  • Veloso, A., Ferreira, A. A., Gonçalves, M. A., Laender, A. H., & Meira, W. (2012). Cost-effective on-demand associative author name disambiguation. Information Processing and Management, 48(4), 680–697.

    Article  Google Scholar 

  • Ventura, S. L., Nugent, R., & Fuchs, E. R. (2015). Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records. Research Policy, 44(9), 1672–1701.

    Article  Google Scholar 

  • Wu, H., Li, B., Pei, Y., & He, J. (2014). Unsupervised author disambiguation using Dempster–Shafer theory. Scientometrics, 101(3), 1955–1972.

    Article  Google Scholar 

  • Wu, J., & Ding, X.-H. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96(3), 683–697.

    Article  MathSciNet  Google Scholar 

  • Yan, S., Lee, D., Kan, M. -Y., & Giles, L. C. (2007). Adaptive sorted neighborhood methods for efficient record linkage. In Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (pp. 185–194). ACM.

Download references


The authors would like to thank Andreas Strotmann for providing details on the algorithm code adopted in Strotmann et al. (2009).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Vittorio Fuccella.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fuccella, V., De Stefano, D., Vitale, M.P. et al. Improving co-authorship network structures by combining multiple data sources: evidence from Italian academic statisticians. Scientometrics 107, 167–184 (2016).

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: