Skip to main content
Log in

Generating automatically labeled data for author name disambiguation: an iterative clustering method

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

To train algorithms for supervised author name disambiguation, many studies have relied on hand-labeled truth data that are very laborious to generate. This paper shows that labeled data can be automatically generated using information features such as email address, coauthor names, and cited references that are available from publication records. For this purpose, high-precision rules for matching name instances on each feature are decided using an external-authority database. Then, selected name instances in target ambiguous data go through the process of pairwise matching based on the rules. Next, they are merged into clusters by a generic entity resolution algorithm. The clustering procedure is repeated over other features until further merging is impossible. Tested on 26 K instances out of the population of 228 K author name instances, this iterative clustering produced accurately labeled data with pairwise F1 = 0.99. The labeled data represented the population data in terms of name ethnicity and co-disambiguating name group size distributions. In addition, trained on the labeled data, machine learning algorithms disambiguated 24 K names in test data with performance of pairwise F1 = 0.90–0.92. Several challenges are discussed for applying this method to resolving author name ambiguity in large-scale scholarly data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. This paper distinguishes meanings of author, name, and name instance. An author refers to a distinct entity, a name to a textual string representing the author, and a name instance to an individual occurrence of the name in data. For example, an author (the distinguished professor Mark E. J. Newman at the University of Michigan Department of Physics) can be represented by one or more names (Mark Newman, M. E. J. Newman, etc.) that appear hundreds of times (i.e., instances) through his publication records in bibliometric data.

  2. Other than these three types, a few studies have used synthetic labeled data (e.g., Milojević 2013). Another noticeable labeling approach is to use the intersection set of disambiguation results by multiple algorithms (Vogel et al. 2014).

  3. This does not imply that only Han et al. (2004)’s data contain flaws. No other labeled data than Han et al. (2004)’s have received such intensive scrutiny for errors.

  4. https://www.crossref.org/.

  5. https://europepmc.org/.

  6. Note that the cluster [#2|#5] is indexed as j = 2, not j = 3 because the prior merging removes cluster2 [#1|#4] from clusterList.

  7. https://clarivate.com/products/web-of-science/web-science-form/web-science-core-collection/.

  8. https://clarivate.com/products/journal-citation-reports/.

  9. https://static.aminer.org/lab-datasets/citation/dblp.v10.zip.

  10. https://figshare.com/articles/ORCID_Public_Data_File_2017/5479792/1.

  11. https://doi.org/10.13012/B2IDB-9087546_V1.

  12. https://github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/nlp/patterns/surface/stopwords.txt.

  13. https://tartarus.org/martin/PorterStemmer/.

  14. Classifiers were implemented with parameter settings as follows: L2 Regularization with class weight = 1 (LR), Gaussian Naïve Bayes with maximum likelihood estimator (NB), and 500 trees (after grid search) with Gini Impurity for split quality measure (RF). For more details, see http://scikit-learn.org/stable/index.html.

  15. The hierarchical agglomerative clustering algorithm and overall training-test procedure were implemented by modifying codes in Louppe et al. (2016).

  16. http://dblp.org/faq/17334571.

References

  • Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222. https://doi.org/10.1002/asi.23329.

    Article  Google Scholar 

  • Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870. https://doi.org/10.1002/asi.21363.

    Article  Google Scholar 

  • D’Angelo, C. A., Giuffrida, C., & Abramo, G. (2011). A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. Journal of the American Society for Information Science and Technology, 62(2), 257–269. https://doi.org/10.1002/asi.21460.

    Article  Google Scholar 

  • Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. Sigmod Record, 41(2), 15–26.

    Article  Google Scholar 

  • Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. (2014). Self-training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology, 65(6), 1257–1278. https://doi.org/10.1002/asi.22992.

    Article  Google Scholar 

  • Gomide, J., Kling, H., & Figueiredo, D. (2017). Name usage pattern in the synonym ambiguity problem in bibliographic data. Scientometrics, 112(2), 747–766. https://doi.org/10.1007/s11192-017-2410-2.

    Article  Google Scholar 

  • Haak, L. L., Fenner, M., Paglione, L., Pentz, E., & Ratner, H. (2012). ORCID: A system to uniquely identify researchers. Learned Publishing, 25(4), 259–264. https://doi.org/10.1087/20120404.

    Article  Google Scholar 

  • Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the fourth ACM/IEEE joint conference on digital libraries (pp. 296–305). https://doi.org/10.1145/996350.996419.

  • Han, H., Xu, W., Zha, H., & Giles, C. L. (2005a). A hierarchical naive Bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM symposium on Applied computing—SAC ‘05. Santa Fe, New Mexico.

  • Han, H., Zha, H. Y., & Giles, C. L. (2005b). Name disambiguation spectral in author citations using a K-way clustering method. In Proceedings of the 5th ACM/IEEE joint conference on digital libraries, proceedings (pp. 334–343). https://doi.org/10.1145/1065385.1065462.

  • Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a large-scale test set for author disambiguation. Information Processing and Management, 47(3), 452–465. https://doi.org/10.1016/j.ipm.2010.10.001.

    Article  Google Scholar 

  • Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867–1886. https://doi.org/10.1007/s11192-018-2824-5.

    Article  Google Scholar 

  • Kim, J., & Diesner, J. (2016). Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks. Journal of the Association for Information Science and Technology, 67(6), 1446–1461. https://doi.org/10.1002/asi.23489.

    Article  Google Scholar 

  • Kim, J., & Kim, J. (2018). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics, 117(1), 511–526. https://doi.org/10.1007/s11192-018-2865-9.

    Article  Google Scholar 

  • Lerchenmueller, M. J., & Sorenson, O. (2016). Author disambiguation in PubMed: Evidence on the precision and recall of author-ity among NIH-funded scientists. PLoS ONE, 11(7), e0158731. https://doi.org/10.1371/journal.pone.0158731.

    Article  Google Scholar 

  • Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. (2012). Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology, 63(5), 1030–1047. https://doi.org/10.1002/asi.22621.

    Article  Google Scholar 

  • Liu, W., Islamaj Dogan, R., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., et al. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology, 65(4), 765–781. https://doi.org/10.1002/asi.23063.

    Article  Google Scholar 

  • Louppe, G., Al-Natsheh, H. T., Susik, M., & Maguire, E. J. (2016). Ethnicity sensitive author disambiguation using semi-supervised learning. Knowledge Engineering and Semantic Web, Kesw, 2016(649), 272–287. https://doi.org/10.1007/978-3-319-45880-9_21.

    Google Scholar 

  • Milojević, S. (2013). Accuracy of simple, initials-based methods for author name disambiguation. Journal of Informetrics, 7(4), 767–773.

    Article  Google Scholar 

  • Müller, M. C., Reitz, F., & Roy, N. (2017). Data sets for author name disambiguation: an empirical analysis and a new resource. Scientometrics, 111(3), 1467–1500. https://doi.org/10.1007/s11192-017-2363-5.

    Article  MATH  Google Scholar 

  • Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America, 98(2), 404–409. https://doi.org/10.1073/pnas.021544898.

    Article  MathSciNet  MATH  Google Scholar 

  • Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

    Article  Google Scholar 

  • Qian, Y., Zheng, Q., Sakai, T., Ye, J., & Liu, J. (2015). Dynamic author name disambiguation for growing digital libraries. Information Retrieval Journal, 18(5), 379–412. https://doi.org/10.1007/s10791-015-9261-3.

    Article  Google Scholar 

  • Santana, A. F., Gonçalves, M. A., Laender, A. H. F., & Ferreira, A. A. (2015). On the combination of domain-specific heuristics for author name disambiguation: The nearest cluster method. International Journal on Digital Libraries, 16(3–4), 229–246. https://doi.org/10.1007/s00799-015-0158-y.

    Article  Google Scholar 

  • Schulz, C., Mazloumian, A., Petersen, A. M., Penner, O., & Helbing, D. (2014). Exploiting citation networks for large-scale author name disambiguation. EPJ Data Science. https://doi.org/10.1140/epjds/s13688-014-0011-3.

    Google Scholar 

  • Schulz, J. (2016). Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses. Scientometrics, 107(3), 1283–1298. https://doi.org/10.1007/s11192-016-1892-7.

    Article  Google Scholar 

  • Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1), 15–50. https://doi.org/10.1007/s11192-014-1289-4.

    Article  Google Scholar 

  • Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual Review of Information Science and Technology, 43, 287–313.

    Article  Google Scholar 

  • Song, M., Kim, E. H. J., & Kim, H. J. (2015). Exploring author name disambiguation on PubMed-scale. Journal of Informetrics, 9(4), 924–941. https://doi.org/10.1016/j.joi.2015.08.004.

    Article  Google Scholar 

  • Strotmann, A., & Zhao, D. Z. (2012). Author name disambiguation: What difference does it make in author-based citation analysis? Journal of the American Society for Information Science and Technology, 63(9), 1820–1833. https://doi.org/10.1002/asi.22695.

    Article  Google Scholar 

  • Torvik, V. I. (2015). MapAffil: A Bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib magazine: The magazine of the Digital Library Forum. https://doi.org/10.1045/november2015-torvik.

    Google Scholar 

  • Torvik, V. I., & Agarwal, S. (2016). Ethnea: An instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. In Paper presented at the library of congress international symposium on science of science, Washington, DC, USA. http://hdl.handle.net/2142/88927.

  • Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data. https://doi.org/10.1145/1552303.1552304.

    Google Scholar 

  • Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. In Proceedings of the 2009 ACM/IEEE joint conference on digital libraries (pp. 39–48).

  • Vogel, T., Heise, A., Draisbach, U., Lange, D., & Naumann, F. (2014). Reach for gold: An annealing standard to evaluate duplicate detection results. Journal of Data and Information Quality, 5(1–2), 1–25. https://doi.org/10.1145/2629687.

    Article  Google Scholar 

  • Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012). A boosted-trees method for name disambiguation. Scientometrics, 93(2), 391–411. https://doi.org/10.1007/s11192-012-0681-1.

    Article  Google Scholar 

  • Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active name disambiguation. In Paper presented at the 2011 IEEE 11th international conference on data mining. http://ieeexplore.ieee.org/document/6137284/.

  • Whang, S. E., Menestrina, D., Koutrika, G., Theobald, M., & Garcia-Molina, H. (2009). Entity resolution with iterative blocking. In Proceedings of the 2009 ACM SIGMOD international conference on management of data.

Download references

Acknowledgements

This work was supported by Grants from the National Science Foundation (#1,561,687 and #1535370), the Alfred P. Sloan Foundation and the Ewing Marion Kauffman Foundation. We would like to thank anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinseok Kim.

Appendices

Appendix A: Construction of self-citation relation

If a paper cites another paper, they are in citing-cited relation. From this paper-level citation information, scholars have constructed author-level citation relation. In Fig. 6, Author A and Author B coauthors Paper 1, while Author C and Author D writes together Paper 2. If Paper 2 cites Paper 1 (paper-level citation), authors in Paper 2 are assumed to refer to authors in Paper 1. Thus, Author C is depicted to cite Author A and Author B, and Author D to cite Author A and Author B. If Author C is the same as Author A, they are in self-citation relation.

Fig. 6
figure 6

An illustration of construction of self-citation relation

Appendix B: Representativeness checks for ORCID-linked data

This section checks how the ORCID-linked data (Methodology > Data and Pre-processing > ORCID-Linkage) represent the whole data in this study. Following the method described in Representativeness Checks of Results, the ratios of name ethnicity and block size of ORCID-linked data are compared to those of the whole data. Figure 7 shows that in ORCID-linked data, Chinese names are under-represented while Hispanic and Italian names are over-represented while other ethnic names show similar ratios. This observation is contrasted to that from Fig. 1 where Chinese names are slightly over-represented and English names are a little under-represented.

Fig. 7
figure 7

Ratios of name ethnicity in ORCID-linked data (ORCIDs) compared to whole data

Regarding the block size distribution in Fig. 8, the distribution plot of ORCIDs starts higher in y-axis (= ratio) than that of Random Data but falls below as x-value (= block size) increases. This means that ORCID-linked data contain more small blocks and less large blocks compared to randomly selected subset with the same number of name instances as ORCID-linked data, while automatically labeled data produce block size distribution quite similar to that of random data in Fig. 2.

Fig. 8
figure 8

Cumulative ratios of block size on log–log scale

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, J., Kim, J. & Owen-Smith, J. Generating automatically labeled data for author name disambiguation: an iterative clustering method. Scientometrics 118, 253–280 (2019). https://doi.org/10.1007/s11192-018-2968-3

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-018-2968-3

Keywords

Navigation