A boosted-trees method for name disambiguation

Wang, Jian; Berzins, Kaspars; Hicks, Diana; Melkers, Julia; Xiao, Fang; Pinheiro, Diogo

doi:10.1007/s11192-012-0681-1

A boosted-trees method for name disambiguation

Published: 29 February 2012

Volume 93, pages 391–411, (2012)
Cite this article

Scientometrics Aims and scope Submit manuscript

Jian Wang¹,
Kaspars Berzins¹,
Diana Hicks¹,
Julia Melkers¹,
Fang Xiao¹ &
…
Diogo Pinheiro¹

784 Accesses
51 Citations
Explore all metrics

Abstract

This paper proposes a method for classifying true papers of a set of focal scientists and false papers of homonymous authors in bibliometric research processes. It directly addresses the issue of identifying papers that are not associated (“false”) with a given author. The proposed method has four steps: name and affiliation filtering, similarity score construction, author screening, and boosted trees classification. In this methodological paper we calculate error rates for our technique. Therefore, we needed to ascertain the correct attribution of each paper. To do this we constructed a small dataset of 4,253 papers allegedly belonging to a random sample of 100 authors. We apply the boosted trees algorithm to classify papers of authors with total false rate no higher than 30% (i.e. 3,862 papers of 91 authors). A one-run experiment achieves a testing misclassification error 0.55%, testing recall 99.84%, and testing precision 99.60%. A 50-run experiment shows that the median of testing classification error is 0.78% and mean 0.75%. Among the 90 authors in the testing set (one author only appeared in the training set), the algorithm successfully reduces the false rate to zero for 86 authors and misclassifies just one or two papers for each of the remaining four authors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generating automatically labeled data for author name disambiguation: an iterative clustering method

Article 29 November 2018

Disambiguation of author entities in ADS using supervised learning and graph theory methods

Article Open access 20 April 2021

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

Article 14 June 2019

References

Aksnes, D. W. (2008). When different persons have an identical author name. How frequent are homonyms? Journal of the American Society for Information Science and Technology, 59(5), 838–841.
Article Google Scholar
Aswani, N., Bontcheva, K., & Cunningham, H. (2006). Mining information for instance unification. In I. Cruz, S. Decker, D. Allemang, C. Preist, D. Schwabe, & P. Mika, et al. (Eds.), The Semantic Web—ISWC 2006. Lecture Notes in Computer Science. (Vol. 4273, pp. 329–342). Berlin: Springer.
Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet model for unsupervised entity resolution. In J. Ghosh, D. Lambert, D. Skillicorn, & J. Srivastava (Eds.), Proceedings of the SIAM 6th International Conference on Data Mining (pp. 47–58). Bethesda, MD: Society for Industrial Mathematics.
Google Scholar
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 1–36.
Article Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Breiman, L. (1984). Classification and regression trees. Boca Raton, FL: Chapman & Hall/CRC.
MATH Google Scholar
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.
Article Google Scholar
Cole, F. J., & Eales, N. B. (1917). The history of comparative anatomy: Part 1.-a statistical analysis of the literature. Science Progress in the Twentieth Century, 6, 578–597.
Google Scholar
Cota, R. G., Ferreira, A. A., Nascimento, C., Goncalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870.
Article Google Scholar
Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. Author disambiguation using error-driven machine learning with a ranking loss function. In 6th International Workshop on Information Integration on the Web (IIWeb-07), Vancouver, Canada, 23 July 2007.
Culp, M., Johnson, K., & Michailidis, G. (2010). ada: An R package for stochastic boosting. http://CRAN.R-project.org/package=ada. Accessed 01 Aug 2011.
D’Angelo, C. A., Giuffrida, C., & Abramo, G. (2011). A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. Journal of the American Society for Information Science and Technology, 62(2), 257–269.
Article Google Scholar
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Special invited paper. additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2), 337–374.
Article MathSciNet MATH Google Scholar
Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In H. Chen, H. Wactlar, C.-c. Chen, E.-P. Lim, & M. Christel (Eds.), Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 296–305). New York: ACM.
Google Scholar
Han, H., Xu, W., Zha, H., & Giles, C. L. (2005a). A hierarchical naive Bayes mixture model for name disambiguation in author citations. In H. M. Haddad, A. Omicini, R. L. Wainwright, & L. M. Liebrock (Eds.), Proceedings of the 2005 ACM Symposium on Applied Computing (pp. 1065–1069). New York: ACM.
Chapter Google Scholar
Han, H., Zha, H., & Giles, C. L. (2005b). Name disambiguation in author citations using a K-way spectral clustering method. In M. Marlino, T. Sumner, & F. Shipman (Eds.), Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 334–343). New York: ACM.
Chapter Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (2nd ed.). New York: Springer.
MATH Google Scholar
Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572. doi:10.1073/pnas.0507655102.
Article Google Scholar
Hofmann, T. (1999). Probabilistic latent semantic indexing. In F. Gey, M. Hearst, & R. Tong (Eds.), Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 50–57). New York: ACM.
Chapter Google Scholar
Huang, J., Ertekin, S., & Giles, C. (2006). Efficient name disambiguation for large-scale databases. Knowledge Discovery in Databases: PKDD 2006, 4213, 536–544.
Article Google Scholar
Johnson, R. A., & Wichern, D. W. (2007). Applied multivariate statistical analysis (6th ed.). Upper Saddle River, NJ: Pearson Prentice Hall.
MATH Google Scholar
Kanani, P., & McCallum, A. Efficient strategies for improving partitioning-based author coreference by incorporating Web pages as graph nodes. In 6th International Workshop on Information Integration on the Web (IIWeb-07), Vol. 23, Vancouver, Canada, 23 July 2007.
Kanani, P., McCallum, A., & Pal, C. Improving author coreference by resource-bounded information gathering from the web. In 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 6–12 Jan 2007 (pp. 429–434). Hyderabad: AAAI Press.
Kang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., et al. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45(1), 84–97.
Article Google Scholar
Lee, D., On, B. W., Kang, J., & Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In L. Berti-Equille, C. Batini, & D. Srivastava (Eds.), International Workshop on Information Quality in Information Systems (IQIS 2005) (pp. 69–76). New York: ACM.
Google Scholar
Liben-Nowell, D., & Kleinberg, J. (2007). The link prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7), 1019–1031.
Article Google Scholar
McCallum, A., & Wellner, B. Object consolidation by graph partitioning with a conditionally-trained distance metric. In KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington, DC, 24–27 Aug 2003. Washington, DC: Citeseer.
McRae-Spencer, D. M., & Shadbolt, N. R. (2006). Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation. In G. Marchionini, M. L. Nelson, & C. C. Marshall (Eds.), Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 53–54). New York: ACM.
Chapter Google Scholar
Moed, H. F. (2005). Citation analysis in research evaluation. Dordrecht: Springer.
Google Scholar
Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America, 98(2), 404.
Article MathSciNet MATH Google Scholar
On, B. W., Lee, D., Kang, J., Mitra, P., & Acm, (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In M. Marlino, T. Sumner, & F. Shipman (Eds.), Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 344–353). New York: ACM.
Chapter Google Scholar
Onodera, N., Iwasawa, M., Midorikawa, N., Yoshikane, F., Amano, K., Ootani, Y., et al. (2011). A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search. Journal of the American Society for Information Science and Technology, 62(4), 677–690. doi:10.1002/asi.21491.
Article Google Scholar
Porter, A., & Rafols, I. (2009). Is science becoming more interdisciplinary? Measuring and mapping six research fields over time. Scientometrics, 81(3), 719–745. doi:10.1007/s11192-008-2197-2.
Article Google Scholar
Radicchi, F., Fortunato, S., Markines, B., & Vespignani, A. (2009). Diffusion of scientific credits and the ranking of scientists. Physical Review E, 80(5), 056103.
Article Google Scholar
Smalheiser, N. R., & Torvik, V. I. (2009). Author Name Disambiguation. Annual Review of Information Science and Technology, 43, 287–313.
Article Google Scholar
Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In E. Rasmussen, R. R. Larson, E. Toms, & S. Sugimoto (Eds.), Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 342–351). New York: ACM.
Google Scholar
Strotmann, A., Zhao, D., & Bubela, T. (2009). Author name disambiguation for collaboration network analysis and visualization. Proceedings of the American Society for Information Science and Technology, 46(1), 1–20. doi:10.1002/meet.2009.1450460218.
Google Scholar
Tan, Y. F., Kan, M. Y., & Lee, D. (2006). Search engine driven author disambiguation. In G. Marchionini, M. L. Nelson, & C. C. Marshall (Eds.), Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 314–315). New York: ACM.
Chapter Google Scholar
Tang, L., & Walsh, J. P. (2010). Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps. Scientometrics, 84(3), 763–784. doi:10.1007/s11192-010-0196-6.
Article Google Scholar
Therneau, T. M., & Atkinson, B. (2010). rpart: Recursive partitioning. http://CRAN.R-project.org/package=rpart. Accessed 01 Aug 2011.
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3), 1–29.
Article Google Scholar
Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. doi:10.1002/asi/20105.
Article Google Scholar
U.S. Census Bureau (2000). Frequently occurring surnames from Census 2000. http://www.census.gov/genealogy/www/data/2000surnames/index.html. Accessed 01 Aug 2011.
Wooding, S., Wilcox-Jay, K., Lewison, G., & Grant, J. (2006). Co-author inclusion: A novel recursive algorithmic method for dealingwith homonyms in bibliometric analysis. Scientometrics, 66(1), 11–21.
Article Google Scholar
Yang, K. H., Jiang, J. Y., Lee, H. M., & Ho, J. M. (2006). Extracting citation relationships from web documents for author disambiguation. Taipei: Technical Report (TR-IIS-06-017).
Yin, X., Han, J., & Yu, P. S. (2007). Object distinction: Distinguishing objects with identical names. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop (pp. 1242–1246). Washington, DC: IEEE.

Download references

Acknowledgments

This work was supported through a U.S. National Science Foundation grant, “Women in Science and Engineering: Network Access, Participation, and Career Outcomes” (NSF Grant # REC-0529642). We thank the NSF for their support of this work, and the NETWISE team for its collective effort in the development of these data.

Author information

Authors and Affiliations

School of Public Policy, Georgia Institute of Technology, 685 Cherry Street, Atlanta, GA, 30332-0345, USA
Jian Wang, Kaspars Berzins, Diana Hicks, Julia Melkers, Fang Xiao & Diogo Pinheiro

Authors

Jian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kaspars Berzins
View author publications
You can also search for this author in PubMed Google Scholar
Diana Hicks
View author publications
You can also search for this author in PubMed Google Scholar
Julia Melkers
View author publications
You can also search for this author in PubMed Google Scholar
Fang Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Diogo Pinheiro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Wang.

Appendix 1: real-world motivation and implementation

This study was motivated by a real-world problem confronting our research on the collaboration networks of American academic scientists. It attempts to identify clean publication sets for each focal scientist and link bibliometric data to survey data, thereby creating an extensive and rich dataset of academic scientists, their networks, and related productivity. From this approach, we hope that this methodological solution can inform other similar issues faced by other researchers.

The data for this study comes from a 2006–2009 NSF-funded national study of academic scientists and engineers in Research I universities in the United States (Women in Science and Engineering: Network Access, Participation, and Career Outcomes, Grant # REC-0529642). This NSF funded project involved a two stage online survey, collection of CV data for survey respondents, including a complete history of focal scientists’ affiliation information, and collection of lifetime bibliometric data for each focal author, requiring a name and affiliation match. The object of the name disambiguation method presented in this paper was to clean 54,853 papers of 1,315 focal scientists in five disciplines (biology (BIOL), chemistry (CHEM), computer science (CS), earth and atmospheric sciences (EAS), and electrical engineering (EE).

For the development dataset of 100 authors, we firstly excluded scientists with less than five publications, because the similarity score might be unstable if the focal author has very few papers, and then we randomly sampled 100 authors from the remaining 1,255 authors.

The method presented in this paper was implemented to clean the 54,853 papers of 1,315 authors, among them 4,253 papers from 100 authors were manually checked for supervised learning. For the remaining papers, the algorithm automatically classified 44,777 papers of 1,025 authors, 156 papers of 60 authors were left for manual checking because these authors had less than five papers, and 5,667 papers of 130 authors were left for manual checking because these authors were predicted to have high false rates. Overall, the algorithm reduced the labor required for manual data cleaning by about 80% (44,777 out of 54,853 papers were automatically cleaned).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Berzins, K., Hicks, D. et al. A boosted-trees method for name disambiguation. Scientometrics 93, 391–411 (2012). https://doi.org/10.1007/s11192-012-0681-1

Download citation

Received: 11 October 2011
Published: 29 February 2012
Issue Date: November 2012
DOI: https://doi.org/10.1007/s11192-012-0681-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A boosted-trees method for name disambiguation

Abstract

Access this article

Similar content being viewed by others

Generating automatically labeled data for author name disambiguation: an iterative clustering method

Disambiguation of author entities in ADS using supervised learning and graph theory methods

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix 1: real-world motivation and implementation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A boosted-trees method for name disambiguation

Abstract

Access this article

Similar content being viewed by others

Generating automatically labeled data for author name disambiguation: an iterative clustering method

Disambiguation of author entities in ADS using supervised learning and graph theory methods

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix 1: real-world motivation and implementation

Appendix 1: real-world motivation and implementation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation