Skip to main content
Log in

A boosted-trees method for name disambiguation

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

This paper proposes a method for classifying true papers of a set of focal scientists and false papers of homonymous authors in bibliometric research processes. It directly addresses the issue of identifying papers that are not associated (“false”) with a given author. The proposed method has four steps: name and affiliation filtering, similarity score construction, author screening, and boosted trees classification. In this methodological paper we calculate error rates for our technique. Therefore, we needed to ascertain the correct attribution of each paper. To do this we constructed a small dataset of 4,253 papers allegedly belonging to a random sample of 100 authors. We apply the boosted trees algorithm to classify papers of authors with total false rate no higher than 30% (i.e. 3,862 papers of 91 authors). A one-run experiment achieves a testing misclassification error 0.55%, testing recall 99.84%, and testing precision 99.60%. A 50-run experiment shows that the median of testing classification error is 0.78% and mean 0.75%. Among the 90 authors in the testing set (one author only appeared in the training set), the algorithm successfully reduces the false rate to zero for 86 authors and misclassifies just one or two papers for each of the remaining four authors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Aksnes, D. W. (2008). When different persons have an identical author name. How frequent are homonyms? Journal of the American Society for Information Science and Technology, 59(5), 838–841.

    Article  Google Scholar 

  • Aswani, N., Bontcheva, K., & Cunningham, H. (2006). Mining information for instance unification. In I. Cruz, S. Decker, D. Allemang, C. Preist, D. Schwabe, & P. Mika, et al. (Eds.), The Semantic WebISWC 2006. Lecture Notes in Computer Science. (Vol. 4273, pp. 329–342). Berlin: Springer.

  • Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet model for unsupervised entity resolution. In J. Ghosh, D. Lambert, D. Skillicorn, & J. Srivastava (Eds.), Proceedings of the SIAM 6th International Conference on Data Mining (pp. 47–58). Bethesda, MD: Society for Industrial Mathematics.

    Google Scholar 

  • Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 1–36.

    Article  Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Breiman, L. (1984). Classification and regression trees. Boca Raton, FL: Chapman & Hall/CRC.

    MATH  Google Scholar 

  • Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.

    Article  Google Scholar 

  • Cole, F. J., & Eales, N. B. (1917). The history of comparative anatomy: Part 1.-a statistical analysis of the literature. Science Progress in the Twentieth Century, 6, 578–597.

    Google Scholar 

  • Cota, R. G., Ferreira, A. A., Nascimento, C., Goncalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870.

    Article  Google Scholar 

  • Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. Author disambiguation using error-driven machine learning with a ranking loss function. In 6th International Workshop on Information Integration on the Web (IIWeb-07), Vancouver, Canada, 23 July 2007.

  • Culp, M., Johnson, K., & Michailidis, G. (2010). ada: An R package for stochastic boosting. http://CRAN.R-project.org/package=ada. Accessed 01 Aug 2011.

  • D’Angelo, C. A., Giuffrida, C., & Abramo, G. (2011). A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. Journal of the American Society for Information Science and Technology, 62(2), 257–269.

    Article  Google Scholar 

  • Friedman, J., Hastie, T., & Tibshirani, R. (2000). Special invited paper. additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2), 337–374.

    Article  MathSciNet  MATH  Google Scholar 

  • Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In H. Chen, H. Wactlar, C.-c. Chen, E.-P. Lim, & M. Christel (Eds.), Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 296–305). New York: ACM.

    Google Scholar 

  • Han, H., Xu, W., Zha, H., & Giles, C. L. (2005a). A hierarchical naive Bayes mixture model for name disambiguation in author citations. In H. M. Haddad, A. Omicini, R. L. Wainwright, & L. M. Liebrock (Eds.), Proceedings of the 2005 ACM Symposium on Applied Computing (pp. 1065–1069). New York: ACM.

    Chapter  Google Scholar 

  • Han, H., Zha, H., & Giles, C. L. (2005b). Name disambiguation in author citations using a K-way spectral clustering method. In M. Marlino, T. Sumner, & F. Shipman (Eds.), Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 334–343). New York: ACM.

    Chapter  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (2nd ed.). New York: Springer.

    MATH  Google Scholar 

  • Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572. doi:10.1073/pnas.0507655102.

    Article  Google Scholar 

  • Hofmann, T. (1999). Probabilistic latent semantic indexing. In F. Gey, M. Hearst, & R. Tong (Eds.), Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 50–57). New York: ACM.

    Chapter  Google Scholar 

  • Huang, J., Ertekin, S., & Giles, C. (2006). Efficient name disambiguation for large-scale databases. Knowledge Discovery in Databases: PKDD 2006, 4213, 536–544.

    Article  Google Scholar 

  • Johnson, R. A., & Wichern, D. W. (2007). Applied multivariate statistical analysis (6th ed.). Upper Saddle River, NJ: Pearson Prentice Hall.

    MATH  Google Scholar 

  • Kanani, P., & McCallum, A. Efficient strategies for improving partitioning-based author coreference by incorporating Web pages as graph nodes. In 6th International Workshop on Information Integration on the Web (IIWeb-07), Vol. 23, Vancouver, Canada, 23 July 2007.

  • Kanani, P., McCallum, A., & Pal, C. Improving author coreference by resource-bounded information gathering from the web. In 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 612 Jan 2007 (pp. 429–434). Hyderabad: AAAI Press.

  • Kang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., et al. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45(1), 84–97.

    Article  Google Scholar 

  • Lee, D., On, B. W., Kang, J., & Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In L. Berti-Equille, C. Batini, & D. Srivastava (Eds.), International Workshop on Information Quality in Information Systems (IQIS 2005) (pp. 69–76). New York: ACM.

    Google Scholar 

  • Liben-Nowell, D., & Kleinberg, J. (2007). The link prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7), 1019–1031.

    Article  Google Scholar 

  • McCallum, A., & Wellner, B. Object consolidation by graph partitioning with a conditionally-trained distance metric. In KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington, DC, 2427 Aug 2003. Washington, DC: Citeseer.

  • McRae-Spencer, D. M., & Shadbolt, N. R. (2006). Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation. In G. Marchionini, M. L. Nelson, & C. C. Marshall (Eds.), Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 53–54). New York: ACM.

    Chapter  Google Scholar 

  • Moed, H. F. (2005). Citation analysis in research evaluation. Dordrecht: Springer.

    Google Scholar 

  • Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America, 98(2), 404.

    Article  MathSciNet  MATH  Google Scholar 

  • On, B. W., Lee, D., Kang, J., Mitra, P., & Acm, (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In M. Marlino, T. Sumner, & F. Shipman (Eds.), Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 344–353). New York: ACM.

    Chapter  Google Scholar 

  • Onodera, N., Iwasawa, M., Midorikawa, N., Yoshikane, F., Amano, K., Ootani, Y., et al. (2011). A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search. Journal of the American Society for Information Science and Technology, 62(4), 677–690. doi:10.1002/asi.21491.

    Article  Google Scholar 

  • Porter, A., & Rafols, I. (2009). Is science becoming more interdisciplinary? Measuring and mapping six research fields over time. Scientometrics, 81(3), 719–745. doi:10.1007/s11192-008-2197-2.

    Article  Google Scholar 

  • Radicchi, F., Fortunato, S., Markines, B., & Vespignani, A. (2009). Diffusion of scientific credits and the ranking of scientists. Physical Review E, 80(5), 056103.

    Article  Google Scholar 

  • Smalheiser, N. R., & Torvik, V. I. (2009). Author Name Disambiguation. Annual Review of Information Science and Technology, 43, 287–313.

    Article  Google Scholar 

  • Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In E. Rasmussen, R. R. Larson, E. Toms, & S. Sugimoto (Eds.), Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 342–351). New York: ACM.

    Google Scholar 

  • Strotmann, A., Zhao, D., & Bubela, T. (2009). Author name disambiguation for collaboration network analysis and visualization. Proceedings of the American Society for Information Science and Technology, 46(1), 1–20. doi:10.1002/meet.2009.1450460218.

    Google Scholar 

  • Tan, Y. F., Kan, M. Y., & Lee, D. (2006). Search engine driven author disambiguation. In G. Marchionini, M. L. Nelson, & C. C. Marshall (Eds.), Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 314–315). New York: ACM.

    Chapter  Google Scholar 

  • Tang, L., & Walsh, J. P. (2010). Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps. Scientometrics, 84(3), 763–784. doi:10.1007/s11192-010-0196-6.

    Article  Google Scholar 

  • Therneau, T. M., & Atkinson, B. (2010). rpart: Recursive partitioning. http://CRAN.R-project.org/package=rpart. Accessed 01 Aug 2011.

  • Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3), 1–29.

    Article  Google Scholar 

  • Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. doi:10.1002/asi/20105.

    Article  Google Scholar 

  • U.S. Census Bureau (2000). Frequently occurring surnames from Census 2000. http://www.census.gov/genealogy/www/data/2000surnames/index.html. Accessed 01 Aug 2011.

  • Wooding, S., Wilcox-Jay, K., Lewison, G., & Grant, J. (2006). Co-author inclusion: A novel recursive algorithmic method for dealingwith homonyms in bibliometric analysis. Scientometrics, 66(1), 11–21.

    Article  Google Scholar 

  • Yang, K. H., Jiang, J. Y., Lee, H. M., & Ho, J. M. (2006). Extracting citation relationships from web documents for author disambiguation. Taipei: Technical Report (TR-IIS-06-017).

  • Yin, X., Han, J., & Yu, P. S. (2007). Object distinction: Distinguishing objects with identical names. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop (pp. 1242–1246). Washington, DC: IEEE.

Download references

Acknowledgments

This work was supported through a U.S. National Science Foundation grant, “Women in Science and Engineering: Network Access, Participation, and Career Outcomes” (NSF Grant # REC-0529642). We thank the NSF for their support of this work, and the NETWISE team for its collective effort in the development of these data.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Wang.

Appendix 1: real-world motivation and implementation

Appendix 1: real-world motivation and implementation

This study was motivated by a real-world problem confronting our research on the collaboration networks of American academic scientists. It attempts to identify clean publication sets for each focal scientist and link bibliometric data to survey data, thereby creating an extensive and rich dataset of academic scientists, their networks, and related productivity. From this approach, we hope that this methodological solution can inform other similar issues faced by other researchers.

The data for this study comes from a 2006–2009 NSF-funded national study of academic scientists and engineers in Research I universities in the United States (Women in Science and Engineering: Network Access, Participation, and Career Outcomes, Grant # REC-0529642). This NSF funded project involved a two stage online survey, collection of CV data for survey respondents, including a complete history of focal scientists’ affiliation information, and collection of lifetime bibliometric data for each focal author, requiring a name and affiliation match. The object of the name disambiguation method presented in this paper was to clean 54,853 papers of 1,315 focal scientists in five disciplines (biology (BIOL), chemistry (CHEM), computer science (CS), earth and atmospheric sciences (EAS), and electrical engineering (EE).

For the development dataset of 100 authors, we firstly excluded scientists with less than five publications, because the similarity score might be unstable if the focal author has very few papers, and then we randomly sampled 100 authors from the remaining 1,255 authors.

The method presented in this paper was implemented to clean the 54,853 papers of 1,315 authors, among them 4,253 papers from 100 authors were manually checked for supervised learning. For the remaining papers, the algorithm automatically classified 44,777 papers of 1,025 authors, 156 papers of 60 authors were left for manual checking because these authors had less than five papers, and 5,667 papers of 130 authors were left for manual checking because these authors were predicted to have high false rates. Overall, the algorithm reduced the labor required for manual data cleaning by about 80% (44,777 out of 54,853 papers were automatically cleaned).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Berzins, K., Hicks, D. et al. A boosted-trees method for name disambiguation. Scientometrics 93, 391–411 (2012). https://doi.org/10.1007/s11192-012-0681-1

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-012-0681-1

Keywords

Navigation