Large-scale name disambiguation of Chinese patent inventors (1985–2016)

  • Deyun Yin
  • Kazuyuki Motohashi
  • Jianwei DangEmail author


This study presents the first systematic disambiguation result of Chinese patent inventors in State Intellectual Property Office of China patent database from 1985 to 2016. With a list of 66,248 inventors owning rare names and a hand-labeled data of 1465 inventors, our supervised learning algorithm identified 3.99 million unique inventors from 1.84 million Chinese names referring to 14.68 million patent-inventor records. We developed a method for constructing high-quality training data from a third-party rare name list and provided evidence for its reliability when large-scale and representative hand-labeled data is crucial but expensive to obtain. To optimize clustering results on large-scale dataset with highly unbalanced distribution, we also modified robust single linkage by adding constraints to the maximum distance within clusters generated. Varying across different training and testing data, as well as clustering parameters, our algorithm could yield F1 scores to 93.36% before clustering and 99.10% after clustering, with final splitting errors of 1.05–1.34% and lumping errors of 0.21–0.83%. Besides, we also applied this framework in standardizing applicants’ names according to their text similarity and geographical information based on the high-resolution geocoding data of all addresses within mainland China.


Disambiguation Patent Inventor Machine learning Gradient boosting decision tree Single linkage 



This work is mainly supported by the Research Institute of Economy, Trade and Industry’s (RIETI) under the project of Empirical Analysis of Innovation Ecosystems in Advancement of the Internet of Things (IoT), National Natural Science Foundation of China (NSFC, Nos. 71704025; 71503123), Scientific Cooperation Program between NSFC and Japan Society for the Promotion of Science (No. 71711540044). We also appreciate the editors’ diligent work as well as insightful and inspiring comments from two anonymous reviewers, Dr. Kenta Ikeuchi, and Mr. Zhao An.


  1. Balcan, M.-F., Liang, Y., & Gupta, P. (2014). Robust hierarchical clustering. Journal of Machine Learning Research. Retrieved from
  2. Balsmeier, B., Chavosh, A., Li, G. C., Fierro, G., Johnson, K., Kaulagi, A., et al. (2015). Automated disambiguation of us patent grants and applications. Fung Institute for Engineering Leadership Unpublished Working Paper.Google Scholar
  3. Boeing, P., Mueller, E., & Sandner, P. (2016). China’s R&D explosion—Analyzing productivity effects across ownership types and over time. Research Policy,45, 159–176.CrossRefGoogle Scholar
  4. Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. In J. Pei, V. S. Tseng, L. Cao, H. Motoda, & G. Xu (Eds.), Advances in knowledge discovery and data mining (pp. 160–172). Berlin: Springer.CrossRefGoogle Scholar
  5. Cassi, L., & Carayol, N. (2009). Who’s who in patents. A Bayesian approach. Retrieved July 7, 2009, from
  6. Chaudhuri, K., & Dasgupta, S. (2010). Rates of convergence for the cluster tree. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems 23 (pp. 343–351). Red Hook: Curran Associates Inc.Google Scholar
  7. Chaudhuri, K., Dasgupta, S., Kpotufe, S., & von Luxburg, U. (2014). Consistent procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory,60, 7900–7912.MathSciNetCrossRefGoogle Scholar
  8. Chin, W.-S., Zhuang, Y., Juan, Y.-C., Wu, F., Tung, H.-Y., Yu, T., et al. (2014). Effective string processing and matching for author disambiguation. The Journal of Machine Learning Research,15, 3037–3064.MathSciNetzbMATHGoogle Scholar
  9. Cuxac, P., Lamirel, J.-C., & Bonvallot, V. (2013). Efficient supervised and semi-supervised approaches for affiliations disambiguation. Scientometrics,97, 47–58.CrossRefGoogle Scholar
  10. Dang, J., & Motohashi, K. (2015). Patent statistics: A good indicator for innovation in China? Patent subsidy program impacts on patent quality. China Economic Review. Scholar
  11. Davidson, I., & Ravi, S. S. (2005). Agglomerative hierarchical clustering with constraints: Theoretical and empirical results. In A. M. Jorge, L. Torgo, P. Brazdil, R. Camacho, & J. Gama (Eds.), Knowledge discovery in databases: PKDD 2005 (pp. 59–70). Berlin: Springer.CrossRefGoogle Scholar
  12. Dehman, A. (2015). Spatial clustering of linkage disequilibrium blocks for genome-wide association studies (Ph.D. thesis). Université d’Evry Val d’Essonne; Université Paris-Saclay; Laboratoire de Mathématiques et Modélisation d’Evry. Retrieved September 21, 2018, from
  13. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research,7(Jan), 1–30.MathSciNetzbMATHGoogle Scholar
  14. Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis. Hoboken: Wiley.CrossRefGoogle Scholar
  15. Fan, X., Wang, J., Pu, X., Zhou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality,2, 10:1–10:23.CrossRefGoogle Scholar
  16. Fegley, B. D., & Torvik, V. I. (2013). Has large-scale named-entity network analysis been resting on a flawed assumption? PLoS ONE,8, e70299.CrossRefGoogle Scholar
  17. Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record,41, 15–26.CrossRefGoogle Scholar
  18. Fleming, L., King, C., & Juda, A. I. (2007). Small worlds and regional innovation. Organization Science,18, 938–954.CrossRefGoogle Scholar
  19. Gagolewski, M., Bartoszuk, M., & Cena, A. (2016). Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences,363, 8–23.CrossRefGoogle Scholar
  20. Giles, C. L., Zha, H., & Han, H. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries (JCDL’05) (pp. 334–343).Google Scholar
  21. Gupta, P. (2011). Robust clustering algorithms (Master Thesis). Georgia Institute of Technology.Google Scholar
  22. Han, H., Yao, C., Fu, Y., Yu, Y., Zhang, Y., & Xu, S. (2017). Semantic fingerprints-based author name disambiguation in Chinese documents. Scientometrics,111, 1879–1896.CrossRefGoogle Scholar
  23. Hartigan, John A. (1975). Clustering algorithms (99th ed.). New York: Wiley.zbMATHGoogle Scholar
  24. Hartigan, J. A. (1981). Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76(374), 388–394.MathSciNetCrossRefGoogle Scholar
  25. He, Z.-L., Tong, T. W., Zhang, Y., & He, W. (2018). A database linking Chinese patents to China’s census firms. Scientific Data,5, 180042.CrossRefGoogle Scholar
  26. Hu, A. G. Z., Zhang, P., & Zhao, L. (2017). China as number one? Evidence from China’s most recent patenting surge. Journal of Development Economics,124, 107–119.CrossRefGoogle Scholar
  27. Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. In Knowledge discovery in databases: PKDD 2006 (pp. 536–544). Berlin: Springer.Google Scholar
  28. Hussain, I., & Asghar, S. (2017). A survey of author name disambiguation techniques: 2010–2016. The Knowledge Engineering Review. Scholar
  29. Ikeuchi, K., Motohashi, K., Tamura, R., & Tsukada, N. (2017). Measuring science intensity of industry using linked dataset of science, technology and industry. RIETI Discussion Paper Series, 17-E-056.Google Scholar
  30. Jones, B. F. (2009). The burden of knowledge and the “death of the renaissance man”: Is innovation getting harder? The Review of Economic Studies, 76(1), 283–317.CrossRefGoogle Scholar
  31. Karami, A., & Johansson, R. (2014). Choosing DBSCAN parameters automatically using differential evolution. International Journal of Computer Applications,91, 1–11.CrossRefGoogle Scholar
  32. Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. Hoboken: Wiley.zbMATHGoogle Scholar
  33. Khabsa, M., Treeratpituk, P., & Giles, C. L. (2014). Large scale author name disambiguation in digital libraries. In 2014 IEEE international conference on big data (pp. 41–42).Google Scholar
  34. Kim, K., Khabsa, M., & Giles, C. L. (2016). Inventor name disambiguation for a patent database using a random forest and DBSCAN. In 2016 IEEE/ACM joint conference on digital libraries (JCDL) (pp. 269–270).Google Scholar
  35. Kriegel, H.-P., Kröger, P., Sander, J., & Zimek, A. (2011). Density-based clustering: Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,1, 231–240.Google Scholar
  36. Lai, R., D’Amour, A., & Fleming, L. (2009). The careers and co-authorship networks of U.S. patent-holders, since 1975. Retrieved January 1, 2018, from
  37. Li, G.-C., Lai, R., D’Amour, A., Doolin, D. M., Sun, Y., Torvik, V. I., et al. (2014). Disambiguation and co-authorship networks of the U.S. patent inventor database (1975–2010). Research Policy,43, 941–955.CrossRefGoogle Scholar
  38. Liu, W., Islamaj Doğan, R., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., et al. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology,65, 765–781.CrossRefGoogle Scholar
  39. Louppe, G., Al-Natsheh, H. T., Susik, M., & Maguire, E. J. (2016). Ethnicity sensitive author disambiguation using semi-supervised Learning. In Presented at the international conference on knowledge engineering and the semantic web (pp. 272–287). Cham: Springer.Google Scholar
  40. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York, NY: Cambridge University Press.CrossRefGoogle Scholar
  41. Monath, N., & McCallum, A. (2015). Discriminative hierarchical coreference for inventor disambiguation. In Presentation. Presented at the patentsview inventor disambiguation technical workshop.Google Scholar
  42. Morrison, G., Riccaboni, M., & Pammolli, F. (2017). Disambiguation of patent inventors and assignees using high-resolution geolocation data. Scientific Data. Scholar
  43. Motohashi, K. (2008). Assessment of technological capability in science industry linkage in China by patent database. World Patent Information,30, 225–232.CrossRefGoogle Scholar
  44. Müller, M.-C. (2017). Semantic author name disambiguation with word embeddings. In Research and advanced technology for digital libraries (pp. 300–311). Cham: Springer.CrossRefGoogle Scholar
  45. On, B.-W., Lee, I., & Lee, D. (2012). Scalable clustering methods for the name disambiguation problem. Knowledge and Information Systems,31, 129–151.CrossRefGoogle Scholar
  46. Pezzoni, M., Lissoni, F., & Tarasconi, G. (2014). How to kill inventors: Testing the Massacrator© algorithm for inventor disambiguation. Scientometrics,101, 477–504.CrossRefGoogle Scholar
  47. Raffo, J., & Lhuillery, S. (2009). How to play the “Names Game”: Patent retrieval comparing different heuristics. Research Policy,38, 1617–1627.CrossRefGoogle Scholar
  48. Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics,100, 15–50.CrossRefGoogle Scholar
  49. Tang, L., & Walsh, J. P. (2010). Bibliometric fingerprints: Name disambiguation based on approximate structure equivalence of cognitive maps. Scientometrics,84, 763–784.CrossRefGoogle Scholar
  50. Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD),3(3), 11.CrossRefGoogle Scholar
  51. Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology,56, 140–158.CrossRefGoogle Scholar
  52. Trajtenberg, M., Shiff, G., & Melamed, R. (2006). The “Names Game”: Harnessing Inventors’ Patent Data for Economic Research (Working Paper No. 12479). National Bureau of Economic Research. Retrieved January 4, 2018, from
  53. Tran, H. N., Huynh, T., & Do, T. (2014). Author name disambiguation by using deep neural network. In N. T. Nguyen, B. Attachoo, B. Trawiński, & K. Somboonviwat (Eds.), Intelligent information and database systems (pp. 123–132). Berlin: Springer.CrossRefGoogle Scholar
  54. Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. In Proceedings of the 9th ACM/IEEE-CS joint conference on digital libraries (pp. 39–48). New York, NY, USA: ACM.Google Scholar
  55. Ventura, S. L., Nugent, R., & Fuchs, E. R. H. (2015). Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records. Research Policy,44, 1672–1701.CrossRefGoogle Scholar
  56. Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012). A boosted-trees method for name disambiguation. Scientometrics,93, 391–411.CrossRefGoogle Scholar
  57. Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In Numerical taxonomy (pp. 282–311). London: Academic Press.Google Scholar
  58. Zhang, B., & Hasan, M. A. (2017). Name disambiguation in anonymized graphs using network embedding. Retrieved from
  59. Zhang, G., Guan, J., & Liu, X. (2014). The impact of small world on patent productivity in China. Scientometrics,98, 945–960.CrossRefGoogle Scholar
  60. Zhao, Y., Karypis, G., & Fayyad, U. (2005). Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery,10, 141–168.MathSciNetCrossRefGoogle Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2019

Authors and Affiliations

  1. 1.Department of Technology Management for Innovation, School of EngineeringThe University of TokyoTokyoJapan
  2. 2.Shanghai International College of Intellectual PropertyTongji UniversityShanghaiChina

Personalised recommendations