Large-scale name disambiguation of Chinese patent inventors (1985–2016)
This study presents the first systematic disambiguation result of Chinese patent inventors in State Intellectual Property Office of China patent database from 1985 to 2016. With a list of 66,248 inventors owning rare names and a hand-labeled data of 1465 inventors, our supervised learning algorithm identified 3.99 million unique inventors from 1.84 million Chinese names referring to 14.68 million patent-inventor records. We developed a method for constructing high-quality training data from a third-party rare name list and provided evidence for its reliability when large-scale and representative hand-labeled data is crucial but expensive to obtain. To optimize clustering results on large-scale dataset with highly unbalanced distribution, we also modified robust single linkage by adding constraints to the maximum distance within clusters generated. Varying across different training and testing data, as well as clustering parameters, our algorithm could yield F1 scores to 93.36% before clustering and 99.10% after clustering, with final splitting errors of 1.05–1.34% and lumping errors of 0.21–0.83%. Besides, we also applied this framework in standardizing applicants’ names according to their text similarity and geographical information based on the high-resolution geocoding data of all addresses within mainland China.
KeywordsDisambiguation Patent Inventor Machine learning Gradient boosting decision tree Single linkage
This work is mainly supported by the Research Institute of Economy, Trade and Industry’s (RIETI) under the project of Empirical Analysis of Innovation Ecosystems in Advancement of the Internet of Things (IoT), National Natural Science Foundation of China (NSFC, Nos. 71704025; 71503123), Scientific Cooperation Program between NSFC and Japan Society for the Promotion of Science (No. 71711540044). We also appreciate the editors’ diligent work as well as insightful and inspiring comments from two anonymous reviewers, Dr. Kenta Ikeuchi, and Mr. Zhao An.
- Balcan, M.-F., Liang, Y., & Gupta, P. (2014). Robust hierarchical clustering. Journal of Machine Learning Research. Retrieved from https://arxiv.org/abs/1401.0247.
- Balsmeier, B., Chavosh, A., Li, G. C., Fierro, G., Johnson, K., Kaulagi, A., et al. (2015). Automated disambiguation of us patent grants and applications. Fung Institute for Engineering Leadership Unpublished Working Paper.Google Scholar
- Cassi, L., & Carayol, N. (2009). Who’s who in patents. A Bayesian approach. Retrieved July 7, 2009, from https://hal-paris1.archives-ouvertes.fr/hal-00631750/document.
- Chaudhuri, K., & Dasgupta, S. (2010). Rates of convergence for the cluster tree. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems 23 (pp. 343–351). Red Hook: Curran Associates Inc.Google Scholar
- Dehman, A. (2015). Spatial clustering of linkage disequilibrium blocks for genome-wide association studies (Ph.D. thesis). Université d’Evry Val d’Essonne; Université Paris-Saclay; Laboratoire de Mathématiques et Modélisation d’Evry. Retrieved September 21, 2018, from https://tel.archives-ouvertes.fr/tel-01288568/document.
- Giles, C. L., Zha, H., & Han, H. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries (JCDL’05) (pp. 334–343).Google Scholar
- Gupta, P. (2011). Robust clustering algorithms (Master Thesis). Georgia Institute of Technology.Google Scholar
- Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. In Knowledge discovery in databases: PKDD 2006 (pp. 536–544). Berlin: Springer.Google Scholar
- Ikeuchi, K., Motohashi, K., Tamura, R., & Tsukada, N. (2017). Measuring science intensity of industry using linked dataset of science, technology and industry. RIETI Discussion Paper Series, 17-E-056.Google Scholar
- Khabsa, M., Treeratpituk, P., & Giles, C. L. (2014). Large scale author name disambiguation in digital libraries. In 2014 IEEE international conference on big data (pp. 41–42).Google Scholar
- Kim, K., Khabsa, M., & Giles, C. L. (2016). Inventor name disambiguation for a patent database using a random forest and DBSCAN. In 2016 IEEE/ACM joint conference on digital libraries (JCDL) (pp. 269–270).Google Scholar
- Kriegel, H.-P., Kröger, P., Sander, J., & Zimek, A. (2011). Density-based clustering: Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,1, 231–240.Google Scholar
- Lai, R., D’Amour, A., & Fleming, L. (2009). The careers and co-authorship networks of U.S. patent-holders, since 1975. Retrieved January 1, 2018, from https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/12367.
- Louppe, G., Al-Natsheh, H. T., Susik, M., & Maguire, E. J. (2016). Ethnicity sensitive author disambiguation using semi-supervised Learning. In Presented at the international conference on knowledge engineering and the semantic web (pp. 272–287). Cham: Springer.Google Scholar
- Monath, N., & McCallum, A. (2015). Discriminative hierarchical coreference for inventor disambiguation. In Presentation. Presented at the patentsview inventor disambiguation technical workshop.Google Scholar
- Trajtenberg, M., Shiff, G., & Melamed, R. (2006). The “Names Game”: Harnessing Inventors’ Patent Data for Economic Research (Working Paper No. 12479). National Bureau of Economic Research. Retrieved January 4, 2018, from http://www.nber.org/papers/w12479.
- Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. In Proceedings of the 9th ACM/IEEE-CS joint conference on digital libraries (pp. 39–48). New York, NY, USA: ACM.Google Scholar
- Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In Numerical taxonomy (pp. 282–311). London: Academic Press.Google Scholar
- Zhang, B., & Hasan, M. A. (2017). Name disambiguation in anonymized graphs using network embedding. Retrieved from http://arxiv.org/abs/1702.02287.