Skip to main content

Semantic fingerprints-based author name disambiguation in Chinese documents

Abstract

Author name disambiguation is an important problem that needs to be resolved in bibliometric analysis or tech mining. Many techniques have been presented; however, most of them require a long run time or additional information. A new method based on semantic fingerprints was presented to disambiguate author names without external data. A manually annotated dataset was built to testify on the efficiency of the presented method. Experiments using co-author features, institution features, and text fingerprints were conducted respectively. We found that the first two methods had higher precision, but their recall was low, and the text fingerprint method had higher recall and satisfied precision. Based on these results, we integrated co-author features, institution features, and text fingerprints to provide semantic fingerprints for disambiguating author names and achieving better performance on the F-measure.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. https://www.sciencedirect.com/.

  2. http://link.springer.com/.

  3. http://isiknowledge.com/.

  4. http://www.wanfangdata.com/.

  5. http://www.researcherid.com/.

  6. http://orcid.org/.

  7. http://xueshu.baidu.com/scholarID/CN-BF738V7J.

  8. http://nlp.uned.es/web-nlp/.

  9. http://nlp.cs.nyu.edu/index.shtml.

  10. http://www.dev.patentsview.org/workshop/index.html#workshopoverview.

  11. http://www.dataharmony.com/services-view/semantic-fingerprinting/.

References

  • Amigo, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4), 461–486.

    Article  Google Scholar 

  • Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., & Amigo, E. (2010). Weps-3 evaluation campaign: Overview of the web people search clustering and attribute extraction tasks. In Proceedings of the Conference on Multilingual & Multimodal Information Access Evaluation.

  • Artiles, J., Gonzalo, J., & Sekine, S. (2007). The semeval-2007 weps evaluation: Establishing a benchmark for the web people search task. In International Workshop on Semantic Evaluations (pp. 64–69).

  • Artiles, J., Sekine, S., & Gonzalo, J. (2009). Weps 2 evaluation campaign: Overview of the web people search clustering task. In Proceedings of the WWW Web People Search Evaluation Workshop.

  • Bagga, A., & Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th International Conference on Computational linguistics-Volume 1. Association for Computational Linguistics (pp. 79–85).

  • Bollegala, D., Matsuo, Y., & Ishizuka, M. (2012). Automatic annotation of ambiguous personal names on the web. Computational Intelligence, 28(28), 398–425.

    MathSciNet  Article  Google Scholar 

  • Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of computing (pp. 380–388). ACM.

  • Elkhidir, M., Ibrahim, M. M., Khalid, T. A., & Ibrahim, S. (2015). Plagiarism detection using free-text fingerprint analysis. In Computer Networks and Information Security.

  • Fan, X., Wang, J., Lv, B., Zhou, L., & Hu, W. (2008), Ghost: An effective graph-based framework for name distinction. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (pp. 1449–1450). ACM.

  • Ferreira, A. A., Goncalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. ACM Sigmod Record, 41(2), 15–26.

    Article  Google Scholar 

  • Griffith, R. A. (2011). Method and system for disambiguating informational objects. United State Patent No.US7953724B2.

  • Han, H., Giles, L., Zha, H., & Li, C. (2004). Two supervised learning approaches for name disambiguation in author citations. In ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 296–305).

  • Han, H., Zha, H., & Giles, C. L. (2003). A model-based k-means algorithm for name disambiguation. In International Semantic Web Conference.

  • Harzing, A. W. (2015). Health warning: Might contain multiple personalities—the problem of homonyms in Thomson Reuters essential science indicators. Scientometrics, 105(3), 2259–2270.

    Article  Google Scholar 

  • Ho, P. T., & Sung, K. R. (2014). Fingerprint-based near-duplicate document detection with applications to SNS spam detection. International Journal of Distributed Sensor Networks, 10(1), 40–44.

    Google Scholar 

  • Ibanez, A., Larranaga, P., & Bielza, C. (2013). Cluster methods for assessing research performance: Exploring spanish computer science. Scientometrics, 97(3), 571–600.

    Article  Google Scholar 

  • Ibriyamova, F., Kogan, S., Salganik-Shoshan, G., & Stolin, D. (2016). Using semantic fingerprinting in finance. Available at SSRN 2755585.

  • Khabsa, M., Treeratpituk, P., & Giles, C. L. (2014). Large scale author name disambiguation in digital libraries. In IEEE International Conference on Big Data (pp. 41–42).

  • Manku, G. S., Jain, A., & Das Sarma, A. (2007). Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web (pp. 141–150). ACM.

  • Mann, G. S., & Yarowsky, D. (2004). Unsupervised personal name disambiguation (pp. 33–40).

  • Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26.

    Article  Google Scholar 

  • On, B. W., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 344–353).

  • Pazienza, M., Pennacchiotti, M., & Zanzotto, F. M. (2004). Identifying relational concept lexicalisations by using general linguistic knowledge. In ECAI (Vol. 16, p. 1071).

  • Pereira, D. A., Ribeiro-Neto, B., Ziviani, N., Laender, A. H. F., Goncalves, M. A.alves, M. A., & Ferreira, A. A. (2009). Using web information for author name disambiguation. In ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 49–58).

  • Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 273–280.

    Article  MATH  Google Scholar 

  • Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual Review of Information Science & Technology, 43(1), 1–43.

    Article  Google Scholar 

  • Strotmann, A., Zhao, D., & Bubela, T. (2009). Author name disambiguation for collaboration network analysis and visualization. Proceedings of the American Society for Information Science & Technology, 46(1), 1–20.

    Google Scholar 

  • Tang, L., & Walsh, J. P. (2010). Bibliometric fingerprints: Name disambiguation based on approximate structure equivalence of cognitive maps. Scientometrics, 84(3), 763–784.

    Article  Google Scholar 

  • Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in medline. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29.

    Article  Google Scholar 

  • Webber, F. D. S. (2015). Semantic folding theory and its application in semantic fingerprinting. arXiv preprint arXiv:1511.08855.

  • Yarowsky, D., Somers, H., Dale, R., & Moisl, H. (2000). Word-sense disambiguation. In R. Dale, H. Somers & H. Moisl (Eds.), Handbook of natural language processing (pp. 629–654). New York: Marcel Dekker.

  • Zhang, W., Yoshida, T., & Tang, X. (2011). A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Systems with Applications, 38(3), 2758–2765.

    Article  Google Scholar 

Download references

Acknowledgements

This work is mainly supported by the National Natural Science Foundation of China (Project 71473237), and partially supported by The National Key Technology R&D Program of Chinese 12th Five-Year Plan (2011–2015) (2015BAH25F01), and The Program of the China Knowledge Centre for Engineering Science and Technology (CKCEST-2016-2-10). Authors are grateful to the National Natural Science Foundation of China, the Ministry of Science and Technology of China, and the Chinese Academy of Engineering for their financial support to carry out this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongqi Han.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Han, H., Yao, C., Fu, Y. et al. Semantic fingerprints-based author name disambiguation in Chinese documents. Scientometrics 111, 1879–1896 (2017). https://doi.org/10.1007/s11192-017-2338-6

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-017-2338-6

Keywords

  • Name disambiguation
  • Simhash
  • Semantic fingerprint