Abstract
Authorship identity has long been an Achilles’ heel in bibliometric analyses at the individual level. This problem appears in studies of scientists’ productivity, inventor mobility and scientific collaboration. Using the concepts of cognitive maps from psychology and approximate structural equivalence from network analysis, we develop a novel algorithm for name disambiguation based on knowledge homogeneity scores. We test it on two cases, and the results show that this approach outperforms other common authorship identification methods with the ASE method providing a relatively simple algorithm that yields higher levels of accuracy with reasonable time demands.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The search was conducted on July 26, 2009.
Using the example of patent documents, Trajtenberg illustrates this problem with two questions: (1) Is “Manuel Trajtenberg” in one patent the same inventor as “Manuel Trajtenberg” in another record? And (2) is “Manuel Trajtenberg” the same inventor as “Manuel Trachtenberg”, and the same as “Manuel D. Trajtenberg”? (Trajtenberg et al. 2006).
The figures here are calculated based on a unique nanotechnology publication database developed by a Georgia Institute of Technology research group led by Dr. Philip Shapira. For a detailed description of this dataset, please refer to Porter et al. (2008).
These author names are taken directly from the nano publication dataset derived from WoS without any cleaning. So although possibly part of “Kim, J” is overlapping with “Kim, J H”, they are taken as they were in this table.
For example, in their report gauging the structure and competitiveness of China’s nanotechnology, Kostoff and his colleagues identify “Zhang, Y.” and “Li, Y.” as the top two most prolific Chinese nano authors in 2003 (Kostoff et al. 2006, p. 148), ignoring the fact that Zhang and Li are also the top two most common family names in China.
Text mining tool developed at Search Technology, Inc.
Academia Sinica also exists in Taiwan.
The Soundex algorithm transforms names into alphanumeric codes targeting on the variations of name spellings. This method was initially developed by the US Census in 1930. To apply this method, a full name and, preferably, residing state, must be known. For more details on this index method please refer to http://www.archives.gov/genealogy/census/soundex.html.
In addition to the above methods, large digital libraries have themselves also started to build up author identification systems, such as the Distinct Author Identification System (DAIS) in WoS (http://science.thomsonreuters.com/training/wos/), Citeseer (http://citeseer.ist.psu.edu/), and Research ID system (http://www.researcherid.com/), respectively. Unfortunately, our experiments disclose that their performances are rather poor, at least in the two common names we tested.
Please note approximately structural equivalent is different from co-citation, another similarity measure used in bibliometric analysis. The former focuses on records sharing references, while the latter refers to citations themselves. In the Fig. 2, AR1 and AR2 are ASE while ref1 and refp are co-citations.
In the arenas of criminology and anti-terrorism war, fingerprints tracing has been used to identify individuals (Chaski 2005) given that an individual has a unique finger tail that distinguishes her/him from the others. In the same vein, assuming an individual researcher has a fixed knowledge stock during a given period, tracing his/her bibliometric fingerprint in reported references should be useful for aiding authorship identification at a very large scale.
This is different from the Distinct Author Identification System (DAIS) in Web of Science. Our experiments in DAIS found that the same authors appear in different clusters, and different authors appear in the same cluster. This suggests either transitivity was not adopted in their algorithm (otherwise the same authors should only appear in one author cluster), or limited coverage of publications indexed in WoS (which miss the common links with papers not covered by DAIS). For more discussion on DAIS, see http://science.thomsonreuters.com/support/faq/wok3new/dais/ or the patent application number US20080275859 A1 (Griffith 2008).
Please note, we use wild-card characters “*” instead of middle initial to relax the formats of reported author names. The “*” is important. Without that, only 47 records were retrieved.
The removal process was conducted in VantagePoint. These 125 records were first clustered into three groups: Group one are articles written by “Walsh, J”; Group two include articles written by “Walsh, J?P”; and Group three consists those written by “Walsh, J?” with“?” not “P”. Group 1 and Group 2 are combined which returns 69 articles.
Those records without references are letters, book reviews, etc., which are not the focus of this study.
The full author names are not viewable in the bibliographic data in Web of Science up to September 2006. To get those full names, we came back to the orginal full text of articles. Since different organizations purchased different coverage of full text WoS, in order to do the verification step (but not needed in the ASE algorithm) in some cases we have to look for hard copies or request InterLibrary loans to get them if the electronic version of the full text is not available.
In addition to citations received by each reference, we also tried the journal impact factor as the weighting factor. Among the 114 references in the first experiment (Walsh, J*), 23 could not be identified from ISI journal citation reports, suggesting this is less useful than citations for weighting references.
Two weights W 1 , W 2 are coded based on the quartile distributions of visibility of references and minimum number of reported references between each pair of targeted papers respectively. For W 1 , in both the Walsh, JP and the Li, Y experiments, if a reference was in the first quartile of forward citations, the reference is given a weight of 8, if in the second quartile, the weight is 3, if in the third quartile, the weight is 2 and if in the fourth quartile, the weight is one. For W 2 , in the Walsh, JP experiment, if the number of references was in the first quartile of reference counts, the number of references weight is set at 4, if in the second quartile, the weight is 3, if in the third quartile, the weight is 2 and if in the fourth quartile, the weight is one. In the Li, Y experiment, the first quartile of reference counts were given a weight of 8 in W 2 . The detailed raw data and coding of KHS variables are available on request.
A close look tells us the misassignment occurs because AR10, an article reporting few references, shares with AR23 a rarely cited reference. ASE method decides the knowledge homogeneity score between AR10 and AR23 is high enough and thus cluster them together.
Even in the US, Li has become a fairly common name, currently ranked 519th on the list of common names, up from 2084th in the 1990 census.
For a detailed description of this global nano database, please refer to Porter et al. (2008).
Singletons are still possible in the ASE method due to citation weights and matching thresholds rules.
For instance, two articles written by the same author Li Yue at CAS Hefei are wrongly separated due to no shared reference between them. A close examination tells us that these two articles are in different research areas, as seen by their subject category codes. One is in “Chemistry, Physical; Materials Science, Multidisciplinary” and the other is in “Nanoscience & Nanotechnology, Polymer Science”. Another example is two articles authored by Li, Ying at Chinese Academy of Sciences (CAS) Shenyang that are in two related but different subject categories. “Acoustics; Chemistry, Multidisciplinary” and “Chemistry, Applied; Engineering, Chemical; Materials Science, Textiles”.
This holds if the following three conditions are met: (1) researchers are not mobile; (2) researchers are not affiliated with multiple organizations; and (3) standardized affiliations are reported across different records.
I.e., ((13 − 3) + (24 − 13) * 0.5)/24 * 100%. The mis-assignments (two in Illinois at Chicago and one in University of Tokyo), are within these 13 records with identifiable affiliation.
The formula is calculated by ((72 − 17) + (145 − 72) * 0.5))/145 * 100, where 17 is the number who are wrongly assigned due to different authors possessing exactly the same English translated name in the same organization.
It turns out a large proportion of records (50% in the Walsh, J case and 83% in Li, Y) that do not share citations are in fact singletons (unique authors).
The differences that weighting makes are even larger if only records with shared citations considered.
All records downloading and data from Google Scholar and Scopus were completed November 25–30, 2009. These new measures of forward citations change the weighting of W 1 and therefore the knowledge homogeneity scores used to calculate the clustering.
One record in Google Scholar and three records in Scopus are mis-assigned compared to one mis-assignment in WoS. A closer examination indicates the messier formats of references in Scopus and inconsistent coverage of journals partially accounts for Scopus having a lower correlation in citations with WoS and Google Scholar. The detailed analyses and results are available on request.
To make the search comparable to the “Walsh, J” case indexed in WoS-SSCI, the search strategy in Scopus is also confined to social science using the query of “AUTHOR-NAME(walsh, j*) AND SUBJAREA(mult OR arts OR busi OR deci OR econ OR psyc OR soci) AND (PUBYEAR BEF 2009) AND (PUBYEAR AFT 2003)”. This returned 164 hits. After removing “Walsh, J?” where “?” is not P, 41 records were left, 37 of them report references, and 17 records share at least one reference with the others.
We did not make a memo of time spent in the case of “Walsh, J” because of the testing and revising efforts in this first prototype.
Some text mining software (such as VantagePoint) has installed a name + affiliation function. We ran it in VantagePoint using person name fuzzy matching and verified by organization name matching in the case of Li, Y, and the performance is very poor and, in spite of the automation, took significant time.
References
Abbasi, A., & Chun, H. (2006). Visualization authorship for identification. In: S. Mehrotra, et al. (Eds.), Proceedings of the IEEE international conference on intelligence and security informatics (LNCS 3975) (pp. 60–71). Berlin: Springer-Verlag.
Borgman, C. L., & Siegfried, S. L. (1999). Getty’s SynonameTM and its cousins: A survey of applications of personal name-matching algorithms. Journal of the American Society for Information Science, 43(7), 45–476.
Chaski, C. E. (2005). Who’s at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4(1), 1–13.
Frietsch, R., Tang, L., & Hinze, S. (2008). Bibliometric data study: Assessing the current ranking of the People’s Republic of China in a set of research fields. Fraunhofer ISI Discussion Papers Innovation Systems and Policy Analysis, No 15. Karlsruhe: Fraunhofer ISI.
Garfield, E. (1969). British quest for uniqueness versus American egocentrism. Nature, 223(5207), 763.
Griffith, R. A. (2008). Method and system for disambiguating informational objects. Patent Application Number: US 20080275859 A1. USPTO.
Han, H., Giles, C. L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. Paper presented at the Proceedings of the ACM/IEEE Joint Conference on Digital Libraries.
Han, H., Xu, W., Zha, H., & Giles, C. L. (2005). A hierarchical naive Bayes mixture model for name disambiguation in author citations. Paper presented at the Proceedings of the 2005 ACM Symposium on Applied Computing.
Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In M. Marlino, T. Sumner, & F. M. Shipman III (Eds.), Proceedings of the 5th ACM/IEEE joint conference on digital libraries (pp. 334–343). Denver: ACM Press.
Hanneman, R. A., & Riddle, M. (2005). Introduction to social network methods. Riverside, CA: University of California, Riverside. http://faculty.ucr.edu/~hanneman/.
Houvardas, J., & Stamatatos, E. (2006). N-gram feature selection for authorship identification. In J. Euzenat & J. Domingue (Eds.), Proceedings of the 12th international conference on artificial intelligence: Methodology, systems, applications (AIMSA’06), LNCS 4183 (pp. 77–86). Berlin: Springer-Verlag.
Huang, J., Ertekin, S., & Giles, C. L. (2006). Fast author name disambiguation in CiteSeer. Working paper. http://www.cse.psu.edu/~sertekin/Papers/IST-TR_DisambiguationCiteseer.pdf.
Jacobs, L. F., & Schenk, F. (2003). Unpacking the cognitive map: The parallel map theory of hippocampal function. Psychological Review, 110(2), 285–315.
Jones, B., Wuchty, S., & Uzzi, B. (2008). Multi-university research teams: Shifting impact, geography, and stratification in science. Science, 322, 1259–1262.
Kang, I. S. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45(1), 84–97.
Kostoff, R. (2008). Comparison of China/USA science and technology performance. Journal of Informetrics, 57, 1–10.
Kostoff, R., et al. (2006). The structure and infrastructure of Chinese science and technology. DTIC Technical Report, No. ADA 443315. http://www.onr.navy.mil/sci_tech/33/332/docs/060307_chinese_sci_tech.pdf.
Kuhn, T. (1970). The structure of scientific revolutions (2nd ed.). Chicago: University of Chicago Press.
Lai, R., D’amour, A., & Fleming, L. (2009). The careers and co-authorship networks of U.S. patent-holders since 1975. Working paper.
Lin, J. C. (1988). Chinese names containing non-Chinese given name. Cataloging & Classification Quarterly, 9(1), 69–81.
Lorrain, F., & White, H. C. (1971). Structural equivalence of individuals in social networks. Journal of Mathematical Sociology, 1, 49–80.
Macroberts, M. H., & Macroberts, B. R. (1989). Problems of citation analysis: A critical review. Journal of the American Society for Information Science, 40, 342–349.
McCallum, A., & Wellner, B. (2003). Toward conditional models of identity uncertainty with application to proper noun conference. Paper presented at the IJCAI Workshop on Information Integration.
Meho, L., & Yang, K. (2007). Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus Scopus and Google Scholar. Journal of the American Society for Information Science & Technology, 58(13), 2105–2125.
Merton, R. (1973). The sociology of science: Theoretical and empirical investigations. Chicago: University of Chicago Press.
National Science Foundation (NSF). (2008). Science and engineering indicators. Washington: Government Printing Office.
O’Keefe, J., & Nadel, L. (1978). The hippocampus as a cognitive map. Oxford, England: Oxford University Press.
Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2004). Identity uncertainty and citation matching. Paper presented at the Advances in Neural Information Processing (NIPS).
Pauly, D., & Stergiou, K. I. (2005). Equivalence of results from two citation analyses: Thomson ISI’s citation index and Google’s scholar service. Ethics in Science and Environmental Politics, 5, 33–35.
Phelan, T. J. (1999). A compendium of issues for citation analysis. Scientometrics, 45(1), 117–136.
Pieters, R., Baumgartner, H., Vermunt, J., & Bijmolt, T. (1999). Importance and similarity in the evolving citation network of the International Journal of Research in Marketing. International Journal of Research in Marketing, 16(2), 113–127.
Porter, A. L., Youtie, J., Shapira, P., & Schoneneck, D. (2008). Refining search terms for nanotechnology. Journal of Nanoparticle Research, 10, 715–728.
Raffo, J., & Lhuillery, S. (2009). How to play the “names game”: Patent retrieval comparing different heuristics. Research Policy, 38(10), 1617–1627.
Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. In B. Cronin (Ed.), Annual review of information science and technology (Vol. 43). Maryland, USA: American Society for Information Science and Technology (ASIST).
Soler, J. M. (2007). Separating the articles of authors with the same name. Scientometrics, 72(2), 281–290.
Strotmann, A., Zhao, D., & Bubela, T. (2009). Author name disambiguation for collaboration network analysis. Working paper.
Tan, C. N. (1986). Chinese personal names. Library Association Record, 88, 551.
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3), Article 11.
Trajtenberg, M., Shiff, G., & Melamed, R. (2006). The “names game”: Harnessing inventors’ patent data for economic research. NBER Working Paper No. 12479.
Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. Paper presented at the JCDL, Austin, Texas, USA.
Van Mechelen, I., Bock, H. H., & De Boeck, P. (2004). Two-mode clustering methods: A structured overview. Statistical Methods in Medical Research, 13(5), 363–394.
Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications. Cambridge: Cambridge University Press.
Wooding, S., Wilcox-Jay, K., Lewison, G., & Grant, J. (2005). Co-author inclusion: A novel recursive algorithmic method for dealing with homonyms in bibliometric analysis. Scientometrics, 66(1), 11–21.
Youtie, J., Shapira, P., & Porter, A. (2008). National nanotechnology publications and citations. Journal of Nanoparticle Research, 10(6), 981–986.
Zhao, D. Z., & Logan, E. (2002). Citation analysis using scientific publications on the web as data source: A case study in the XML research area. Scientometrics, 54(3), 449–472.
Zhou, P., & Leydesdorff, L. (2008). China ranks second in scientific publications since 2006. ISSI Newsletter, 13, 7–9.
Acknowledgments
The authors would like to thank Dr. Diana Hicks and participants in the Workshop for Original Policy Research at Georgia Tech, and the Research Institute of Economics, Technology and Industry, Tokyo, seminar series for their comments on earlier draft, Dr. Juan D. Rogers for offering the R-class and useful discussion, Dr. Philip Shapira for the use of the global nano publication data in the study, and the anonymous reviewer for his useful suggestions. We also would like to thank authors and co-authors of “Li, Y” for clarifying their publication lists.
Author information
Authors and Affiliations
Corresponding author
Additional information
Both authors contributed equally to this work and are listed alphabetically.
Rights and permissions
About this article
Cite this article
Tang, L., Walsh, J.P. Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps. Scientometrics 84, 763–784 (2010). https://doi.org/10.1007/s11192-010-0196-6
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-010-0196-6