Data sets for author name disambiguation: an empirical analysis and a new resource

Müller, Mark-Christoph; Reitz, Florian; Roy, Nicolas

doi:10.1007/s11192-017-2363-5

Data sets for author name disambiguation: an empirical analysis and a new resource

Open access
Published: 27 March 2017

Volume 111, pages 1467–1500, (2017)
Cite this article

Download PDF

You have full access to this open access article

Scientometrics Aims and scope Submit manuscript

Data sets for author name disambiguation: an empirical analysis and a new resource

Download PDF

10k Accesses
43 Citations
Explore all metrics

Abstract

Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we derive a set of general requirements to future AND data sets. These include both trivial requirements, like absence of errors and preservation of author order, and more substantial ones, like full disambiguation and adequate representation of publications with a small number of authors and highly variable author names. On the basis of these requirements, we create and make publicly available a new AND data set, SCAD-zbMATH. Both the quantitative analysis of this data set and the results of our initial AND experiments with a naive baseline algorithm show the SCAD-zbMATH data set to be considerably different from existing ones. We consider it a useful new resource that will challenge the state of the art in AND and benefit the AND research community.

Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation

Article 07 March 2020

Author name disambiguation literature review with consolidated meta-analytic approach

Article Open access 10 April 2024

Generating automatically labeled data for author name disambiguation: an iterative clustering method

Article 29 November 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In this paper, we provide a comprehensive and detailed review of data sets used in computational author name disambiguation (AND) experiments.^{Footnote 1} AND data sets are basically collections of publication headers in which author names have been annotated with unique author identifiers. They are essential and indispensable resources for current research in computational AND, which is characterized by empirical, evaluation-based approaches (Ferreira et al. 2012a). AND data sets are utilized in two ways: First, computational AND approaches based on supervised machine-learning require them as training data during the development (Han et al. 2004; Treeratpituk and Giles 2009) or the parameter estimation phase (Santana et al. 2015). Second, they are also indispensable as test or reference data (often called gold standard or ground truth data) in AND system evaluation. Here the disambiguation decisions of the system under evaluation are compared to the correct disambiguations encoded manually in the data set, and the system performance is quantified on the basis of the number of correct and incorrect decisions. When a new AND data set is created as part of a research project, the design of this data set will probably reflect some explicit or implicit assumptions about the task. Fan et al. is an example of a project where the main research interest is in co-author networks, and where an algorithm is presented that uses author information only (Fan et al. 2011). Accordingly, Fan et al. create a data set of approx. 760.000 publications which contains only author names. What is even more important, Fan et al. systematically exclude from their data set all publications with only one author, because these are not accessible to co-author-based approaches. In contrast, Song et al. focus on disambiguating authors by using advanced semantic topic-modelling techniques (Song et al. 2007). They create a data set of more than 750.000 publications which contains author names, but also titles, abstracts, keywords, and the full text of each publication’s first page.

When the creation of a new data set is out of the scope of a project, the choice of existing data sets available for re-use will have an effect in terms of applicable methods, and, ultimately, outcomes. To give an example, as we will show in the section “Data set content analysis”, most data sets annotate only one, rarely two authors per publication with unique author identifiers, while the other authors remain unidentified and thus undisambiguated. For these authors, co-author network analysis, which is a cornerstone of many AND algorithms, simply has to assume that superficial, string-based name identity always implies identity of the author individual. Likewise, superficially different names will be treated as referring to different author individuals. Both assumptions, however, are obviously not valid. Co-author ambiguity, e.g., is present if the names of one or more unidentified authors in a publication are also used by different authors in other publications. It causes a disambiguation algorithm to incorrectly lump together these authors on the basis of matching names, thus producing incorrect connections in the resulting co-author network. In their discussion of open challenges for AND, Ferreira et al. explicitly refer to cases of co-author ambiguity as very ambiguous cases, also pointing out that the problem might be more pronounced for Asian names (Ferreira et al. 2012a). Shin et al. re-use data sets originally created by Han et al. (2005a, b) and Wang et al. (2011), where authors are also only partly identified (Shin et al. 2014). And in fact, in the error analysis of their co-author-based system, Shin et al. identify co-author ambiguity as one of three major sources of error. These examples show that there is a strong mutual interaction and dependency between current research in computational AND, and the AND data sets used in this research.

Our point of departure in this paper is the following: Current state-of-the-art AND systems like Nearest Cluster (Santana et al. 2015) or BatchAD+IncAD (Qian et al. 2015) report very good performance on distinct, but comparable, data sets: Santana et al., for their solely batch-based system, report a K score of 0.940 on the KISTI data set, and a K score of 0.917 on the BDBComp data set.^{Footnote 2} Likewise, the system of Qian et al., who use the \(\hbox {B}^3\) evaluation measure (Bagga and Baldwin 1998), yields a F1 score of 86.83 when run in batch-mode (BatchAD) on a similar, DBLP-derived data set. While these results as impressive, there are aspects to real-life AND which are simply not well-represented in the respective data sets. Potential co-author ambiguity has already been mentioned above. Other aspects include

1.
cases where one author appears under several names with non-trivial differences (as opposed to differences that are only due to abbreviated first or middle names),
2.
cases where the actual author name is written in a non-western (e.g., Asian or Cyrillic) alphabet and appears in the publication header in some transliterated version, which in turn can give rise to instances of case 1,
3.
cases of publications by less productive authors or authors with only a small number of collaborators, for which rich co-author information is not available, and
4.
cases of publications from scientific fields or communities that generally tend to have smaller numbers of co-authors.

The aim of this paper is two-fold: First, by means of an analysis of the most prominent data sets used in AND research so far, we want to identify and suggest new directions for research in AND. Second, we want to facilitate AND research by designing, creating, and making available a novel AND data set which complements existing ones. We do this by utilizing data and expertise available at the two major bibliographic data bases DBLP^{Footnote 3} and zbMATH.^{Footnote 4}

The rest of the paper is structured as follows: In the section “Background”, we briefly outline some key concepts of AND and provide some definitions that will be used throughout the paper. The section “Review of AND data sets” contains a detailed review of the most important AND data sets. To our knowledge, this is the first comprehensive overview of this kind. In the section “A new AND data set from the domain of mathematics”, we provide some background information on the real-life data that we employ for the creation of our own data set, SCAD-zbMATH,^{Footnote 5} and describe the quality assurance process. The paper ends with conclusions and an outlook in the section “Conclusion”.

Background

In author name disambiguation (AND), publication and author are two central concepts. The authors of a publication are denoted by their names, for which, in case of multi-author publications, the list position in the order of appearance in the publication header may also be relevant.

Each tuple of author name, author name position in author list, and unique publication identifier^{Footnote 6} constitutes an authorship record (Cota et al. 2010). Using this terminology, author name disambiguation can then be characterized as follows: Given a set of authorship records, AND tries to determine which of these refer to the same author entity. This task is very similar to the co-reference resolution task in Natural Language Processing (NLP), which tries to identify all expressions in a document that refer to (or mention) the same entity (Ng 2010).^{Footnote 7} AND is made difficult by two characteristics of person names:

Distinct individuals bear, and publish under, the same name, which gives rise to authorship records with matching author names, but distinct underlying author entities. This phenomenon is called name homography.^{Footnote 8} Failure to distinguish between different authors with identical names will cause a merging or Mixed Citation (Lee et al. 2005) error.

Likewise, different names can be used to refer to the same author entity, which produces authorship records with different author names, but relations to the same author entity. This phenomenon is known as name variability.^{Footnote 9} Failure to correctly merge these records will result in fragmentation (Esperidião et al. 2014) or Split Citation (Lee et al. 2005) errors. Note that, strictly speaking, the term disambiguation applies to cases of name homography only.

Due to the existence of name homography and name variability, author names as they appear in publication headers or publication meta-data are often not sufficient to uniquely identify and distinguish between authors.

It is worth noting that author name ambiguity which results from variability is, at least in part, a home-made problem: While name homography will always exist as the result of a limitation of available person names, name variability is sometimes simply due to a lack of consistency on the part of authors and publishers. Authors sometimes deliberately use different variants, including abbreviations, of their first and middle names, while publishers often abbreviate first names to initials. For example, of all signatures which were added to DBLP between 2011 and 2015, 12.8% were delivered with all first name components abbreviated. McKay et al. point out that some authors use name variations to separate different areas of research or to hide their gender. They also report that researchers might change their name on a publication to avoid confusion with authors who have a similar name (McKay et al. 2010). Already in 1995, Grossman and Ion identified this lack of consistency as a central problem for citation studies in the field of Mathematics, and made a plea to authors to use their complete names consistently for each publication (Grossman and Ion 1995). The degree of author name ambiguity, however, seems to be different in different languages. In some Asian languages, e.g., the name homography problem is very pronounced: In the Chinese language area, it is estimated that the top three surnames (“Wang”, “Zhang”, and “Li”) account for about 21% of the population (Jin-Zhong et al. 2011). In the Vietnamese language area, a mere one hundred family names are presumed to be in common use,^{Footnote 10} with the last name “Nguyen” accounting for up to 46% of family names.^{Footnote 11} These are examples of name homography arising from cultural or ethnological conditions in the respective language areas. On the other hand, name variability can be very pronounced for languages using a non-western alphabet where the author names have to be transliterated into standard characters in order to facilitate search using a standard international keyboard. This is true again for Asian languages, but also for those using a Cyrillic alphabet, or an alphabet with special characters or diacritics. For all these, there are often several ways in which a name can be represented. Consider the following name variants actually observed in zbMATH:

(Henryk) Żoł \({\boldsymbol {\c{a}}}\) dek ^{Footnote 12}: Żoł\({\c{a}}\)dek; Żołądek; Zoł\({\c{a}}\)dek; Ẓoł\({\c{a}}\)dek; Żolądek

Mefodij F. Raţă ^{Footnote 13}: Raţă, Mefodij F.; Rata, Mefodie; Ratsa, Metodie; Raţă, Metodie

(Ivan D.) Pukal’s’kyĭ ^{Footnote 14}: Pukal’s’kyĭ; Pukal’s’kij; Pukal’skii; Pukal’skij; Pukal’s’kyj; Pukal’s’kyi; Pukal’skyj; Pukals’kyj; Pukal’skyj; Pukal’skiĭ; Pukal’sky; Pukalskyi; Pukalskyj; Pukalsky;

Author name ambiguity poses a major problem for online bibliographic data bases, which typically organize and make accessible publication data on the basis of author names. In order to be able to perform an author-targeted query, i.e., to retrieve all and only those publications by a particular author, the authorship records for this author need to be disambiguated. This type of query has been shown to be predominant in the navigation patterns of users searching for scholarly material.^{Footnote 15} Without disambiguation of the bibliographic data base, precision and recall of this type of query are not guaranteed to be satisfactory (Salo 2009). But users of bibliographic data bases are not just researchers looking for other researchers: Scientific organizations and policy makers often rely on author-based statistics as a basis for critical action, while universities and research agencies often use publication statistics for their hiring and funding decisions. Weingart discusses the importance of bibliometrics for grant acquisition and the filling of positions (Weingart 2005). Frey and Rost, and the work referenced there, discuss the effects of publication-based ranking on scientific careers (Frey and Rost 2010). McKay et al. state that building a clean citation profile is a concern of many researchers (McKay et al. 2010). Finally, Diesner, Evans, and Kim, and Kim and Diesner, coming from a slightly different angle, provide evidence that naive, incorrect identification of authors based on name identity alone can have a distorting effect on scientometric analyses of both individual authors and entire scientific sectors, rendering the results of these analyses unreliable (Diesner et al. 2015; Kim and Diesner 2016).

All this makes author name ambiguity a relevant practical problem with far-reaching effects even outside the scholarly domain. As a consequence, online bibliographical data bases expend a lot of effort on author name disambiguation (cf. sections “Data curation at DBLP” and “Data curation at zbMATH”) in order to keep up a high quality of their author data, which is often stored in the form of disambiguated author profiles (Ley and Reuther 2006; Ley 2009). These efforts also include attempts to involve the author or user community (Mihaljevic-Brandt et al. 2014). The ever-growing number of scientific publications makes the task more and more difficult. Bornmann and Mutz, for example, report an exponential growth of publications by year for the period 1980-2012 (Bornmann and Mutz 2015). This tendency calls for automated methods, which in turn require data sets for their development and evaluation.

Review of AND data sets

The following review is based on our survey of the current research literature in computational AND. To our knowledge, it is the first review of its kind. Ferreira et al., in their survey of AND systems, only briefly mention some data sets, but do not give any details (Ferreira et al. 2012a). Kang et al. and Ferreira et al. provide more, and more detailed, information in their respective sections on related work, including some statistics (Kang et al. 2011; Ferreira et al. 2012b). We, in contrast, performed a comprehensive analysis of current, empirically-oriented publications on AND and identified what we think are the most important data sets. In order to be included in our review, the data sets had to be sufficiently identifiable. This was the case for all data sets that were either made publicly available by the respective authors, or that we could obtain otherwise. Another requirement was that the data sets had to be freely available for non-commercial purposes, without additional restrictions or obligations for the individual re-using the data.

It will become clear in the following that all data sets obtained this way cover the domain of computer science, and that most of them are somehow based on DBLP data. This, however, is not a bias of our selection, but it reflects a reality of the AND research community: While there are several AND projects for other domains (most notably the Author-ity project^{Footnote 16} for MEDLINE and PubMed), the data sets produced in these projects failed to reach a level of re-use in current projects that would have made them eligible for our review.^{Footnote 17}

In the following review, descriptive categories applied to each data set include the following:

Does the data set contain errors or structural ambiguities, and what is its overall quality?
Is the data set fully or only partially disambiguated? This property relates to the question whether or not all authorship records in the data set have a unique author identifier.
Are the author names given in their full or only in abbreviated form?
Is the author ordering of the original publication retained in the data set?
Was the data set created in a methodologically controlled manner?
Is the data set expandable by means of an external link to some other data source?

In order to facilitate analysis and comparison of the available data sets, we converted them from their various technical formats into a canonical XML representation. The main feature of this representation is that it puts the publication (and not the author) at its center. The author-centric perspective singles out one particular, featured author of a publication^{Footnote 18} by annotating it with a unique author identifier. In doing so, the featured author’s co-authors in this publication are reduced to mere string-valued attributes of the featured author’s authorship record. In a fully publication-centric data set, on the other hand, every author of every publication is uniquely identified. This is a prerequisite for the development and, in particular, the accurate evaluation of co-author-based disambiguation methods: In a publication-centric data set, co-author relations are no longer established on the basis of string-matching, but on the basis of previous decisions made by a disambiguation method which is equipped to correctly handle co-author ambiguity and variability. In other words: Publication-centric data sets give up the fixed distinction between featured author and co-author in favor of a dynamic setup in which every author in turn is disambiguated. This also better reflects the situation in a realistic, production AND setting, where every author (and not just those with highly ambiguous names) will eventually be subjected to disambiguation.^{Footnote 19}

It is important to distinguish the merely formal, technical aspect of the data set from its content: Converting a data set into our canonical representation alone does not render it publication-centric, unless complete author identifiers are added. Likewise, however, an author-centric data set which happens to contain disambiguated authorship records for more than one author of a publication is (at least to some degree) publication-centric.

Consider Fig. 1, which shows an excerpt from the original XML version of the KISTI-AD-E-01 data set. We chose this example because of its clarity; similar observations can be made in many of the other data sets.

The example shows two citation entries from different parts of the XML file. Each entry corresponds to one authorship record, which renders this data set author-centric. The dblpkey attribute is the publication identifier, and entries with the same value represent the same publication. Note that each citation entry also has a nameGroupID attribute, which groups entries belonging to one so-called same-name group.^{Footnote 20} We found that these groups appear (in one form or another) in many of the reviewed data sets, and they often (e.g., in the Han-DBLP, REXA-AND, and Wang-Arnetminer data sets) are the major means of data set organization. In most cases, a same-name-group is identified by a first-name initial and a full last name (Y. Han and Y. Wang in the example). In each citation entry in Fig. 1, a different author is disambiguated by means of an author identifier. Technically converting an author-centric data set like this into a publication-centric one involves merging individual authorship records on a common value in the publication identifier. An excerpt from the result of this conversion is shown in Fig. 2.

Note that in the example no author identifier has been added to the third author, as this manual disambiguation would constitute a non-trivial enhancement of the data set, which is outside the scope of this review.

Data set content analysis

This section describes six AND data sets, whose main properties are provided in Table 18 in the appendix. The Han-DBLP ^{Footnote 21} data set is one of the first and most influential data sets in AND. It was originally created and employed by Han et al. (2005a, b), with a previous version described in Han et al. (2004). As the name suggests, Han-DBLP is based on data from DBLP, which was obtained using the publicly available download function. The inclusion of authorship records into the data set was based on the degree of ambiguity of the respective author name. This was determined by clustering all author names according to their first name initial and full last name, and ranking the clusters according to their in-cluster name variability, i.e., according to the number of distinct full author names, such that highly ambiguous combinations of first-name initial and last name ranked top. The top four clusters determined in this way were “J. Lee”, “S. Lee”, “Y. Chen”, and “C. Chen”. These were complemented by ten other highly ambiguous clusters, resulting in a total of 14 clusters. The statistics of the original data set can be found in Han et al. 2005 (Han et al. 2005a, b). Han et al. 2005 contains more detailed statistics, including a break-down of the individual name clusters (Han et al. 2005b, p. 338). For this data set, and for all other data sets in this review, unless otherwise noted, our own statistics were calculated on the basis of our converted version of the publicly available^{Footnote 22} data set. Due to structural ambiguities in the original data set, automatic identification of the featured author was not possible for some records.^{Footnote 23} This is the reason why the number of records with ID for Han-DBLP in Table 18 is actually lower than the total number of publications. Table 1 shows a sample of an authorship record from the “DJohnson” block of the Han-DBLP data set. In this case, the name of the featured author is given in short form only, while the names of the co-authors are given in the original form, which in most cases is more complete. Other records in this data set provide the complete names for all authors. The original publication ordering of the authors is not retained in the data. This sample record is typical of the Han-DBLP data set, as it contains one featured author only.

Table 1 Han-DBLP sample record

Data sets for author name disambiguation: an empirical analysis and a new resource

Abstract

Similar content being viewed by others

Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation

Author name disambiguation literature review with consolidated meta-analytic approach

Generating automatically labeled data for author name disambiguation: an iterative clustering method

Introduction

Background

Review of AND data sets

Data set content analysis

Quantitative analysis and comparison

Discussion

A new AND data set from the domain of mathematics

Data bases: DBLP and zbMATH

Data curation at DBLP

Data curation at zbMATH

Quality assurance

The SCAD-zbMATH data set

Initial naive baseline experimentation

Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation