Strong Baselines for Author Name Disambiguation with and Without Neural Networks
 1.8k Downloads
Abstract
Author name disambiguation (AND) is one of the most vital problems in scientometrics, which has become a great challenge with the rapid growth of academic digital libraries. Existing approaches for this task substantially rely on complex clusteringlike architectures, and they usually assume the number of clusters is known beforehand or predict the number by applying another model, which involve increasingly complex and timeconsuming architectures. In this paper, we combine simple neural networks with two sets of heuristic rules to explore strong baselines for the author name disambiguation problem without any priori knowledge or estimation about cluster size, which frees the model from unnecessary complexity. On a popular benchmark dataset AMiner, our solution significantly outperforms several stateoftheart methods both in performance and efficiency, and it still achieves comparable performance with many complex models when only using a group of rules. Experimental results also indicate that gains from sophisticated deep learning techniques are quite modest in the author name disambiguation problem.
Keywords
Author name disambiguation Heuristic rules Clustering problem Baseline methods1 Introduction
There has been significant historic and recent interest in the author name disambiguation (AND) problem, which can be defined as the problem of clustering unique authors using the metadata of publication records (title, venue, keyword, author name and affiliation, etc.) [11, 19, 23]. With the fast growth of scientific literature, the disambiguation problem has become an imminent issue since numerous downstream applications are affected by its preferences, such as information retrieval and bibliographic data analysis [5, 13]. But unfortunately, AND is not an elementary problem because distinct authors may share the same name, which is quite common for Asians, especially Chinese researchers [9], since different Chinese names will be the same when mapped to English (e.g., Open image in new window and Open image in new window share the same English name Wei Wang).
The problem of disambiguating who is who dates back at least few decades, and it is typically viewed as a clustering problem and solved by various clustering models, such models have to answer two questions inevitably, that is how to quantify the similarity and how to determine cluster size [8]. Many existing literatures mainly focus on answering the first question, such as featurebased methods [12, 13] and graphbased methods [3, 16, 20]. Actually, quite a few of them involve increasingly complex and timeconsuming architectures that yield progressively smaller gains over the previous stateoftheart. When it comes to the second question, most previous approaches assume the number of clusters is known beforehand or predict the number by applying another model [25]. However, there is no doubt that the former is unrealistic in real situations and the latter may lead to error propagation.
Lost in this push, we argue that author name disambiguation is not a typical clustering task. From the source of this problem, we should pay more attention to the precision, followed by recall, since that once two clusters are merged incorrectly, resplitting them is an almost impossible process. Cast in this light, many existing clustering models are not very suitable for the author name disambiguation problem. Meanwhile, costeffective blocking technique [1] and lightweight rulebased methods [2, 22] are worthy of research as they have been proven to achieve convincing precision in this problem.
In line with an existing research that aims to improve empirical rigor by focusing on insights and knowledge, as opposed to simply “winning” [17], we peel away unnecessary components until we arrive at the simplest model that works well without any priori knowledge about cluster size, which only consists of simple neural networks and some heuristic rules. Furthermore, the hierarchical agglomerative clustering (HAC) algorithm is adopted as the guiding ideology to cluster publications. On the benchmark dataset AMiner [25], we find that our proposed solution achieves significantly better performance than several stateoftheart methods. Experiments on another public dataset show that such rules conform to the natural law and are applicable to the whole author name disambiguation task rather than just the AMiner dataset. Experimental results also suggest that while complex models do indeed contribute to meaningful advances towards this problem, some of them exhibit unnecessary complexity and rules play a role that cannot be ignored in this task.
2 Problem Definition
All the records in \(C_k\) belong to the same author \(\alpha _k\).
All the records in \(\mathcal {P}\) by \(\alpha _k\) are in \(C_k\).
3 Methodology
In this section, we discuss the design and implementation of our solution in detail, whose design philosophy is based on the observation that the interests of researchers usually do not change too frequently, and in particular, he/she would stay in the same institution for a relatively long time [3]. For this purpose, we can infer that researchers usually have relatively stable sets of coauthors, and topics of publications belong to a researcher should be close in the semantic space during a certain period. This is also in line with the law of human social activities in the real world, that is, friends and interests of a person are usually relatively fixed [6].
With this in mind, we first scatter the publication records \(\mathcal {P}=\{p_1, p_2, ... , p_l\}\) into l sets, and there is only one unique publication p in each original set. Next, a premerging strategy (PMS) is proposed to make preliminary merge decisions according to coauthors. Furthermore, simple neural networks (SNN) are further employed to measure the semantic similarity between two clusters by publication titles, since titles naturally convey the main point of publications. Finally, we introduce a postblocking strategy (PBS) to determine the final clusters elegantly. Figure 1 shows a concrete process of our proposed approach.
3.1 Premerging Strategy
This step aims to merge the initial publication sets preliminarily using the pointtopoint and clustertocluster rules. For convenience, we set an identity constraint \(\mathcal {M}(i, j) \in \{1, 0\}\) to indicate that i and j will (not) be merged into a cluster, where i and j refer to two publications or clusters (Fig. 2).

PointtoPoint: Given two publications \(p_i\) and \(p_j\), if \(S_{n}(p_i) \cap S_{n}(p_j) > \lambda _1\), or \(A_{\alpha }(p_i) = A_{\alpha }(p_j)\) & \(S_{n}(p_i) \cap S_{n}(p_j) > 1\), then \(\mathcal {M}(p_i, p_j)=1\). For a publication \(p_i\), \(S_{n}(p_i)\) and \(A_{\alpha }(p_i)\) denote the set of author names and the affiliations of current author name \(\alpha \), respectively.
 ClustertoCluster: Given two clusters \(C_i\) and \(C_j\), if \(\mathcal {O}_n(C_i,C_j) > \lambda _2\), or \(\mathcal {O}_a(C_i,C_j) > \lambda _2\), then \(\mathcal {M}(C_i, C_j)=1\), where \(\mathcal {O}_x(C_i,C_j)\) denotes the overlap ratio of two clusters in the aspect of x, and \( x \in \{n, a\}\) denotes the name or affiliation of authors. We define the overlap ratio \(\mathcal {O}_x(C_i, C_j)\) as:where \(F_{\bar{x}}(C_i)\) is the occurrence number of \(\bar{x}\) in the cluster \(C_i\).$$\begin{aligned} \mathcal {O}_x(C_i, C_j) = \frac{ \sum _{\bar{x}\in {(S_x(C_i)\cap S_x(C_j))}} (F_{\bar{x}}(C_i)+ F_{\bar{x}}(C_j)) }{ \min (\sum _{\bar{x}\in {S_x(C_i)}}F_{\bar{x}}(C_i), \sum _{\bar{x}\in {S_x(C_j)}}F_{\bar{x}}(C_j) ) } \end{aligned}$$(1)
3.2 Simple Neural Networks
As mentioned above, it is a natural idea to determine whether two publications belong to the same author by their topic similarity, since topic reflects the interest and direction of a researcher. In order to quantify the similarity effectively, we design a simple model based on convolutional neural networks (CNN) to project publications into a lowdimensional latent common space.
For a given cluster \(C_i\) containing \(C_i\) publications, the cluster embedding is defined as \({\mathbf {c}}_i = \frac{1}{C_i}\sum _{j=1}^{C_i} \mathcal {R} (p_j)\). We choose the cluster with the highest similarity with \(C_i\) as its target merging cluster, denoted as \(C_j\), the similarity between these two clusters is measured by the cosine similarity between \({\mathbf {c}}_i\) and \({\mathbf {c}}_j\). Finally, \(C_i\) and \(C_j\) will be merged with some postblocking strategies, we will discuss them in the following paragraph.
3.3 Postblocking Strategy
AnchortoAnchor: if \(S_{n}(p_i^*) \cap S_{n}(p_j^*) = \{\alpha \} \) and \(S_{a \setminus \alpha }(p_i^*) \cap S_{a \setminus \alpha }(p_j^*) = \emptyset \), then \(\mathcal {M}(C_i, C_j)\) \(=0\). For an anchor publication \(p_i^*\), \(S_{a \setminus \alpha }(p_i^*)\) denote the set of affiliations except the current author \(\alpha \).
The anchortoanchor rule can be interpreted as that, if there is no intersection between the name sets or the affiliation sets of \(p_i^*\) and \(p_j^*\) except the current author name \(\alpha \) and its affiliation, we do not think \(C_i\) and \(C_j\) belong to the same author. To illustrate this process intuitively, we describe an example in Fig. 1 (the third step). Although the similarity between {Pub1, Pub3} and {Pub4} is the highest, the merge operation is still blocked as the anchortoanchor rule is violated.
4 Experiments
4.1 Dataset
We conduct our experiments on a recently widely used public benchmark dataset AMiner introduced in [25]^{1}, which is sampled from a welllabeled academic database. The labeling process of the dataset is based on the publication lists on authors’ homepages and the affiliations, emails in web databases (e.g. Scopus, ACM Digital Library). The training set contains publications of 500 author names, and the test set has 100 author names. For each publication, there are five fields as follows: title, keywords, venue, author name and corresponding affiliation. In this paper, we only use title, author name and affiliation to develop our solution. Compared with existing benchmarks for name disambiguation, AMiner is significantly larger (in terms of the number of documents) and more challenging (since each candidate set contains much more clusters) [25].
4.2 Experiment Settings
Following popular choices, we tune our model using fivefold cross validation. For the premerging strategy (PMS), we set \(\lambda _1\) to 2 and \(\lambda _2\) to 0.5 experimentally. Beyond that, CBOW model [14] with k = 100 is employed to learn initial word representations on the training set of AMiner. The simple neural networks (SNN) model is trained using Stochastic Gradient Descent (SGD) algorithm with the initial learning rate of 0.1 and the weight decay of 0.9, the batch size is 50 and the margin is 0.3. At convolutional layer, the number of filter maps is 100 and the window size is 3. Dropout with p = 0.3 is used after the input layer.
4.3 Comparison Methods
Basic Rules [25]: It constructs linkage graphs by connecting two publications when their coauthors, affiliations or venues are strictly equal. Results are obtained by simply partitioning the graph into connected components.
Fan et al. [3]: For each name, it constructs a graph by collapsing all the coauthors with identical names to one node. The final results are generated by affinity propagation algorithm and the distance between two nodes is measured based on the number of valid paths.
Louppe et al. [13]: It trains a pairwise distance function based on carefully designed similarity features, and uses semisupervised Hierarchical Agglomerative Clustering (HAC) algorithm to determine clusters.
 Zhang and Al Hasan [24]: It constructs graphs for each author name based on coauthor and document similarity. Embeddings are learned for each name and the final results are also obtained by HAC.Table 1.
Results of author name disambiguation on the AMiner benchmark dataset. \(\dag \) marks results reported in [25].
Model
Precision
Recall
F1 Score
Basic Rules [25]\(^\dag \)
44.94
89.30
53.42
Fan et al. [3]\(^\dag \)
81.62
40.43
50.23
Louppe et al. [13]\(^\dag \)
57.09
77.22
63.10
Zhang and Al Hasan [24]\(^\dag \)
70.63
59.53
62.81
Zhang et al. [25]\(^\dag \)
77.96
63.03
67.79
PMS
81.86
55.61
66.23
PMS+SNN
73.90
61.97
67.41
PMS+SNN+PBS (PNP)
76.92
64.54
70.19
Zhang et al. [25]: It introduces a representation learning framework by leveraging both global supervision and local contexts, and also uses HAC as clustering method, which is the latest approach on the dataset^{2}. Besides, it deploys the recurrent neural networks to estimate the number of cluster.
Our method is indicated by PNP. In order to analyze the contribution of each component, we present results at each of the three stages described in Sect. 3.
4.4 Results
Table 1 shows the performances of different methods on the AMiner dataset. Following previous settings [25], we utilize pairwise Precision, Recall, and F1score to evaluate all methods. Meanwhile, a macro averaged score of each metric is calculated according to all test names.
Runtime and trainable parameter number of different models.
Model  Runtime  Trainable  

Training  Testing  Parameters Number  
Zhang el al. [25]  >24 h  \(\sim \)573 s  3,024,193 
PMS    \(\sim \)31 s  0 
SNN  \(\sim \)2 h    30,100 
PBS    \(\sim \)88 s  0 
PMS+SNN+PBS (PNP)  \(\sim \)2 h  \(\sim \)119 s  30,100 
In the bottom half part of Table 1, some incremental results of our method are presented. Specifically, PMS outperforms most baselines, which indicates the effectiveness of heuristic rules. PMS+SNN with optimal reject threshold (0.8) yields better performance than PMS (+1.78% in terms of F1score), which suggests the advantage of SNN. PNP outperforms PMS+SNN by +4.12% in terms of F1score and +4.09% in terms of precision which verifies the incorporation of PBS can greatly enhance the performance. Overall, we attribute these successes to the comprehensive consideration of rules and semantics based on the inherent characteristics of the author name disambiguation problem.
5 Analyses
5.1 Efficiency Analysis
We study the runtime and model size (except word embeddings) of our method as well as the stateoftheart model [25] using official implementation. For the sake of fairness, we run them on the same GPU server.
From Table 2, we find that Zhang et al. [25] is indeed computationally expensive, which is caused by the complex operations in modeling the local linkage with graph autoencoder and estimating the number of clusters. Instead, our PNP model is quite simpler and faster because it mainly relies on the heuristic rules to model coauthors rather than embed the local coauthorship into representations. Beyond that, our proposed model removes the need to know or estimate cluster size beforehand, which is unrealistic or timeconsuming. Generally, our approach is almost 5 times faster than the stateoftheart model in test time and has a significant advantage in the model size, which means that training our model requires much fewer computation resources and less time.
5.2 Rule Sensitivity Analysis
By varying the value of \(\lambda _1\) across {0, 1, 2, 3, 4, 5}, we repeat the experiments and report results in Fig. 3(a). As observed, when \(\lambda _1\) increases, the F1score first increases and then decreases, the best performances of both datasets are achieved when \(\lambda _1\) = 1. It is intuitive, because a person usually has a fixed partner, such as a mentor or leader. Furthermore, as shown in Fig. 3(b), 3(c) and 3(d), when fixing the value of \(\lambda _1\) and varying \(\lambda _2\), the two datasets have the similar trends and achieve the peak at almost the same value of \(\lambda _2\), which is a strong evidence for our claim that such rules conform to the natural law and the hyperparameters of rules are relatively insensitive to datasets. We hypothesize that this phenomenon is due to the particularity of problem, which is that friends and affiliations of a person are usually relatively fixed.
We also reproduce the stateoftheart model [25] on the OPEDAC 2018 dataset and achieved the F1score of 50.4%, and our PNP model outperforms it by a substantial margin (+15.4%), which suggests the generalizability of our model^{3}. It is worthy to mention that the results of OPEDAC 2018 in Fig. 3 should not be compared with other competitors in the leaderboard. The reason is that OPEDAC 2018 suffers from the problem of noise in the author list. Actually, when combined with other denoising strategy, our PNP method finally ranks top 3% without any ensemble tricks in the competition.
5.3 Error Analysis
We analyze some of the errors made by our model on the AMiner dataset, and find that the most common error is the incorrect mergence when publications have the same short and incomplete affiliations (e.g. Department of Computer Science). In other words, there might be two different people with the same name who happen to work in the department of computer science, but if they do not belong to the same school, things will become trickier.
For this purpose, we perform a supplemental experiment to explore the upper bound of precision, we merge two publications if and only if \(S_{n}(p_i)\) \(=\) \(S_{n}(p_j)\) & \(S_{a}(p_i)\) \(=\) \(S_{a}(p_j)\), which means that the name set and the affiliation set of two publications are exactly the same. Experimental results show that the precision is just about 95%. When facing the remaining 5%, even humans have no certain confidence to deal with them correctly. In this case, after removing these indistinguishable samples, our premerging strategy attains 86% precision, which is quite acceptable for the unsupervised heuristic rules.
6 Related Work
In many applications, author name disambiguation (AND) has been regarded as a challenging problem, which can date back at least few decades. With the growth of scientific literature, it becomes more and more difficult and urgent to solve this problem [4, 18, 21]. Based on the different scenarios, the author name disambiguation problem can be divided into two subtasks: author name disambiguation from scratch (ANDS) [24, 25] and incremental author name disambiguation (IAND) [9, 10], the former is generally a clustering problem, while the latter is a classification problem.
In this paper, we focus on the ANDS scenario, which is more challenging and practical than IAND. On the whole, stateoftheart solutions for the task can be divided into two categories: featurebased and graphbased. Featurebased methods leverage pairwise distance function to measure documents. Huang et al. [7] first uses blocking technique to group candidate documents with similar names together and employs DBSCAN to cluster documents. Louppe el al. [13] uses a classifier to learn pairwise similarity and performs semisupervised hierarchical clustering to generate results. Graphbased methods utilize graph topology and aggregate information from neighbors. Fan et al. [3] builds document graph for each name by coauthorship, and uses carefullydesigned similarity function and affinity propagation algorithm to generate clustering results. Tang el al. [20] employs Hidden Markov Random Fields to model node and edge features in a unified probabilistic framework. Zhang and Al Hasan [24] learns graph embedding from three constructed graphs based on document similarity and coauthorship. Moreover, Zhang et al. [25] combines the advantages of above two methods by learning a global embedding using supervised metric learning and refining the embedding using local linkage structures. In this push towards complexity, we do not believe that all researchers have adequately explored baseline methods, and thus it is unclear how much various fussy techniques actually help.
7 Conclusion
In this paper, we take heuristic rules that come from realworld observations into consideration and propose a strong baseline for the author name disambiguation problem. The proposed model contains a premerging strategy, simple neural networks and a postblocking strategy, which do not need any extra knowledge about cluster size. Experimental results verify the advantage of our method over stateoftheart methods, and demonstrate the proposed model is highly efficient and rules can be extended to other datasets, in which many conclusions are consistent with some sociological phenomena. Beyond that, we further explore the upper bound of disambiguation precision and analyze the possible reasons, which will be leaved as our future work. To conclude, we offer all data mining researchers a point of reflection like some previous work [15]: The most important thing is to consider baselines that do not involve complex architectures, simple methods might lead to unexpected performances.
Footnotes
Notes
Acknowledgements
This research is supported by the National Key Research and Development Program of China (grant No. 2016YFB0801003) and the Strategic Priority Research Program of Chinese Academy of Sciences (grant No. XDC02040400).
References
 1.Backes, T.: The impact of namematching and blocking on author disambiguation. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM) (2018)Google Scholar
 2.Caron, E., van Eck, N.J.: Large scale author name disambiguation using rulebased scoring and clustering. In: Proceedings of the International Conference on Science and Technology Indicators (STI) (2014)Google Scholar
 3.Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graphbased name disambiguation. J. Data Inf. Qual. (2011)Google Scholar
 4.Ferreira, A.A., Gonçalves, M.A., Laender, A.H.: A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record (2012)Google Scholar
 5.Han, D., Liu, S., Hu, Y., Wang, B., Sun, Y.: Elmbased name disambiguation in bibliography. World Wide Web (2015)Google Scholar
 6.Hirsch, J.E.: An index to quantify an individual’s scientific research output that takes into account the effect of multiple coauthorship. Scientometrics 85, 741–754 (2010) CrossRefGoogle Scholar
 7.Huang, J., Ertekin, S., Giles, C.L.: Efficient name disambiguation for largescale databases. In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) (2006)Google Scholar
 8.Hussain, I., Asghar, S.: A survey of author name disambiguation techniques: 2010–2016. Knowl. Eng. Rev. (2017)Google Scholar
 9.Hussain, I., Asghar, S.: Incremental author name disambiguation using author profile models and selfcitations. Turkish J. Electric. Eng. Comput. Sci. (2019)Google Scholar
 10.Kim, K., Rohatgi, S., Giles, C.L.: Hybrid deep pairwise classification for author name disambiguation. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM) (2019)Google Scholar
 11.Levin, M., Krawczyk, S., Bethard, S., Jurafsky, D.: Citationbased bootstrapping for largescale author disambiguation. J. Am. Soc. Inf. Sci. Technol. (2012)Google Scholar
 12.Liu, J., Lei, K.H., Liu, J.Y., Wang, C., Han, J.: Rankingbased name matching for author disambiguation in bibliographic data. In: Proceedings of the 2013 KDD Cup 2013 Workshop (2013)Google Scholar
 13.Louppe, G., AlNatsheh, H.T., Susik, M., Maguire, E.J.: Ethnicity sensitive author disambiguation using semisupervised learning. In: Proceedings of the 7th International Conference on Knowledge Engineering and Semantic Web (KESW) (2016)Google Scholar
 14.Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NeruIPS) (2013)Google Scholar
 15.Mohammed, S., Shi, P., Lin, J.: Strong baselines for simple question answering over knowledge graphs with and without neural networks. In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (2018)Google Scholar
 16.Niu, F., Ré, C., Doan, A., Shavlik, J.: Tuffy: scaling up statistical inference in markov logic networks using an RDBMS. In: Proceedings of the Very Large Data Bases Endowment (VLDB) (2011)Google Scholar
 17.Sculley, D., Snoek, J., Wiltschko, A., Rahimi, A.: Winner’s curse? on pace, progress, and empirical rigor. In: Workshop on 6th The International Conference on Learning Representations (ICLR) (2018)Google Scholar
 18.Shen, Q., Wu, T., Yang, H., Wu, Y., Qu, H., Cui, W.: Nameclarifier: a visual analytics system for author name disambiguation. IEEE Trans. Visual. Comput. Graph. (2016)Google Scholar
 19.Smalheiser, N.R., Torvik, V.I.: Author name disambiguation. Annual Rev. Inf. Sci. Technol. (2009)Google Scholar
 20.Tang, J., Fong, A.C., Wang, B., Zhang, J.: A unified probabilistic framework for name disambiguation in digital library. IEEE Trans. Knowl. Data Eng. (2012)Google Scholar
 21.Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM International Conference on Knowledge Discovery & Data Mining (KDD) (2008)Google Scholar
 22.Veloso, A., Ferreira, A.A., Gonçalves, M.A., Laender, A.H., Meira, W.: Costeffective ondemand associative author name disambiguation. Inf. Process. Manage. Int. J. (2012)Google Scholar
 23.Yoshida, M., Ikeda, M., Ono, S., Sato, I., Nakagawa, H.: Person name disambiguation by bootstrapping. In: Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2010)Google Scholar
 24.Zhang, B., Al Hasan, M.: Name disambiguation in anonymized graphs using network embedding. In: Proceedings of the 26th ACM International Conference on Information and Knowledge Management (CIKM) (2017)Google Scholar
 25.Zhang, Y., Zhang, F., Yao, P., Tang, J.: Name disambiguation in aminer: clustering, maintenance, and human in the loop. In: Proceedings of the 24th ACM International Conference on Knowledge Discovery & Data Mining (KDD) (2018)Google Scholar