1 Introduction

Scientific digital bibliographic repositories, such as DBLP [1], AMiner [2], and CiteSeerX [3], provide bibliographic information offering features that allow the identification of scientific research, authors, and their respective communities. Such repositories can list millions of bibliographic records presenting a vital source of information for academic communities and allowing relevant publication search in a centralized way [4]. In addition to the literature search facility, these digital libraries provide functional analysis and information used for decision-making by funding agencies and academic institutions for grants and individual promotion decisions [5].

Name ambiguity may arise in citation records when an author’s name is not accurately identified. This situation can occur when an author is listed in the bibliography using different names or when two or more authors have the same name.

This problem can be due to many reasons, including name changes due to personal reasons, variations in the transliteration of non-Roman names, typographical errors, the absence of standard practices, and decentralized content generation, such as through automatic harvesting. These aspects have been discussed in various studies, as evidenced in previous literature [6,7,8].

Consequently, the effectiveness of the primary functions of digital bibliographic repositories, including searching, navigating, and suggesting content, can be significantly impacted by the uncertainty surrounding author names. These issues may impact significantly the accuracy and reliability of citation records, which can affect the quality of scientific research [4].

However, developing effective methods for AND is a challenging task. In the literature, there are different approaches to solving this problem, with techniques varying from heuristic-based to more complex methods that leverage artificial intelligence with supervised and unsupervised learning [9]. This set of methods forms an area of study known as Author Name Disambiguation (AND) [10].

With the scientific interest and concern for AND, literature reviews presenting different methods are emerging. The review of [5] provides techniques available in the literature from 2010 to 2016, comparing them at an abstract level, discussing limitations, and classifying them into five categories. But these categories only classify the techniques, unlike other review approaches that classify the type of evidence explored. The authors in [11] presented a brief survey of AND automatic methods and proposed a taxonomy for classifying the techniques, including explored evidence types. In [4], the authors updated the review emphasizing the previous taxonomy and sorting automatic techniques for AND in bibliographic repositories (2003–2020).

In [12], the literature review focuses on approaches that applied AND methods in the PubMed bibliographic repository until 2019. The authors proposed a new taxonomy with a subdivision of the category author grouping based on the taxonomy presented in [4, 11] with similar evidence types explored. A recent review analyzes the development of incremental AND methods using similarity comparison strategies from 2011 to 2020 [13].

In this literature review work, we use the hierarchical taxonomy proposed by [4, 11] to classify AND approaches. The documents found during this review fit adequately in the taxonomy classifications according to the diagram shown in Fig. 1. In the sequence, AND methods and techniques of this taxonomy are detailed.

Fig. 1
figure 1

The taxonomy used in this literature review [4, 11]

1.1 Type of approach

  • Author Grouping aims to group references of the same author using a type of similarity by analyzing the attributes of these references. Usually, these methods use clustering techniques, pre-defined similarity functions, or machine learning techniques, extracting information from co-authorship relationships or a set of heuristic rules.

  • Author Assignment methods assign each author record using the construction of a model that represents the author. These methods aim to directly attribute the authorship record to their respective authors, adopting some classification or clustering technique.

1.1.1 Explored evidence

  • Citation Information extracts information directly from the citation records, such as author and co-author names, paper titles, year of publication, and other information. These attributes are the most commonly used AND methods available in the literature. However, sometimes they do not provide enough information about the approaches used.

  • Web Information is extracted from the Web and used as supplementary information about an author’s publication profile. This obtained information is used as attributes to calculate the authorship record similarity.

  • Implicit Evidence is obtained from visible attribute elements, such as the latent topics of a citation which returns each topic probability given a particular citation. This value is used as an attribute or evidence to calculate the similarity between authorship records.

Analyzing the reviews presented, we note they classify AND approaches, describe the main characteristics, and propose a taxonomy for classification. In contrast, our review work imposes questions focusing on the most cited works, most relevant authors, most-used approaches, and pursued lines of research. To answer such questions, we use meta-analysis based on statistics to summarize the work results. The Theory of Consolidated Meta-analytic Approach (Teoria do Enfoque Meta-analítico Consolidado - TEMAC [14], in Portuguese) is used to collect data on AND area with quantitative information available in bibliographic repositories, such as the Web of Science (WoS) and Scopus. However, a systematic review answers research questions by collecting and summarizing empirical evidence that fits pre-specified eligibility criteria as suggested by [15, 16]. We think a systematic review would add investigation threads to our work in a complementary way but not replaceable. In addition, a meta-analytical AND review has not been conducted to date, creating a foundation for forthcoming research endeavors that will contribute to the existing body of knowledge built upon prior review studies [5, 11, 12].

The following sections present the literature review methodology, results and analysis, and conclusions with future work.

2 Methodology

Traditionally, systematic literature reviews focus on accessing many different digital bibliographic repositories to enrich the findings [15, 16]. In this work, a consolidated meta-analytic approach was chosen to carry out a literature review because of the methodological advantage, as its process reduces the access to private scientific digital bibliographic repositories minimizing bias and maximizing the coverage possibility. Specifically, considering the Brazilian Public Universities, there is a free access to major scientific repositories such as Scopus and WoS through the Periodicos portal of the Coordination for the Improvement of Higher Education Personnel (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES), institution linked to the Ministry of Education in Brazil.

The TEMAC [14] emerged as an exploratory solution, supported by prior strategies and grounded in bibliometric principles focusing on the need to unify various systematic methods with a meta-analytical framework with recent publications [17, 18]. Moreover, the AND meta-analytical review studies have not yet been conducted, starting a point for future research work adding to the body of knowledge of existing systematic review studies.

The TEMAC consists of three steps, the research preparation, data presentation and interrelation, detailing integration model, and validation by evidence as presented in the sequence.

2.1 Research preparation

The review preparation step is vital as wrong choices generate unsatisfactory results (e.g., inadequate search strings). The work of [19], using meta-analytic analysis, states that one of the essential stages in review studies is the reading of articles to define specific criteria to include and exclude studies. We defined specific criteria for inclusion (IC) and exclusion (EC) to guide the selection of academic works as follows:

  • IC-1: primarily addressing AND as an integral component of the study.

  • IC-2: published in peer-reviewed conferences or journals, available in major bibliographic repositories.

  • EC-1: works that are not available in online repositories.

  • EC-2: primarily associated with domains other than information systems, computer science, and engineering.

  • EC-3: published beyond the timeframe of 2003–2022.

These choices were substantiated by addressing four fundamental questions during the selection process:

  1. 1.

    What is the search descriptor, string, or keyword? The string “author name disambiguation" is used without and/or connectives to include studies in different contexts and scientific bibliographic repositories.

  2. 2.

    Which are the databases? The WoS and Scopus bibliographic repositories, esteemed within numerous academic communities, were chosen due to the fact that the works within these databases emanate from peer-reviewed conferences or journals. This decision was grounded in WoS’s extensive temporal coverage and Scopus’s comprehensive scope of science and technology journals. These databases are complementary and widely employed in literature reviews.

  3. 3.

    What is the space-time field of the research? Temporal delineation is crucial, as databases have varying time coverage. These documents ranged from 2003, which marked the first work on AND on the web, to 2022.

  4. 4.

    Which are the knowledge areas? We determined the knowledge domains after examining the documents included in the WoS and Scopus databases. After this examination, the domains were categorized into Computer Science, Social and Information Sciences, Medicine, Engineering, and Mathematics. Table 1 presents AND knowledge areas in both databases.

Using a meta-analytic review approach establishes a foundation for conducting an exploratory study, ensuring the inclusion of relevant works for constructing up-to-date knowledge of the research AND area.

Table 1 Knowledge areas in WoS and Scopus databases

2.2 Data presentation and interrelation

This section covers the data presentation and interrelation aspects of TEMAC meta-analytical framework, including the laws and tools used in the exploratory study of the AND research area.

2.2.1 Laws

The TEMAC meta-analytical overall framework includes quantitative techniques and bibliometric aspects based on three laws.

  • Bradford’s law [20] allows finding journals that publish the most on the topic. The scientific journals of an area should be ordered in a decreasing manner according to their productivity, generating nuclei where appear few journals usually account for a high share of total publications. While a high number of journals publish fewer articles in the area [21]. This law also measures bibliographic dispersion, how much knowledge is dispersed in journals. The Bradford’s law is computed from the journals n that have published the most articles on the subject, which would be the core. As one moves away from the core, an increasing proportion of the articles in the subsequent zones is observed 1:n:n\(^2\):n\(^3\). In the context of this study, Bradford’s law facilitates citing a limited number of scientific journals in the AND area, which collectively account for a substantial portion of the total publications.

  • The Elitism or Prince law is born from Lotka’s Law [22], one of the most discussed models under bibliometrics, which states that the number of authors making n contributions is about \(1/n^{2}\) of those making a single publication. The Elitism law seeks to reveal the most important (most-cited) authors and papers employing the square root of the total number of authors, unveiling what is considered an elite. If n represents the total number of authors, \(\sqrt{n}\) would represent the elite of the studied area. In this study, the most cited authors reveal the most important authors and documents responsible for more than half of the contributions in the AND area.

  • The 80/20 law (Pareto rule) [23] is inspired by information systems used in commerce and industry, where 80% of information demand is satisfied by 20% of the set of information sources. In this work, this law searches for more relevant journals, conferences, countries, and universities that publish the most in the AND area, and the choice of more representative keywords.

2.2.2 Tools

The data presentation and interrelation using the consolidated meta-analytic approach of TEMAC allows for a review of the most relevant authors and citations, journals, countries, organizations, or universities, and knowledge areas most related to the research field. To perform the data analysis review, we used the VOSviewer bibliometric tool [24], and the BiblioTools [25, 26].

The VOSviewer tool for visualizing and analyzing bibliometric networks provides insights into patterns, relationships, and trends in the research literature. The VOSviewer allowed the production of visual representations of bibliometric data. The co-authorship and co-citation networks’ visualization with research clusters, influential authors, and new research directions help reveal relationships between authors and documents in review works. In summary, the VOSviewer tool helps to explore large bibliometric datasets, contributing to understanding the landscape of research fields.

BiblioTools is a suite of Python scripts for bibliometric analysis integrable to different digital repositories, with numerous functions, such as data mining, data processing, data analysis, keyword visualization, and automated report generation. BiblioTools makes it possible to refine and clean raw data with a preprocessing script preparing the dataset for analysis. We explored our data, producing a variety of co-occurrence networks, such as co-words, co-authors, and co-citations. The BiblioTools allows the visualization of bibliographic coupling networks and clusters, providing information about publications, authors, and research topic connections.

Using the BiblioTools and VOSviewer, we extracted information as follows.

  1. 1.

    An analysis of journals and conferences with the largest number of documents on the topic;

  2. 2.

    Journals with the largest number of documents;

  3. 3.

    Publications in journals and conferences per year;

  4. 4.

    Authors who published the most versus most cited authors;

  5. 5.

    The countries that published the most;

  6. 6.

    Organizations or Universities that published the most;

  7. 7.

    Knowledge areas that most publish;

  8. 8.

    Keyword frequency.

Fig. 2
figure 2

Documents obtained in the Scopus, WoS, and the merged databases

Table 2 Document types in the databases
Table 3 Journals with the largest number of documents
Fig. 3
figure 3

Evolution of journal and conference publications per year

2.3 Detailing, integrating model and validation by evidence

In the third step, deeper analyses allow a better understanding of the topic, selecting principal authors, approaches, lines of research, and validation by evidence with a comparison of results from the different databases.

This evidence is obtained with the analysis of co-citations and bibliographic coupling maps. The co-citation method connects different authors and documents based on their appearance together in the lists of references obtained in the bibliographic repositories. On the other hand, the bibliographic coupling method connects authors and documents based on the number of references they share between them. In other words, while the co-citation presents works constantly cited together and may show similarities between studies, the coupling uses the premise that works that quote the same articles have similar contexts, but indicate the current research fronts using up-to-date space-time.

Co-citation and coupling analyses are commonly used in systematic reviews [27,28,29,30,31]. They fulfill the functions of revealing the main research approaches, establishing the fronts, and revealing future research directions [14]. By establishing a link between references (past) and the most prominent works (future), they fulfill the function of snowballing in an automated way. Thus, through co-citation, one can understand the main approaches of the past while the coupling identifies the primary current studies. Also, the keyword cloud is essential for revealing lines of research demonstrating the different applications in certain areas [19]. The keyword cloud is usually carried out using the frequency of keywords and can be enhanced by using the co-occurrence of these words.

Fig. 4
figure 4

Distribution of documents by knowledge area in the merged databases

Table 4 Number of Authors versus Number of Publications (merged database)

3 Results and analysis

The database search results returned 197 documents in Scopus, where 137 were also in the WoS. The data export included the complete work records, including fields of Author, Title, Abstract, Keywords, Addresses, and Cited References. For a broad literature review investigation, we handled a merge with the documents obtained in the WoS and Scopus databases when 14 unique ones compose the WoS database. Figure 2 presents a Venn Diagram with 211 documents recovered in both bibliographic databases. Additional information on the literature review results and analysis using BiblioTools [25] is available.Footnote 1

3.1 Data presentation and interrelation

In this step, we present the literature review interrelationship of data and quantitative information. Table 2 presents document types in WoS and Scopus databases. On the WoS, approximately 53% (81) are journal articles, 41% (62) conference articles, 2% (3) reviews, 2.6% (4) erratum and early access, and 0.6% (1) data paper. According to the Scopus database, approximately 45.6% (90) of documents are journal articles, 43.1% (85) conference papers, and the remainder divided into conference reviews 7.1% (14), review 2.5% (5), book chapter 0.5% (1), data paper 0.5% (1), and erratum 0.5% (1). Considering the documents obtained from the databases merged, 46.4% (98) are journal articles, 41.2% (87) conference articles, 6.6% (14) conference reviews, 2.3% (5) reviews, 3.3% (7) are divided into book chapters, data paper, erratum, and early access.

According to Table 3, most documents are journal-type. The Scientometrics, h-index of 123, and SJR of 0.929 is the one with the most publications in the AND area with 21 publications in the WoS and the merged databases. The rest of the journals have 23 publications in the merged database. The IEEE Access has the highest h-index of 158 and an SJR of 927, but only four publications in the merged databases. It is also possible to verify that in this set of documents, the journals related to the Information Science area are in the majority.

The document distribution among the databases is similar, where we can see that most are journal and conference articles. Figure 3 presents the evolution of publications in journals and conferences per year. Note that in all cases, journals and conference documents have alternated over the years. However, considering recent years 2020 to 2022, the amount of journal articles has increased.

As shown in Fig. 4, the knowledge area of Computer Science (34.1%), Social Sciences (13.8%), and Engineering (10.8%) are related to more than half of the total number of AND documents in the WoS and Scopus merged database (58.7%). The leadership in Computer Science might be related to the fact that AND is an open problem in the area, triggering methods and approaches to solve it. Although some works use databases related to the Medicine domain (PubMed [32]), this area corresponds to 3.8% of all documents.

Based on the literature review method, it is possible to identify the authors with the most publications and the most cited ones. Considering the WoS, Scopus, and merged datasets, we empirically tested what number of publications would present an appropriate citation index to filter authors. As presented in Table 4, the number of documents decreases as the number of authors increases. We checked through statistical observation, that 97% of the authors have fewer than five publications. Using a well-established bibliometric principle (the Elitism or Prince law, presented in Sect. 2.2), a small number of authors contribute to a large number of publications. Thus, selecting authors with at least five publications, we focus on the most prolific authors and, presumably, the most influential in the field. With this filter, the WoS database returned eight authors, and Scopus and the merged returned 13. We analyzed these authors, comparing the number of documents and citations.

As shown in Table 5, the author with the highest number of documents from the WoS database was Jinseok Kim (12). However, the results for Scopus and merged databases showed that M. A. Gonçalves was the author with the highest number of documents (13). This author was the most cited in the WoS database (305). However, with authors not previously identified in the WoS database but found in Scopus and the merged databases presented Torvik’s works [33,34,35,36,37] with more citations (549). In addition, in the merged database, Torvik’s work [36] is the most cited one (189).

Table 5 Most cited authors and authors’ documents
Fig. 5
figure 5

Documents by country

Fig. 6
figure 6

Citations by country

As shown in Fig. 5, we measured the number of documents and citations by countries using information from the WoS, Scopus, and merged databases. Following the specification of cited authors and document quantity per author (Table 5), we filtered the countries of the merged databases with five or more publications. Regarding documents by region, the USA, China, Germany and Brazil lead the list of countries that publish the most, considering the three databases. In the WoS and Scopus merged databases, for example, the USA has 45 (21.3%) documents, China 41 (19.4%), Germany 28 (13.2%), and Brazil 18 (8.5%).

According to the graph in Fig. 6, information about the number of citations changes compared to the number of documents. The USA, Brazil, and China present the most citations. The USA leads comfortably with 1337 document citations. Also, it is possible to verify that Brazil (510) has more document citations than China (311), with a lower document number. China’s citation number is very close to the amount from Germany (51). But Germany is the third country that publishes the most.

Table 6 Documents and citations by organizations

We filtered the number of publications per year on the three databases studied. As proposed in the literature review method, the selected documents were from 2003 to 2022. Analyzing Fig. 3 that uses the WoS, Scopus, and merged databases, it is possible to observe a publication increase in the AND area since 2003 but with a decrease in publications in 2015, considering that in 2014 there was growth. It is also important to note that even with a world pandemic scenario in 2020 and 2021, there was a growth in publications compared to previous years. In 2022, we did not obtain the total number of publications, requiring a new survey in 2023 to validate the annual growth.

We conducted the same analysis of publications and citations by organizations that publish studies in the AND area. We included organizations with more than 20 citations in each database (WoS and Scopus) as presented in Table 6. Considering WoS and Scopus merged databases, we found that North American and Brazilian organizations regularly publish with good document citation scores. We checked that most publications are done jointly in Brazil. The Departamento de Ciência da Computação da Universidade Federal de Minas Gerais and the Departamento de Computação da Universidade Federal de Ouro Preto. These two organizations have 18 publications and 682 citations. Unlike Brazilian organizations, North American organizations usually do not publish together. However, each organization has many publications. The Information Sciences School of Illinois University at Urbana-Champaign has 11 publications and 498 citations. The Institute for Research on Innovation & Science of Michigan University has 12 publications with 192 citations.

Using the document titles and abstracts in the WoS and Scopus merged databases, we generate a word cloud as shown in Fig. 7.The cloud included words related to the AND problem and solving approaches, such as data, clustering, information, learning, similarity, publication, model network, libraries, and graph. In the bibliographic co-citation and coupling analysis, such approaches validate the recurrent use of the methods in the AND area.

Fig. 7
figure 7

Word cloud considering document titles and abstracts in the WoS and Scopus merged databases

3.2 Detailing, integrating model and validation by evidence

In this section, we present the third step of the review methodology, including the co-citation analysis, bibliographic coupling analysis, and overview of AND publication.

Table 7 provides a direct relationship between the references shown in the Figs. 8 and 10 and their corresponding representations in the text, which helps to improve the readability and clarity of our results.

3.2.1 Co-citation analysis

In the co-citation analysis of Fig. 8, we identify a similarity between the authors’ contributions and their areas of study interest. There are four dark red spots representing co-citation cores. Below, we will detail the leading studies of each cluster and classify them according to the taxonomy used in this literature review (Fig. 1).

Fig. 8
figure 8

Co-citation clusters under heat map analysis. The circular dotted lines with explicit numbered labels indicate each cluster

Table 7 Correspondence references cited in Figs. 8 and 10

Cluster 1

This cluster consists of two works, Shin et al. [38] and Ferreira et al. [40], which use co-authorship information for disambiguation. This similarity indicates the proximity in the heat map and justifies the high co-citation of the cluster. However, the computational approach for disambiguation is different. The work of [38] has an approach based on graphs constructed with co-authorship relations to solve the AND problem using the DBLP and Arnetminer databases. According to the taxonomy, we can classify the type of approach as Author Grouping and the explored evidence as Citation Information and Web Information.

The work presented by Ferreira et al. [40] uses a three-step approach for AND. First, using a heuristic based on co-authorship makes the citations clustered. Using similarities, some of these clusters will be selected to become training data in the second step. In the third step, the selected clusters are added into an associative name disambiguator with self-training capabilities. This work is classified according to the taxonomy in the type of approach as Author Assignment and explored evidence as Citation Information and Web Information (i.e., data extracted from DBLP and BDBComp).

Cluster 2

While neither of the two works in this cluster suggests a direct solution for the AND problem, they do provide strategies that help resolution approaches. The cluster appearance is justified by their high co-citation in the literature, serving as a basis for other studies. Kim et al. [42] introduce a method for generating labeled data to compose machine learning approaches. With test runs, the proposal achieved high performance compared to works in the literature. Kim [43] implements a framework integrating five validation measures for AND approaches using clustering. This integration may help scholars in the AND area to compare the similarities and differences of the various validation measures before selecting the ones that best characterize the clustering performances of their AND methods.

Cluster 3

The authors in [35] present a brief literature review focusing on the definition and challenges of the AND problem. Ferreira et al. [11] conducted a literature review with approaches to AND resolution, suggesting a taxonomy for classifying these approaches. We observed that the two reviews are close compared to the whole heatmap, evidencing a large co-citation of these papers in the studied databases.

Two other works propose approaches to AND. First, Levin et al. [45] present a self-supervised algorithm that uses bootstrap techniques for clustering and a supervised training algorithm. The work uses information from the authors’ citations and other attributes such as email, authors’ names, and language. We classify this work in the type of approach as Author Grouping and the explored evidence as Citation Information.

The work presented by [46] uses a heuristic-based approach for AND with similarity functions of authorship evidence records extracted from DBLP and BDBComp. According to the taxonomy, the type of approach is Author Grouping and explored evidence Citation Information.

Cluster 4

This cluster contains two papers by the same authors. The first one of [36], a probabilistic approach, named Authority, to solve the AND problem in the MEDLINE [32] database using information such as title, journal name, co-authorship, language, and other features. Authority computes the similarity between two articles by analyzing the authors’ names and emails. The model also presents ways of automatically generating training sets, methods to estimate the probability between author names, and an agglomerative clustering algorithm based on maximum likelihood to compute clusters of articles that represent the authors studied.

The second work of [34] also uses a probabilistic model for AND, but it only used authors’ names, discarding other information such as email addresses and affiliations. Thus, it is evident that the 2009 work is an evolution of the 2005 one. According to the taxonomy, both papers use the approach type as Author Grouping and as explored evidence Citation Information. The other clusters of co-citation presented by the heat map in Fig. 8 did not present patterns detected by this study.

Figure 9 presents another co-citation analysis using cluster density with clustering among all the co-citation papers and allows insight into other similarities among the documents in each group. Thus, we can analyze the other works not so evident in Fig. 8. Note there are three general clusters: green, blue, and red. A common characteristic is a space-time between the documents. The red has papers from 2005 to 2010 and the blue from 2009 to 2015. This space-time feature does not appear in the green cluster, as it is more diverse with documents from 2004 to 2019. Note there are works on the cluster edges, uniting groups based on the date characteristic, such as [11] that link the blue and red clusters.

The green cluster is quite diverse as there are various types of approaches, such as cognitive maps and network analysis [47], probabilistic models [48], heuristic-based models [49], agglomerative hierarchical clustering [50], and supervised learning [51, 52].

The red cluster cover works using clustering approaches [53,54,55]. The study conducted by [56] investigates the influence of co-authorship attributes for solving the AND problem. The blue cluster presents a similarity with research using cluster similarity and agglomerative clustering for AND [57, 58]. In contrast, Strotmann and Zhao [59] presents research that indicates the influence of AND on citation and bibliographic base analysis studies.

Fig. 9
figure 9

Co-citation clusters under cluster density analysis

3.2.2 Bibliographic coupling analysis

The bibliographic coupling analysis allows insights into the current state of the AND research area. We present in this section the AND research fronts, including works from 2019 to 2022. The works are classified considering the AND approach using the taxonomy presented in Fig. 1.

Figure 10 presents a bibliographic coupling of works using a heat map for the merged WoS and Scopus merged databases. Note there are three clusters explicitly numbered, highlighting the current AND research fronts. In the sequence, we present a summary of the works included in the three clusters.

Fig. 10
figure 10

Bibliographic coupling in the WoS and Scopus merged databases. The Circular dotted lines with explicit numbered labels indicate each cluster

Cluster 1

The authors in [39] used a graph node embedding approach to solve the AND problem. This type of solution is inspired by the word embedding model but adapted for a graph structure solution. A graph is constructed with co-authorship relationships, using the random walk method for learning graphs and assigning clusters to unique people in the real world. The approach used CiteSeerX data with results improved compared to similar approaches.

The authors in [41] propose a hybrid pairwise classification method for estimating the probability that an author record is correct in a bibliographic repository. This solution uses global features extracted from text using supervised training on a dataset of an author’s citations. This text classification and the supervised training use word embedding methods such as Bag of Words and TF-IDF with data from PubMed and ArnetMiner. According to the taxonomy, [39, 41] can be classified as Author Grouping approaches because they use similarity calculation with training and machine learning. Authors use word embedding as basis for the AND method, justifying the proximity of the studies observed in the heat map presented in Fig. 10.

Cluster 2

In [33], the authors create a knowledge graph with information from the PubMed repository extracting bio-entities from abstracts. In this work, the authors do not propose a new approach to solving the AND problem. However, approaches already known in the literature were used together, such as Authority (uses a graph approach) and Semantic Scholar (uses a binary training classifier to join pairs of author names and create author clusters). The constructed knowledge graph allowed the creation of links between biological entities, articles, authors, and affiliations. In the AND step, the results achieved F1 scores of 98.09%. We can classify the approach of this work as Author Grouping and the explored evidence as Citation Information and Web Information.

Cluster 3

The work of [44] created an automatic system for data availability declarations in bibliographic repositories using PubMed. The authors compare first and last names with string similarity techniques. It appears in the heat map of the literature because it cites several influential AND works.

With the literature coupling analysis, we note a current use of grounded techniques for AND with Author Grouping and Clustering methods. The first article found in the dataset of this review dates back to 2003. Since the bibliographic coupling seeks to obtain current research fronts, works from 2019 to 2022 were selected for the bibliographic coupling analysis. Analysis was conducted to classify and present the most recent articles (2020–2022), classifying them according to the taxonomy in Table 8 and Fig. 11.

Based on the co-citation analysis performed across the studied time-space (2003–2022), the works present mainly the use of author grouping approaches. This analysis result is consistent with the findings of [4]. Moreover, the conducted coupling bibliographic analysis points to viable alternatives to address AND-related problems, indicating a wide range of research in the area.

3.2.3 Overview of AND publications

In this section, we present an overview of recent AND works published between 2020 and 2022, arranged by the taxonomy of Fig. 1 [4, 11]. The taxonomy is also used in the co-citation (Sect. 3.2.1) and coupling analysis (Sect. 3.2.2) sections. This time frame was chosen as earlier reviews covered works until 2019 [11, 12]. The coupling analysis includes some works presented in this section.

The overview of current research works is essential to complete the meta-analytical analysis by citing the particular techniques, strategies, and emerging themes within the AND research area. To organize the works overview analysis in a concise way, we guided it by the book devoted to AND study in bibliographic repositories [4]. We present a synopsis of each work included in Table 8. The work synopsis presents the AND approaches, including author grouping, mainly through agglomerative clustering, standing out as prevalent methods in recent studies.

The author grouping approach, as defined in Sect. 1, is especially appropriate for datasets with lots of co-authorship data. Large bibliographic datasets can benefit from this approach, unlike the author assignment approach, as it does not depend on time-consuming manual labeling author annotations. Agglomerative clustering in author grouping is a common approach as it is simple to use and can produce hierarchical clusters to be analyzed at different granularities. This clustering method provides flexibility when creating author groups. Figure 11 shows how the works relate to each other in the used taxonomy.

We identified a set of five works using tree-based learning models, such as Gradient Boosting, Random Forest, and Decision Tree [60,61,62,63,64]. The works may use other Machine Learning techniques in conjunction with the mentioned, such as Naive Bayes [62], Logistic Regression [60, 61], and Network Graphs [63].

Table 8 Classification of papers from 2020 to 2022 according to the taxonomy of Fig. 1 [4, 11]
Fig. 11
figure 11

Sankey diagram of approaches used for AND from 2020 to 2022

Some works address distinct supervised techniques, such as [10] that use transfer learning. The authors in [65] show that ORCID can validate the performance of supervised AND methods that use large-scale bibliographic data. The authors in [66] used the DBLP database with a neural network to learn the representations of coauthors and titles so AND could consider the similarity between these attributes.

Li et al. [67] present an algorithm with multiple similarity strategies for AND implementing using collaboration network calculations, affiliation, and publications attributes of authors. Another multi-strategy approach is presented by [68] using string comparison with Jaccard similarity, Levenshtein distance, and co-authorship network comparison. Waqas and Qadir [69] propose a multilayer heuristic with a clustering approach. The clustering uses attributes inherent to the author and publication, such as title, abstract, keywords, email, and affiliation. Word embedding Word2Vec is used to extract the attributes.

D’angelo and van Eck [70] use a rule-based scoring approach with author, publication, citation, and institution attributes. Clusters with meta-data allow indexing of a particular author to disambiguate. Zhang et al. [71] use heuristic rules combined with neural networks to analyze publication attributes, such as title and affiliation. An advantage of this method is the possibility of extending the method’s application to other datasets.

Mozafari [72] proposes a genetic algorithm for determining the similarity coefficient between two authors for AND. The algorithm determines the importance of the attributes in the publications, electing an optimal coefficient for comparison between authors.

Jinqi et al. [73] propose an algorithm to put entities and resources into a network graph to set the resource node capacity-based sharing degree. The network graph uses relationships between the author and publication nodes to calculate the flow capacity between nodes which allows clustering of the graph.

Zhang and Ban [64] use publication relationships to construct the graph, with the strongly related publications grouped, forming atomic clusters and reducing the graph size. At another stage, a rule-based similarity algorithm analyzes and combines the feature information from the publication graph to perform AND.

Zhou et al. [74] present an approach with five graphs formed by publication attributes, co-authorship, location, title, keywords, and affiliation. Each attribute creates the node where the edges are the similarity weights between publication pairs. A fusion graph of the attributes is built. A random walk algorithm is applied to the graph to determine paths that represent the local node structural information. Then, a multilayer perceptron algorithm is applied to the graph structure.

Santini et al. [93] propose a Knowledge Graph Embeddings (KGE) using information from the AMiner database. The KGE has three parts: multimodal information extraction from the KGE, a blocking procedure, and hierarchical agglomerative clustering. Qiping et al. [92] use citation information to construct a heterogeneous information network. Representation learning for clustering the authors and disambiguation is applied, and cluster analysis with rule matching is performed.

Ma et al. [76] propose incorporating a Word2Vec model into a Ghaph-Based approach. The algorithm extracts attributes and the relationships between publications, authors, and co-authors. Word2Vec serves to obtain these features allowing the insertion of other features that may appear in the dataset. Subsequently, a graph with relationships between publications and authors is built. Then, an algorithm for clustering and similarity analysis between nodes and edges is applied.

Pooja et al. work presents solutions using a graph-based approach as a basis associated with other computational techniques (e.g., Clustering). The work [80] uses a graph-based clustering approach for AND. Jaccard and Cosine similarity characterizes the relationships between authors and publications in the graphs, and Web information refines the results. In [85], the authors use graphs with publication attributes and a Word2Vec embedding model to create vectors that will serve as input to an agglomerative hierarchical clustering (HAC), which is widely used in AND studies. In [9], the authors use a graph-based approach configured with multi-hop neighborhoods and apply HAC for AND in the final step of the algorithm. The work in [95] uses graphs to build the network of authors and publications and uses clustering for disambiguation. However, the differential of this new approach is the ability to work with online information from digital bibliographic repositories.

The author in [75, 79, 82, 84] use word embedding, graphs, and clustering respectively. Ma et al. [77] use the same approach applied to robotic literature consultants.

The work of [78] proposes a technique with partial classification in three steps to solve the AND problem. The first uses a probability propagation constraint to infer the distribution of a given author’s name. In the second step, a portion of the author name in the documents is linked to their respective authoring if the model exhibits high confidence. In the last step, the initial classification algorithm parameters are updated.

Firdaus et al. [81] propose two methods for AND. In the first work, the technique uses four steps for disambiguation: data labeled, publication attributes extracted, deep neural network, random forest, naive Bayes, and SVM classification are done, and the validation of the result comparing the classification techniques. In the second work, Firdaus et al. [87] uses classification with deep neural network technique is increased with cost-sensitive learning considering cost variation from unclassified data.

Manzoor et al. [90] use convolutional neural networks for unbalanced and balanced dataset classification. According to the authors, the solution is flexible by learning the attributes without concatenating similarity measures. The same method is also Single Citation Based which preprocesses the dataset efficiently, decreasing computational costs.

Farber and Ao [91] do not propose a new approach but use an unsupervised rule-based classification method. The method does not require data training, adapted for the authors’ proposal.

Correia et al. [83] propose a crowd-systems-based prototype allowing interaction and contribution from the Web for the general public correcting name ambiguities, missing data, and incorrect references in a digital bibliographic repository. Backes and Dietze [89] present a technique for progressive AND with lattice structures for name inclusion. Waqas and Qadir [94] do not propose an AND resolution but present a dataset to assist developers. The “CustAND” labeled dataset with 7886 publication records is presented using data from DBLP and Google Scholar.

4 Conclusion

This article presented AND literature review using the theory of the consolidated meta-analytic approach with the WoS, Scopus, and merged bibliographic repositories. A taxonomy was used to classify AND methods in the reviewed works. With the bibliometric laws of analysis, it was possible to present the most cited papers, authors, countries, organizations, knowledge areas, journals that publish the most documents, and the frequency of keywords, highlighting the evolution of AND from 2003 to 2022.

Summarising the key findings, we note that AND authors publish more in journals than conferences and book chapters (Table 2). The journal with the largest number of documents is the Scientometrics (Table 3). The evolution of journal and conference publications shows an increase over the years from 2003 to 2022 accentuated from 2016 (Fig. 3). The Computer Science knowledge area presents the highest contribution in AND with 34%, followed by Social Sciences (\(\approx 14\%\)) and Engineering (\(\approx 11\%\)) that together correspond more than half of the total works (Fig. 4). The countries that publish the most are the USA (21.3%), China (19.4%), Germany (13.2%), and Brazil (8.5%) (Fig. 5). Considering the number of citations in the merged databases (WoS and Scopus), the USA leads with 1337 document citations, Brazil with 510, and China with 311 (Fig. 6), reinforced by the organizations that regularly publish with more than 20 citations including the North American and Brazilian ones (Table 6).

During the co-citation and bibliographic coupling analyses, we identified four and three clusters, respectively. In the co-citation, four clusters show that graph, supervised learning, and heuristic-based approaches with probability applications for solving the AND problem are used. Furthermore, co-citation indicates the prevalence and effectiveness of author grouping techniques in current AND literature, particularly in addressing issues associated with large bibliographic databases. The bibliographic coupling indicates current research for AND with word embedding and supervised learning. We note that most of the approaches use AMiner and DBLP as bibliographic bases for information extraction.

Presenting this literature review of the AND panorama, we intend to help researchers direct current studies resulting in the creation of new techniques to solve the problem. However, the meta-analytic approach used in this literature review presents some limitations. The focus is an exploratory overview of the research area grounded by bibliometric principles, not including a protocol like a systematic literature review with specific research questions. However, the results of the meta-analytic approach are complementary to systematic literature review methods, adding knowledge of existing studies in the AND research area. Another limitation is related to the WoS and Scopus merged databases, which was done using a script in Python language since the VOSviewer tool allows only one database at a time.

In future work, we can use new bibliographic databases in complement to other literature review approaches, such as a systematic review. Additional WoS and Scopus database knowledge areas can be used, to enlarge the scope of research in the AND area, such as multidisciplinary sciences.