A novel methodology to disambiguate organization names: an application to EU Framework Programmes data

Ancona, Andrea; Cerqueti, Roy; Vagnani, Gianluca

doi:10.1007/s11192-023-04746-x

A novel methodology to disambiguate organization names: an application to EU Framework Programmes data

Open access
Published: 31 May 2023

Volume 128, pages 4447–4474, (2023)
Cite this article

Download PDF

You have full access to this open access article

Scientometrics Aims and scope Submit manuscript

A novel methodology to disambiguate organization names: an application to EU Framework Programmes data

Download PDF

1385 Accesses
1 Citation
Explore all metrics

Abstract

The concept of collaborative R&D has been increasing interest among scholars and policy-makers, making collaboration a pivotal determinant to innovate nowadays. The availability of reliable data is a necessary condition to obtain valuable results. Specifically, in a collaborative environment, we must avoid mistaken identities among organizations. In many datasets, indeed, the same organization can appear in a non-univocal way. Thus its information is shared among multiple entities. In this work, we propose a novel methodology to disambiguate organization names. In particular, we combine supervised and unsupervised techniques to design a “hybrid” methodology that is neither fully automated nor completely manual, and easy to adapt to many different datasets. Thus, the flexibility and potential scalability of the methodology make this paper a worthwhile contribution to different research fields. We provide an empirical application of the methodology to the dataset of participants in projects funded by the first three European Framework Programmes. This choice is because we can test the quality of our procedure by comparing the refined dataset it returns to a well-recognized benchmark (i.e., the EUPRO database) in terms of the connection structure of the collaborative networks. Our results show the advantages of our approach based on the quality of the obtained dataset, and the efficiency of the designed methodology, leaving space for the integration of affiliation hierarchies in the future.

Name disambiguation from link data in a collaboration graph using temporal and topological features

Article 24 March 2015

Topological-collaborative approach for disambiguating authors’ names in collaborative networks

Article 01 August 2014

Improving co-authorship network structures by combining multiple data sources: evidence from Italian academic statisticians

Article 09 February 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The new paradigms of open innovation have been increasingly determining the need to collaborate to create and disseminate knowledge, and to develop technological innovation. Nowadays, organizations rarely carry out innovation in isolation. R & D investments often include the presence of partners, producing a growing interest in collaborative R & D. Many researchers explore the concept of collaborative R & D, investigating the determinants behind the choice of partners (Diestre & Rajagopalan, 2012; Reuer & Devarakonda, 2017), the effects of collaboration on innovation performance (Hoang & Rothaermel, 2005, 2010), the dynamics of collaborative relationships (Tatarynowicz et al., 2015; Jakobsen et al., 2019), and many other facets of this phenomenon.

The availability of reliable data is necessary to obtain valuable results providing relevant policy implications. Nowadays, digital technologies let room for precise data gathering. However, scholars and practitioners usually recommend data cleaning and wrangling procedures to guarantee the usability of data (Endel & Piringer, 2015). Specifically, we need to avoid attributing wrong information to different entities in a collaborative environment. Indeed, in many cases, working directly with original data can provide misleading results since the same organization can appear non-univocal, and its information is then shared among multiple entities. The most common reasons for using various labels (i.e., how the organization is called) to address the same organization are related to different languages, abbreviations, acronyms, punctuation, periphrases, linguistic equivalences providing a different order of words, and misspellings. At the same time, we must be careful not to aggregate the information of different organizations in correspondence with the same entity.

In this work, we propose a novel methodology to disambiguate organization names based on identifying correspondences between any organization and its possible distinct labels in a dataset, accounting for the most frequent sources of errors early mentioned. In more detail, we aim to standardize possible ways to address the same organization in a dataset by determining equivalences between labels. At the end of the procedure, each organization is ideally associated with a distinct equivalence class. For instance, consider the two distinct labels “University of Rome La Sapienza" and “Sapienza Univ of Rome". They are referred to the same organization. The designed methodology should be able to assign both labels to the same equivalence class (i.e., the two labels are equivalent), such that, in the end, “Sapienza University of Rome" univocally corresponds to a distinct equivalence class. Previous studies have shown that an automated procedure may be inefficient, and a manual inspection should be integrated. However, to reduce the manual part, the automated one must be as efficient as possible. In this respect, accurately pre-processing labels can ease the automatic identification of ambiguous cases. Different measures of string similarity have been tested in the literature, but there is no convergence on one method rather than the others. In any case, the selected method must be justified, and its advantages over the others must be presented for the specific case study. Finally, the methodology must be as generic and flexible as possible to adapt to different contexts and datasets. Our approach combines features of supervised and unsupervised methods for organization name disambiguation. In particular, it draws elements from studies on organization name disambiguation and affiliation disambiguation to build a “hybrid” methodology that is neither fully automated nor completely manual. The proposed methodology also includes a thorough pre-processing phase. Moreover, we provide evidence in favor of selecting a similarity measure based on “cosine" distance to address the primary sources of errors we encountered. However, the methodology is designed to easily adapt to many datasets related to different research fields where high-quality data is missing. Thus, the flexibility and potential scalability of the methodology make this work a worthwhile contribution both to scientific literature and in a policy context.

In order to show the effectiveness and efficiency of the proposed methodology, we provide an empirical application to the datasets of organizations taking part in European projects funded by the first three EU Framework Programmes (FPs). Data about participants in projects funded by the EU FPs represents one of the most interesting case studies in collaborative R &D at the European level. Nevertheless, raw datasets about EU FPs are not free from the abovementioned errors.

Roediger-Schluga and Barber (2008) tackle this issue, disambiguating participant organizations through a proper labour-intensive procedure relying on a consistent manual inspection of participants. In this way, the authors obtain a novel data source of higher quality than the original data, known by the name of “EUPRO database”. Actually, most of the studies dealing with European projects benefit from the EUPRO database, performing their analyses on this data source (Paier & Scherngell, 2011; Scherngell & Barber, 2011; Hoekman et al., 2013; Scherngell & Lata, 2013; Lepori et al., 2015; Crespo et al., 2016; Heringa et al., 2016; Uhlbach et al., 2017; Wanzenböck et al., 2020; Cavallaro & Lepori, 2021). However, our intent is not to present an alternative dataset to the EUPRO database. Rather, we aim to show the validity of the proposed methodology by comparing the dataset returned by our procedure to the EUPRO database, thus considering it as a benchmark. In particular, we measure several network properties of the collaborative networks generated from the different sets of data (i.e., raw dataset, refined dataset obtained through the application of the methodology, and the EUPRO database). Networks are indeed one of the most common tools to represent collaborative relationships, and the analysis of their structure allows to shed light on the research process from a systemic point of view. Therefore, the choice of the case study is due to the fact that we can benefit from a reliable benchmark to test the quality of our methodology. Notice that this is a conservative approach to evaluate the accuracy of the methodology since in the EUPRO database organizations are disambiguated by integrating both linguistic and entity-based criteria (e.g., aggregating subsidiaries).

The paper is organized as follows. An overview of related works and existing methods addressing similar problems is reported in “Related works and approaches” section. “Proposed methodology” section describes the proposed methodology. “Application: the first three EU Framework Programmes” section introduces the case study, providing a general overview of the EU FPs and the customization of the methodology. “Results” section reports the results obtained by applying the methodology to the case study. Our main contributions and possible further developments are summarized in “Discussion and conclusions” section.

Related works and approaches

Identifying correspondences between the organizations and their possible distinct labels in a dataset is a critical problem in scientometrics literature, known as “organization name disambiguation”, “institution name disambiguation”, or “company name disambiguation”. In particular, the higher the amount of available information, the worse the problems due to misspellings, linguistic differences, and name changes (Wang et al., 2012). The approaches proposed in the literature to investigate this issue are similar to those addressing other topics, such as the disambiguation of patent applicants and inventors (Li et al., 2014; Balsmeier et al., 2015; Morrison et al., 2017; Yin et al., 2020), and author name disambiguation (Wu & Ding, 2013; Shin et al., 2014; Amancio et al., 2015; Santini et al., 2022).

In their study, Jonnalagadda and Topham (2010) focus on the biomedical domain proposing a system to extract organization names from the affiliation strings listed in PubMed abstracts and normalize them to a canonical name. Although their dataset includes mainly English citations, they pre-process institution names by removing blocking words and special characters. Secondly, they cluster organization names based on the distance between the centroid (i.e., the organization name closest to the other names included in the same cluster) of existing clusters and any new name entering the dataset, using the Levenshtein distance at the word level. At this point, they perform a recalculation phase considering the obtained dataset as an undirected graph whose nodes are the identified clusters. Authors then look for connections among different nodes based on the value of the Extended Smith-Waterman Score. Finally, they clean the dataset manually, repeating the recalculation step. A similar pre-processing of organization names is performed in (Rimmert et al., 2017) where, before allocating publications to the respective main institutions, authors design a standardization procedure in which specific patterns are replaced by standardized forms (e.g. “Universitat”, “University”, “Universidad”, and “Universität” are replaced by “UNIV”).

Huang et al. (2014) adopt a rule-based approach to disambiguate institution names. Their methodology relies on a three-step procedure, starting with the extraction of author and institution names from bibliographic databases. Then, they build an author-institution table, where each block is screened to disambiguate institution names. The second step is constituted by a set of distinct rules that are applied according to different criteria, including the computation of shared words, the Jaccard distance, and the Jaro-Winkler algorithm. Finally, they cluster blocks based on the number of common subsets identified.

Zhang et al. (2012) are interested in finding tweets related to a given organization. They propose an adaptive method for organization name disambiguation based on Twitter information and external web sources. Specifically, the authors develop a general classifier with the training data and different web sources. Then, they use the general classifier to label unlabeled tweets of a specific organization. They train in turn an adaptive classifier for a given organization with more information from these tweets. Other interesting works address the issue of company name disambiguation on Twitter. Muñoz et al. (2012) introduce a new unsupervised approach to computing the similarity between the content of a tweet and the company profile. Similarly, Spina et al. (2013) propose a method to determine the connection between a company and a set of tweets, based on the identification of filter keywords. In particular, they show that filter keywords can efficiently attribute tweets to the right company.

Some other relevant contributions can be found in the literature on “affiliation disambiguation”. This is a slightly different problem regarding the attribution of the right institution to the authors in bibliographic databases. Indeed, relying on author names before disambiguating institution names is crucial. Nonetheless, some interesting aspects can be highlighted for our work. In order to address the problem of affiliation disambiguation, Jiang et al. (2011) develop an effective clustering method based on normalized compression distance (NCD). According to this method, the algorithm finds the most similar pair of clusters in the current clustering set at each step based on their NCD (all distinct affiliations in the dataset represent the first clustering set). One of the main peculiarities of this approach is the definition of an ex-ante number of desired clusters to match different affiliations. Cuxac et al. (2013) propose two approaches to deal with affiliation disambiguation in bibliographic databases. The former exploits a supervised learning approach based on a Naive Bayes algorithm wherever a manually analyzed reference dataset is available. The latter is employed without a learning data source and consists of a semi-supervised approach, mixing clustering techniques and Bayesian learning. Although efficient, the authors state that this approach encounters some difficulties when processing highly unbalanced data. A combination of manual and automated steps is included in (Rimmert, 2018), where authors provide evidence about the reliability of wikidata to disambiguate author addresses.

Finally, an interesting contribution comes from Akbaritabar (2021), who compares different techniques for publication author disambiguation, by analyzing the structure of the related co-authorship networks.

Alongside scientific studies, we also consider two of the most popular existing tools and practices developed to address the same issues at the core of this paper. The first one is the ROR OpenRefine Reconciler.^{Footnote 1} This service does not require any coding. It matches organization names in a project to ROR (Research Organization Registry) IDs through the ROR REST API. Starting from a file with a column related to organization names, the ROR OpenRefine Reconciler proposes a series of possible matchings that must be accepted or denied by the user. One can also search manually for matchings that the tool does not identify directly. The second one is the AIDA system (Yosef et al., 2011), an online service aimed to match mentions (i.e., entity names) in an input text with candidate entities in a reference database following a graph-based approach. More specifically, a graph is constructed whose nodes correspond to mentions and candidate entities, while edges can be either between mentions and entities or between two entities. In particular, edges are weighted based on the similarity between the context in which the mentions appear (i.e., the input text) and keyphrases retrieved from Wikipedia articles related to the candidate entities. The outcome is a dense sub-graph where each mention is linked to a unique candidate entity.

Once introduced our methodology In “Proposed methodology” section, we will point out the main similarities and differences with the most important approaches described in this section.

Proposed methodology

In general, given a dataset of organizations, each label is associated initially with a distinct organization. However, as discussed in “Introduction” section, raw data can contain multiple ways in which the same organization is called, i.e., multiple labels can be equivalent. The rationale behind the proposed methodology is illustrated in Fig. 1.

All possible equivalences need to be checked to determine whether the associated labels correspond to the same organization. The proposed methodology is generalized so that the set of possible equivalences can be checked both automatically and by hand. However, it is not recommended to implement a procedure that automatically assigns an outcome to all possible equivalences according to pre-defined criteria, in the presence of datasets with the early mentioned types of errors, for two main reasons. If the pre-defined criteria are too weak (i.e., the methodology tends to link quite similar labels), there is the risk of assigning wrong labels to the same equivalence class, thus introducing exogenous sources of errors in the final dataset (e.g., “Universitaet Gesamthochschule Siegen” and “Universitaet Gesamthochschule Essen” may be wrongly matched). On the contrary, if the pre-defined criteria are too strong (i.e., the methodology only links quite identical labels), we risk not identifying most of the equivalences between labels, thus not obtaining a refined dataset (e.g., “Univ di Roma La Sapienza” and “Sapienza University of Rome" might be unmatched). Therefore, checking equivalences by hand would provide high-quality data avoiding the risk of introducing external endogeneity. Nevertheless, in the presence of large datasets, it becomes rapidly not affordable to check all possible equivalences between labels by hand. Indeed, the set of possible equivalences is constituted by all the pairwise equivalences between labels. The cardinality of this set is equal to $\frac{n(n-1)}{2}$, where n is the number of distinct labels in a dataset. Thus, the amount of possible equivalences grows proportionally to $n^2$.

The proposed methodology aims to consistently reduce the number of equivalences to check by hand, identifying those couples of labels that are more likely to be equivalent, based on suitably selected criteria. Then, in the following part of the paper, we present a “hybrid" methodology that automatically assigns an outcome to most of the possible equivalences, while identifying a small set of equivalences to be assessed by hand, in order to produce a final dataset that is both refined and reliable. Thus, all possible equivalences are classified as true (if the two labels are automatically recognized as equivalent), false (if the two labels are automatically recognized as not equivalent), or to be checked (if the equivalence between the two labels needs to be assessed by hand). We also describe how to easily make the methodology fully automated in case the user is confident about this procedure.

We use the software R (R Core Team, 2014) to develop the algorithm behind the proposed methodology. Nonetheless, it can be easily adapted to other programming languages. Before applying the algorithm, two preliminary steps are implemented. First, the original dataset is pre-treated by comparing it with certified lists of organizations that already tracked name changes, and multi-level structures. Then, a pre-processing of labels is needed. This step constitutes a data transformation, such that identifying equivalences between pre-processed labels is more straightforward than dealing with original labels.

Data pre-treatment through certified lists of organizations

The employment of certified lists of distinct organizations is useful in a preliminary phase. Specifically, it is possible to directly disambiguate organization names by matching the distinct labels in the analyzed dataset with the corresponding labels that have already been mapped in other organizational registers. For instance, two notable solutions are the Research Organization Registry (ROR), publicly available from the ROR website,^{Footnote 2} and the OrgReg interface, publicly accessible via the RISIS Core Facility.^{Footnote 3} The use of external registers in the data pre-treatment allows to address two different issues that may affect a dataset, and that could not be identified by linguistic methods, i.e., name changes and parent affiliations. While determining name changes is relevant independently from the research objectives, linking parent organizations with subsidiaries, or sub-institutes with the main research center, is strictly related to the research goals. Thus, the level of investigation of organizational registers depends on the specific research question, and this step must be designed accordingly.

Pre-processing of labels

Let us consider a generic dataset, whose rows correspond to different observations, while columns represent different variables. For our purpose, we consider datasets in which one column is referred to organization names, thus cells in this column contain labels. Firstly, rows in which labels are not available are removed. Secondly, distinct labels in the dataset are determined. At this point, acronyms are disambiguated through an endogenous method, as reported in Veyseh et al. (2021). Specifically, acronyms are identified in correspondence with labels comprising a string of nine characters at least, associated with the organization name, and another string of five characters at most, supposed to be the acronym. Moreover, the potential acronym is expressed in upper case letters, and either included within parentheses, or preceded (or followed) by a dash (-). All potential acronyms are then checked by hand, and they are replaced with the corresponding extended name whether the associated label includes the specific acronym as a distinct string, and if the country of the candidate for replacement is the same as the country of the associated label. After acronym disambiguation, all labels are expressed in lower case letters, and every accent and special character is removed from the label, as well as additional white spaces. Furthermore, a series of articles, prepositions, and conjunctions are removed from labels, avoiding one-letter words since they may cause problems with acronyms. The same action applies to words contained in specific recurring periphrases. Finally, sets of words with the same meaning are converted into unique equivalent keywords. To this aim, it is also possible to consult the list of abbreviations shared by Clarivate’s Web of Science (WOS).^{Footnote 4} For instance, considering the label “The Chancellor, Masters and Scholars of the University of Oxford”, the pre-processed label becomes “00uni00 oxford”.

At the end of this step, we obtain m distinct pre-processed labels. From now on, when mentioning labels, we refer to distinct pre-processed ones.

Application of the algorithm

The algorithm to identify equivalences between labels is designed as a combination of three different methods. Each method alone behaves as follows:

The first method relies on algorithms we develop to compute the number of common words between any pair of labels. Then, all possible equivalences between any two labels are classified based on comparing the computed number of common words and a pre-fixed value.
The second method relies on algorithms we develop to compute the number of consecutive common characters between any pair of labels. Then, all possible equivalences between any two labels are classified based on comparing the computed number of consecutive common characters and a pre-fixed value.
The third method implements software packages and functions that calculate a similarity score between any pair of labels. A detailed definition of the similarity score is proposed in “Application: the first three EU Framework Programmes” section, where the application to the European projects is carried out. In general, the considered similarity score ranges between 0 (complete dissimilarity) and 1 (perfect similarity) (Van Der Loo, 2014). Then, all possible equivalences between any two labels are classified based on the comparison between the computed similarity score and a pre-defined threshold.

Specifically, the algorithm implements the first and second methods alternatively, with the similarity score computed at the end of the procedure. In more detail, the algorithm will first evaluate equivalences according to either the number of common words or the number of consecutive common characters based on specific conditions described below. Then, equivalences will be ultimately classified based on their similarity score.

The definition of a vector of control variables is expected by the proposed methodology. We define a “control variable", a variable included in the analyzed dataset (different from the column with labels) related to a characteristic of an organization (e.g., country, region, business sector, activity type). Employing this variable improves the algorithm’s reliability when comparing two labels. To this end, suppose that both the number of common words (or, alternatively, the number of consecutive common characters) and the similarity score between two labels overperform the respective pre-defined thresholds, so that the algorithm would classify the related equivalence as true. Considering also a vector of control variables, as defined by us, allows the algorithm to be more robust, avoiding matching labels referring to different organizations. For instance, the labels “agricultural 00uni00 athens" and “agricultural 00uni00 auw" may be classified as equivalent if we take into account the number of consecutive common characters and the similarity score between them. However, if we consider their country as control variable, the algorithm classifies this equivalence as false since the two labels refer to a Greek and a Dutch organization, respectively.

The entire logical process followed by the algorithm when comparing two distinct labels is reported below.

1:
Compute the number of characters in the two labels. If they are both greater than 3, then go to Step 2, otherwise the equivalence is classified as false.
2:
Compute the number of words in the two labels. If at least one of them has 1 word, then go to Step 3, otherwise go to Step 4.
3:
Compare the vectors of control variables related to the two labels. If they are equal, then go to Step 5; if variables are not available for at least one label, then go to Step 6; if they differ, the equivalence is classified as false.
4:
Compare the vectors of control variables related to the two labels. If they are equal, then go to Step 5; if variables are not available for at least one label, then go to Step 7; if they differ, the equivalence is classified as false.
5:
Compute the number of consecutive common characters between the two labels. If it is greater than 3 (equal to 3 for labels with three characters), then go to Step 8, otherwise the equivalence is classified as false.
6:
Compute the number of common words between the two labels. If it is greater than 0, then go to Step 8, otherwise the equivalence is classified as false.
7:
Compute the number of common words between the two labels. If it is greater than 1, then go to Step 8, otherwise the equivalence is classified as false.
8:
Compute the similarity score between the two labels. If it is greater or equal than a pre-defined low threshold, then go to Step 9, otherwise the equivalence is classified as false.
9:
If the similarity score between the two labels is greater or equal than a pre-defined high threshold, then the equivalence is classified as true, otherwise the equivalence is classified as to be checked by hand.

Notice that when we introduced the method of computing the similarity score between two labels, we specified that the similarity score needs to be compared to a pre-defined threshold. In the algorithm above, we compare it to two different thresholds at Step 8 and Step 9, called “low threshold” and “high threshold”, respectively. These last two steps make the proposed methodology “hybrid”, since those equivalences between labels having a similarity score ranging between the low and the high threshold, need to be ultimately assessed by hand. In order to consistently reduce the number of equivalences to check by hand while not compromising the reliability of the final dataset, the two thresholds should be fixed appropriately, neither too distant nor too close to each other (we provide an example in “Application: the first three EU Framework Programmes” section). It is also important to note that, if Step 9 is removed, the algorithm would become fully automated. This last scenario is compared to the hybrid one, i.e., the proposed methodology, in Fig. 2.

For the reader’s convenience, we also report the instructions of the algorithm when comparing two distinct labels through a flowchart in Fig. 3. This scheme illustrates the logical process described earlier. Before showing the flowchart, we need to define a glossary of elements in Table 1.

Table 1 Glossary of variables included in the flowchart

Full size table

In light of the proposed methodology, in Table 2, we point out the main similarities and differences with the most relevant approaches analyzed in “Related works and approaches” section, in order to highlight the major novelties introduced in this work.

Table 2 Main similarities and differences with key references and services

Full size table

Application: the first three EU Framework Programmes

One of the most interesting data sources in the field of collaborative R &D at the European level is represented by the list of projects funded by the European Framework Programmes (FPs). The FPs are multi-annual programs providing funds mainly to EU member states, but also to associate countries, in order to promote long-term investments in several areas, under the establishment of a systematic Research and Technological Development (RTD) policy at the European level.

In particular, we decided to consider the first three EU FPs, i.e., FP1 (1984–1987), FP2 (1987–1991), and FP3 (1990–1994). This choice is driven by the fact that datasets about the first FPs are the most unbalanced and the least standardized among all FPs. Thus, our methodology is well-suited for these datasets. Data about EU FPs are publicly available on the Community Research and Development Information Service (CORDIS) website.^{Footnote 5} CORDIS is the European Commission’s primary source of results about projects funded by the EU FPs, offering a unique and structured repository including information about all projects financed from FP1 to Horizon Europe (i.e., the current FP), and about the related participant organizations.

Nevertheless, these datasets are not free from the errors we described in “Introduction” section, namely:

Use of different languages (e.g., “Università degli Studi di Roma La Sapienza” and “University of Rome La Sapienza”).
Use of abbreviations (e.g., “Università degli Studi di Roma La Sapienza” and “Univ di Roma La Sapienza”).
Use of acronyms (e.g., “Consiglio Nazionale delle Ricerche (CNR)” and “CNR”).
Use of different punctuation (e.g., “CEN/SCK” and “CEN - S.C.K.”).
Use of periphrases (e.g., “University of Cambridge” and “The Chancellor, Masters and Scholars of the University of Cambridge”).
Use of linquistic equivalences providing a different order of words (e.g., “University of Aarhus” and “Aarhus University”).
Mispellings (e.g., “Telefonica Investigacion y Desarfolio” and “Telefonica Investigacion y Desarollo”).

Apllying our methodology to the CORDIS datasets allows us to determine uniquely the organizations taking part in one, or more, European projects funded by the first three EU FPs. In their work, Roediger-Schluga and Barber (2008) introduced for the first time a novel data source of higher quality than the CORDIS datasets accounting for the first five EU FPs (i.e., the EUPRO database). During the following years, the EUPRO database has been integrated and managed as part of a European funded project, called “Research Infrastructure for Science and Innovation Policy Studies (RISIS)". In particular, the EUPRO database has been expanded, including all FPs from FP1 to H2020 (until the end of 2018), as well as other European programmes such as EUREKA, Joint Technology Initiatives (JTI), and European Cooperation in Science and Technology (COST) (Heller-Schuh et al., 2020). Nowadays, many researchers exploit the EUPRO database to analyze the mechanisms of the EU FPs. For this reason, we decided to consider it as a benchmark to test the quality of the proposed methodology. Thus, our intent is not to present an alternative dataset to the EUPRO database. Instead, we aim to show the validity of our procedure by comparing the obtained results to a high-quality dataset. We downloaded CORDIS data on October 1 st, 2021. At the same time, we requested and obtained the access to the EUPRO database, restricted to the first three EU FPS.

In both datasets (i.e., raw data from CORDIS, and the EUPRO database), each row represents an organization’s participation in a project. Thus, the two main columns contain the organization name (what we call “label") and the project id, while the other columns correspond with variables related to either organization’s or project’s characteristics. We clean the two datasets by removing rows with no organization name. After this process, in the raw datasets, there are 7900 participations in FP1, 19, 054 participations in FP2, and 31, 348 participations in FP3. In the EUPRO database, there are 7818 participations in FP1, 19, 126 participations in FP2, and 30, 732 participations in FP3. Notice that the number of participations in the raw datasets and in the EUPRO database almost coincide. Minor differences are due to the fact that CORDIS data about participation in projects (especially the oldest ones) occasionally includes unspecified or redundant participations that may have been removed from the EUPRO database.

Collaborative networks

One of the most common ways to represent collaborative relationships is by using networks. Collaborative networks are a popular concept in innovation (Nieto & Santamaría, 2007; Tsai, 2009) and R & D (Campos et al., 2013; König et al., 2019) literature. According to this representation, distinct organizations correspond to the network nodes, while edges stand for collaborative relationships between them. In our case study, two nodes are connected by an edge if the corresponding participant organizations are partners in the same project. Thus, participants in projects funded by a specific FP are represented ultimately by an undirected, weighted network, where weights on edges constitute eventual multiple partnerships in different projects.

We decided to show the effectiveness of the proposed methodology by providing the obtained results based on popular network metrics. In more detail, we compare the collaborative networks generated from the raw datasets (called “raw networks" from now on) with the collaborative networks built from the refined datasets obtained through the application of the proposed methodology (called “refined networks" from now on). We expect the structure of connections in the two sets of networks to be highly different. In particular, the refined networks should tend to the benchmark model (i.e., the collaborative networks coming from the EUPRO database, called “EUPRO networks” from now on) in terms of structural network properties. To this end, a short glossary of network metrics is introduced in Table 3 (Wasserman & Faust, 1994; Barabási, 2016; Newman, 2018). Moreover, we compare the degree distribution functions (i.e., the empirical distribution function of the nodes’ degree) of the refined networks with those of the raw networks. In this way, we want to assess if the connections among organizations are distributed more likely as in the former case than in the latter one. More precisely, we verify if the estimated parameters characterizing the degree distributions of the refined networks are getting closer to the EUPRO networks’ parameters, moving away from the raw networks’ ones. Analyzing the connection structure of collaborations through the exploitation of network metrics allows us to investigate the characteristics of the collaborative research process from a systemic point of view. In particular, if the structure of the refined networks is similar to the structure of the EUPRO networks, then collaborations are distributed reliably, and research projects are mapped coherently.

Table 3 Glossary of network metrics

Full size table

Customization of the methodology

Before showing the results, we need to provide some details about the customization of the proposed methodology for the specific case study. We run the algorithm on the raw datasets downloaded from CORDIS. In more detail, we aggregate data about FP1, FP2, and FP3 in a unique dataset to identify equivalent labels across all three FPs.

For what concerns the data pre-treatment, we compare the CORDIS dataset with the Research Organization Registry (ROR), which includes IDs and metadata for 102, 742 organizations. ROR is a reliable register, completely open, with a particular focus on affiliations of research organizations. Thus, it is the most appropriate source for this aim, given the specific case study. After downloading the dataset from the ROR website,^{Footnote 6} we build a list of equivalences between the distinct organizations and the associated aliases included in the ROR dataset, identifying the ones that can be exploited in our case. Then, we put a great effort into the pre-processing of labels. This step is essential to allow the algorithm to easily identify equivalences between labels whose similarity score would be significantly lower if not pre-processed. For instance, we convert all possible ways to address universities, institutes, centers, and others in the dataset, into distinct equivalent keywords. We use the country of organizations as the selected control variable, being it available for almost all organizations in all three FPs. The similarity score between labels is computed through the “cosine” method, that is based on the $q-$grams associated with each label. A $q-$gram is a string consisting of q consecutive characters. For instance, the $1-$grams associated with the label “CNR” are “C”, “N”, and “R”, while the $2-$grams associated with the same label are “CN” and “NR". Specifically, the similarity score between two generic labels x and y based on “cosine” distance is defined as follows (Van Der Loo, 2014):

$$sim_{{\cos }} (x,\;y;\;q) = \frac{{v(x;\;q) \cdot v(y;\;q)}}{{\left\| {v(x;\;q)} \right\|\left\| {v(y;\;q)} \right\|}},{\text{ }}$$

(1)

where the numerator is a scalar product between two vectors, and the denominator is the product between the standard Euclidean norms of the same vectors. Thus, similarity score based on “cosine" distance can be interpreted as a measure of the angle between two vectors. More precisely, v(x; q) and v(y; q) are non-negative integer vectors of dimension $|\Sigma ^q |$ (where $\Sigma ^q$ is the whole set of distinct $q-$grams associated with x and y), whose components represent the number of occurrences of every possible q-gram in x and y, respectively. Clearly, we need an order of the elements of $\Sigma ^q$ to define the vectors v(x; q) and v(y; q). The order of the elements of $\Sigma ^q$ can be chosen arbitrarily. For instance, considering the following labels “00uni00 oxford" and “oxford 00uni00", and keeping $q=1$ as set by default in R, we have $\Sigma ^{1} = \{``0", ``u'', ``n", ``i", ``o", ``x", ``f", ``r", ``d" \}$. Let us select the ordered $1-$grams in $\Sigma ^1$ as the vector (“0", “u", “n", “i", “o", “x", “f", “r", “d"). Then, v(“00uni00 $oxford";1) = (4,1,1,1,2,1,1,1,1)$, and v(“oxford $00uni00";1) = (4,1,1,1,2,1,1,1,1)$, i.e., they coincide. It follows that the similarity score between “00uni00 oxford" and “oxford 00uni00" based on “cosine" distance is equal to 1, showing that this method is not affected by the order of words when $q=1$. The latter property is one of the main reasons why “cosine" distance appears to be the most appropriate to identify correspondences in the considered dataset compared to other popular distance metrics, such as “Hamming" (Hamming, 1950), “Levenshtein" (Levenshtein, 1966), and “Jaccard" (Jaccard, 1901). For instance, “Hamming" distance requires strings to have the same number of characters; otherwise, it is set equal to infinity. Thus, it is not possible to implement it in our case. “Levenshtein" distance is based on the number of edits necessary to turn a string into another one, hence it is inefficient in case of misspellings or unnecessary abbreviations (e.g., the similarity score between “55counc55 sup investiga cientificas" and “55counc55 superior investigacion cientifica" based on “Levenshtein" distance is equal to 0.75, while the similarity score between the same labels based on “cosine" distance is equal to 0.96). “Jaccard" distance, instead, is equal to 1 minus the number of shared distinct $q-$grams over the number of total distinct $q-$grams between two strings. Thus, its value falls rapidly in the presence of additional words associated with the organization name (e.g., “00uni00 liege cedia" and “00uni00 liege", altough differing for “cedia" only, have a similarity score based on “Jaccard" distance equal to 0.70, while the similarity score between the same labels based on “cosine" distance is equal to 0.94, considering $1-$grams for both metrics).

Finally, a low and a high threshold are defined according to the similarity scores we obtain by running the algorithm on a small sample of organizations belonging to the dataset. We identify reasonable values such that in the external intervals, as represented in Fig. 2, we are confident enough to classify equivalences as true or false automatically. At the same time, the number of equivalences to check by hand is reduced consistently and restricted to pairs of labels that likely correspond to the same organization. Accordingly, we set a low and a high threshold equal to 0.94 and 0.99, respectively.

Results

In this section, we report the main results coming from the application of the proposed methodology to the raw dataset including all participations in projects funded by the first three EU FPs. As introduced in “Collaborative networks” section. , we provide empirical evidence about the effectiveness of the methodology by comparing the structural network properties of the raw networks with those of the refined networks, considering the EUPRO networks as benchmark models. Notice that, in this way, we obtain conservative results since our approach is based on linguistic methods, while in the EUPRO database organizations are disambiguated by considering also entity-based criteria (e.g., aggregating subsidiaries).

We create three different networks for every dataset (i.e., raw dataset, refined dataset, and the EUPRO database), each related to a distinct FP. The overall number of distinct organizations in the raw dataset, the refined dataset, and the EUPRO database is equal to 17, 105, 12, 803, and 10, 140, respectively. Notice that in the refined dataset and the EUPRO database distinct organizations are uniquely identified by construction. In the raw dataset, distinct organizations are represented by distinct pairs “labels-country", since there could be cases in which the label solely is not sufficient to determine a distinct organization (e.g., “Ministry of Agriculture" appears for both the Greek and the Swedish institution).

Therefore, applying the methodology consistently reduces the dimension of the original dataset because the algorithm can recognize that several labels are equivalent, thus corresponding to the same organization. This way, the obtained dataset moves away from the raw dataset, getting closer to a high-quality benchmark like the EUPRO database. Results reported in Table 4 highlight the advantages of applying the methodology, which allows to obtain a structure of connections in the refined networks very different from the raw networks, moving toward the benchmark models.

Table 4 Results in terms of network metrics

Full size table

As it is possible to note, both the number of nodes (i.e., the distinct organizations) and the number of edges decrease in the refined networks compared to raw networks. This result certifies that there are faulty connections in the raw networks because in many cases, different labels are not recognized as corresponding to the same organization, then resulting in distinct nodes, each of them with its own links. The increase in mean degrees reveals that the information of organizations (in terms of participation in projects) is misallocated in the raw networks and shared between multiple nodes, which are, in fact, the same organization. At the same time, the values of density increase in the refined networks, and the global clustering coefficients are much the same as in the EUPRO networks. It means that more cohesive groups are emerging after applying the methodology, and, specifically, more triadic connections (proxied by the clustering coefficient) are closed, pointing out the absence of important edges in the raw networks. The decrease in average shortest paths and diameters in the refined networks reveals the same aspect. Fewer intermediaries are needed to reach any organization in the networks, showing the identification of relevant paths that, once again, are missing in the raw networks.

A whole picture of the connection structure in the three scenarios is provided by the analysis of the respective degree distribution functions, which are represented in logarithmic scale in Fig. 4.

All nine plots show a quasi-log linear distribution that is well-approximated in network science by a power-law (Barabási, 2016). In a power-law distribution, P(x) is proportional to $x^{-\alpha }$, where x is a positive number (in this case, it corresponds to the node’s degree), and $\alpha$ is a parameter greater than 0 (Newman, 2007; Clauset et al., 2009). Thus, the parameter $\alpha$ uniquely characterizes this kind of distribution function, determining the slope of a power-law in a logarithmic scale. We then estimate $\alpha$ for each degree distribution, by fitting a power-law distribution function to our discrete data in R. The method for fitting parametrized models, such as power-law distributions, to observed data is the maximum likelihood method, which provides accurate parameter estimates within the limit of a large sample size (Clauset et al., 2009). The estimated parameters are reported in Table 5. Based on the Kolmogorov-Smirnov test, all values are significant.

Table 5 Estimated $\alpha$ for degree distribution functions

Full size table

As it is possible to notice, all EUPRO networks show a lower estimated value of $\alpha$ fitting their degree distributions. In a certain sense, this result is in line with the fact that the EUPRO database represents a benchmark. A lower $\alpha$ represents a smaller slope of the curve, thus a more heterogeneous distribution of connections. Indeed, it is reasonable that misallocations of information have more impact on small degree nodes than on the hubs of a network (i.e., the nodes with the highest number of connections). Consequently, differences in terms of connections between the two tails of the distribution function are amplified, resulting in a higher slope of the curve in raw networks. Once again, implementing our methodology allows moving closer to the benchmark in comparison to raw networks, particularly in correspondence with FP1 and FP2.

Error analysis

As it is possible to notice from the results summarized in Table 4 and Table 5, some differences between refined and EUPRO networks still hold. They are mainly due to the fact that not all organizations can be determined distinctively by identifying equivalences between labels. It is the case of research institutes appearing with their own name in the raw dataset, that are however considered part of a higher research center in the EUPRO database. There are also cases in which different organizations are involved in some merger and acquisition operations, such that only one of the two labels survives in datasets related to subsequent FPs, or the organizations are aggregated under a new name that it is not possible to match with previous labels.

In order to analyze more systematically how accurate the obtained results are in comparison with EUPRO, we compute the pairwise-Precision (pP), and the pairwise-Recall (pR) metrics (Kim, 2018) for the aggregated dataset (i.e., including all three FPs). These indexes allow evaluating the accuracy of disambiguated data (i.e., the refined dataset) compared to labeled data (i.e., the EUPRO database) at a pair level. To this aim, we need to identify a one-to-one matching between all labels in the raw data and all standardized labels in the EUPRO database. Since these matchings can not be determined directly from the two datasets, we develop a methodology to associate each EUPRO label with the original label that is more likely to correspond with. In particular, the only common variable exploitable to match automatically the two datasets is the project record control number (rcn), which uniquely identifies the different projects. Starting from this variable, we reduce the complexity of matching identification by determining only associations between labels that are linked to the same project rcn. Then, pairwise correspondences between original and EUPRO labels are defined based on two criteria. First, if two labels are linked to the same country in the two different datasets, then they become candidates to refer to the same organization. If there are no other labels related to the specific project rcn linked to that country, then the association between the two candidates is confirmed. Otherwise, if there are more labels linked to that country, the association is made by maximizing the similarity score between candidate labels based on “cosine” distance. Final associations between original and EUPRO labels are ultimately checked manually at a high level. Since residual errors may occur when matching the two datasets, the values of pP and pR might be underestimated. Indeed, they are defined as follows:

$$\begin{aligned}{} & {} pP=\frac{|Pairs_{disambiguated}\cap Pairs_{labeled}|}{|Pairs_{disambiguated}|} \end{aligned}$$

(2)

$$\begin{aligned}{} & {} pR=\frac{|Pairs_{disambiguated}\cap Pairs_{labeled}|}{|Pairs_{labeled}|} \end{aligned}$$

(3)

where $Pairs_{disambiguated}$ represents all identified pairs in the refined dataset (i.e., the original labels that have been associated after applying our methodology), while $Pairs_{labeled}$ corresponds with all pairs of equivalent labels in our benchmark (i.e., the original labels that have been associated in the EUPRO database). Thus, while both the denominators are exact values, the numerators (i.e., the amount of common pairs between the two datasets) can be underestimated because of eventual wrong matchings between original and EUPRO labels. The estimated values of pP and pR confirm the accuracy of the methodology. In fact, $pP=0.97$ and $pR=0.82$. Therefore $97\%$ of disambiguated pairs are correct in comparison with EUPRO. Moreover, our methodology has been able to identify $82\%$ of all labeled pairs, thus a huge portion of them.

In order to dive deep into the nature of unidentified matchings, we also compute the value of pR by country (considering those countries with the greatest number of organizations) and activity type. The obtained results are reported in Tables 6 and 7, respectively. Here N represents the number of labeled pairs for the related category.

Table 6 Estimated pR by country; N is equal to the number of labeled pairs by country

Full size table

Table 7 Estimated pR by activity type; N is equal to the number of labeled pairs by activity type

Full size table

For what concerns countries, the lowest value of pR is observed in correspondence with Dutch organizations. This result stems from the fact that Dutch names are the most difficult to associate with the related international label (i.e., expressed in English language) through linguistic similarity criteria. However, the value of pR can be considered satisfactory for all countries. More pronounced differences emerge when analyzing the value of pR by activity type. Research organizations, higher education institutions, and governmental bodies reveal a high level of pairwise-Recall. Differently, half of pairwise equivalences between labels related to private companies have not been determined by the methodology. This result provides relevant indications about the remainining main sources of errors in the refined dataset. In the private context indeed, organizations are frequently subject to rebrandings, mergers, and acquisitions. The identification of these dynamics requires the knowledge of the hystory of each company, that is however a different research question independent from linguistic properties. For instance, one of the most relevant companies in terms of connections emerging from the EUPRO database in FP1 is “BAE Systems PLC”. “BAE Systems PLC” was created in 1999 from the merger of “British Aerospace” with “Marconi Electronic Systems”, a subsidiary of “General Electric Company (GEC)”. “British Aerospace” in turn, had acquired the defence units of “Siemens Plessey” in 1997. Thus, in the refined dataset, the participation in projects of “BAE Systems PLC” is distributed over different labels, i.e., “British Aerospace Plc”, “GEC Marcon Ltd”, “GEC-Marconi Materials Technology Ltd”, “GEC Marconi Electronic Devices Ltd”, “GEC Marconi Research Centre”, “PLESSEY RESEARCH”, “PLESSEY SIEMENS ELECTRONIC SYSTEMS LTD”, and “GEC Plessey Telecommunications Ltd”. Notice that the relevance of these unidentified matchings is strictly related to the research objectives. Indeed, in a punctual analysis of the projects funded by FP1, it could make sense to analyze organizations separately as they were effectively at that moment, while it may be appropriate to consider mergers and acquisitions, as well as subsidiaries, in a dynamic analysis of a company’s tangible and intangible assets.

Finally, one important result can be drawn from the dimension of the giant component (i.e., the portion of nodes all mutually connected) of the networks. Indeed, if considered as a percentage of the whole set of nodes in the network, the dimension of the giant component is almost the same in the refined networks as in the EUPRO networks ($80.2\%$ and $80.8\%$ in FP1, respectively; $95.1\%$ and $95.5\%$ in FP2, respectively; $91.2\%$ and $91.9\%$ in FP3, respectively), while it differs in the raw networks ($77.3\%$ in FP1, $94.0\%$ in FP2, $90.6\%$ in FP3). This result reveals that unsolved errors are heterogeneously distributed among the nodes of the networks, and they do not regard peripheral or core members only, thus allowing refined networks to be structurally similar to the benchmark.

Efficiency of the methodology

Besides providing reliable and refined data, the proposed methodology aims at consistently reducing the number of equivalences between labels to check by hand, as described and motivated in “Proposed methodology” section. However, it is difficult to trade-off between the quality of the final dataset and computational issues. A manual inspection of all possible equivalences would guarantee the highest level of quality, but it becomes rapidly not affordable as the dimension of the analyzed dataset grows. On the other hand, a fully automated procedure would provide benefits from a computational perspective, although there is a high risk of wrongly matching labels. The efficiency of the proposed methodology lies in the fact that it is able to produce a well-refined dataset through a fast and easy procedure that is performed by hand, just to a small extent.

More specifically, in our application, the algorithm returns 7, 149 equivalences to check by hand, corresponding to $0.005\%$ of the amount of all possible pairwise equivalences between labels in the raw dataset, which is equal to 146, 281, 960. This number depends on the choice of the low and the high threshold defining the intervals of similarity score in which the algorithm classifies equivalences as true, false, or to be checked. In particular, all equivalences between labels whose similarity score ranges between the low and the high threshold need to be assessed by hand to be classified ultimately as true or false. In Fig. 5, we report the distribution of equivalences classified as true or false by hand, over discrete intervals of similarity score ranging between 0.94 and 0.99 (i.e., the low and the high threshold defined by us).

This plot proves that implementing a fully automated procedure is not recommended. Indeed, wherever one fixes a unique threshold within this interval, one either excludes several effective equivalences (green bars) or includes many wrong equivalences (red bars). Furthermore, this picture also supports our choice in defining the two thresholds. In fact, only $4.3\%$ of equivalences whose similarity score lies between 0.985 and 0.99 is classified as false. Thus, it is reasonable to automatically set the equivalences with a similarity score greater than the high threshold as true since there is a marginal probability of including false equivalences. At the same time, $18.6\%$ of equivalences whose similarity score ranges between 0.94 and 9.945 is classified as true. Hence, a small portion of equivalences is probably not identified by automatically setting the equivalences with a similarity score lower than the low threshold as false. However, shifting the low threshold to the left would partially compromise the efficiency of the methodology, making the procedure longer and spending additional time to check mostly false equivalences.

Finally, in order to show the appropriateness of the proposed methodology, in Fig. 6, we represent the cumulative percentage of equivalences classified as true by hand, in the interval included between the low and the high threshold (outside this interval, equivalences are automatically classified as either true or false). As it is possible to note, the cumulative percentage of effective equivalences is positively related to the similarity score (i.e., the higher the similarity score, the greater the possibility of classifying equivalences as true rather than false), which turns out to be a well-suited proxy to uniquely identify organizations in the analyzed dataset. In support of this conclusion, we find that the similarity score between associated original and EUPRO labels is above 0.87 for $80\%$ of all pairs. The cumulative distribution function of the similarity score over all pairs of corresponding original and EUPRO labels is shown if Fig. 7.

Discussion and conclusions

In this work, we propose a novel methodology to disambiguate organization names in a dataset, addressing the most common sources of errors frequently affecting datasets about organizations. The availability of reliable data is necessary to obtain valuable results in a scientific investigation. The proposed methodology limits the attribution of false information to different entities caused by all the possible ways in which the same organization can be called in a dataset. We demonstrate that it is not recommended to design a fully automated procedure to determine all distinct organizations in a dataset uniquely since there would be a high probability of wrongly matching distinct labels or excluding many effective equivalences between them. At the same time, in the presence of datasets including thousands of distinct labels (very frequent in collaborative environments), it is impossible to check all pairwise equivalences between labels by hand. Therefore, we propose a “hybrid methodology" which classifies almost all equivalences automatically, and then returns a small portion of equivalences to assess by hand, in order to provide a final dataset of higher quality and reduce the pairs of labels to check manually to those more likely to be equivalent. Our procedure combines elements of supervised and unsupervised methods, and relies on an accurate pre-processing of labels that make easier disambiguating organization names. The core of the proposed methodology is represented by a rule-based approach computing the number of common words, the number of consecutive common characters, and the similarity score between any pair of labels. Differently from other existing methods, our methodology is not domain-specific and does not require a context to disambiguate. Moreover, it is designed to be efficient also in the presence of highly unbalanced data with a considerable number of organizations.

We provide an empirical application of the methodology to the dataset of organizations taking part in projects funded by the first three EU FPs. This data represents a worthwhile source for empirical studies on collaborative projects at the European level, providing indications about the structure of EU FPs, as well as several policy implications. This choice is motivated by the fact that literature on EU FPs can benefit from a reliable high-quality dataset, i.e., the EUPRO database, introduced for the first time in 2008 by Roediger-Schluga and Barber, and then integrated in the following years by the RISIS core facility team. In this way, we test the quality of the proposed methodology by comparing the obtained results to the EUPRO database. More specifically, we decided to show the effectiveness of the methodology through the assessment of several network properties measured on the collaborative networks generated from the raw dataset, the refined dataset, and the EUPRO database, respectively. Networks are indeed one of the most popular tools to represent collaborative relationships, allowing to evaluate connections among partners quantitatively, relying on consolidated metrics from network theory.

Our results reveal that the dataset we obtain moves closer to the EUPRO database than the raw dataset through a reduced time consuming procedure. Indeed, the connection structure of the refined networks well approximates the connection structure of the EUPRO networks. All the properties we test, approach the benchmark values, and connections are distributed in a more heterogeneous and reliable way than in the raw networks, as shown by the estimated $\alpha$ characterizing the degree distribution functions of the respective power-laws ($P(x) \propto x^{-\alpha }$, where x corresponds to node’s degree). The estimated values of pairwise-Precision and pairwise-Recall confirm the accuracy of the proposed methodology, which is able to identify $82\%$ of all pairwise equivalences in the EUPRO database, returning just a small portion of unlabeled pairs (i.e., $3\%$ of disambiguated pairs are not included in the EUPRO database).

Nevertheless, the obtained results still differ from the EUPRO database. This is mainly due to the fact that not all errors are due to the way in which an organization is called in the dataset. In order to investigate the nature of unidentified matchings, we estimate the values of pR by country and activity type. This further analysis provides insights into the remaining sources of errors, and, consequently, some limitations of the current approach. More specifically, there are few languages that are more difficult to associate with the corresponding international label. However, the main issues determining the difference between the refined dataset and the EUPRO database correspond with various dynamics affecting mainly private companies (pR for private companies is equal to 0.50). For instance, there are cases in which organizations merged or have been rebranded, and our procedure can not attribute the corresponding labels to the same organization, differently from the EUPRO database. To this aim, recent initiatives to develop firm registers, such as FirmReg from the RISIS project, could offer a worthwhile contribution. However, the relevance of these unidentified matchings depends on the objectives of the analysis, as discussed in the previous section.

We also provide evidence about the efficiency of the proposed methodology, that produces a well-refined dataset returning a small set of equivalences between labels to check by hand, i.e., $0.005\%$ of all pairwise equivalences. Moreover, we show the appropriateness of the methodology in identifying equivalences between distinct labels. The similarity score is indeed a good proxy to determine whether two labels are equivalent, being positively related to the percentage of equivalences classified as true by hand. Furthermore, $80\%$ of corresponding original and EUPRO labels have a similarity score greater than 0.87, thus associations in the EUPRO database depending on non-linguistic criteria are just a small portion.

It is important to notice that the methodology we develop is thought to be applied to all those cases in which a high quality dataset is not available in literature. Furthermore, the methodology is designed to be adapted to the user’s needs. For istance, it would be possible to make the methodology fully automated through an easy step (as described in “Proposed methodology” section), although Fig. 5 supports our discussion that a fully automated procedure is not recommended for several reasons. In the future, we aim to apply the methodology to other research fields where a high-quality dataset is not available.

Notes

References

Akbaritabar, A. (2021). A quantitative view of the structure of institutional scientific collaborations using the example of berlin. Quantitative Science Studies, 2(2), 753–777. https://doi.org/10.1162/qss_a_00131
Article Google Scholar
Amancio, D. R., da F.Costa, L., et al. (2015). Topological-collaborative approach for disambiguating authors’ names in collaborative networks. Scientometrics, 102(1), 465–485. https://doi.org/10.1007/s11192-014-1381-9
Article Google Scholar
Balsmeier, B., Chavosh, A., Li, G. C., Fierro, G., Johnson, K., Kaulagi, A., O’Reagan, D., Yeh, B., & Fleming, L. (2015). Automated disambiguation of US patent grants and applications. Working paper 8.
Barabási, A. L. (2016). Network science. Cambridge University Press.
MATH Google Scholar
Campos, P., Brazdil, P., & Mota, I. (2013). Comparing strategies of collaborative networks for r &d: An agent-based study. Computational Economics, 42(1), 1–22. https://doi.org/10.1007/s10614-013-9376-9
Article Google Scholar
Cavallaro, M., & Lepori, B. (2021). Institutional barriers to participation in EU framework programs: Contrasting the Swiss and UK cases. Scientometrics, 126(2), 1311–1328. https://doi.org/10.1007/s11192-020-03810-0
Article Google Scholar
Clauset, A., Shalizi, C. R., & Newman, M. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661–703. https://doi.org/10.1137/070710111
Article MathSciNet MATH Google Scholar
Crespo, J., Suire, R., & Vicente, J. (2016). Network structural properties for cluster long-run dynamics: Evidence from collaborative R &D networks in the European mobile phone industry. Industrial and Corporate Change, 25(2), 261–282. https://doi.org/10.1093/icc/dtv032
Article Google Scholar
Cuxac, P., Lamirel, J. C., & Bonvallot, V. (2013). Efficient supervised and semi-supervised approaches for affiliations disambiguation. Scientometrics, 97(1), 47–58. https://doi.org/10.1007/s11192-013-1025-5
Article Google Scholar
Diestre, L., & Rajagopalan, N. (2012). Are all ‘sharks’ dangerous? New biotechnology ventures and partner selection in R &D alliances. Strategic Management Journal, 33(10), 1115–1134. https://doi.org/10.1002/SMJ.1978
Article Google Scholar
Endel, F., & Piringer, H. (2015). Data wrangling: Making data useful again. IFAC-PapersOnLine, 48(1), 111–112. https://doi.org/10.1016/j.ifacol.2015.05.197
Article Google Scholar
Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147–160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
Article MathSciNet MATH Google Scholar
Heller-Schuh, B., Barber, M., Bilalli Shkodra, X., Scherngell, T., & Zahradnik, G. (2020). Documentation of risis datasets: Eupro.https://doi.org/10.5281/zenodo.4428394
Heringa, P. W., Hessels, L. K., & van der Zouwen, M. (2016). The influence of proximity dimensions on international research collaboration: An analysis of European water projects. Industry and Innovation, 23(8), 753–772. https://doi.org/10.1080/13662716.2016.1215240
Article Google Scholar
Hoang, H., & Rothaermel, F. T. (2005). The effect of general and partner-specific alliance experience on joint R &D project performance. Academy of Management Journal, 48(2), 332–345. https://doi.org/10.5465/AMJ.2005.16928417
Article Google Scholar
Hoang, H., & Rothaermel, F. T. (2010). Leveraging internal and external experience: Exploration, exploitation, and R &D project performance. Strategic Management Journal, 31(7), 734–758. https://doi.org/10.1002/SMJ.834
Article Google Scholar
Hoekman, J., Scherngell, T., Frenken, K., & Tijssen, R. (2013). Acquisition of European research funds and its effect on international scientific collaboration. Journal of Economic Geography, 13(1), 23–52. https://doi.org/10.1093/jeg/lbs011
Article Google Scholar
Huang, S., Yang, B., Yan, S., & Rousseau, R. (2014). Institution name disambiguation for research assessment. Scientometrics, 99(3), 823–838. https://doi.org/10.1007/s11192-013-1214-2
Article Google Scholar
Jaccard, P. (1901). Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletins - Société Vaudoise des Sciences Naturelles, 37, 241–272.
Google Scholar
Jakobsen, S., Lauvås, T. A., & Steinmo, M. (2019). Collaborative dynamics in environmental R &D alliances. Journal of Cleaner Production, 212, 950–959. https://doi.org/10.1016/J.JCLEPRO.2018.11.285
Article Google Scholar
Jiang, Y., Zheng, H. T., Wang, X., Lu, B., & Wu, K. (2011). Affiliation disambiguation for constructing semantic digital libraries. Journal of the American Society for Information Science and Technology, 62(6), 1029–1041. https://doi.org/10.1002/asi.21538
Article Google Scholar
Jonnalagadda, S., & Topham, P. (2010). Nemo: Extraction and normalization of organization names from pubmed affiliation strings. Journal of Biomedical Discovery and Collaboration, 5, 50–75.
Article Google Scholar
Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of dblp. Scientometrics, 116(3), 1867–1886. https://doi.org/10.1007/s11192-018-2824-5
Article Google Scholar
König, M. D., Liu, X., & Zenou, Y. (2019). R &d networks: Theory, empirics, and policy implications. Review of Economics and Statistics, 101(3), 476–491. https://doi.org/10.1162/rest_a_00762
Article Google Scholar
Lepori, B., Veglio, V., Heller-Schuh, B., Scherngell, T., & Barber, M. (2015). Participations to European framework programs of higher education institutions and their association with organizational characteristics. Scientometrics, 105(3), 2149–2178. https://doi.org/10.1007/s11192-015-1768-2
Article Google Scholar
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, Soviet Union, 10, 707–710.
MathSciNet Google Scholar
Li, G. C., Lai, R., D’Amour, A., Doolin, D. M., Sun, Y., Torvik, V. I., Yu, A. Z., & Lee, F. (2014). Disambiguation and co-authorship networks of the U.S. patent inventor database (1975–2010). Research Policy, 43(6), 941–955. https://doi.org/10.1016/j.respol.2014.01.012
Article Google Scholar
Morrison, G., Riccaboni, M., & Pammolli, F. (2017). Disambiguation of patent inventors and assignees using high-resolution geolocation data. Scientific Data, 4(1), 1–21. https://doi.org/10.1038/sdata.2017.64
Article Google Scholar
Muñoz, A. D., Unanue, R. M., García-Plaza, A. P., & Fresno, V. (2012). Unsupervised real-time company name disambiguation in twitter. Proceedings of the International AAAI Conference on Web and Social Media, 6, 25–28. https://doi.org/10.1609/icwsm.v6i3.14351
Article Google Scholar
Newman, M. E. (2007). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5), 323–351. https://doi.org/10.1080/00107510500052444
Article Google Scholar
Newman, M. (2018). Networks. Oxford University Press. https://doi.org/10.1093/oso/9780198805090.001.0001
Book MATH Google Scholar
Nieto, M. J., & Santamaría, L. (2007). The importance of diverse collaborative networks for the novelty of product innovation. Technovation, 27(6–7), 367–377. https://doi.org/10.1016/J.TECHNOVATION.2006.10.001
Article Google Scholar
Paier, M., & Scherngell, T. (2011). Determinants of collaboration in European R &D networks: Empirical evidence from a discrete choice model. Industry and Innovation, 18(1), 89–104. https://doi.org/10.1080/13662716.2010.528935
Article Google Scholar
R Core Team (2014). R: A language and environment for statistical computing. Retrieved, from http://www.R-project.org
Reuer, J. J., & Devarakonda, R. (2017). Partner selection in R &D collaborations: Effects of affiliations with venture capitalists. Organization Science, 28(3), 574–595. https://doi.org/10.1287/ORSC.2017.1124
Article Google Scholar
Rimmert, C. (2018). Institutional disambiguation for further countries-an exploration with extensive use of wikidata. Project report.
Rimmert, C., Schwechheimer, H., & Winterhager, M. (2017). Disambiguation of author addresses in bibliometric databases. Technical Report.
Roediger-Schluga, T., & Barber, M. J. (2008). R &D collaboration networks in the European framework programmes: Data processing, network construction and selected results. International Journal of Foresight and Innovation Policy, 4(3–4), 321–347. https://doi.org/10.1504/IJFIP.2008.017583
Article Google Scholar
Santini, C., Gesese, G. A., Peroni, S., Gangemi, A., Sack, H., & Mehwish, A. (2022). A knowledge graph embeddings based approach for author name disambiguation using literals. Scientometrics, 127(8), 4887–4912. https://doi.org/10.1007/S11192-022-04426-2
Article Google Scholar
Scherngell, T., & Barber, M. J. (2011). Distinct spatial characteristics of industrial and public research collaborations: Evidence from the fifth EU framework programme. Annals of Regional Science, 46(2), 247–266. https://doi.org/10.1007/s00168-009-0334-3
Article Google Scholar
Scherngell, T., & Lata, R. (2013). Towards an integrated European research area? Findings from eigenvector spatially filtered spatial interaction models using European framework programme data. Papers in Regional Science, 92(3), 555–577. https://doi.org/10.1111/j.1435-5957.2012.00419.x
Article Google Scholar
Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1), 15–50. https://doi.org/10.1007/s11192-014-1289-4
Article Google Scholar
Spina, D., Gonzalo, J., & Amigó, E. (2013). Discovering filter keywords for company name disambiguation in Twitter. Expert Systems with Applications, 40(12), 4986–5003. https://doi.org/10.1016/j.eswa.2013.03.001
Article Google Scholar
Tatarynowicz, A., Sytch, M., & Gulati, R. (2015). Environmental demands and the emergence of social structure: Technological dynamism and interorganizational network forms. Administrative Science Quarterly, 61(1), 52–86. https://doi.org/10.1177/0001839215609083
Article Google Scholar
Tsai, K. H. (2009). Collaborative networks and product innovation performance: Toward a contingency perspective. Research policy, 38(5), 765–778. https://doi.org/10.1016/j.respol.2008.12.012
Article Google Scholar
Uhlbach, W. H., Balland, P. A., & Scherngell, T. (2017). R &D policy and technological trajectories of regions: Evidence from the EU framework programmes. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3027919
Article Google Scholar
Van Der Loo, M. P. J. (2014). The stringdist package for approximate string matching. R Journal, 6(1), 111–122. https://doi.org/10.32614/RJ-2014-011
Article Google Scholar
Veyseh, A. P. B., Dernoncourt, F., Chang, W., & Nguyen, T. H. (2021). Maddog: A web-based system for acronym identification and disambiguation. http://arXiv.org/210109893 https://doi.org/10.48550/arXiv.2101.09893
Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012). A boosted-trees method for name disambiguation. Scientometrics, 93(2), 391–411. https://doi.org/10.1007/s11192-012-0681-1
Article Google Scholar
Wanzenböck, I., Neuländtner, M., & Scherngell, T. (2020). Impacts of EU funded R &D networks on the generation of key enabling technologies: Empirical evidence from a regional perspective. Papers in Regional Science, 99(1), 3–24. https://doi.org/10.1111/pirs.12473
Article Google Scholar
Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications (Vol. 8). Cambridge University Press. https://doi.org/10.1017/CBO9780511815478
Book MATH Google Scholar
Wu, J., & Ding, X. H. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96(3), 683–697. https://doi.org/10.1007/s11192-013-0978-8
Article Google Scholar
Yin, D., Motohashi, K., & Dang, J. (2020). Large-scale name disambiguation of Chinese patent inventors (1985–2016). Scientometrics, 122(2), 765–790. https://doi.org/10.1007/S11192-019-03310-W
Article Google Scholar
Yosef, M. A., Hoffart, J., Bordino, I., Spaniol, M., & Weikum, G. (2011). Aida: An online tool for accurate disambiguation of named entities in text and tables. Proceedings of the VLDB Endowment, 4(12), 1450–1453. https://doi.org/10.14778/3402755.3402793
Article Google Scholar
Zhang, S., Wu, J., Zheng, D., Meng, Y., & Yu, H. (2012). An adaptive method for organization name disambiguation with feature reinforcing. Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation (pp 237–245).

Download references

Acknowledgements

We would like to thank the RISIS core facility team for making the EUPRO database available to us.

Funding

Open access funding provided by Università degli Studi di Roma La Sapienza within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Department of Social and Economic Sciences, Sapienza University of Rome, Piazzale Aldo Moro, 5, 00185, Rome, Italy
Andrea Ancona & Roy Cerqueti
GRANEM, University of Angers, SFR CONFLUENCES, F-49000, Angers, France
Roy Cerqueti
Department of Management, Sapienza University of Rome, Via del Castro Laurenziano, 9, 00161, Rome, Italy
Gianluca Vagnani

Authors

Andrea Ancona
View author publications
You can also search for this author in PubMed Google Scholar
Roy Cerqueti
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Vagnani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roy Cerqueti.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ancona, A., Cerqueti, R. & Vagnani, G. A novel methodology to disambiguate organization names: an application to EU Framework Programmes data. Scientometrics 128, 4447–4474 (2023). https://doi.org/10.1007/s11192-023-04746-x

Download citation

Received: 01 October 2022
Accepted: 10 May 2023
Published: 31 May 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11192-023-04746-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A novel methodology to disambiguate organization names: an application to EU Framework Programmes data

Abstract

Similar content being viewed by others

Name disambiguation from link data in a collaboration graph using temporal and topological features

Topological-collaborative approach for disambiguating authors’ names in collaborative networks

Improving co-authorship network structures by combining multiple data sources: evidence from Italian academic statisticians

Introduction

Related works and approaches

Proposed methodology