Accounting for Quality in Data Integration Systems: a Completeness-aware Integration Approach

Ensuring the quality of integrated data is undoubtedly one of the main problems of integrated data systems. When focusing on multi-national and historical data integration systems, where the “space” and “time” dimensions play a relevant role, it is very much important to build the integration layer in such a way that the final user accesses a layer that is “by design” as much complete as possible. In this paper, we propose a method for accessing data in multipurpose data infrastructures, like data integration systems, which has the properties of (i) relieving the final user from the need to access single data sources while, at the same time, (ii) ensuring to maximize the amount of the information available for the user at the integration layer. Our approach is based on a completeness-aware integration approach which allows the user to have ready available all the maximum information that can get out of the integrated data system without having to carry out the preliminary data quality analysis on each of the databases included in the system. Our proposal of providing data quality information at the integrated level extends then the functions of the individual data sources, opening the data infrastructure to additional uses. This may be a first step to move from data infrastructures towards knowledge infrastructures. A case study on the Research Infrastructure for the Science and Innovation Studies (RISIS) shows the usefulness of the proposed approach.


Introduction
In the current big data era in which we live, the problems of data integration, harmonization and above all data quality have increased rather than reduced (Ekbia, et al., 2015).Paradoxically, in this context it appears more complex to identify criticalities in data and information, and profiling research infrastructures capable of showing the shortcomings and potential of the various existing data sources (Borgman, 2015).Information quality, which is more than simply accuracy, calls for an increasing interest on other significant dimensions such as completeness, consistency, and currency (Batini and Scannapieco, 2016).The quality of data is context-dependent and an appropriate quality of a single dataset, for a specific purpose, is not enough.The linkages between different datasets are relevant as well.The compatibility, interchangeability and the connectability of a given dataset with other related data are fundamental aspects which need to be taken into account (Daraio and Glanzel, 2016).Quality is also a relevant dimension, a kind of overarching principle, to keep into account when designing models of metrics (Daraio, 2017).Data integration is the activity of joining data located in diverse sources, to offer the user a unified view of these data.According to Parent and Spaccapietra (2000), interoperability is the way in which heterogeneous databases talk to each other and exchange information in a meaningful way.Parent and Spaccapietra (2000) propose three levels of interoperability: (i) lowest level of interoperability in which there is no integration; (ii) intermediary level of interoperability in which the system does not assure consistency across database borders; (iii) higher level of interoperability in which the goal is to develop a global system on top of existing databases, to deliver the wanted level of integration.
There are different levels of conceptual interoperability proposed in the existing literature.Tolk and Muguira (2003) propose a detailed set of levels of conceptual interoperability that goes from the limited case of no integration, Level (0), which corresponds to an isolated systems (constituted by system specific data) to Level (4) which corresponds to the maximum level of integration and is based on the existence of a common conceptual model (constituted by harmonized data and processes with a conceptual model).Intermediary levels include: Level (1) which is characterized by the existence of documentation of data and interfaces (basically documented data), Level (2) which corresponds to the use of common reference models/common ontology (consisting in aligning static data through Meta Data Management) and Level (3) which corresponds to the existence of a common system approach and/ or open source code (consisting of aligned dynamical data).According to the quality framework of the OECD (2011), data quality is defined as "fitness for use" with respect to user needs, and it has seven dimensions: i) relevance grades the ability of data to address their purposes; ii) accuracy measures how the data correctly describes the features they are designed to assess; iii) credibility accounts for the confidence and trust of users in the data and their objectivity; iv) timeliness expresses the length of time between data availability and the phenomenon described by data; v) accessibility gauges how readily the data can be located and accessed); vi) interpretability relates the easiness with which the user may understand and properly use and analyse the data; vii) coherence refers to the degree to which data are logically connected and mutually consistent".An important data quality aspect that is not explicitly reported in the OECD (2011) framework but very often encountered in the practical data analysis is completeness.
For each variable, dimension and data set, completeness evaluates the number of missing values (with the meaning relevant to completeness, i.e. unavailable or temporarily unavailable) that are present.Data quality is a very complex topic, in which the theory and practice often differ.In practice, data quality does play an important role in the design of data architectures.All the data quality efforts must start from a solid understanding of high-priority use cases, and use that insight to navigate various trade-offs to optimize the quality of the final output.The followings are trade-offs related to data quality: Should we select data for cleaning based on the cost of cleaning effort or based on how frequently the data is used or based on its relative importance within the data models consuming it?or a combination of those factors?What sort of combination?Is it a good idea to improve data accuracy by getting rid of incomplete or erroneous data?While removing some data, how do we ensure that we do not introduce distortions or bias?Data integration systems are often the result of a huge effort that has to be paid to integrate highly heterogeneous data sources: schema harmonization, record linkage and historical data management are only some of the most common activities that these systems require in real application scenarios.Among such activities, ensuring the quality of integrated data is undoubtedly one of the main problems of integrated data systems.
To address the quality problem some shared practices are there: for instance, ensuring data consistency at the integration layer is a mandatory approach in any sound data integration systems.However, when it comes to data completeness, different solutions are possible, depending also on the "completeness" requirement by the users: if it is reasonable to say that no user would like to have inconsistent data, instead different degrees of completeness can be made available depending on how the data integration layer is built.When focusing on multi-national and historical data integration systems, where the "space" and "time" dimensions play a relevant role, it is very much important to build the integration layer in such a way that the final user accesses a layer that is "by design" as much complete as possible.In this paper we address the relevance and challenges of the characterization of quality in a longitudinal and multinational data integration system.We propose a data quality approach, based on the maximization of the available information at the level of integrated infrastructure, that could be the first step, towards the building of a knowledge infrastructure.
The paper unfolds as follows.In the next section we describe the main goal of the paper and its contribution to existing literature.Section 3 outlines existing studies related to the topic addressed in the paper while Section 4 describes the proposed methodology.Section 5 illustrates the case study on the RISIS data integrated system, while Section 6 discusses the main results and concludes the paper.

Aim and contribution
The aim of this work is to propose a method to characterize the quality of the information contained in a multipurpose data infrastructure characterized by historical and multinational heterogeneous data systems.We propose an approach that investigates the integration level of the overall system and is based on a completeness-aware method for maximizing the amount of information available in a data integration system.We choose completeness with respect to the target coverage defined by the integration layer because it is a fundamental data quality property that should be checked and on which we can build further to extend the functionality of existing data systems integrated in a data infrastructure.The aim of this investigation at the integrated level is to highlight opportunities of data harmonization and exploitation that were not available to the potential user of individual databases before.This investigation offers additional relevant information to the user and extends the functions of the individual data sources, opening the data infrastructure to additional uses not foreseen by the single data systems.Our approach may be considered as a first step from data infrastructures towards knowledge infrastructures.
The existing literature on this topic, namely the analysis of the quality of the integrated system built on historical and multinational sources, is scant.However these systems exhibit a significant complexity: multi-nationality is typically characterized by high heterogeneity, while historical data imply that time consistency is carefully checked and ensured at the integration layer.We believe that the development of this approach may be of considerable importance, not only from a scientific point of view but also from an applied perspective, as it allows us to provide additional functionality indications for users of the integrated data system.
The methodology proposed will be applied in a case study on data coming from the platform on research, higher education and innovation, maintained and developed within the European project H2020 RISIS (Research Infrastructure for the Science and Innovation Studies).We will show the importance of considering data and information quality at the integrated level as an ingredient to move from a data infrastructure to a knowledge infrastructure.
The contribution that this work offers consisting in a data quality analysis that will be developed on the integrated level of the data infrastructure sources, provides a set of information available to data users to decide which variables and levels of analysis present higher levels of quality and under what conditions of use.

Related studies
The literature on the analysis of the quality of the integrated system built on historical and multinational sources is limited.
Quality-driven data integration systems are data integration systems that return an answer to a global query posed on the integrated layer by explicitly taking into account the quality of data provided by local sources.Some relevant examples of such systems are briefly described below: -FusionPlex (Motro et al., 2005) is a data integration system assuming instance inconsistency, meaning that the same instance of the real world can be represented differently in the various local sources due to errors.In order to deal with such instancelevel inconsistencies, Fusionplex introduces a set of quality metadata, called features, about the sources to be integrated.-DaQuinCIS (Scannapieco et al., 2004) is a framework with an underlying data integration system where the sources are characterized by quality metadata that are exploited in the query answering phase.User queries, posed to the integration layer, are processed so that the "best quality" answer is returned as a result, i.e. when retrieving data from the sources, data are compared and a best quality copy can be either selected or constructed.-QP-alg (Naumann et al. 1999) specifies the mapping between local sources and the global schema is specified by means of Query Correspondence Assertions (QCAs).Three classes of data quality dimensions, called Information Quality criteria (IQ criteria), are defined: Source-specific criteria, defining the quality of a whole source, QCA-specific criteria, defining the quality of specific query correspondence assertions, User-query specific criteria, measuring the quality of the source with respect to the answer provided to a specific user query.These criterias are used in the query answering phase.
Differently from the above cited systems, our approach does not base the query answering on quality metadata specified as part of the data integration system, instead the integration layer is built by-design to maximize the completeness.A detailed description of the proposed approach is reported in Section 4.2.It is based and uses an Ontology-Based Data Management (OBDM) approach described at length in Section 4.1.Lenzerini and Daraio (2019) discuss the main challenges, approaches and solutions available for integrating data on research, higher education and innovation, consolidating existing research on the topic, including Daraio et al. (2016a) which introduce Sapientia, the ontology of multidimensional assessment of research and Daraio et al. (2016b) that highlighted and discussed the main advantages on an OBDM approach residing in the openness, interoperability and data quality.Recently, Angelini et al. (2020) showed the usefulness of Sapientia and OBDM combined with visual analytics to develop general models of performance indicators.

Method
In this section, we first illustrate the proposed method and later we present the RISIS case study that shows the application of the method to a real case.In particular, we will describe our proposal for building a data integration system with explicit quality annotations.We will first give an overview of the used data integration approach, namely OBDM (Ontology-Based Data Management); then, we will focus on our proposal to explicitly represent data quality of the integration layer, so to have a full governance of the quality of the data provided by the data integration system.

1 Introduction on OBDM
Ontology-Based Data Management (OBDM), introduced about a decade ago as a new way for modeling and interacting with a collection of data sources (see Lenzerini 2011).According to such paradigm, the client of the information system is freed from being aware of how data are structured in concrete resources (databases, software programs, services, etc.), and interacts with the system by expressing her queries and goals in terms of a conceptual representation of the domain of interest, called ontology.
More precisely, an OBDM system is an information management system maintained and used by a given organization (or, a community of users), whose architecture has the same structure of a typical data integration system, with the following components: an Integration Layer with an ontology, a Source Layer with a set of data sources, and the mapping between the two (see Figure 1).In particular: -Integration Layer, with an ontology, i.e. a conceptual, formal description of the domain of interest of the organization, expressed in terms of relevant concepts, attributes of concepts, relationships between concepts, and logical assertions formally describing the domain knowledge.
-Source Layer, where there are data sources, which are the repositories accessible by the organization where data concerning the domain are stored.In the general case, such repositories are numerous, heterogeneous, each one managed and maintained independently from the others.-Mapping Layer, with the mapping as a precise specification of the correspondence between the data contained in the data sources and the elements of the ontology.Here element means concept, attribute, or relationship.We observe that the above three layers constitute a sophisticated knowledge representation system that can be managed and reasoned upon with the help of automated reasoning techniques.For example, suitable algorithms allow queries expressed over the ontology to be answered by automatically translating the query in terms of the data sources using the mapping (Calvanese et al. 2007).Although the problem of answering queries over the ontology has been the main focus in the last years, there are several other services that an OBDM system should provide.Data quality assessment (Batini and Scannapieco 2016) is one notable example.

A completeness-aware integration approach
When integrating data sources being multi-national and historical, a relevant dimension to consider is the completeness with respect to the target coverage defined by the integration layer.
We have then to introduce a new concept of completeness with respect to a coverage target defined at the integration level.This target can be not fully reached by integrating the sources and is in general dependant on the way in which the sources are integrated.Assuming that we would like to have an integrated system in which the completeness of the data available to the final users is maximized, we can reason on building the integrated system with this target in mind, as explained below.
Two intuitive examples of completeness are geographical completeness and time completeness.
Let the Integration Layer be defined as a set of relational tables {I1,…,Im}.
Let us assume for the sake of simplicity and without loss of generality that we are in a setting with only two sources, each one consisting of one relational table, namely: S={S1, S2}, with S1={R11} and S2={R21}.
Let us also assume, without loss of generality, that both R11 and R21,in the following referred to respectively as R1 and R2 for the sake of simplicity of the notation, have one single attribute for the territorial dimension Aterritory (e.g.country) and one single attribute for the temporal dimension Atime(e.g.year), and similarly R2.
Example 1.In this setting, the Integration Layer can be defined in order to take explicitly into account the completeness dimension, in order to give the final users the possibility to access to information at the integration layer by maximizing the amount of information they can access.

Looking at
To such a scope, the Integration Layer will be composed by a set of relations {I, I1, I2 }, such that: 1. I= (R1R2)territorytime .that (i) will consists of all the tuples present in both R1 and R2, and (ii) will have Atime and Aspace defined on the intersection of the domains of the two attributes in the originating sources, namely R1(Atime) R2(Atime) and R1(Aspace) R2(Aspace).
2. I1=(R1-R2) that (i) will consists of all the tuples present in R1 but not in R2,and (ii) will have Atime and Aspace defined as in R1.
3. I2=(R2-R1) that (i) will consists of all the tuples present in R2 but not in R1, and (ii) will have Atime and Aspace defined as in R2.
o I1 consists of all the tuples of R1 that are not in R2 and the domains of Aspace and Atime are the same of Aspace and Atime in R1 o I2 consists of all the tuples of R2 that are not in R1 and and the domains of Aspace and Atime are the same of Aspace and Atime in R2 We can now define the notion of completeness of the integration layer as: -I_Completeness: this is the notion of completeness that provides the highest information value on a specific entity with a given space-time view.I_completeness is maximum when the user queries the I relation of the Integration layer.
In the example in Figure 3, if the user is interested to have all the information that the sources have on HEIs, then she has to access to the relation I, which indeed have both ProjectInfo and PatentInfo of HEIs.
-S_Completeness: this is a notion of completeness that provides the highest information value on a specific entity with a given attribute selection of S1 (respectively S2).
S_Completeness is maximum when the user accesses I I1 (respectively I I2) In the example in Figure 3, if the user is only interested to ProjectInfo of HEIs, by querying both relations I and I1, she is able to obtain ProjectInfo for all the HEIs of S1.
Note 1.We focus on the space and time attributes of the sources as they are the ones that are typically and mandatorily shared by the sources; indeed, in order to perform a proper integration it is necessary to define the space-time scope of the population underlying the integrated datasets.Of course, it can be the case that other attributes are shared by the sources.In such a case, the shown approach can easily be extended to such attributes as well.
Note 2. The notion of S_Completeness allows characterizing completeness of a source at the integration layer.The question could arise: why not accessing the source directly at the source layer?The answer is: because the user will see only the integration layer and will benefit from an homogeneous representation of all the data at the sources according to a common global representation.

Experimental Validation of the Approach
The proposed methodology was applied to some of the RISIS project datasets presented above.In particular, we focus on databases containing Higher Education Institution (HEIs)'s information, though the approach is general enough to be applied to other databases as well.
A conceptual integration scheme for HEIs is available in Appendix 1.To facilitate the reading of the schema, Appendix 1 reports in Fig A1 .the legend of the Graphol language including predicate and constructor nodes (Console et al. 2014, Lembo et al. 2016and 2018)

RISIS ETER and CWTS Integration
This integration task combines HEIs with related publications.
Starting from the source layers, data integrations have been performed following methodology presented in section 2 (see appendix 1 for the results), considering R1 as ETER and R2 as CWTS: 1) I= (R1R2)territorytime =Creation of the intersection of org_Id and years between datasets R1 and R2 2) I1=(R1-R2)=Creation of the subtraction table between one dataset versus the other referenced in the previous operation will have Atime and Asp Aspace ace defined as in R1 (R1=ETER).
3) I2=(R2-R1)=Creation of the subtraction table between a dataset compared to the other dataset referred to in the previous operation will have Atime and Aspace defined as in R2 (R2=CWTS).From this data, applying the methodology described above, the following results were obtained: -

Institutions in CWTS not present in ETER by Year and Country
The opposite approach to the proposed one involves the use of a single union table between the information in ETER and CWTS, which is composed of 10092551 rows.
Considering the total number of rows with complete information (6429051 rows) and the total rows of the report, we can calculate the I_Completeness: This shows the relevance of our approach in maximizing the completeness and relieving the final users from receiving partially empty tamples as results of their queries.

RISIS ETER and RISIS PATENT Integration
Starting from the source layers, data integrations have been performed following methodology presented in Section 2 (see appendix 1 for the results), considering R1 as ETER and R2 as RISIS Patents: 1) I= (R1R2)territorytime =Creation of the intersection of org_Id and years between datasets R1 and R2 (2011-2016) 2) I1=(R1-R2)=Creation of the subtraction table between one dataset versus the other referenced in the previous operation will have Atime and Aspace defined as in R1 (R1=ETER).
3) I2=(R2-R1)=Creation of the subtraction table between a dataset compared to the other dataset referred to in the previous operation will have Atime and Aspace defined as in R2 (R2=RISIS Patents).

Figure 8 Representation of the integration scheme of ETER and RISIS PATENT
From this data, applying the methodology described above, the following results were obtained: -  Thanks to this approach, it is possible to highlight the completeness of the information.In each relation (I, I1 and I2) the I_Completeness is equal to 1, and specifically for I, the relation has the complete information from two different sources for the period 2011-2016.
An approach alternative to the proposed one could involve the use of a single union table between the information in ETER and RISIS Patent, which is composed of 71291 rows.
Considering the total number of rows with complete information (32027 rows) and the total rows of the report, we can calculate the I_Completeness: Hence, in this alternative approach the completeness value would be quite low.

Impact on user
Thanks to the results above, it is possible to highlight how the use of the proposed methodology has a considerable impact on the user.By dividing the information into sub-relationships I, I1 and I2, the information content is maximized with an I_Completeness=1 for each relation.The high value of completeness allows the user to know even before each query the possible amount of partial or complete information available.Besides, the proposed approach moves the workload of finding unlinked values, incomplete information or other data cleaning operation from the user to the database manager, so that it makes easier access to the data for the user.
The opposite approach to the one proposed shows, instead, that there is a higher workload in data checking and cleaning operations by the user and that the user has no prior knowledge of the complete information contained in the dataset, but must necessarily analyze the dataset obtained from this perspective.
Evidence for these claims is shown below by contextualizing them in the results of the two case studies shown above.In the case of CWTS and ETER, it is possible to estimate that only 64 % of the rows are complete.As a consequence, the user, once obtained the dataset, will have to analyze and/or eliminate the remaining 3663500 rows.In the case of RISIS Patent and ETER, the situation is even more interesting.The results show that 45% of the rows are complete, leading the final user to manipulate, according to his needs, 39264 rows, about 55% of the total.
It is important to specify that the proposed results and numbers may be subject to errors due to the quality of the dataset used.In particular, it is conceivable the presence of HEIs not mapped in ETER but mapped in the RISIS patent and CWTS datasets as these datasets contain also HEIs that are not universities.

Discussion and Conclusions
The consideration of the quality of data is an extremely important and current topic in the current big data era, characterized by the paradox of the ever greater increase of available data which, however, are not accompanied by an adequate development of techniques capable of providing more information for users.Indeed, users are often overwhelmed by data and are unable, except with extreme difficulty and after several data cleaning and harmonization works, to understand what information is actually available for their empirical analyses.
In this paper, we propose an approach to account for quality in data integration systems.It is a completeness-aware integration approach that works at the integrated system level.The case study illustrated on European Higher Education Institutions data (included in the ETER database), integrated with bibliometric data (coming from the CWTS database) and patent data (included in the RISIS Patents database), shows the importance of the proposed approach for providing data with high level of completeness, relieving final users from the need to post-processing data in order to have adeguate levels of data quality..
The proposed data quality approach offers different potentialities beyond the case study illustrated in the previous section that we briefly report below.
(i) Designing information quality-aware methods at the integrated system level We proposed a data quality approach led by the maximization of information available at the integrated system layer.Our integration approach is led by the maximization of completeness at the integrated leyer and can be further extended to other data quality dimensions and applied to different databases.
(ii) Putting the users' needs at the center of the scene providing useful knowledge.
We proposed a user oriented approach that permits to reduce the workload in data checking and cleaning operations of the user and that allows the user to grasp the knowledge about the overall information available without any prior operations on the data contained in each dataset.Our approach moves the workload of finding unlinked values, incomplete information or other data cleaning operation from the user to the database manager, so that it makes easier accessing the relevant information for the user.
(iii) A first step from data infrastructure to knowledge infrastructure Our approach is able to contribute to the extension of the individual data sources functions, opening the data infrastructure to additional uses.This may be a first step to move from data infrastructures towards knowledge infrastructures.
The management of data at the integrated level is part of data governance and should include also a certain data literacy (Koltay, 2016).Most data can in principle be considered as infrastructural resources, as they are "shared means to many ends" that satisfy all three criteria of infrastructure resources highlighted by Frischmann (2012).1.Data are non-rivalrous goods that can be consumed in principal an unlimited number of times.While it is widely accepted that social welfare is maximised when a pure rivalrous good is consumed by the person who values it the most, and that the market mechanism is generally the most efficient means for rationing such goods and for allocating resources needed to produce such goods, this is not always true for nonrivalrous goods (Frischmann, 2012).Social welfare is not maximised when the good is consumed only by the person who values it the most, but by everyone who values it.
Maximising access to the non-rivalry good will in theory maximise social welfare, as every additional private benefit comes at no additional cost.2. Data are capital goods -Data are not a consumption good, or an intermediate good.
In most cases, data can be classified as capital goods.The UN (2008) System of National Accounts (SNA) defines a consumption good or service as "one that is used […] for the direct satisfaction of individual needs or wants or the collective needs of members of the community".3. Data are general-purpose inputs.As Frischmann (2012) explains, "infrastructure resources enable many systems (markets and non markets) to function and satisfy demand derived from many different types of users".They are not inputs that have been optimised for a special limited purpose, but "they provide basic, multipurpose functionality".Data may often be collected for a particular purpose, and in the case of personal data the ex-ante specification of the purpose.However, there is theoretically no limitation on what purposes data can be used for, and in fact many of the benefits of data sharing arise from the reuse of data in ways that were or could not be anticipated when the data were collected.In addition, the reuse of data created in one domain may lead to further insights when applied in another.Edwards (2010) Daraio and Bonaccorsi (2017) show that the intelligent integration of existing data may lead to an open-linked data platform which permits the construction of new indicators.The power of the approach derives from the ability to combine heterogeneous sources of data to generate indicators that address a variety of user requirements without the need to design indicators on a custom basis.The quality of data and of related information is crucial to add value and improve the awareness and better exploitation of the available data, enhancing data quality-aware empirical investigations when heterogeneous data sources, included in data infrastructures, have to be integrated in knowledge infrastructures.It has been observed that the knowledge sharing has direct impacts and interaction effects, in combination with IT infrastructure and enhance firms' ability to innovate (OECD, 2015a, Cassia et al. 2020).
Among the most urgent research questions to address about knowledge infrastructure recently discussed we have the following: i) Investing in knowledge infrastructures that enhance scholarly communication.Despite the political pressures and institutional requirements for university researchers to share and to retain their data, investments in knowledge infrastructures to sustain access to those data resources are relatively few.Scientific data are heterogenous in type, volume, funding sources, instrumentation, standards, and other factors, making them difficult to sustain (Borgman, 2020).ii) Developing more inclusive knowledge infrastructure by fostering opportunities for fair participation.User participation in the planning and designing of tools/systems to create sustainable infrastructure development has been discussed in Edwards et al., 2013.Extremely important is the different user participation/contribution models in existing Knowledge Infrastructures (KI), such as citizen science, community-based science, street science, and community research.However, the nature of that participation, the demands and abilities of marginalized populations, and methods to reflect inclusivity in design and/or operationalization of KIs for knowledge creation should be further investigated.Many studies have already demonstrated how KIs can benefit and empower communities and citizens, especially when combined with numerous open data initiatives through existing knowledge and data infrastructures by providing access to new information and knowledge and teaching new technical skills.However, literatures have also pointed out how existing KIs did not help communities and citizens address their immediate community concerns and problems (Yoon, 2020).iii) Maximizing the scientific return of archival data in the coming decades, especially with a fast-moving ecosystem of tools, technologies, and techniques for generating scientific knowledge (Smith, 2020).iv) Urgent questions to address about KI include: How can parts of KI that are opposing, independent, and lagging be bridged?Which bridges facilitate success under these different circumstances?When in the life of KI is bridging more or less successful?(Faniel, 2020).
We are well aware that the road to building knowledge infrastructures on top of existing data infrastructures is still a long way to go.The approach that we have presented in this paper, and illustrated on the real case of RISIS, represents a very encouraging first step to continue the path just undertaken.

Patents
Patents mapped in RISIS Patent by Year

Figure 2 :
Figure 2: Example of source layer in a data integration system with specific space-time features.

Figure 4 :
Figure 4: Representation of the integration scheme of ETER and CTWS Relation I: o The relation I has 6429051 records, which corresponds to the number of publications with information about the referenced institutions; o I contains 3451451 different articles, 2199 unique organization from 34 different Countries.

Figure 5 :
Figure 5: Number of institutions in I by year and country (Institutions with information in ETER and CWTS by year and country) -Relation I1 o The relation I1 has 6522 records, which correspond to the number of institutions in ETER without information in CWTS (for the period 2011-2017); o I1 contains information on 1006 different institutions from 37 different countries.

Figure 6 :
Figure 6: Number of institutions in I1 by year and country Relation I2 (Institutions in ETER not present in CWTS by year and Country)

Figure 9
Figure 9 Number of institutions in I by year and country (ETER and RISIS Patent Institutions) -Relation I1 o The relation I1 has 14177 records (for the period 2011-2016); o I1 contains information on 3060 different institutions.

Fig. A4
Fig. A4 Patents mapped in RISIS Patent by Year Cheetah, is a database featuring geographical, industry and accounting information on three cohorts of mid-sized firms that experienced fast growth during the periods2008- 2011, 2009-2012and 2010-2013 -The CIB / CinnoB -Corporate Invention and Innovation Boards, is a database about the largest R&D performers and their subsidiaries worldwide, providing patenting and other indicators.-TheCWTS publication database, is a full copy of Web of Science (WoS) dedicated to bibliometric analyses, with additional information e.g. on standardised organisation names and other enhancements.-ESID,is a comprehensive and authoritative source of information on social innovation projects and actors in Europe and beyond.-EUPRO, is a unique dataset providing systematic and standardized information on R&D projects of different European R&D policy programmes.-JoREP2.0, is a database on European trans-national joint R&D programmes, storing a basic set of descriptors on the programmes and agencies participating in the programmes.-MORE(Mobility Survey of the Higher Education Sector), is a comprehensive empirical study of researcher mobility in Europe.-TheNano S&T dynamics database (Nano), collects publications and patents between 1991 and 2011 about Nano S&T.-ProFile, is a longitudinal study focusing on doctoral candidates and their postdoctoral professional careers at German universities and funding organisations.-RISISPatent, offers an enriched and cleaned version of the PATSTAT database, with a focus on standardised organisation names and geolocalisation.-RISIS-ETER,represents an extension by additional indicators in terms of research activities of the European Tertiary Education Register database.-Scienceand Innovation Policy Evaluations Repository (SIPER), is a rich and unique database and knowledge source of science and innovation policy evaluations worldwide.-VICO, is a database comprising geographical, industry and accounting information on startups that received at least one venture capital investment in the period 1998-2014.Besides the databases of RISIS, we considered also the public facility OrgReg, used by the RISIS project for the harmonization of the various institutions in the various databases.OrgReg (https://risis-eter.orgreg.joanneum.at/about/data-download) is a public facility, which provides a comprehensive register of public-sector research and higher education organizations in European countries.OrgReg covers organizations that are not exclusively market-oriented in all 27+1 (Uk) European Union member states, EEA-EFTA countries (Iceland, Liechtenstein, Norway and Switzerland), as well as candidate countries (FYRM, Montenegro, Serbia and Turkey).It is a public resource whose main function is to allow integrating different RISIS datasets at the level of actors through the definition of a common list of organizations and the use of organizational IDs (OrgReg_Id) that are used consistently in the RISIS datasets providing data at the level of organizational actors.Private (market-oriented) organizations are covered by parallel firms register (FirmReg). - used to model the domain.Details of the used datasets are given below:-ETER (see Appendix 2, Fig.A3shows organizations in ETER by Year and Country), taking all the institutions' information in the dataset for the period 2011-2017 (full temporal coverage of the dataset).All institutions with org_Id within ETER are mapped geographically (the ETER_Countries entity in the scheme in Appendix 2).For more precise information, the geographical coverage of the data used is EU 27, UK, Montenegro,

Table I
AL AT BE BG CH CY CZ DE DK EE ES FI FR GR HR HU IE IS IT LT LU LV MT NL NO PL PT RO RS SE SI SK TR UK 21 In the I2 Domains there are 3492623 publications without org id in a certain Years for a certain institution.oI2 articles come from 1380 institutions not mapped in CWTS but not in ETER (institutions within the ETER Countries group).Number of articles in CWTS without Org_Id value by yearThanks to this approach, is possible to highlight the completeness of the information.In each relation (I, I1 and I2) the I_Completeness is equal to 1, and specifically for I, the relation has the complete information from two different sources.AL AT BA BE BG CH CY CZ DE DK EE ES FI FR GR HRHU IE IN IS IT LT LU LV MT NL NO PL PT RO RS SE SI SK TR UK o Figure 7 Number of institutions in I2 by year and country (Institutions in CWTS not present in ETER by Year and Country)In addition to these results, 164355 records from CWTS without org_Id, i.e. unmapped.
AT BE BG CH CY CZ DE DK EE ES FI FR GR HR HU IE IS IT LT LU LV MT NL NO PL PT RO RS SE SI SK TR UK In the I2 Domains there are 25087 projects id without org id Linked in ETER in a certain year for certain institutions.o I2 refers to 664 institutions mapped in RISIS Patents but not in ETER (institutions within the ETER Countries group).AL AT BE BG CH CY CZ DE DK EE ES FI FR GR HR HU IE IS IT LI LT LU LV MEMKMT NL NO PL PT RO RS SE SI SK TR UK Figure 10 Number of institutions in I1 by year and country (Institutions in ETER not present in RISIS Patent by year and Country) -Relation I2 o Figure 11 Number of institutions in I1 by year and country (Institutions in RISIS Patent not present in ETER) AT BE BG CH CY CZ DE DK ES FI FR GR HR HU IE IL IN IR IT LT LU LV NL NO PL PT RO RS SE SI SK TR UK defined knowledge infrastructures as "robust networks of people, artifacts, and institutions that generate, share, and maintain specific knowledge about the human and natural worlds."Nielsen(2012)arguesthatwe are living at the dawn of the most dramatic change in science in more than 300 years.This change is being driven by powerful new cognitive tools, enabled by the internet, which are greatly accelerating scientific discovery.In his book on "Reinventing Discovery" Nielsen describes an unprecedented new era of networked science.According toOECD (2015b), open data are "data that can be used by anyone without technical or legal restrictions.The use encompasses both access and reuse."OECD(2015b,p. 7).According toOECD (2015b), open science refers to "efforts by researchers, governments, research funding agencies or the scientific community itself to make the primary outputs of publicly funded research resultspublications and the research datapublicly accessible in digital format with no or minimal restriction as a means for accelerating research; these efforts are in the interest of enhancing transparency and collaboration, and fostering innovation.[…] Three main aspects of open science are: open access, open research data, and open collaboration enabled through ICTs.Other aspects of open sciencepost-publication peer review, open research notebooks, open access to research materials, open source software, citizen science, and research crowdfunding are also part of the architecture of an open science system" (OECD, 2015b, p. 7).Vicente-Sáez and Martínez-Fuentes (2018), after a systematic review proposes the following broad definition of open science as the "transparent and accessible knowledge that is shared and developed through collaborative networks".
AT BE BG CH CY CZ DE DK EE ES FI FR GR HR HU IE IL IN IR IS IT LT LU LV MT NL NO PL PT RO RS SE SI SK TR UK