1 Introduction

Modern science is becoming more and more data driven and works with a large amount of data, which are heterogeneous, are distributed, and require special infrastructure for data collection, storage, processing, and visualization. Science digitalization, likewise industry digitalization, is facilitated by the explosive development of digital technologies as well as infrastructure technologies and services.

New large-scale scientific problems such as climate, global warming, genome, and fast response to pandemics require using the modern Big Data, Cloud Computing, and Artificial Intelligence technologies. However further research platform advancement requires new approaches to infrastructure services provisioning and management that could facilitate the essential research process and minimize overhead of infrastructure provisioning and management.

Future digital science opens new possibilities of cross-domain/cross-sector integration and consolidation of resources and capacities. It will require new type of infrastructure that would provide extended functionality to collect, store, distribute, process, exchange, and preserve research data to support common knowledge growth and exchange [1,2,3]: We will refer to this new infrastructure as Future Scientific Data Infrastructure (FutureSDI or FutureRI).

Recent European initiatives and projects such as the European Open Science Cloud (EOSC) [4] and Research Data Alliance (RDA) [5] facilitated implementation of the FAIR (Findable, Accessible, Interoperable, Reusable) data principles [6] that allow for effective data exchange and integration across scientific domains, making scientific data a valuable resource and a growth factor for the whole digital economy and society. To uncover the potential of the future digital and data-driven science, the FutureSDI must provide a platform for effective use of scientific data by allowing creating specialized consistent ecosystems supporting full cycle of the value creation from data collection to model creation and knowledge acquisition and exchange. Shift of the focus from infrastructure operation to value creation will require new FutureSDI design approach, operation, and evolution to respond to changing requirements and evolving technologies. Growing infrastructure complexity will require automation of the infrastructure provisioning and operation, allowing researchers to focus on problem-solving.

This paper attempts to analyze current technology that can advance SDI development and support future digital science. Based on this analysis, the paper proposes a vision for the future RI Platform as a Service (PRIaaS) that incorporates recent digital technologies and enables platform and ecosystem model for future science.

The proposed analysis and PRIaaS architecture are based on the authors’ long-time involvement in numerous EU and national projects on RI development, studies, and initiatives, including current ongoing projects GEANT4 [7], FAIRsFAIR [8], and SLICES-DS [9] dealing with different modern Research Infrastructure and e-Infrastructure developments. The paper refers to the previous authors’ works on defining the Big Data Architecture Framework (BDAF) [10] and Scientific Data Infrastructure requirements [11] and developing practical aspects of the cloud services network infrastructure [12] that provide a strong foundation for current research.

The paper is organized as follows. Section 15.2 provides a short reference to recent regulations, initiatives, and projects in the European Research Area that drive future SDI and RI development. Section 15.3 provides an overview of the key technology development that may facilitate FutureSDI development. Section 15.4 describes the main features of the future digital science, analyzes the timeline of the European RI development, and proposes a vision for the key technologies that can shape the FutureSDI marked as EOSC-2. Sections 15.5 summarizes the general requirements to FutureSDI and describes the proposed PRIaaS architecture and its operation. Section 15.6 discusses important aspects of the research data management to support FutureSDI requirements and PRIaaS functionality. Section 15.7 presents a conclusion and refers to ongoing and future developments.

2 European Research Area

2.1 European Research Infrastructures and ESFRI Roadmap

European Research Area (ERA) is an important area of the European policy development and funding to support European science and ensure its competitiveness while facilitating European cooperation and integration. The Research Infrastructures (RI) is one of the pillars of ERA designated to connect research, higher education, and innovation [13].

A European Research Infrastructure (RI) is a facility or (virtual) platform that provides the scientific community with resources and services to conduct top-level research in their respective fields. The research infrastructures can be single-sited or distributed or an e-infrastructure and can be part of a national or international network of facilities, or of interconnected scientific instrument networks.

Important instrument in defining European RI development and evolution is the ESFRI (European Strategy Forum on Research Infrastructures) Roadmap [14]. The new ESFRI Roadmap 2021 defines the important priorities that include consolidating the landscape of European RIs, opening, interconnecting and integrating RIs to develop the full potential of data generated by RIs and increase the innovation potential of ERA/European science in its cooperation with industry [15]. Research Infrastructures constitute a powerful resource for industry, a prerequisite for collaboration between industry and academia.

To facilitate RI and science digitalization, the new ESFRI Roadmap includes a new DIGIT area whose focus is to support research on digital technologies.

e-Infrastructure is another area of the European policy and funding that is designated to support ESFRI and constitute the essential building block for ERA. e-Infrastructures address the needs of European researchers for digital services in terms of networking, computing, and data management. e-Infrastructures provide digital-based services and tools for data- and computing-intensive research in virtual and collaborative environments.

e-Infrastructures are key in the future development of research infrastructures, as activities go increasingly online and produce vast amounts of data. Current European e-Infrastructure capacity includes such Trans-European operational infrastructures as GÉANT, the high-capacity and high-performance communication network [16], and PRACE, European HPC services for European research [17].

2.2 European Open Science Cloud (EOSC)

The European Open Science Cloud (EOSC), started in 2016, is the part of the “European Cloud Initiative—Building a competitive data and knowledge economy in Europe” [18, 19] that is targeted to capitalize on the data revolution. Under this initiative, EOSC federates existing and emerging e-Infrastructures to provide European science, industry, and public authorities with world-class data infrastructure connected to high-performance computers (HPC).

The EOSC goal is to enable the Open Science Commons [20]. At the present time, the EOSC projects created the foundation for research data interoperability and integration for European IRs. The Minimum Viable EOSC (MVE) achieved by the end of 2021 will create a starting point for future EOSC development [21].

MVE defines EOSC Core that is designed to provide a federated data exchange environment for research projects and communities where data comply FAIR principles. EOSC Core includes the following components/functionalities:

  • Shared Open Science policy framework.

  • Authentication and Authorization Interoperability framework.

  • Data Access framework.

  • Service Management and Access framework.

  • A minimum legal metadata framework.

  • An open metrics framework.

  • PID framework and service.

  • Portal providing web access to the EOSC services and offering Catalog and Marketplace services.

The further EOSC development based on MVE (which we can refer to as EOSC-1) will require designing a new type of infrastructure that can benefit from existing and emerging digital and infrastructure technologies.

3 Technology-Driven Science Transformation

3.1 Science Digitalization and Industry 4.0

Science digitalization is a demand of time and advised by the OECD report [1]. Science and industry digitalization make easier exchange of technologies, solutions, and application and also adopting recent industry trends such as Industry 4.0 [22] and platform-based ecosystems.

Industry 4.0 will bring tremendous changes to both business models and the way future factories will operate. The key Industry 4.0 elements that both empower new data economy and will be facilitated by the new business and consumer models include Cyber-physical systems; Internet of Things; Internet of services; Smart factories; Mobile technologies and highspeed access networks; Cloud Computing and distributed data processing; Big Data; Artificial Intelligence and Machine Learning; and Automation, Robotics, and Digital Twins.

The digital nature of ongoing economy transformation opens opportunity for faster technologies and solution exchange with science and research. Science can benefit from massive investments into industrial digital and data-driven technologies that can be directly used in digital science, in particular, experimental research automation and following data processing and management. The scientific community should follow the development and be open to wider use of technologies that are advanced by industry; actually all technologies powering Industry 4.0 can be effectively used both in the Future SDI and domain-specific scientific applications.

3.2 Transformational Role of Artificial Intelligence

Similar to Industry 4.0, Artificial Intelligence will have a strong transformative effect on future science [23]. Benefits that AI can bring to scientific research and SDI include but not limited to:

  • Extending possibilities of research when working with big data.

  • Automating data preparation, processing, and analysis.

  • Smart infrastructure and tool operation and management.

  • AI-driven and Machine Learning-powered scientific discovery and decision support, digital model creation (Digital Twins).

  • AI-powered self-learning assistant to a researcher/scientist capable of creating domain-related intelligence; many research questions will be pursued semi-automatically [24].

  • Role of data will change: the learned model will replace data; theory becomes data for next-generation AI [24].

It is recognized that an effective work of AI and ML technologies is critically dependent on the quality of data and their availability at all stages of the AI lifecycle [25]. This will impose the specific requirement to the FutureSDI, including general compute and storage, distributed federated ML algorithms, edge computing, and highspeed access network.

Consistent data management including FAIR compliance, quality assurance, data lineage, and privacy protection are general preconditions for successful AI implementation [26].

3.3 Promises of 5G Technologies

5G technologies promise to solve not only high-speed mobile communication for smart(phone) applications but also e2e land/terrestrial network communication. 5G architecture defines three main future use cases (or usage scenarios) that can be adopted by the FutureSDI [27]:

  • Enhanced Mobile Broadband (eMBB): this also covers IoT, robotics, and sensor network.

  • Massive Machine Type Communications (mMTC) to support HPC and large-scale distributed data processing.

  • Ultra Reliable and Low Latency Communications (URLLC): industry automation, process control, and real-time applications.

To address these use cases and corresponding requirements, 5G architecture offers e2e network slicing technology that allows proving isolated virtual overlay networks using Network Functions Virtualization (NFV) and cloud native services deployment model and mechanisms. In addition to slices isolation, the 5G architecture is also offering a consistent security model that enables Trusted Execution Environment (TEE) [28] for running secure and trusted services by using the hardware Root of Trust (whose idea is originated from the Trusted Computing Platform architecture [29]).

3.4 Adopting Platform and Ecosystems Business Model for Future SDI

The platform economy [30, 31] and digital ecosystems [32] are the two trends shaping ongoing transformation of the modern economy facilitated by digitalization. The wide adoption of the platform business and operational model (as an alternative to the pipeline model) facilitates the creation of the value chain between producers and consumers when using (composable) platform services powered by extended data collection and availability from the platform providers. This allows creating consistent business-oriented digital ecosystems as loose associations of stakeholders and capabilities instantiated on the platform provider facilities. An ecosystem has members that interact in the context of a defined set of services and offerings.

TeleManagement Forum (TMF) defines the Open Digital Architecture (ODA) [33] and the Digital Platform Reference Architecture (DPRA) [34], where the infrastructure provisioning component is defined as the Actualization Platform whose architecture is illustrated in Fig. 15.1.

Fig. 15.1
figure 1

Main functional components of the TMForum Actualization Platform as a core part of DPRA (adopted from [34]). © TM Forum 2020

The Actualization Platform includes the following essential (group of) components:

  • Common infrastructure and platform services.

  • Data and digital content (media) services.

  • Integration and Lifecycle Management.

  • Integration, orchestration, and DevOps.

  • Security and Identity Management.

  • Core commerce services including Catalog, Accounting and Billing, Fulfillment Platform components, and customer/tenant facing services.

The Fulfillment Platform defined in DPRA “allows for user/service configuration and activation data to be sent for each individual component service, and also for fully composed product offers (of the customizable templates or design patterns). It allows a product creator to configure (fulfill) a service that is being composed into an e2e offer—this could involve adding an end-user (authorization credentials, establishing an account), or any other actions required for configuration management” [34].

ODA and DPRA are adopted by many telecom providers, and we can benefit from adopting it for FutureSDI that could serve to create instant virtualized RI and ecosystems for specific user communities.

3.5 Other Infrastructure Technologies and Trends

The following are recent technologies that can be adopted to build the Future SDI:

  • Cloud-based federated hyperconverged infrastructure allowing for provisioning on-demand secure private infrastructure [35]

  • IDSA architecture and IDS Trusted Connector enabled data exchange infrastructure [36, 37]

  • Infrastructure automation technologies and tools (virtualization, microservices, composability, containerization, code libraries, API).

  • DevOps and CI/CD that trends to become integrated into the change management process to ensure the continuous evolution of the target system [38]

  • Data-centric models DataOps/MLOps (whose examples are services offered by Azure cloud platform) [39, 40]

  • Semantic Data Lakes as integrated data storage and data analytics platform (whose example is Azure Data Lake gen2 that offers storage for heterogeneous data and provides integrated data analytics) [41, 42]

  • Permissioned blockchain technologies that allow for traceable and policy enforceable data sharing and lineage [43]

  • Infrastructure-related security technologies that propose solutions for trust bootstrapping and creating secure trusted virtual execution environment for data processing (such as Confidential Computing or secure enclave computing [44, 45]).

4 Defining Future Scientific Data Infrastructure

4.1 Paradigm Change in Modern Data-Driven/Digital Science

Ongoing Science digitalization is powered by the rapid development of Cloud Computing, Big Data, Artificial Intelligence, and DevOps-based infrastructure automation technologies.

The FutureSDI should consolidate existing and future RIs focusing on specific scientific domains and minimize costs and efforts of creating specialized RI for different scientific communities. Achieving MVE/EOSC-1 will create a platform for FAIR data interoperability and sharing, a key step in the future digital transformation of science.

Here we summarize the main characteristics of the (future) digital science powered by recent advancement in data-driven technologies and AI (also refer to our previous analysis [10]):

  • Availability of Pan-European Research Infrastructure Platform as a Service (later defined as PRIaaS) that uses cloud-native technologies (S/P/IaaS) for on-demand provisioning of the fully operational infrastructure for end-to-end scientific research (both experiments and data processing) by using composable infrastructure and application design templates, supported by DevOps tools.

  • Automation of scientific experiments and all data handling processes, including data collection, storing, classification, pre-processing and curation, and provenance.

  • Adopting and leveraging DevOps and DataOps/MLOps technologies found rapid adoption in the industry and supported with a variety of tools available with cloud-based infrastructure platforms such as AWS, Azure, Google Cloud Platform, and from multiple vendors.

  • Digitizing existing artifacts and creating their digital twins, AI-assisted documenting and cataloging, building subject/domain knowledge base using self-learning algorithms.

  • The full adoption of the FAIR data principles, both prospective and retrospective, to ensure reusability of available data/datasets in the cross-domain and secondary research.

  • Adopting STREAM data properties and corresponding infrastructure to enable trusted multipurpose data sharing and exchange, including data trading as economic goods and enabling different economic models for data sharing.

  • Availability of new algorithms for distributed secure data processing such as federated machine learning, or blockchain-enabled policy-aware distributed data processing.

  • Global data availability and access over the network for cooperative group of researchers, including wide public access to scientific data, however subject for the data sharing and access policies, in particular GDPR.

  • Advanced security, access control, and identity management technologies that ensure the secure operation of the complex research infrastructures and scientific instruments and allow creating a trusted secure environment for cooperating groups and individual researchers.

The future SDI should support the whole data lifecycle and explore the benefit of data aggregation and provenance at a large scale and during a long/unlimited period of time.

Data security is not limited by a secure and trusted storage but also requires a secure and trusted data processing environment that would allow data processing using proprietary algorithms. Demand for RI trustworthiness and security is increasing to address both personal data protection and the trustworthiness of the research process itself. Data infrastructure must ensure data security (integrity, confidentiality, availability, and accountability), trustworthiness, and, at the same time, data sovereignty that include both data ownership protection and control of data sharing and processing by data owners. There should be a possibility to enforce data/dataset policy (sharing, processing, derivative/secondary data) in the distributed data storage, sharing, and processing environment.

4.2 Timeline of the European RI Development/Evolution

In our research on the technologies for FutureSDI, we analyzed the development and evolution of the European RIs. Figure 15.2 below illustrates the timeline of the European RI evolution (based on the authors’ expertise and wide community discussions aligned with technology evolution and trends) that covers past stages: Centralized, Interconnected, Distributed, and Federated, where the current stage is labeled as EOSC-I (actually implementing EOSC Core) and foreseeing future stage labeled as EOSC-II. Table 15.1 provides extended details about technologies that are suggested to drive the transition from EOSC-I to EOSC-II.

Fig. 15.2
figure 2

Timeline of RI evolution and SLICES positioning

Table 15.1 Details of the technologies used in current EOSC-1 and future EOSC-2

Past stages (before EOSC) delivered Federated Research Infrastructures supporting inter-organizational and interdomain cooperation and data sharing using well-defined metadata ensuring data interoperability, however in many cases limited to a science domain. Examples of such RIs are EGI, EUDAT, GEANT, PRACE, and other landmark RIs as reviewed in the ESFRI Roadmap 2018 [46]. The European Open Science Cloud (EOSC) provides a basis for European RI integration and interoperability based on adoption of the FAIR principles both for data and for RIs themselves. H2020 EOSC-hub project established and operates EOSC Portal offering services Catalog and Marketplace that enables services and data findability, interoperability, and reusability based on published APIs [47].

Future progression and adoption of modern technologies such as Cloud and Edge Computing, Big Data, AI, IoT, and Digital Twins will enable fully virtualized Pan-European RI platform as a Services (PRIaaS) that will allow virtualized RI provisioning on demand for specific scientific domain and community; advanced data management and processing technologies will allow full FAIR principle implementation and trusted data exchange, supporting whole data lifecycle and value chain with the necessary infrastructure services. Adoption of the 5G technologies is expected to start a preparatory stage at the EOSC-I stage in some individual projects and testbeds and will become the main enabling technology for virtualizing/slicing network and RI in the future, combining with the Virtual Private Cloud (VPC) [48] technologies supported by modern cloud platforms.

The envisioned PRIaaS definition leverages the TMForum DPRA concepts and principles that define the provider actualization platform as a way to enable provisioning customer-tailored services platform/ecosystem on demand.

Recently started the SLICES-DS project [9] intends to bridge the current EOSC-I stage and future EOSC-II stage by advancing infrastructure technologies to fully virtualized customized domain-specific RI provisioning on-demand. Many modern advanced and emerging technologies need to be tested, adopted, and prototyped to make them easily usable by different RIs and embedded into the PRIaaS platform (see Sect. 15.5 for PRIaaS architecture).

4.3 General Requirements to Future Data-Driven Research Infrastructures

From the overview above, we can specify the following general infrastructure requirements to the future Scientific Data Infrastructure:

  • Cloud-based platform for provisioning (on-demand) instant RIs, fully configured and functional including Virtual Organization for user management

  • Support of virtual scientist communities, addressing dynamic user group creation and management, federated identity management—to enable cooperation and support scientific workflows

  • Support FAIR data principles by providing necessary metadata services and data sharing facilities

  • Secure trusted data infrastructure, ensuring data sovereignty and trustworthiness, supporting STREAM data properties for effective and value-added data exchange [49]

  • Support long-running experiments and large data volumes generated at high speed

  • Trusted environment for data storage and processing

  • Support for data integrity, confidentiality, accountability, provenance, sovereignty

  • Mechanisms for policy binding to data to protect privacy, confidentiality, and IPR that ensure the policy is attached to data during the whole data lifecycle; mechanisms for policy provisioning and roaming as part of the provisioned infrastructure to ensure policy enforcement by design in a diverse heterogeneous environment.

5 Proposed PRIaaS Architecture Model

We propose the PRIaaS Architecture for FutureSDI as illustrated in Fig. 15.3. This model contains the three generalized layers:

Fig. 15.3
figure 3

The proposed PRIaaS architecture

Virtualized Resources (VR): Virtualized general compute, storage and network resources that are composed to create infrastructure components and are used by other services and applications.

Actualization Platform: This is the main component and layer that enables provisioning, monitoring, and operating fully functional instant Virtual RIs for specific scientific domains, projects, or communities.

Virtual (Private) RI (VirtRI): Virtual RI provisioned on demand that contains a full set of services, resources, and policies needed to serve the target scientific community and create full value change of data handling. VirtRI is operated by the specific community and uses services provided by the Actualization platform, including the possibility of cross-platform data sharing.

Users and external resources include researchers, developers and operators, and external datasets.

Federation Access Infrastructure and Tenants Management (FAI&TM) layer serves as interface layer enabling communication between distributed Actualization Platform resources and services and generally distributed and multiorganizational VirtRI. FAI&TM is also the place where VirtRI and Actualization Platform policy are enforced and managed.

5.1 Actualization Platform Components

The PRIaaS Actualization Platform includes the following groups of services required to develop, deploy, manage, and operate the Virtual RI during its whole lifecycle, including resources and users that can be grouped into Virtual Organizations.

  • Core Infrastructure Services (IaaS & PaaS) including compute, storage, network, IoT&Edge, blockchain, Access Control and Federated Identity management, infrastructure security.

  • Data Services including directory, metadata/PID, lineage/provenance, FAIR &QA, semantic data lakes, data analytics, and AI tools.

  • Management and Operation including Service Catalog and Lifecycle Management, orchestration, and management.

  • Service provisioning and fulfillment including user provisioning, SLA management and policy provisioning.

  • Development Environment and Tools that support DevOps process related to platform and VirtRI development, provisioning, and operation; this group also maintains the repository of API, containers, and design templates that can facilitate VirtRI design and provisioning.

VirtRI provisioning process is based on well-known and commonly used DevOps tools and is supported by the Management and Operation functions. As the PRIaaS platform will progress, the repository of the design patterns, templates, and containerized applications and functions will grow. A starting point for such a repository can be the EOSC Catalog [47] that already contains information about API for applications and services offered by existing RIs and service providers.

The policy provisioning, management, and enforcement are important functions of the Actualization Platform that can be attributed to the Fulfillment function. The policy that is defined by the target community is provisioned as a part of VirtRI provisioning. Policy management and enforcement infrastructure should support policy roaming and combination for the multi-domain distributed resources and tenants.

6 Research Data Management in the Future SDI

6.1 European-Wide and International Initiatives and Projects

The importance of data and research information sharing has been central in a number of European-wide initiatives and projects, such as Open Access, Open Data, Open Science, and Open Commons. The Research Data Alliance (RDA) that was created in 2012 jointly by the National Science Foundation of USA (NSF) and European Commission became a key community coordination body to exchange and develop best practices in research data management. One of the important RDA developments became Persistent Identifiers (PID for data objects to enable data interoperability and findability) [50].

To facilitate research data sharing and implementation of the FAIR principles, European Commission started Open Research Data (ORD) Pilot [51], and currently all EU-funded projects are required to develop and implement the Data Management Plan (DMP) at the initial stage of the project. Data produced in the project must be stored in the open available but secure repositories (operated by the project or using national or European data archive services). Metadata must be published and quality of data ensured, in particular, compliance with the FAIR principles.

6.2 From FAIR Data Principles to STREAM Data Properties

FAIR data principles are important for creating trusted research-friendly environment for data sharing. FAIR data is a key element/layer of the EOSC core. However, the data exchange infrastructure requires additional data properties that would allow trusted and economical data exchange, also supporting data value chain creation.

Data exchange and data trading/market have been long-time interest area by/from the industry where data represent also companies’ intellectual property, and companies want to remain in control of their data which is defined as data sovereignty.

Data Sovereignty is a key principle of the industrial data exchange as defined by the International Data Spaces Association (IDSA) Reference Architecture Model (RAM) [36].

Data involved in industrial processes and business relations are becoming a part of the economic relations and added value creation process. However, data as economic goods are in many aspects different from the traditional economic goods and commodities. We refer to our research on data properties as economic goods as part of the RDA Interest Group on Data Economics (IG-DE) [52].

Emerging data-driven economy and modern Big Data technologies facilitate interest in making data a new economic value (data commoditization) and consequently the identification of new properties of data as economic goods. The STREAM data properties for industrial and business data have been proposed by the authors in [49]. To become an economic goods and bring business value to data producers and data consumers, data must be [S] sovereign, [T] trusted, [R] reusable, [E] exchangeable, [A] actionable, and [M] measurable.

Other data properties important to enabling data commoditization and allowing data trading and exchange for goods include quality, value, auditability/trackability, branding, authenticity, as well as original FAI(R) properties: findability, accessibility, interoperability, and reusability. Special features that must be managed in all data transfers and transformations are data ownership, IPR, and privacy. The data property originated from its digital form of existence defined as not-Rivalry, on one hand, makes data exchange (copying, distribution) easy, but on the other hand, it creates a problem when protecting proprietary, private, or sensitive data or IPR.

7 Future Research and Development

In this paper, we presented analysis of current trends in digital technologies that can be used to build Future Scientific Data Infrastructure and in particular can be used to progress the current EOSC infrastructure, also proposing a common platform for future European RI integration. Further research will require a closer analysis of the typical use cases in ESFRI and EOSC projects. The presented research and proposed PRIaaS are based on the authors long-time experience in infrastructure research and developing/implementing practical solutions in a number of national EU-funded projects such as EGEE, GEANT, and GEYSERS, as well as standardization activity in such bodies as IETF, OGF, NIST, and CEN.

The proposed PRIaaS architecture and DPRA-inspired operational model require a variety of technologies to work together realizing data-centric data exchange and transformation to enable data-based applications and services and added value data service creation. New functionality and technology combinations will require re-thinking existing concepts and models, extending usage scenarios.

Further development of the proposed PRIaaS and its components will be done in the ongoing project SLICES-DS. This work also intends to contribute to the EOSC Architecture Working Group.