1 Introduction

Gaining a deep and factual understanding of the business and its underlying operations is crucial for enterprises to support their daily decisions and stay competitive. Despite the plethora of datasets, enterprises typically do not utilize all aspects of the available datasets due to poor dataset management, unclear governance, inaccessible data silos, and lack of ownership. As a result, the value of the dataset stored in the enterprise data warehouses and lakes has yet to be fully maximized. Moreover, the decision success rate is at risk, and potential business opportunities will likely be missed.

Datasets are constantly generated from various sources and stored in different structured and unstructured formats on scattered storage systems. For instance, Google announced in 2016 that its catalog has over 26 billion datasets that are available to all Google employees, excluding datasets with restricted access. The total expected number of all datasets types could be more than double [1]. Thus, effective dataset management is critical for decision-making, answering business questions, dataset governance, and gaining a competitive advantage. In addition, it would help the enterprises to identify trends and spot problems early by making use of the available datasets intelligently.

It has been the traditional wisdom for enterprises to manage their datasets centrally [1,2,3,4,5]. This dataset management paradigm provides better control, reduces redundancy, and simplifies data management. However, these centralized approaches create an ineffective monolithic architecture as large enterprises’ datasets increase continuously. This issue may lead to the creation of data silos and duplication of effort. For example, if a dataset is produced, it may not be easily accessible, or two domains might build the same products, such as dashboards or machine learning models. This happens due to the absence of a clear communication channel within the enterprise as whole.

Since datasets are naturally distributed as they are consumed/produced by various sectors across the enterprise, centralized solutions lead to failures in meeting the enterprise’s requirements. Datasets from multiple domains and sources are stored with minimum context. Teams that manage the datasets lack domain expertise and a strong sense of ownership. This issue may impact dataset quality, allow unauthorized access, and result in data soils and poor decision-making. Moreover, tasks to ingest, clean, process, and provide datasets for consumers become the center of focus rather than the data itself. Scaling and managing these tasks are problematic. From one end, the proliferation of dataset sources increases the complexity of managing them centrally. From the other end, prompt requirements and use cases of datasets consumers are growing to meet the needs of enterprises. Furthermore, fulfilling these requirements may require changes impacting the whole data pipeline and delaying data request deliveries.

Recently, a new data management architecture called data mesh has been proposed [6, 7]. In specific, data mesh comes into play to resolve the bottleneck of the overwhelmed centralized data team to answer all the analytical questions coming from different domains. Therefore, the architecture supports decentralizing the team, moving the data closer to the domain owner, and treating the data as a product. Data governance is federated across business domains through mechanisms, including global policies such as data formats, locations, and access control.

As an emerging data management paradigm, existing data mesh research has focused on defining its concepts, suggesting guidelines, and providing reference frameworks [8, 9], with limited practical implementation efforts [10]. Specifically, inter-domain communication and governance challenges to bridge decentralized business domains must be adequately addressed. Current mechanisms focused on communication and information-sharing aspects rather than providing an approach that fits the data mesh paradigm and its principles.

In this paper, we propose a blockchain-based framework solution that is inspired by the data mesh architecture to manage the datasets and resolve the centralized dataset management issues. The proposed framework helps the enterprise maximize the value of the datasets across its domains and govern the datasets sharing process. Data mesh and blockchain synergize very well, as they are both distributed and decentralized. Blockchain technology fulfills several data mesh requirements and principles, as its features support the data mesh architecture in several ways. First, blockchain enables parties to communicate and share data efficiently and transparently. Second, blockchain’s smart contracts facilitate the data as a product principle. Third, governance, especially across domains, is a significant issue in the data mesh that blockchain help to address. We demonstrate this synergy by designing a blockchain-powered framework to manage enterprise datasets with a data mesh architecture.

More specifically, the proposed framework turns the datasets into digital assets and the business domains within the enterprise into peers in a decentralized network. The peers interact with each other through a smart contract that facilitates different functionalities such as dataset cataloging, dataset sharing, and tracking the dataset across the enterprise from creation to usage until archiving or destruction. In essence, a smart contract acts as a data contract that can be established between dataset producers and consumers in a governed procedure.

The main contributions of this paper are summarized as follows:

  1. 1.

    Realizing data mesh architecture based on blockchain technology.

  2. 2.

    Proposing a blockchain-powered framework for managing enterprise datasets efficiently with detailed qualitative analysis.

  3. 3.

    Designing a generic smart contract with different transactions to support the data-driven enterprise functionalities.

  4. 4.

    Developing a proof-of-concept with publicly available source code.

The rest of this paper is organized as follows: Sect. 2 provides a thorough background and Sect. 3 reviews the related work. In Sect. 4, we explain the features of the proposed framework and detail its design aspects. Section 5 provides a qualitative analysis of the properties of our proposed framework. Section 6 details the implementation choices for a proof-of-concept of the proposed framework. Section 7, we discuss some of the applications and challenges of our framework. Section 8 concludes the paper.

2 Background

We provide background information about data mesh and blockchain to establish the proper context for our proposed blockchain-powered data mesh framework to manage datasets. Additionally, we describe the data fabric concept, representing a centralized data management architecture, as a related alternative to data mesh.

2.1 Blockchain

Blockchain technology is a distributed ledger that securely stores data entries. It enables entities to exchange information and interact without relying on a centralized governing body [11]. The ledger consists of blocks that store data entries and are bundled together using cryptographic protocols to ensure their integrity [12]. The nodes in the blockchain employ consensus mechanisms to validate and agree upon transactions, ensuring efficiency, fairness, reliability, and security.

Blockchain networks possess several characteristics that make them well suited for several applications [13]; these characteristics include:

  • Decentralization: There is no central authority responsible for validating and approving ledger records in the blockchain.

  • Immutability: Records stored in the blockchain are permanent and cannot be changed, edited, or deleted by any network node.

  • Transparency: All nodes in the blockchain network maintain a complete and auditable copy of the transaction ledger.

  • Traceability: All transactions can be traced, enabling the retrieval of a comprehensive history for any given record.

Generally, blockchain networks can be classified into two types based on their accessibility and level of control: public and permissioned. Public blockchains, like Bitcoin and Ethereum [14, 15], allow anyone to join without restrictions. In contrast, permissioned blockchains (also known as private blockchains) restrict access to known participants. The characteristics of blockchain networks may differ between these two types. Public blockchains are often more complex due to their openness, necessitating careful design and consensus mechanisms that may impact scalability and performance. Additionally, public blockchains may not be suitable for sharing sensitive information, such as enterprise data, as shared records are visible to all participants [16]. Conversely, permissioned blockchains are better suited for sharing sensitive data and are less susceptible to attacks due to their restrictive access and the known identity of the network participants [17].

One significant aspect of blockchain technology is the concept of smart contracts [18]. Blockchain smart contracts are self-executing contracts where the agreement’s terms between two parties are directly encoded into code. These contracts and their agreements exist on a decentralized blockchain network. Similar to legal agreements, smart contracts contain predetermined terms and conditions that the contract parties have agreed upon. Smart contracts are automatically executed when the agreed-upon conditions are met, without the need for a central authority [19]. This automation results in a more efficient, secure, and transparent process, as the contract terms are recorded on the blockchain and can be audited and validated by any party on the network.

2.2 Emerging Data Management Paradigms

Evolving and continual challenges of data handling [20] lead to the emergence of innovative data management designs and architectures. Some of these architectures represent a paradigm shift from existing data management approaches to address their challenges drastically. The following subsections describe two emerging data management paradigms: data mesh and data fabric. Data fabric comes from a data engineer mindset focusing on data processing aspects, such as data integration and lineage. In contrast, data mesh comes from a data owner mindset that focuses on data product aspects, such as data product deliveries. Since there is no one-size fits all solution, we briefly discuss their advantages and disadvantages to help organizations choose the best data management approach depending on the organizational goals and circumstances.

2.2.1 Data Mesh

Data mesh is an emerging data management paradigm [6, 7] that provides a domain-driven design to address the issues of centralized data platforms. It focuses on data as products that are decentralized and distributed across independent and loosely coupled business domains. Data mesh is more suitable for large-scale enterprises or cases where data are mostly used within business domains, providing enhanced adaptability to evolving needs, superior scalability, and increased agility. However, there are potential drawbacks, such as inconsistent data practices, coordination and collaboration complexities, and reliance on standardization. We describe the data mesh principles to illustrate these aspects.

The data mesh approach stands on four principles: domain-oriented decentralized data ownership and architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance. These concepts drive a new logical view of the technical architecture and organizational structure. We provide an overview of each of these principles.

  • Domain-oriented decentralized data ownership and architecture:

    Domain-driven design is a method of decomposing components around business domains. It enables a modular design of software microservices in bounded contexts defined by business domains. This design generates components of microservices that are loosely coupled and scalable [21]. In the same spirit, data mesh decomposes data ownership and architecture into decentralized domain-oriented components. Defined domains become responsible for managing their own datasets and providing them to consumers. Data consumers may pull datasets from different domains. These domains become the architectural components of data management rather than data pipeline stages in a centralized data approach. It is important to note that data pipelines are still present in the data mesh architecture but are decentralized and distributed inside the producing and consuming domains as internal implementation.

  • Data as a product:

    Treating data as a product helps to resolve issues that prevent utilizing it properly. These issues relate to qualities, such as understandability, discoverability, security, trustworthiness, etc. When considering data as a product, organizations allocate the roles and set the necessary key performance indicators (KPIs) to ensure that the data products are produced and consumed properly to serve the business needs. Data products may be facilitated with data contracts [22] to ensure proper usage and provide necessary guarantees from data owners. A data contract is a formal agreement between data producers and consumers, including product description, scope, access mechanism, constraints, compatibility issues, and service-level agreements (SLAs).

  • Self-serve data infrastructure as a platform:

    Similar to centralized data management platforms, data processing infrastructure is needed in a decentralized data mesh paradigm. In centralized approaches, data pipelines and infrastructure are the focus, with data being an outcome of them. In contrast, data pipelines and infrastructure are not the focus of the data mesh architecture. Data infrastructure is offered as a service for every domain. This design shift requires technology stacks and infrastructure that process the data with high-level abstract operations and simple provisioning. Offering self-serve data platform services eliminates the need for specialized developers and enables domain autonomy, which is not available yet in today’s data technology stacks.

  • Federated computational governance:

    The data mesh approach creates distributed and independent data products. However, data analytics use cases may require combining and linking data from various domains. For example, a use case may require combining datasets from the business services and manufacturing domains to optimize operations and enhance client satisfaction. Consequently, providing integration and interoperability operations through global standardization across all domains is essential for implementing the data mesh approach. These global decisions and standards should be managed by a federated governance model led by domain and data platform owners.

2.2.2 Data Fabric

Data fabric [23,24,25] focuses on centralization and unified data access. It creates a unified data integration layer by connecting existing data systems and components to provide a single source of truth for data, similar to the purpose of master data management [26]. It avoids the traditional problems of centralized systems by connecting data from different sources rather than moving the data itself to centralized storage. As a centralized architecture, data fabric simplifies data access, governance, standardization, and quality assurance.

Data fabric is ideal for enterprises with interwind business domains or that require a single source of truth for data. However, the centralized nature of the data fabric may result in potential bottlenecks, slower responsiveness to domain-specific needs, dependency on a centralized team, and scalability challenges. Furthermore, centralized data management can restrict innovation and experimentation as teams may lack the autonomy to explore new technologies and approaches tailored to their specific domain requirements.

3 Related Work

We review the research work related to data mesh and discuss its limitations in Sect. 3.1. Next, we describe and compare related papers in the context of dataset management in Sect. 3.2.

3.1 Data Mesh

The data mesh paradigm, proposed by Zhamak Dehghani in 2019 [6, 7], recently emerged from industrial experiences for managing data with limited scientific research. Netflix and Zalando [27, 28] provided presentations of data mesh implementations in the industry. [29] explained their industrial experiences, justifying the migration from a centralized data warehouse to a decentralized data mesh architecture. In their case, data were primarily used within domains. Inter-domain requests are served by a central metadata catalog, but data-sharing mechanisms were not provided. More recently, [30] conducted a gray literature survey of 114 industry articles to provide a comprehensive review of data mesh. Moreover, they compared the service-oriented architecture to the data mesh due to their similarities. It answers research questions related to understanding the data mesh architecture, its benefits, architectural design choices, and challenges.

Authors of [8, 9] provided an overview of the data mesh approach and described its motivation and concepts. The work of [8] details different architectures for the data mesh paradigm. A follow-up work [10] focused on identifying potential technologies for implementation purposes and provided a proof-of-concept demonstration. They presented various software solutions and technologies that may be used to serve the purposes of the data mesh architecture. For example, software products for data storage, processing, machine learning, and visualization were presented to support the self-serve data platform principle. For linking domains, authors extended Apache Atlas to run as a catalog for the data mesh, created Slack channels for communication purposes between data teams, and used Google Forums to request data access. Moreover, a standard folder organization is followed by all teams. These inter-domain communication methods are generic and ad-hoc, which limits their usability to support the overall data mesh paradigm and its core principles.

Previous work fails to address the data mesh’s concerns of governing and bridging decentralized domains. The proposed solutions in that context focused on communication and information-sharing aspects rather than providing an approach that fits the data mesh paradigm and its principles, such as data as a product. Consequently, the components of the proposed solutions are not well-integrated cohesively. In contrast, our proposed blockchain-powered framework offers a practical solution that links and governs the data mesh’s independent domains and products. Both the data mesh and blockchain are fundamentally decentralized. This core similarity makes blockchain technology a perfect fit to support the data mesh principles and solves its decentralization challenges.

3.2 Dataset Management

There has been a variety of research work that focused on managing datasets. On a conceptual level, authors in [31] envisioned an ecosystem, namely Agora, to provide access to data pipeline assets, such as data, algorithms, and computing resources, through marketplaces focusing on these assets as core components. Agora aims to make data science and its technologies more accessible to general users, as it currently requires considerable technical proficiency. Agora architecture builds around assets and consists of two layers: the Asset Layer and the Execution Layer. The connection between the two layers is seamless. The asset layer serves as an advanced ecosystem of multiple asset markets. It facilitates not just the discovery and offering of assets, but also allows for their intelligent combination using asset managers. On the other hand, the execution layer is responsible for running and optimizing the asset execution with respect to the given budget. In contrast, the data mesh architecture addresses enterprise data management challenges and focuses on data as core elements and not data pipeline components.

More tangibly, a research paper [32] proposed a data grid architecture to manage large scientific datasets. It defined requirements and components to integrate computation and storage infrastructure to process and manage large datasets. For cloud computing, [33] focused on minimizing storage costs of datasets, and [34] built a platform to manage large biomedical datasets. These papers focused on providing computational and storage infrastructure for managing datasets.

Another direction of research focused on providing tools and functionalities to manage datasets. Inspired by software version control systems, [35] proposed two integrated systems, similar to git and GitHub [36, 37]. The first system manages versions of datasets and facilitates sharing them. The second system works on top of the first one to provide advanced interaction capabilities and data pipeline tools for tasks, such as data cleaning, version search, integration, differencing, and visualization. These systems may complement the data mesh architecture by providing useful tools for use within domains and/or as part of the self-serve data platform component. Our work differs by focusing on the interactions between the decentralized domains of data mesh.

Several papers focused on building centralized systems to solve dataset challenges, such as data quality [4], data discovery, and metadata collection [1,2,3,4,5]. This line of work focused on the systems aspect of managing datasets and addressed specific issues related to the traditional centralized data management architecture. In contrast, our paper uses the data mesh approach that represents a paradigm shift by focusing on the data itself as the center of the architecture rather than its managing systems. For example, [1] Google presented a project for large-scale enterprises, namely Google Data Search (GOODS), to collect metadata of datasets that are scattered in various storage systems. GOODS creates a global data catalog enabling users to monitor and annotate datasets. It collects metadata from various sources in the background in a non-intrusive manner, allowing parties to freely generate datasets that will be added to the catalog afterward for others to find. Moreover, GOODS provides a dashboard for the data producers to display all their datasets and enables browsing them by facets (e.g., owner, data center, schema). Equipped with monitoring capabilities, Goods can detect unexpected changes in data size, value distribution, and availability, alerting owners accordingly. GOODS tracks the origin and impact of datasets. It shows which datasets were used to create others (upstream) and which ones rely on them (downstream). For consumers, GOODS offers a search engine with filters to find relevant and current datasets. GOODS focused on data cataloging without including other aspects, such as data contracts and governance, which this paper covers.

4 Framework Design

Data mesh and blockchain synergize very well, as they are both distributed and decentralized. Blockchain technology fulfills several data mesh requirements and principles, as its features support the data mesh architecture in many ways. Following the decentralizing data ownership principle of data mesh, blockchain enables parties to communicate and share data efficiently and transparently. Thus, it can facilitate communication and data sharing between different data domains in the data mesh. Participants can share data and execute transactions transparently and traceably across domains, supporting inter-domain visibility and accountability. Moreover, requirements for a global data catalog are served with a replicated data ledger for all domains.

Blockchain’s smart contracts facilitate the data as a product aligned with the second data mesh principle. Smart contracts enable data providers and consumers to agree on specific terms and conditions for data requests. A smart contract may digitally implement and manage data contracts between consumers and producers. Consequently, smart contracts clearly manage and define data products and services. Smart contract transactions enable participants to request a data product, view its metadata, agree on the terms of use, and others.

Governance, especially across domains, is a significant issue in the data mesh that blockchain may address. Blockchain technology facilitates governance among participants via defined consensus protocols implementing the federated computational governance principle of data mesh. These consensus protocols govern the execution of transactions to determine their validity based on specified conditions and approval requirements. Thus, the federation of governance between domains may be achieved with blockchain technology.

We demonstrate this synergy by designing a blockchain-powered framework to manage enterprise datasets with a data mesh architecture. Business domains are participating entities in the blockchain network to share data and exchange value. Data producers add datasets as data products that are managed by smart contracts. Data consumers belonging to various domains issue smart contract transactions to use these datasets and interact with them.

4.1 Framework Functionalities

Our proposed framework can provide a range of functionalities to support data mesh architecture, including dataset ownership, governance, dataset sharing, and dataset cataloging. In this subsection, we highlight the key functionalities of the blockchain-powered framework that are essential to realize the data mesh architecture, including regulating dataset ownership, sharing datasets, establishing data contracts, and maintaining the datasets catalog of the enterprise. It is important to note that we use blockchain to store the metadata of datasets and information capturing their usages. Storing the datasets themselves in the blockchain is impractical, and inefficient due to their large size.

A dataset is typically owned by the domain which produces it. Our framework supports ownership rights once the dataset reference is added to the blockchain by the domain owner. As a result, the framework gives the control to the domain owner, who can explicitly specify the rules and conditions on which the access should be granted via smart contracts (see Sect. 4.2 for details). Therefore, the governance policies are enforced automatically through the framework. By enabling domain producers to establish rules and conditions through smart contracts that meet the owner requirements, our framework maintains a robust governance structure.

On the other hand, dataset consumers can establish a data contract with the dataset domain owner. The data contract facilitates secure and transparent datasets exchange between domains. The framework inherently tracks the datasets and data contracts; as traceability is one of the main features of blockchain technology. Additionally, dataset cataloging is supported naturally to improve dataset collaboration and reduce duplication of effort. This integral aspect significantly streamlines data discovery processes to fully make use of the shared datasets while optimizing resource utilization within the enterprise ecosystem.

Table 1 A proposed design of smart contract transactions and their descriptions

4.2 Smart Contract Transactions

Different smart contracts can be developed to support the framework functionalities. A generic smart contract contains various transactions that can be executed by the dataset producers and dataset consumers. A valid execution of any transaction is based on satisfying the predefined terms and conditions. With the smart contract in place, autonomous execution of transactions is possible to enhance efficiency and compliance. For instance, the Sales business domain could publish a dataset about the customers purchasing activities in the last month. The Marketing domain would request this dataset for developing a new marketing strategy. If the producer domain (the Sales domain) makes the dataset freely available. The consumer domain (the Marketing domain) accesses the dataset immediately. However, if the owner puts restrictions on the dataset for privacy concerns, the request shall be approved by the owner (the Sales domain).

The deployed smart contract facilitates and streamlines the transaction execution within the blockchain network. Table 1 shows a design of the smart contract transactions for our framework, along with their inputs, outputs, issuing party, and approval requirements. The smart contract may be extended to include new transactions or modify existing ones to meet the new enterprise requirements. Thus, the framework supports the flexible deployment of different smart contracts that can accommodate all the different domains’ needs.

More specifically, datasets are converted into digital assets represented by a JSON object. Each dataset has a unique identifier. Figure 1 is shown an example of a dataset JSON object. The dataset JSON object provides information about the dataset, such as name, description, producer, and the dataset location (URL). The collection of all dataset JSON objects represents a clear catalog of all enterprise datasets assets. These assets are managed by the deployed smart contract on which the agreed transactions can be executed.

Fig. 1
figure 1

Example of dataset JSON object

Similarly, the data contracts between the dataset producer and the consumer are converted into digital assets represented by a JSON object which is related to the given dataset. Figure 2 is shown an example of a data contract JSON object. The data contract JSON object primarily provides information about the data contract, such as contract ID, dataset ID, consumer, contract status, and the purposes of using the dataset.

Fig. 2
figure 2

Example of data contract JSON object

To illustrate the framework functionalities, a domain produces a dataset, the reference of the dataset can be added to the ledger by invoking (addDataset) transaction on the deployed smart contract. The dataset is associated with the producer domain as the owner. Other domains, the consumers, can request this dataset by invoking (addContract). The data contract is established between the dataset producer domain and the consumer domain. The dataset producer domain receives the request, and the request is subject to the approval or rejection of the dataset owner (approveContract). In case of approval, the contract takes effect, and the contract’s status changes to active. Afterward, the dataset is available to the consumer. Figure 3 illustrates the process.

The sharing of the actual dataset may be implemented in several ways, such as API access or enterprise storage share link. For sensitive datasets, access control may be required to prevent unauthorized entry. The producer domain should grant access through the existing enterprise authentication and authorization system using, for example, a private API.

Fig. 3
figure 3

Illustration of sharing a dataset between domains in a governed manner

The datasets can also be removed (removeDataset) by the producer domain, and the data contracts can be terminated (terminateContract) by the participating domains. All domains can view the consumers of a given dataset (viewConsumers). In addition, the dataset catalog can be retrieved by any domain by invoking (viewCatalog) to view all the enterprise datasets.

4.3 Network Architecture

The network architecture consists of several components, including nodes, protocols, and consensus mechanisms. Sharing and exchanging datasets within the enterprise are an internal process. Therefore, a private blockchain network clearly fits our proposed framework to share data securely and efficiently across the enterprise domains which are previously known. This gives greater control over accessing and changing data on the blockchain and prevents disclosing sensitive data beyond the enterprise boundaries.

Each business domain within the enterprise turns into a node of the blockchain network. The protocols that govern the validity of transactions are defined within the deployed smart contracts. The domains should agree on a consensus mechanism to reach an agreement on the current state of the blockchain. More specifically, each domain has its own copy of the blockchain state and can approve, propose, and validate transactions. Domain members cooperatively request to execute and approve transactions through a communication channel (see Fig. 4). These communications are streamlined by the deployed smart contracts to automate the transaction execution, eliminating human interventions and minimizing human error while maximizing framework efficiency.

The main aspects of the proposed framework is the dataset sharing and governance. Each node, the business domain, could produce the dataset of which the enterprise should aware. This can be done using the node’s producer identity, the deployed smart contract, and the Public Key Infrastructure (PKI) [38] of the producer domain. Upon the sharing dataset transaction success, every other nodes updates the state of its blockchain, and the dataset information is available across the enterprise and could be requested using the same deployed smart contract. However, the approval is based on the rules and conditions that are associated with the shared dataset. If the dataset is sensitive, the formal approval should be granted from the producer node.

The blockchain network offers a variety of consensus mechanisms designed to assist nodes in achieving a consistent state. However, consensus mechanisms suitable for public blockchain networks, like proof of work [39] or proof of stake [40], may not be suitable for implementation in private settings. Given that the proposed framework operates within a classified private blockchain network, Practical Byzantine Fault Tolerance (PBFT) [41] and its variants emerge as suitable consensus mechanisms among the nodes.

Fig. 4
figure 4

Framework overview

5 Qualitative Analysis

In this section, we analyze the key properties of our framework from various qualitative perspectives. The following is a detailed analysis of our framework properties:

  • Decentralization: Both data mesh and blockchain are fundamentally decentralized. In our framework, dataset ownership and control are decentralized. By eliminating the central authority that maintains data, or controls the data access, our framework allows for a direct peer-to-peer interaction and data exchange. This facilitates seamless collaboration, dataset sharing, and integration between different business domains, promoting innovation, and cross-functional insights.

  • Data Accessibility: The design of our framework enables domain teams to easily share their data products and make them easily accessible to other business domains. Thus, it improves data collaboration and sharing, and eliminates redundancy and duplication of efforts. Using our framework, datasets will no longer be stored in isolated silos. Every domain has a replica of the datasets catalog and may easily view them and request access to any dataset by issuing smart contract transactions.

  • Transparency: Blockchain, by design, promotes transparency within a data mesh. Blockchain creates a record of data transactions and modifications that are viewable by all nodes in the blockchain. This transparency helps establish trust among different business domains and enables participants to verify the integrity and the history of the shared datasets.

  • Federated Governance and Control: Since our framework uses blockchain, it allows for establishing rules and policies governing datasets structure, access, and usage. By leveraging smart contracts transactions in the blockchain, our framework can enforce data governance processes, ensuring that datasets that are shared follow the predefined rules and guidelines that are agreed upon by the participating domains.

  • Data Security and Privacy: Blockchain’s cryptographic protocols can enhance data security and privacy within a data mesh. The security of our framework is granted through the underlying blockchain framework we have used (i.e., Hyperledger Fabric). Several studies have analyzed the Hyperledger Fabric and its security features [42,43,44]. The results of these studies found the framework to be secure if implemented correctly. In addition, using smart contract transactions (described in Sect. 4), our framework allows access to be granted only to authorized consumers. Access to the datasets can also be revoked by terminating the contract using smart contract transactions. Our framework can be integrated with existing enterprise authentication and authorization systems to allow or deny access to sensitive data, as described in Sect. 4.

  • Scalability: The scalability heavily depends on the blockchain network choice and the underlying consensus mechanism. The adopted blockchain network in our framework is a permissioned blockchain, which is shown to be more scalable than public blockchain networks [45, 46]. Thus, it supports high transaction throughput due to a limited number of known participants.

  • Resilience: The decentralization nature of blockchain and data mesh reduces the risk of a single point of failure, making our framework more resilient against data access disruptions or attacks. This decentralization ensures that even if a certain data source fails or goes offline, datasets from other sources remain accessible, improving the overall system’s reliability.

6 Implementation

To implement a prototype of our proposed framework [47], we use Hyperledger Fabric [48] to create a permissioned blockchain network. The blockchain network consists of two business domains, namely, the Sales and Marketing domains. Each domain in the Hyperledger Fabric network has a Membership Service Provider (MSP) that uses public key infrastructure (PKI) to manage identities and authentication. We used the Hyperledger Fabric cryptogen utility tool to generate public and private keys for all entities in the network. We created two orderer instances to provide the ordering service for the Hyperledger Fabric network. The ordering service uses the Raft consensus protocol [49] to ensure that all nodes in the network have a consistent view of the blockchain ledger.

For each domain, we create a peer node to store a copy of the blockchain ledger. Multiple peer nodes can be created per domain for availability and durability. Peer nodes issue transactions on behalf of their domains and act as endorsers, which means that they must approve transactions before they are considered valid. We also develop a smart contract in Golang [50], implementing the transactions shown in Fig. 5

Fig. 5
figure 5

A list of the implemented transactions for the smart contract

In Fig. 6, we show the peer of Sales domain executing the “addDataset” transaction with a JSON object as input. The JSON object represents the dataset to be added to the ledger with the dataset information, including the producer, dataset description, and other information. The peer of Marketing domain executes the transaction “addContract” to establish a data contract to request the dataset with its status “Pending.” Then, the peer of Sales domain executes the transaction “approveContract” or “rejectContract” to either approve or reject the contract, respectively, and sets its status accordingly. Figure 7 shows the Sales domain peer approving the contract, changing its status to “Active”.

Fig. 6
figure 6

A peer of sales domain issues “addDataset” to add the dataset reference to the blockchain

Fig. 7
figure 7

A peer of Sales domain issues “approveContract” to grant access of dataset “DS5002” to the Marketing domain

In Fig. 8, we show when a peer of any domain executes the “viewCatalog” transaction to show the enterprise dataset catalog. Figure 9 shows the consumers of the given dataset when a peer of any domain executes the “viewConsumers” transaction. In Fig. 10, we demonstrate that any peer from any domain can view the history of a dataset by executing the “viewHistory” transaction. This transaction shows the history of the dataset, including all of the dataset that have been used to generate the given dataset.

Fig. 8
figure 8

A peer of any domain issues “viewCatalog” to view all the dataset in the enterprise

Fig. 9
figure 9

A peer of any domain issues “viewConsumers” to view all the consumers of dataset “DS5002”

Fig. 10
figure 10

A peer of any domain issues “viewHistory” to show the history of dataset “DS5002”

7 Discussion

Our framework facilitates the functionalities of various organizations, specifically data-centric ones. The framework presented here can be used by organizations that rely heavily on data science or artificial intelligence applications to better manage and share datasets. It is common knowledge that these applications are mainly built on prior datasets [51]. Therefore, a “view catalog” can be used by data teams within an organization to determine which datasets belong to which domain. Moreover, since domains own their datasets, data teams will be aware of any new features added by those domains.

Additionally, the framework can also be implemented by organizations without data teams specialized in data analytics and artificial intelligence, who hire third parties to manage their data-centric and artificial intelligence solutions, thus facilitating the communication between internal organization domains as well as external third parties. Typically, domains update their catalog on a continuous basis to improve their data quality using several features provided by our framework, including “updateDataset”, which enables domain experts to provide annotations and improve data quality for building advanced analytics solutions.

Our framework is also useful for organizations planning to implement knowledge graph architecture in the exchange of information across domains. Using our framework, data teams can integrate data points from different domains, which facilitates the creation of new knowledge and information search from the uploaded data sources [52].

Despite the various functionalities offered by the proposed framework, it has a number of challenges that range from cultural to technical. The discovery of datasets is an essential component of our framework. However, producer domains must continuously update the catalog of data products to reflect any changes to existing or new data products to maintain discoverability and relevance. Therefore, a sub-challenge may arise for producer domains in describing their datasets and which domains could benefit from producer data products.

As for the content level of the datasets, consumer domains may find it challenging to understand the semantics behind the features of the datasets produced. Thus, the producer domain should ensure that documentation of the dataset is produced to meet the needs and to facilitate the understanding of the consumer domains within the organization. The quality of content in producer datasets is another obstacle that may hinder the usability of producer datasets. Therefore, producer domains should monitor their dataset usability and address any related issues raised by other consumer domains.

In terms of data sensitivity, data in the datasets may contain sensitive information; therefore, for protection, it is imperative that data anonymization be applied to such data by producer domains. Moreover, another challenge may arise when two or more domains own a dataset. In the case of domains that add data contracts, all producers must accept them, which can be time-consuming and challenging to apply.

8 Conclusion

We introduce a blockchain-powered framework to implement data mesh architecture for managing datasets within the enterprise. The framework highlights the principles of data mesh architecture, including decentralizing data ownership, dealing with data as a product, and applying federated governance, through various functionalities implemented in a smart contract which is deployed on every domain of the enterprise. We have made the proof-of-concept implementation publicly available to allow the community to validate, exploit, and expand our proposed framework.