This section presents the proposed generic blockchain-based data provenance framework for the IoT. As mentioned in Section 1, major benefits of a generic framework are interoperability between applications using the framework and the facilitated implementation of provenance concepts for new IoT use cases. Figure 3 displays the core architecture of the framework. It consists of three layers all embedded within a blockchain smart contract platform. Each layer represents a different level of abstraction with a diverse set of responsibilities within the framework: The storage layer is primarily concerned with low-level representation and storage of provenance data, and the generic provenance layer provides general-purpose provenance functionality, while the specific provenance layer can be modified by use cases to fine-tune the framework to their specific requirements. To this end, Section 4.1 describes the data model used for the representation of provenance data within the framework, and Section 4.2 explains the storage layer, whereas Section 4.3 and Section 4.4 describe the generic and specific provenance layers, respectively.
We provide a proof-of-concept implementation of the framework consisting of smart contracts written in Solidity—a smart contract language for the Ethereum Virtual Machine (EVM). As such, the prototype can be deployed on any EVM-compatible blockchain, e.g. public permissionless blockchains such as Ethereum and Ethereum Classic, or private permissioned blockchains such as Hyperledger Burrow [19]. However, the framework is not restricted to EVM-based blockchains. The described concepts could also be implemented on any blockchain platform that provides sufficient scripting capabilities. The prototype is available as open-source software on Github.Footnote 1
Data model
Provenance data varies depending on the specific domain or application [7]. Since the IoT domain is characterised by heterogeneous applications and a wide range of possible use cases [1], there is a need for a generalised provenance data model suitable for IoT applications. As the underlying data model, our framework utilises a data provenance model specifically designed for the IoT by Olufowobi et al. [7]. Instead of taking a document-centric view on provenance data such as models like the Open Provenance Model (OPM) [20] or the PROV data model [21] which track agents performing actions on documents, the model by Olufowobi et al. focuses on the infrastructure of agents like sensors, devices, analytics services, and the exchanged data. The authors argue that in IoT environments provenance can be recorded in terms of creations or modifications of data by agents. No distinction between agents and actions is necessary since a fine-grained definition of agents already explains the action performed on the data. The model can therefore account for fine-grained provenance data while keeping a low overhead.
The model defines a data point as a uniquely identifiable and addressable piece of data. In the context of IoT systems, these can be sensor readings, complex analytics results derived from sensor readings, actuator commands, etc. The function addr(dp) denotes the address/ID of a data point dp. The function inputs(dp) refers to the set of input data points that have contributed to the creation or modification of a data point.
The model defines a provenance event as the moment an IoT system reaches a specific state which requires the collection of provenance data for a particular data point. When a provenance event occurs, information of interest for provenance about the state of the IoT system need to be recorded. The function context(dp) denotes such information. The context varies depending on the IoT application. An example of how the context parameter might be structured is displayed in Fig. 4. Agents (e.g. sensors, devices, persons, organisations, etc.) create and/or modify data points. Here, agents are defined recursively, i.e. agents can contain other agents (such as devices containing several sensors). Information related to the specific provenance event, such as events that triggered the creation/modification of the data point are defined in the execution context. Furthermore, time and location information may be added to the context as well.
Finally, the model defines a provenance record as a tuple associating the address of a data point with the set of provenance records of its input data points and the specific context of the corresponding provenance event. The provenance function prov(dp), providing the provenance record for some data point dp, is defined as follows:
$$ {prov} : {dp} \mapsto \langle {addr} ({dp}), \{{prov} ({idp}) | \forall{idp} \in {inputs} ({dp})\}, {context} ({dp})\rangle $$
Note that this definition of provenance allows for the description of both creation and modification of data points. In the former case, the set of input provenance records contains an empty set. In the latter case, the set of input provenance records contains the provenance record of the data point before modification, i.e. \({prov}({dp}^{\prime }) = \langle {addr}({dp}^{\prime }), \{{prov}({dp}), ...\}, {context}({dp}^{\prime })\rangle \).
The described data provenance model combined with blockchain technology builds the basis for our IoT data provenance framework. This way, the framework can not only represent provenance data for various IoT use cases, but also provides integrity guarantees for the stored records.
Storage layer
This layer is responsible for the low-level storage of provenance records. It contains the generic representation for provenance records as defined by the data provenance model, as well as basic functionality to create, retrieve, update, and delete provenance records. Delete refers to the invalidation of provenance records, since truly deleting data from a blockchain is not possible. An invalidated provenance record cannot be used as input for subsequent provenance records.
Listing 1 displays an excerpt of the smart contract implementing this layer. The internal representation of provenance records (Lines 2–7) closely resembles the model discussed above with the field tokenId representing the ID of a data point. However, not only are data points addressable, but also the provenance records themselves. Addressable provenance records allow the storage layer to manage provenance records as a mapping from provenance IDs to provenance records: addr(prov)↦prov where the function addr(prov) represents the ID of a provenance record prov. The mapping is implemented via the mapping keyword (Line 8).
The contract exposes an API for creating, retrieving, updating, and deleting (i.e. invalidating) provenance records. Note that, while functionality for retrieving provenance records is exposed publicly via the public keyword (Lines 10ff), functionality for creating, updating, and deleting records is protected via the internal keyword (Lines 13ff). These functions cannot be accessed publicly, but are accessible from inheriting contracts such as the contract representing the generic provenance layer. Read functions are publicly accessible since these functions do not alter the state of the contract.
Generic provenance layer
The generic provenance layer’s main purpose is to provide general-purpose provenance functionality on top of the storage layer; i.e. it provides features that are universally applicable for a wide range of provenance use cases. For this, it defines the ownership of data points (Section 4.3.1) and how to associate provenance records with data points (Section 4.3.2).
Ownership of data points
While blockchain technology can guarantee the integrity of data provenance records once those records have entered the system, mechanisms need to be in place to ensure that the records that enter the system are correct. As a first step, we aim to prevent the creation of provenance records by arbitrary clients, i.e. if client A generates some data point dp0, we want to make sure that only client A (or any client authorised by client A) is able to create provenance records for dp0. Thus, we introduce the notion of ownership of data points: Each data point belongs to a specific client of the system, and only the owner or a client authorised by the owner can create provenance records for it. If an unauthorised client tries to create a provenance record for a data point, the system raises an error.
The notion of ownership is closely related to so-called tokens, i.e. smart contracts deployed on a blockchain that represent a kind of digital asset [22]. Our framework leverages tokens to introduce ownership of data points. Each data point is represented by a single token and a single token identifies exactly one data point. This one-to-one mapping allows us to identify the owner of a data point by identifying the owner of a particular token.
Each token acts as an entry ticket to the provenance framework. To create provenance records for a particular data point, a client first has to become the token owner or be approved by the owner of the corresponding token. The prototype uses the Ethereum token standard ERC721Footnote 2 which defines a common interface for non-fungible assets, for instance functions for transferring ownership. This has the advantage that our tokens (i.e. the data points) can be traded by any client implementing this standard, such as wallets or exchanges. By transferring ownership, data points can pass from owner to owner, leaving a trail of provenance records created by each owner along the way. This is useful for implementing provenance applications which aim at creating a lineage (see Section 3), e.g. in supply chain scenarios and business processes [23, 24]. The generic provenance contract complies to the standard by inheriting from an existing ERC721 implementation provided by OpenZeppelin.Footnote 3
Associating provenance records with data points
The generic provenance layer further links data points and their respective provenance records. The framework provides information about associated provenance records of specific data points. Within the generic provenance layer, this is achieved by using a mapping from a data point ID to a set of associated provenance IDs:
$$ {addr} ({dp}) \mapsto \{{addr} ({prov}_{1} ({dp})), {addr} ({prov}_{2} ({dp})), ...\} $$
All associated provenance records (prov1, prov2, etc.) represent parallel provenance traces for the same data point. Of those records, each one represents a completely independent trace of provenance data. For instance, one trace of provenance records might store the temperature history of a physical good, while another one stores its location.
Listing 2 displays an excerpt of the smart contract implementing this layer and demonstrates the general workflow for creating new provenance records. First, the contract verifies that a given token (i.e. data point) exists (Line 7) and that the token belongs to the sender of the message (Line 8). After validating the input provenance records (Line 9), the contract creates a new provenance ID (Line 10), adds it to the list of associated provenance (Line 11), and calls the storage contract’s createProvenance function (Line 12). Note that the createProvenance function of the generic contract is again internal, and thus not accessible publicly. This enables specific provenance contracts which implement concrete use cases to further adapt the functionality according to their needs.
Specific provenance layer
Smart contracts within this layer utilise the functionality provided by the generic provenance layer, but control a subset of parameters by themselves. This way, use cases can customise the provenance model according to their needs, and control access to the functionality provided by the storage and generic provenance layers. Listing 3 displays an excerpt of an exemplary smart contract implementing this layer.
The provenance model can be customised by defining the context parameter, so that it presents the provenance data needed for a specific use case scenario. For instance, in the case of vaccine supply chains, the context should contain temperature and location information about individual vaccines. Hence, a specific contract could define its own createProvenance function that requires parameters like temperature, and location (Lines 7ff) which are then combined to form the context to be passed on to the createProvenance function of the generic provenance contract (Lines 10ff).
Access control happens on two levels. First, a specific contract defines which parts of the generic provenance layer’s API are exposed. For instance, even though the generic provenance layer could permit updating or deleting (i.e. invalidating) provenance records, this could be unwanted behaviour in the specific use case at hand. In this case, the contract in the specific provenance layer simply “hides” the functionality, i.e. does not expose it publicly.
Second, access control is relevant for controlling the ownership of data points. Since data point ownership is the decisive factor with regard to who can create provenance records for which particular data points, contracts in the specific provenance layer are responsible for actually assigning ownership, i.e. which tokens get assigned to which clients. For simplicity, the example shown in Listing 3 automatically assigns new tokens to requesting clients (Lines 2ff). However, ultimately, the most suitable approach is dictated by the specific use case at hand. Some further possibilities for assigning ownership are listed below:
-
One possibility is for clients to purchase tokens. This has the advantage that the framework is publicly available without the risk of spamming attacks since the acquisition of tokens incurs financial cost. Essentially, the token acquisition cost needs to be low enough for honest users to be willing and able to participate, but high enough to discourage spammers. Of course, such an approach cannot prevent malicious actors with sufficient purchasing power.
-
In another approach, the layer manages a white list of authorised clients allowed to request new tokens. This way, the list controls exactly who participates in the provenance system. However, the question arises who is responsible for managing the white list of authorised clients. As discussed above, the proposed framework could also be implemented for private blockchains. In this case, the white list approach is surely the most promising one, since simply all participants in the private blockchain could be white-listed.
-
Yet another approach is a list of pending requests. A client sends a transaction requesting tokens. An administrator gets notified about the new request and decides whether to accept or deny the request by assigning or not assigning corresponding tokens. While this approach allows a more fine-grained control over whose requests get accepted and whose requests get denied, it bears the risk of a high degree of centralisation due to the administrator having full control over distributing tokens.