Introduction

Digitalization profoundly transforms business today. For some time, digitalization focused on the automation of back-office and customer-oriented processes (Urbach, et al., 2019). Today, digitalization increasingly encompasses the collection and usage of data, adding to global data growth (IDC, 2018). Monitoring and collecting data from Internet-of-Things (IoT) sensors, mobile devices, and business processes not only facilitate companies in creating innovative products and improving customer experience. These activities also open new ways of monetizing data as a product itself. According to Forrester Research (2018), 48 percent of global data and analytics decision-makers commercialize the data they own by either selling it or the derived insights. In doing so, data becomes part of a company’s revenue model. Simultaneously, the demand for access to external data sources is growing (Forrester, 2018). Companies need diverse datasets to train machine learning models and enable improved decision-making. For example, sharing data between manufacturers can help to improve predictive maintenance algorithms and therefore increase machine performance (World Economic Forum, 2020). Additionally, making more data available for analysis supports tackling societal and environmental challenges (European Commission, 2020).

Data marketplaces have been emerging in recent years to match the supply and demand for data. Data marketplaces provide a digital platform enabling data providers to sell their data to data consumers. Concurrently, data consumers get access to otherwise inaccessible data sources. Though data marketplaces bring along benefits, they pose multiple challenges for data governance. These challenges manifest in a variety of issues within different data governance decision domains, including how to protect data, ensure data quality, and define and model data consistently. Regarding data security, companies fear a loss of control over their data, which could lead to a competitive disadvantage (Roman & Stefano, 2016; Spiekermann, 2019; van den Broek & van Veenstra, 2015). Concerning data quality, data consumers require insights into the quality of data products before they can use them for certain purposes (Janssen et al., 2012). If data products are further processed and used as the basis for producing other goods, a high level of data quality is required (Stahl et al., 2017). Issues with the quality of data products can hamper efforts to provide a marketplace service (Smith et al., 2016). Regarding data architecture, the lack of standards inhibits data sharing (European Commission, 2022). It also poses barriers to finding, analyzing, and processing published data (Smith et al., 2016). These challenges indicate that further research is required regarding data governance in the context of data marketplaces.

This paper adopts a comprehensive perspective by establishing a taxonomy of data governance decision domains in data marketplaces. A taxonomy structures and organizes the body of knowledge within a certain field (Glass & Vessey, 1995). It builds a foundation for future research by allowing researchers to determine relationships between the taxonomy’s dimensions and other variables of interest. A taxonomy is also helpful in identifying divergence in previous research findings (Sabherwal & King, 1995). We aim to create a taxonomy that describes the different mechanisms available to instantiate data governance decision domains in the context of data marketplaces. Hence, our study answers the following research question: how do data marketplaces instantiate data governance decision domains? Our study advances the body of knowledge on data governance in data marketplaces, which has to date been little researched (Koutroumpis et al., 2020). Our findings also contribute to the body of knowledge on the wider topic of data governance for ecosystems of public and private organizations (Tiwana et al., 2014).

The remainder of this paper is structured as follows. First, we present the theoretical background regarding data marketplaces and data governance decision domains. Second, we describe the taxonomy development method applied for the study. Third, we present our findings concerning the dimensions, subdimensions, and characteristics of the taxonomy. Fourth, we discuss our findings in the context of scientific literature and present the final taxonomy. In doing so, we highlight current limitations in the instantiation of data governance decision domains. We then conclude with a summary, describe the limitations of our study, and suggest avenues for future research.

Theoretical background

Data marketplaces

Data marketplaces are digital platforms that offer data or data-related services as primary goods (Stahl et al., 2016). Data goods encompass manually and automatically created personal and commercial data, such as age, gender, purchase history, and IoT sensor data. Data-related services comprise capabilities such as data aggregation, analysis, and visualization (Roman & Stefano, 2016; Spiekermann, 2019). In our study, we focus on data marketplaces that act as independent intermediaries connecting two or more market participants (Stahl et al., 2016). The main actors involved in data trades are data providers offering data, and data consumers buying data. Marketplace providers offer an infrastructure that allows these actors to upload, discover, buy, and sell data (Spiekermann, 2019; Stahl et al., 2016).

Data governance decision domains in data marketplaces

Data governance is a framework that provides structure and formalization for the management of data (Morabito, 2015; Rifaie et al., 2009; Weber et al., 2009). As part of data governance, organizations typically need to specify what must be governed, i.e., the scope of data (Abraham et al., 2019), who governs the data, i.e., the roles and governance bodies (Khatri & Brown, 2010; Otto, 2011), and what decisions must be made in data-related areas, i.e., the data governance decision domains (Abraham et al., 2019; Khatri & Brown, 2010; Lee et al., 2019). This study deals with the last component. Based on Khatri and Brown (2010) and Abraham et al. (2019), we distinguish between the following six data governance decision domains: (a) data quality; (b) data security; (c) data architecture; (d) metadata; (e) data lifecycle; (f) data storage and infrastructure. Based on Schreieck et al. (2016) we add (g) data pricing as an additional data governance decision domain since data marketplaces contain the aspect of data trade and data valuation. In the remainder of this section, we describe the seven data governance decision domains that form the foundation of the taxonomy.

Data quality

Data quality refers to the ability of data to satisfy its usage requirements in a given context (de Abreu Faria et al., 2013; Khatri & Brown, 2010). Data quality is characterized by quality dimensions such as completeness, credibility, accuracy, timeliness, and consistency of data (DAMA International, 2009; Khatri & Brown, 2010). Scientific literature proposes both preventive and reactive measures to manage data quality (Otto et al., 2012). In the context of data marketplaces, preventive measures inhibit data providers from onboarding data products with insufficient quality. For example, data providers apply automated test scripts to examine the quality of their data products before making the products available on the data marketplace (Smith et al., 2016). Reactive measures aim to support the identification and reporting of data quality issues after data products have been made available on the data marketplace. Examples include rating systems that allow data consumers to rate and provide feedback on data products (Zuiderwijk et al., 2014) or data providers (Mišura & Žagar, 2016; Ramachandran et al., 2018).

Data security

Data security refers to the preservation of security requirements concerning the accessibility, authenticity, availability, confidentiality, integrity, privacy, and reliability of data (Carretero et al., 2017; de Abreu Faria et al., 2013; Donaldson & Walker, 2004; ISACA, 2013). In the context of data marketplaces, requirements concern the control of when, to whom, and to what extent data is being sold (Mišura & Žagar, 2016; Tzianos et al., 2019) and how and where data is being used (Otto & Jarke, 2019; Roman & Stefano, 2016). To store data confidentially, data marketplaces use encryption techniques (Roman & Stefano, 2016; Shaabany et al., 2016; Tzianos et al., 2019). To protect sensitive data during data usage, data marketplaces apply methods that only provide access to parts of the data or even fully restrict access to raw data. Examples include the utilization of anonymization techniques to hide identity data (Fung et al., 2010; Ha et al., 2019) and the application of homomorphic encryption to enable mathematical operations on encrypted data (Roman & Stefano, 2016). To control data usage and protect data ownership rights, data marketplaces apply data usage terms that describe the appropriate uses of data (Otto & Jarke, 2019; Truong et al., 2012; Tzianos et al., 2019). Similarly, data contracts help to negotiate and assure the authorizations, obligations, and prohibitions on data covered by the contract (Allen, et al., 2014; Matteucci et al., 2012). They enable data providers to have a remedy against data consumers in case of contract infringements (Truong et al., 2012).

Data architecture

Data architecture is a set of data specifications, which is used to define data requirements and guide data integration (DAMA International, 2009). Data architecture also contains comprehensive data models on a conceptual, logical, and physical level (DAMA International, 2009; Watson et al., 2004). In the context of data marketplaces, data standards are often mentioned as being crucial to supporting interoperability and data exchange between data providers and data consumers (Lis & Otto, 2020; Spiekermann, 2019). However, data marketplaces apply different approaches regarding the data format to facilitate data exchange, ranging from standardized through proprietary to hybrid. Within the standardized approach, data marketplaces define standardized vocabularies and formats, to which all marketplace participants must adhere (Ito, 2016; Otto & Jarke, 2019). The proprietary approach allows data providers to offer their data products using their own proprietary data formats (Özyilmaz et al., 2018). The most convenient approach for both data providers and consumers is a hybrid approach, where data providers can offer data products in proprietary formats, which then are automatically normalized on the marketplace platform using a standardized data model (Nagorny et al., 2018).

Metadata

Metadata represents data about data (DAMA International, 2009; Were & Moturi, 2017). It gives meaning and context to data by providing a structured description of the content, quality, and other characteristics of data (Hovenga & Grain, 2013; Khatri & Brown, 2010). Within data marketplaces, rich contextual metadata is important for supporting data consumers in finding data of interest (Tzianos et al., 2019), determining the usefulness of data products (Ramachandran et al., 2018), and correctly interpreting and processing data (Zuiderwijk et al., 2014). Scientific literature provides two approaches regarding the metadata vocabulary in the context of data marketplaces. The first approach contains a marketplace-specific metadata vocabulary, which is used by data providers for describing and publishing metadata, and by data consumers for looking up and retrieving metadata (Otto & Jarke, 2019). The second approach comprises the application of standardized metadata vocabularies such as CERIF and DCAT (W3C, 2020; Zuiderwijk et al., 2014).

Data lifecycle

The data lifecycle represents the approach of defining, collecting, creating, using, maintaining, archiving, and deleting data (Khatri & Brown, 2010; Morabito, 2015). In the context of data marketplaces, the main data lifecycle phases are data onboarding, data discovery, data purchase, and data usage. During data onboarding, data providers capture, create, and store data, which is made available to the data marketplace (Otto & Jarke, 2019). Within the data discovery phase, data consumers try to find the right data for their purpose (Mišura & Žagar, 2016; Ramachandran et al., 2018). During the data purchase step, data consumers pay for data products, and data providers grant access to the purchased data (Musso et al., 2019; Tzianos et al., 2019). In the data usage phase, data consumers use the data, e.g., by enriching and aggregating it (Otto & Jarke, 2019).

Data storage and infrastructure

Data storage and infrastructure focus on information technology (IT) artifacts that enable effective data management (Dreibelbis et al., 2008; Tallon et al., 2014). In a data marketplace context, one major question concerns how the data should be stored. Spiekermann (2019) distinguishes between the centralized, decentralized, and hybrid storage approaches. With the centralized approach, data products are offered by data providers via a central location such as cloud infrastructure. With the decentralized approach, data products remain with data providers. The hybrid approach is a combination of both the centralized and the decentralized approaches.

Data pricing

With the trade of data between independent parties, the question of how to price data products becomes relevant. Data marketplaces often apply pay-per-use or subscription-based pricing models in line with their business models. Within a pay-per-use pricing model, data marketplaces charge data consumers for each consumed data product (Spiekermann, 2019; Truong et al., 2012). Within a subscription-based pricing model, data consumers have access to data products for a certain period. In addition to pay-per-use and subscription-based pricing models, data products can be provided free of charge. This often includes data from public authorities and non-profit organizations (Spiekermann, 2019). We also identified hybrid pricing models, such as the freemium pricing model, where data providers offer basic data products free of charge while charging a premium for more detailed data products (Thomas & Leiponen, 2016). Furthermore, data pricing contains the determination of the right price for data products (Truong et al., 2012). Apart from fixed prices for data products, data marketplaces apply more dynamic pricing strategies such as bidding (Maruyama et al., 2013; Parra-Arnau, 2018), progressive pricing (Spiekermann, 2019), the “pay what you want” approach (Zuiderwijk et al., 2014), and packaged pricing (Spiekermann, 2019).

Methodology

Taxonomy development method

Taxonomies support systematically organizing and describing the body of knowledge in a certain field (Glass & Vessey, 1995). As we aim to create a structured overview of mechanisms to instantiate data governance decision domains, a taxonomy is particularly well suited. The development of a taxonomy is a multistep process (Fiedler et al., 1996). We use the taxonomy development method by Nickerson et al. (2013) which comprises four main steps. The first step is the identification of the meta-characteristic. This is based on the purpose of the taxonomy and guides the choice of the remaining characteristics within the taxonomy. The second step comprises the selection of objective and subjective ending conditions. The third step initiates an iterative approach for the development of the taxonomy. Within each development cycle, the researcher can choose between an empirical-to-conceptual or a conceptual-to-empirical approach. The fourth step contains the validation of the taxonomy against the objective and subjective ending conditions. The taxonomy development ends if all objective and subjective ending conditions are fulfilled. Otherwise, the taxonomy development continues with the third step.

Data collection

The empirical-to-conceptual part of the taxonomy development method requires the researcher to select a sample of objects from which to derive characteristics (Nickerson et al., 2013). We considered objects for inclusion if they met the following two criteria: (a) the object represents a data marketplace or a data marketplace protocol enabling trading between data providers and data consumers; (b) the main products traded on the data marketplace are data products or data services. We selected multiple objects since the evidence is considered more robust and generalizable than from a single object (Herriott & Firestone, 1983).

We retrieved an initial list of 177 data marketplaces via the datarade.ai website. In a first round, we reviewed all 177 data marketplaces against the two inclusion criteria and reduced the list to 63 data marketplaces. In a second round, we reviewed the remaining 63 data marketplaces and excluded instances where the marketplace provider did not act as an independent intermediary but rather as an aggregator of different data providers. We also excluded instances which did not provide sufficient official and openly accessible information to be analyzed. In doing so, we reduced the list to 13 data marketplaces to be included in our study. The selected data marketplaces enabled trading personal data, corporate data, and IoT sensor data. Personal data represents data about natural persons such as gender, date of birth, place of residence, and personal interests. Corporate data comprises data about companies, such as company descriptions and financial market data. IoT sensor data encompasses data specifically collected from such IoT devices as sensors installed in cars, smart homes, and smart factories. Table 1 provides an overview of the examined data marketplaces.

Table 1 Overview of data marketplaces as of 24 June 2022

We consider the sample set of data marketplaces as appropriate for studying the instantiation of data governance decision domains since they support various types of data products and are located in diverse countries. Also, some data marketplaces enable the creation and sale of data analysis results, which require the management of additional data stakeholders and complex data infrastructures. The primary source of evidence for our study was official publications from the sample set of data marketplaces. We collected data provided in whitepapers, reports, and on data marketplace websites.

Taxonomy development

Following the taxonomy development method by Nickerson et al. (2013), we started with the definition of the meta-characteristic. Since our purpose was to investigate how data marketplaces instantiated data governance decision domains, we chose data governance instantiation in data marketplaces as our meta-characteristic. We then determined the ending conditions for the validation of our taxonomy. We omitted the objective ending condition that at least one object was classified under every characteristic. We perceived characteristics solely derived from scientific literature as valid because they provided for differentiation and thus robustness of the taxonomy. Tables 2 and 3 provide an overview of the selected objective and subjective ending conditions. After each taxonomy development cycle, we checked to see whether the resulting taxonomy met all selected objective and subjective ending conditions. If the test result was negative, we initiated a new development cycle.

Table 2 Overview of objective ending conditions derived from Nickerson et al. (2013)
Table 3 Overview of subjective ending conditions derived from Nickerson et al. (2013)

In total, we conducted four development cycles to reach the final version of the taxonomy. In the first taxonomy development cycle, we used a conceptual-to-empirical approach and conceptualized dimensions, subdimensions, and characteristics taken from scientific literature. Based on Khatri and Brown (2010) and Abraham et al. (2019), we chose the following data governance decision domains as the initial dimensions: data quality, data security, data architecture, metadata, data lifecycle, and data storage and infrastructure. The resulting version of the taxonomy gave us a first impression of the taxonomy structure. During the second development cycle, we applied an empirical-to-conceptual approach. We used the open coding technique to turn collected raw data from the sample of objects into dimensions, subdimensions, and characteristics (Corbin & Strauss, 2015). We first assigned labels to key areas of text using a software tool for content analysis (Tallon et al., 2014). In doing so, we created 167 codes in total. Then, we reviewed the codes and filtered for those with a focus on the meta-characteristic, which reduced the number of codes to 80. We searched for common patterns among the codes and underlying raw data to derive generic characteristics. Afterward, we grouped related characteristics under dimensions and subdimensions. During this analysis step, we added data pricing as a new dimension, among others. We conducted a third development cycle because we had previously added the data pricing dimension. Also, the validation via objective ending conditions showed that the characteristics within the data lifecycle dimension were not mutually exclusive. We applied a conceptual-to-empirical approach to derive additional relevant subdimensions and characteristics from scientific literature for the data pricing dimension. Furthermore, we restructured characteristics within the data lifecycle dimension to make the characteristics mutually exclusive. Since we restructured characteristics, we did not meet all objective ending conditions. Therefore, we conducted a fourth empirical-to-conceptual development cycle and classified again all 13 data marketplaces using the final version of the taxonomy. As we met all objective and subjective ending conditions this time, we concluded the taxonomy development process. Figure 1 shows an overview of the taxonomy development process.

Fig. 1
figure 1

Overview of the taxonomy development process

Findings

The following chapter presents the findings from the analysis of 13 data marketplaces. We describe the results for each data governance decision domain. We also present selected citations from our sources to substantiate our findings.

Data quality

Regarding data quality, we identified preventive and reactive measures among data marketplaces. On the preventive side, data marketplaces applied automated data validation methods during data onboarding. In doing so, they prevented flawed data products from being further processed and offered on the data marketplace. On the reactive side, we identified one data marketplace that offered a rating system based on consumer feedback to rank data providers. If data consumers detected and reported fake or incorrect data, data marketplaces reduced the rating of the data provider. Several data marketplaces planned to establish similar rating systems to rate data providers or data products.

“The data model is used within the data normalization process and plays a key role. It defines how values should be stored in the local data store and is used to identify rule violations, thus establishing a consistent level of quality and consistency. It enforces specific units, length and a structure on the stored data, making it possible to analyze the data. Only if the data is accurate, reliable, and formatted consistently, further processing will be possible.” (MADANA, 2018, p. 27) – Preventive measure

“In order to guarantee the validity of the data, Datapace employs several mechanisms - like seller reputation rating (…).” (Datapace, 2017, p. 4) – Reactive measure

Data security

Data marketplaces applied several mechanisms to ensure data security. Concerning the confidential storage of data, most data marketplaces applied encryption techniques such as public-key cryptography. Furthermore, we identified data fragmentation as a method whereby either the data payload was split into different fragments or the data providers’ identity information and the data payload were split. The fragments were stored in different physical storage locations. The data marketplaces we analyzed implemented data fragmentation as an additional mechanism to data encryption and therefore applied a hybrid approach.

“VETRI users will store their most sensitive data locally on their device by using state-of-the-art encryption techniques (…).” (VETRI, n.d., p. 9) – Data encryption

“Full Privacy: In this case data are stored in encrypted form, (…). Standard Privacy: Genetic data are stored as fragments making it impossible to identify the user.” (Zenome, 2017, p. 29) – Hybrid storage protection

“The data is fragmented to a number of unknown physical locations, and it is protected by strong encryption while in transit and in storage.” (Streamr, 2017, p. 19) – Hybrid storage protection

In terms of data access control, data marketplaces provided the options to grant access instantly or based on consent. When applying instant access, data consumers received access to a data product immediately after purchase and without any explicit consent from data providers. Consent-based access enabled data providers to decide which data they wanted to sell to selected data consumers. We also identified two data marketplaces that applied a hybrid approach by allowing both authorization options.

“The user clients shows [sic] currently running projects requesting data access and users can control whether to give access or not based on their decision.” (Datum, 2017, p. 10) – Consent-based access

“Subscribing to streams can be restricted to certain users only, or be free to the public.” (Streamr, 2017, p. 13) – Hybrid access control measure

Regarding the confidential usage of data, we identified a few data marketplaces that only provided access to parts of the data by anonymizing the data. Two data marketplaces did not provide access to the raw data at all. Instead, they restricted data processing to the marketplace platform without providing data consumers with direct access to raw data.

“Data can be offered anonymously, so privacy is not violated.” (Streamr, 2017, p. 10) – Access to data parts

“Data will only be processed in secured environments and afterward deleted to minimize the risk of unwanted data breaches.” (MADANA, 2018, p. 17) – No access to data

We also found evidence that data marketplaces supported the application of data contracts. These allowed data providers to determine the conditions under which their data products should be used and enabled data consumers to describe how they planned to use the data.

“Your rights to use the data are governed by a licence that has been drafted by the data provider. When you purchase data, you need to confirm that you accept the terms of the licence.” (Databroker, 2021) – Contract-based data usage control

Data architecture

Concerning data architecture, we identified standardized, proprietary, and hybrid approaches regarding the data format. Using the standardized approach, some data marketplaces required data providers to format data products according to a unified data model before publishing these products on the data marketplace. Applying the proprietary approach, data marketplaces allowed data providers to publish data products using proprietary data formats. Since the latter could have inhibited data consumers in automatically interpreting data, some data marketplaces required the submission of the data payload format as part of the metadata. We also identified a data marketplace that applied a hybrid approach where data products in proprietary formats were automatically pre-processed and standardized based on standardized data models before storage.

“For each new sensor, we ask you to provide the following information: (…) Data Fields: The most essential part of the sensor configuration. Please provide information for every parameter that will be captured by the sensor and stored on the Tangle. (…) Parameter information consist [sic] of 3 fields: Field ID (…). Field Name (…). Field Unit (…).” (IOTA, 2020) – Standardized data format

“It is responsibility of data seller to provide a valid data source URL and give detailed description of the data stream and it’s [sic] format (it’s [sic] JSON schema) – so it can be easily consumed by data buyer.” (Datapace, 2017, p. 4) – Proprietary data format

“The normalization process builds on the interpretation of the data before the data is put into the local data store. The standardization process then reformats the data and creates a consistent data representation with fixed and discrete columns based on the data model. The advantage of standardization is that the conformity of the data guarantees simpler and more secure processing of the data.” (MADANA, 2018, p. 27) – Hybrid data format

Metadata

The data marketplaces used metadata to support data providers in organizing their data products and facilitate data consumers in discovering relevant datasets. Most analyzed data marketplaces provided a marketplace-specific set of metadata fields to capture metadata when onboarding new data products. Common metadata fields comprised the description of the offered data product, the data owner, the price, access permissions, and the terms and conditions of data use.

“For each new sensor, we ask you to provide the following information: Device ID (…). Device Type (…). Company (…). Location (city/country) (…). GPS Coordinates (latitude/longitude) (…). Price of the data stream (…).” (IOTA, 2020) – Marketplace-specific vocabulary

Data lifecycle

Regarding the data lifecycle, we identified two types of data marketplaces. The first type of marketplace focused on data trading, encompassing the phases of data onboarding, data discovery, data purchase, and data offboarding. Most analyzed data marketplaces fell under this category. The second marketplace type contained an additional data usage phase. The data usage phase supported the processing of data within the marketplace platform and the provisioning of analysis results to data consumers.

“Via their gateway operator, the sensor owners place the data generated by their sensors up for sale (…), and buyers can discover and purchase access to the data using that same DTX token. (…) Data generated by the sensors of their clients is sent (…) to their dAPI which check who has purchased access and send the data directly on to the location specified by the buyer on purchasing.” (Databroker, n.d., p. 6) – Data trade-focused marketplace

“In case of a mobile survey all answers from all consumers worldwide are aggregated and visually presented on selectable charts and in table form. Since GPS position of each consumer is tracked, Opiria can display the location of each answer on a world map.” (Opiria, 2017, p. 13) – Data usage-focused marketplace

Data storage and infrastructure

Regarding data storage, we identified the centralized and decentralized storage approach within the data marketplaces. A few data marketplaces applied the centralized storage approach and used a central database or a central, cloud-based storage solution to store the data. However, most analyzed data marketplaces applied a decentralized storage approach, of which we identified three forms. The first form encompassed data being stored on the data provider’s device such as a mobile phone. The second form comprised a decentralized storage node architecture provided by the data marketplace. Independent storage nodes were paid for providing computing power and storage capacity to replicate and store the data in a distributed network. The third decentralized storage form entailed data being stored at a storage vendor of the data provider’s choice.

“oneTRANSPORT provides a cloud-based platform (…).” (oneTRANSPORT, 2017, p. 6) – Centralized storage

“VETRI users will store their most sensitive data locally on their device (…).” (VETRI, n.d., p. 9) – Decentralized storage

“A Storage Node receives the data and stores the data. The data is replicated to many other storage nodes.” (Datum, 2017, p. 14) – Decentralized storage

“Built in the dAPI, there are connectors to integrate with the leading IoT and bigdata [sic] storage vendors, leaving the buyer the choice on where their data needs to be sent.” (Databroker, n.d., p. 24) – Decentralized storage

Data pricing

Regarding data pricing, most analyzed data marketplaces applied a pay-per-use or subscription-based pricing model. Some data marketplaces applied a hybrid pricing model where data providers could decide if they offered data products at a certain price per use, based on a subscription, or free of charge.

“Consumers that consent to provide their data would trigger a smart contract between the consumer and the company. On this basis the consumer is paid with PDATA tokens and the company receives the requested personal data.” (Opiria, 2017, p. 3) – Pay-per-use

“(…) Enigma creates a decentralized data marketplace that allows people, companies and organizations to contribute data (…), which users of the system can then subscribe to and consume.” (Enigma, 2020) – Subscription-based

“Data can be purchased as one-off, or on an on-going subscription basis.” (Datum, 2017) – Hybrid pricing model

“The Marketplace is filled with both paid and free products, offering data producers an opportunity to either monetise their data or make it freely available to everyone.” (Streamr, 2020) – Hybrid pricing model

For their pricing strategy, most analyzed data marketplaces applied a fixed price approach. Some data marketplaces applied the bidding process where data consumers offered a price to data providers, who then accepted or declined the offer or made a counteroffer.

“Price of the data stream: Here you can define the cost of the sensor data.” (IOTA, 2020) – Fixed price

“A Data Consumer declares interest to purchase the piece of data. (…) The User receives a data purchase request with the details such as purchaser and price offered. He can agree to the purchase request or counter offer with a modified proposal.” (Datum, 2017, p. 14) – Dynamic pricing

Discussion

Our findings show that data marketplaces have multiple options to instantiate data governance decision domains. We observed certain tendencies, but also limitations of specific mechanisms. In the following, we discuss our main findings in the context of scientific literature and present our final taxonomy.

Regarding data security, our findings would suggest that data marketplaces offer limited protection of sensitive data. Though data marketplaces apply anonymization techniques, Narayanan and Shmatikov (2008) have demonstrated that the de-anonymization of datasets is possible with little auxiliary information. Secured execution environments, which restrict direct access to raw data, promise a higher level of protection. Nevertheless, in cases where raw data is used to train machine learning algorithms, adversaries could identify the raw data by using model inversion attacks (Fredrikson et al., 2015). It becomes essential, therefore, for data providers to undertake a thorough check to assess whether their data products comprise sensitive data. Furthermore, in most analyzed data marketplaces, data providers transfer data products to data consumers. Thus, the main mechanism to control data usage and protect data ownership rights is the application of data contracts between data providers and consumers. However, the application of data contracts cannot fully prevent the illegal and malevolent copying and reselling of data products. In cases of unauthorized reselling, being able to prove data ownership is essential. Technology-based data usage control mechanisms such as watermarking could help to prove data ownership rights. By applying watermarking, data providers could embed watermark data such as data ownership information into data products (Agarwal et al., 2019; Vlachos et al., 2015).

Furthermore, our findings would suggest that data quality management within data marketplaces is still at an early stage. Almost half of the data marketplaces did not actively approach the topic of ensuring data quality. Where we identified measures, these were mainly rating systems consistent with those proposed by Mišura and Žagar (2016), Ramachandran et al. (2018), and Zuiderwijk et al. (2014). Using these reactive approaches, the responsibility for identifying and reporting data quality issues often lies with data consumers. A hybrid approach comprising both preventive and reactive measures can help to overcome this drawback. For example, guided approaches such as LANG can help data consumers reactively identify data quality issues in datasets for which they have minimal control or knowledge of underlying rules (Zhang et al., 2019). Simultaneously, data marketplace providers can use LANG to prevent flawed data products from being onboarded on the marketplace. Another solution comprises the provisioning of warranties for data products. If data providers do not deliver data products at the expected level of quality, data consumers have the right to cancel the purchase and demand a refund. The terms and conditions could either be stipulated by law or by data marketplace providers similar to guarantees provided by marketplaces such as Amazon (Amazon, 2022) and payment providers such as PayPal (PayPal, 2022).

In terms of data architecture, our findings do not reveal a clear tendency towards a specific data formatting approach. Instead, our findings would suggest that the approaches described by Tzianos et al. (2019), Özyilmaz et al. (2018), and Nagorny et al. (2018) are valid in different contexts. Data marketplaces, which aim to support data consumers in the automatic processing of data products, are likely to provide a standardized data format for the data payload. This applies in particular to data products that are published regularly or in real-time such as data streams. Data marketplaces that want to keep adoption barriers low for data providers might enable the use of proprietary data formats. This might especially be the case for data marketplaces new on the market. Those data marketplaces that focus on convenience for both data providers and data consumers might apply a hybrid approach.

Regarding data storage and infrastructure, our findings show a tendency towards decentralized storage solutions. Preserving data within data providers’ storage systems offers data providers an increased level of control over their data assets (Spiekermann, 2019). Also, storing data using distributed storage nodes enables the scalability of storage and facilitates the availability of data and fault tolerance through data replication. These findings contrast with data governance in traditional companies where the inclusion of external IT infrastructure is negatively related to data governance maturity (Borgman et al., 2016). Also, the storage of data in several disparate databases often inhibits the accessibility and consistency of data (Cheong & Chang, 2007; Tallon et al., 2014). The high level of standardization and integration within the marketplace platform architectures could be reasons why a decentralized storage approach is successfully applied.

Concerning data pricing, our results confirm earlier findings showing that the pay-per-use model and the subscription-based model are the more common pricing models applied in data marketplaces (Truong et al., 2012). Most data marketplaces within our sample applied one of these two pricing models. However, we also identified data marketplaces that allowed data providers to offer data products using a pricing model of their choice. This approach might support marketplace adoption as it might attract more data marketplace participants. Within the set of dynamic pricing strategies, we only identified bidding as an applied pricing strategy, omitting other dynamic pricing strategies such as progressive pricing, “pay what you want”, or packaged pricing. We suspect there are different reasons behind this result. Progressive pricing is applied where the dissemination of data products is to be restricted (Spiekermann, 2019). Given that data marketplaces are fairly novel, data providers have likely no incentive to restrict the dissemination of their data products. Furthermore, the commercial interest of data providers and data marketplaces pre-empts the application of a “pay what you want” pricing approach. However, increased adoption of data marketplaces could bring along the application of additional pricing strategies.

Figure 2 shows our resulting taxonomy of data governance decision domains in data marketplaces. Per characteristic, the number in the bottom right corner illustrates how often we found that characteristic within the analyzed data marketplaces. A hybrid characteristic indicates that a data marketplace applies a combination of the characteristics within the respective subdimension. The taxonomy meets all selected objective ending conditions. Each dimension of the taxonomy consists of mutually exclusive and collectively exhaustive characteristics. All objects were examined. During the last development cycle, we did not merge or split any objects. We also did not add, merge, or split any dimensions, subdimensions, or characteristics during the last development cycle. Every dimension, subdimension, characteristic, and combination of characteristics is unique within the taxonomy. In addition, the taxonomy meets all subjective ending conditions. The taxonomy is concise since the number of dimensions is in the proposed range of seven plus or minus two. The taxonomy is robust since the added characteristics provide for differentiation among objects. The taxonomy is comprehensive since we were able to classify all 13 data marketplaces. By adding dimensions and characteristics during development cycles, we were able to demonstrate that the taxonomy is extendible. Furthermore, the taxonomy is explanatory because the dimensions, subdimensions, and characteristics provide explanations of the nature of the objects.

Fig. 2
figure 2

Taxonomy of data governance decision domains in data marketplaces

Conclusion

Data is at the center of digital transformation. Facilitating the exchange of data between independent market participants has the potential to generate significant economic, societal, and environmental benefits. However, the fear of unintentionally releasing sensitive data, the lack of control over data usage, alongside accessibility and interoperability issues create trust-related and technical barriers to data sharing (World Economic Forum, 2020). To overcome these barriers, the trade and exchange of data should be accompanied by robust data governance practices, increasing the level of certainty and producing new opportunities. Hence, the research reported in this paper analyzed the emerging topic of data marketplaces from a data governance perspective. The following research question framed our study: how do data marketplaces instantiate data governance decision domains? We answered this question by developing a taxonomy comprising the subdimensions and characteristics of data governance decision domains in the context of data marketplaces.

Our study has the following limitations. As the primary source of evidence, we reviewed official documentation such as whitepapers, reports, and information provided on data marketplace websites. Future research should conduct case studies collecting data from interviews and observations to enhance our findings. In addition, we used a sample size of 13 data marketplaces. Future research should examine a larger sample size to validate the robustness and comprehensiveness of our taxonomy. Furthermore, our data did not allow for rigorous testing of how different types of data products influence the instantiation of data governance decision domains. Future research should analyze the possible configurations of data governance decision domains based on different types of data products. Also, future research should investigate which roles and governance bodies have the decision-making authority within each decision domain and which marketplace actor takes on which role.

The results of our study advance scientific literature. To the best of our knowledge, our study is the first to investigate data marketplaces through a data governance lens and thus combine these two research strands. The taxonomy of data governance decision domains offers a common terminology that can be used by researchers to share their findings with other members of the information systems community. Moreover, the taxonomy represents a foundation for further scientific investigation and theory-building. For example, the taxonomy can be used to study relationships between the taxonomy concepts and other variables of interest (Glass & Vessey, 1995). The taxonomy also enables researchers to understand divergence in previous research findings regarding data marketplaces (Sabherwal & King, 1995). Additionally, the taxonomy can be used to define data governance archetypes for data marketplaces. From a practitioner’s perspective, the taxonomy highlights relevant data governance decision domains and instantiation options in the context of data marketplaces. When trading and exchanging data with third parties, neglecting aspects such as data security, privacy, and data quality can have unforeseen consequences. Marketplace providers can use our findings to develop a data governance strategy in a structured and thoughtful manner. Our results can also be used by traditional companies aiming to implement an internal data marketplace.