Keywords

1 Introduction

The motivation to open data by governments and private organizations have increased extensively over the last few years. The creation of transparency and accountability, to sustain citizen engagement and to enable business innovation are the main drivers to open more data [1,2,3,4]. The disclosure of data is expected to improve decisionmaking initiatives by both government and society [3, 5]. Furthermore, the opening of data by organizations can improve an organization’s reputation by showing that they are an open institution [6].

However, although divers types of datasets have already been opened [7,8,9], in reality many datasets are still not opened [10]. There are several reasons why the data providers reluctance for opening datasets, including: (1) barriers of implementing the systems [11, 12]; (2) risks like inaccuracy, misuse, sensitivity, and inconsistency of the data [3, 10, 12,13,14,15,16,17]; and (3) inappropriate interpretation of the data resulting in an inadequate comprehend of the data [3]. Moreover, mistakes in interpreting data or misuse of data can jeopardize the reputation of data providers [11]. This result in many datasets to remain closed, whereas this might not be necessary.

The goal presented in this study is to develop a Fuzzy Multi-Criteria Decision Making (FMCDM) approach to analyze the risks and benefits and to determine the best alternative decision for a given dataset. The use of Fuzzy set theory in this research is to manage decision-making problem of alternative selection of a dataset status. These alternatives are developed by establishing and incorporating the FMCDM based on Fuzzy Analytic Hierarchy Process (FAHP) [18, 19]. The main function of the Fuzzy logic is to capture the expertise of open experts and to express it with computational approach [20,21,22]. A Fuzzy theory is based on the intuitive reasoning by considering the human subjectivity and incorrectness, which are common in the natural language [23]. The natural language is an intricate structure both in the human communication and the way how the human being thinks [23, 24].

Fuzzy theory is used in this paper to provide a mathematical strength for the emulation of the higher cognitive function from the human thought and perception associated with weights of the risks and benefit of opening data. The main function of the FMCDM is to assess the alternative selection with respect to predetermined criteria for a single decision making [25]. The appropriateness of the alternative compares to the criteria and the priority weights of each criterion can be analyzed and computed using linguistic matrix values reflected by the fuzzy [20, 26]. FAHP, furthermore, is used to determine the preference weightings of criteria by collecting expert’s judgment [18, 27]. The scores for each criterion are summed up to rank the importance of the alternatives [28, 29].

This FAHP technique used in this study consists of the six following steps [18, 19, 27], namely: (1) select experts team; (2) determine the evaluation criteria and construct the hierarchy, including alternatives; (3) construct pairwise comparison matrix and evaluate the relative importance of the criteria; (4) transform the linguistic terms into triangular fuzzy number; (5) calculate the Fuzzy weights matrix, and check the consistency of the pairwise comparison matrix; and (6) select the best alternative. A dataset of health patient records is used in the illustration part to show how the risk and benefit multiple criteria can be analyzed by employing the FMCDM approach. The four possible decisions are completely open, maintain suppression, provide limited access, or remain closed. These are the alternatives for the FMCDM and decisions on these alternatives will be analyzed based on the four main criteria, namely data sensitivity and data ownership for the risk criteria, while data availability and data trustworthiness are the criteria for the benefit. Data sensitivity and ownership are selected as input because of these criteria can represent some privacy violation issues containing in health patient records dataset. For example, in the case of data sensitivity, by releasing the actual value of name, date of birth, place of birth, home address, or insurance provider of a patient, it might be potentially misused by the unauthorized users. In addition, data availability and data trustworthiness are chosen criteria due to they can reflect the benefit of transparency and accountability in opening data. Each of the criteria has sub-criteria to further refine the risks and benefits. In Sect. 3.3, we will explain the sub-criteria definition and relationship in more detail.

This paper is consists of six sections. In Sect. 1 the rationale of this research is presented, Sect. 2 contains the related work of decision-making to open data. In Sect. 3 the approaches are described, including proposed flow process, alternatives, and criteria selection for FMCDM which is based on FAHP method. Section 4 provides the illustration and results. Section 5 some findings of the study are provided. Finally, the paper will be concluded in Sect. 6.

2 Related Work

In order to present the current approaches of decision analysis in the domain of open data, we reviewed literature which is summarized in Table 1. We found three limited works about decision-making analysis for opening data. Existing work uses the following methods: (1) trade-offs method to weigh the values and risks of open data by conducting interview sections with exclusive groups like civil servants and archivists, (2) decision-support framework to develop a prototype based on the open data ecosystem for specific groups like business and private organization, and (3) an iterative method using Bayesian-belief Networks to weigh the risks and benefits of opening data.

Table 1. The previous methods of decision support for opening data

Yet, none of these related works utilized an FMCDM approach in a sense to measure and determine the best alternative for deciding a single status of a dataset. Some possible advantages the use of the FMCDM approach compare to three other methods are: (1) the capability to consider the human subjectivity and incorrectness from the common natural language [32]; (2) provides assessment of the alternatives selection with respect to predetermined criteria for a single decision making [25]; and (3) its simplicity characteristic to evaluate multiple conflicting in decision-making as one of the most popular problems handled by researchers in the literature [25, 32].

3 Decision-Making Approach

In this section, we aim to describe the decision-making approach for analyzing risks and benefits of open data. Four subsections are described, namely flow process of the proposed method, alternatives, selection of criteria, and FAHP technique.

3.1 Flow Process of Proposed Method

To describe how the FMCDM approach works, we use a flow of decision-making process having three main phases, namely data source, evaluation, and decision. The entire process starts with the selection of the dataset from the data source to create the input for the evaluation phase. The input data are processed next in the evaluation phase. The output of the evaluation namely decision stage is a suggestion to make a decision. The latter is done by showing the rank of decision priority (decision), as shown in Fig. 1.

Fig. 1.
figure 1

The flow process of the approach

The flow process is based on the data source, evaluation of input data (data source) and decision. Figure 1 illustrates the staging of analyzing the risks and benefits of opening data, and it can be narrated as follows:

  • Data Source: First, we need to select the type of a dataset. For example, in this study, we have chosen health patient records and Table 1: diagnosed stage (see Fig. 3) as the object to be analyzed. To define the criteria and sub-criteria, an extensive literature review related to the risks and benefits of opening data has done in Sect. 2. In this study, we designed four criteria and eight sub-criteria of the risks and benefits as the input data.

  • Evaluation: In the second stage, we used FMCDM to assess the alternatives based on criteria defined in the data source elicitation phase and the criteria uses linguistic matrix values reflected by the Fuzzy. FMCDM works on Fuzzy AHP technique has an essential role to measure the relative importance of defined criteria for dealing with decision-making problem. To quantify the relative importance of the risks and benefits, we picked up the knowledge from the experts’ judgment. There are two main steps to conduct an evaluation process by the experts in AHP, as follows [27, 33]: To begin with, experts should rank the criteria in a descending or ascending order of their significance. Then, determining the most important criteria and compare it with others. For example, an expert ranked that data sensitivity (C1) is higher or essentially important than data ownership (C2). Second, experts will determine the criteria weights by transforming pairwise comparison matrix into a triangular fuzzy number, as can be seen in Fig. 5.

  • Decision: Finally, the outcome of this flow process is to get the final weights of the best alternative as the priority of a decision.

3.2 Alternatives

The following four alternatives of opening data in this paper are: opening the dataset (A1), maintaining a dataset suppression (A2), providing limited access (A3), or keeping the dataset closed (A4). First, the alternative “open the dataset” is defined as publishing the dataset presents a low risk to an individual or organization identity, and/or the potential benefits of the dataset substantially outweigh the potential risks. Second, the alternative “maintaining suppression” is specified as removing a data field and/or an individual record into particular groups or generate unique characteristics to avoid the personal identity. In this alternative, data that might create significant risks are not opened in the actual form, as the potential benefits do not outweigh the possibility of the risks. Third, the alternative “limited access” defines that only a certain group will be given access to the data. The level of openness is limited. Often those who will gain access have to sign a document that outlines the rules of access. The reason for this is releasing the dataset will create a moderate risk, or potential benefits of the dataset do not outweigh the potential privacy risks. Fourth, the alternative “keeping the dataset closed”, it means that by publishing the dataset generates a very high risk to an individual or organization and significantly outweigh the potential benefits.

3.3 Selection of Criteria

Figure 2 represents the hierarchy of the four criteria, eight sub-criteria, and four alternatives. The four criteria C1, C2, C3 and C4 define data sensitivity, data ownership, data availability, and data trustworthy respectively. The data sensitivity (C1) composes of two sub-criteria: individual life-threatening (C1.1) and data identifiable (C1.2). Individual life-threatening (C1.1), can be defined as a potential risk to an individual or personal life because of the possibility to recognize the sensitive value of the dataset. Data identifiable (C1.2) is specified as the potential leak of the personal, organizational, business or even government data identifiable e.g. by combining some attributes of the field.

Fig. 2.
figure 2

The hierarchy of criteria and alternatives

The second criterion is data ownership (C2) which consists of two sub-criteria namely metadata scanning (C2.1) and fake or misleading (C2.2). Metadata scanning (C2.1) can be represented to figure out the property and structure of the dataset. Fake or misleading (C2.2) is defined by a user to potentially change and modify the dataset and affect an unreliable and wrong decision. Data availability (C3) is the third criterion and it has two sub-criteria namely data manageability (C3.1) and data recoverability (C3.2). Data manageability (C3.1) is specified as the chance to manage the availability and accessibility of the dataset. Data recoverability (C3.2) is indicated by delivering a dataset and it can have a highly positive impact on recovering the availability of the data. The fourth criterion is data trustworthiness (C4) which consists of two sub-criteria like data traceability (C4.1) and data authenticity (C4.2). Data traceability (C4.1) can make the possibility to trace the source of the dataset. Data authenticity (C4.2) is defined as the potentially affected to recognize the authentication of the data.

3.4 Fuzzy AHP Technique

The AHP process is a quantitative method that deals with the multi-attribute, multicriteria, multi-period problem hierarchically [34]. Only with AHP, it is not possible to overcome the deficiency of the fuzziness during decision making [35]. Hence, in this study, the Fuzzy AHP which is the extension of the conventional AHP method by integrating fuzzy comparison ratios is used for multi-criteria analysis [18, 27, 34, 36]. It uses the triangular fuzzy number of fuzzy set theory directly into the pairwise comparison matrix of the AHP. The geometric mean method is used to generate fuzzy weights and performance scores [37]. The steps of the Fuzzy AHP can be summarized as follows:

  • Step 1. Select experts. The quality of the evaluation process depends on experts’ knowledge and experience. Hence the selection of experts is crucial.

  • Step 2. Determine the evaluation criteria and construct the hierarchy including alternatives.

  • Step 3: Construct pairwise comparison matrix and evaluate the relative importance of the criteria. The experts are expected to provide their judgment on the basis of their knowledge.

    For any expert the comparison matrix is given by Eq. (1) as:

    1. (a)
      $$ \tilde{C}_{\text{k}} = \left[ {\begin{array}{*{20}l} 1 \hfill & {\tilde{c}12 \ldots } \hfill & {\tilde{c}1n} \hfill \\ \vdots \hfill & \ddots \hfill & \vdots \hfill \\ {\tilde{c}n1} \hfill & {\tilde{c}n1 \ldots } \hfill & 1 \hfill \\ \end{array} } \right] $$
      (1)

      where n is the number of criteria, \( \tilde{C}_{\text{k}} \) is a pairwise comparison matrix belongs to kth expert for k = 1, 2.. k.

    2. (b)

      Arithmetic mean is used to aggregate experts’ opinion as given in Eq. (2).

      $$ \tilde{C} = \frac{1}{k}\left( {\frac{1}{c} + \frac{2}{c} + \ldots + \frac{k}{c}} \right) $$
      (2)
  • Step 4: Transform the linguistic terms into triangular fuzzy numbers. The following linguistic terms provided in Table 2 are utilized for the evaluation procedure.

    Table 2. The fuzzy linguistic scales (adapted from: [18])
  • Step 5: Calculate the fuzzy weight matrix using Eqs. (3) and (4).

    $$ \tilde{r}_{\text{i}} = \left( {\tilde{c}_{\text{i1}} \otimes \tilde{c}_{\text{i2}} \otimes \ldots \otimes \tilde{c}_{\text{in}} } \right)^{{\frac{1}{n}}} $$
    (3)
    $$ \tilde{w}_{i} = \tilde{r}_{i} \otimes \left( {\tilde{r}_{1} + \tilde{r}_{2} + \cdots \tilde{r}_{n} } \right)^{ - 1} $$
    (4)

    where \( \tilde{r}_{i} \) is the geometric mean of fuzzy comparison value and \( \tilde{w}_{i} \) is the fuzzy weight of the ith criteria.

  • Step 6: Apply normalization procedure as Eq. (5)

    $$ w_{i} = \frac{{\tilde{w}_{i} }}{{\sum\nolimits_{j = 1}^{n} {\tilde{w}_{j} } }} $$
    (5)

4 Illustration of FMCDM

In this section, we will illustrate the FMCDM using a health patient records dataset with the help of Fuzzy AHP technique. The reason for selecting this dataset is that it contains the typical both benefits and risks. The variety of benefits from the selected dataset, include the data availability of the hospital medical records by providing accurate, up-to-date, and enable quick access by the users to the patient records. However, from the side of the risks, by releasing the patient health records attributes, it might also encounter endangers like the name_of_patient, date_of_birth, and place_of_birth that result in the identification of individuals in a privacy violation.

4.1 Data Source: Health Patient Records Dataset

In the scenario of the illustration part, we designed that the government proposes a Department of Health to release a dataset of medical records of patient to the public that can enable individual or organization to access and see the current trend of a disease [38, 39]. By doing so, for instance, the government is able to generate a location map related to the disease landscape for some regions or specific attributes. However, if the government decides to open the dataset and actual values immediately, there are some potential privacy issues of the patients containing in the dataset that might be very harmful like misuse, inaccuracy, and identifiable of the data [39, 40, 41]. Figure 3 shows the dataset structure of the health patient records that will be analyzed using FMCDM in this study.

Fig. 3.
figure 3

(adapted from: [31, 43])

Raw table of Health Patient Records

For the illustration of this work, we designed to analyze the Table 1 namely Diagnosed Stage which is containing six attributes/fields: Name_of_patient, Date_of_birth, Place_of_Birth, Gender, Race, Insurance, Stage, and TNM_staging.

4.2 Evaluation: Analyzing the Dataset

The following steps are the scenarios of FMCDM. Figure 4 shows the hierarchy of criteria and alternatives are used in the illustration of FMCDM.

Fig. 4.
figure 4

Hierarchy of criteria and alternatives for the illustration

  • Step 1. Establish an expert team. We picked up the knowledge as well as expertise from some experts. The selected experts were interviewed based on the three consideration rationales, namely: (1) Domain knowledge, where the importance of educational background of the experts in this field ought to accommodate various specializations with partial overlap to confirm completeness of the data and available information [42]; (2) Functional knowledge, where the experts chose are capable in the scope of the existing problems and the requirements of the process as well as solution proposed [42]; and (3) Best practice, where the interviewee’s expertise and their own insight have to be outstanding to warrant the quality as well as the validity of information sources [43].

  • Step 2. Determine the evaluation criteria and construct the hierarchy including alternatives.

  • Step 3. Construct pairwise comparison matrix and evaluate the relative importance of criteria. The experts are asked to provide their consideration based on their knowledge and expertise. For simplicity, in this illustration a pairwise comparison matrix for expert one is given in Fig. 5. Before the experts started to quantify the criteria, we expected to construct a Fuzzy evaluation linguistic scale for the weights as presented in Table 2.

    Fig. 5.
    figure 5

    The pairwise comparison matrices of criteria and alternatives

  • Step 4: Transform the linguistic terms into triangular fuzzy numbers. The linguistic terms provided in Table 2 are utilized for the evaluation procedure.

  • Step 5: Calculate the fuzzy weight matrix using Eqs. (3) and (4). The final weights of the alternatives are calculated using Eqs. (3), (4), and (5). The linguistic terms provided in Table 2 are utilized for the evaluation and fuzzy operational laws are used for the calculation [18, 27]. Illustrative examples for weights of subcriteria C11 and C12 are given as follows:

Calculating sub-criteria: Linguistic terms for the pairwise comparison, we are getting from Fig. 5 and the corresponding fuzzy numbers are getting from the Table 2. For example, pairwise comparison of (C1.1 C1.2) is “Equal Important” and the fuzzy number of this linguistic term is (1, 1, 3).

$$ \begin{aligned} \tilde{r}_{c11} = & \left( {\tilde{c}_{c11c11} \otimes \tilde{c}_{c11c11} } \right)^{{\frac{1}{2}}} \\ \tilde{r}_{c11} = & \left( {\left( {1,1,1} \right) \otimes \left( {3,5,7} \right)} \right)^{{\frac{1}{2}}} \\ \tilde{r}_{c11} = & \left( {1.73,2.23,2.64} \right) \\ \tilde{r}_{c12} = & \left( {\tilde{c}_{c12c11} \otimes \tilde{c}_{c12c12} } \right)^{{\frac{1}{2}}} \\ \tilde{r}_{c12} = & \left( {\left( {1/\left( {3,5,7} \right)} \right) \otimes \left( {1,1,1} \right)} \right)^{{\frac{1}{2}}} \\ \tilde{r}_{c12} = & \left( {0.37,0.44,0.57} \right) \\ \end{aligned} $$

Calculating weights: For calculating weights, we are using Eq. 4. In the previous step, we are getting the value of ̃1.1 and ̃1.2 and putting these values in the following equation.

$$ \begin{aligned} \tilde{w}_{c1.1} = & \left( {0.36,0.5,1.10} \right) \\ \tilde{w}_{c1.2} = & \tilde{r}_{c1.2} \otimes \left( {\tilde{r}_{c1.1} + \tilde{r}_{c1.2} } \right)^{ - 1} \\ \tilde{w}_{c1.2} = & \left( {0.57,1.1} \right) \otimes \left[ {\left( {1,1,1.73} \right) + \left( {0.57,1.1} \right)} \right]^{ - 1} \\ \tilde{w}_{c1.2} = & \left( {0.2,0.5,0.63} \right) \\ \end{aligned} $$
  • Step 6: Apply normalization procedure.

Normalized weight values: To find the normalized weights of C1.1 and C1.2 we used Eq. 5.

$$ \begin{aligned} w_{c1.1} = & \frac{{\tilde{w}_{c1.1} }}{{\sum\nolimits_{j = 1}^{2} {\tilde{w}_{1j} } }} = \frac{{L_{c1.1} + M_{c1.1} + U_{c1.1} }}{{\tilde{w}_{c1.1} + \tilde{w}_{c1.2} }} \\ w_{c1.1} = & \frac{{\left( {0.36 + 0.5 + 1.10} \right)}}{{\left( {0.36 + 0.5 + 1.10 + 0.2 + 0.5 + 0.63} \right)}} = 0.59 \\ w_{c1.2} = & \frac{{\tilde{w}_{c1.2} }}{{\sum\nolimits_{j = 1}^{2} {\tilde{w}_{1j} } }} = \frac{{L_{c1.2} + M_{c1.2} + U_{c1.2} }}{{\tilde{w}_{c1.1} + \tilde{w}_{c1.2} }} \\ w_{c1.2} = & \frac{{\left( {0.2 + 0.5 + 0.63} \right)}}{{\left( {0.36 + 0.5 + 1.10 + 0.2 + 0.5 + 0.63} \right)}} = 0.40 \\ \end{aligned} $$

The similar calculation approach is applied for all pairwise comparisons. The final weights of the alternatives are provided in Table 3. An illustrative example for WA1 is given as follows:

Table 3. Final weights of the criteria and alternatives
$$ \begin{aligned} W_{A1} = & C1 \times C11 \times A1 + C1 \times C12 \times A1 + \cdots + C4 \times C41 \times A1 + C4 \times C42 \times A1 \\ W_{A1} = & 0.53 \times 0.59 \times 0.39 + 0.53 \times 0.40 \times 0.41 + \cdots + 0.07 \times 0.59 \times 044 + 0.07 \times 0.40 \times 0.35 \\ W_{A1} = & 0.34 \\ \end{aligned} $$

4.3 Decision: Recommendations

According to the Table 3, the highest priority of the decision for the Table 1 (Diagnose Stage) of Health Patient Records is A2 (0.42), following by A1 (0.34), and A3 (0.08), while in the last ranking of decision recommendation is A4 (0.06). Based on the analyzing and computing process, in this case we recommended that the Table 1 (Diagnose Stage) should be maintaining suppression as the highest priority recommendation in this illustration.

5 Findings

In order to present the recommendations based on the final results of the analyzing process using FMCDM, we designed a graphical view to support the decision-makers to decide to release their dataset. Figure 6 shows how the Fuzzy AHP could help the decision-makers with the better understanding of the comparison score for each alternative.

Fig. 6.
figure 6

Ranking of decision recommendations

Furthermore, to design the action plan of the maintaining suppression, some possible procedures could be taken into account as follows: (1) removing a data field or an individual attributes into particular group of the data and replace it into unique characteristics; (2) obscuring a data field by making substitution precise data values with ranges to minimize the provision of the personal identity; and (3) Aggregating a data field by summarizing the data across the amounts of the data and visualizing the data value into statistics form like graphics or charts.

6 Conclusion

In this paper, we presented the results of a study by utilizing Fuzzy AHP to analyze the risks and benefits of opening data for determining the best alternative in the health patient records dataset. A set of criteria and a variety of sub-criteria were designed and identified base on the literature review and experts’ judgment. Some advantages the use of the FMCDM approach compare to other three methods as follows: (1) the capability to transform the human subjectivity and incorrectness from the common natural language to weights the complex problems, and (2) provides assessment method of the selected alternatives to rank a single decision making. However, a disadvantage found while using this approach is because the fuzzy is a ruled-based system, hence it needs to get enough rules to be accurate and expressively. The contribution resulted from this paper is to provide a decision-making model to analyze the potential risks and benefits of opening data. A given dataset is evaluated by taking action like measuring and weighing the relative importance of the multiple criteria.

Thus, the approach might contribute decision makers to decide to open a dataset. In the further research, we recommend refining this approach by adding more datasets in which and advice for (not) opening data can be generated without human involvement.