Identifying and managing data quality requirements: a design science study in the field of automated driving

Pradhan, Shameer Kumar; Heyn, Hans-Martin; Knauss, Eric

doi:10.1007/s11219-023-09622-8

Identifying and managing data quality requirements: a design science study in the field of automated driving

Open access
Published: 11 May 2023

Volume 32, pages 313–360, (2024)
Cite this article

Download PDF

You have full access to this open access article

Software Quality Journal Aims and scope Submit manuscript

Identifying and managing data quality requirements: a design science study in the field of automated driving

Download PDF

Shameer Kumar Pradhan¹,
Hans-Martin Heyn¹ &
Eric Knauss¹

1649 Accesses
2 Citations
Explore all metrics

Abstract

Good data quality is crucial for any data-driven system’s effective and safe operation. For critical safety systems, the significance of data quality is even higher since incorrect or low-quality data may cause fatal faults. However, there are challenges in identifying and managing data quality. In particular, there is no accepted process to define and continuously test data quality concerning what is necessary for operating the system. This lack is problematic because even safety-critical systems become increasingly dependent on data. Here, we propose a Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM) to systematically manage data quality and related requirements based on design science research. The framework is constructed based on an advanced driver assistance system (ADAS) case study. The study is based on empirical data from a literature review, focus groups, and design workshops. The proposed framework consists of four components: a Data Quality Workflow, a List of Data Quality Challenges, a List of Data Quality Attributes, and Solution Candidates. Together, the components act as tools for data quality assessment and maintenance. The candidate framework and its components were validated in a focus group.

Requirements Engineering for Automotive Perception Systems: An Interview Study

Requirements and software engineering for automotive perception systems: an interview study

Article Open access 24 January 2024

Guidelines for User-Centered Development of DAS

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Successful deep learning requires a large volume of data during the design and operation of such systems (Rusk, 2016). Data used for training and operation is crucial in achieving the desired behavior of a deep learning system (Sun et al., 2017). Consequently, there is a need to identify data quality challenges and systematically define relevant data quality attributes. However, there needs to be a systematic procedure to determine and manage data quality. Today, most of the data quality assessment information for the deep learning system is based on undocumented expert knowledge, especially during pre-processing of input data (Holstein et al., 2019).

An advanced driver assistance system (ADAS) is designed to make driving comfortable and safe by enabling drivers to make the right decisions (Ziebiński et al., 2017). The system assists in overtaking other vehicles, parking, and detecting obstacles. ADAS can also execute emergency braking and lane change independently. These systems are inherently safety-critical because they can intervene by braking and steering the vehicle. To enable all these functions, ADAS employs a perception system, which deploys deep learning and encounters a large volume of data during the design and operation phase (Fayyad et al., 2020). Such systems for ADAS include traffic sign recognition and road obstacle detection. Because of functional safety decomposition, the perception system will inherit functional safety requirements from ADAS. In turn, the deployed deep learning models in the perception system will also have to comply with functional safety requirements. Consequently, this means that the data used for training and testing the deep learning models must not compromise the safe function of the deep learning model.

This study aims to understand data quality requirements in the context of safety-critical systems like ADAS. Divergence from the expected system behavior can mean the difference between a safe journey and a fatal accident. The behavior of machine learning in general, and deep learning in particular, depends on the data, especially the quality of the data provided for training, validation, and inference at runtime. A lack of quality data might compromise the decision-making capabilities of the driver in the context of automated driving, which can result in a fatal accident. Thus, the data used for training the system should be appropriate for successfully operating in a real-world implementation. Similarly, data used for validation should be appropriate for determining whether the system will work as intended. Finally, during runtime, the inference must be based on data with a quality that resembles training and validation data quality; otherwise, it will be impossible to guarantee that the system is working within certain boundaries. Providing unsuitable data, i.e., data of poor quality, will lead to undesired system behavior and impact efficiency (Madnick et al., 2014; Challa et al., 2020).

1.1 Research questions and objectives

We formulate two research questions to guide our study:

Research question 1 (RQ1): What are the relevant data quality challenges in deep learning systems?
Research question 2 (RQ2): What constitutes a requirements framework for data quality management in deep learning systems?

Answering the first research question helps identify data quality challenges. Identification of such challenges can, in turn, help in devising solutions for those challenges. The second research question helps develop a series of components for a candidate framework whose goal is to help researchers, practitioners, and other stakeholders identify the data quality challenges, understand data quality attributes, and manage data quality overall.

The objectives of this study are as follows:

To identify challenges associated with data quality for deep learning systems such as that can be found in ADAS;
To understand data quality requirements for such systems;
To devise a set of solutions for identifying and mitigating data quality challenges.

The primary contributions of this study are the identification of relevant data quality challenges and the development of a series of artifact components that assist in the identification and reduction/mitigation of such challenges. By understanding the identified data quality challenges, we establish a candidate framework that could lead to a framework that supports stakeholders in identifying and maintaining data quality and requirements towards data. According to McMeekin et al. (2020), a methodological framework “provides structured practical guidance or a tool that supports its user through a process in a step-wise manner.”

We position the candidate framework devised in this paper as a stepping stone towards a comprehensive framework for understanding the data quality challenges and attributes for data-driven developments such as deep learning in ADAS.

The scope of the study is limited to establishing a candidate framework for data quality in the training and testing of deep learning models and, thus, does not relate to concrete data types produced by individual sensors. We study data quality requirements by exploring data quality challenges and attributes. The data collected for this study originates mainly from the past experiences of the experts. A candidate framework comprising various components is proposed based on data collected via interviews, focus groups, surveys, and literature review.

The remainder of this article is structured in the following manner. Background and related work are presented in Section 2. Similarly, Section 3 provides the study’s methodology and design using automated driving as a case study. Section 4 provides the result of the study in the form of a candidate framework, including a set of primary components and their evaluation. The resulting candidate framework and its implication to researchers and practitioners are discussed in Section 5. Finally, Section 6 concludes the article and provides potential future directions for this study.

2 Background and related work

With the rise of distributed systems, data soon became a key concern. Standards such as ISO 25010 on software and data quality can guide the handling of data quality aspects for software systems ISO (2011). However, the standard was drafted before the rise of machine learning in the late 2010s. It aims to guide software architecture decisions instead of data selection in data-centric applications (Haoues et al., 2017).

A data quality framework for distributed computing environment by Fletcher (1998) proposes a measure called Data Quality Risk Exposure Level (DQREL). DQREL is an attribute-dimensions matrix with eight data dimensions and three data attributes. As stated by the author, the DQREL matrix can be used to understand “data quality pitfalls” in a system.

A first step towards identifying data quality requirements is understanding the expectations for the final ML systems. Sandkuhl (2019) studied the expectations of two projects—one in financial industries and the other in ML and data science. The author devised a method component to understand the organizational context of ML, which can be used to conduct ML requirements analysis and, finally, analysis of data availability based on the elicited requirements towards the ML system.

That requirements towards the ML system directly result in requirements towards data quality, has been shown by Sessions and Valtorta (2006). The authors show that data quality impacts the effectiveness of machine learning algorithms. They devise procedures for developing robust and practical algorithms using data quality assessments. They evaluate the need for good data quality by developing and testing three Bayesian networks. However, assessing and managing the data quality of large datasets is a challenging task, as shown by Cai and Zhu (2015). The challenges of data quality they identify include difficulty in data integration, a large volume of data, fast-changing data, and a need for more data quality standards and frameworks. The authors propose a dynamic assessment process for data quality to identify these challenges. Another framework for data quality assessment and monitoring was developed by Batini et al. (2007). Based on the Basel II operational risk evaluation methods, the authors devised a data quality assessment methodology called ORME-DQ, which contains four phases for data quality risk prioritization, identification, measurement, and monitoring. The authors develop an architectural framework composed of five modules that support the phases of the assessment methodology.

The importance of such data quality assessment methods has also been shown by Fujii et al. (2020). The authors devised a set of guidelines for the quality assurance of AI. These guidelines connect data quality, model robustness, system quality, process agility, and customer expectation. They evaluated their proposed guidelines through a survey, with over 77% of the participants agreeing on their usefulness.

Among the five challenges in requirement engineering for ML-based applications identified by Vogelsang and Borg (2019), the elicitation of data required is one of them. The authors identified a gap between the tools used by data scientists to control data quality and requirement engineering connecting data quality requirements to customer expectations.

The Open Measured Data Management Working Group has developed a vendor-neutral platform called OpenMDM^{Footnote 1} to manage measured data. Automotive companies primarily use this platform to build in-house applications. It can, however, also be used to develop other solutions. It includes components and concepts that can be used to “compose applications for measured data management systems.” OpenMDM can manage measurement data, evaluation results, and descriptions.

Other data management frameworks, such as datasheets for datasets proposed by Gebru et al. (2021), do not explicitly connect data quality attributes to data requirements. The “dataset nutrition label framework” introduced by Holland et al. (2020) provides an extendible approach for data scientists to compare different datasets summarized as labels. However, the framework requires a list of relevant data quality attributes and needs to explain how data quality challenges can be solved.

We propose the contribution of this study as a blueprint toward a framework for identifying and managing data quality attributes. Unlike previous studies that mainly investigated individual aspects of data quality, this study provides a consolidated tool that includes data quality challenges, related attributes, and solution candidates to overcome the data quality challenges. The main difference from previous approaches is that this proposed framework is extendable to data quality challenges, attributes, and solutions. Based on a case study, this article will provide many examples of data quality challenges, attributes, and solutions entered into the proposed framework.

3 Research method

Design science research (DSR) was performed for this study. According to Hevner et al. (2004), DSR is a problem-solving process enabled by developing and evaluating novel artifacts as solutions to problems. The DSR methodology is applicable in various domains, including software, human-computer interaction, and system design.

The study was performed in three cycles, with each cycle focusing primarily on one of the stages of DSR, namely, problem identification, solution design, and evaluation. However, other tasks were also updated if new information or idea was generated, irrespective of the stage. Data quality challenges were identified during the problem identification stage with the help of a literature review and expert interviews. The framework was devised during the second cycle, the solution design stage. The identified challenges and the framework were evaluated in the third cycle, the evaluation stage. All stages are illustrated in Fig. 1.

3.1 A case study in data quality for automated driving

Automated driving is adopted as the case study for this research. In this research, we conduct a case study to evaluate data quality challenges in automated driving. The study was conducted in collaboration with a Swedish Tier 1 supplier of automotive systems for original equipment manufacturers (OEMs), which designs, manufactures, and sells software and hardware systems for occupant protection, ADAS, collaborative, and automated driving. These systems include vision, radar, lidar, thermal sensing, electronic controls, and human–machine interfaces. We argue that this company is a representative for developing systems for automated driving, as it has customer relations with several OEMs worldwide, and it is one of the largest Tier 1 suppliers for perception systems used in automated driving in Europe.

Sampling strategy

We employed a mixture of convenience sampling (Sedgwick, 2013) and purposeful sampling (Suri, 2011) techniques during the selection of the experts. The industry partner supported the selection of experts for this study and provided us with the experts based on our requirements regarding their expertise and area of work. We asked the company to provide us with experts with a wide variety of experiences and positions involved in the development of automated driving functions to obtain a broader perspective and receive more diverse feedback on our interview questions (Palinkas et al., 2015). Our main selection criterion was the active involvement in product development for ADAS functions that use some form of machine learning.

3.1.1 First cycle: problem identification

During problem identification, one investigates the research objective from different perspectives in sufficient detail to support the design of a solution (Peffers et al., 2007). While it makes sense to focus on problem identification in the first cycle, understanding the problem should be revisited iteratively even during the other cycles of the DSR (Knauss, 2021). Similarly, although the focus was on solution design during the second DSR cycle, problem identification and evaluation were also consciously considered. Feedback from the evaluation stage was also used to further refine the problem understanding and solution design.

The first cycle involved interviews and a literature review as the primary source for identifying data quality challenges. The interviews, which were recorded and transcribed, were conducted via Microsoft Teams, an online communication tool. The data quality challenges were segregated using data-driven thematic analysis.

Based on the previously formulated research question, we developed an interview guide as Farooq and de Villiers (2017) state that a well-developed interview guide helps devise a better structure for the interviews. Furthermore, feedback received from interviews can be helpful in further refining and rephrasing the interview questions. Based on the outcome of previous interviews, questions were tuned accordingly to fill the knowledge gap for other interviews.

The goal of the interviews in the first cycle was to identify data quality challenges. Interviewees A–E, listed chronologically in Table 1, are the interviewees during the first cycle. Interviewees F–H in the same table are the interviewees during the second cycle. Five interviewees are experts from the case company, two additional interviewees are experts from two partner companies of our case company, and one expert is a research partner of the case company within an EU Horizon 2020 research project. We chose to add additional experts outside the case company to check the validity and transferability of the answers we received from within the case company.

Table 1 List of interviewees

Full size table

The interviews were transcribed and thematically coded. Data-driven coding was used in the thematic analysis of the interviews of the first cycle, as described by Gibbs (2007). In such a technique, codes are based on the words used in the interviews.

A survey was conducted to understand the appropriate severity of the identified challenges. Interview participants from the first cycle and additional participants from a requirements engineering workgroup of a deep learning research project associated with the case company^{Footnote 2} participated in the survey.

While preparing the survey questionnaire, the identified challenges were divided into five categories. For each category, the survey participants were asked to rank the challenges by the level of severity. They were asked to rate the categories as well. The modified scale ranged from 1 to 6, with 1 being the least severe challenge and 6 being the most formidable challenge. A scale with an even number of alternatives was deliberately selected to induce the participants to “pick a side,” as suggested by Cox (1980).

An algorithm to calculate a metric called Challenge Score was developed. The algorithm uses the ranking of individual challenges in their respective categories and the Likert scale value given to those categories to calculate a Challenge Score. The value is normalized over the total number of challenges in the respective category and the number of survey participants. More details about the algorithm and associated formula can be found in the accompanying data package.^{Footnote 3}

3.1.2 Second cycle: solution design

After identifying the problem in the first DSR cycle, the primary focus of the second DSR cycle was on solution design. The artifact was designed to meet the stakeholder requirements and resolve the identified challenges by building on the early prototypes from the first cycle.

A series of artifact components, which collectively form the Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM), was designed as part of the solution design step. The components are explained in Sect. 4 of this article. Results from a literature review, the first round of interviews, the first survey, and the group brainstorming sessions between the researchers were used to devise the components and their content. We also conducted additional interviews in this cycle to verify the developed components. Furthermore, some of the questions asked during the interviews were open-ended to encourage brainstorming between the researchers and the interviewees.

The interviews of the second cycle were also thematically coded and analyzed. Unlike the thematic coding of the interviews of the first cycle, descriptive coding and analytic coding techniques were used to thematically code the interviews of the second cycle (Gibbs, 2007), (Skjott Linneberg & Korsgaard, 2019) because we were focusing on verifying the findings of the first cycle.

We used four deductive codes in this study. Those were confirmation of a pre-identified challenge, confirmation of a proposed solution, rejection of a pre-identified challenge, and rejection of the proposed solution.

3.1.3 Third cycle: evaluation

The third cycle of this study focuses primarily on the evaluation of the candidate framework. A preliminary evaluation was already conducted as part of the study’s first and second cycles. For example, the interviewees were presented with the artifact components and solutions in a preliminary design phase during the second cycle. The presentation was done to gather their feedback regarding those components and solutions.

The evaluation was primarily done using a focus group and a survey. A focus group session was conducted to validate the candidate framework components. The focus group participants included researchers and engineers from academia and industry with experience in automated driving development, deep learning, and data quality. The session was conducted for 2 h with five participants: two from academia and three from the industry. The participants were confronted with questions to brainstorm regarding the association between the challenges, the data quality attributes, and the candidate framework components. They also shared their ideas and thoughts through discussion.

Finally, a comprehensive survey questionnaire was sent to members of the VEDLIoT requirements engineering workgroup. Ten participants submitted a response. However, the participants’ identities could not be determined as the survey did not ask for their names to maintain anonymity. This survey aimed to validate the components of the candidate framework. It asked the participants to provide a Boolean response to the appropriateness of individual fields for the templates of the candidate framework components. In the same way, questions regarding data quality challenges, their association with data quality attributes, and their effect on deep learning models were asked in the survey.

3.1.4 Calculation of challenge score

During the first iteration, 27 data quality challenges were identified through interviews and a literature review. A way to rank the challenges was necessary for the effective analysis. Challenge Score ranks the identified challenges in terms of their severity, i.e., whether a challenge is more pressing or less.

The computation of the Challenge Score is based on the response from the survey conducted to rank the challenges. The survey contained two types of questions; one type of question asked the participants to provide a value of significance based on a Likert scale to five sets of challenges, and another type of question asked to rank individual challenges inside the five sets of challenges.

As there are two types of responses to two types of questions, their results need to be combined. The Challenge Score combines both types of responses in one final value. For each respondent, the value they provide for the comprehensive sets of challenges is recorded. The highest-ranked challenge in a challenge set is given the highest numerical value. Decreasing numerical values are assigned to remaining challenges in the particular challenge set. E.g., if there are four challenges in a challenge set, the highest-ranked challenge is given a value of 4, the second highest-ranked is given a value of 3, and so on.

For each challenge, the assigned numerical value is multiplied by the value given by that particular participant for the challenge set of that particular challenge. This process was repeated for all of the participants and challenges. The product values calculated for all participants for individual challenges are summed. The final Challenge Score is calculated by dividing this sum by the number of challenges in the particular challenge set and by dividing the result by the total number of participants, which is done to normalize the final value.

4 Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM)

This section presents the final artifact titled Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM), which was developed during the study. It includes four components listed in Table 2, namely a Data Quality Workflow, a List of Data Quality Challenges, a List of Data Quality Attributes, and Solution Candidates. Components here mean a series of tools that can be used, independently or in combination, to identify and manage data quality requirements. This section will outline each of the components. Furthermore, for each of the components, more details, implementations, and literature references are provided in the artifact package of this article.^{Footnote 4} In the following, we define attribute as “a concept providing qualitative information about a specific object” (Statistical Office of the EU, 2020).

Table 2 Candidate Framework components and their purpose (*refer to Fig. 2 for the related steps in the workflow component)

Full size table

4.1 Component I: Data quality workflow

This component presents a step-by-step workflow for assessing and managing data quality and requirements. It includes six steps, as shown in Fig. 2. Most of the steps can be performed in parallel, as depicted by the dotted line in Fig. 2. Loops indicate that the steps can be done iteratively. The components of CaFDaQAM can be associated with the different steps of the workflow, as depicted in Table 2. The workflow was developed through brainstorming with experts. Furthermore, it was presented to the industry practitioners working with the case study during the focus group session to collect feedback for its evaluation.

S1 Identify data quality challenges

In this step, challenges concerning data quality can be identified from several sources. Examples of primary sources of data collection are interviews, field studies, and surveys. Research papers and books can be used as second-hand sources as well. Furthermore, the collected challenges can be divided into different categories. In this study, they were categorized into five groups relating to data availability, data management, data source, data structure, and data trust.

S2 Collect and organize data quality attributes

In this step, data quality attributes can be identified from various sources. E.g., sources such as research papers, proceedings papers, books, standards, technical reports, Internet articles, and interviews can be used to identify the attributes. Data quality attributes can also be elicited from interviews. A single attribute can also represent differently phrased data quality attributes. E.g., understandability and ease of understanding attributes can be represented by the same attribute.

S3 Associate data quality challenges and data quality attributes

Data quality challenges and quality attributes can be associated with each other after their identification. The association means that a certain data quality challenge affects a certain data quality attribute. There is a many-to-many relationship between data quality challenge and data quality attribute, i.e., one challenge can affect more than one attribute, and one attribute can be affected by more than one challenge. For instance, accuracy (attribute) is affected by data drop, incomplete data, etc. (challenges); and data drop (challenge) can affect accuracy, completeness, etc. (attributes). However, there can be those data quality attributes that are not affected by any identified challenge and data quality challenges that do not affect any attribute.

S4 Define data quality attribute metrics

Metrics to measure data quality attributes are formulated in this step. The metrics help to put a quantitative value on the attributes. E.g., degree of accuracy (metric) helps to measure accuracy (attribute). It gives a quantifiable value for the attribute. Furthermore, formulae can be devised to calculate the metrics. E.g., the degree of accuracy can be calculated as a ratio of the number of correctly labeled data records and the total number of data records. The formulae are mostly dependent on the context of the application.

S5 Identify solutions for data quality challenges

A way of improving data quality attribute metrics, thus improving quality attributes, is to determine candidate solutions for the data quality challenges that affect the attributes. If the challenges can be mitigated or reduced, it will help improve the data quality attributes. For instance, finding a solution for data drop (challenge) and implementing it in the system process could result in lesser data being dropped, thus improving the completeness (attribute). Several sources, such as research papers, technical reports, and books, can identify solutions. Teams can also brainstorm and devise new solution candidates for the challenges. An effective way to validate solution candidates is to implement them as tests in part of a system.

S6 Present to stakeholders

As the final step, identified data quality challenges, attributes, and solution candidates should be presented to appropriate stakeholders. They could be higher management, other colleagues, or customers. A suitable form of presentation should also be decided.

4.2 Component II: List of data quality challenges

Table 3 presents the template of List of Data Quality Challenges component. It includes eight fields validated by the participants of the focus group as well as the second survey. The participants were asked to decide whether a certain field was required or not for a particular component. All participants responded that all fields except one (the source) apply to the component. The source field was agreed upon by 75% of the participants. The challenges are related to the case as they are identified by the experts from the case company. The challenges identified in the case study were entered into the template and are also provided in the artifact package.^{Footnote 5}

Table 3 Template for List of Data Quality Challenges artifact component

Full size table

In response to the first research question (RQ1), in total, and at the end of the study, 27 data quality challenges were identified from elicitation methods such as interviews and literature review. During the course of the study, ten challenges were identified in our literature review analysis and interview data. Nine other challenges were only found in interview data, without a matching report in related work. The remaining eight challenges were identified only in the literature review. Figure 3 depicts the number of challenges retrieved from various sources, such as interviews and literature reviews, as well as the methods employed to validate the identified challenges. The challenges are divided into five broad categories: data availability, data management, data source, data structure, and data trust. We will list the identified challenges here; a complete description of all challenges, including more details on each challenge, is available in the artifact package^{Footnote 6} accompanying this article. As an extract from the artifact package, the challenges under the category data availability challenges are detailed in Appendix A.

Full size table

Furthermore, the second survey sent out to the study participants attempted to validate the fields of the templates of the candidate framework components. For all components, every field was evaluated to be appropriate by all survey participants except for two. The field Sources in List of Data Quality Challenges and the List of Data Quality Attributes components are evaluated to be suitable by only 75% and 50% of the survey participants, respectively.

4.5.3 Focus group evaluation

A focus group session was conducted in the third cycle of this study. Five deep learning, data science, and requirement engineering experts participated in the session. Two experts were employed at the case company; three were members of the VEDLIoT research project. Two types of questions were presented during the session. The first type pertains to the ranking of the data quality challenges. The researchers of this thesis study wanted to understand if the experts would rank the challenges differently compared to the ranking of the first cycle of this study. The second question type relates to validating the association between data quality challenges and attributes.

Unlike in the surveys, the focus group session’s ranking portrays the challenges’ overall ranking without giving them individual weights and calculating the Challenge Score. One of the reasons behind the imposition of a different way is the use of a different tool for the focus group. Ranking of the challenge sets using a Likert scale is presented in Table 11. Ranking for challenges in each challenge set is presented in Table 12.

Table 11 Ranking of Challenge Sets

Full size table

Table 12 Ranking of Data Challenges for each Challenge Set based on Expert’s knowledge collected in a focus group

Full size table

107 data quality challenge-attribute associations were presented for validation during the focus group. The experts regarded only four challenge-attribute associations as not valid (i.e., the initial supposition that the challenges affect the attributes for four of the attributes is not valid in expert opinion). Similarly, for 30 challenge-attribute associations, there was unanimity (i.e., all of the experts in the focus group session regarded a particular challenge as affecting a particular attribute).

For 45 challenge-attribute associations, more than half, but not all, of the experts in the focus group regarding a particular challenge affect a particular attribute. Similarly, for 26 challenge-attribute associations, more than half, but not all, of the experts in the focus group regarding a particular challenge does not affect a particular attribute. Only for the Data Delay challenge, there were two challenge-attribute associations in which half of the experts regarded a particular challenge does affect a particular attribute, and the other half regarded a particular challenge does not affect a particular attribute. This anomaly in data is due to one of the focus group participants not answering the question regarding Data Delay.

All data regarding the focus group, including tables outlining the experts’ responses, can be found in the data package^{Footnote 11} accompanying this article.

5 Discussion

In this section, we discuss the implications and contributions of our study. We also provide potential threats to the validity of the study.

5.1 Implications and contributions

The study has implications for researchers and practitioners interested in data quality and methods to assess and manage data quality. First, the study provides a candidate framework, which can act as a repository of information regarding data quality. The Data Quality Workflow component provides a step-by-step guide of the tasks that could be performed for overall data quality management.

Similarly, the List of Data Quality Challenges component provides a tool that can be referred to when designing a system to understand the types of data challenges it could face. List of Data Quality Attributes component presents the interested parties with attributes they might want to emphasize more in their systems. For example, a system might prefer data availability more than completeness, or vice versa. The component would help them understand which challenges could affect those attributes. Similarly, interested parties could understand which metrics to focus on and which data to collect to calculate the metric values by using the metrics provided. Mainly, practitioners can record data and compute metrics, which could help them adapt and change their processes if needed.

Likewise, using the solution candidates component, they can identify and implement techniques for mitigating the challenges affecting the attributes they prefer most.

A comparison can be made between the candidate framework proposed in this study and the OpenMDM framework described in Sect. 2. A difference between the two is that OpenMDM provides workflow management of measurement data, whereas CaFDaQAM provides a workflow for overall data quality management. Furthermore, OpenMDM is an Eclipse IDE-based tool, whereas CaFDaQAM could be employed in a programming language-neutral and IDE-neutral fashion.

The candidate framework developed in this study combines various components to present a comprehensive collection of tools to assess and maintain data quality. Those tools include a data quality workflow, templates for identifying and recording data quality challenges and attributes, a list of identified data quality challenges, a list of data quality attributes and metrics, and a list of solution candidates to many data quality challenges. Prior studies (see Sect. 2) explored a single concept. For example, Cai and Zhu (2015) studied only the aspect of challenges of data quality, Batini et al. (2007) explored steps in data quality risk assessment, and Fletcher (1998) provided an attribute-dimensions matrix. Similarly, Fujii et al. (2020) focused on quality assurance of machine learning-based AI applications and did not touch upon data specifically. Unlike previous studies and frameworks, CaFDaQAM explores the overall data quality management process by explicitly proposing a data quality workflow and providing the necessary tools to apply that workflow. The proposed candidate framework also provides requirements for the individual components.

5.2 Answer to the research questions

In response to the first research question (RQ1), this study identified 27 data quality challenges through interviews and a literature review. We developed a method, Challenge Score ranking, to rank and understand the severity of the identified challenges. Furthermore, we verified the identified challenges using surveys and a focus group.

Furthermore, four components were derived, forming the Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM). A data quality workflow was derived. Similarly, tools were proposed in the form of templates and lists for identifying data quality challenges, identifying, quantifying, and managing data quality attributes, and developing solution candidates for data quality challenges. We validated the candidate framework components using a survey, thus ensuring that they correctly address the need of the stakeholders. Hence, the validated candidate framework components answer the second research question (RQ2).

5.3 Threats to validity

5.3.1 Internal validity

Internal validity is concerned with how different variables affect the result of an experiment. One such threat is researcher bias. We, as researchers, could have introduced biases about the topic of the study. The researchers, for example, could have been biased during collecting data and conducting interviews. In order to mitigate this, two researchers performed thematic coding separately using the same coding technique. They then combined them into a single final set of codes in a joint meeting through discussion.

Similarly, as stated earlier, some challenges were identified through literature review only. However, they were validated by conducting a focus group and surveys. Also, a predefined set of questions was used for the interviews, limiting the discussion during interview sessions. At the end of each interview, the interviewees were asked if any questions that should have been asked were missed. Efforts were made to reduce ambiguity in the questions as much as possible. However, there could still be confusion regarding the questions because of communication gaps.

Likewise, there were a limited number of participants in the interviews, the focus group, and the surveys. Most were from the automated driving sector, which could have skewed the study’s result. However, suppose researchers will conduct experiments in the future with the same questionnaire used in this study. In that case, the result could vary if only a few participants are used because those participants might have different experiences and expertise than those consulted during this study.

5.3.2 Reliability

Reliability is associated with the replicability of an experiment or other empirical study, which means future experiments designed in the same fashion as the first experiment should produce the same results as the first experiment. The different versions of interview questions are provided in a replication package so that researchers can track how the research questions evolved based on the participants’ responses. The interview questions help researchers to ask similar questions in the future. However, the responses by experts might be different despite being from the same domain and having similar years of experience, which is because they could have different backgrounds and experiences throughout their careers or simply because they can have different perspectives.

5.3.3 Conclusion validity

Conclusion validity deals with the reasonability of the results of an experiment. Because focus group sessions and surveys were conducted to evaluate the artifacts developed in the study, it can be stated that the conclusion of this study is valid. However, the researchers of this study have yet to validate the conclusion with other domain experts, such as healthcare, aerospace, or law enforcement. The artifacts have not been implemented in a real-world context. So, there is scope for future study regarding the real-world implementation of the artifact developed in this thesis.

5.3.4 Generalizability

The study was conducted for a specific sector—automated driving. While our findings and candidate framework can only be generalized beyond this scope with further research, we hope our work can inspire similar concerns in other domains. For instance, quality data is also crucial for critical systems such as healthcare or power grid applications. The candidate framework could be used as a template to identify data quality challenges and mitigate them in such systems. Albeit, modifications in the candidate framework and its components might be warranted for such generalization. Furthermore, we do not claim the generalizability of the identified challenges; we only claim the transferability of the concept that challenges exist in the defined categories.

6 Conclusion

In this study, we have identified data quality challenges that could arise in deep learning systems using an automated driving system case study, thus answering RQ1 of this study. We have identified, analyzed, and evaluated the data quality challenges using interviews and a focus group. The list of challenges acts as one of the components of the candidate framework devised in this study.

The proposed Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM), its components, and associated templates assist in comprehending data quality challenges, attributes, metrics, and solution candidates. The candidate framework can be used as a tool to improve data quality. It can be used to define data quality requirements for a given system. The proposed templates help create a reference point for identifying data quality issues and defining necessary data attributes. The candidate framework can help improve the performance of deep learning systems, make better predictions, and reduce the risks that insufficient data quality could pose. Using the information provided by the candidate framework, stakeholders can proactively identify and mitigate the challenges regarding data quality. The candidate framework supports RQ2 of this study.

As future work, researchers can use the candidate framework components as a baseline to further develop a framework. Additional challenges could be identified, or identified challenges could be broken into sub-challenges to explore in detail. In order to make the candidate framework developed in this study generalizable, it can be tested in other fields, such as healthcare. Additional data quality challenges, attributes, and solutions can be identified from different domains.

The candidate framework could also be adopted as an automated tool. Data can be passed through a pipeline in this tool, and different relevant quality aspects of the data can be assessed automatically. Then, quality information can be presented to appropriate stakeholders using various mediums and visualization techniques.

Data availability

https://doi.org/10.7910/DVN/Y6ORUV.

Code availability

N/A.

Supplementary information

The article is accompanied by a replication package which contains a data package and an artifact package. The replication package can be accessed through the Harvard Dataverse at https://doi.org/10.7910/DVN/Y6ORUV. The data package covers the following topics:

• Interview Guide

• Challenge Score

• Focus Group Data

• Survey 2 Data

The artifact package for the article lists and explains the components of the artifact developed in the study. It covers the following components:

• Data Quality Workflow

• Data Quality Challenges

• Data Quality Attributes

• Solution Candidates

Notes

https://openmdm.org
The group consists of participants of Work Package 2 of the Very Efficient Deep Learning in the IoT (VEDLIoT) research project in which the case company is actively involved. The research project aims to apply the proposed data quality framework for its use cases in distributed deep learning for automotive systems, home automation, and industrial IoT. See www.vedliot.eu for more details.
https://doi.org/10.7910/DVN/Y6ORUV
https://doi.org/10.7910/DVN/Y6ORUV
https://doi.org/10.7910/DVN/Y6ORUV
https://doi.org/10.7910/DVN/Y6ORUV
*: Found in interview data and literature; **: Found only in interview data; ***: Found only in literature.
https://doi.org/10.7910/DVN/Y6ORUV
Note: In Table 8, Expensive Procedure and Time Consuming challenges are not included as they were identified only during the second cycle.
Note: In Table 10, due to limitation on the number of options provided by the survey tool used (Microsoft Forms), Manual Data Collection and Manual Data Labeling challenges were combined into a single challenge named Manual Data Collection and Labeling for ranking. They are still regarded as separate challenges in the List of Challenges artifact component.
Due to a technical error, Regulatory Compliance was not included in the second cycle survey. Hence, the calculation of Challenge Score ranking disregards it. The disregard is only for calculation of the Challenge Score; the challenge is still included in the List of Challenges artifact component.
https://doi.org/10.7910/DVN/Y6ORUV
https://doi.org/10.7910/DVN/Y6ORUV
https://doi.org/10.7910/DVN/Y6ORUV

References

Batini, C., Barone, D., Mastrella, M., Maurino, A., & Ruffini, C. (2007). A framework and a methodology for data quality assessment and monitoring. In In Proceedings of the 12th International Conference on Information Quality (pp. 333–346). Cambridge, MA, United States: MIT.
Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys, 41, 16:1-16:52.
Article Google Scholar
Bobrowski, M., Marré, M., & Yankelevich, D. (1998). A Software Engineering View of Data Quality. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.41.5713 &rep=rep1 &type=pdf.
Cai, L., & Zhu, Y. (2015). The Challenges of Data Quality and Data Quality Assessment in the Big Data Era. Data Science Journal, 14.
Challa, H., Niu, N., & Johnson, R. (2020). Faulty Requirements Made Valuable: On the Role of Data Quality in Deep Learning. In 2020 IEEE Seventh International Workshop on Artificial Intelligence for Requirements Engineering (AIRE) (pp. 61–69). Zurich, Switzerland: IEEE.
Corrales, D. C., Ledezma, A. I., & Corrales, J. C. (2016). A systematic review of data quality issues in knowledge discovery tasks. Revista Ingenierías Universidad de Medellín, 15, 125–149.
Article Google Scholar
Cox, E. P., III. (1980). The Optimal Number of Response Alternatives for a Scale: A Review. Journal of Marketing Research, 17, 407–422.
Article Google Scholar
Dama International. (2017). Dama-dmbok: Data management body of knowledge (2nd edition). Denville, NJ, USA: Technics Publications, LLC.
Data Management Association, Henderson, D., Earley, S., Sebastian-Coleman, L., Sykora, E., & Smith, E. (2017). DAMA-DMBOK: data management body of knowledge. Denville, NJ, United States: Technics Publications, LLC.
DQ. (2017). List of Conformed Dimensions of Data Quality | Conformed Dimensions of Data Quality. https://dimensionsofdataquality.com/alldimensions.
European Commission. Statistical Office of the European Union. (2020). European Statistical System handbook for quality and metadata reports: 2020 edition. Publications Office, LU. https://ec.europa.eu/eurostat/documents/3859598/10501168/KS-GQ-19-006-EN-N.pdf
Farooq, M. B., & de Villiers, C. (2017). Telephonic qualitative research interviews: when to consider them and how to do them. Meditari Accountancy Research, 25, 291–316.
Article Google Scholar
Fayyad, J., Jaradat, M. A., Gruyer, D., & Najjaran, H. (2020). Deep Learning Sensor Fusion for Autonomous Vehicle Perception and Localization: A Review. Sensors, 20, 4220.
Fletcher, F. (1998). A Framework for Addressing Data Quality in Distributed Computing Systems. In Proceedings of the 1998 International Conference on Information Quality. MIT.
Fox, C., Levitin, A., & Redman, T. (1994). The notion of data and its quality dimensions. Information Processing & Management, 30, 9–19.
Article Google Scholar
Fujii, G., Hamada, K., Ishikawa, F., Masuda, S., Matsuya, M., Myojin, T., Nishi, Y., Ogawa, H., Toku, T., Tokumoto, S., Tsuchiya, K., & Ujita, Y. (2020). Guidelines for Quality Assurance of Machine Learning-Based Artificial Intelligence. International Journal of Software Engineering and Knowledge Engineering, 30, 1589–1606.
Article Google Scholar
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64, 86–92.
Article Google Scholar
Gibbs, G. R. (2007). Analyzing Qualitative Data. London, United Kingdom: SAGE Publications, Ltd.
Gilb, T. (2005). Competitive Engineering: A Handbook For Systems Engineering, Requirements Engineering, and Software Engineering Using Planguage. Burlington, MA, United States: Elsevier.
Haoues, M., Sellami, A., Ben-Abdallah, H., & Cheikhi, L. (2017). A guideline for software architecture selection based on iso 25010 quality related characteristics. International Journal of System Assurance Engineering and Management, 8, 886–909.
Google Scholar
Heravizadeh, M., Mendling, J., & Rosemann, M. (2009). Dimensions of Business Processes Quality (QoBP). In D. Ardagna, M. Mecella, & J. Yang (Eds.), Business Process Management Workshops (pp. 80–91). Berlin, Heidelberg: Springer.
Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design Science in Information Systems Research. MIS Quarterly, 28, 75–105.
Article Google Scholar
Holland, S., Hosny, A., Newman, S., Joseph, J., & Chmielinski, K. (2020). The dataset nutrition label. Data Protection and Privacy, Volume 12: Data Protection and Democracy, 12, 1.
Holstein, K., Wortman Vaughan, J., Daumé, H., Dudik, M., & Wallach, H. (2019). Improving fairness in machine learning systems: What do industry practitioners need? In Proceedings of the 2019 CHI conference on human factors in computing systems (pp. 1–16).
ISO. (2008). ISO/IEC 25012:2008. Technical Report International Organization for Standardization Geneva, Switzerland.
ISO. (2011). Systems and software engineering – Systems and software Quality Requirements and Evaluation (SQuaRE) – System and software quality models. Geneva: International Organization for Standardization.
Knauss, E. (2021). Constructive Master’s Thesis Work in Industry: Guidelines for Applying Design Science Research. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET) (pp. 110–121).
Knight, S.-A., & Burn, J. (2005). Developing a Framework for Assessing Information Quality on the World Wide Web. Informing Science: The International Journal of an Emerging Transdiscipline, 8, 159–172.
Article Google Scholar
Kruse, C. S., Goswamy, R., Raval, Y. J., & Marawi, S. (2016). Challenges and Opportunities of Big Data in Health Care: A Systematic Review. JMIR Medical Informatics, 4.
Madnick, S., Wang, R., & Xian, X. (2014). The Design and Implementation of a Corporate Householding Knowledge Processor to Improve Data Quality. Journal of Management Information Systems, 20, 41–70.
Article Google Scholar
McGilvray, D. (2008). Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information (TM). Academic Press.
McMeekin, N., Wu, O., Germeni, E., & Briggs, A. (2020). How methodological frameworks are being developed: evidence from a scoping review. BMC Medical Research Methodology, 20, 173.
Palinkas, L. A., Horwitz, S. M., Green, C. A., Wisdom, J. P., Duan, N., & Hoagwood, K. (2015). Purposeful sampling for qualitative data collection and analysis in mixed method implementation research. Administration and Policy in Mental Health and Mental Health Services Research, 42, 533–544.
Article Google Scholar
Peffers, K., Tuunanen, T., Rothenberger, M. A., & Chatterjee, S. (2007). A Design Science Research Methodology for Information Systems Research. Journal of Management Information Systems, 24, 45–77.
Article Google Scholar
Peralta, V. (2006). Data Quality Evaluation in Data Integration Systems. phdthesis Université de Versailles-Saint Quentin en Yvelines ; Université de la République d’Uruguay.
Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45, 211–218.
Article Google Scholar
Rusk, N. (2016). Deep learning. Nature Methods, 13, 35.
Sandkuhl, K. (2019). Putting AI into Context - Method Support for the Introduction of Artificial Intelligence into Organizations. In 2019 IEEE 21st Conference on Business Informatics (CBI) (pp. 157–164). volume 01.
Sedgwick, P. (2013). Convenience sampling. BMJ, 347, f6304. Publisher: British Medical Journal Publishing Group Section: Endgames.
Sessions, V., & Valtorta, M. (2006). The effects of data quality on machine learning algorithms. In 11th International Conference on Information Quality (pp. 485–498). Cambridge, MA, United States: MIT.
Sidi, F., Shariat Panahy, P. H., Affendey, L. S., Jabar, M. A., Ibrahim, H., & Mustapha, A. (2012). Data quality: A survey of data quality dimensions. In 2012 International Conference on Information Retrieval Knowledge Management (pp. 300–304).
Skjott Linneberg, M., & Korsgaard, S. (2019). Coding qualitative data: a synthesis guiding the novice. Qualitative Research Journal, 19, 259–270.
Article Google Scholar
Statistical Office of the EU. (2020). European Statistical System handbook for quality and metadata reports: 2020 edition.. LU: Publications Office.
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision (pp. 843–852).
Suri, H. (2011). Purposeful Sampling in Qualitative Research Synthesis. Qualitative Research Journal, 11, 63–75. Publisher: Emerald Group Publishing Limited. https://doi.org/10.3316/QRJ1102063
Vogelsang, A., & Borg, M. (2019). Requirements Engineering for Machine Learning: Perspectives from Data Scientists. In 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW) (pp. 245–251).
Wang, R. Y., & Strong, D. M. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 12, 5–33.
Article Google Scholar
Ziebiński, A., Cupek, R., Grzechca, D., & Chruszczyk, L. (2017). Review of advanced driver assistance systems (ADAS). AIP Conference Proceedings, 1906.

Download references

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 957197.

Funding

Open access funding provided by University of Gothenburg. The research received funding as part of the European Union’s Horizon 2020 project “VEDLIoT.”

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Gothenburg and Chalmers, Göteborg, 40530, Sweden
Shameer Kumar Pradhan, Hans-Martin Heyn & Eric Knauss

Authors

Shameer Kumar Pradhan
View author publications
You can also search for this author in PubMed Google Scholar
Hans-Martin Heyn
View author publications
You can also search for this author in PubMed Google Scholar
Eric Knauss
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Shameer Kumar Pradhan: Conceptualization, Validation, Formal analysis, Investigation, Data Curation, Writing - Original Draft, Writing - Review and Editing. Hans-Martin Heyn: Conceptualization, Investigation, Methodology, Validation, Resources, Writing - Review and Editing, Supervision. Eric Knauss: Conceptualization, Methodology, Resources, Writing - Review and Editing, Supervision, Project administration, Funding acquisition

Corresponding author

Correspondence to Shameer Kumar Pradhan.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Ethics approval

N/A.

Consent to participate

N/A.

Consent for publication

N/A.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Example of a list of data quality challenges

Here we present an example of a list of data quality challenges. The list was created using the template provided in Table 3. The presented list contains challenges from the category data availability challenges. Similar organized lists of challenges for the categories data management, data sources, data structure, and data trust can be found in the accompanying artifact package.^{Footnote 12}

Survey 1 - 4.333 (Rank 1/31), Survey 2 - 3.750 (Rank 1/25^*)

Appendix B: List of data quality attributes

The following Table 13 demonstrates how the template for data quality attributes in Table 4 can be applied to create an organized list of data quality attributes. The following list of data quality attributes and relevant metrics has been compiled and validated with the case company. Table 14 provides the data quality attribute metrics.

Note:

NA: Not Applicable
The numbers in the brackets are the weighted average values for the challenge-attribute association calculated from the focus group session and survey 2.
The first number inside the brackets denotes the weighted average from the focus group results, and the second number denotes the weighted average from survey 2.
If there is no weighted average from either the focus group or survey, the space is left blank. E.g., (, 1) would mean that there is no weighted average from the focus group, but there is a weighted average from survey 2. In the same way, (1, ) means vice versa.
The meaning of weighted average is explained in the main article.

Table 13 List of Data Quality Attributes, Their Sources, Definitions, and Association with Data Quality Challenges

Full size table

Table 14 List of Data Quality Attribute Metrics

Full size table

Some data quality attributes do not have an applicable metric. The lack of metrics is that these attributes do not have a tangible numeric value. E.g., Comment does not have a numeric value that can be used in devising a metric.

Following is the list of the data quality attributes without a metric.

1.
Accessibility
2.
Amount of Data
3.
Auditability
4.
Authorization
5.
Believability / Credibility / Reputation
6.
Clarity / Interpretability / Unambiguous
7.
Coherence and Comparability
8.
Comment
9.
Conciseness / Concise Representation
10.
Consistency and Synchronization
11.
Consistent Representation / Representational Consistency
12.
Contact
13.
Definition / Documentation
14.
Ease of Manipulation
15.
Ease of Operation
16.
Ease of Use and Maintainability
17.
Elasticity
18.
Flexibility
19.
Free of Error
20.
Institutional Mandate
21.
Learnability
22.
Lineage
23.
Metadata
24.
Metadata Update
25.
Navigation
26.
Objectivity
27.
Portability
28.
Precision
29.
Presentation Quality
30.
Quality Management
31.
Readability
32.
Recoverability
33.
Reference Period
34.
Release Policy
35.
Representation
36.
Resiliency
37.
Safety
38.
Scalability
39.
Security
40.
Statistical Presentation
41.
Statistical Processing
42.
Structure
43.
Traceability
44.
Unambiguous
45.
Understandability / Ease of Understanding
46.
Unit of Measure
47.
Usability
48.
Validity
49.
Value Added

Appendix C. Example of a solution candidate

Here we provide an example of a solution candidate. The solution candidate Continuous Data Processing has been developed together with the case company. It demonstrates how the template for solution candidates (Table 5) can be applied in practice. Figure 4 shows the flowchart of the solution candidate Continuous Data Processing. Altogether 13 solution candidates have been derived in this study. The remaining solution candidates can be found in the supplement material.^{Footnote 13}

1.1 Continuous data processing

Mitigated Challenge:

Data Delay

Requirement Specifications:

1.
Add new fields for departure timestamp and arrival timestamp in the database,
2.
Determine an acceptable range of time for data arrival

Implementation Details:

First, above mentioned requirement specifications, should be completed.
Then, when the data arrives for processing, check if it is in the initial processing stage.
CHECK_PIPELINE: If it is, check if there is data in the data pipeline.
- If there is data in the pipeline, start processing that particular piece of data without waiting for the rest of the data.
- CHECK_END: If there is no data in the pipeline, check if it is the end of processing.
  - * If it is the end of processing, stop.
  - * If it is not the end of processing, identify that there is a data delay.
  - * Check if the data departure timestamp is there or not.
    - \(\cdot\) If data departure timestamp exists, compute the total time taken by finding the difference between arrival and departure times.
    - \(\cdot\) Check if the time taken is within the acceptable range.
    - \(\cdot\) If it is within the acceptable range, stop.
    - \(\cdot\) If it is not within the acceptable range, notify appropriate stakeholders about the data delay.
If it is not the initial stage of processing, check if the stage is mid-processing.
- If yes, continue from CHECK_PIPELINE.
If the stage is not mid-processing, continue from CHECK_END.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pradhan, S.K., Heyn, HM. & Knauss, E. Identifying and managing data quality requirements: a design science study in the field of automated driving. Software Qual J 32, 313–360 (2024). https://doi.org/10.1007/s11219-023-09622-8

Download citation

Accepted: 21 February 2023
Published: 11 May 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s11219-023-09622-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Identifying and managing data quality requirements: a design science study in the field of automated driving

Abstract

Similar content being viewed by others

Requirements Engineering for Automotive Perception Systems: An Interview Study

Requirements and software engineering for automotive perception systems: an interview study

Guidelines for User-Centered Development of DAS

1 Introduction

1.1 Research questions and objectives

2 Background and related work

3 Research method

3.1 A case study in data quality for automated driving

Sampling strategy

3.1.1 First cycle: problem identification

3.1.2 Second cycle: solution design

3.1.3 Third cycle: evaluation

3.1.4 Calculation of challenge score

4 Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM)

4.1 Component I: Data quality workflow

S1 Identify data quality challenges

S2 Collect and organize data quality attributes

S3 Associate data quality challenges and data quality attributes

S4 Define data quality attribute metrics

S5 Identify solutions for data quality challenges

S6 Present to stakeholders

4.2 Component II: List of data quality challenges

4.3 Component III: List of data quality attributes

4.4 Component IV: Solution candidates

4.5 In-Depth evaluation of the data quality challenges

4.5.1 First evaluation survey

4.5.2 Second evaluation survey

4.5.3 Focus group evaluation

5 Discussion

5.1 Implications and contributions

5.2 Answer to the research questions

5.3 Threats to validity

5.3.1 Internal validity

5.3.2 Reliability

5.3.3 Conclusion validity

5.3.4 Generalizability

6 Conclusion

Data availability

Code availability

Supplementary information

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher’s Note

Appendices

Appendix A. Example of a list of data quality challenges

1.1 Data availability challenges

Name:

Reference:

Description:

Directly affects AI Functions:

Challenge Score:

Name:

Reference:

Description:

Directly affects AI Functions:

Challenge Score:

Name:

Reference:

Description:

Directly affects AI functions:

Challenge score:

Name:

Reference:

Description:

Directly affects AI functions: