1 Introduction

Successful deep learning requires a large volume of data during the design and operation of such systems (Rusk, 2016). Data used for training and operation is crucial in achieving the desired behavior of a deep learning system (Sun et al., 2017). Consequently, there is a need to identify data quality challenges and systematically define relevant data quality attributes. However, there needs to be a systematic procedure to determine and manage data quality. Today, most of the data quality assessment information for the deep learning system is based on undocumented expert knowledge, especially during pre-processing of input data (Holstein et al., 2019).

An advanced driver assistance system (ADAS) is designed to make driving comfortable and safe by enabling drivers to make the right decisions (Ziebiński et al., 2017). The system assists in overtaking other vehicles, parking, and detecting obstacles. ADAS can also execute emergency braking and lane change independently. These systems are inherently safety-critical because they can intervene by braking and steering the vehicle. To enable all these functions, ADAS employs a perception system, which deploys deep learning and encounters a large volume of data during the design and operation phase (Fayyad et al., 2020). Such systems for ADAS include traffic sign recognition and road obstacle detection. Because of functional safety decomposition, the perception system will inherit functional safety requirements from ADAS. In turn, the deployed deep learning models in the perception system will also have to comply with functional safety requirements. Consequently, this means that the data used for training and testing the deep learning models must not compromise the safe function of the deep learning model.

This study aims to understand data quality requirements in the context of safety-critical systems like ADAS. Divergence from the expected system behavior can mean the difference between a safe journey and a fatal accident. The behavior of machine learning in general, and deep learning in particular, depends on the data, especially the quality of the data provided for training, validation, and inference at runtime. A lack of quality data might compromise the decision-making capabilities of the driver in the context of automated driving, which can result in a fatal accident. Thus, the data used for training the system should be appropriate for successfully operating in a real-world implementation. Similarly, data used for validation should be appropriate for determining whether the system will work as intended. Finally, during runtime, the inference must be based on data with a quality that resembles training and validation data quality; otherwise, it will be impossible to guarantee that the system is working within certain boundaries. Providing unsuitable data, i.e., data of poor quality, will lead to undesired system behavior and impact efficiency (Madnick et al., 2014; Challa et al., 2020).

1.1 Research questions and objectives

We formulate two research questions to guide our study:

  • Research question 1 (RQ1): What are the relevant data quality challenges in deep learning systems?

  • Research question 2 (RQ2): What constitutes a requirements framework for data quality management in deep learning systems?

Answering the first research question helps identify data quality challenges. Identification of such challenges can, in turn, help in devising solutions for those challenges. The second research question helps develop a series of components for a candidate framework whose goal is to help researchers, practitioners, and other stakeholders identify the data quality challenges, understand data quality attributes, and manage data quality overall.

The objectives of this study are as follows:

  • To identify challenges associated with data quality for deep learning systems such as that can be found in ADAS;

  • To understand data quality requirements for such systems;

  • To devise a set of solutions for identifying and mitigating data quality challenges.

The primary contributions of this study are the identification of relevant data quality challenges and the development of a series of artifact components that assist in the identification and reduction/mitigation of such challenges. By understanding the identified data quality challenges, we establish a candidate framework that could lead to a framework that supports stakeholders in identifying and maintaining data quality and requirements towards data. According to McMeekin et al. (2020), a methodological framework “provides structured practical guidance or a tool that supports its user through a process in a step-wise manner.”

We position the candidate framework devised in this paper as a stepping stone towards a comprehensive framework for understanding the data quality challenges and attributes for data-driven developments such as deep learning in ADAS.

The scope of the study is limited to establishing a candidate framework for data quality in the training and testing of deep learning models and, thus, does not relate to concrete data types produced by individual sensors. We study data quality requirements by exploring data quality challenges and attributes. The data collected for this study originates mainly from the past experiences of the experts. A candidate framework comprising various components is proposed based on data collected via interviews, focus groups, surveys, and literature review.

The remainder of this article is structured in the following manner. Background and related work are presented in Section 2. Similarly, Section 3 provides the study’s methodology and design using automated driving as a case study. Section 4 provides the result of the study in the form of a candidate framework, including a set of primary components and their evaluation. The resulting candidate framework and its implication to researchers and practitioners are discussed in Section 5. Finally, Section 6 concludes the article and provides potential future directions for this study.

2 Background and related work

With the rise of distributed systems, data soon became a key concern. Standards such as ISO 25010 on software and data quality can guide the handling of data quality aspects for software systems ISO (2011). However, the standard was drafted before the rise of machine learning in the late 2010s. It aims to guide software architecture decisions instead of data selection in data-centric applications (Haoues et al., 2017).

A data quality framework for distributed computing environment by Fletcher (1998) proposes a measure called Data Quality Risk Exposure Level (DQREL). DQREL is an attribute-dimensions matrix with eight data dimensions and three data attributes. As stated by the author, the DQREL matrix can be used to understand “data quality pitfalls” in a system.

A first step towards identifying data quality requirements is understanding the expectations for the final ML systems. Sandkuhl (2019) studied the expectations of two projects—one in financial industries and the other in ML and data science. The author devised a method component to understand the organizational context of ML, which can be used to conduct ML requirements analysis and, finally, analysis of data availability based on the elicited requirements towards the ML system.

That requirements towards the ML system directly result in requirements towards data quality, has been shown by Sessions and Valtorta (2006). The authors show that data quality impacts the effectiveness of machine learning algorithms. They devise procedures for developing robust and practical algorithms using data quality assessments. They evaluate the need for good data quality by developing and testing three Bayesian networks. However, assessing and managing the data quality of large datasets is a challenging task, as shown by Cai and Zhu (2015). The challenges of data quality they identify include difficulty in data integration, a large volume of data, fast-changing data, and a need for more data quality standards and frameworks. The authors propose a dynamic assessment process for data quality to identify these challenges. Another framework for data quality assessment and monitoring was developed by Batini et al. (2007). Based on the Basel II operational risk evaluation methods, the authors devised a data quality assessment methodology called ORME-DQ, which contains four phases for data quality risk prioritization, identification, measurement, and monitoring. The authors develop an architectural framework composed of five modules that support the phases of the assessment methodology.

The importance of such data quality assessment methods has also been shown by Fujii et al. (2020). The authors devised a set of guidelines for the quality assurance of AI. These guidelines connect data quality, model robustness, system quality, process agility, and customer expectation. They evaluated their proposed guidelines through a survey, with over 77% of the participants agreeing on their usefulness.

Among the five challenges in requirement engineering for ML-based applications identified by Vogelsang and Borg (2019), the elicitation of data required is one of them. The authors identified a gap between the tools used by data scientists to control data quality and requirement engineering connecting data quality requirements to customer expectations.

The Open Measured Data Management Working Group has developed a vendor-neutral platform called OpenMDMFootnote 1 to manage measured data. Automotive companies primarily use this platform to build in-house applications. It can, however, also be used to develop other solutions. It includes components and concepts that can be used to “compose applications for measured data management systems.” OpenMDM can manage measurement data, evaluation results, and descriptions.

Other data management frameworks, such as datasheets for datasets proposed by Gebru et al. (2021), do not explicitly connect data quality attributes to data requirements. The “dataset nutrition label framework” introduced by Holland et al. (2020) provides an extendible approach for data scientists to compare different datasets summarized as labels. However, the framework requires a list of relevant data quality attributes and needs to explain how data quality challenges can be solved.

We propose the contribution of this study as a blueprint toward a framework for identifying and managing data quality attributes. Unlike previous studies that mainly investigated individual aspects of data quality, this study provides a consolidated tool that includes data quality challenges, related attributes, and solution candidates to overcome the data quality challenges. The main difference from previous approaches is that this proposed framework is extendable to data quality challenges, attributes, and solutions. Based on a case study, this article will provide many examples of data quality challenges, attributes, and solutions entered into the proposed framework.

3 Research method

Design science research (DSR) was performed for this study. According to Hevner et al. (2004), DSR is a problem-solving process enabled by developing and evaluating novel artifacts as solutions to problems. The DSR methodology is applicable in various domains, including software, human-computer interaction, and system design.

The study was performed in three cycles, with each cycle focusing primarily on one of the stages of DSR, namely, problem identification, solution design, and evaluation. However, other tasks were also updated if new information or idea was generated, irrespective of the stage. Data quality challenges were identified during the problem identification stage with the help of a literature review and expert interviews. The framework was devised during the second cycle, the solution design stage. The identified challenges and the framework were evaluated in the third cycle, the evaluation stage. All stages are illustrated in Fig. 1.

Fig. 1
figure 1

Stages of design science research

3.1 A case study in data quality for automated driving

Automated driving is adopted as the case study for this research. In this research, we conduct a case study to evaluate data quality challenges in automated driving. The study was conducted in collaboration with a Swedish Tier 1 supplier of automotive systems for original equipment manufacturers (OEMs), which designs, manufactures, and sells software and hardware systems for occupant protection, ADAS, collaborative, and automated driving. These systems include vision, radar, lidar, thermal sensing, electronic controls, and human–machine interfaces. We argue that this company is a representative for developing systems for automated driving, as it has customer relations with several OEMs worldwide, and it is one of the largest Tier 1 suppliers for perception systems used in automated driving in Europe.

Sampling strategy

We employed a mixture of convenience sampling (Sedgwick, 2013) and purposeful sampling (Suri, 2011) techniques during the selection of the experts. The industry partner supported the selection of experts for this study and provided us with the experts based on our requirements regarding their expertise and area of work. We asked the company to provide us with experts with a wide variety of experiences and positions involved in the development of automated driving functions to obtain a broader perspective and receive more diverse feedback on our interview questions (Palinkas et al., 2015). Our main selection criterion was the active involvement in product development for ADAS functions that use some form of machine learning.

3.1.1 First cycle: problem identification

During problem identification, one investigates the research objective from different perspectives in sufficient detail to support the design of a solution (Peffers et al., 2007). While it makes sense to focus on problem identification in the first cycle, understanding the problem should be revisited iteratively even during the other cycles of the DSR (Knauss, 2021). Similarly, although the focus was on solution design during the second DSR cycle, problem identification and evaluation were also consciously considered. Feedback from the evaluation stage was also used to further refine the problem understanding and solution design.

The first cycle involved interviews and a literature review as the primary source for identifying data quality challenges. The interviews, which were recorded and transcribed, were conducted via Microsoft Teams, an online communication tool. The data quality challenges were segregated using data-driven thematic analysis.

Based on the previously formulated research question, we developed an interview guide as Farooq and de Villiers (2017) state that a well-developed interview guide helps devise a better structure for the interviews. Furthermore, feedback received from interviews can be helpful in further refining and rephrasing the interview questions. Based on the outcome of previous interviews, questions were tuned accordingly to fill the knowledge gap for other interviews.

The goal of the interviews in the first cycle was to identify data quality challenges. Interviewees A–E, listed chronologically in Table 1, are the interviewees during the first cycle. Interviewees F–H in the same table are the interviewees during the second cycle. Five interviewees are experts from the case company, two additional interviewees are experts from two partner companies of our case company, and one expert is a research partner of the case company within an EU Horizon 2020 research project. We chose to add additional experts outside the case company to check the validity and transferability of the answers we received from within the case company.

Table 1 List of interviewees

The interviews were transcribed and thematically coded. Data-driven coding was used in the thematic analysis of the interviews of the first cycle, as described by Gibbs (2007). In such a technique, codes are based on the words used in the interviews.

A survey was conducted to understand the appropriate severity of the identified challenges. Interview participants from the first cycle and additional participants from a requirements engineering workgroup of a deep learning research project associated with the case companyFootnote 2 participated in the survey.

While preparing the survey questionnaire, the identified challenges were divided into five categories. For each category, the survey participants were asked to rank the challenges by the level of severity. They were asked to rate the categories as well. The modified scale ranged from 1 to 6, with 1 being the least severe challenge and 6 being the most formidable challenge. A scale with an even number of alternatives was deliberately selected to induce the participants to “pick a side,” as suggested by Cox (1980).

An algorithm to calculate a metric called Challenge Score was developed. The algorithm uses the ranking of individual challenges in their respective categories and the Likert scale value given to those categories to calculate a Challenge Score. The value is normalized over the total number of challenges in the respective category and the number of survey participants. More details about the algorithm and associated formula can be found in the accompanying data package.Footnote 3

3.1.2 Second cycle: solution design

After identifying the problem in the first DSR cycle, the primary focus of the second DSR cycle was on solution design. The artifact was designed to meet the stakeholder requirements and resolve the identified challenges by building on the early prototypes from the first cycle.

A series of artifact components, which collectively form the Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM), was designed as part of the solution design step. The components are explained in Sect. 4 of this article. Results from a literature review, the first round of interviews, the first survey, and the group brainstorming sessions between the researchers were used to devise the components and their content. We also conducted additional interviews in this cycle to verify the developed components. Furthermore, some of the questions asked during the interviews were open-ended to encourage brainstorming between the researchers and the interviewees.

The interviews of the second cycle were also thematically coded and analyzed. Unlike the thematic coding of the interviews of the first cycle, descriptive coding and analytic coding techniques were used to thematically code the interviews of the second cycle (Gibbs, 2007), (Skjott Linneberg & Korsgaard, 2019) because we were focusing on verifying the findings of the first cycle.

We used four deductive codes in this study. Those were confirmation of a pre-identified challenge, confirmation of a proposed solution, rejection of a pre-identified challenge, and rejection of the proposed solution.

3.1.3 Third cycle: evaluation

The third cycle of this study focuses primarily on the evaluation of the candidate framework. A preliminary evaluation was already conducted as part of the study’s first and second cycles. For example, the interviewees were presented with the artifact components and solutions in a preliminary design phase during the second cycle. The presentation was done to gather their feedback regarding those components and solutions.

The evaluation was primarily done using a focus group and a survey. A focus group session was conducted to validate the candidate framework components. The focus group participants included researchers and engineers from academia and industry with experience in automated driving development, deep learning, and data quality. The session was conducted for 2 h with five participants: two from academia and three from the industry. The participants were confronted with questions to brainstorm regarding the association between the challenges, the data quality attributes, and the candidate framework components. They also shared their ideas and thoughts through discussion.

Finally, a comprehensive survey questionnaire was sent to members of the VEDLIoT requirements engineering workgroup. Ten participants submitted a response. However, the participants’ identities could not be determined as the survey did not ask for their names to maintain anonymity. This survey aimed to validate the components of the candidate framework. It asked the participants to provide a Boolean response to the appropriateness of individual fields for the templates of the candidate framework components. In the same way, questions regarding data quality challenges, their association with data quality attributes, and their effect on deep learning models were asked in the survey.

3.1.4 Calculation of challenge score

During the first iteration, 27 data quality challenges were identified through interviews and a literature review. A way to rank the challenges was necessary for the effective analysis. Challenge Score ranks the identified challenges in terms of their severity, i.e., whether a challenge is more pressing or less.

The computation of the Challenge Score is based on the response from the survey conducted to rank the challenges. The survey contained two types of questions; one type of question asked the participants to provide a value of significance based on a Likert scale to five sets of challenges, and another type of question asked to rank individual challenges inside the five sets of challenges.

As there are two types of responses to two types of questions, their results need to be combined. The Challenge Score combines both types of responses in one final value. For each respondent, the value they provide for the comprehensive sets of challenges is recorded. The highest-ranked challenge in a challenge set is given the highest numerical value. Decreasing numerical values are assigned to remaining challenges in the particular challenge set. E.g., if there are four challenges in a challenge set, the highest-ranked challenge is given a value of 4, the second highest-ranked is given a value of 3, and so on.

For each challenge, the assigned numerical value is multiplied by the value given by that particular participant for the challenge set of that particular challenge. This process was repeated for all of the participants and challenges. The product values calculated for all participants for individual challenges are summed. The final Challenge Score is calculated by dividing this sum by the number of challenges in the particular challenge set and by dividing the result by the total number of participants, which is done to normalize the final value.

4 Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM)

This section presents the final artifact titled Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM), which was developed during the study. It includes four components listed in Table 2, namely a Data Quality Workflow, a List of Data Quality Challenges, a List of Data Quality Attributes, and Solution Candidates. Components here mean a series of tools that can be used, independently or in combination, to identify and manage data quality requirements. This section will outline each of the components. Furthermore, for each of the components, more details, implementations, and literature references are provided in the artifact package of this article.Footnote 4 In the following, we define attribute as “a concept providing qualitative information about a specific object” (Statistical Office of the EU, 2020).

Table 2 Candidate Framework components and their purpose (*refer to Fig. 2 for the related steps in the workflow component)

4.1 Component I: Data quality workflow

This component presents a step-by-step workflow for assessing and managing data quality and requirements. It includes six steps, as shown in Fig. 2. Most of the steps can be performed in parallel, as depicted by the dotted line in Fig. 2. Loops indicate that the steps can be done iteratively. The components of CaFDaQAM can be associated with the different steps of the workflow, as depicted in Table 2. The workflow was developed through brainstorming with experts. Furthermore, it was presented to the industry practitioners working with the case study during the focus group session to collect feedback for its evaluation.

S1 Identify data quality challenges

In this step, challenges concerning data quality can be identified from several sources. Examples of primary sources of data collection are interviews, field studies, and surveys. Research papers and books can be used as second-hand sources as well. Furthermore, the collected challenges can be divided into different categories. In this study, they were categorized into five groups relating to data availability, data management, data source, data structure, and data trust.

S2 Collect and organize data quality attributes

In this step, data quality attributes can be identified from various sources. E.g., sources such as research papers, proceedings papers, books, standards, technical reports, Internet articles, and interviews can be used to identify the attributes. Data quality attributes can also be elicited from interviews. A single attribute can also represent differently phrased data quality attributes. E.g., understandability and ease of understanding attributes can be represented by the same attribute.

S3 Associate data quality challenges and data quality attributes

Data quality challenges and quality attributes can be associated with each other after their identification. The association means that a certain data quality challenge affects a certain data quality attribute. There is a many-to-many relationship between data quality challenge and data quality attribute, i.e., one challenge can affect more than one attribute, and one attribute can be affected by more than one challenge. For instance, accuracy (attribute) is affected by data drop, incomplete data, etc. (challenges); and data drop (challenge) can affect accuracy, completeness, etc. (attributes). However, there can be those data quality attributes that are not affected by any identified challenge and data quality challenges that do not affect any attribute.

S4 Define data quality attribute metrics

Metrics to measure data quality attributes are formulated in this step. The metrics help to put a quantitative value on the attributes. E.g., degree of accuracy (metric) helps to measure accuracy (attribute). It gives a quantifiable value for the attribute. Furthermore, formulae can be devised to calculate the metrics. E.g., the degree of accuracy can be calculated as a ratio of the number of correctly labeled data records and the total number of data records. The formulae are mostly dependent on the context of the application.

S5 Identify solutions for data quality challenges

A way of improving data quality attribute metrics, thus improving quality attributes, is to determine candidate solutions for the data quality challenges that affect the attributes. If the challenges can be mitigated or reduced, it will help improve the data quality attributes. For instance, finding a solution for data drop (challenge) and implementing it in the system process could result in lesser data being dropped, thus improving the completeness (attribute). Several sources, such as research papers, technical reports, and books, can identify solutions. Teams can also brainstorm and devise new solution candidates for the challenges. An effective way to validate solution candidates is to implement them as tests in part of a system.

S6 Present to stakeholders

As the final step, identified data quality challenges, attributes, and solution candidates should be presented to appropriate stakeholders. They could be higher management, other colleagues, or customers. A suitable form of presentation should also be decided.

Fig. 2
figure 2

Data Quality Workflow artifact component

4.2 Component II: List of data quality challenges

Table 3 presents the template of List of Data Quality Challenges component. It includes eight fields validated by the participants of the focus group as well as the second survey. The participants were asked to decide whether a certain field was required or not for a particular component. All participants responded that all fields except one (the source) apply to the component. The source field was agreed upon by 75% of the participants. The challenges are related to the case as they are identified by the experts from the case company. The challenges identified in the case study were entered into the template and are also provided in the artifact package.Footnote 5

Table 3 Template for List of Data Quality Challenges artifact component

In response to the first research question (RQ1), in total, and at the end of the study, 27 data quality challenges were identified from elicitation methods such as interviews and literature review. During the course of the study, ten challenges were identified in our literature review analysis and interview data. Nine other challenges were only found in interview data, without a matching report in related work. The remaining eight challenges were identified only in the literature review. Figure 3 depicts the number of challenges retrieved from various sources, such as interviews and literature reviews, as well as the methods employed to validate the identified challenges. The challenges are divided into five broad categories: data availability, data management, data source, data structure, and data trust. We will list the identified challenges here; a complete description of all challenges, including more details on each challenge, is available in the artifact packageFootnote 6 accompanying this article. As an extract from the artifact package, the challenges under the category data availability challenges are detailed in Appendix A.

Fig. 3
figure 3

Challenges identified from various sources and validated using different methods

Data availability challenges affect the data availability during processing by AI models. The challenges categorized under this challenge set are Data Delay*, Data Drop**, Incomplete Data*, and Low Labeled Data Volume**.

Data management challenges are related to data management and operations performed on them. The challenges categorized under this challenge set are Data Acquisition***, Data Ownership*, Expensive Procedure**, Imbalanced Dataset***, Improrer Data Transfer*, Large Volume of Data***, Manual Data Collection***, Manual Data Labeling**, Redundant Data*, Regulatory Compliance***, Reliance on Suppliers to Raise Error**, and Time Consuming**.

Data source challenges are those caused by and due to the source of the data. The challenges categorized under this challenge set are Data Dependent on External Conditions**, Lack of Variety in Test Environment**, New Data Type*, and Wrongly Calibrated / Defective Sensors*.

Data structure challenges are related to the format and structure of the data. The challenges categorized under this challenge set are Fragmented Data***, Incompatible Data Format***, Outlier Data*, and Unstructured Data***.

Data trust challenges are caused due to the lack of transparency in the data and its quality to extract meaningful information. The challenges categorized under this challenge set are Incorrect Labeling*, Lack of Good Data from Simulations**, and Noise*.Footnote 7

4.3 Component III: List of data quality attributes

Table 4 presents the template of the List of Data Quality Attributes component, which the participants of the focus group validated. It includes eleven fields. Altogether 82 data quality attributes are presented in the concrete implementation of this component. A complete list of all data quality attributes is provided in Appendix B in Tables 13 and 14. A full description of each item is available in the artifact package.Footnote 8 Additionally, 30 metrics for different data quality attributes are also presented in the appendix, and a complete description of these metrics is available in the artifact package\(^1\). E.g., a metric to measure the timeliness of data is the degree of timeliness, a ratio between the number of data records received within an acceptable time and the total number of received data records. Furthermore, fields from Planguage, a quality factors notation Gilb (2005), were also adapted for this component.

Table 4 Template for List of Data Quality Attributes artifact component (italics: Fields from Planguage, not evaluated)

4.4 Component IV: Solution candidates

Table 5 presents the template of the Solution Candidates component. It includes four fields, which were validated by the focus group participants. In the concrete implementation of this component, 13 solution candidates are devised, as depicted in Table 6. Some solution candidates are Automated Labeling to solve Low Labeled Data Volume and Manual Data Labeling challenges or Corroboration of Data with Central Data Repository to solve the Data Dependent on External Conditions challenge. It should be noted that a single solution can be suitable to solve more than a single challenge.

Table 5 Template for solution candidates artifact component
Table 6 Concrete solution candidates

In this study, we explored and devised solution candidates. An example of a solution candidate definition is presented in Appendix C. Definitions for all depicted solution candidates can be found in the accompanying artifact package\(^7\).

4.5 In-Depth evaluation of the data quality challenges

Since identifying data quality challenges is one of the goals of this study and a fundamental aspect of the candidate framework, this section presents the results of an in-depth evaluation of the identified data quality challenges.

In order to verify the severity of identified challenges and the validity of the data quality attributes, two surveys and a focus group were conducted. The participants were asked to provide a Likert scale value between 1 and 6 to the challenge sets. They were also asked to rank the individual challenges in the challenge sets. The provided Likert scale values and the rankings were used to compute a Challenge Score. The higher the score, the more severe a challenge is compared to other challenges. Low Labeled Data Volume challenge (i.e., lack of enough labeled data) had the highest score in both surveys; hence, it is regarded as the most severe challenge.

In addition, during both surveys, the participants were asked to rank the challenge sets on a similar scale. All challenge sets were deemed relevant and showed only minor differences in ranking.

4.5.1 First evaluation survey

Table 7 presents the values of the Likert scale selected for each challenge set by the survey participants. Here, S1-S6 are the six survey participants. The data is presented in alphabetical order of the challenge set.

Table 7 First survey - Ranking of Challenge Sets

Table 8Footnote 9 provides the ranking of data quality challenges given by participants of the first survey during the first cycle. Here, S1-S6 are the six survey participants, \(\sum\) depicts the sum of the product of rankings, and f is the final normalized Challenge Score.

Table 8 Ranking of data quality challenges through the first evaluation survey during the first cycle of the study

4.5.2 Second evaluation survey

Table 9 presents the values of the Likert scale selected for each challenge set by the second survey participants. Here, S7-S10 are the four survey participants. The data is presented in alphabetical order of the challenge set.

Table 9 Second survey - Ranking of Challenge Sets

The ranking of data quality challenges given by participants in the second survey during the third cycle is presented in Table 10Footnote 10. In the table, S7-S10 are the four survey participants, \(\sum\) is the sum of the product of rankings, and f is the final normalized Challenge Score.

Table 10 Ranking of data quality challenges through the second evaluation survey during the third cycle of the study

Furthermore, the second survey sent out to the study participants attempted to validate the fields of the templates of the candidate framework components. For all components, every field was evaluated to be appropriate by all survey participants except for two. The field Sources in List of Data Quality Challenges and the List of Data Quality Attributes components are evaluated to be suitable by only 75% and 50% of the survey participants, respectively.

4.5.3 Focus group evaluation

A focus group session was conducted in the third cycle of this study. Five deep learning, data science, and requirement engineering experts participated in the session. Two experts were employed at the case company; three were members of the VEDLIoT research project. Two types of questions were presented during the session. The first type pertains to the ranking of the data quality challenges. The researchers of this thesis study wanted to understand if the experts would rank the challenges differently compared to the ranking of the first cycle of this study. The second question type relates to validating the association between data quality challenges and attributes.

Unlike in the surveys, the focus group session’s ranking portrays the challenges’ overall ranking without giving them individual weights and calculating the Challenge Score. One of the reasons behind the imposition of a different way is the use of a different tool for the focus group. Ranking of the challenge sets using a Likert scale is presented in Table 11. Ranking for challenges in each challenge set is presented in Table 12.

Table 11 Ranking of Challenge Sets
Table 12 Ranking of Data Challenges for each Challenge Set based on Expert’s knowledge collected in a focus group

107 data quality challenge-attribute associations were presented for validation during the focus group. The experts regarded only four challenge-attribute associations as not valid (i.e., the initial supposition that the challenges affect the attributes for four of the attributes is not valid in expert opinion). Similarly, for 30 challenge-attribute associations, there was unanimity (i.e., all of the experts in the focus group session regarded a particular challenge as affecting a particular attribute).

For 45 challenge-attribute associations, more than half, but not all, of the experts in the focus group regarding a particular challenge affect a particular attribute. Similarly, for 26 challenge-attribute associations, more than half, but not all, of the experts in the focus group regarding a particular challenge does not affect a particular attribute. Only for the Data Delay challenge, there were two challenge-attribute associations in which half of the experts regarded a particular challenge does affect a particular attribute, and the other half regarded a particular challenge does not affect a particular attribute. This anomaly in data is due to one of the focus group participants not answering the question regarding Data Delay.

All data regarding the focus group, including tables outlining the experts’ responses, can be found in the data packageFootnote 11 accompanying this article.

5 Discussion

In this section, we discuss the implications and contributions of our study. We also provide potential threats to the validity of the study.

5.1 Implications and contributions

The study has implications for researchers and practitioners interested in data quality and methods to assess and manage data quality. First, the study provides a candidate framework, which can act as a repository of information regarding data quality. The Data Quality Workflow component provides a step-by-step guide of the tasks that could be performed for overall data quality management.

Similarly, the List of Data Quality Challenges component provides a tool that can be referred to when designing a system to understand the types of data challenges it could face. List of Data Quality Attributes component presents the interested parties with attributes they might want to emphasize more in their systems. For example, a system might prefer data availability more than completeness, or vice versa. The component would help them understand which challenges could affect those attributes. Similarly, interested parties could understand which metrics to focus on and which data to collect to calculate the metric values by using the metrics provided. Mainly, practitioners can record data and compute metrics, which could help them adapt and change their processes if needed.

Likewise, using the solution candidates component, they can identify and implement techniques for mitigating the challenges affecting the attributes they prefer most.

A comparison can be made between the candidate framework proposed in this study and the OpenMDM framework described in Sect. 2. A difference between the two is that OpenMDM provides workflow management of measurement data, whereas CaFDaQAM provides a workflow for overall data quality management. Furthermore, OpenMDM is an Eclipse IDE-based tool, whereas CaFDaQAM could be employed in a programming language-neutral and IDE-neutral fashion.

The candidate framework developed in this study combines various components to present a comprehensive collection of tools to assess and maintain data quality. Those tools include a data quality workflow, templates for identifying and recording data quality challenges and attributes, a list of identified data quality challenges, a list of data quality attributes and metrics, and a list of solution candidates to many data quality challenges. Prior studies (see Sect. 2) explored a single concept. For example, Cai and Zhu (2015) studied only the aspect of challenges of data quality, Batini et al. (2007) explored steps in data quality risk assessment, and Fletcher (1998) provided an attribute-dimensions matrix. Similarly, Fujii et al. (2020) focused on quality assurance of machine learning-based AI applications and did not touch upon data specifically. Unlike previous studies and frameworks, CaFDaQAM explores the overall data quality management process by explicitly proposing a data quality workflow and providing the necessary tools to apply that workflow. The proposed candidate framework also provides requirements for the individual components.

5.2 Answer to the research questions

In response to the first research question (RQ1), this study identified 27 data quality challenges through interviews and a literature review. We developed a method, Challenge Score ranking, to rank and understand the severity of the identified challenges. Furthermore, we verified the identified challenges using surveys and a focus group.

Furthermore, four components were derived, forming the Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM). A data quality workflow was derived. Similarly, tools were proposed in the form of templates and lists for identifying data quality challenges, identifying, quantifying, and managing data quality attributes, and developing solution candidates for data quality challenges. We validated the candidate framework components using a survey, thus ensuring that they correctly address the need of the stakeholders. Hence, the validated candidate framework components answer the second research question (RQ2).

5.3 Threats to validity

5.3.1 Internal validity

Internal validity is concerned with how different variables affect the result of an experiment. One such threat is researcher bias. We, as researchers, could have introduced biases about the topic of the study. The researchers, for example, could have been biased during collecting data and conducting interviews. In order to mitigate this, two researchers performed thematic coding separately using the same coding technique. They then combined them into a single final set of codes in a joint meeting through discussion.

Similarly, as stated earlier, some challenges were identified through literature review only. However, they were validated by conducting a focus group and surveys. Also, a predefined set of questions was used for the interviews, limiting the discussion during interview sessions. At the end of each interview, the interviewees were asked if any questions that should have been asked were missed. Efforts were made to reduce ambiguity in the questions as much as possible. However, there could still be confusion regarding the questions because of communication gaps.

Likewise, there were a limited number of participants in the interviews, the focus group, and the surveys. Most were from the automated driving sector, which could have skewed the study’s result. However, suppose researchers will conduct experiments in the future with the same questionnaire used in this study. In that case, the result could vary if only a few participants are used because those participants might have different experiences and expertise than those consulted during this study.

5.3.2 Reliability

Reliability is associated with the replicability of an experiment or other empirical study, which means future experiments designed in the same fashion as the first experiment should produce the same results as the first experiment. The different versions of interview questions are provided in a replication package so that researchers can track how the research questions evolved based on the participants’ responses. The interview questions help researchers to ask similar questions in the future. However, the responses by experts might be different despite being from the same domain and having similar years of experience, which is because they could have different backgrounds and experiences throughout their careers or simply because they can have different perspectives.

5.3.3 Conclusion validity

Conclusion validity deals with the reasonability of the results of an experiment. Because focus group sessions and surveys were conducted to evaluate the artifacts developed in the study, it can be stated that the conclusion of this study is valid. However, the researchers of this study have yet to validate the conclusion with other domain experts, such as healthcare, aerospace, or law enforcement. The artifacts have not been implemented in a real-world context. So, there is scope for future study regarding the real-world implementation of the artifact developed in this thesis.

5.3.4 Generalizability

The study was conducted for a specific sector—automated driving. While our findings and candidate framework can only be generalized beyond this scope with further research, we hope our work can inspire similar concerns in other domains. For instance, quality data is also crucial for critical systems such as healthcare or power grid applications. The candidate framework could be used as a template to identify data quality challenges and mitigate them in such systems. Albeit, modifications in the candidate framework and its components might be warranted for such generalization. Furthermore, we do not claim the generalizability of the identified challenges; we only claim the transferability of the concept that challenges exist in the defined categories.

6 Conclusion

In this study, we have identified data quality challenges that could arise in deep learning systems using an automated driving system case study, thus answering RQ1 of this study. We have identified, analyzed, and evaluated the data quality challenges using interviews and a focus group. The list of challenges acts as one of the components of the candidate framework devised in this study.

The proposed Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM), its components, and associated templates assist in comprehending data quality challenges, attributes, metrics, and solution candidates. The candidate framework can be used as a tool to improve data quality. It can be used to define data quality requirements for a given system. The proposed templates help create a reference point for identifying data quality issues and defining necessary data attributes. The candidate framework can help improve the performance of deep learning systems, make better predictions, and reduce the risks that insufficient data quality could pose. Using the information provided by the candidate framework, stakeholders can proactively identify and mitigate the challenges regarding data quality. The candidate framework supports RQ2 of this study.

As future work, researchers can use the candidate framework components as a baseline to further develop a framework. Additional challenges could be identified, or identified challenges could be broken into sub-challenges to explore in detail. In order to make the candidate framework developed in this study generalizable, it can be tested in other fields, such as healthcare. Additional data quality challenges, attributes, and solutions can be identified from different domains.

The candidate framework could also be adopted as an automated tool. Data can be passed through a pipeline in this tool, and different relevant quality aspects of the data can be assessed automatically. Then, quality information can be presented to appropriate stakeholders using various mediums and visualization techniques.