1 Introduction

A machine learning (ML)-based system offers numerous benefits. For example, it provides software solutions to previously impossible functionalities, including autonomous driving, object recognition, and forecasting. Due to the criticality of the results, ML-based software must employ a reliable and trustworthy approach (Khomh et al., 2018; Ozkaya, 2020). However, achieving reliability and trustworthiness remains challenging due to the unique characteristics of the ML-based software components (Martínez-Fernández et al., 2022; Sculley et al., 2015).

Developments of ML components, which are more often referred to as ML models, are highly experimental and nondeterministic (Wan et al., 2021). Identifying the ML model version suitable for the predefined quality described in the performance metrics requires experimentation using different parameters and datasets (Vogelsang & Borg, 2019). Then continuous monitoring is necessary after deployment to detect drifts and keep the ML Model relevant to the ever-changing operation domain via retraining (Xiang et al., 2023; Lima et al., 2022). Additionally, the characteristics of ML drive solutions that are uncommon to traditional software engineering. Dataset manipulation such as augmentation, reduction, or rebalancing is necessary to provide quality inputs to the training and testing process (Rahman et al., 2023). Manipulation of the neural network architecture (e.g., deep neural network (DNN) repair) is another example of ML-specific solutions that replace traditional software debugging activities (Sotoudeh et al., 2021).

The decisions made during ML-specific development activities often occur in isolation from the requirements of the rest of the software systems (Nahar et al., 2023). Analysis of the components in ML models considers only the ML perspective and ignores other aspects crucial for reliable and successful software system development, such as business and safety requirements (Wolf & Paine, 2020; Pereira & Thomas, 2020). Consequently, the reliability of the overall ML-based system cannot be demonstrated. This type of problem is not new for traditional software. The model-based approach overcomes this problem by facilitating traceability and consistency between different aspects of the software system (Batot et al., 2021). The term multi-view emphasizes the usage of different models to capture various aspects of the system (Reineke et al., 2019).

Several multi-view approaches have been proposed to facilitate ML characteristics (Villamizar et al., 2022; Nalchigar et al., 2021). However, these approaches are prone to inconsistency and a lack of traceability between the analysis and implementation because the decision is implemented separately from the analysis. Some works have aimed to implement a model transformation approach that integrates analysis models into ML model training, but the capability to discuss the relationship between the model defining the ML training design and other aspects necessary for a system-level analysis is lacking (Moin et al., 2022; Koseler et al., 2019). A different approach that combines the benefits of multi-view modeling and integrated implementation is necessary to achieve effective ML system analysis.

This paper extends our proposal of the Multi-view Modeling Framework for ML Systems (M3S) as a model-based framework that facilitates consistent and comprehensive analysis of ML systems (Husen et al., 2023). M3S analyzes the ML components and the overall system itself. The analysis includes integrating the modeling environment and the ML pipelines to facilitate the highly experimental characteristics of ML models, in which a series of training and evaluations are conducted with different solutions and configurations to identify an ML model version that satisfies all requirements. These integrations are based on a cohesive metamodel to ensure analysis consistency. The underlying approach of M3S supports a reliable and comprehensive analysis of the ML system while ensuring tight synchronization with the implementation. This synchronization provides the base for a feedback loop between the analysis and the implementation at both the ML model and the overall system levels.

Previously, we proposed a version of M3S and evaluated it with a limited case study (Husen et al., 2023). The early version features an analysis framework without any inclusion of integrated implementation of the decision documented in the models. This paper extends the previous work with an improved version of the framework. The enhanced framework includes the integration of the modeling environment with ML training pipelines along with an updated metamodel and process to reflect the improvement. Finally, a more comprehensive case study and a controlled experiment validate the capabilities of the updated M3S.

The contributions of this paper include:

  • Proposal of a multi-view modeling approach for an overall analysis of ML systems. Because comprehensive analysis is necessary to achieve reliable ML systems as an overall solution and not simply an ML model, we validate the usefulness of the proposed approach.

  • Development of an integrated metamodel for ML system analysis. Integrating different modeling approaches into a single entity is a limitation of the multi-view modeling approach. Herein we create an integrated model, which combines the unique characteristics of ML systems, and introduce an integrated metamodel that incorporates analysis approaches developed for ML systems (e.g., ML Canvas and the ML model pipeline) itself.

  • Integrated tool between the modeling environment and ML training pipeline. ML model development is a highly experimentative approach. Separating analysis from the ML model training and testing causes drift. To enhance the consistency between these two sides with different natures, we developed a tool that integrates the modeling environment and the ML pipeline through the definition and execution of configurations in ML training approaches.

The rest of this paper is structured as follows. Section 2 provides a motivating example. Section 3 summarizes related works. Section 4 presents the M3S and its implementation as an integrated tool. Sections 5 and 6 show a case study and experiment of M3S, respectively. Section 7 discusses the benefits and limitations of M3S identified during the evaluation. Finally, section 8 concludes this paper with the future direction of M3S.

2 Motivating example

Figure 1 outlines the motivating example of our work. In a model-first ML system development, an ML model is developed using the standard experimental process where an experiment management tool executes several runs. This generates a set of versions for a multi-class classification model. The performance of each version is tested and is ready for use as the basis to decide the version to be deployed into the system.

Fig. 1
figure 1

Motivating example

On the other hand, the system where the ML model will work has its requirements. At the very least, a business-level decision will dictate the success criteria of the overall project. The problem happens in the link between the performance of the selected version of the ML model and requirements at a higher level. How to trace the order of the importance of the ML performance for each class into the business decisions? On top of that, do the business decisions align or contradict each other on the ML performance level? How to reconcile and maximize the achievement of the project goals in the case of conflicting decisions?

However, the system where the ML model will be deployed has its own requirements. At a minimum, a business-level decision will dictate the success criteria of the overall project. This typically creates issues in the link between the performance of the selected version of the ML model and the requirements at a higher level. Not only is tracing the order of the ML performance for each class into the business decisions crucial but the business decisions must align with the ML performance level. Hence, when the ML performance and business decisions conflict, they must be reconciled to achieve the project goals.

In a hypothetical situation where training a new version of the ML model is available, another problem also emerges. The approach to train the version must satisfy the business decisions. Analysis based on the defined business decisions must be employed to select a suitable approach. Otherwise, experimentation on training the new version of the ML model may lead to unachievable goals. It should be noted that business decisions may state not only the functionality of the ML model but also other quality aspects such as safety and fairness. Hence, a trade-off may be necessary (Software engineering - systems and software quality requirements and evaluation (square) - quality model for ai systems, 2023).

We aim to solve this problem using M3S. The multi-view approach should bridge the deterministic side of higher-level requirements and the decisions to implement training strategies inside the black-box ML training activities. An integrated metamodel will guide which part of the decisions are connected and synchronized. Finally, if a change in higher-level requirements is necessary, the traceability of the framework should provide reliable information on the limitations of possible solutions from ML model training.

3 Related works

Several works have investigated reliable ML system development. Here, we classify them based on their similarity to M3S. In terms of a model-based approach for analyzing ML models, early examples include Bishop’s work in probabilistic graphical models (Bishop, 2013) and Infer.net (Minka et al., 2018). Moin et al. proposed the integration of Model-Driven Software Engineering and Automated ML (AutoML) with automated source code and ML model generation (Moin et al., 2022) as well as a Model-driven Engineering (MDE) approach for analytic and software modeling with a focus on ML mainly for the Internet of Things (IoT) domain with a prototype named ML-Quadrat (Moin et al., 2022). Kirchoff et al. conducted a comparative study using MontiAnna, a textual modeling framework, and ML-Quadrat to explain the potential of the MDE approach in the ML domain (Kirchhof et al., 2022). Koseler et al. defined a domain-specific modeling language for ML for a metamodel for ML systems (Koseler et al., 2019). Langford et al. proposed MoDALAS to facilitate model-driven runtime monitoring for learning-enabled components (Langford et al., 2021). M3S differs from the other studies in this area as it tries to address not only single aspect using a single model but to integrate different aspects into a single workflow.

Some studies have connected ML performance to requirements on higher levels. Villamizar et al. proposed perspective-based ML task specification based on the classification of 45 ML concerns into five perspectives: objectives, user experience, infrastructure, model, and data (Villamizar et al., 2022). Takeuchi and Yamamoto devised an analysis method to construct a business-AI alignment model in ArchiMate (Takeuchi & Yamamoto, 2020). Chuprina et al. proposed an artefact-based requirements engineering approach, which divides the concerns into four layers: context, requirements, system, and data-centric (Chuprina et al., 2021). Nalchigar et al. proposed GR4ML, a conceptual modeling framework for ML that utilizes three perspectives: business, analytic design, and data preparation (Nalchigar et al., 2021). M3S follows the same concept, but a different approach in the abstraction of the levels of requirements. M3S also extends the concept into the ML training pipeline.

Idowu et al. proposed the Experiment Management Meta-Model (EMMM) as an integrated metamodel of a commonly used experiment management tool (Idowu et al., 2022) and taxonomy of those tools (Idowu et al., 2023). The difference between our metamodel and EMMM is the focus of the metamodel. EMMM is an experiment management tool, while our integrated metamodel aims to connect the elements of the analysis models to the artefacts inside the experiment management tool itself. At M3S, the metamodel of the pipeline exists as a single part of the overall integrated metamodel.

The main difference between M3S and the aforementioned studies is the approach utilized for connecting the analysis and implementation of the analysis inside the machine learning training. M3S is not designed to generate the source code for model training. Instead, it connects the solutions, configurations, and results embedded in the ML pipeline with higher-level requirements to support project requirements achievement. Moreover, M3S facilitates a feedback loop between the ML pipeline and the system modeling through the integrated environment for modeling and training, which is not well supported by traditional model-driven approaches of model-to-code generation.

4 Multi-view modeling framework for ML system

Figure 2 overviews M3S, which aims to integrate and synchronize both sides. The key concept behind M3S is an integrated, traceable process between system models and the ML pipeline. It involves a multi-peak process where the analysis and implementation are iteratively refined from a highly abstract and speculative state into a more concrete and proven one. An integrated metamodel guides the connection between system models, ML pipelines, and configurations, connecting both sides to ensure consistency and traceability. There are three use cases of M3S: top-down, model-first, and parallel approaches. The use cases are based on which side is developed first. The use cases are based on common trajectories of ML system development (Nahar et al., 2022).

Fig. 2
figure 2

Overview of M3S

The first is a top-down approach. This is like a traditional software process where analytical models are developed prior to the execution of ML pipelines. The results of ML testing are then used as the basis to refine the decisions before another ML pipeline is executed. This use case fits a product-first approach to ML-system development.

The second one is the model-first approach. In this case, early versions of the ML model are developed before the analysis starts. The analysis is done in a reverse engineering manner. First, the current capability of the ML model is connected to the higher requirements. Second, whether the model satisfies the goal is determined. Third, the requirements are retrained or negotiated based on the findings of the analysis. This approach is appropriate when the ML system process is based on exploring what kind of ML task the existing data can produce.

The last one is the parallel or hybrid approach. This use case starts by defining the highest-level requirements followed by small experimentation using anything the team can think of. In this case, both the analysis and experimentation sides must communicate frequently to close the gaps as the analysis and ML performance become clearer over time. This is the best usage of M3S. However, this use case is also the most challenging to implement correctly.

4.1 Multi-view modeling process

M3S is comprised of six views covering different aspects of a reliable ML system. Table 1 summarizes the views and the model responsible for each view. The views are selected based on the identification part ISO/IEC Guide 51, which specifies analysis steps of the safety aspects for standards (Safety aspects - guidelines for their inclusion in standards, 2014). ISO/IEC Guide 51 covers high-level use cases, functionalities, and failure analysis of the system itself. ISO/IEC Guide 51 is selected with the goal of aligning the framework with safety aspects by following the required iterative process for risk assessment. This allows M3S to be utilized for safety-critical ML systems where highly reliable analysis and argumentation of the development of ML model and overall system are required. We transformed the steps into aspects that must be covered and subsequently assigned a model for each step. Figure 3 summarizes the transformation and model assignment.

Table 1 Views of M3S
Fig. 3
figure 3

Mapping of models into views and ISO/IEC Guide 51

Following the steps of ISO/IEC Guide 51, the modeling process becomes systematic steps of modeling activities. Back-and-forth adjustments between models may be necessary to correct and align the information between them. The decisions captured in the models are then implemented into a model training pipeline. In addition, a V-shaped process can be realized by connecting a validation process to each step of the modeling activities (Fig. 4). By utilizing a V-shaped process, the correctness of the decisions made during the analysis can be evaluated and traced for improvement and revision during the development process. Ultimately, a monitoring phase to detect performance degradation of the ML model from drift can be traced back to the main goal of the ML system.

Fig. 4
figure 4

Modeling Process of M3S

The details of the process are as follows:

  1. 1.

    Begin with the "Value" view. This view involves developing an AI Project Canvas to capture the project-level requirements (Thiée, 2021). This step defines the top business-level requirements, which include value propositions, potential users and other stakeholders, related aspects of the systems, and financial aspects of the project.

  2. 2.

    Identify the common aspects of the ML Task on the "ML Task" view using ML Canvas (Dorard, 2015). ML Canvas incorporates the information of AI Project Canvas (e.g., the value proposition, output, and data) into the requirements necessary for building an ML pipeline. ML Canvas defines the data collection and processing activities, the desired capability of the ML model, and the need for continuous monitoring.

  3. 3.

    Develop the "Architecture" view by an architecture diagram made in SysML. This view overviews the workflow to integrate the ML models and traditional software components. Communications between the components, including sensors, controllers, and user interfaces, are defined here. The information on integration in AI Project Canvas acts as the baseline for developing this model.

  4. 4.

    Model the "Goal" view using the KAOS Goal Model. This view defines the details of the ML task and the expected performance defined in ML Canvas. It also assigns the expected performance to each ML component described in the architectural diagram (Matulevičius & Heymans, 2007). The model decomposes the task and required performance into more detailed specifications. In the leaf nodes of the KAOS diagram, details of the desired performance must be defined in a measurable form. A formal specification of the ML performance can also be used to give an unambiguous specification (Letier, 2001).

  5. 5.

    Employ STAMP/STPA acts as the model for the "Safety" view in M3S (Leveson, 2012). The model uses the specified architecture in the architecture diagram to model interactions and identify potential communication failures that may lead to accidents. Root cause analysis of each failure is followed by the definition of the countermeasures that need to be implemented to ensure the safety of the overall system.

  6. 6.

    Assign responsibility to the safety case as the model of the "Argumentation" view that captures the solutions implemented in the ML pipeline and related components. The model specifies the goals of each implementation of solutions. The solutions cover data engineering, training approach, safeguarding, and other aspects necessary to be argumentative. The ML pipeline implements the solution and its configuration reflected in the safety case.

  7. 7.

    Execute the workflow of the training pipeline according to the information in the solutions described in the "Argumentation" view. Continuous synchronization provides consistency between decisions to utilize the solutions. Then the result of the training pipeline goes through several steps of testing.

  8. 8.

    Execute unit tests in the form of ML performance metrics to signal whether the specified minimum performances in the "Goal" view are too optimistic or pessimistic. This step shows the achievement of ML performance requirements and decides whether a change to the analysis model is necessary. New solutions in the "Argumentation" view or evolution of the ML performance requirements in the "Goal" view may be necessary.

  9. 9.

    Execute an integration test to validate whether the designed architecture in the "Architecture" view fulfills its purpose. Like the previous step, the architectural decision in the "Architectural" view may need to be updated if the integration fails to be done or does not demonstrate the desired quality.

  10. 10.

    Implement continuous monitoring to detect possible drift and wrong specifications of the "ML Task" view. This step is important because the uncertainty of the ML model cannot be removed from the system. Detection through the described Monitoring element in ML canvas notifies developers when drifts occur.

  11. 11.

    Employ value monitoring to evaluate whether the proposed values in the "Value" view are achieved. A misaligned business judgment needs to be monitored closely. A change in the environment may lead to an incorrect business judgment, which requires a shift in the value proposition.

4.2 Integrated metamodel

To achieve consistency between different views, we developed an integrated metamodel, which summarizes the relationship between the elements inside each view, using a metadata modeling process. The metamodeling process focuses on identifying similar elements between different models and connecting parts that lead to a comprehensive connection between all views. The integrated metamodel of M3S not only covers the models but also their communication with the ML pipeline. To achieve this, the integrated metamodel includes common concepts of the ML pipeline and the experiment management tool. Finally, the general concept of an ‘ML Solution’ is added to describe the implementation of solutions detailed in the safety case. The integrated metamodel is constructed and evaluated iteratively. The process begins with a metamodel for each utilized model. Then, the connections between all element pairs are evaluated to determine the connection type.

Connections are classified into four categories: same, similar, aggregate, or contribution (El Hamlaoui et al., 2018). "Same" connections mean the exact similarity between two elements, and the description of both elements mimics each other. "Similar" connections represent a generality and specialization between two elements, where one element describes a higher-level explanation while the other provides a more specific description. Most connections are classified as "contributions," which means the two elements are interdependent. Finally, “aggregation" connections show one element as a subset of another. To ensure the correctness of the metamodel, an iterative evaluation and correction process is implemented.

Table 2 shows an example of connecting the elements between two views and their connection types. The value proposition of both AI Project Canvas for the "Value" view and ML Canvas for the "ML Task" view is the same, meaning that the description from one side should match the other. Output and Integration of the "Value" view show an example of a similar connection, where the Impact Simulation of the "ML Task" view and Component of the "Architecture" view should use these elements as a basis, respectively. The safety goal of the "Argumentation" view should have a more specialized description than the goals of the "Goal" view as part of their aggregation. For the contribution, the ML performance generated from the ML testing activities should contribute to the achievement of the ML requirements, which is a specialized type of goal in the "Goal" view.

Figure 5 depicts the entire integrated metamodel of M3S, including the examples shown in Table 2. It should be noted that the class diagram notation is used to model our integrated metamodel. The “same" connections are modeled as a single node of an element, such as the "Value Proposition" elements of the AI Project Canvas for the "Value" view and ML Canvas for the "ML Task" view. Aggregation indicates “aggregate" connections, for example, between "Dataset" and "Label" elements of the ML Pipeline. Generalizations indicate “similar" connections, as shown between the "Goal" element and "Safety Goal." Associations indicate "contribution" connections, including the connection between the "Solution" element of the Safety Case for the "Argumentation" view and the "Countermeasure" of STAMP/STPA for the "Safety" view. A colored box symbolizes that an element may contribute to one or more views.

Table 2 Examples of connecting elements in the integrated metamodel
Fig. 5
figure 5

M3S integrated metamodel

4.3 Extensibility

M3S was designed to be flexible. Figure 6 summarizes the modification process for the models in M3S. The process of extending the M3S framework is based on the evolutionary thinking approach of ISO/IEC/IEEE 14764:2022 (Iso, iec, ieee international standard - software engineering - software life cycle processes - maintenance, 2022). This extension process allows the adoption of M3S to be lightweight or more extensive, depending on the needs of the case-specific criterion.

Fig. 6
figure 6

M3S extension process (Husen et al., 2023)

The first step in the extension process is to understand the analysis requirements of the ML-system development project. The analysis requirements include the ML concerns relevant to both the processes and products. Information about existing multi-view model-based processes (e.g., the standard goal-oriented multi-view modeling process of M3S) serves as the baseline for extensions into a more fitting process. Model responsibility mapping assigns the analysis models responsible for each aspect described in the analysis requirements. This step specifies the lack or excess of model utilization to cover all necessary views.

Each addition or reduction of the models creates a modification proposal. Each proposal is either accepted or rejected based on the potential impact of the modification. If a proposal is accepted, the integrated model is updated. Finally, continuous monitoring looks for potential concept drifts during the development and system operation. If a major concept drift is detected, another iteration of the extension is conducted to add or remove views. Iterative extensions based on an improved understanding gained from the development and operation should enhance the fitness of the list of utilized models over time.

4.4 Integrated modeling tool

We developed a prototype of an integrated ML system modeling environment to support the modeling and ML training process. The environment is developed as a plugin for the Astah* System SafetyFootnote 1 because this can reuse existing Astah* System Safety features such as SysML for architecture diagram, Goal Structuring Notation (GSN) for KAOS and safety case, and STAMP/STPA modeling. Additionally, the hyperlink feature of the Astah* System Safety helps connect two different views.

Figure 7 summarizes the extension of the Astah* System Safety to support M3S. We created plugins to add ML Canvas and AI Project Canvas to the Astah* System Safety to facilitate all the views. We also extended the SysML requirement object into canvas elements to ensure the elements of ML Canvas and AI Project Canvas can utilize the functionality offered by Astah* System Safety. Figure 8 shows examples of how the canvases work inside the Astah* System Safety environment. The modeling environment is integrated into the Data Version ControlFootnote 2 (DVC) experiment management tool via a set of plugins. The plugins work as a communication bridge to retrieve and send data to the other side. This integration is developed following the M3S metamodel.

Fig. 7
figure 7

Architecture of the Astah* System Safety and DVC integration

Fig. 8
figure 8

Implementation of AI Project Canvas on Astah* System Safety

For inward communication, we developed the ML performance monitoring plugin. It supports the definition of the ML performance requirements at leaf nodes of the KAOS goal model and fetches the testing result to trace the impact of a version of the ML model in satisfying them. As a proof-of-concept, we monitored the accuracy, precision, recall, and misclassification rate of a classification model. Figure 9 shows the performance requirement setting, ML performance data fetching, and impact tracing feature. The performance requirement setting allows the minimum ML performance to be specified. This minimum can also be set through goals written in parsable descriptions. A pop-up window summarizes all the ML performance requirements and provides a list of versions of the ML model to be evaluated. Finally, the plugin evaluates the achievement of the requirement and traces the failure into all the models.

Fig. 9
figure 9

Performance monitoring feature

For outward communication, we developed the DNN repair plugin. This plugin facilitates parameter configuration for the repair process and its execution from the modeling environment. The configuration is specified in the solution element on the "Argumentation" view. The configuration is recorded in the safety case for argumentation purposes. Figure 10 overviews the flow of the configuration setting and repair execution. The solution is defined through a parsable description or GUI support. Another window acts as the execution point to select the version to be repaired and the resulting version’s name. The ML pipeline plugin provides a simple function of executing training of the ML model using several hyperparameters as inputs. A message containing the value of each hyperparameter is then sent as a trigger for the new execution of the ML training process. The result of the training process consists of the trained ML model, its metadata, and the validation result. The data is stored in the version control system for future access.

Fig. 10
figure 10

DNN repair feature

5 Case study

This case study aimed to evaluate the capability of M3S to facilitate comprehensive analysis. We chose a case study as the evaluation method for two reasons. First, it can evaluate the modeling side of M3S from the first step, "Value" view development, to the eighth step, the unit test of trained ML models. Second, it is suitable with respect to the extensive time needed to evaluate the overall process with external participants.

5.1 Utilized case

Table 3 summarizes the case study, which is based on object classification ML models for autonomous driving cars (ADV). Here, the scope of the classification task is limited to traffic sign classification. The inputs for the ML model classification are color images from an embedded camera system as the sensors of the overall ADV system. The classification result is sent to the decision-making ML model as a decision-making input for the car’s control system.

Table 3 Overview of the case study

ADV is required to work at level three of vehicle autonomy. In level three conditions, the autonomous part returns the driving responsibility to the driver when ADV is operated outside the preferable domain. To train and test the model, we used the publicly available German Traffic Sign Recognition Benchmark (GTSRB) dataset (Stallkamp et al., 2012, 2011). The GTSRB dataset consists of images of German traffic signs that fit the case study. Figure 11 shows samples for each of the 43 traffic sign classes.

Fig. 11
figure 11

Sample of images in the GTSRB dataset (Hosseinzadeh Kassani & Teoh, 2016)

The case study considers two operation domains (Fig. 12). The first one is the highway. Because this domain is free of pedestrians and bikes, traffic signs indicating such objects are non-existent. The second one is suburban roads, where pedestrians and bikes are more prevalent. The highway domain is prioritized from an economic standpoint, whereas the suburban road domain is preferable from the user standpoint. The JAMA Framework Japan Automobile Manufacturers Association (2021) and Aurora’s safety case framework for ADVsFootnote 3 serve as the basis to ensure that the case study reflects the real world as much as possible. To fit the ML model to both cases, DNN repair may be utilized to improve the performance of important classes. However, if no version of trained ML models satisfies both cases, the highway domain is prioritized. The configuration of the repair process must reflect such concerns.

Fig. 12
figure 12

Illustration of the difference between highway and urban road domain

5.2 Research Questions

This case study aims to answer the following research questions (RQs):

  • RQ1. Does the integrated metamodel ensure consistency in the multi-view modeling process of M3S? RQ1 assesses the benefits and necessity of utilizing an integrated metamodel to facilitate the M3S process. This question should validate the integrated metamodel, which serves as the guideline for the M3S modeling process.

  • RQ2. Does the integrated modeling tool facilitate validating higher-level goals compared to existing ML performances? RQ2 evaluates the capability of the tool-supported M3S process to maintain and utilize backward traceability between the ML test result and the specified ML performance and other related requirements.

  • RQ3. Does the integrated modeling tool facilitate rationalizing ML-specific solutions and their impact? RQ3 examines the capability of the tool-supported M3S process to maintain traceability between ML-specific solutions, including the configuration and implementation of such solutions in the ML pipeline.

The process and results of the case study address RQ1. Evaluating the impact analysis function of the integrated tool answers RQ2. Finally, the evaluation of the solution integration function of the DNN repair answers RQ3.

5.3 Results

The case study followed the process steps described in the Subsection 4.1. The numbering in this subsection reflects the numbering of steps of the M3S process.

  1. 1.

    We initially modeled the "Value" view using the AI Project Canvas based on the information from the case study. The result of the modeling can be seen at the top of the Fig. 13.

  2. 2.

    Figure 13 shows the result of the AI Project Canvas for the "Value" view and the ML Canvas for the "ML Task" view along with the related metamodel part that guides the derivation of information from the AI Project Canvas to the ML Canvas. Firstly, the value proposition developed in the "Value" view was copied directly into the "ML Task" view. Then other elements in the "ML Task" view were derived based on the value proposition. On the elements where a specific connection was described in the metamodel (e.g., the outputs of the "Value" view and the impact simulations of the "ML Task” view), more detailed information was derived. At this point, the information necessary for training the initial versions of the ML models was complete. Various versions of ML models using different hyperparameter configurations could be trained using the information of the dataset and ML task in ML Canvas on the DVC side. In parallel, modeling continued for the "Architectural" and "Goal" view.

  3. 3.

    The integration part of the AI Project Canvas dictated the development of an architectural diagram for the "Architectural" view (Fig. 14). Each piece of information in AI Project Canvas’s integration was translated into the system requirements prior to dividing into specialized components inside the architectural diagram. The connection between the components was subsequently analyzed and specified to complete the architectural diagram.

  4. 4.

    The value proposition and impact simulation defined in the "ML Task" served as the basis of higher-level goals in the KAOS goal model for the "Goal" view (Fig. 15). Then it was further decomposed to obtain a goal achievable by a single performance metric. With the required ML performance defined in a semi-formal way, the M3S modeling tool parsed the information and used it to check if the requirements were satisfied. Figure 16 compares the ability of three different ML models to satisfy the requirements. By examining the color-coded elements, version A clearly outperformed the other two. However, no version completely satisfied all the requirements. Thus, we continued developing the "Safety" view with a clear understanding of the limitations of the available ML models.

  5. 5.

    STAMP/STPA inside the "Safety" view indicated how the hazard from the user perspective connects to the limitation of the ML model. The architecture diagram was translated into the control structure diagram of STAMP/STPA (Fig. 17). Unsafe control actions between the ML model and the other components of the ML system were analyzed. Then countermeasures were given based on the limitation of existing ML models towards each hazard causal factor of each unsafe control action.

  6. 6.

    Figure 18 shows the connection of the top goal of the goal model to the top goal of safety constraint and the countermeasures from STAMP/STPA in the safety case of the "Argumentation" view. For the solution that utilized DNN repair, we implemented repair strategies and patterns to improve the best-performing version of the ML model, which was version A. Figure 19 summarizes the patterns utilized and the difference between the repair results. The first pattern was a balanced approach, where both classes were treated equally with the same priority weighting. The other one prioritized fixing the worse-performing class.

  7. 7.

    DNN repair processes were executed for both cases using the specified configuration in the "Argumentation" view. The repair resulted in two new versions of the ML model, improved from the selected A version as the base model for repair. All new versions of the ML model are stored on the DVC side.

  8. 8.

    Although the execution of the repair improved the performance in both patterns, neither achieved the required ML performance. Further evaluation of the misclassified images showed that the test data quality might not be suitable for real-life situations. The test data in Fig. 20 was too extreme for the target operational domain, considering the development goal only required level 3 self-driving capability. As such, further manipulation of the test data, such as the exclusion of extremely low-quality images, may be necessary to measure the capability of the ML model properly. Moreover, the quality of the sensors also came into question. The ability of the cameras to provide quality images is integral to ensure that the ML model is not exposed to extreme cases. However, both solutions increased development costs. Figure 21 shows how the solutions are reflected in the "Argumentation" view and the associated costs of the solutions in the AI Project Canvas for the "Value" view due to the part of the metamodel that specifies the interconnected changes. Hereafter, we adopted the balanced repair version of the ML model for further integration. However, the need for better-quality camera sensors is noted as a future improvement.

Fig. 13
figure 13

AI Project Canvas (top) and ML Canvas (bottom) with associated metamodel elements (middle)

Fig. 14
figure 14

Derivation of AI Project Canvas’ Integration (top) into the Architectural Diagram’s components (bottom)

Fig. 15
figure 15

ML Canvas (top) and part of the KAOS Goal Model (bottom) and its associated metamodel elements (middle)

Fig. 16
figure 16

Comparison of the results from different ML model versions. Red means the performance requirement is not fulfilled

Fig. 17
figure 17

Derivation of Architectural Diagram’s components (top) into STAMP/STPA’s Entities (bottom)

Fig. 18
figure 18

Development of a safety case (top) from KAOS Goal Model’s Top Goal (snippet) and STAMP/STPA’s Countermeasures (bottom)

Fig. 19
figure 19

Repair strategy patterns utilized in the case study

Fig. 20
figure 20

Misclassified test data for each version of the ML model

Fig. 21
figure 21

Inclusion of newly found solutions in the safety case (top) and the associated update on AI Project Canvas (bottom) based on the metamodel (middle)

5.4 Answers to research questions

Here, the research questions are answered. Each subsection is dedicated to one research question.

5.4.1 RQ1. Does the integrated metamodel ensure consistency in the multi-view modeling process of M3S?

The results highlight how the metamodel guides the development of new elements. The metamodel guides the consistency from business-level decisions into the ML training aspects (Fig. 13). Then the information in the AI Project Canvas serves as the basis to construct the architectural diagram (Fig. 14) while composing the abstract description in the ML canvas realizes achievable ML performance requirements (Fig. 15). Finally, the architectural decision should be the foundation of the STAMP/STPA control structure (Fig. 17).

The metamodel guides the process in higher-level decisions to update the models during the loop. The costs in the AI project canvas should be added for each newly proposed decision (Fig. 21) because the metamodel must not only work in the initial development but also in the later stages when decisions may need to be changed.

This case study demonstrates the capability of the metamodel. Hence, the integrated metamodel of M3S can ensure consistency in the multi-view modeling process.

The integrated metamodel of M3S ensures the consistency of the multi-view modeling process. Elements of different models can be traced and connected using the integrated metamodel not only during the initial development but also as the analysis models evolve as new solutions update the elements of another view.

5.4.2 RQ2. Does the integrated modeling tool facilitate validating higher-level goals compared to existing ML performances?

The integrated modeling tool can be configured to monitor specific ML metrics through the goals of the KAOS goal model (Fig. 15). The color coding in Fig. 16 demonstrates the tool’s capability to communicate with the ML pipeline to fetch the test result and mark the achievement of the configured goal nodes and the associated elements of other models. This format visualizes the impact of the ML model on the achievement of the higher-level goals. The integrated modeling tool utilizes visualization techniques to support the traceability between models. Consequently, the integrated modeling tool can validate the achievement of higher-level goals from existing ML performances.

The integrated modeling tool of M3S successfully validates the achievement of higher-level goals using the traceability between the business and system-level goals and lower-lever ML performance goals. The fetched ML performance from the DVC server can be automatically traced to other views such as the "Value" and "Architecture" views.

5.4.3 RQ3. Does the integrated modeling tool facilitate rationalizing ML-specific solutions and their impact?

Figure 18 demonstrate how the information of ML-specific solutions is captured inside the safety case of M3S. The solution spans from the data layer to ML training and architectural decisions made. The metamodel supported the addition of solutions on other aspects. Moreover, the integrated DNN repair tool, which is used as an implementation example of an integrated solution, shows promising results. Solutions can be captured inside the analysis model and subsequently executed in a single flow, ensuring synchronization of decision-making and solution execution during the iterative addition of solutions. The impact of the solution on ML performance and the higher-level goals can also be traced using the performance checking function (Fig. 16).

It should be noted that these findings are based on a single integrated solution. More implementations of integrated solutions, especially on different aspects such as data manipulation, are necessary to understand the full capability and limitation of integrating solutions into both the analysis and model training because the solutions may work differently from the DNN repair, which works directly with the ML model.

The integrated solutions of M3S allow decision-making and execution of solutions to maintain consistency during the iterative and experimentative development process. However, this finding is limited to the context of DNN repair. Further implementation and evaluation of different types of solutions such as data manipulation are necessary to understand the benefits and limitations of integrated ML solutions.

5.5 Threat to validity

The main threat to our case study is the validity of the case itself. An unrealistic case study questions the uncertainty of whether the results of the case study reflect the real situation. We implemented two strategies to ensure that the case study is suitable to evaluate M3S. First, the case study is based on reliable documents. We followed the JAMA framework for required capability and possible failures. We also followed Aurora’s safety case framework for ADVs to design the overall system. Second, we solicited input from industry practitioners. We continuously reviewed the case study and the results with industrial experts to verify the quality of the case and analysis results.

6 Controlled experiment

The controlled experiment focused on evaluating the usability of M3S to execute the integrated pipeline. The experiment consisted of two parts. The first one evaluated the capability of M3S to facilitate impact analysis of the integrated pipeline. The second assessed the capability of M3S to execute an integrated solution, which in this experiment is represented in the form of a DNN repair.

6.1 Research questions

We aimed to answer the following research questions (RQs):

  • RQ4. Can M3S efficiently facilitate the impact analysis from ML model performance? RQ4 evaluates the time needed for M3S to complete the impact analysis task on ML performance testing. A lower time compared to the control group indicates a better efficiency compared to ad-hoc approaches.

  • RQ5. Can M3S efficiently facilitate the analysis of parameters for repair activities? RQ5 evaluates the time needed for M3S to fully configure the integrated solutions. A lower time compared to the control group indicates a better efficiency compared to ad-hoc approaches.

  • RQ6. Does M3S help users train a better ML model through integrated solutions? RQ6 evaluates the capability of M3S to help developers effectively incorporate solutions to train better ML models.

  • RQ7. How confident are users about the impact analysis result of M3S? RQ7 evaluates the users’ acceptance toward the result generated by M3S in analyzing the impact of the result of ML performance testing.

  • RQ8. How confident are users about the usability of the impact analysis of M3S? RQ8 evaluates users’ acceptance toward the support provided by the support tool for M3S in analyzing the impact of ML performance testing.

  • RQ9. How confident are users about the result of solution integration in M3S? RQ9 evaluates users’ acceptance toward the result generated by M3S in the configuration and execution of the integrated solution.

  • RQ10. How confident are users about the usability of the solution integration in M3S? RQ10 evaluates users’ acceptance toward the support provided by the support tool for M3S for the configuration and execution of the integrated solution.

The time spent by the participants finishing their tasks answers RQ4 and RQ5. The performance of the ML models repaired by the participants answered RQ6. RQ7, RQ8, RQ9, and RQ10 are addressed based on the participant’s responses to the post-experiment questionnaire.

6.2 Experiment design

The experiment is designed to answer the research questions. Here, the design of the flow, participants, and data collection method for the experiment are detailed.

6.2.1 Experiment flow

Our experiment consists of two parts. The first part evaluates the effectiveness and efficiency of the M3S modeling-training pipeline integration for performance monitoring and tracing. The second part evaluates the effectiveness and efficiency of M3S modeling-DNN repair pipeline integration as a sample of modeling-solution integration.

The final goal of the experiment involving the participants is to identify a version of the ML model that best meets the requirements. The model can be an existing or repaired model. If no version satisfies all the ML requirements, they must indicate the changes necessary for immediate deployment of the most suitable model. Additionally, the participants need to finish the tasks given by the proctors to achieve such goals in the available time for each part.

We separated the participants into two: the framework and control groups. The framework group used M3S to achieve their goal, while the control group used an ad-hoc approach. The experiment employed natural language specification as a control for comparison to the M3S multi-view models. To ensure the similarity, the natural language was translated from the M3S models already specified for the experiment group. The framework group used the standard command-line interface (CLI) execution of the pipeline works for the integrated solution.

Figure 22 outlines the ML performance evaluation part of the experiment. Both groups started the experiment with a general briefing. The briefing consisted of an introduction of the group members and an explanation of the goals and tasks. Then the groups were physically separated to work with their own approaches. Both groups started by understanding the requirements for their respective group (e.g., the multi-view models for the framework group or in a natural language for the control group).

Fig. 22
figure 22

Experiment flow for ML performance monitoring part

The next step was to execute a tool to work with the ML model. The framework group began by configuring the required performance for the leaf goals they wanted to monitor. Then they executed the fetching of ML model performance from the pipeline. In contrast, the control group executed the pipeline using the command lines in the CLI to determine the performance of each version of the ML model. Finally, each group decided which version of the ML model was the most suitable to satisfy the existing requirements, and if the existing versions of the ML model did not satisfy all requirements, they indicated which requirements were not satisfied.

Figure 23 overviews the second part of the experiment. It began with an explanation of how DNN repair works and what parameters need to be configured for the repair to work. Both groups were then tasked with repairing the version of the ML model they found most suitable in the first part of the experiment to fulfill as many requirements as possible. The experiment group worked with the safety case model to specify the configuration, while the control group worked with the configuration file directly to set up the repair process. Both groups then executed the repair using their approach. Then they evaluated the success of the repair process using a similar approach. Finally, they decided whether to use the repaired or the original version based on the suitability to satisfy the requirements. If they chose the repaired version and it did not satisfy all requirements, they had to mark which requirements were not satisfied.

Fig. 23
figure 23

Experiment flow for the integrated DNN repair pipeline part

Afterward, the participants completed a post-experiment questionnaire. The experiment concluded with a short discussion session between both groups. During the discussion, a moderator captured the subjective opinions of the participants about what worked well and what could be improved.

6.2.2 Participants

Our experiment attracted thirteen participants from practitioners, academia, and graduate students with varying roles and experience levels. Table 4 summarizes the participants. For the practitioners and academia, we collected the experience based on how long they have worked in their role, whereas the year in graduate school was collected for studies. The personal identities of each participant were obscured to protect their privacy. The participants were divided into four groups: the practitioner control group, the practitioner framework group, the student control group, and the student framework group.

Table 4 Summary of the participants. The groups include practitioner control (C-P), practitioner framework (FW-P), student control (C-S), and student framework (FW-S)

We split the participants into practitioners and students before assigning them to the framework and control groups for two reasons. The first one was to isolate the experience levels to detect differences between the perspectives based on experience. The second one was for flexibility as students had more time available to test the tools. Although the experience and backgrounds were balanced between the framework and control groups, the participants were randomly assigned to a group. For example, two participants had 30 years of experience; one was assigned to each practitioner group, but it was random which one was in each group. Similarly, two students had limited industrial experience and were assigned to different student groups.

6.2.3 Questionnaire

A post-experiment questionnaire was employed to capture the participants’ subjective opinions from both groups. The questionnaire employed a Likert scale to capture the participants’ impression (Likert, 1932-1985). Each question was designed to answer RQ3, RQ4, RQ5, and RQ7. A four-point Likert scale was used to reduce bias from selecting the neutral option. Table 5 shows the questions of the post-experiment questionnaire and their corresponding RQs.

Table 5 Post-experiment four-scale Likert questionnaire

The results of the questionnaire from the control and experiment groups were compared. The difference in the average between the groups was used to answer the related RQ. The analysis used a weighted calculation to separate the extreme options of ’Highly Disagree’ and ’Highly Agree’ from the ’Disagree’ and ’Agree’ options.

6.3 Results

6.3.1 Time for completion

Figure 24 summarizes the time required for each group to finish the ML performance monitoring part. The student framework group performed three iterations of monitoring during the experiment, with each iteration fulfilling the completion criteria. To evaluate the time properly, we divided their time into three parts to reflect each iteration. We also separated the time for discussion and tool operation. The discussion consisted of conversations about the tool, case, and solution. However, further classification was difficult since the topics were often mixed.

Fig. 24
figure 24

Time needed for each group to finish the ML performance monitoring task

The framework groups tended to use time for tool operation more efficiently than the control groups. The practitioner and student control groups required more than 10 minutes and 13 minutes, respectively, whereas the practitioner and student framework groups each required about 3 minutes. Even when considering the multiple iterations executed by the student framework group, the time remained consistent. Similarly, the discussion time was longer in the framework group when the results were compared by experience level (practitioners or students). For a given experience level, the discussion time of the framework group was almost twice that of the control group.

Figure 25 summarizes the time required for each group to finish the DNN repair part. Similar to the ML performance monitoring part, the student framework group completed three iterations of the task. The practitioner framework group completed two iterations. Following the same approach as the ML performance monitoring part, we divided their time to reflect the iterations. We also separated the tool operation and discussion time.

Fig. 25
figure 25

Time needed for each group to finish the DNN repair task

The control group for a given experience level spent more time on tool operation. The student groups showed a significant difference; the discussion time of the control group was almost twice that of the framework group. For the practitioners, the control group took slightly longer than the framework group. The second iteration from the practitioner framework group was less than half the time for the single iteration of the practitioner control group. However, the learning curve effect must be considered in this comparison. The discussion time for the control groups from both sections was significantly higher than that for the framework groups. Overall, regardless of their experience level, the control groups took more time than the framework groups to finish the DNN repair part.

6.3.2 Repaired ML models performances

Table 6 summarizes the performances of repaired ML models by the group, along with performance expectations. The aim of the experiment is for ML models to satisfy all desired values. The ML models from both framework groups satisfied all the desired values. In contrast, the control groups failed to satisfy the desired misclassification rate from the label ”Speed Limit 60” to ”Speed Limit 80” but satisfied the remaining desired values.

Table 6 Summary of ML model performances trained by each group. Groups include practitioner control (C-P), practitioner framework (FW-P), student control (C-S), and student framework (FW-S). Underlined values indicate failure in satisfying the desired value

Except for the misclassification rates, both control groups produced ML models with better performance than their respective framework groups. In both practitioner and student groups, the misclassification rate for the label “Speed Limit 60” to “Speed Limit 80” of the framework group’s ML model was half that of the control group’s ML model. A similar reduction was observed for the student group in the case of “Speed Limit 100” to “Speed Limit 120” for the students, but the difference in practitioners was much smaller. The characteristics of the misclassification rate should be discussed as a performance metric to understand the findings.

6.3.3 Questionnaire result

Here, the results are visualized using diverging bar charts (Heiberger & Robbins, 2014) because the differences in perceptions can easily be identified. Our analysis focused on the positivity or negativity of the sentiment by group.

Figure 26 summarizes the answers to Q1. The framework group had a more positive sentiment (72.3%) than the control group (62.5%). This was a difference of 11.2%, suggesting that the control group felt their tracing result was more prone to mistakes.

Fig. 26
figure 26

Summary of the answers to Q1

Figure 27 summarizes the answers to Q2, Q3, and Q4. As the tracing comprehensiveness increased, the sentiment of the control groups decreased from 85.7% positive in Q2 to 50% positive in Q3 and 12.5% positive for Q4. In contrast, the framework groups showed 100% positivity for Q2 and Q3 and had a slight drop to 88.8% positivity in Q4.

Fig. 27
figure 27

Summary of the answers to Q2, Q3, and Q4

The answers to Q5 suggest that the M3S implementation of the DNN repair solution has major weaknesses (Fig. 28). The control group indicated 100% positivity, whereas the framework group had only 62.5% positivity in their answers. The significant gap between the groups implied that something is not working well for the framework groups and needs to be fixed.

Fig. 28
figure 28

Summary of the answers to Q5

Figure 29 shows the answers to Q6. The 1.5% difference in positivity by group suggested a slight disadvantage in using M3S in terms of efficiency of deciding the input of the DNN repair tools. However, the answers to Q7 showed that M3S had a significant advantage for evaluating the output as the framework group showed a 38.8% higher positivity than the control group.

Fig. 29
figure 29

Summary of the answers to Q6 and Q7

6.4 Answers to research questions

6.4.1 RQ4. Can M3S efficiently facilitate the impact analysis from ML model performance?

For the tool operation, M3S is a more efficient approach than the natural language specification and CLI-based tool combination. However, the difference in discussion time must be considered to properly understand the reason for the time difference. This finding is relevant for the practitioner framework group, which spent a lot of time discussing. An interesting finding from their post-experiment discussion is the comments about model correctness. The practitioner framework group took the time to discuss whether the models’ logical points are correct. None of the other groups engaged in this type of analysis. It is plausible that the time spent by the practitioner framework group discussing the correctness is due to the clarity of the connection between the elements rather than the difficulties in finishing their task. With that in mind, we argue that the M3S approach is more efficient than the ad-hoc approach used by the control group. Moreover, the experience level did not affect the required time for completion.

The M3S approach is more efficient than the ad-hoc approach. This is consistent with the fact that the framework group needed much less time than the control group, regardless of experience level.

6.4.2 RQ5. Can M3S efficiently facilitate the analysis of parameters for repair activities?

Although the difference by group is not significant, the difference by group and experience level is. In both cases, the framework groups took less time to complete their task than the control groups. This finding is more pronounced in the students. The operation time of the student control group is almost twice that of the student framework group. Additionally, the participants’ comments further support this. One participant in the control group commented on the need for GUIs to complete all the DNN repair tasks.

Each group employed a similar format in their discussions: identify important classes to repair, decide the exact number for the configuration, and evaluate the result of the DNN repair. Although similar formats were utilized, the control groups had longer discussions. Comments from the practitioner control group emphasized their difficulties in deciding the value of the configuration and analyzing the side effects of the process.

The M3S approach more efficiently facilitates parameter analysis in the repair activity than a more ad-hoc approach. M3S supports a faster analysis to make and evaluate the decisions of the DNN repair activities. However, experience level also affects the analysis.

6.4.3 RQ6. Does M3S help users train a better ML model through integrated solutions?

Participants in the framework groups repaired the model to satisfy all the desired values of ML performance. In contrast, the control groups failed to repair the model to satisfy the desired misclassification rate. It should be noted that the control groups’ models outperformed the framework groups’ models for other performance metrics. This finding leads to interesting questions about the nature of the requirements to be satisfied.

An important difference between achieving the misclassification rate and other performance metrics is its relationship with multiple labels. Accuracy, precision, and recall simply focus on the population of a single label, whereas the misclassification rate requires a deeper analysis of the connection between the label misclassified from and the label being misclassified into. This leads to more complex decision-making when configuring the priority of labels in DNN repair. We argue that M3S helps guide decision-making in more complex situations. This argument is supported by the fact that the framework groups better repaired models for the misclassification rate while simultaneously satisfying all desired values.

The M3S approach helps the user train a better model in complex situations where the relationships between labels are important. However, the efficacy in a straightforward situation when only a single metric is important is lower compared to the ad-hoc approach.

6.4.4 RQ7. How confident are users about the impact analysis result of M3S?

Participants in the framework groups are more confident in their ML performance monitoring task results compared to the control groups. This is supported by the positive comment about the M3S capability from the practitioner framework group, which stated that the automatic detection of satisfied and unsatisfied elements helped them navigate the effect of ML model performance even if they had to recheck it. In contrast, both control groups were concerned about their result if the requirements were larger than the ones used in the experiments.

The automatic impact signaling of M3S increases users’ confidence in the impact analysis result compared to a more ad-hoc approach because the only result needs to be rechecked using M3S instead of analyzing everything from scratch. The impact becomes more significant as the number of requirements increases.

6.4.5 RQ8. How confident are users about the usability of the impact analysis of M3S?

The answers to Q2, Q3, and Q4 show interesting results. The difference in the sentiments becomes more significant as the scope of the requirements broadens. Both groups started with similar confidence levels. However, the sentiments of the control group about their capability to evaluate the overall requirements became extremely negative, implying that the ad-hoc approach is highly unreliable in such cases. In contrast, the framework groups showed stable responses for all cases. The participants who used M3S felt confident that they can work with all levels of requirements, not just the ML performance requirements, indicating that M3S has a better usability for evaluating the impact than a more ad-hoc approach.

The M3S approach is more usable than ad-hoc approaches. This sentiment is more pronounced when impact analysis requires a more comprehensive approach. M3S should be more beneficial when the scope of the impact is broad.

6.4.6 RQ9. How confident are users about the solution integration in M3S?

The integration of DNN repair into our framework has some major issues. Unfortunately, the reason is unclear from the answers to Q5. Both the practitioner and student framework groups completed the repair and satisfied all the ML performance requirements, whereas neither control group did. However, the post-experiment discussion may provide some insight. One participant noted that the DNN repair process is not visible from the modeling side. Another participant reported unfamiliarity with the tool and the need to see the detailed process of the DNN repair. Based on the post-discussion comments, we assume our integration has some issues. Our integration is overly encapsulated and lacks the transparency of more traditional approaches. Additional information is necessary to properly answer this question.

The confidence of the integrated DNN repair in M3S result is low. This may be due to the lack of transparency, especially when users are unfamiliar with the solution. In the future, an in-depth evaluation of the internal process, especially the generality of the findings on different types of solutions, should be conducted.

6.4.7 RQ10. How confident are users about the usability of the solution integration in M3S?

M3S shows a slight disadvantage compared to the ad-hoc groups, but the difference may be caused by the weighting of extreme options. Nevertheless, the problems from both groups should be evaluated to understand whether they felt the same difficulties during the experiment. The control group noted that an ad-hoc feeling during the decision of the configuration value made it difficult. The framework group did not make a similar comment, suggesting the groups encountered different problems. Moreover, a participant in the control group stated the need for a GUI-based approach for the repair, whereas the framework group indicated the ease of not having to write CLI commands manually. However, the framework group mentioned that the randomness of the ML solution is an issue. This situation is likely to be felt since both sections of the framework group conducted the repair during the experiment more than once and tried to make sense of the detailed effect of the configuration. Unfortunately, randomness is a characteristic of the ML training process. Addressing randomness is beyond the scope of this experiment.

Overall, the comments suggest that our approach has better usability than more ad-hoc approaches for deciding the value of the configuration, which is a limitation of the ML solution itself. The answers to Q7 show that the sentiment of the framework group is significantly more positive than the control group, indicating that the solution integration implemented in M3S has a higher usability than ad-hoc approaches.

Users have a higher confidence in the solution integration’s usability of M3S than the ad-hoc approach. The main reason for is the availability of GUI support in M3S allows for more reason-based decision-making, especially when evaluating the solution’s successes and side effects.

6.5 Threat to validity

There are several threats to the validity of the experiment. One internal threat is the participants’ familiarity with the methods. To mitigate this threat, we randomized the groups while ensuring that both groups had similar experience levels. Another internal threat is the participants’ bias towards a particular method. We countered potential bias by not sharing the aim of this research with the participants.

An external threat is sampling bias. Our experiment included two different sections to improve the generality of the results. We also selected participants with differing levels of experience in ML and software development to ensure the quality of the sampling.

7 Discussion

This section addresses benefits, limitations, and other insights of M3S gained from the case study and controlled experiment.

Table 7 Comparison of M3S with other approaches

7.1 Benefits - comprehensive feedback loop

The case study highlighted that the repair process and proposal to improve the sensors are based on feedback from previous actions. Similarly, the proposal to improve the camera sensors was also made by understanding the limitations of the repaired ML model. At the same time, the controlled experiment has shown positive sentiment toward the impact analysis inside M3S on top of the fact that the users of the framework managed to create better ML models. The development process in M3S drives an informed decision, which is implemented again in another development process, creating a feedback loop between the analysis and the implementation.

Table 7 compares the scope of M3S to other approaches. Compared to other multi-view analysis approaches (Villamizar et al., 2022; Nalchigar et al., 2021), M3S covers more aspects, namely safety and argumentation. Additionally, M3S facilitates an integrated feedback loop between the analysis and the ML training and testing pipeline. Compared to pure analysis approaches, M3S supports a dynamic environment for continuous evaluation and improvement of decisions behind ML system development through the traceability provided by the integration (Galvao and Goknil, 2007).

The analysis scope of M3S differs from model-transformation approaches (Moin et al., 2022; Koseler et al., 2019). Model-transformation approaches fit better with the ML model training without considering other aspects required for an overall ML system. In contrast, M3S supports the development of more robust ML systems that integrate ML model analysis and higher-level requirements necessary for a successful large-scale ML system development. The integration facilitates the validation of potentially unrealistic expectations of the capability of the ML model and provides guidance for their refinement into a more realistic one (Nahar et al., 2023).

7.2 Benefits - documented integrated solutions

An underlying principle of M3S is solution integration to improve the quality of the ML model in the multi-view analysis part. The concept of integration is not unique. Table 7 summarizes other works that have explored the idea of using model transformation to generate training pipeline source code. The distinction between M3S and other approaches lies in the generated part of the pipeline through model transformation; M3S produces the configuration of the preferred solutions to improve the ML model quality. The underlying benefit of this approach is that the experimentation of the proposed solution is tightly synchronized with the analysis part, enhancing both traceability and reproducibility.

The experimentation using different approaches to DNN repair is documented directly inside the solution of the safety case (Fig. 19). This facilitates tracing the improvement or degradation of the ML model quality because the configuration of proposed solutions is well represented inside the models. Although the case study only evaluated a single solution, other solutions should show similar benefits. For example, the decision to experiment with data augmentation to balance data distribution can be properly reflected in the model with the configuration of the augmentation written as a description of the solution.

The benefits of the M3S style of integration should be more apparent in longer loops of adding and removing solutions to the ML model training compared to the model transformation approaches summarized in Table 7. The decisions behind each experiment are more properly reflected in the model rather than directly placing the ML pipeline design in the model. Given the characteristic of ML systems where monitoring for model degradation from drifts is prominent (Bayram et al., 2022), the ability to retrace past decisions is highly beneficial.

7.3 Limitation - variation of machine learning tasks

One issue discovered while developing and validating M3S is the vast possibilities of tasks in the ML model. The case study and controlled experiment only considered the multi-class classification task as other ML tasks were beyond the scope of this research. This limits our findings to the multi-class classification task. Although this study demonstrates the versatility of M3S in handling different ML tasks, in the future, the generality of M3S in handling different tasks should be validated.

A comparison to other works utilizing a framework for different ML tasks suggests that M3S has the versatility necessary to handle various ML tasks. Our early works on the framework extensibility showed promising results in named entity recognition (NER) problems for a rule-based text transformation (Takeuchi et al., 2023). Extension of M3S with activity-driven analysis demonstrated the ability to handle optical character recognition (OCR) tasks (Tanaka et al., 2023). By comparing the characteristics of the NER and OCR task to the image classification in our case study, we can explore the features of M3S that work differently from the case study we presented.

The performance metrics are one crucial aspect that differs from our case study. The performance metrics of image classification mainly rely on classical ones, such as accuracy, precision, and recall while The OCR’s accuracy varies between characters and word-level error rates. This difference is also true for other ML tasks, such as the mean square error (MSE) for regression and intersection-over-union (IoU) for semantic segmentation. While the analysis side of this difference can be handled by the Value, ML Task, and Goal views, implementation in the integrated training pipeline will differ. A modification or extension to facilitate the variation of ML performance metrics is necessary to fit the requirements of individual projects.

The same is also true for the available solutions to improve the quality of the ML model. The DNN repair in the case study works mainly for image classification problems and only one variant among all possible DNN repair tools, as each technique has its own parameters and processes. Additionally, different ML tasks may not have the same solutions available. Moreover, even the same type of solution, such as data augmentation, will have variations for different types of data required by each ML task. An extension is necessary to facilitate the specific needs of solutions for each possible case because exhaustively providing all possible solutions is expensive.

In conclusion, the limitation in handling the myriad of ML tasks lies in the implementation. Although other works have explored the versatility of M3S on the analysis side, challenges on the implementation side have yet to be solved. In the future, a general, extensible interface between the modeling side and the training pipeline should be investigated to enable an efficient extension of the integrated pipeline to facilitate different configurations of the ML performance and parameters into integrated solutions.

7.4 Limitation-platform-agnosticism

The support tool in the case study and controlled experiment is based on the Astah* System Safety and DVC. We selected this platform due to its suitability for the reuse of existing functions. However, this also raises a concern about the platform-agnosticism of M3S and its usability among other software development processes and tools. The concern is important as developers commonly have existing processes and tools working. This subsection will address such concerns and guide future works in this direction.

For analysis and modeling, we suggest that any modeling tool that supports model analysis can be utilized. Since modeling approaches are not developed in a platform-specific manner, they should work in any modeling environment. However, applying the metamodel and integration into the training pipeline may be challenging. The metamodel-based modeling approach of M3S requires a modeling environment that supports such functions either natively or via custom functionality. Integration also needs a custom communication function, which requires extensible modeling tools. Two common modeling environments come with those functions: Sparx Systems Enterprise ArchitectFootnote 4 and Eclipse Modeling FrameworkFootnote 5. However, other modeling environments with those features are also possible to support M3S.

Various ML training pipelines can be implemented in the integrated training pipeline. We argue that our metamodel facilitates the integration of the modeling part with the training pipeline based on similar research by (Idowu et al., 2022). M3S’ integrated metamodel works in a similar manner to Idowu’s proposed general metamodel for experiment management tools aimed at general integration. The integrated metamodel connects artefacts generated by the pipeline, which is managed by experiment management tools, into the concepts inside the models. This allows for the implementation of the integrated pipeline using different platforms because such pipelines support the general artifacts of ML training described in the integrated metamodel (Idowu et al., 2023).

The present case study and controlled experiment did not consider the extensibility process of the framework, as described in subsection 4.3. A case study involving customization and extension has been discussed briefly (Husen et al., 2023). The extension process should allow developers to evaluate and customize M3S to their needs and constraints, including the use of existing processes and tools. Moreover, efforts have been made to generalize currently implemented solutions (Runpakprakun et al., 2023).

7.5 Limitation - internal solution uncertainty

Uncertainty due to the probabilistic nature of ML models is a crucial problem in ML system development. M3S aims to support decision-making and management by documenting the decisions made during the development and operation of ML systems. The extensive nature of M3S should support understanding the impact of varying ML performance due to uncertainties and other aspects. Additionally, M3S should be able to evaluate if previously utilized solutions are still relevant under new conditions.

However, the solution’s internal uncertainty remains in the decision-making. This situation is especially true during the DNN repair process in the controlled experiment. The non-deterministic nature of the method leads to low confidence for participants working with M3S as they retried the repair. They found that their configuration was inconsistent with their expectations. Combined with the black-box approach provided by the integrated tool, the perception of how the solution works was too vague for the developers’ comfort.

8 Conclusion and future works

This paper proposes and evaluates M3S, which is an approach to facilitate a consistent and comprehensive analysis of ML systems. Herein, we elucidate the benefits of M3S through a case study and experiment. The evaluation demonstrated that M3S clarifies existing decisions and enhances the performance evaluation of ML models. The case study involved a series of guided decisions through different views from the top business goals into executable solutions and testable ML performance requirements, while the experiment confirmed the evaluation ease of the ML performance requirements and the related higher-level decisions.

Moreover, the consistency between decisions and implementation is managed efficiently during training. The decisions made in the models directly influence the execution of solutions implemented in the training pipeline (e.g., the solution part of the safety case). However, M3S provides limited assistance in navigating the internal uncertainty of solutions, resulting in a less positive response regarding the trustworthiness of solution implementation. This limitation may have several origins, such as the lack of transparency in the internal processes. Further improvements in the solution implementation are necessary.

Finally, the generality of the framework to handle different ML tasks remains unclear. Although there are some indications of generality, the evaluation is insufficient to draw a proper conclusion. In the future, experiments on the benefits and limitations of utilizing M3S for ML tasks other than classification are necessary to demonstrate the generality of the framework.

Other future tasks include exploring more views and the extensibility of M3S, as well as improving the implementation of integrated solutions. One direction is to generalize supported modeling, integrated solutions, and experimentation environments. A second direction is to provide a plug-and-play approach to the extensibility of M3S. A third direction is to evaluate the universality of the M3S framework, especially on different ML tasks, and to extend the case study to include continuous monitoring.