1 Introduction

Even though data analytics is a popular topic among industry practitioners and researchers alike, organisations are struggling to implement data analytics solutions that match well with their organisation’s IT infrastructure and analytics goals. In many cases, organisations outsource the design and development of data analytics solutions to external contractors. The resulting solutions carry significant technical debt and are hard to maintain, especially when new requirements emerge over time. Even a slight shift in the organisational analytics requirements, such as adding a new algorithm or integrating a new data source, requires a complete re-engineering of the existing data analytics platforms and processes [29].

Existing literature [10] suggests that, compared to the amount of research dedicated to designing data analytics models, there are fewer studies that focus on software engineering issues associated with data analytics platform development, such as requirements engineering [23], software architecture design or analytics knowledge representations. Particularly, Khalajzadeh et al. [17] and Kim et al. [19] advocate the need for better technologies that manage and utilise the data analytics-related knowledge and resources in an organisation and provide means to rapidly develop or update user-friendly data analytics platforms.

Furthermore, we believe that to address the unique difficulties of requirements engineering in data analytics platforms such as their intrinsic uncertainty and unpredictablity [16], model-driven requirements engineering, platform design and development approaches with semantic-rich meta-models are necessary.

In an effort to address these challenges, this paper proposes the Data Analytics Solution Engineering (DASE) framework, a knowledge-driven requirements engineering and platform design approach that utilises semantic data models to improve data analytics platform design and development. The DASE framework is built on the premise that building and using data analytics platforms is a knowledge-intensive task that requires the expertise of software engineers, data scientists as well as domain specialists. The use of semantic web technologies gives the opportunity to build rich information models that can integrate knowledge from different spheres and support flexible software design.

A Data Analytics Platform as referred to in this paper is an interactive application that implements a generic analytics workflow for commonly known, frequently used and recurring business problems. Examples of such recurring problems include using an existing prediction service or using a model building service to train a machine learning model using a new dataset. A Data Analytics Platform will have an interactive GUI, backed by a Knowledge Base and a set of services to be used by the data analysts in an organisation.

The main contributions of this paper are as follows:

  • We propose the DASE framework that integrates knowledge from different spheres and supports flexible analytics platform software design and implementation. The DASE framework relies on a Knowledge Base that can accumulate analytics requirements as well as the knowledge about an organisation’s application domain including analytics and service related information integrated with open knowledge.

  • We propose the Analytics Requirements Ontology (ARO) as a meta-model that can represent data analytics platform requirements in an organisation.

  • We propose the DASE reference architecture upon which organisations can adapt their IT infrastructure to utilise the Knowledge Base, elicit data analytics platform requirements, define the analytics platform design elements and support the implementation of new data analytics platforms.

  • We evaluate the DASE framework by conducting two real-world case studies related to house price prediction and time-series data analytics, based on a prototype implementation of the DASE reference architecture.

Furthermore, we articulate a discussion on the limits of the DASE framework and how these limits can be overcome.

The remainder of the paper is structured as follows. The next section discusses the background and related work of the DASE framework, followed by a description of the main elements of the framework in Sect. 3. Section 4 provides an overview of the proposed architecture, and Sect. 5 presents the prototype we developed. Section 6 describes the evaluation process, followed by Sect. 7 which lists the limitations and threats to validity. The paper concludes in Sect. 8, discussing potential future work as well.

2 Background and related work

2.1 Model-driven requirements engineering and platform design

Utilising meta-models for requirements engineering is a widely researched area, particularly to enable the integration of requirements with software platform architecture to ensure traceability and change management.

For example, El Begger et al. [13] recently proposed a model-driven architecture framework to design data warehouse requirements and generate a multidimensional data schema. However, their UML-based model cannot be integrated with other meta-data associated with analytics such as business domain, process and data lineage.

Tropos4AS [22] is a framework for engineering requirements for adaptive software systems that can directly map the i* based requirements model to the software prototypes. Yet the scope of Tropos4AS is not suitable for modelling proactive behaviours in interactive data analytics platforms. Even though Tropos4AS requirements model can be mapped to an application prototype, it does not support model-driven platform development.

2.2 Knowledge-driven requirements engineering and platform design

Knowledge-driven approaches, based on ontologies supported by semantic web technologies, as proposed by Berners-Lee et al. [8], have benefits in requirements engineering as well as platform development. The semantic web technology stack offers a well-developed set of standards and notations such as RDF, RDFS and OWL, supported by tools for modelling, storing, querying and inferencing the knowledge. Using these standards and tools for modelling semantic-rich information about software requirements and artefacts can advance the way organisations design and develop their software. Different communities have adapted semantic technologies to build standard ontologies related to their practices (e.g. Schema.org, DBpedia, SNOMED CT). While there are examples of leading technology companies (e.g. Google, Amazon and Facebook) exploiting the power of semantic web technologies for applications such as semantic search and knowledge graphs, many industry players are still largely unaware of the value that these approaches bring [7, 9, 21].

A recent survey in requirements engineering [11] investigates how using ontologies can support requirements engineering activities, especially for reducing ambiguity, inconsistency and incompleteness of requirements. They identified four areas of research that need further attention as ontologies have high potential in modelling the interplay between software requirements and architecture. The first is how to use ontologies in the requirement validation phase. The second is the integration between requirements and software architecture through the use of ontologies and the third is how ontologies improve requirements models interoperability. The final area is helping to generate code in the context of model-driven development (MDD).

Related to these research gaps, Pires et al. [25] propose an iterative and incremental model-driven requirements engineering process combining ontologies and Controlled Natural Language to represent requirements from the different perspectives of all the stakeholders and development teams and integrate it with other models in software engineering life-cycle. However, their work does not utilise the requirements model for model-driven platform design and development. Eito-Brun and Amescua [12] propose a semantic technology based solution to represent the context of process requirements when developing software in critical sectors like aerospace, and medical systems by linking process requirements to activities, tasks, and work products. However, their process-oriented meta-model does not easily translate into data analytics platform development and due to the setup cost, it is only suitable for long-term projects.

2.3 Knowledge-driven requirements engineering and platform design for data analytics

Knowledge-driven requirements engineering and platform design can be particularly beneficial in the scope of data analytics platform development. Multiple survey studies highlight the need for comprehensive knowledge repositories that can represent data analytics requirements, analytical models, datasets used, decision-making processes and relevant domain information that can assist future analytics operations [17, 19].

A recent systematic literature review [4] has studied existing research efforts related to the use of semantic technology in data analytics platform design and implementation. The majority of identified studies use semantic models to support isolated activities such as model generation [32] or data source selection [18]. The existing ontologies cover certain elements of data mining and knowledge discovery processes well (e.g. OntoDM [24]). Yet there are scarce research efforts whose primary objective is in linking multiple facets of the expert knowledge related to the different phases of the platform development process (i.e. requirements gathering, design and implementation). The Research Variable Ontology (RVO) [6] addresses the issue of integrating organisational analytics knowledge with open knowledge in a way that can support the decision-making process of data analysts. Although RVO can be used to design an organisational analytics Knowledge Base, its practical usage in developing data analytics platforms that integrate well with an organisation’s analytics requirements and existing IT infrastructure has not been sufficiently explored in previous work [26].

Nalchigar et al. [23] report on GR4ML- a UML-based machine learning requirements modelling framework that facilitates requirement elicitation and cross-team communication. However, they do not offer a methodological approach to incorporate GR4ML in platform development and lack alignment of requirements with business goals and IT infrastructure. GORE-MLOps [16] is another theoretical framework that aims to link requirements into ML operations in order to eliminate the uncertainty and unpredictability of ML-based systems. Yet the GORE-MLOps model is limited to a goal graph and does not support other associated meta-data. It also lacks an implementation prototype or tool that demonstrates its applicability and effectiveness in practical applications.

2.4 Motivation for DASE framework

In order to fill the gaps identified in the literature, this paper proposes the DASE Framework. By utilising a semantic-rich Knowledge Base to represent requirements and integrating them with other meta-data in analytics platforms (domain knowledge, analytics knowledge and IT infrastructure), we support the integration between requirements and software architecture, requirements models interoperability as well as model-driven platform design and development. The data analytics platforms developed through the DASE framework can be customised for specific organisation requirements. They are flexible enough to be rapidly modified and can reduce the cognitive burden on data analysts when used for analysing data in a new domain or adopting new approaches, resulting in a shorter learning curve.

3 DASE framework

3.1 Overview

Figure 1 shows the scope and the main building blocks of the DASE framework including the recommended steps of utilising analytics knowledge to generate a Data Analytics Platform.

Fig. 1
figure 1

Scope of the DASE Framework

Within the scope of the DASE framework, we define Open Knowledge as all the publicly available knowledge related to data analytics represented in the Semantic Web as linked data through different ontologies. The Analytics Requirements Ontology (ARO) is a main contribution of our paper. It is a meta-model that can represent knowledge about Data Analytics Platform requirements in an organisation. As Fig. 1 shows, ARO, together with Open Knowledge, is used in Knowledge Modelling and Requirement Modelling to create the Organisational Analytics Knowledge Base: an integrated information repository reflecting the organisational analytics activities, resources and expertise.

Platform Design Modelling is the step where Organisational Analytics Knowledge Base and Organisational IT Infrastructure are utilised to create Platform Design Specification (PDS). As the DASE framework is based on service-oriented and workflow architecture design principles, organisations are free to leverage IT components (databases, middleware and machine learning models) from the existing infrastructure, and expose them as services to be utilised by the Data Analytics Platform. As transforming a PDS into an executable platform via Model-Driven Development (MDD) principles is a mature and independent field of research [1], the issue of automatically transforming a PDS into an executable Data Analytics Platform is left out of the scope of this paper.

3.2 Knowledge modelling

During the Knowledge Modelling step, an organisation develops the analytics Knowledge Base by accumulating and integrating all analytics-related knowledge and resources such as domain knowledge, details about past analytics experiments and findings, and information about resources and services contained in the organisation’s IT infrastructure. Organisations can reuse Open Knowledge; publicly available taxonomies, ontologies and linked data. It is also possible to extend Open Knowledge and model organisation-specific data following semantic web and linked data standards.

3.3 Requirement modelling

The ARO-Analytics Requirement Ontology, illustrated in Fig. 2, is used at Requirement Modelling step to help organisations identify and catalogue requirements related to a particular Data Analytics Platform. The purpose of ARO is to elicit requirements for the Data Analytics Platform, model requirements in detail and integrate them with Open Knowledge to provide clarity and context. The ARO design and development process was based on Neon methodology [30] and work by Gruninger and Fox [15]. Several design iterations were conducted to refine and improve concepts based on competency questions and feedback from domain experts.

3.3.1 Core concepts of ARO

We designed ARO extending UML stereotype concepts and following the Object Management Group standardsFootnote 1. This way, organisations can easily translate and extend traditional requirement specifications into ARO. The UML concepts we extended are:

  • “UseCase”: a means to capture the requirements of systems, i.e. what systems are supposed to do. The behaviour of a “UseCases” can be described by a set of elements such as interactions (e.g. activity diagram/interaction diagram or communication diagram.)

  • “Activity”: a behaviour specified as sequencing of subordinate units, using a control and data flow model.

  • “Action”: the fundamental unit of behaviour specification in UML.

Fig. 2
figure 2

Main concepts of ARO

These are the main concepts of ARO (Fig. 2):

  • AnalyticsUseCase is a stereotype of UML “UseCases” that anchors the requirements of a specific analytics platform by identifying the specific scenario performed by the analyst.

  • AnalyticsActivityFlow is a stereotype of UML “Activity” representing a common group of steps analysts will follow to perform an AnalyticsUseCase.

  • AnalyticsAction is a stereotype of UML “Action” that captures each unit step of AnalyticsActivityFlow. The resulting data analytics platform needs to support these actions via the application workflow. AnalyticsAction contains textual information and a sequence number to ensure the order of action within AnalyticsActivityFlow.

We identify three components of AnalyticsAction that can capture low-level requirements necessary for platform design modelling.

  • QueryDefinition defines the nature of the query, with its inputs and outputs, that can retrieve relevant information from the Organisational Analytics Knowledge Base to fulfil the related AnalyticsAction.

  • GUIDefinition defines the type of a Graphical User Interface (GUI) element required to perform the related AnalyticsAction.

  • ServiceDefinition defines the characteristics of the service (i.e. API) that needs to be invoked by the Data Analytics Platform to access a service available in the organisational IT infrastructure.

3.3.2 Utilising ARO for requirements modelling

To demonstrate how analytics requirements are modelled using ARO, we instantiate two scenarios as examples of AnalyticsUseCase concept using ARO. The first AnalyticsUseCase in Fig. 3 is ‘Conducting new forecast using a pre-deployed model’. Its behaviour is represented by a very simple AnalyticsActivityFlow named ‘AAF-Forecast’ and assumes the organisation has a pre-deployed model repository. It contains three AnalyticsActions that define the sequence of steps a data analyst will follow to fulfil the AnalyticsUseCase. Please note that many real-world AnalyticsActivityFlows will be much more complex with a larger number of AnalyticsActions.

Fig. 3
figure 3

AAF-Forecast: An instance of ARO for simple AnalyticsActivityFlow for the AnalyticsUseCase: ‘Conducting new forecast using a pre-deployed model’

Figure 4 provides an example of a more complex AnalyticsActivityFlow named ‘AAF-Train’ to capture the behaviour of ‘Training a new prediction model’ AnalyticsUseCase. This AnalyticsActivityFlow captures an exploratory prediction activity in an organisation, where analysts interact with the data analytics platform to define the context of analysis, and identify dependent and independent variables for the prediction. Furthermore, analysts can define the measures they want to relate to each variable and identify datasets within the organisation that contain selected measures. In AnalyticsAction 6, analysts can select a model type (e.g. regression, SVM), configure model parameters in AnalyticsAction 7 and use the previously selected dataset to train and create a model in the final AnalyticsAction.

Further examples of AnalyticsUseCases are provided in supplementary materials associated with this paper.

Fig. 4
figure 4

AAF-Train: An instance of AnalyticsActivityFlow for the AnalyticsUseCase: ‘Training a new prediction model’

In Tables 1 and 2, we present how the properties of an AnalyticsAction and associated text definitions are instantiated for ‘AAF-Forecast’ and ‘AAF-Train’ AnalyticsUseCases. These requirements were developed and refined in consultation with the data analysts (end-users) involved in the two case studies presented in Sect. 6.

Table 1 Example ARO-based Requirement Specifications for AAF-Forecast
Table 2 Example ARO-based Requirement Specifications for AAF-Train

3.4 Platform design modelling

Once an AnalyticsUseCase is modelled in ARO, its components can be instantiated as the Platform Design Specification, as described below, linking with the organisational IT Infrastructure and the organisational analytics Knowledge Base. The Platform Design Specification has following four components:

  • Query Management: Each QueryDefinition requirement from ARO is used as a basis to define a SPARQL query against the Organisational Analytics Knowledge Base to access information necessary to fulfil the AnalyticsAction.

  • GUI Design: Each AnalyticsAction needs a detailed graphical user interface (GUI) that satisfies all the GUIDefinition requirements. Details of the user interface such as what GUI elements to be used (e.g.- buttons, drop-down lists, sliders), their colour, size, position, etc., need to be defined.

  • Service Management: If an AnalyticsAction has a ServiceDefinition requirement associated with it, service invocation details need to be defined and managed. This includes information such as the input parameters required by the service and how to read and manage the service response.

  • Workflow Design: This is a realisation of the AnalyticsActivityFlow where Platform Design Specification components associated with AnalyticsActions are ordered in the specified sequence and composed into an implementable and machine-readable workflow model.

4 The DASE architecture

Fig. 5
figure 5

Proposed DASE architecture

The proposed DASE architecture contains a number of architectural recommendations for organisations to implement and adapt the DASE framework. The DASE architecture is independent of the organisation’s application domain or technology stack it uses. Its design has the flexibility to be extended or updated with any new module independently of others.

There are five modules in the DASE architecture (see Fig. 5).

The Knowledge Base module contains instance data related to the Organisational Analytics Knowledge Base modelled using Open Knowledge (Sect. 3.2) and ARO concepts (Sect. 3.3).

The Platform Design Framework module facilitates the creation of all four components of the Platform Design Specification described in Sect. 3.4, referring to IT infrastructure and organisational analytics knowledge (through Platform Design Interface). The IT infrastructure shown in Fig. 5 represents existing back-end modules that Data Analytics Platform can access via service calls. This may include machine learning APIs, analytics tools, data sources, workflows and authentication APIs.

The Tools layer provides frontends for knowledge management and platform design operations. The Knowledge Management Interface is used for Knowledge Modelling and Requirement Modelling and manipulates the objects in the Knowledge Base. The Platform Design Interface allows engineers to define elements of PDS through Platform Design Framework. It is linked to the Knowledge Base through the Query Engine.

The Data Analytics Platform is the final product of the DASE framework, generated via a model-driven implementation of the PDS. It is connected to the Knowledge Base via the Query Engine and to the IT infrastructure via service invocations.

5 Prototype implementation

To maintain platform independence, in Sect. 4 we defined each component in the DASE architecture without referring to specific implementation details. For demonstration and evaluation purposes, a prototype implementation of the DASE architecture was realised with the following implementation choices:

  • Use of Protégé as the Knowledge Management Interface

  • Use of MarkLogic as the Knowledge Base

  • Use of Capsifi Jalapeno as the Platform Design Interface and Framework

  • Use of R Shiny Framework for Data Analytics Platform Implementation

5.1 Use of protégé as the knowledge management interface

In our prototype implementation, we used ProtégéFootnote 2 ontology editing tool as the Knowledge Management Interface for Knowledge Modelling and Requirement Modelling because of its simplicity and availability. Protégé provides a GUI for users to import ontologies available via a file or an URL, explore them and integrate them with local ontologies. Protégé also supports the creation of instance data that represents organisational analytics knowledge, inference of new information, and validation of the consistency through its deductive classifiers.

Multiple domain ontologies were imported into Protégé from Open Knowledge to create Organisational Analytics Knowledge Instance Data. This domain knowledge was organised around the Research Variable Ontology following the approach proposed in Bandara et al.  [6]. For requirements modelling, we imported ARO into Protégé and modelled required AnalyticsUseCase and AnalyticsActivityFlow with their associated AnalyticsActions. Then we wrote QueryDefinitions, GUIDefinitions and ServiceDefinitions for all the AnalyticsActions, according to the schema of the organisational Knowledge Base.

5.2 Use of marklogic as the knowledge base

We used a MarkLogic Triple StoreFootnote 3 to implement the Knowledge Base module and its inbuilt Marklogic Query API to implement the Query Engine.

5.3 Use of Capsifi Jalapeno as the platform design interface and framework

In our prototype, we use the CapsifiFootnote 4 Jalapeno tool which provides many of the required capabilities of both the Platform Design Interface and the Platform Design Framework module. Capsifi Jalapeno is a commercial tool that utilises semantic web technologies to model organisational business architecture. It provides high extensibility and the ability to integrate organisational knowledge with open knowledge. Because of that, organisations can easily connect the Knowledge Base of the DASE architecture to the Jalapeno tool and conduct Platform Modelling.

The Digital Interaction (DI) Framework within the Jalapeno tool is specially designed to support rapid design and development of software applications that reflect knowledge-intensive and service-oriented processes in an organisation [5]. This framework is a dynamic composition of concrete services, a set of interactions and underlying information concepts which can be easily converted into execution-level workflow code. We used the Jalapeno tool to design all four components of the Platform Design Specification: GUI Design, Service Management, Query Management, and Workflow Design as detailed below.

5.3.1 GUI design

To conduct GUI Design for each AnalyticsAction related to an AnalyticsUseCase, we use the Form Builder Interface provided by the Jalapeno platform (Fig. 6). This interface has a drag-and-drop feature to design web pages from generic component templates such as text fields, drop-downs, and radio buttons. Further, we can link GUI components to the concepts defined in the Knowledge Base, referring to their ontology’s URI. That way, once deployed, some of the GUI components can be dynamically populated by information fetched via semantic queries. The user-defined parameters can also be mapped automatically to the values from the Knowledge Base.

As an example, Fig. 6 shows a snapshot of Form Builder interface that contains a GUI Design, modelled following the GUI Definition for ‘Select a Model’ AnalyticsAction related to AAF-Forecast in Table 1.

Fig. 6
figure 6

Jalapeno Form Builder interface containing GUI Definition for ‘Select a Model’ AnalyticsAction in AAF-Forecast

5.3.2 Service management

The Jalapeno Message interface (Fig. 7) is designed to model generic web services with the API Fields table allowing users to define all input parameters, their data types, and their location in the API request. We can use it to conduct Service Management and model any service available in the IT Infrastructure based on ARO Service Definition.

As an example, Fig. 7 shows how the service related to ‘Run Model’ AnalyticsAction in AAF-Forecast (Table 1) is defined in Jalapeno.

Fig. 7
figure 7

Service definition for ‘Run Model’ AnalyticsAction in AAF-Forecast, defined in Jalapeno Message interface

5.3.3 Query management

Fig. 8
figure 8

Query written as part of the platform design specification of AAF-Forecast

When conducting design activities related to the Query Management, we refer to every QueryDefinition in ARO to understand the query requirement, refer to the schema of the organisational knowledge base and then write a SPARQL query through the Query Interface provided in Jalapeno tool, associating with the relevant GUI parameters when necessary.

As an example, Fig. 8 lists the three queries defined in Jalapeno tool for three AnalyticsActions related to AAF-Forecast defined in Table 1. As we have used RVO [6] in the prototype Knowledge Base to integrate and represent organisational analytics knowledge, the queries in Fig. 8 refer to the concepts in the RVO schemaFootnote 5 using the prefix ‘rvo’.

5.3.4 Workflow design

For Workflow Design, we use the Form-Flow interface of the Jalapeno tool. As shown in Fig. 9, it provides a canvas to drag-and-drop modelled GUIs and services and to design an executable workflow that reflects any AnalyticsActivityFlow. The workflow modelled in Fig. 9 represents the AAF-Forecast AnalyticsActivityFlow (Fig. 3). Each node and edge in the flow diagram can be edited to map to the required input and output parameters.

Fig. 9
figure 9

Workflow definition for AAF-Forecast using the Jalapeno Form-Flow interface

5.4 Use of R shiny framework for data analytics platform implementation

The DASE framework’s ultimate goal is to develop a Platform Design Specification that can be used to generate an executable Data Analytics Platform in a model-driven fashion. As this is out of scope of our work, we manually implement Data Analytics Platforms using the Shiny software package based on R programming language in order to evaluate the comprehensiveness of the Platform Design Specification.

Shiny is designed to build and host interactive web applications and dashboards utilising computational capabilities of R. Shiny applications are easy to write and do not require web development skills, reducing the workload of the developer. Shiny applications are also extensible with CSS themes, HTML widgets, and Java-Script actions to provide better visualisation capabilities and usability to the data analytics platform. Two example platforms we developed to evaluate the framework using R Shiny package (see the snapshots in Figs. 10 and 11) will be discussed under Sect. 6.

6 Evaluation

6.1 Introduction to case studies

In order to demonstrate and evaluate the DASE framework, we employed a case study-based method. We conducted an artifact-based multiple case study with two data analytics platforms implemented and followed an observational method for data collection.

The artifact used for the case study is the prototype implementation of the DASE framework described in Sect. 5. We used it to design and develop two data analytics platforms in two application domains.

Case study 1 is related to engineering a data analytics platform for house price prediction. We developed a platform that can be used to perform two scenarios: train new house price prediction models (reflecting AnalyticsUseCase AAF-Train (see Fig. 4)) and use pre-deployed prediction models to conduct new forecast (reflecting AnalyticsUseCase AAF-Forecast (see Fig. 3)). Components of the Platform Design Specification are presented as Figs. 678 and  9. Snapshot of the implemented platform is shown in Fig. 10. Please refer to Bandara et al.  [6] and Rabhi et al. [27] for more details related to this case study.

Case study 2 is related to engineering a data analytics platform for time-series data processing. We developed a platform that can be used to acquire, transform and integrate financial market time-series data. Supplementary materials associated with this paper provide details of the AnalyticsUseCase as well as the components of the Platform Design Specification produced as a result of this case study. A snapshot of the implemented platform’s GUI is shown in Fig. 11. The details of this case study are available in Bandara et al.  [3].

By observing the process of developing these two platforms using the DASE framework and participant-based observations of using them, we evaluated the effectiveness of the DASE framework under three criteria:

  • how does framework facilitate end-to-end requirement-driven platform design and implementation.

  • how can the Knowledge Base support both data analytics platform engineering and operations.

  • how maintainable is the resulting data analytics platform when requirements change.

  • how usable is the resulting data analytics platform.

Evidence from observations was analysed through cross-case synthesis and explanation building [28, 33], and the results are present here.

6.2 Facilitation of end-to-end requirement-driven platform design and implementation

We used the prototype implementation of the DASE architecture to observe how well the DASE framework can capture and use the analytics requirements related to two case studies for Data Analytics Platform design and implementation.

The first step in developing Data Analytics Platforms in both case studies was identifying AnalyticsUseCases that represent end-user requirements, and modelling related AnalyticsActivityFlow and AnalyticsActions. This was done in consultation with end users and resulted in the ARO based requirement specifications presented in Tables 1 and 2.

Then all the components of the Platform Design Specification necessary to realise those user requirements were modelled in a way that incorporates some of the existing IT infrastructure components. We manually implemented the two Data Analytics Platforms based on the Platform Design Specifications using R Shiny framework. These activities were done independently, without involving the end-user. Only the final products (two Data Analytics Platforms) were demonstrated to end users to gather feedback for the evaluation.

Fig. 10
figure 10

Snapshots of the analytics platform developed for house price prediction case study

Fig. 11
figure 11

Snapshots of the analytics platform developed for time-series building case study

Our conclusion is that when evaluated from the end-user’s perspective, these platforms were able to support all the AnalyticsUseCase functionalities required, confirming that the requirements captured by the DASE framework are sufficient for the Data Analytics Platform completion.

One limitation in the DASE framework is that the AnalyticsActions supported by ARO are restricted by the information available in the Knowledge Base and the IT services provided by the organisation. In addition, only very simple requirements can be captured by the framework. The framework cannot model complex user requirements or enforce non-functional requirements. Furthermore, in ARO, requirement definitions are limited to text, and AnalyticsActivityFlow is represented as a sequence of steps without branching or loops. In the future, we plan to extend ARO with more attributes and complex workflow definitions, possibly adapting some concepts from an existing workflow language or notation.

6.3 Knowledge base support for analytics platform engineering and analytics operations

First, we look at how well the PDS is driven by organisational analytics knowledge accumulated in the Knowledge Base. In both case studies, when developing a PDS, the Knowledge Base is accessed via the query engine, to map GUI components with the concepts defined in Open Knowledge, and to identify suitable services to fulfil the Service Management as part of Platform Design Modelling (see Sect. 3.4). When implementing the platforms, the developer integrates the Query Engine with the platform so that each analytics activity conducted by the analyst is assisted by the recommendations provided by queries.

Secondly, we observe that once the data analytics platform is implemented, the majority of AnalyticsActions are assisted by the Knowledge Base. For example, Use a Model step in AAF-Forecast implementation (Fig. 10) is supported by a drop-down list populated via a query to show all available models to the analyst. Certain GUI components, such as text boxes to input values of the independent variables in AAF-Forecast implementation (Fig. 10), are dynamically created via queries. This enables access to the most relevant information from the Knowledge Base, based on previous decisions made by the analysts, and ensures that the activities are supported by the Knowledge Base.

We conclude that the Knowledge Base provided good support for analytics platform engineering and conducting analytics operations.

6.4 Maintainability of the resulting data analytics platform

We also evaluated how organisations can rapidly update the analytics platform, with the help of the Knowledge Base. To observe how the DASE framework can handle a change of requirements in the analytics platform, we conducted a few hypothetical changes as described below:

  • Changing the Knowledge Base: to reflect a change at the organisational Knowledge Base, we look at how the platform behaves when there is a requirement to add a new prediction model to the House Price Prediction platform. This can be done by updating the Knowledge Base with a new model via Protégé. As the platform fetches information via queries dynamically, the new model will be available for the analyst for immediate use in the ‘Select a Model’ step of the GUI in Fig. 10.

  • Changing the IT infrastructure: to study this, we observed how the DASE framework handles a new requirement that adds an improved prediction service to the house price prediction platform. To realise it, the Service Management and Workflow Design components of the Platform Design Specification had to be updated in Jalapeno to use a new service instead of the current one. Then the platform source code had to be updated with new service invocations.

  • Changing the AnalyticsActivityFlow: to study this, we add a new step into AAF-Forecast for an analyst to define the expected prediction accuracy, before selecting a model. Then the Knowledge Base needs to be updated with a new AnalyticsActivityFlow in the place of the current AAF-Forecast. Following that, the Platform Design Specification needs to be updated, i.e. the GUI and query associated with Select a Model AnalyticsAction need to be updated to filter results according to the user-defined prediction accuracy value. All other components of the Platform Design Specification and existing platform software code can be reused.

Our conclusion is that the process of maintaining the data analytics platform and implementing a change of requirement is simple, resulting in traceable and well documented changes.

One drawback we observed is that there is a certain burden associated with implementing some of the changes, particularly in relation to updating the components of the Platform Design Specification, when changing the AnalyticsActivityFlow.

6.5 Usability of the resulting data analytics platform

We evaluated the usability of the analytics platforms generated by the DASE framework using two Data Analytics Platforms implemented.

First, we look into how these platforms reuse existing IT infrastructure and knowledge as they were designed to follow SOA and workflow principles. In both case studies, once the analytics platforms were generated, they were able to invoke existing services and software packages in the organisation that are linked to the GUI elements as defined in the Platform Design Specification. When the organisation updates its underlying services, this modular design allows the developers to upgrade the analytics platform with minimum effort, reducing the technical debt.

Secondly, with the DASE framework, there is no need for the analyst to identify what process to follow manually. The platforms generated by the DASE framework are designed to support predefined AnalyticsActivityFlows linked to a rich repository of the most common analytics patterns in the Knowledge Base. This means that the analysts are provided with a structured process within the application and a sequence of analysis to follow, saving their time spent in establishing a manual process.

Lastly, the analytics platform is designed to reduce the cognitive burden of the analysts in the organisation. To use the Data Analytics Platforms in both case studies, the analyst doesn’t need prior knowledge of existing prediction models, data sources or other resources in the organisation. The analyst can learn and explore alternatives while conducting the analysis. As the platform interacts seamlessly with the Knowledge Base and provides relevant information for the analyst to support their decision-making, the analyst is not required to spend too much time on background research or domain understanding. For example, in AAF-Forecast implementation, when the analyst selects a model, its performance and structure details are Visualised dynamically through the associated query (Fig 10). In addition, the platform can filter the alternatives presented for each activity, based on the previous inputs of the analyst. Furthermore, the DASE framework makes many technical details transparent to the analyst. The analysts do not need programming skills to use the resulting analytics platforms in both case studies.

We conclude that the resulting data analytics platform is end-user friendly and requires minimum effort, compared to traditional ad hoc analytics platforms widely used today, as the DASE framework provides structured application flow for analysts to follow and reduce cognitive burden by providing knowledge-based recommendations.

7 Limitations and threats to validity

7.1 Limitations

One limitation of the DASE framework we observed is that the data analytics platforms generated by the DASE framework can limit the freedom of data analysts. They would not be able to develop and execute custom scripts or workflows in an unrestricted way. That means the DASE framework will be useful mostly for organisations that require frequently used analytics processes that can be easily customised from a generic form.

Our work does not propose a methodology for converting the Platform Design Specification into an executable software code implementation. This implementation was conducted manually for the two case studies using the R Shiny framework. Previous work [5] suggests that the Digital Interaction Framework in Jalapeno tool has sufficient information captured about each artifact of the data analytics platform to be automatically converted into an executable platform in a model-driven fashion. Yet further evaluation is necessary to validate the feasibility and workload associated with such an automatic model-driven implementation.

Furthermore, two research gaps identified in the literature were not addressed by the DASE framework. One of them is the analysis of cost in setting up the DASE framework within an organisation and the second one is how to utilise the DASE framework during the requirements validation phase [11].

7.2 Threads to validity

7.2.1 Construct validity

The DASE framework was developed following the design science method and best-practice ontology and architecture design principles to limit any threats to construct validity. The three criteria used for evaluation (Sect. 6) are focused on the quality, completeness and usefulness of the DASE framework and the resulting data analytics platform. The authors identify that these criteria are subjective and qualitative in nature and can have different meanings for different researchers.

7.2.2 Internal validity

The framework evaluation was conducted with two case studies and multiple scenarios in order to limit any threats to the internal validity of the results. Yet as the case studies are based on the creation of artefacts and direct observations, authors acknowledge that the evaluation may have weaknesses in terms of selectivity of case study scenarios and participants, availability of artefacts and bias due to investigators’ manipulation of events [33].

7.2.3 External validity

The DASE framework is not a silver bullet for model-driven requirements engineering and design of data analytics platforms. Different data analytics platforms used in the industry have varying levels of complexities and are used for a diverse range of applications from exploratory and ad hoc research-oriented platforms to simple applications and prediction services. Currently, the DASE framework is applicable to data analytics platforms designed to support iterative analytics processes that implement repetitively used machine learning workflows. Identifying and modelling AnalyticsActivityFlow for new tasks is a challenging task that requires special modelling skills and an understanding of the domain. As an example, the AnalyticsActivityFlow in Fig. 4 does not include a step for hyper parameter training, which is a crucial step for many machine learning models. Also, the DASE framework is not suitable for ad hoc research tasks in data analytics, and its applicability in organisational settings at different scales is yet to be evaluated.

8 Conclusion

This paper presents the DASE framework, a novel approach that enables knowledge-driven requirements engineering and platform development for data analytics platforms. By utilising a semantic rich Knowledge Base to represent requirements and integrating them with other meta-data in analytics platforms (domain knowledge, analytics knowledge and IT infrastructure), the DASE framework supports the integration between requirements and software architecture models as well as the process of model-driven platform design and development. The resulting data analytics platforms are knowledge-driven, user-friendly and easy to maintain. They contribute to reducing the required technical knowledge and cognitive burden of the data analysts.

By implementing and evaluating the DASE framework, we identified multiple future research directions. Generating a Data Analytics Platform utilising the Model-Driven Development paradigm was not explored in this paper. The flow-logic definition in Jalapeno DI Framework can be used as the basis for such work in the future, combined with existing research that explores workflow automation utilising ontologies [14]. The resulting workflow definitions can be used with an appropriate Model-Driven Development engine to fully automate the Data Analytics Platform software generation.

Furthermore, the DASE framework needs improvements, so that it can fully harness the power of semantic web technologies such as offering semantic inference capabilities. This way, the resulting data analytics platforms can provide intelligent recommendations to the analysts, reducing the cognitive burden associated with the analysis further.

Once the DASE framework is implemented within an organisation, that organisation will accumulate a repository of analytics knowledge over time, together with meta-data such as how this knowledge is utilised by the analysts. For example, organisations can generate statistics on how the same analytical model performs on different datasets, and what are the best performing or most popular analytics algorithms. This information can be used to build value-added applications such as meta-learning and eXplainable Artificial Intelligence (XAI) systems specialised in a specific organisational context. Meta-learning is a novel research direction where data analytics platforms are designed to learn and adapt with experience gained by exploiting meta-knowledge extracted in previous analytics tasks or from different domains or problems [20, 31]. XAI is about creating a suite of analytics techniques that can produce more explainable machine learning models with high performance, and enabling humans to understand, trust, and manage AI systems [2]. By stressing the importance of having a Knowledge Base and a service-oriented modular architecture as integral parts of the analytics and AI platforms, the DASE framework is offering a practical solution on how to manage the organisational meta-knowledge and semantics related to meta-learning and XAI.