Introduction

The continued growth of the Internet of Things (IoT) has been significant in recent years. In this regard, the overall data volume of smart devices is expected to reach 79.4 zettabytes [1]. Consequently, data analytics already take on a central role in a variety of application areas and are expected to become even more important in the future [2]. To gain insights into the data of the ever-increasing number of smart devices, different technologies have been adopted for the IoT, such as Big Data processing or machine learning (ML) algorithms.

Together with the possibilities, however, the challenges arising from this development must also be considered. In terms of the IoT, analytics platforms provide the tools to design, implement and execute analytics pipelines to gain meaningful insights from the data of smart devices, but also from other data sources, such as environmental data. Although there are already numerous architectural approaches for IoT data analytics, these are most commonly based on the concepts of processing Big Data from other domains of information systems research and practice. Looking at the peculiarities of the IoT, these approaches are only of limited use, since the IoT exposes its own set of architectural challenges that need to be addressed by analytics solutions [3].

However, the challenges that analytics solutions face must be addressed differently depending on the application domain. For example, in the smart home domain, the Big Data challenge has to be viewed from an IoT platform provider’s point of view: Although a single customer and their smart devices may not produce huge amounts of data, the challenge becomes more pressing with a growing customer base. On the other hand, industrial IoT applications are Big Data-driven because of the volume of data resulting from the sensors of machines in manufacturing processes. Therefore, the challenges for analytics architectures, e.g., real-time capabilities, the scalability of data processing and the storage of all analyzed and raw data, have to be considered from a different point of view when compared to the smart home domain. In addition, data security and privacy present further challenges that are important but also different in each application area of the IoT. This results in the need to provide flexible analytics architectures that can map the characteristics of the individual challenges, depending on the use case. From a technical standpoint, a promising concept to address the diverging architectural needs of analytics solutions in different fields of the IoT is fog computing. It allows offloading data processing closer to the source of the data, therefore enabling analytics architectures to become more versatile by adding flexibility to analytics pipeline deployments.

In our previous work, we propose an IoT analytics platform that utilizes the fog computing paradigm and is designed to enable hybrid analytics pipeline deployments in user-centric application domains, such as smart home [4]. In this regard, a hybrid analytics pipeline comprises data processing tasks that are executed at different network levels. In this work, we extend the architectural proposal as well as the background of this research and describe how the architecture can be utilized in additional IoT application domains. Furthermore, we enhanced the evaluation of the architectural proposal to include experiments for weather and environmental monitoring as well as predictive maintenance.

The remainder of this paper is structured as follows: In Sect. "Background", we describe the background of our research. Furthermore, we name the challenges to be addressed by the architecture we propose. Afterwards, we present an overview of related works in this field. In addition, we describe how these are not fully suitable regarding the problem space (Sect. "Related Work"). The main contribution of this paper, a proposal for an analytics architecture usable in different IoT environments, is described in Sect. "Solution Proposal". In Sect. "Prototype", we present a prototypical implementation of the architecture. The application of the architecture in different IoT domains is described in Sect. "Application Domains". The evaluation of our approach is presented in Sect. "Evaluation". Finally, we summarize our findings and provide starting points for further research in this field as well as our own (Sect. "Conclusion and Outlook").

Background

The number of IoT connected devices worldwide has been 7.741 billion in 2019 and is expected to rise to 16 billion in the year 2025 [5]. This development is accompanied by an increase in the amount of data generated. The creation of meaningful insights from these data is a cornerstone of generating added value from smart devices. For this reason, it is necessary to develop suitable analytics architectures that, for example as part of an IoT platform, are able to cope with the various challenges involved. These architectures and the resulting implementations are the foundation for the creation and operation of analytics pipelines. In this context, an analytics pipeline refers to the approach of processing data as it moves from a source to a sink through a series of different transformation and analysis steps (see [6]). Previous research on this topic has already identified a number of challenges that should be addressed by such architectures. The specifics of these challenges vary from one application domain to the next and their significance within a data analytics architecture varies accordingly. According to [3], challenges for analytics architectures in the IoT are:

  • Big Data: IoT-based data processing must be able to handle huge amounts of data arriving at high velocity and from a variety of sources.

  • (near) Real-time analytics: the nature of a multitude of IoT applications requires data to be processed in (near) real-time.

  • Privacy and security: IoT data has to be secure from external access and the derived insights have to comply with all stakeholders' privacy requirements.

  • Stream handling: since IoT data are predominantly time series, an analytics architecture has to have the ability to handle data streams.

  • Scalable data processing: an analytics solution must be able to process any number of data streams.

  • Integrating data from different sources: to gain meaningful insights, data streams from different sources have to be combined.

  • Data storage of all analytics and raw data: all IoT data, raw and processed, have to be stored and be ready to be accessed at all times.

  • Flexible extension of data processing: fast-changing requirements in IoT applications require the rapid addition and reconfiguration of analytics capabilities.

  • Personalization of analytics: data analytics in the IoT are highly individual, especially in user-centric domains, for instance smart home. Analytics architectures have to provide the tools to map this individuality.

  • Fault-tolerant data input: the input data sources of analytics pipelines in the IoT are not always reliably available.

  • High network usage: the rising number of overall IoT devices creates network pressure that has to be addressed by analytics architectures to reduce latency and increase service availability.

  • Share analytics capabilities/data: analytics capabilities and data have to be shared across internal and external applications.

  • Integration of historic and real-time data: the mixture of real-time and historic data may provide additional insights into IoT applications.

  • Data visualization: the visualization of all raw and processed data is necessary to gain insights.

  • Energy efficiency: the increasing number of IoT devices and the resulting huge volumes of data creates the need for analytics architectures to enable energy-aware analytics pipeline deployments.

  • Limited computing resources: some IoT environments provide limited computing resources or allow data analytics only under resource-constrained framework conditions.

To address the aforementioned challenges, we have proposed a fog-based architecture in [7] that was extended in [4]. In this context, fog computing describes a paradigm promoting the computation, storage, etc. of data anywhere from the cloud to the edge of the network [8, 9]. It is therefore different from the sometimes interchangeably used term “edge computing” [10], which specifically excludes the cloud for computational as well as related tasks and is limited to only a small number of network layers [9]. Consequently, the fog computing paradigm aims at combining cloud computing, edge computing and the IoT in a new architectural paradigm [11]. In IoT use cases, the edge of the network is synonymous with local networks that include IoT devices [10]. Therefore, fog computing nodes may comprise edge devices, such as gateways, access points, etc., but not IoT devices, such as sensors and actuators [8, 10].

Past scientific literature has interpreted fog computing in the sense that data processing is primarily performed on fog nodes and the cloud performs information aggregation tasks for decision support. In contrast, the main goal of our solution proposal is to enable analytics pipeline deployments along the diverging constraints of different stakeholders. This includes hybrid analytics pipelines, in which some processing tasks run on fog nodes, while others are computed at the cloud layer.

The cloud-based parts of this architecture are already discussed and evaluated in [12] and [13] address multiple of the aforementioned challenges. In this paper, the components required to enable data processing at fog nodes are further specified and the components for message exchange between them and the cloud are added. Moreover, we extended the architecture to include components for importing data from external data sources. Additionally, we present three application scenarios relating to smart home, weather and environmental monitoring as well as predictive maintenance to show how our proposed architectural concept can be adopted in different IoT domains.

Related Work

In recent years, several architectural approaches for data analytics in IoT environments have been developed to facilitate different application scenarios. To provide an overview of this literature, we conducted a review following [14]. The results were analyzed regarding the utilized computing paradigm and application scenario.

In terms of architectural proposals without a specific application domain assigned, Rathore et al. present a cloud-based, real-time analytics architecture for IoT applications [15]. However, their approach does not address fog-related challenges, e.g., limited computing resources in IoT applications. Cao & Wachowicz propose an analytics architecture for real-time analytics use cases in fog environments [16]. Their architecture follows a top-down approach for defining and configuring analytics pipelines, thus reducing the flexibility of the solution in terms of fast-changing analytics pipelines requirements or inclusion of new processing methods. Sun et al. highlight the benefits of task offloading in terms of time and energy savings and propose a fog analytics architecture [17]. Bhattacharjee et al. propose a Big Data-as-a-Service approach for conducting IoT analytics [18]. The approach is cloud-based and no specific application domain is given. Akbar et al. present a cloud-based analytics architecture that focuses on Big Data-related challenges [19]. It utilizes streaming technologies together with complex event processing but does not address the challenges related to edge processing, such as limited computing resources or network usage, in IoT applications. Marah et al. conceptualize and evaluate an edge-centric analytics architecture for IoT applications that focuses on rapid decision making and low latency [20]. However, the Big Data-feasibility of this approach is not evaluated.

In the field of smart home analytics architectures, Bhole et al. present an architecture to be deployed on edge devices in smart homes [21]. Their field of application is home automation and they utilize several different ML-based algorithms. Constant et al. propose a fog-based architecture that centers around a fog gateway device [22]. They evaluate their approach using smart wearable data without limiting their approach to a specific field of application. Popa et al. also present a fog-based architecture for home automation, in which deep neural networks are trained and stored in the cloud and the resulting models are pulled and applied by an IoT agent on an edge device [23]. Furthermore, they present applications for non-intrusive load monitoring and energy load forecasting. Singh and Yassine describe a fog analytics architecture for energy management in households [24]. Their fog computing nodes main functionality involves the pre-processing of data. Hasan et al. and Al-Ali et al. propose cloud-based analytics architectures for energy management [25, 26]. Data analytics are supported by Big Data technologies in all approaches. Another cloud-based analytics architecture is presented by Fortino et al. Their main focus lies on activity recognition in smart homes [27]. Paredes-Valverde et al. propose a smart home analytics architecture that is not associated with a computing paradigm and revolves around the use case of energy consumption monitoring and forecasting [28]. Finally, Pathak et al. highlight a fog-based analytics architecture utilized for ML-based applications of data streams from surveillance cameras [29]. However, both approaches focus on a specific use case and are therefore only considering a limited number of the found challenges for analytics architectures.

Regarding the integration of external data sources into IoT analytics architectures for weather and environmental monitoring, Senožetnik et al. propose a solution that allows the retrieval of groundwater data from different sources for real-time queries and analysis [30]. However, their solution only offers limited analytics capabilities. Kwon et al. present an IoT framework that handles sensor and externally available data [31], but their solution lacks the possibility to process historic data. Heidecker et al. try to optimize crop yield and water usage of agriculture using an IoT solution based on FIWARE [32]. Terroso-Saenz et al. follow a similar approach to predict the energy consumption of HVAC systems based on weather conditions and sensor data [33]. Using a framework like FIWARE these solutions are highly complex [34]. Luckner et al. introduce a smart city IoT application based on the Lambda architecture scheme [35]. Since their solution is based on the Lambda architecture, two separate processing logics for historic and real-time analytics are required.

Looking at architectural proposals for predictive maintenance applications, a comprehensive overview of different data processing approaches is given by Hafeez et al. in [36]. They found that available solutions are designed to utilize ML techniques and algorithms. In this regard, data preprocessing is carried out at edge devices while model training is done in the cloud. The actual model application is either done at edge devices or in the cloud. They also propose their own hybrid architecture in which data reduction is done at the edge layer and the predictive maintenance application is implemented in the cloud layer. However, they do not address any of the challenges presented in Sect. "Background" specifically. Sahal et al. propose an approach that employs different Big Data technologies in a cloud environment [37]. Still, they acknowledge the need to support edge processing of data. A similar solution is presented by Calabrese et al. in [38]. However, they do not provide components for data processing at edge devices. Kefalakis et al. introduce a fog-based analytics architecture that leverages edge devices for data integration and pre-processing [39]. They also allow user defined analytical functions and flexible configuration of analytics pipelines. Gröger presents an IIoT analytics platform that leverages several Big Data technologies and integrates enterprise data sources, such as Enterprise Resource Planning [40]. The selected data processing technologies suggest no utilization of edge devices.

We found that none of the analytics architectures analyzed support hybrid analytics pipeline deployments with regard to the challenges described in Sect. "Background". Furthermore, most of the solutions are based on rather static analytics pipelines that are defined a priori, thus missing flexibility in terms of changing requirements. Finally, almost all approaches focus on one or a handful of use cases or at best an entire application domain of the IoT. Consequently, their reusability in different application scenarios has not been proven.

Solution Proposal

In this section, the proposed IoT analytics architecture that comprehensively addresses the challenges described in Sect. "Background" is presented. It is based on previous research of ours that is published in [4]. The overall architecture is visualized in Fig. 1. It was designed according to the method of conceptual modeling [41] and follows the microservice principle. Consequently, the individual components of the overall architecture were created in such a way that they, ideally, have exactly one task and are therefore slim and easy to replace. Moreover, the proposed solution only includes a single processing layer for streaming data, thus promoting the advantages of the Kappa architecture approach that was first described in [42].

Fig. 1
figure 1

Proposed architectural concept, showing components and interfaces, adapted from [4]

At the cloud layer the architecture consists of an orchestration, streaming, import management and serving platform as well as a message broker, a messaging bridge and a cloud connector. The fog platform is deployed at network layers closer to the edge of the network and contains components that enable local data processing and forwarding. Depending on the application domain, the fog platform can be deployed on a variety of computational devices, e.g., single board computers. The proposed architecture enables a flexible deployment of analytics pipelines at different network layers including pipelines with processing tasks distributed over multiple network layers. Moreover, it allows the integration of external data sources into analytics pipelines.

The core component that enables hybrid analytics pipelines is the orchestration platform that consists of five microservices and a graphical designer component. The main purpose of the orchestration platform is to manage the deployment and configuration of analytics operators. These are single-purpose microservices that handle all analytics and data processing tasks in streaming fashion. Moreover, analytics operators can be deployed in the fog and the cloud stratum to provide data processing capabilities at every network level. Analytics operators are packaged using container technology (e.g., DockerFootnote 1) and deployed via container images that are stored in public or private repositories. An individual analytics operator is an encapsulated data processing step and has inputs, outputs and configuration values. In this regard, inputs are data structures that are expected by the analytics operator so that it can perform its processing task. Outputs describe the data structure that can be expected from the operator after processing a data point. The use of configuration values allows the tweaking of the data processing logic of an analytics operator. Information about inputs, outputs, configuration values and the container image name are stored in the operator repository (1).

One or more analytics operators can be composed into an analytics flow, therefore allowing the design of more complex processing logic. Analytics flows serve as blueprints for analytics pipelines and can be instantiated indefinitely. While the description of an analytics flow can be done manually, the proposed architecture includes the flow designer component that provides a graphical interface to enable users to model analytics flows. Furthermore, the flow designer uses a distinct modeling notation to visualize analytics flows. Consequently, the flow designer uses the metadata about analytics operators from the operator repository (2). Designed analytics flows are stored in the flow repository (3).

Analytics pipelines are instantiated from analytics flows by triggering the flow engine (4) via its interface. The flow engine gets a unified flow description from the flow parser (5) that reads and parses a previously created analytics flow from the flow repository (6). By unifying the flow description using the flow parser, analytics flows might be stored in different modeling notations, extending the possible application domains of the overall architecture. Successfully started analytics pipelines are registered by the flow engine in the pipeline registry and deleted after termination (7). In the presented architecture model, the orchestration platform and streaming platform are operated in the cloud. Against this background, all cloud-based analytics operators are started by the flow engine via an interface with the underlying container orchestration management platform, e. g. Kubernetes Footnote 2 (8). Furthermore, cloud-based analytics operators of an analytics pipeline consume data from the log data store according to the analytics flow on which the analytics pipeline is based (9). Additionally, they write the processed data back to the log data store.

The central component of the streaming platform is the log data store that is managed by a message broker, such as Apache Kafka,Footnote 3 to provide scalable access to all data of the platform. The streaming platform persists data from various sources, such as sensor data from IoT Devices or data from web-based APIs. The data in the streaming platform is available to all cloud-based analytics operators as well as other applications. In this regard, third-party or visualization applications may process all analytics results or raw data by querying the serving platform that consumes and stores data from the log data store (10).

Raw IoT device data as well as data that were processed at the fog level are pushed to the cloud message broker by the fog platform (11). From there, it is pulled (12) and ingested into the streaming platform (13) by the cloud connector. The messaging bridge handles the opposite communication between log data store and cloud message broker (14,15). As a consequence, data processed in the cloud can be sent to the fog message broker via the fog connector to be processed by fog-based analytics operators.

Data from web-based APIs is inserted and managed using the import management platform. In this context, external data sources provide data in various different formats and make them available via a variety of different protocols. Therefore, each data source requires a source-specific implementation that we refer to as an import type. An import type description consists of outputs, configuration values and a container image name and is stored in the import repository (16). Import types can be instantiated as imports using the import deployment service (17). The import deployment service retrieves the import type description from the import repository (18) and deploys the import at the streaming platform by interfacing the underlying container management platform (19). The import collects data from the external data source (20), converts it into a unified format for further processing and inserts the data into the log data store (21).

When deploying analytics pipelines in a fog or hybrid configuration, communication and orchestration become major challenges. To address these, the proposed architecture includes message brokers and connector services in both, the cloud and fog stratum. These handle the bidirectional data transfer between the network layers. If an analytics pipeline is to be deployed in the fog stratum or hybrid, the flow engine sends the appropriate information about relevant analytics operators to the cloud message broker (22). These requests are then read by the fog connector (23) and published to the fog message broker within the fog platform (24) so that they can be consumed and processed by the fog master (25). The fog master checks the available fog agents and publishes the provisioning requests for analytics operators in fog agent-specific topics. The fog agent receives the request (26) and begins provisioning the analytics operator (27). For this to work, the fog agent must be deployed on a hardware platform that supports container executions. Fog agents monitor and report analytics operator resource usage to the fog master through the fog message broker. The fog master manages provisioning requests for analytics operators with respect to a fog agent's available resources. The fog stratum may contain one or more processing nodes, each running a fog agent.

Similar to their cloud-based counterparts, analytics operators deployed in the fog stratum consume data streams from the fog message broker according to the underlying analytics flow. Furthermore, they publish their processing results to the fog message broker (28). Hence, the data relayed through the fog message broker originates from other deployed fog-based analytics operators, the streaming platform in the cloud or from IoT devices (29). In the case of a hybrid analytics pipeline, the results of the fog-based analytics operators are forwarded from the fog connector to the cloud message broker (30).

Prototype

We implemented the proposed solution of Sect. 4 as a proof of concept in an integrated software prototype to evaluate the approach. All artifacts are released as open-source software.Footnote 4 The prototype follows the microservice paradigm, therefore all components are fine-grained and loosely-coupled. We encapsulated all microservices using the container technology Docker to improve their reusability and mobility. This is particularly important for components that are instantiated multiple times like analytics operators, imports or fog agents. When possible, Alpine Linux Footnote 5 was used as the base for the container images to reduce the image size, thus reducing download times and storage requirements. To automate the deployment of all containers, the Kubernetes container orchestration system was used. Furthermore, RancherFootnote 6 was utilized to manage the Kubernetes cluster. It was selected because it addresses several operational and security challenges that arise when running a Kubernetes cluster. In addition, the software is not dependent on any particular hardware infrastructure.

In the cloud stratum, the operator repository and flow repository were implemented in Python, while the flow engine, flow parser, import repository, import deployment service and pipeline registry were implemented in Golang. All services provide RESTful CRUD interfaces for interaction via HTTP. In contrast, the communication between the different network layers was realized utilizing the MQTTFootnote 7 protocol. We opted for this protocol, since it is a lightweight communication protocol and widely adopted in IoT use cases [43]. The flow engine and deployment service use the API of the underlying container management platform, Rancher, to start, stop or restart analytics operators and imports, respectively. The operator, flow and import repository as well as the pipeline registry persist their data in MongoDBFootnote 8 instances. The document database was chosen because it is schema-less and thus well suited for prototyping services. Each service uses its own MongoDB instance, following the one-database-per-service design pattern. This promotes the resilience of the overall system, as the failure of one database does not affect other services. In addition, the databases can be scaled independently of each other.

To manage the services of the prototype and visualize data of experiments, we developed a frontend application using the AngularFootnote 9 framework. It was chosen because it offers a wide range of functionality for enterprise applications and is supported by a large community as well as global internet companies. In this context, the flow designer was integrated into the frontend application. It was implemented on the basis of TypeScript and the JointJSFootnote 10 graphics library. An example implementation of an analytics flow as a flowchart is shown in Fig. 2. Specifically, the model of experiment 1, that is presented in Sect. "Smart Home", is shown. An analytics flow model consists of nodes (grey) representing analytics operators and directed edges defining the data flow between the outputs and inputs (white circles) of analytics operators. The elements of an analytics flow can be added in the flow designer and combined in any way.

Fig. 2
figure 2

Screenshot of an example analytics flow in the flow designer. The gray squares are analytics operators. The white circles on the left side of an analytics operator represent inputs, those on the right side represent outputs. Edges show the data flow between the analytics operators and define the mapping between output and input ports. Metadata about the analytics flow can be entered in the upper area of the mask. The right side contains all analytics operators that can be used for modeling

In the development of analytics operators, a distinction is made between cloud and fog variants. Cloud-based analytics operators interact with the streaming platform that was implemented in the prototype using Apache Kafka. Consequently, Kafka StreamsFootnote 11 was selected as the technological basis for the implementation of cloud-based analytics operators. Based on this, a Java library was developed to standardize the implementation of analytics operators. It also ensures the basic compatibility of developed analytics operators with the platform in terms of internal data formats and interfaces. The library enables the configuration of inputs, outputs and configuration values for an analytics operator that can be found in the operator repository and used to create analytics flows. In addition, when an analytics operator is executed, the streaming platform data flows are automatically interconnected and data schemas are unified. As a consequence, any number of different data streams can be accessed in an analytics operator. With regard to fog-based analytics operators, a software library has also been created in Python, which, analogous to cloud-based analytics operators, simplifies development. However, in this case, data is provided via the MQTT protocol. For this reason, this library was developed based on the Paho MQTT clientFootnote 12 that is used in a large number of projects. The serving platform consists of an InfluxDBFootnote 13 and a management service written in Golang that provides CRUD interfaces via HTTP. InfluxDB was chosen because it specializes in time series data and IoT data is almost exclusively temporal related. The serving platform management service utilizes the API of the underlying container orchestration platform to start, stop and restart worker services. These worker services are written in Python and are transferring raw, imported and processed data from Apache Kafka to the InfluxDB.

The messaging bridge and cloud connector are both implemented in Golang and handle the data transfer between the streaming platform and the cloud message broker. In this regard, the messaging bridge implements a Kafka consumer that subscribes to topics in the streaming platform and publishes the data at the cloud message broker using the Paho MQTT client. The cloud connector has the same structure, with the transmission paths running in the opposite direction. A VerneMQ Footnote 14 instance was used as the cloud message broker. This decision was driven by the out-of-the-box scalability capabilities of the software as well as its ability to support various authentication methods.

In the fog stratum, the fog agent and fog master were implemented in Golang. The fog connector was implemented in Python and serves as communication gateway between the different layers. During the development of the fog platform, we gave special attention to the resource availability constraints of this network layer. To reduce the system resource requirements, we used MQTT as the communication protocol between all fog components and Eclipse Mosquitto Footnote 15 as the fog message broker.

Application Domains

IoT technologies are found in a large variety of different industry and business domains, but also in user-centric application areas. As a result, analytic scenarios differ in terms of data structure, volume and velocity of data as well as concerning the purpose of the analysis. Consequently, the challenges for analytics architectures described in Sect. "Background" change depending on the problem space. Nevertheless, the proposed architecture can be utilized for use cases of different domains. In the following, we describe possible applications in different domains using real-world scenarios. More specifically, we introduce three use cases from the areas of electricity consumption prediction in smart homes, weather and environmental monitoring and predictive maintenance in Industrial IoT (IIoT) scenarios. All application scenarios are described in detail as well as the specific embodiment of the challenges for analytics architectures presented in Sect. "Background".

Smart Home

The global number of smart homes is expected to rise to 478 million in the year 2025 [44]. This number highlights the growing significance of the smart home domain, in which the challenges for IoT analytics architectures, as provided in [3], are primarily influenced by the residents. Specifically, privacy and security aspects regarding IoT data is a major concern for them. Therefore, an analytics solution has to provide tools to enable measures to increase data security and privacy, e.g., anonymization techniques and data encryption. Furthermore, common smart homes only provide limited computing resources, thus requiring the ability to offload analytics tasks to cloud data centers. Another important challenge is fault-tolerant data input, since analytics pipelines need to keep running, even if connectivity issues, etc. occur. Since available smart devices in smart homes are usually different in terms of vendors, but also numbers or types, personalization of analytics, meaning the individual composition and configuration of analytics pipelines, is another important issue to be addressed. Since most analytics services in smart home environments are provided as part of an IoT platform to many end-consumers, the architectural challenges are also driven by platform vendors. From their perspective, handling Big Data, scalable data processing as well as data storage are needed to engage the large number of data sources of their customers. Moreover, the high network usage of these smart devices needs to be addressed. From an analytical standpoint, the flexible extension of data processing as well as the integration of data from different sources and of historic and real-time data have to be possible to provide meaningful smart home analytics services. Additionally, data visualization is important for residents to gain insights into various consumption and monitoring scenarios in the smart home. The overall character of IoT data also demands real-time processing and consequently stream handling capabilities of smart home analytics architectures. Specifically, use cases revolving around alarm systems or smoke detectors have little tolerance for latency.

The combination of these challenges creates a number of limitations in the domain of smart home that must be mapped by analytics architectures. Currently available approaches in this field are either cloud-based Big Data- or application-specific solutions that lack the required flexibility with regard to different consumer, legal or technical requirements and constraints. Additionally, analytics scenario requirements are not known beforehand by platform providers or may change rapidly, thus creating the need to allow flexible pipeline composition and deployment. In this context, we also define our motivational scenario for the smart home application that revolves around the forecasting of the total energy consumption of a household and is the basis for the conducted experiments in Sect. "Evaluation". The total energy consumption is based on the individual consumptions of various electrical consumers and is measured, in this scenario, accordingly via smart plugs or other sub meters. From an analytical point of view, the composition of the associated analytics pipeline is based on the number of available devices and their device types. Furthermore, it should be possible to use different calculation types, that, however, also have different requirements concerning available computing resources. In this regard, especially ML algorithms have presented promising results looking at energy consumption forecasting accuracy [45]. Finally, there are the requirements of the residents who, for example, might want data processing to take place on their local devices so that their data cannot be used by third parties.

Weather and Environmental Monitoring

With global temperatures on the rise and the financial cost due to natural disasters having almost doubled over the last 40 years [46], weather and environmental monitoring becomes increasingly important. For example, extreme temperatures have a large effect on mortality [47]. Looking at the increased urbanization at a global scale [48] and cities being particularly prone to high heat [49], this issue is expected to become even more pressing in the near future [50]. In the context of the IoT, weather and environmental monitoring is usually based on large amounts of different sensors that measure various environmental metrics ranging from temperature to particle density etc. As a result, analytics solutions in this domain have to be able to integrate and process large amounts of data (Big Data) and allow for scalable data processing due to the large amount of data needed for accurate weather modeling. To detect trends in climate, the integration of historic and real-time data is also required. Weather and other climate data (e. g. air pollution data) is typically collected and provided by multiple government agencies. To enable the combined processing of all data, integrating data from different sources is required. Rapid changes in weather and environmental conditions call for stream handling and (near) real-time analytics, e. g. to alarm citizens in case of a natural disaster. Since individuals might be affected differently by specific environmental conditions (e. g. pollen is particularly important for persons suffering from allergies), personalization of analytics is an important aspect.

All these challenges need to be addressed by IoT analytics architectures. In this regard, we propose our use case in the application domain of weather and environmental monitoring that is the monitoring of diverse conditions to detect optimal time frames for outdoor activities. To detect these optimal time frames, weather forecasts, air pollution and pollen data has to be analyzed. Users may define the importance of each parameter by specifying thresholds and scores to adapt the analytics to their personal preferences or medical conditions. By referencing historic analytics results, users may also be able to detect when optimal conditions typically apply.

Predictive Maintenance

Predictive Maintenance use cases are part of the IIoT domain. The practice of evaluating a machine’s condition to predict if and when maintenance should take place has been part of industrial manufacturing for a long time [51]. With the emergence and broad adoption of IoT and ML technologies in manufacturing processes, new approaches for predictive maintenance have been introduced. In this context, various data types are used as input for the resulting applications. These include data concerning the machine state, e.g., temperature, pressure, energy consumption, vibration [52] or press-in curves, but also from manufacturing process databases, e.g., machine and process settings or traceability information from components of products. Especially during the manufacturing processes of critical components, e.g., electronic control units of cars, huge amounts of data are created for traceability and legal reasons. Consequently, these data can be also utilized for predictive maintenance applications and shape the challenges presented in Sect. "Background". As a result, analytics architectures need to address the challenges of Big Data, integration of data from different sources and scalable data processing. These data are provided via embedded or appended sensors of machines that either emit them as a stream or create flat files. Based on this, analytics architectures have to provide stream handling capabilities as well the personalization of analytics with regard to heterogeneous data sources and use case-dependent data transformation. The large volumes of data created also influence the challenge of high network usage as well as limited computing resources, since predictive maintenance approaches require extensive resources and companies do not limit their efforts to single machines, but rather entire production or assembly lines. At the operational level, data-driven predictive maintenance approaches are based on prediction models that are conceptualized and trained using historical data. Their application is carried out, depending on the use case, either in real-time [52] or in regular intervals, e.g., daily [38]. Consequently, analytics architectures in this field need to support (near) real-time analytics, but also the integration of historic and real-time data as well as the storage of all analytics and raw data. Regarding privacy and security, analytics architectures have to provide tools to ensure data integrity at all times. In terms of actions taken based on the results of predictive maintenance approaches, sharing analytical data for automated decision making as well as data visualization are needed in analytics architectures. Finally, the flexible extension of data processing is crucial in predictive maintenance use cases so that changed configurations of production lines or machines can be mapped seamlessly. Fault-tolerant data input and energy efficiency are challenges that are not as pronounced in predictive maintenance applications as compared to other IoT use cases. Although fault-tolerant data input is in principle a prerequisite for reliable analytics pipelines, it can be assumed that the non-availability of data of a machine is always related to its standstill or defect therefore defeating the purpose of the application. Energy efficiency is rather seen as a result of the successful application of predictive maintenance.

In the context of the proposed architecture, the described challenges can be mapped according to their individual manifestation in a predictive maintenance use case. In this regard, we define the motivational scenario that is the prediction of future defects of a machine for the production of special components. In this scenario, the machine owner does not possess the knowledge or resources to implement a predictive maintenance approach himself, therefore utilizing the computing resources of an external IoT platform provider. The machine data is stored using several flat files with comma-separated values. Furthermore, it should be possible for the machine owner to use different predictive maintenance approaches that are provided by third-party vendors. Consequently, the resulting analytics pipeline has to merge and transform the input data files and make the resulting data streams accessible for external services to predict machine defects. The prediction is to be executed on a daily basis.

Evaluation

In this section, we describe the quantitative evaluation of the proposed architectural solution based on the prototype introduced in Sect. "Prototype". It serves as a proof of concept to show that the proposed analytics architecture can be used to process and analyze data in different real-world scenarios of different IoT domains. Furthermore, the purpose of the experiments is to show that the architectural proposal can also deploy analytics pipelines at fog nodes as well as in a hybrid manner. Moreover, we evaluate how different application scenarios influence the viability of the two deployment strategies by investigating the latency of data processing as well as the overall resource usage. In this regard, previous research of ours has already shown that the cloud components of the proposed architecture are able to handle Big Data problems in Smart Home environments that might also include ML technologies [13].

Experiments

Smart Home

To enable sufficient experimentation in a smart home setting as described in Sect. "Smart Home", we have deployed the fog platform that is introduced in Sect. "Solution Proposal" in a real-world apartment. This apartment was equipped with 97 different smart home devices, thus providing sensing and actuating capabilities. Previously, the residents have agreed to participate in our IoT-related research. This experimental setting was setup in August 2020 and is running until now. In the context of this work, we utilized the environment that generated real-world energy consumption data, to conduct four field experiments with regard to [53].

Overall, these experiments revolved around the energy consumption scenario we described in Sect. "Background". Available energy consumption data sources in the household were:

  • 35 Gosund SP111 smart plugs that were flashed to use the Tasmota firmwareFootnote 16 version 8.3.1

  • 1 Fibaro wall plug Type FFootnote 17

  • 16 Fibaro Walli SwitchesFootnote 18

Each of the Gosund smart plugs sent its current energy consumption data every 10 s, the Fibaro devices every 30 s. The number of smart plugs and switches was needed to cover all electrical consumers and circuits of the apartment.

For each experiment, an analytics pipeline was deployed that forecasts the energy consumption of the household for the end of the day, the end of the month and the end of the year. The analytics pipelines comprised three analytics operators:

  • Adder-Operator (AO): calculates the total energy consumption by adding the energy consumption of individual sources. Additionally, it writes the current timestamp and an ID in the result message.

  • Forecast-Operator (FO): processes the messages from the AO and forecasts the total energy consumption value for different dates.

  • Latency-Operator (LO): calculates the latency between the start and the end of the processing of a message by consuming the messages of the FO.

To compare different deployment strategies as well as application scenarios, we implemented four versions of the FO. Two as fog (deployed at a fog node) and two as cloud analytics operators. For every deployment layer, one of the FOs was implemented using a simple, updating linear regression based on [54]. The other one was implemented utilizing the adaptive random forest regressor that is an online-ML algorithm based on the work of Gomes et al. in [55]. Consequently, the result was four different experiments that corresponded to the four analytics pipelines compositions:

  • Experiment 1 (E1): a fog-only analytics pipeline with updating linear regression forecasting,

  • Experiment 2 (E2): a fog-only analytics pipeline with adaptive random forest regressor forecasting,

  • Experiment 3 (E3): a hybrid analytics pipeline with updating linear regression forecasting,

  • Experiment 4 (E4): a hybrid analytics pipeline with adaptive random forest regressor forecasting.

During E1 and E2, all data were processed at the fog layer (Fig. 3). In contrast, during E3 and E4, the data was sent from the AO at the fog layer to the cloud layer, processed by the FO and relayed back to LO at the fog layer. The design of the experiments with the LO at the end of the analytics pipeline simulates a possible actuator that is triggered based on the results of the FO.

Fig. 3
figure 3

Components and data flow of experiments 1–4

Weather and Environmental Monitoring

For the weather and environmental use case outlined in Sect. "Weather and Environmental Monitoring" we designed experiment 5 (E5). In this context, we implemented an analytics pipeline consisting of four imports and one analytics operator. The following imports have been implemented:

  • Weather forecast from Yr, a weather service by the Norwegian Meteorological Institute.Footnote 19 This includes various weather forecast parameters such as air temperature, air pressure, wind speeds, relative humidity etc. The forecast is updated every 30 min.

  • Pollen data from the German weather service.Footnote 20 The German weather service provides daily pollen density levels as well as forecasts for the next day. To achieve more accurate results, individual levels are provided on a regional level.

  • Precipitation data from the German weather service.Footnote 21 The data is provided with a geographical precision of 1 km2 by the combined analysis of radar data and ground stations (ombrometers). These precipitation sums are updated hourly.

  • Air quality data from the German Environment Agency.Footnote 22 Depending on the station selected, different air quality indicators are measured. This includes particulate matters, ozone and nitrogen dioxide measurements, which are renewed every 15 min.

The analytics operator consumed the imported data alongside temperature readings from a multisensor in the fog stratum and calculated a score based on user preferences. This score served as a metric to describe environmental conditions. In this regard, users define an upper threshold ti and a score si that is added to the overall score S if the value vi is below the threshold. The analytics operator calculates an overall score using the formula:

$$S = \sum_{(i = 0)}^{n} s_{i} {\text{ if }}v_{i} < t_{i}$$

In this experiment, we assumed a person suffering from allergies who is trying to detect the best time frames for outdoor sports activities. In order to detect these, the operator’s inputs, thresholds and scores were defined as described in Table 1. The components of the experiment are visualized in Fig. 4.

Table 1 Inputs, data sources, thresholds and scores used in E5
Fig. 4
figure 4

Components and data flow of experiment 5

With this configuration a maximum score S of 20 can be achieved (indicating optimal conditions for outdoor sport activities). The dataset used in this experiment contained 442,672 measurements from the region of Leipzig, Germany. It includes data between the 1st of January 2021 and 8th of September 2021. The majority of the measurements are precipitation levels, since we selected a large area of 28 km2 and the data is frequently updated. In contrast, the pollen density levels are only updated once a day and thus only comparably few measurements were collected. For the air quality measurements, we selected the station Leipzig-Mitte, because it is located directly in the city center. Finally, the multisensor provided a temperature measurement every hour.

Predictive Maintenance

Regarding the predictive maintenance use case described in Sect. "Predictive Maintenance", we designed experiment 6 (E6) to evaluate the feasibility of the proposed architecture in the context of predictive maintenance. During preliminary studies, we observed that manufacturing machine data often comes in flat files. Therefore, we created a fog component that enables the conversion of multiple raw manufacturing machine data files to the required platform format. The data were forwarded to the cloud platform as described in Sect. "Solution Proposal".

To analyze the data, two analytics pipelines were being used. The first analytics pipeline was responsible for the creation of the data files required for the training of a machine learning model based on random forest classification. The second analytics pipeline was responsible for using this machine learning model to perform the predictions. All components and the data flow of this experiment is visualized in Fig. 5.

Fig. 5
figure 5

Components and data flow of experiment 6

This first analytics pipeline comprises three analytics operators. The first analytics operator used the machine data as input and combined multiple values into a single object structure. The second analytics operator converted array structures within the object into flat map-like structures and filled missing values with default values. The third analytics operator handled the creation of the final data files for the model training process.

The second analytics pipeline consisted of two analytics operators. The first analytics operator used the data already produced by the second analytics operator of the first analytics pipeline as input and merged the data into 1-h wide windows. The second analytics operator used the windowed data for the actual predictions. Machine learning models were trained by an externally hosted microservice and saved into a model store available for multiple analytics pipelines. Another externally hosted microservice used the model to provide predictions. If no model for a specific machine had been trained, an appropriate model was trained when first requested using the data files provided by the first analytics pipeline. Since training a model requires additional time and training is performed infrequently, we assumed an already trained model for the experiment.

For this evaluation we are using a sample dataset with 97,544 data points, representing 24 h of manufacturing machine data. Each data point contains the power draw of different modules of the manufacturing machine alongside status information and error codes for each module.

System Setup and Deployment

All components of the cloud layer, including cloud analytics operators, were deployed at a private cloud datacenter. The cluster comprises 18 virtual machines with 8 CPU kernels, 64 GB RAM and 256 GB of solid-state storage, each. The underlying hypervisors use XEON E5 CPU cores. For the experiments E1–E4, the cloud components ran as Docker containers on Kubernetes version 1.16.8 as the container orchestration platform with Rancher version 2.4.8 as the management frontend. For the experiments E5 and E6 the cloud platform has been updated to use Kubernetes version 1.20.6 and Rancher version 2.5.8.

The components of the fog platform were deployed on two Raspberry Pi 3, Model B Plus (Rev 1.3).Footnote 23 The Docker version on both computers was 19.03.13. The fog agent was deployed solely on one of the computers so that the measurements were not distorted by further services. As a result, all fog analytics operators ran on this computer as well.

Metrics and Methodology

For the experiments described in 7.1.1, we measured the overall processing latency of an analytics pipeline as well as CPU load and memory usage of the hardware on which the analytics operators were deployed. By adding the current timestamp to each message at the beginning of the processing task carried out by the AO, the processing latency was measured. At the end of an analytics pipeline the LO calculated the time difference by subtracting the initial timestamp from the now-current one. CPU load and memory usage of the fog analytics operators were gathered using the system performance analysis toolkit Performance Co-Pilot.Footnote 24 Additionally, the cloud CPU and memory metrics were collected using the cluster monitoring tools of Rancher. Concerning the experiments described in 7.1.1, each experiment was performed for one hour. Due to the constant number of messages emitted by the devices, the behavior of the observed metrics did not change by prolonging the investigation period. This was confirmed by preliminary experiments. Due to the consistent character of the data, the first 15 min of the total data set were selected and examined. This time frame included around 3500 messages emitted by the smart devices for each experiment.

Since the experiments E1–E4 differ from the experiments E5 and E6 in terms of the processing topology, different metrics were gathered. Primarily, the average throughput of the analytics pipelines was measured. It was calculated by dividing the number of data points of the overall data set by the time frame starting at the consumption of the first message until the last result message was published to the streaming platform. The data set size and number of consumed messages in analytics pipelines with multiple analytics operators may differ if analytics operators do not emit an output message for each input message (as the caching operator in E6 does). Even though analytics tasks may be parallelized by running multiple instances of analytics operators, we were only considering a single instance setup of each analytics operator in these experiments. We examined the parallelization capabilities of the proposed architecture and prototype in previous experiments. For example, the experiments presented in Sect. "Weather and Environmental Monitoring" and "Predictive Maintenance" are comparable to the 1–1 instance configuration used in [12].

The average throughput is only considering running analytics pipelines and disregards the time required for configuration and launching of analytics operators. Therefore, we are providing two additional metrics. The first metric time to launch measures how long it takes the analytics platform to configure and launch an analytics operator. It is considered launched when the connection to the streaming platform has been established. The second metric time to first result measures the time it takes the analytics pipeline to calculate the first result after it launches. The sum of time to launch and time to first result reflects the timespan a user needs to wait to receive the first data processing result after the pipeline request has been sent to the flow engine. Each experiment was executed five times to account for run-to-run variances.

Experimental Results

Smart Home

The summary of the results of all experiments of the smart home scenario regarding the overall processing latency are presented in Table 2. The lowest mean (0.015 s) and median (0.015 s) latency were observed during E1. Experiment 3, in which the same processing algorithm was used, but executed in the cloud, shows an average latency of 0.603 s with a median of 0.618 s. Experiment 2, using a more complex online ML algorithm, shows the highest latency of all experiments with an average of 20.81 s and a median of 3.709 s. The results of E4 correspond approximately to those of E3 with an average of 0.626 s and a median of 0.604 s.

Table 2 Summary of the measured processing latency for the first 15 min of experiments 1–4 in seconds, source: [4]

The summary of the results of all experiments of the smart home scenario with respect to 5-min CPU load average and memory usage of the Raspberry Pi are shown in Tables 3 and 4. The highest resource usage was observed during E2 with an average CPU load of 0.976 (median = 1.03) and an average memory usage of 228 megabytes (MB) (median = 228 MB). The second highest resource usage was recorded during E1 with an average CPU load of 0.12 (median = 0.124) and an average memory usage of 179 MB (median = 179 MB).

Table 3 Summary of the measured 5-min CPU load average of the Raspberry Pi for the first 15 min of experiments 1–4, source: [4]
Table 4 Summary of the measured memory usage of the Raspberry Pi for the first 15 min of experiments 1–4 in megabytes, source: [4]

The 5-min CPU load of the Raspberry Pi was at an average 0.107 (median = 0.1) for E3 and 0.065 (median = 0.06) for E4. The average memory usage was at 162 MB (median = 161 MB) for E3 and at 162 MB (median = 162 MB) for E4. Additionally, we gathered the 1-min CPU load and memory usage of the cloud operator that was at an average 0.014 (median = 0.014) and 156 MB (median = 155 MB) for E3. For E4, the average cloud operator one-minute CPU load was at 0.034 (median = 0.0314) and the memory usage at 185 MB (median = 185 MB).

Regarding the overall processing latency, the quantile distance between Q0.05 (5% quantile) and Q0.95 (95% quantile) is 100 times larger for E2 than for the other experiments and 3 to 6 times higher regarding the 5-min CPU load. This shows that the latency and resource usage remained at a steady level throughout experiments 1, 3 and 4. During E2, the memory usage also stayed constant throughout the experiment, but CPU load and overall processing latency increased (Fig. 6). While the 5-min CPU load average increased from the beginning and then remained between 1.0 and 1.1, the processing latency only increased once the 5-min CPU load average was above 1.

Fig. 6
figure 6

5-minutes CPU load average (left) and overall processing latency (right) of the Raspberry Pi during experiment 2. The datapoint when a 5-min CPU load average of 1 is reached is marked with a vertical line in both graphs, source: [4]

Weather and Environmental Monitoring

The results of the experiment concerning the weather and environmental monitoring use case are presented in Table 5. The utilized single analytics operator was able to achieve a mean throughput of 2438.11 messages per second with a standard deviation of 24.95 messages per second, resulting in a coefficient of variation (CV) of 1.02%. The time to first result varied more between each run (CV = 12.94%) than the time to launch (CV = 5.56%). On average, the user had to wait 17.26 s to receive the first result after launching the analytics pipeline.

Table 5 Summary of the measured throughput (messages per second), time to launch (seconds) and time to first result (seconds) for experiment 5

Predictive Maintenance

The collected metrics relating to the predictive maintenance use case are shown in Table 6. On average, the analytics pipeline processed 166.2 messages per second (standard deviation = 2.5 messages per second, CV = 1.5%). The time frame to the first result shows higher variance (CV = 18.7%) than the time to launch (CV = 7.16%). In this experiment, the user had to wait for 15.99 s on average before the first result was available after launching the analytics pipeline.

Table 6 Summary of the measured throughput (messages per second), time to launch (s) and time to first result (s) for experiment 6

Discussion

Overall, the conducted experiments highlight the proposed analytics architecture’s ability to be utilized in different application domains of the IoT. Furthermore, the relevance of the approach is displayed by its successful implementation in a real-world environment as carried out for the smart home scenario. In addition, the weather and environment as well as the predictive maintenance use cases and experiments were derived from current research projects and discussed with experts in the respective fields. This supports their relation to real use cases in these domains.

From a technical standpoint, the results of the smart home scenario experiments indicate that fog-only analytics pipelines deployments result in a lower overall processing latency at the cost of higher resource usage of the utilized edge devices. In contrast, more complex computations, for example ML-based algorithms, may require too much resources to run on edge devices. This is supported by the results of E2 that show an increase of the processing latency, once the CPU load crosses a certain threshold. Experiments 3 and 4 provide insights into the nature of hybrid deployments. Since the measured latencies are almost the same in both experiments, we conclude that the complexities of the overall analytics pipelines are minor factors to consider, because resource-intensive tasks may be offloaded to the cloud. Additionally, the experiments resulting from the smart home scenario show the approaches capabilities concerning hybrid analytics pipeline deployments.

Comparing the throughput of the weather and environmental monitoring and predictive maintenance experiments (E5 and E6) that were running at the cloud layer exclusively shows that more demanding machine learning tasks are resulting in less throughput than less complex aggregation tasks. Even though the task itself may be more complex, processing latency as measured by the time to launch and time to first result are suitable for the described use case.

The results of the experiments may seem trivial at first. However, they show the practical relevance of the solution, which previously existed only theoretically. In this context, the experiments performed illustrate that the architecture is technically capable of executing analytics pipelines that result from different requirements. This can be applied to other types of requirements that may be imposed by consumers, legislation or businesses as laid out in Sect. "Application Domains".

The application scenarios considered in this work combine a number of challenges that, in part, must be viewed from different angles. Moreover, there are challenges that have to be solved equally for all use cases. Specifically, the architecture’s ability to deploy hybrid as well as fog-only analytics pipelines addresses the privacy and security challenge for IoT analytics architectures, as it provides the opportunity to deploy an analytics operator as a first control point for privacy-sensitive data [56]. Considering the fault-tolerant data input challenge, the presented fog-only experiments indicate the resilience of the architecture against network connection losses as computation can continue without an active Internet connection. Moreover, the utilization of resilient communication protocols (such as MQTT) for data exchange between fog nodes themselves as well as the cloud reduces the susceptibility to interference. Additionally, by creating custom analytics operators, users are able to ensure fault-tolerant data input by filtering or altering malformed input data. The presented predictive maintenance use case illustrates this capability by exemplifying an analytics operator that automatically fills missing values with default ones.

Regarding the Big Data challenge, a conclusive evaluation is difficult. The reasons for this are, on the one hand, a lack of a uniform definition of when data becomes Big Data and, on the other hand, a lack of test data to precisely verify this challenge. Based on the volume of available test data, it has to be noted that the experiments conducted in this work cannot be seen in the context of Big Data. However, previous publications on this approach illustrate the processing of much larger data sets and look at the resulting performance (see [12, 13]). By comparing with industry-standard Big Data solutions, it was shown that the presented architecture solution is capable of achieving similar processing dimensions in terms of volume and velocity.

In context of the (near) real-time analytics challenge, the term real-time needs to be defined. According to [57], it can be differentiated with regard to the underlying deadline of a processing task, which might be relative or absolute, resulting in a range from hard (no tolerance for delay) to (low tolerance for delay) soft real-time. However, since the operational limits of real-time processing are use case-dependent, a universal proof of this ability is not possible. Still, as demonstrated by the smart home-based experiments, the data processing latency achieved by the proposed analytics architecture ranges between milliseconds and seconds and can therefore be considered sufficient for most use cases in the IoT. Experiments 5 and 6 support this argument with their results in the time to launch and time to first result metrics. Stream handling is accomplished using stream protocols and technologies, namely MQTT and Kafka, for data ingestion and processing. Using those technologies alongside container virtualization also offers scalable data processing. Additionally, the architecture's capabilities concerning data processing parallelization are further investigated in [12].

The proposed solution addresses the challenge of integrating data from different sources by the means of imports for ingesting data from external sources (such as web APIs) as well as device data using the fog components as described in Sect. "Solution Proposal". This capability is exemplified in Sect. "Weather and Environmental Monitoring" and shown in experiment 5. Both data sources can be used jointly in analytics pipelines as demonstrated in the weather and environmental monitoring use case and experiment 5. Data storage of all analytics and raw data is provided in two manners. Data is stored in the log data store available for processing by streaming applications such as analytics pipelines, but also made available for querying via the serving platform. Since analytics operators write back the results of their data processing task to the log data store, processed data can be accessed at any step on an analytics pipeline. Data may also be shared among stakeholders using the serving platform that provides an API to access data and analytics functions. Consequently, sharing of analytics capabilities and data is accomplished.

In the proposed solution, the log data store stores all raw data indefinitely, therefore keeping the entire history of a data stream. This approach is supported by recent advancements in Big Data technology [58]. Together with the possibility to configure analytics pipelines to either consume all available or only newly arriving data, the integration of historic and real-time data is achieved. This is further extended by the use of imports that may provide historic data from external sources. The conducted experiments highlight that analytics pipelines with different configurations can be deployed in every stratum. In addition, the software library, which simplifies the creation of fog analytics operators facilitates the streamlined development of new analytics operators. Therefore, the flexible extension of data processing is possible. The predictive maintenance use case and experiment demonstrate how intermediate results can be used by multiple analytics pipelines, further extending the flexibility of flow design and resulting analytics pipeline configuration. Moreover, the prototype described in Sect. "Prototype" highlights the proposed solution’s ability to map requirements of different stakeholders. Together with the flexible analytics flow design and individual configuration values for analytics pipelines the personalization of analytics is enabled as shown by the weather and environmental monitoring use case and experiment.

The smart home experiments illustrate the ability of the architectural approach to offload resource-intensive tasks to its cloud components, if fog node resources are insufficient, therefore addressing the challenge of limited computing resources. The frontend application described in Sect. "Prototype" also allows for data visualization by querying the serving platform. The proposed analytics architecture allows for data processing solely in the fog layer, thus reducing high network usage by only transferring relevant data. Although not explicitly evaluated, fog computing can lead to energy savings [59], therefore supporting increased energy efficiency for analytics applications carried out with the proposed solution.

Concerning the internal validity of our results, we recognize that the results of our experiments may differ based on several factors. These are general network latency between cloud and fog nodes, local network usage, but also the utilization of cloud and fog resources by other processes during the execution of the experiments. These uncertainties might also be responsible for the great variance in the time to launch and time to first result metrics. On the other hand, the reproducibility was increased through the usage of open-source software and a widely available and utilized computing platform as the fog node. High reproducibility is also shown by the low variance in the throughput metric. Finally, the field experiments carry a high external validity, since they were conducted in a real-world setting and the included IoT devices were consumer hardware.

Conclusion and Outlook

In this paper, we propose a fog-based analytics architecture that enables hybrid analytics pipeline deployments in different domains of the IoT. In this regard, we provide background information about the research area, motivational scenarios for the domains of smart home, weather and environmental monitoring as well as predictive maintenance. Furthermore, we illustrate the reasons behind conducting this research and highlight different challenges that have to be addressed by analytics architectures in the IoT. Additionally, an overview about similar works in existing scientific literature is given.

The proposed solution is based on previous works of ours and utilizes the microservice paradigm to structure architectural components. It comprises four main components that are collections of microservices and are loosely coupled. The presented architecture is the technical basis for the implementation of individual analytics pipelines in different IoT environments that are based on the requirements of various stakeholders. The components of the proposed architecture as well as their interactions with each other are presented in a conceptual model that is the basis for a prototypical implementation. This prototype serves as a proof of concept and was further utilized to perform six experiments that are based on the motivational scenarios we presented. The results of the experiments show that the architectural approach is capable of mapping different deployment scenarios and thus makes it possible to engage the requirements of different stakeholders. Moreover, we show that the approach is able to address a multitude of challenges for IoT analytics architectures that are different depending on the application domain.

Future research in this field needs to investigate how the requirements of different interest groups in IoT environments, but also technical constraints, can be modeled, mapped and structured. This regards conceptual considerations, such as the representation of stakeholder requirements in a machine-readable notation, but also aspects of software engineering, such as the underlying functional and non-functional requirements as well as technologies for implementation. In addition, the ways in which stakeholder requirements influence the configuration and composition of analytics pipelines should be investigated. Furthermore, deployment preferences may be derived from legal or social sources. The resulting data could be used as the foundation for self-learning analytics pipeline deployment systems. Finally, these concepts should yield software components to be integrated in the presented architecture to enable deployment decision support.