1 Introduction

In the age of Artificial Intelligence (AI), the proliferation and ubiquity of Data has taken on new dimensions. As demonstrated by several studies, all Generative AI models are highly data dependent for the development of corresponding intelligence capabilities [1, 2]. This new paradigm of Big Data that we find ourselves in means that now we can no longer depend solely on the batch processing strategies of the not too distant past. We need to be able to capture and process the different streams of data at scale. In our conceptualisation of the platform we explore architectural and data processing strategies which will trade off between real time and batch based processing paradigms allowing for low latency and high latency data streams. We explore an event based streaming philosophy of data consumption and ingestion which allows for extremely high scale and speed of processing. The incoming data streams are considered as streams of events which are used to mirror and depict events which occur in the environment. The dual paradigm of processing spoken about above also allows the data that we use as the final version of truth to be constantly bolstered with the very latest data, as it is available. This in turn allows for the models and applications which rely on the data to be able to generate insights in real time while still augmenting and supporting those insights with new information as it becomes available [3, 4].

The rest of this paper is organized as follows. Section 2 explores related work in Big Data analytics platforms. Section 3 proposes a cloud based analytics platform architecture for SSAI and proposes an existing SSAI technique which will be suitable to showcase the platform. Experiments conducted using Household Energy Consumption data generated from the SGSC initiative are reported in Sect. 4. Section 5 concludes the paper with a discussion on limitations and the plausibility of future work.

2 Related work

The proliferation of Data in Cloud systems and the recent advances in technologies to tackle Big Data environments has been discussed extensively in literature. However, existing solutions do not effectively cater to a holistic solution, there remain to be open research questions existing in data staging, distributed storage, security and analysis [5,6,7]. The scale of data prevalent in the new Big Data paradigm demand non-conventional methods to process and generate insights from the data streams being generated. Al-Jarrah et al. have discussed data modelling on large datasets for machine learning from a theoretical and experimental perspective, toward optimising computational complexity [8]. It is necessary to instead look to non-traditional methods of harnessing Big Data’s potential [9]. An important consideration though is that invasive techniques that look to structure Big Data are generally cumbersome and eventually self defeating faced with the exponential growth we are seeing and predicting [10]. Harnessing the power of AI is one of the ways challenges posed by the new Big Data can be answered. O’Leary explores some of the way Artificial Intelligence can be used to facilitate process and analyse Big Data in this manner [11]. This intersection of Big Data Analytics and AI is demonstrated in a number of recent studies, such as, situational awareness from IoT data streams [12], intelligent detection of driver behavior changes [13], human activity recognition [14] and emotion detection [15]. However, in most such studies a cloud based strategy is not proposed and the solution does not focus on the seamless integration of real time and batch processed data, further they also do not address the need for the explainability of the insights generated.

As the world begins to depend further on machine learning models to generate insights, which in turn inform important decision making, it is even more important to be able to formulate the explainability of such insights in terms of bias, transparency, safety, ethics, and causality [16, 17]. There is significant research that has gone into the challenges faced in applying Explainable AI and the workarounds and solutions that are available [18,19,20]. It is important that any platform that seeks to apply Artificial Intelligence for insights also builds into its architecture a framework which can include Explainable AI models and explainers which can provide a reference to why those insights were generated.

There are advances also being explored to work with the class of Big Data which encompasses Sensor Data, Plageras et al. propose a sensor management system based on remote sensor deployments communication with a cloud based building management server [21]. This allows for centralised management of sensor data using the cloud as well as bringing the elastic nature of the cloud to the fore to be able to manage the inconsistent surges in Big Data. Mavromoustakis et al. take this idea further by speaking about distributing compute and storage in the local networks using “edge cloudification”, where small clouds operating at the network edge can accommodate the storage and compute needs of the local area [22]. This is an area that can be explored further in distributing AI along the continuum of cloud to edge, with demonstrated potential in diverse applications, such as energy [23, 24] and healthcare [25, 26].

3 The proposed cloud-based architecture for explainable Big Data analytics

3.1 Theoretical discussion

The proposed architecture is composed of three distinct layers; namely Process, Store and Serve. The Process layer serves as the gateway to the platform taking care of discovery, registration, provision and acclimatisation of the data streams [27]. Given the nature of the data being variable the processing layer lends a frame of reference from which the platform can adapt the data and synchronised. This means that the platform will be able to handle various velocities of data and carry the ability to sample them in a single frame of reference. The Process layer, by managing data stream discovery, registration, and acclimatization, ensures data integrity and transparency, which act as building blocks for explainability of the system.

The Store layer will provide the high performance stack which is required for handling the data coming through the Process layer at scale. It will manage and make the data available to the platform inline with its freshness and frequency of utilisation, utilising both “hot” and “cold” paths of delivery. The partitioning of the data in this manner will also afford the ability to be able to augment and mask the data inline with contextualisation and security considerations. This will allow for the efficient participation of the data in any analytic activities. All extracted and collected metadata from each stream will be democratised within the platform and will be used to inform the processing and storage processes as well as inform the up stream layers. The Store layer’s data partitioning and management strategies facilitate data traceability, enabling insights to be linked back to their data sources, enhancing transparency and accountability, which in turn speak to the explainability of the system.

The final layer of the platform is Serve, which will function as the wheel house of the platform; the framework will hold a suitable adapters to be able to retrofit different learning mechanisms. These learning mechanisms can perform in ensemble or individually. From an Edge perspective the learning mechanisms will surface latent models which can be deployed to run on a low capacity environment such that is prevalent on the edge of the network.

As mentioned, the proposed platform operationalises SSAI that addresses the limitations of conventional AI by adapting to the inherent structure of the data, incrementally learning and abstracting from this structure. The information processing functionality of the platform that supports SSAI is based on the principles of the lambda architecture [28], where we enable the processing of a real time stream (Real time processing layer) and a separate processing stream for higher latency data in batches, a batch stream (Batch processing layer). One inherent problem in employing principles of a lambda architecture is that there is a duplication of function and code in both the Speed and Batch layers, we look to overcome this by utilising a single repository of components which we utilise across both layers. the platform is illustrated in Fig. 1.

Fig. 1
figure 1

Cloud-based architecture for explainable Big Data analytics

In the proposed platform, we utilise Apache Kafka [29] as our Streaming Manager, it will be used to publish and subscribe to all streams entering the system. It will work as the defacto gatekeeper to the system. The nature of Kafka means that it provides a fast message bus and will be the delivery point for all event streams into the system, including data from 5G low latency mMTC, URLLC networks. The Streaming Manager will work alongside Apache Spark Streaming system [30] to implement the real time processing branch. Other options which can be used here include Apache Storm [31], Samza [32] however we have chosen Apache Spark Streaming for its capabilities for Machine Learning, Graphing, SQL based querying and the cohesion of systems in the processing layers. In the Batch processing branch we will have Apache Spark fulfilling this role, here we could have also used such services as Hadoop but again we have chosen to utilise Spark for the reasons mentioned above.

Both these branches of processing will load data into the Stream Manager which will be implemented using Apache Druid. Apache Druid is is a column-oriented, high performance, open-source, distributed data store [33]. It provides capabilities for flexible, highly available, low-latency queries and fast slice-and-dice analytics ("OLAP" queries) on large data sets.

The AI Capabilities of the platform will be initiated on top of this high performance data delivery stack. The different learning mechanisms will be instantiated using pipelines which will sequence and apply machine learning capabilities. For the purpose of this article, we demonstrate the workings of the IPCL algorithm, which has been developed upon the principles of SSAI for incrementally characterises patterns in stream data and correlates these across time [24].

3.2 IPCL algorithm

Incremental Pattern Characterisation Learning (IPCL) is an unsupervised incremental learning algorithm where existing learned knowledge incrementally gets extended and updated as new data comes in. Incremental learning is a must for high velocity low latency data stream, as it is a constantly evolve. There will always be new unseen patterns appear, previously appeared patterns may appear later, so learning techniques have to keep the previously learned knowledge intact without discarding. IPCL algorithm supports the four key characteristics of an incremental learning technique [24]:

  1. 1.

    Learn additional information from new data.

  2. 2.

    Require access to the past data that it has already processed.

  3. 3.

    Address catastrophic forgetting, thus should preserve the previously acquired knowledge

  4. 4.

    Accommodate new classes that may be introduced with new data

IPCL self learns a layered structure across time generalising the knowledge embodied in data. Each layer learns from a buffered batch of data using GSOM self-structuring technique [34]. IPCL preserves the acquired knowledge in a generalised form therefore, it does not require access to the past data that it has already processed. The generalised version of the acquired knowledge from each layer (n) is used as the basis for the knowledge acquisition from the subsequent layer (n + 1), thus it avoids catastrophic forgetting of the past knowledge. Moreover, while using the past acquired knowledge as the base, it incrementally acquires new knowledge that is embodied in the upcoming data. This incremental learning capability of IPCL enables it to handle high-velocity low latency data streams as it does not need to look into past data again to learn the patterns.

Algorithm 1
figure a

IPCL Algorithm

The learning outcomes determined by the IPCL algorithm are presented to the Orchestrator, which is for choosing applying and coordinating these pipelines based on the data. The required Meta data for the pipelines will be augmented and delivered from the data factory. The Explainability Engine module at the end of the pipeline will respectively, be responsible for providing the explainability of the models generated and creating the appropriate augmentations to the data to make the results human friendly. The reasoning function of the engine will employ an ensemble of explainers including LIME [35] and SHAP [36] based techniques for exploring results, these will be integrated with the feature relevance to provide an Explainability Graph to rationalise the results delivered from the machine learning pipeline. The Explainability Engine interface will also allow for human input in consolidating and validating reasoning provided. The Explainability Engine will also have interfaces which will allow for new models of reasoning to be added in the future. This forms the framework by which insights generated by the SSAI are presented hand in hand with the ability to also interrogate it in terms of its explainability. The Serve layer, through the Explainability Engine, directly tackles explainability by providing understandable interpretations of AI decisions. The process and store layers provide explainability-enabling metadata, while the serve layer provides actual explanations of model outputs using that metadata and XAI techniques. These layers collectively ensure that the architecture not only supports but enhances the explainability of AI-driven insights.

3.3 Comparative evaluation

Several studies have proposed cloud-based, scalable architectures for Big Data analytics, including [37, 38], however a shortcoming of these has been the ability to integrate a Self-Structurng AI model, which is able to handle drift in both the data streams being processed and the data in itself. They also do not address this along with a suitable set of reasoning capabilities. The Orchestrator and the Explainability Engine which are modelled in the Serve layer prove themselves capable of addressing this gap. The framework also provides the ability to serve the latent representations of the model which are generated to out to the edge of the network. These latent representations lend themselves to being executed on a lower capacity hardware, suiting themselves for the use on the edge. This way the most of the data would be processed on the edge and the resulting information can be ingested back into the framework through the Process layer. This is another notable point of difference which was observable in conventional architectures that we evaluated, please see Table 1 for this evaluation.

Table 1 Comparative evaluation of the features of the proposed architecture and aggregate of related work

4 Experiments and results

In the push for smart cities, smart grids are key components in understanding and addressing the burgeoning demand for energy, while fulfilling the requirement for energy efficiency, sustainable energy management, and carbon neutrality. The real time monitoring capabilities of Smart Grids inculcate the ability to understand the demand for energy and allow energy providers to adjust rates accordingly to match supply and demand. By doing this, household and industrial consumers can change their behavioural and lifestyle factors that directly impact usage patterns to reduce the load on the grid. This setting of multiple stakeholders with varied objectives in a multi-layered composition is well-positioned for the evaluation and demonstration of the proposed cloud-based architecture for explainable Big Data analytics. However, smart grid data-streams from a metropolitan city is technically challenging as it encompasses frequent updates from thousands of household smart-meters depicting the qualities of a high-velocity low latency data stream of hundreds of millions of data-points per day.

For this experiment, we utilised 30-min interval load data collected for the Smart Grid for Smart City (SGSC) project [39]. The dataset consists of data points collected from 30 min interval-reading of 78,000 households in New South Wales, Australia from 2010 to 2014. This dataset is streamed to the proposed platform emulating real-time smart-meter interval-readings flowing to the platform. The processing layer of the platform receives data-stream and pushed into the IPCL algorithm which will be instantiated in the Machine Learning pipeline. Each IPCL layer is represented by readings from a 24 h period, given that the readings are half an hour apart this creates a 48-dimension vector (24 × 2 half-hourly reads). The IPCL algorithm develops a columnar arrangement that will represent patterns and will maintain continuity in learning across each time period. As noted in the expansion of the algorithm in Section 3, the initial four aggregate nodes generated as a part of the learning phase will then expand in the subsequent phases, incrementally growing and learning in this manner for all time periods. Nodes that do not grow in a certain phase will not be lost but will be retained and learning will be continued in subsequent phases when relevant. The streamlined nature of the platform will allow the algorithm to be run in near real time for processing and insights. The metadata and customer demographic data will be processed through the batch processing layer and augmented to the results from the machine learning pipeline. The results will be run through the Explainability Engine to generate explained insights, such as the ontological representation shown in Fig. 2

Fig. 2
figure 2

Explainability of unsupervised analysis of energy data

As discussed above, the platform will cluster the energy consumption data over time and cluster different Energy Consumption Pathways or Usage Profiles based on the characteristics were learned by the algorithm. The execution resulted in 12 different profiles which encapsulated the energy usage of the households. These profiles were then extrapolated and averaged to a representative energy consumption pattern visualised over time. Figure 3 shows the consumption of energy over the days of the month and the four segments of the day (the 30 min interval loads were split into 6 h segments from midnight to midnight.

Fig. 3
figure 3

All pathways over time by the Day of the Month

Studying the visualisation we can see, the peaks and troughs commonly displayed across peak and off peak use (day and night) and also slopes denoting weekday and weekend usage. We can also see usage characteristics of different user profiles; Cluster 11 (denoted by P24_11) shows a relatively flat and low consumption pattern where energy usage does not vary much. Whereas we can see Cluster 3 (denoted by P24_3) show significant peaks in their usage. The data was then averaged over a 24 h period to have a closer look at the Energy Usage Characteristics of the different Usage Profiles, shown in Fig. 4. We can see that the patterns are consistent with the views from Fig. 3. Here we can also see Clusters 12 and 2 (denoted by P24_12 and P24_2) also seem to show similar behaviours to Cluster 3 albeit on a lower energy consumption scale.

Fig. 4
figure 4

All pathways over time by the Hour of the Day

If we consider the energy consumption profile for a day depicted in Fig. 4 we can see that there are peaks experienced in the morning between 8 a.m. and 9 a.m. followed by an immediate dip, where the occupants are possibly getting ready to leave for their day’s activities be it school or work. We see this gradually pick up from around 4 p.m. to 8 p.m. where people arrive home and follow through with activities through to dinner when they wind down for the night where we can see subsequents dips in energy consumption. Here interestingly we see that P24_3 and P24_1 both seem to have a decrease in energy consumption from 12am to 1am in the morning, which will suggest that there are occupants who stay up later in the night.

Next to delve deeper into the usage profiles and the patterns of energy usage that they exhibit, the platform also cross referenced the demographic profiles of the customers to the Energy Consumption Profiles derived by the platform. A median value for each demographic characteristic was established and was compared and cross tabulated against the mean of each of the Energy Usage Profiles. The results are shown in Fig. 5. This allows more granular analysis of the Energy Pathways adopted by the different User Groups and allows the plotting of cause and effect phenomenons that can be observed.

Fig. 5
figure 5

Demographic characteristics for all pathways

Here we can see Cluster 10 (denoted by P24_10) exhibit the largest consumption profiles, a closer look at their demographic profiles, show a high percentage of high income families with children and a median of four occupants, living in individual houses. Utilities include split system and ducted heating and a good percentage of them also have pools. This validates the energy consumption curve which is consistently high through the day. Group3 (denoted by P24_3) are also generally high income earners, who have their own houses, high electricity usage, ducted or split system air conditioning and a median of 4 occupants including children and 2 refrigerators. Which again bears out in the high energy usage and the high peaks and troughs in energy usages seen as they depart for work and school and come back from work. The late night energy consumption might suggest teenagers staying up later in the night.

In contrast if we consider Cluster 11 P24_11 we can see that the energy profile is relatively flat through the day and the average energy usage is quite low. A view of the demographic profile shows a single occupant with no children who is at home during, living in a unit on rent. There is an emphasis on Gas usage for heating water and cooking. The income profile also shows a low income. This might suggest elderly pensioners who are living on their own. There is also a note of use of pool pumps and solar usage, this can mean that these units are a part of a set of units and the landlord has installed a pool or solar generation.

If we consider the Explainability Graph generated by the system to explain the insights generated by the system, this further bears out the analysis that has been conducted.

Figure 6 illustrates household attributes across different 24-h profile segments, indicating diverse energy consumption patterns. Certain segments, such as Segment 2 and Segment 4, demonstrate a higher prevalence of both heating and air conditioning use, suggesting households with a need for energy due to climate control, which could be reflective of regions with more extreme temperatures. Segment 9 and notably Segment 10 present the highest attribute sums, particularly in technological connectivity and air conditioning use, hinting at segments with potentially larger or more technology dependent households that have constant energy demands. Conversely, other segments display lower sums for features like pool pumps and solar panels, possibly pointing to urban dwellings with less space for such amenities or more energy-efficient lifestyles. The variability across segments provides insights into the different energy needs and usage patterns.

Fig. 6
figure 6

Usage attribute distribution for each segment

The results of the experiment were positive and we were able to successfully demonstrate the effectiveness of the proposed cloud based platform architecture in hosting and applying a Self Supervised AI (SSAI) on data streams of Energy Consumption data and presenting an explainable narrative to validate the model. The experiment also demonstrated the practicality of the application of the IPCL using the platform on the Energy Consumption data. With the results generated by the system it will be possible for energy providers and regulators to correlate the grouping of attributes to people groups of those demographic profiles and be able to attribute energy consumption patterns and load characteristics to those profiles. This will allow the ability to propose most appropriate energy plans which will maximise the value and the returns offered. Due to the flexible nature of the platform and the enabled ongoing learning, these results can be obtained incrementally and in in near real time. This has far reaching consequences for Smart Grid application including efficient load distribution, providing benefits for consumers and power generators.

4.1 Addressing the challenges of Big Data at the Edge

The proposed platform rationalises a few of the challenges which are faced when working on the edge of the network. This includes the availability of general compute resources on the edge, in practice end nodes are not capable of handling analytical workloads. The platform’s ability to serve a minimal latent representation of the model capable of being executed on the edge and which can then provide back a reduced data footprint for further processing in the cloud alleviates this issue. This tackles a further issue of the large data load which would otherwise have to be passed on to the cloud. The final issue that is addressed is that of a limited energy budget being available on the edge; the platform addresses this need in its processing layer which is able to handle various velocities of data, coupled with the models that can be run on the edge, this means that the edge devices can be run on less demanding schedules and stretching the energy budget further. The inherent architecture of the platform also lends itself to addressing challenges such as additional security and service discovery through the store and processing layers, however these areas have been relegated to future work.

5 Conclusion

We proposed a cloud based architecture for a platform which can encapsulate and execute a Self Structuring AI algorithm. We were able to demonstrate the viability of the platform through the application of the IPCL algorithm on the SGSC household energy consumption data. The results generated showed distinct energy usage pathways for distinct demographic household profiles, which included peaks and troughs in daily usage patterns. This further attested to the performance of the model within the platform. Further the system was able to demonstrate meaningful visualisation and display a mechanism for explainability which were validated in the results section.

Future work will involve correlating the pathways with temperatures and other relevant data streams utilising the data fusion capabilities of the IPCL. The platform will also be integrated with further Self Structuring AI models such as the DGSOM [12] in approaching different use cases such as Traffic and Congestion. Further investigation will be undertaken on improving the visualisation and explainability provisions of the platform. Further extension to platform will be considered in terms of decoupling the processing of the platform down from the cloud through the Edge to the IoT devices in providing end to end governance of the datastream as well as exploring the distribution of processing and providing insights at varying levels at each stage of the platform.