Introduction

The smart grid combines the traditional power system with information technology leading to a one of most important modern cyber-physical systems. Smart grid is envisioned to fully integrate high-speed and two-way communication technologies into millions of power equipment to establish a dynamic and interactive infrastructure with new energy management capabilities, such as advanced metering infrastructure (AMI) and demand response. Smart grid heavily relies on information and communication technology to achieve efficient and reliable operation [1]. At the same time, smart grid big data has provided new opportunities for electric load forecasting, anomaly detection (e.g power theft), and demand side-management. However, the high-dimensional and massive smart grid big data creates new challenges in data transmission, data storage, and data analysis. This paper addresses the problem of creating a benchmark for big data frameworks used in smart grid big data analysis. We also develop a realistic smart grid data generator for performance analysis in real conditions.

Motivation

The rapid growth of smart grid, deployment of modern information and communication technologies and millions of newly deployed smart meters, will generate large amount of smart grid. Smart grid big data analysis is considered to be the key to solving significant problems of this industry. With the exponential growth of data, how to efficiently utilize this data becomes a critical issue. Most of recent work have focused on smart grid analysis and big data technology separately, without considering a benchmark for different big data frameworks. But, it is important to understand different big data frameworks’ performance in analyzing millions of smart meters’ data.

Contributions

The main contributions of this paper are as follows:

  1. 1.

    A smart grid data generator is designed based on big data platforms, taking into account the practical concerns of realistic smart grid.

  2. 2.

    The data generator is developed and implemented using Spark and HDFS file-systems.

  3. 3.

    The four most efficient open source big data frameworks are selected and used to analyze smart grid big data.

  4. 4.

    We propose a benchmark for evaluating smart grid big data frameworks in term of different criteria such as run time, throughput, latency and etc.

  5. 5.

    Based on the proposed benchmark, we compare four big data frameworks and select the best one for smart grid big data analysis

Paper structure

The rest of the paper is organized as follows. “Related work” section discusses related work. In “Methods” section, we explain the design of our scalable smart meter data generator. We also provide the required background on big data concepts and the four big data frameworks for smart grid big data analysis. In “Implementation and benchmarking of big data technology in smart grid” section we propose and deploy our benchmark for comparing smart grid big data frameworks. In “Results and discussion” section, we evaluate different frameworks for smart grid data analysis and present the benchmark results. Finally, “Conclusion” section provides some concluding remarks and outlines directions of future research.

Related work

There is a wide range of studies that are related to the big data challenges in smart grids. In this section, we try to briefly cover researches that their results are mostly related to our work.

Big data concepts in smart grid

Ye et al. in [2] illustrate big data analytics and cloud computing in a smart grid with focus on relevant issues in demand response and wide-area monitoring. To fulfill the potential of energy big data and obtain insights to achieve smart energy management, a comprehensive study of big data-driven smart energy management presented in [3]. Vasilakos et al. in [4] present a survey and discuss about new findings and developments in the smart grid big data analytics and security. A big data framework that can be a start for innovative research in smart grids and implementation of the framework on a secure cloud-based platform is presented in [5]. In [6], an overview on the state-of-the art of big data technology in smart grid is presented. A comprehensive survey on the life cycle of data and data trading, such as studying a variety of data pricing models, categorizing them into different groups, and comparison of these models is presented in [7]. Kang et al. in [8] introduce some recent progress in the application of big data analytics in China’s electric power industry. A study on the compression techniques for smart meter big data is presented in [9]. A smart grid big data ecosystem based on the state-of-the-art Lambda architecture is proposed in [10] that is capable of performing parallel batch and real-time operations on distributed data. Jiang et al. in [11] proposed a high performance and privacy-preserving query scheme over encrypted multidimensional big metering data to address how to efficiently query encrypted multidimensional metering data stored in an untrusted heterogeneous distributed system environment. In [12], a lightweight communication approach for big data traffic is presented that can prevent or quickly respond to the occurrence of network congestion in situations that computing power and bandwidth are limited. A comprehensive tutorial and survey to highlight research challenges on the vulnerabilities in smart grid big data is presented in [13]. Hou et al. in [14] propose a novel temporal, functional, and spatial big data computing framework for large-scale smart grid. The potential applications of big data analytics in electric grids is discussed in [15]. A study on 0.1 billion data point, collected by various smart Internet of Things (IoT) devices in power system of China is done to analyze consumption characteristics of power users in [16].

Big data analysis in smart grid

Here, we briefly survey the papers that focused on using big data analysis technology in different domains such as anomaly detection, marketing and electricity price forecasting, fault tolerance etc. in smart grid. Liu et al. in [17] proposed a lambda system for detecting anomalous consumption patterns, aiming at assisting decision makings for smart energy management. A real-time anomaly detection framework, which can be built based upon smart meter data collected in smart grid big data is proposed in [18]. Vimalkumar et al. in [19] designed a big data framework using various machine learning techniques to detect intrusions. An unsupervised progressive incremental data mining mechanism applied to smart meters big energy consumption data through frequent pattern mining is presented in [20]. Wu et al in [21] proposed a security situational awareness mechanism based on the analysis of big data in the smart grid. A novel electricity price forecasting model that handle with huge price data in the smart grid big data is introduced in [22]. Ye et al. in [23] proposed a secure data learning scheme for big data applications in the information and communication technology infrastructure of the smart grid. In [24], a new model for the microgrids optimal scheduling and load curtailment based on big data concept is proposed. A data-driven method of association rule mining for transformer state parameters based on big data is proposed by combining the a-priori algorithm and probabilistic graphical model in [25]. From performance point of view only in [26], smart meter analytic from a software performance perspective is examined. Also a smart meter data generator using Spark that can generate semi-realistic data sets is presented in [27].

Fig. 1
figure 1

a Overall architecture of our smart grid big data framework, b smart grid big data analysis system model

Methods

Figure 1 presents the overall architecture of our smart grid big data framework and data analysis system model based on this architecture. The system can be divided in to three blocks: data generator, database, and data analysis. Data initialization module is a smart meter dataset that has been provided by the Irish Social Science Data Archive [28]. This real data is collected during 2009 and 2010 with over 5000 Irish homes and businesses participating. The data set includes over 150 million records of 5000 customers for 18 months. A smart meter measures consumption in a given interval every 30 min. Figure 2 illustrates the typical power consumption of customer from ISSDA dataset. We can see a fixed pattern in workdays and small differences in weekends. In the following we describe each part of the smart grid big data system model in details.

Fig. 2
figure 2

A week smart meter reading of customer

Smart grid data generator model

In order to investigate the performance and benchmarking of big data technologies in smart grids, we need to process petabytes of data for smart grid data management and analytics. Indeed using realistic data for performance evaluation is much better, but obtaining datasets with records of millions customers is challenging because of privacy issues and the lack of truthful data sources. Therefore, in this section, we first describe a comprehensive smart meter data generator based on big data frameworks that can generate semi-realistic datasets with large number of customers. Our proposed data generator use a supervised machine learning algorithm that can generate data based on small real datasets. As shown in Fig 1, the data generator has different blocks: pre-process module, Spark module and HDFS file system module. The pre-process module, first checks the format and values in each field of the seed data, then clustering the customers in order to offer personalized energy-efficiency services [29]. It also prepares time series of data to send to the Spark module. The engine of the data generator includes two modules: training module and data generation module which are implemented using memory-based distributed computing framework Spark. In the first stage, Resilient Distributed Dataset (RDD) is built in RAM to store temporary data. The seed data is deseasonalized using a moving average time series model in order to flatten the periodic variations and create an adjusted time series. The moving average by smoothing the peak periods reduces the periodic variations from the actual time series and it uses auto-regressive (AR) model trained with the flattened time series to predict meter reading. In the end, the periodic variations are added back. Actually, the training process will take a clustered seed data as the input to create the models and writes the output in HDFS. Since, generated data has a stream-oriented nature, we propose using Apache Kafka [30] as a message queue in big data framework in our model. The output is inserted in NoSQL database through Kafka queue. The generated data has four fields including meter ID,time, power consumption and environment temperature. The meter ID is consider as 1000 to 5001000. The time is stored as a five-digit numeric field, the first three numbers representing the day n and the next two digits representing the clock that it’s value set between 1 to 48 which time of day start 0 correspond to 12 PM . The power consumption field is considered as numerical data. We see that, all data is structured column-based and range of changes is fully specified.

NoSQL database

NoSQL provides the new data management technologies designed to meet the features of big data i.e. volume, velocity, and variety of data. It can store and retrieve data that is modeled in means other than the tabular relations used in relational databases [31]. In comparison to relational databases, NoSQL databases are more scalable and provide superior performance when we are dealing with big data. Based on our experiences and domain experts’ recommendations, we select four open source NoSQL database: Hbase, Cassandra, MongoDB and Elastic search. In the following we describe each technology.

Hbase

Hbase is a column store based on Apache Hadoop and concepts of BigTable. It is a NoSQL key/value store which runs on top of HDFS [32]. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Although Hbase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality. Hbase has centralized architecture where the Master server is responsible for monitoring all RegionServer (responsible for serving and managing regions) instances in the cluster, and is the interface for all metadata changes. Hbase is optimized for reads, supported by single-write master, and resulting strict consistency model, as well as use of Ordered Partitioning which supports row-scans. It provides CP (Consistency, Partition tolerance) form the CAP theorem. The most important feature of Hbase is strong consistency and fast read and write with high scalability.

Cassandra

Cassandra is also a column store based distributed database management system based on ideas of BigTable and DynamoDB. Unlike Hbase, Cassandra has decentralized architecture and each node can perform any operation that it provides AP (Availability, Partition-Tolerance) from the CAP theorem [33]. Cassandra has excellent single-row read performance as long as eventual consistency semantics are sufficient for the use-case. Cassandra quorum reads, which are required for strict consistency will naturally be slower than Hbase reads. Cassandra is well suited for supporting single-row queries or selecting multiple rows based on a Column-Value index [34].

MongoDB

MongoDB is a document oriented database. All data in MongoDB is treated in JSON/BSON format. It is a schema less database which goes over tera-bytes of data in database. It also supports master slave replication methods for making multiple copies of data over servers making the integration of data in certain types of applications easier and faster [31]. MongoDB combines the best of relational databases with the innovations of NoSQL technologies, enabling engineers to build modern applications. MongoDB provides the data model flexibility, elastic scalability, and high performance of NoSQL databases. As a result, engineers can continuously enhance applications, and deliver them at almost unlimited scale on commodity hardware [35].

Elasticsearch

Elasticsearch is an open-source search engine built on top of Apache \(\hbox {Lucene}^{\mathrm{TM}}\), a fulltext search engine library. Elasticsearch is a real-time distributed search and analytics engine. It is used for full text search, structured search, analytics, and all three in combination [36]. It has a very flexible query API, that supports filtering, sort, pagination, and aggregations in the same query. Elasticsearch handles unstructured data automatically, meaning you can index JSON documents without predefining the schema. It will try to figure out the field mappings, and adding/removing the new/existing fields automatically. Also task such as clustering, replication of data, and automatic fail-over are supported out of the box and are completely transparent to the user.

Implementation and benchmarking of big data technology in smart grid

In this section the architecture of the platform is briefly introduced and principles for building real-time processing systems are presented. Furthermore, the advantages of using the architecture in smart grids are summarized.

Cluster specification

To implement our ecosystem, we use a cluster of 6 nodes. Each node is a server with 20 physical CPU core and 40 in hyper-thread mode, 128GB RAM, and 24TB HDD. We assign a node to Spark, a node to HDFS file system, and a node to Kafka queue. Each of the database technologies (i.e. Hbase, Cassandra, MongoDB, and Elasticsearch) is implemented on a cluster with three nodes.

Smart grid data generator implementation

The smart meter data generator is implemented on two nodes. The implementation is consisted of two functions: data-preparator and data generator. The initial value for training algorithm and specifications of data generated are summarized in Table 1.

Table 1 Initial and generated data description

The generated data is stored in HDFS and the format of each data file is as follows: 3 columns corresponding to Meter ID, a five digit code composed of day code: digits 1–3 (day 001 = 1st of January 2009), and time code: digits 4–5 (1–48 for each 30 min with 01= 00:00:00–00:29:59), and electricity power consumed during each 30 min interval (in kWh). By means of this implementation we generate a big data-set with 5,000,000 home and commercial customers for 1.5 year that might be used to evaluate big data platforms performance for smart grid data.

Big data backend implementation

Hbase

In Hbase, read and write processes have two stages: data transfer to RAM memory data-node that is called memstore and then copying on hfile. Figure 3 illustrate the process of Hbase read and write.

Fig. 3
figure 3

Hbase data model

Cassandra

We implement our cluster with three nodes as a ring. Figure 4 shows the mechanism of data processing in Cassanadra. In data insertion, each write command changes store in memtable memory on RAM and then on SSTable. Also in order to protect data against fault and Cassanadra nodes down, replica of data is stored temporarily in an optimized disc called commitlog for each change caused by write command.

Fig. 4
figure 4

Cassandra data model

MongoDB

In our MongoDB implementation we consider a process depicted in Fig. 5. The implementation has a mongos function that acts as a router and determines the path of data store. The data directly is written on tables, after the data store path is determined by mongos.

Fig. 5
figure 5

MongoDB data model

Elasticsearch

The data model of Elasticsearch that we have implemented is shown in Fig. 6.

Fig. 6
figure 6

Elasticsearch data model

Results and discussion

In this section, we evaluate the performance of each big data technology with default configurations in terms of effectiveness and scalability. The effectiveness will be evaluated by comparing WRITE and READ parameters. The scalability will be assessed by measuring the execution performance of the full mechanism. Table 2 summarizes the comparison of the four technologies performance. We compare big data technologies based on the most famous criteria including: Run time, Throughput, number of operations, and latency.

Table 2 The summarized performance criteria results

Figure 7 shows the run time of each big data database technology. We see that Cassanadra in write and read stages has less run time and consequently better performance.

Fig. 7
figure 7

Run time of different technologies

Comparison of throughput’s in ops/sec is presented in Fig. 8. It is shown that Cassandra has the best throughput and MongoDB performs the worst.

Fig. 8
figure 8

Throughput of different technologies

Figure 9 shows the percentile latency of different technologies. If A% latency of a technology is B, it means that A% of read or write latencies of this technology is less than or equal to B. For example in Fig. 9, we see that the 95% latency of MongoDB is 2 ms, thus 95% of read/write latencies of MongoDB are less than or equal to 2 ms. We see that Cassanadra has the smallest average latency in both read and write processes. Based on these results we observe that Cassandra has the best overall performance among the four open source big data database technologies that are used for analysis of smart grid big data in real scale. Generally as outlined in "Methods" section, because of smart grid meter data is structural column based data, that in our implementation the smart meter ID and time fields are considered as key partition for storing these data. Since the Cassandra is one of the best column base databases, writing on this database is better than other technologies. Also, Cassandra read operation is faster than other technologies due to key structure and sorting data according to time.

Fig. 9
figure 9

Percentile-latency of Elasticsearch, MongoDB, Hbase and Cassandra comparison

Conclusion

In this paper we proposed a framework to analyze smart grid big data. At first a comprehensive architecture for evaluation of smart grid big data is proposed. Then we introduce a scalable data generator to overcome to lack of access to real smart grid big data. Since there are different big data technologies at this time, we selected four of the best open source big data NoSQL databases and integrated them in the architecture of our large scale smart grid data generator. Based on our implementation, we evaluated the performance of Elasticsearch, MongDB, Hbase and Cassandra in terms of different criteria. Our results show that among these database technologies, Cassandra has the best overall performance and satisfies all requirements for smart grid bid data analysis. In the future, we plan to perform cyber security analysis using smart grid big data based on Cassandra. Also we are going to extend our data generator to generate data related to electricity theft and crypto-currency mining.