Interpretable multiple data streams clustering with clipped streams representation for the improvement of electricity consumption forecasting

Abstract

This paper presents a new interpretable approach for multiple data streams clustering in a smart grid used for the improvement of forecasting accuracy of aggregated electricity consumption and grid analysis named ClipStream. Consumers time series streams are compressed and represented by interpretable features extracted from the clipped representation. The proposed representation has low computational complexity and is incremental in the sense of the windowing method. From the extracted features, outlier consumers can be simply and quickly detected. The clustering phase consists of three parts: clustering non-outlier representations, the aggregation of consumption within clusters, and unsupervised change detection procedure on aggregated time series streams windows. ClipStream behaviour and its forecasting accuracy improvement were evaluated on four different real datasets containing variable patterns of electricity consumption. The clustering accuracy with the proposed feature extraction method from the clipped representation was evaluated on 85 time series datasets from a large public repository. The results of experiments proved the stability of the proposed ClipStream in the sense of improving forecasting accuracy and showed the suitability of the proposed representation in many tested applications.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. 1.

    http://www.ucd.ie/issda/data/commissionforenergyregulationcer/.

  2. 2.

    http://energyhack.sk/?lang=en.

  3. 3.

    https://github.com/PetoLau/ClipStream.

  4. 4.

    https://cran.r-project.org/package=TSrepr.

  5. 5.

    https://www.openml.org/d/41060.

References

  1. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on Very large data bases-volume 29, VLDB Endowment, pp 81–92

  2. Aghabozorgi S, Seyed Shirkhorshidi A, Ying Wah T (2015) Time-series clustering: a decade review. Inf Syst 53:16–38

    Article  Google Scholar 

  3. Amini A, Saboohi H, Herawan T, Wah TY (2016) Mudi-stream: a multi density clustering algorithm for evolving data stream. J Netw Comput Appl 59:370–385

    Article  Google Scholar 

  4. Appice A, Guccione P, Malerba D, Ciampi A (2014) Dealing with temporal and spatial correlations to classify outliers in geophysical data streams. Inf Sci 285:162–180

    MathSciNet  MATH  Article  Google Scholar 

  5. Arora P, Deepali Varshney S (2016) Analysis of k-means and k-medoids algorithm for big data. Procedia Comput Sci 78:507–512

    Article  Google Scholar 

  6. Bagnall A, Ratanamahatana C, Keogh E, Lonardi S, Janacek G (2006) A bit level representation for time series data mining with shape based similarity. Data Min Knowl Discov 13(1):11–40

    MathSciNet  Article  Google Scholar 

  7. Beringer J, Hüllermeier E (2007) Fuzzy clustering of parallel data streams. In: Advances in fuzzy clustering and its application, pp 333–352

  8. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    MATH  Article  Google Scholar 

  9. Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Amsterdam

    Google Scholar 

  10. Chan KP, Fu AWC (1999) Efficient time series matching by wavelets. In: Data engineering, 1999. Proceedings., 15th international conference on, IEEE, pp 126–133

  11. Chen JY, He HH (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inf Sci 345:271–293

    Article  Google Scholar 

  12. Chen L, Zou LJ, Tu L (2012) A clustering algorithm for multiple data streams based on spectral component similarity. Inf Sci 183(1):35–47

    Article  Google Scholar 

  13. Chen Y (2009) Clustering parallel data streams. InTech

  14. Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 133–142

  15. Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The ucr time series classification archive www.cs.ucr.edu/~eamonn/time_series_data

  16. Cleveland RB, Cleveland WS, McRae JE, Terpenning I (1990) STL: a seasonal-trend decomposition procedure based on loess. J Off Stat 6(1):3–73

    Google Scholar 

  17. Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex fourier series. Math Comput 19(90):297–301

    MathSciNet  MATH  Article  Google Scholar 

  18. Corduas M, Piccolo D (2008) Time series clustering and classification by the autoregressive metric. Comput Stat Data Anal 52(4):1860–1872

    MathSciNet  MATH  Article  Google Scholar 

  19. Dai BR, Huang JW, Yeh MY, Chen MS (2006) Adaptive clustering for multiple evolving streams. IEEE Trans Knowl Data Eng 18(9):1166–1180

    Article  Google Scholar 

  20. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227

    Article  Google Scholar 

  21. Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv 45(1):1–34

    MATH  Article  Google Scholar 

  22. Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the 1994 ACM SIGMOD international conference on management of data, ACM, New York, SIGMOD ’94, pp 419–429. https://doi.org/10.1145/191839.191925

  23. Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553–569

    MATH  Article  Google Scholar 

  24. Gama J, Rodrigues PP (2007) Stream-based electricity load forecast. In: Proceedings of the 11th European conference on principles and practice of knowledge discovery in databases (PKDD 2007) vol 4702, pp 446–453

  25. Hahsler M, Bolaños M (2016) Clustering data streams based on shared density between micro-clusters. IEEE Trans Knowl Data Eng 28(6):1449–1461

    Article  Google Scholar 

  26. Hyndman R, Khandakar Y (2008) Automatic time series forecasting: the forecast package for R. J Stat Softw 27(3):1–22

    Article  Google Scholar 

  27. Hyndman R, Koehler AB, Ord JK, Snyder RD (2008) Forecasting with exponential smoothing: the state space approach. Springer, Berlin

  28. Jarábek T, Laurinec P, Lucká M (2017) Energy load forecast using s2s deep neural networks with k-shape clustering. In: Informatics, 2017 IEEE 14th international scientific conference on, IEEE, pp 140–145

  29. Kaufman L, Rousseeuw P (2009) Finding groups in data: an introduction to cluster analysis. Wiley, London

    Google Scholar 

  30. Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data. ACM, New York, SIGMOD ’01, pp 151–162. https://doi.org/10.1145/375663.375680

  31. Keogh EJ, Pazzani MJ (1998) An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, KDD’98, pp 239–243

  32. Keogh EJ, Pazzani MJ (2000) A simple dimensionality reduction technique for fast similarity search in large time series databases. In: Terano T, Liu H, Chen ALP (eds) Knowledge discovery and data mining. Current issues and new applications. Springer, Berlin, pp 122–133

    Google Scholar 

  33. Khan I, Huang JZ, Ivanov K (2016) Incremental density-based ensemble clustering over evolving data streams. Neurocomputing 191(Supplement C):34–43

    Article  Google Scholar 

  34. Laurinec P (2018) TSrepr R package: time series representations. J Open Source Softw 3(23):577. https://doi.org/10.21105/joss.00577

    Article  Google Scholar 

  35. Laurinec P, Lucká M (2016) Comparison of representations of time series for clustering smart meter data. In: Lecture notes in engineering and computer science: proceedings of the world congress on engineering and computer science 2016, pp 458–463

  36. Laurinec P, Lucká M (2017) New clustering-based forecasting method for disaggregated end-consumer electricity load using smart grid data. In: 2017 IEEE 14th international scientific conference on informatics, pp 210–215, https://doi.org/10.1109/INFORMATICS.2017.8327248

  37. Laurinec P, Lucká M (2018) Clustering-based forecasting method for individual consumers electricity load using time series representations. Open Comput Sci 8(1):38–50

    Article  Google Scholar 

  38. Laurinec P, Lucká M (2018) Usefulness of unsupervised ensemble learning methods for time series forecasting of aggregated or clustered load. In: Appice A, Loglisci C, Manco G, Masciari E, Ras ZW (eds) New frontiers in mining complex patterns. Springer, Cham, pp 122–137

    Google Scholar 

  39. Laurinec P, Lóderer M, Vrablecová P, Lucká M, Rozinajová V, Ezzeddine AB (2016) Adaptive time series forecasting of energy consumption using optimized cluster analysis. In: Data mining workshops (ICDMW), 2016 IEEE 16th international conference on, IEEE, pp 398–405

  40. Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery—DMKD ’03 p 2. https://doi.org/10.1145/882085.882086

  41. Livera AMD, Hyndman RJ, Snyder RD (2011) Forecasting time series with complex seasonal patterns using exponential smoothing. J Am Stat Assoc 106(496):1513–1527. https://doi.org/10.1198/jasa.2011.tm09771

    MathSciNet  MATH  Article  Google Scholar 

  42. Manjoro WS, Dhakar M, Chaurasia BK (2016) Operational analysis of k-medoids and k-means algorithms on noisy data. In: 2016 International conference on communication and signal processing (ICCSP), pp 1500–1505. https://doi.org/10.1109/ICCSP.2016.7754408

  43. McGill R, Tukey JW, Larsen WA (1978) Variations of box plots. Am Stat 32(1):12–16

    Google Scholar 

  44. Paparrizos J, Gravano L (2015) k-shape: efficient and accurate clustering of time series. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, ACM, New York, SIGMOD ’15, pp 1855–1870. https://doi.org/10.1145/2723372.2737793

  45. Pereira CMM, de Mello RF (2014) TS-stream: clustering time series on data streams. J Intell Inf Syst 42(3):531–566

    Google Scholar 

  46. Pravilovic S, Bilancia M, Appice A, Malerba D (2017) Using multiple time series analysis for geosensor data forecasting. Inf Sci 380:31–52

    Article  Google Scholar 

  47. Ratanamahatana C, Keogh E, Bagnall AJ, Lonardi S (2005) A novel bit level time series representation with implication of similarity search and clustering. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 771–777

  48. Razali NM, Wah YB et al (2011) Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. J Stat Model Anal 2(1):21–33

    Google Scholar 

  49. Rodrigues PP, Gama J, Pedroso J (2008) Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng 20(5):615–627

    Article  Google Scholar 

  50. Schofield JR, Carmichael R, Tindemans S, Bilton M, Woolf M, Strbac G, et al (2015) Low carbon london project: data from the dynamic time-of-use electricity pricing trial, 2013

  51. Scholz FW, Stephens MA (1987) K-sample anderson–darling tests. J Am Stat Assoc 82(399):918–924

    MathSciNet  Google Scholar 

  52. Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACPLFD, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):1–31

    MATH  Article  Google Scholar 

  53. Strasser H, Weber C (1999) On the asymptotic theory of permutation statistics. In: SFB adaptive information systems and modelling in economics and management science

  54. Yang J, Ning C, Deb C, Zhang F, Cheong D, Lee SE, Sekhar C, Tham KW (2017) k-shape clustering algorithm for building energy usage patterns analysis and forecasting model accuracy improvement. Energy Build 146:27–37

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the Slovak Research and Development Agency, Grant Nos. APVV-16-0484 and APVV-16-0213, and the Scientific Grant Agency of The Slovak Republic, Grant No. VG 1/0458/18.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Peter Laurinec.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Responsible editor: Jesse Davis, Elisa Fromont, Derek Greene, Bjorn Bringmann.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Laurinec, P., Lucká, M. Interpretable multiple data streams clustering with clipped streams representation for the improvement of electricity consumption forecasting. Data Min Knowl Disc 33, 413–445 (2019). https://doi.org/10.1007/s10618-018-0598-2

Download citation

Keywords

  • Data streams clustering
  • Time series representations
  • Electricity consumption forecasting