A graph-based superframework for mixture model estimation using EM: an analysis of US wholesale electricity markets

A fully unsupervised graph-based superframework is proposed to handle the EM initialization problem for estimating mixture models on financial time series. Using a complex network approach that links time series and graphs, the graph-structured information derived from the observed data is exploited to produce a meaningful starting point for the EM algorithm. It is shown that structural information derived by complex graphs can definitely capture time series behavior and nonlinear relationships between different observations. The proposed methodology is employed to estimate Gaussian mixture models on US wholesale electricity market prices using two different configurations of the superframework. The obtained results show that the proposed methodology performs better than conventional initialization methods, such as K-means based techniques. The improvements are significant on the overall representation of the empirical distribution of log-returns and, in particular, on the first four moments. Moreover, this approach has a high degree of generalization and flexibility, exploiting graph manipulation and employing functional operating blocks, which can be adapted to very different empirical situations.


Introduction
Network science is a well-established discipline that explores a broad variety of natural and social phenomena by describing actual relationships via complex network, or graph, structures [1]. There is a long history of effective applications of complex networks in various fields [2][3][4][5].
Several time series analysis techniques based on network science have been proposed in the previous decade, utilizing the huge body of research on network analysis and providing new insights and innovative perspectives on understanding the granular structure of time series [6,7]. Indeed, transforming a time series into a network improves in-depth the analysis, leading to the discovery of time series non-trivial topological features [8,9] and providing new viewpoints and ideas for penetrating into the probabilistic structure of time series [10,11].

Problem statement
Finite mixture models provide universal approximations to any continuous probability density and have proven to be very useful in describing observed data in many fields of application [12,13]. In particular, Gaussian mixture models (GMMs) are intriguing instances of mixture models that are frequently used as an effective tool for data analysis and modeling, especially in signal and information processing [14][15][16][17]. The methods for estimating the parameters in a mixture model are crucial to the model performance. Among the others, the expectation-maximization (EM) algorithm, frequently used for training mixture models [18], provides an iterative approach for parameter estimation based on likelihood maximization. However, the initialization settings of the EM technique have a significant influence on the quality of the resulting solution [19]. Indeed, EM maximum likelihood estimation is subject to the local-optima problem and choosing the right initial values is essential for getting accurate results [20][21][22]. Due to this predisposition toward local solutions, it is implausible that any given solution, regardless of the initialization strategy, would be globally optimum [22].

Related works
Clustering algorithms are used in the literature to provide meaningful solutions to the EM initialization problem. In particular, K-means [23] and Random algorithms [24] received great attention in recent years. In both these algorithms, the number of mixture components must be set exogenously. Moreover, cluster centers (centroids) are chosen randomly and, due to the high occurrence of local solutions, the estimation procedure must be re-initialized several times before a solution can be detected. In addition, no existing technique can determine when the number of initialization is sufficient to ensure a full examination of the likelihood function [22]. Other clustering algorithms for EM initialization, such as hierarchical clustering [25] or rough-enhanced-Bayes mixture estimation (REBMIX) [26], have been proposed in the literature. However, no one of these experimental strategies can be considered as the best one [24]. Some authors investigated the feasibility of realizing a fully unsupervised framework, i.e., without any guesswork about the number of mixture components, by mapping time series to complex networks, thus determining the optimal number of the mixture model components and the vector of initial parameters [27]. The proposed framework has done an outstanding job in addressing the local-optima problem, thus providing accurate estimates of GMM parameters.

Purpose
The aim of this paper is develop a fully unsupervised graph-based superframework to handle the EM initialization problem for estimating GMMs on financial log-return time series. In fact, although price time series are typically non stationary, log-return time series, computed as the difference in log-prices between two subsequent observations, demonstrate better behavior [28]. We will show that the proposed approach has a high degree of generalization and flexibility, taking advantage of graph manipulation and employing functional operating blocks that can be adapted to any empirical situation. This superframework is described in the end-to-end flowchart shown in Fig. 1.
The workflow is composed by three sections, namely the input, the distribution estimation, and the output. The input is the possibly preprocessed log-price time series. Indeed, in some cases time series of market prices needs to be preprocessed to account for possibly missing data imputation and trends [29]. Then, the preprocessed log-price time series is splitted in two branches both entering the distribution estimation section, one going into the initialization blocks (highlighted in red in Fig. 1), and the other branch going directly into the GMM model block after a log-return transform. In the initialization blocks, the preprocessed time series undergoes three major initialization transformations, namely graph encoding, graph compression and graph partitioning, in order to determine the vector of initialization parameters. First, the graph associated to the preprocessed time series must be created (graph encoding); then, by lowering the complexity of the graph (graph compression), log-return communities (graph partitioning) can be identified. Next, the number of the GMM components is set equal to the number of the detected communities, and the remaining initializing parameters of the EM algorithm are determined by log-return community membership, as will be explained in the text. In this way, all the inputs of the GMM block, i.e., the initializing parameters and the log-return time series, are composed. Finally, the output section provides the estimated parameters through the EM algorithm.
The graph encoding step is realized by mapping the preprocessed log-price time series (hereinafter, log-price Fig. 1 End-to-end workflow diagram. The red line highlights the initialization blocks time series), into a quantile network (according to the taxonomy proposed in the literature [11]), using a Markov transition field (MTF) as adjacency matrix [30][31][32][33]. The use of a MTF as adjacency matrix is particularly well suited to fully account for the transitions between the observed dynamic states [30].
To reduce the complexity of the created graph, we employed two different graph compression techniques, namely graph coarsening [34] and graph embedding [35]. Graph coarsening is a model-driven technique used to reduce the size of a graph maintaining essential properties. Despite the plethora of literature on graph coarsening, model-driven methods are the most popular, as data-driven techniques are still under-explored [34]. Graph embedding, on the other hand, is based on a data-driven approach, transforming the graph to a multi-dimensional Euclidean vector space. The essential idea of graph embedding methods is to learn effective feature representations of network nodes in an unsupervised manner preserving the graph structural properties [35,36]. In this research, we will employ a variety of embedding techniques using the Karate Club open source Python framework [37,38], which includes state-of-the-art and cutting-edge approaches for conducting unsupervised learning on graphstructured data. The classes of embedding, as listed on the Karate Club paper and suitable for our procedure, are: (i) neighborhood preserving embedding; (ii) structural embedding; (iii) attributed embedding; iv) meta-embedding. In particular, we used the so-called diff2Vec embedding method as representative of the first class [39]; the GraphWave method for the second class [40]; the attributed social network embeddings (ASNE) for the third class [41], and network embedding update(NEU) for the fourth class [42].
The community detection task is performed in the graph partitioning block. We used two different techniques, the Louvain method [43] and a clustering technique based on topological data analysis (TDA) [44]. The Louvain Method provides a technique of community discovery based on the optimization of the graph modularity. Instead, operating with a TDA-based clustering technique, the community discovery process is performed directly on the graph embedding using the Topological Mode Analysis Tool (ToMATo) [45]. This latter strategy, compared to other current clustering methods, demonstrates a superior advantage by providing an accurate unsupervised system for determining how many stable clusters the data contain through to the use of persistent homology [44,[46][47][48].
We applied the proposed superframework to US wholesale electricity markets, investigating the behavior of daily electricity prices at Palo Verde (Southwest area), PJM (Northeast region), SP15 (Southern California) and Nepool (New England) in the time interval January 1, 2017 to December 31, 2021. US electricity price time series show irregular sampling (lack of daily data points) as well as trend and seasonality. Typically, electricity prices may be higher in winter and in summer, and the seasonal component must account for this semiannual periodicity. A trend must be included to account for expected inflation and possibly for a real power price escalation rate (positive or negative). For these reasons, a preprocessing consisting of a gap filling in procedure and a seasonal-trend removal is first accomplished.
Within the superframework and focusing on the initialization blocks, we identify two different frameworks that combine the techniques described above: the first framework is composed by (i) graph encoding using MTF, (ii) graph compression with graph coarsening techniques, (iii) graph partitioning with the Louvain method; the second framework as (i) graph encoding using MTF, (ii) graph compression with graph embedding, (iii) graph partitioning with ToMATo clustering. We will call graph approximation framework (GA framework) the first framework and graph representation learning framework (GRL framework) the second. In a previous paper [27], we have shown that the GA framework performs better than conventional initialization methods, such as K-means and random-based techniques. In this paper, we will see that the GRL framework provides even more accurate results than the GA framework. The higher likelihood values reached imply a stronger learning performance than the GA framework in the distribution estimation task. This is due also to the fact that the GRL framework has a more conservative pipeline by design (less information loss in the graph compression phase), and the need for custom feature engineering as graph coarsening is reduced by superior data-driven and unsupervised graph machine learning techniques. In addition, embedding methods combined with TDA tools made feature extraction highly accurate and efficient.
To the best of our knowledge, this is the first study in which these embedding approaches are used with quantile graphs created with time series MTF encoding. In particular, we highlight that the combination of NEU meta-embedding applied to Diff2Vec, and the ASNE method stand out as especially effective embedding models, and their use with an MTF-encoded quantile graph is a completely new approach.
The rest of this study is structured as follows. The superframework is described in depth in Sect. 2. Section 3 focuses on the empirical analysis. Section 4 concludes. The code to reproduce our results is available in the Github repository at https://bit.ly/3zFtbQY.

Overview
This section illustrates the proposed superframework approach to the estimation of mixture models, outlining in detail two distinct operational frameworks, namely the GA and the GRL frameworks, that will be used in the empirical analysis. Let us start, therefore, by briefly reviewing same basic facts about Mixture Models and the EM estimation algorithm.
Consider a possibly preprocessed log-price time series fx t g N t¼1 and its log-return tranform fr t g NÀ1 t¼1 , where Suppose that the values, r i , i ¼ 1; 2; . . .; N À 1, are extracted in an IID manner from an underlying random model described by a probability density pðr i Þ. We assume that pðr i Þ is a finite mixture distribution with C components, In Eq. (2), z i ¼ fz i1 ; z i2 . . .; z iC g with i ¼ 1; 2; . . .; N À 1, is a vector of C unobservable latent binary random variables that are mutually exclusive and exhaustive, i.e., one and only one of the z ic is equal to 1, while the others are equal to 0). The vector z i plays the role of an indicator random variable representing the identity of the mixture component responsible for the generation of the outcome r i ; p c ðr i j z ic ¼ 1; h c Þ denotes the density distributions of the mixture components with parameters h c ; a 1 ; a 2 ; . . .; a C are the mixture weights, i.e., positive numbers such that representing the probability that the value r i was generated by the component c; H ¼ fa 1 ; . . .; a C ; h 1 ; . . .; h C g is the complete set of the mixture model parameters [49]. In the case of univariate Gaussian mixture models, the components of the model are described by univariate Gaussian densities, with parameters h c ¼ fl c ; r 2 c g denoting, respectively, the mean and the variance of the single component distribution.
Mixture models can be efficiently estimated by maximum likelihood using the EM algorithm [18]. EM is an iterative procedure that begins with an initial estimate of parameters H ¼ H 0 and then iteratively updates H until convergence is achieved. Each iteration consists of two steps, the E-step and the M-step.
In the E-Step, given a set of parameters H, the so-called membership weight of the data point r i in component c is computed. It is defined as In this way, we obtain the N Â C matrix of membership in which each row sum to 1. From the Bayes rule, we can cast Eq. (5) in the following equivalent form, In the M-step, the algorithm computes the parameter values that maximizes the log-likelihood, starting from the values obtained by suitably aggregating the membership weights generated in the E-step. In the case of Gaussian mixture models such an aggregation can be performed in the following way, where N c ¼ P N i¼1 w i represents the effective number of data points assigned to component c. We notice that both the mean and the variance are computed in a way similar to how standard empirical average and variance are computed, but with a fractional weight w ic .
Once the new values of the parameters are obtained via maximum likelihood in the M-step, they are used in the subsequent iterations (composed of both the E-step and the M-step). The iterative procedure continues until convergence is achieved.
The EM algorithm can be started by selecting a set of initial parameters and then performing the E-step, or by selecting a set of initial weights and then conducting a first M-step. The initial parameters or the initial weights can be chosen randomly or could be derived using some heuristic method [49]. Within the superframework, we identified two different operational frameworks to perform this initialization task, namely the GA and GRL frameworks. Figure 2 depicts the specific tasks performed in each framework, putting into evidence major distinctions between the GA and GRL frameworks.
In both frameworks, the preprocessed log-price time series is encoded into a graph network using a MTF-based encoding. Then, the graph will undergo a compression phase performed through graph coarsening techniques in the GA framework, and graph embedding techniques in the GRL framework. Finally, the community detection is accomplished by a partitioning procedure performed using the Louvain Method to optimize the modularity of the graph in the GA framework, and ToMaTo multidimensional clustering of the graph embedding in the GRL framework. Table 1 summarizes the specific task of each block of Fig. 2.
The partitioning step serves as a feature extraction procedure for the distribution estimation to generate automatically the number of components of the mixture model and the initialization parameters. In fact, since each logreturn value, r i ¼ x iþ1 À x i , can be uniquely associated to the value x i , the detected communities can be viewed also as log-return communities. In both frameworks the number of components is set equal to the number of the detected communities, and the other initialization parameters, contained in the initializing vector H 0 , are computed through community membership (i.e., w ic ¼ 1 if the observation r i belongs to the cluster c, 0 otherwise) using Eqs. (8)(9)(10). Then, a first M-step is performed. Once the new values of the parameters are obtained via maximum likelihood in the M-step, they are used in the subsequent iterations (composed of both the E-step and the M-step) until convergence is obtained.

Graph encoding with Markov transition fields
This methodology is based on mapping the log-price time series into a graph. The purpose is to identify typical patterns or similar contexts with the aim to form a network and detect communities of nodes that may be associated with the components of a mixture model. In order to map a time series onto a complex network, the adjacency matrix must be defined. If the underlying dynamics is determined by a Markov process, the Markov transition matrix can be used as an adjacency matrix [11]. In fact, it allows us to properly account for the transitions among different states of the observed dynamics. In such a case, the time series must be first quantized in order to define the dynamical states. This task can be accomplished in the following way. Let us consider on the real axis a certain number, say Q, of adjacent intervals by positioning Q þ 1 cut points, namely fq 0 ; q 1 ; . . .; q Q g, to divide log-price observation data into continuous intervals, hereinafter bins, with equal probabilities, In this way, we can group observation data into Q bins, B 1 ; B 2 ; . . .; B Q , that identify the Q states of the dynamics. In our analysis, both the empirical distribution, hereinafter Quantile binning, and the normal distribution fitted to time series data, hereinafter normal binning, are used. In the latter case, the whole real axis is divided into Q interval with equal probability. In both cases, each observation is mapped to the corresponding bin, where d i denotes the bin containing the value x i . After  assigning each x i to its corresponding bin d i , we construct a Q Â Q weighted matrix X by counting transitions among bins in the manner of a first order Markov chain. By transition we mean a transition between two consecutive observations, namely from x i to x iþ1 , that identify a transition between the bin (the state) d i to the bin (the state) d iþ1 . The Markov transition matrix can be, then, obtained by identifying each element of X, namely X h;k , with the relative frequency of transitions between bins B h and B k (h; k ¼ 1; . . .; QÞ. However, the information contained in the Markov matrix is too coarse. In order to refine the analysis we introduced the so-called Markov transition field (MTF) [30,50], a N Â N matrix whose elements are defined as follows, In a Markov transition field, the information contained in the Markov transition matrix is spread out along the whole time series. Indeed, M i;j represents the probability associated to a transition from the bin d i which contains x i to the bin d j containing x j , as given in the Markov transition matrix X. The M matrix results as a spread of the X matrix along the time axis and is intended to enhance analytical capabilities, enabling a comprehensive investigation of the associated network. The Markov transition field M allows us to map a time series into a complex network. This can be done by using the matrix M as the adjacency matrix of the graph G, mapping the vertices (nodes) V to the row-column indices (i, j) and the edge weights to M i;j .

Graph compression
This section discusses some graph compression techniques, namely graph coarsening in the GA framework and unsupervised embedding of graph in the GRL framework.

Graph coarsening
The goal of graph coarsening is to find a smaller graph G, which is a good approximation of G [51]. In the GA framework, to make the whole estimation procedure computationally more feasible and efficient, the size of the matrix M is reduced by averaging its elements in each nonoverlapping g Â g sub-matrix with a kernel 1=g 2 , thus obtaining a reduced S Â S square matrix, M, to be used as adjacency matrix, 1 for the generation of the coarsened graph G.

Graph embedding
In the GRL framework we will make use of graph embedding, as a data-driven approach, instead of graph coarsening. Network embedding techniques have made significant contributions to the use of Machine Learning (ML) in network science [54]. Unsupervised feature extraction techniques from graph data are gaining popularity in the ML domain [35,55,56]. These methods automatically extract features which can be used as inputs for link prediction, node and graph classification, community detection, and a variety of other tasks in a wide range of real-world research and application contexts [35,55,[57][58][59]. These graph mining technologies contributed in a significant way to the advancement and development of ML [60,61]. In particular, node embeddings transform graph vertices to a Euclidean space, where nodes that are related according to a specific definition of closeness are close to each other. The Euclidean format, instead of the native graph, facilitates the use of conventional ML tools [37]. In this paper, we adopt graph embedding procedures belonging to the four classes of the Karate Club framework [37]: (i) neighborhood preserving embedding; (ii) structural embedding; (iii) attributed embedding; (iv) meta-embedding. While neighborhood preserving embeddings preserve the closeness of graph nodes, structural embeddings preserve the structural roles of nodes in the embedding space, and attributed embeddings maintain the neighborhood, the structure and the generic attribute similarity of nodes. Meta-embeddings are designed to produce embeddings with a higher representation quality. To ensure generality and completeness, in the empirical analysis we will use one method chosen from each class. The main characteristics of these methods are described below.
Diff2Vec is a neighborhood preserving embedding that uses diffusion processes on graphs to create node sequences with the aim of training a neural network. The weights of the trained neural network determine the embedding of the nodes. Experiments on community discovering revealed that this method detect high-quality communities [39].
GraphWave is a structural embedding that preserves the structural roles of nodes in the embedding space [40]. Different portions of a graph may contain nodes with comparable structural roles within their local network architecture. The discovery of such roles provides crucial information into the structure of networks. GraphWave computational complexity is proportional to the number of edges, enabling it to scale to huge networks.
Attributed social network embeddings (ASNE) enhance all the previous approaches focusing on neighborhood closeness and structural information [41]. For networks usually exist supplementary details that are referred to as features or attributes, which comprise not only the node information (adjacency matrix) but also node features (attributes), related to the node context. In essence, node attributes have enormous effects on the organization of networks. By studying attribute homophily and network structure together, it is possible to learn informative node representations.
Network embedding update (NEU) belong to the metaembedding class [42]. It is an algorithm designed to improve the performance of any given network representation learning. The running time of NEU is almost negligible. We will apply the NEU meta-embedding method to any of the previously listed methods.

Graph partitioning
This section is devoted to describe the different strategies to perform a graph mining task, namely the community detection, in both the GA and GRL frameworks. In general, a network is regarded to have a community structure if its nodes can be divided into groups of highly connected internal nodes compared to the connections with the external nodes. Communities can be classified as overlapping or nonoverlapping, depending on whether a node can belong to more than one community or only one. In our scenario, we adopted the latter option. In the GA framework, the Louvain method was applied on the coarsened graph; the GRL framework uses a clustering technique on the embedded multi-dimensional Euclidean space.

The louvain method
To identify communities, we used the Louvain method [43] in the GA framework. The Louvain method provides a unsupervised technique of community discovery based on the optimization of the graph modularity that does not need the number of communities or their sizes as inputs.
Modularity is the ratio of the density of connections inside clusters to the density of connections between clusters. The choice of the Louvain algorithm is justified by the fact that it is well documented in the literature that this strategy performs very well on a variety of community detection benchmarks [62]. By partitioning the whole time series into communities, the Louvain method finds the number of mixture model components C by matching it to the number of observed communities.

ToMATo clustering
Since the graph embedding generates a high-dimensional Euclidean representation of the graph, we have chosen TDA, an emerging field of research with the aim of providing mathematical and algorithmic tools to understand the topological and geometric structure of data, particularly suited for high-dimensional data [63,64]. The ToMATo clustering technique is an unsupervised TDA tool that we used in the GRL framework for community detection.
Theoretically, this technique allows us to identify clusters that are stable under small perturbations of the input [44,65,66]. In this section, Gaussian mixture models are used to perform an experiment on US wholesale electricity prices time series to assess the performance of the GA and GRL frameworks. Our data collection is composed by daily prices observed in the time interval January 1, 2017, to December 31, 2021. Time series data can be freely downloaded from www.eia.gov/electricity/wholesale. Observed time series are shown in Fig. 3. Volatility and extreme unpredictability characterize market price movements, which are frequently accompanied by sharp spikes and jumps generated by changes in the supply-demand balance. All these time series show irregular sampling (as a result of weekends, holidays and other missing data due to market specific reasons) as well as trend and seasonality. Typically, electricity prices may be higher in winter and in summer, and the seasonal component must account for this semiannual periodicity. A trend must be included to account for expected inflation and possibly for a real power price escalation rate (positive or negative). Data prepocessing was first conducted to fill in time series gaps and remove observable trend and seasonality over time.

Data preprocessing
Let us denote by fy ob t g the time series of daily prices defined as a function of the incomplete observed raw time base ftg, and fx ob t g its natural logarithm transform, i.e., x ob t ¼ log y ob t . The time series fx ob t g is sampled irregularly (not evenly time interval) since weekends, holidays, and other sporadic missing days are not included in it. Imputation of missing data (gap filling) can be very informative and convey important knowledge [67], counteracting the phenomenon known as informative missingness [68]. In the presence of missing data, different time periods may have different information content. In fact, even when the market is closed, new information can emerge that can influence price dynamics [69,70].
The first step, therefore, involves the gap filling of the time series. To do this, we employ a complete daily grid and an imputation technique called missForest [71], a ML algorithm for data imputation that is completely agnostic about the data distribution. MissForest is employed to compute a value for each missing point. In this way, following the same procedure proposed in [29], we extending the raw time series fx ob t g on a complete daily time base ftg, where F is the application that transforms ftg in ftg and computes missing values using the missForest algorithm to fill the daily-complete grid, mapping fx ob t g to fx f t g. In the second step, we look for temporal trend and seasonality, hereinafer, trend, that must be eliminated in order to reveal the stochastic process driving the market dynamics. LOWESS (LOcally Weighted Estimated Scatterplot Smoothing) is used to perform this task. [72,73]. LOWESS is a versatile method for removing the trend by fitting basic polynomial models to restricted portions of data. The key benefit of LOWESS over other methods is that it does not need the specification of a global function or the assumption that the data must conform to a certain distribution shape [74]. Figure 4 shows log-price time series, fx f t g, and superimposed the trend, fx tr t g, for each market under investigation.
Once detected, the trend can be eliminated from the filled time series, thus obtaining, where we assumed, without loss of generality, that the  . . .; Ng. Then, log-returns are computed as the difference between two successive daily log-prices, i.e.,

Distribution estimation: the experimental setup
In this section, Gaussian mixture models are estimated on market log-returns time series, r t , by maximum likelihood using the EM algorithm. The initialization will be performed using both the GA and GRL frameworks. Let us focus on the Initialization blocks. Figure 6 shows a magnified version of the distribution estimation section and summarizes the calibration procedure. The Initialization blocks produce both the number of the model component, C, and the initial parameter set, H 0 . Indeed, the number of components is set equal to the number of the detected communities, and the initialization parameters are computed through community membership according to Eqs. (8)(9)(10). Then, a first M-step is performed and, subsequently, the EM algorithm is used to estimate the GMM parameter set, H.
With respect to the GA framework, the feasibility of the GRL framework is facilitated by the embedding techniques that enable to work directly on a not reduced graph. In the GA framework, we performed the model estimation for each couple of values (Q, S) defined on a suitable grid. The number of bins, Q, is made to vary from 2 to 100 in increments of 2 for both quantile and normal binning strategies; the parameter S which defines the dimension of the reduced adjacency matrix M is made to vary from 5 to 400 in increments of 5. In the GRL framework, the parameters of the four embedding methods used in this study are the default parameters of the Karate Club framework. Moreover, the ToMATo clustering techniques needs to set three parameters, namely the neighborhood information of the point cloud, the density estimator, and the merging parameter. We used a ToMATo cluster function contained in the Python library tomaster 2 which needs only the neighborhood information as input and automates the other two parameters. Since ToMATo relies heavily on neighborhood information, a popular choice to model neighborhood seems to be the K-nearest neighbors algorithm. As well documented in the literature, this ML algorithm performs well, recovering the correct clusters under an appropriate choice of the parameter K [44]. For the GRL framework the model estimation is, therefore, performed by exploring a suitable two-dimensional grid for the couple (Q, K). In this grid, the number of bins, Q, is made to vary from 2 to 100 in increments of 2 for both quantile and normal binning strategies; the parameter K from 2 to 100 in increments of 1. The embedding method, during the exploration of the grid, operates always on the complete graph G, never using an approximate form like the coarsening method that makes use of a reduced adjacency matrix of dimension S. This is one of the most important differences between GA and GRL frameworks. Table 3, reports the dimension of the Euclidean space for each embedding method. Such a parameter has been chosen according to the default one provided by the library. In addition, Table 3 shows, as additional attributes, the log-return time series in correspondence of two well defined embedding methods, namely the ASNE and NEU_ASNE methods. 3 As we will see in the next section, this feature can greatly improve the accuracy of the estimation procedure.

Results
This section outlines and discusses the empirical findings of the experiment. We will analyze the outcomes of six embedding methods for the GRL framework (see Table 3) and one coarsening method for the GA framework. Due to the fact that each method includes two different binning strategies (Quantile and Normal), the comparison will involve 14 outcomes corresponding to 14 different combinations, namely 6 GRL-Q, 6 GRL-N, 1 GA-Q, 1 GA-N combinations. For each of these 14 combinations, we determined the best performing model according to the BIC (Bayesan information criterion) [77,78]. By model we mean the combination of method and strategy, the grid parameters and the number of Gaussian components determined by the partitioning technique. Therefore, each model uniquely identifies the initialization point of the maximum likelihood estimation procedure. By varying the values of the grid parameters, the best performing model is the one that minimizes the BIC value. We recall that the use of the BIC allows us to resolve the model selection problem by introducing a penalty term for the number of parameters, dealing with the trade off between the quality of fit and the simplicity of the model.  Table 8 shows the estimation results obtained using a Kmeans based initialization technique. Our methodology outperforms this more conventional approach. In fact, for each market the BIC value obtained with the this initialization technique is greater than the maximum BIC value reported in the last row of Tables 4, 5, 6 and 7. Figure 7 shows for each market the log-return time series in which log-returns belonging to the communities identified to initialize the EM algorithm in the case of the top-performing model are represented with the same color (left panel). In addition, a two-dimensional reduction of the multidimensional embedding of the top-performing model is also shown (right panel). The reduction has been performed by t-distributed stochastic neighbor embedding (t-SNE) which is a well-suited technique for the visualization of high-dimensional datasets [79]. The parameters of the top-performing models are provided in Table 9. The same table also shows the colors of the communities and the number of log-returns belonging to the same community (count). Table 10 displays for each market, a selection of threecomponents models chosen among the models that minimize the BIC value in grid exploration. The model with the best BIC is listed, of course, in the first row of each subtable. The models listed in the other rows were chosen to provide a trade-off between the value of BIC and the value of the kurtosis error coefficient e 4 . The significance of this trade-off should be noted. The first row provides a better match according to the BIC, reproducing well the first three moments of the empirical log-return distributions. In contrast, the last row of each sub-table provides a very interesting compromise reproducing well also the fourth moment. Finally, we remark that the NEU approach performs very well, matching several rows in Table 10, in particular when it is combined with the Diff2Vec embedding method. The ASNE approach that makes use of extra information carried on by log-return time series as additional attribute, also provides good results. However, the GA framework remains a plausible option, providing interesting results for Paloverde and PJM log-return distribution estimates.

Concluding remarks
We presented a superframework to handle the initialization problem of the EM algorithm for mixture models. Our innovative method emerges from a comparison with a prior study [27], which we transferred and expanded in a general superframework based on graph machine learning. We applied the proposed methodology to estimate Gaussian mixture models on US wholesale electricity market prices using two different configurations of the superframework, namely the GA and the GRL frameworks. Electricity market prices offer an excellent example of a complex system that transcends simple modeling. Using complex networks approaches that link time series and graphs, we were able to exploit graph-structured information derived from electricity market data. We have shown that the GA framework performs better than conventional initialization methods, such as k-means and random-based techniques [27]. In this paper, we demonstrated that the GRL framework provides even more accurate results. Indeed, the GRL framework, incorporating data-driven and unsupervised graph embedding approaches in conjunction with TDAbased clustering, has less information loss and a more conservative design than the GA framework. We found that structural information can definitely capture the behavior of time series and the nonlinear relationships between different observations. In particular, in the GRL framework we highlight how the combination of NEU meta-embedding applied to Diff2Vec, and the ASNE method stand out as especially effective embedding models. The proposed superframework proves to be a very flexible tool of analysis that can be further developed. Indeed, the ability to incorporate a wide variety of methods within its functional building blocks makes it extremely adaptable to very different empirical situations. From this point of view, three main directions of our future research will be: (i) since the output of the graph encoding block using MTF can be interpreted as an image [30,31], we can exploit an image classification transfer learning technique [80] to generate the embedding in the graph compression step and use an image pretrained VGG-16 deep convolutional neural network [81] for this feature extraction; (ii) combine embedding and coarsening procedures in the graph compression block, as it is done in other contexts [82,83], and test the accuracy of the results; (iii) use the initialization blocks as a self-supervised learning framework to discover communities and assign them to different dynamics [84,85]. We have left these topics for future study.
Funding Open access funding provided by Università degli Studi G. D'Annunzio Chieti Pescara within the CRUI-CARE Agreement.
Data Availability The US wholesale electricity prices time series data that support the findings of this study are available from repository Wholesale Electricity and Natural Gas Market Data, https://www.eia. gov/electricity/wholesale/.

Conflict of interest
The authors declare no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.