1 Introduction

Many applications including internet of things (IoT) networks, sensor networks, social networks, gene networks, customer consumption data in utility companies, financial data, regional temperature data, and brain computer interface measurements result in rapidly growing data volumes. IoT networks are a rapidly emerging technology where smart sensors are connected to easily enable data collection anywhere and anytime [1]. Figure 1 illustrates how an IoT network may involve collecting large volume of multi-variate time series data which are connected in some abstract senses.

Fig. 1
figure 1

Illustration of learning a network graph structure from IoT data

Storing and analyzing these huge data volumes is challenging due to the need for high computation and power resources. The emerging field of graph signal processing (GSP) [2, 3] simplifies the analysis of large data volumes by the use of graph theory, where each graph node represents a component of the system. Several applications can be analyzed by graphs, which can capture the underlying topology among different entities of the network. For example, a network of thermal sensors which measures temperatures in different regions, can be modeled by a graph.

In some cases, the graph topology is not known a priori. Thus, one of the desired goals of the GSP framework is estimating the underlying topology given a set of measured data. Some researches concentrated on directed topology estimation [4,5,6,7,8,9,10,11,12,13,14], where specific process models have been used which are suitable for limited applications. However, the Gaussian model is a ubiquitous signal model which can be utilized for a wide range of applications. By using data sets generated by Gaussian distributions, the solution of a topology inference problem usually leads to an undirected graph structure. The “covariance selection” proposed by Dempster [15], was one of the pioneering works to capture the connectivity underling the Gaussian measurements. Later on, Banerjee et. al. [16] proposed an ℓ1 regularized optimization problem to find a sparse precision matrix which is the inverse of the covariance matrix. The precision matrix carries information about the partial correlations between the random variables. Friedman and his colleagues have proposed a fast algorithm, called graphical Lasso, to implement inverse covariance estimation [17]. There are other works investigating the inverse covariance matrix estimation with different approaches and different rates of convergence [18, 19]. However, none of these approaches can leverage the signal model to infer the underlying graph topology. In other words, they focus on pairwise relationships of entities to find their structure instead of applying the signal model information. Besides, these approaches are very sensitive to noise since they are based purely on the measurement values.

A state-of-the-art machine learning approach for estimating a graph topology from measurements generated by a Gaussian Markov Random Field processes has been proposed in [20]. The main limitations of this algorithm are twofold. First, no low cost implementation is proposed that scales efficiently with the number of nodes. Second, the effect of noisy measurements on the performance evaluation has not been considered. Kalofolias has proposed an efficient and scalable algorithm to learn a graph from noise-free observations where it is assumed that the node degrees are positive (they can not be zero) [21]. A generalization of the methods in [20] and [21] for different Laplacian and structural constraints has been proposed in [22]. Using graph signal frequency domain analysis, the underlying network topology is inferred from spectral templates in [23]. The authors assumed that there exists a diffusion process in the graph shift operator (GSOFootnote 1) that can generate the given observations and the GSO is jointly diagonalizable with the observations covariance matrix. This method was first applied to capture the underlying topology when the graph signals are generated from a GSO filter output fed by a white signal input. Then, this method was extended to the colored input, generating non-stationary diffusion processesFootnote 2. Some other research works with similar ideas are presented for different stationary and non-stationary processes in [24,25,26,27,28]. Dictionary learning [29, 30] and transform learning [31] have also been used for inferring the graph topology. In [29,30,31], a specific relation between the Laplacian matrix and the dictionary atoms has been sought and hence these algorithms are applicable when we have some knowledge about signal representation in the transform domain. There are also many other papers investigating the topology identification and graph learning methods from different perspectives [8, 32,33,34,35,36,37,38,39,40,41]. However, none of the above approaches has discussed graph signal recovery and topology learning when the measurements are corrupted by noise.

The inexpensive sensors found in a wide variety of applications distort the observed data samples. The resulting measurement errors in the data analysis are conventionally modeled as noise [42] and it is generally desirable to remove this noise in many applications [43, 44], which generally produced a more accurate learned topology. In a recent research [45], Liu and her colleagues proposed a noisy data cleansing algorithm to remove unwanted distortion from Industrial IoT (IIoT) sensor data in manufacturing. They showed the efficiency of their proposed method on the multi-variate time-series data collected from a real IIoT based four stage compressor manufacturing plant. In [46], the effect of distortion on the trading data has been discussed, where the stock market’s impact on this data is characterized by a graph [47]. There are also other studies in the GSP framework considering the noise effect on the observations, e.g. [48], but they assumed that they have knowledge about the underlying graph and did not tackle graph topology inference and noise removal at the same time. While it is a naive strategy to find the graph topology first and then remove the noise from the graph signal, this approach has a low performance due to the accumulation of errors in topology learning and signal estimation.

To solve this type of problem in a topology learning framework, an optimization problem is usually formulated with at least two objective terms. The first term is always a data fidelity term controlling the energy of the signal recovery error. The second term takes care of the graph signal smoothness over the learned topology. A smooth signal has smooth variations on the underlying topology, or in other words, connected nodes in the estimated structure have similar signal values. To overcome over-fitting, the data fidelity term is regularized by a smoothing term and an exhaustive search is applied to find the regularization parameter. In many applications, the relationship between the noise variance and the regularization parameters may be held directly, e.g. image restoration [49]. Thus, estimating the noise variance not only provides a cleaner version of the signal for further analysis, but also helps to overcome the overfitting problem. All things considered, the estimation of noise variance is the main step to design the filter for smoothing the signal and denoising the given measurements.

It is usually assumed that the measurement noise has a zero mean Gaussaian distribution model which is independent of the desired signal. Thus, finding the noise variance is of interest to design a filter for signal denoising. The noise variance estimation is not only useful from a signal recovery perspective, it is also important from a graph learning point of view. Since finding the underlying structure from a noise-free data set is more accurate than from noisy graph signals, signal denoising can contribute to estimating a more accurate graph topology. In this respect, Chepuri et al. [50] proposed an approach to find the connectivity graph and remove noise from measurements. They proposed a non-convex, cardinality constrained, boolean optimization problem along with a convex relaxation. They have assumed that the number of edges is known in advance and the proposed method scales with the desired number of edges. Another approach has been proposed in [51], which estimates the topology and removes the noise from the measured signals, simultaneously. They adopted a factor analysis model for the multi-variate signals and imposed a Gaussian prior on the latent variables that control the graph signals. By applying a Bayesian framework, the maximum a posteriori (MAP) probability estimate of the latent variable is investigated. Considering signal smoothness over the underlying graph, this procedure leads to an optimization problem over the graph topology and graph signals, simultaneously. A limitation of this approach is that an exhaustive search is applied to find the regularization parameter or noise variance which is not very suitable, especially for the situation in which the measurement noise varies over time.

Our contribution Given a set of noisy multi-variate signal measurements, we propose a new algorithm to perform the underlying topology learning and graph signal noise removal. This graph topology characterizes the affinity relationship among the multi-variate signals. We have proposed a minimum mean square error estimation (MMSE) approach for signal recovery and topology learning and unlike [51], we estimate the regularization parameters analytically rather than by using a grid search. We show that the regularization parameters are linked to the noise variance which is usually unknown, and hence, finding the noise variance helps adjusting the parameters precisely. Besides, it is the main role in designing a filter to denoise the graph measurements. In other words, to compare the current paper with other similar ones, especially [51], two specific points must be considered; (1) understanding the relation between amount of noise and the regularization parameters in the denoising procedure, and (2) a rigorous strategy to estimate the noise variance instead of assuming that it is known in advance. The MMSE procedure is utilized to estimate the optimal Laplacian matrix, whose eigenvalue matrix is the precision matrix of the Gaussian Markov Random Field process. This work is an extension of our previous paper [52], in which we studied the problem of joint graph signal recovery and topology learning using an off-the-shelf optimization toolbox to implement the algorithm. However, in this version, we propose a fast algorithm and prove its convergence analytically. Moreover, we provide simulation results on real world data sets in this paper. To compare the results with the state of the art methods, we applied some measures which evaluate the performances from two distinct perspectives which are signal recovery error and graph topology estimation accuracy. Therefore, we can compare our proposed method against other methods that employ signal denoising strategies, like the one proposed in [22]. A simple graph filter is applied based on the estimated noise variance which can be used in many multi-variate measurements applications, including industrial internet of things IIoT applications. In the preprocessing stage in some IIoT applications, it is necessary to denoise the measurements, remove outlier data, detect anomalies, and estimate the missing values of some data. By the proposed method, the knowledge of the underlying data structure can be applied to implement all of these tasks. As an application of our idea in the area of IIoT networks, we apply our proposed method to estimated the missing values of sensor readings for a power consumption dataset.

The rest of the paper is organized as follows; in Sect. 2, some preliminaries about graph theory and signal processing on graphs are presented. Section 3 formulates the graph topology learning problem via a Bayesian framework and proposes a general algorithm for implementation, called Bayesian Topology Learning (BTL). In Sect. 4, a method is proposed to implement BTL efficiently via solving a proximal point algorithm, called BTL-PPA. The proof of convergence is also presented in Sect. 5. Section  6 presents experimental results obtained from the proposed method for temperature sensor networks and IIoT applications by using both real and synthetically simulated data. Finally, the Sect. 7 concludes the paper. Throughout the paper, lowercase regular letters, lowercase bold letters, and uppercase bold letters denote scalars, vectors, and matrices, respectively. The rest of the notation is presented in Table 1.

Table 1 Table of notation

2 Background on graph signal processing

Let \({\mathcal {G}}=({\mathcal {V}} ,{\mathcal {E}}, {\mathbf {W}} )\) to be a graph with the vertices \(v_i\in {\mathcal {V}}\), the edge set \({\mathcal {E}}\), and the undirected weight matrix \({\mathbf {W}}\). Each edge \((v_i,v_j), 1\le i,j\le N\) has a weight of \(w_{ij}\) in the corresponding entry of \({\mathbf {W}}\in {\mathbb {R}}^{N\times N}_+\). Thus, \(w_{ij}=0\) implies no connection and \(w_{ij}=w_{ji}>0\) quantifies the strength of connection between the node \(v_i\) and the node \(v_j\). For simplicity of presentation, we assume that the diagonal entries of the weight matrix are zero, i.e. there is no self loop. The adjacency matrix \({\mathbf {A}}\) stores only the existence of the edges, regardless of their weights. In other words, if there is a connection or edge between the vertices \(v_i\) and \(v_j\), the adjacency matirx entity \(a_{ij}=1\) and zero otherwise. The degree matrix is defined as \({\mathbf {D}}=\text {diag}(d_{i}),1\le i\le N\), where \(\text {diag}(d_i)\) denotes a square diagonal matrix with the degrees \(d_1, d_2, \dots , d_N\) on its main diagonal. The degree \(d_{i}\) is the sum of the edge weights connecting node i to its neighbors. Then, \({\mathbf {D}}=\text {diag}({\mathbf {W}}\cdot {\mathbf {1}}_N)\), where \({\mathbf {1}}_N\) is the all ones \(N\times 1\) vector. The combinatorial and normalized graph Laplacian matrices are also defined as \({\mathbf {L}}={\mathbf {D}}-{\mathbf {W}}\) and \({\mathbf {L}}_{\text {norm}}={\mathbf {I}}_N-{\mathbf {D}}^{-\frac{1}{2}} {\mathbf {WD}}^{-\frac{1}{2}}\), respectively, where \({\mathbf {I}}_N\) is the identity matrix of size \(N\times N\). Since the Laplacian matrix is real and symmetric, its eigendecomposition is written as follows

$$\begin{aligned} {\mathbf {L}} = \varvec{\chi \varLambda }\varvec{\chi }^T, \end{aligned}$$
(1)

where \(\varvec{\varLambda }\) and \(\varvec{\chi }\in {\mathbb {R}}^{N\times N}\) are eigenvalue and eigenvector matrices, respectively, and \((\cdot )^T\) denotes the matrix transpose operator. For the normalized Graph Laplacian, all eigenvalues are between 0 and 2. Since the Laplacian matrix can uniquely characterize the graph structure, the graph topology learning problem can be formulated as a problem of graph Laplacian matrix learning. The k’th graph signal is represented as below

$$\begin{aligned} \begin{aligned}&{\mathbf {y}}[k] : {\mathcal {V}} \rightarrow {\mathbb {R}}^N, { v_i} \mapsto {y_i}[k]\\&{\mathbf {y}}[k]=\left({y_1[k], y_2[k], ..., y_{{N}}[k]}\right)^T \in {\mathbb {R}}^N, \end{aligned} \end{aligned}$$
(2)

An important concept in the GSP framework is signal smoothness with respect to the intrinsic structure of the graph. The local smoothness around vertex i at time k is given as [2]

$$\begin{aligned} \left\| \nabla _i{\mathbf {y}}[k]\right\| _2:=\left[\sum _{j\in {\mathcal {N}}_i}W_{ij}\left[y_j[k]-y_i[k]\right]^2\right]^{\frac{1}{2}}, \end{aligned}$$
(3)

where \({\mathcal {N}}_i\) is the neighborhood of node i. Then, to quantify the global smoothness (GS), we have [2]

$$\begin{aligned} \begin{aligned} GS_2({\mathbf {y}}[k])=&\sum _{i\in {\mathcal {V}}}\sum _{j\in {\mathcal {N}}_i}W_{ij}\left[y_j[k]-y_i[k]\right]^2\\ =&\sum _{(i,j)\in {\mathcal {E}}}W_{ij}\left[y_j[k]-y_i[k]\right]^2={\mathbf {y}}^T[k]{\mathbf {Ly}}[k]. \end{aligned} \end{aligned}$$
(4)

Figure 2 compares the smoothness of a signal living in different topologies.

Fig. 2
figure 2

A unique graph signal resided on three different graph topology. The blue and black bars show positive and negative signal values, respectively, and red circles denotes graph vertices. The signal is smooth with respect to the \({\mathcal {G}}_1\), less smooth with respect to \({\mathcal {G}}_2\) and likely to be non-smooth with respect to \({\mathcal {G}}_3\). The edge weights are all ones. These figures are generated by GSPBOX [53]

3 Bayesian Topology Learning

Bayes’ rule relates the probability of an event to the prior knowledge that we have around it and updates the belief. Here, in the same way, it is assumed that we have some knowledge about the latent variable which factorizes the graph signals and also relates to the underlying graph structure. By Bayesian inference, we investigate the posterior probability of this latent variable and use its probability distribution function to find the graph Laplacian matrix and denoise the observed graph signals.

Assume that we are given a data matrix \({\mathbf {X}}\) of size \(N\times K\), containing noisy graph signal measurements in its columns. The rows of the matrix \({\mathbf {X}}\) correspond to the vertices of the underlying graph and each column/measurement is as follows

$$\begin{aligned} {\mathbf {x}}[k]={\mathbf {y}}[k]+{\mathbf {e}}[k], k=1,\dots K. \end{aligned}$$
(5)

Let us adopt a multi-variate Gaussian distribution for the measurement noise \({\mathbf {e}}[k]\), given by the following probability density function (pdf)

$$\begin{aligned} p({\mathbf {e}}[k]; \sigma )\sim {\mathcal {N}}({\mathbf {0}},\sigma {\mathbf {I}}_N) \end{aligned}$$
(6)

where \({\mathbf {0}}\in {\mathbb {R}}^N\) is all zero vector.Footnote 3

To find the underlying graph topology, first, we use the factor analysis model to leverage a representation matrix that can be linked to the graph Laplacian/topology directly. In this model, each graph signal is represented by \({\mathbf {y}}[k] = \varvec{\chi } {\mathbf {h}}[k]+{\mathbf {u}}[k],\quad \forall k\), where the unobserved latent variable \({\mathbf {h}}[k]\in {\mathbb {R}}^N\) controls each graph signal via the eigenvector matrix \(\varvec{\chi }\) [54] and \({\mathbf {u}}[k]\in {\mathbb {R}}^N\) is the mean of graph signal \({\mathbf {y}}[k]\). For simplicity and without loss of generality, hereafter, it is assumed that \(\forall k,{\mathbf {u}}[k]={\mathbf {0}}\). As discussed in [51], the motivation of using the factor analysis model is that each graph signal can contribute to the underlying graph structure estimation. Similar to [51], it is assumed that \({\mathbf {h}}[k]\) follows a degenerate zero mean multi-variate Gaussian distribution as follows

$$\begin{aligned} p \left({\mathbf {h}}[k]; \varvec{\varLambda }^{\dagger }\right)\sim {\mathcal {N}}({\mathbf {0}}, \varvec{\varLambda }^{\dagger }), \end{aligned}$$
(7)

where \((\cdot )^{\dagger }\) denotes the Moore–Penrose pseudo-inverse and \(\varvec{\varLambda }\) is the precision matrix. Considering all graph signals, we have

$$\begin{aligned} {\mathbf {X}} = \varvec{\chi }{\mathbf {H}}+{\mathbf {E}}, \end{aligned}$$
(8)

where \({\mathbf {H}}=\left[{\mathbf {h}}[1],\dots ,{\mathbf {h}}[K]\right]\) and \({\mathbf {E}}=\left[{\mathbf {e}}[1],\dots ,{\mathbf {e}}[K]\right]\). Equivalently, by using the Kronecker product \(\otimes\) property, all given data can be vectorized as follows

$$\begin{aligned} {\mathbf {x}} = \mathbf {Bh}+{\mathbf {e}}, \end{aligned}$$
(9)

where \({\mathbf {x}}=\text {vec}\left({{\mathbf {X}}}\right)\), \({\mathbf {B}}={\mathbf {I}}\otimes \varvec{\chi }\), \({\mathbf {h}}=\text {vec}\left({{\mathbf {H}}}\right)\), \({\mathbf {e}}=\text {vec}\left({{\mathbf {E}}}\right)\), and vec(\(\cdot\)) stacks all columns of the matrix in a column vector. Thus, we have the following pdf’s

$$\begin{aligned}&p \left({\mathbf {h}}; \varvec{\varLambda }^{\dagger }\right)\sim {\mathcal {N}}({\mathbf {0}}, {\mathbf {C}}_0^{\dagger }), \end{aligned}$$
(10)
$$\begin{aligned}&p \left({\mathbf {x}}\mid {\mathbf {h}};\varvec{\chi }, \sigma \right)\sim {\mathcal {N}} \left({\mathbf {B}} {\mathbf {h}},\sigma {\mathbf {I}}\right), \end{aligned}$$
(11)
$$\begin{aligned}&p \left({\mathbf {x}};\varvec{\varLambda }^{\dagger },\varvec{\chi },\sigma \right)\sim {\mathcal {N}}\left({\mathbf {0}},{\mathbf {BC}}_0^{\dagger }{\mathbf {B}}^T+\sigma {\mathbf {I}}\right), \end{aligned}$$
(12)

where \({\mathbf {C}}_0={\mathbf {I}}\otimes \varvec{\varLambda }\). By applying the eigenvector matrix identity \(\varvec{\chi }\varvec{\chi }^T={\mathbf {I}}\) and Kronecker product properties, we have

$$\begin{aligned} {\mathbf {BC}}_0^{\dagger }{\mathbf {B}}^T=\left({\mathbf {I}}\otimes \varvec{\chi }\right)\left({\mathbf {I}}\otimes \varvec{\varLambda }^{\dagger }\right)\left({\mathbf {I}}\otimes \varvec{\chi }\right)^T={\mathbf {I}}\otimes \varvec{\chi }\varvec{\varLambda }^{\dagger }\varvec{\chi }^T, \end{aligned}$$
(13)

and thus, the covariance matrix of \({\mathbf {x}}\) in (12) is simplified as follows

$$\begin{aligned} {\mathbf {BC}}_0^{\dagger }{\mathbf {B}}^T+\sigma {\mathbf {I}}={\mathbf {I}}\otimes \left({\mathbf {L}}^{\dagger }+\sigma {\mathbf {I}}\right), \end{aligned}$$
(14)

where \({\mathbf {L}}^{\dagger }= \varvec{\chi }\varvec{\varLambda }^{\dagger }\varvec{\chi }^T\).

The posterior density of \({\mathbf {h}}\) is obtained by applying the Bayes rule as follows

$$\begin{aligned} p \left({\mathbf {h}}\mid {\mathbf {x}};\varvec{\varLambda }^{\dagger },\varvec{\chi },\sigma \right)=\frac{p \left({\mathbf {x}}\mid {\mathbf {h}};\varvec{\chi }, \sigma \right)p \left ({\mathbf {h}}; \varvec{\varLambda }^{\dagger }\right)}{p \left({\mathbf {x}};\varvec{\varLambda }^{\dagger },\varvec{\chi },\sigma \right)}, \end{aligned}$$
(15)

and the Minimum Mean Square Error (MMSE) estimator of the latent variable is defined as below [55]

$$\begin{aligned} \hat{{\mathbf {h}}}= \frac{1}{\sigma }{\mathbf {C}}_{\varvec{\epsilon }}{\mathbf {B}}^T{\mathbf {x}}, \end{aligned}$$
(16)

where the covariance matrix is

$$\begin{aligned} {\mathbf {C}}_{\varvec{\epsilon }}=\left({\mathbf {C}}_0+\frac{1}{\sigma }{\mathbf {B}}^T{\mathbf {B}}\right)^{-1} =\left({\mathbf {C}}_0+\frac{1}{\sigma }{\mathbf {I}}\right)^{-1} ={\mathbf {I}}\otimes \left(\varvec{\varLambda }+\frac{1}{\sigma }{\mathbf {I}}\right)^{-1} \end{aligned}$$
(17)

If \(\sigma\) increases, the distributions in (11) and (12) have larger uncertainties and thus the estimation of (16) is less accurate and has a larger MSE. Due to the linear relationship between \({\mathbf {y}}\) and \({\mathbf {h}}\), i.e. \({\mathbf {y}}={\mathbf {Bh}}\), the graph signals can be estimated by \(\hat{{\mathbf {y}}}={\mathbf {B}}\varvec{\mu }_{{\mathbf {h}}}\), which can be simplified as follows

$$\begin{aligned} \hat{{\mathbf {y}}}=\left({\mathbf {I}}\otimes \left({\mathbf {I}}+\sigma {\mathbf {L}}\right)^{-1}\right){\mathbf {x}}, \end{aligned}$$
(18)

where \(\sigma\) and \({\mathbf {L}}\) will be estimated later. Equivalently, (18) can be represented in a matrix form as \(\hat{{\mathbf {Y}}}=\left({\mathbf {I}}+\sigma {\mathbf {L}}\right)^{-1}{\mathbf {X}}\) which is similar to the one presented in [51], where \(\hat{{\mathbf {Y}}}=\left({\mathbf {I}}+\alpha {\mathbf {L}}\right)^{-1}{\mathbf {X}}\) for a parameter \(\alpha\). In [51], \(\alpha\) is the regularization parameter corresponding to the smoothness of graph signals over the topology. In other words, the graph signals \({\mathbf {X}}\) are smoothed by the graph filter \(\left({\mathbf {I}}+\sigma {\mathbf {L}}\right)^{-1}\) and hence \(\sigma =\alpha\) can be called the filter coefficient or the regularization parameter or noise variance, interchangeably. If \(\sigma\) approaches zero, \(\hat{{\mathbf {Y}}}\) and \({\mathbf {X}}\) are going to be identical and for larger \(\sigma\), the effect of graph Laplacian matrix on filtering the signal is larger. In other words, we have

$$\begin{aligned} \hat{{\mathbf {Y}}}=\left({\mathbf {I}}+\sigma {\mathbf {L}}\right)^{-1}\left({\mathbf {Y}}+{\mathbf {E}}\right)={\mathbf {Y}}+{\mathbf {E}}-\sigma {\mathbf {L}}\hat{{\mathbf {Y}}}, \end{aligned}$$
(19)

and thus for the true denoised version of \({\mathbf {Y}}\), i.e. \(\hat{{\mathbf {Y}}}={\mathbf {Y}}\), we have \({\mathbf {E}}=\sigma {\mathbf {L}}\hat{{\mathbf {Y}}}\). Given a fixed error \({\mathbf {E}}\), if \(\sigma\) is large, then \(\hat{{\mathbf {Y}}}\) is sufficiently smooth hence \({\mathbf {L}}\hat{{\mathbf {Y}}}\) will likely be small. This is the case where we make a big effort to denoise the observation \({\mathbf {X}}\). If \(\sigma\) is small, the opposite will be true. So this shows that given a noisy observation (hence the amount of noise added to the clean signal is “fixed”), different \(\sigma\) estimation approaches (or equivalently \(\alpha\) adjustment in [51]) will lead to different denoising effect. However, \(\sigma\) has not been investigated analytically in [51] and it was estimated by a grid search.

To make the importance of noise variance estimation more clear, we generate a graph with \(N=50\) nodes with \(K=100\) graph signals, i.e. \({\mathbf {Y}}\), via the procedure explained in the simulation Sect. 6.1. Then, the graph signals are contaminated by Gaussian noises with difference variances to provide \({\mathbf {X}}\). Given \({\mathbf {X}}\) and the known Laplacian matrix \({\mathbf {L}}\), we try to denoise the graph signals via (18) by two different filters, i.e. employing the correct variance and an incorrect estimated variance which is half of the true value. This procedure is repeated for 10 different variances and the normalized mean square error of the true signals and the estimated ones are computed. The results in Fig. 3 show how we can improve our measurements by the filter in (18), especially when we have a better estimation of the noise variance. Since the Laplacian matrix estimation is performed in the next step based on these measurements, using denoised graph signals certainly helps in a better topology estimation.

Fig. 3
figure 3

The effect of denoising the graph signals when using a correct noise variance estimator

To estimate the parameters \(\sigma\) and \(\varvec{\varLambda }^{\dagger }\), an expectation maximization procedure is used which proceeds by optimizing the following log-likelihood function

$$\begin{aligned} Q \left(\varvec{\varLambda }^{\dagger },\varvec{\chi },\sigma \right)={\mathbb {E}}_{{\mathbf {h}}\mid {\mathbf {x}};{\hat{\sigma }},\hat{\varvec{\chi }},\hat{\varvec{\varLambda }}^{\dagger }}\left[\text {log }p \left({\mathbf {x}},{\mathbf {h}}\mid \sigma ,\varvec{\chi },\varvec{\varLambda }^{\dagger }\right)\right], \end{aligned}$$
(20)

where \({\mathbb {E}}(\cdot )\) denotes the expectation and \(\hat{\varvec{\varLambda }}^{\dagger }\), \(\hat{\varvec{\chi }}\), and \({\hat{\sigma }}\) are the estimators of \(\varvec{\varLambda }^{\dagger }\), \(\varvec{\chi }\), and \(\sigma\). Hereafter, for brevity of notation, \(\varvec{\varTheta }\) denotes all the parameters, i.e. \(\varvec{\varTheta }:=\left(\sigma ,\varvec{\chi },\varvec{\varLambda }^{\dagger }\right)=\left(\sigma ,{\mathbf {L}}\right)\). By using (10)–(12) and (17) and some manipulations, (20) is rewritten as follows (see Appendix 1 for more details)

$$\begin{aligned} \begin{aligned} Q \left(\varvec{\varTheta }\right)\,=\,&{\mathbb {E}}_{{\mathbf {h}}\mid {\mathbf {x}};\hat{\varvec{\varTheta }}} \left[\text {log }p \left({\mathbf {x}}\mid {\mathbf {h}}\right)\right]+{\mathbb {E}}_{{\mathbf {h}}\mid {\mathbf {x}};\hat{\varvec{\varTheta }}}\left[\text {log }p\left({\mathbf {h}}\right)\right]\\ \propto&-N\text { log }\sigma -\text {Tr}\left(\left(\sigma {\mathbf {L}}+{\mathbf {I}}\right)^{-1}\right)+\text { log }|{\mathbf {L}}|-\text {Tr}\left({\mathbf {SL}}\right), \end{aligned} \end{aligned}$$
(21)

where \({\mathbf {S}}=\frac{1}{K}\hat{{\mathbf {Y}}}\hat{{\mathbf {Y}}}^T\) is the empirical covariance matrix of the graph signals and the pseudo-determinant \(|\cdot |\) is used due to the singularity of \(\varvec{\varLambda }\) and \({\mathbf {C}}_0\). The pseudo-determinant of a matrix is the product of its non-zero eigenvalues. To find the optimal parameters, Q must be optimized with respect to each argument iteratively up to a convergence. By taking the derivative with respect to \(\sigma\), we have

$$\begin{aligned} -\frac{N}{\sigma }+\text {Tr} \left(\left(\sigma {\mathbf {L}}+{\mathbf {I}}\right)^{-1}\cdot {\mathbf {L}}\cdot \left (\sigma {\mathbf {L}}+{\mathbf {I}}\right)^{-1}\right)=0, \end{aligned}$$
(22)

where the optimal \(\sigma\) is found by a numerical method, e.g. Newton–Raphson algorithm. Then, Q is maximized with respect to the Laplacian matrix as follows

$$\begin{aligned} \begin{aligned} \underset{{\mathbf {L}}}{{\text {argmax}}}\quad&\text {log }|{\mathbf {L}}|-\text {Tr} \left({\mathbf {SL}}\right)- \text {Tr}\left(\left(\sigma {\mathbf {L}}+{\mathbf {I}}\right)^{-1}\right)\\ \text {s.t.} \quad&{\mathbf {L}}_{ij}={\mathbf {L}}_{ji},\quad {\mathbf {L}}_{ij}\le 0 \text { if } i\ne j,\quad {\mathbf {L}}\cdot {\mathbf {1}}={\mathbf {0}},\text { Tr}({\mathbf {L}})=c, \end{aligned} \end{aligned}$$
(23)

where the first three constraints ensure a valid Laplacian matrix and the last one avoids trivial solution for \(c>0\). The second term of the objective function promotes the signal smoothness over the estimated topology [as explained in (4)]. The optimization problem in (23) is not easy to implement due to the first and third terms. To solve the issue with the pseudo determinant term, we propose the following minimization problem

$$\begin{aligned} \begin{aligned} \underset{{\mathbf {L}}}{{\text {argmin}}}\quad&-\text {logdet }(\mathbf {L+{\mathbf {J}})}+\text {Tr}\left({\mathbf {SL}}\right)+\text {Tr}\left(\left(\sigma {\mathbf {L}}+{\mathbf {I}}\right)^{-1}\right)\\ \text {s.t.} \quad&{\mathbf {L}}_{ij}={\mathbf {L}}_{ji},\quad {\mathbf {L}}_{ij}\le 0 \text { if } i\ne j,\quad {\mathbf {L}}\cdot {\mathbf {1}}={\mathbf {0}},\text { Tr}({\mathbf {L}})=c, \end{aligned} \end{aligned}$$
(24)

where \(\text {det }(\cdot )\) stands for the determinant and \({\mathbf {J}}=\frac{1}{N}{\mathbf {11}}^T\). The equivalence of \(\text {logdet }(\mathbf {L+{\mathbf {J}})}\) and \(\text {log}|{\mathbf {L}}|\) has been justified in [22]. The matrix inverse term which needs high computational power can also be replaced by a less computationally expensive term. Theorem 1 of [56] proposed upper and lower bounds for the trace of symmetric positive definite matrices inverse as

$$\begin{aligned}&\text {Tr}\left(\left(\sigma {\mathbf {L}}+{\mathbf {I}}\right)^{-1}\right)\le N-\frac{c^2\sigma }{\sigma \left\| {\mathbf {L}}\right\| _F^2-c}, \end{aligned}$$
(25)
$$\begin{aligned}&\text {Tr}\left(\left(\sigma {\mathbf {L}}+{\mathbf {I}}\right)^{-1}\right)\ge \frac{N}{2\sigma +1}-\frac{\frac{c^2\sigma -2cN\sigma +2N^2\sigma }{2\sigma +1}}{\sigma \left\| {\mathbf {L}}\right\| _F^2-c-2N-2c\sigma } \end{aligned}$$
(26)

and thus for a fixed \(\sigma\), minimizing \(\sigma \left\| {\mathbf {L}}\right\| _F^2\) results in the minimization of \(\text {Tr}\left(\left(\sigma {\mathbf {L}}+{\mathbf {I}}\right)^{-1}\right)\). To sum up, we propose to solve the following minimization problem for the Laplacian matrix estimation

$$\begin{aligned} \begin{aligned} \underset{{\mathbf {L}}}{{\text {argmin}}}\quad&-\text {logdet }(\mathbf {L+{\mathbf {J}})}+\text {Tr}\left({\mathbf {SL}}\right)+\sigma \left\| {\mathbf {L}}\right\| _F^2\\ \text {s.t.} \quad&{\mathbf {L}}_{ij}={\mathbf {L}}_{ji},\quad {\mathbf {L}}_{ij}\le 0 \text { if } i\ne j,\quad {\mathbf {L}}\cdot {\mathbf {1}}={\mathbf {0}},\text {Tr}({\mathbf {L}})=c, \end{aligned} \end{aligned}$$
(27)

where \(\left\| {\mathbf {L}}\right\| _F^2\) can also be considered as the control term for the distribution of off-diagonal elements, i.e. the edge weights of the estimated graph. Since (27) is a convex optimization problem, any off the shelf convex solver may be used, e.g. YALMIP [57]. However, in the next section, a proximal point algorithm is proposed to solve (27) efficiently. To conclude this section, the three steps of the proposed method are presented in Algorithm 1.

figure a

4 The efficient implementation

In this section, we discuss milestones to implement the proposed method with a fast and efficient algorithm. The proposed algorithm to solve (27) follows these steps: first by Proposition 2, the last three constraints in (27) are rewritten in the form of inner product of two matrices, helping to form a simpler Lagrangian function. Then, the Lagrangian is iteratively maximized with respect to the dual variable and then minimized with respect to the primal variable, i.e. the graph Laplacian matrix. To do the minimization step, we apply a proximal point algorithm. The maximization can be done by the Newton or quasi-Newton methods where the derivative and Hessian with respect to the dual variable are obtained by Lemma 2 and 1. Theorem 1 finds the stopping criterion guaranteeing the convergence of iteration between the maximization and minimization problem.

Proposition 1

Since \({\mathbf {L}}\) is a symmetric matrix, the summation over all entries of the ith row (or i’th column) is rewritten as

$$\begin{aligned} \sum _{j=1}^{N}L_{ij}=\frac{1}{2}\text {Tr}\left({\mathbf {L}}\left({\mathbf {U}}_i+{\mathbf {U}}_i^T\right)\right) \end{aligned}$$
(28)

where \({\mathbf {U}}_i\) is an \(N\times N\) matrix in which the i’th column is the all one vector and the rest is zero.

Proposition 2

Having, \({\mathbf {L}}\cdot {\mathbf {1}} = {\mathbf {0}}\), the constraint \({\mathbf {L}}_{ij}\le 0 \text { for } i\ne j\) in the problem (27) can be replaced by \(\left\| {\mathbf {L}}\right\| _1=2\text {Tr}({\mathbf {L}})\), where \(\left\| {\mathbf {L}}\right\| _1=\sum _{i,j}\text {abs}(L_{ij})\) and \({\mathbf {L}}\in {\mathcal {S}}_+^N\).

Proof

The weight matrix \({\mathbf {W}}\) is a matrix with zero diagonal entries as long as we have assumed that there is no self loop. If all edges are positive (and so \({\mathbf {L}}_{ij}\le 0\) for \(i\ne j\)), we have \(\left\| {\mathbf {D}}\right\| _1=\left\| {\mathbf {W}}\right\| _1\) and

$$\begin{aligned} \left\| {\mathbf {L}}\right\| _1=\left\| {\mathbf {D}}-{\mathbf {W}}\right\| _1=\left\| {\mathbf {D}}\right\| _1+\left\| {\mathbf {W}}\right\| _1=2\left\| {\mathbf {D}}\right\| _1=2\text {Tr}({\mathbf {D}})=2\text {Tr}({\mathbf {L}}). \end{aligned}$$
(29)

If there is only one edge with negative weight (or correspondingly \(\exists i\ne j\), \({\mathbf {L}}_{ij}>0\)), \(\left\| {\mathbf {D}}\right\| _1\ne \left\| {\mathbf {W}}\right\| _1\).

Note that if all edges are negative and thus \({\mathbf {L}}_{ij}\ge 0\) and \({\mathbf {L}}_{ii}\le 0\), this property is proved in the same way. However, this is not the case here as long as it is assumed that \({\mathbf {L}}\in {\mathcal {S}}_+^N\), forcing the diagonal entries of \({\mathbf {L}}\) to be non-negative. \(\square\)

By applying the Propositions 1 and 2, the minimization problem of (27) is rewritten in the following form

$$\begin{aligned} \begin{aligned} \underset{{\mathbf {L}}\in {\mathcal {S}}_+^N}{{\text {argmin}}}\quad&-\text {logdet }({\mathbf {L}}+{\mathbf {J}})+\text {Tr}\left(\mathbf {LS}\right)+\sigma \left\| {\mathbf {L}}\right\| _F^2\\ \text {s.t.} \quad&{\mathcal {B}}({\mathbf {L}})={\mathbf {a}}, \end{aligned} \end{aligned}$$
(30)

where \({\mathbf {a}}=[{\mathbf {0}};\quad c; \quad 2c]^T\), and \({\mathcal {B}}(\cdot ) : {\mathcal {R}}^{N\times N}\rightarrow {\mathcal {R}}^{(N+2)\times 1}\) is a matrix operator as follows

$$\begin{aligned} {\mathcal {B}}({\mathbf {L}})=\left[\frac{1}{2}\text {Tr}\left({\mathbf {L}}\left({\mathbf {U}}_1+{\mathbf {U}}_1^T \right)\right),\dots ,\frac{1}{2}\text {Tr}\left({\mathbf {L}}\left({\mathbf {U}}_N+{\mathbf {U}}_N^T\right)\right) , \text {Tr}({\mathbf {L}}), \quad \left\| {\mathbf {L}}\right\| _{1}\right]^T, \end{aligned}$$
(31)

Hereafter, we interchangeably use the inner product of two matrices instead of the trace operator, i.e. \(\text {Tr}({\mathbf {L}}_1\cdot {\mathbf {L}}_2)=\langle {\mathbf {L}}_1,{\mathbf {L}}_2\rangle\). It is assumed that the feasible set \({\mathcal {F}}=\{{\mathbf {L}}\in {\mathcal {S}}_+^N, \quad {\mathcal {B}}({\mathbf {L}})={\mathbf {a}}\}\) is not empty and then due to the convexity of (30), any local optimal solution is also global optimal. The Lagrangian function over the primal variable \({\mathbf {L}}\) and the dual variable \(\varvec{\nu }=[\nu _1,\dots , \quad \nu _{N},\quad \nu _{N+1}, \quad \nu _{N+2}]\) is given as below

$$\begin{aligned} {\mathcal {L}}\left({\mathbf {L}};\varvec{\nu }\right)=-\text {logdet }({\mathbf {L}}+{\mathbf {J}})+\text {Tr}\left(\mathbf {LS}\right)+\sigma \left\| {\mathbf {L}}\right\| _F^2+\varvec{\nu }^T({\mathbf {a}}-{\mathcal {B}}({\mathbf {L}})), \end{aligned}$$
(32)

Since the objective function of (30) is convex and there exists a strictly feasible point (in \({\mathcal {F}}\)), Slater’s condition holds and there is a strong duality or saddle point property [58]. Strong duality means that it is possible to find the optimal of the dual problem instead of the primal one. Thus, \({\mathbf {L}}\) can be estimated via the following optimization problem

$$\begin{aligned} \hat{{\mathbf {L}}}=\underset{{\mathbf {L}}\in {\mathcal {S}}_+^N}{{\text {argmin}}}\quad g({\mathbf {L}}), \end{aligned}$$
(33)

where

$$\begin{aligned} g({\mathbf {L}})= \underset{\varvec{\nu }}{{\text {max}}}\quad {\mathcal {L}}\left({\mathbf {L}};\varvec{\nu }\right). \end{aligned}$$
(34)

However, it is difficult to implement (33) directly. Therefore, we propose to use the proximal algorithm, which is an standard tool for solving constrained and nonsmooth minimization problems [59]. The proximal operator of a closed proper convex function \(g:{\mathcal {R}}^n\rightarrow {\mathcal {R}}\cup \{+\infty \}\) is represented by \(\mathbf{prox} _g:{\mathcal {R}}^n\rightarrow {\mathcal {R}}^n\) and its scaled version with parameter \(\eta\) is given as below

$$\begin{aligned} \mathbf{prox} _{\eta g}({\mathbf {x}})=\underset{{\mathbf {v}}}{{\text {argmin}}}\quad \left(g({\mathbf {v}})+\frac{1}{2\eta }\left\| {\mathbf {x}}-{\mathbf {v}}\right\| _2^2\right). \end{aligned}$$
(35)

One of the tools for solving the general optimization problem \(\hat{{\mathbf {x}}}=\underset{{\mathbf {x}}}{{\text {argmin}}}\quad g({\mathbf {x}})\) is the proximal point algorithm, also called proximal iteration, given by

$$\begin{aligned} {\mathbf {x}}^{(t+1)}:=\mathbf{prox} _{\eta g}\left({\mathbf {x}}^{(t)}\right) \end{aligned}$$
(36)

where t is the iteration index. In this algorithm, \({\mathbf {x}}^k\) and \(g\left({\mathbf {x}}^k\right)\) converge to the set of minimizers of g and its optimal value, respectively [60]. The proximal operator provides a smoothed version of the function by adding the regularized term. The proximal operator can also be computed by [59]

$$\begin{aligned} \mathbf{prox} _{\eta g}\left({\mathbf {x}}\right)={\mathbf {x}}-\eta \nabla M_{\eta g}({\mathbf {x}}), \end{aligned}$$
(37)

where \(\nabla\) is the gradient operator and \(M_{\eta g}({\mathbf {x}})\) is the Moreau–Yosida regularization. The Moreau–Yosida regularization of \(g({\mathbf {L}})\) in (33) with parameter \(\eta >0\) is defined as [59, 61, 62]

$$\begin{aligned} M_{{\eta g}} ({\mathbf{L}}) & = \mathop {{\text{min}}}\limits_{{{\mathbf{L}}^{\prime } \in S_{ + }^{N} }} \left\{ {g\left( {{\mathbf{L}}^{\prime } } \right) + \frac{1}{{2\eta }}\left\| {{\mathbf{L}} - {\mathbf{L}}^{\prime } } \right\|_{F}^{2} } \right\} \\ & \mathop = \limits^{{(a)}} \mathop {\max }\limits_{\varvec{\nu }} \;\mathop {{\text{min}}}\limits_{{{\mathbf{L}}^{\prime } \in S_{ + }^{N} }} \left\{ { {\mathcal{L}}\left( {{\mathbf{L}}^{\prime } ; \varvec {\nu }} \right) + \frac{1}{{2\eta }}\left\| {{\mathbf{L}} - {\mathbf{L}}^{\prime } } \right\|_{F}^{2} } \right\} \\ \end{aligned}$$
(38)

where \(\overset{(a)}{=}\) follows from Von Neumann–Fan minimax theorem [63, 64]. Like the proximal operator, the Moreau-Yosida regularization provides a smooth version of the function. Moreover, \(M_{\eta g}({\mathbf {L}})\) and \(g\left({\mathbf {L}}\right)\) have the same set of minimizers and thus we solve (38) instead of (34), equivalently. In particular, we propose to find the minimum of \(M_{\eta g}({\mathbf {L}})\) by applying the proximal point algorithm in (36).

4.1 Finding \({\mathbf {P}}_{\eta} \left({\mathbf {L}};{\varvec{\nu}}\right)\)

The adjoint operator of \({\mathcal {B}}\), called hereafter \({\mathcal {B}}_a\), is obtained as follows

$$\begin{aligned} {{\mathcal {B}}_a}(\varvec{\nu })=\frac{\partial \langle {\mathbf {L}}, {{\mathcal {B}}_a}(\varvec{\nu })\rangle }{\partial {\mathbf {L}}}\overset{(a)}{=}\frac{\partial \langle {\mathcal {B}}({\mathbf {L}}), \varvec{\nu }\rangle }{\partial {\mathbf {L}}}, \end{aligned}$$
(39)

where \(\frac{\partial }{\partial {\mathbf {L}}}\) denotes the partial derivative with respect to \({\mathbf {L}}\) and (a) follows the inner product property. By using (31) and some simple manipulations, we have

$$\begin{aligned} {\mathcal {B}}_a(\varvec{\nu })= \frac{1}{2}\nu _1\left({\mathbf {U}}_1+{\mathbf {U}}_1^T\right)+\dots +\frac{1}{2}\nu _N \left({\mathbf {U}}_N+{\mathbf {U}}_N^T\right)+\nu _{N+1}{\mathbf {I}}+\nu _{N+2}\left(2 {\mathbf {I}}-N{\mathbf {J}}\right). \end{aligned}$$
(40)

To simplify (38), we introduce the change of variable \({\mathbf {T}}_\eta\) as

$$\begin{aligned} {\mathbf {T}}_\eta \left({\mathbf {L}};\varvec{\nu }\right)\triangleq \frac{1}{1+2\eta \sigma }\left({\mathbf {L}}-\eta \left({\mathbf {S}}-{\mathcal {B}}_a(\varvec{\nu })\right)\right), \end{aligned}$$
(41)

and then \({{\mathbf {P}}_{\eta}} \left({\mathbf {L}};{\varvec{\nu}}\right)\) can be rewritten as follows

$$\begin{aligned} {\mathbf {P}}_{\eta} \left({\mathbf {L}};\varvec{\nu }\right)= \varvec{\nu }^T{\mathbf {a}}+\frac{1}{2\eta }\left\| {\mathbf {L}}\right\| _F^2-\frac{1+2\eta \sigma }{2\eta }\left\| {\mathbf {T}}_\eta \left({\mathbf {L}};\varvec{\nu }\right)\right\| _F^2 + \underset{{\mathbf {L}}'}{{\text {min}}}\quad {\mathcal {J}}_\eta \left({\mathbf {L}}',{\mathbf {L}};\varvec{\nu }\right) \end{aligned}$$
(42)

where

$$\begin{aligned} {\mathcal {J}}_\eta \left({\mathbf {L}}',{\mathbf {L}};\varvec{\nu }\right)=-\text {logdet }({\mathbf {L}}') +\frac{1+2\eta \sigma }{2\eta }\left\| {\mathbf {L}}'-{\mathbf {T}}_\eta \left({\mathbf {L}};\varvec{\nu }\right)\right\| _F^2. \end{aligned}$$
(43)

To find the minimizer of \({\mathcal {J}}_\eta \left({\mathbf {L}}',{\mathbf {L}};\varvec{\nu }\right)\) and simplify (42), the following Lemma from [65] will be used.

Lemma 1

[65]: Let \({\mathbf {Z}}\in {\mathcal {S}}^N\) with eigenvalue decomposition \({\mathbf {Z}}={\mathbf {P}}\varvec{\varTheta }{\mathbf {P}}^T\) and \(\gamma >0\) where \(\varvec{\varTheta }=\text {diag}(\varvec{\theta })\) is the eigenvalue matrix. Assume two scalar functions \(\phi _\gamma ^+(x)\triangleq \frac{1}{2}\left(\sqrt{x^2+4\gamma }+x\right)\) and \(\phi _\gamma ^-(x)\triangleq \frac{1}{2}\left(\sqrt{x^2+4\gamma }-x\right)\) are defined and their matrix counterparts are as follows

$$\begin{aligned} \begin{aligned} {\mathbf {Z}}_1&=\phi _\gamma ^+({\mathbf {Z}})={\mathbf {P}}\text {diag}(\phi _\gamma ^+(\varvec{\theta })){\mathbf {P}}^T\\ {\mathbf {Z}}_2&=\phi _\gamma ^-({\mathbf {Z}})={\mathbf {P}}\text {diag}(\phi _\gamma ^-(\varvec{\theta })){\mathbf {P}}^T \end{aligned} \end{aligned}$$
(44)

Then,

  1. 1.

    \({\mathbf {Z}}={\mathbf {Z}}_1-{\mathbf {Z}}_2\) and \({\mathbf {Z}}_1{\mathbf {Z}}_2=\gamma {\mathbf {I}}\)

  2. 2.

    \(\phi _\gamma ^+\) is continuously differentiable and its derivative at \({\mathbf {Z}}\) for every \({\mathbf {H}}\in {\mathcal {S}}^N\) is given as

    $$\begin{aligned} \left. \frac{\partial \phi _\gamma ^+}{\partial x}\right| _{x={\mathbf {Z}}}\cdot [{\mathbf {H}}]={\mathbf {P}}\left(\varvec{\varOmega }\circ ({\mathbf {P}}^T\mathbf {HP})\right){\mathbf {P}}^T, \end{aligned}$$
    (45)

    where \(\circ\) denotes the Hadamard product and \(\varvec{\varOmega }\in {\mathcal {S}}^N\) is defined as follows

    $$\begin{aligned} \varOmega _{ij}=\frac{\phi _\gamma ^+(\theta _i)+\phi _\gamma ^+(\theta _j)}{\sqrt{\theta _i^2+4\gamma }+\sqrt{\theta _j^2+4\gamma }},\quad 1<i,j<N. \end{aligned}$$
    (46)
  3. 3.
    $$\begin{aligned} \left. \frac{\partial \phi _\gamma ^+}{\partial x}\right| _{x={\mathbf {Z}}}\cdot [{\mathbf {Z}}_1+{\mathbf {Z}}_2]=\phi _\gamma ^+({\mathbf {Z}}). \end{aligned}$$
    (47)

Proof

See [65]. \(\square\)

To find the minimizer of \({\mathcal {J}}_\eta \left({\mathbf {L}}',{\mathbf {L}};\varvec{\nu }\right)\), the derivative with respect to \({\mathbf {L}}'\) is set to zero as follows

$$\begin{aligned} -\left({\mathbf {L}}'\right)^{-1}+\frac{1+2\eta \sigma }{\eta }\left({\mathbf {L}}'-{\mathbf {T}}_\eta \left({\mathbf {L}};\varvec{\nu }\right)\right) \end{aligned}$$
(48)

and by some simple manipulations, the solution is \({\mathbf {L}}'=\phi ^+_{\gamma '}\left({\mathbf {T}}_\eta \left({\mathbf {L}};\varvec{\nu }\right)\right)\) where \(\gamma '=\frac{\eta }{1+2\eta \sigma }\). Then (42) is simplified as follows (see Appendix 2)

$$\begin{aligned} {\mathbf {P}}_{\eta} \left({\mathbf {L}};\varvec{\nu }\right)=\varvec{\nu }^T{\mathbf {a}}+\frac{1}{2\eta }\left\| {\mathbf {L}}\right\| _F^2-\text {logdet }\left(\phi ^+_{\gamma '}\left({\mathbf {T}}_\eta ({\mathbf {L}};\varvec{\nu })\right)\right) -\frac{1}{2\gamma '}\left\| \phi ^+_{\gamma '}\left({\mathbf {T}}_\eta ({\mathbf {L}};\varvec{\nu })\right) \right\| _F^2+N. \end{aligned}$$
(49)

4.2 Find \(M_{\eta g}({\mathbf {L}})\) via (38)

Lemma 2

The derivative of \({\mathbf {P}}_{\eta} \left({\mathbf {L}};{\varvec{\nu}}\right)\) with respect to \(\varvec{\nu }\) is

$$\begin{aligned} \nabla _{\varvec{\nu }} {\mathbf {P}}_{\eta} \left({\mathbf {L}};\varvec{\nu }\right)={\mathbf {a}}-{\mathcal {B}}\left(\varvec{\varPhi }^+\right), \end{aligned}$$
(50)

where \(\varvec{\varPhi }^+:=\phi _{\gamma '}^+\left({\mathbf {T}}_\eta ({\mathbf {L}};\varvec{\nu })\right)\).

Proof

See Appendix 3. \(\square\)

By taking the derivative of \(\nabla _{\varvec{\nu }} {\mathbf {P}}_{\eta} \left({\mathbf {L}};\varvec{\nu } \right)\) with respect to \(\varvec{\nu }\) and applying (45), the following corollary follows.

Corollary 1

The Hessian of \({\mathbf {P}}_{\eta} \left({\mathbf {L}};\varvec{\nu }\right)\) with respect to \(\varvec{\nu }\) is

$$\begin{aligned} \nabla _{\varvec{\nu }\varvec{\nu }}^2{\mathbf {P}}_{\eta} =-\gamma' [&{\mathcal {B}}\left(\left(\varvec{\varPhi }^+\right)'\cdot [\frac{1}{2}\left({\mathbf {U}}_1+{\mathbf {U}}_1^T \right)]\right),\dots ,{\mathcal {B}}\left(\left(\varvec{\varPhi }^+\right)'\cdot [\frac{1}{2}\left({\mathbf {U}}_N+{\mathbf {U}}_N^T\right)]\right),\\ {}&{\mathcal {B}}\left(\left(\varvec{\varPhi }^+\right)'\cdot [{\mathbf {I}}]\right),{\mathcal {B}}\left(\left(\varvec{\varPhi }^+\right)'\cdot [2{\mathbf {I}}-{\mathbf {11}}^T]\right)], \end{aligned}$$
(51)

Using the first and second order derivatives of \({\mathbf {P}}_{\eta} ({\mathbf {L}};\varvec{\nu })\) with respect to \(\varvec{\nu }\), the unconstrained maximization problem of (38) can be solved by Newton or quasi-Newton methods, like L-BFGS. Let \(\varvec{\nu }_{\text {opt}}\) denote \(\underset{\varvec{\nu }}{{\text {argmax}}}\quad {\mathbf {P}}_{\eta} \left({\mathbf {L}};\varvec{\nu }\right)\) and by using (38), we have \(M_{\eta g}({\mathbf {L}})={\mathbf {P}}_{\eta} \left({\mathbf {L}};\varvec{\nu }_{\text {opt}}\right)\).

Lemma 3

The derivative of \(M_{\eta g}({\mathbf {L}})\) with respect to the graph Laplacian matrix is

$$\begin{aligned} \nabla M_{\eta g}({\mathbf {L}})=\frac{1}{\eta }\left({\mathbf {L}}-\phi ^+_{\gamma '}\left({\mathbf {T}}_\eta ({\mathbf {L}};\varvec{\nu }_{\text {opt}})\right)\right). \end{aligned}$$
(52)

Proof

The result follows by taking the derivative of (49) with respect to \({\mathbf {L}}\) and applying Lemma 1, part (3), similar to the proof of Lemma 2. \(\square\)

Considering (36), (37), and (52), the graph Laplacian matrix is estimated via the following iteration rule

$$\begin{aligned} {\mathbf {L}}^{(t+1)} = \phi ^+_{\gamma '}\left({\mathbf {T}}_\eta ({\mathbf {L}}^{(t)};\varvec{\nu }_{\text {opt}}^{(t+1)})\right). \end{aligned}$$
(53)

where (t) stands for the iteration index. All steps to update the Laplacian matrix are summarized in Algorithm 2. It is also possible to use \(\eta _t\) rather than \(\eta\) to show the possibility of adjustment in each iteration for a faster convergence of the objective variable \({\mathbf {L}}\). To find \(\eta _t\) in each iteration, the exact or backtracking line search can be applied [58, 59].

figure b

5 Convergence analysis

Assume that the eigendecomposition of \({\mathbf {T}}_\eta \left({\mathbf {L}};\varvec{\nu }\right)\) is represented as \({\mathbf {T}}_\eta ={\mathbf {P}}\varvec{\varTheta }{\mathbf {P}}^T\). Hence, we have

$$\begin{aligned} \phi _\gamma ^+({\mathbf {T}}_\eta )={\mathbf {P}}\text {diag}(\phi _\gamma ^+(\varvec{\theta })){\mathbf {P}}^T. \end{aligned}$$
(54)

where the scalar function \(\phi _\gamma ^+(\cdot )\) is a convex function and also bounded with bounded eigenvalues. From theorems B and C in [66], it follows that \(\phi _\gamma ^+(\varvec{\theta })\) is Lipschitz continuous. Thus,

$$\begin{aligned} \left\| \phi ^+_{\gamma '}\left({\mathbf {T}}_\eta ^{(t)}\right)-\phi ^+_{\gamma '}\left({\mathbf {T}}_\eta ^{(t-1)}\right)\right\| _F^2\le l_c \left\| {\mathbf {T}}_\eta ^{(t)}-{\mathbf {T}}_\eta ^{(t-1)}\right\| _F^2, \end{aligned}$$
(55)

where \(l_c\) is the Lipschitz constant and \({\mathbf {T}}_\eta ^{(t)}={\mathbf {T}}_\eta \left({\mathbf {L}}^{(t)};\varvec{\nu }_{\text {opt}}^{(t)}\right)\).

Lemma 4

The scalar function \(\phi _\gamma ^+(x)\) is Lipschitz continuous with the Lipschitz constant \(\frac{3}{2}\).

Proof

See Appendix 4. \(\square\)

Lemma 5

If \({\mathbf {Z}}\) is a real symmetric matrix with the eigenvalue matrix \(\varvec{\varTheta }_Z\), then \(\left\| {\mathbf {Z}}\right\| _F^2=\left\| \varvec{\varTheta }_Z\right\| _F^2\).

Proof

See Appendix 5. \(\square\)

Corollary 2

The matrix valued function \(\phi _\gamma ^+({\mathbf {Z}})\) is Lipschitz continuous with the constant \(l_c = \frac{9}{4}\).

Proof

The proof follows readily by applying the eigenvalue decomposition (44) and using Lemma 4 and 5. \(\square\)

Theorem 1

The proposed BTL-PPA converges when the following stopping criterion is set for dual variables update

$$\begin{aligned} \left\| \varvec{\nu }^{(t)}-\varvec{\nu }^{(t_*)}\right\| _2^2\le \frac{1}{\eta ^2N^2}\left\| {\mathbf {L}}^{(t)}-{\mathbf {L}}^{(t_*)}\right\| ^2_F, \end{aligned}$$
(56)

where the optimum primal and dual variables are \({\mathbf {L}}^{(t_*)}\) and \(\varvec{\nu }^{(t_*)}\), respectively.

Proof

See Appendix 6. \(\square\)

6 Numerical results

The proposed algorithm is tested for simulated and real data for different scenarios. For topology learning performance, the results of our algorithm are compared to those of three existing algorithms: GL-SigRep [51], CGL [22], and the learning sparse graph algorithm in [20], called LSG here. In synthetic data simulation part, although the performance of the LSG algorithm is similar to the BTL-PPA, it has a higher computational complexity compared to the other three algorithms. The results for signal recovery performance are only compared to those of GL-SigRep, because the other two methods have no policy for signal representation and recovery. To implement GL-SigRep, an optimal selection of the regularization parameters via exhaustive search is applied and thus the appropriate \(\frac{\alpha }{\beta }\) ratio is found in order to maximize the performance for the algorithm in [51].

6.1 Synthetic data

The synthetic data is drawn from a Gaussian Markov Random Field process. First, an Erdős–Rényi graph is generated with \(N=40\) vertices and an edge probability of 0.2. The weight matrix and then the graph Laplacian matrix are computed and normalized by its trace. Then, each data vector is sampled from a N-variate Gaussian distribution \({\mathbf {y}}[k]\sim {\mathcal {N}}({\mathbf {0}}, {\mathbf {L}}^{\dagger })\) and contaminated by independent and identically distributed Gaussian noise. The measurements are \({\mathbf {x}}[k]={\mathbf {y}}[k]+{\mathbf {e}}[k]\) for \(k=1,\dots ,4000\) and different value of noise variance \(\sigma\). The measurements are stacked in the columns of the matrix \({\mathbf {X}}\), which is the given input to Algorithm 1. We set \(c=N\) [51] and initialize \(\sigma\) to 1. Finally, the simulation results are averaged over 100 different trials of experiments.

The most important performance measures to compare different algorithms in this literature are as follows [51, 67,68,69,70]:

  • Normalized Mean Squared Deviation of graph topology estimation: \(\text {NMSD}=\frac{1}{N^2}\cdot \frac{\left\| {\mathbf {L}}-\hat{{\mathbf {L}}}\right\| ^2_F}{\left\| {\mathbf {L}}\right\| ^2_F}\), where \(\hat{{\mathbf {L}}}\) denotes the estimated Laplacian matrix,

  • Normalized Mean Squared Error of signal reconstruction: \(\text {NMSE}=\frac{1}{N\cdot K}\cdot \frac{\left\| {\mathbf {X}}-\hat{{\mathbf {X}}}\right\| ^2_F}{\left\| {\mathbf {X}}\right\| ^2_F}\),

  • \(\text {F-measure}=\frac{2\cdot \text {Precision}\cdot \text {Recall}}{\text {Precision}+\text {Recall}}=\frac{\text {2TP}}{\text {2TP+FN+FP}}\), where TP, FP and FN are the numbers of true positives, false positives, and false negatives, respectively. Also, precision is the number of correctly recovered edges to the number of reconstructed edges in the estimated graph and recall is the number of correctly recovered edges to the number of edges in the ground-truth graph. This performance measure solely takes into account the support of the recovered graph while ignoring the weights,

  • Normalized Mutual Information: \(\text {NMI}({\mathbf {L}},\hat{{\mathbf {L}}}) =\frac{2\text {MI}({\mathbf {L}},\hat{{\mathbf {L}}})}{H({\mathbf {L}})+H(\hat{{\mathbf {L}}})}\), where MI is the mutual information between the set of edges in the estimated graph and the true graph and \(H({\mathbf {L}})\) is the entropy based on the probability distribution of edges, i.e. the probability of zeros and ones in the matrix \({\mathbf {L}}\) [70].

Figure 4 shows that BTL-PPA outperforms GL-SigRep for different noise variances and its NMSE are lower than those of GL-SigRep. Considering the NMSD comparison, the BTL-PPA topology learning algorithm works very well compared to other algorithms and its topology learning capability is robust in different noise powers. The experiments are run for difference values of noise variances, for a fixed signal power. Hence, the horizental axes can also be linked to different signal to noise (SNR) ratio. As shown in Fig. 4, when the noise increases, the NMSD for CGL algorithm decreases which seems to be counter-intuitive at first glance. However, it is necessary to note that for the current chosen threshold to remove weak edges, we get more edges leading to a lower NMSD (due to less variance in edge weights) and at the same time, as shown in Fig. 5, F-measure is decreasing due to an increasingly lower precision. Figure 5 also corroborates the strong performance of the proposed algorithm from the perspective that how much percentage of edges are learned correctly.

Fig. 4
figure 4

The capability of different algorithms in estimating the graph topology and denoising the siganls. Top: The normalized mean square deviation of graph topology estimation; Bottom: The normalized mean square error of signal recovery (\(N=40\) and \(K=4000\))

Fig. 5
figure 5

The capability of different algorithms in finding the correct edges; Top: F-measure; Bottom: Normalized Mutual Information

6.2 Signal inpainting in IoT applications

In the IoT framework, a challenging area of research is preprocessing of data after sensing and before providing it to further processing (for more detail, please see [1] Figure 3). One of these preprossesing steps is to estimate the missing values, i.e. signal inpainting. These missing values may be due to faulty recording, signal transmission to the central unit or storage.

Assume the sensor readings are recorded in the vector \({\mathbf {y}}=\left[{\mathbf {y}}_{\mathcal {M}}; {\mathbf {y}}_{\mathcal {U}}\right]\), where \({\mathbf {y}}_{\mathcal {M}}\) is the portion of the signal that is correctly recorded and known while \({\mathbf {y}}_{\mathcal {U}}\) includes missing values. Following the procedure explained in [71] but with our designed filter in (18), the signal is estimated as follows

$$\begin{aligned} {\mathbf {y}}^*=\left(\begin{bmatrix} {\mathbf {I}}_M &{} {\mathbf {0}} \\ {\mathbf {0}} &{} {\mathbf {0}} \end{bmatrix}+\sigma ^2{\mathbf {LL}}\right)^{-1}\begin{bmatrix} {\mathbf {y}}_{\mathcal {M}} \\ {\mathbf {0}} \end{bmatrix} \end{aligned}$$
(57)

where M is the cardinality of \({\mathbf {y}}_{\mathcal {M}}.\)

To apply the proposed methods on a relevant IIoT data, the “Household Power Consumption” data set has been downloaded from [72]. This data set is an individual household electric consumption data set collected via submeters in 5 years (2006–2010), while we only use a subset of it here. First \(K=100,000\) time samples have been saved from the first block of data and then it is downsampled to 1000 samples. The data set contains 7 measurements, including the active power, reactive power, voltage, intensity and three energy submeter recordings from different places in a house. The intensity sensor readings are chosen to be tested here and then all multipliers of 50 in its time instances were deleted to model missing values, i.e. \(y_k for k\in \{50, 150, \dots , 950, 1000\}\) are the missing values. Now, this dataset is given to the Algorithm 2 to find the underlying graph Laplacian matrix along with the estimated noise variance. The missing values are estimated by applying (57). Figure 6 shows the result of signal inpainting via our proposed filter. For a better illustration, only the missing values and the estimated ones have been shown.

Fig. 6
figure 6

Signal inpainting for an IIoT application. The true values for 20 missing samples are compared with the estimated ones using the proposed graph filter. The original data is the time series of houshold power consumption readings [72]

The main reason for the good performance is that the designed filter considers each dimension of the multi-variate signal as an entity of the entire network which is related to others based on a structure and hence it tries to use this knowledge in the filtering process. In other words, based on the measured signals, the algorithm learns the structure of the sensor readings first and then tries to use this data structure to interpolate the missing values.

6.3 Temperature data

In this experiment, the daily temperatures of \(N = 48\) states of the USA mainland are stored for the years 2011 to 2014, i.e. K = 1461 [73]. Here, the graph signals are average daily temperatures and the underlying topology describes the temperature relation between states. We do not have access to the ground-truth topology, but a geographical based graph may be proposed to compare the results. A graph is considered where the nodes are the states and an edge weight between two states is computed by the Gaussian RBF of the physical distance.

First, the temperature data of 2011 is used to learn the underlying topology. Figure 7 compares different graph learning algorithms with respect to their capability to capture the underlying connections. It is shown that the proposed BTL-PPA can detect most of the edges. Figure 8 shows the topology learned by our proposed method over the real map of the United States. It can be inferred that a state weather influences that of the neighboring states, which is also corroborated by physics of temperature propagation. There are more edges in the far right side of the map due to the higher density of nodes, representing east coast regions. Then, from right to left of the map, there is a bit discontinuity among edges which can be representing the blockage of weather propagation by Appalachian mountains. The smaller number of connections in the middle of the figure can also be due to the Rocky mountain chain, also mentioned in [6]. At the same way, a few edges in the right side of the California corroborates the effect of Sierra Nevada mountain range. The numerical results to compare different algorithms’ performances are given in Table 2.

Fig. 7
figure 7

Visual comparisons among a the Ground-truth adjacency matrix, and the adjacency matrix learned by b BTL-PPA, c GL-SigRep, d CGL , and e LSG. Here, blue pixel shows that there is a connection between two states (an edge between two nodes in the graph)

Fig. 8
figure 8

The learned graph topology via BTL-PPA algorithm from real temperature data in 2011: the background map is the USA mainland

Table 2 Performance measures for learned topology from average temperature for the US mainland in 2011

In the second experiment with real temperature data, the data set is divided into two groups of data, the training set and the testing set. The first half of the data, i.e. the averaged daily temperature of 2011 and 2012 \(k=1,\dots , 731\),Footnote 4 is utilized as the training data to learn the underlying topology. Then we consider this topology as the ground-truth graph and the remaining data, i.e. the data from the year 2013 and 2014, is used to estimate the topology and compare to the ground-truth to check the consistency of the learning algorithms. In other words, a cross validation procedure is done to verify how the learned topology from given training data is different from the one learned by the test data.

Table 3 shows the results for the cross validation scenario. The proposed BTL-PPA algorithm has the best consistency results among all algorithms with respect to all performance measure.

Table 3 Performance of different algorithms for learning the graph from the test data set, when compared to the one learned from the training data

7 Conclusion

Many information networks involve multiple interacting entities for which finding the topology connecting these entities connections is somehow important for real-world applications. In the graph topology inference framework, we tried to estimate the structure of the underlying data from multi-variate measurements. In other words, we investigated an algorithm to explore the link between the signal model and the graph topology. In this paper, a factor analysis model was used for signal representation in the graph domain and a Bayesian inference method was applied to learn the Laplacian matrix (which can uniquely represent the graph topology) and to estimate the noise variance at the same time. To formulate the problem, we used a Bayesian framework and proposed a minimum mean square estimation approach to denoise the measurements. Finally, a convex optimization problem over the graph Laplacian matrix was proposed and solved via a proximal point method to estimate the topology from a denoised version of graph signals. The experimental results corroborate the performance of the proposed algorithm for a wide range of sensor networks and IoT applications.