Sublinear Computational Time Modeling by Momentum-Space Renormalization Group Theory in Statistical Machine Learning Procedures

We review sublinear computational time modeling using momentum-space renormalization group approaches in the statistical machine learning algorithms. The modeling scheme has been proposed and the basic frameworks have been briefly explained in a short note (Tanaka et al. in J. Phys. Soc. Jpn, 87(8), Article ID: 085001, 2018). We present their detailed formulations and some numerical experimental results of sublinear computational time modeling based on the momentum-space renormalization scheme.


Introduction
Probabilistic graphical models have useful applications in statistical machine learning theory, and many authors are investigating their applications in artificial intelligent computational systems [1,2]. Realizations of statistical machine learning systems with probabilistic graphical models entail massive computational time.
For avoiding such problems, some statistical-mechanical techniques have been introduced, for example, advanced mean-field methods including belief propagations [3][4][5][6][7][8]. However, it is very difficult to reduce the computational time to the sublinear time for each system size in such advanced mean-field methods although they can achieve the linear computational time modeling. Recent research in data sciences has helped understand the necessity to treat high-dimensional state vectors, and it is necessary to achieve novel sublinear computational time modeling schemes in probabilistic graphical models. Sublinear computational time modeling is used to reduce the computational time for solving the given problem to the order of sublinear time with respect to the system size.
The renormalization group (RG) theory [9][10][11][12][13] is one of important theories in statistical mechanics. The RG theory was introduced to investigate a universality class of phase transitions in probabilistic graphical models by considering spatially coarse-graining procedures. The RG scheme can be divided to two kinds: real-space renormalization group (RSRG) schemes and momentum-space renormalization group (MSRG) schemes. MSRG schemes are implemented in the spectrum space of the Fourier transformations in the probabilistic graphical model and have produced many successful results in combination with perturbation techniques.
Recently, the present authors have proposed applications of RG techniques to achieve sublinear computational time modeling schemes in the statistical machine learning theory. In physics, it is well known that the RG scheme is typically used to capture critical phenomena characterized by certain diverging statistical quantities, particularly, the correlation length. It might seem that the achievements of the sublinear time computational time modeling in the statistical machine learning theory are far from the standard physical applications of the RG schemes, because the statistical models for data sciences have finite sizes and no critical phenomena are expected. However, it has become necessary to treat many high-dimensional state vectors in recent data sciences and it is important to achieve to reduce the computational time in the statistical models containing such high-dimensional state vectors. The present authors have considered introducing RG techniques to systematically reduce the system size in the statistical model while applying the statistical machine learning schemes, systematically. One way to accomplish them is to apply the RSRG scheme to probabilistic image segmentations, in which a Potts model is employed as a prior probability [14]. The other one is to introduce the momentum-space renormalization scheme to probabilistic noise reductions, in which a Gaussian graphical model is introduced as a prior probability [15]. The most important aspect of the statistical machine learning algorithms is how the computational time of the estimation of hyperparameters from data vectors should be reduced. One of familiar and practical estimation schemes involving hyperparameters is the expectation-maximization (EM) algorithm [16,17]. However, the EM algorithm needs some commutations of statistical quantities in the massive probabilistic graphical model. Two of the works by the present authors applied the RG approaches to estimate the hyperparameters by means of the EM algorithm.
The present paper is not an original paper but a review paper of the sublinear computational time modeling using the MSRG approaches in the EM algorithm for statistical machine learning system based on the Gaussian graphical models which has been published as a short note paper [15] and their background formulations. In the present paper, we provide their detailed formulations and some numerical experimental results of the sublinear computational time modeling based on the momentum-space renormalization scheme. Their essential part of the sublinear computational time modeling scheme has been proposed and the most important numerical experimental results have been shown as the above short note paper. In Sect. 2, we present the sublinear computational time modeling of statistical machine learning systems for probabilistic noise reductions by introducing the momentum-space renormalization schemes. Section 3 shows practical schemes of the sublinear computational time EM algorithm. In Sect. 4, statistical performance analysis of the hyperparameter estimations using the sublinear computational time EM algorithm is given. In Sect. 5, we mention some concluding remarks.

Momentum-Space Renormalization Group Scheme of Gaussian Graphical Model
In this section, we review fundamental formulations of momentum-space renormalization group approaches in Gaussian graphical models. Gaussian graphical models are sometimes referred also to as Gaussian-Markov random fields [4,[18][19][20][21][22][23][24], and are familiar probabilistic graphical models for Bayesian noise reduction systems with the EM algorithm. In noise reductions of image processing in particular, the graphical structures are usually square grid graphs and then unitary matrices based on discrete Fourier transformations are useful. Although scale transformations in the MSRG approaches are achieved through the wave-number space in the discrete Fourier transformations, we mention that they differ from low-pass filters in conventional signal processing. Now we consider images on the square grid graph with the periodic boundary conditions for the abscissa and the ordinate. In the position vector for each pixel of the present square grid graph, the abscissa and the ordinate are denoted by x and y, respectively, such that the position vector of the pixel is denoted by (x, y). The periodic boundary conditions for the abscissa and ordinate mean that x = M and y = N are interpreted as x = 0 and y = 0 , respectively. We introduce the set of all the pixels by a notation V(M, N) = {(x, y)|x = 0, 1, …, M − 1;y = 0, 1, …, N − 1} . We remark that V(M, N) can be regarded as an ordered set by considering the two-dimensional position vector (x, y) as where the notation ⌊a⌋ is the floor function defined by ⌊a⌋≡a − a mod 1.
We define the state variables f x,y and g x,y ( (x, y)∈V(M, N) ) of a light intensity at each pixel (x, y), in the original images and the degraded images. The state variable f x,y and g x,y take any real values in the interval (−∞, +∞) . The state vector f and g are defined by f ≡ (f x,y |(x, y)∈V(M, N)) and g ≡ (g x,y |(x, y)∈V(M, N)) , respectively. (1)

3
We consider a prior probability density function (M,N) (f | , ) defined by for original images f . Here Z M,N ( , ) is a normalization constant of (M,N) (f | , ) , which is defined by Degraded images g are generated from a given original image f according to the following conditional probability density function: Here, we remark that and are referred to as hyperparameters in the statistical machine learning theory. is an interaction between the nearest neighbour pairs of pixels. Equation (4) means that the additive white Gaussian noise is assumed as a degradation process, and corresponds to the inverse of the variance of the additive White Gaussian noise. The prior probability density function (M,N) (f |̂ ,̂ ) in Eq. (2) can be rewritten as Here C(M, N) and I(M, N) denote an MN × MN matrix whose (x, y|x � , y � )-compo- N) , (x � , y � )∈V(M, N) ) are defined by the following two-dimensional representations as   where We remark that W(M, N) can be also regarded as an ordered set by considering the two-dimensional wave-number-dependent vector (k, l) as 1 3 in the similar arguments for Eqs. (1). Equations (10) and (11) in the momentum-space can be calculated as follows: We remark that the high-frequency modes are not ignored but are marginalized. Only low-frequency modes are rescaled or extended to recover the same frequency range as that of the original probabilistic Gaussian graphical model. The marginalization is referred to as the "trace Out" in the RG techniques in the physics. Again, we remark again that the present algorithm does the trace out; rather, it merely rescales instead of just ignoring the high-frequency modes. We consider that it is based on the standard RG. Now we set new positive integers m and n with m≤M and n≤N , respectively, and introduce the following scale transformation from the state vectors f in the space We assume that the prior probability density function of f is given as   We introduce the scale transformation   Moreover, we introduce the conditional probability density function of the degraded image g ≡ (g x,y |(x, y)∈V(m, n)) , where the original image f given in the reduced space V(m, n), is assumed to satisfy (32) g x,y ≡ �

Momentum-Space Renormalization Group Approaches in EM Algorithm
In this section, we provide a practical scheme of the EM algorithm in the MSRG approaches. The fundamental framework is based on the maximization of marginal likelihood. We first formulate the probability density function of g and regard it as a marginal likelihood function of , and . Under some assumptions in the previous section, the marginal likelihood (m,n) (g| , , ) in the space V(m, n) is defined by Our proposed framework is designed to achieve the estimation of the hyperparameters , and by maximizing (m,n) (g | | | , , ) as follows: Estimates of hyperparameters, ̂ m,n (g) , ̂ m,n (g) and ̂ m,n (g) , are determined so as to maximize the marginal likelihood (m,n) (g = U † (m, n)B(m, n|M, N)U(M, N)g| , , ) with respect to , and , respectively, as follows: By considering the extremum conditions of (m,n) (g = U † (m, n)B(m, n|M, N)U(M, N) g| , , ) with respect to , and , the simultaneous deterministic equations for the estimates of hyperparameters, ̂ m,n (g) , ̂ m,n (g) and ̂ m,n (g) , are derived as follows:

Hyperparameter Estimation and Noise Reduction Algorithm
Step 1: Input a given data point g and the set of positive integers (m, n). Compute G (k, l) using Set (t) , (t) and (t) as initial values, and t⇐0.
Step 5: Repeat the following update rule until f ̂ m,n (g),̂ m,n (g),̂ (g) converges: Some restored results f ̂ m,n (g * ),̂ m,n (g * ),̂ (g * ) obtained by applying the above algorithm to degraded images g * in Fig. 6 are shown in Figs. 10, 11, 12 and 13. Computational times in the estimation processes of hyperparameters, ̂ m,n (g) , ̂ m,n (g) and ̂ m,n (g) , in the numerical experiments for the degraded images g * in Fig. 6 are also shown in Fig. 14. Numerical experimental results of the logarithm of the signal-tonoise ratio 10log 10 ( for the degraded images g * in Fig. 6 are shown in Fig. 15. Here, Var[f * ] is a variance of the light intensities for each colour, red, green and blue, on all the pixels in the original image f *

Statistical Performance Estimation of Gaussian Graphical Model
In this section, we estimate the statistical performance of our framework of the maximization of the renormalized marginal likelihood (m,n) (g| , , ) in Eq. (36) with Eq. (33). In the present Bayesian inference method, we assume that the data vectors g are generated from the conditional probability density function (M,N) (g|f , * ) in Eq. (4), under the assumption that a parameter vector f = f * is given. (48) (49)   transformation. In n N < 1 4 and m M < 1 4 , they leave from the flow. In Fig. 9, it is seen that the noise in the reduced image g is hardly recognized for n N = 1 4 and m M = 1 4 . In fact, the hyperparameter corresponds to the inverse of the variance in the additive white Gaussian noise, and then ̂ m,n (g) rapidly increases although ̂ m,n (g) and ̂ m,n (g) remain near the renormalization flow within the limit of n N → + 0 and m M → + 0.

Concluding Remarks
In the present review paper, we have summarized the MSRG approaches for the Gaussian graphical models in a finite-size square grid graph. We have provided the formulations and a practical algorithm for noise reduction applications in image processing using the Bayesian sublinear computational time modeling. Moreover, we derive also the statistical performance schemes of the above systems. from the statistical-mechanical point of view. The estimated results of the hyperparameters are first located along the flow of the MSRG transformations in the region mn MN > ( 1 4 ) 2 . However, the sublinear modeling cannot recognize the noise in the reduced image g in the region mn MN < ( 1 4 ) 2 and then the estimate of the hyperparameter , which corresponds to the inverse of the variance in the additive white Gaussian noise, goes to infinity with mn MN decreasing in the region. We consider finite-size system only, and our formulation is within the discrete Fourier transformation. However, if we consider a system size with infinity as the limit, such that M→ + ∞ and N→ + ∞ , the state vector f would be composed of an infinite number of components. At this stage, f x,y is a function of x and y which are any real number in (−∞ + ∞ , and which can then be rewritten as f(x, y). (f | , ) is a functional of f = {f (x, y)|(x, y)∈(−∞, +∞) 2 } and is given by instead of Eq. (2). For the probability density function in Eq. (56), the integral ∫ (⋯)df means a functional integral. This is a starting point of the statistical field theory in the physics [25][26][27]. Big data sciences are characterized by extremely large dimensional state vectors, and we expect that the statistical field theory should provide powerful applications in such cases.Indeed, this theory is one of the most exciting research paradigms for sublinear computational time modeling in the data sciences. (56)