1 Introduction

Recently, with the rapid development of wireless communication technology, the demand for high-speed data transmission has increased rapidly. Broadband communication is an effective technology to provide high data rate transmission. However, the increase of bandwidth will result in less sampling period than channel delay spread, especially in multipath scenarios, which gives rise to frequency-selective channels. Carrier frequency-offset due to the oscillators’ mismatch, together with high relative mobility between the transmitter and receiver cause the transmission channel to change rapidly in time, which is referred to as the time-selectivity of the channel. The channel with both the frequency-selectivity and the time-selectivity is called doubly-selective in wireless communications. Orthogonal frequency division multiplexing (OFDM) is an attractive multi-carrier modulation scheme, which divides the whole bandwidth into multiple overlapping narrowband subchannels to reduce the symbol rate. OFDM has the advantages of high spectral efficiency and simple structure of single tap equalizer. The bandwidth of each subcarrier in OFDM systems is small enough that each subcarrier is considered to experience flat fading in the frequency-selective channel. The narrow band nature of subcarriers makes the signal robust against the frequency-selectivity, and the inter-symbol interference (ISI) can be easily eliminated by inserting a cyclic prefix (CP) in front of each transmitted OFDM block. However, OFDM is relatively sensitive to the time-selectivity of the mobile channel. Time variation of the channel within an OFDM symbol duration results in intercarrier interference (ICI) (Wu and Fan 2016) and lead to an irreducible error floor in conventional receivers, which further degrades the performance. OFDM has strong robustness in high delay spread environment, which can eliminate the need for equalizing delay spread effect. This feature allows for higher data rates and leads to the selection of OFDM as the standard for digital audio broadcasting, digital video broadcasting, some wireless local area networks, long-term evolution and the next generation cellular systems. It has been proposed to be adopted in high speed train broadband communication system (Yu et al. 2016). The performance of the OFDM systems is affected by channel estimation, timing synchronization and mobility. The ICI may become more severe as mobile speed, carrier frequency or OFDM symbol duration increases (Sheng et al. 2017). OFDM can be implemented by using a coherent or non-coherent detection technique. Coherent detection methods usually provide higher signal-to-noise ratio gain than incoherent methods because they use channel state information. This implies, however, the receiver is more complex because channel state information is usually obtained through channel estimation. In order to achieve acceptable reception quality for applications with high latency and Doppler spread, it is essential to optimize the design of channel estimators.

In highly selective multipath fading channels, the channel response presents complex nonlinearity. If linear method is used, the estimation accuracy will be reduced (Charrada and Samet 2016). Therefore, it is necessary to use the nonlinear channel estimation method to improve the estimation performance. While support vector regression (SVR) developed from support vector machine (SVM) is suitable for regression of nonlinear systems. The SVM is proposed based on the principle of statistical learning theory and Vapnik–Chervonenkis (VC) dimensional theory, which is a promising machine learning approach that has been adopted in classification and regression (Vapnik 1995). The adoption of the kernel trick can be used to extend applications to nonlinear situations. It has become one of the most powerful tools for pattern classification and regression (Xu et al. 2017). So far, some SVR algorithms have been used in wireless channel estimation. Matilde et al. (2004) developed a multiple-input multiple-output channel estimation method based on SVR, but the channel was assumed to be non-selective. Djouama et al. (2014) and Charrada and Samet (2016) proposed OFDM channel estimation method based on SVR for different application scenarios. However, the methods mentioned above are based on the basic SVR, there are still shortcomings in computational complexity and performance. Along with continuous research progress, many modifications have been proposed in recent years. Peng (2010) proposed a twin support vector regression (TSVR), which can increase the computational speed by solving two small-size quadratic programming problems (QPPs). Xu et al. (2017) proposed asymmetric \(\nu\)-twin support vector regression, which is a kind of twin SVR suitable for dealing with asymmetric noise. Peng et al. (2014) introduced a pair of projection terms in the optimization problem, which has the advantage of embedding the structural information of the data into the learning process, resulting in the reduction of empirical variance. Melki et al. (2017) studied multi-target regression and presented several models for problems with multiple outputs. Anand et al. (2019), Anagha et al. (2018) and Gupta and Gupta (2019) proposed the improved SVR algorithms based on the pinball loss function, Balasundaram et al. (2014) (2016a, Knowl. Inf. Syst.) studied the problems of Lagrangian SVR, and Anand et al. (2018) proposed a generalized \(\varepsilon\)-loss function for regression.

However, all of the training samples in most methods are considered to have the same status and are given the same penalties, which may degrade performance due to the influence of noise or outliers. It is useful to give the training samples different weights depending on their importance. Although Xu and Wang (2014) proposed K-nearest neighbor (KNN)-based weighted twin support vector regression, which uses local information of data to improve prediction accuracy, KNN-based methods are suitable for clustering sample regression, but not for time series such as channel estimation. Recently, an efficient projection wavelet weighted twin support vector regression (PWWTSVR) was proposed in our work (Wang et al. 2019), which introduces a weight matrix based on wavelet transform and suitable for dealing with time-series data. PWWTSVR is with good normalization performance and makes full use of data structure information. It is suitable for regression of nonlinear systems with sequence information as training samples. The application of the proposed algorithm in wireless channel parameter prediction, a system with strong nonlinearity, will improve the estimation performance. The idea of combining the TSVR and weights based on wavelet transform was also applied to the regression of steelmaking model (Gao et al. 2019).

Taking the merits of PWWTSVR, the proposed channel estimation algorithm can be with good performance. The contributions of this paper are summarized as follows.

  1. 1.

    The improved TSVR is adopted for the first time to estimate doubly selective wireless channel parameters in OFDM system. This method solves the problem that the performance of most traditional estimation methods is degraded by linear assumptions.

  2. 2.

    Aiming at nonlinearity and the characteristics of time series of fading channels of OFDM system, we propose an improved TSVR, PWWTSVR algorithm based on wavelet transform. It can be said that the method of calculating weight matrix by wavelet transform in this work is a new preprocessing angle. Wavelet transform a kind of time-frequency representation for signals, therefore the proposed method based on the wavelet theory is suitable for dealing with time series samples such as channel parameters. Additionally, compared with TSVR, the PWWTSVR improves the regression performance by adding projection terms to the objective function.

  3. 3.

    In order to make better use of information of the received pilot signal and improve the channel estimation accuracy, weight matrix is introduced into the objective functions for channel parameters regression. The weights are inserted into both quadratic and first-degree terms to reduce the influence of outliers, which is likely to appear in the received pilot signal polluted by noise. Intuitively, the PWWTSVR can make full use of the prior information of training channel response. The weight matrix and weight vector, which represent the distance of noised samples and its ’real position’, can reflect the prior information of the training channel response. The larger weight is given to samples with smaller noise and the smaller weight is given to samples with larger noise.

This paper is organized as follows: Sect. 2 briefly describes OFDM system. Section 3 proposes projection wavelet weighted TSVR channel estimation. Experimental results are described in Sect. 4 to investigate the validity of our proposed algorithm, and Sect. 5 ends the paper with concluding remarks.

2 System model

Consider an OFDM system with N subcarriers experiencing doubly selective fading channel. The sequence to be transmitted X(k) from QPSK or QAM constellation is parsed into blocks of N symbols and then transformed into a time-domain sequence using an N-point IFFT. To avoid inter-block interference (IBI), a cyclic prefix (CP) of length M that is equal to or larger than the channel order L (channel discrete multipath number is \(L+1\) ) is inserted at the head of each block. The time-domain signal x(n) can be serially transmitted over the fading channel. x(n) can be expressed as

$$\begin{aligned} x(n)=\frac{1}{\sqrt{N}}\sum \limits _{k=0}^{N-1}X(k)e^{j2\pi nk/N} \end{aligned}$$
(1)

where \(n=-M,\ldots ,N-1,\) \(k=0,1,\ldots ,N-1.\) After CP is removed at the receiver, the received signal in time domain y(n) can be expressed as

$$\begin{aligned} y(n)=\sum \limits _{l=0}^{L}h_{l}(n)x(n-l)+\eta (n) \end{aligned}$$
(2)

where \(\eta (n)\) is additive white Gaussian noise (AWGN) with zero-mean, variance \(\sigma _{n}^{2}\) and independent with each other, i.e. \(E(\eta (n)\eta (m))=\delta (n-m)\sigma _{n}^{2}\). \(h_{l}(n)\) is the baseband-equivalent doubly selective channel impulse response of the lth path (\(l=0,1,\ldots ,L\)) at time n, which includes the physical channel as well as filters at the transmitter and receiver.

The matrix form of (2) can be expressed as

$$\begin{aligned} \mathbf {y}=\mathbf {hF}^{H}\mathbf {X}+{\eta }=\mathbf {hx}+{\eta } \end{aligned}$$
(3)

where \(\mathbf {X}=[X(0),X(1),\ldots ,X(N-1)]^{T}\), \(\mathbf {x}=[ x(0),x(1),\ldots x(N-1)]^{T}\), \(\mathbf {y}=[y(0),y(1),\ldots y(N-1)]^{T}\), \({\eta }=[\eta (0),\eta (1),\ldots \eta (N-1)]^{T}\), (\(\cdot\))\(^{T}\) denotes the transpose operation; \(\mathbf {X},\mathbf {x},\mathbf {y},{ \eta }\in \mathbb {C}^{N}\), and the channel matrix \(\mathbf {h}\in \mathbb {C} ^{N\times N}\) can be expressed as

$$\begin{aligned} \mathbf {h}=\left[ \begin{array}{ccccc} h_{0}(0) &{} \cdots &{} h_{L}(0) &{} \cdots &{} h_{1}(0) \\ \vdots &{} \ddots &{} &{} \ddots &{} \vdots \\ \vdots &{} &{} &{} \mathbf {0} &{} h_{L}(L-1) \\ h_{L}(L+1) &{} \cdots &{} h_{0}(L+1) &{} &{} \mathbf {0} \\ &{} \ddots &{} &{} \ddots &{} \\ \mathbf {0} &{} &{} h_{L}(N-1) &{} \cdots &{} h_{0}(N-1) \end{array} \right] \end{aligned}$$
(4)

where \(\mathbf {F}^{H}\) is an N-point inverse discrete Fourier transform (IDFT) matrix, \((\cdot )^{H}\) denotes conjugate transpose and the entry of which \([\mathbf {F}^{H}]_{n,k}=(1/\sqrt{N})\exp (j2\pi nk/N)\). Also define Fourier transform (DFT) matrix \(\mathbf {F}\), and \([\mathbf {F}] _{n,k} =(1/\sqrt{N})\exp (-j2\pi nk/N)\).

Perform Fourier transform on both sides of (2) and the following equations can be obtained,

$$\begin{aligned} \mathbf {Fy} &= \mathbf {FhF}^{H}\mathbf {X}+\mathbf {F}\eta \nonumber \\ \mathbf {Y} &= \mathbf {HX}+\mathbf {F}\eta \end{aligned}$$
(5)

where \(\mathbf {Y}=[Y(0),Y(1),\ldots ,Y(N-1)]^{T}\in \mathbb {C}^{N}\) is received signal vector in frequency domain. \(\mathbf {H}=\mathbf {FhF} ^{H}\in \mathbb {C}^{N\times N}\) is a frequency domain channel matrix with inter carrier interference (ICI) induced by time variations of the channel, and the elements of which can be described as

$$\begin{aligned} \mathbf {H}(p,q)=\frac{1}{N}\sum \limits _{l=0}^{L}e^{-j2\pi pl/N}\sum \limits _{n=0}^{N-1}h_{l}(n)e^{-j2\pi (p-q)n/N} \end{aligned}$$
(6)

where \(p,q=0,1,\ldots ,N-1\). The off-diagonal elements of \(\mathbf {H}\) is the ICI response. \(\mathbf {H}\) can be divided into two parts, one part \(\mathbf {H }_{d}\) \(\in \mathbb {C}^{N\times N}\) is to retain only the principal diagonal elements, the other one \(\mathbf {H}_{n}\) \(\in \mathbb {C}^{N\times N}\) is to retain only the non-diagonal elements. Then (5) can be expressed as

$$\begin{aligned} \mathbf {Y} &= \mathbf {H}_{d}\mathbf {X}+\mathbf {H}_{n}\mathbf {X}+\mathbf {F}\eta \nonumber \\ &= {\text {diag}}(\mathbf {X})\mathbf {H}_{d}^{^{\prime }}+\mathbf {H}_{n}\mathbf {X}+\mathbf {F}\eta \end{aligned}$$
(7)

\(\mathbf {H}_{n}\mathbf {X}\) is the ICI component, \({\text {diag}}(\cdot )\) denotes the diagonal operators and \(\mathbf {H}_{d}^{^{\prime }}\in \mathbb {C}^{N}\) is a column vector, the element of which is taken from the principal diagonal element of \({\mathbf {H}}_{d}\).

Because of Doppler effect, a subcarrier will be interfered by ICI of adjacent subcarriers. Figure 1 shows the simulation results of the ICI power influence of adjacent subcarriers on one subcarrier. The horizontal axis is the distance between adjacent subcarriers and a certain subcarrier, and the vertical axis is the average normalized power of ICI. In this simulation, subcarrier number of an OFDM \(N=64\), carrier frequency \(f_{c}=2.15\) GHz, mobile speed is 10/120/350 (km/h). As can be seen from Fig. 1 that most energy is distributed over the diagonal and its neighbors, the main ICI power is generated by adjacent subcarriers. The closer the subcarriers are, the greater the interference to a certain subcarrier, and the greater the moving speed is, the more serious the influence is.

Fig. 1
figure 1

ICI power influence of adjacent subcarriers

3 Projection wavelet weighted TSVR channel estimation

Linear interpolation methods, which are often adopted in typical channel estimation are inadequate for time-varying fading channel estimation, while TSVR is suitable for regression of nonlinear systems due to its kernel mapping technique, therefore we use this method for nonlinear channel estimation. In 2010, the TSVR was proposed by Peng (2010), which is an extension of the classification tool support vector machine to regression applications; the target of the regression problem is to acquire the relationship between inputs and their corresponding outputs. On the basis of Peng’s work, many improved algorithms were proposed (Huang et al. 2014; Balasundaram and Meena 2016b; Khemchandani et al. 2016; Parastalooi et al. 2016; Rastogi et al. 2017). In this section, a novel projection wavelet weighted TSVR is proposed for deep fading channel estimation, which inherits the advantages of TSVR regression for nonlinear systems.

Given a training set \(\mathbf {Tr}=\{(\mathbf {t}_{1},z_{1}),(\mathbf {t} _{2},z_{2}),\ldots ,(\mathbf {t}_{m},z_{m})\}\), where \(\mathbf {t}_{i}\in \mathbb {R }^{2}\) and \(z_{i}\in \mathbb {R}\), \(i=1,2,\ldots ,m\), m is the number of training samples. Then the output vector of the training data can be denoted as \(\mathbf {Z}=(z_{1},z_{2},\ldots ,z_{m})^{T}\in \mathbb {R}^{m}\) and the input matrix as \(\mathbf {T}=(\mathbf {t}_{1},\mathbf {t}_{2},\ldots ,\mathbf {t} _{m})^{T}\in \mathbb {R}^{m\times 2}\). The first column of \(\mathbf {T}\) represents the time domain positions of training samples and the second column of \(\mathbf {T}\) represents the frequency domain positions of training samples. Let \(\mathbf {e}\) and \(\mathbf {I}\) be a ones vector and an identity matrix of appropriate dimensions, respectively.

In high mobility wireless environment, the channel is selective in both time and frequency domains, and the doubly selective channel exhibits very complex nonlinearity in the case of fast and deep fading. Therefore, linear channel estimation methods cannot obtain high performance. We adopt the nonlinear PWWTSVR algorithm to satisfy the estimation requirements of nonlinear channels since TSVR is superior in solving nonlinear, small training samples and high dimensional pattern recognition. Similar to the classical TSVR model (Peng 2010), the PWWTSVR is constructed by two nonparallel hyperplanes, down-bound \(f_{1}(\mathbf {t})\), and up-bound \(f_{2}(\mathbf {t})\); each hyperplane determines the \(\epsilon\)-insensitive bound regressor, and the end regressor is \(f(\mathbf {t})=\frac{1}{2}(f_{1}( \mathbf {t})+f_{2}(\mathbf {t}))\).

3.1 Channel estimation algorithm-projection wavelet weighted twin support vector regression

Initially, SVR was used to solve linear regression problems and a hyperplane can be adopted to regress the relationship between inputs and outputs by training samples. However, the hyperplane method is only applicable to linear problems. Based on Vapnik theory (1995), the algorithm can be extended to nonlinear cases by using kernel mapping, which is the majority of cases in the real world. The kernel trick is adopted to map the input into a higher-dimensional feature space, and the following kernel-generated functions are considered: down-bound \(f_{1}(\mathbf {t})=K(\mathbf {t},\mathbf { T}^{T})\mathbf {w}_{1}+b_{1}\) and up-bound \(f_{2}(\mathbf {t})=K(\mathbf {t}, \mathbf {T}^{T})\mathbf {w}_{2}\) \(+b_{2}\), where \(\mathbf {w}_{1},\mathbf {w} _{2}\in \mathbb {R}^{m}\), m is the number of training points, \(b_{1},b_{2}\in \mathbb {R},\) K is an appropriately chosen kernel. Therefore, the end regressor is the average of \(f_{1}(\mathbf {t})\) and \(f_{2}(\mathbf {t}),\) i.e. \(f(\mathbf {t})=\frac{1}{2}(f_{1}(\mathbf {t})+f_{2}( \mathbf {t}))\). The optimization problems can be described as follows:

$$\begin{aligned} \begin{array}{ll} \underset{\mathbf {w}_{1},b_{1},{\xi }_{1}}{\min } &{} \begin{array}{l} L_{1}=\frac{1}{2}(\mathbf {Y}-(K(\mathbf {T},\mathbf {T}^{T})\mathbf {w}_{1}+ \mathbf {e}b_{1}))^{T}\mathbf {D}(\mathbf {Y}-(K(\mathbf {T},\mathbf {T}^{T}) \mathbf {w}_{1}+\mathbf {e}b_{1}))+\frac{1}{2}c_{11}(\mathbf {w}_{1}^{T}\mathbf { w}_{1}+b_{1}^{2}) \\ +\frac{1}{2}c_{12}\hat{\mathbf {w}}_{1}^{T}{\Sigma }_{z}\hat{\mathbf {w} }_{1}+c_{13}\mathbf {e}^{T}\mathbf {D}\xi _{1} \end{array} \\ \text {s.t.} &{} \begin{array}{cc} \mathbf {Y}-(K(\mathbf {T},\mathbf {T}^{T})\mathbf {w}_{1}+\mathbf {e} b_{1})\ge -\varepsilon _{1}\mathbf {e}-{\xi }_{1}, &{} {\xi }_{1}\ge 0\mathbf {e} \end{array} , \end{array} \end{aligned}$$
(8)

and

$$\begin{aligned} \begin{array}{ll} \underset{\mathbf {w}_{2},b_{2},{\xi }_{2}}{\min } &{} \begin{array}{l} L_{2}=\frac{1}{2}(\mathbf {Y}-(K(\mathbf {T},\mathbf {T}^{T})\mathbf {w}_{2}+ \mathbf {e}b_{2}))^{T}\mathbf {D}(\mathbf {Y}-(K(\mathbf {T},\mathbf {T}^{T}) \mathbf {w}_{2}+\mathbf {e}b_{2}))+\frac{1}{2}c_{21}(\mathbf {w}_{2}^{T}\mathbf { w}_{2}+b_{2}^{2}) \\ +\frac{1}{2}c_{22}\hat{\mathbf {w}}_{2}^{T}{\Sigma }_{z}\hat{\mathbf {w}}_{2}+c_{23}\mathbf {e}^{T}\mathbf {D}\xi _{2} \end{array} \\ \text {s.t.} &{} \begin{array}{cc} (K(\mathbf {T},\mathbf {T}^{T})\mathbf {w}_{2}+\mathbf {e}b_{2})-\mathbf {Y} \ge -\varepsilon _{2}\mathbf {e}-{\xi }_{2}, &{} {\xi } _{2}\ge 0\mathbf {e} \end{array} , \end{array} \end{aligned}$$
(9)

where \(c_{11},c_{12},c_{13},c_{21},c_{22},c_{23}>0\) are parameters chosen a priori by the user; \(\varepsilon _{1}\) and \(\varepsilon _{2}\) are insensitive parameters, \({\xi }_{1}\) and \({\xi }_{2}\) are slack vectors to measure the errors of samples outside the “\(\varepsilon\) tube”. \(\mathbf {D}\in \mathbb {R}^{m\times m}\) is a weighting matrix, which will be discussed later.

The first term in the objective function of (8) is the sum of weighted squared distances from training points to the down-bound function, which is called empirical risk. Minimizing this causes the function \(f_{1}(\mathbf {t})\) to fit the training samples and avoid under-fitting. The second term is a regularization term, which can make \(f_{1}(\mathbf {t})\) as smooth as possible. The structural risk minimization is implemented by minimizing the regularization term \(\frac{1}{2}(\mathbf {w} _{1}^{T}\mathbf {w}_{1}+b_{1}^{2})\). A small value of \(\frac{1}{2}(\mathbf {w} _{1}^{T}\mathbf {w}_{1}+b_{1}^{2})\) corresponds to the function (8) being flat. The third one, data structure term, can minimize empirical variance values of projected points on the down-bound functions. The fourth term aims to minimize the sum of errors of the points lower than the down-bound \(f_{1}(\mathbf {t})\), which can possibly over-fit the training points. The ratios of the four penalty terms in the objective function of ( 8) can be adjusted by the choice of \(c_{11},c_{12},\)and \(c_{13}\). The optimization problem (9) is with similar illustrations.

In the third term, the projection axis, \(\hat{\mathbf {w}}_{k}=[\mathbf {w} _{k};-1]\), \(k=1,2,\) is normal to the line of the bound regression functions. The projection axis is meant to make the projected zone or the variance of the projected noise as small as possible, and the following formula can be obtained.

$$\begin{aligned} \min \frac{1}{2}\hat{\mathbf {w}}_{k}^{T}\Sigma _{\mathbf {z}}\hat{\mathbf {w}} _{k}=\min \frac{1}{2}\mathbf {w}_{k}^{T}{\Sigma }_{\varphi (\mathbf {t} )}\mathbf {w}_{k}-\mathbf {w}_{k}^{T}{\Sigma }_{\varphi (\mathbf {t})y} \end{aligned}$$
(10)

where

$$\begin{aligned} {\Sigma }_{\mathbf {z}} &= \frac{1}{n}\sum \limits _{i=1}^{n}(\mathbf {z} _{i}-{\mu }_{z})(\mathbf {z}_{i}-{\mu }_{z})^{T}, \nonumber \\ {\Sigma }_{\varphi (\mathbf {t})} &= \frac{1}{n}\sum \limits _{i=1}^{n}( \varphi (\mathbf {t}_{i})-{\mu }_{\varphi (\mathbf {t})})(\varphi ( \mathbf {t}_{i})-{\mu }_{\varphi (\mathbf {t})})^{T}, \nonumber \\ {\Sigma }_{\varphi (\mathbf {t})y} &= \frac{1}{n}\sum \limits _{i=1}^{n} \varphi (\mathbf {t}_{i})y_{i}-{\mu }_{\varphi (\mathbf {t})}\mu _{y}, \end{aligned}$$
(11)

and \(\mathbf {z}_{i}\) is the training point \(\mathbf {z}_{i}=(\varphi ( \mathbf {t}_{i});\,y_{i})\), \(i=1,\ldots ,m\), \(\mu _{\mathbf {z}}\) is the centroid point of \(\mathbf {z}_{i}\), and \({\Sigma }_{\mathbf {z}}\) is the covariance matrix of \(\mathbf {z}_{i}\), \({\Sigma }_{\varphi (\mathbf {t} )}\) is the empirical covariance matrices of inputs, \({\Sigma } _{\varphi (\mathbf {t})y}\) is the empirical correlation coefficient matrix between the inputs and responses. They are defined as \(\mu _{\varphi ( \mathbf {t})}\) and \(\mu _{y},\) which are the centroid points of the inputs and outputs respectively. \(\varphi (\mathbf {t})\) is the operation of mapping input \(\mathbf {t}\) to high-dimensional feature space using kernel trick for nonlinear applications, i.e., \(\mathbf {t}\rightarrow \varphi (\mathbf {t})\). However, \(\varphi (\mathbf {t})\) lacks an explicit formulation due to the higher dimensions, which prevents the computation of \(\Sigma _{\varphi ( \mathbf {t})}\). In Peng’s work (2014), the eigenvalue decomposition method is adopted to explicitly map to the empirical feature space. Let \(K(\mathbf {T}, \mathbf {T}^{T})\) denote an \(m\times m\) matrix of rank r, where K is an appropriately chosen kernel. Since \(K(\mathbf {T},\mathbf {T}^{T})\) is a symmetric positive-semidefinite matrix, it can be decomposed as \(K(\mathbf {T} ,\mathbf {T}^{T})=\mathbf {P}_{m\times r}{\varvec{\Lambda }} \mathbf {P}_{r\times m}^{T},\) where \({\varvec{\Lambda }}\) is a diagonal matrix containing only the r positive eigenvalues of \(K(\mathbf {T},\mathbf {T}^{T})\) in decreasing order, and \(\mathbf {P}_{m\times r}\) consists of the eigenvectors corresponding to the positive eigenvalues. The mapping from the input data space to the kernel space is expressed as

$$\begin{aligned} \mathbf {t}\rightarrow \varphi (\mathbf {t})=\Lambda ^{-1/2}\mathbf {P}^{T}K( \mathbf {t},\mathbf {T}^{T}). \end{aligned}$$
(12)

To solve the optimization problems (8) and (9), we convert the constrained problems to a pair of unconstrained problems by introducing the plus function (\(\cdot\))\(_{+}\) and substitute (10) into (8) and (9) as follows.

$$\begin{aligned}&\begin{array}{ll} \underset{\mathbf {w}_{1},b_{1}}{\min }&L_{1}=\frac{1}{2}(\mathbf {Y}-(K( \mathbf {T},\mathbf {T}^{T})\mathbf {w}_{1}+\mathbf {e}b_{1}))^{T}\mathbf {D}( \mathbf {Y}-(K(\mathbf {T},\mathbf {T}^{T})\mathbf {w}_{1}+\mathbf {e}b_{1}))+ \frac{1}{2}c_{11}(\mathbf {w}_{1}^{T}\mathbf {w}_{1}+b_{1}^{2}) \end{array} \nonumber \\ & \quad +\frac{1}{2}c_{12}(\mathbf {w}_{1}^{T}{\Sigma }_{\varphi (\mathbf {t} )}\mathbf {w}_{1}-\mathbf {w}_{1}^{T}{\Sigma }_{\varphi (\mathbf {t} )y})+c_{13}e^{T}\mathbf {D}((K(\mathbf {T},\mathbf {T}^{T})\mathbf {w}_{1}+ \mathbf {e}b_{1})-(\mathbf {Y}+\varepsilon _{1}\mathbf {e}))_{+} \end{aligned}$$
(13)

and

$$\begin{aligned}&\begin{array}{ll} \underset{\mathbf {w}_{2},b_{2}}{\min }&L_{2}=\frac{1}{2}(\mathbf {Y}-(K( \mathbf {T},\mathbf {T}^{T})\mathbf {w}_{2}+\mathbf {e}b_{2}))^{T}\mathbf {D}( \mathbf {Y}-(K(\mathbf {T},\mathbf {T}^{T})\mathbf {w}_{2}+\mathbf {e}b_{2}))+ \frac{1}{2}c_{21}(\mathbf {w}_{2}^{T}\mathbf {w}_{2}+b_{2}^{2}) \end{array} \nonumber \\& \quad +\frac{1}{2}c_{22}(\mathbf {w}_{2}^{T}{\Sigma }_{\varphi (\mathbf {t} )}\mathbf {w}_{2}-\mathbf {w}_{2}^{T}{\Sigma }_{\varphi (\mathbf {t} )y})+c_{23}e^{T}\mathbf {D}((\mathbf {Y}-\varepsilon _{2}\mathbf {e})-(K( \mathbf {T},\mathbf {T}^{T})\mathbf {w}_{1}+\mathbf {e}b_{1}))_{+} \end{aligned}$$
(14)

Plus functions \((\cdot )_{+}\) in (13) and (14) are not differentiable, but they can be replaced by smooth approximate functions \(\mathbf {p}(\cdot )\). In this paper, we adopt the sigmoid integral function as a smooth function; it is defined as

$$\begin{aligned} \mathbf {p}(\mathbf {t})=\mathbf {t}+\frac{1}{\alpha }\log (1+\exp (-\alpha \mathbf {t})). \end{aligned}$$
(15)

where \(\alpha\) is a positive real constant. Define \(\mathbf {f}_{1}=(\mathbf { Y}+\varepsilon _{1}\mathbf {e})\), \(\mathbf {f}_{2}=(\mathbf {Y}-\varepsilon _{2} \mathbf {e})\), \(\mathbf {u}_{1}=[\mathbf {w}_{1}^{T},b_{1}]^{T}\), \(\mathbf {u} _{2}=[\mathbf {w}_{2}^{T},b_{2}]^{T}\), \(\mathbf {G}=[K(\mathbf {T},\mathbf {T} ^{T}),\mathbf {e}]\) and replace the plus functions \((\cdot )_{+}\) in (13) and (14) by (15), we can get

$$\begin{aligned}&\begin{array}{ll} \underset{\mathbf {w}_{1},b_{1}}{\min }&L_{1}=\frac{1}{2}(\mathbf {Y}-(K( \mathbf {T},\mathbf {T}^{T})\mathbf {w}_{1}+\mathbf {e}b_{1}))^{T}\mathbf {D}( \mathbf {Y}-(K(\mathbf {T},\mathbf {T}^{T})\mathbf {w}_{1}+\mathbf {e}b_{1}))+ \frac{1}{2}c_{11}(\mathbf {w}_{1}^{T}\mathbf {w}_{1}+b_{1}^{2}) \end{array} \nonumber \\& \quad +\frac{1}{2}c_{12}(\mathbf {w}_{1}^{T}{\Sigma }_{\varphi (\mathbf {t} )}\mathbf {w}_{1}-\mathbf {w}_{1}^{T}{\Sigma }_{\varphi (\mathbf {t} )y})+c_{13}\mathbf {e}^{T}\mathbf {Dp}(\mathbf {Gu}_{1}-\mathbf {f}_{1}), \end{aligned}$$
(16)

and

$$\begin{aligned}&\begin{array}{ll} \underset{\mathbf {w}_{2},b_{2}}{\min }&L_{2}=\frac{1}{2}(\mathbf {Y}-(K( \mathbf {T},\mathbf {T}^{T})\mathbf {w}_{2}+\mathbf {e}b_{2}))^{T}\mathbf {D}( \mathbf {Y}-(K(\mathbf {T},\mathbf {T}^{T})\mathbf {w}_{2}+\mathbf {e}b_{2}))+ \frac{1}{2}c_{21}(\mathbf {w}_{2}^{T}\mathbf {w}_{2}+b_{2}^{2}) \end{array} \nonumber \\& \quad +\frac{1}{2}c_{22}(\mathbf {w}_{2}^{T}{\Sigma }_{\varphi (\mathbf {t} )}\mathbf {w}_{2}-\mathbf {w}_{2}^{T}{\Sigma }_{\varphi (\mathbf {t} )y})+c_{23}\mathbf {e}^{T}\mathbf {Dp}(\mathbf {f}_{2}-\mathbf {Gu}_{2}). \end{aligned}$$
(17)

Note that \(L_{1}\) in (16) and \(L_{2}\) in (17) are convex (Wang et al. 2019). Global and unique solutions can be obtained, and the Newton iterative approach can be adopted to solve the minimization problems as follows.

The first- and second-order gradients of \(L_{1}\) in (16) and \(L_{2}\) in (17) are deduced as follows

$$\begin{aligned} \nabla L_{1}(\mathbf {u}_{1})=\mathbf {Q}_{1}\mathbf {u}_{1}-\mathbf {P} _{1}+c_{13}\mathbf {G}^{T}\mathbf {D}{\text {diag}}\left( \frac{1}{1+\exp (-\alpha ( \mathbf {Gu}_{1}-\mathbf {f}_{1}))}\right) \end{aligned}$$
(18)
$$\begin{aligned} \nabla ^{2}L_{1}(\mathbf {u}_{1})=\mathbf {Q}_{1}+\alpha c_{13}\mathbf {G}^{T} \mathbf {D}{\text {diag}}\left( \frac{\exp (-\alpha (\mathbf {Gu}_{1}-\mathbf {f}_{1}))}{ 1+\exp (-\alpha (\mathbf {Gu}_{1}-\mathbf {f}_{1}))^{2}}\right) \mathbf {G}, \end{aligned}$$
(19)

and

$$\begin{aligned} \nabla L_{2}(\mathbf {u}_{2})=\mathbf {Q}_{2}\mathbf {u}_{2}-\mathbf {P} _{2}-c_{23}\mathbf {G}^{T}\mathbf {D}{\text {diag}}\left( \frac{1}{1+\exp (-\alpha ( \mathbf {f}_{2}-\mathbf {Gu}_{2}))}\right) \end{aligned}$$
(20)
$$\begin{aligned} \nabla ^{2}L_{2}(\mathbf {u}_{2})=\mathbf {Q}_{2}+\alpha c_{23}\mathbf {G}^{T} \mathbf {D}{\text {diag}}\left( \frac{\exp (-\alpha (\mathbf {f}_{2}-\mathbf {Gu}_{2}))}{ 1+\exp (-\alpha (\mathbf {f}_{2}-\mathbf {Gu}_{2}))^{2}}\right) \mathbf {G}, \end{aligned}$$
(21)

where \(\mathbf {Q}_{1}=\left[ \begin{array}{ll} c_{11}\mathbf {I}+K(\mathbf {T},\mathbf {T}^{T})^{T}\mathbf {D}K(\mathbf {T}, \mathbf {T}^{T})+{\Sigma }_{\varphi (\mathbf {t})} &{} K( \mathbf {T},\mathbf {T}^{T})^{T}\mathbf {De} \\ \mathbf {e}^{T}\mathbf {D}K(\mathbf {T},\mathbf {T}^{T}) &{} c_{11}+ \mathbf {e}^{T}\mathbf {De} \end{array} \right]\), \(\mathbf {P}_{1}=\left[ \begin{array}{l} K(\mathbf {T},\mathbf {T}^{T})^{T}\mathbf {DY}+c_{12}{\Sigma }_{\varphi ( \mathbf {t})y}^{{}} \\ \mathbf {e}^{T}\mathbf {DY} \end{array} \right]\), \(\mathbf {Q}_{2}=\left[ \begin{array}{ll} c_{21}\mathbf {I}+\mathbf {A}^{T}\mathbf {D}K(\mathbf {T},\mathbf {T}^{T})+ {\Sigma }_{\varphi (\mathbf {x})} &{} K(\mathbf {T},\mathbf {T} ^{T})^{T}\mathbf {De} \\ \mathbf {e}^{T}\mathbf {D}K(\mathbf {T},\mathbf {T}^{T}) &{} c_{21}+ \mathbf {e}^{T}\mathbf {De} \end{array} \right]\), and \(\mathbf {P}_{2}=\left[ \begin{array}{l} K(\mathbf {T},\mathbf {T}^{T})^{T}\mathbf {DY}+c_{22}{\Sigma }_{\varphi ( \mathbf {t})y}^{{}} \\ \mathbf {e}^{T}\mathbf {DY} \end{array} \right] .\)

The iterative solutions of minimization problems (13) and (14) can be obtained by adopting the Newton method and using (18)––(21), as follows:

$$\begin{aligned} \mathbf {u}_{1}^{k+1}=\mathbf {u}_{1}^{k}-(\nabla ^{2}L_{1}(\mathbf {u} _{1}))^{-1}(\nabla L_{1}(\mathbf {u}_{1})), \end{aligned}$$
(22)
$$\begin{aligned} \mathbf {u}_{2}^{k+1}=\mathbf {u}_{2}^{k}-(\nabla ^{2}L_{2}(\mathbf {u} _{2}))^{-1}(\nabla L_{2}(\mathbf {u}_{2})). \end{aligned}$$
(23)

3.2 Training samples

SVR is a supervised machine learning method, which requires input and output of training samples for parameter training. The inputs of training samples are the positions of estimated channel response in time and frequency domain, and the outputs are the channel response estimated by received pilots. In the proposed method, the pilots are inserted equidistantly in both time and frequency domain. The OFDM symbols inserted into the pilot are called pilot symbols, and those without pilot are called data symbols. The positions of pilot symbols are expressed as \([n\Delta t]\), \(n=0,1,\ldots ,N_{t}-1,\) where \(\Delta t\) is the pilot symbol interval in time domain and \(N_{t}\) is the number of pilot symbols. In an OFDM symbol, the transmitting pilot subcarrier positions are expressed as \([m\Delta f]\), \(m=0,1,\ldots ,N_{f}-1,\) where \(\Delta f\) is the pilot interval in frequency domain and \(N_{f}\) is the number of pilots in an OFDM symbol. The pilot insertion scheme is shown in Fig. 2. Let the transmitting pilot matrix be \(\mathbf {X}_{P}={\text {diag}}(X(n\Delta t,m\Delta f))\in \mathbb {C} ^{N_{t}N_{f}\times N_{t}N_{f}}\), and the channel frequency response estimated at pilot subcarriers according to (7) can be obtained

$$\begin{aligned} \hat{\mathbf {H}}_{P}=(\mathbf {X}_{P})^{-1}\mathbf {Y}_{P} \end{aligned}$$
(24)

where \(\mathbf {Y}_{P}=\mathbf {Y}(n\Delta t,m\Delta f)\in \mathbb {C} ^{N_{t}N_{f}}\) is the received pilot vector and \(\hat{\mathbf {H}}_{P}= \hat{\mathbf {H}}(n\Delta t,m\Delta f)\in \mathbb {C}^{N_{t}N_{f}}\) is the estimated frequency response at pilot positions \((n\Delta t,m\Delta f)\).

After interpolation, the frequency response of data position can be calculated and the predicted frequency response of all subcarriers in an OFDM symbol can be expressed as

$$\begin{aligned} \tilde{\mathbf {H}}=f(\hat{\mathbf {H}}_{P}). \end{aligned}$$
(25)
Fig. 2
figure 2

Pilot insertion scheme in OFDM system

3.3 Weighting parameters

The parameter \(\mathbf {D}\) mentioned above is the weighting matrix, and it can be determined beforehand according to the importance of training data. The samples used for training are all polluted by noise of different amplitude. Intuitively, the samples polluted by large noise should be given smaller weight. Therefore, all training data should be given penalty weights. The penalty weighting parameter D are given to the samples in the proposed algorithm. The weighting parameters which represent the distance between noised samples and its ‘real position’ can reflect the prior information of the training samples. While in the classical SVR algorithms, such as the TSVR, no weights are added to the samples, that means the weights are all one and all of the samples have the same weights. Points with too much noise, such as outliers, will degrade the performance of the regressors. Different training samples should be given different weights, a larger given weight means that the sample is more important. Motivated by this idea and noting that the Gaussian function can reflect this trend, the weighting parameter \(\mathbf {D}\) is determined by the Gaussian function described as follows.

$$\begin{aligned} \mathbf {D}={\text {diag}}\left( Ae^{(-\left| \mathbf {Y}-\hat{\mathbf {Y}} \right| ^{2}/\sigma ^{2})}\right) , \end{aligned}$$
(26)

where A is the amplitude, \(\sigma\) represents the standard deviation, and \(\hat{\mathbf {Y}}(=[\hat{y}_{1},\hat{y}_{2},\ldots ,\hat{y} _{_{N_{t}N_{f}}}]^{T})\) is the estimation value vector of output \(\mathbf {Y} (=[y_{1},y_{2},\ldots ,y_{^{_{_{N_{t}N_{f}}}}}]^{T})\). In this paper, wavelet transform based method is adopted to calculate \(\hat{\mathbf {Y}}\). Wavelet transform is a kind of time-frequency transform analysis method and wavelet filter is applicable to short-term signal processing, which also is the characteristic of most actual time series signals. It inherits and develops the idea of short-time Fourier transform localization, and overcomes the shortcomings of window size not changing with frequency. It can provide a time-frequency window that changes with frequency. It is a relatively ideal tool for time-frequency analysis and processing signals. In addition, the inserted pilots are two-dimensional distribution in time and frequency domain, therefore two-dimensional wavelet filtering is adopted to calculate \(\hat{\mathbf {Y}}\) by three steps. Firstly, the wavelet transform may be considered a form of time-frequency representation for signals, and thus are related to harmonic analysis. For discrete data, discrete wavelet transforms (DWT) use discrete-time filter banks to decompose signal sequence. Secondly, the obtained decomposed sequence is processed by an appropriate algorithm to remove noise. In this paper, the high frequency part of the frequency domain signal is directly zeroed as denoising algorithm. Thirdly, the estimation value of output \(\hat{y}\) is reconstructed by the denoised sequence and the estimation value of output \(\hat{\mathbf {Y}}\) can be obtained. Substitute \(\hat{\mathbf {Y}}\) into (26), and then the weighting vector \(\mathbf {d}\) and matrix \(\mathbf {D}\) can be calculated.

Weighting parameters were also introduced in other improved TSVR, for example, in Xu’s work (2014), the K-nearest neighbor (KNN) algorithm is adopted. One of the differences between the PWWTSVR and the method proposed by Xu and Wang (2014) lies in the different objects being processed. The KNN algorithm is suitable for dealing with clustered samples. The idea of the KNN is that a sample point x is important if it has a larger number of K-nearest neighbors, whereas it is not important if it is an outlier that has a small number of K-nearest neighbors. While the wavelet transform theory adopted in our proposed algorithm is a kind of time-frequency representation for signals, and the proposed method based on the wavelet theory is suitable for dealing with time series samples.

3.4 Algorithm summary

In this subsection, we illustrate the channel estimation algorithm.

Algorithm: Channel estimation based on PWWTSVR

figure a

3.5 Computational complexity analysis

Computational complexity is an important performance of an algorithm. Consider that in a calculating period, if the number of pilot symbols is \(N_{t}\) and the number of pilots in one pilot symbol is \(N_{f}\), then the number of training samples is \(l=N_{t}N_{f}\). In the processing step 1 of the algorithm described in the previous subsection, the calculation of \(\hat{\mathbf {H}}_{p}\) requires O(l) complex multiplications, and the step 2 needs computational cost O(l) real multiplication. In step 3, since the input \(\mathbf {t}\) is known at receiver, the \(\varphi (\mathbf {t})\) in (12) can be pre-computed and stored at the receiver. Therefore the calculation of \({\Sigma }_{\varphi (\mathbf {t})}\) and \({\Sigma }_{\varphi (\mathbf {t})y}\) require O(l) and O(l) respectively. The step 4 is an iterative process. The main calculation comes from the inversion of two matrices : \((\nabla ^{2}L_{1}(\mathbf {u}_{1}))^{-1}\) and \((\nabla ^{2}L_{2}(\mathbf {u}_{2}))^{-1}\) in (22) and (23) respectively, and one matrix inversion requires about O(\(l^{3}\)). Let the number of iteration steps be n (generally, \(n<\)10), then the computational complexity of this step is about O(2\(nl^{3}\)). In addition, the proposed algorithm contains the wavelet transform weighted matrix, and a Db-3 wavelet with a length of 6 is used in this paper, then the complexity of the wavelet transform is less than 12l. By comparing with the computations of the inverse matrix, the complexity of computing the wavelet matrix can be ignored. The step 5 also needs computational cost O(l). The step 6 is the repetitive process of the previous calculation. If we only retain the inverse computations and ignore the other computations, the total computational complexity is about O(4\(nl^{3}\)).

4 Experimental results

In this section, we show some simulation results of the proposed channel estimation algorithm based on the novel PWWTSVR. TSVR is effective for channel estimation, and its regression performance has been verified by the works of Charrada and Samet (2016) and Matilde et al. (2004), therefore we compare performance of the proposed algorithm with the linear interpolation method and TSVR (Peng 2010) based channel estimation algorithm. Consider an OFDM system with doubly selective channel. The multipath number \(L+1=6\), and the channel taps are assumed to be i.i.d., correlated in time with a correlation function according to Jakes’ model (Stuber 1996) \(E[h(n_{1},l_{1})h^{*}(n_{2},l_{2})]=\sigma _{h}^{2}J_{0}(2\pi f_{\max }T_{s}(n_{1}-n_{2}))\delta (l_{1}-l_{2}),\) where \(E(\cdot )\) means expected value, (\(\cdot\))\(^{*}\) denotes conjugate, \(n_{k}\) is time index, \(l_{k}\) is channel path index, \(J_{0}\) is the zeroth-order Bessel function of the first kind, Ts is the sampling time interval, and \(\sigma _{h}^{2}\) denotes the variance of the channel. We consider \(N=64\) subcarriers, and a CP of length \(L=5\). The sampling time interval \(Ts=72\mu s\), carrier frequency \(f_{c}=2.15\) GHz, mobile speed is 120/350 (km/h) and 16-QAM signaling is assumed.

In order to demonstrate the effects of multipath and moving speed on channel frequency response, four scenario simulations are performed. Figure shows the channel frequency response at subcarriers in an OFDM symbol under multipath number being 2 and 5 for mobile speed equal to 0 and 350 km/h respectively. From Fig. 3, we can see that multipath can cause frequency-domain fading and the more the number of multipath, the deeper the fading. At the same time, the mobile can cause ICI, the faster the movement, the bigger the ICI. According to (7), we know that the ICI is influenced by the product of channel response \(\mathbf {H}_{n}\) and data to be transmitted \(\mathbf {X}\). As data to be transmitted is random, the ICI caused by transmitted data is like noise, which can be reflected in the simulations.

Fig. 3
figure 3

Channel frequency response at subcarriers in an OFDM symbol under multipath number being 2

The computer simulations are implemented in a Matlab R2014a environment on a PC. Because Gaussian kernel is an effective and frequently used kernel function in TSVR research field, in this paper, the Gaussian nonlinear kernel is adopted for the proposed PWWTSVR and TSVR (Peng 2010), that is

$$\begin{aligned} K(\mathbf {x}^{T},\mathbf {y}^{T})=\exp \left(-\frac{\left\| \mathbf {x}-\mathbf {y} \right\| ^{2}}{\rho }\right), \end{aligned}$$
(27)

where \(\rho\) determines the width of the Gaussian function. The choice of parameters is essential for the performance of algorithms. In this paper, parameter values are chosen by the grid search method from the set of values \(\{10^{i}|i=-4,-3,\ldots ,5\}\).

Figure 4 illustrates the regression of PWWTSVR. In this simulation experiment, pilots are inserted in frequency and time domain with insertion interval \(\Delta f=2\), \(\Delta t=2\), SNR=10dB, number of multipath \(L=5\) and mobile speed \(v=350\)km/h. We can see that the proposed PWWTSVR can fit the channel frequency response and the outliers (pilot samples with too much noise) can be ignored, which shows the robustness of the proposed algorithm. The star points represent the estimated channel response at the pilot position by LS method, the solid line represents the interpolation using the PWWTSVR algorithm, and the dashed line represents the perfect estimation result, which is not affected by noise in the simulation.

Fig. 4
figure 4

Performance of PWWTSVR estimation method

The performance of channel estimation based on PWWTSVR and other methods are evaluated by selected criteria. The number of testing samples is denoted by l, \(y_{i}\) denotes the real value of a testing sample point \(x_{i}\) and \(\hat{y}_{i}\) denotes the predicted value of \(x_{i}\). The criteria are specified as follows.

SNR: Signal-to-noise ratio, defined as SNR=10log\((\sigma _{x}^{2}/\sigma _{v}^{2})\), where the signal power \(\sigma _{x}^{2}=E(\left| x(k)\right| ^{2}),\) \(\sigma _{v}^{2}\) is variance of AWGN.

SSE: Sum squared error of testing samples, defined as \({\text {SSE}}= \sum _{i=1}^{l}\left| y_{i}-\hat{y}_{i}\right| ^{2}\).

BER: bit error rate, defined as \({\text {BER}}=N_{e}/N\), where \(N_{e}\) and N are the number of error signals and all signals in binary respectively.

SSE represents the fitting precision, but too small of an SSE value may mean overfitting of the regressor due to the fitting of noise.

To test the performance of kernel functions, six kernel functions are studied for Sinc function (\(f(x)=\frac{\sin x}{x}\)) including the Gaussian kernel adopted in the proposed method. The kernel function expressions and SSE performance are listed in Table 1. It is easy to see that the Gaussian kernel achieves the best results.

Table 1 Comparisons of PWWTSVR on Sinc function with different kernel functions

Outliers may have a certain influence on the regression of the method. To test the robustness of the proposed algorithm to outliers, the performance on Sinc function polluted by Gaussian additive noise with mean zero and variance \(\varsigma ^{2}=0.1^{2}\) is studied. Figure 5 shows that the prediction performance of the proposed method is satisfying.

Fig. 5
figure 5

Regression performance of PWWTSVR on noised Sinc function with outliers

Figures 6, 7, 8 and 9 show the BER performance as a function of the SNR with different pilot spacing and different moving speed. We compare the BER variation of the proposed algorithm with TSVR estimation (Peng 2014), linear interpolation estimation, K-nearest TSVR (Gupta 2017) and perfect estimation in the presence of additive Gaussian noise. Table 2 shows the SSE performance of channel frequency response for various algorithms. It can be seen that the proposed method outperforms the TSVR estimation and linear interpolation estimation. The BER performances of the proposed method and the TSVR estimation are all better than that of linear interpolation estimation, which means that SVR is suitable for the regression problems based on training samples especially in the nonlinear cases. At the same time, we can see that our PWWTSVR estimation outperforms the TSVR estimation, which shows the effectiveness of the adoption of the weighting parameter \(\mathbf {D}\) in the objective functions.

Mobile speed v, pilot spacing \(\Delta f\) in frequency domain and \(\Delta t\) in time domain also affect the estimation performance. Slow moving speed leads to better BER and SSE performance, and smaller spacing interval can achieve better estimation quality, but the cost is the increase of computational complexity. The simulation results listed in table shows that our proposed method obtains the best performance compared with TSVR estimation and linear interpolation estimation in the sense of SSE value, which shows its good fitting ability. Furthermore, we notice that there are special cases due to randomness. For example, the SSE of \(\Delta f=4\), \(\Delta t=4\), SNR=5 dB, \(v=350\) in Table 1 is less than that of \(v=120\).

In summary, the simulation results confirm that the proposed channel estimation algorithm has better performance than the conventional methods and the introduction of the projection item and the weighting parameters calculated by wavelet transform method can improve the performance of the channel estimation.

Fig. 6
figure 6

BER versus SNR for mobile speed \(v=120\)km/h, pilot spacing \(\Delta f=2\), \(\Delta t=2\)

Fig. 7
figure 7

BER versus SNR for mobile speed \(v=120\)km/h, pilot spacing \(\Delta f=4\), \(\Delta t=4\)

Fig. 8
figure 8

BER versus SNR for mobile speed \(v=350\)km/h, pilot spacing \(\Delta f=2\), \(\Delta t=2\)

Fig. 9
figure 9

BER versus SNR for mobile speed \(v=350\)km/h, pilot spacing \(\Delta f=4\), \(\Delta t=4\)

Table 2 SSE values of channel response by three estimation methods

5 Conclusions

In this paper, a novel PWWTSVR based channel estimator is proposed. Pilots are grid insertion in time and frequency domains in OFDM system, so the channel estimation is performed in dual domains using the grid-inserted pilots. Firstly, the channel frequency response is calculated at the pilot positions and the training samples for the PWWTSVR can be obtained. Then channel frequency response at data positions can be estimated by the proposed algorithm. Unlike TSVR, samples in different positions in the proposed PWWTSVR model are given different weights according to the distance between samples and results preprocessed by the wavelet transform. Computational comparisons between PWWTSVR and other existing methods are performed to show the better performance of the PWWTSVR based channel estimation, which also demonstrates the effectiveness of the proposed method especially for the nonlinear cases. Furthermore, since one of the theoretical bases of the proposed method is wavelet theory, which is a powerful denoising tool for time series signals, the proposed method is suitable for dealing with time series datasets such as the estimation of channel parameters. Additionally, SVR is suitable for small sample size datasets, and a large number of training samples will bring about tremendous computational cost. For future work, study of small proportion pilots insertion in time and frequency domain algorithms based on more accurate channel models should be carried out.