Introduction

The deep integration of wireless communication technology, artificial intelligence (AI) and big data technology has led to the rapid development of information technology, driving the evolution of 5G towards 6G. In recent years, millimeter-wave massive multiple input multiple output (MIMO) wireless communication systems have generated a research boom in academia and industry as a promising technology due to their ability to support massive amounts of mobile data traffic and wireless connectivity [1,2,3,2,]. To achieve a green and sustainable wireless communication network, the reconfigurable intelligent surface (RIS) technology [4, 5] has emerged as an emerging technology in the sight of scholars, which could combine with millimeter-wave massive MIMO wireless communications to form a new RIS-assisted millimeter-wave massive MIMO system model with unique quasi-passive, software-programmable, easy-to-deploy, low-cost and low-energy characteristics. In [6], the authors considered a hybrid RIS-assisted MIMO-OFDM operating in the millimeter-wave band, and established a connection between RIS technology and millimeter-wave hybrid MIMO systems. In the latest RIS research work, unlike the conventional RIS [7], proposed an emerging scheme for integrating stacked intelligent metasurfaces (SIMs), with transceivers to support HMIMO communication for 6G to achieve considerable spatial gain, which also gives a rich inspiration for the subsequent development of this work.

In the V2X millimeter-wave massive MIMO systems, millimeter-waves faced serious multipath effects and fading losses during transmission. Multipath effects lead to fading phenomena, signal fading limits the distance that wireless signals can transmit, and reflections and refraction from large objects such as buildings are unavoidable factors [8]. Therefore, millimeter-wave communication was usually carried out using massive MIMO technology transmitting narrow beams. However, the transmission performance of a narrow beam was dependent on beam alignment, which exacerbated the frequent channel estimation in communication systems, and it became more difficult to obtain accurate channel state information (CSI) and beam alignment in dynamic V2X scenarios. As an artificial electromagnetic surface structure with programmable electromagnetic properties developed from metamaterial technology, RIS could be deployed on the surface of large buildings, roadside units and other objects in the wireless communication environment to build an intelligent programmable wireless environment, which is expected to break through the barriers of traditional wireless communication [9,10,11,12,13].

Recent works have discussed the potent-ials and challenges of RIS-assisted wireless communications (see, e.g. [9, 14, 15], and references therein). Among the several open issues, we highlight the acquisition of transmit beamforming at the BS and the phase shifts on the RIS. In existing studies, the widely adopted communication metrics include achievable rate[16], spectral efficiency [17], or energy efficiency [18]. For reflecting RIS-assisted multiple input single output (MISO) systems, an alternating optimization method was proposed in [19] to solve the joint design of local optimal transmit beamforming at the BS and the phase shifts on the RIS. In [20, 21], various optimization techniques were designed to maximize a reflecting RIS-assisted MIMO system with only one legitimate receiver and one eavesdropper. To maximize sum rate and energy efficiency, a zero-forcing (ZF) based algorithm employed at the BS was used in [22, 23] to obtain the phase shift using stochastic gradient descent (SGD) search and sequential fractional programming, which in turn jointly optimized the transmit beamforming and phase shift [24]. divided the channel training phase into several periods and proposed a random configuration and a Euclidean distance maximization based configuration that can generate suitable RIS reflection coefficient training sets, then directly estimated the effective superposition channels of direct links and all reflection links, constituting a low-complexity channel estimation and passive beamforming framework.

In wireless communications research over the past decade, tensor modeling has been successfully applied to solve various signal processing problems [25,26,27,28]. Tensor-based signal processing benefits from the powerful uniqueness of tensor decomposition, while exploiting the multidimensional nature of the signals and communication channels transmitted/received. In [29], it is shown that the received signals in RIS-assisted MIMO communication systems could be written as a 3-way tensor admitting a canonical polyadic (CP) decomposition. In [30], the authors related the channel estimation problem for RIS-assisted MIMO systems to fitting the PARAFAC model to a 3-way tensor. They derived two methods for simple pilot-assisted channel estimation algorithms (namely, KRF and BALS) and the analytical expressions of the CRB [31]. extends the TRICE framework by [32] and proposed a TenRICE framework using the CP tensor decomposition method in the RIS-assisted millimeter-wave MIMO system. Double-RIS was deployed in a MIMO system to solve the channel estimation and design problem based on an alternating least squares (ALS) approach using the tensor structure of the received signal [33]. Using the channel sparse angular and delay domains, a Sparse Bayesian Learning (SBL) model with tensor structure was introduced to reconstruct the complete cascade channel and derive the Cramer-Rao lower bound (CRLB) in [34].

Recently, researchers have argued that AI will be at the heart of future wireless communication systems (e.g., 6G and beyond) [35,36,37,38], its ability to solve mathematically intractable non-linear non-convex problems and the explosive growth of massive data computation problems has caused scholars to use AI techniques to solve design and optimization problems in wireless communication systems. In [35, 36], deep learning (DL) was used to obtain the beamforming matrix for MIMO systems by establishing a mapping relationship between the channel information and the precoding design. In addition to this, attention mechanisms were also widely used in DL. The attention module could optimize the weighting of input features by minimizing recognition errors and enhancing important information. In [39], Woo et al. proposed a convolutional block attention module that implements channel attention and spatial attention to enhance the critical part of the input features [40]. proposed a time–frequency attention (TFA) mechanism that learns valuable channel, frequency and time feature to improve automatic modulation recognition (AMR) performance. Deep Reinforcement Learning (DRL) was also used in wireless communication systems to improve the learning speed of neural networks in DL [27, 29, 37,38,39,40]. Reference [38] maximizing SINR using DRL by jointly design of beamforming, power control, and interference coordination non-convex optimization problems. Jointly design of transmit beam and reflection phase shift to maximize the sum rate of MISO systems via DRL in [41]. To address the problems of poor robustness of DRL to diverse channels and excessive overhead of DRL’s interaction with the actual environment [42], proposed a DRL algorithm based on location-aware imitation environmental network (IEN). Meanwhile [43], provides further prospective considerations for multi-RIS-empowered wireless communication and leverages the independence between the system design parameter configuration and the future state of the wireless environment to propose an efficient multi-armed bandits approaches for sum rate maximization. Unlike mmwave channels [44], uses multiple RISs for terahertz bands using DRL to counteract propagation losses to constitute a novel hybrid beamforming scheme for multi-hop RISs-assisted communication networks. In summary, DRL is very beneficial for wireless communication systems dealing with time variation. The remainder of this paper is summarized as follows:

  1. 1.

    In "System model and channel model", RIS-assisted V2X millimeter-wave massive MIMO system utilizing a switchable RIS, which controls the reflecting elements on the RIS with a micro-controller depending on the wireless environment and gives system model and channel model.

  2. 2.

    In "Tensor decomposition and problem formulation", utilizing CP tensor decomposition to link the system with jointly optimized transmit beamforming at the BS and reflecting phase shifts on the RIS, and the optimization problem is given.

  3. 3.

    In "Jointly beamforming based attention-DRL", an attention-DRL-based algorithm for jointly optimized the transmit beamforming at the BS and reflecting phase shifts on the RIS is proposed. A channel–frequency–time attention (CFTA) framework is proposed to extract critical temporal and frequency features in the channel and utilize a DRL-based algorithm for optimization.

  4. 4.

    Numerical simulation results and conclusions are given in "Simulation results and analysis" and "Conclusion", respectively.

Notations: For a given matrix \({\mathbf{A}}\), \({\mathbf{A}}^{{\text{T}}}\) and \({\mathbf{A}}^{{\text{H}}}\) denotes its transpose and the conjugate transpose (Hermitian), respectively. The Khatri-Rao product is denoted as \(\diamondsuit\). Moreover, \(diag(a)\) forms a diagonal matrix \({\mathbf{A}}\) by putting the entries of the input vector \(a\) in its main diagonal. \({\mathbf{A}}^{(t)}\) is the value of \({\mathbf{A}}\) at time \(t\).\({\text{ Re}} \{ {\mathbf{A}}\}\) and \({\text{Im}} \{ {\mathbf{A}}\}\) denote the real part and imaginary part of a complex number \({\mathbf{A}}\), respectively.

System model and channel model

In the section, a system model and channel models for RIS-assisted V2X millimeter-wave massive MIMO system are introduced.

System model

In such system, we deploy a BS at the roadside, a RIS on the exterior surface of a high building and multiple vehicles, as shown in Fig. 1. The BS has \(N_{T}\) transmitting antennas and communicates with \(K\) vehicles which has \(N_{R}\) receiving antennas every vehicle, where \(N_{T} \ge N_{R}\), and RIS has \(N_{S}\) passively reflecting elements and one micro-controller. It is assumed that the direct communication link from the BS to the vehicles suffers from severe signal blockage due to the wireless communication environment, therefore, in this paper, we utilize the RIS as a relay and the communication link of the system model only consists of a cascade channel from the BS to the RIS, and from the RIS to the vehicles, the direct signal transmissions between the BS and the vehicles are assumed to be negligible. As the RIS is essentially a passively reflecting array and could control the phase of the sub-arrays on the RIS with a micro-controller depending on the wireless environment. We assume that for all vehicles the channel matrix is known.

Fig. 1
figure 1

The considered RIS-assisted V2X millimeter-wave massive MIMO system comprised of \(N_{T}\) transmitting antenna BS simultaneously communicating \(K\) vehicles, which have \(N_{R}\) receiving antennas. RIS is equipped with \(N_{S}\) reflecting elements and one micro-controller, which is attached to a building’s facade

In a time-slotted transmission, we assume that the RIS aligns its phase shifts as a function of the time \(n = 1, \ldots ,T\). The received signal of the vehicles during \(T\) time slots is formulated as

$$ y_{n} = {\mathbf{H}}_{{\mathbf{R}}} {\text{diag}}(s_{n} ){\mathbf{H}}_{{\mathbf{T}}} g_{n} x_{n} + w_{n} . $$
(1)

where \(y_{n} \in {\mathbb{C}}^{{N_{R} \times 1}}\) denotes the received signal at the time index \(n \in {\mathcal{N}}\),which is positive integer. The phase shifts vector at time \(n\) is defined as \(s_{n} \triangleq [s_{{ne^{{j\phi_{1} ,n}} }}^{1} ,s_{{ne^{{j\phi_{2} ,n}} }}^{2} ,...,s_{{ne^{{j\phi_{{N_{S} }} ,n}} }}^{{N_{S} }} ]^{\text{T}} \in {\mathbb{C}}^{{N_{S} \times 1}}\).\(x_{n}\) denotes the vector embodying the transmitted pilot symbol at time \(n\),where \(\phi_{{n_{S} }} \in (0,2\pi ]\) and \(s_{n}^{{N_{S} }} \in \{ 0,1\}\) controls the ON–OFF state of the corresponding elements of the RIS at time \(n\).\(g_{n} \in {\mathbb{C}}^{{N_{T} \times 1}}\) is the beamforming vector applied at the BS.\(w_{n}\) denotes the complex additive white Gaussian noise (AWGN) with the zero mean and entries of variance \(\sigma_{n}^{2}\).And we assume that the channel matrix from the BS to the reflecting RIS denotes \({\mathbf{H}}_{{\mathbf{T}}} \in {\mathbb{C}}^{{N_{S} \times N_{T} }}\) and the channel matrix from the reflecting RIS to the sensing target vehicles denotes \({\mathbf{H}}_{{\mathbf{R}}} \in {\mathbb{C}}^{{N_{R} \times N_{S} }}\).

Different from the perfectly stable block fading model, time-varying fading channels are considered, where the fading coefficient varies from symbol to symbol. Then the received signal at time \(n\) of the \(k^{{{\text{th}}}}\) vehicles is represented as

$$ y_{k,n} = {\mathbf{H}}_{{k,{\mathbf{R}}}} {\text{diag}}(s_{k,n} ){\mathbf{H}}_{{\mathbf{T}}} g_{k,n} x_{k,n} + w_{k,n} , $$
(2)

where \(y_{k,n}\) denotes the received signal at the \(k^{{{\text{th}}}}\) vehicles.\({\mathbf{H}}_{{k,{\mathbf{R}}}}\) denotes the channel matrix from the RIS to the \(k^{{{\text{th}}}}\) vehicles.

For the sake of problem optimization meaningfully, we have done this by stacking \(\{ y_{k,n} ,n = 1,...,T\}\) into a matrix. The received signals could be written as

$$ {\mathbf{Y}}_{k} = {\mathbf{H}}_{{k,{\mathbf{R}}}} D_{k} ({\mathbf{S}}){\mathbf{H}}_{{\mathbf{T}}} {\mathbf{GX}}_{k} + {\mathbf{W}}_{k} . $$
(3)

where \({\mathbf{Y}}_{k} = [y_{k,1} ,y_{k,2} , \ldots ,y_{k,n} ,...,y_{k,T} ] \in {\mathbb{C}}^{{N_{R} \times T}}\), \(D_{k} ({\mathbf{S}}) \triangleq {\text{diag}}(s_{k,n} )\) denotes a diagonal matrix of the RIS phase shift matrix \({\mathbf{S}}\), and \({\mathbf{S}} \triangleq [s_{k,1} ,s_{k,2} , \ldots ,s_{k,n} ,...,s_{k,T} ] \in {\mathbb{C}}^{{N_{S} \times T}}\).The pilot symbol matrix is \({\mathbf{X}} \triangleq [x_{k,1} ,x_{k,2} , \ldots ,x_{k,n} ,...,x_{k,T} ]^{\text{T}} \in {\mathbb{C}}^{{T \times N_{T} }}\),\({\mathbf{W}}_{k} \triangleq [w_{k,1} ,w_{k,2} , \ldots ,w_{k,n} ,...,w_{k,T} ]^{\text{T}} \in {\mathbb{C}}^{{N_{R} \times T}}\) denotes AWGN matrix, and the beamforming matrix regards \({\mathbf{G}} \in {\mathbb{C}}^{{N_{T} \times T}}\).

Channel model

Due to the high mobility of vehicles in the RIS-assisted V2X millimeter-wave massive MIMO system, the fading components are distinct for each symbol index \(n\), and the Doppler frequency determines the correlation function across different symbols in the time domain. The challenging time-varying system model in this treatise is that the coherence time is confined to the duration of a single symbol [45, 46]. As shown in Fig. 1, the BS, RIS and vehicles are each marked separately with 3D coordinates, which denote, respectively, as \((x_{B} ,y_{B} ,z_{B} )\), \((x_{R} ,y_{R} ,z_{R} )\) and \((x_{V} ,y_{V} ,z_{V} )\).

In the subsection, suppose that the cascade link from BS to RIS and RIS to vehicles present only a line-of-sight (LoS) link. Besides, the frequency offset of the LoS path is induced by the movement of the vehicles. Then, the channel models \({\mathbf{H}}_{{\mathbf{T}}}\) from BS to RIS and \({\mathbf{H}}_{{\mathbf{R}}}\) from RIS to vehicles could be separately formulated as

$$ {\mathbf{H}}_{{\mathbf{T}}} = \frac{1}{{\sqrt {L_{T} } }}\sum\limits_{l = 1}^{{L_{T} }} {g_{T}^{l} \exp (j2\pi \Delta f_{T} )a_{r} (\theta^{t} ,\varphi^{t} )} , $$
(4)
$$ {\mathbf{H}}_{{\mathbf{R}}} = \frac{1}{{\sqrt {L_{R} } }}\sum\limits_{l = 1}^{{L_{R} }} {g_{R}^{l} \exp (j2\pi \Delta f_{R} )a_{r} (\theta^{r} ,\varphi^{r} )^{\text{H}} } . $$
(5)

where \(L_{T}\) and \(L_{R}\) are, respectively, denote the total numbers of paths for BS to RIS and RIS to vehicles, respectively. \(g_{T}^{l}\) and \(g_{R}^{l}\) are, respectively, denote the path gains of the direct LOS for the \(l^{th}\) path of these channel with \(\Delta f_{T}\) and \(\Delta f_{R}\) are respectively denote the frequency offset of \({\mathbf{H}}_{{\mathbf{T}}}\) and \({\mathbf{H}}_{{\mathbf{R}}}\),which could be formulated as \(\Delta f_{T} = f_{d} \cos (\phi_{0}^{T} )\) and \(\Delta f_{R} = f_{d} \cos (\phi_{0}^{R} )\).The normalized maximum Doppler frequency of the cascade link is given by \(f_{d} = \frac{{vf_{c} }}{{cf_{s} }}\), where \(v\) represents the sensing target vehicles’ velocity and \(c\) represent the speed of light,\(f_{c}\) and \(f_{s}\) represent the carrier frequency and the symbol rate. And where the angle between the LOS and the direction of movement of vehicles are evaluated by \(\phi_{0}^{T} = \arctan \frac{{y_{B} - y_{R} }}{{x_{B} - x_{R} }}\) and \(\phi_{0}^{R} = \arctan \frac{{y_{R} - y_{V} }}{{x_{R} - x_{V} }}\) respectively.

Note that, the azimuth and elevation AoAs are evaluated in conformity with 3D coordinates of BS to RIS channel matric \({\mathbf{H}}_{{\mathbf{T}}}\) as \(\theta^{t} = \arccos \frac{{z_{B} - z_{R} }}{{d^{BR} }}\) and \(\varphi^{t} = \arctan \frac{{y_{B} - y_{R} }}{{x_{B} - x_{R} }}\),where the distance of BS to RIS \(d^{BR}\) is represented as \(d^{BR} = \sqrt {(x_{B} - x_{R} )^{2} + (y_{B} - y_{R} )^{2} + (z_{B} - z_{R} )^{2} }\). The azimuth and elevation AODs are evaluated in conformity with 3D coordinates of RIS to vehicles channel matric \({\mathbf{H}}_{{\mathbf{R}}}\) as \(\theta^{r} = \arccos \frac{{z_{V} - z_{R} }}{{d^{RV} }}\) and \(\varphi^{r} = \arctan \frac{{y_{V} - y_{R} }}{{x_{V} - x_{R} }}\),where the distance of RIS to vehicles \(d^{RV}\) is represented as \(d^{RV} = \sqrt {(x_{V} - x_{R} )^{2} + (y_{V} - y_{R} )^{2} + (z_{V} - z_{R} )^{2} }\).

Furthermore, the RIS is regarded to be comprised of \(N_{S}\) passively reflecting elements, where a uniform planar array (UPA) is adopted, as well as for the UPA horizontally with \(N_{v}\) elements and vertically with \(N_{h}\) elements, where \(N_{S} = N_{v} \cdot N_{h}\). The array response vector is formulated as Eq. (6), where \([1 \le n_{v} \le (N_{v} - 1)]\), \([1 \le n_{h} \le (N_{h} - 1)]\), and the wavelength denotes \(\lambda = \frac{c}{{f_{c} }}\), the antenna spacing denotes \(d = \frac{\lambda }{2}\).

Tensor decomposition and problem formulation

In this section, the tensor decomposition analysis method to perform a four-dimensional tensor decomposition of the transmitting antennas’ spatial characteristics, receive spatial antennas characteristics, temporal characteristics and frequency characteristics of the BS, RIS and vehicles is considered, as well as the jointly optimization problem of beamforming matrix at the BS and the phase shift matrix on the RIS are formulated.

$$ \begin{gathered} \alpha_{UPA} (\theta ,\varphi ) \hfill\\ \quad = \left[ {1,...,\exp \left\{ {j\frac{2\pi }{\lambda }d[n_{v} \sin \theta \cos \varphi + n_{h} \sin \varphi ]} \right\},...,} \right. \\ \quad \left. {\exp \left\{ {j\frac{2\pi }{\lambda }d[(N_{v} - 1)\sin \theta \cos \varphi + (N_{h} - 1)\sin \varphi ]} \right\}} \right]^{{\text{T}}} . \\ \end{gathered} $$
(6)

Tensor representation of the received signal

To simplify the exposition of the received signal, we reformulate Eq. (6) from the following developments. The received signal of vehicles is re-represented by \({\mathbf{Z}} \triangleq {\mathbf{GX}}_{k} \in {\mathbb{C}}^{{T \times N_{S} }}\) could be rewritten as

$$ {\mathbf{Y}}_{k} = {\mathbf{H}}_{{k,{\mathbf{R}}}} D_{k} ({\mathbf{S}}){\mathbf{H}}_{{\mathbf{T}}} {\mathbf{GX}}_{k} + {\mathbf{W}}_{k} , $$
(7)

Then, according to the proof in [47], the received signals of vehicles \({\mathbf{Y}}_{k}\) has a tensor structure and could be recast as in a 4-way tensor \({\mathbf{\mathcal{Y}}} \in {\mathbb{C}}^{{N_{R} \times N_{T} \times T \times K}}\). Admitting a canonical polyadic (CP) decomposition [48] tensor representation model and by definition of the outer product, for each \((n_{R} ,n_{T} ,n,k) {-} {\text{th}}\) entry of the received signal with free-noise tensor \({\overline{\mathbf{\mathcal{Y}}}}\) could be approximately defined as

$$ \overline{y}^{{n_{R} ,n_{T} ,n,k}} \approx \sum\limits_{{n_{S} = 1}}^{{N_{S} }} {h_{{k{\text{,R}}}}^{{n_{R} ,n_{S} }} h_{{\text{T}}}^{{n_{T} ,n_{S} }} z^{{n,n_{S} }} s^{{k,n_{S} }} } , $$
(8)

where \(h_{{\text{k,R}}}^{{n_{R} ,n_{S} }} \triangleq [{\mathbf{H}}_{{k, {\mathbf{R}}}} ]^{{n_{R} ,n_{S} }}\), \(h_{{\text{T}}}^{{n_{T} ,n_{S} }} \triangleq [{\mathbf{H}}_{{\mathbf{T}}} ]^{{n_{T} ,n_{S} }}\), \(z^{{n,n_{S} }} \triangleq [{\mathbf{Z}}]^{{n,n_{S} }}\) and \(s^{{k,n_{S} }} \triangleq [{\mathbf{S}}]^{{k,n_{S} }}\). The tensor could be approximated as a minimalist notation \({\overline{\mathbf{\mathcal{Y}}}} \approx [{\mathbf{H}}_{{k,{\mathbf{R}}}} ,{\mathbf{H}}_{{\mathbf{T}}} ,{\mathbf{Z}},{\mathbf{S}}]\).Consequently, admitting a CP decomposition tensor representation with free-noise model as

$$ {\overline{\mathbf{\mathcal{Y}}}} = \Im_{4,M} \times_{1} {\mathbf{H}}_{{k,{\mathbf{R}}}} \times_{2} {\mathbf{H}}_{{\mathbf{T}}} \times_{3} {\mathbf{Z}} \times_{4} {\mathbf{S}}, $$
(9)

where \(\Im_{4,M} \in {\mathbb{C}}^{{N_{S} \times N_{S} \times N_{S} \times N_{S} }}\) is the 4-way super-diagonal tensor.

We add noise to \({\overline{\mathbf{\mathcal{Y}}}}\), then the tensor form of the completely received signal could be expressed as

$$ {\mathbf{\mathcal{Y}}} = \Im_{4,M} \times_{1} {\mathbf{H}}_{{k,{\mathbf{R}}}} \times_{2} {\mathbf{H}}_{{\mathbf{T}}} \times_{3} {\mathbf{Z}} \times_{4} {\mathbf{S}} + \mathbf{\mathcal{W} } $$
(10)

where \(\mathbf{\mathcal{W}}\) is the AWGN tensor.

Exploiting the nature of the slices of the CP tensor decomposition, the \(m - {\text{mode}}\) unfolding of the received signal tensor \({\mathbf{\mathcal{Y}}}\), where \(m = \{ 1,2,3,4\}\), into the following four matrix forms [49, 50]:

$$ [{\mathbf{\mathcal{Y}}}]_{(1)} = {\mathbf{H}}_{{k,{\mathbf{R}}}} ({\mathbf{S}}\, {\diamondsuit }\,{\mathbf{Z}}\,\diamondsuit \,{\mathbf{H}}_{{\mathbf{T}}} )^{\text{T}} + [\mathbf{\mathcal{W}}]_{(1)} , $$
(11)
$$ [{\mathbf{\mathcal{Y}}}]_{(2)} = {\mathbf{H}}_{{\mathbf{T}}} ({\mathbf{S}}\, {\diamondsuit }\,{\mathbf{Z}}\,\diamondsuit \,{\mathbf{H}}_{{k,{\mathbf{R}}}} )^{\text{T}} + [\mathbf{\mathcal{W}}]_{(2)} , $$
(12)
$$ [{\mathbf{\mathcal{Y}}}]_{(3)} = {\mathbf{Z}}({\mathbf{S}}\, {\diamondsuit }{\mathbf{H}}_{{\mathbf{T}}} \,\diamondsuit \,{\mathbf{H}}_{{k,{\mathbf{R}}}} )^{\text{T}} + [\mathbf{\mathcal{W}}]_{(3)} , $$
(13)
$$ [{\mathbf{\mathcal{Y}}}]_{(4)} = {\mathbf{S}}({\mathbf{Z}}\,\diamondsuit \,{\mathbf{H}}_{{\mathbf{T}}} \,\diamondsuit \,{\mathbf{H}}_{{k,{\mathbf{R}}}} )^{\text{T}} + [\mathbf{\mathcal{W}}]_{(4)} . $$
(14)

The tensor decomposition form of the received signal of vehicles are given, which \([{\mathbf{\mathcal{Y}}}]_{(m)}\) and \([\mathbf{\mathcal{W}}]_{(m)}\), for \(m = \{ 1,2,3,4\}\) indicate the received signal in the form of tensor and AWGN tensor along the \(m{\text{ - mode}}\). However, alternating least squares technique (ALS) is usually used to do the task of channel estimation after the tensor decomposition of the received signal, i.e., to estimate the factor matrix in the tensor.

In this paper, the beam from the BS is tough to align with dynamic vehicles in urban road environments that we want to solve. The alternating optimization techniques typically are used by previous scholars [20, 51, 52] to maximize or minimize the objective functions, where in each iteration, by fixing \({\mathbf{S}}\) first to seek to solve suboptimal \({\mathbf{G}}\), and then by setting the solved \({\mathbf{G}}\) to derive \({\mathbf{S}}\), until the algorithms converge.

Nevertheless, ALS technique usually has the advantage of solving optimization problems with small sample sizes. For optimization problems with large sample sizes, it is more advantageous to adopt gradient descent. In the following subsections, we describe the problem to be solved optimally in this paper.

Problem formulation

To address the problem described in "Tensor representation of the received signal", according to [53], minimizes the upper bound on the BER by jointly optimizing the beamforming matrix \({\mathbf{G}}\) and RIS phase shift matrix \({\mathbf{S}}\) in this paper. As seen from (4), the RIS-assisted V2X millimeter-wave massive MIMO system is equivalent to acceding a mirror compared to a conventional millimeter-wave massive MIMO system. The role of RIS is to assist in the transformation of the direct link between the BS and the vehicles into a cascade link from the BS to the RIS and from the RIS to the vehicles due to the interruption of the communication link caused by the complexity of the wireless communication environment and to passively reflect the signal by modulating the phase of the reflecting elements on the RIS via a micro-controller attached to the RIS. Moreover, as the RIS only reflects passively, the AWGN is not introduced through the RIS reflections, which are only present in the wireless communication environment. As communication over the cascade links is more sophisticated than the direct links and could be more prone to significant signal degradation, hence, to fulfill the maximum average transmit power at the BS, a power constraint demand to be considered, and assuming that \({\mathbf{Z}}\) satisfies the maximum average transmit power, which the power constraint is denoted as

$$ \begin{aligned} {{\mathbb{E}} }[\left\| {\mathbf{Z}} \right\|_{2}^{2} ] & = {{\mathbb{E}} }[\text{tr}({\mathbf{ZZ}}^{\text{H}} )] \\ & = {{\mathbb{E}} }[\text{tr}({\mathbf{GXX}}^{\text{H}} {\mathbf{G}}^{\text{H}} )] \le P_{t} , \\ \end{aligned} $$
(15)

Without loss of generality, assuming that the pilot symbol matrix \({\mathbf{X}}\) fulfills the condition of normalization, i.e., \({{\mathbb{E}} }[\left\| {\mathbf{X}} \right\|_{2}^{2} ] = 1\), Then Eq. (15) could be rewritten as

$$ {{\mathbb{E}} }[\left\| {\mathbf{Z}} \right\|_{2}^{2} ] = {{\mathbb{E}} }[\text{tr}({\mathbf{GG}}^{\text{H}} )] \le P_{t} . $$
(16)

Based on what is described in [53], adopting the minimization of the upper bounded on the BER as our objective of optimization in this paper, and the upper bound on the BER of the whole system could be described as

$$ P_{e} \le \overline{P}_{e} = \frac{1}{Lu}\sum\limits_{l = 1}^{L} {\sum\limits_{{\hat{l} = 1,\hat{l} \ne l}}^{L} {D_{{{\text{HD}}}} P_{{{\text{PEP}}}} } } . $$
(17)

where \(D_{{{\text{HD}}}}\) indicates the Hamming distances between the two received signals and \(P_{{{\text{PEP}}}}\) indicates the pairwise error probability, which could be represented by \(P_{{{\text{PEP}}}} = Q(\sqrt {\frac{{D_{{{\text{ED}}}}^{2} }}{{2\sigma^{2} }}} )\). Meanwhile, where \(D_{{{\text{ED}}}}^{2}\) indicates the square of the Euclidean distance between two noise-free received signal vectors, which the objective of jointly optimizing \({\mathbf{G}}\) and \({\mathbf{S}}\) is to make the noise-free received signal vectors received by the communicate receiver as separate as possible and \(D_{{{\text{ED}}}}^{2}\) could be denoted as Eq. (18).

In this paper, our goal is to cause the received signal by the receiver to be as separable as possible under the condition of supplying a perfect CSI state. It is worth mentioning that from (17), we could see that the upper bound of the minimum BER is determined jointly by the Hamming distance \(D_{{{\text{HD}}}}\) and the Euclidean distance \(D_{{{\text{ED}}}}\). Included among these, the Hamming distance \(D_{{{\text{HD}}}}\) between two received signals is determined by the mapping rule, i.e., \(\Gamma\), while, as could be seen from (21), the Euclidean distances \(D_{{{\text{ED}}}}\) are determined jointly by the phase shift matrix of the RIS, the beamforming matrix and the pilot symbol. Based on these denotations, adopting the DDPG method in DRL to find the optimal \({\mathbf{G}}\) and \({\mathbf{S}}\) by minimizing the upper bound on the BER in this paper. Hence, the optimization problem could be formulated as

$$ \begin{aligned} \mathop {D_{{{\text{ED}}}}^{2} }\limits_{\min }& = \left\| {\user2{ }y^{{n_{R} ,n_{T} ,n,k}} - \user2{ }\hat{y}^{{n_{R} ,n_{T} ,n,k}} } \right\|_{2}^{2} \\ &= \left\| {{\mathbf{H}}_{{k,{\mathbf{R}}}} {\text{diag}}(s_{k,n} ){\mathbf{H}}_{{\mathbf{T}}} g_{k,n} x_{k,n}}\right. \\& \quad \left.{ - {\mathbf{H}}_{{\hat{k},{\mathbf{R}}}} {\text{diag}}(s_{{\hat{k},\hat{n}}} ){\mathbf{H}}_{{\mathbf{T}}} g_{{\hat{k},\hat{n}}} x_{{\hat{k},\hat{n}}} } \right\|_{2}^{2} . \\ \end{aligned} $$
(18)
$$\begin{aligned} & ({\mathbf{P1}}): \quad {\text{given}}: \quad \quad {\mathbf{H}}_{{\mathbf{T}}} ,{\mathbf{H}}_{{k,{\mathbf{R}}}} ,{\mathbf{X}}_{k} ,\Gamma ,k = 1, \cdots ,K \\ & \qquad \qquad {\text{find}}: \quad {\mathbf{G}}_{o} ,{\mathbf{S}}_{o} \\ & \qquad \qquad \min :\quad \quad \bar{P}_{e} \\ & \qquad \qquad {\text{s.t.}}: \qquad \text{tr}({\mathbf{GG}}^{\text{H}} ) \le P_{t} \\ & \quad \quad \quad \quad \{ |s_{{k,n}}^{{n_{S} }} | = 1\} .\forall n_{S} = 1,2,...,N_{S} \\ \end{aligned} $$

Jointly beamforming based attention-DRL

In this section, the problem \(({\mathbf{P1}})\) is non-convex, and it is usually challenging to obtain an optimal solution. To address this difficulty, the proposed Attention-DRL-based algorithm for jointly beamforming to more accurate determination of beamforming search directions is proposed. Extract crucial temporal and frequency features in the channel using an attention mechanism and jointly optimize transmit beamforming and phase shifts using a DRL model.

Channel–frequency–time attention mechanism

In this subsection, inspired by the literature [54], adopts attention mechanisms to extract the critical temporal and frequency features in the channel in this paper.

The channel-refined feature mapping is generated firstly by channel attention module and is used as an input to the frequency attention module and the time attention module. These two attention mechanisms, respectively, generate frequency-refined feature mapping and temporal-refined feature mapping. Then the two refined feature mappings are stitched together to notch the more pivotal channels of information in the wireless communication environment using convolutional layers finally. These three attention mechanisms are collectively referred to in this paper as CFTA. The overview of the proposed CFTA is shown in Fig. 2.

Fig. 2
figure 2

The overview of CFTA, which contains three sub-modules: the channel attention mechanism module, frequency attention mechanism module and time attention mechanism module

We assumed that channels from the BS to the RIS and from the RIS to the vehicles have been captured by the central controller in real-time, i.e., that these channels are known. Next, the basic principles of attention mechanisms and these three attention mechanisms are recommended individually.

(1) Attention mechanism

As a matter of fact, the objective of designing attention mechanisms is the benefit of more weight to the parts that need attention, which could be done by highlighting only severally prominent information for processing while neglecting additional contents, thus scaling up the efficiency of the neural network. Owing the neural network could only take real numbers as input. Hence, we extract the real and imaginary parts of the channels of \({\mathbf{H}}_{{\mathbf{T}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}\) respectively, as follows:

$$ {\mathbf{h}}_{{\mathbf{T}}} = {\text{Re}} \{ {\mathbf{H}}_{{\mathbf{T}}} \} + {\text{Im}} \{ {\mathbf{H}}_{{\mathbf{T}}} \} , $$
(19)
$$ {\mathbf{h}}_{{k,{\mathbf{R}}}} = {\text{Re}} \{ {\mathbf{H}}_{{k,{\mathbf{R}}}} \} + {\text{Im}} \{ {\mathbf{H}}_{{k,{\mathbf{R}}}} \} . $$
(20)

As the dimension of the matrices \(d\) increases, the variance of the inner product of the critical matrices and the query matrices also increases. Hence, we adjust the product according to the result of the product \(\sqrt d\).

The attention matrices are obtained by multiplying the transpose of the critical matrices with the query matrices, adjusting by \(\sqrt d\) and then adopting the Softmax function for each column of the matrices, and normalizing, which is denoted as

$$ {\mathbf{A}}_{{\mathbf{T}}} = {\text{Softmax}} \left( {\frac{{{\mathbf{K}}_{{\mathbf{T}}}^{\text{T}} {\mathbf{Q}}_{{\mathbf{T}}} }}{{\sqrt {d_{{\mathbf{T}}} } }}} \right), $$
(23)
$$ {\mathbf{A}}_{{\mathbf{R}}} = {\text{Softmax}} \left( {\frac{{{\mathbf{K}}_{{\mathbf{R}}}^{\text{T}} {\mathbf{Q}}_{{\mathbf{R}}} }}{{\sqrt {d_{{\mathbf{R}}} } }}} \right). $$
(24)

where \({\mathbf{A}}_{{\mathbf{T}}}\) and \({\mathbf{A}}_{{\mathbf{R}}}\) are represented as the attention matrices of channel matrices \({\mathbf{H}}_{{\mathbf{T}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}\), respectively, as well as \(d_{{\mathbf{T}}}\) and \(d_{{\mathbf{R}}}\) denote the dimension of channel matrices \({\mathbf{H}}_{{\mathbf{T}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}\).

Then, the value matrices \({\mathbf{V}}_{{\mathbf{T}}}\) and \({\mathbf{V}}_{{\mathbf{R}}}\) are multiplied by the elements in the attention matrices \({\mathbf{A}}_{{\mathbf{T}}}\) and \({\mathbf{A}}_{{\mathbf{R}}}\),then the summation is performed to obtain a weighted sum of the attention weights for all inputs, and the overall representation of the output matrices of the attention mechanisms could be expressed as Eqs. (25) and (26).

(2) Channel attention module


In the channel attention module, its role is to excavate the channels containing pivotal information about the sensed target vehicles and to extract prominent information, such as frequency and time information, from that key information channel. Input is given feature channels \({\mathbf{H}}_{{\mathbf{T}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}\), adopt the channel attention module to generate channel-refined feature mappings and perform maximum global pooling and global average pooling on the input feature mapping. The generated features of the two different contexts are used as input to the shared network—multi-layer perceptron (MLP), and the outputs of the shared network are summed and passed through the sigmoid function to generate the channel attention mappings \({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{c}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{c}}}\) finally, where \({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{c}}} = {\mathbf{M}}_{{\mathbf{c}}} ({\mathbf{H}}_{{\mathbf{T}}} ) \otimes {\mathbf{H}}_{{\mathbf{T}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{c}}} = {\mathbf{M}}_{{\mathbf{c}}} ({\mathbf{H}}_{{k,{\mathbf{R}}}} ) \otimes {\mathbf{H}}_{{k,{\mathbf{R}}}}\).The channel attention mechanism operations are formulated as Eqs. (27) and (28).


(3) Frequency attention module


In the frequency attention module, its role is to excavate the channels containing pivotal information about the prominent frequency information. The channel attention mappings generated by the channel attention module is used as input to the frequency attention module. The frequency attention module focuses on the frequency axis and along the time axis taking the average of \({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{c}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{c}}}\). Maximum-pooling and average-pooling are then performed, and the cascade is followed by a cascaded convolutional layer and a sigmoid function to generate the final frequency attention mappings \({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{f}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{f}}}\), where \({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{f}}} = {\mathbf{M}}_{{\mathbf{f}}} ({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{c}}} ) \otimes {\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{c}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{f}}} = {\mathbf{M}}_{{\mathbf{f}}} ({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{c}}} ) \otimes {\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{c}}}\). The frequency attention mechanism operation are formulated as Eqs. (29) and (30).


(4) Time attention module


In the time attention module, its role is to excavate the channels containing pivotal information about the prominent time information. Similar to the frequency attention module, the channel attention mappings generated by the channel attention module is used as input to the time attention module. The time attention module focuses on the time axis and along the frequency axis taking the average of \({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{c}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{c}}}\). The following steps are the same as for the frequency attention module operations, resulting in the time attention mappings \({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{t}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{t}}}\), where \({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{t}}} = {\mathbf{M}}_{{\mathbf{t}}} ({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{c}}} ) \otimes {\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{c}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{t}}} = {\mathbf{M}}_{{\mathbf{t}}} ({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{c}}} ) \otimes {\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{c}}}\). The time attention mechanism operation is formulated as Eqs. (31) and (32).

$$ {\mathbf{O}}_{{\mathbf{T}}} = {\mathbf{V}}_{{\mathbf{T}}} {\mathbf{A}}_{{\mathbf{T}}} = {\mathbf{W}}^{v} {\mathbf{H}}_{{\mathbf{T}}} {\text{Softmax}} \left( {\frac{{{\mathbf{H}}_{{\mathbf{T}}} {\mathbf{W}}^{{k^{\text{T}} }} {\mathbf{W}}^{q} {\mathbf{H}}_{{\mathbf{T}}} }}{{\sqrt {d_{{\mathbf{T}}} } }}} \right), $$
(25)
$$ {\mathbf{O}}_{{\mathbf{R}}} = {\mathbf{V}}_{{\mathbf{R}}} {\mathbf{A}}_{{\mathbf{R}}} = {\mathbf{W}}^{v} {\mathbf{H}}_{{k,{\mathbf{R}}}} {\text{Softmax}} \left( {\frac{{{\mathbf{H}}_{{k,{\mathbf{R}}}} {\mathbf{W}}^{{k^{\text{T}} }} {\mathbf{W}}^{q} {\mathbf{H}}_{{k,{\mathbf{R}}}} }}{{\sqrt {d_{{\mathbf{R}}} } }}} \right). $$
(26)
$$ {\mathbf{M}}_{{\mathbf{c}}} ({\mathbf{H}}_{{\mathbf{T}}} ) = \sigma ({\text{MLP}}({\text{AvgPool}}({\mathbf{H}}_{{\mathbf{T}}} )) + {\text{MLP}}({\text{MaxPool}}({\mathbf{H}}_{{\mathbf{T}}} ))),$$
(27)
$$\begin{aligned} {\mathbf{M}}_{{\mathbf{c}}} ({\mathbf{H}}_{{k,{\mathbf{R}}}} ) &= \sigma ({\text{MLP}}({\text{AvgPool}}({\mathbf{H}}_{{k,{\mathbf{R}}}} ))\\ &\quad + {\text{MLP}}({\text{MaxPool}}({\mathbf{H}}_{{k,{\mathbf{R}}}} ))).\end{aligned}$$
(28)
$$ {\mathbf{M}}_{{\mathbf{f}}} ({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{c}}} ) = \sigma (f_{3}^{3 \times 3} [{\text{AvgPool}}({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{f}}} );{\text{MaxPool}}({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{f}}} )]), $$
(29)
$$ {\mathbf{M}}_{{\mathbf{f}}} ({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{c}}} ) = \sigma (f_{3}^{3 \times 3} [{\text{AvgPool}}({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{f}}} );{\text{MaxPool}}({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{f}}} )]). $$
(30)
$$ {\mathbf{M}}_{{\mathbf{t}}} ({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{c}}} ) = \sigma (f_{3}^{3 \times 3} [{\text{AvgPool}}({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{t}}} );{\text{MaxPool}}({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{t}}} )]), $$
(31)
$$ {\mathbf{M}}_{{\mathbf{t}}} ({\mathbf{H}}_{{\mathbf{R}}}^{{\mathbf{c}}} ) = \sigma (f_{3}^{3 \times 3} [{\text{AvgPool}}({\mathbf{H}}_{{\mathbf{R}}}^{{\mathbf{t}}} );{\text{MaxPool}}({\mathbf{H}}_{{\mathbf{R}}}^{{\mathbf{t}}} )]). $$
(32)

Ultimately, attaching \({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{f}}}\) and \({\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{t}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{f}}}\) and \({\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{t}}}\), respectively, and adopting a convolutional layer with 1 × 1 sized kernels generate channels \({\hat{\mathbf{H}}}_{{\mathbf{T}}} = f^{1 \times 1} [{\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{f}}} ;{\mathbf{H}}_{{\mathbf{T}}}^{{\mathbf{t}}} ]\) and \({\hat{\mathbf{H}}}_{{k,{\mathbf{R}}}} = f^{1 \times 1} [{\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{f}}} ;{\mathbf{H}}_{{k,{\mathbf{R}}}}^{{\mathbf{t}}} ]\) with more critical information in a wireless communication environment eventually. The more pivotal channels could be adopted as input to the DRL model presented below, and be optimized more efficiently to obtain the optimal beamforming. The process of implementing the attention mechanism is shown in Algorithm 1.

The proposed jointly beamforming based attention-DRL

In the system model mentioned in this paper, the amplitude factor of the intelligent reflecting surface is set to be constant to \(\{ |s_{n}^{{n_{S} }} | \, = 1\} |_{{n_{S} = 1}}^{{N_{S} }}\). The optimization problem discussed in "Problem formulation" is equivalent to the joint optimization problem of transmit beamforming at the BS and the phase shifts on the RIS. Depending on the mobility of the vehicles in the wireless communication environment, the most challenging aspect of this optimization problem is to find the optimal beam based on the continuous state space and action space of the vehicles. Therefore, in this subsection, DRL based on attention mechanisms is proposed to solve the jointly transmit beamforming and RIS phase shifts problem in RIS-assisted millimeter-wave massive MIMO V2X system and adopting the DDPG algorithm based on the objective that the system could achieve the minimum BER.

figure a

In this paper, the model schematic is shown in Fig. 3, which includes the Attention Mechanism modules and the Actor-Critic modules. We assume that the inputs of the Attention-DRL model are channels \({\hat{\mathbf{H}}}_{{\mathbf{T}}}\) and \({\hat{\mathbf{H}}}_{{k,{\mathbf{R}}}}\), and the final outputs is the optimal action solution based on the best policy, which is the beamforming matrix and the RIS phase shift matrix in this paper.

Fig. 3
figure 3

The proposed DRL algorithm structure

Before describing the DDPG algorithm used in this paper, it is essential to understand reinforcement learning, as it is called, which aims to solve a policy for guiding a machine to choose the best action in different states. The strategy refers to the mapping between the state of the environment to the action or probability of action of Agent, the learners and policymakers are referred to as Agent, and the part that interacts with agent is referred to as the environment. By interacting with the environment, the Agent chooses an action based on a particular strategy, and then the environment moves to the next state based on a certain probability of state transfer while giving the Agent a reward based on how good or bad the state is at that moment. Based on the feedback from the environment, the Agent continuously adjusts its strategy, gradually arriving at its optimal action, which maximizes the environmental benefits.

About the DDPG algorithm, there are two primary networks: an Actor network, which learns the mapping relationship between the state to the optimal solution and is also known as a policy network. The input dimension of the Actor network is the state dimension, and the output is the continuous action of the Agent; the other that could evaluate the policy of the Agent, which is known as the Critic network, where the input dimension of the Critic network is the sum of the dimensions of that state and the continuous action and the output is the Q-value of the state-action pair.

Both the Actor network and the Critic network adopt a fully connected neural network with the same structure and the same optimizer, Adam, which speeds up the convergence of the neural network, and selects tanh as the activation function in the hidden layer of the neural network to handle the presence of negative inputs to the neural network. In addition to this, based on the power constraint of Eq. (19) and the restriction on the range of values of the phase shifts coefficient, we take absolute value normalization and range mapping methods at the output layer of the Actor network to satisfy the constraint, to ensure that the signal does not consume power by passing through the RIS reflection, and to convert the action into the data format needed for the problem to feed into Eq. (20) to calculate the achievable BER results.

This subsection follows with a detailed description of the algorithm for the model, mainly covering the construction of actions, states and immediate rewards. We are given the channels \({\hat{\mathbf{H}}}_{{\mathbf{T}}}^{(t)}\) and \({\hat{\mathbf{H}}}_{{k,{\mathbf{R}}}}^{(t)}\) at time \(t\). The state at this moment is \(s^{(t)}\), which describes a set of observations in the environment. The Agent performs an action step by step during the learning process, taking action \(a^{(t)}\) in policy \(\pi\),and \(s^{(t)}\) in the environment transitions to the next state \(s^{(t + 1)}\). As a result, the Agent gets a reward \(r^{(t)}\), which could be used to evaluate how good the action \(a^{(t)}\) is for a given state \(s^{(t)}\).

In the initial phase of the algorithm, we need to perform initialization operations. Initialize the experience replay memory \({\mathcal{E}}\) with size \(D\), the pair action emission of the beamforming matrix \({\mathbf{G}}\) and the phase shift matrix \({\mathbf{S}}\), and initialize the training parameters \(\vartheta_{a}^{(train)}\) and target parameters \(\vartheta_{a}^{(target)}\) of the Actor network and the training parameters \(\vartheta_{c}^{(train)}\) and target parameters \(\vartheta_{c}^{(target)}\) of the Critic network. Meanwhile, to establish the initial state, we define the phase shift coefficient of the RIS in the initial \(a^{(0)}\) as all zeros and the transmitter power allocation as all ones.\(s^{(t)}\) is composed of four main components: The first part is the channels at the current moment \({\hat{\mathbf{H}}}_{{\mathbf{T}}}^{(t)}\) and \({\hat{\mathbf{H}}}_{{k,{\mathbf{R}}}}^{(t)}\). According to the neural network described in "Channel-Frequency-time attention mechanism" could only be fed with real numbers. We need to separate the more critical channels containing real and imaginary numbers, where \({\hat{\mathbf{h}}}_{{\mathbf{T}}} = {\text{Re}} \{ {\hat{\mathbf{H}}}_{{\mathbf{T}}} \} + {\text{Im}} \{ {\hat{\mathbf{H}}}_{{\mathbf{T}}} \}\) and \({\hat{\mathbf{h}}}_{{k,{\mathbf{R}}}} = {\text{Re}} \{ {\hat{\mathbf{H}}}_{{k,{\mathbf{R}}}} \} + {\text{Im}} \{ {\hat{\mathbf{H}}}_{{k,{\mathbf{R}}}} \}\), the total size of this part is \((N_{T} N_{S} + KN_{S} ) + (N_{R} N_{S} + KN_{S} )\). The second and third components are the transmitted power of the BS \(\left\| {{\mathbf{G}}_{k} } \right\|^{2} = \left| {{\text{Re}} \{ {\mathbf{G}}_{k}^{{\text{H}}} {\mathbf{G}}_{k} \} } \right|^{2} + \left| {{\text{Im}} \{ {\mathbf{G}}_{k}^{{\text{H}}} {\mathbf{G}}_{k} \} } \right|^{2}\) and the received power of the vehicles \(\left\| {{\mathbf{H}}_{{k,{\mathbf{R}}}} } \right\|^{2} = \left| {{\text{Re}} \{ ({\mathbf{H}}_{{k,{\mathbf{R}}}} )^{{\text{H}}} {\mathbf{H}}_{{k,{\mathbf{R}}}} \} } \right|^{2} + \left| {{\text{Im}} \{ ({\mathbf{H}}_{{k,{\mathbf{R}}}} )^{{\text{H}}} {\mathbf{H}}_{{k,{\mathbf{R}}}} \} } \right|^{2}\), the size of each is \(2K\) and \(2K^{2}\). The last component of the state is the action \(a^{(t - 1)}\) obtained from the input of the state \(s^{(t - 1)}\) into the actor network at the previous moment, and the total size of this component is \((N_{T} K + N_{S} ) + (N_{R} K + N_{S} )\). In summary, the dimension of the state space is \(D_{s} = 2K + 2K^{2} + (N_{T} + N_{R} )(N_{S} + K) + 2(K + 1)N_{S}\).

The action \(a^{(t)}\) is mainly related to the transmit beamforming matrix \({\mathbf{G}}\) and the phase shift matrix \({\mathbf{S}}\). For each moment \(t\), the Agent observes the state \(s^{(t)}\) and feeds it into the Actor network. For better input into the neural network, it will \({\mathbf{G}}\) and \({\mathbf{S}}\) are similarly separated into real and imaginary parts of the form, where \({\mathbf{G}} = {\text{Re}} \{ {\mathbf{G}}\} + {\text{Im}} \{ {\mathbf{G}}\}\) and \({\mathbf{S}} = {\text{Re}} \{ {\mathbf{S}}\} + {\text{Im}} \{ {\mathbf{S}}\}\). The dimension of the action space is \(D_{a} = N_{T} K + N_{R} K + 2N_{S}\), and the action \(a^{(t)}\) is determined from the output of the Actor network.

According to the action \(a^{(t)}\) at the current moment, also the instantaneous channel at this moment \({\hat{\mathbf{H}}}_{{\mathbf{T}}}^{(t)}\) and \({\hat{\mathbf{H}}}_{{k,{\mathbf{R}}}}^{(t)}\), the final reward \(r^{(t)}\) is obtained as Eq. (20) of the minimum BER and the new state \(s^{(t + 1)}\).

Store the above empirical samples \((s^{(t)} ,a^{(t)} ,r^{(t + 1)} ,a^{(t + 1)} )\) into the experience replay memory \({\mathcal{E}}\). The \(Q\)-value function is obtained from the critic network, where \(Q(s^{(t)} ,a^{(t)} ) \triangleq q(\vartheta_{c}^{{\text{(train)}}} |s^{(t)} ,a^{(t)} )\),then a batch of randomly selected samples \((s^{{(t^{^{\prime}} )}} ,a^{{(t^{^{\prime}} )}} ,r^{{(t^{^{\prime}} + 1)}} ,a^{{(t^{^{\prime}} + 1)}} )|_{{t^{^{\prime}} \ne t}}\) of size \({\mathcal{M}}\) is drawn from the \({\mathcal{E}}\). After that we construct the loss function \(\mathcal{L}(\vartheta_{c}^{{\text{(train)}}} )\) of the training critic network as well as update the training critic network \(\vartheta_{c}^{{\text{(train)}}}\) and training actor network \(\vartheta_{a}^{{\text{(train)}}}\), and then update the target critic network \(\vartheta_{c}^{{\text{(target)}}}\) and target actor network \(\vartheta_{a}^{{\text{(target)}}}\), and finally the input state of the network becomes \(s^{(t + 1)}\).

For the update process of the above network, we made a detailed description, the parameters update process of training critic network could be expressed as Eqs. (33) and (34), where \(\eta_{c}\) is the learning rate of the training critic network, \(\gamma\) is the \(Q\)-learning rate, \(a^{^{\prime}}\) is the action output by the target actor network, and \(\nabla_{{\vartheta_{c}^{{\text{(train)}}} }} \mathcal{L}(\vartheta_{c}^{{\text{(train)}}} )\) represents the gradient value corresponding to the training critic network \(\vartheta_{c}^{{\text{(train)}}}\) calculated according to the loss function \(\mathcal{L}(\vartheta_{c}^{{\text{(train)}}} )\).

The parameters update process of training actor network could be expressed as Eq. (35), where \(\eta_{a}\) is the learning rate of the training actor network, \(\nabla_{a} Q(s^{(t)} ,a;\vartheta_{a}^{{\text{(target)}}} )\) is the gradient value of the target critic network based on the current input action, and \(\nabla_{{\vartheta_{a}^{{\text{(train)}}} }} \pi (s^{(t)} ;\vartheta_{a}^{{\text{(train)}}} )\) is the gradient of the training actor network corresponding to its \(\vartheta_{a}^{{\text{(train)}}}\).

$$ \vartheta_{c}^{(t + 1)} = \vartheta_{c}^{(t)} - \eta_{c} \nabla_{{\vartheta_{c}^{{\text{(train)}}} }} \mathcal{L}(\vartheta_{c}^{{\text{(train)}}} ), $$
(33)
$$ \mathcal{L}(\vartheta_{c}^{{\text{(train)}}} ) = (r^{(t)} + \gamma Q(s^{(t + 1)} ,a^{^{\prime}} ;\vartheta_{c}^{{\text{(target)}}} ) - Q(s^{(t)} ,a^{(t)} ;\vartheta_{c}^{{\text{(train)}}} )). $$
(34)
$$ \vartheta_{a}^{(t + 1)} = \vartheta_{a}^{(t)} - \eta_{a} \nabla_{a} Q(s^{(t)} ,a;\vartheta_{a}^{{\text{(target)}}} )\nabla_{{\vartheta_{a}^{{\text{(train)}}} }} \pi (s^{(t)} ;\vartheta_{a}^{{\text{(train)}}} ). $$
(35)

The target critic network and target actor network parameters are updated with the following equations, respectively:

$$ \vartheta_{c}^{{\text{(target)}}} \leftarrow \tau_{c} \vartheta_{c}^{{\text{(train)}}} + (1 - \tau_{c} )\vartheta_{c}^{{\text{(target)}}} , $$
(36)
$$ \vartheta_{a}^{{\text{(target)}}} \leftarrow \tau_{a} \vartheta_{a}^{{\text{(train)}}} + (1 - \tau_{a} )\vartheta_{a}^{{\text{(target)}}} . $$
(37)

where \(\tau_{c}\) and \(\tau_{a}\) represent the update factors for the target critic network and the target actor network. The overall training algorithm for the entire DDPG network is shown in Algorithm 2. The purpose of the algorithm is not to train the neural network for online processing, but to obtain the optimal transmit beamforming matrix \({\mathbf{G}}_{o}\) and phase shift matrices \({\mathbf{S}}_{o}\) using DRL.

figure b

Simulation results and analysis

In this section, we give several numerical results to evaluate the performance of the proposed Attention-DRL algorithm for RIS-assisted V2X millimeter-wave massive MIMO system. Also, our proposed algorithm is compared with the following methods:

  1. 1.

    The weighted minimum mean square error (WMMSE) algorithm in [54].

  2. 2.

    The iterative algorithm based on fractional programming (FP) with the zero-forcing (ZF) beamforming in [12].

  3. 3.

    The jointly optimal transmit beamforming with phase shift matrix based on DRL algorithm in [41].

Both algorithms require full up-to-date cross-cell CSI in WMMSE algorithm and FP with ZF algorithm, and both are centralized and iterative in their original forms. A detailed description and pseudo code of the former algorithm could be found in Algorithm 1 in [54] and the latter algorithm could be found in Algorithm 3 in [12]. Meanwhile, the DRL algorithm already assumes that large-scale path loss and the shadowing effects have been compensated, which detailed description and algorithmic pseudo-code could be found in Algorithm 1 in [41].

The key hyper-parameters used in our proposed algorithm for the RIS-assisted V2X millimeter-wave massive MIMO system are summarized in Table 1. In this paper, two different vehicles speeds \(v = 45{\text{ km/h,}}v = 90{\text{ km/h}}\), were selected for the dynamic V2X system, where the normalized Doppler frequencies are given by \(f_{d} = \frac{{vf_{c} }}{{cf_{s} }}\). The performance of the Attention-DRL-based algorithm compared to these three state-of-the-art benchmarks is shown in this section.

Table 1 Summary of hyper-parameters

We consider the relationship between SNR and convergence performance for two different speeds of \(v = 45{\text{ km/h,}}v = 90{\text{ km/h}}\) and three different RIS reflecting elements of \(N_{S} = 50,100,150\) for the described Attention-DRL-based algorithm. In addition, we also consider the performance between SNR and average runtime between our proposed Attention-DRL-based algorithm and the three other algorithms described at the beginning of "Simulation results and analysis".

The SNR (in dB) is defined as

$$ {\text{SNR}} = 10 {\text{log}_{10}} \left({\left\| {{\overline{\mathbf{\mathcal{Y}}}}} \right\|_{F}^{2}} /{\left\| {\mathbf{\mathcal{W}}} \right\|_{F}^{2}} \right). $$
(38)

where \({\overline{\mathbf{\mathcal{Y}}}}\) is the received signal with free-noise tensor according to Eq. (9), and \(\mathbf{\mathcal{W}}\) is the AWGN tensor.

To better understand our proposed attention-DRL-based approach, we investigate the average rewards versus time step for different SNR as shown in Fig. 4. It can be seen that the SNR has a significant impact on the convergence rate and performance, especially in the low SNR scenario, where the performance is much smaller than that of the high SNR scenario, but the convergence rate and stability are faster and more stable compared to the high SNR scenario. Similarly, we also studied the average rewards versus time step for four scenarios with different \(N_{S} = 100,150\) and vehicle speeds of \(v = 45{\text{ km/h,}}v = 90{\text{ km/h}}\), with the effect of different SNR on the system, and the variation of the system model parameters setting has stronger robustness. As shown in Fig. 5, it can be seen that the average reward increases with the increase of element \(N_{S}\) and vehicle speed, and its convergence speed is more robust at vehicle speed of \(v = 45{\text{ km/h}}\) compared to that at vehicle speed of \(v = 90{\text{ km/h}}\), which also reflects from the side that the high speed of the vehicle affects the system performance.

Fig. 4
figure 4

Average rewards versus time steps under different SNR = {0 dB, 10 dB, 20 dB, 30 dB}

Fig. 5
figure 5

Average rewards versus time steps under different \(N_{S}\) and \(v\)

Figure 6 depicts the SNR versus the number of iterations required for the convergence curve for the Attention-DRL-based algorithm. For the same SNR, a vehicle’s speed of \(v = 90{\text{ km/h}}\) requires a higher number of iterations than when the vehicle’s speed is \(v = 45{\text{ km/h}}\), which is indicative of the fact that the high-speed movement of the vehicle creates a more complex and impactful communication for the system. It could also be seen from Fig. 6 that the number of iterations required grows with the number of reflecting elements \(N_{S}\) in RIS, which also shows the significant impact of the number of reflecting elements \(N_{S}\) in RIS in the system. Also, the number of iterations required for convergence does not change significantly for several cases at a high SNR range, while different \(N_{S}\) values are more sensitive to the rate of convergence at the low SNR range.

Fig. 6
figure 6

The SNR versus the number of iterations required for convergence curve for the Attention-DRL-based algorithm

We also consider the performance of average runtime between our proposed Attention-DRL-based algorithm and the other three algorithms. The results are depicted in Fig. 7. We chose a small number of reflecting elements for the RIS, \(N_{S} = 50\), and a low speed of the vehicles, \(v = 45{\text{ km/h}}\).The results confirm the robustness of our proposed the Attention-DRL-based algorithm over the WMMSE algorithm, FP with ZF algorithm and DRL algorithm. At the low SNR range, the four algorithms are more sensitive to the size of the SNR. The average running time of all four algorithms decreases as the low SNR grows. At the same time, the trend of the average running time of all four algorithms flattens out in the high SNR range.

Fig. 7
figure 7

The performance of average runtime between Attention-DRL-based algorithm and the other three algorithms

We also consider the effect of changes in vehicles position at two different speeds based on the Attention-DRL algorithm on the normalized power and the power-efficiency gain by RIS. As shown in Figs. 8 and 9, we simulated the transmit power \(P_{t}\) versus the Variations of the vehicles position for the \({\text{BER = 10}}^{ - 4} \;{\text{dB}}\), which recorded in the upper subfigures. And the lower subfigures show the relationship between power-efficiency gain by RIS and vehicles position.

Fig. 8
figure 8

The performance of the normalized \(P_{t}\) and the power-efficiency gain by RIS due to vehicles position with \(v = 45\;{\text{km/h}}\)

Fig. 9
figure 9

The performance of the normalized \(P_{t}\) and the power-efficiency gain by RIS due to vehicles position with \(v = 90\;{\text{km/h}}\)

The vehicles speed adopted \(v = 45\;{\text{km/h}}\) in Fig. 6 and \(v = 90\;{\text{km/h}}\) in Fig. 9, and the number of reflecting elements in the RIS are set to \(N_{S} = 100\). As could be seen from the upper subfigures of Figs. 8 and 9, the RIS is not quite sensitive to the effect of normalized power \(P_{t}\) as the vehicles is positioned closer to the BS. In contrast, its normalized power reaches a plateau as the vehicles is positioned closer to the RIS. At the same time, as the distance between the vehicles and the RIS with the BS continuously increases, and the normalized \(P_{t}\) still increases with distance, as we would expect, but it also tends to level off. Similarly, as seen in the lower subfigures of Figs. 8 and 9, the power-efficiency gain by RIS is less as the vehicles are positioned closer to the BS. However, as the vehicles move closer and closer to the RIS, the power-efficiency gain by RIS increases significantly to some extent and it also tends to decrease as moving away from the RIS and BS.

Similarly, we combine Figs. 8 and 9 to see that, due to the faster speed vehicles in Fig. 9, the degree of variation in the normalized transmit power of the BS is not as good as at lower vehicles speeds and has a higher value than at lower rates. For the power-efficiency gain, we could find a minimized normalized power in both figures corresponding to a high-power gain.

In addition, we compared the performance of the BER and the number of reflecting elements in RIS between our proposed Attention-DRL-based algorithm with the other three algorithms described at the beginning of "Simulation results and analysis". Secondly, the effect of \(P_{t}\) on BER was also explored at RIS reflecting elements \(N_{S} = 100,150\), respectively.

In Fig. 10, we took the speed of the vehicles to be \(v = 45{\text{km/h}}\) and the \(P_{t} = 20\;{\text{dB}}\). As we can see, the BER of our proposed algorithm is reduced compared to other benchmark algorithms. Moreover, the BER decreases with an increasing number of elements in RIS under all algorithms considered, indicating that the effect of the number of reflecting elements in RIS in this system is also significant.

Fig. 10
figure 10

The performance of the BER and the number of reflecting elements in RIS between attention-DRL-based algorithm with the other three algorithms

Another simulation experiment was also carried out, considering two sets of system parameters, as shown in Fig. 11. At the top of this figure, we set the number of reflecting elements in RIS to \(N_{S} = 100\) and at the below of this figure, we set the number of reflecting elements in RIS to \(N_{S} = 150\). Again, the speed of these vehicles adopted \(v = 45\;{\text{km/h}}\). It is observed that the BER decreases with the \(N_{S}\) at a certain number of reflecting elements in RIS, and while the larger the number of reflecting elements in RIS, the greater the BER will be when the \(P_{t}\) is inevitable, and making the system performance more robust.

Fig. 11
figure 11

The effect of \(P_{t}\) on BER with \(N_{S} = 100,150\)

Conclusion

In this work, we investigate a RIS-assisted V2X millimeter-wave massive MIMO system and propose an attention-DRL-based algorithm for jointly optimizing the transmit beamforming at the BS and reflecting phase shifts on the RIS to achieve the goal of minimum BER of the system. Unlike traditional alternating optimization methods, the integration of DRL techniques into the optimal design of RIS-assisted V2X millimeter-wave massive MIMO system enables the observation of immediate rewards, learning from the environment and improving it to ultimately obtain optimal solutions to high-dimensional optimization problems. In addition, we propose a CFTA framework that could effectively extract frequency and time features in the channel, improving the effectiveness of the optimization model. In particular, we take full advantage of the low-rank nature of the millimeter-wave channel and the tensor structure of the received signal, not only to increase accuracy but also to decline the training overhead and complexity. Simulation results have illustrated that the proposed scheme has meaningful behavior.