Traditional electrocardiograms (ECG) analysis usually requires doctors to diagnose and treat based on the patient’s ECG wave information. However, ECGs recorded by wearable devices are commonly contaminated by various noises. Especially contaminated by noise such as motion artifacts [MA: muscle artifact (ma) and electrode motion artifact (em)], resulting in a large number of poor quality signals, and its existence seriously hinders the doctors’ diagnosis and delays patients’ timely treatment. To make matters worse, some MA frequency details overlap with the band of the ECG signals, thus limiting the filtering methods in the frequency domain, or have similar morphology to some ECG signals, thus limiting the filtering methods in the time domain [1]. It is challenging to eliminate these noises without distorting the clinical features [2].

In general, there are two ways to solve this problem. The first is to use denoising techniques [3,4,5,6], which have good effect on baseline wander, high-frequency noise, etc., but it is difficult to remove the MA noise mentioned above. Another way is to eliminate signals heavily contaminated by the MA through the signal quality assessment (SQA) [7, 8]. Currently, the mainstream SQA methods can be roughly divided into two categories. The first category is based on traditional machine learning and signals quality indicators (SQIs) [9,10,11,12,13,14]. For example, Xia et al. proposed an ECG SQA method based on support vector machine (SVM) and multi-feature fusion with waveform attributes, power spectrum, R-wave detection, and other characteristics [9]. Behar et al. employed indicators such as kSQI, sSQI, pSQI basSQI, bSQI, pcaSQI, and rSQI, and trained an SVM model to evaluate the quality of ECG signals to reduce false alarms [10]. Satija et al. calculated the SQIs through signal loss detection, baseline mutation extraction, and high-frequency noise detection and extraction to evaluate the clinical acceptability of ECG signals [11]. Zhang et al. adopted waveform feature-based methods (including lead-off features, baseline wander features, power spectral features, and nonlinear features) to train random forest and SVM model for SQA [12]. Shahriari et al. used a structural similarity measure (SSIM) to compare ECG images obtained from two ECGs at standard scales. Then, a representative subset of ECG images are selected from the training set as a template by a clustering method. Finally, the SSIM between each image and all templates is used as features to train a linear discriminant analysis classifier for SQA [13]. Holzinger et al. provided a taxonomy of various entropy methods, whereby describing in more detail: approximate entropy, sample entropy, fuzzy entropy, and particularly topological entropy for finite sequences. They also state that entropy measures have successfully been tested for analyzing short, sparse and noisy time series data [14]. These hand-crafted features have the advantage of interpretability and can reflect the specific description of ECG features to a certain extent. However, these SQIs are based on human-defined desirable properties of clean signals, it relies on human-specified properties, leading to inherent limitations in expressing potential features of signal quality. Simultaneously, they rarely consider the effective ECG feature extraction methods under the condition of MA interference.

The second category is deep learning-based methods [15,16,17,18], which usually utilize abstract features extracted by deep learning techniques or combine them with hand-crafted features to implement SQA. For instance, Liu et al. proposed a new method that combines deep learning-based Stockwell Transform (S-Transform) spectrogram features and hand-crafted statistical features to achieve SQA [15]. Huerta et al. combined convolutional neural networks and wavelet transform to robustly identify high-quality ECG segments in the challenging setting of single-lead recordings of alternating sinus rhythms, atrial fibrillation episodes, and other rhythms [16]. Seeuws et al. used an unsupervised deep learning model to derive a data-driven quality metric that outperformed some traditional metrics (kSQI, sSQI, IOR, pSQI, basSQI, bSQI, and pcaSQI) and highlight the consistently superior performance of their metrics across different tasks [17]. Zhang et al. designed a comprehensive feature-set (covering spectral distribution, signal complexity, horizontal and vertical variations of waves, etc.) and utilized two long short-term memory (LSTM) layers to learn time-dependent features automatically [18]. Compared with hand-crafted features, the abstract features extracted based on deep learning methods describe ECG recordings from another magical perspective. But they seldom consider effective solutions to the problem of MA interference that have similar morphology and band aliasing to some ECG signal. In addition, they also rarely propose interpretability and relationships between these features.

Here, we mainly solve two problems: (1) noise such as MA with similar morphology and aliased frequency bands to some ECG can easily deceive machine learning methods, resulting in low accuracy of SQA. (2) The hand-crafted features require sufficient human intervention and cannot express signal quality comprehensively. We propose a novel SQA method that fuses depth local dual-view (DLDV) features and a dual-input Transformer (DI-Transformer) framework for improving the recognition ability of MA-contaminated ECG. Specifically, we extract the first three intrinsic mode functions (FT-IMF) of the signal through empirical mode decomposition (EMD) [19] and then employ fast fourier transform (FFT) [20] to further explore the deeper local amplitude and phase angle features of FT-IMF. Then, the DLDV features are dimensionally reduced by kernel principal component analysis (KPCA) [21] and employed to identify subtle differences between MA and ECG signal through depth local amplitude and phase angle features. At the same time, we also analyze the FT-IMF’s central tendency and dispersion degree and combine the result with dimensionality reduced DLDV features to form augmented features (FT-IMF\(_\mathrm{all}\)). Finally, the FT-IMF\(_\mathrm{all}\) is fused with the temporal relational features extracted from Raw ECG by the proposed DI-Transformer framework to train the SQA model. In particular, the phase angle features we extracted contains the contribution of each time sample point. So it can quantify the subtle changes in ECGs at the time sample point. Naturally, it can also distinguish the nuances of ECGs and MA. As far as we know, there has no literature on extracting the DLDV features (phase angle and amplitude–frequency features) from FT-IMF to achieve SQA. Only Lee et al. calculated the mean, variance, and Shannon entropy from the first IMF (F-IMF) obtained by EMD, and then used them for SQA [22]. These indicators can reflect the signal’s central tendency and dispersion degree but cannot fully reflect the deeper local features used to distinguish the MA noise. Because the feature information computed by their method loses the temporal features. In this paper, the DLDV features extracted through FT-IMF not only can solve the problem that traditional methods cannot obtain the iconic features of the MA, but also have the advantage of interpretability. We also verify the accuracy and robustness of DLDV features on four traditional classifiers and provide an accurate and efficient SQA scheme based on K-Nearest Neighbor (KNN). In addition, our proposed DI-Transformer model is based on the transformer [23] architecture, which has the advantage that the multi-head attention module can be executed in parallel and can capture the temporal relationship of ECG signal. Our combined strategy with the transformer model can overcome the shortcomings of traditional machine learning requiring full human intervention while accurately distinguishing MA noise from ECGs. The contributions of this study can be summarized as follows:

  • The proposed DLDV features can identify subtle differences between MA and ECG signals through depth local amplitude and phase angle features, which provides a practical and novel solution for identifying MA-contaminated ECGs.

  • The proposed DI-Transformer can focus on the temporal relationship between sample points and reflect the subtle local changes in the signal sequence, which can effectively improve the model’s ability to identify MA-contaminated ECG.

  • The strategy of fusion the DLDV features and DI-Transformer’s temporal relational features extracted from Raw ECG significantly improves the accuracy of MA noise recognition and has applications such as wearable ECG monitoring devices.

  • For the first time, we propose the DLDV features to solve the ECG SQA problem and achieve an accuracy of 94.27% on G-SVM and 93.32% on KNN, and the result outperforms six traditional SQIs. More importantly, we obtain the best accuracy (99.62%) on the proposed DI-Transformer, which outperforms other state-of-the-art SQA methods.

This paper is organized as follows: “Methodology” presents the data used in the experiments and the details of our method. “Experiments and results” demonstrates the experimental results. Finally, we discuss and conclude our work in “Discussion” and “Conclusion”.


We present the overall framework of the proposed SQA method in Fig. 1. It mainly consists of three parts: data preprocessing, DLDV features extraction and KPCA, and DI-Transformer framework. Among them, the DI-Transformer framework also consists of two parts: transformer encoder layer and classification layer. Next, we will describe each part in detail in the following sections.

Fig. 1
figure 1

The overall framework of the proposed SQA model. Note that this paper focuses on SQA of MA-contaminated long-term ECGs

The DLDV features extraction and KPCA

DLDV features extraction

We start our DLDV feature extraction method from EMD [19]. The EMD can effectively process non-linear and non-stationary time-series signals, such as ECG signals. Unlike FFT and discrete wavelet transform (DWT) [24], the EMD reveals the inherent features of a signal through its decomposition IMFs. It can represent a signal as a combination of multiple IMFs components, containing the characteristic distribution from high to low frequency. Different IMFs can reflect the feature information of signal and noise in different degrees.

Fig. 2
figure 2

The process of DLDV feature extraction and dimensionality reduction. Note that we process the ECG through two stages (EMD and FFT), we then focus on the depth local features of the secondary component of ECG through the amplitude value and phase angle, so it is called the depth local dual-view feature

In general, some MA noise has similar morphology and overlapping frequency to some ECG signals, so traditional denoising methods cannot effectively eliminate such noise. Amazingly, we find the local nuances between them that can be expressed by IMFs somehow. Therefore, we design a special method to obtain the DLDV features of these MA-contaminated ECGs. Figure 2 shows the architecture diagram of the proposed DLDV feature extraction method. The light green areas represent the key modules of the proposed method, which we named the DLDV feature extraction module (DLDV-FEM), and it composed of a stack of \(N = 3\) identical modules. Each module has two sub-modules. The first is an FFT-based sub-module, and the second is a statistical analysis-based sub-module (SA-based sub-module). After performing the EMD operation on x[n], we obtain its FT-IMF components (F-IMF: the first IMF, S-IMF: the second IMF, and T-IMF: the third IMF). When we feed F-IMF to DLDV-FEM through the “Input” pipeline, the FFT-based sub-module obtains its amplitude value and phase angle in the frequency domain through the FFT [20] operation (denoted as FT-IMF\(_\mathrm{f}\)). Meanwhile, the SA-based sub-module obtains its central tendency and degree of dispersion (denoted as FT-IMF\(_\mathrm{t}\)). Then, FT-IMF\(_\mathrm{t}\) and FT-IMF\(_\mathrm{f}\) are output together to FT-IMF\(_\mathrm{F}\) through the lavender pipeline. When the remaining S-IMF and T-IMF pass through the DLDV-FEM module in turn, we get two output components (S-IMF\(_\mathrm{S}\), and T-IMF\(_\mathrm{T}\)). Then, the FT-IMF\(_\mathrm{f}\) of these three output components are concatenated together to form our FT-IMF\(_\mathrm{freq}\) (DLDV) features, and the three FT-IMF\(_\mathrm{t}\) of these components are concatenated together to form our FT-IMF\(_\mathrm{time}\) features. Finally, the output features (FT-IMF\(_\mathrm{all}\)) of the entire module are obtained by concatenating FT-IMF\(_\mathrm{freq}\) and FT-IMF\(_\mathrm{time}\). Next, we will describe the feature extraction process in detail:

Given \(X \in {\mathbb {R}}^{12 \times \ell }\) represents a multi-lead ECG signal, and \(X_\mathrm{f} \in {\mathbb {R}}^{1 \times \ell }\) represents the f-th lead ECG signal, \(f \in [1, \ldots , 12]\) are the number of leads for the ECG signal, and l is the length of ECG segment. After performing the EMD operation according to [19], we can get IMFs as follows:

$$\begin{aligned} \mathrm{I M F}_{\mathrm{f}, p}[n] = \left\{ \begin{array}{l} X_{\mathrm{f}}[n]-r_{\mathrm{f}, p}[n], p = 1 \\ X_{\mathrm{f}}[n]-\sum _{p = 2}^{N} \mathrm{I M F}_{\mathrm{f}, p-1}[n]-r_{\mathrm{f}, p}[n], p>1, \end{array}\right. \end{aligned}$$

where n is the serial number of ECG segment, \(\mathrm{IMF}_{\mathrm{f},p}[n]\) represents the p-th IMF of the f-th lead. \(\mathrm {p} \in [1,2\ldots ,N]\), N (here, the value of N is 3 and f is 1) is the total layer number of IMFs, \(r_{f,p}[n]\) is the residual signal generated by the f-th lead signal passing through the p-th layer EMD. Note that this paper mainly uses the FT-IMF (F-IMF, S-IMF, and T-IMF) components of EMD. Because the dynamics of the FT-IMF of the EMD are as though they have been passed through a high-pass filter [25]. Hence, it is not surprising that the FT-IMF contains dynamics associated with noise for any well-sampled data [26].

Fig. 3
figure 3

The FT-IMF of clean and noisy ECG signals. The light purple area represents the features of em and ma in each IMF component. a Clean ECG signal and its FT-IMF component. b the bw-contaminated signal and its FT-IMF component. c The ma-contaminated signal and its FT-IMF component. d The em-contaminated signal and its FT-IMF component

Figure 3 shows the FT-IMF of clean signal, bw-contaminated signal, ma-contaminated signal and em-contaminated signal, respectively. We find several interesting phenomena: (1) the amplitude values of the IMFs of the noise-contaminated ECG signals are significantly lower than that of the clean signals. (2) The FT-IMF component of EMD contains almost no bw noise (there is almost no difference between the corresponding IMFs components in Fig. 3a, b), but can well reflect the inherent features of em and ma noise (the FT-IMF of the noise signal in Fig. 3c, d) reflect the feature information of noise to varying degrees). (3) R peaks have higher amplitude values in each IMF component, while em or ma noises similar to R peaks have different amplitude values in different IMFs. The difference of ma artifacts in each IMF component is marked in light purple in Fig. 3c, and it can be seen that the ma is manifest in different degrees in all three components. In Fig. 3d, the difference of em artifacts in each IMF component is marked in light purple colors, and it can be seen that em has obvious characteristics in T-IMF. These phenomena indicate that the FT-IMF contains some features beneficial to recognizing MA-contaminated ECG. Therefore, we utilize the FFT-based sub-module to extract the amplitude value and phase angle of F-IMF, S-IMF and T-IMF in the frequency domain, and concatenate the features obtained from the three components:

$$\begin{aligned} {\text {FT-IMF}}_{\mathrm{freq}} = \mathrm{Concat}\left\{ \begin{array}{l} \mathrm{angle}(fft({\text {F-IMF}})),\Vert fft({\text {F-IMF}})\Vert \\ \mathrm{angle}(fft({\text {S-IMF}})),\Vert fft({\text {S-IMF}})\Vert \\ \mathrm{angle}(fft({\text {T-IMF}})),\Vert fft({\text {T-IMF}})\Vert \end{array}\right\} ,\nonumber \\ \end{aligned}$$

among them, FT-IMF\(_\mathrm{freq} \in {\mathbb {R}}^{3 \times 2l}\), the \(\Vert \cdot \Vert \) means the absolute value operation, the \({\text {angle}}(\cdot )\) represents the operation of calculating the phase angle, and \({\text {fft}}(\cdot )\) represents the operation of FFT. \({\text {Concat}}(\cdot )\) represents the operation of the connection. Simultaneously, we utilize the SA-based sub-module to analyze the central tendency and dispersion degree of FT-IMF in the time domain, and concatenate the features obtained from the three components:

$$\begin{aligned} {\text {FT-IMF}}_\mathrm{time } = \mathrm{Concat}\left\{ \begin{array}{l} \mathrm{mean}({\text {F-IMF}}), \mathrm{var}({\text {F-IMF}}) \\ \mathrm{mean}({\text {S-IMF}}), \mathrm{var}({\text {S-IMF}}) \\ \mathrm{mean}({\text {T-IMF}}), \mathrm{var}({\text {T-IMF}}) \end{array}\right\} , \end{aligned}$$

where the \({\text {mean}}(\cdot )\) is the averaging operation, \({\text {var}}(\cdot )\) represents the operation of calculating variance, and FT-IMF\(_\mathrm{time} \in {\mathbb {R}}^{3 \times 2}\).

Fig. 4
figure 4

A visualization display example of amplitude value and phase angle features of FT-IMF. a Amplitude-frequency features diagram of FT-IMF of the em-contaminated ECG. b Phase angle diagram of FT-IMF of the em-contaminated ECG. c Amplitude–frequency features diagram of FT-IMF of the ma-contaminated ECG. d Phase angle diagram of FT-IMF of the ma-contaminated ECG

Figure 4 shows an example of the feature extraction of the em and ma contaminated signals at each stage. Figure 4a is the amplitude–frequency features of the em-contaminated ECG, and Fig. 4b is its corresponding phase angle features. Figure 4c is the amplitude–frequency features of the ma-contaminated ECG, and Fig. 4d is its corresponding phase angle features. It can be seen that when the frequency of the intermediate quantity decomposed by the em- or ma-contaminated ECG is not 0, the corresponding phase angle is also not 0 and does not have obvious periodic characteristics (the phase angle feature of the clean signal has a periodic characteristic.). It is in line with the periodic characteristics of the ECG signal. In addition, the phase angle can reflect the local change of the signal waveform at a certain moment [27], so the depth features extracted in this way can well remember the subtle differences between the signal and noise. Finally, we obtain the FT-IMF\(_\mathrm{freq}\) and FT-IMF\(_\mathrm{time}\), and we also call the FT-IMF\(_\mathrm{freq}\) as \(X_\mathrm{DLDV}\).

DLDV feature dimension reduction

Principal component analysis (PCA) [28] is one of the essential methods for linear dimensionality reduction. Each principal component is a data projection in a certain direction, and their variances in different directions are determined by their eigenvalue. In the dimensionality reduction process, the eigenvalues are sorted from large to small. The eigenvectors corresponding to the first k eigenvalues are used as dimensionality-reduced features to express the information we are interested in. However, the data we need to process are nonlinear and non-stationary ECG signals. Therefore, this paper adopts kernel principal component analysis (KPCA) [21] to deal with these data. In the KPCA, we believe the ECG data have a higher dimension. We can do PCA analysis in a higher-dimensional space (Hilbert space). The advantage is that it is possible to find an effective projection direction to classify the data in a higher-dimensional space for nonlinear data points that are difficult to classify in a lower-dimensional space. Since the dimensionality of DLDV features (non-linear features) is too high and contains some features that hardly contribute to classification (as reflected in Fig. 4). So, we utilize KPCA to perform dimensionality reduction operations on DLDV features.

For PCA, given \(\mathbf {{\textbf {X}}}_\mathrm{DLDV} = \left[ x_{1}, x_{2}, \ldots , x_{n}\right] , \varvec{{\textbf {X}}}_\mathrm{DLDV} \in {\mathbb {R}}^{n \times d}\), n is the sequence numbers of \({\textbf {X}}_\mathrm{DLDV}\), and d is the dimension of each sequence. After performing PCA, we get the following decomposition model:

$$\begin{aligned} \mathbf {{\textbf {X}}}_\mathrm{DLDV} = \varvec{{\textbf {S}}}_{1} \varvec{{\textbf {U}}}_{1}^{T}+\varvec{{\textbf {S}}}_{2} \varvec{{\textbf {U}}}_{2}^{T}+\cdots \varvec{{\textbf {S}}}_{d} \varvec{{\textbf {U}}}_{d}^{T}, \end{aligned}$$

\(\mathbf {{\textbf {S}}}_{\mathrm {t}}(1 \le \mathrm {t} \le \mathrm {d})\) and \(\mathbf {{\textbf {U}}}_{\mathrm {t}}(1 \le \mathrm {t} \le \mathrm {d})\) represents the principal component vector and the corresponding projection vector, respectively. Since \({\textbf {U}}_{t}\) represents a series of orthonormalized vectors, the principal component \({\textbf {S}}_{t}\) can be expressed as: \({\textbf {S}}_{t} = {\textbf {X}}_\mathrm{DLDV} {\textbf {U}}_{t}\). So, the projection vector \({\textbf {U}}_{t}\) can be calculated by solving the eigenvalue problem:

$$\begin{aligned} \gamma _{\mathrm {t}} \varvec{{\textbf {U}}}_{t} = \frac{1}{n-1} \varvec{{\textbf {X}}}_\mathrm{DLDV}^{T} \varvec{{\textbf {X}}}_\mathrm{DLDV} \varvec{{\textbf {U}}}_{t}. \end{aligned}$$

For KPCA, we define a mapping: \({\textbf {X}} _\mathrm{DLDV} \in {\mathbb {R}}^{n \times d} \rightarrow \varvec{\mathbb {\aleph }}\left( \mathrm {{\textbf {X}}}_\mathrm{DLDV}\right) \in {\mathbb {R}}^{n \times p}\), the \(\varvec{\mathbb {\aleph }}(\cdot )\) denotes a nonlinear mapping function which is to map the signal to the Hilbert functional space (\(\varvec{\beth }\)), and p represents the dimension of the feature space. We denote the mapping function of \({\textbf {X}}_\mathrm{DLDV}\) to the \(\varvec{\beth }\) space as:

$$\begin{aligned} \varvec{\mathbb {\aleph }}\left( {\varvec{X}}_\mathrm{DLDV}\right) = {\varvec{S}}_{1} {\varvec{U}}_{1}^{T}+{\varvec{S}}_{2} {\varvec{U}}_{2}^{T}+\cdots {\varvec{S}}_{p} {\varvec{U}}_{p}^{T}. \end{aligned}$$

For the nonlinear case, it is difficult to solve \({\textbf {U}}_{t}\) by simply replacing \({\textbf {X}}_\mathrm{DLDV}\) with \(\varvec{\mathbb {\aleph }}({\textbf {X}}_\mathrm{DLDV} )\) according to (6). Because the mapping function \(\varvec{\mathbb {\aleph }}(\cdot )\) is unknown. To address this problem, we introduce kernel tricks to develop KPCA model. The \({\varvec{U}}_{t}\) can be expanded in the feature space as \(\mathrm {{\textbf {U}}}_{\mathrm {t}} = \varvec{\mathbb {\aleph }}^{T}\left( {\varvec{X}}_\mathrm{D L D V}\right) \varvec{\beta }_{t}\) by reference [29], \(\varvec{\beta }_{t}\) is a linear transformation vector. Thus, formula (6) is transformed as:

$$\begin{aligned} \gamma _{t} \varvec{\beta }_{t} = \frac{1}{n-1}\left( \varvec{\mathbb {\aleph }}\left( {\varvec{X}}_\mathrm{D L D V}\right) \varvec{\mathbb {\aleph }}^{T}\left( \varvec{{\varvec{X}}}_\mathrm{D L D V}\right) \right) \varvec{\beta }_{t}, \end{aligned}$$

we find that \(K = \varvec{\mathbb {\aleph }}({\textbf {X}}_\mathrm{DLDV} ) \varvec{\mathbb {\aleph }}^{T} ({\textbf {X}}_\mathrm{DLDV} )\) is the kernel matrix of the kernel function, and the elements of the kernel matrix are calculated by the Gaussian kernel function \(k(x, y) = e^{-\left( \frac{\left\| x^{2}-y^{2}\right\| }{w}\right) }\), and w represents the bandwidth of the Gaussian kernel.

For a given test vector \({\varvec{X}}_\mathrm{D L D V}^{j} \in {\mathbb {R}}^{d}\), represents the j-th DLDV feature vector, the corresponding kernel principal component can be calculated by [30,31,32]:

$$\begin{aligned} {\varvec{S}}_{\mathrm {t}}\left( {\varvec{X}}_\mathrm{D L D V}^{j}\right)= & {} \varvec{\mathbb {\aleph }}\left( {\varvec{X}}_\mathrm{D L D V}^{j}\right) \varvec{\mathbb {\aleph }}^{T}\left( {\varvec{X}}_\mathrm{D L D V}\right) \varvec{\beta _{t}} \nonumber \\= & {} k\left( {\varvec{X}}_\mathrm{D L D V}^{j}, {\varvec{X}}_\mathrm{D L D V}\right) \varvec{\beta _{t}}, \end{aligned}$$

where \(t = [1,2,..., k]\) indicates that the first k vectors retained after dimensionality reduction, that is \(\mathrm {{\varvec{S}}_{t}}\left( {\varvec{X}}_\mathrm{D L D V}\right) \in {\mathbb {R}}^{1 \times k}\). Here, we determine the value of k by the cumulative contribution rate of the principal components. Usually, if the cumulative contribution rate (P) of the first k principal components reaches 80–90%, it means that the first k principal components basically contain the main information of all measurement indicators. To keep as many principal components as possible while reducing dimensionality as much as possible, we keep all principal components with \(P\ge 95\%\):

$$\begin{aligned} P = \frac{\sum _{i = 1}^{k} \gamma _{k}}{\sum _{i = 1}^{d} \gamma _{k}}. \end{aligned}$$

After DLDV feature extraction and dimensionality reduction for 6 s ECG signals, we determine the minimum k value that satisfy Eq. (10) is \(k = 2124\) (\(354\times 6\)). Finally, we combine the FT-IMF\(_\mathrm{time}\), and the low-dimensional result (FT-IMF\(_\mathrm{all} \in {\mathbb {R}}^{1 \times (k+6)}\)) obtained as:

$$\begin{aligned} {\text {FT-IMF}}_\mathrm{all} = \left[ {\text {FT-IMF}}_\mathrm{time},{\textbf {S}}_{t}\left( {\textbf {X}}_\mathrm{DLDV}\right) \right] . \end{aligned}$$

Proposed dual-input transformer model

Deep learning-based approaches can automatically extract abstract features of samples. However, its complex convolution and recursive structure make a series of hidden layers have a large number of front-to-back dependencies, which leads to low parallelism of the model. Transformer, the first sequence transduction model entirely based on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention [23]. Existing studies have shown that the transformer can not only handle the problem in the field of translation, but can deal with the classification of temporal sequence [23], such as ECG sequence [33, 34]. For the first time, we propose a DI-Transformer model to deal with the problem of ECG SQA, and its overall structure is shown in Fig. 5. Our DI-Transformer model mainly includes the transformer encoder layer and classifier layer. Furthermore, the feature extraction and KPCA are plugged into our model as augmented features. Note that the transformer encoder layer is formed by stacking six attention modules, each module includes six multi-head attention blocks, and the specific composition of the multi-head attention mechanism is in [23, 35, 36]. Since ECG does not require a standard translation process, we replace the decoder part of the transformer with a fully connected layer. We describe the DI-Transformer in detail as follows.

Fig. 5
figure 5

The overall structure of the proposed DI-Transformer model

Transformer encoder layer

Input embedding and positional encoding: The input embedding of the sequential signal is similar to methods in most natural language processing (NLP) architectures [37]. To get the embedding for each point, the Raw ECG or FT-IMF\(_\mathrm{all}\) is mapped to the \(d_\mathrm{model}\) dimensional space through 1D convolution. It should be noted that we must ensure the consistency of the sequence length before and after convolution through well-designed padding and kernel size. That is, we must ensure the dimension of the embedding output is also \(d_\mathrm{model}\). In addition, we choose the sinusoidal version [23, 36] to provide positional embedding for our input sequence.

Attention module: We stack the attention module six times, and each consisting of two parts (the multi-head attention block and the feed forward network). The former comprises six parallel attention modules, and its internal structure is shown in Fig. 6. After the “input embedding and positional encoding” operation for raw ECG, the input vector U of the transformer encoder layer is obtained. Then, we define three transformation matrices: \(\mathrm {W}_{e}^{\mathrm {Q}}\in {\mathbb {R}}^{d_\mathrm{model}\times d_{k}}\), \(\mathrm {W}_{e}^{\mathrm {K}}\in {\mathbb {R}}^{d_\mathrm{model}\times d_{k}}\) and \(\mathrm {W}_{e}^{\mathrm {V}}\in {\mathbb {R}}^{d_\mathrm{model}\times d_{v}}\), \(e = \{1,2,\ldots ,6\}\), and use these three transformation matrices to perform three linear transformations on U to get the query (\(Q_{e}\)), Key (\(K_{e}\)) and Value (\(V_{e}\)). Finally, the e-th head is calculated by \(Q_{e}\), \(K_{e}\), and \(V_{e}\):

$$\begin{aligned} h_{\mathrm {e}} = {\text {softmax}}\left( \frac{Q_{e}\cdot K_{e}^\mathrm{T}}{\sqrt{d_{k}}}\right) \mathrm {V}_{\mathrm {e}}, \end{aligned}$$

where T represents the operation of matrix transpose. To connect the results of all \(h_{e}\), we define the transformation matrix \(W^{P}\), and then get the output of the multi-head attention module through a linear mapping operation:

$$\begin{aligned} \mathrm{MHAB}\left( Q,K,V) = \mathrm{Concat}(h_{1},h_{2},\ldots ,h_{6}\right) W^{P}, \end{aligned}$$

where the \(W^{P}\in {\mathbb {R}}^{6 d_{v}\times d_\mathrm{model}}\) [23]. And then, a residual connection and a layer normalization are performed in “Add &Norm” blocks for MHAB(QKV). The result is then connected to the feed-forward network (the second part of attention module), which consists of two fully connected layers with a rectified linear unit (ReLU). The output of each attention module is represented as \(X_\mathrm{attention}\). Note that we use layer normalization rather than batch normalization. Again, a residual connection, layer normalization and feed forward are performed, respectively. We can finally get the output of the transformer encoder layer. The output will be used as the input of the next transformer encoder layer or fusion with FT-IMF\(_\mathrm{all}\) and input to the classification layer to determine the final output categories.

Dual-input features fusion and classification

In the phase of model initialization, we extract FT-IMF\(_\mathrm{time}\) and FT-IMF\(_\mathrm{freq}\) features through the proposed method and perform KPCA on FT-IMF\(_\mathrm{freq}\). They are then concatenated and used as the second channel input feature (FT-IMF\(_\mathrm{all}\)) of DI-Transformer. In the training phase, the Raw ECG of the first channel is divided into mini-batch and perform position encoding and then feed into the transformer encoder layer. For each iteration, we randomly select 6 s data from each Raw ECG sample (We have shown in follow-up experiments that 6 s long data is optimal). After the Raw ECG passes through the transformer encoder layer, the extracted feature map is flattened and concatenated with the FT-IMF\(_\mathrm{all}\) features prepared in the phase of model initialization:

$$\begin{aligned} X_\mathrm{hidden} = \left[ \mathrm{concat}\left( X_\mathrm{attention}^{1}, \ldots X_\mathrm{attention }^{6}\right) , {\text {FT-IMF}}_\mathrm{all}\right] . \end{aligned}$$

And then the \(X_\mathrm{hidden}\) goes through a linear layer (a 1D fully connected layer and the input dimension is \(d_\mathrm{in}\)), which is connected with a softmax function. Then, the Softmax mapping scores are compared with the corresponding input labels to calculate the cross-entropy loss value. Finally, the classification layer outputs a vector \(V = (v_{1},v_{2})\), where \(v_{i}\) denotes the probability that the segment belongs to class i (good quality or bad quality).

Fig. 6
figure 6

Multi-head attention blocks

Experiments and results

ECG database and experimental setting

ECG database

This paper employs the Physionet Computing in Cardiology Challenge 2011 (PCCC) [38] database to test the proposed SQA method. The PCCC includes 1500 10 s standard 12-lead ECG recordings with sampling rate 500 Hz, and it contains two subsets: the set-a includes 1000 12-lead 10 s recordings, and the set-b includes 500 12-lead 10 s recordings. This paper employs set-a, which contains 9276 (\(773\times 12\)) 10 s good quality (“acceptable”) ECGs and 2700 (\(225\times 12\)) 10 s bad quality (“unacceptable”) ECGs. In addition, we also select 500 single-lead good quality records and 500 single-lead bad quality records from the PCCC to form the testset (test-a). Then, we randomly select the em or ma noise after oversampling and use it to contaminate any one of the 500 selected good quality data according to the method in [39], repeat this process 500 times, and generate 500 records with em and ma noise contamination. Finally, the generated 500 bad quality data and 500 good quality data selected from PCCC are combined into a testset (test-b). The details of each database are described in Table 1. As shown in Fig. 7, we randomly select the good quality and bad quality segments from the set-a. In addition, it should be noted that the Z-score is used to normalize each 10 s record of all datasets, which can be calculated as follows:

Fig. 7
figure 7

An example of good and bad quality segments selected from set-a and their corresponding heatmaps

Table 1 Details of the datasets used in this paper
$$\begin{aligned} {z} = \frac{x-u}{\sigma }, \end{aligned}$$

where x denotes the signal segments, \(\mu \) and \(\sigma \) are the mean value and standard deviation of the signal segments, respectively.

Experimental setting

Model parameters settings: The key parameters set for the DI-Transformer model are shown in Table 2. It should be noted that due to the physiological characteristics of the human body, ECG signal strength will be limited within a certain range, which means there will not be much numerical difference between peaks and troughs, so the \(d_\mathrm{model}\) is set to 512 [33]. In addition, to achieve the goal of rapid convergence and prevent oscillation near the local minimum, the learning rate is dynamically adjusted during the model’s training.

The whole method is developed and trained using Tensorflow and Pytorch. Our experiments are performed on a computer with an Intel(R) Core(TM) i5-7640X CPU@4.00GHz, and equipped with two GPU GeForce GTX 1080 Ti with 11GB RAM.

Performance evaluation: To evaluate the performance of the proposed method for SQA, we adopted five-fold cross-validation. The set-a is randomly divided into five equal subsets, each subset is selected as the test set in turn, and the remaining four subsets are used for training. However, less than a quarter of the data is classified as bad quality. It is well known that using an unbalanced dataset to build classifiers will cause bias and result in poor generalization ability of classification models. Another approach is to balance the dataset when not using prior probabilities (and Bayesian training paradigms) to overcome this problem. Therefore, we balance the dataset by adding real noise [em and ma noise from NSTDB [40] and additive Gaussian white noise (AGWN)] to the good quality segments to generate additional bad quality data. Note that we oversampled the em and ma noises to 500 Hz before adding them to the training subset, and the sampling rate of AGWN is also of 500 Hz. The method of balancing the dataset is described in [39]. For each cross-validation task, we balance train subset (containing \(7421\approx 9276/5 * 4\) 10 s good quality segments and \(6838\approx 2700/5*4+4678\) 10 s bad quality segments) but keep the test subset unchanged (containing \(1855\approx 9276/5\) 10 s good quality segments and \(540\approx 2700/5\) 10 s bad quality segments).

In addition, we employ multiple indicators to evaluate the performance of the proposed method, such as sensitivity (Se), Specificity (Sp), Precision (\(P_{+}\)), accuracy (Acc), \(F_{1}\) and area under curve (AUC) [41]. It should be noted that for extremely unbalanced data (i.e., a low prevalence or incidence of a disease in the total population), the ROC curve and AUC are only partially meaningful. For this problem, Carrington et al. [42] gives an effective solution. Here, we balanced the training set. The definitions of these indicators are as follows:

$$\begin{aligned} {\text {Se}}(\%)= & {} \frac{\mathrm{T P}}{\mathrm{T P}+\mathrm{F N}} \times 100 \%, \end{aligned}$$
$$\begin{aligned} {\text {Sp}}(\%)= & {} \frac{\mathrm{TN}}{\mathrm{TN} +\mathrm{FP}} \times 100 \%, \end{aligned}$$
$$\begin{aligned} \mathrm {P}_{+}(\%)= & {} \frac{\mathrm{T P}}{\mathrm{T P}+\mathrm{F P}} \times 100 \%, \end{aligned}$$
$$\begin{aligned} {\text {Acc}}(\%)= & {} \frac{\mathrm{T P}+\mathrm{T N}}{\mathrm{T P}+\mathrm{T N}+\mathrm{F P}+\mathrm{F N}} \times 100 \%, \end{aligned}$$
$$\begin{aligned} \mathrm {F}_{\text{1 }}(\%)= & {} \frac{2 P_{+} \times \mathrm{S e}}{P_{+}+\mathrm{S e}} \times 100 \%, \end{aligned}$$

where TP is true positives, TN is true negatives, FP is false positives and FN is false negatives.

Table 2 The parameter setting of the proposed DI-Transformer model

Experiments results

Table 3 The main parameter settings of each model in the Python development environment

Performance evaluation of DLDV features

To evaluate the performance of the DLDV features extracted by our method, we employ four traditional classifiers (Gaussian Kernel Support Vector Machines (G-SVM) [43], Logistic Regression (LR) [44], Random Forests (RF) [45], and K-Nearest Neighbors (KNN) [46], and the parameter settings of each classifier are shown in Table 3) and six time-frequency dependent SQIs [10, 47, 48], such as sSQI and kSQI, pSQI, LpSQI, MpSQI, HpSQI. Table 4 shows the binary classification results of ECG signal quality using a series of features on four traditional classifiers. Figure 8 shows the confusion matrix obtained from the DLDV features (FT-IMF\(_\mathrm{freq}\)) on the four classifiers. Table 4 shows that our DLDV features outperform the traditional six SQIs on G-SVM, LR, RF and KNN. Our DLDV features achieve the best performance on G-SVM, and the Se, \(P_{+}\) and Acc achieve 93.42, 97.85 and 93.32%, respectively. Among the six comparison SQIs, the sSQI achieve the best performance on KNN with Se, \(P_{+}\) and Acc are 89.91, 93.27 and 87.92%, respectively. Despite this, its Acc is still 5.40% lower than our method. Such results show that the performance of the DLDV features outperform the six comparison SQIs.

Table 4 Average results of five-fold cross-validation performed on set-a using four classifiers for each SQIs
Fig. 8
figure 8

Confusion matrices obtained by our DLDV feature on the four classifiers (G-SVM, LR, RF and KNN). a Confusion matrix on G-SVM. b Confusion matrix on LR. c Confusion matrix on RF. d Confusion matrix on KNN

To further test the performance of the proposed method, instead of randomly combining SQIs to train the classification model, we generate new combinations of SQIs according to the principle of decreasing the average accuracy of the six SQIs on the four classifiers. Then, these combinations are compared with DLDV, FT-IMF\(_\mathrm{all}\), respectively, and the results on each classifier are shown in Table 5. It can be seen that the Acc of the combination of six SQIs is the highest among all combinations, but still lower than the Acc of DLDV and FT-IMF\(_\mathrm{all}\). It shows that our features’ performance is better than the traditional six advanced SQIs. Furthermore, our DLDV feature performs the best on G-SVM (Acc = 93.32%), which benefits from our DLDV features and the superior performance of the SVM classifier based on the Gaussian kernel function. The results obtained on KNN (Acc = 92.98%) are slightly inferior to G-SVM. In addition, our features perform poorly on LR (Acc = 87.76%), even lower than SQI\(_\mathrm{features}\) on KNN (Acc = 89.98%), but still slightly ahead of the results for the combinations of all 6 SQIs. It indicates that our method outperforms these six traditional SQIs in executing quality classification.

Comparison of our DI-Transformer and four traditional classifiers

This section compares our DI-Transformer with four traditional methods (G-SVM, LR, RF and KNN). Four features (SQI\(_\mathrm{features}\), FT-IMF\(_\mathrm{time}\), FT-IMF\(_\mathrm{freq}\) and FT-IMF\(_\mathrm{all}\)) are used to build five categories of classifiers, and the results on the test set are shown in Table 6. It can be seen that the classification models built with SQI\(_\mathrm{features}\), a higher accuracy (Acc = 89.98%) is achieved on KNN among all four traditional models, but still lower than the result of DI-Transformer (Acc = 91.26%). The performance of the classification models built with FT-IMF\(_\mathrm{all}\) is generally better than that of SQI\(_\mathrm{features}\). The result on G-SVM (Acc = 94.27%) is better than that obtained on KNN (Acc = 93.64%), but Table 7 and Fig. 13b reflect that the performance on KNN (AUC = 0.962) is better than G-SVM (AUC = 0.921). More importantly, combined with FT-IMF\(_\mathrm{all}\), our DI-Transformer achieves the globally best performance (Acc = 99.62% and AUC = 0.993). The p values we provide in Table 8 show the significant difference in expression signal quality between the proposed DI-Transformer and these four traditional classifiers, and this significant difference is statistically significant.

Table 5 Average Acc of fivefold cross-validation performed on balanced set-a using four classifiers for the different combinations of SQIs

Ablation study on DI-Transformer model

In this section, we design a series of ablation experiments to comprehensively evaluate the performance of the proposed DI-Transformer. Experiment A only uses the FT-IMF\(_\mathrm{freq}\) feature as the input to train the transformer-based model. Based on experiment A, the B used the FT-IMF\(_\mathrm{freq}\) and FT-IMF\(_\mathrm{time}\) as the input to train the transformer-based model. Experiment C only used Raw ECG as the input to train the transformer model. Based on C, experiment D treats FT-IMF\(_\mathrm{time}\) as augmented features, which are then concatenated with the output of the transformer encoder layer and fed to the classification layer. Experiment E encodes the Raw ECG as the input of the transformer, and then the dimension reduced FT-IMF\(_\mathrm{freq}\) is used as an augmented feature, which is finally fed into the classification layer along with the output of the transformer encoder layer (see in Fig. 5). On the basis of experiment E, the F treats FT-IMF\(_\mathrm{freq}\) and FT-IMF\(_\mathrm{time}\) as augmented features, which are then concatenated with the output of the transformer encoder layer and fed to the classification layer. Notice that compared with experiments A, B and C for the single-input structure, experiments D, E, and F adopt the method of augments feature with a dual-input structure, the most advantage of which is that it can fully utilize the depth local dual-view features.

Table 6 The Acc values obtained by different features on five classifiers
Table 7 The AUC values obtained by different features on five classifiers
Table 8 The p values of AUC between DI-Transformer and traditional methods by using different features
Table 9 Results of ablation studies performed on DI-Transformer with five-fold cross-validation
Fig. 9
figure 9

Confusion matrix corresponding to each ablation experiment. af represent the confusion matrices corresponding to ablation experiments A–F, respectively

Table 9 shows a series of ablation experiments associated with the proposed method, and Fig. 9 shows six confusion matrices for the corresponding experiments. As shown in Table 9, the Acc of the transformer-based model achieves 95.49% in experiment A. the Acc of experiment C achieves 92.57%. Compared with experiment C, the Acc of experiment E (DI-Transformer model) is increased by 6.01%. The result shows that as an augmented feature, the FT-IMF\(_\mathrm{freq}\) significantly improves the performance of the model. Comparing the results of experiments A and C, we can find that inputs FT-IMF\(_\mathrm{freq}\) into transformer can more effectively improve the classification performance than directly inputs Raw ECG into transformer. In experiment B, the Acc of the transformer-based model achieves 97.70% and the \(F_{1}\) achieves 98.51%. Comparing the results of experiments A and B, it can be seen that as an augmented feature the FT-IMF\(_\mathrm{time}\) also improves the classification performance of the model, but its contribution is not as significant as FT-IMF\(_\mathrm{freq}\). Experiment F maximizes the performance of the proposed DI-Transformer method, its Se, Sp, \(P_{+}\) and Acc values reaches 99.68, 99.44, 99.83 and 99.62%, respectively. As shown in Fig. 9f, only 0.25% of the good quality data are misclassified as bad quality data. Such results show that the performance of our DI-Transformer is much better than G-SVM and KNN.

Performance of each model to recognize the MA noise

First, we select the four traditional classification models trained with the SQI\(_\mathrm{features}\) and FT-IMF\(_\mathrm{all}\) features, respectively. The performance of these models are then tested on an artificial test set with progressively increasing MA-contaminated ECG segments. We generate a series of test sets with unchanged total samples (1000) to test the ability of each model to identify MA-contaminated ECG by adjusting the proportion of data obtained in test-a and test-b. We take data from test-a and test-b at the ratios of 8:2, 6:4, 4:6 and 2:8, respectively, and we denote these generated test sets as test-ab1, test-ab2, test-ab3 and test-ab4 in turn. The results of the four traditional classifiers trained with SQI\(_\mathrm{features}\) on each test subset are shown in Table 10. As the proportion of MA-contaminated ECG segments increases, the Acc of all four classifiers decreases to different degrees. Relatively speaking, the result of KNN under the same proportion is better than the results obtained by the other three classifiers. Figure 10a shows the results obtained on Test-ab1 and Test-ab4. It can be seen that these classifiers are more sensitive to MA noise. The results of the five classifiers trained with FT-IMF\(_\mathrm{all}\) on each test subset are shown in Table 11. As the proportion of MA-contaminated ECG segments increases, the accuracy of all five classifiers decreases to different degrees, but it is much smaller than the decrease in Table 10. As shown in Fig. 10b, the results on Test-ab1 and Test-ab4 also confirmed this view. The results in Tables 12 and 13 illustrate that the contribution of our FT-IMF\(_\mathrm{all}\) features to identifying MA noise is significant at p = 0.05, and our DI-Transformer based on FT-IMF\(_\mathrm{all}\) outperforms the employed four conventional classifiers across the board in recognizing MA noise.

Table 10 Acc of SQI\(_\mathrm{features}\) on the four traditional classifiers
Fig. 10
figure 10

The ability of FT-IMF\(_\mathrm{all}\) to recognize signal segments containing MA (em and ma)-contaminated ECG. a The results on classification models built with SQI\(_\mathrm{features}\) features. b The results on classification models built with FT-IMF\(_\mathrm{all}\) features

Optimal data length and computational time

To find the optimal segment length (\(N_\mathrm{seg}\)) for SQA, we repeat experiment F ten times on set-a with \(N_\mathrm{seg}\) varying from 1 to 10 s at an increment of 1 s. Throughout the whole experiment, we only change the size of \(N_\mathrm{seg}\), and the relationship between the \(N_\mathrm{seg}\) and the accuracy of SQA are shown in Fig. 11a. It can be seen that as the size of \(N_\mathrm{seg}\) increases, the accuracy of quality classification of our model also increases. However, when \(N_\mathrm{seg}\) is greater than 6 s, the accuracy can hardly be improved. It shows that the 6 s segment has covered most of the features required for signal quality classification. In addition, the Fig. 11b reflects the relationship between sample length and training and testing times. As the \(N_\mathrm{seg}\) increases, the training and testing time of the model slowly increases within 5 s. After 6 s, as it increases, the curve shows a rapid upward trend. Combining the results of Fig. 11a, b, weighing classification accuracy and computational complexity, we finally choose the optimal signal segment length as \(N_\mathrm{seg} = 6s\).

Table 11 Acc of FT-IMF\(_\mathrm{all}\) on all five classifiers
Table 12 p values of Acc between SQI\(_\mathrm{features}\) and FT-IMF\(_\mathrm{all}\) on the four traditional classifiers
Table 13 p values of Acc between the four traditional classifiers and our DI-Transformer using FT-IMF\(_\mathrm{all}\) features
Fig. 11
figure 11

The relationship between segment length and classification accuracy and computation time. a The relationship between the segments length and the accuracy. b The relationship between the segment length and the training and testing times

Performance comparison

Table 14 Performance comparison with other methods

This paper employs the PCCC [38] database, and other papers also use that database. Table 14 lists some other well-performing methods using this database. Albaba et al. [49] constructed an SQA pipeline by combining multiple time-frequency domain features with multiple traditional classifiers, and obtained good results on the Medium Gaussian SVM (MG-SVM) classifier. The method achieves an accuracy of 93.00% on MG-SVM, which is comparable to the result obtained by our FT-IMF\(_\mathrm{all}\) features on G-SVM (Acc = 94.27%), but still much lower than our DI-Transformer ( Acc = 99.62%). Shahriari et al. [13] used the SSIM to compare ECG images obtained from two ECGs at standard scales. And then, they trained a linear discriminant analysis classifier for SQA based on the SSIM between each image and all templates as feature vectors. Compared with others, their method obtained a lower accuracy. Behar et al. [10] employed indicators such as kSQI, sSQI, pSQI basSQI, bSQI, pcaSQI, and rSQI, and trained an SVM model to evaluate the quality of ECG signals to reduce false alarms, with the achieved accuracy of 99.30%. The result is higher than our G-SVM based on FT-IMF\(_\mathrm{all}\) but is slightly inferior to our DI-Transformer. It is worth noting that our methods have a strong MA noise recognition ability, but [10] aimed at the normal noisy signal and do not consider the interference of MA noise. Therefore, even though their performance metrics are high, but not entirely comparable. In [13, 49], they also hardly consider the case of MA-contaminated ECG. In addition, the proposed methods have good interpretability and can achieve accurate ECG SQA, including a large amount of MA noise.


Analyzing the performance of DLDV features

This paper uses EMD and FFT to extract the DLDV features of ECG signals. Then four different traditional classifiers (G-SVM, LR, RF, and KNN) are employed to evaluate the performance of the extracted DLDV features. Meanwhile, we also employ six traditional time-frequency related SQIs metrics as references to evaluate the performance of our DLDV features. In general, the larger span of signal quality, the more significant difference in SQI value. For example, as shown in Fig. 12, due to the obvious difference in the probability density distribution of different quality signals, the kurtosis (kSQI) and skewness (sSQI) can provide effective information for distinguishing good quality signals from bad quality signals. In addition, the other four time-frequency-related SQIs are all valid SQA indicators verified by researchers and have achieved good results in actual SQA [4, 5, 43, 44]. Therefore, this paper selects them as references to evaluate the confidence of our DLDV features for SQA.

Fig. 12
figure 12

The probability density function of kSQI and sSQI on set-a (9276 good quality and 2700 bad quality) before normalization

Fig. 13
figure 13

ROC curves of several classification models on set-a. a ROC curves obtained on four traditional classifiers based on the SQI\(_\mathrm{features}\) feature. b ROC curves obtained on five classifiers based on the FT-IMF\(_\mathrm{all}\) feature

Table 4 and Fig. 8 show the classification results and confusion matrices of the six traditional SQIs and DLDV features employed in this paper on the four classifiers. It can be seen that the DLDV features outperform these traditional SQIs metrics on the four classifiers, and even the SQI\(_\mathrm{features}\) on LR with the lowest accuracy is also lower than our DLDV. The reason why our method comprehensively outperforms the traditional six SQIs is that the features extracted by our method can not only express the central tendency and discrete degree of the signal segment, but also employ the phase angle and amplitude–frequency values to express the characteristics of the transient change of the signal.

Analyzing the performance of each model to recognize the MA noise

We also design experiments to test the proposed method’s ability to recognize MA-contaminated ECGs. Our DLDV features work well for MA-contaminated ECGs, which is well confirmed in Fig. 10 and Tables 10, 11, 12, 13. Table 10 reflects the expression ability of SQI\(_\mathrm{features}\) on MA noise. It can be seen that with the increase of MA noise, the accuracy of all four classifiers decreases, and the minimum decrease reaches 6.82%. It can be seen from Table 11, under the same conditions, the accuracy of all four classifiers also decreased, but the maximum decrease is only 1.08%. Tables 12 and 13 are the results of statistical analysis for Tables 10 and 11. The p values show the significant difference between SQI\(_\mathrm{features}\) and FT-IMF\(_\mathrm{all}\) in expressing MA noise, which is statistically significant. It can be seen from the results in Fig. 10a, the SQI\(_\mathrm{features}\) has its limitation in expressing MA noise. Because these metrics are based on human-defined desirable properties of clean signals, they rely on human-specified properties, leading to inherent limitations in expressing potential features of signal quality [17]. In addition, it is difficult for us to artificially specify the features of some MA noises similar to ECG signal, so it is not surprising that the features information of them are hard to extract by using the SQI\(_\mathrm{features}\). Compared with the results in Fig. 10a, the results obtained by each classifier in Fig. 10b on the two test sets are very close, with the average difference of 0.76%. It shows that the classifier constructed with our features can identify general noise well. More importantly, it also offers strong performance in identifying MA noises. Furthermore, our DI-Transformer structure achieves high accuracy on test-ab4. Such high accuracy is not only due to the design of the dual-input structure, but more importantly, the transformer’s self-attention module can also capture the timing relationship of the signal and then combine the DLDV features with improving the model’s ability to recognize MA noise. Note that we do not use the FT-IMF\(_\mathrm{time}\) feature in this test experiment because this feature can only express the central tendency and dispersion of the signal and cannot fully reflect the transient change of the signal.

Analyzing the performance of proposed DI-Transformer

The effectiveness and robustness of our FT-IMF\(_\mathrm{all}\) feature for SQA are verified on traditional classifiers (G-SVM, LR, RF, and KNN). Furthermore, we also propose a DI-Transformer SQA method based on the FT-IMF\(_\mathrm{all}\) features. Table 9 presents a series of ablation experiments for the proposed DI-Transformer method. Figure 9 shows the confusion matrix corresponding to each ablation experiment. The results of experiments C, D, E and F show that the contribution of FT-IMF\(_\mathrm{freq}\) to the SQA is much more significant than that of FT-IMF\(_\mathrm{time}\). The results of experiments C and E show that the proposed dual-input structure significantly improves the model’s classification performance. Feeding the FT-IMF\(_\mathrm{freq}\) (experiment A) to the transformer as input data are much better than feeding it the Raw ECG directly (experiment C), which shows that DLDV features can help the transformer model to learn the quality features more easily. It benefits from the fact that the phase angle features can well represent the transient change of the signals, and combined with the amplitude features, this transient change can be quantified. We also observe that the Se value of experiment E is higher than that of experiment C, the accuracy of experiment F is the best. It shows that experiment F tends to identify more signal segments as good quality, with the advantage of not missing valuable signals in subsequent processing stages, which is also demonstrated in the confusion matrix in Fig. 9f. From this point of view, the abstract features automatically extracted by the transformer from Raw ECG are complementary to the FT-IMF\(_\mathrm{freq}\) features. Comparing the results of A, E and B, F, we find that the DI-Transformer combines the advantages of DLDV features and transformer-based abstract features, and has higher Se, Sp and Acc values. It can obtain more effective signal quality features than the single-input structure (A, B and C).

We also compare the proposed DI-Transformer with four traditional classifiers. It can be seen from Table 6 that the result on SQI\(_\mathrm{features}\) is inferior to our FT-IMF\(_\mathrm{all}\), but higher than our FT-IMF\(_\mathrm{time}\). Because our FT-IMF\(_\mathrm{time}\) does not focus on the nuances of signal and noise. The AUC values in Table 7 show that our DI-Transformer exhibits the best performance on all features, followed by KNN combined with FT-IMF\(_\mathrm{all}\). Furthermore, in Table 8 the p values we provide show significant differences between the method based on SQI\(_\mathrm{features}\) and the method based on FT-IMF\(_\mathrm{all}\), and this significant difference is statistically significant. It is not surprising that we get such good results because our method rarely considers the morphology of Rwa ECG and instead mines the depth local features of the signal. We not only extract the transient amplitude features of the intermediate component of the signal (IMFs), but also extract the transient phase angle features that can express the subtle difference between the signal and the noise (especially for MA noise). Equally important, on the traditional classifier-based methods, although the accuracy of FT-IMF\(_\mathrm{all}\) features on G-SVM is higher than that of KNN, but the receiver operating characteristic curve (ROC) of each model in Fig. 13 shows that the performance of DI-Transformer is the best (AUC = 0.993). Therefore, the DI-transformer-based model constructed by FT-IMF\(_\mathrm{all}\) can provide a new set of practical solutions for SQA. In addition, it can be seen from Fig. 13b that the KNN model built with FT-IMF\(_\mathrm{all}\) exhibits the best performance (AUC = 0.962), followed by RF (AUC = 0.948). Suppose the user uses the traditional method to build the signal quality classifier. In that case, the KNN or RF method based on FT-IMF\(_\mathrm{all}\) can be preferred under the same conditions.


In summary, we present a novel ECG SQA method that fuses the proposed DLDV features and the DI-Transformer framework for improving the recognition ability of MA-contaminated ECG. For the first time, we combine DLDV features and transformer to handle the ECG SQA problem. Specifically, we use EMD and FFT to extract DLDV features of Raw ECG in the time-frequency domain. The extracted DLDV feature can identify subtle differences between MA and ECG signals through depth local amplitude and phase angle features. When it is fused with the temporal relationship features extracted by DI-Transformer, its accuracy is significantly improved compared to the method based on traditional SQIs. Experiments on SQA tasks show that the proposed method outperforms the state-of-the-art SQA methods. In addition, our method can not only identify the common type of noise from noise-contaminated ECGs, more importantly, it can effectively identify MA-contaminated ECG. In the future, we will improve the proposed method and make it suitable for SQA of other physiological signals, such as SQA of electroencephalogram and electromyogram.