Fusing depth local dual-view features and dual-input transformer framework for improving the recognition ability of motion artifact-contaminated electrocardiogram

Heart health monitoring based on wearable devices is often contaminated by various noises to varying degrees. Using signal quality indicators (SQIs) to achieve signal quality assessment (SQA) is among the most promising ways to solve this problem, but the performance of SQIs in expressing ECG quality features contaminated by motion artifact (MA) noise remains disappointing. Here, we present a novel SQA method that fuses the proposed depth local dual-view (DLDV) features and the dual-input transformer (DI-Transformer) framework to improve the recognition ability of MA-contaminated ECGs. The proposed DLDV features are to identify subtle differences between MA and ECG through depth local amplitude and phase angle features. When it fuses with the temporal relationship features extracted by DI-Transformer, its accuracy is significantly improved compared to the SQIs-based methods. In addition, we also verify the robustness and the accuracy of DLDV features on four traditional classifiers. Finally, we conduct our experiments on the two datasets. On the PhysioNet/Computing in Cardiology Challenge dataset, the DLDV features (Acc = 95.49%) outperform the combination of six SQIs features (Acc = 91.26%). When combined with our DI-Transformer, it delivered an accuracy of 99.62%, outperforming the state-of-the-art SQA methods. On the artificial testset constructed by MA noise, our DI-Transformer outperforms four traditional methods and also delivered an accuracy of 97.69%.


Introduction
Traditional electrocardiograms (ECG) analysis usually requires doctors to diagnose and treat based on the patient's ECG wave information. However, ECGs recorded by wearable devices are commonly contaminated by various noises. Especially contaminated by noise such as motion artifacts [MA: muscle artifact (ma) and electrode motion artifact ( in a large number of poor quality signals, and its existence seriously hinders the doctors' diagnosis and delays patients' timely treatment. To make matters worse, some MA frequency details overlap with the band of the ECG signals, thus limiting the filtering methods in the frequency domain, or have similar morphology to some ECG signals, thus limiting the filtering methods in the time domain [1]. It is challenging to eliminate these noises without distorting the clinical features [2]. In general, there are two ways to solve this problem. The first is to use denoising techniques [3][4][5][6], which have good effect on baseline wander, high-frequency noise, etc., but it is difficult to remove the MA noise mentioned above. Another way is to eliminate signals heavily contaminated by the MA through the signal quality assessment (SQA) [7,8]. Currently, the mainstream SQA methods can be roughly divided into two categories. The first category is based on traditional machine learning and signals quality indicators (SQIs) [9][10][11][12][13][14]. For example, Xia et al. proposed an ECG SQA method based on support vector machine (SVM) and multi-feature fusion with waveform attributes, power spectrum, R-wave detection, and other characteristics [9]. Behar et al. employed indicators such as kSQI, sSQI, pSQI basSQI, bSQI, pcaSQI, and rSQI, and trained an SVM model to evaluate the quality of ECG signals to reduce false alarms [10]. Satija et al. calculated the SQIs through signal loss detection, baseline mutation extraction, and high-frequency noise detection and extraction to evaluate the clinical acceptability of ECG signals [11]. Zhang et al. adopted waveform featurebased methods (including lead-off features, baseline wander features, power spectral features, and nonlinear features) to train random forest and SVM model for SQA [12]. Shahriari et al. used a structural similarity measure (SSIM) to compare ECG images obtained from two ECGs at standard scales. Then, a representative subset of ECG images are selected from the training set as a template by a clustering method. Finally, the SSIM between each image and all templates is used as features to train a linear discriminant analysis classifier for SQA [13]. Holzinger et al. provided a taxonomy of various entropy methods, whereby describing in more detail: approximate entropy, sample entropy, fuzzy entropy, and particularly topological entropy for finite sequences. They also state that entropy measures have successfully been tested for analyzing short, sparse and noisy time series data [14]. These hand-crafted features have the advantage of interpretability and can reflect the specific description of ECG features to a certain extent. However, these SQIs are based on humandefined desirable properties of clean signals, it relies on human-specified properties, leading to inherent limitations in expressing potential features of signal quality. Simultaneously, they rarely consider the effective ECG feature extraction methods under the condition of MA interference.
The second category is deep learning-based methods [15][16][17][18], which usually utilize abstract features extracted by deep learning techniques or combine them with hand-crafted features to implement SQA. For instance, Liu et al. proposed a new method that combines deep learning-based Stockwell Transform (S-Transform) spectrogram features and handcrafted statistical features to achieve SQA [15]. Huerta et al. combined convolutional neural networks and wavelet transform to robustly identify high-quality ECG segments in the challenging setting of single-lead recordings of alternating sinus rhythms, atrial fibrillation episodes, and other rhythms [16]. Seeuws et al. used an unsupervised deep learning model to derive a data-driven quality metric that outperformed some traditional metrics (kSQI, sSQI, IOR, pSQI, basSQI, bSQI, and pcaSQI) and highlight the consistently superior performance of their metrics across different tasks [17]. Zhang et al. designed a comprehensive feature-set (covering spectral distribution, signal complexity, horizontal and vertical variations of waves, etc.) and utilized two long short-term memory (LSTM) layers to learn time-dependent features automati-cally [18]. Compared with hand-crafted features, the abstract features extracted based on deep learning methods describe ECG recordings from another magical perspective. But they seldom consider effective solutions to the problem of MA interference that have similar morphology and band aliasing to some ECG signal. In addition, they also rarely propose interpretability and relationships between these features.
Here, we mainly solve two problems: (1) noise such as MA with similar morphology and aliased frequency bands to some ECG can easily deceive machine learning methods, resulting in low accuracy of SQA. (2) The hand-crafted features require sufficient human intervention and cannot express signal quality comprehensively. We propose a novel SQA method that fuses depth local dual-view (DLDV) features and a dual-input Transformer (DI-Transformer) framework for improving the recognition ability of MAcontaminated ECG. Specifically, we extract the first three intrinsic mode functions (FT-IMF) of the signal through empirical mode decomposition (EMD) [19] and then employ fast fourier transform (FFT) [20] to further explore the deeper local amplitude and phase angle features of FT-IMF. Then, the DLDV features are dimensionally reduced by kernel principal component analysis (KPCA) [21] and employed to identify subtle differences between MA and ECG signal through depth local amplitude and phase angle features. At the same time, we also analyze the FT-IMF's central tendency and dispersion degree and combine the result with dimensionality reduced DLDV features to form augmented features (FT-IMF all ). Finally, the FT-IMF all is fused with the temporal relational features extracted from Raw ECG by the proposed DI-Transformer framework to train the SQA model. In particular, the phase angle features we extracted contains the contribution of each time sample point. So it can quantify the subtle changes in ECGs at the time sample point. Naturally, it can also distinguish the nuances of ECGs and MA. As far as we know, there has no literature on extracting the DLDV features (phase angle and amplitude-frequency features) from FT-IMF to achieve SQA. Only Lee et al. calculated the mean, variance, and Shannon entropy from the first IMF (F-IMF) obtained by EMD, and then used them for SQA [22]. These indicators can reflect the signal's central tendency and dispersion degree but cannot fully reflect the deeper local features used to distinguish the MA noise. Because the feature information computed by their method loses the temporal features. In this paper, the DLDV features extracted through FT-IMF not only can solve the problem that traditional methods cannot obtain the iconic features of the MA, but also have the advantage of interpretability. We also verify the accuracy and robustness of DLDV features on four traditional classifiers and provide an accurate and efficient SQA scheme based on K-Nearest Neighbor (KNN). In addition, our proposed DI-Transformer model is based on the transformer [23] architecture, which has the advantage that the multi-head attention module can be executed in parallel and can capture the temporal relationship of ECG signal. Our combined strategy with the transformer model can overcome the shortcomings of traditional machine learning requiring full human intervention while accurately distinguishing MA noise from ECGs. The contributions of this study can be summarized as follows: • The proposed DLDV features can identify subtle differences between MA and ECG signals through depth local amplitude and phase angle features, which provides a practical and novel solution for identifying MA-contaminated ECGs. This paper is organized as follows: "Methodology" presents the data used in the experiments and the details of our method. "Experiments and results" demonstrates the experimental results. Finally, we discuss and conclude our work in "Discussion" and "Conclusion".

Methodology
We present the overall framework of the proposed SQA method in Fig. 1. It mainly consists of three parts: data preprocessing, DLDV features extraction and KPCA, and DI-Transformer framework. Among them, the DI-Transformer framework also consists of two parts: transformer encoder layer and classification layer. Next, we will describe each part in detail in the following sections.

DLDV features extraction
We start our DLDV feature extraction method from EMD [19]. The EMD can effectively process non-linear and nonstationary time-series signals, such as ECG signals. Unlike FFT and discrete wavelet transform (DWT) [24], the EMD reveals the inherent features of a signal through its decomposition IMFs. It can represent a signal as a combination of multiple IMFs components, containing the characteristic distribution from high to low frequency. Different IMFs can reflect the feature information of signal and noise in different degrees.
In general, some MA noise has similar morphology and overlapping frequency to some ECG signals, so traditional denoising methods cannot effectively eliminate such noise. Amazingly, we find the local nuances between them that can be expressed by IMFs somehow. Therefore, we design a special method to obtain the DLDV features of these MA-contaminated ECGs. Figure 2 shows the architecture diagram of the proposed DLDV feature extraction method. The light green areas represent the key modules of the proposed method, which we named the DLDV feature extraction module (DLDV-FEM), and it composed of a stack of N = 3 identical modules. Each module has two sub-modules. The first is an FFT-based sub-module, and the second is a statistical analysis-based sub-module (SA-based sub-module). After performing the EMD operation on x[n], we obtain its FT-IMF components (F-IMF: the first IMF, S-IMF: the second IMF, and T-IMF: the third IMF). When we feed F-IMF to DLDV-FEM through the "Input" pipeline, the FFT-based sub-module obtains its amplitude value and phase angle in the frequency domain through the FFT [20] operation (denoted as FT-IMF f ). Meanwhile, the SA-based sub-module obtains its central tendency and degree of dispersion (denoted as FT-IMF t ). Then, FT-IMF t and FT-IMF f are output together to FT-IMF F through the lavender pipeline. When the remaining S-IMF and T-IMF pass through the DLDV-FEM module in turn, we get two output components (S-IMF S , and T-IMF T ). Then, the FT-IMF f of these three output components are concatenated together to form our FT-IMF freq (DLDV) features, and the three FT-IMF t of these components are concatenated together to form our FT-IMF time features. Finally, the output features (FT-IMF all ) of the entire module are obtained by concatenating FT-IMF freq and FT-IMF time . Next, we will describe the feature extraction process in detail: Given X ∈ R 12× represents a multi-lead ECG signal, and X f ∈ R 1× represents the f -th lead ECG signal, f ∈ [1, . . . , 12] are the number of leads for the ECG signal, and Fig. 1 The overall framework of the proposed SQA model. Note that this paper focuses on SQA of MA-contaminated long-term ECGs Fig. 2 The process of DLDV feature extraction and dimensionality reduction. Note that we process the ECG through two stages (EMD and FFT), we then focus on the depth local features of the secondary component of ECG through the amplitude value and phase angle, so it is called the depth local dual-view feature l is the length of ECG segment. After performing the EMD operation according to [19], we can get IMFs as follows: where n is the serial number of ECG segment, IMF f, p [n] represents the p-th IMF of the f -th lead. p ∈ [1, 2 . . . , N ], N (here, the value of N is 3 and f is 1) is the total layer number of IMFs, r f , p [n] is the residual signal generated by the f -th lead signal passing through the p-th layer EMD. Note that this paper mainly uses the FT-IMF (F-IMF, S-IMF, and T-IMF) components of EMD. Because the dynamics of the FT-IMF of the EMD are as though they have been passed through a high-pass filter [25]. Hence, it is not surprising that the FT-IMF contains dynamics associated with noise for any well-sampled data [26]. Figure 3 shows the FT-IMF of clean signal, bw-contaminated signal, ma-contaminated signal and em-contaminated signal, respectively. We find several interesting phenomena: (1) the amplitude values of the IMFs of the noise-contaminated ECG signals are significantly lower than that of the clean signals.
(2) The FT-IMF component of EMD contains almost no bw noise (there is almost no difference between the corresponding IMFs components in Fig. 3a, b), but can well reflect the inherent features of em and ma noise (the FT-IMF of the noise signal in Fig. 3c Fig. 3c, and it can be seen that the ma is manifest in different degrees in all three components. In Fig. 3d, the difference of em artifacts in each IMF component is marked in light purple colors, and it can be seen that em has obvious characteristics in T-IMF. These phenomena indicate that the FT-IMF contains some features beneficial to recognizing MA-contaminated ECG. Therefore, we utilize the FFT-based sub-module to extract the amplitude value and phase angle of F-IMF, S-IMF and T-IMF in the frequency domain, and concatenate the features obtained from the three components: among them, FT-IMF freq ∈ R 3×2l , the · means the absolute value operation, the angle(·) represents the operation of calculating the phase angle, and fft(·) represents the operation of FFT. Concat(·) represents the operation of the connection. Simultaneously, we utilize the SA-based sub-module to analyze the central tendency and dispersion degree of FT-IMF in the time domain, and concatenate the features obtained from where the mean(·) is the averaging operation, var(·) represents the operation of calculating variance, and FT-IMF time ∈ R 3×2 . Figure 4 shows an example of the feature extraction of the em and ma contaminated signals at each stage. Figure 4a is the amplitude-frequency features of the em-contaminated ECG, and Fig. 4b is its corresponding phase angle features. Figure 4c is the amplitude-frequency features of the macontaminated ECG, and Fig. 4d is its corresponding phase angle features. It can be seen that when the frequency of the intermediate quantity decomposed by the em-or macontaminated ECG is not 0, the corresponding phase angle is also not 0 and does not have obvious periodic characteristics (the phase angle feature of the clean signal has a periodic characteristic.). It is in line with the periodic characteristics of the ECG signal. In addition, the phase angle can reflect the local change of the signal waveform at a certain moment [27], so the depth features extracted in this way can well remember the subtle differences between the signal and noise. Finally, we obtain the FT-IMF freq and FT-IMF time , and we also call the FT-IMF freq as X DLDV .

DLDV feature dimension reduction
Principal component analysis (PCA) [28] is one of the essential methods for linear dimensionality reduction. Each principal component is a data projection in a certain direction, and their variances in different directions are determined by their eigenvalue. In the dimensionality reduction process, the eigenvalues are sorted from large to small. The eigenvectors corresponding to the first k eigenvalues are used as dimensionality-reduced features to express the information we are interested in. However, the data we need to process are nonlinear and non-stationary ECG signals. Therefore, this paper adopts kernel principal component analysis (KPCA) [21] to deal with these data. In the KPCA, we believe the ECG data have a higher dimension. We can do PCA analysis in a higher-dimensional space (Hilbert space). The advantage is that it is possible to find an effective projection direction to classify the data in a higher-dimensional space for nonlinear data points that are difficult to classify in a lower-dimensional space. Since the dimensionality of DLDV features (nonlinear features) is too high and contains some features that hardly contribute to classification (as reflected in Fig. 4). So, For PCA, given X DLDV = [x 1 , x 2 , . . . , x n ] , X DLDV ∈ R n×d , n is the sequence numbers of X DLDV , and d is the dimension of each sequence. After performing PCA, we get the following decomposition model: represents the principal component vector and the corresponding projection vector, respectively. Since U t represents a series of orthonormalized vectors, the principal component S t can be expressed as: S t = X DLDV U t . So, the projection vector U t can be calculated by solving the eigenvalue problem: For KPCA, we define a mapping: X DLDV ∈ R n×d → ℵ (X DLDV ) ∈ R n× p , the ℵ(·) denotes a nonlinear mapping function which is to map the signal to the Hilbert functional space ( ), and p represents the dimension of the feature space. We denote the mapping function of X DLDV to the space as: For the nonlinear case, it is difficult to solve U t by simply replacing X DLDV with ℵ(X DLDV ) according to (6). Because the mapping function ℵ(·) is unknown. To address this problem, we introduce kernel tricks to develop KPCA model. The U t can be expanded in the feature space as U t = ℵ T (X DLDV ) β t by reference [29], β t is a linear transformation vector. Thus, formula (6) is transformed as: we find that K = ℵ(X DLDV )ℵ T (X DLDV ) is the kernel matrix of the kernel function, and the elements of the kernel matrix are calculated by the Gaussian kernel function , and w represents the bandwidth of the Gaussian kernel.
For a given test vector X j DLDV ∈ R d , represents the jth DLDV feature vector, the corresponding kernel principal component can be calculated by [30][31][32]: where t = [1, 2, ..., k] indicates that the first k vectors retained after dimensionality reduction, that is S t (X DLDV ) ∈ R 1×k . Here, we determine the value of k by the cumulative contribution rate of the principal components. Usually, if the cumulative contribution rate (P) of the first k principal components reaches 80-90%, it means that the first k principal components basically contain the main information of all measurement indicators. To keep as many principal components as possible while reducing dimensionality as much as possible, we keep all principal components with P ≥ 95%: After DLDV feature extraction and dimensionality reduction for 6 s ECG signals, we determine the minimum k value that satisfy Eq. (10) is k = 2124 (354 × 6). Finally, we combine the FT-IMF time , and the low-dimensional result (FT-IMF all ∈ R 1×(k+6) ) obtained as:

Proposed dual-input transformer model
Deep learning-based approaches can automatically extract abstract features of samples. However, its complex convolution and recursive structure make a series of hidden layers have a large number of front-to-back dependencies, which leads to low parallelism of the model. Transformer, the first sequence transduction model entirely based on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention [23]. Existing studies have shown that the transformer can not only handle the problem in the field of translation, but can deal with the classification of temporal sequence [23], such as ECG sequence [33,34]. For the first time, we propose a DI-Transformer model to deal with the problem of ECG SQA, and its overall structure is shown in Fig. 5. Our DI-Transformer model mainly includes the transformer encoder layer and classifier layer. Furthermore, the feature extraction and KPCA are plugged into our model as augmented features. Note that the transformer encoder layer is formed by stacking six attention modules, each module includes six multi-head attention blocks, and the specific composition of the multi-head attention mechanism is in [23,35,36]. Since ECG does not require a standard translation process, we replace the decoder part of the transformer with a fully connected layer. We describe the DI-Transformer in detail as follows.

Transformer encoder layer
Input embedding and positional encoding: The input embedding of the sequential signal is similar to methods in most natural language processing (NLP) architectures [37].
To get the embedding for each point, the Raw ECG or FT-IMF all is mapped to the d model dimensional space through 1D convolution. It should be noted that we must ensure the consistency of the sequence length before and after convolution through well-designed padding and kernel size. That is, we must ensure the dimension of the embedding output is also d model . In addition, we choose the sinusoidal version [23,36] to provide positional embedding for our input sequence.

Attention module:
We stack the attention module six times, and each consisting of two parts (the multi-head attention block and the feed forward network). The former comprises six parallel attention modules, and its internal structure is shown in Fig.6. After the "input embedding and positional encoding" operation for raw ECG, the input vector U of the transformer encoder layer is obtained. Then, we define three transformation matrices: . . , 6}, and use these three transformation matrices to perform three linear transformations on U to get the query (Q e ), Key (K e ) and Value (V e ). Finally, the e-th head is calculated by Q e , K e , and V e : where T represents the operation of matrix transpose. To connect the results of all h e , we define the transformation matrix W P , and then get the output of the multi-head attention module through a linear mapping operation: where the W P ∈ R 6d v ×d model [23]. And then, a residual connection and a layer normalization are performed in "Add&Norm" blocks for MHAB(Q, K , V ). The result is then connected to the feed-forward network (the second part of attention module), which consists of two fully connected layers with a rectified linear unit (ReLU). The output of each attention module is represented as X attention . Note that we use layer normalization rather than batch normalization. Again, a residual connection, layer normalization and feed forward Fig. 5 The overall structure of the proposed DI-Transformer model are performed, respectively. We can finally get the output of the transformer encoder layer. The output will be used as the input of the next transformer encoder layer or fusion with FT-IMF all and input to the classification layer to determine the final output categories.

Dual-input features fusion and classification
In the phase of model initialization, we extract FT-IMF time and FT-IMF freq features through the proposed method and perform KPCA on FT-IMF freq . They are then concatenated and used as the second channel input feature (FT-IMF all ) of DI-Transformer. In the training phase, the Raw ECG of the first channel is divided into mini-batch and perform position encoding and then feed into the transformer encoder layer. For each iteration, we randomly select 6 s data from each Raw ECG sample (We have shown in follow-up experiments that 6 s long data is optimal). After the Raw ECG passes through the transformer encoder layer, the extracted feature map is flattened and concatenated with the FT-IMF all features prepared in the phase of model initialization: And then the X hidden goes through a linear layer (a 1D fully connected layer and the input dimension is d in ), which is connected with a softmax function. Then, the Softmax mapping scores are compared with the corresponding input labels to calculate the cross-entropy loss value. Finally, the classification layer outputs a vector V = (v 1 , v 2 ), where v i denotes the probability that the segment belongs to class i (good quality or bad quality).

ECG database
This paper employs the Physionet Computing in Cardiology Challenge 2011 (PCCC) [38] database to test the proposed SQA method. The PCCC includes 1500 10 s standard 12-lead ECG recordings with sampling rate 500 Hz, and it contains two subsets: the set-a includes 1000 12-lead 10 s recordings, and the set-b includes 500 12-lead 10 s recordings. This paper employs set-a, which contains 9276 (773 × 12) 10 s good quality ("acceptable") ECGs and 2700 (225 × 12) 10 s bad quality ("unacceptable") ECGs. In addition, we also select 500 single-lead good quality records and 500 single-lead bad quality records from the PCCC to form the testset (test-a). Then, we randomly select the em or ma noise after oversam- pling and use it to contaminate any one of the 500 selected good quality data according to the method in [39], repeat this process 500 times, and generate 500 records with em and ma noise contamination. Finally, the generated 500 bad quality data and 500 good quality data selected from PCCC are combined into a testset (test-b). The details of each database are described in Table 1. As shown in Fig. 7, we randomly select the good quality and bad quality segments from the set-a. In addition, it should be noted that the Z-score is used to normalize each 10 s record of all datasets, which can be calculated as follows: where x denotes the signal segments, μ and σ are the mean value and standard deviation of the signal segments, respectively.

Model parameters settings:
The key parameters set for the DI-Transformer model are shown in Table 2. It should be noted that due to the physiological characteristics of the human body, ECG signal strength will be limited within a certain range, which means there will not be much numerical difference between peaks and troughs, so the d model is set to 512 [33]. In addition, to achieve the goal of rapid convergence and prevent oscillation near the local minimum, the learning rate is dynamically adjusted during the model's training.
The whole method is developed and trained using Tensorflow and Pytorch. Our experiments are performed on a computer with an Intel(R) Core(TM) i5-7640X CPU@4.00GHz, and equipped with two GPU GeForce GTX 1080 Ti with 11GB RAM.
Performance evaluation: To evaluate the performance of the proposed method for SQA, we adopted five-fold crossvalidation. The set-a is randomly divided into five equal subsets, each subset is selected as the test set in turn, and the remaining four subsets are used for training. However, less than a quarter of the data is classified as bad quality. It is well known that using an unbalanced dataset to build classifiers will cause bias and result in poor generalization ability of classification models. Another approach is to balance the dataset when not using prior probabilities (and Bayesian training paradigms) to overcome this problem. Therefore, we balance the dataset by adding real noise [em and ma noise from NSTDB [40] and additive Gaussian white noise (AGWN)] to the good quality segments to generate additional bad quality data. Note that we oversampled the em and ma noises to 500 Hz before adding them to the training subset, and the sampling rate of AGWN is also of 500 Hz. The method of balancing the dataset is described in [39]. For each cross-validation task, we balance train subset (containing 7421 ≈ 9276/5 * 4 10 s good quality segments and 6838 ≈ 2700/5 * 4 + 4678 10 s bad quality segments) but keep the test subset unchanged (containing 1855 ≈ 9276/5 10 s good quality segments and 540 ≈ 2700/5 10 s bad quality segments).
In addition, we employ multiple indicators to evaluate the performance of the proposed method, such as sensitivity (Se), Specificity (Sp), Precision (P + ), accuracy (Acc), F 1 and area under curve (AUC) [41]. It should be noted that for extremely unbalanced data (i.e., a low prevalence or incidence of a disease in the total population), the ROC curve and AUC are only partially meaningful. For this problem, Carrington et al. [42] gives an effective solution. Here, we balanced the training set. The definitions of these indicators are as follows: where TP is true positives, TN is true negatives, FP is false positives and FN is false negatives.

Performance evaluation of DLDV features
To evaluate the performance of the DLDV features extracted by our method, we employ four traditional classifiers (Gaussian Kernel Support Vector Machines (G-SVM) [43], Logistic Regression (LR) [44], Random Forests (RF) [45], and K-Nearest Neighbors (KNN) [46], and the parameter settings of each classifier are shown in Table 3) and six time-frequency dependent SQIs [10,47,48], such as sSQI and kSQI, pSQI, LpSQI, MpSQI, HpSQI. Table 4 shows the binary classification results of ECG signal quality using   a series of features on four traditional classifiers. Figure 8 shows the confusion matrix obtained from the DLDV features (FT-IMF freq ) on the four classifiers. To further test the performance of the proposed method, instead of randomly combining SQIs to train the classification model, we generate new combinations of SQIs according to the principle of decreasing the average accuracy of the six SQIs on the four classifiers. Then, these combinations are compared with DLDV, FT-IMF all , respectively, and the results on each classifier are shown in Table 5. It can be seen that the Acc of the combination of six SQIs is the highest among all combinations, but still lower than the Acc of DLDV and FT-IMF all . It shows that our features' performance is better than the traditional six advanced SQIs. Furthermore, our DLDV feature performs the best on G-SVM (Acc = 93.32%), which benefits from our DLDV features and the superior performance of the SVM classifier based on the Gaussian kernel function. The results obtained on KNN (Acc = 92.98%) are slightly inferior to G-SVM. In addition, our features perform poorly on LR (Acc = 87.76%), even lower than SQI features on KNN (Acc = 89.98%), but still slightly ahead of the results for the combinations of all 6 SQIs. It indicates that our method outperforms these six traditional SQIs in executing quality classification.

Comparison of our DI-Transformer and four traditional classifiers
This section compares our DI-Transformer with four traditional methods (G-SVM, LR, RF and KNN). Four features (SQI features , FT-IMF time , FT-IMF freq and FT-IMF all ) are used to build five categories of classifiers, and the results on the test set are shown in Table 6. It can be seen that the classification models built with SQI features , a higher accuracy (Acc = 89.98%) is achieved on KNN among all four traditional models, but still lower than the result of DI-Transformer (Acc = 91.26%). The performance of the classification models built with FT-IMF all is generally better than that of SQI features . The result on G-SVM (Acc = 94.27%) is better than that obtained on KNN (Acc = 93.64%), but Table 7 and Fig. 13b reflect that the performance on KNN (AUC = 0.962) is better than G-SVM (AUC = 0.921). More importantly, combined with FT-IMF all , our DI-Transformer achieves the globally best performance (Acc = 99.62% and AUC = 0.993). The p values we provide in Table 8 show the significant difference in expression signal quality between the proposed DI-Transformer and these four traditional classifiers, and this significant difference is statistically significant.

Ablation study on DI-Transformer model
In this section, we design a series of ablation experiments to comprehensively evaluate the performance of the proposed DI-Transformer. Experiment A only uses the FT-IMF freq feature as the input to train the transformer-based model. Based on experiment A, the B used the FT-IMF freq and FT-IMF time as the input to train the transformer-based model. Experiment C only used Raw ECG as the input to train the transformer model. Based on C, experiment D treats FT-IMF time as augmented features, which are then concatenated with the output of the transformer encoder layer and fed to the classification layer. Experiment E encodes the Raw ECG as the input of Notice the bold indicates that compared with other combinations of SQIs, FT-IMF_all has the best performance on G-SVM, followed by KNN. Note that DLDV and FT-IMF_freq represent the same feature Notice the bold indicates that several classes of features achieve the best performance on the classifier built by DI-Transformer Notice bold indicates that several classes of features achieve the highest AUC on the classifier built by DI-Transformer compare to other four classifiers the transformer, and then the dimension reduced FT-IMF freq is used as an augmented feature, which is finally fed into the classification layer along with the output of the transformer encoder layer (see in Fig. 5). On the basis of experiment E, the F treats FT-IMF freq and FT-IMF time as augmented features, which are then concatenated with the output of the transformer encoder layer and fed to the classification layer.
Notice that compared with experiments A, B and C for the single-input structure, experiments D, E, and F adopt the method of augments feature with a dual-input structure, the most advantage of which is that it can fully utilize the depth local dual-view features. Table 9 shows a series of ablation experiments associated with the proposed method, and Fig. 9 shows six confusion matrices for the corresponding experiments. As shown in Table 9, the Acc of the transformer-based model achieves 95.49% in experiment A. the Acc of experiment C achieves 92.57%. Compared with experiment C, the Acc of experi-  Fig. 9f, only 0.25% of the good quality data are misclassified as bad quality data. Such results show that the performance of our DI-Transformer is much better than G-SVM and KNN.

Performance of each model to recognize the MA noise
First, we select the four traditional classification models trained with the SQI features and FT-IMF all features, respectively. The performance of these models are then tested on an artificial test set with progressively increasing MAcontaminated ECG segments. We generate a series of test sets with unchanged total samples (1000) to test the ability of each model to identify MA-contaminated ECG by adjust- Notice the bold indicates that among various ablation experiments, the F achieves the best performance ing the proportion of data obtained in test-a and test-b. We take data from test-a and test-b at the ratios of 8:2, 6:4, 4:6 and 2:8, respectively, and we denote these generated test sets as test-ab1, test-ab2, test-ab3 and test-ab4 in turn. The results of the four traditional classifiers trained with SQI features on each test subset are shown in Table 10. As the proportion of MAcontaminated ECG segments increases, the Acc of all four classifiers decreases to different degrees. Relatively speaking, the result of KNN under the same proportion is better than the results obtained by the other three classifiers. Figure 10a shows the results obtained on Test-ab1 and Test-ab4. It can be seen that these classifiers are more sensitive to MA noise. The results of the five classifiers trained with FT-IMF all on each test subset are shown in Table 11. As the proportion of MA-contaminated ECG segments increases, the accuracy of all five classifiers decreases to different degrees, but it is much smaller than the decrease in Table 10. As shown in Fig. 10b, the results on Test-ab1 and Test-ab4 also con-

Optimal data length and computational time
To find the optimal segment length (N seg ) for SQA, we repeat experiment F ten times on set-a with N seg varying from 1 to 10 s at an increment of 1 s. Throughout the whole experiment, we only change the size of N seg , and the relationship between the N seg and the accuracy of SQA are shown in Fig. 11a. It can be seen that as the size of N seg increases, the accuracy of quality classification of our model also increases. However, when N seg is greater than 6 s, the accuracy can hardly be improved. It shows that the 6 s segment has covered most of the features required for signal quality classification. In addition, the Fig. 11b reflects the relationship between sample length and training and testing times. As the N seg increases, the training and testing time of the model slowly increases within 5 s. After 6 s, as it increases, the curve shows a rapid upward trend. Combining the results of Fig. 11a, b, weighing classification accuracy and computational complexity, we finally choose the optimal signal segment length as N seg = 6s.

Performance comparison
This paper employs the PCCC [38] database, and other papers also use that database. Table 14 lists some other well-performing methods using this database. Albaba et al. [49] constructed an SQA pipeline by combining multiple time-frequency domain features with multiple traditional classifiers, and obtained good results on the Medium Gaussian SVM (MG-SVM) classifier. The method achieves an accuracy of 93.00% on MG-SVM, which is comparable to the result obtained by our FT-IMF all features on G-SVM (Acc Fig. 11 The relationship between segment length and classification accuracy and computation time. a The relationship between the segments length and the accuracy. b The relationship between the segment length and the training and testing times = 94.27%), but still much lower than our DI-Transformer ( Acc = 99.62%). Shahriari et al. [13] used the SSIM to compare ECG images obtained from two ECGs at standard scales. And then, they trained a linear discriminant analysis classifier for SQA based on the SSIM between each image and all templates as feature vectors. Compared with others, their method obtained a lower accuracy. Behar et al. [10] employed indicators such as kSQI, sSQI, pSQI basSQI, bSQI, pcaSQI, and rSQI, and trained an SVM model to evaluate the quality of ECG signals to reduce false alarms, with the achieved accuracy of 99.30%. The result is higher than our G-SVM based on FT-IMF all but is slightly inferior to our DI-Transformer. It is worth noting that our methods have a strong MA noise recognition ability, but [10] aimed at the normal noisy signal and do not consider the interference of MA noise. Therefore, even though their performance metrics are high, but not entirely comparable. In [13,49], they also hardly consider the case of MA-contaminated ECG. In addition, the proposed methods have good interpretability and can achieve accurate ECG SQA, including a large amount of MA noise.

Analyzing the performance of DLDV features
This paper uses EMD and FFT to extract the DLDV features of ECG signals. Then four different traditional classifiers (G-SVM, LR, RF, and KNN) are employed to evaluate the performance of the extracted DLDV features. Meanwhile, we also employ six traditional time-frequency related SQIs metrics as references to evaluate the performance of our DLDV features. In general, the larger span of signal quality, the more significant difference in SQI value. For example, as shown in Fig. 12, due to the obvious difference in the probability density distribution of different quality signals, the kurtosis (kSQI) and skewness (sSQI) can provide effective information for distinguishing good quality signals from bad quality signals. In addition, the other four time-frequency-related SQIs are all valid SQA indicators verified by researchers and have achieved good results in actual SQA [4,5,43,44]. Therefore, this paper selects them as references to evaluate the confidence of our DLDV features for SQA. Fig. 12 The probability density function of kSQI and sSQI on set-a (9276 good quality and 2700 bad quality) before normalization  Table 4 and Fig. 8 show the classification results and confusion matrices of the six traditional SQIs and DLDV features employed in this paper on the four classifiers. It can be seen that the DLDV features outperform these traditional SQIs metrics on the four classifiers, and even the SQI features on LR with the lowest accuracy is also lower than our DLDV. The reason why our method comprehensively outperforms the traditional six SQIs is that the features extracted by our method can not only express the central tendency and discrete degree of the signal segment, but also employ the phase angle and amplitude-frequency values to express the characteristics of the transient change of the signal.

Analyzing the performance of each model to recognize the MA noise
We also design experiments to test the proposed method's ability to recognize MA-contaminated ECGs. Our DLDV features work well for MA-contaminated ECGs, which is well confirmed in Fig. 10 and Tables 10, 11, 12, 13. Table 10 reflects the expression ability of SQI features on MA noise. It can be seen that with the increase of MA noise, the accuracy of all four classifiers decreases, and the minimum decrease reaches 6.82%. It can be seen from Table 11, under the same conditions, the accuracy of all four classifiers also decreased, but the maximum decrease is only 1.08%. Tables 12 and 13  are the results of statistical analysis for Tables 10 and 11. The p values show the significant difference between SQI features and FT-IMF all in expressing MA noise, which is statistically significant. It can be seen from the results in Fig. 10a, the SQI features has its limitation in expressing MA noise. Because these metrics are based on human-defined desirable properties of clean signals, they rely on human-specified properties, leading to inherent limitations in expressing potential features of signal quality [17]. In addition, it is difficult for us to artificially specify the features of some MA noises similar to ECG signal, so it is not surprising that the features information of them are hard to extract by using the SQI features . Compared with the results in Fig. 10a, the results obtained by each classifier in Fig. 10b on the two test sets are very close, with the average difference of 0.76%. It shows that the classifier constructed with our features can identify general noise well. More importantly, it also offers strong performance in identifying MA noises. Furthermore, our DI-Transformer structure achieves high accuracy on test-ab4. Such high accuracy is not only due to the design of the dual-input structure, but more importantly, the transformer's self-attention module can also capture the timing relationship of the signal and then combine the DLDV features with improving the model's ability to recognize MA noise. Note that we do not use the FT-IMF time feature in this test experiment because this feature can only express the central tendency and dispersion of the signal and cannot fully reflect the transient change of the signal.

Analyzing the performance of proposed DI-Transformer
The effectiveness and robustness of our FT-IMF all feature for SQA are verified on traditional classifiers (G-SVM, LR, RF, and KNN). Furthermore, we also propose a DI-Transformer SQA method based on the FT-IMF all features. Table 9 presents a series of ablation experiments for the proposed DI-Transformer method. Figure 9 shows the confusion matrix corresponding to each ablation experiment. The results of experiments C, D, E and F show that the contribution of FT-IMF freq to the SQA is much more significant than that of FT-IMF time . The results of experiments C and E show that the proposed dual-input structure significantly improves the model's classification performance. Feeding the FT-IMF freq (experiment A) to the transformer as input data are much better than feeding it the Raw ECG directly (experiment C), which shows that DLDV features can help the transformer model to learn the quality features more easily. It benefits from the fact that the phase angle features can well represent the transient change of the signals, and combined with the amplitude features, this transient change can be quantified. We also observe that the Se value of experiment E is higher than that of experiment C, the accuracy of experiment F is the best. It shows that experiment F tends to identify more signal segments as good quality, with the advantage of not missing valuable signals in subsequent processing stages, which is also demonstrated in the confusion matrix in Fig. 9f. From this point of view, the abstract features automatically extracted by the transformer from Raw ECG are complementary to the FT-IMF freq features. Comparing the results of A, E and B, F, we find that the DI-Transformer combines the advantages of DLDV features and transformerbased abstract features, and has higher Se, Sp and Acc values. It can obtain more effective signal quality features than the single-input structure (A, B and C).
We also compare the proposed DI-Transformer with four traditional classifiers. It can be seen from Table 6 that the result on SQI features is inferior to our FT-IMF all , but higher than our FT-IMF time . Because our FT-IMF time does not focus on the nuances of signal and noise. The AUC values in Table 7 show that our DI-Transformer exhibits the best performance on all features, followed by KNN combined with FT-IMF all . Furthermore, in Table 8 the p values we provide show significant differences between the method based on SQI features and the method based on FT-IMF all , and this significant difference is statistically significant. It is not surprising that we get such good results because our method rarely considers the morphology of Rwa ECG and instead mines the depth local features of the signal. We not only extract the transient amplitude features of the intermediate component of the signal (IMFs), but also extract the transient phase angle features that can express the subtle difference between the signal and the noise (especially for MA noise). Equally important, on the traditional classifier-based methods, although the accuracy of FT-IMF all features on G-SVM is higher than that of KNN, but the receiver operating characteristic curve (ROC) of each model in Fig. 13 shows that the performance of DI-Transformer is the best (AUC = 0.993). Therefore, the DI-transformer-based model constructed by FT-IMF all can provide a new set of practical solutions for SQA. In addition, it can be seen from Fig. 13b that the KNN model built with FT-IMF all exhibits the best performance (AUC = 0.962), followed by RF (AUC = 0.948). Suppose the user uses the traditional method to build the signal quality classifier. In that case, the KNN or RF method based on FT-IMF all can be preferred under the same conditions.

Conclusion
In summary, we present a novel ECG SQA method that fuses the proposed DLDV features and the DI-Transformer framework for improving the recognition ability of MAcontaminated ECG. For the first time, we combine DLDV features and transformer to handle the ECG SQA problem. Specifically, we use EMD and FFT to extract DLDV features of Raw ECG in the time-frequency domain. The extracted DLDV feature can identify subtle differences between MA and ECG signals through depth local amplitude and phase angle features. When it is fused with the temporal relationship features extracted by DI-Transformer, its accuracy is significantly improved compared to the method based on traditional SQIs. Experiments on SQA tasks show that the proposed method outperforms the state-of-the-art SQA methods. In addition, our method can not only identify the common type of noise from noise-contaminated ECGs, more importantly, it can effectively identify MA-contaminated ECG. In the future, we will improve the proposed method and make it suitable for SQA of other physiological signals, such as SQA of electroencephalogram and electromyogram.