PhysFormer++: Facial Video-Based Physiological Measurement with SlowFast Temporal Difference Transformer

Remote photoplethysmography (rPPG), which aims at measuring heart activities and physiological signals from facial video without any contact, has great potential in many applications (e.g., remote healthcare and affective computing). Recent deep learning approaches focus on mining subtle rPPG clues using convolutional neural networks with limited spatio-temporal receptive fields, which neglect the long-range spatio-temporal perception and interaction for rPPG modeling. In this paper, we propose two end-to-end video transformer based architectures, namely PhysFormer and PhysFormer++, to adaptively aggregate both local and global spatio-temporal features for rPPG representation enhancement. As key modules in PhysFormer, the temporal difference transformers first enhance the quasi-periodic rPPG features with temporal difference guided global attention, and then refine the local spatio-temporal representation against interference. To better exploit the temporal contextual and periodic rPPG clues, we also extend the PhysFormer to the two-pathway SlowFast based PhysFormer++ with temporal difference periodic and cross-attention transformers. Furthermore, we propose the label distribution learning and a curriculum learning inspired dynamic constraint in frequency domain, which provide elaborate supervisions for PhysFormer and PhysFormer++ and alleviate overfitting. Comprehensive experiments are performed on four benchmark datasets to show our superior performance on both intra- and cross-dataset testings. Unlike most transformer networks needed pretraining from large-scale datasets, the proposed PhysFormer family can be easily trained from scratch on rPPG datasets, which makes it promising as a novel transformer baseline for the rPPG community.


Introduction
Physiological signals such as heart rate (HR), respiration frequency (RF), and heart rate variability (HRV) are important vital signs to be measured in many circumstances, especially for healthcare or medical purposes.Traditionally, the electrocardiography (ECG) and photoplethysmograph (PPG) or blood volume pulse (BVP) are the two most common ways for measuring heart activities and corresponding physiological signals.However, both ECG and PPG/BVP sensors need to be attached to body parts, which may 1 Fig. 1 The trajectories of rPPG signals around t1, t2, and t3 share similar properties (e.g., trends with rising edge first then falling edge later, and relatively high magnitudes) induced by skin color changes.It inspires the long-range spatio-temporal attention (e.g., blue tube around t1 interacted with red tubes from intraand inter-frames) according to their local temporal difference features for quasi-periodic rPPG enhancement.Here 'tube' indicates the same regions across short-time consecutive frames.
cause discomfort and are inconvenient for longterm monitoring.To counter this issue, remote photoplethysmography (rPPG) [Yu et al., 2021b, Chen et al., 2018, Liu et al., 2021c] methods are developing fast in recent years, which aim to measure heart activity remotely without any contact.
Besides, there are a few color subspace transformation methods [De Haan andJeanne, 2013, Wang et al., 2017] which utilize all skin pixels for rPPG measurement.Based on the prior knowledge from traditional methods, a few learning based approaches [Hsu et al., 2017, Qiu et al., 2018, Niu et al., 2018, Niu et al., 2019a] are designed as non-end-to-end fashions.ROI based preprocessed signal representations (e.g., time-frequency map [Hsu et al., 2017] and spatio-temporal map [Niu et al., 2018, Niu et al., 2019a]) are generated first, and then learnable models could capture rPPG features from these maps.However, these methods need the strict preprocessing procedure and neglect the global contextual clues outside the pre-defined ROIs.Meanwhile, more and more end-to-end deep learning based rPPG methods [ Špetlík et al., 2018, Chen and McDuff, 2018, Yu et al., 2019a, Yu et al., 2019b, Liu et al., 2020] are developed, which treat facial video frames as input and predict rPPG and other physiological signals directly.However, pure end-to-end methods are easily influenced by the complex scenarios (e.g., with head movement and various illumination conditions) and rPPG-unrelated features can not be ruled out in learning, resulting in huge performance drops [Yu et al., 2020] in realistic datasets (e.g., VIPL-HR [Niu et al., 2019a]).
Recently, due to its excellent longrange attentional modeling capacities in solving sequence-to-sequence issues, transformer [Lin et al., 2021a, Han et al., 2020] has been successfully applied in many artificial intelligence tasks such as natural language processing (NLP) [Vaswani et al., 2017], image [Dosovitskiy et al., 2021] and video [Bertasius et al., 2021] analysis.Similarly, rPPG measurement from facial videos can be treated as a video sequence to signal sequence problem, where the long-range contextual clues should be exploited for semantic modeling.As shown in Fig. 1, rPPG clues from different skin regions and temporal locations (e.g., signal trajectories around t1, t2, and t3) share similar properties (e.g., trends with rising edge first then falling edge later and relative high magnitudes), which can be utilized for long-range feature modeling and enhancement.However, different from the most video tasks aiming at semantic motion representation, facial rPPG measurement focuses on capturing subtle skin color changes, which makes it challenging for global spatio-temporal perception.Besides, the rPPG measurement task usually relies on periodic hidden visual dynamics, and the existing deep end-to-end models are weak in representing such clues.Furthermore, videobased rPPG measurement is usually a long-time monitoring task, and it is challenging to design and train transformers with long video sequence inputs.
Motivated by the discussions above, we propose two end-to-end video transformer architectures, namely PhysFormer and PhysFormer++, for remote physiological measurement.On the one hand, the cascaded temporal difference transformer blocks in PhysFormer benefit the rPPG feature enhancement via global spatio-temporal attention based on the fine-grained temporal skin color differences.Furthermore, the two-pathway SlowFast temporal difference transformer based PhysFormer++ with periodic-and cross-attention is able to efficiently capture the temporal contextual and periodic rPPG clues from facial videos.On the other hand, to alleviate the interference-induced overfitting issue and complement the weak temporal supervision signals, elaborate supervision in frequency domain is designed, which helps the PhysFormer family learn more intrinsic rPPG-aware features.
This paper is an extended version of our prior work [Yu et al., 2022] accepted by CVPR 2022.The main differences with the conference version are as follows: 1) besides the temporal difference transformer based PhysFormer, we propose the novel SlowFast video transformer architecture PhysFormer++ for rPPG measurement task; 2) based on the temporal difference transformer, the temporal difference periodic transformer and temporal difference cross-attention transformer are proposed to enhance the rPPG periodic perception and cross-tempo rPPG dynamics, respectively; 3) a detailed overview about the traditional, non-end-to-end learning based, and end-toend learning based rPPG measurement methods is discussed in the related work; 4) more elaborate experimental results, visualization, and efficiency analysis are given for the PhysFormer family.To sum up, the main contributions of this paper are listed: • We propose the PhysFormer family, i.e., Phys-Former and PhysFormer++, which mainly consists of a powerful video temporal difference transformer backbone.In the rest of the paper, Section 2 provides the related work about rPPG measurement and vision transformer.Section 3 first introduces the detailed architectures of the PhysFormer and PhysFormer++, and then formulates the label distribution learning and curriculum learning guided dynamic supervision for rPPG measurement.Section 4 introduces the four rPPG benchmark datasets and evaluation metrics, and provides rigorous ablation studies, visualizations and evaluates the performance of the proposed models.Finally, a conclusion is given in Section 5.

Related Work
In this section, we provide a brief discussion of the related facial rPPG measurement approaches.As shown in Table 1, these approaches can be generally categorized into traditional, non-endto-end learning, and end-to-end learning based methods.We also briefly review the transformer architectures for vision tasks.

rPPG measurement
Traditional approaches.An early study of rPPG-based physiological measurement was reported in [Verkruysse et al., 2008].Plenty of traditional hand-crafted approaches have been developed in this field since then.Compared with coarsely averaging arbitrary color channel from the detected full face region, selective merging information from different color channels [Poh et al., 2010b, Poh et al., 2010a] from different ROIs [Lam and Kuno, 2015, Li et al., 2014] with adaptive temporal filtering [Li et al., 2014] are proven to be more efficient for subtle rPPG signal recovery.To improve the signal-to-noise rate of the recovered rPPG signals, several signal decomposition methods such as independent component analysis (ICA) [Poh et al., 2010b, Poh et al., 2010a, Lam and Kuno, 2015] and matrix completion [Tulyakov et al., 2016] are also proposed.To alleviate the impacts of the skin tone and head motion, several color space projection (e.g., chrominance subspace [De Haan and Jeanne, 2013] and skinorthogonal space [Wang et al., 2017]) methods are developed.Despite remarkable early-stage progresses, these approaches have the following limitations: 1) they require empirical knowledge to design the components (e.g., hyperparameter in signal processing filtering); 2) there is a lack of supervised learning models to counter data variations, especially in challenging environments with serious interference.
Non-end-to-end learning approaches.In recent years, deep learning based approaches dominate the field of rPPG measurement due to the strong spatio-temporal representation capabilities.One representative framework is to learn robust rPPG features from the facial ROI-based spatio-temporal signal map (STmap).STmap [Niu et al., 2018, Niu et al., 2019b] or its variants (e.g., multisacle STmap [Niu et al., 2020, Lu et al., 2021] and chrominance STmap [Lu and Han, 2021]) are first extracted from predefined facial ROIs on different color spaces, and then classical convolutional neural network (CNN) (e.g., ResNet [He et al., 2016]) and recurrent neural network (RNN) (e.g., GRU [Cho et al., 2014]) are cascaded for rPPG feature representation.The STmap-based non-end-to-end learning framework focuses on learning an underlying mapping from the input feature maps to the target rPPG signals.With dense raw rPPG information and less irrelevant elements (e.g., face-shape attributes), these methods usually converge faster and achieve reasonable performance against head movement but need explicit and exhaustive preprocessings.
End-to-end learning approaches.Besides learning upon handcrafted STmaps, end-toend learning from facial sequence directly is also favorite.Both spatial 2DCNN networks [ Špetlík et al., 2018, Chen andMcDuff, 2018] and spatio-temporal models [Yu et al., 2019a, Yu et al., 2019b, Yu et al., 2020, Liu et al., 2020, Liu et al., 2021b, Nowara et al., 2021, Gideon and Stent, 2021] are developed for rPPG feature representation.Yu et al. [Yu et al., 2019a] investigates the recurrent methods (PhysNet-LSTM, PhysNet-ConvLSTM) for rPPG measuremnt.However, such CNN+LSTM based architectures are good at long-range sequential modeling via LSTM but fail to explore longrange intra-frame spatial relationship using CNN with local convolutions.In contrast, with spatial transformer backbone and temporal shift module, EfficientPhys [Liu et al., 2021b] is able to explore long-range spatial but only short-term temporal relationship.In other words, existing end-to-end methods only consider the spatio-temporal rPPG features from local neighbors and adjacent frames but neglect the long-range relationship among quasi-periodic rPPG features.
Compared with the non-end-to-end learning based methods, end-to-end approaches are less dependent on task-related prior knowledge and handcrafted engineering (e.g., STmap generation) but rely on diverse and large-scale data to alleviate the problem of overfitting.To enhance the longrange contextual spatio-temporal representation capacities and alleviate the data-hungry requirement of the deep rPPG models, we propose the PhysFormer and PhysFormer++ architectures, which can be easily trained from scratch on rPPG datasets with the elaborate supervision recipe.

Methodology
We will first introduce the architecture of Phys-Former and PhysFormer++ in Sec.3.1 and 3.2, respectively.Then we will introduce label distribution learning for rPPG measurement in Sec.3.3, and at last present the curriculum learning guided dynamic supervision in Sec.3.4.

PhysFormer
As illustrated in Fig. 2, PhysFormer consists of a shallow stem E stem , a tube tokenizer E tube , N temporal difference transformer blocks E i trans (i = 1, ..., N ) and a rPPG predictor head.Inspired by the study in [Xiao et al., 2021], we adopt a shallow stem to extract coarse local spatio-temporal features, which benefits the fast convergence and clearer subsequent global self-attention.Specifically, the stem is formed by three convolutional blocks with kernel size (1x5x5), ( 3x3x3) and (3x3x3), respectively.Each convolution operator is cascaded with a batch normalization (BN), ReLU and MaxPool.The pooling layer only halves the spatial dimension.Therefore, given an RGB facial video input X ∈ R 3×T ×H×W , the stem output X stem = E stem (X), where X stem ∈ R D×T ×H/8×W/8 , and D, T , W , H indicate channel, sequence length, width, height, respectively.Then X stem would be partitioned into spatiotemporal tube tokens X tube ∈ R D×T ×H ×W via the tube tokenizer E tube .Subsequently, the tube tokens will be forwarded with N temporal difference transformer blocks and obtain the global-local refined rPPG features X trans , which has the same dimensions with X tube .Finally, the rPPG predictor head temporally upsamples, spatially averages, and projects the features X trans to 1D signal Y ∈ R T .
Tube tokenization.Here the coarse feature X stem would be partitioned into non-overlapping tube tokens via E tube (X stem ), which aggregates the spatio-temporal neighbor semantics within the tube region and reduces computational costs for the subsequent transformers.Specifically, the token tokenizer consists of a learnable 3D convolution with the same kernel size and stride (non-overlapping setting) as the targeted tube size T s × H s × W s .Thus, the expected tube token map X tube ∈ R D×T ×H ×W has length, height and width Please note that there are no positional embeddings after the tube tokenization as the stem with cascaded convolutions and poolings at early stage already captures relative spatio-temporal positional information [Hassani et al., 2021].Vaswani et al., 2017, Dosovitskiy et al., 2021], the relationship between the tokens is modeled by the similarity between the projected query-key pairs, yielding the attention score.Instead of point-wise linear projection, we utilize temporal difference convolution (TDC) [Yu et al., 2020, Yu et al., 2021d] for query (Q) and key (K) projection, which could capture fine-grained local temporal difference features for subtle color change description.TDC with learnable w can be formulated as where indicates the sampled local (3x3x3) spatiotemporal receptive field cube for 3D convolution in both current (t 0 ) and adjacent time steps (t −1 and t 1 ), while R only indicates the local spatial regions in the adjacent time steps (t-1 and t1).The hyperparameter θ ∈[0, 1] tradeoffs the contribution of temporal difference.The higher value of θ means the more importance of temporal difference information (e.g., trends of the skin color changes).Specially, TDC degrades to vanilla 3D convolution when θ= 0. Then query and key are projected via unshared TDC and BN as (3) For the value (V ) projection, point-wise linear projection without BN is utilized.Then Q, K, V ∈ R D×T ×H ×W are flattened into sequence, and separated into h heads (D h = D/h for each head).
For the i-th head (i ≤ h), the self-attention (SA) can be formulated where τ controls the sparsity.We find that the default setting τ = √ D h in [Vaswani et al., 2017, Dosovitskiy et al., 2021] performs poorly for rPPG measurement.According to the periodicity of rPPG features, we use a smaller τ value to obtain sparser attention activation.The corresponding study can be found in Table 7.The output of TD-MHSA is the concatenation of SA from all heads and then with a linear projection U ∈ R D×D TD-MHSA = Concat(SA 1 ; SA 2 ; ...; SA h )U. ( 5) As illustrated in Fig. 2, residual connection and layer normalization (LN) would be conducted after TD-MHSA.
Spatio-temporal feed-forward (ST-FF).The vanilla feed-forward network consists of two linear transformation layers, where the hidden dimension D between two layers is expanded to learn a richer feature representation.In contrast, we introduce a depthwise 3D convolution (with BN and nonlinear activation) between these two layers with extra slight computational cost but remarkable performance improvement.The benefits are two-fold: 1) as a complementation of TD-MHSA, ST-FF could refine the local inconsistency and parts of noisy features; 2) richer locality provides TD-MHSA sufficient relative position cues.

PhysFormer++
In the PhysFormer, the temporal length T s of the tube token map is fixed.However, the fixed value of T s might be sub-optimal for robust rPPG feature representation as the larger T s reduces the temporal redundancy but loses the fine-grained temporal clues, and vice versa for the smaller T s .To alleviate this issue, we design the temporal enhanced version PhysFormer++ (see Fig. 3) consisting of two-stream SlowFast pathways with large and small T s , respectively.Similar to the SlowFast concept in [Feichtenhofer et al., 2019, Kazakos et al., 2021], the Slow pathway has high channel capacity with low framerates, and reduces the temporal redundancy.In contrast, the Fast pathway operates at a fine-grained temporal resolution with high framerates.Furthermore, two novel transformer blocks, temporal difference periodic transformer and temporal difference crossattention transformer, are designed for the slow and fast pathway, respectively.The former one encodes contextual rPPG periodicity clues for the slow pathway while the latter one introduces efficient SlowFast interactive attentions for the fast pathway.The SlowFast architecture is able to adaptively mine richer temporally rPPG contexts for robust rPPG measurement.
As illustrated in Fig. 3 and detailed architecture in Fig. 4, different from the PhysFormer using a single tube tokenizer, two tube tokenizers E fast tube and E slow tube are adopted in Phys-Former++ to form the spatio-temporal tube tokens X fast tube ∈ R D fast ×T fast ×H ×W and X slow tube ∈ R D slow ×T slow ×H ×W , respectively.Default settings D slow = D = 2D fast and T fast = 2T = 2T slow are used for computational tradeoff.Here we set temporal scale to two by considering that there are many low-framerate videos in the VIPL-HR dataset [Niu et al., 2019a].Too higher scales would result in the pulse rhythm incompletion/artifacts for high HR values (e.g., ¿120 bpm).We will investigate more scales for higher framerate videos in the future.Subsequently, the tube tokens from the slow pathway will be forwarded with N = 3N temporal difference periodic transformer blocks while tube tokens from the fast pathway will pass N temporal difference transformer and 2N temporal difference cross-attention transformer blocks.Specifically, the feature interactions between SlowFast pathways are in two folds: 1) all semantic mid-and high-level features from the slow path are cross-attentive with those from the fast path; and 2) the last mid-level features from two pathways X fast-mid tube , X slow-mid tube are lateral connected and then aggregated for the high-level propagation in the slow pathway.The lateral connection and aggregation can be formulated as where Conv1 is the temporal convolution with size=3x1x1, stride=2x1x1, padding=1x0x0 while Conv2 denotes the point-wise convolution with D output channel.The lateral connection adaptively transfers the mid-level fine-grained rPPG clues from the Fast pathway to the Slow pathway, and provides complementary temporal details for the Slow pathway to alleviate information loss especially for high-HR scenarios (e.g., after exercise).
Finally, the refined high-level rPPG features from fast and slow (upsampled) pathways are concatenated and forwarded the rPPG predictor head with temporally aggregation, upsampling, spatially averaging, and 1D signal Ŷ ∈ R T projection.
Temporal difference multi-head cross-and self-attention.Compared with the slow pathway, the fast pathway has more fine-grained features but conducts inefficient and inaccurate self-attention due to the temporal redundancy/artifacts.To alleviate the weak self-attention issue in the fast pathway, we propose the temporal difference multi-head cross-and self-attention (TD-MHCSA) module, which could be cascaded with ST-FF module to form the temporal difference cross-attention transformer.With TD-MHCSA, the features in the fast pathway can not only be refined by its own self-attention but also the cross-attention between the SlowFast pathways.
The structure of the TD-MHCSA is illustrated in Fig. 5.The features from the fast pathway X fast tube are first projected to query and key via For the value (V fast ) projection, point-wise linear projection without BN is utilized.Then Q fast , K fast , V fast ∈ R D fast ×T fast ×H ×W are flat- tened into sequence, and separated into h heads (D fast h = D fast /h for each head).For the i-th head (i ≤ h), the self-attention can be formulated Similarly, the features from the slow pathway X slow tube are projected to key K slow via BN(TDC(X slow tube )) as well as the value (V slow ) projection using point-wise linear projection.Then K slow , V slow ∈ R D slow ×T slow ×H ×W are flattened into sequence, and separated into h heads.For the i-th head (i ≤ h), the cross-attention (CA) can be formulated as Thus, the combined cross-and self-attention (CSA) is formulated as CSA i = CA i + SA fast i .The output of TD-MHCSA is the concatenation of CSA from all heads and then with a linear projection U fast ∈ R D fast ×D fast , which is formulated be formulated as where λ tradeoffs the CP and SA.Here we follow the memory efficient implementation in [Huang et al., 2019] for S calculation.Despite richer positional periodicity clues, the predicted periodic attention S might be easily influenced by some rPPG-unrelated clues (e.g., light changes and dynamic noise).To alleviate this issue, we propose a periodicity constraint to supervise the periodic S representation.As shown in the top left of Fig. 6, the approximate peak map PM can be obtained via 1) first extracting the binary peak signal P ∈ R T from the ground truth BVP signal Y ∈ R T via where R peak denotes the 1D-region of peak locations; and then 2) calculating the auto-correlation of the peak signal P via PM = P P T .Finally, the periodic-attention loss L atten can be calculated with the binary cross-entropy (BCE) loss between the adaptive-spatial-pooled periodic attention maps S ∈ R T ×T (from each head and each TD-MHPSA module) and the subsampled binary peak maps PM' ∈ R T ×T .It can be formulated as We also try supervision with L1 regression loss instead of BCE loss but with poorer performance.
Relationship between PhysFormer and PhysFormer++.PhysFormer++ can be treated as an upgraded version of PhysFormer towards excellent performance while with more computational cost.With similar temporal difference transformers, PhysFormer can be seen as a slow-pathway version of PhysFormer++, which is more lightweight and efficient.In contrast, PhysFormer++ is designed based on a dual-pathway SlowFast architecture with complex cross-tempo interactions, which is more robust to head motions and less sensitive to the video framerate, but with heavier computational cost (see Table 11 for efficiency analysis).

Label Distribution Learning
Similar to the facial age estimation task [Geng et al., 2013, Gao et al., 2018] that faces at close ages look quite similar, facial rPPG signals with close HR values usually have similar periodicity.Inspired by this observation, instead of considering each facial video as an instance with one label (HR), we regard each facial video as an instance associated with a label distribution.The label distribution covers a certain number of class labels, representing the degree that each label describes the instance.Through this way, one facial video can contribute to both targeted HR value and its adjacent HRs.
To consider the similarity information among HR classes during the training stage, we model the rPPG-based HR estimation problem as a specific L-class multi-label classification problem, where L=139 in our case (each integer HR value within [42,180] bpm as a class).A label distribution p = {p 1 , p 2 , ..., p L } ∈ R L is assigned to each facial video X.It is assumed that each entry of p is a real value in the range [0,1] such that L k=1 p k = 1.We consider the Gaussian distribution function, centered at the ground truth HR label Y HR with the standard deviation σ, to construct the corresponding label distribution p.
The label distribution loss can be formulated as L LD = KL(p, Softmax(p)), where divergence measure KL(•) denotes the Kullback-Leibler (KL) divergence [Gao et al., 2017], and p is the power spectral density (PSD) of predicted rPPG signals.
Please note that the previous work [Niu et al., 2017] also considers the distribution learning for HR estimation.However, it is totally different with our work: 1) the motivation in [Niu et al., 2017] is to smooth the temporal HR outliers caused by facial movements across continuous video clips, while our work is more generic, aiming at efficient feature learning across adjacent labels under limited-scale training data; 2) the technique used in [Niu et al., 2017] is after a post-HR-estimation for the handcrafted rPPG signals, while our work is to design a reasonable supervision signal L LD for the PhysFormer family.

Curriculum Learning Guided Dynamic Loss
Curriculum learning [Bengio et al., 2009], as a major machine learning regime with philosophy of easy-to-hard curriculum, is utilized to train PhysFormer.In the rPPG measurement task, the supervision signals from temporal domain (e.g., mean square error loss [Chen and McDuff, 2018], negative Pearson loss [Yu et al., 2019a, Yu et al., 2019b]) and frequency domain (e.g., cross-entropy loss [Niu et al., 2020, Yu et al., 2020], signalto-noise ratio loss [ Špetlík et al., 2018]) provide different extents of constraints for model learning.The former one gives signal-trend-level constraints, which is straightforward for model convergence but overfitting after that.In contrast, the latter one with strong constraints on frequency domain enforces the model learning periodic features within target frequency bands, which is hard to converge well due to the realistic rPPG-irrelevant noise.Inspired by the curriculum learning, we propose the dynamic supervision to gradually enlarge the frequency constraints, which alleviates the overfitting issue and benefits the intrinsic rPPG-aware feature learning gradually.Specifically, exponential increment strategy is adopted, and comparison with other dynamic strategies (e.g., linear increment) will be shown in Table 10.The dynamic loss L overall can be formulated as where hyperparameters α, β 0 and η equal to 0.1, 1.0 and 5.0, respectively.Negative Pearson loss [Yu et al., 2019a, Yu et al., 2019b] and frequency cross-entropy loss [Niu et al., 2020, Yu et al., 2020] [Li et al., 2018]).Besides, comprehensive ablations about PhysFormer and PhysFormer++ are also investigated in the VIPL-HR dataset.

Datasets and Performance Metrics
VIPL-HR [Niu et al., 2019a]   is a dataset including 102 RGB videos from 40 subjects, and the raw resolution of each video is at 1040x1392.OBF [Li et al., 2018] is a high-quality dataset for remote physiological signal measurement.It contains 200 five-minute-long RGB videos with 60 fps framerate recorded from 100 healthy adults.The example video frames from these four rPPG datasets are illustrated in Fig. 7.
For MAHNOB-HCI, as there are no available BVP ground truth, we first smooth the sharp ECG signals (with 10-point averaging strategy) into pseudo BVP signals as ground truth.Specifically, to alleviate the incorrect synchronization between videos and ground truth signals in MAHNOB-HCI, OBF, and VIPL-HR datasets, we first extract the coarse green channel signals via averaging the segmented facial skin in each frame.Then, we calculate the cross-correlation between the coarse green rPPG signals and (pseudo) BVP signals, and use the maximum-correlation phase to calibrate/compensate the phase bias.Furthermore, we remove the samples with HR>180 in the VIPL-HR and MMSE-HR datasets because the ground truths in these samples are unreliable due to poor contact of sensors (resulting in very noisy and fluctuated HRs).
In terms of evaluation metrics, average HR estimation task is evaluated on all four datasets while HRV and RF estimation tasks on highquality OBF [Li et al., 2018] dataset.Specifically, we follow existing methods [Yu et al., 2019b, Niu et al., 2020, Lu et al., 2021] and report low frequency (LF), high frequency (HF), and LF/HF ratio for HRV and RF estimation.We report the most commonly used performance metrics for evaluation, including the standard deviation (SD), mean absolute error (MAE), root mean square error (RMSE), and Pearson's correlation coefficient (r).

Implementation Details
Both PhysFormer and PhysFormer++ are implemented with Pytorch.For each video clip, the MTCNN face detector [Zhang et al., 2016] is used to crop the enlarged face area at the first frame and fix the region through the following frames.The videos in MAHNOB-HCI and OBF are downsampled to 30 fps for efficiency.The numbers of temporal difference transformer blocks N =12, transformer heads h=4, channel dimension D=96, hidden dimension in ST-FF D =144 are used for PhysFormer while temporal difference coefficient θ=0.7 and attention sparsity τ =2.0 for TD-MHSA.λ=0.5 is utilized in the TD-MHPSA.The targeted tube size T s × H s × W s equals to 4×4×4.For the R peak calculation in Eq. ( 12), the function 'findpeaks()' in Matlab is used for BVP peak detection, and then the detected peak locations are extended with successive ±3 neighbors.

Intra-dataset Testing
In this subsection, two datasets (VIPL-HR and MAHNOB-HCI) are used for intra-dataset testing on HR estimation while the OBF dataset is used for intra-dataset HR, HRV and RF estimation.
HR estimation on VIPL-HR.Here we follow [Niu et al., 2019a] and use a subject-exclusive 5-fold cross-validation protocol on VIPL-HR.
As shown in Table 2, all three traditional methods (Tulyakov2016 [Tulyakov et al., 2016], POS [Wang et al., 2017] and CHROM [De Haan and Jeanne, 2013]) perform poorly due to the complex scenarios (e.g., large head movement and various illumination) in the VIPL-HR dataset.In terms of deep learning based methods, the existing end-to-end learning based methods (e.g., PhysNet [Yu et al., 2019a [Niu et al., 2019a], ST-Attention [Niu et al., 2019b], NAS-HR [Lu and Han, 2021], CVD [Niu et al., 2020], and Dual-GAN [Lu et al., 2021]).Such the large performance margin might be caused by the coarse and overfitted rPPG features extracted from the end-to-end models.In contrast, all five non-end-to-end methods first extract fine-grained signal maps from multiple facial ROIs, and then more dedicated rPPG clues would be extracted via the cascaded models.Without strict and heavy preprocessing procedure in [Niu et al., 2019a, Niu et al., 2019b, Lu and Han, 2021, Niu et al., 2020, Lu et al., 2021], the proposed PhysFormer and PhysFormer++ can be trained from scratch on facial videos directly, and achieve better or on par performance with state-ofthe-art non-end-to-end learning based method Dual-GAN [Lu et al., 2021].It indicates that PhysFormer and PhysFormer++ are able to learn the intrinsic and periodic rPPG-aware features automatically.It can be seen from Table 2 that the proposed PhysFormer family outperforms the VideoTransformer [Revanur et al., 2022] by a large margin, indicating the importance of local and global spatio-temporal physiological propagation.
In order to further check the correlations between the predicted HRs and the ground-truth HRs, we plot the HR estimation results against the ground truths in Fig. 8(a).From the figure we can see that the predicted HRs from PhysFormer++ and the ground-truth HRs are well correlated in a wide range of HR from 47 bpm to 147 bpm.
Besides HR estimation, we also conduct experiments for three types of physiological signals, i.e., HR, RF, and HRV measurement on the OBF [Li et al., 2018] dataset.Following [Yu et al., 2019b, Niu et al., 2020], we use a 10-fold subject-exclusive protocol for all experiments.All the results are shown in Table 4.
It is clear that the proposed PhysFormer and PhysFormer++ outperform the existing state-ofthe-art traditional (ROI green [Li et al., 2018], CHROM [De Haan and Jeanne, 2013], POS [Wang et al., 2017]) and end-to-end learning (rPPGNet [Yu et al., 2019b]) methods by a large margin on all evaluation metrics for HR, RF and all HRV features.The proposed PhysFormer and PhysFormer++ give more accurate estimation in terms of HR, RF, and LF/HF compared with the preprocessed signal map based non-end-to-end learning method CVD [Niu et al., 2020].These results indicate that PhysFormer family could not only handle the average HR estimation task but also give a promising prediction of the rPPG signal for RF measurement and HRV analysis, which shows its potential in many healthcare applications.
We also check the short-time HR estimation performance of the after exercising scenario on the OBF, in which the subject's HR decreases rapidly.Two examples are given in Fig. 8(b).It can be seen that PhysFormer++ could follow the trend of HR changes well, which indicates the proposed model is robust in the significant HR changing scenarios.We further check the predicted rPPG signals of the PhysFormer++ from these two examples in Fig. 8(c).From the results, we can see that the proposed method could give an accurate prediction of the interbeat intervals (IBIs), thus can give a robust estimation of RF and HRV features.

Cross-dataset Testing
Besides of the intra-dataset testings on the VIPL-HR, MAHNOB-HCI, and OBF datasets, we also conduct cross-dataset testings on MMSE-HR [Tulyakov et al., 2016] following the protocol of [Niu et al., 2019a].The models trained on VIPL-HR are directly tested on MMSE-HR.All the results of the proposed PhysFormer family and the state-of-the-art methods are shown in Table 11.It is clear that PhysFormer and Phys-Former++ generalize well in unseen domains (e.g., skin tone and lighting conditions).It is worth noting that PhysFormer++ achieves the lowest SD (5.09 bpm), MAE (2.71 bpm), RMSE (5.15 bpm) as well as the highest r (0.93) among the traditional, non-end-to-end learning and end-to-end

Ablation Study
Here We provide the results of ablation studies for HR estimation on the Fold-1 of the VIPL-HR [Niu et al., 2019a] dataset.Specifically, we first evaluate the impacts of architecture configurations for PhysFormer in terms of 'Tube Tokenization', 'TD-MHSA' and 'ST-FF'.Then based on the optimal configuration of PhysFormer, the impacts of architecture configurations of Phys-Former++ with 'TD-MHPSA' and 'SlowFast architecture' will be studied.Finally, we study the transformer configurations ('θ in TDC' and 'layer/head numbers') and the training receipts ('label distribution learning' and 'dynamic supervision') for the whole PhysFormer family (i.e., PhysFormer and PhyFormer++).

Impact of tube tokenization in PhysFormer.
In the default setting of PhysFormer, a shallow stem cascaded with a tube tokenization is used.In this ablation, we consider other four tokenization configurations with or w/o stem.It can be seen from the first row in Table 6 that the stem helps the PhysFormer see better [Xiao et al., 2021], and the RMSE increases dramatically (+3.06 bpm) when w/o the stem.Then we investigate the impacts of the spatial and temporal domains in tube tokenization.It is clear that the result in the fourth row with full spatial projection is quite poor (RMSE=10.61bpm), indicating the necessity of the spatial attention.In contrast, tokenization with smaller tempos (e.g., [2x4x4]) or spatial inputs (e.g., 160x96x96) reduces performance slightly.Based on the observed results, tokenizations with [4x4x4] and [2x4x4] are adopted for the defaulted setting of slow and fast pathway in PhysFormer++, respectively.
Impact of TD-MHSA and ST-FF in Phys-Former.As shown in Table 7, both the TD-MHSA and ST-FF play vital roles in PhysFormer.The result in the first row shows that the performance degrades sharply without spatio-temporal attention.Moreover, it can be seen from the last two rows that without TD-MHSA/ST-FF, Phys-Former with vanilla MHSA/FF obtains 10.43/8.27bpm RMSE.Thus, we can draw the conclusion that the key element 'vanilla MHSA' in transformer cannot provide rPPG performance gain   although it captures the long-term global spatiotemporal physiological features.In contrast, the proposed 'TD-MHSA' benefits the rPPG measurement via local spatio-temporal physiological clue guided long-term global spatio-temporal physiological aggregation.One important finding in this research is that, the temperature τ influences the MHSA a lot.When the τ = √ D h like previous ViT [Dosovitskiy et al., 2021, Arnab et al., 2021], the predicted rPPG signals are unsatisfied (RMSE=9.51bpm).Regularizing the τ with smaller value enforces sparser spatiotemporal attention, which is effective for the quasi-periodic rPPG task.
Impact of TD-MHPSA for different pathway in PhysFormer++.Based on the TD-MHSA in PhysFormer, the PhysFormer++ further extends the slow pathway with the more periodic TD-MHPSA modules.Table 8 shows the From the results in Table 8 we can see that the TD-MHPSA with L atten benefits the periodic rPPG clue mining in the slow pathway while limited effects for the fast pathway.It may be because the attention loss calculated from the periodic maps with huger temporal resolution in the fast pathway is inefficient to back-propagate the rPPG-aware information.Thus, we only apply the TD-MHPSA in the slow pathway as the defaulted setting for PhysFormer++.
Impact of the SlowFast architecture in PhysFormer++.Table 9 illustrates the ablations of SlowFast two-pathway based architecture in PhysFormer++.From the results of the first two rows we can see that such SlowFast rPPG models even achieve inferior performance (7.78/7.58 vs. 7.56 bpm RMSE) compared with single pathway based PhysFormer.The unsatisfied results might be caused by the lack of efficient rPPG feature interaction between two pathways.We also conduct experiments with lateral connections in different levels and cross-attention based TD-MHCSA in the fast pathway.From Table 9 we can obviously find that both lateral connections and TD-MHCSA improve the performance remarkably.This is because the former one brings more temporally fine-grained clues back to the slow pathway to alleviate rPPG information loss while the latter one leverages the cross -attention features to refine the redundant rPPG features in     10.Although the L LD performs slightly worse (respective +0.12 and +0.13 bpm RMSE for PhysFormer and PhysFormer++) than L CE , the best performance can be achieved using both losses, indicating the effectiveness of explicit distribution constraints for extreme-frequency interference alleviation and adjacent label knowledgement propagation.It is interesting to find from the last two rows in both PhysFormer and Phys-Former++ that using real PSD distribution from ground truth PPG signals as p, the performance is inferior due to the lack of an obvious peak in the distribution and partial noise.We can also find from the Fig. 9(a) that the σ ranged from 0.9 to 1.2 for L LD are suitable to achieve good performance.
Impact of dynamic supervision for the PhysFormer family.Fig. 11 illustrates the testing performance of PhysFormer and Phys-Former++ on Fold-1 VIPL-HR when training with fixed and dynamic supervision.It is clear that with exponentially increased frequency loss, models in the blue curves converge faster and achieve smaller RMSE.We also compare several kinds of fixed and dynamic strategies in Table 10.
The results in the first four rows indicate 1) using fixed higher β leads to poorer performance caused by the convergency difficulty; 2) models with the exponentially increased β perform better than using linear increment.

Efficiency Analysis
Here we also investigate the computational cost 1 compared with the baselines.The number of parameters and the multiply-accumulates (MACs) are shown in

Visualization and Discussion
Visualization of the self-attention map.We visualize the attention maps from the last TD-MHSA module of PhysFormer (left) and the last TD-MHCSA module in the fast pathway of Phys-Former++ (right) in Fig. 12.The x and y axes of the attention map indicate the attention confidence from key and query tube tokens, respectively.From the attention maps activated from the video sample with limited head movement in Fig. 12(a), we can easily find periodic or quasiperiodic responses along both axes, indicating the periodicity of the intrinsic rPPG features from PhysFormer and PhysFormer++.To be specific, given the 530th tube token (in blue) from the forehead (spatial face domain) and peak (temporal signal domain) locations as a query, the corresponding key responses are illustrated at the blue line in the attention map.On the one hand, it can be seen from the key responses that dominant spatial attentions focus on the facial skin regions and discard unrelated background.On the other hand, the temporal localizations of the key responses are around peak positions in the predicted rPPG signals.All these patterns are reasonable: 1) the forehead and cheek regions [Verkruysse et al., 2008] have richer blood volume for rPPG measurement and are also reliable since these regions are less affected by facial muscle movements due to e.g.,  We also visualize the attention maps from another video sample with serious head movement in Fig. 12(b).It can be observed from the left subfigure that the attentional response of Phys-Former is inaccurate (e.g., focusing on the neck region) when the head moves to the left.Another issue is that due to the large temporal token size (T s =4) in the tokenization stage, the temporal rPPG clues might be partially discarded, resulting in the sensitivity about the head movement and the biased rPPG prediction (i.e., huge IBI gaps between the predicted rPPG and ground truth BVP signals).In contrast, it can be seen from the right subfigure in Fig. 12(b) that the attentional response and the predicted rPPG signal from PhysFormer++ are reliable, indicating the effectiveness of the SlowFast architecture and advanced attention modules.
Overall, two limitations of the spatio-temporal attention could be concluded from Fig. 12. First, there are still some unexpected responses (e.g., continuous query tokens with similar key responses) in the attention map, which might introduce task-irrelevant noise and damage to the performance.Second, the temporal attentions are not accurate under serious head movement scenarios, and some are coarse with phase shifts.
Visualization of the periodic attention map.
We also visualize the periodic attention map from the last TD-MHPSA module of PhysFormer++ in Fig. 13.It is interesting to find that the periodic attention maps from the PhysFormer++ 1) trained without L atten are more arbitrary and easily influenced by the large head movement; and 2) trained with L atten are more regular and keep the periodicity even under the scenarios with serious head movement.In other words, the proposed TD-MHPSA with attention loss L atten enforces the PhysFormer++ to learn more periodic and robust attentional features from the face videos.
Evaluation under serious motion, video compression, and low resolution.In realworld scenarios, large head movement, high video compression rate and low face resolution usually introduce serious motion noises, compression artifacts and blurriness, respectively.All these corruptions and quality degradations make the rPPG measurement challenging.Here we evaluate the performance under these challenging scenarios.First, we evaluate the PhysFormer family under scenarios of large head movement (i.e., 'v2' and 'v9' samples) on VIPL-HR dataset.PhysFormer and PhysFormer++ achieve RMSE of 11.46 bpm and 10.25 bpm, respectively.In other words, with richer temporally contextual rPPG clues, the two-pathway SlowFast architecture in PhysFormer++ is more motion-robust.Note that there are still performance gaps between non-end-to-end method (e.g., Rhythm-Net [Niu et al., 2019a] with RMSE=9.4 bpm).
Second, we evaluate the PhysFormer family on OBF with high compression rates (250/500/1000 kb/s) using x264 codec.The corresponding HR measurement results are illustrated in Fig. 14(a).
Compared with the rPPGNet [Yu et al., 2019b], the PhysFormer family performs significantly better when bitrates equal to 500 and 1000 kb/s.This might be because the spatio-temporal selfattention mechanism helps filter out the compression artifacts.However, all three methods perform poorly under extremely high compression situation (i.e., bitrate=250 kb/s).Finally, we evaluate the PhysFormer family on VIPL-HR with different low-resolution settings to mimic the long-distance rPPG monitoring scenario.Specifically, bilinear interpolation is used to downsample the face frames to the sizes 16x16/32x32/64x64 first, and then upsample them back to 128x128.The HR measurement results are illustrated in Fig. 14 Training with fewer samples.Since end-toend deep models (e.g., CNNs and transformers) are data hungry, here we investigate three methods (AutoHR [Yu et al., 2020], PhysFormer and PhysFormer++) under conditions of fewer training samples.As shown in Table 12, when training with only 10% or 50% samples, all these three methods obtain poor RMSE performance (>10 bpm).Another observation is that, compared with pure CNN-based AutoHR, the proposed Phys-Former++ still achieves on par or better performance with fewer training samples.It indicates that the proposed transformer architectures can learn CNN-comparable rPPG representation even with limited data.

Conclusions
In this paper, we propose two end-to-end video transformer architectures, namely PhysFormer and PhysFormer++, for remote physiological measurement.With temporal difference transformer and elaborate supervisions, the Phys-Former family is able to achieve superior performance on benchmark datasets on both intraand cross-testings.Comprehensive ablation studies as well as visualization analysis demonstrate the effectiveness of the proposed methods.In the future, it is potential to explore more accurate yet efficient spatio-temporal self-attention mechanism especially for long sequence rPPG monitoring.Besides the rPPG measurement task, we will investigate the effectiveness of the proposed temporal different transformer for broader finegrained or periodic video understanding tasks in computer vision (e.g., video action recognition and repetition counting).

Fig. 2
Fig.2Framework of the PhysFormer.It consists of a shallow stem, a tube tokenizer, several temporal difference transformers, and a rPPG predictor head.The temporal difference transformer is formed from the Temporal Difference Multi-head Self-attention (TD-MHSA) and Spatio-temporal Feed-forward (ST-FF) modules, which enhances the global and local spatio-temporal representation, respectively.'TDC' is short for the temporal difference convolution[Yu et al., 2020, Yu et al., 2021d].

Fig. 3
Fig. 3 Framework of the PhysFormer++ with two-stream SlowFast pathways.Different from the PhysFormer using only slow pathway, the PhysFormer++ extracts and fuses attentional features from slow and fast pathways.Moreover, temporal difference periodic transformer blocks are used in the slow pathway.The information flow between two pathways interacts via temporal difference cross-attention transformer blocks and lateral connection.

Fig. 4
Fig. 4 Architectures of PhysFormer++.Inside the brackets are the filter sizes and feature dimensionalities.'Conv' suggests the vanilla 3D convolution.All convolutional layers (except Tokenizer) are with stride=1 and are followed by a BN-ReLU layer while 'MaxPool' layers are with stride=1x2x2.

Fig
Fig. 8 (a) The scatter plot of the ground truth HR gt and the predicted HRpre via PhysFormer++ of all the face videos on VIPL-HR dataset.(b) Two examples of the short-time HR estimation from PhysFormer++ for face videos with significantly decreased HR.(c) Two example curves of the predicted rPPG signals from PhysFormer++ and the ground truth ECG signals used to calculate the HRV features.

Fig. 10
Fig. 10 Ablation of the (a) layers and (b) heads in PhysFormer and PhysFormer++.
best while fewer heads lead to sharp performance drops.Impact of label distribution learning for the PhysFormer family.Besides the temporal loss L time and frequency cross-entropy loss L CE , the ablations w/ and w/o label distribution loss L LD are shown in the last four rows of Table

Fig. 11
Fig. 11 Testing results of fixed and dynamic frequency supervisions for (a) PhysFormer and (b) PhysFormer++ on the Fold-1 of VIPL-HR.
Fig. 12 Visualization of the attention maps from (left) the 1st head in last TD-MHSA module of PhysFormer and (right) the 1st head in last TD-MHCSA module of the fast pathway in PhysFormer++.Given the 530th and 276th tube tokens in blue as the query for the video samples with (a) limited head movement and (b) serioud head movement, representative key responses are illustrated (the brighter, the more attentive).The predicted downsampled rPPG signals as well as the ground truth BVP signals are shown for temporal attention understanding.
Fig. 13 Visualization of the periodic attention maps from the 1st head in last TD-MHPSA module of the slow pathway in PhysFormer++.The top row show the periodic attention map from the facial video with limited head movement while the bottom one with serious head movement.
Fig. 14 HR results with different (a) compression bitrates on OBF, and (b) resolutions on VIPL-HR.
(b).Despite performance drops with lower face resolution for both AutoHR[Yu et al., 2020] and the PhysFormer family, PhysFormer++ still achieves RMSE=9.58bpm with the lowest (16x16) resolution setting.

Table 1
Summary of the representative rPPG measurement methods in terms of traditional, non-end-to-end learning, and end-to-end learning categories.
are adopted as L time and L CE , respectively.With the dynamic supervision, Phys-Former and PhysFormer++ could perceive better signal trend at the beginning while such perfect warming up facilitates the gradually stronger frequency knowledge learning later.

Table 2
Intra-dataset testing results on the VIPL-HR dataset.The symbols , and denote traditional, non-end-to-end learning based and end-to-end learning based methods, respectively.Best results are marked in bold and second best in underline.

Table 3
Intra-dataset results on the MAHNOB-HCI dataset.

Table 4
Performance comparison of HR and RF measurement as well as HRV analysis on the OBF dataset.

Table 5
Cross-dataset results on the MMSE-HR dataset.

Table 6
Ablation of Tube Tokenization of PhysFormer.The three dimensions in tensors indicate length× height×width.

Table 7
Ablation of TD-MHSA and ST-FF in PhysFormer.

Table 8
Ablation of TD-MHPSA for the single pathway configuration in PhysFormer++.

Table 9
Ablation of SlowFast two-pathway based architecture in PhysFormer++.

Table 10
Ablation of dynamic loss in the frequency domain for PhysFormer and PhysFormer++.The temporal loss L time is with fixed α=0.1 here.'CE' and 'LD' denote cross-entropy and label distribution, respectively.
[Lin et al., 2021b]owards efficient mobilelevel rPPG applications, the computational cost of the proposed PhysFormer family is still unsatisfied.One potential future direction is to design more lightweight PhysFormer with advanced network quantization[Lin et al., 2021b]and binarization[Qin et al., 2022]techniques.

Table 12
HR results (RMSE (bpm)) when training with different proportion of samples on VIPL-HR.