Background

Heart rate variability (HRV) provides useful physiological parameters according to the beat-to-beat intervals (R–R intervals, RRI) obtained from cardiac pulse signals. Some publications have validated that abnormality of HRV are related to some cardiological and noncardiological diseases, e.g., myocardial infarction, diabetic neuropathy, and myocardial dysfunction [1]. The conventional way to obtain the cardiac pulse is using electrocardiography (ECG) to sense the electrical activity of heart over a period of time by electrodes attached to the surface of human skin. ECG provides clean and accurate pulse signals, however, it is prone to be interfered by electrical activity produced by skeletal muscles near the electrodes.

Another way to obtain the cardiac pulse is photoplethysmography (PPG), which was first described in 1930s [2]. PPG detects the optical absorption variations of the human skin due to the blood volume variations. Both ECG and PPG need to contact the human skin, which are not suitable for the cases of extreme sensitivity, e.g., neonates, skindamaged patients, or when the non-contact property is required (surveillance, fitness, etc.). Peng et al. [3] proposed an alternative method for extracting the PPG signal through the smart-phone camera followed by computing the HRV. However, this method still requires the subjects to put their finger on the smart-phone camera and keep themselves static, which has similar disadvantages as the traditional PPG device. Recent works have shown that cardiac pulse rate can be measured in a non-contact way, which is also known as remote-PPG (rPPG) [412]. These works obtain pulse signals under ambient light conditions with only one camera, which are low-cost, simple, and effective.

The main idea of rPPG is that the blood volume variations can be captured during video recording. The earlier works [4, 5] first obtain the mean intensity of skin region and perform frequency analysis (Fourier or wavelet transform) to estimate the pulse rate. Recent works [612] estimate the pulse rate using a regular color video camera. The first step of these methods are locating the region of interest by manual selection or automatic face detection, followed by different analysis algorithms to extract the pulse signals, e.g., difference of RGB [6], source separation [79], chrominance [10, 12], motion magnification [11].

Poh et al. [7] proposed an algorithm for heart rate (HR) measurement. They first detected the face every frame and extracted the mean RGB color values to form a three-dimensional time series. Then they applied independent component analysis (ICA) [13, 14] to separate the independent sources from these RGB signals which may contain the pulse signal, followed by FFT and select the frequency with maximum amplitude in the spectral of the component which has highest peak as the HR. Later on, the authors proposed a similar method in [8] to extract the R–R intervals by finding the peaks of the pulse signal. The peaks of pulse signal are treated as the R wave of ECG signal, and the peak intervals are treated as R–R intervals. Alternatively, one may apply PCA, as shown in [9], instead of ICA to separate the pulse signal from RGB time series. Wu et al. [11] proposed an Eulerian-based motion magnification to magnify the subtle motions or color changes in temporal domain using Laplacian pyramid. This method is able to obtain a clean pulse signal if the subjects are almost static. Haan and Jeanne [10] proposed a chrominance-based remote PPG (we denote “C-rPPG” in the rest of this paper) which takes different factors into account to form the color model captured by camera. Given the pulsatility as a function of wavelength exhibits a strong peak in green and the dips in red [15, 16]. To exploit this fact and to reduce the specular reflection problem mentioned in [17], they proposed a model using difference of wighted color channels to obtain chrominance signals. This method is robust to different skin-tone and adaptive to non-white illumination. Moreover, the authors showed the impressive results of HR estimation for the scenario with the subjects exercising on stationary bike. Wang et al. [12] proposed an algorithm exploiting the spatial redundancy of image sensor and the idea of chrominance to improve the robustness to motions.

In the indoor scenes, the lighting sources are usually on the top of subjects, i.e., on ceiling. The color intensities or brightness of the skin captured by camera are various at different positions. Different angles between the lighting sources and camera also result in intensity variations. These periodical or non-periodical variations will produce artifacts which severely influence most of the rPPG algorithms. The methods based on source separation [79] may separate the artifacts rather than true pulse signal. Nevertheless, the component with the highest spectral amplitude is not necessary to be the true pulse signal. The Eulerian motion magnification [11] requires the subject as stationary as possible; otherwise, the motion in the specific frequency band will be magnified accordingly. Hence the motion magnification is not appropriate for general scenarios. The C-rPPG [10] improved the robustness to motions and has much better performance in HR estimation with non-static subjects. However, we found that there exist noises and artifacts in the C-rPPG signal which produce false peaks and severely influence the accuracy of R–R intervals. The paper [12] further improves the motion-robustness of the C-rPPG algorithm by adaptively combining local PPG-signals, and improves the SNR using an adaptive band-pass filter. This more elaborated C-rPPG concept, however, leads to an increase in computational complexity, which we consider less attractive for a mobile platform.

This paper proposed a non-contact method to estimate accurate HRV metrics from 30 fps video clips captured by frontal camera of off-the-shelf smart phones. The face of subject is located every frame followed by averaging all the skin pixels to form the RGB time series. The RGB time series are then used to compute the pulse signal by C-rPPG [10] algorithm. We proposed a denoising method based on continuous wavelet transform (CWT) to increase the robustness to interferences. The R–R intervals are obtained by computing the intervals of successive peaks in the denoised signal. To demonstrate the performance of HRV measurement, we taped a 12 video clips for different scenarios with static and non-static subjects and used an ECG-accurate chest band to obtain the R–R intervals as the ground truth. Comparing with existed approaches, the absolute errors of HRV metrics generated by proposed approach is relatively low. For the video clips with static subjects, the mean of absolute errors of the HRV metrics obtained by our method is only 3.53 ms.

Methods

Overview

Fig. 1
figure 1

The processing flow of the proposed algorithm

Figure 1 shows the overall flow chart of the proposed algorithm. For each frame of the video, we first detect the face based on the nose positions to increase the robustness to non-frontal faces. Once the face is located, we perform skin detection in YCbCr color space followed by averaged the RGB channels of the skin pixels in the face region to form the time series. To increase the fineness of time grid, we upsample the time series by a factor of eight, i.e., the sampling rate is from 30 to 240 Hz. After data acquisition stage, we compute the C-rPPG signal to extract the pulse signal. Next, we perform our denoising technique based on CWT. Finally, the peaks in the denoised signal are detected to compute the R–R intervals.

Data acquisition

First of all, we locate the face in every frame to extract the color signals. There are plenty of face detection works and surveys [1821]. For simplicity and convenience, one may apply the face detector proposed by Viola and Jones [18] which is effective and efficient to locate the faces in frames. However, the face detector will fail if the faces in video are non-frontal. We found that detecting the nose is more stable than detecting the face, thus we can exploit the nose position to derive appropriate face region. We use the object detection toolbox (vision.CascadeObjectDetector) built in MATLAB to detect the nose in every frame. The region of the face can be determined by

$$\eqalign{ & {w_f}{\text{ }} = {\text{ }}2{w_n} \cr & {h_f}{\text{ }} = {\text{ }}3{h_n} \cr & {x_f}{\text{ }} = {\text{ }}{x_n}{\text{ }} + {\text{ }}{{{w_n}} \mathord{\left/ {\vphantom {{{w_n}} 2}} \right. \kern-\nulldelimiterspace} 2} - {{{w_f}} \mathord{\left/ {\vphantom {{{w_f}} 2}} \right. \kern-\nulldelimiterspace} 2} \cr & {y_f}{\text{ }} = {\text{ }}{y_n}{\text{ }} + {\text{ }}{{{h_n}} \mathord{\left/ {\vphantom {{{h_n}} 2}} \right. \kern-\nulldelimiterspace} 2} - {{{h_f}} \mathord{\left/ {\vphantom {{{h_f}} {1.5}}} \right. \kern-\nulldelimiterspace} {1.5}} \cr}$$
(1)

where w and h are the width and height, (xy) is the top left coordinate of the bounding box. The subscripts n and f represent the “nose” and “face”.

Next, we use a simple skin color detection to ensure the processed data are obtained from skin pixels. There are lots of works one may refer for the skin detection, e.g., the method proposed in [22]. We only take into account the Cb and Cr components to detect the skin color. A pixel is classified as skin pixel if it satisfies the following conditions:

$$\begin{aligned} \begin{array}{l} 98 \le Cb \le 142 \\ 133 \le Cr \le 177 \end{array} \end{aligned}$$
(2)

After skin detection, we then record the averaged RGB values of skin pixels in the ROI to form the time series. Finally, we upscale the time series by a factor of 8.

Computing C-rPPG

Inspired by previous works, we apply the chrominance-based method in our algorithm due to the better performance for extracting the real pulse signals instead of false ones. We apply the model \(X_smin\alpha Y_s\) proposed in [10] which is briefly reviewed in the following.

In [10], the intensity of a given pixel in i-th frame in color channel \(C \in \{R,G,B\}\) registered by the camera is modeled as

$$\begin{aligned} C_i = I_{C_i}(\rho _{C_{dc}} + \rho _{C_i} + s_i) \end{aligned}$$
(3)

where \(I_{C_i}\) is the intensity of the light source integrated over the exposure time of the camera, \(\rho _{C_{dc}}\) is the stationary part of the reflection coefficient of the skin, \(\rho _{C_i}\) is the zero-mean time-varying fraction caused by the pulsation of the blood volume, and \(s_i\) is the additive specular reflection contribution.

The RGB data are normalized using the following formula

$$\begin{aligned} C_{ni} = \frac{C_i}{\mu (C_i)}, C \in \{R,G,B\} \end{aligned}$$
(4)

where \(\mu (C_i)\) is a moving average centered around frame index i. The chrominance signals are defined as follows

$$\begin{aligned} \begin{array}{l} X_s = 3R_n-2G_n \\ Y_s = 1.5R_n + G_n - 1.5B_n \end{array} \end{aligned}$$
(5)

Finally, the pulse signals can be extracted by

$$\begin{aligned} S = X_f - \alpha Y_f \end{aligned}$$
(6)

with

$$\begin{aligned} \alpha = \frac{\sigma (X_f)}{\sigma (Y_f)} \end{aligned}$$
(7)

where \(\sigma (\cdot )\) is standard deviation of the signals, the signals with subscript f represent their band-pass filtered versions. We can further rewrite (6) as follows

$$\begin{aligned} S = 3(1-\frac{\alpha }{2})R_f - 2(1+\frac{\alpha }{2})G_f + \frac{3\alpha }{2}B_f \end{aligned}$$
(8)

CWT-based denoising method

The CWT transforms a time series to a time-frequency representation and has been used to denoise the PPG signals in some works [2325]. The CWT uses inner product to measure the similarity between a signal and a specific analysis function, which outperforms the Fourier Transform and the short term Fourier Transform since the CWT can detect rapid changes in frequency due to the multi-scale representation. We will briefly review the theory of CWT and describe our denoising method based on CWT in the following.

The CWT convolves a signal x(t) with child wavelets \(\psi _{\tau ,s}(t)\) which represent scaled and translated versions of mother wavelet \(\psi (t)\),

$$\begin{aligned} X_w(\tau ,s) = \int _{-\infty }^{\infty } x(t) \psi _{\tau ,s}(t) dt \end{aligned}$$
(9)

\(X_w(\tau ,s)\) represents similarity between the signal x(t) and a child wavelet scaled by s and translated by \(\tau\), which is define as follows:

$$\begin{aligned} \psi _{\tau ,s}(t) = \frac{1}{\sqrt{|s|}}\psi \left( \frac{t-\tau }{s} \right) \end{aligned}$$
(10)

There are many standard mother wavelets available in CWT literature. We selected the Morlet wavelet in our algorithm since it has been used to analysis PPG signals in [25]. The signal can be reconstructed from the wavelet transform by the inverse formula of (11).

$$\begin{aligned} x(t) = \frac{1}{C_{\psi }} \int _{0}^{\infty } \int _{-\infty }^{\infty } \frac{1}{s^2} X_w(\tau ,s)\frac{1}{\sqrt{|s|}}\psi \left( \frac{t-\tau }{s} \right) d\tau ds \end{aligned}$$
(11)

where \(C_{\psi }\) is the admissible constant of wavelet transform. Let \(\hat{\psi }(\xi )\) denoted as Fourier version of \(\psi (t)\), the admissible constant is defined as follows:

$$\begin{aligned} C_{\psi } = \int _{0}^{\infty }\frac{|\hat{\psi }(\xi )|}{|\xi |} d\xi < \infty \end{aligned}$$
(12)

One may reserve the coefficients of specific scales corresponding to the observed frequency band (0.75, 4) Hz [(45, 240) bpm] and set the others to zero followed by inverse transform, which is equivalent to bandpass filtering (we call it “CWT-BP”). However, the motion artifacts are usually in the same frequency band, hence there will be false peaks produced by motion artifacts in the reconstructed signal.

Assume the pulse signal is the most significant component of the C-rPPG signal, our goal is to select a representative scale to reconstruct the pulse signal. We computed the summation of the magnitude of CWT coefficients in the same scale within a time interval, followed by selecting the scale with maximal value of the summation, i.e.,

$$\begin{aligned} s^* = \arg \max _s \sum _{\tau } X_w(\tau , s) \end{aligned}$$
(13)

where \(s^*\) is the optimal scale to reconstruct the pulse signal. The CWT coefficients belonging to the scale \(s^*\) are reserved and the others are set to zero. In practice, we should take into account computation efficiency, thus we divide the CWT coefficients into non-overlapping time intervals with length T (seconds) and select the representative scales for every time interval. Another factor we should take into account is the non-stationary property of cardiac activity, hence the value of T should be carefully selected. Choosing smaller T is able to catch up to variation of cardiac activity but is less robust to the strong interference such as motion artifacts, and vice versa. Here we suggest that one can set T in the range of 10–30 (seconds). After selecting the optimal scales for every time interval, the pulse signal is reconstructed by inverse CWT. We denoted this method as “CWT-MAX” in the following.

Fig. 2
figure 2

Examples for using CWT to detrend and denoise. a The original C-rPPG signal. b The CWT coefficients of the original signal. Note that the black solid line denotes the representative frequency (scales) of pulse signal computed by (13). c The zoomed-in part of original signal. d The zoomed-in part of signal denoised by CWT-BP. e The zoomed-in part of signal denoised by CWT-MAX

Figure 2 shows an example to demonstrate our approach. We can observe that the original signal shown in Fig. 2c is noisy and with many false peaks. After applying CWT to the original signal, we can obtain the CWT coefficients as shown in Fig. 2b. The black line represents the coefficients of optimal scales of every time interval. The CWT-BP reserved all the coefficients in the observed band [(0.75, 4) Hz] and set the others to zero which may smooth and denoise the original signal as shown in Fig. 2d; however, it still retained some false peaks which may degrade the accuracy of R–R intervals. On the contrary, the CWT-MAX only reserved the coefficients of representative scales of each time interval and set the others to zero. The signal reconstructed by CWT-MAX is much cleaner, as shown in Fig. 2e. Therefore, we applied the CWT-MAX in our algorithm.

Peak detection and R–R intervals

Fig. 3
figure 3

R–R intervals of the example in Fig. 2. The red dot line is the R–R intervals measured by an ECG-accurate chest band. The blue dot line is the R–R intervals computed by our method

After CWT denoising, the proposed approach then detect the peaks in the denoised pulse signal to compute the R–R intervals. One may simply use the findpeaks function built in MATLAB, or use the customized peak-finding algorithm. Since the CWT-denoised signal is almost noise-free, the selection of peak-finding algorithms does not play a crucial role to our results. After peak detection, let \(p_k\) be the time instance of k-th peak in the signal, the R–R intervals can be calculated by

$$\begin{aligned} RRI_k = p_k-p_{k-1} \end{aligned}$$
(14)

and generally in the unit of millisecond (ms). Figure 3 shows the R–R intervals of the example in Fig. 2 computed by our method and the R–R intervals measured by an ECG-accurate chest band.

Experimental setup

Table 1 Descriptions of the video clips in each category

We totally taped 12 video clips with one minute long to evaluate the performance of R–R intervals extraction. The clips are taped by frontal camera of a smart phone (Sony Xperia Z1) with 30 fps frame rate and size of 640 × 480. We simulated the scenario that the video recorded the subjects when they were using their smart phone or tablet to monitoring their cardiac physiology. These video clips are classified into four categories, which are “static subjects”, “static subject with makeup”, “occasional motion”, and “frequent motion”, respectively. Each of the category has 2–4 clips with different subjects or slightly different conditions. Note that the word “static” here means the subjects kept their bodies static but slight movements (e.g., talking, facial expression, slight shaking) are allowable. The detail descriptions of the video clips are listed in the Table 1.

Table 2 Descriptions of all the subjects

We have totally six subjects in the 22–28 age range involved in the experiments. There are four subjects (two males and two females) in “static subjects” category, one female subject in “static subject with makeup” category, and one male subject in both “occasional motion” and “frequent motion” categories, respectively. For more details of the subjects, please see the Table 2. The makeups we used in the experiments are, CC cream (CLINIQUE Molsture Surge CC cream hydrating colour corrector broad spectrum SPF30), powder foundation (DiorSnow Sublissime SPF30 PA+++), and blush powder (Christian Dior Diorshow Powder Backstage Makeup Color in a flash loose powder 0.17oz/5g 003 Catwalk Pink), respectively. This study had received approval by China Medical University and Hospital Research Ethics Committee. All the subjects have signed an informed consent allowing the authors to publish their HRV data.

We also used an ECG-accurate chest band (R1 Blue Comfortex+, made by Sigma sport) during video recording to obtain the ground truth of cardiac activity and exported the R–R intervals for the following comparisons. Because we aimed at the implementations suitable for smart phone applications, thus we only compared the proposed method with the algorithms which have similar computational cost. We will compare the performance of our algorithm with the ICA-based method [8] and original C-rPPG [10]. Since the authors did not release the source codes, we have tried our best to implement the algorithms described in their papers. We implemented the same bandpass filter in [8] for the original C-rPPG signal. For fair comparisons, we applied the same peak detection function (findpeaks) built in MATLAB to all the methods in the following experiments. Note that we do not further process the R–R intervals no matter they are reasonable or not. All the algorithms are implemented in MATLAB code.

Results and discussions

Quantitative evaluation

To show the accuracy of HRV estimation, this paper make comparisons with existed works by using well-known HRV metrics [26]. The scatter plot of R–R intervals is usually a good tool to show the relationship between \(RRI_n\) and \(RRI_{n+1}\) and thereby evaluate HRV. The scatter plot is a 2-D \((RRI_n, RRI_{n+1})\) plot, in which the calculated eigenvalues are useful in the following comparisons. The square root of an eigenvalue describes the standard deviation along the direction of corresponding eigenvector. In this paper we denote SD1 as the square root of the smallest eigenvalue and SD2 as the other one in our HRV comparisons. The time-domain HRV metrics used here are: the standard deviation of R–R intervals (SDNN), root mean square of successive differences (RMSSD), standard deviation of successive differences (SDSD). All the HRV metrics mentioned above are in the unit of millisecond (ms).

Results and discussions of each category

Static subjects

Table 3 The HRV metrics estimated by different methods in the “static subjects” category

The HRV metrics of the “static subjects” clips estimated by the chest band and the different methods are listed in Table 3. Generally, the pulse signal extracted by C-rPPG [10] has better performance than the one extracted by ICA [8]. Our method inherited from C-rPPG and the HRV metrics are very close to the ones measured by chest band (see the absolute errors). Ideally, the clips with static subjects have no motion artifacts. However, as mentioned above, the HRV metrics are computed by R–R intervals which are very sensitive to the false peaks in the noisy signals. The proposed CWT-based denoising method removes the most interferences; hence, the R–R intervals are reliable and close to the ground truth even the subjects keep making facial expressions.

Static subjects with makeup

Table 4 The HRV metrics estimated by different methods in the “static subjects with makeup” category

In this category, we made experiments on the cases which the subjects had different kind of makeup on her face. Table 4 shows the results of HRV estimated by the different methods. The ICA-based method and C-rPPG deviated from the ground truth while our method still got much lower errors. In these experiments, we can observe that the makeup may interfere the performance of the pulse signal extraction. However, this interference will not degrade the results of our method because our technique can successfully remove noises and artifacts.

Occasional motion

Fig. 4
figure 4

The illumination changes of the face in the “motion_O3” clip

Table 5 The HRV metrics estimated by different methods in the “occasional motion” category

The clips in this category are that the subjects moved his/her body or head less than three times, just like the regular motions we make in daily-life. Table 5 shows the results of this category. Our method only severely deviated from the ground truth in the “motion_O3”. To explain the result, we computed the averaged illumination (grayscale) on the face of the “motion_O3”, as shown in Fig. 4. We found that the illumination changes significantly due to the auto-exposure function of camera. The camera changed the exposure automatically when the subject turned the head, while the other two clips (“motion_O1” and “motion_O2”) have no such illumination changes. Therefore, our method still obtained the HRV metrics close to the ground truth in “motion_O1” and “motion_O2”.

Frequent motion

Fig. 5
figure 5

The face positions and illumination in “motion_F1” clip. a The face position (x-axis). b The face position (y-axis). c The illumination (grayscale) of the face

Table 6 The HRV metrics estimated by different methods in the “frequent motion” category

Table 6 shows the HRV metrics of the “frequent motion” video clips. This category is extremely challenging since the subjects kept making movements during the video recording. Both the ICA [8] and C-rPPG [10] severely deviated from the ground truth. Although our method was interfered by the large motion artifacts, we still obtained a reasonable HRV metrics close to ground truth. Figure 5 shows the face positions and the illumination changes on the face of “motion_F1”. The face positions changes periodically due to the continuously shaking of the head. We can observe that the illumination changed with the motion of face rather than exposure changes. The results have shown that our method can deal with the motion artifacts even the subject kept shaking his head during the video recording if the exposure of camera is almost fixed.

Conclusion

In this paper, we have analyzed the problems of camera-based PPG and proposed an algorithm to extract accurate R–R intervals using 30 fps camera. We first extract the pulse signal using the chrominance-based method (C-rPPG) followed by a denoising method based on the CWT. The R–R intervals are computed by finding the peaks in the denoised signals. The experimental video clips were recorded by a frontal camera of smart phone (Sony Xperia Z1) held by the subjects in different situations. The experiments have shown that our method is able to extract much more accurate results than the related works. The mean of absolute errors of HRV metrics obtained by our method is only 3.53 ms in the “Static subjects” and “Static subjects with makeup” categories. This shows the potential of our method for remote health monitoring of patients, which can be done by an easy and comfortable way in daily-life.

Note that the measurements of HRV for clinical use should conform to professional recommendations (e.g., [1, 26]), and our method might not meet those requirements. However, it can be useful for informal applications; for instance, monitoring the physiological status of the tablet users and giving warnings to the users who may have some potential healthy problems.

Although the proposed method is able to alleviate the interference of motion artifacts, we still have room for improvement to deal with the artifacts made by the significantly changes of exposure due to auto-exposure function of camera. In addition, for a proof of concept, this paper validates our work with six subjects which might not be enough to show convincing, statistically significant evidence of efficacy. In the future, we will aim to refine our algorithm for being robust to artifacts generated by built-in functions in smart phones, and conduct the experiments with larger number of subjects.