1 Introduction

Although the spread of COVID-19 has decreased considerably, the situation remains unstable. In addition to avoiding human contact, remote biometric technology-based body temperature measurements for fever detection have become increasingly popular tools to prevent the spread of infection.

Recently, facial imaging with visible and infrared light has become a common application for the remote sensing of vital signs. As the face is lined with capillaries, facial color and skin temperature fluctuate based on variations in cutaneous blood flow. Facial visible images can be considered a short-term indicator of hemodynamics because facial color contains pulse wave components that fluctuate over a short period. These images have been used to evaluate heart rate variability [1], respiration rate [2], blood pressure [3], and subjective feelings related to health [4]. Facial thermal images can be considered long-term indicators of hemodynamics because skin temperature fluctuates over a long period. Facial skin temperature has been used to evaluate stress [5], subjective feelings related to comfort [6], and drowsiness level [7].

Fig. 1
figure 1

A concept of triplet loss

Contrarily, our research group attempted to estimate physiological and psychological states using facial skin temperature and color distributions via a deep-learning algorithm. Adachi et al. constructed a drowsiness estimation model based on facial skin temperature using a deep-learning algorithm [8]. In previous methods, estimation models were constructed based on estimated targets, such as a model for estimating sleepiness or stress. However, because there are various poor physical conditions in daily life such as hangovers, fatigue, and lack of sleep, there is a limit to the method of constructing an estimation model for each estimation target.

Hence, an “anomaly detection” method was devised that uses a single model to detect anomaly conditions such as drowsiness, stress, and poor health. The following two types of anomaly detection models were proposed in our laboratory: (i) models based on facial skin temperature built using a variational autoencoder and deep-learning algorithm [9], and (ii) models using a combination of facial visible and thermal images using triplet loss, a type of deep metric learning [10]. Triplet loss is widely used in the image processing field, such as vision tasks such as person re-identification, face recognition, image search, and so on. The authors hypothesized that the face detection capability of triplet loss could be applied to anomaly detection in face images. In a previous study (ii), a blood pressure increase experiment was conducted in which blood pressure was increased by making subjects hold their breath to induce an anomaly state; an attempt was then made to classify the anomaly state. Consequently, facial color and skin temperature fluctuated as blood flow increased, suggesting that the anomaly states could be classified. This further indicates the possibility of using facial images as physiological indicators of hemodynamics. However, in these studies, the subjects were photographed under controlled conditions such as eating and sleeping. Extracting features based on subjective health feelings from facial images is difficult in an uncontrolled environment, which is susceptible to seasonal variations and disturbances such as ambient light.

In this study, triplet loss, a type of deep metric learning, was used to extract features related to subjective health feelings from facial images, to evaluate whether there is a relationship between subjective health feelings and facial images. A classification of subjective health feelings related to a poor physical condition based on these features was attempted. To obtain the data, an experiment was conducted for approximately one year to measure the facial visual and thermal images and subjective feelings related to physical conditions. Moreover, the facial visible images were transformed into a perceptually uniform L*a*b* color space. Anomaly levels were set based on the measured subjective feelings related to the physical condition and were defined as poor physical conditions. The facial visual and thermal images were treated in poor physical conditions as anomaly data and facial visual and thermal images in good physical conditions as normal data. Facial visible and thermal images applied to the trained model to quantitatively evaluate the detection classification accuracy of anomaly states related to subjective health feelings confirmed the association between subjective health feelings and facial images.

2 Mechanism of triplet loss

Several computer vision tasks exist, such as face recognition [11, 12], re-identification [13], object tracking [14], and deep metric learning technology. Among these approaches, metric learning, which uses triplet loss as the loss function, is widely used. In this study, the authors attempted to create anomaly detection models by triplet loss using facial visual and thermal images obtained over a long period.

A concept of the triplet loss is illustrated in Fig.1. The triplet loss was learned as a set of \(x_{i}^{a}\), \(x_{i}^{p}\), \(x_{i}^{n}\)}; it was composed of \(x_{i}^{a}\)(anchor), which is the reference; \(x_{i}^{p}\)(positive), belonging to the same class as \(x_{i}^{a}\); and \(x_{i}^{n}\)(negative), belonging to a different class from \(x_{i}^{a}\). Each input image was placed in the feature space with features extracted by MobilenetV2, a convolutional neural network. [15]. The authors hypothesized that MobilenetV2, with its high classification accuracy, can be used to accurately extract features related to subjective health feelings from facial images.

The triplet loss function \(L_\textrm{tripletloss}\) is expressed as

$$\begin{aligned} { L_\textrm{tripletloss}} = { \sum _{i} max}\left[ d({ x_{i}^{a}}, { x_{i}^{p})} - { d}({ x_{i}^{a}}, { x_{i}^{n}}) + { \alpha }\right] , \end{aligned}$$
(1)

where d( \(x_{i}^{a}\), \(x_{i}^{p}\)) is the Euclidean distance between the anchor and positive, d( \(x_{i}^{a}\), \(x_{i}^{n}\)) is the Euclidean distance between the anchor and negative, and \(\alpha\) is the margin. \(\alpha\) is set to 0.2, as used in a previous study [11].

Triplet loss was learned such that the anchor-positive distance d( \(x_{i}^{a}\), \(x_{i}^{p}\)) was smaller than the anchor-negative distance d( \(x_{i}^{a}\), \(x_{i}^{n}\)). Facial images of normal and anomaly data are input to the trained model to perform two classification.

3 Experiment

Table 1 Subjective health score on a 4-point scale

3.1 Experiment system and condition

In this study, an experiment was conducted for approximately one year (December 2020–November 2021) to measure the facial visual and thermal images and subjective health feelings. Thirty-three healthy adult males (21–51 years) were used as subjects, and the experiments were conducted at temperatures ranging from 25.3 \(^\circ\)C to 30.9 \(^\circ\)C. The measurement environment is shown in Fig. 2. Visible (HD 1080p, Logicool) and infrared thermography cameras (Boson, FLIR) were installed in front of the subject at a distance of 1.0 m to acquire facial visible and thermal images. The sizes of the visible and thermal images were 1080 \(\times\) 1920 pixel and 256 \(\times\) 320 pixel, respectively. The temperature resolution was 0.1 \(^\circ\)C. To obtain subjective health feelings, an iPad Air (3rd generation, Apple, California, USA) was placed next to the camera to answer the questionnaires regarding the subjects’ physical conditions. The questionnaire was answered on a 4-point scale that assessed the subject’s wakefulness, physical condition, comfort, and energy. Table 1 presents health status on a 4-point scale. A lower score indicates a better health status.

3.2 Protocol

The experimental system was installed at the entrance of the laboratory and measurements were obtained at the discretion of each subject. After the facial images were measured, participants answered a questionnaire regarding their physical condition. Measurements were performed actively on days when the patients were sleep-deprived or fatigued. Meal and sleeping times were not controlled in this study because this study was assumed to be implemented in real-life commercial facilities. This study was approved by the Life Science Committee of the Department of Science and Engineering at Aoyama Gakuin University (approval number: H17-M13-3).

Fig. 2
figure 2

Experiment environment

4 Analysis method

4.1 L*a*b* color space

As the RGB color space is a color model designed for the output onto a display, the perception of changes in color values varied for different individuals. Considering that we may subjectively determine our health status from the complexion of others, it is necessary to perform evaluations based on a color space that considers perceptual uniformity in response to changes in color values. In this study, the authors applied L*a*b* color space information with perceptual uniformity to the visible images.

The L*a*b* color space is a color space represented by lightness L* and chromaticity a* and b*, which indicate hue and saturation, respectively. Positive values of L* were indicated in white and negative values in black; positive values of a* were indicated in red and negative values in green; and positive values of b* were indicated in yellow and negative values in blue [16]. In this study, following conversion from the RGB color space to the XYZ color space, conversion to the L*a*b* color space was performed [17]. The lighting environment was assumed to be a standard light source, D65, as defined by the International Commission on Illumination (CIE) [18]. In this study, only a* and b* were used for visible images, considering the effect of the brightness of the shooting environment.

4.2 Definition of training and test data

Fig. 3
figure 3

Flow of creating datasets

In this study, the training and testing data for anomaly detection were classified according to the questionnaire scores listed in Table 1. The anomaly levels were set by the number of (3)s that were the answers to the questions in Table 1 to verify how the accuracy of anomaly detection varied depending on the anomaly level. Anomaly levels were defined as Level 1 (58 data points) for data with one or more question items answered as (3), Level 2 (21 data points) for data with two or more items answered as (3), Level 3 (10 data points) for data with three or more items answered as (3), and Level 4 (7 data points) for data with four or more items answered as (3). For the data in which (0) or (1) was the answer to all questions, the same amount of data was extracted as the anomaly data and defined as normal data. The remaining data were defined as anchor-positive for training data. There were 212 anchor-positive data points at anomaly level 1, 249 at anomaly level 2, 260 at anomaly level 3, and 263 at anomaly level 4. Negative (171 data points) was defined as (3) not being answered to all questions; these data were not used as normal data. In this study, the number of anchor-positive and negative data increased by a factor of two and five, respectively, by data augmentation that randomly shifted the 68 facial feature points to a few pixels in the same direction, which is described in the next section. The data augmentation ratio and other factors will be discussed in the future.

Fig. 4
figure 4

Image combination method

4.3 Construction of anomaly detection model

A general model was constructed using data from all subjects. The flow for creating facial visible and thermal images is shown in Fig. 3. For the facial thermal images, 68 facial feature points were extracted based on the methods in our previous study [19], and spatially normalized standard faces were generated [20]. For visible facial images, spatially normalized standard faces were generated from the face feature points obtained by Dlib’s 68-point facial feature point extraction method [21]. The authors used spatially normalized standard faces to justify the effects of individual differences in facial shapes. Spatially normalized images were analyzed; the size of these images was 150 150 pixels. For spatially normalized visible facial images, mosaic processing was performed using the nearest-neighbor interpolation method. The aspect ratio was set to 0.08. The a* and b* pixel values in the facial visible images were normalized with maximum and minimum values of one and zero, respectively. For the facial thermal images, the mean and variance were standardized to zero and one, respectively.

Classification accuracy was compared when models were constructed using only facial visible images (VI), when models were constructed using only facial thermal images (FTI), and when models were constructed using facial visual and thermal images(FTI–VI). To combine the facial visible and thermal images, the authors created a three-channel dataset consisting of the facial thermal image and a* and b* channels of the facial visible image, as shown in Fig. 4.

Accuracy indicators for anomaly detection were determined using the area under the curve (AUC) obtained from the ROC curve [22]. The ROC curve is a graph that plots the rate of correctly recognized anomaly data as anomaly (TPR) on the vertical axis and the rate of incorrectly recognized normal data as an anomaly(FPR) on the horizontal axis. The AUC is the area under the ROC curve and indicates a high accuracy at times ranging between 0.9 and 1.0, medium accuracy between 0.7 and 0.9, and low accuracy between 0.5 and 0.7 [23]. True negative rate (TNR), which is the rate at which normal data are correctly determined as normal, was calculated from the FPR. The model was evaluated using the threshold at which the sum of TPR and TNR was maximized.

4.4 Feature analysis

To identify which parts of the face were extracted as features, the feature maps obtained when the anomaly images farthest from the anchor and the normal images closest to the anchor were input to the model and analyzed using principal component analysis (PCA). The average feature values of each feature map \(w_{k}\) obtained from the input images were calculated using Equation (2).

$$\begin{aligned} {w_{k}} = \frac{1}{Z}\sum _{{i}}\sum _{{j}} { A_{ij}^{k}}, \end{aligned}$$
(2)

where k denotes the filter index, A denotes the feature map, (i, j) denotes the spatial index, and Z is the spatial resolution of the feature map. PCA was performed by sorting the average features calculated for one input image in order of size; the top 1/2 feature maps were selected. PCA was performed on the selected feature maps, and the first principal component with a cumulative contribution of 90% was used to identify the facial regions.

Fig. 5
figure 5

Distance distribution between normal and anomaly data when combining visible and thermal images at anomaly level 3

5 Result

5.1 Discrimination result by each model

The distance distribution between the normal and anomaly data for the combined facial visible and thermal images at anomaly level 3 is shown in Fig. 5. In Fig. 5, the dotted line indicates the threshold, red star indicates the plot of the anomaly data, and blue circle indicates the plot of the normal data. The results for AUC, TPR, and TNR are summarized in Tables 2, 3 and 4.

Table 2 AUC results for varying anomaly levels
Table 3 TPR results for varying anomaly levels [%]
Table 4 TNR results for varying anomaly levels [%]

Fig. 5 illustrates that most normal data were classified below the dotted line and most anomaly data were classified above the dotted line. The results in Tables 2, 3 and 4 demonstrate that the FTI-VI condition has high accuracy in detecting anomaly states, and the results in Tables 2 and 3 show that the anomaly detection accuracy was high for anomaly levels three and four. This suggests that when the anomaly level is low, the classification accuracy is low as distinguishing between the two when the patient is in good physical condition is difficult. However, when the anomaly level is high, the classification accuracy is good because the facial features are easily expressed on the face. Cutaneous blood flow changed as the level of the anomaly increased, which may have facilitated the discrimination of the anomaly state. The simultaneous use of facial visual and thermal images may have resulted in an interaction between short- and long-term indicators.

Fig. 6
figure 6

Feature maps based on PCA for each anomaly level when facial visual and facial thermal images are used together

5.2 Feature analysis

The feature map based on PCA for each anomaly level when facial visual and thermal images were used simultaneously is shown in Fig. 6. Figure 6 presents the image of the first principal component from PCA; the most characteristic feature is the bright area. For levels 1 and 2, which had lower levels of anomaly, features were observed around the eyes and lips. Levels 3 and 4, which had the highest levels of anomaly, demonstrated common features at the eye socket and sides of the nose. The face is lined with capillaries. Particularly, there are arteries on the sides of the nose and inner corner of eye. It has also been suggested that the sides of the nose are more susceptible to hemodynamic effects when the level of anomaly is higher because the skin here is thinner than that of the cheeks and other parts of the body. Based on these results and the results shown in Sect. 5.1, there may be a relationship between subjective health feelings and facial images, as the sides of the eyes and nose are important sites for detecting poor physical conditions when anomaly levels are high.

6 Conclusion

This study experimentally investigated whether there is an association between interday subjective health feelings and facial images. At higher levels of anomaly, the combination of facial visible and thermal images resulted in the classification of subjective health feelings with moderate accuracy. Moreover, the inner corner of eye and sides of the nose were extracted as features in the facial regions of interest that may be related to subjective health feelings. This suggests that there may be a relationship between interday subjective health feelings and facial images. In the future, the authors plan to further improve the accuracy by tuning hyperparameters, analyzing questionnaire data related to physical conditions, and considering practical applications in daily life.