Attempt to extract features and classify subjective poor physical conditions in facial images using deep metric learning

Hattori, Takato; Nagumo, Kent; Oiwa, Kosuke; Nozawa, Akio

doi:10.1007/s10015-022-00831-1

Attempt to extract features and classify subjective poor physical conditions in facial images using deep metric learning

Original Article
Published: 16 December 2022

Volume 28, pages 381–387, (2023)
Cite this article

Download PDF

Artificial Life and Robotics Aims and scope Submit manuscript

Attempt to extract features and classify subjective poor physical conditions in facial images using deep metric learning

Download PDF

Takato Hattori¹,
Kent Nagumo¹,
Kosuke Oiwa¹ &
…
Akio Nozawa¹

1364 Accesses
1 Citation
Explore all metrics

Abstract

With the spread of COVID-19, the need for remote detection of physical conditions is increasing, for example, there are several situations wherein the body temperature has to be measured remotely to detect febrile individuals. Aiming to remotely detect physical conditions, the study attempted to investigate anomaly detection based on facial color and skin temperature, which are indicators related to hemodynamics. Triplet loss was used to extract features related to subjective health feelings from facial images to evaluate whether there is a relationship between subjective health feelings and facial images. A classification of subjective health feelings related to poor physical conditions based on these features was also attempted. To obtain the data, an experiment was conducted for approximately 1 year to measure facial visual and thermal images, and subjective feelings related to physical conditions. Anomaly levels were defined based on subjective health feelings. Anomaly detection models were constructed by classifying anomaly and normal data based on subjective health feelings. Facial visible and thermal images were applied to the trained model to quantitatively evaluate the accuracy of the classification of anomaly conditions related to subjective health. At higher levels of anomaly, a combination of facial visible and thermal images resulted in the classification of subjective health feelings with moderate accuracy. Further, the results suggest that the eyes and sides of the nose may indicate subjective health feelings.

Reducing false positive rate with the help of scene change indicator in deep learning based real-time face recognition systems

Article 13 May 2023

Optimization of facial skin temperature-based anomaly detection model considering diurnal variation

Article 23 January 2023

Introducing a novel deep convolutional neural network to detect skin cancer in thermographic images

Article 15 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Although the spread of COVID-19 has decreased considerably, the situation remains unstable. In addition to avoiding human contact, remote biometric technology-based body temperature measurements for fever detection have become increasingly popular tools to prevent the spread of infection.

Recently, facial imaging with visible and infrared light has become a common application for the remote sensing of vital signs. As the face is lined with capillaries, facial color and skin temperature fluctuate based on variations in cutaneous blood flow. Facial visible images can be considered a short-term indicator of hemodynamics because facial color contains pulse wave components that fluctuate over a short period. These images have been used to evaluate heart rate variability [1], respiration rate [2], blood pressure [3], and subjective feelings related to health [4]. Facial thermal images can be considered long-term indicators of hemodynamics because skin temperature fluctuates over a long period. Facial skin temperature has been used to evaluate stress [5], subjective feelings related to comfort [6], and drowsiness level [7].

Contrarily, our research group attempted to estimate physiological and psychological states using facial skin temperature and color distributions via a deep-learning algorithm. Adachi et al. constructed a drowsiness estimation model based on facial skin temperature using a deep-learning algorithm [8]. In previous methods, estimation models were constructed based on estimated targets, such as a model for estimating sleepiness or stress. However, because there are various poor physical conditions in daily life such as hangovers, fatigue, and lack of sleep, there is a limit to the method of constructing an estimation model for each estimation target.

Hence, an “anomaly detection” method was devised that uses a single model to detect anomaly conditions such as drowsiness, stress, and poor health. The following two types of anomaly detection models were proposed in our laboratory: (i) models based on facial skin temperature built using a variational autoencoder and deep-learning algorithm [9], and (ii) models using a combination of facial visible and thermal images using triplet loss, a type of deep metric learning [10]. Triplet loss is widely used in the image processing field, such as vision tasks such as person re-identification, face recognition, image search, and so on. The authors hypothesized that the face detection capability of triplet loss could be applied to anomaly detection in face images. In a previous study (ii), a blood pressure increase experiment was conducted in which blood pressure was increased by making subjects hold their breath to induce an anomaly state; an attempt was then made to classify the anomaly state. Consequently, facial color and skin temperature fluctuated as blood flow increased, suggesting that the anomaly states could be classified. This further indicates the possibility of using facial images as physiological indicators of hemodynamics. However, in these studies, the subjects were photographed under controlled conditions such as eating and sleeping. Extracting features based on subjective health feelings from facial images is difficult in an uncontrolled environment, which is susceptible to seasonal variations and disturbances such as ambient light.

In this study, triplet loss, a type of deep metric learning, was used to extract features related to subjective health feelings from facial images, to evaluate whether there is a relationship between subjective health feelings and facial images. A classification of subjective health feelings related to a poor physical condition based on these features was attempted. To obtain the data, an experiment was conducted for approximately one year to measure the facial visual and thermal images and subjective feelings related to physical conditions. Moreover, the facial visible images were transformed into a perceptually uniform L*a*b* color space. Anomaly levels were set based on the measured subjective feelings related to the physical condition and were defined as poor physical conditions. The facial visual and thermal images were treated in poor physical conditions as anomaly data and facial visual and thermal images in good physical conditions as normal data. Facial visible and thermal images applied to the trained model to quantitatively evaluate the detection classification accuracy of anomaly states related to subjective health feelings confirmed the association between subjective health feelings and facial images.

2 Mechanism of triplet loss

Several computer vision tasks exist, such as face recognition [11, 12], re-identification [13], object tracking [14], and deep metric learning technology. Among these approaches, metric learning, which uses triplet loss as the loss function, is widely used. In this study, the authors attempted to create anomaly detection models by triplet loss using facial visual and thermal images obtained over a long period.

A concept of the triplet loss is illustrated in Fig.1. The triplet loss was learned as a set of $x_{i}^{a}$, $x_{i}^{p}$, $x_{i}^{n}$}; it was composed of $x_{i}^{a}$(anchor), which is the reference; $x_{i}^{p}$(positive), belonging to the same class as $x_{i}^{a}$; and $x_{i}^{n}$(negative), belonging to a different class from $x_{i}^{a}$. Each input image was placed in the feature space with features extracted by MobilenetV2, a convolutional neural network. [15]. The authors hypothesized that MobilenetV2, with its high classification accuracy, can be used to accurately extract features related to subjective health feelings from facial images.

The triplet loss function $L_\textrm{tripletloss}$ is expressed as

$$\begin{aligned} { L_\textrm{tripletloss}} = { \sum _{i} max}\left[ d({ x_{i}^{a}}, { x_{i}^{p})} - { d}({ x_{i}^{a}}, { x_{i}^{n}}) + { \alpha }\right] , \end{aligned}$$

(1)

where d( $x_{i}^{a}$, $x_{i}^{p}$) is the Euclidean distance between the anchor and positive, d( $x_{i}^{a}$, $x_{i}^{n}$) is the Euclidean distance between the anchor and negative, and $\alpha$ is the margin. $\alpha$ is set to 0.2, as used in a previous study [11].

Triplet loss was learned such that the anchor-positive distance d( $x_{i}^{a}$, $x_{i}^{p}$) was smaller than the anchor-negative distance d( $x_{i}^{a}$, $x_{i}^{n}$). Facial images of normal and anomaly data are input to the trained model to perform two classification.

3 Experiment

Table 1 Subjective health score on a 4-point scale

Full size table

3.1 Experiment system and condition

In this study, an experiment was conducted for approximately one year (December 2020–November 2021) to measure the facial visual and thermal images and subjective health feelings. Thirty-three healthy adult males (21–51 years) were used as subjects, and the experiments were conducted at temperatures ranging from 25.3 $^\circ$C to 30.9 $^\circ$C. The measurement environment is shown in Fig. 2. Visible (HD 1080p, Logicool) and infrared thermography cameras (Boson, FLIR) were installed in front of the subject at a distance of 1.0 m to acquire facial visible and thermal images. The sizes of the visible and thermal images were 1080 $\times$ 1920 pixel and 256 $\times$ 320 pixel, respectively. The temperature resolution was 0.1 $^\circ$C. To obtain subjective health feelings, an iPad Air (3rd generation, Apple, California, USA) was placed next to the camera to answer the questionnaires regarding the subjects’ physical conditions. The questionnaire was answered on a 4-point scale that assessed the subject’s wakefulness, physical condition, comfort, and energy. Table 1 presents health status on a 4-point scale. A lower score indicates a better health status.

3.2 Protocol

The experimental system was installed at the entrance of the laboratory and measurements were obtained at the discretion of each subject. After the facial images were measured, participants answered a questionnaire regarding their physical condition. Measurements were performed actively on days when the patients were sleep-deprived or fatigued. Meal and sleeping times were not controlled in this study because this study was assumed to be implemented in real-life commercial facilities. This study was approved by the Life Science Committee of the Department of Science and Engineering at Aoyama Gakuin University (approval number: H17-M13-3).

4 Analysis method

4.1 Lab* color space

As the RGB color space is a color model designed for the output onto a display, the perception of changes in color values varied for different individuals. Considering that we may subjectively determine our health status from the complexion of others, it is necessary to perform evaluations based on a color space that considers perceptual uniformity in response to changes in color values. In this study, the authors applied L*a*b* color space information with perceptual uniformity to the visible images.

The L*a*b* color space is a color space represented by lightness L* and chromaticity a* and b*, which indicate hue and saturation, respectively. Positive values of L* were indicated in white and negative values in black; positive values of a* were indicated in red and negative values in green; and positive values of b* were indicated in yellow and negative values in blue [16]. In this study, following conversion from the RGB color space to the XYZ color space, conversion to the L*a*b* color space was performed [17]. The lighting environment was assumed to be a standard light source, D65, as defined by the International Commission on Illumination (CIE) [18]. In this study, only a* and b* were used for visible images, considering the effect of the brightness of the shooting environment.

4.2 Definition of training and test data

In this study, the training and testing data for anomaly detection were classified according to the questionnaire scores listed in Table 1. The anomaly levels were set by the number of (3)s that were the answers to the questions in Table 1 to verify how the accuracy of anomaly detection varied depending on the anomaly level. Anomaly levels were defined as Level 1 (58 data points) for data with one or more question items answered as (3), Level 2 (21 data points) for data with two or more items answered as (3), Level 3 (10 data points) for data with three or more items answered as (3), and Level 4 (7 data points) for data with four or more items answered as (3). For the data in which (0) or (1) was the answer to all questions, the same amount of data was extracted as the anomaly data and defined as normal data. The remaining data were defined as anchor-positive for training data. There were 212 anchor-positive data points at anomaly level 1, 249 at anomaly level 2, 260 at anomaly level 3, and 263 at anomaly level 4. Negative (171 data points) was defined as (3) not being answered to all questions; these data were not used as normal data. In this study, the number of anchor-positive and negative data increased by a factor of two and five, respectively, by data augmentation that randomly shifted the 68 facial feature points to a few pixels in the same direction, which is described in the next section. The data augmentation ratio and other factors will be discussed in the future.

4.3 Construction of anomaly detection model

A general model was constructed using data from all subjects. The flow for creating facial visible and thermal images is shown in Fig. 3. For the facial thermal images, 68 facial feature points were extracted based on the methods in our previous study [19], and spatially normalized standard faces were generated [20]. For visible facial images, spatially normalized standard faces were generated from the face feature points obtained by Dlib’s 68-point facial feature point extraction method [21]. The authors used spatially normalized standard faces to justify the effects of individual differences in facial shapes. Spatially normalized images were analyzed; the size of these images was 150 150 pixels. For spatially normalized visible facial images, mosaic processing was performed using the nearest-neighbor interpolation method. The aspect ratio was set to 0.08. The a* and b* pixel values in the facial visible images were normalized with maximum and minimum values of one and zero, respectively. For the facial thermal images, the mean and variance were standardized to zero and one, respectively.

Classification accuracy was compared when models were constructed using only facial visible images (VI), when models were constructed using only facial thermal images (FTI), and when models were constructed using facial visual and thermal images(FTI–VI). To combine the facial visible and thermal images, the authors created a three-channel dataset consisting of the facial thermal image and a* and b* channels of the facial visible image, as shown in Fig. 4.

Accuracy indicators for anomaly detection were determined using the area under the curve (AUC) obtained from the ROC curve [22]. The ROC curve is a graph that plots the rate of correctly recognized anomaly data as anomaly (TPR) on the vertical axis and the rate of incorrectly recognized normal data as an anomaly(FPR) on the horizontal axis. The AUC is the area under the ROC curve and indicates a high accuracy at times ranging between 0.9 and 1.0, medium accuracy between 0.7 and 0.9, and low accuracy between 0.5 and 0.7 [23]. True negative rate (TNR), which is the rate at which normal data are correctly determined as normal, was calculated from the FPR. The model was evaluated using the threshold at which the sum of TPR and TNR was maximized.

4.4 Feature analysis

To identify which parts of the face were extracted as features, the feature maps obtained when the anomaly images farthest from the anchor and the normal images closest to the anchor were input to the model and analyzed using principal component analysis (PCA). The average feature values of each feature map $w_{k}$ obtained from the input images were calculated using Equation (2).

$$\begin{aligned} {w_{k}} = \frac{1}{Z}\sum _{{i}}\sum _{{j}} { A_{ij}^{k}}, \end{aligned}$$

(2)

where k denotes the filter index, A denotes the feature map, (i, j) denotes the spatial index, and Z is the spatial resolution of the feature map. PCA was performed by sorting the average features calculated for one input image in order of size; the top 1/2 feature maps were selected. PCA was performed on the selected feature maps, and the first principal component with a cumulative contribution of 90% was used to identify the facial regions.

5 Result

5.1 Discrimination result by each model

The distance distribution between the normal and anomaly data for the combined facial visible and thermal images at anomaly level 3 is shown in Fig. 5. In Fig. 5, the dotted line indicates the threshold, red star indicates the plot of the anomaly data, and blue circle indicates the plot of the normal data. The results for AUC, TPR, and TNR are summarized in Tables 2, 3 and 4.

Table 2 AUC results for varying anomaly levels

Full size table

Table 3 TPR results for varying anomaly levels [%]

Full size table

Table 4 TNR results for varying anomaly levels [%]

Full size table

Fig. 5 illustrates that most normal data were classified below the dotted line and most anomaly data were classified above the dotted line. The results in Tables 2, 3 and 4 demonstrate that the FTI-VI condition has high accuracy in detecting anomaly states, and the results in Tables 2 and 3 show that the anomaly detection accuracy was high for anomaly levels three and four. This suggests that when the anomaly level is low, the classification accuracy is low as distinguishing between the two when the patient is in good physical condition is difficult. However, when the anomaly level is high, the classification accuracy is good because the facial features are easily expressed on the face. Cutaneous blood flow changed as the level of the anomaly increased, which may have facilitated the discrimination of the anomaly state. The simultaneous use of facial visual and thermal images may have resulted in an interaction between short- and long-term indicators.

5.2 Feature analysis

The feature map based on PCA for each anomaly level when facial visual and thermal images were used simultaneously is shown in Fig. 6. Figure 6 presents the image of the first principal component from PCA; the most characteristic feature is the bright area. For levels 1 and 2, which had lower levels of anomaly, features were observed around the eyes and lips. Levels 3 and 4, which had the highest levels of anomaly, demonstrated common features at the eye socket and sides of the nose. The face is lined with capillaries. Particularly, there are arteries on the sides of the nose and inner corner of eye. It has also been suggested that the sides of the nose are more susceptible to hemodynamic effects when the level of anomaly is higher because the skin here is thinner than that of the cheeks and other parts of the body. Based on these results and the results shown in Sect. 5.1, there may be a relationship between subjective health feelings and facial images, as the sides of the eyes and nose are important sites for detecting poor physical conditions when anomaly levels are high.

6 Conclusion

This study experimentally investigated whether there is an association between interday subjective health feelings and facial images. At higher levels of anomaly, the combination of facial visible and thermal images resulted in the classification of subjective health feelings with moderate accuracy. Moreover, the inner corner of eye and sides of the nose were extracted as features in the facial regions of interest that may be related to subjective health feelings. This suggests that there may be a relationship between interday subjective health feelings and facial images. In the future, the authors plan to further improve the accuracy by tuning hyperparameters, analyzing questionnaire data related to physical conditions, and considering practical applications in daily life.

Data availability

Data supporting the results of this study are available from the corresponding author.

References

Tarassenko L, Villarroel M, Guazzi A, Jorge J, Clifton DA, Pugh C (2014) Non-contact video-based vital signmonitoring using ambient light andauto-regressive models. Physiol Meas 35(5):807–831
Article Google Scholar
Fielder MA, Rapczynski M, Al-Hamadi A (2020) Fusion-Based Approach for Respiratory Rate Recognition From Facial Video Images. IEEE Access 8:130036–130047
Article Google Scholar
Oiwa K, Bando S, Nozawa A (2018) Contactless Blood Pressure Assessment by Facial Visible Image Analysis. IEEJ Transact Electron Inform Syst 138(7):783–789
Google Scholar
Oiwa K, Urakami K, Lamasl B, Nagumo K, Nozawa A (2021) Relationship between long-term variability of facial Hue information in physiological and psychological ROIs and health condition. IEEE Access 9:145554–145562
Article Google Scholar
Tanaka H, Ide H (1998) Study of stress analysis using facial skin temperature. J Robot Mechatron 10(2):154–157 (in Japanese)
Article Google Scholar
Zenju H, Nozawa A, Tanaka H, Ide H (2004) Estimation of unpleasant and pleasant states by nasal thermogram. IEEJ Transact Electron Inform Syst 124(1):213–214
Google Scholar
Bando S, Oiwa K, Nozawa A (2017) Evaluation of dynamics of forehead skin temperature under induced drowsiness. IEEJ Trans Electr Electron Eng 12(S1):S104–S109
Article Google Scholar
Adachi H, Oiwa K, Nozawa A (2019) Drowsiness level modeling based on facial skin temperature distribution using a convolutional neural network. IEEJ Trans Electr Electron Eng 14(6):870–876
Article Google Scholar
Masaki A, Nagumo K, Lamsal B, Oiwa K, Nozawa A (2021) Anomaly detection in facial skin temperature using variational autoencoder. Artif Life Robot 26(1):122–128
Article Google Scholar
HattoriK T, Oiwa NagumoK, Nozawa A (2021) An attempt to detect anomaly conditions using facial visible and thermal images. 2021 Annual Conference on Electronics, Information and System Institute of Electrical Engineers of Japan (IEEJ), GS11-2 (in Japanese)
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823
Alexander H, Beyer L, Leibe B (2017) In defense of the triplet loss for person re-identification. arXiv: 1703.07737
Cheng D, Gong Y, Zhou S, Wang J, Zheng N (2016) Person re-identification by multi-channel parts-based CNN with improved triplet loss function. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1335–1344
Dong X, Shen J (2018) Triplet loss in Siamese network for object tracking. Proceedings of the European Conference on Computer Vision, pp. 459–474
Sandler M, Howard A, Zhu M, Zhmoginov A, Liang-Chieh C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520
Murali S, Govindan VK (2013) Shadow detection and removal from a single image using LAB color space. Cybern Inform Technol 13(1):95–103
Google Scholar
OpenCV Open Source Computer Vision: https://opencv.org/ (accessed on 2 May 2022)
CIE Standard Illuminations, http://cie.co.at/publications/colorimetry-part-2-cie-standard-illuminants (accessed on 20 Dec 2021)
Nagumo K, Kobayashi T, Oiwa K, Nozawa A (2021) Face alignment in thermal infrared images using cascaded shape regression. Int J Environ Res Public Health 18(4):1776
Article Google Scholar
Nagumo K, Oiwa K, Nozawa A (2021) Spatial normalization of facial thermal images using facial landmarks. Artif Life Robot 26(4):481–487
Article Google Scholar
dlib C++ Library: http://dlib.net/ (accessed 30 Mar 2022)
Andrew BP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159
Article Google Scholar
Swets JA (1988) Measuring the accuracy of diagnostic system. Am Assoc Adv Sci 240(4857):1285–1293
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Aoyama Gakuin University, 5-10-1 Fuchinobe, Chuo-ku, Sagamihara, Kanagawa, Japan
Takato Hattori, Kent Nagumo, Kosuke Oiwa & Akio Nozawa

Authors

Takato Hattori
View author publications
You can also search for this author in PubMed Google Scholar
Kent Nagumo
View author publications
You can also search for this author in PubMed Google Scholar
Kosuke Oiwa
View author publications
You can also search for this author in PubMed Google Scholar
Akio Nozawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akio Nozawa.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Cite this article

Hattori, T., Nagumo, K., Oiwa, K. et al. Attempt to extract features and classify subjective poor physical conditions in facial images using deep metric learning. Artif Life Robotics 28, 381–387 (2023). https://doi.org/10.1007/s10015-022-00831-1

Download citation

Received: 13 May 2022
Accepted: 14 November 2022
Published: 16 December 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10015-022-00831-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Attempt to extract features and classify subjective poor physical conditions in facial images using deep metric learning

Abstract

Similar content being viewed by others

Reducing false positive rate with the help of scene change indicator in deep learning based real-time face recognition systems

Optimization of facial skin temperature-based anomaly detection model considering diurnal variation

Introducing a novel deep convolutional neural network to detect skin cancer in thermographic images

1 Introduction

2 Mechanism of triplet loss