1 Introduction

In recent years a considerable number of applications have benefited from the usage of the third dimension [47]; there are several research fields in which 3D is currently successfully used: safety, such as for autonomous driving [22]; orthopedics, for both diagnosis and treatment planning [106]; surgery, as 3D models reconstruction gives the possibility of organizing medical equipment [99], attending the surgeon during the intervention and supporting the post-operative evaluation of the results [70]; 3D printing applications [21], including facial prosthesis [67], dental implants [30] and pelvis prosthesis [13].

The ambition of accelerating the evolution process of cities into interconnected communities brings out other application areas as candidates for heavy 3D usage: land surveying [90], architecture [59], archaeology [42] for research and tourism purposes and also security. Smart cities, urban area equipped with interconnected sensors able to collect data to be used to manage products and services [3], aim to benefit of the spreading of face recognition technology and deep learning techniques to solve problems such as quickly finding missing children and identifying criminals [105] or monitoring public places such as airports [83]. Geometry of the surfaces acquired with sensors capable of capturing depth information can be used for a more accurate face reconstruction [110], to build 3D aging models [82], face manipulation [44] and landmarking [31]. As it will be better explained in the next section, facial applications are shiny examples of this consideration, since the face acquisition can be performed in different conditions depending on various usage scenarios.

3D techniques require a higher computational cost than 2D methods [1], especially if the 3D face model has to be reconstructed from multi-view images [18] or through 3D morphable models obtained from 2D images and 3D scans or even without 3D data [98]. Nonetheless, the robustness given by the opportunity to operate in critical lighting conditions [109], in presence of occlusions [103, 28] and regardless of the orientation of the subject [102] make a 3D approach preferable.

Literature about 3D is varied and fragmented due to lack of a shared methodology for analyzing the field and developing new applications in the face of a growing number of RGB-D cameras on the market. This scientific survey has been conducted to converge on a unique standard and to provide a baseline for the design of the following 3D facial applications in real-time: Face Detection, Face Authentication, Face Identification and Face Expression Recognition.

Total time required by a facial application to be performed is the sum of the acquisition time and the processing time. The first one is the time required to obtain the RGB-D information and depends on 3D cameras; the second one involves the processing of the acquired depth information that is necessary to obtain a result. Since the latter does not depend on 3D cameras, but on other elements of the framework which constitutes an application, for instance the face analysis algorithms, it has not been analyzed in the present work, which focuses on acquisition hardware.

This work aims to be a guide for the right choice of an RGB-D camera depending on the facial application that has to be implemented. The focus is on camera technologies able to provide RGB images and depth maps, namely images on which each pixel has a value representing the distance from the camera; 3D scanners have not been taken into consideration, because they do not work in real-time, since they require a minimum technical time to complete the scan.

The study is structured as follows. Section 2 focuses on facial applications and on 3D sensor technologies; an explanation of the methodologies used for the investigation is provided. Survey results are presented in the third section, while in the final section conclusions have been drawn.

2 Methodological analysis

This survey has been carried out through a two steps analysis. First, a desk research has been performed to qualitatively investigate two aspects of 3D: the available technologies for computing the depth and the facial applications able to benefit from 3D usage that have been developed up to now. A desk research is a complete review of the literature, including articles and datasheets, indispensable to deeply analyze the functioning and the potentialities of the 3D sensors [100]. Secondly, QFDs (Quality Function Deployments) [66] have been used to quantitatively examine the relationships between two different orthogonal dimensions, namely the qualitative requirements typical of each facial application and the technical specifications of 3D acquisition technologies. Both dimensions have been obtained from the results of the desk research.

2.1 Desk research on facial applications

The opportunity of understanding and extracting information from human face has interested many researchers in past decades, giving birth to a new discipline called “Face Perception” [107]. Human brain has the ability of figuring out characteristics such as identity, age, sex and mood [15, 65, 29], a skill that infants already possess from birth [69], and develops during growth [40].

Since the recent spread of Computer Vision outcomes have highlighted that the utilization of technologies able to emulate human behavior is desirable, the idea of automatizing the face perception process has come up. Nevertheless, human brain functioning is highly complex and nowadays the possibility of reconstructing a model able to replay its behavior is remote. That explains why, in literature, all the applications related to the automatic recognition of specific features on human face are studied individually.

In this paper, the main facial applications have been considered: Face Detection, Face Recognition, with the two declinations in Face Authentication and Face Identification, and Face Expression Recognition (Fig. 1).

Fig. 1
figure 1

3D facial applications considered in the present work: Face Detection [86, 27, 46, 63], Face Authentication [41], Face Identification [79, 17, 49] and Face Expression Recognition [36, 37]

2.1.1 Face detection

Face Detection [68] aims to detect a face shape inside an image or inside a frame in the case of a video stream. It is often used to crop the image for further processing, typically another facial application, so that algorithms can focus on the region of interest; nonetheless, Face Detection could be used stand-alone in applications such as counting number of people in a room [111], the automatic selection of a region of interest containing a face to insert a tag on it (like on Facebook), to avoid gatherings or otherwise monitor the crowd [61].

There exists various 2D techniques [46, 108], that aim to achieve a good trade-off in terms of accuracy and speed. Some common operations are to localize and discard the background for improving computational speed focusing on the area of the image that carries the relevant information, normalizing the image with rotation and scaling operations not to reject false negatives and finally extracting the facial features necessary for Face Detection [86].

3D techniques benefit of the intrinsic advantages in using the depth information, such as lighting, pose and occlusion independence, to perform the detection through the analysis of surface curvature [27] or other geometrical features [63]. This is the family of methods considered in this paper for the sake of robustness, an essential characteristic for real-time video data streams, thus sensors should provide high quality depth maps in a wide range of functioning.

2.1.2 Face authentication

A taxonomy clarification is mandatory to deepen the discussion about Face Recognition applications. Face recognition aims to recognize a face detected into an image or into a frame, comparing it with another face or with a set of faces contained into a database.

Face Authentication belongs to biometric systems, that are solutions implemented to control the access to a private area using specific features of individuals [7] and it is not uncommon that information obtained with different biometric systems are often fused together to further improve robustness [85].

Fingerprints [57] and iris scans [43] are two of the most famous biometric systems for recognizing a person, but Face Authentication is becoming a more and more common solution in the case of identity certification for personal devices, especially for laptops and smartphones [41], and to fulfil payments. The high degree of security requested to protect a personal device implicates the need of a great deal of skill in the recognizing process and consequently 3D cameras must provide the best images possible in terms of quality, so that the facial authentication algorithm can minimize false positives and false negatives having as much features as possible retrievable from depth maps provided as input data.

In last years, the spread of personal mobile devices equipped with RGB-D cameras has been the cause of increased usage of face to perform user authentication, to such an extent that a new taxonomy has been forged, the selfie biometrics [81].

2.1.3 Face identification

In this article, Face Identification refers to that variety of applications performing Face Recognition [88] without authentication purposes described in the previous section. Some examples can be found in the fields of security, for criminals identification [79], marketing, to target specific customers or at least some of their features such as age and gender [17], and healthcare, for a health monitoring through a comparison between the current status of a patient and an image of the same patient in good health [49]. Some of the applications benefits of the technology development in terms of portability to recognize criminals [38], patients [32] or other individuals [80].

In Face Identification applications images or frames must be accurate enough to compare different facial features, the result must be provided in a reasonable time, the frame rate should be sufficiently high to detect all the people in the camera field of view (FOV), especially those ones in motion, and an adequately wide operating range to accomplish the target if working on a video data flow.

Face recognition algorithms working with 2D data must be carefully used stand-alone due to their vulnerabilities to spoofing attacks. Indeed, some other methods as liveness detection must be added to obtain a reliable face recognition technique. Furthermore, the technological improvement has made 3D data usage promising since depth map details are more and more refined and robust to spoofing attacks [2].

2.1.4 Face expression recognition

Face Expression Recognition [4] aims to identify the face within a frame and understand humans emotion by observing different parts of the face and analyzing the Action Units proposed by Paul Ekman in his works [36, 37].

The need of such an application is due to the spread of the concept of human-computer interaction in a variety of fields [9]: marketing [91], smart TV [62], videogames [64], psychiatry [16], evaluation of users’ engagement [73, 72]. One of the most important fields of application is robotics, since the capability to automatically understand human’s mood [11] significantly improve human operator safety during the interaction.

Face Expression Recognition is a critical task since some expressions are ambiguous and difficult to be recognized even by a human observer. Geometrical analysis is the basis of this application, so input images quality should be detailed enough to identify them and to perform further analysis. Recent researches show how landmarks and facial units can be the starting point to detect facial emotions [51] as well as geometrical descriptors can be used as input information to feed a CNN [74].

2.2 Desk research on RGB-D camera technologies

The interest in the applications mentioned above has received a further impulse since the advent of low-cost 3D sensors, i.e. devices able to detect the third dimension. The Microsoft Kinect release on the market in 2010 is one of the milestones related to the diffusion of these devices. This sensor has been designed and developed for the specific purpose of recognizing human body actions to perform an original type of human-machine interaction aimed at controlling characters, vehicles, or whatever object movements inside a videogame.

Several types of 3D sensors have been released on the market during last years and technology is the most suitable characteristic for grouping up sensors according to the similarity of their main parameters (Fig. 2).

Fig. 2
figure 2

3D Technologies [14]

All the 3D sensors mentioned above are also known as RGB-D cameras, because they provide two types of data: RGB and D (depth). RGB refers to the color model thanks to which every color can be displayed using three primary color red, green and blue; in other words, it identifies the color images. Depth information is retrievable through depth maps, images on which each pixel has a value representing the distance from the camera. This type of data is an advance compared to 2D data in terms of reliability and suitable for real-time applications. Indeed, it is possible to analyze the depth map without building a mesh; every 3D object is identified with x, y coordinates and the depth value instead of set of vertices, edges, and faces. The result is a more responsive acquisition system at the cost of accuracy. The present work focuses on 3D sensors because it is necessary to understand which technology can preserve high quality depth data working in real-time. This is due to the focus on technologies and data which will be largely adopted in the near future, when the accuracy of the third dimension will be exploited for several purposes and analyzing data real-time will be core for most of the acquisition systems [24].

Some of the applications mentioned above can have a considerable computational cost; nonetheless, 3D cameras and the devices that potentially can integrate them must be able to acquire information in real-time but the processing can be performed by systems located remotely. This solution can be planned at designed time before implementing a facial application, allowing not to be constrained by device capabilities in terms of processing, although they still must guarantee to maintain the 3D camera frame rate and to be connected with the remote system.

The way each technology provides the depth map is described in the following paragraphs.

2.2.1 Passive stereoscopy

Passive stereo requires the presence of at least two cameras for acquiring different images of the same object or environment from different points of view [93, 20, 35, 71, 84, 34].

To understand the distance of each point detected by this type of camera, the triangulation (or computational stereopsis) process must be performed, solving the so-called correspondence problem. Given the camera parameters calibration, the conjugate points, i.e. the two pixels representing the same point on the scene that are positioned on the two different acquired frames, must be found.

The main drawback of stereo cameras is the need of a scene lacking occlusions, therefore the shape of the object can be detected from both the cameras, and this is not trivial, since the object geometry can be complex enough that some parts are visible from a camera and hidden to the other one, such as alae, namely the two points that lie on the right and on the left of the nose and are commonly considered the landmarks for computing nose width [101]. In addition, the scene must not be featureless since the correspondence problem can be solved only if the same features can be found by both the cameras.

Price of these cameras can vary from 150 $ to 700 $ depending both on the features and the release on the market time.

2.2.2 Structured light

Structured light depth cameras have been studied to overcome the issue of reliability of correspondences [54, 56, 6, 8, 39, 75, 77, 95]. If there are two or more cameras filming an object, however close they may be, they will frame different parts of the object and not all the points of the object will be visible from all the cameras. Furthermore, if cameras are too close to each other, disparity will not be large enough to make the triangulation process possible.

The technology consists in projecting a pattern on the object using a transmitter and, successively, evaluate the deformation of the pattern on the object detected by a receiver. This solution allows to put transmitter and receiver close each other, since the distance is computed without the need of the disparity and consequently the occlusions issue is minimized.

The projected pattern can assume different configurations to perform the correspondences estimation according to design concepts. Adopted strategies are wavelength multiplexing, range multiplexing, temporal multiplexing, and spatial multiplexing [87].

This type of camera can be considered quite cheap compared to the other technologies: price is usually not higher than 200 $ with a few exceptions.

2.2.3 Time-of-flight (ToF)

ToF cameras have been considered only professional-grade until Microsoft released the second version of the Kinect, commonly mentioned as Kinect v2 or Kinect One, since it has been developed for being used with the Microsoft X-Box One console, contrary to the Kinect v1 developed for X-Box 360.

This technology relies on the knowledge of the light speed in the air. Distances can be evaluated projecting an electromagnetic wave on the scene and computing the time in which it has been received from the receiver.

A remarkable advantage of this technology is the opportunity to put transmitter and receiver closer than the transmitter and the receiver needed for structured light depth cameras. Moreover, ToF sensors can reach considerable frame rate, making them suitable for real-time applications [92, 50, 89, 10, 96, 97, 33].

On average, ToF cameras are the most expensive on the market since they were born for industrial applications. Nonetheless prices cover a very wide range: from 80 $ to thousands of dollars.

2.2.4 Active stereoscopy

Active stereo is a vision technique in which stereo and structured light, or laser, are combined to benefit of the advantages of both the technologies [19, 55, 52, 53]. A 3D sensor built according to this technology is equipped with two outdistanced cameras and a projector between them, usually working in IR spectrum. This solution allows to improve accuracy in 3D detection and, above all, permits to extend the operating range [12].

Active stereoscopy cameras are peculiar of Intel which proposes them at a cost between 130 $ and 400 $. Most recent devices cost 150 $ - 200 $.

2.3 Benchmarking

A benchmarking among 3D sensor technologies has been done evaluating the parameters available both in literature and in datasheets. Parameters taken into consideration are:

  • Resolution: horizontal and vertical number of pixels

  • Frame rate: number of images captured in one second (FPS, Frames Per Second)

  • Minimum distance: this parameter establishes the lowest gap for sensor functioning

  • Maximum distance: this parameter establishes the greatest gap for sensor functioning

  • Range: difference between minimum distance and maximum distance

  • Field of view (FOV): this parameter indicates the part of the scene visible through the sensor

  • Size: sensors dimensions.

Twenty-six sensors belonging to the four categories explained above have been analyzed to identify strengths and weaknesses of each 3D detection technology (Table 1).

Table 1 RGB-D cameras considered in this work

2.3.1 Passive stereoscopy

Six passive stereo sensors have been considered (Table 2). Stereo cameras have quite good ranges of functioning, thanks to good maximum distance values that make most of them suitable for acquisition over 3 m of distance, but a bad minimum distance of functioning. Values regarding minimum distance of functioning reported in this work, directly taken from sensor datasheets too, are often misleading. That value means that it is possible to acquire the depth map, but its quality is very poor, especially in the case of facial application. This is a technological problem: passive stereoscopy uses disparity between two cameras to retrieve the depth information. If the camera is close to the subject, a lot of points will be present in only one of the images due to occlusions, making them very difficult to merge. Resulting depth images contain too big holes, which make data impossible to use. In particular, a second minimum value is often shown in datasheets and it points out the optimal minimum distance that is usually greater than 50 cm.

Table 2 Passive stereoscopy sensors specs

On the contrary, resolution is excellent, while frame rate has quite different nominal values (3, 15, 30, 45 FPS).

2.3.2 Structured light

Eight structured light sensors have been analyzed (Table 3). Minimum distance is undoubtedly the strength for this technology, in fact several sensors minimum operating distance is between 20 cm and 40 cm. Frame rate is remarkable too, almost all sensors work at 30 or 60 FPS. Maximum distance and range operating functioning are the weaknesses of this technology, since most of sensors work with an upper limit that is suggested from 1.5 m to 2.5 m. Resolution is remarkable for short range, since only one sensor is 320 × 240 and the others are 640 × 480 or above.

Table 3 Structured light sensors specs

2.3.3 Time-of-flight

Among the 8 ToF sensors analyzed (Table 4), just one of them can be considered suitable for facial applications. Other sensors belonging to this category have a magnificent maximum distance (at least greater than 4 m), a decent frame rate (20–30 FPs), but poor minimum distance (0.5 m) and resolution (640 × 480 is the only remarkable value, all the others are below).

Table 4 ToF sensors specs

Values in the table are strongly influenced by a single sensor build with the specific purpose of working at close distance. This is the reason why the comments reported above are very important to understand the considerations drawn up in the “Results and discussion” section.

2.3.4 Active stereoscopy

The four active stereoscopy sensors considered (Table 5) are the most recent on the market, launched in 2015 or later.

Table 5 Active stereoscopy sensors specs

They can be considered the best trade-off between all the parameters, with good minimum distance (around 30 cm except for the worst one), maximum distance (up to 10 m), 30 FPS frame rate and good resolution (two of them reach 1280 × 800).

A special mention is deserved by the best minimum distance found during the desk research (0.11 m), but all the others functioning minimum distance exceed 0.3 m, so structured light sensors must be considered as the state-of-art for minimum distance of functioning yet.

Sensors datasheets report the size including the chassis and the support dimensions. Customer-grade sensors can be integrated in personal devices such as smartphones, tablets and laptops without chassis and support, therefore it is desirable to understand the physical space that each technology requires. Passive stereo and active stereo need a larger space due to the presence of two different cameras for detecting the third dimensions through the disparity, while for what concerns structured light and ToF technologies size can be limited by the possibility of putting transmitter and receiver as close as possible.

A brief recap of main advantages and disadvantages for each technology can be found in Table 6.

Table 6 Advantages and disadvantages of analyzed technologies

2.4 Quality function deployment (QFD)

Once the desk research has been completed, the QFD has been used to integrate two orthogonal dimensions, namely sensors’ technical specifications and facial applications requirements. The aim of this stage is to identify their interconnections evaluating how much each technical specification is important in relation to a certain application requirement.

QFD [66] is a method applied to transform qualitative user demands into quantitative parameters and the basic design to implement it is the house of quality. On the vertical axis there are the user desires (What’s), on the horizontal axis there are technical requirements (How’s) that may be useful to satisfy the user desires. A weight between 1 and 5 is given to each user’s desire according to the final application that has to be designed. In the other cells of the table a score of 1, 3 or 9 [26] is given according to the contribution that each technical requirement gives to each user desire, namely respectively “weak”, “moderate” and “strong”. 0 value has been given if there is no relationship. Scores to be attributed to the relationships can vary according to different ways of building a QFD [58]. In this case, 0, 1, 3, and 9 have been considered because they reflect at best the perception that people have with regard to the correlation process and strong correlation is awarded.

Four QFDs have been drawn up, one for each facial application previously explained, and they are structured as follows: qualitative application requirements, namely the main characteristics that an application should have, are listed on the first column and the importance of each qualitative requirement is listed on the second column. On the first row there are the technical specifications (How’s), and contrary to the qualitative requirements, that are slightly different between the applications, the technical requirement list is the same for each of the four QFDs.

The considered technical specifications are the depth sensors parameters extracted by the desk research. Specifically, technical requirements are the frame rate, the minimum and the maximum distance to which the sensors work, the range, the FOV, the dimensions and the technology used to build the sensor. A little observation for the resolution is important; if its value is high, this means not only that there are more pixels on the same image, and consequently a higher accuracy, but there is also the possibility of performing a downgrade of the resolution to speed up the frame rate for those applications in which real-time is a critical task.

In the final row the relative total score of each technical specification is specified. Relative total score is a percentage of how much a technical requirement is important compared to the others. Its values are computed as follows:

  1. 1

    For each technical requirement, a total score is computed as a sum of products between the application requirement weights and the corresponding evaluation scores given to the technical requirements.

  2. 2

    For each total score obtained at point 1 the percentage is computed considering the sum of all the total scores as 100%.

3 Results and discussion

Generical raw data have been translated into values to be put in QFDs (9–3–1-0 score) after a discussion held by a focus group. The focus group has proved to be essential to accurately evaluate technical requirements thanks to the involvement of researchers from several areas and is composed by eleven people, five women and six men: four of them are computer science engineers, and their research field involve computer vision and RGB-D cameras; three are management engineers; two are biomedical engineers, experts in face analysis; one is an electronic engineer; and one is a mathematical engineer, whose competences involve facial feature extraction.

The focus group also assigned weights and scores to each of the requirements as a result of a discussion among all participants, so that everyone has intervened in the debate giving a contribution linked to the specific area of expertise, and the final value has been unanimously assigned.

Results are presented in the following section.

3.1 Face detection

Even if accuracy is something to be taken care of in all contexts, this constraint can be considered not so strict for Face Detection stand-alone applications as other facial applications. Once that the face is detectable, details on facial surface are not required. This does not mean accuracy is not relevant at all: a trade-off between accuracy and resources (computational and storage resources) is always necessary; nonetheless, in Face Detection applications the limit can be set closer to the resources than Face Authentication, Face Identification and Face Expression Recognition applications. Moreover, flexibility should be a strength point for this application, so that it can work in all range, light, pose and occlusions situations (Table 6). Qualitative requirements are:

  • Real-time: faces should be detected when an individual enters in the camera field-of-view [5].

  • Wide operating range: faces should be detectable both if an individual is getting closer to the camera and moving away.

  • Accurate at close distance: faces should be detectable if an individual is close to the camera.

  • Accurate at far distance: faces should be detectable if an individual is far from the camera.

  • Able to discriminate faces among other elements in the environment: the core of the application, if a face is present in the scene, then it should be detected.

  • Integrable into a smartphone: sensors should allow to be put into a smartphone, a tablet, or a laptop to perform Face Detection.

  • Portable: this requirement suggests having a sensor small enough to be easily carried by the user.

  • Small output data: the detected face should be reported without spending too much resources in terms of memory, for reasons of storage and computational speed. Nonetheless, to preserve a level of accuracy that allows to detect faces is mandatory.

  • Robust to light: faces should be detected whatever light conditions are (i.e. in the dark, in a sunny day…).

  • Head pose invariant: faces should be detected whatever the individual relative orientation with respect to the camera is.

  • Robust to occlusions: faces should be detected in presence of occlusions (i.e. glasses, scarfs…).

Sensors parameters relative importance is shown in Fig. 3. Radar shows that the resolution is the most important parameter, followed by the maximum distance of functioning, since Face Detection applications must detect subjects that do not necessarily position themselves in front of the camera.

Fig. 3
figure 3

Specs relative importance for Face Detection

3.2 Face authentication

The minimum error rate in Face Authentication is required. User is aware of the sensitivity of this application so that real-time is not strictly required, but speed should be high enough to compete with other type of authentication (for instance, the insertion of a PIN code); nevertheless, speed must not sacrifice accuracy in any way, since for Face Authentication this is the main requirement on which to focus on. (Table 7). Qualitative requirements are:

  • Fast enough to unblock a device: this application does not require real-time, unblocking speed should not be annoying for the user.

  • Accurate at close distance: face should be recognized from a distance as close as a smartphone, a tablet or a laptop typical user is.

  • Able to detect facial features: facial landmark for face analysis must be detected.

  • Integrable into a smartphone: sensors should allow to be put into a smartphone, a tablet, or a laptop to perform Face Authentication.

  • Robust to light: faces should be recognized whatever light conditions are (i.e. in the dark, in a sunny day…).

  • Robust to occlusions: faces should be detected in presence of small occlusions (i.e. glasses).

Table 7 Face Detection

Sensors parameters relative importance is shown in Fig. 4. Radar shows that resolution and minimum distance of functioning are the most important technical requirements to satisfy, coherently with the most-common usage scenarios: a user that must unlock his personal device. Subsequently, frame rate and dimensions can be considered influential, since a user must not wait too much time to be authenticated, otherwise another authentication method would be preferable, and the system should have the possibility of being integrated in personal devices such as smartphones, tablets and laptops.

Fig. 4
figure 4

Specs relative importance for Face Authentication

3.3 Face identification

This application requires to council the accuracy for face analysis and the robustness to work in different range, light, pose and occlusions situation. Close distance is not considered so relevant since Face Identification is different from Face Authentication as it has been previously explained (Table 8).

Table 8 Face Authentication

Qualitative requirements about Face Identification are:

  • • Real-time: a subject should be identified before he leaves the field-of-view of the camera [23].

  • Wide operating range: faces should be identified both if an individual is getting closer to the camera and moving away.

  • Accurate at close distance: faces should be identified if an individual is close to the camera.

  • Accurate at far distance: faces should be identified if an individual is far from the camera

  • Able to detect facial features: facial landmark for face analysis must be identified.

  • Integrable into a smartphone: sensors should allow to be put into a smartphone, a tablet or a laptop to perform Face Identification.

  • Portable: this requirement suggests having a sensor small enough to be easily carried by the user.

  • Robust to light: faces should be identified whatever light conditions are (i.e. in the dark, in a sunny day…).

  • Head pose invariant: faces should be identified whatever the individual relative orientation with respect to the camera is.

  • Robust to occlusions: faces should be identified in presence of occlusions (i.e. glasses, scarfs…).

  • Robust to different face expressions: faces should be identified whatever the individual mood is.

Sensors parameters relative importance is shown in Fig. 5. Radar shows that the resolution confirms to be the most important technical requirement, indeed, to recognize features is mandatory to apply facial algorithms. All the technical requirements linked to the distance of functioning appears right after resolution in the ranking, since the sensor should be able to recognize subjects that could be more or less close to the camera. This result is significantly different from Face Authentication and confirms the choice of splitting Face Recognition applications in Face Authentication and Face Identification.

Fig. 5
figure 5

Specs relative importance for Face Identification

3.4 Face expression recognition

Qualitative requirements about Face Expression Recognition are very similar to the Face Identification ones since the operating conditions are almost the same (Table 9):

  • Real-time: individual expressions should be recognized whenever an event associated to what they are assisting is triggered [76].

  • Wide operating range: individuals’ expressions should be recognized both if an individual is getting closer to the camera and moving away.

  • Accurate at close distance: individuals’ expressions should be recognized if an individual is close to the camera.

  • Accurate at far distance: individuals’ expressions should be recognized if an individual is far from the camera.

  • Able to detect facial features: facial landmarks for face analysis must be recognized.

  • Integrable into a smartphone: sensors should allow to be put into a smartphone, a tablet or a laptop to perform Face Expression Recognition.

  • Portable: this requirement suggests having a sensor small enough to be easily carried by the user.

  • Robust to light: individuals’ expressions should be recognized whatever light conditions are (i.e. in the dark, in a sunny day…).

  • Head pose invariant: individuals’ expressions should be recognized whatever the individual relative orientation with respect to the camera is.

  • Robust to occlusions: individuals’ expressions should be recognized in presence of occlusions (i.e. glasses, scarfs…).

Table 9 Face Identification

Sensors parameters relative importance is shown in Fig. 6. The radar appears to be very similar to the Face Identification one, but this result should not be surprising. In both cases resolution must be excellent in order to discriminate between different features on resulting images. Data should be retrievable both if the subjects is close or far from the camera, and, regarding the frame rate, data should be available several times per second (this requirement is satisfied by the vast majority of analyzed sensors). Finally, dimensions and field of view are not so much considered, because sensors should be not necessarily portable and can be placed in strategic locations in order to avoid FOV issues.

Fig. 6
figure 6

Specs relative importance for Face Expression Recognition

A comparison between facial application specs is reported in Fig. 7. Supplementing the comments already reported, resolution can be universally recognized as the most important parameter, followed by technical requirements linked to the distance of functioning, minimum, maximum and range, depending on the facial application. Frame rate varies from 10% to 15% and this result can be explained as follows: nowadays real-time is a mandatory requirements for facial applications, nonetheless the bottleneck is not the choice of the sensor, but the computationally demanding techniques, thus the focus must be moved on the choice (and the implementation) of the suitable algorithm.

Fig. 7
figure 7

Specs comparison for the different facial applications

Afterwards technical specifications and facial applications have been analyzed, the most suitable 3D detection technology can now be identified (Table 11).

ToF cameras are the best in terms of long range operating functioning [48, 94], but this strength is not feasible for facial applications, and they are weaker than other technologies in terms of resolution, this is the reason why it is the worst choice for the considered facial applications.

Passive stereo technology has resulted to be the most suitable choice for Face Detection applications, due to the trade-off between high resolution and remarkable maximum operating functioning distance [60, 25], followed by the active stereo technology and, in third position, by structured light cameras, because of their too poor maximum operating functioning distance.

Face Detection has been taken into account during all the evaluation process not only as stand-alone application, but also as preliminary step of Face Authentication, Face Identification and Face Expression Recognition.

Scores of these facial applications have been given from a global point of view.

In particular, when the focus group gathered for the evaluation, the main facial application steps were taken into account and this means that they discussed about the face detection step as well as the subsequent steps such as feature extraction or analysis with neural networks.

Face Detection requirements in stand-alone applications are different from Face Detection requirements as preliminary step. The requirements of Face Authentication, Face Identification and Face Expression Recognition definitely consider Face Detection as a part of their algorithms, but some of the requirements may change based on the application within which they are incorporated.

Going into the detail, if included in a Face Authentication application, Face Detection can accept a higher response time than Face Detection as a stand-alone application such as counting people in a room. Besides, the range need not be wide, because Face Authentication use cases are at short distance.

Open access funding provided by Politecnico di Torino within the CRUI-CARE Agreement.Considering Face Identification and Face Expression Recognition, the shape of the radars related to these applications are very similar each other and the Face Detection one is not too different. This testify that Face Detection requirements played a role in the evaluation of these facial application requirements. Indeed, they have not been twisted if they are considered as stand-alone or integrated, nonetheless there are some differences. In terms of relative importance, frame rate has a greater value in stand-alone Face Detection applications, while minimum distance acquires importance in Face Identification and Face Expression Recognition.

The situation is inverted in Face Authentication. Since minimum distance is the most important parameter, together with the resolution, the excellent minimum operating functioning distance of structured light technology has resulted to be the best for this application [78, 104]. It is mandatory to observe that an active stereoscopy sensor seems to be the best at close range, but this is false to a broader set of sensors. Since active stereoscopy is the most recent technology, it is wise to bear in mind this result, but the time is not yet ripe to claim that it is the best one for close-range applications and, consequently, Face Authentication.

Face Identification and Face Expression Recognition have resulted to be similar in terms of qualitative requirements, in fact the shapes of their technical specifications relative importance are really close to each other. Active stereoscopy is the most suitable technology for these applications [45], because of the presence of good resolution both at close distance and long distance operating functioning at the same time. Passive stereoscopy is the second-best choice, thanks to its very high resolution and operating functioning at high distance, that is more relevant with respect to close distance. This is the reason why structured light is in third position, in fact the poor maximum operating function distance has been penalizing for this sensor category.

Key technical specifications used to analyze 3D sensor technologies are strongly linked to accuracy more than the acquisition time. From the datasheet analysis, it has been found that all the considered 3D sensors can provide several FPS when acquiring single shot acquisitions; if all of them can satisfy the real-time requirement, it has been unavoidable to focus on other technical specifications to discriminate between RGB-D cameras and to evaluate 3D acquisition technologies.

4 Conclusions

In this paper a survey to understand which 3D sensor technology can fit better different facial applications has been conducted. Qualitative requirements for the most common face applications and sensors specifications considered in the present survey are the result of a desk research about 3D facial applications and 3D sensor technologies.

A focus group has filled-in four QFDs to identify the main features involved in each application and to understand which the most suitable technology for depth detection is. Results show that passive stereoscopy is the best technology choice for Face Detection applications, structured light is the most suitable sensor technology for Face Authentication applications and active stereo is the most interesting technology for Face Recognition and Face Expression Recognition (Table 10).

Table 10 Face Expression Recognition
Table 11 Technology ranking for facial application

Future work consists in performing an empirical analysis of 3D sensors to proof the theoretical results presented in this survey. Furthermore, a 3D QFD will be presented to further point out a correct technology choice for facial applications presenting a new orthogonal dimension in addition to qualitative requirements and technical specifications.