1 Introduction

Face detection technology is to extract effective detection information and identify its identity by computer instead of natural person [1, 2]. In addition, it is difficult to forge, cannot be lost, portable, and easy to use. It overcomes the shortcomings of traditional identity authentication methods and provides a more secure and reliable authentication mechanism. Therefore, as the basic application of artificial intelligence technology, face recognition has been widely used in public safety and enterprise management. This has reduced the cost of the industry to a certain extent, improved service efficiency and management level, and has been widely recognized by people from all walks of life [3].

During the development of human-computer interaction, it is more and more easy for users to accept the more practical, more human, and more intelligent interaction mode. The application of human-computer interaction technology in the field of graphics and images is to pursue a better user experience. Face detection technology can reduce the number of user hands-on operations and provide a new way of interaction that is different from finger touching the screen, truly achieve the liberation of hands, to achieve more convenient, less burden of human-computer interaction [4]. However, the mainstream interactive devices related to face detection are still traditional keyboard, mouse, and touch screen. However, these devices have many limitations. Keyboard can only complete text input, mouse can only achieve cursor movement, click and orientation control, and other simple operations, but cannot interact with the user for a higher level, richer semantic information. The touch screen directly interacts with the interface through the fingers, and the WYSIWYG interaction mode greatly improves the interaction efficiency compared with the traditional mouse and keyboard. However, there are still limitations in finger touch, such as cumbersome operation and eye obstruction.

On the other hand, face detection on mobile devices is often slow and difficult to meet real-time detection needs. Zhenyu et al. [5] proposed a real-time face detection method based on optical flow estimation to realize video face detection on mobile devices, which makes the detection accuracy close to the detection accuracy of convolution neural network (CNN) [6] method, and the speed basically meets the requirements of real-time monitoring. It can be applied to medium- and low-end devices and cannot meet the performance of high resolution. Approximately 90% of face detection devices in the market are provided with front-end image acquisition while face detection in back-end server. These devices deeply rely on networks, slowing detection down, which are not applicable to the places encountering unsatisfactory network state, thereby consequently worsening application effect [7]. Yu’s [8] Face Terminal Identity Recognition Simulation technology for mobile device network security does not specify what kind of mobile device is suitable for, nor does it verify on Microsoft HoloLens devices. The current face detection in Microsoft HoloLens can only be achieved by remote call of face detection application programming interface (API) which is, however, restricted by network, resulting in slow detection and failing to meet real-time detection demand. This paper introduces the face detection algorithm based on Microsoft HoloLens holographic glasses. Such algorithm is upgraded from Viola-Jones [9,10,11,12] classical algorithm building on Haar-like [13] rectangle feature expansion and is accelerated relying on two-dimensional convolution separation and image re-sampling technique. Besides, HoloLens depth camera is installed for 3D face detection and location, thus localizing HoloLens face detection. Compared with existing Microsoft Azure Face API [14], the face detection algorithm, as shown in the experimental results, is more advantageous in terms of detection accuracy and speed. HoloLens enables supplementation and superposition of real and virtual information, creating a half-to-half real and virtual environment. HoloLens furnished with face detection not only betters user experience, but also contributes to smarter and easier life, which can be applied in such fields as social contact, public security, and business management.

2 HoloLens overview

Microsoft HoloLens is developed as the first cable-unrestricted and computer-controlled holographic smart glasses, enabling interaction between user and digital data, and between user and holographic images of real world [15, 16]. Figure 1 shows HoloLens appearance.

Fig. 1
figure 1

HoloLens appearance

As a mixed reality (MR) device, HoloLens is provided with unique man-computer interaction modes, namely, gaze, gesture, and voice [17,18,19,20], GGV for short. Thanks to cooperation among three interaction modes, the device enables the user to operate freely under MR environment. Figure 2 illustrates GGV.

Fig. 2
figure 2

GGV illustration

HoloLens mixes holographic scenes and real world, zooming in/out virtual objects just like the real world so that the user may feel the holographic scenes as a part of real world. Figure 3 shows details of HoloLens hardware. HoloLens is equipped with inertial measurement unit, ambient light sensor, and four ambient sensing cameras along with the depth-sensing camera to portray and scan current space and environment in real-time manner, thus identifying the plane, wall, desk, and other bigger objects. Besides, HoloLens is provided with self-developed holographic processing unit for real-time scanning, massive data processing, tracking, and space anchoring.

Fig. 3
figure 3

Details of HoloLens hardware

The face detection in HoloLens can only be fulfilled by remote call of Microsoft Azure Face API with low speed and limited detection accuracy, which gives rise to inconvenient practical application.

3 Classical Viola-Jones face detection method

The classical Viola-Jones algorithm combines shape and edge, face feature, template matching, and other statistical models with AdaBoost. Firstly, the Haar-like feature matrix is used to calibrate the face feature, and the feature evaluation is accelerated by the integral image [21,22,23,24,25,26], then the AdaBoost [27,28,29] algorithm is used to construct strong and weak classifiers and to form a screening cascade classifier [30, 31] to eliminate non-face images and improve accuracy.

3.1 Haar-like rectangle feature

As shown in Fig. 4, Haar features are classified into three categories: edge features (bi-adjacency matrices), linear features (tri-adjacency matrices), central features, and diagonal features (quadra-adjacency matrices), which are combined into feature templates.

Fig. 4
figure 4

Characteristic matrix

There are white and black matrices in the feature template, and the eigenvalues of the template are defined as white rectangular pixels and black rectangular pixels subtracted. The Haar eigenvalue reflects the change of the grayscale of the image. For example, some features of the face can be simply described as rectangular features; as shown in Fig. 5, the eyes are darker than the cheeks, the sides of the nose are darker than the bridge of the nose, and the mouth is darker than the surrounding color. It is more advantageous to use feature judgment than to use pixel only, and the speed of judgment is faster. However, rectangular features are sensitive only to simple graphical structures such as edges and line segments, so they can only describe structures with specific directions (horizontal, vertical, diagonal).

Fig. 5
figure 5

Facial feature

3.2 Integral image

In order to compute Haar-like features, it is necessary to sum all the pixels in the rectangular region. Viola-Jones face detection algorithm uses the concept of integral image. The integral image value of any point in the image is equal to the sum of all the pixels in the upper left corner of the point. As shown in Fig. 6, by integrating the image through the graph, the pixel sum of all regions in the image can be obtained by one traversal of the image, which greatly improves the computational efficiency of the image eigenvalue. Let SAT (x, y) be the integral image value of points (x, y) and I(x', y') be the gray value of any pixel (x', y') in the integral image, then:

$$ \mathrm{SAT}\left(x,y\right)=\sum \limits_{x`\le x,y`\le y}I\left({x}^{`},{y}^{`}\right) $$
(1)
Fig. 6
figure 6

Integral image

The following recursion formula can be obtained through traversing order from left to right and from top to bottom:

$$ \mathrm{SAT}\left(x,y\right)=\mathrm{SAT}\left(x,y-1\right)+\mathrm{SAT}\left(x-1,y\right)+I\left(x,y\right)-\mathrm{SAT}\left(x-1,y-1\right) $$
(2)

In the same way, the sum of pixels of any rectangular region in the image can be obtained. As shown in Fig. 7, let the upper left corner coordinates of the rectangular to be solved be x, y and the width and height of the rectangular to be w, h, denoted as a rectangle (x, y, w, h). The integral image formula is as follows:

$$ {\displaystyle \begin{array}{l}\mathrm{Sum}\left(x,y,w,h\right)=\mathrm{SAT}\left(x,y\right)+\mathrm{SAT}\left(x+w,y+h\right)+I\left(x,y\right)\\ {}-\mathrm{SAT}\left(x,y+h\right)-\mathrm{SAT}\left(x+w,y\right)\end{array}} $$
(3)
Fig. 7
figure 7

Integral image of a certain rectangular region

3.3 AdaBoost algorithm

AdaBoost (Ada: Adaptive, Boost: Boosting) algorithm can carry out feature selection and classifier training at the same time. It is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then assemble these weak classifiers to form a stronger final classification (strong classifier). The algorithm itself is realized by changing the data distribution. It determines the weight of each sample according to whether the classification of each sample in each training set is correct or not and the accuracy of the last global classification. The new data sets with modified weights are sent to the lower classifier for training. Finally, the classifiers obtained from each training are fused as the final decision classifier. Using AdaBoost classifier can eliminate some unnecessary training data features and put them on the key training data.

When the eigenvalue of the input image is greater than the threshold value, the face is judged, so the process of training the optimal weak classifier is actually to find the appropriate threshold value of the classifier.

In ordinary images, regions containing human faces occupy only a small part of the whole image. Therefore, if all local regions must traverse all the features, the operation is very heavy and time-consuming. In order to save computing time, more potential samples should be tested.

In the cascade classifier architecture, each level contains strong classifier. All rectangular features are divided into several groups, each containing some rectangular features used at each stage of the cascaded classifier. Each stage of the cascaded classifier determines whether the input area is human face, and if it is not, then the area will be discarded immediately. Only those areas that are judged as possibly human faces will be passed into the next stage, further distinguished by more complex classifiers. Its flowchart is shown in Fig. 8:

Fig. 8
figure 8

Filter process

4 Algorithm improvement

4.1 Haar-like rectangle feature expansion

As shown in Fig. 4, Lienhart further expands Haar-like rectangle bank by adding 45°-rotatable rectangle feature [32] (Fig. 9).

Fig. 9
figure 9

Expanded matrix

For a 45° rotation rectangle, we define RSAT(x, y) as the integral image value of point (x, y) upper left 45° region and lower left 45° region. For the shaded part shown in Fig. 10, I(x', y') is the gray value of any pixel point (x', y') in the area integral image.

Fig. 10
figure 10

Integral image example

According to the definition of integral image, then:

$$ \mathrm{RSAT}\left(x,y\right)=\sum \limits_{x`\le x,x`\le x-\mid y-y`\mid }I\left({x}^{`},{y}^{`}\right) $$
(4)

Similarly, by traversing from left to right and from top to bottom, the following recursive formulas can be obtained, so that the values of all pixels in the shadow region of the graph can be calculated at one time:

$$ {\displaystyle \begin{array}{l}\mathrm{RSAT}\left(x,y\right)=\mathrm{RSAT}\left(x-1,y-1\right)+\mathrm{RSAT}\left(x-1,y+1\right)\\ {}-\mathrm{RSAT}\left(x-2,y\right)+I\left(x,y\right)\end{array}} $$
(5)

As shown in Fig. 11, the 45° rotating rectangle is assumed to be the highest vertex coordinate of the rotating rectangle (x, y, w, h, 45°), (x, y) the horizontal and vertical distances between the rectangle and the rightmost vertex, respectively. Similarly, the integral image value Sum(x, y, w, h, 45°) of the 45° rotating rectangle can be deduced and calculated. See Eq. (6).

Fig. 11
figure 11

Arbitrarily integral image example

$$ {\displaystyle \begin{array}{l}\mathrm{Sum}\left(x,y,w,h,{45}^{{}^{\circ}}\right)=\mathrm{RSAT}\left(x+w,y+w\right)+\mathrm{RSAT}\left(x-h,y+h\right)\\ {}-\mathrm{RSAT}\left(x,y\right)-\mathrm{RSAT}\left(x+w-h,y+w+h\right)\end{array}} $$
(6)

By the above formula, the integral image of rotated rectangle can be figured out rapidly to calculate feature value of face detection under different states.

4.2 Acceleration by two-dimensional convolution separation

The algorithm as mentioned in this paper works to screen out targeted matrix by detecting actual marginal density greater than threshold value. It is required to firstly conduct gray processing for the image; secondly, calculate perpendicularity and levelness of image by two-dimensional convolution separation; then, detect marginal density of image by Sobel operator to find out image margin in order to create integral graph of marginal density, facilitating the subsequent calculation of marginal density for any rectangle. The procedures are detailed in Fig. 12.

Fig. 12
figure 12

Acquisition of image edge

With image sized as M × N and filter sized as P × Q, MNPQ times of multiplying and adding operations will be made under no-separation condition; MNP times and MNQ times of operations for the first/second time respectively, namely MN(P + Q) times of operations, increasing operation speed by PQ/(P + Q) times. By adoption of 3 × 3 filter, the detection speed upon convolution separation will be increased by 1.5 times.

4.3 Acceleration by image re-sampling technique

The sampled image measures 2048 × 1152 by PhotoCaputure API of HoloLens, resulting in massive characteristic matrices, which is not applicable to image detection accordingly. Therefore, the bi-linear image re-sampling technique is adopted to extract low-resolution images from high-resolution images to scale down the image width to 1/2. In such way, the detection will be accelerated to four times theoretically.

As a result, if the convolution separation and image re-sampling techniques work cooperatively, the theoretical detection speed will be six-folded by 3 × 3 filter.

5 Realization of face detection based on HoloLens

The algorithm procedures are as follows: capture image by HoloLens device; simplify image relying on image re-sampling method; calculate gray value of each pixel in the image; accelerate margin detection by convolution separation method; judge whether the detected rectangle rotates by comparison with template; compute integral image value of detected rectangle; sum up feature values following Viola-Jones algorithm; and compare values with threshold values: if actual feature value is less than threshold value, the image is judged as non-face image, indicating false; if actual feature value exceeds threshold value, the sum of feature values and comparison with threshold value will be repeated for the next stage until the image passes through cascade classifier completely, judging the rectangle as face and returning to “true.” The complete algorithm procedures are shown in Fig. 13.

Fig. 13
figure 13

Realization procedures of face detection based on HoloLens

Figure 14 shows the face in the scene detected by HoloLens furnished with algorithm as mentioned in this paper and displays the detection information. And HoloLens marks detected face model by 3D grid.

Fig. 14
figure 14

Detection example

Upon the above detection, the face is located as (x, y, width, height) in 2D image. Then, the central points are calculated to capture the face, which are returned and collided by rays to figure out 3D coordinates of the face in the real world. Ultimately, 2D image coordinates are converted from coordinates in the camera to world coordinates in the space as detailed in Fig. 15.

Fig. 15
figure 15

3D face detection procedures

Figure 16 shows 3D face detection result which is illustrated by line-frame cube in 3D space.

Fig. 16
figure 16

Example of space coordinates

6 Results and discussion

The algorithm as mentioned in this paper will be compared with the data stored in commonly used Orl, Yale, Ar, Stanford, Jaffe, and cit face image databases, and the data before/after acceleration and optimization and Microsoft Azure Face API relying on feature comparison bank containing the feature changes in expression, illumination expression glasses, illumination expression scarf, face calibration, expression of Asian woman, and color illumination. Furthermore, the image is sized as 60 × 60 ~ 179 × 118, and the number of face images ranges from 165 to 2600. The data analysis and comparison focus on missing quantity and detection time. As for the experiment, it is planned to firstly develop face detection program based on the algorithm as mentioned in this paper building on Unity game engine and Visual Studio C# script, then input program and data into Hololens helmet via LAN. After the above preparations, the experiencer wears HoloLens helmet for face detection. The experimental results are shown in Table 1. As Microsoft Azure Face API sends detection request once every 3 s on average, considering API request interval, the total time of detection via networks is far longer than that of local detection.

Table 1 Experimental results

As shown in the experimental results:

  1. 1.

    Acceleration and optimization of algorithm will exert no influence on the detection accuracy while the detection speeds up. To be specific, the detection time is shortened by 3.5, 3.8, 4.0, 5.1, 3.7, and 3.9 times respectively, averaging 4.0 times.

  2. 2.

    In detection of database featuring massive data, such as AR face image database and detection of database containing scaled-down image size, such as Yale face image database,

  3. 3.

    Microsoft Azure Face API network detection requires longer time, 464 and 221 times of local detection respectively.

  4. 4.

    Regardless of Face API request interval, the speed of local detection is generally 4.1 times, in case of extensive images, 9.8 times, or in case of scaled-down image size, 20 times of that for detection via networks.

  5. 5.

    Moreover, the loss rate of face detection by this algorithm is lower than that of Microsoft Azure Face API detection via networks. And its detection accuracy increases by 12% on average.

7 Conclusion

In recent years, with leapfrog progress of science and technology, the face detection is applied in our life on all fronts instead of only presence in science fiction film, such as identify comparison, access control, personal computer unlocking, and retail payment. This paper clarifies the cooperation between face detection algorithm and HoloLens holographic glasses by optimizing classical Viola-Jones depending on Haar-like rectangle feature expansion, two-dimensional convolution separation, and image re-sampling technique. Building on the above improved Viola-Jones algorithm, the local HoloLens face detection is realized, enhancing detection speed and accuracy. Moreover, the HoloLens depth camera is installed for better 3D detection and spatial location of the face. As shown in the experimental results, upon comparison with Face API, the speed of local detection is generally 4.1 times, in case of extensive images, 9.8 times, or in case of scaled-down image size, 20 times of that for detection via networks.

HoloLens-based face detection, enabling supplementation and superposition of real and virtual information, not only betters user experience, but also contributes to smarter and easier life, which will be applied in such fields as social contact, public security, and business management.