1 Introduction

Due to the growing demand in today’s entertainment industry, such as Hollywood blockbuster movies, high-quality facial capture has become a prominent topic in both the industrial and academic communities. Numerous technologies have been developed and have achieved remarkable success. A typical digital head asset comprises high-fidelity geometry, variable texture maps with high resolution, and other individual components like beards and teeth. While many methods focus on automating the capture of face geometry and appearance maps, only a few address an important characteristic: facial hair, including beards and eyebrows. Hair reconstruction is considered one of the most challenging research areas due to the micro-scale structure and significant self-occlusions. Moreover, capturing subjects with beards or facial hair presents additional challenges as it can hinder the tracking of the underlying face geometry, resulting in a distorted surface. On one hand, traditional pipelines for facial capture rely on the local plane assumption, which tends to exclude hair structures. The reconstructed geometry often appears smooth and lacks small-scale details, not to mention the accurate representation of protruding hair strands. On the other hand, algorithms specifically designed for hair reconstruction focus on the statistical properties of hair but struggle to capture strand-accurate hair fibers while accurately separating them from the skin regions. Integrating a strand-accurate facial hair capture within the scanning pipeline is essential. It not only aids in the reconstruction of the global face geometry but also enhances the overall realism, enabling the creation of more lifelike digital human avatars with improved visual fidelity and believability.

Fig. 1
figure 1

We present a strand-accurate facial hair reconstruction and tracking method in both active and passive illumination settings

In this paper, we introduce a method targeting the modeling of the facial hairs in facial performance capture. We follow the strand-based hair modeling and present a method to reconstruct the detailed structure from multi-view imagery. Our method is capable with standard multi-view photogrammetry systems no regardless of the illumination settings, as shown in Fig. 1. A specialized line-plane map estimation method is designed to simultaneously reconstruct the rough facial skin and hair point cloud. Then the facial hair strands are extracted using multi-view segment matching and further refined. When given a performance capture sequence, we reconstruct the reference facial hair in the initial neural frame and propagate it to the subsequent frames using a hybrid propagation scheme. To achieve this, we propose a hair strand deformation optimization method that considers both the shape and motion of facial hair between two time frames. The underneath face mesh of each frame is then refined using captured facial hair roots. By addressing the challenges associated with facial hair capture and incorporating strand-accurate reconstruction methods, we can achieve a significant advancement in the field of high-quality facial capture. This advancement will contribute to the creation of more immersive and realistic digital human avatars, meeting the increasing demands of the entertainment industry.

In summary, the main contributions of this paper include:

  1. 1.

    A 3D line-plane multi-view stereo method for facial hair cloud generation that outperforms previous methods in both accuracy and completeness;

  2. 2.

    A strand-accurate facial hair registration method implemented as a space-time optimization;

  3. 3.

    A facial hair reconstruction and tracking method that incorporates with both active and passive illumination-based multi-view camera systems.

2 Related work

Facial Performance Capture. In the last decades, digital human capturing gained a lot of attentions in computer vision community. Many excellent works have been proposed, ranging from static facial geometry reconstruction [1, 2], skin appearance acquisition [3, 4] to dynamic performance capture [5,6,7]. The state-of-the-art facial reconstruction pipelines are mostly built with multiple calibrated and synchronized cameras, making the multi-view stereo (MVS) algorithms [8, 9] the first choice for verifiable and accurate geometry reconstruction. Recently, the advancements in deep learning techniques have made it possible to infer facial information [10, 11] from just a few or even a single “in-the-wild” image. Lattas et al. [12] combined the recent advent of 3D Morphable Model fitting algorithm and image translation using Generative Adversarial Networks. The proposed method produced high-quality facial geometry as well as high-resolution reflectance maps from an unconstrained face image. Gafni et al. [13] firstly introduced the implicit representation to dynamic 4D facial reconstruction. They presented an approach to learn dynamic neural radiance fields that represent the changing surface and appearance of a human face from a monocular video sequence.

Scalp Hair Reconstruction. Besides full head reconstruction, hair modeling is significant and challenge task in virtual human digitization. Researchers have proposed various techniques to capture and digitize hairstyles, such as the system described by Paris et al. [14, 15] that uses structured light and multiple cameras. Luo et al. [16] improved hair capture for complex hairstyles using a novel graph data structure and global optimization. Xu et al. [17] presented a dynamic hair capture system that predicts hair strands from 2D motion paths and optimizes the static hair strands in a spatiotemporally coherent way. Zhang et al. [18] introduced a data-driven approach to reconstruct hair from four sparse views. Most work has focused on capturing the overall shape of hair wisps or hairstyle [19] due to the wide range of shape variations and severe self-occlusion. Nam et al. [20, 21] proposed a line-based Patch-based MVS with the slanted line assumption to achieve strand-accuracy hair capture. Recently, deep learning based methods have been proposed to recover hair geometry from single-view images [22, 23] and dynamic hairs from monocular videos [24]. The key idea is pixel-wise hair feature parameterization using position and curvature. Specially designed convolutional neural networks are then trained to estimate hair strands from unconstrained images. Sun et al. [25] developed a hair inverse rendering framework to capture hair geometry and appearance from multi-view images using controllable lighting.

Fig. 2
figure 2

Overview of the proposed strand-based facial hair reconstruction (top) and tracking (bottom) pipeline. In the reconstruction stage, the algorithm estimates the initial 3D information of the skin mesh and facial hairs separately. Hair strands are then reconstructed and refined, whose roots could then be used to refine the skin mesh. In the tracking stage, we introduce the space-time optimization for the strand deformation between two frames and furthermore the strands tracking in a sequence using hybrid propagation

Facial Hair Capture. While there has been a significant amount of work on scalp hair capture, few papers have specifically focused on reconstructing facial hairs, such as beards, mustaches, and eyebrows. Fyffe et al. [26] derived a specialized stereo matching cost function to recover the particle-based facial hair. Beeler et al. [27] proposed to reconstruct the face geometry and sparse facial hair in a coupled fashion. The method deigned a custom pipeline to detect 2D hair segments and reconstruct the 3D fibers of each hair segment. Recently, Winberg et al. [28] extended this work to track the facial hair in the facial performance capture pipeline. The proposed method firstly reconstructs a neural facial hairstyle using the approach in [27] and then track the reference hairstyle throughout the following frames. Although impressive, restricted by the performance of the static facial hair reconstruction, the challenges still exist for thick facial hair regions and flying strands. There are also some other attempts on deep-learning-based facial hair editing. Olszewski et al. [29] proposed an interesting application which could interactively synthesize realistic facial hairs like beard directly in an image using generative models. Xiao et al. [30] captured the eyelashes based on a specifically designed fluorescent labeling system and built a dataset for eyelash matting.

3 Strand-based facial hair performance capture

To model facial hair in facial performance capture, we present a method that recovers the 3D structures of facial hair characterized by multiple strands and the time-varying motions of each strand. Our method can be easily adapted to work with standard multi-view facial performance capture systems [6, 7, 31] in either active or passive illumination settings. The overall pipeline is depicted in Fig. 2 and consists of two main components: hair strands reconstruction and per-strand motion tracking.Footnote 1 We will elaborate on the proposed solution in the following subsections.

3.1 Facial hair reconstruction

This subsection first describes our method for strand-accurate facial hair reconstruction. We initiate the process by integrating the reconstruction of rough skin and hair point clouds through a unified Multi-View Stereo (MVS) solution. This step facilitates the acquisition of a comprehensive 3D representation of the face, encompassing both the skin and hair regions. Subsequently, we extract hair strands using a segment matching technique, incorporating initial depth and direction estimations. Finally, we refine the skin mesh, enhancing both the accuracy and alignment of the underlying facial surface in relation to the reconstructed hair strands.

3.1.1 3D line-plane map estimation

We first generate the per-frame 3D facial geometries using a coarse-to-fine multi-view stereo [7]. However, the resulting raw mesh exhibited severe shrink-warp surfaces over facial hair regions due to the local plane assumption. To address this issue, we drew inspiration from the line-based MVS (LMVS) [20] and developed a method to simultaneously estimate 3D line parameters for facial hair pixels. To ensure uniformity, we compute a 3D line-plane map \(\prod _p({\textbf {d}},z)\) with the depth value at p and the local 3D direction/normal. For hair pixels, we represent them with a 3D line \(\prod _p=\mathcal {L}_p\). For skin surface pixels, we represent them with the local 3D plane parameters \(\prod _p=\pi _p\). We pre-computed hair confidence and orientation maps using Gabor filters at the raw image resolution, which were then resized to each image pyramid level using a nearest interpolation method. The cost function \(\zeta (p,\prod _p)\) varies depending on the pixel mask.

For a hair pixel, we compute the cost function \(\zeta (p,\mathcal {L}_p)\) consists of the geometric and intensity cost term similar to that in LMVS [20]:

$$\begin{aligned} \zeta (p,\mathcal {L}_p)= & {} \lambda _g E_{\textrm{geo}} + \lambda _p E_{\textrm{pho}} \nonumber \\ E_{\textrm{geo}}= & {} \sum _{c\in \mathcal {C}}\sum _{x\in \tilde{S}_p} \mathbb {C}^c(x^{'}) \parallel \mathbb {O}^c(x^{'}) - l^c_p \parallel \bigg / \sum _{x\in S_p^c}\mathbb {C}^c(x^{'}) \nonumber \\ E_{\textrm{pho}}= & {} \sum _{c\in \mathcal {C}}\sum _{x\in \tilde{S}_p} \mathbb {C}^c(x^{'}) \parallel \mathbb {I}^c(x^{'}) - \tilde{\mathbb {I}}(x) \parallel \bigg / \sum _{x\in S_p^c}\mathbb {C}^c(x^{'}) \nonumber \\ x^{'}= & {} H^c_{\mathcal {L}}\cdot x \end{aligned}$$
(1)

where \(\mathbb {C},\mathbb {O},\mathbb {I}\) are the confidence, orientation and gray image map correspondingly, \(H^c_{\mathcal {L}}\) is the line-induced homography related to the 3D line parameters, \(\mathcal {C}\) is the collection of neighbor views, and \(\tilde{S}_p\) represents the pixels on the reference view \(\tilde{c}\) sampled along the 3D line \(\mathcal {L}_p\).

For a skin surface pixel, we compute the intensity cost of sample pixels correspondences across views. The plane-induced homography \(H_{\pi }\) is computed using the 3D plane parameters \(\pi _p\):

$$\begin{aligned} \zeta (p,\pi _p) = E_{\textrm{pho}} \quad , \quad x^{'}=H_\pi \cdot x \end{aligned}$$
(2)

During the spatial propagation step, we update the 3D parameters while considering the pixel masks. For the line-plane case, only the depth value is replaced. For the line-line case, the line parameter is replaced according to LMVS. Finally, for the plane-plane case, we directly replaced the plane parameter. After several iterations of spatial propagation and refinement, we compute the 3D line-plane map for each single view. In consideration of the large skin regions, the hybrid 3D line-plane map takes full advantage of both the plane and line-based parameterization and improves the overall quality, particularly in the 3D line estimation, as shown in Fig. 3.

Fig. 3
figure 3

Hair cloud generation comparison. a Reference image; b LMVS [20] reconstructs very sparse point cloud; c Beeler et al. [27] fail in some regions, especially flying strands; (d)our line-plane MVS performs better in both accuracy and completeness

3.1.2 Segments matching across views

Next, we split the output 3D line-plane map using facial hair masks. While the 3D line map \(\mathcal {L}_p\) provides plausible depth and direction estimations, the corresponding points are noisy and sporadically distributed, making it impossible to extract meaningful 3D strands. To refine the depth estimation, we resolve the depth matching matrix with the 2D connectivity information, similar to the method proposed by Beeler et al. [27].

For a given 2D hair segment in the reference view, we compute the centroid depth using the estimations from Sect. 3.1.1 and create potential depth values in a limited depth range. For every segment pixel \(p_i\) at every depth sample \(z_j\), we construct a new line \(\mathcal {L}_{ij}({\textbf {d}}, z_j)\) and compute the matching cost \(\zeta _{ij}(p_i, \mathcal {L}_{ij})\). To prevent drifting, we add an extra regularization term in the form of a Euclidean distance between \(\mathcal {L}_p\) and \(\mathcal {L}_{ij}\). Compared to the 3D segment matching method proposed by Beeler et al. [27], the proposed matching cost function involves both depth and direction, yielding more robust and accurate reconstructions. Additionally, the reference depth regularizer makes it possible to reconstruct flying strands.

3.1.3 Strands generation and growth

After merging the 3D line maps from all views, we obtain a dense point cloud with 3D position and direction information represented by \(\Gamma =\{P_{\textrm{pos}}, P_{\textrm{dir}}\}\). This point cloud is clustered using the mean-shifting algorithm and connected to generate hair strands S through the forward Euler method. The output strands are defined as a sequence of 3D points with a regular separation distance of \(s=0.5\) mm. To refine and extend the strands, we revisit the multi-view confidence and orientation maps. Starting from the two end points of a strand segment, we define a three-dimensional growing cone along the direction \({\textbf {d}}\) with growth length s and opening angle \(\gamma = 15^{\circ }\). Each potential segment of the cone is measured using the multi-view cost function \(\zeta \), and we add the end point of the potential segment with the lowest score to the current strand. This process is repeated until the score value reaches the threshold \(\tau \).

3.2 Facial hair tracking

We now turn to the method of tracking the facial hair strands throughout the facial performance. After obtaining the reference facial hair shape represented by multiple 3D poly-lines (strand), we seek to register the reference 3D hair to subsequent frames by estimating the time-varying 3D motion of each strand. The registration is done as a space-time optimization for a locally rigid deformation field that matches the motion and changes induced by facial expressions and head movement over time. We first describe the initialization of computing the rough hair-free skin mesh performance capture. Then we describe the fully 3D hair registration by optimizing a per-strand deformation field. Finally, we introduce the skin mesh refinement using the updated hair space information.

3.2.1 Skin mesh tracking

We utilize the method of Li et al. [7] for creating and tracking the hair-free skin mesh. To achieve this, we discard all 3D points in facial hair regions after applying the Line-Plane Multi-View Stereo (LPMVS) and register the template face mesh to the resulting hair-free facial point cloud. However, there may exist large empty areas when dealing with dense hair styles, which can affect the robustness of the skin mesh tracking. Therefore, in the initialization stage, we also apply the static facial hair reconstruction described in the previous section and add the estimated hair root points to the hair-free facial point cloud. For each hair strand, we identify the root point that is designated as the closest end to the landmark-driven deformed template mesh and add it to the point cloud. The combined 3D point cloud then serves as the target nearest points during the mesh deformation. In the following tracking stage, the skin mesh points beneath hair regions are deformed only with the As-Rigid-As-Possible constraint.

3.2.2 Hair strand deformation optimization

We now describe the method to warp facial hair shape from the reference time frame to the current time frame in alignment with the multi-view motion fields. The optimization considers flow maps and nearest hair maps adopted from the hair distance fields (HDFs) [28]. We encode the nearest hair map with the directional vector to the closest hair particle from each pixel position. Compared to the numerical distance representation in [28], the directional vector provides better guidance during the optimization and reduces the time cost. The flow map encodes the motion path of pixel position between two time frames. To compute it, we first warp the image to the target frame based on the tracked face mesh and compute the optical flow between the warped image and the target image. Then we further refine the dense pixel matching using the block-matching algorithm. After computing optical flow, the moved pixel x is searched to its best match in the target image within a given search window. The match is guided by combing the photometric consistency and hair confidence map using normalized cross-correlation cost function. The overall motion field is the concatenation of the warp vector, optical flow field and matching motion. Figure 4 shows the improvement of the complete motion field compared to pure optical flow.

Fig. 4
figure 4

Motion flow between the reference images (a, b) produces the warped images (b, f). We apply the optical flow (c, g) and the final motion flow (d, h) to produce the warped confidence map to show the better matching. The warped confidence map, shown in yellow, is overlaid with the target confidence map, shown in white. A zoomed-in view is presented in (c_1, d_1, g_1, h_1)

Notations. A hair strand is represented as a polyline \(\xi (i)=\{p_0,\ldots ,p_i,\ldots ,p_N\}\) with a mean separation of 0.5 mm, and the root point \(p_0\) is the closest end to the face mesh. Inspired by [17], we parameterize the strand with \(\xi (\phi ),\phi \in [0,1]\) and compute the real-value \(\phi \) using the pixel position covered by the 2D projected hair strand in camera view k. Concretely, given the projection matrix \(Q_k\), we compute the projected pixel positions \(\{(x,y) \in Q_k(\xi )\}\) and compute the parameter value \(\phi \) of each pixel. The corresponding 3D strand points of the parameterized strand can be rewritten as the interpolation of the subsequent \(\lceil \phi \rceil \) and preceding \(\lfloor \phi \rfloor \) strand points:

$$\begin{aligned} p_{\phi } = (\lceil \phi \rceil -\phi ) p_{\lfloor \phi \rfloor } + (\lfloor \phi \rfloor -\phi )p_{\lceil \phi \rceil } \end{aligned}$$
(3)

The optimization is done by minimizing the energy function given by the motion estimate and hair detection in each view. We predict the hair strand shape at current frame \(\hat{\xi }(i)=\{\hat{p}_i\}\) by minimizing the distance between the 2D projections and target positions given by the motion estimate. We first solve for a rigid deformation for the hair strand with 3D translation \({\textbf {t}}\) and rotation \({\textbf {q}}\) using unit quaternions: \(\hat{p}_i = R({\textbf {q}})p_i + {\textbf {t}}\). Then we refine the individual hair strand points. The optimization problem is solved in the following steps:

  1. 1.

    Initialize per-strand transformations \([{\textbf {q}}|{\textbf {t}}]\) from the tracked face mesh;

  2. 2.

    Optimize \(({\textbf {q}}|{\textbf {t}})\) with motion field constraints and neighborhood regularizer:

    $$\begin{aligned} \min _{{\textbf {q}},{\textbf {t}}}&\sum _{k,\phi } {s_k(\xi ) \parallel Q_k(R({\textbf {q}})p_{\phi } + {\textbf {t}}) - F_k(x_{\phi },y_{\phi }) \parallel }^2 \nonumber \\ +&\sum _{j\in N_{ri}} \parallel (\hat{p}_0 - \hat{p}_{j0}) \parallel ^2 \end{aligned}$$
    (4)

    where \(F_k\) is the motion field and \(F_k(x_{\phi },y_{\phi })\) is the predicted position at current frame, \(s_k\) is the weight function combines the motion confidence and strand visibility in camera view k. We add a neighborhood regularization term to preserve the overall hair shape during the deformation. Let \(N_{ri}\) be the set of neighboring hair strands with a search radius. Instead of directly define the smoothness energy on the transformation \([{\textbf {q}}|{\textbf {t}}]\), we use the strand root points \(p_0\) that achieves a more robust result.

  3. 3.

    Iteratively optimize \(({\textbf {q}}|{\textbf {t}})\) with nearest hair map constraints

    $$\begin{aligned} \arg \min _{{\textbf {q}},{\textbf {t}}} \sum _{k}\sum _{\phi \in \Phi _k} {s_k(\xi ) \parallel Q_k(\hat{\xi }(\phi )) - H_k(\hat{x}_{\phi }, \hat{y}_{\phi }) \parallel }^2 \end{aligned}$$
    (5)

    where \(\hat{x}_j\) is the projection of current hair strand position \(\hat{x}_j=Q_k(\hat{p}_j)\) onto camera view k. We solve the minimization problem in an iterative way by finding the closest point \(H_k(\hat{x}_j)\) on the current estimation of strand transformation in each iteration. In practice, we found 5 iterations are enough.

  4. 4.

    Iteratively optimize individual points of each hair strand

    $$\begin{aligned} \arg \min _{\hat{p}} \sum _{k,j} ({s_k(\hat{p}_j) \parallel Q_k(\hat{p}_j) - H_k(Q_k(\hat{p}_{j})) \parallel }^2+E_{\textrm{reg}})\nonumber \\ \end{aligned}$$
    (6)

    Following the per-strand transformation, the final step directly refines the positions of individual points along the hair strand. To preserve the strand geometry while deformation, we add regularization terms on strand position, length and local Laplacian coordinates:

    $$\begin{aligned} E_{\textrm{reg}}&= \sum _j{\parallel \hat{p}_{j} \parallel }^2 + \sum _j{\parallel \hat{p}_{j+1}-\hat{p}_j \parallel }^2 \nonumber \\&\quad + \sum _j{\parallel 2\hat{p}_{j}-\hat{p}_{j-1}-\hat{p}_{j+1} \parallel }^2. \end{aligned}$$
    (7)

3.2.3 Skin mesh refinement

After capture the facial hair for each frame, we refine the rough estimate of the underlying skin mesh. We set the tracked facial strand roots as the control points of the As-Rigid-As-Possible deformation. The target positions of skin mesh points in facial hair regions are computed using the barycentric coordinates of the nearest strand roots.

4 Facial hair performance capture in different illumination settings

Our facial performance capture method is capable for most multi-view camera systems in spite of the illumination settings. We describe two most common scenarios: multiple static scan capture with various specific expressions using active illumination and video-based facial performance capture using flat passive illumination. In the first scenario, we seek to build a complete facial hair by accumulating static reconstructions from different expressions, while in the latter, we focus on the space-time consistency in tracked hair strands.

Fig. 5
figure 5

We conduct experiments on three standard setups for facial capture based on Li et al.[7] (left and middle) and Yang et al. [32] (right). Left: 24 cameras under active illumination for appearance capture. Middle: 24 video cameras under uniform lighting for facial performance capture. Right:17 cameras under uniform lighting for 3D face scan

4.1 Active illumination

In the task of creating a personalized face rig of an actor, e.g., as a parametric PCA model or blendshape models, the preliminary work usually includes the reconstruction of a set of scanned static expressions. Our facial hair reconstruction and tracking method can be used to capture a complete, temporally consistent, high-fidelity facial hair serving as the secondary facial feature. To accomplish this, we compute the directional facial hair point cloud of each expression instance using the method described in Sect. 3.1. The facial hair strands of the neutral expression instance are reconstructed and set as the reference facial hair. For the subsequent expressions, the reference facial hair is registered to the target directional point cloud with an extra iterative closest point (ICP) constraint.

$$\begin{aligned} \arg \min _{{\textbf {q}},{\textbf {t}}} \sum _{j} { \parallel [{\textbf {q}}|{\textbf {t}}]p_{j} - p^{'}_{j} \parallel }^2 \end{aligned}$$
(8)

where \(p^{'}_j\) is the best match point of strand point \(p_j\) in the target point cloud, we measure the match in terms of both position distance and orientation difference within a search radius of 5 mm. The ICP constraint is combined with the nearest hair map constraint in step. (2) in Sect. 3.2.2.

Next, we update the reference facial hair by growing the existing hair strands and new hair strands from the unmatched points using the method described in Sect. 3.1.3. We apply the registration and update procedure for each static reconstruction in the scanned expression sequence and incrementally build a denser reference facial hair of the subject. The final complete facial hair is again registered to each expression instance to build the blendshape models with facial hair.

Fig. 6
figure 6

F-score between projected hair strands (yellow) and detected confidence map (white). We compute the Precision and Recall for a distance threshold of 1 pixel. Our method achieves relatively high precision and completeness value on different subjects and under different illumination settings

4.2 Passive illumination

In the task of video-based performance capture, the goal for facial hair tracking is to propagate the reference hair strands to the subsequent frames. We borrow the hybrid propagation scheme in Li et al. [7] during the tracking to avoiding drifting and preserve the 3D facial hair geometry. For the current frame, we compute the deformed strands from reference frame and previous frame, and set the final deformation as the weighted sum of both deformations.

5 Experiments and results

Our facial hair reconstruction method is suitable for standard multi-view photogrammetry systems designed for facial performance capture. To evaluate our method, we conducted experiments using two different systems: one described in the work of Li et al. [7] and another from Yang et al. [32]. The setups of both systems are illustrated in Fig.  5. The system described by Yang et al. [32] consists of 68 cameras with varying resolution settings. For our evaluation, we selected 17 cameras that captured images at the same resolution. The system described by Li et al. [7] not only provides flat lighting but also offers an additional active illumination setting for facial reflectance capture. This active illumination setup involves capturing multi-view images under rapidly varying flash directions (Fig. 5a). Regarding the running time of the entire procedure, reconstructing facial hair strands takes several minutes, with the bulk of the time spent on per-view depth estimation. The tracking procedure requires less time, yet there remains a gap to achieve real-time performance. In the future, we will continue to enhance the overall performance of the proposed method to achieve real-time capabilities.

5.1 Quantitative evaluation and ablations

To evaluate the quality of the reconstructed facial hairs, we employ the F-score metric [33], which is commonly used in point cloud reconstruction tasks. Specifically, we compute the accuracy and recall values between the projected strand positions and the detected hair confidence map. To calculate the F-score, we consider a threshold of 1 pixel for the projected strands of subjects captured under all three setups. The results, shown in Fig. 6, indicate that our method achieves high accuracy in reconstructing the 2D detections. While there may be a slight decrease in performance for the active illumination case, the overall reconstruction quality remains satisfactory.

We also conduct ablations on the strand deformation optimization method used to tracking facial hairs. Figure 7 shows intermediate deformed strands during the optimization in Sect.  3.2.2. Our method gradually aligns the reference hair strands to the target frame, producing smooth and coherent tracking results. In Fig. 8, we discuss the impact of the neighborhood regularization term in rigid deformation optimization (Eq. 3). We overlay the projected strands (shown in yellow) with the reference image and compute the F-score of the tracked strands with the static reconstructions of the target frame. Our method outperforms the neighborhood regularizer \(E_{nei}\) introduced in the work of Winberg et al. [28] in terms of alignment accuracy. We observed that many hair strands were not perfectly rooted into the facial surface, leading to significant variations in the rotation and translation transformations, particularly in dense hair regions, when using the neighborhood strands computed from the root positions. In contrast, our method employs the deformed root positions, which helps preserve the local hair structure and results in more accurate and coherent alignment during the optimization process.

Fig. 7
figure 7

Step results of the strand deformation optimization

Fig. 8
figure 8

Analysis on the neighborhood regularization term. We compute the F-score of the tracked facial hairs with the static reconstructions for a threshold of 0.5 mm. a Reference image; b without neighborhood regularizer; c direct constraint on the rotation and translation \([{\textbf {q}}|{\textbf {t}}]\) used in Winberg et al. [28]; d our neighborhood regularization term on the deformed strand root positions

Fig. 9
figure 9

Two examples of facial hair reconstruction under the active illumination, showing the reference images, projected hair strands overlay on the image and 3D geometries

Fig. 10
figure 10

Frames of a performance sequence with tracked face meshes and facial hairs, showing the reference images, projected hair strands overlay on the image and 3D geometries

5.2 Qualitative evaluation and comparison

Figure  9 displays two subjects captured in the active illumination setting, where the images exhibit varying intensities and shadows due to the changing lighting directions. Estimating the depth solely based on intensity information, as done in [27], becomes challenging in this scenario. However, our segment matching method, which incorporates 2D orientation, improves the robustness of the reconstruction under such illumination conditions. In Fig. 10, we present the results of reconstructed and tracked facial hairs under a more common setup with flat illumination. We set the neural expression as the reference frame and reconstructed the hair strands. Subsequently, we tracked the deformation of each strand in the following frames with different expressions. The results demonstrate that the hair strands are transformed in a spatiotemporally coherent manner as the expressions change. The hybrid propagation scheme introduced in Sect. 4.2 ensures robust tracking even in the presence of extreme expressions and enables reliable tracking in long sequences. After capturing the facial hairs, we proceed to refine the underlying face surface to align with the strand root positions. Previous performance capture methods [7, 32] directly align the template face mesh to the shrink-warp surface. We first separately reconstruct the skin and hair points, and then align the face mesh to the episurface, which combines the skin points and hair roots. The result is a mesh that closely matches the true clean-shaven surface, as seen in Fig. 11, with notable improvements in the jaw outline alignment. This refinement step significantly enhances the accuracy of the overall facial reconstruction.

Fig. 11
figure 11

Face mesh refinement using captured hair strands, compared to facescape result [32] and Li et al. [7]

Fig. 12
figure 12

Qualitatively comparison between our hair reconstruction and that of Winberg et al. [28]. Our method reconstructs more accurate hair strands in areas of dense hair, such as a thick beard and the eyebrows

Fig. 13
figure 13

Extreme regions may be excluded

5.3 Comparison and limitation

Figure  12 presents a comparison between our method and the approach proposed by Winberg et al. [28]. It should be noted that, owing to the absence of necessary raw data for reproduction, we resorted to the next best alternative and presented a rough visual comparison between the results of Winberg et al. [28] and our method. Their method is an extension of their previous work, which primarily focuses on single-shot facial reconstruction [27]. Despite incorporating data from multiple frames to achieve denser reconstruction, their approach is limited in capturing complete eyebrows, as shown in Fig. 12. In contrast, our method outperforms theirs in most regions, particularly in the eyebrow region. Our approach provides more accurate and complete reconstructions of facial hairs, resulting in a superior overall performance compared to the method proposed by Winberg et al. [28]. Although the comparison is made in different inputs, the results shown in Fig. 12 demonstrate the effectiveness and robustness of our method in capturing facial hair strands to a certain degree. Finally, in Fig. 13, we present the reconstructions in “outlier” regions, specifically the eyelashes and sideburns, which were manually excluded from our facial hair tracking process. The eyelashes are typically visible only when the eyes are closed and are challenging to track due to their sparse and unpredictable movement. As a result, these regions were excluded from our tracking procedure.

6 Conclusions

In our work, we make significant contributions to the advancement of facial performance capture pipelines by incorporating detailed and realistic representations of facial hair. Our strand-accurate facial hair reconstruction and tracking method can be easily adapted to various multi-view systems with different illumination setups and camera configurations. Compared to previous methods, our method exhibits superior performance in both accuracy and completeness, achieving strand-accurate reconstruction of facial hairs in areas including eyebrows, mustaches, and cheek hair. Our method enables applications such as creating high-fidelity faces with facial hair from static scans, capturing facial performances in the presence of facial hair in videos, and recovering the dynamics of facial hair strands. However, there are still some limitations to our method that could be addressed in future research. In regions with few visible views, such as the basis mandibulae (lower jaw), our method may struggle to reconstruct accurate hair strands. Additionally, the strand deformation optimization relies on accurate optical flow estimation, which may encounter challenges in very extreme facial expressions or cases with motion blur. We acknowledge these limitations and consider them as areas for future improvement in our research. Besides, we plan to incorporate deep learning techniques to solve some of the challenges and improve the overall performance to achieve real time.