HDR-Net-Fusion: Real-time 3D dynamic scene reconstruction with a hierarchical deep reinforcement network

Reconstructing dynamic scenes with commodity depth cameras has many applications in computer graphics, computer vision, and robotics. However, due to the presence of noise and erroneous observations from data capturing devices and the inherently ill-posed nature of non-rigid registration with insufficient information, traditional approaches often produce low-quality geometry with holes, bumps, and misalignments. We propose a novel 3D dynamic reconstruction system, named HDR-Net-Fusion, which learns to simultaneously reconstruct and refine the geometry on the fly with a sparse embedded deformation graph of surfels, using a hierarchical deep reinforcement (HDR) network. The latter comprises two parts: a global HDR-Net which rapidly detects local regions with large geometric errors, and a local HDR-Net serving as a local patch refinement operator to promptly complete and enhance such regions. Training the global HDR-Net is formulated as a novel reinforcement learning problem to implicitly learn the region selection strategy with the goal of improving the overall reconstruction quality. The applicability and efficiency of our approach are demonstrated using a large-scale dynamic reconstruction dataset. Our method can reconstruct geometry with higher quality than traditional methods.


Introduction
3D reconstruction is a key technique in computer graphics with various applications in virtual and augmented reality and animation. In recent years, many advances have been made in both reconstruction quality and speed. Since the early success of KinectFusion [1], scanning with a commodity RGB-D camera and reconstructing the captured geometry in an online fashion have become commonplace. Subsequent work has either improved system scalability to support larger scenes and finer details by introducing new persistent data structures [2][3][4], or focused on enhancing reconstruction quality through accurate frame-tomodel registration [5][6][7].
While research on reconstructing and modeling static indoor scenes [8] has matured in the past few years, reconstructing dynamic objects (e.g., humans, animals, and other freely moving objects) still remains an open problem in both the graphics and robotics communities (referred to as dynamic SLAM [9][10][11][12][13]). Given an input sequence recording a non-rigid deforming object, the goal of dynamic reconstruction is to recover the moving object's underlying shape in a canonical pose as well as the deformation field for each frame so that the geometry at each instant of time can be recovered. The seminal work of DynamicFusion [14] described a general pipeline adopted by many other algorithms: by parameterizing the per-frame deformation as a warp field defined on a sparse set of transformation nodes skinned from the full geometry, the underlying shape can be registered to the depth observations at a particular time by solving a nonrigid iterative closest point (NR-ICP) problem [15], yielding the transformation for every node. To further improve robustness, many industrial and academic solutions use dedicated hardware [16,17], or exploit a common deformable template [18,19] as a prior to regularize the final result. On the other hand, multiview reconstruction systems like Fusion4D [5] and FusionMLS [20] leverage more complete observations from a large number of cameras to reconstruct the geometry with higher quality.
However, reconstructing scene dynamics is inherently an ill-posed problem because the solution space for occluded regions which are not observed by any camera can be infinitely large [21]. Various regularization terms, such as as-rigid-as-possible constraints, are used to tackle this problem to some extent, but they are not always appropriate to realworld scenarios. Another challenge is the low-quality output provided by the capture device, which tends to contain noise and erroneous depth observations, resulting in artifacts in the final reconstructed model such as holes and bumps. To address the above challenges, we pursue a data-driven framework based on state-of-the-art deep learning techniques, which can be easily integrated into an existing dynamic reconstruction pipeline, to enhance the fusion quality.
Deep neural networks (DNN) have shown their applicability to a wide range of graphics applications such as shape completion [22,23], geometric registration [24,25], and flow/correspondence estimation [26,27]. Recently, 3D deep learning has also gained ever more attention in reconstruction applications [28]. However, there are few attempts to embed deep models directly into reconstruction systems, primarily due to efficiency and generalization considerations. Furthermore, in online systems where succeeding frames rely directly on previous fusion results, deep models directly operating over the already fused geometry [23,29] cannot utilize intermediate fusion results, and it is impossible to recover from any catastrophic tracking failure in the reconstruction system.
In order to maintain system efficiency while exploiting the power of deep learning models, we present HDR-Net-Fusion, a highly-efficient dynamic 3D reconstruction system based on surfel representation [30], which can reconstruct and refine dynamic scenes simultaneously with a hierarchical deep reinforcement network, called HDR-Net. The core of HDR-Net consists of two parts: a Global-HDR-Net and a Local-HDR-Net. The global net first considers the overall geometric structure of the current model, and determines those local patches which may have poor quality, potentially leading to bad registration results for future frames. Then, the local net fixes such detected regions, performing patch-based geometric refinement using a data-driven neural network. We formulate the training of the global HDR-Net as a reinforcement learning problem: the optimal region selection strategy is implicitly learned to minimize the overall reconstruction error, considering both short-term and long-term loss during the fusion process. Our system is empirically shown to be accurate, robust, and efficient. As far as we know, this is the first work to integrate deep neural networks into reconstruction using a reinforcement learning approach.
In brief, this paper makes the following contributions: • the first efficient hierarchical deep reinforcement network integrated with real-time dynamic multiview 3D reconstruction, • a reinforcement learning model for efficient and progressive selection of region to be fixed, and • a deep neural network for high quality local reconstruction refinement.

Dynamic reconstruction
Inferring dynamic scene geometry remains an open research topic. Some work [18,[31][32][33][34] adopts strong semantic scene priors (e.g., a human body or hand template) to facilitate accurate correspondence and registration. Other methods [14,[35][36][37] instead choose to aggregate and denoise geometry in a canonical static space, and only track the per-frame deformation field over time, without knowledge of the reconstructed scene beforehand. To tackle the inherent ambiguity of the deformation field, and achieve better reconstruction fidelity, Refs. [5,6,36] introduce sparse image feature tracking, silhouette constraints, and albedo inference into the non-linear optimization to make tracking more robust, while Refs. [38,39] bypass the correspondence estimation stage by imposing divergence constraints over the entire deformation vector field, and Refs. [17,40] give dedicated hardware designs for obtaining cleaner and more complete depth and texture information.
Readers are referred to Ref. [41] for a comprehensive literature review. Our approach introduces deep neural models to efficiently learn geometric priors from data for higher fusion quality.

Point set deep networks
For our surfel-based representation of the reconstructed geometry, we apply deep networks which directly consume point clouds. PointNet [43] and its variants [44][45][46] are a standard choice for encoding point set features while providing a good description of multiscale details. The work in Ref. [47] is the first point set decoder combining fully-connected and deconvolution layers. In order to enforce a uniform structure onto the generated point set, FoldingNet [48] and AtlasNet [49] use one or more uniform grids to condition the shape descriptor for shape generation. The designs of various deep point set networks support a variety of applications in both graphics and vision, such as point upsampling [29,50] and shape completion [23,51].

Deep reinforcement learning
Traditional reinforcement learning aims to learn from past experience and make better decisions in a principled way. The successful combination of deep neural networks and reinforcement learning algorithms is capable of dealing with higherdimensional state and action spaces which were previously intractable [52]. Deep reinforcement learning has various applications in video games [42], generating animation [53], and indoor navigation [54]. A prominent approach is provided by the Deep Q-Network (DQN) [42] and its variants [55,56], which approximate value functions with off-policy learning. Another line of approaches is based on policy gradients or actor-critics, where the model directly learns a stochastic policy [57,58]. Our work formulates dynamic reconstruction as a Markov decision process and applies DQN to learn how to achieve minimum reconstruction error. We believe this is the first application of deep reinforcement learning in dynamic reconstruction pipelines.

Overview
Our HDR-Net-Fusion takes sequential depth maps captured using several commodity RGBD cameras as input and progressively reconstructs the geometry of the dynamic scene for every frame. As shown in Fig. 2, during the testing phase of our algorithm, for each incoming frame, the warping field which best aligns the current depth observations and the reference geometry is first found, and then a traditional fusion process is applied by our basic reconstruction system (Section 5). After that, the Global-HDR-Net is applied to the embedded deformation nodes to compute an expected reward for each node (Section 6.3). The node with the highest expected reward is selected and the local surfel patch surrounding that deformation node is fed into the Local-HDR-Net, which locally refines the patch geometry and completes missing areas (Section 6.2). The refined patch is then integrated into the reference geometry maintained by the reconstruction system to improve the quality of reconstruction and assist future tracking and registration.

Fig. 2
Overview of our reconstruction system. During testing, the deformation nodes Gt of the live geometry St are fed into the Global-HDR-Net; the local patchŜ m t with the highest expected reward is fed into the Local-HDR-Net for refinement. The refined patchS m t replaces the original geometry and is fused into the whole model, which is used for registering the next incoming frame. To train the global and local hierarchical networks, we first supervise Local-HDR-Net with groundtruth full patches, and then we fix its weight and train the Global-HDR-Net represented as a point-set-based DQN [42].

Notation and scene representation
The dynamically reconstructed scene is represented by a set of deformation nodes G = {g m ∈ R 3 } and a set of surfels with neighborhoods S = {s i , N i }. S is a dense reconstruction of the entire scene, while the nodes in G are scattered sparsely over the surface represented by the surfels. Each surfel s i = (p i , n i , r i ) is represented by its center p i ∈ R 3 , normal n i ∈ R 3 , and radius r i ∈ R. A neighbourhood set N i ⊂ G is attached to each surfel, initialized as the nearest K neighbours of s i among all deformation nodes g m in G. Similarly, a neighbourhood set N m ⊂ G can be built for all the deformation nodes to establish their spatial relationships.
For each frame t we compute a warp field W t = {q m t ∈ SE(3)} defined at each node in G, where q m t is the transformation applied to g m in frame t; it is represented using dual quaternions [59]. Let G t = {g m t ∈ R 3 } be the transformed version of G where g m t = q m t ·g m . The surfels skin the deformation nodes and the transformation for each surfel is found by interpolating nearby node transformations asq i t = m∈N i w im q m t , where w im = exp(− p i − g m 2 2 /σ 2 ), σ representing the node sampling distance [37]. We denote the transformed version of the surfels at frame t after applyingq i and n i t is the normal n i transformed by the rotation part ofq i t only.

Design
The design of our basic reconstruction system is inspired by Ref. [37] and illustrated in Fig. 3. The initial surfels S are called the reference geometry and the up-to-date surfels S t are called the live geometry. For each frame, W t is determined and the new surfels introduced in the current frame are appended to the live geometry; matched surfels are updated according to the running mean integration protocol [30]. The live geometry is then warped back Fig. 3 Pipeline of the basic reconstruction system. For each incoming frame, we first forward warp the reference geometry into live geometry according to the current warp field, and then a new deformation field is found to align the reference geometry and depth observation. After that, the depth observation is fused with the new live geometry. Finally, we warp the live geometry back to the reference geometry.
to the reference geometry which provides a canonical shape representation.

Energy function
A key step in dynamic reconstruction is to find the perframe warp field W t , which is solved by minimizing the following energy, consisting of a data term, a correspondence term and a regularization term: where λ c and λ r are balancing weights. The data term: is a depth-to-plane ICP error summed across all V input views per frame, where V v t is the visible surfel index set for the current W t from the v-th view, and d i,v t is the corresponding depth observation of the i-th surfel found by re-projecting s i t into the v-th camera view, transformed by the camera extrinsic. The correspondence term is which is a distance between two sparsely related points found by global patch collider [60] and C v t is the correspondence set containing tuples of matched surfels s i t and pixels is an as-rigid-as-possible constraint encouraging nearby nodes to share the same transformation.

Challenges
Generally, our basic reconstruction system works well on simple datasets; special cases like topology changes and tracking failure can be fixed by re-initialization [37]. However, without any prior knowledge about the dynamic scene structure, it is still very challenging to track fast motions and much information is lost during re-initialization. The situation becomes worse when there are erroneous depth observations or when parts of the dynamic structure are occluded in any views.

Concept
To address the challenges faced by the traditional reconstruction system discussed in Section 5, repair of the erroneous and occluded parts is necessary, while the efficiency of the reconstruction system should also be guaranteed. We thus propose a hierarchical reinforcement network (HDR-Net) that first finds the to-be-fixed regions (Global-HDR-Net) efficiently using reinforcement learning algorithms and then fixes these regions by exploiting the power of deep neural network (Local-HDR-Net).

Network architecture
For each frame, given a selected deformation node g m t from Global-HDR-Net, we gather all surfels influenced by that node asŜ m t := {s i t |g m ∈ N i } and feed that local patch into the Local-HDR-Net. Its job is to generateS m t , a completed and de-noised version ofŜ m t . Two design requirements exist for our model: (i) as the network is applied to the reconstructed geometry on a per-frame basis, the model should be lightweight, requiring minor additional computation, and (ii) in order to resolve the inherent ambiguity of point set completion, knowledge of the entire scene geometry should be taken into consideration.
We therefore propose a hybrid encoder-decoder structure using the order-agnostic PointNet [43] as the backbone, as shown in Fig. 4. To integrate global geometric knowledge, we use G t as a summary of the current coarse shape: it gives a good global shape approximation which can effectively summarize the overall scene structure to the network. In the encoder part of our model, G t andŜ m t are first encoded separately, extracting features with respective point-shared MLPs. As the encoded feature of each point in G t , we take its globally aggregated feature vector as well as the point feature vector for g m t . These two feature vectors are then concatenated with the aggregated per-point feature ofŜ m t . The overall aggregated latent representation of the local region now contains information summarizing the patch geometry in its global context.
In the decoder, we find that using a classic fully connected and deconvolution combination [47] generates the best results while still allowing real-time processing. Deformation-based decoders [48] easily lead to over-smoothed surfaces lacking detail, while implicitfunction-based decoders [22] involve heavy sampling computations during inferencing. The direct output of our model is simply the 3D surfel center position's offset to the selected nodeõ i t , which is easier to learn than the surfel normal, given its spatial continuity.

Loss function
We use the earth-mover distance (EMD) as the loss function to train the network: wherep i t =õ i t + g m t is the center position of each output surfel in world coordinates. φ is a bijection; the best linear assignment expressed by the min operator can be computed efficiently using the approach in Ref. [47].p is the groundtruth surfel center.

Problem formulation
For each frame t, Global-HDR-Net aims to select the node g m t for the subsequent local patch refinement Reinforcement learning for training Global-HDR-Net. The reconstruction system as well as Local-HDR-Net serve as the environment while Global-HDR-Net is the agent whose task is to select a deformation node in Gt to be refined by the Local-HDR-Net for each frame.
operation described in Section 6.2. One could choose to refine more than one node, or in the extreme case, all nodes in one single frame. However, too many passes of network inferencing will drastically affect the system's real-time performance. In fact, as the output reward is not expected to vary much given subtle changes in the input node positions, all nodes with high reward will be eventually picked up in time based on the high capturing frame rate. By instead performing inferencing on one node at a time, we distribute the computation across the entire session so that speed is guaranteed while still not preventing any nodes from being chosen.
One simple strategy for this module could be to always select S m t with the worst geometric quality. However, a greedy algorithm will not necessarily lead to a globally optimal result as it does not consider possible future registration error and the empirical performance of Local-HDR-Net.
We instead pursue an algorithm that is aware of both short-term and long-term reconstruction quality and takes the properties of both the underlying dynamic reconstruction system and Local-HDR-Net into account.
We solve this problem using ideas from reinforcement learning (RL) which implicitly model the environment using existing experience gained through trial and error. A natural analogy can be made between Global-HDR-Net and a reinforcement learning agent. The dynamic reconstruction system and the local net serve as the environment, which receives an action (a deformation node g m t ) from the network, performs internal fusion and local patch refinement, and emits the reconstructed result as the new observation. The rewards for the action performed can be modeled by the score of reconstruction quality. By choosing different actions at each timestamp, the Global-HDR-Net agent influences internal state of the system and all succeeding reconstruction steps.
The target of RL is to learn an optimal policy which can be later executed during inferencing. The optimal policy maximizes the expected return along the state transition path, which, in our case, effectively minimizes reconstruction error over all time steps.
From a theoretical point of view, two key propositions have to be met for the above formulation to be meaningful. Firstly, the reconstruction system should obey the Markov property, where the state of the current step is solely dependent on the previous step's state. Secondly, an appropriate choice of the deformation node can be made solely from the configuration of G t . The first proposition is naturally satisfied because for each frame, the depth observation is integrated only with the fused geometry from the previous frame. Also, we have found that regions with poor reconstruction quality often have highly complicated or mostly occluded parts, which to a certain degree justifies us in assuming the second proposition to be true.

Learning algorithm and network architecture
We employ DQN [42], which uses an efficient off-policy value-function based approach, as our reinforcement learning algorithm. DQN aims to learn the Qfunction (expected reward given state and action) through past experience, and approximates Q(s t , ·) using a deep neural network (i.e., the Q-Network) to model the high-dimensional state and action space. Here we use s t and a t to denote the state and action for frame t. Specifically, s t represents the positions of global nodes G t up to frame t and a t is the integer index m of the selected deformation node in G t . By enforcing Bellman equality and minimizing temporal difference error δ t , the Markov process of the environment can be precisely modeled by the Qfunction and our final policy can be greedily selected as π * (s t ) := argmax a Q(s t , a) so that in each frame t we maximize T t =t γ t −t r t , where γ > 0 is the discount factor, r t is the reward for frame t and T is the number of total frames. Following Ref. [42] we define the temporal difference error as δ t (Θ) := Q(s t , a t ; Θ)−(r t +γ max a Q(s t+1 , a; Φ)) (6) where Θ is the parameter of the policy deep network and Φ is the parameter of the target deep network. During training, we execute the reconstruction system with Local-HDR-Net several times and gather (s t , a t , r t , s t+1 ) tuples. Here, the actions a t are chosen using an -greedy policy which interpolates between the currently best found policy π * and a completely random policy with factor . It can be proved that this strategy converges to an optimal policy, balancing exploration and exploitation in state space. We store multiple state-action-reward tuples across different episodes in a common replay memory. Mini-batches are then sampled from the replay memory to train the policy network parameter Θ using back propagation, so that δ 2 t (Θ) is minimized. Φ is usually fixed and updated to Θ only every few episodes to guarantee stable training.
We choose PointNet++ [44] as our Q-network. It takes in point set G t at frame t and the numbers of local surfel patches S m t (m ∈ G t ) as input and predicts the Q-value, i.e., the expected reward for each point (possible action).
The reward r t is taken as the negative chamfer distance D, defined as where S p are the surfel positions of the current geometry using the reconstruction system and S g is the groundtruth reference geometry.

Results and discussions
In this section, we introduce the experimental setup for implementing our system, give results and comparisons, and validate the design of our method.

Dataset
Our experiments used sequences from the Human10 dataset [4] to test our algorithm. This dataset contains 10 long sequences of several human actors performing various actions, of which 9 are publicly usable. In Human10, each sequence was recorded using 4 fixed-position 512 × 512 resolution RGB-D cameras distributed uniformly around a 360 • viewing circle. The frustum of each depth camera covers a partial view of the entire human body. The limited sensor quality, leading to severe depth error and loss, and many fast large motions as well as topology changes, present very challenging data to the reconstruction system, resulting in very frequent tracking loss and re-initialization. To measure reconstruction quality, the dataset provides a groundtruth 3D mesh reconstructed using a freeviewpoint-video [61] capture system. To verify the generalization of the network, we split the 9 sequences into training sequences (human1/2/4/6/9) for HDR-Net and testing sequences (human0/3/7/8).
Weights for both Local-HDR-Net and Global-HDR-Net were learned and cross-validated solely from the training frames. In order to train Local-HDR-Net, patch-level surfel and deformation node data were generated. We first generated surfels and nodes from depth observations every single frame without warping, to simulate the artifacts caused by re-initialization and camera quality. On the other side, we used Poisson disk sampling to sample equally-spaced surfels over the groundtruth mesh. We then gathered surfels from both the reconstruction results and the groundtruth surrounding each node to form a complete patch, using a ball query, forming a local patch training pair for the supervised learning of patch completion. Additionally, we balanced the distribution of training pairs by their completeness score, defined as the portion of groundtruth surfels closer than a certain threshold to its nearest neighbour in the partial surfel patch (extracted from the input depth map). Empirically we found better overall system performance can be achieved with this balanced dataset, most of whose training pairs would otherwise be almost complete.

Training protocol
We use a common supervised training strategy to optimize the Local-HDR-Net using the dataset described above: an AdaGrad optimizer is used with a learning rate of 10 −3 . The training of Global-HDR-Net is based on 200 randomly selected consecutive frames from each episode. For each frame, we randomly sample a state-action-reward tuple batch from the replay memory and optimize Θ with the RMSprop optimizer. The network weight Φ is updated to Θ every 3 episodes. During execution of the -greedy policy, we start with 90% probability of selecting random nodes and decrease the probability exponentially to 5% with a 200 frame decay rate. The discount factor is set to γ = 0.999. In total, we train for around 160 episodes to get a fairly convergent result. The loss curve is shown in Fig. 6.
To ensure fair evaluation of system performance, all input frames are never seen by either network. Specifically, among the only four sequences (human0/3/8/9) which contain RGB information and can achieve a good result in multi-view sequences' tracking, we choose human9 for training Global-HDR-Net. Considering the dependency of Global-HDR-Net's training on Local-HDR-Net's performance, Global-HDR-Net would learn nongeneralizable policies if Local-HDR-Net is too familiar with the sequence we train Global-HDR-Net on. Therefore, we only take human1/2/4/6 for Local-HDR-Net training and evaluation, excluding human9.

Implementation details
We implemented our multi-view reconstruction system in C++/CUDA. The training code for both Local-HDR-Net and Global-HDR-Net is written using PyTorch. We interfaced the reconstruction system and the deep network so that both training and inferencing are tightly coupled and trained effectively end-to-end. The algorithm was tested on a workstation with an Nvidia Titan RTX graphics card, running the reconstruction pipeline, Local-HDR-Net and Global-HDR-Net simultaneously.
In practice, we find that Local-HDR-Net does not guarantee perfectly smoothed output, which may degrade overall system performance. Hence we separately adopt a post-processing step to directly reject bad local net outputS t m . This post-rejection step finds all nearest neighbour pairs inS t m and computes the dot product of normal vectors of the pair. This generated surfel patch is rejected if the mean value of all dot products is less than ε, which can guarantee the smoothness of the surfels we add.
In addition, the number of fixed nodes in each frame can be adjusted in practice to achieve better quality. Since the Global-HDR-Net can output the expected reward of all nodes, we can select several nodes within an acceptable range instead of the best one and fix them one by one so that we can refine a necessary number of nodes every frame while still enabling real-time performance.
In the presence of fast motion, the tracking module may fail. We detect this abnormality by checking the residuals of the registration solver and perform re-initialization [5,37].

Parameter selection
Parameter choice for the reconstruction system is application dependent. To effectively track and recover human geometry, we empirically set the reconstruction parameters in Section 5 to λ r = 2.3 and λ c = 1.3 while the node sampling distance [37] σ = 0.04 cm. The post-processing rejection threshold ε is set to 0.9.
In terms of network structure, our Local-HDR-Net encodes surfels inŜ m t using a shared MLP with sequentially 32, 64, and 256 channels, encodes each point in G t with 32, 64, 256 channels, and transforms the concatenated latent feature into two patches of 256 3D points with FC and DeConv layers separately. The Global-HDR-Net downsamples input deformation nodes into 256, 64, and 16 points sequentially with set abstraction layers and the local feature is interpolated using a feature propagation layer. The input nodes G t are padded to a minimum size of 400.

Overall performance
We present some qualitative results for our entire HDR-Net-Fusion framework in Fig. 7. Compared to results without HDR-Net, combining our hierarchical Fig. 7 Selected frames demonstrating the overall reconstruction quality of HDR-Net-Fusion. Top to bottom: human0, human3, human8 from Human10. Each pair shows the reconstructed result without (blue) and with (red) HDR-Net. Our algorithm successfully identifies missing or noisy regions and refines them reasonably. network can effectively complete and refine missing or inaccurate regions of the fused model. Leveraging the geometric prior of the underlying scenes using our carefully-designed deep network, a plausible completion can be generated, filling in holes in occluded regions (e.g., body parts partially hidden by moving arms) or wrongly-observed regions (e.g., regions with dark hair whose depth cannot be accurately measured by the sensor). During reinitialization caused by large registration errors, most of the fused model is deleted, which can be quickly fixed by our model and subsequent registration artifacts can be minimized. Table 1 compares the number of re-initializations required by systems with and without HDR-Net. When there are frequent reinitializations caused by large motions, our repair can prevent further subsequent registration artifacts, leading to a significant reduction in the number of re-initializations.
A close-up of the reconstructed geometry is shown in Fig. 8. The regions of the actor's head and shoulders contain holes as a result of the erroneous observation depth, while the region behind his left arm is empty because of a recent re-initialization. Both of the artifacts can be effectively fixed by our method.
The behaviour of our algorithm can be further analyzed by visualizing the expected reward computed Table 1 Number of re-initializations during reconstruction by our framework without or with HDR-Net. Our method can reduce reinitialization especially when re-initialization is frequent

Sequence
Without HDR-Net With HDR-Net by the global net, answering the questions of what the Global-HDR-Net has learned and why it is useful in our setting. As shown in Fig. 9, Global-HDR-Net can find places with holes and bumps efficiently and accurately. In addition, it tends to repair regions of the model's boundaries such as the shoulders and feet. This is valid in the sense that these parts are more likely to move rapidly later and need to be refined to make the tracking more reliable. Otherwise, it will be harder to track a broken arm, as we can see in Fig. 7, leading to frequent re-initialization as shown in Table 1. Presumably, as the input to our global net only contains the deformation nodes for the current frame, the model implicitly learns to predict the potential node motions given the static pose and jointly considers both spatial and temporal cues when making decisions. Again, the policy is implicitly learned for lower reconstruction error, which is hard for a hand-crafted heuristic to imitate as demonstrated in Section 7.5.

Speed
Our reconstruction system takes about 25 ms per frame for a single-view sequence of Human10, and more time for the image pre-processing as it uses multi-view sequences, which can be parallelized if there are multiple processors. The average inferencing time is about 2 and 4.5 ms for Local-HDR-Net and Global-HDR-Net, respectively, adding little overhead to the underlying reconstruction. This is due to the lightweight design of our deep models and the scalable surfel representation. In conclusion, our system can reach 25 Hz with 5 nodes fixed per frame. Taking the parallel running of Local-HDR-Net into consideration, the process could be made even faster.
Expected reward . 9 Expected reward for the deformation nodes computed by Global-HDR-Net (superimposed on the surfels). Since the Q-value's interval for each sequence is not uniform, we use relative coloring, nodes in blue and yellow having relatively lower and higher values.

Comparison with the traditional method
To demonstrate the advance of our framework over traditional reconstruction methods, we also reconstructed the test data with SurfelWarp [37] for each single perspective to make a comparison with our method. Figure 10 presents several selected frames from the sequences reconstructed by both SurfelWarp and our system. The result shows our method can refine holes and bumps effectively, which is impossible for traditional methods without correct input. Therefore, our system can give a reconstruction result with higher quality than traditional methods when there are heavy noise and erroneous depth observations. In addition, broken or disconnected legs and arms are quickly completed, confirming our conclusion in Section 7.2.

Global-HDR-Net comparison
To test the performance of Global-HDR-Net and to show that our network actually learns effective information during its training, we set up two competing agents executing different polices.
• Random: nodes are uniformly sampled from G t .
• Heuristic: we first remove all candidate nodes not satisfying the following criteria: -the number of surfels related to the node should be greater than 20; -the mean Euclidean distance from each surfel to the node should be smaller than 0.08; -the surfel confidence maintained within the reconstruction system, representing the point's stability and reliability, should be greater than 2.0. The above criteria ensure Local-HDR-Net acquires sufficient information for inferencing.
Then the eligible node with fewest surfels is selected by this policy since patches with fewer surfels should be given higher priority for refinement.
The performance of the policies are compared using the quality of the reconstructed model computed using two-way chamfer distance as defined in Eq. (7). Figure 11 shows both qualitative and quantitative comparisons over two of the Human10 sequences. Results show that the manually designed heuristic policy leads to better reconstruction quality than the random policy most of the time, but the effect is not strong or particularly stable. Interestingly, we find that the random policy can sometimes lead to worse results than the simple reconstruction system without Local-HDR-Net. This is because that some randomly selected nodes may contain too much noise or have a low completeness score, so their corresponding complete geometry is too challenging for Local-HDR-Net to recover, generating many noisy outliers. In the contrast,our policy provides an effective refinement to the geometry. Clearly, choosing the correct node is as important as the geometry refinement process, which needs careful handling.
Compared to our policy, which outperforms all baselines and is learned with the direct goal of minimizing reconstruction error, the heuristic policy is the closest competitor but is unaware of the behaviour of Local-HDR-Net and the underlying reconstruction system. It is non-trivial to manually build a spatiotemporally aware criterion as analyzed in Section 7.2.

Local-HDR-Net comparison
Local-HDR-Net mainly focuses on completing and refining local patch geometries. We compare it with three baselines: • FoldingNet [48], whose decoder is designed by concatenating sampled points on a uniform grid with the global feature vector. The network learns how to deform such a uniform grid to the desired shape; surface smoothness is guaranteed. • The Point Completion Network [23], employing a coarse-to-fine completion strategy and aggregating multiple deformable grids to assemble the final completed shape. • A variant of Local-HDR-Net, lacking the branch taking in the scene deformation nodes G t . This variant is used to test the utility of the scene structural guidance. All baseline models and Local-HDR-Net were trained for 200 epochs. Network hyper-parameters including model architecture and learning strategies were separately tuned for each model to give best cross-validation performance.
For evaluation we use the earth mover distance (Eq. (5)) between prediction and groundtruth surfel positions as our metric. Distance error is plotted for different ranges of patch completeness (from 0.0 to 1.0) in Fig. 12. As there will be much noise and error in the actual reconstruction, we also tested the performance of the networks when Gaussian noise with a standard deviation of 5 mm is added in the normal direction of each surfel of the input. For patches with low completeness (< 0.1), most methods find it challenging to infer missing fine details due to loss of information. In addition, patches with too much noise will also lead to a bad result. This also justifies our strategy of rejecting patches with too few surfels and returning a negative reward: this explicitly discourages the global net from choosing overly challenging patches for the local net.
Compared to the baselines, our Local-HDR-Net yields much smaller distance errors in most cases and exhibits better stability in the case of noisy input. Figure 13 renders results qualitatively in the form of surfels. FoldingNet [48] can generate smooth surfaces, but its uniform grid parameterization leads to distorted boundaries and the deformation in complex areas is unnatural. The Point Completion Network [23] assembles the surface from many smaller patch grids, resulting in overlaps and uneven distribution of surfels. It is unsuitable for small-scale geometry refinement. Our baseline without nodes completely discards the scene structural guidance, i.e., the global context, which is very important when the patches' information is very limited or contains much noise.

Generalization
To demonstrate the ability of our system to generalize on real world scenes, we also selected some sequences from the DeepDeform dataset [7] and sequence human7 from the Human10 dataset [4] to test our method. Results are shown in Fig. 14. Our method can fix artifacts and achieve better results than a traditional method [37]. Furthermore, in addition to the human body, our system can fix the artifacts in other objects as well.

Limitations
There are two typical limitations of our method: Firstly, as shown in Fig. 15(a), although our Local- Fig. 14 Reconstruction results of sequences from DeepDeform dataset [7] and Human10 dataset [4] (bottom-left two). Compared to a traditional method [37] (blue), our method (red) can complete objects other than the human body and real scenes including interaction of people and things like a bag or ball. HDR-Net can provide good refinement for most model parts, it still remains challenging to learn features of complicated and subtle structures like hands. A possible reason is that it is hard for the network to represent all features of such a large batch. Narrowing the range of a single patch when the structure is complicated may provide a better result. Furthermore, a more powerful deep neural network with self-attention mechanisms [62] could be adopted to learn more discriminative features for point cloud completion. Secondly, as shown in Fig. 15(b), selecting a constant number of nodes per frame can lead to problems since the model's completeness is continuously changing during reconstruction. For some models already sufficiently complete, Global-HDR-Net will still select some completed patches to refine, resulting in computational inefficiency and even leading to worse results. A more adaptive node selection strategy could be applied by rejecting previously chosen nodes or selecting nodes by considering their predicted completeness.

Conclusions
This paper has presented HDR-Net-Fusion, a novel dynamic reconstruction system using a hierarchical deep reinforcement network to improve reconstruction quality. Its applicability and effectiveness have been experimentally demonstrated using a large-scale dynamic fusion dataset. Our approach formulates the global selection of a local geometric patch for refinement in terms of reinforcement learning and uses a point-based neural network to complete and improve the local geometry. We hope this work can inspire future work pursuing better dynamic reconstruction quality using powerful deep learning and reinforcement learning algorithms. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.