1 Introduction

Photorealistic rendering of dynamic real-world scenes such as moving persons or interactions of people with surrounding objects plays a vital role in 4D content generation and has numerous applications including augmented reality (AR) and virtual reality (VR), advertisement, or entertainment. Traditional approaches typically capture such scenarios with professional well-calibrated hardware setups [10, 19] in a controlled environment. This way, high-fidelity reconstructions of scene geometry, material properties, and surrounding illumination can be obtained. Recent advances in neural scene representations and, in particular, the seminal work in neural radiance fields (NeRF) [39] marked a breakthrough toward synthesizing photorealistic novel views. Unlike in previous approaches, highly detailed renderings of complex static scenes can be generated only from a set of posed multi-view images recorded by commodity cameras. Several extensions have subsequently been developed to alleviate the limitations of the original NeRF approach which led to significant reductions in the training times [41] or acceleration of the rendering performance [8, 18].

Further approaches explored the application of NeRF to dynamic scenarios but still suffer from slow rendering speed [20, 27, 32, 51]. Among these, Fourier PlenOctrees (FPO) [60] offer an efficient representation and compression of the temporally evolving scene while at the same time enabling free viewpoint rendering in real time. In particular, they join the advantages of the static PlenOctree representation [72] with a discrete Fourier transform (DFT) compression technique to compactly store time-dependent information in a sparse octree structure. Although this elegant formulation enables a high runtime performance, the Fourier-based compression results in a low-frequency approximation of the original data. This estimate is susceptible to artifacts in both the reconstructed geometry and color of the model which often persist and cannot be fully resolved even after an additional fine-tuning step. Strong priors like a pretrained generalizable NeRF [62] may mitigate these artifacts and are applied in FPOs for a more robust initialization. However, when considering recent state-of-the-art techniques to boost the training of the static per-frame neural radiance fields without requiring prior knowledge, obtaining a suitable compressed model remains challenging.

In this paper, we revisit the frequency-based compression of Fourier PlenOctrees in the context of volume rendering and investigate the characteristics and behavior of the involved time-dependent density functions. Our analysis reveals that they exhibit beneficial properties after the decompression that can be exploited via the implicit clipping behavior in terms of an additional Rectified Linear Unit (ReLU) operation applied for rendering that enforces non-negative values. Based on these observations, we aim to find efficient strategies that retain the compact representation of FPOs without introducing significant additional complexity or overhead while eliminating artifacts and even further accelerating the high rendering performance. We particularly focus on flexible approaches that allow for interchanging components to leverage recent advances and, thus, model the Fourier-based compression as an explicit step in the training process, instead of investigating end-to-end trainable systems. To this end, we derive an efficient density encoding consisting of two transformations, where (1) a component-dependent encoding counteracts the under-estimation of values inherent to the reconstruction with a reduced set of Fourier coefficients, and (2) a further logarithmic encoding facilitates the reconstruction from Fourier coefficients and the fine-tuning process by putting higher attention to small values in the underlying \( L_2 \)-minimization. While it is tempting to learn such an encoding in an end-to-end fashion using, e.g., small MLPs, our handcrafted encoding directly incorporates the gained insights and only adds negligible computation overhead, which is especially beneficial for fast rendering.

In summary, our key contributions are:

  • We perform an in-depth analysis of the compression artifacts induced in the Fourier PlenOctree representation when recent state-of-the-art techniques for training the individual per-frame NeRF models are employed.

  • We introduce a novel density encoding for the Fourier-based compression that adapts to the characteristics of the transfer function in volume rendering of NeRF.

2 Related work

2.1 Neural scene representations

With the rise of machine learning, neural networks have become increasingly popular and attracted significant interest for reconstructing and representing scenes [5, 23, 40, 43, 50, 52, 70, 71]. In this context, the seminal work on NeRF [39] demonstrated impressive results for synthesizing novel views using volume rendering of density and view-dependent radiance that are optimized in an implicit neural scene representation using multi-layer perceptrons (MLP) from a set of posed images. Further work focused on addressing several of its limitations such as relaxing the dependence on accurate camera poses by jointly optimizing extrinsic [29, 61, 64] and intrinsic parameters [61, 64], the dependence on sharp input images by a simulation of the blurring process [36, 61], representing unbounded scenes [3, 74] and large-scene environments [54], or reducing aliasing artifacts by modeling the volumetric rendering with conical frustums instead of rays [2,3,4].

2.2 Acceleration of NeRF training and rendering

A further major limitation of implicit neural scene representations is the high computational cost during both training and rendering. In order to speed up the rendering process, several techniques were proposed that reduce the amount of required samples along the ray [25, 31, 42] or subdivide the scene and use smaller and faster networks for the evaluation of the individual parts [47, 48]. Some approaches represented the scene using a latent feature embedding where feature vectors are stored in voxel grids [53, 67] or octrees [33]. Another strategy for accelerating rendering relies on storing precomputed features efficiently into discrete representations such as sparse grids with a texture atlas [21], textured polygon meshes [8], or caches [18] and inferring view-dependent effects by a small MLP. Furthermore, PlenOctrees [72] use a hierarchical octree structure of the density and the view-dependent radiance in terms of spherical harmonics (SH) to entirely avoid network evaluations.

Improving the convergence of the training process has also been investigated by using additional data such as depth maps [11] or a visual hull computed from binary foreground masks [24] as an additional guidance. Furthermore, meta learning approaches allow for a more effective initialization compared to random weights [55]. Similar to the advances in rendering performance, discrete scene representations were also leveraged to boost the training. Instant-NGP [41] incorporated a multi-resolution hash encoding to significantly accelerate the training of neural models including NeRF. Plenoxels [15] stored SH and opacity values within a sparse voxel grid, and TensoRF [7] factorized dense voxel grids into multiple low-rank components. However, all the aforementioned methods and representations are limited to static scenes only and do not take dynamic scenarios like motion into account.

2.3 Dynamic scene representations

Although novel views of scenes containing motions can be directly synthesized from the individual per-frame static models, significant effort has been spent into more efficient representations for neural rendering such as subdividing the scene into static and dynamic parts [30, 68], using point clouds [68], mixtures of volumetric primitives [35], deformable human models [45], or encoding the dynamics with encoder–decoder architectures [34, 38]. Due to the success and representation power of neural radiance fields, these developments also inspired recent extensions of NeRF to dynamic scenes. Some methods leveraged the additional temporal information to perform novel-view synthesis from a single video of a moving camera instead of large collections of multi-view images [12, 16, 28, 44, 46, 56, 65, 69]. Among these, the reconstruction of humans also gained increasing interest where morphable [16] and implicit generative models [69], pre-trained features [59], or deformation fields [44, 65] were employed to regularize the reconstruction. Furthermore, TöRF [1] used time-of-flight sensor measurements as an additional source of information and DyNeRF [27] learned time-dependent latent codes to constrain the radiance field. Another way of handling scene dynamics is through the decomposition into separate networks where each handles a specific part such as static and dynamic content [17], rigid and non-rigid motion [65], new areas [51], or even only a single dynamic entity [73]. Similarly, some methods reconstruct the scene in a canonical volume and model motion via a separate temporal deformation field [13, 20, 32] or residual fields [26, 58]. Discrete grid-based representations [7] applied for accelerating static scene training have also been extended to factorize the 4D spacetime [6, 14, 22, 49]. In this context, Fourier PlenOctrees (FPO) [60] relaxed the limitation of ordinary PlenOctrees [72] to only capture static scenes in a hierarchical manner by combining it with the Fourier transform which enables handling time-variant density and SH-based radiance in an efficient way.

3 Preliminaries

In this section, we revisit the method of representing a dynamic scene using a Fourier PlenOctree [60], which extends the model-free, static, explicit PlenOctree representation [72] for real-time rendering of NeRFs. Given a set of T individual PlenOctrees each corresponding to a frame in a dynamic time sequence, the construction of an FPO consists of two parts: (1) a structure unification of the T static models, and (2) the computation of the DFT-compressed octree leaf entries, which will be discussed in more detail in Sects. 3.1 and 3.2, respectively.

In order to render an image of a scene at time step \( { t \in \{0,\dots ,T-1\} } \), the color \(\hat{\varvec{C}}\) of a pixel is accumulated along the ray \( { \varvec{r}(\tau ) = \varvec{o} + \tau \cdot \varvec{d} \in {\mathbb {R}}^3 } \) with origin \( { \varvec{o} \in {\mathbb {R}}^3 } \) at the camera, viewing direction \( { \varvec{d} \in {\mathbb {R}}^3 } \) as well as step length \( \tau \in {\mathbb {R}}_{\ge 0} \). The ray \(\varvec{r}\) is taken from the set of all rays \( {\mathcal {R}} \) cast from the input images. The accumulation is performed analogously to PlenOctrees [72]:

$$\begin{aligned} \hat{\varvec{C}}(\varvec{r}, t) = \sum _{i=1}^{N} T_i(t) \, \left( 1 - \exp (-\sigma _i(t) \, \delta _i)\right) \, \varvec{c}_i(t), \end{aligned}$$
(1)

where N is the number of octree leaves hit by \(\varvec{r}\), \( { \delta _i {=} \tau _{i + 1} - \tau _i } \) the distance between voxel borders and

$$\begin{aligned} T_i(t) = \exp \left( -\sum _{j=1}^{i-1} \sigma _j(t) \, \delta _j \right) \end{aligned}$$
(2)

is the accumulated transmittance from the camera up to the leaf node i. Hereby, \(\left( 1 - \exp (-\sigma _i(t) \, \delta _i)\right) \) can be considered as the transfer function from densities to transmittance in this node.

The rendering procedure of an FPO is analogous to PlenOctrees [72], with the addition of passing the time step t to the renderer. The time-dependent density \( { \sigma _i(t) \in {\mathbb {R}} } \) is reconstructed using the inverse discrete Fourier transform (IDFT) for time step t applied to the values stored in the FPO in leaf node i. Similarly, the time- and view-dependent color \( { \varvec{c}_i(t) \in {\mathbb {R}}^3 } \) is obtained by first applying the IDFT to the FPO entries of the respective SH-coefficients \( { \varvec{z}_i(t) \in {\mathbb {R}}^{Z \times 3} } \) with Z SH-coefficients per color channel and then querying \(\varvec{z}_i(t)\) for the given viewing direction \(\varvec{d}\). Finally, the sigmoid function is applied to \( \varvec{c}_i(t)\) for normalization. In the following, we omit the subscript i for brevity as all computations are performed per leaf. Since all operations are differentiable with respect to the octree leaves, the compressed representation can be directly fine-tuned based on the rendered images using the following image loss function [60]:

$$\begin{aligned} {\mathcal {L}} = \sum _{t = 0}^{T - 1} \sum _{\varvec{r} \in {\mathcal {R}}} \left\| \varvec{C}(\varvec{r}, t) - \hat{\varvec{C}}(\varvec{r}, t)\right\| _2^2. \end{aligned}$$
(3)

3.1 PlenOctree structure unification

To construct an FPO, time-dependent SH coefficients and densities from all PlenOctrees are merged into a single data structure. The sparse octree structures are first unified to obtain the structure of the FPO. The static PlenOctrees contains leaves with maximum resolution only where the scene is non-empty. Identifying these regions over all time steps and refining the structure of all PlenOctrees accordingly yields the sparse octree structure for the dynamic representation [60].

3.2 Time-variant data compression

After creating the structure of the FPO, the SH coefficients and densities of all leaves and time steps are compressed by converting them into the frequency domain using the DFT. Each SH coefficient and density value is compressed independently for each octree leaf, where only \( K_{\sigma } \) components for the transformed density functions and \( K_{\varvec{z}} \) components for the SH coefficients are kept and stored. Thereby, K components correspond to \( 0.5 \cdot (K + 1) \) frequencies and omitted components correspond to higher frequencies in the frequency domain, so a low-frequency approximation of the data is computed. Thus, the entries of the Fourier PlenOctree are calculated as

$$\begin{aligned} \omega _k = \sum _{t=0}^{T-1} x(t) \cdot \text {DFT}_k(t) \end{aligned}$$
(4)

with

$$\begin{aligned} \text {DFT}_k(t) = {\left\{ \begin{array}{ll} \frac{1}{T} \cos \left( \frac{k \pi }{T} t\right) &{} \text {if}\ k\ \text {is even,}\\ \frac{1}{T} \sin \left( \frac{(k + 1) \pi }{T} t\right) &{} \text {if}\ k \ \text {is odd.} \end{array}\right. } \end{aligned}$$
(5)

Here, x represents either the density \( \sigma \) or a component of the SH coefficients \( \varvec{z} \), and \( \omega _k \) is the k-th Fourier coefficient for the density or the specific SH coefficient. Rendering remains completely differentiable and the time-dependent densities and SH coefficients can be reconstructed using the IDFT:

$$\begin{aligned} x(t) = \sum _{k=0}^{K - 1} \omega _k \cdot \text {IDFT}_k(t) \end{aligned}$$
(6)

with

$$\begin{aligned} \text {IDFT}_k(t) = {\left\{ \begin{array}{ll} \cos \left( \frac{k \pi }{T} t\right) &{} \text {if}\ k \ \text {is even,}\\ \sin \left( \frac{(k + 1) \pi }{T} t\right) &{} \text {if} \ k \ \text {is odd.} \end{array}\right. } \end{aligned}$$
(7)

4 FPO analysis

Upon investigation of the FPO representations of a dynamic scene, we especially notice geometric reconstruction errors that are visible as ghosting artifacts of scene parts from other time steps. While the DFT in general is able to represent arbitrary discrete sequences using \(K=2T-1\) Fourier coefficients, we observe that cutting off high frequencies for the purpose of compression leads to artifacts that are equally distributed across the entire signal. These artifacts persist even after fine-tuning which implies that the lower-dimensional representation of the signal cannot capture the crucial characteristics of the original values at each time step from the static reconstructions. However, especially the density functions always exhibit the same properties that upon analysis lead to two key observations.

Fig. 1
figure 1

Density over time of a single octree leaf. The leaf is marked in red in the respective images taken from the same view at different time steps t. Although the opacity is similar in the views, highly varying densities are observed over time, except for \(t=16\) where there is empty space in the tree leaf

When dealing with static PlenOctrees that have been optimized independently, it is possible that leaf entries are highly varying in terms of the estimated density and color, even though the rendered results are similar. The reason for this effect lies in the underlying volume rendering which involves the exponential function in Eqs. 1 and 2 to compute the observed color and transmittance based on \(\sigma \). For large input values, this function saturates, which can lead to large differences where scene content appears similarly opaque, as shown in Fig. 1.

Fig. 2
figure 2

Two exemplary density functions over time (left) reconstructed with different number of coefficients \(K_\sigma \). The original function of T time steps is equal to its reconstruction with \(K_\sigma = 119\) Fourier coefficients. The falloff of the marked peaks relative to its original value (right) is depending on \(K_\sigma \) and follows the linear scaling function \( s(K_\sigma ) \)

Compressing a time-dependent function with a reduced amount of frequencies in Fourier space returns a smoothed approximation of the original function. Figure 2 shows this effect for different settings of the number of components \(K_\sigma \) kept to represent the signal. Fewer frequencies thereby result in smoother functions and higher reconstruction error, especially visible with sharp peaks. Density values with a higher absolute difference to the average \({ {\bar{\sigma }} = 1/T \sum _{t=0}^{T-1}\sigma (t) }\) are reconstructed with higher error. This interferes with faithfully reconstructing areas with low or zero-densities, such as empty space or transparent and semi-opaque surfaces. However, large positive values do not need to be reconstructed as precisely. The saturation property of the transfer function allows for higher reconstruction errors of large positive values, which is visualized in Fig. 3. The reconstruction with only the compressed IDFT exhibits large errors, whereas after applying the transfer function, high densities are still approximated well. Scaling down the range of values by applying for instance a logarithmic function automatically allows for a higher approximation error of high densities after the inverse transformation.

In addition to the compressed IDFT, the \(L_2\)-minimization of the fine-tuning process treats approximation errors of all input values equally. This is not necessary for large densities in opaque areas and leads to the conclusion that the density reconstruction needs to be concentrated on low and zero-density areas.

Fig. 3
figure 3

A density function and its reconstruction \(\pi _{K_\sigma }\) using the DFT and IDFT with \({K_\sigma =31}\) (top left) and the same function and its reconstruction after applying our logarithmic encoding \(e_{\log }\) (center left). Their full and compressed Fourier representations (top right, center right) show that a logarithmically scaled function contains less high-frequency information that gets lost during compression. Applying the transfer function to the reconstructions (bottom left) shows that the logarithmic version can better represent the original one

During rendering an FPO, zero-densities are interpreted as free space and color computations are skipped for these locations. Negative densities generally cannot be interpreted in a meaningful way. However, during rendering and fine-tuning, colors only need to be evaluated for existing geometry, which is represented with positive densities. Negative values are ignored and can be interpreted to represent free space. Thus, an implicit clipping via the ReLU function lifts the restriction that free space has to be represented as a zero value. With this observation, we can grant more freedom to the representation of zero-density values and, thus, also to the DFT approximation.

Fig. 4
figure 4

Reconstruction of a density function (Orig.) using only DFT and IDFT as proposed for FPOs [60] and additionally in combination with our component-dependent (comp.) and logarithmic (log.) encoding on top of the DFT and IDFT

5 Density encoding

Based on the insights of the aforementioned analysis, we propose an encoding for the densities to facilitate the reconstruction of the original \(\sigma \). During the FPO construction, we perform the compression on encoded densities

$$\begin{aligned} \sigma ' = e_{\text {comp}}(e_{\log }(\sigma )) \end{aligned}$$
(8)

where the encoding consists of a component-dependent and a logarithmic part. We use the latter also during rendering and fine-tuning the FPO, while we apply the former only as an initialization during construction.

Figure 4 shows the differences in the reconstruction of a density function using both or only one part of the encoding. The encoding allows for a better reconstruction of the original densities without any fine-tuning than can be achieved with only the DFT and IDFT.

5.1 Logarithmic density encoding

We use the observation that high density values can have a larger approximation error without impairing the rendered result to improve the reconstruction with the IDFT. To weigh the values according to their importance in reconstruction and focus the approximation on densities near or equal to zero, larger values should be mapped closer together, while smaller values should stay almost the same.

This property is satisfied when encoding the individual non-negative density values \(\sigma \) logarithmically using

$$\begin{aligned} e_{\log }(\sigma ) = \log (\sigma +1) \end{aligned}$$
(9)

before applying the DFT. We choose the shift by 1 so that \(e_{\log }\) remains a non-negative function for non-negative input densities with \(e_{\log }(0) = 0\). During rendering, we apply the inverse of Eq. 9 after the IDFT to project the densities back to their original range. The effect of this encoding can be seen in Fig. 4.

The encoded density sequences are easier to approximate with a low-frequency Fourier basis exploiting the properties of the transfer function. Furthermore, the \(L_2\)-error in fine-tuning is allowed to be larger for encoded high values than without logarithmic encoding, and we focus the optimization on the more important parts of the reconstruction.

5.2 Component-dependent encoding

With the DFT, an approximation of the original function is reconstructed, where low \(\sigma \) values are increased, while high \(\sigma \) values are reduced. This leads to an under-estimation of its variations over time.

Intuitively, using fewer components leads to this under-estimation as fewer frequencies are summed up to reconstruct the original function. The heights of the peaks in the function are correlated with the ratio

$$\begin{aligned} s(K_\sigma )= 0.5 \cdot (K_\sigma +1) / T \end{aligned}$$
(10)

between the number of frequencies \( K_\sigma \) and the number of time steps T, as shown in Fig. 2. Amplitudes of the density function are, however, not smaller compared to zero but relative to the average \({\bar{\sigma }}\). Smaller than average values thus need to be reduced further.

We shift the densities by \(\sigma _{\text {shift}}\) and then scale them with the inverse ratio \(1 / s(K_\sigma )\) before the DFT during FPO construction for a better approximation:

$$\begin{aligned}&e_{\text {comp}}(\sigma ) = \frac{1}{s(K_\sigma )} \cdot (\sigma - \sigma _{\text {shift}}) + \sigma _{\text {shift}} \end{aligned}$$
(11)
$$\begin{aligned}&\sigma _{\text {shift}} = {\left\{ \begin{array}{ll} {\bar{\sigma }} &{} \text {if } \exists \,t \in \{0,\dots ,T-1\} :\sigma (t) = 0, \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(12)

In octree leaves that only contain positive \(\sigma \) for all t, applying \(e_{\text {comp}}\) with a shift by the mean value \({\bar{\sigma }}\) can lead to undesired non-positive values, where positive \(\sigma \) can be accidentally pushed below zero and cause holes in the reconstructed model. Thus, the shifting is only applied if empty space and, in turn, at least one zero-density is encountered. While such non-positive \(\sigma \) may still be introduced to the reconstruction, most cases that would lead to significant errors in the reconstruction can be handled faithfully this way.

This scaling can lead to higher amplitudes in the reconstruction than desired. However, this is not problematic following the two key observations: Both large and small values can be larger or smaller, respectively, to achieve the same result. We can largely remove geometric artifacts from incorrectly reconstructed zero values using this technique. The effect of the scaling is shown in Fig. 4. We apply \(e_{\text {comp}}\) only during FPO construction for a better initialization of the Fourier components in the octree leaves, so its inverse and the involved values of \({\bar{\sigma }}\) do not have to be used and stored for rendering or fine-tuning. Since the component-dependent encoding introduces negative densities, it needs to be applied after \(e_{\log }\).

6 Experimental results

6.1 Data sets

We use synthetic data sets of the Lego scene from the NeRF synthetic data set [39] and of a walking human model (Walk) generated from motion data from the CMU Motion Capture Database [9] using a human model [37]. Each data set includes 125 inward-facing camera views with a resolution of \({800\,\times \,800}\) pixels per time step anchored to the model, where \(20\,\%\) are used for testing purposes. The real-world NHR data set consisting of four scenes (Basketball, Sport 1, Sport 2 and Sport 3) including corresponding masks [68] is used for evaluation on real scenes. Basketball includes 72 views of \({1024\,\times \,768}\) and \({1224\,\times \,1024}\) resolution, where 7 views are withheld for testing purposes, whereas the Sport data sets each contain 56 views with 6 views withheld for testing.

6.2 Training

Since the reference implementation of the original Fourier PlenOctrees [60] is unfortunately not publically available and also relies on a generalizable NeRF [62] that has been fine-tuned on the commercial Twindom data set [57], we evaluate our approach against a reimplementation, which is further referred to as FPO-NGP. In particular, we employed Instant-NGP [41] instead of a generalizable NeRF and further removed floater artifacts [66].

Fig. 5
figure 5

Renderings of FPOs of different dynamic scenes without and with our logarithmic and component-dependent encoding after 10 epochs of fine-tuning

To obtain the static PlenOctrees, a set of \({T=60}\) NeRF models is trained first. The networks use the multiresolution hash encoding and NeRF network architecture of Instant-NGP [41] but produce view-independent SH coefficients instead of view-dependent RGB colors as output, analogous to NeRF-SH [72]. The training images are scaled down by a factor of two.

The PlenOctrees are extracted from the trained implicit reconstructions [72] using 9 SH coefficients per color channel on a grid of size \(512^3\). The PlenOctree bounds are set to be constant over time with varying center positions to enable representing larger motions. Fine-tuning of the static PlenOctrees is performed for 5 epochs with training images at full resolution.

We choose the same parameters for the Fourier approximation as in the original approach [60]: For the density, \(K_\sigma =31\) Fourier coefficients are stored in the FPO, while \(K_{\varvec{z}}=5\) components are used for the SH coefficients of each color channel. The FPO is fine-tuned for 10 epochs on the randomized training images of all time steps at full resolution. We also augment the time sequence by duplicating the first and last frame to avoid ghosting artifacts but exclude these additional two frames from the fine-tuning and the evaluation.

All training is performed on NVIDIA GeForce RTX 3090 and RTX 4090 GPUs, where frame rates and training times are listed here for the RTX 4090 GPU. The process to obtain a fine-tuned FPO with our un-optimized implementation takes around 6 h for training one set of the static NeRF models and around 10–30 min per epoch of fine-tuning. Any other steps in the training require a few minutes each.

Further details of the training procedure are provided in the supplemental material.

6.3 Evaluation

Figure 5 shows renderings of the baseline and our enhanced FPO. The baseline exhibits artifacts stemming from the geometry of other time steps that are visible as floating structures and are introduced by the Fourier-based compression. In comparison with FPO-NGP, the amount of these artifacts is significantly reduced with our method. In the case of fast moving scene content such as the legs in the Walk scene, artifacts are mostly removed and much less apparent. Even without fine-tuning, the geometry is reconstructed well, as can be seen in Fig. 6.

Fig. 6
figure 6

Visual comparison of the ground truth data with the reconstructions of the scenes Lego, Walk, Basketball, and Sport 1 with combinations of the logarithmic (log.) and component-dependent (comp.) encoding before and after fine-tuning for 10 epochs

Table 1 Comparison of achieved metrics averaged over all data sets with different combinations of logarithmic encoding (log.) and component-dependent encoding (comp.), both before and after fine-tuning for 1 and 10 epochs

In the enhanced FPO before fine-tuning gray artifacts are still visible on the reconstruction. These stem from approximation errors in the SH coefficient functions, which are not altered by our approach. Leaves that are empty most of the time contain a default value of zero in most static PlenOctrees, which result in the gray coloration. The fine-tuning process however ensures a realistic reconstruction of RGB colors for all time steps.

Table 1 provides an overview of the achieved PSNR, SSIM [63] and LPIPS [75] values. Considering our baseline reimplementation FPO-NGP, we observe a lower performance both qualitatively and quantitatively in comparison with the results of the original reference implementation [60]. However, our results are consistent with the evaluations reported in their supplemental material when the generalizable NeRF [62], which has been specifically fine-tuned on the commercial Twindom data set, is not employed. Additional comparisons can be found in the supplementary material. Besides these observations, our method achieves much better results than the baseline even with only a single epoch of fine-tuning. Similarly for the case when no further optimization is involved, our method achieves higher metrics than the baseline due to a better initialization of the geometry. Due to this fact, the creation process of an FPO representation of a dynamic scene is accelerated indirectly, as less time needs to be spent on fine-tuning.

Table 2 Comparison of the frame rate [1/s] averaged over all data sets with different combinations of logarithmic encoding (log.) and component-dependent encoding (comp.), both before and after fine-tuning for 1 and 10 epochs

Since our proposed encoding only adds a few additional computation operations, its impact on the performance of the optimization and rendering is minimal. In fact, we still achieve real-time frame rates, as can be seen in Table 2. The resulting FPS are even increased as free space, which previously exhibited artifacts due to the incorrect positive densities, is now correctly identified. Because free space does not require any computation of color, this step in rendering is now skipped which accelerates rendering significantly. Furthermore, our encoding does not change the required memory for storing an FPO, which is 2.4 GiB on average across all tested scenes.

6.4 Ablation studies

In Fig. 6, we present a more detailed overview of the effects of each part of our density encoding. Table 1 lists the corresponding metrics for different scenes.

Especially the logarithmic part improves the reconstruction significantly, since the low-frequency approximation can reconstruct the density functions much better than without it and most geometric artifacts are removed or become barely visible. The DFT and optimization process puts more focus on the reconstruction of lower values and changes between free space and positive densities requiring a smaller error for good results.

The component-dependent part shows to be beneficial for a better initialization of the geometry reconstruction before fine-tuning. Gray artifacts are removed in most places at most time steps, as zero-densities are represented with negative values and are thus interpreted as free space. In combination with the logarithmic part, the quality of the initial geometry reconstruction is increased even further. After fine-tuning, the FPO initialized with only the component-depending part of the encoding achieves better results than the baseline FPO, but also in this case, our full encoding further improves the overall quality.

Both parts of the encoding result in a significant speed-up in rendering before and after fine-tuning, see Table 2. While the reconstruction using only the logarithmic part of the encoding shows improved visual quality over the baseline, the rendering speed is decreased without fine-tuning. Here, zero-densities are better approximated but are still assigned small positive values. Since higher densities also exhibit smaller values, more values need to be accumulated along the ray to reach the termination criterion which is determined by the transmittance. Our full encoding consisting of both parts yields the highest frame rate.

We provide additional renderings in the supplemental video and further ablation studies on the number of Fourier coefficients and on the underlying NeRF models in the supplementary material.

6.5 Limitations

Similar to other approaches, our method also has some limitations. The primary focus of our encoding lies in transforming the density functions to make it easier to compress via the Fourier-based signal representation. SH coefficient functions, however, show different properties and are currently solely compressed using the DFT, which introduces similar artifacts as for the density function. While fine-tuning allows to improve the reconstruction, obtaining a realistic representation of the colors is also important and a challenging problem. Furthermore, our method inherits several limitations of the original FPO approach. Only the provided data are compressed, so generalization of the scene dynamics in terms of extrapolation and interpolation of the motion is not directly possible.

7 Conclusion

In this paper, we revisited Fourier PlenOctrees as an efficient representation for real-time rendering of dynamic neural radiance fields and analyzed the characteristics of its compressed frequency-based representation. Based on the gained insights of the artifacts that are introduced by the compression in the context of the underlying volume rendering when state-of-the-art NeRF techniques are employed, we derived an efficient density encoding that counteracts these artifacts while retaining the compactness of FPOs and avoiding significant additional complexity or overhead. Our method showed a superior reconstructed quality as well as a substantial further increase of the real-time rendering performance, and we believe that our insights will also be beneficial for further Fourier-based methods [58].