1 Introduction

Virtual Reality (VR) has been constantly evolving since Sutherland’s work in 1965 (Sutherland 1965). Over the last years, with the most recent advances in hardware, this technology has gained attention both, from the scientific community and software industry. This new interaction paradigm has made it a suitable technology for education, science, and research applications.

With the emergence of the new generation of head-mounted displays (HMD), immersive virtual reality applications and games are gaining popularity. HMD systems have shown to improve the user’s sense of self-presence when compared to desktop VR applications, which translates to better user performance in navigation and spatial inspection tasks (Shu et al. 2018).

Even though immersive VR systems have shown to improve user engagement and sense of presence in the scene (Loup et al. 2016; Reiners et al. 2014), they also have some limitations compared to desktop setups. Immersive VR applications must be highly responsive since low performance may cause some symptoms such as discomfort or nausea. Moreover, HMD systems require computing the scene twice (one time for each eye image), which doubles the rendering time and increases the performance requirements.

Most VR applications and games represent the environment using 3D mesh-based models. These models often require a certain degree of processing, while point clouds can be just raw, unfiltered measurements from the LIDAR scanner.

There are some fields that require accurate representations of a real environment, such as heritage or industrial applications, where the user needs to take precise measurements. Once the environment has been 3D scanned, it becomes possible to perform these measurements, eliminating the need to travel again to the facility.

3D laser scanning and photogrammetry are the main technologies for capturing as-built models of indoor and outdoor environments. They result in massive point clouds that model the real environment down to millimetre accuracy. Although the point clouds can be converted to lighter mesh-based models, the process can be difficult in complex scenes and may result in a loss of information. Therefore, it is often desirable to directly visualize point clouds. Nonetheless, since points are infinitesimal primitives, the user may notice holes on a surface in a point cloud when observing it from a short distance. This is the reason why even small scenes require millions of points for a proper visualization. In order to reduce the memory footprint of massive point clouds, data must be compressed. This high density of points makes it crucial to render the scene in a continuous level of detail (CLOD) to maximise the visual quality of closer surfaces while limiting the number of points rendered for more distant objects.

Most point cloud visualization methods focus on outdoor data for GIS applications, since airborne scans are one of the most popular point cloud data sources available. In these applications the scene is commonly displayed from a top-view perspective as the user typically wants to see an overall view of the terrain. In this case, rendering the full set of points from a large point cloud can significantly compromise the performance without contributing to the final render since most points may not be visible in the final image.(e.g. after applying frustum culling or discretization operations).

As opposed to outdoor scenes, indoor scenes typically feature a high number of occlusions since structural elements hide anything that may be behind them. One of the main problems in existing point cloud rendering methods are that they do not take into account this high degree of occlusion. To overcome this issue, we can benefit from the static nature of indoor environments by rendering only the subset of points that are visible from a given position. To do so, one must compute the occlusions between all surfaces from a specific point of view. This process can be demanding and may compromise real-time performance.

1.1 Related work

Rendering point clouds with hundreds of millions of points can be an extremely processing intensive task. In order to improve the application performance many approaches have been proposed. These methods typically involve building optimised data structures to access multiple resolution levels of the geometry in render time. Many works, such as QSplat (Rusinkiewicz and Levoy 2000) and Layered Point Clouds (Gobbetti and Marton 2004), are able to render large point clouds by introducing such data structures. Multi-resolution acceleration structures are commonly based on binary trees that recursively subdivide 3D space. In these trees, the root node stores a low resolution version of the scene and with each level, the resolution is gradually increased, adding more and more detail to the comprised area. When rendering the scene, the level of detail (LOD) in which geometry is displayed must balance image quality and performance.

Point cloud models require a large number of points in order to represent a scene properly. It is possible to implement some compression techniques to reduce the memory requirements by, for example, quantising the data (coordinates, colour and other attributes) from each node in the hierarchical structure to a specific size as described in (Schnabel and Klein 2006). Additionally, (Botsch et al. 2002; Peng and Kuo 2005) suggest also compressing the octree structure by encoding the occupancy of child nodes. In (Schütz et al. 2019) authors also propose the use of vertex data quantisation in their continuous level of detail (CLOD) method. In this workpoints are stored in a single vertex buffer that is iterated repeatedly by a compute shader in order to produce a downsampled version at runtime based on the camera position and orientation, thus reducing the presence of popping artifacts.

As opposed to the above-mentioned methods, some works introduced new strategies that focus on deferred rendering. Some authors (Futterlieb et al. 2016; Tredinnick et al. 2016) suggested rendering approaches that preserve the rendered image from the previous frame if the camera is not moving and later render additional details from the scene. While the previously mentioned methods still rely on hierarchical structures to select which voxels need to be rendered, (Schütz et al. 2020) introduced a progressive rendering method for data with no hierarchical structure. These methods improve the rendering performance for models with a high amount of occlusions, such as indoor scenes. Most previous methods would render all surfaces that are inside the view frustum without taking into account whether other points that are closer to the camera occlude them, therefore leading to a lower performance with no real contribution to the final image.

Whereas progressive rendering techniques are able to render large point clouds with high performance, they need the camera to be static in order to refine the final image reusing the points rendered in the previous frame. However, VR users constantly rotate their head when exploring the environment and the headset position is rarely stationary. This makes it necessary to constantly render new points because the final image refinement algorithm rarely converges when working with massive datasets. It is for this reason that these methods often suffer from the appearance of holes and fading corners.

As an alternative, it is possible to improve direct rendering methods by pre-computing the set of voxels that are visible from a point of view. By doing this, the application would only have to render the set of voxels that are stored in the list corresponding to the current user position. This list is known as potentially visible set (PVS) and it is commonly used for mesh-based rendering in video games and walkthrough applications (Teller and Séquin 1991; Cohen-Or et al. 2003). This pre-computation step typically takes some time, but it significantly reduces the rendering overhead and the visualization does not suffer from image refinement convergence issues as deferred render techniques do.

Current literature about computing the visibility of a scene can be classified into point-based methods and from-region methods, as proposed in (Cohen-Or et al. 2003). The first group computes the visibility with respect to the current viewpoint in real time; and from-region methods use the visibility information from neighbour regions. The latter usually require more pre-processing time, because they store the potentially visible set (PVS) for each region, which is usually valid for several sequential frames while the user is inside the same region.

Point-based methods compute the visibility by using either image precision or object precision methods. The first group works on the discrete representation of the rendered scene. The simplest method consists in using ray tracing by casting a ray from the eye that passes through a pixel toward the scene in order to render the closest object that is intersected by the ray (Bala et al. 1999; Parker et al. 2005).

An alternative approach consists in building a Hierarchical Z-Buffer (HZB) (Greene et al. 1993) that contains sub-sampled versions of the Z-buffer. In order to decide whether a node in the octree is visible, all its faces are tested hierarchically against the Z-pyramid. In a similar fashion, other works render the scene occluders into an occlusion map or a hierarchy of occlusion maps (HOM) (Zhang et al. 1997); Then, scene objects are tested against these occlusion maps to decide whether they are visible or not.

Object precision methods, in turn, operate on the scene geometrical space. Some methods (Coorg and Teller 1997; Hudson et al. 1997; Bittner et al. 1998; Naylor 1992) test scene objects against the shadow frustum volume of a set of occluders. In (Wonka et al. 2001) the authors present an online occlusion culling system that is able to compute visibility in render-time.

From-region methods compute the scene visibility inspired by the idea of cells and portals from Jones (Jones 1971), which makes them suitable for indoor scenes. These methods (Airey et al. 1990; Teller and Séquin 1991) assume that the space can be divided into cells that are partly bounded by the objects surfaces and can be connected through portals. This assumption makes it possible to store the pre-computed PVS for each cell and retrieve it when needed instead of constantly compute the visibility of the scene in real time as point-based methods do.

On the other hand, outdoor scenes are usually rendered using a 2.5D approach (Koltun et al. 2000) that, instead of computing the PVS of each cell, they compute the set of virtual occluders that compactly represent the aggregate occlusion for a given cell. Other algorithms for general 3D scenes have been proposed such as (Durand et al. 2000).

Whereas many methods have been presented for computing (Katz et al. 2007; Mehra et al. 2010; Katz and Tal 2015; Pintus et al. 2011) and analysing direct visibility of point clouds (Peters et al. 2015; Alsadik et al. 2014), to the best of our knowledge there is still no published work on pre-computing PVS for point clouds in order to improve the rendering performance for models with massive amounts of points.

In order to render point clouds bigger than the GPU memory capacity it is necessary to implement out-of-core structures and algorithms. Some works (Wimmer and Scheiblauer 2006; Pajarola et al. 2005; Richter and Döllner 2010; Goswami et al. 2013) introduced out-of-core versions of these hierarchical structures. Nonetheless, these approaches must optimise data transfer rates by asynchronously accessing the data structure while rendering because input–output interruptions can compromise real time performance when rendering massive datasets.

1.2 Overview

In this work, we introduce a new in-core real-time rendering method for the visualization of massive indoor point clouds with hundreds of millions of points in virtual reality. Indoor environments are mostly cluttered and feature a high number of occluders. We benefit from the static nature of indoor environments and pre-compute the visibility in the scene in order to cull all regions that are occluded given a camera position.

The idea is to, given a point cloud and its hierarchical representation, pre-compute a visibility map (V-Map), which contains all potentially visible sets of spatial regions (PVS) in the walkthrough-feasible space in the scene (navigable space). This is achieved by performing several off-screen renders from each position in a similar way to how cubemaps are rendered and storing the list of octree nodes that are projected into the buffer.

For every pixel in the resulting buffers, we retrieve the 3D point that corresponds to it by computing its re-projection based on the known camera information and then we retrieve the hierarchical structure node to that contains the 3D point. This pre-processing step is fairly heavy in terms of computational power and requires a significant amount of time. Nevertheless, once the V-Map is built, the scene can be constantly rendered by displaying only the octree nodes from the PVS that is nearest to the current user position in the V-Map.

The main contributions of this work can be summarised as follows.

  • To the best of our knowledge, this is the first work to pre-compute a visibility map (V-Map) for rendering large point cloud models and dealing with the inherent issues to indoor models with a high amount of occlusions. By doing this, the algorithm only renders the potentially visible set (PVS) of octree nodes that correspond to the user’s position.

  • We implemented a new hierarchical structure for rendering point clouds in a voxel-wise continuous level of detail (CLOD) by lowering the point resolution rendered for distant regions and, therefore, achieving lower rendering times. This allows us to further improve the rendering performance when pre-computing the V-Map is not enough.

  • Our experimental results show that our work outperforms other state of the art methods on several indoor datasets. In addition, our approach achieved competitive results on outdoor datasets.

Our proposed method allows the rendering of any point cloud that fits into GPU memory. Nonetheless, it would be possible to implement an out-of-core version of our algorithm for rendering datasets larger than the system’s GPU memory. In addition, our approach requires a pre-processing step for building the required acceleration structures and pre-computing the V-Map. Although the pre-processing can be slow, we propose computing the navigable space in order to reduce it. Once these data structures are built, our method is able to render massive point clouds in real-time for virtual reality environments with really high performance.

1.3 Organization of the article

The rest of this article is organized as follows: Sect. 2 describes our proposed method in detail including the pre-processed data structures and the real-time rendering algorithms. Before drawing conclusions in Sects. 3, 4 presents the experiments for evaluating the performance of our pipeline.

2 Method

We propose a new in-core real-time rendering method for the visualization of massive point clouds in virtual reality. Our approach is divided into a pre-processing step and the real-time rendering algorithm. The first processing stage is responsible of building the acceleration structures for accessing the point cloud data and pre-computing a visibility map (V-Map), which stores all potentially visible sets (PVS) of octree nodes from a given point of view. This pre-processing step requires a significant amount of time, but once the V-Map is built, the rendering overhead is reduced significantly.

When rendering the scene, our algorithm selects all the visible space regions from the V-Map nearest to the current user position and determines the number of points that will be rendered depending on their distance to the camera in order to visualize the environment in a continuous level of detail. In every frame, the render loop retrieves and renders the points that were selected by this algorithm. Moreover, we perform several optimizations in order to reduce the amount of draw calls.

2.1 Data structures

Fig. 1
figure 1

Diagram showing the spatial distribution of the octree and the data distribution into the vertex array. Every node in the octree stores the indexing data for its corresponding points in the vertex array. On the left we can see how the node \(A^1\) on L1 covers the ranges in the array that correspond to its child nodes \(A^2\), \(B^2\), \(C^2\), and \(D^2\). Similarly, \(A^0\) covers the ranges that correspond to \(A^1\), \(B^1\), \(C^1\), and \(D^1\)

Most state-of-the-art methods that rely on hierarchical structures render the scene in discrete levels of detail. Distant spatial areas are rendered with a lower level of detail (LOD) in order to reduce the number of points that are rendered every frame. The LOD resolution is usually pre-defined for each level of detail. This makes the limits noticeable between regions with different LOD. To mitigate this effect, we propose a node-wise continuous level of detail (CLOD) rendering method, which makes it possible to render any region at a given continuous resolution.

The hierarchical data structure is divided into an octree and a vertex array for storing the model vertices. Given an input point cloud with N points, we build an octree with L levels by recursively subdividing the space in a three-dimensional grid until the 3D cells reach a desired minimum size, e.g. 1 m per side. The octree does not contain the points from the model. These are stored into the GPU memory. Instead, it contains the indexing data –starting and ending index as illustrated in Fig. 1– for retrieving the points from the vertex array. This array sequentially stores the points for all nodes in the octree.

When populating the vertex array, we store all the points of each leaf node in the octree. Then, the starting and ending index positions are stored into the corresponding leaf node in the indexing octree. Once the vertex array is populated, the octree gets updated with the associated indices for the rest of hierarchical levels. Any node contains all child nodes as shown in Fig. 1. This way, any intermediate node can be rendered at any given continuous LOD by rendering random points from its child nodes.

There are other works in the literature, such as (Schütz et al. 2019), where a single vertex buffer is used for storing point cloud data. While the authors suggest storing points ordered by their octree level -from lowest to highest level- we propose to store them sequentially based on the spatial position of their octree node, thus eliminating the need to encode the LOD level into a vertex attribute.

Our multi-resolution data structure makes it possible to render sub-sampled versions of the model depending on the level of detail for each octree node. When the LOD percentage for an octree node is increased, the number of points that are sampled from its corresponding segment in the vertex array is also increased.

We apply random sampling in order to get sub-sampled versions of a node by assuming that the points are uniformly distributed in space. Sampling points randomly in the vertex array during real-time rendering may lead to results that are not consistent between frames. Therefore, we shuffle all points in every leaf node before adding them to the vertex array during the pre-processing step. This way, we ensure that we get n random samples while accessing the array sequentially on real time by just accessing the first n points of the vertex array.

In order to get sub-sampled versions of intermediate level nodes from the octree, we cannot simply render the first n points from the vertex array, since it stores sequentially all its child nodes. By doing this, only the first child nodes would be rendered at the maximum resolution level and the rest would remain invisible. Instead, assuming that the input point cloud has a uniform distribution, we can solve this by retrieving uniformly strided segments in order to render the node uniformly.

In addition to the indexing data, the octree also stores a density factor for each node. This value is pre-computed as the sum of the number of points of all leaf nodes that are inside an arbitrary size sphere —typically with a radius of 15 m– centered on the selected node. This density value will be used in render-time as a parameter for adjusting the level of detail in the scene, as we will further describe in subsection 2.3.

Our index tree has a low memory footprint compared to the vertex array. Nevertheless it is still possible to maximise the number of points that can be fit into the vertex array. For that purpose we implemented a quantization method inspired by previous works (Calver 2002) (Purnomo et al. 2005) for storing vertex data into GPU memory and de-compressing it in the vertex shader. Our method only requires 12 bytes per vertex: 3 bytes for colour and 9 for position. As a result, we are able to render 3D scenes of up to 16777 ms per axis with a 1 mm resolution.

2.2 Pre-computed visibility

In order to deal with the high depth complexity in many point cloud datasets we must know which points are visible from a given position in render time. Therefore we must compute the occlusions between all surfaces from a specific point of view. This process can be quite slow and cannot be done in real-time since it would compromise the rendering performance. Instead, we pre-compute the scene visibility in order to store all potentially visible sets of nodes (PVS). However, computing the PVS from the infinite continuous positions in which the camera may be positioned is not affordable. Instead, we set up a discrete grid of cells that are walkthrough-feasible in the scene (navigable space).

When pre-computing the visibility map (V-Map) we first build a navigable space grid with a specific resolution –e.g. 1 m. Every cell centre defines a sampling position for computing its PVS. The navigable space is computed following the method described in (Sanchez-Belenguer et al. 2016), which can be summarized in the following steps.

  1. 1.

    First, detect floors in the 3D model by analysing point normals.

  2. 2.

    Remove any holes in the detected surfaces by using region-growing and region-closing algorithms.

  3. 3.

    Build a three-dimensional grid with empty cells of the specified size bounded by the floors and a given height.

Once the navigable space is computed, the V-Map building process can take place. This is done by performing several low-resolution off-screen renders. To achieve this, we render the scene placing the camera in the central position of each navigable space cell with a 90 degree view frustum and perform six off-screen renders: one facing towards the positive Z axis, another towards the negative Z axis, and four more looking at the horizon with a 90 degree angle between orientations in a similar way as cubemaps are generated in Computer Graphics. In algorithm 1 we summarize the steps required for pre-computing the visibility map.

The result are 6 small buffers where the scene is mapped as shown in Fig. 2. For every pixel in the 6 resulting buffers we retrieve the 3D point that corresponds to it by computing its re-projection. Then we determine which of the octree leaf nodes contains the retrieved 3D point. When all the children nodes of a higher-level node are visible, we substitute them with the parent node in the PVS, thus reducing the size of the list and the number of data access calls required to render them.

Fig. 2
figure 2

Off-screen rendered cubemap corresponding to an example position in the navigable space grid

This process is quite demanding in terms of GPU and CPU resources, having a temporal cost of \(O(G \cdot B^2 \cdot L)\), where G is the length of points in the navigable space list; B is the rendering buffer size; and L is the number of leaf nodes in the octree. However, this pre-processing step is required to be executed only once.

The resolution of these off-screen buffers directly affects the number of nodes in the PVS. Due to the infinitesimal nature of point clouds, a high resolution buffer leads to the appearance of holes in the rendered image and makes it possible to see through occluders as can be seen in the left image in Fig. 3 (left). In contrast, performing low resolution off-screen renders mitigates this effect.

Fig. 3
figure 3

A comparison of an identical off-screen render performed into frame buffers with 512x512px and 64x64px resolution

However, such small buffers for pre-computing the visibility have some drawbacks. Points from distant surfaces may fall into the same pixel in the final buffer, thus not being added to the visible nodes list as shown in Fig. 3, and leading to missing areas when rendering the scene with the camera in that position. In addition to this, points corresponding to small artifacts in the scanned model may occlude other surfaces that may be behind. In order to solve this issue we can use a region-growing algorithm for making visible the potentially missing areas.

We select the cubemap resolution according to two parameters: (1) the spacing, f, of the point cloud and (2) the minimum distance, d, where we want to ensure that a uniformly sampled plane with the given spacing effectively occludes what is behind. Considering that we perform six renders using a FOV of 90 degrees, we set the cubemap resolution, res, as \(\frac{\pi / 2}{ atan\left( \frac{f}{d} \right) }\).

The size of the V-Map cells determines its final file size. The finer the resolution, the bigger the amount of memory that is required for the visibility map. Moreover, a smaller V-Map cell size implies that the PVS will be updated more frequently when navigating through the scene. This results in a more consistent visualization since the visibility approximation is more precise.

The V-Map resulting from the pre-processing step can be refined with a region-growing algorithm as shown in the end of algorithm 1. This function retrieves all the neighbours –up to a given spatial distance limit– of each navigable space cell in the spatial boundaries of the set of visible nodes and adds all candidates —i.e. nodes that are not visible yet— to the PVS of the current cell. By doing this, the resulting PVS increases in size and the amount of missing areas is reduced with every iteration.

figure a

2.3 Rendering

Since our method is mostly focussed on walkthrough applications, we aim to render the model with the highest level of detail possible. These high-detail requirements must be balanced out by limiting the number of points rendered for more distant objects rendering the scene in continuous level of detail (CLOD).

For achieving a continuous level of detail when rendering the model, we use a LOD (level of detail) function as described in (Strugar 2009), who introduced an algorithm for rendering terrain height-maps. In Strugar’s work, they propose a LOD function based on the precise three-dimensional distance between the observer and the terrain. This function is used across the whole rendered mesh and is able to handle smooth transitions between LOD levels. In our case, we employ a LOD function based on the distance from a node’s centre to the user camera position and a pre-computed spatial density factor. Given the following equation:

$$\begin{aligned} f = (dist(node, camera) - U) * d^2 + \epsilon \end{aligned}$$

We can define our LOD function with range [0, 1]:

$$\begin{aligned} lod = {\left\{ \begin{array}{ll} 1, &{} \text {if} f \le 0\\ min \left( 1, \frac{d}{f} \right) , &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

where

  • \(d \in [0, 1]\) is a pre-computed density factor. This factor is computed while building the data structures in the pre-processing step and takes into account the number of points in an arbitrary size –e.g. 15 ms– sphere around the center of the node.

  • U is a positive value for a spatial threshold. Any node that is closer to the camera than U metres will be rendered at full resolution to fulfil the visual requirements in walkthrough applications. By default we set this value to 1.5 m.

  • \(\epsilon\) is \(1 \times 10^{-5}\).

In each iteration, our algorithm selects which nodes in the scene are visible based on the user’s position by querying the closest pre-computed V-Map and assigns them a sampling ratio based on this function. This ratio determines the amount of points \({q_i}\) that must be rendered for each node i.

When displaying the V-Map we start sequentially retrieving the points from each corresponding visible node in the octree by selecting points from the vertex buffer until the target number of points \(q_i\) is reached. By doing this, the render loop will be able to render a sub-sampled version of visible nodes just by simply accessing sequential fragments of the vertex buffer.

Notice that when displaying an inner node, instead of sequentially accessing the vertex array, we define a stride in order to sample n points uniformly from the segment of points in the buffer that correspond to the selected octree node. This is due to the fact that all the points from leaf nodes are stored sequentially in the vertex buffer and their parent node stores the starting index of the first child node and the ending index of the last. If we selected the first n points –assuming that n is smaller than the length of the set of points that correspond to the parent node– we would be rendering only the first nodes. Therefore, by defining a stride, the render loop will be able to render a uniformly sub-sampled version of inner nodes.

3 Evaluation

For evaluating our method we have used several indoor models –illustrated in Fig. 4. For completion we also used an outdoor point cloud model. The chosen datasets vary in amount of points, spatial dimensions, number of occlusions and spatial resolution. Table 1 shows the key parameters of the datasets evaluated in this article.

Fig. 4
figure 4

Datasets used in this work. Office and Cyclo datasets are indoor scenes, whereas Melbourne is an outdoor model. Melbourne model was scaled down by a 9 factor for faster navigation in the scene

Table 1 Datasets used in this work. The first four columns show the name of the models, their class, number of points, sampling space and scene dimensions. The next three columns indicate the leaf node size used while building the hierarchical structure and V-Map spatial structure we use in this article. The last column shows the file size of the compressed version we use for storing each model

The following datasets were used in this work:

  • Office: indoor model of an office, publicly available from Navvis.Footnote 1 Lidar scan data was acquired using a Navvis M6 mobile mapping system.

  • Cyclo: indoor model of the Cyclotron of the JRC-Ispra obtained with static scanners ZF 5016. We have worked with different resolutions: 1 cm (135 M points), 2 mm (778 M points), 2.5 mm (579 M points), 3 mm (459 M points).

  • Melbourne: outdoor model of the City of Melbourne.Footnote 2 This model was scaled down by a 9 times factor for faster navigation in the scene. Lidar scan data was acquired using an airborne scanner. We decided to include in our experiments an outdoor model in order to prove the generality of our method: even when the V-Map does not significantly contribute to the final rendering, the CLOD by itself makes our results competitive.

All tests have been made on a VIVE Pro MV headset at 90Hz with a rendering resolution of 1424x1584 pixels, unless another resolution is specified. The following systems were used when measuring the results in this article:

  • 1070: A desktop computer running Windows 10 (64 bits) with a GeForce GTX 1070 with 8GB GPU memory, an Intel Core i7-8700K CPU, and 32GB RAM.

  • 2080: A desktop computer running Windows 10 (64 bits) with a GeForce RTX 2080 with 8GB GPU memory, an Intel Core i7-9700K CPU, and 32GB RAM.

3.1 V-Map

First, we assess the effects of the pre-computed V-Map. For this purpose, we computed the V-Map for each dataset and estimated the average visibility ratio in the scene by dividing the amount of octree leaf nodes that are not included in PVS in each V-Map cell by the total amount of octree leaves in the scene.

Table 2 Comparison of the percentage of scene leaves that are culled by the pre-computed V-Map

This metric does not take the camera orientation into account, but it still gives us an upper bound of the amount of rendered nodes. This is due to the fact that in render time we apply frustum culling, which may reduce the amount of rendered nodes even more.

In Table 2 we can observe that the amount of culled nodes is larger in bigger scenes, since the occluders hide a bigger percentage of the scene. That’s the reason why the culling ratio for the Office dataset is lower than in the Cyclo model. In the latter dataset, our method benefits significantly from the occlusions by only evaluating an average of 9.97% of the scene. Notice that this problem-size reduction is due only to our pre-processing step and does not require any further computation in rendering time. It is worth to note that, when we mention the results for the Cyclo dataset, we refer to all different-resolution versions, since the occlusions in the scene are exactly the same.

In indoor environments the size of the scene directly affects the culling ratio. This is due to the cluttered topology of these models. The bigger the model, the lower the relative number of visible nodes have to be rendered because structural elements and other occluders hide most of the scene. However, the occlusion ratio in the Melbourne outdoor dataset is much lower. We can explain this because most of the terrain is flat and only the buildings that are near to the coastline constitute significant occluders in the scene.

Another important consideration about the V-Map is the amount of time it takes to compute it when pre-processing the input point cloud. In order to reduce this processing time, we employ a pre-computed navigable space, which reduces the amount of required off-screen renders. In Table 3 we show the speedup ratio after computing this navigable space.

Table 3 Pre-processing step speedup for the indoor datasets resulting from the pre-computation of the navigable space

The Office dataset has the biggest speedup ratio since it is an L-shaped building. If we considered the whole bounding box of the model we would spend a huge amount of time and resources in computing the visibility from outside the building. Figure 5 shows a couple of slices from the pre-computed V-Map for the Office and the Cyclo-2 mm dataset. The colour scale represents the amount of visible octree nodes from that position. If the navigable space is not available for the model, the V-Map has to be computed for all the scene.

Although our method is focussed in indoor walkthrough visualization, the user can navigate through a scene in free-flight mode. When the user gets out of the space in which the V-Map is available, we cannot longer rely on it. Therefore we only use our CLOD strategy in order to reduce the amount of points that are rendered, similarly to what the other methods do.

Fig. 5
figure 5

Example slices of a visibility map. On the left, a rendered slice of the Office model and its corresponding visibility map at 2.25 m height. On the right, a rendered slice of the Cyclo-1 cm model and its V-Map slice at 0.70 m height. The colour scale represents the amount of octree cells that are visible from that point in space. This is why wide spaces and deep corridors have a warmer colour than small rooms

3.2 Real-time performance

The method we propose in this work is able to render massive point clouds that fit into GPU memory. If we assume 8GB of available GPU memory, our approach is able to render point clouds with up to 667 million points. With 10GB available, this limit is raised up to 894 million; and with 20GB, up to 1.79 billion points. This limit is defined as \(\frac{GPU \, memory \, (bytes)}{12}\). In the demo video (Online Resource 1) it is possible to see some walkthrough navigation tests with the proposed method.

We compared our method against a progressive rendering method (Schütz et al. 2020) and Arena4D ProfessionalFootnote 3 trial version by navigating around the scene following a predefined trajectory for every dataset with each application. For measuring real-time performance we used SteamVR’s time logging system and FrapsFootnote 4 as an independent FPS (frames per second) benchmarking tool. When not specified differently, all tests were performed with desktop mirroring enabled.

When testing Schütz’ method, we use their proposed compressed BIN file format and set the drawing point quota-i.e. the average number of points rendered per frame- to 10 M and 30 M points. For testing Arena4D Professional, we set the detail level to the second lowest resolution option, the rendering point size to 1 and the quality parameter (0-150) to 100.

Table 4 Rendering performance (measured in FPS) on the GTX 1070 system rendering with a resolution of 1424x1584 pixels per eye. Performance was limited to a maximum of 90.0 FPS due to the refresh rate of the device

In Table 4 we compare the real-time frame rate results obtained with each of the three applications running on the GTX 1070 system. Our method introduces an overhead by querying the V-Map in every frame. However, in larger models the effect of this overhead is lower. This way, the difference in performance between our method and the rest is higher for larger models such as Cyclo \(-\)2.5 mm,Cyclo-3 mm and Cyclo-2 mm, where the amount of points is much bigger.

Although pre-computing the visibility map has been shown to improve the rendering performance in large indoor datasets, the decrease in performance for the Melbourne dataset can be explained by the low occlusion ratio in the scene. In this scene there are not as many occluders as in the indoor datasets, thus the effects our pre-computed V-Map are lower.

For the tested outdoor model Melbourne, we can see how the progressive rendering method shows comparable performance results with a low point rendering quota of 10 M points. Nevertheless, we can still compare our CLOD approach with Schütz’ progressive rendering strategy in outdoor datasets. By considering that our method renders an average of 17 M points in every frame for the Melbourne dataset, we set a 17 M point quota for Schütz’ method. By doing this, our CLOD rendering strategy is able to render a similar amount of points at 54 frames per second, while the progressive rendering method runs at 37 FPS.

We can also observe that the Cyclo-2 mm dataset shows a significant decrease in performance. In this case, the point data does not fit into the available GPU memory, thus making it necessary for part of the vertex buffers data to be stored into the shared graphics memory. Nonetheless, as shown in Table 4, our method is still able to deliver good results when rendering on a GeForce GTX 1070 GPU.

Table 5 Rendering performance results (measured in FPS) on the RTX 2080 system rendering with a resolution of 1424x1584 pixels per eye. Performance was limited to a maximum of 90.0 FPS due to the refresh rate of the device

Table 5 shows the rendering performance on the RTX 2080 system. The added GPU rendering power increases our method’s rendering performance significantly, even doubling it in the case of Cyclo-2 mm. As we can observe, the average performance of our method is around 85.4 frames per second, which outperforms the rest in all indoor datasets while achieving comparable results with Arena4D on the Melbourne dataset.

Fig. 6
figure 6

Rendering performance with the RTX 2080 system with a resolution of 2468x2740 pixels per eye. Frames per second measurements obtained in 1 min sessions where the user followed a specified path by walking around the scene. During these tests the scene was rendered both in the head-mounted display and the desktop screen

The difference in rendering performance between our method and the rest becomes more noticeable when rendering to higher resolutions. In Fig. 6 we show the real-time frame rates for rendering with a resolution of 2468x2740 on the RTX 2080 system. In this case, our method clearly outperforms the rest, specially on the largest datasets.

Table 6 Rendering performance of our method running on the RTX 2080 system with a resolution of 2468x2740 pixels per eye with no desktop mirroring

Finally, we performed the same tests by disabling desktop mirroring in order to assess the real-time performance while rendering only on the HMD. In Table 6 we can see that this results in a noticeable increase in performance in the largest datasets, such as Cyclo-2 mm, leading to results close to 90 FPS.

Besides offering high real-time rendering performance, our method also features much better visual quality when the scene is explored in virtual reality. Instead of drawing randomly sampled points from the model in every frame up to a specified quota, we draw only the points that are inside the octree nodes that are visible from the current user position. This means that our method has to render all visible points in every frame, whereas progressive rendering methods require some time for the image to converge.

Fig. 7
figure 7

Visual quality comparison on the Cyclo-2 mm dataset. The first two columns contain screenshots obtained in real-time with the progressive rendering method by Schütz et al. The first column uses a 10 M point budget, while the second uses 30 M. In the last column we show two screenshots obtained in real-time with our method

We compared our method with Schütz’ by capturing screenshots from a given position by leaving the headset completely still. As it can be seen in Fig. 7, our approach is able to render the model with much higher detail.

One of the main benefits of the V-Map is that points are culled effectively at render time, thus avoiding their final representation in the rendered images. We designed an experiment in order to assess these visual quality results by accounting for pixels in the final image that should have not been represented.

Fig. 8
figure 8

Visual quality results of our experiments comparing the results with and without using the V-Map with the image rendered using only the points that belong to the currently visited room

We first isolated one of the central rooms of the model. Then, we randomly selected different points of view inside the room and performed three different renders: (1) using the isolated room with a brute-force strategy, (2) using the full model without V-Map culling and (3) using the full model with V-Map culling. The first one gives us the ground truth points that should be rendered, as the room was manually isolated. The second and third methods were compared to the first one.

In Fig. 8 we can observe that the use of the V-Map effectively culls the points belonging to other rooms that should be occluded by walls or other surfaces, therefore producing images much more similar to the ground truth image. This becomes more noticeable when rendering in higher resolutions or when the user’s point of view is close to the boundaries of the room. Figure 9 shows a top-down view of the rendered points in Fig. 8 (highlighted in red). It can be noticed how the V-Map effectively culls the points belonging to the neighbouring rooms that are inside the camera frustum.

Fig. 9
figure 9

Top-down view showing the rendered points in Fig. 8

We measured the excess of displayed points respect to the ground truth render. On average, using the V-Map the number of incorrect points -i.e. points that should be occluded but were displayed- is 1.56% of the total displayed points, while the algorithm that does not use the V-Map is producing 13.45% of incorrect points.

3.3 Ablation tests

In this work, we introduce several strategies to improve the final rendering performance. Initially we introduce a hierarchical structure for accelerating the rendering of the scene. Later we quantise all vertex attributes to make it possible to fit larger point clouds into the GPU memory. Lastly, we pre-compute the visibility in the scene to avoid drawing regions of the model that were not visible from the current user’s position. To demonstrate the contribution to the final performance from each strategy we describe a set of tests where we compare the average frame rate when using each version of the application.

In Fig. 10, we can observe that our final approach, including the hierarchical structure, the vertex buffer compression method and the pre-computed V-Map outperforms significantly the rest of approaches. In the case of the Cyclotron model, we increased the frame rate more than three times with respect to using only CLOD with compression. We can also observe that this method is eight times faster than rendering the full point cloud.

It is also worth noticing the effects of vertex data compression. When we compress vertex data, the vertex shader has to unpack the data for every vertex, which lowers slightly the rendering performance. This performance loss is balanced by our CLOD rendering technique and enables our approach to fit bigger models into memory.

Fig. 10
figure 10

Rendering performance with the GeForce GTX 1070 system on the Cyclo-1 cm dataset. From top to bottom: (Naive) brute force rendering of the whole PLY model; (CLOD) rendering using the hierarchical structure; (CLOD Compressed) adding vertex buffer compression to the previous test; (CLOD Compressed + V-Map) adding the pre-computed visibility to the previous test

3.4 Memory footprint

In Table 7 we compare the file sizes for every approach. The main difference between the original PLY model and our files is due to the hierarchical structure information and the pre-computed V-Map. In both Cyclo-1 cm and Cyclo-3 mm datasets the V-Map takes up the same amount of space –aprox. 2GB– since we use the same octree parameters and navigable space for both models.

Table 7 File size comparison. (Naive) raw point cloud file. (CLOD) hierarchically structured file. (CLOD Compressed) compressed version of the hierarchically structured file. (CLOD Compressed + V-Map) compressed hierarchical file that includes a pre-computed visibility map

Although the V-Map takes up a significant amount of space in the file, it is never uploaded to the GPU. Instead, it is loaded into RAM for our rendering algorithm to fetch the list of visible octree nodes in real time. This way, only vertex data needs to be uploaded to the GPU and can be compressed with the quantization method we show in this article, thus increasing the number of points we can fit into memory.

4 Conclusions

In this paper we have presented a technique for rendering massive indoor point clouds in Virtual Reality devices. The two main challenges we faced are related to (1) the non-stationary position of the view point and (2) the amount of data to store and process in real-time.

The first challenge is addressed by developing a one-shot rendering strategy that evaluates the full model for each frame in a deterministic way. For the second one, we have introduced two pre-computed strategies that benefit from the static nature of indoor point clouds: a continuous level of detail data structure (CLOD) and a visibility map (V-Map).

As the results show, the proposed CLOD hierarchical structure allows boosting performance by more than two times with respect to rendering the full cloud. Additionally, with a compression strategy based on data quantization, our approach can fit up to 894 million points in a 10GB GPU.

Our CLOD implementation is a general purpose structure that can be used in any kind of environment. The frame-rates achieved during the experiments clearly show that outdoor environments can also benefit from this technique. On the other hand, the V-Map targets explicitly the use case of indoor environments: taking into consideration the cluttered topology of buildings, it effectively reduces the number of points to evaluate in execution time by pre-computing visibility, therefore, making it possible to render the point cloud without the need of any kind of splatting or other runtime shader-based strategies.

Using the V-Map can drastically boost real-time performance depending on the shape of the environment (i.e. the amount of disjoint spaces inside the building). In our case, for a medium-size building with large rooms such as the Cyclotron, we increased the frame rate more than three times with respect to using only CLOD with compression (and eight times faster than rendering the full cloud).

The combination of both strategies outperforms state-of-the-art techniques when rendering large indoor environments, whilst remaining comparable in small problems. Additionally, relying on the V-Map lowers the number of points to be processed, allowing to improve the visual experience by rendering the nearby spaces in full resolution and preventing the rendering of contiguous occluded spaces.

Future works will focus on an out-of-core extension of our rendering algorithm that will be capable of rendering point clouds that do not fit into GPU memory. This algorithm must fetch and load into GPU memory the octree nodes that are more probable to become visible before they are required to be rendered. For that purpose, we may retrieve the nodes from the neighbouring V-Map cells that are not present in the cell for the current user position, thus reducing the size of the problem.