Learning to sculpt neural cityscapes

Zhu, Jialin; Wang, He; Hogg, David; Kelly, Tom

doi:10.1007/s00371-024-03528-7

Learning to sculpt neural cityscapes

Original article
Open access
Published: 12 July 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

The Visual Computer Aims and scope Submit manuscript

Learning to sculpt neural cityscapes

Download PDF

Jialin Zhu ORCID: orcid.org/0000-0002-1826-6566¹,
He Wang²,
David Hogg¹ &
…
Tom Kelly³

120 Accesses
Explore all metrics

Abstract

We introduce a system that learns to sculpt 3D models of massive urban environments. The majority of humans live their lives in urban environments, using detailed virtual models for applications as diverse as virtual worlds, special effects, and urban planning. Generating such 3D models from exemplars manually is time-consuming, while 3D deep learning approaches have high memory costs. In this paper, we present a technique for training 2D neural networks to repeatedly sculpt a plane into a large-scale 3D urban environment. An initial coarse depth map is created by a GAN model, from which we refine 3D normal and depth using an image translation network regularized by a linear system. The networks are trained using real-world data to allow generative synthesis of meshes at scale. We exploit sculpting from multiple viewpoints to generate a highly detailed, concave, and water-tight 3D mesh. We show cityscapes at scales of $100 \times 1600$ meters with more than 2 million triangles, and demonstrate that our results are objectively and subjectively similar to our exemplars.

Structured3D: A Large Photo-Realistic Dataset for Structured 3D Modeling

Geometric Image Synthesis

3D Shape Segmentation with Geometric Deep Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Urban environments form a central and growing aspect of our lives—more than half the world’s population lives in urban areas and eighty percent of the global GDP is generated in cities. In order to understand and develop these environments, it is important to quickly and accurately synthesize 3D virtual cityscapes at scale. Such models are important for applications ranging from the generation of training data for self-driving cars, environments for video games, urban studies, as well as city planning. However, manually constructing urban 3D models is time-consuming, requiring proficient artists with architecture skills, or engineers to build parametric models. Our system, NeuroSculpt (Neural Sculptor), learns to synthesize large-scale and detailed cityscapes as 3D surfaces (Fig. 1) suitable for a wide range of urban applications.

The traditional approach to generating 3D models is manual creation by 3D artists. However, such manual modeling has several drawbacks. It can be time consuming, and requires highly skilled artists. Further, manual 3D modeling also relies on artists or (in the case of procedural geometry systems) engineers to match the appearance and distribution of a generated model to the real world. Such approaches introduce a human bias—e.g., during the manual 3D modeling or in the design of parametric programs. For many applications which require realism (e.g., self driving cars) it is desirable to accurately match the statistical distribution of urban models to the real-world. NeuroSculpt generates kilometer-scale scenes by learning from exemplar training data in an unbiased fashion. Our learning and synthesis process can also be largely automated, considerably reducing the cost of urban mesh generation.

Data driven 3D mesh synthesis is challenging because the memory and compute requirements for Machine Learning in 3D are high. Further, the synthesis of a mesh’s discrete components—vertices and edges—remains a challenging problem at the cutting edge of Machine Learning due to the difficulty of learning over heterogeneous data. Inspired by artists sculpting of clay objects, we introduce a pipeline which learns to sculpt digital clay. To learn these sculpting actions, we exploit the robustness, scalability, and memory efficiency which come from homogeneously structured 2D CNNs (Convolutional Neural Networks) to iteratively sculpt the surface of our digital-clay to create our intricate 3D urban meshes (Fig. 2). By starting with coarse features and progressing to fine details we are able to synthesize highly detailed, concave, large scale models. The input of NeuroSculpt is a random latent vector drawn from Gaussian distribution, and the output is one cityscape represented by a 3D mesh of arbitrarily large extent. The detailed procedure of our whole pipeline is shown in Fig. 3. NeuroSculpt makes the following contributions:

A multi-resolution neural sculpting pipeline which allows for the generation of arbitrarily large meshes.
A linear system and set of neural networks trained for the generation of sculpting operations.
Trained networks and example generations for English towns.

Source code and model weights are provided at https://github.com/Misaliet/Learning-to-Sculpt-Neural-Cityscapes.

2 Related work

2.1 Pre-neural generative modeling

Virtual urban environments have become ubiquitous for city planning, video game environments, and mapping. Following this trend there are increasing numbers of 3D city model sources—both for reconstruction [2,3,4] of existing environments and synthesis of designed, planned, or fictional cityscapes. Synthesis techniques for urban geometry may be manual or procedural. In a manual pipeline, designers build meshes using specialised software [5] or guided by images [6]. Such manual approaches can be slow for massive urban areas. Another manual approach is to allow the sculpting of a deformable, digital-clay like surface, as commercial tools such as ZBrush [7] or Mudbox [8], albeit these techniques are typically applied to modeling organic forms. A faster approach is procedural modeling with approaches such as Split Grammars [9,10,11]—these models can be driven by various goals such as optimisation [12], inverse procedural modeling [13, 14], or directly from real world data [15]. Procedural languages are able to quickly generate large urban environments but are written by programmers—a highly skilled and time consuming process, prone to human bias. Further, procedural grammars are not data-driven and cannot learn from example data.

Reconstruction techniques approximate the city-as-built from observations. We discuss highlights here and refer the reader to summaries from the graphics [2] and remote sensing [16] communities for additional depth. Reconstruction from 2D cartographic data sources is simple, but limited. Typical pipelines simply extrude building footprints [17] and create a roof [18], although modern cartographic standards such as CityGML [19, 20] are rapidly developing more nuance. With modern Geographic Information Systems (GIS) it is possible to create huge 3D urban meshes with these approaches [21], although mostly untextured. Point clouds (typically from Lidar or structure-from-motion systems) are converted to 3D mesh representations using techniques such as Poisson reconstruction [22, 23], planar patch identification and fusion [24,25,26], or semantic segmentation [27]. Reconstruction approaches vary—large area aerial Lidar showing only roof heights [28] requires different approaches to that of dense, street-side Lidar, capturing detailed facades [29]. Photogrammetry can be used to recover 3D information from aerial [3, 4] and street level photogrpahy [30, 31]. Various models have been used to fit meshes to raw images, including procedural [32, 33], unwrappable [34], Markovian [35], or deep [36] models of 2D facades.

2.2 2D image synthesis

Texture synthesis is the generation of 2D texture images from exemplars or lower dimension representations; in this paper we apply such textures to describe sculpting actions over surfaces. Early, non-neural, methods learn to generate textures with local and stationary properties, using techniques such as Markov fields to generate pixels [37, 38] or patches [39, 40] from an exemplar image. Improvements to these approaches generate larger, more varied, textures [41]. Procedural languages to create [42] or edit images [43] are well adapted for 2D urban domains.

Texture synthesis using neural networks has become possible as the size of datasets has grown, along with the available computing capacity. Typical systems use convolutional neural networks (CNNs) and apply techniques including gradient descent in feature space [44], disentangling latent representations [45, 46], and GANs [47]. Another approach to texture generation is to transform an input image to a different style or domain. Such image-to-image translation [48, 49] has been applied to generate terrains [50] and building facades [15, 51]. The resolution of such networks is typically limited by available memory; a number of works attempt to create arbitrarily sized images by searching for matches [52], removing seams [53], synthesis over implicit neural functions [54], or interpolating over latent encodings [1].

Where there are no datasets of paired input and output images, unsupervised image-to-image networks use novel losses to generate images and textures. A typical approach is to use a cycle loss [55] which learns to translate from one image domain to a second, as well as from the second back to the first, at the same time. This reconstruction loss is applied alongside other losses, such as adversarial, for unsupervised texture learning. This technique has been extended to two or more image domains [56, 57], smaller/faster patches [58], and domain specific problems such as urban image [59] or video generation [60].

2.3 3D object synthesis

CNNs can create 3D objects by replacing their 2D kernels with 3D ones at the cost of significantly increased memory consumption. Techniques such as GANs [61] and recurrent networks [62] can be applied; applications include learning from point clouds [63] and reconstruction from 2.5D sketches [64]. To attempt to reduce the memory usage, many techniques have been applied including octrees [65], generation from 2D silhouettes which can carve volumetric representations [66, 67], and the use of sparse voxel representations [68]. Despite these techniques, synthesizing objects over 3D arrays of voxels has high memory requirements and can only model small areas.

An alternative to directly generating voxels is to learn to generate an implicit surface [69]. Here the network learns to transform a coordinate to a value—typically a signed distance [70], representing whether the point is inside or outside of the object, but variations such as binary occupancy [71] are also in use. These implicit surfaces can be converted to surface meshes using approaches such as marching cubes [72, 73] or tetrahedra [74]. Recent variations disentangle the mesh from the associated textures [75], multi-resolution surface networks [76], or also generate normals [70].

Methods using voxels and implicit fields for 3D have seen more recent success than the direct generation of 3D meshes consisting of vertices and points. Early work uses graph neural networks (GNNs) to generate meshes from single [77, 78] or multi-view [79, 80] images; such approaches typically work by deforming, rather than creating, meshes. Recent memory-intensive techniques, such as transformers, have been applied to directly generate vertex and face coordinates [81]. Other approaches have used a variety of higher-level modeling techniques and primitives, such as convex decomposition [82], topology autoencoders [83], parametric surface elements [84], or individual parts with contact-based reasoning [85]. The memory and compute intensiveness of these methods have meant that they are only used for single object generation (e.g., a chair or single building).

2.4 2.5D terrain synthesis

DEMs (Digital Elevation Models) are used for environment generation in geospatial and cartography applications. They lift 2D data to a 3D terrain by providing the elevation distance. Traditional methods include procedural modeling, simulation of erosion, and example-based methods [86]. Because of the characteristics of 2D terrains, representing and generating 3D landforms is effectively an ill-posed problem. Existing methods either use an implicit surface model [87] or modeling in the gradient domain instead of the elevation domain [88]. Further, additional work studies how to optimize the 2D heightfield to either achieve better control [89], generate higher resolution [90], or introduce different styles [91].

2.5 Large scale urban environments generation

The above 2.5D terrain synthesis methods can generate high-quality landscape models, but often fail to obtain better results for more complex 3D urban environments. BungeeNeRF [92] uses Neural Radiance Fields (NeRFs) [93] as the basic model and introduces progressive growing to achieve large-scale and multi-resolution 3D urban environment reconstruction. However, it is still a reconstruction and not a synthesis method. Furthermore, it does not generate the 3D urban meshes directly. Using algorithms such as marching cubes [94] to extract mesh from the neural scene in BungeeNeRF’s results in lower quality compared to the rendering results. CityDreamer [95] also uses NeRF to create cityscapes. It implements the generation task rather than reconstruction in BungeeNeRF. CityDreamer uses a VQVAE [96] as a layout generator to generate the unbounded semantic map and height field to create a corresponding neural city scene. InfiniCity [97] and CityGen [98] generate infinite-sized 3D models of cityscapes using infinite depth, normal, or semantic maps generated by deep generative models. They do not refine 3D model details however using image-to-image systems to render final fine results, resulting in a very low-quality 3D urban model. More recently, BlockFusion [99] uses a diffusion model [100] to denoise a tri-plane representation [101] to create infinite neural scenes of villages but the generation process is very slow. Our goal is to directly generate directly usable 3D urban meshes of arbitrary size in a way that balances efficiency and quality.

3 Method

To create large-scale cityscapes in 3D it is necessary to limit the memory requirements of our mesh synthesis. More memory-intensive approaches, such as voxel or transformer architectures, are unable to scale to such massive urban areas. To address this, we take inspiration from a human sculptor who is able to use 2D vision and many small sculpting actions (i.e., adding or removing material over a surface) to create complex 3D models. Digital sculpting is used to create organic (e.g., models of humans or clothing) in existing manual tools such as ZBrush [7]; we extend this paradigm to learn to synthesize large-scale cityscapes including mixed organic (e.g., trees and vegetation) and man-made (e.g., buildings and streets) features over large scale urban areas.

Following the sculptor paradigm, we work in a coarse-to-fine fashion, using two neural networks to refine a tessellated mesh (Fig. 2), first to a coarse mesh, and then to a fine mesh. NeuroSculpt ’s sculpting operations are introduced in Sect. 3.1. By using 2D CNNs to describe these refinement operations, we are able to exploit existing work to create massive seamless meshes in Sect. 3.2 [1], and unsupervised refinement in Sect. 3.3 [58]. In order to apply the sculpting paradigm to man-made urban structures with many flat faces we learn to generate sculpting actions both through depth-map generation and normal-map generation, which are fused in Sect. 3.3.1 using a linear system. We present our results in Sect. 4 and evaluate them in Sect. 5.

3.1 The Sculpting operation

NeuroSculpt is a sculpting pipeline to create large-scale urban environment 3D meshes. From synthesized depth maps and camera locations we sculpt the mesh into the expected shape by pushing and pulling mesh vertices towards or away from the camera. The creation of these depth maps is described in Sects. 3.2 and 3.3.

Given a set of vertices, a camera position, and a depth map, there is a projective mapping between the 2D map/image and the 3D mesh which can be computed from the camera parameters and vertex locations, as in Fig. 4. To sculpt, we modify the vertex positions by applying the depth offset specified by the generated depth map. The camera in NeuroSculpt is used to compute normal and depth maps from 3D meshes as well as locate vertices and directions (towards or away from the camera plane) when performing sculpting operations with new depth maps. The values in the depth maps are relative to the mesh surface.

The pixel values in the generated depth map are interpreted as the new depth distances from the camera plane to the corresponding vertices. The procedure of pushing or pulling a vertex to its new depth values along the camera direction and creating a new mesh is the sculpting operation.

We use a camera with a scale appropriate to the network resolution in the NeuroSculpt pipeline; an orthogonal projection is used to reduce variance for the learning-aspects of the pipeline. The sculpting operation can either be performed over the whole mesh at once, or tile-by-tile to cover all areas in a piecemeal fashion.

For each sculpting operation over a single tile, we select vertices from within the camera frustum which is visible with a vertex normal oriented towards the camera, as in Fig. 4. Their current depth values from the camera plane are calculated. We then measure the difference between vertices’ current depth values and depth values in the generated depth map. These are computed from the nearest four pixels in the depth map with an edge-preserving linear interpolation filter. We move the selected vertices by this distance along the camera’s view direction, thus sculpting the mesh towards the target depth map.

3.2 Coarse 3D mesh Sculpting and optimizations

The system employs a generative network based on a 2D CNN that areis designed for consistent synthesis at scale. This map is used to convert a latent vector (drawn from a normal distribution) to a depth map which is used to sculpt an empty plane into a coarse urban cityscape.

The initial sculpting operation is applied using a vertical camera positioned over a mesh plane. This plane is subdivided to have one vertex for each pixel in the depth map; that is, the first sculpting operation is the application of a heightmap generated by a neural architecture.

We choose the Alis [1] architecture to learn to generate these coarse depth maps by learning over our training dataset. It allows NeuroSculpt to create arbitrarily large-scale 3D meshes from these depth maps. It also serves as a template to ensure large scale consistency over the mesh in the later, fine, sculpting stages. The Alis model generates seamless tile images between continuous noise vectors in latent space using StyleGAN2-ADA [46] as its backbone architecture. It modifies the AdaIN (Adaptive Instance Normalization) [102] mechanism in StyleGAN2-ADA to SA-AadIN (Spatially Aligned AdaIn) to interpolate adjacent latent codes from the mapping network.

We observed that while CNNs are proficient at generating images of organic objects, generating depth maps with straight lines was problematic. We applied CoordConv [103] to remedy this. A CoordConv layer feeds a convolutional layer with additional input channels to allow access to its own spatial coordinates. Previous research [104] shows using CoordConv layers has a positive effect on generative images’ structural details. So, we add a CoordConv layer in Alis to reach better performance. We add one CoordConv layer before the input layer in the discriminator, and this gives the generator coordinate information during the training phase by backward propagation of the gradient. Figure 6, bottom, illustrates the architecture of Alis after adding the CoordConv layer. This significantly improved the quality of our initial depth map synthesis.

The training data are urban environment depth maps for this coarse stage, they are rendered from the top view of 3D urban meshes from a photogrammetry dataset of English towns. After training on this data, Alis draws from the normal distribution to generate arbitrary size depth maps; in the coarse stage, we generate a depth map of resolution of $256\times 4096$ pixels, or $100 \times 1600$ meters. Given a synthesized depth map, we apply the first sculpting operation to create the coarse mesh. This nadir (top to bottom) map is used to vertically displace the vertices of a regular 2D grid to create a 3D terrain mesh.

After this process, the output mesh contains two types of artefacts. The first is that some pixels in the heightmap image generated by the neural network result in outlier values compared to surrounding pixels. This noise leads to highly visible “spikes” in the 3D mesh, pointing towards the camera. Additionally, large vertex movements cause the 3D mesh to develop skinny triangles. Not having enough vertices in such areas causes later operations to be ineffectual, whilst the unbalanced density in the mesh impacts the mesh quality.

NeuroSculpt uses two mesh post-processing methods to resolve the artefacts above and to increase the performance in later stages. For spike removal, we use a similar algorithm to that in Centin et al. [105]—we detect spike vertices which have at least one edge’s dihedral angle larger than a threshold ($90^{\circ }$ in our experiments). This spike vertex is assigned a new position based on the mean of the neighboring vertices’ positions. Ten iterations of this algorithm after each sculpting operation.

The system resamples the mesh to remove long skinny triangles; this is implemented by isotropic surface remeshing [106], which regularizes triangle sizes over a threshold (0.5 meters in our experiments). This remeshing also increases our mesh resolution if required, and has the advantage of not moving other vertices. Figure 5 shows the results after two mesh post-processing techniques have been applied.

3.3 Fine Sculpting

The coarse mesh is large-scale and contains many interesting features at a variety of scales. However, as we can see in Fig. 10 top-row, there are some limitations—man-made urban features (e.g., building walls and roofs) are blobby rather than flat, the mesh is convex because we have only moved vertices in a single direction, and we are missing a lot of fine detail, in particular, on vertical faces. In the fine sculpting stage of NeuroSculpt we apply repeated sculpting operations from a variety of angles to address these shortcomings.

3.3.1 Linear optimization for fine depth maps

Depth synthesis networks often have issues sculpting perfectly flat surfaces. To address this flatness issue of sculpting urban spaces with CNNs, we compute a depth map for the fine sculpting operation from a synthesized normal map (rather than directly synthesizing depth maps). In our experiment, we found that it is easier for the network to learn the local derivative (normal), and to learn a constant normal over a patch than to learn a varying gradient depth. We hypothesise this is because learning a kernel to output a constant normal is simpler than changing depth. Therefore, we synthesize fine normal maps instead of depth maps; however, we must convert these to depth maps for the sculpting operation. To achieve this, we take inspiration from [107], to construct a 2D linear system whose solution is a fine depth map. The system, ND2D (Normal and Depth to Depth), is constructed from a rendered depth map of the coarse mesh and a synthesized normal map from the CNN (Sect. 3.3.2).

We solve a linear system to find the depth of every pixel, $\Delta _{i,j}$, in the fine depth map by minimizing an objective function. To compute these we take an input depth map, d, of resolution $p \times q$ with pixels $d_{i,j}$ and a target normal map, n of the same resolution, with pixels $n_{i,j}$ that can be converted to 3D vectors representing normal values.

We solve the pixels of the depth map to constrain the output normals, nc, to the target normals: $n = nc$. We decompose the target 3D normal vector n into two vectors: $n_{yz}$ and $n_{xz}$ in 2D. $n_{yz}$ is the projection of n on YZ plane in 3D and $n_{xz}$ is the projection of n on XZ plane. These two decomposed normal vectors can also be calculated from the depth map as $nc_{yz}$ and $nc_{xz}$ in vertical and horizontal directions in 2D. This is done by calculating the difference value with adjacent pixel value both vertically and horizontally. Figure 7 illustrates this decomposition and calculation. Assume $n(i, j) = (a_{i,j}, b_{i,j}, c_{i,j})$ and $\Delta _{i,j}$ is the value of pixel (i, j) in the fine depth map.

$$\begin{aligned}{} & {} nc_{yz}(i, j) = \left( \Delta _{i,j}-\Delta _{i+1,j}, 1\right) \nonumber \\{} & {} nc_{xz}(i, j) = \left( \Delta _{i,j}-\Delta _{i,j+1}, 1\right) \nonumber \\{} & {} n_{yz}(i, j) = \left( b_i/c_i, 1\right) \nonumber \\{} & {} n_{xz}(i, j) = \left( a_i/c_i, 1\right) \end{aligned}$$

(1)

The output fine depth map should meet the normal value requirements in both YZ and XZ planes. So, we add the following terms for each pixel among total p rows and q columns as horizontal and vertical terms; this ensures that the difference between $nc_{yz}$, $nc_{xz}$ and $n_{yz}$, $n_{xz}$ is low after the optimization.

$$\begin{aligned} \text {Goal}_h = \text {min}\sum _{i=0}^{p-1} \sum _{j=0}^{q-1} |nc_{yz}(i,j) - n_{yz}(i,j)|\end{aligned}$$

(2)

$$\begin{aligned} \text {Goal}_v = \text {min}\sum _{i=0}^{p-1} \sum _{j=0}^{q-1} |nc_{xz}(i,j) - n_{xz}(i,j)|\end{aligned}$$

(3)

One normal map could correspond to many different depth maps. Therefore, we add a term to promote stability by reducing the change in depth; that is, the difference between the original depth map $d_{i,j}$ and the fine depth map image $\Delta _{i,j}$ should be as small as possible. This is added as an objective term in ND2D system:

$$\begin{aligned} \text {Goal}_d = \text {min} \sum _{i=0}^p \sum _{j=0}^q |d_{i,j} - \Delta {i,j}|\end{aligned}$$

(4)

The view direction of the camera leads to discontinuities in the 2D depth maps, for example along the edge between a roof and the ground plane in shadow which cannot be captured by the camera. In this situation, pixels near such edges are not adjacent in 3D space as the camera cannot capture the occluded areas. The differences do not correspond to surface normals along these edges. Therefore, we create edge images to detect discontinuous edges in the depth maps. These edge images are created using a Sobel filter which identifies the pixels adjacent to the depth discontinuities. We can then remove the $\text {Goal}_h$ and $\text {Goal}_v$ constraints for every pixel that is adjacent to, or on, such edges.

The horizontal and vertical constraints are updated as:

$$\begin{aligned} \text {Goal}_h = \text {min}\sum _{i=0}^{p-1} \sum _{j=0}^{q-1} |\left( nc_{yz}(i,j) - n_{yz}(i,j)\right) * E(i, j)|\nonumber \\ \end{aligned}$$

(5)

$$\begin{aligned} \text {Goal}_v = \text {min}\sum _{i=0}^{p-1} \sum _{j=0}^{q-1} |\left( nc_{xz}(i,j) - n_{xz}(i,j)\right) * E(i, j)|\nonumber \\ \end{aligned}$$

(6)

where E(i, j) gives the 1 or 0 value for whether pixel i, j is located on or adjacent to the edge.

Finally, we adjust the priorities of the three groups of terms by different lambda values to generate the best results. The objective function of the linear system is then:

$$\begin{aligned} \text {Goal} = \lambda _d \times \text {Goal}_d + \lambda _h \times \text {Goal}_h + \lambda _v \times \text {Goal}_v \end{aligned}$$

(7)

By minimizing this function, we find a value for every pixel, $\Delta _{i,j}$, and compute it in the fine depth map. We use Gurobi [108], a fast solver for mathematical optimization, to solve our linear system.

After the linear optimization, fine depth maps are generated. We can use these maps for sculpting operations to deform the coarse meshes and get the final fine meshes.

3.3.2 Fine normal map generation

To achieve details on all sides of the mesh, the fine sculpting operation is repeated from different directions. For each direction, we position the camera, render the required inputs, compute the fine depth map, and then apply the sculpting operation.

To position the camera for the fine normal map generation, the system uses preset camera parameters. The camera’s pitch angle is $45^{\circ }$ down from the horizontal direction. This value minimizes occlusions while allowing the sides and roofs of buildings to be sculpted, creating concave and overhanging features. The camera’s yaw angle value is switched between $0^{\circ }$ and $180^{\circ }$. These two angles are what we have found to ensure the balance of result quality, efficiency and artifacts during experiments. The camera’s roll angle is fixed to $0^{\circ }$. The height of the camera (value on the Z-axis) is fixed, while the Y position is changing all the time as well as the X axis, whereas it is presumably constant for each of the two directions of view. This is shown in Fig. 8.

During runtime, the fine sculpting operation is performed multiple times for a coarse mesh (shown in the bottom part of Fig. 8). The front operation with $0^{\circ }$ camera yaw angle is performed first. We set the camera’s position Y by letting it capture a whole tile based on the short edge length of the coarse mesh. Then, the camera is moved over the X-axis to ensure all tiles on the mesh are modified by sculpting operations. Next, we change the camera position to the other side of the potential mesh and rotate it to $180^{\circ }$ degree camera yaw angle for the second row of sculpting operations along the back of the coarse mesh. The camera positions and movements of two sculpting operations are shown in the bottom of Fig. 8 by yellow rectangles. After front and back deformations, the final fine mesh is complete. Using this technique there is a chance that some vertices may be moved twice, but we find the network is robust to such situations. Further implementation details are discussed in Sect. 3.5.

To generate the fine normal maps, the system must learn to translate poor quality coarse normal maps to high quality fine normal maps. However, as there is no paired dataset matching our course mesh, we must learn from unpaired data. Therefore, we generate fine normal maps with CUT [58], an unpaired image-to-image translation neural network. CUT takes as input rendered normal maps of the mesh (Fig. 9, column c) and outputs clean normals (Fig. 9, column d). The training data for CUT is rendered from our coarse 3D meshes and ground-truth English town meshes. As above, for each direction, the output of a CUT execution is processed by ND2D to create a depth map which is applied to the mesh as a sculpting operation.

3.4 Complete NeuroSculpt pipeline

To summarize, we first generate a coarse heightmap image which is applied with a single sculpting operation to a tessellated plane to create our coarse mesh. This is refined by a number of fine sculpting operations from different directions. For each direction, normal maps are rendered and processed by a network to refine and improve them; these are processed by ND2D to create a depth map which can be applied. Once all these fine sculpting operations have been applied to the mesh, NeuroSculpt is complete and outputs the finished mesh. The whole procedure of NeuroSculpt is shown in Fig. 3.

3.5 Dataset, training, and engineering details

Our training data was from a photogrammetric source of an English town, as shown in the top left part in Fig. 1. We render single channel grayscale heightmap images from the top view over these 3D meshes with a $256 \times 256$ resolution to train the coarse network, Alis [1]. The heightmap images are normalized from 0 to 60 ms. We enable augmentation with parameter “bg” for Alis and train it for 6 days with 2 NVIDIA V100 GPUs. The normal maps for CUT [58] are directly rendered from our coarse mesh results using Blender [109]. The ground-truth normal maps are rendered from English town meshes. These images are also $256 \times 256$ resolution. We use the default parameters to train CUT for 2 days (400 epochs) on a single NVIDIA Titan V GPU. We render 16-bit depth maps for ND2D linear system optimization. The maximum depth is 131.0 meters and the resolution is $256 \times 256$ for depth maps. The fine depth maps are created with lambda values—$\lambda _d = 0.1$. $\lambda _h = 1.0$, and $\lambda _v = 1.0$ for ND2D.

Because fine depth maps for sculpting are generated multiple times, from different directions for the same areas, our ND2D model cannot guarantee they are continuous over adjacent tiles. In order to solve the potential seams problem after sculpting, we retain some overlapping areas when rendering the coarse mesh depth and normal maps, and use linear alpha interpolation over the overlapping areas to ensure the continuity of the final result. This removes the most obvious seams in NeuroSculpt, one example is shown in (c) and (d) in Fig. 11).

For all images, the camera uses an orthogonal camera scale of 80 meters. We compute the approximate height of the ground plane based on the height of the thousand lowest vertices in the meshes. We don’t move these vertices in order to resolve any obvious discontinuity in silhouetted areas. In addition, after the first fine sculpting operation iteration, we use a strict condition - only moving vertices whose normals are inside the semicircle facing the camera direction. This is to avoid changing the parts that have already been sculpted by earlier sculpting operations. Because the accuracy of the image is limited even for the 16-bit depth maps, some different position vertices are projected inside the same pixel square. Therefore, the system performs linear interpolation to compute pixel values when we sample the fine depth maps. Finally, a maximum movement distance is used to prevent vertices from being moved to an inappropriate position.

4 Results

Our intermediate 2D results are shown in Fig. 9. Depth and normal maps are rendered from coarse meshes are are shown in columns (a) and (c). Column (b) shows the corresponding coarse 3D results of sculpting with the depth map. Column (d) introduces the normal maps output from the fine CUT neural network, with the processed depth maps generated from our ND2D linear system in column (e). Finally, column (f) gives the final 3D result after sculpting with the fine depth map.

Figure 10 illustrates our 3D results. The synthesized 3D meshes with the scale of $100 \times 1600$ meters are shown at the bottom of this figure. It shows four areas in detail, both with their coarse and fine stage results. The improvement made by our complete pipeline can be seen by comparing the two stages - we note the shape of trees, woods, and houses in the final result are much closer to the training data, and that the smoothness of all surfaces of houses are significantly improved. NeuroSculpt successfully introduces concave and overhangs to the vertical plane during the fine sculpting stage. Such concave details can be seen in the top left part of tiles on the third column in Fig. 10. Trees in the final results do not contain trunks since trees in the training data do not contain trunks. Detailed results are presented in the red rectangle in Fig. 11 showing trees, houses, walls, and bushes; other details can be found in the additional materials

5 Evaluation

Evaluating the quality of generative models is difficult as our goal is the synthesis of statistically similar outputs, rather than verbatim reproduction of the exemplars. Because we are doing 3D generation rather than reconstruction, it is not appropriate to use direct comparison to a mesh ground truth such as Hausdorff or chamfer distance. Therefore, our evaluation uses statistical similarities to compare our results to the entire ground truth dataset and a user study to assess the quality of our pipeline.

To evaluate the NeuroSculpt pipeline, we compute the statistical similarity to the ground-truth meshes using FID (Fréchet Inception Distance) [110] and LPIPS (Learned Perceptual Image Patch Similarity) [111] scores. Because FID and LPIPS process 2D features we capture 3D meshes’ features as images. To do this, we create normal maps from 3D meshes, both from the two stages of our system and for the ground-truth meshes. The normal map is well suited to capture the 3D meshes’ characteristics, 3D object shapes, and surface flatness or textures.

To avoid introducing bias, we use random camera positions and directions to render the normal maps for FID and LPIPS calculations. During the rendering process, we add some restrictions (e.g., the camera position cannot be too far from the mesh) to ensure that blank scenes are not being captured. This is illustrated in Fig. 12. We use a 20 meter camera orthogonal scale and the normal map patches are rendered at $64 \times 64$ pixel resolution. We render 10, 000 patches for each data domain and calculate FID and LPIPS scores with these patches.

To give a context for comparison, we also show the results of our evaluation against another large scale 3D urban mesh dataset - LoD 2.2 geometry from the Netherlands - 3D BAG [112]. A representative example is shown in Fig. 13. This dataset was selected because, while the urban content is similar, the style of the mesh is significantly different as it was generated from a cartographic database of building footprints. The dataset also omits non-building objects, such as trees and streets. When comparing the quality of the Netherlands dataset and our training dataset, the algorithmically generated LoD model is a cleaner mesh with fewer polygons than the photogrammetry model. We emphasize, however, that our goal is to generate cityscapes as close to the training data as possible and use the Netherlands dataset as an alternate data source with a distinct style to evaluate our approach.

Table 1 shows the results of FID evaluation of different mesh datasets against our training data - English town photogrammetry. The FID value compares the learned activations of a network to give a statistical estimate of image similarity. We can infer from the table that even though the coarse mesh is not of high quality, it still has a lower FID than the Netherlands dataset which means it is statistically closer to the ground-truth meshes. Further, as our pipeline progresses to the fine sculpting stage, the distance decreases, showing improved similarity to the training data. Table 2 shows the results of LPIPS evaluation which provides an estimate of human perceptual similarity between the meshes. We observe similar patterns to the FID measure - that our coarse mesh more closely matches the training data than the Netherlands’ dataset, and the fine stage mesh improves further.

Table 1 FID scores table for evaluation in comparison to the ground-truth

Full size table

Table 2 LPIPS scores table for evaluation in comparison to the ground-truth

Full size table

We also conduct a user study to evaluate our results through direct human comparison. Our user study evaluates the perceived similarity to the training data of our generated results as well as the Netherlands dataset. We designed a user questionnaire to ask users to choose which image from three alternatives has the most similar appearance and buildings to a sample from the ground-truth 3D models. Specifically, randomly chosen ground-truth meshes from random viewpoints were rendered as the reference images for users. The experiment then provides three rendered images from the Netherlands dataset, our coarse stage results, and our fine stage results. These are presented in a random order and users are prompted to choose which one is more similar compared to the given references. See additional materials for example study screens. Comparison trials are performed by 30 participants over 20 triplets of images.

At the start of the user study, we collected information on participant’s self-assessed experience of 3D graphics by asking about their viewing, generating, and editing 3D models. Among our participants, 16.7% create or edit 3D models more than two times every month, 30% less than two times every month, while the remaining 53.3% have no relevant experience. 6.7% of participants have a professional level of experience with 3D graphics, 16.7% are studying graphics or related disciplines, 40% are interested amateurs, while 36.6% do not have any 3D graphics knowledge.

Figure 14 shows the results of our user study. Our fine stage results are chosen as the most similar to the ground-truth distribution in the majority of trials. Moreover, our fine stage advantage is particularly strong against the Netherlands dataset and our coarse stage results. The results of our coarse stage also beat the Netherlands dataset by a small margin.

6 Limitation and discussion

There are several limitations of our NeuroSculpt pipeline. First, our height maps are obtained from Alis [1] using a modified StyleGAN2-ADA [46] generator which means that the quality is limited by their architectures. One limitation is that Alis can generate infinite size images, but only in the horizontal direction. This means that our final output has an extreme aspect ratio. Our work is not focused on solving the issue of Alis. However, in future work, we hope to use adjust the interpolation of the latent vector in Alis to bi-linear interpolation in both horizontal and vertical directions. Generally, Alis has no guarantee that there will never be visible seams, and this is true of our system too; a seam caused by Alis is shown in row (c) in Fig. 11. Further, we observed that the generated shapes in the height maps are not always reasonable, and have some strangely shaped buildings in our results that are not fixed by our pipeline (first and last images in row (e) in Fig. 11). We noted similar issues with the CUT network (second image in row (e) in Fig. 11); there is a small chance it will generate fine normal maps that are uncharacteristic of the target data. These can be seen in the meshes provided in the additional materials.

Another limitation of our architecture is that fine normal maps generated from different views may not perfectly correspond to each other. This means that our neural network does not have the ability to distinguish the same object in the same area from two different viewpoints. This sometimes causes the results of two sculpting operations to conflict. Our Gurobi optimization model has the same issue. This is shown in the third image in row (e) in Fig. 11.

Our pipeline is slow because optimization steps for dense mesh require complex calculations. The sculpting operation is also not perfect. Because the fine depth maps used in the two different sculpting directions are not guaranteed to correspond to each other mentioned above, sometimes the second iteration sculpting destroys the part that has been deformed during the first iteration and creates visible artifacts. We add several restrictions to the sculpting process to avoid these situations, but we cannot entirely prevent them from happening.

The accuracy of our sculpting pipeline can still continue to improve. Not only are the real depth values within a range normalized to the same minimum pixel value unit (even using 16-bit bitmaps), but many vertices are projected to the same minimum pixel square which causes some bugs. Our sculpting operations are based on dense meshes, which amplifies this inaccuracy.

Additionally, our NeuroSculpt pipeline can create overhangs and trees but it can only model genus zero surfaces. We hope to use the work of [113] to improve in future.

Finally, we suffer the difficulty of evaluating generative systems against ground truths. NeuroSculpt is a generative system in which the input is a latent vector drawn from Gaussian distribution. This feature prevents us from controlling the generated content and makes it impossible to directly compare it with ground truths. We can only evaluate our results by comparing the similarity of the final results through FID and user studies.

In future work, we would like to add a super-fine stage for an extra detail level of our results. To make the outputs of all stages continuous, we also want to engineer a general framework for more complex neural networks integrated with the whole pipeline such as backpropagation of gradients through Gurobi. Complementary colour texture generation is another avenue of improvement we wish to try. Finally, we wish to extend our work to on-demand generation like [114] and combine Nerf [115] and SLAM (Simultaneous localization and mapping) approaches to synthesize cityscapes in real-time.

7 Conclusion

We have proposed our NeuroSculpt pipeline for sculpting large-scale 3D urban environment meshes by learning from photogrammetry. This is achieved by computing a 3D surface deformation using 2D image generation and optimization which reduces memory costs and complex calculation. Our results demonstrate high quality meshes at scale, showing realistic generation of complex concave meshes.

Data availability

The Netherland dataset for evaluation can be found at https://3dbag.nl/en/viewer. Due to copyright reasons, we cannot release the English town dataset. However, the way to obtain it can be known by contacting the corresponding author.

Code Availability

The code of our pipeline is available upon acceptance at https://github.com/Misaliet/Learning-to-Sculpt-Neural-Cityscapes.

References

Skorokhodov, I., Sotnikov, G., Elhoseiny, M.: Aligning latent and image spaces to connect the unconnectable. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14144–14153 (2021)
Musialski, P., Wonka, P., Aliaga, D.G., Wimmer, M., Gool, L.V., Purgathofer, W.: A survey of urban reconstruction. CGF 32(6), 146–177 (2013)
Google Scholar
Baillard, C., Zisserman, A.: Automatic reconstruction of piecewise planar models from multiple views. In: IEEE CVPR, vol. 2, pp. 559–565. IEEE (1999)
Zhu, L., Shen, S., Gao, X., Hu, Z.: Large scale urban scene modeling from MVS meshes. In: ECCV, pp. 614–629 (2018)
Lipp, M., Wonka, P., Müller, P.: Pushpull++. ACM SIGGRAPH 33(4), 1–9 (2014)
Article Google Scholar
Nan, L., Sharf, A., Zhang, H., Cohen-Or, D., Chen, B.: SmartBoxes for interactive urban reconstruction. In: ACM SIGGRAPH, pp. 1–10 (2010)
ZBrush by Maxon. https://www.maxon.net/en/zbrush. Accessed 26 Sept 2022
Mudbox by Autodesk. https://www.autodesk.co.uk/products/mudbox. Accessed 26 Sept 2022
Wonka, P., Wimmer, M., Sillion, F., Ribarsky, W.: Instant architecture. ACM TOG 22(3), 669–677 (2003)
Article Google Scholar
Mueller, P., Wonka, P., Haegler, S., Ulmer, A., Van Gool, L.: Procedural modeling of buildings. ACM TOG 25(3), 614–623 (2006)
Article Google Scholar
Demir, I., Aliaga, D.G., Benes, B.: Proceduralization of buildings at city scale. In: 2014 2nd International Conference on 3D Vision, vol. 1, pp. 456–463. IEEE (2014)
Schwarz, M., Müller, P.: Advanced procedural modeling of architecture. ACM TOG 34(4), 110712 (2015). https://doi.org/10.1145/2766956
Article Google Scholar
Wu, F., Yan, D.-M., Dong, W., Zhang, X., Wonka, P.: Inverse procedural modeling of facade layouts. ACM TOG 33(4), 112110 (2014)
Martinovic, A., Van Gool, L.: Bayesian grammar learning for inverse procedural modeling. In: IEEE CVPR, pp. 201–208 (2013)
Kelly, T., Guerrero, P., Steed, A., Wonka, P., Mitra, N.J.: FrankenGAN: guided detail synthesis for building mass models using style-synchonized GANs. ACM TOG 37(6), 1–14 (2018)
Article Google Scholar
Haala, N., Kada, M.: An update on automatic 3d building reconstruction. ISPRS J. Photogramm. Remote Sens. 65(6), 570–580 (2010)
Article Google Scholar
Haala, N., Brenner, C., Anders, K.-H.: 3d urban GIS from laser altimeter and 2d map data. Int. Arch. Photogramm. Remote Sens. 32, 339–346 (1998)
Google Scholar
Laycock, R.G., Day, A.: Automatically generating large urban environments based on the footprint data of buildings. In: ACM SMA, pp. 346–351. ACM (2003)
Gröger, G., Plümer, L.: CityGML-interoperable semantic 3d city models. ISPRS J. Photogramm. Remote Sens. 71, 12–33 (2012)
Article Google Scholar
Biljecki, F., Ledoux, H., Stoter, J.: An improved LOD specification for 3d building models. Comput. Environ. Urban Syst. 59, 25–37 (2016)
Article Google Scholar
Biljecki, F., Ledoux, H., Stoter, J.: Generating 3d city models without elevation data. Comput. Environ. Urban Syst. 64, 1–18 (2017)
Article Google Scholar
Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: SGP, pp. 61–70 (2006)
Kazhdan, M., Hoppe, H.: Screened Poisson surface reconstruction. ACM TOG 32(3), 29 (2013)
Article Google Scholar
Monszpart, A., Mellado, N., Brostow, G.J., Mitra, N.J.: RAPter: rebuilding man-made scenes with regular arrangements of planes. ACM TOG 34(4), 103–1 (2015)
Article Google Scholar
Lafarge, F., Mallet, C.: Creating large-scale city models from 3d-point clouds: a robust approach with hybrid representation. Int. J. Comput. Vis. 99(1), 69–85 (2012)
Article MathSciNet Google Scholar
Fang, H., Lafarge, F.: Connect-and-slice: an hybrid approach for reconstructing 3d objects. In: CVPR 2020-IEEE Conference on Computer Vision and Pattern Recognition (2020)
Lin, H., Gao, J., Zhou, Y., Lu, G., Ye, M., Zhang, C., Liu, L., Yang, R.: Semantic decomposition and reconstruction of residential scenes from lidar data. ACM SIGGRAPH 32(4), 66 (2013)
Article Google Scholar
Lafarge, F., Descombes, X., Zerubia, J., Pierrot-Deseilligny, M.: Structural approach for building reconstruction from a single DSM. IEEE TPAMI 32(1), 135–147 (2010)
Article Google Scholar
Brenner, C.: Scalable estimation of precision maps in a mapreduce framework. In: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 1–10 (2016)
Teller, S.: Automated urban model acquisition: project rationale and status. In: Image Understanding Workshop, pp. 455–462 (1998)
Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3d. ACM SIGGRAPH 25(3), 835–846 (2006). https://doi.org/10.1145/1141911.1141964
Article Google Scholar
Teboul, O., Kokkinos, I., Simon, L., Koutsourakis, P., Paragios, N.: Parsing facades with shape grammars and reinforcement learning. IEEE TPAMI 35(7), 1744–1756 (2013)
Article Google Scholar
Nishida, G., Bousseau, A., Aliaga, D.G.: Procedural modeling of a building from a single image. In: CGF, vol. 37, pp. 415–429. Wiley Online Library (2018)
Fang, T., Wang, Z., Zhang, H., Quan, L.: Image-based modeling of unwrappable facades. IEEE TVCG 19(10), 1720–1731 (2013)
Google Scholar
Kozinski, M., Gadde, R., Zagoruyko, S., Obozinski, G., Marlet, R.: A MRF shape prior for facade parsing with occlusions. In: IEEE CVPR, pp. 2820–2828 (2015)
Femiani, J., Para, W.R., Mitra, N., Wonka, P.: Facade segmentation in the wild. arXiv preprint arXiv:1805.08634 (2018)
Efros, A.A., Leung, T.K.: Texture synthesis by non-parametric sampling. In: IEEE ICCV, vol. 2, pp. 1033–1038. IEEE (1999)
Wei, L.-Y., Levoy, M.: Fast texture synthesis using tree-structured vector quantization. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH’00, pp. 479–488. ACM Press/Addison-Wesley Publishing Co. (2000). https://doi.org/10.1145/344779.345009
Liang, L., Liu, C., Xu, Y.-Q., Guo, B., Shum, H.-Y.: Real-time texture synthesis by patch-based sampling. ACM Trans. Graph. 20(3), 127–150 (2001). https://doi.org/10.1145/501786.501787
Article Google Scholar
Efros, A.A., Freeman, W.T.: Image quilting for texture synthesis and transfer. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 341–346 (2001)
Darabi, S., Shechtman, E., Barnes, C., Goldman, D.B., Sen, P.: Image melding: combining inconsistent images using patch-based synthesis. ACM SIGGRAPH 31(4), 1–10 (2012)
Article Google Scholar
Müller, P., Zeng, G., Wonka, P., Van Gool, L.: Image-based procedural modeling of facades. ACM SIGGRAPH 26(3), 85 (2007)
Article Google Scholar
Št’ava, O., Beneš, B., Měch, R., Aliaga, D.G., Krištof, P.: Inverse procedural modeling by automatic generation of l-systems. In: CGF, vol. 29, pp. 665–674. Wiley Online Library (2010)
Gatys, L., Ecker, A.S., Bethge, M.: Texture synthesis using convolutional neural networks. In: NeurIPS, pp. 262–270 (2015)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119 (2020)
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676 (2020)
Zhou, Y., Zhu, Z., Bai, X., Lischinski, D., Cohen-Or, D., Huang, H.: Non-stationary texture synthesis by adversarial expansion. ACM SIGGRAPH (2018). https://doi.org/10.1145/3197517.3201285
Article Google Scholar
Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: IEEE CVPR (2017)
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423 (2016)
Guérin, É., Digne, J., Galin, É., Peytavie, A., Wolf, C., Benes, B., Martinez, B.: Interactive example-based terrain authoring with conditional generative adversarial networks. ACM TOG 36(6), 1–13 (2017)
Article Google Scholar
Georgiou, Y., Averkiou, M., Kelly, T., Kalogerakis, E.: Projective urban texturing. In: 2021 International Conference on 3D Vision (3DV), pp. 1034–1043. IEEE (2021)
Frühstück, A., Alhashim, I., Wonka, P.: Tilegan: synthesis of large-scale non-homogeneous textures. ACM SIGGRAPH 38(4), 1–11 (2019)
Article Google Scholar
Zhu, J., Kelly, T.: Seamless satellite-image synthesis. In: Computer Graphics Forum, vol. 40, pp. 193–204. Wiley Online Library (2021)
Lin, C.H., Lee, H.-Y., Cheng, Y.-C., Tulyakov, S., Yang, M.-H.: InfinityGAN: towards infinite-pixel image synthesis. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=ufGMqIM0a4b
Zhu, Jun-Yan, Taesung, P., Isola, Phillip, A, E.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE ICCV (2017)
Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., Choo, J.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)
Huang, X., Liu, M.-Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189 (2018)
Park, T., Efros, A.A., R.Z., Zhu, J.-Y.: Contrastive learning for unpaired image-to-image translation. In: ECCV (2020)
Guo, X., Wang, Z., Yang, Q., Lv, W., Liu, X., Wu, Q., Huang, J.: Gan-based virtual-to-real image translation for urban scene semantic segmentation. Neurocomputing 394, 127–135 (2019)
Article Google Scholar
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz, J., Catanzaro, B.: Video-to-video synthesis. arXiv preprint arXiv:1808.06601 (2018)
Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In: NeurIPS, pp. 82–90 (2016)
Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In: ECCV, pp. 628–644. Springer (2016)
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3d object detection. In: IEEE CVPR, pp. 4490–4499 (2018)
Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, B., Tenenbaum, J.: MarrNet: 3d shape reconstruction via 2.5 d sketches. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Riegler, G., Osman Ulusoy, A., Geiger, A.: OctNet: Learning deep 3d representations at high resolutions. In: IEEE CVPR, pp. 3577–3586 (2017)
Arsalan Soltani, A., Huang, H., Wu, J., Kulkarni, T.D., Tenenbaum, J.B.: Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1511–1519 (2017)
Smith, E., Fujimoto, S., Meger, D.: Multi-view silhouette and depth decomposition for high resolution 3d object representation. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Hao, Z., Mallya, A., Belongie, S., Liu, M.-Y.: Gancraft: Unsupervised 3d neural rendering of minecraft worlds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14072–14082 (2021)
Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5939–5948 (2019)
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: learning continuous signed distance functions for shape representation. In: IEEE CVPR, pp. 165–174 (2019)
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3d reconstruction in function space. In: IEEE CVPR, pp. 4460–4470 (2019)
Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3d surface construction algorithm. ACM Siggraph Comput. Gr. 21(4), 163–169 (1987)
Article Google Scholar
Liao, Y., Donne, S., Geiger, A.: Deep marching cubes: learning explicit surface representations. In: IEEE ICCV, pp. 2916–2925 (2018)
Shen, T., Gao, J., Yin, K., Liu, M.-Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. NeurIPS 34, 6087–6101 (2021)
Google Scholar
Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S.: Get3d: a generative model of high quality 3d textured shapes learned from images. In: NeurIPS (2022)
Morreale, L., Aigerman, N., Guerrero, P., Kim, V.G., Mitra, N.J.: Neural convolutional surfaces. In: IEEE CVPR, pp. 19333–19342 (2022)
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.-G.: Pixel2mesh: generating 3d mesh models from single RGB images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67 (2018)
Sun, X., Lian, Z.: Easymesh: an efficient method to reconstruct 3d mesh from a single image. Comput. Aided Geom. Des. 80, 101862 (2020)
Article MathSciNet Google Scholar
Wen, C., Zhang, Y., Li, Z., Fu, Y.: Pixel2mesh++: multi-view 3d mesh generation via deformation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1042–1051 (2019)
Feng, Q., Atanasov, N.: Mesh reconstruction from aerial images for outdoor terrain mapping using joint 2d–3d learning. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 5208–5214. IEEE (2021)
Nash, C., Ganin, Y., Eslami, S.A., Battaglia, P.: Polygen: an autoregressive generative model of 3d meshes. In: International Conference on Machine Learning, pp. 7220–7229. PMLR (2020)
Chen, Z., Tagliasacchi, A., Zhang, H.: BSP-Net: generating compact meshes via binary space partitioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 45–54 (2020)
Chen, Q., Nguyen, V., Han, F., Kiveris, R., Tu, Z.: Topology-aware single-image 3d shape reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 270–271 (2020)
Groueix, T., Fisher, M., Kim, V., Russell, B., Aubry, M.: AtlasNet: a papier-mâché approach to learning 3d surface generation. arXiv preprint arXiv:1802.05384 (1802) (2018)
Wang, K., Guerrero, P., Kim, V., Chaudhuri, S., Sung, M., Ritchie, D.: The shape part slot machine: contact-based reasoning for generating 3d shapes from parts. arXiv preprint arXiv:2112.00584, 1–19 (2021)
Galin, E., Guérin, E., Peytavie, A., Cordonnier, G., Cani, M.-P., Benes, B., Gain, J.: A review of digital terrain modeling. In: Computer Graphics Forum, vol. 38, pp. 553–577. Wiley Online Library (2019)
Paris, A., Galin, E., Peytavie, A., Guérin, E., Gain, J.: Terrain amplification with implicit 3d features. ACM Trans. Gr. 38(5), 1–15 (2019)
Article Google Scholar
Guérin, E., Peytavie, A., Masnou, S., Digne, J., Sauvage, B., Gain, J., Galin, E.: Gradient terrain authoring. In: Computer Graphics Forum, vol. 41, pp. 85–95. Wiley Online Library (2022)
Zhang, J., Li, C., Zhou, P., Wang, C., He, G., Qin, H.: Authoring multi-style terrain with global-to-local control. Gr. Models 119, 101122 (2022)
Article Google Scholar
Zhang, Y., Yu, W., Zhu, D.: Terrain feature-aware deep learning network for digital elevation model superresolution. ISPRS J. Photogramm. Remote Sens. 189, 143–162 (2022)
Article Google Scholar
Perche, S., Peytavie, A., Benes, B., Galin, E., Guérin, E.: Authoring terrains with spatialised style. In: Computer Graphics Forum, vol. 42, p. 14936. Wiley Online Library (2023)
Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B., Lin, D.: BungeeNeRF: progressive neural radiance field for extreme multi-scale scene rendering. In: The European Conference on Computer Vision (ECCV), vol. 2 (2022)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3d surface construction algorithm. In: Seminal Graphics: Pioneering Efforts that Shaped the Field, pp. 347–353 (1998)
Xie, H., Chen, Z., Hong, F., Liu, Z.: Citydreamer: compositional generative model of unbounded 3d cities. arXiv preprint arXiv:2309.00610 (2023)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Lin, C.H., Lee, H.-Y., Menapace, W., Chai, M., Siarohin, A., Yang, M.-H., Tulyakov, S.: Infinicity: infinite-scale city synthesis. arXiv preprint arXiv:2301.09637 (2023)
Kelly, G., McCabe, H.: Citygen: an interactive system for procedural city generation. In: Fifth International Conference on Game Design and Technology, pp. 8–16 (2007)
Wu, Z., Li, Y., Yan, H., Shang, T., Sun, W., Wang, S., Cui, R., Liu, W., Sato, H., Li, H., et al.: BlockFusion: expandable 3d scene generation using latent tri-plane extrapolation. arXiv preprint arXiv:2401.17053 (2024)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)
Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient geometry-aware 3d generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133 (2022)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
Liu, R., Lehman, J., Molino, P., Petroski Such, F., Frank, E., Sergeev, A., Yosinski, J.: An intriguing failing of convolutional neural networks and the coordconv solution. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Zafeirouli, K., Dimou, A., Axenopoulos, A., Daras, P.: Efficient, lightweight, coordinate-based network for image super resolution. In: 2019 IEEE International Conference on Engineering, Technology and Innovation (ICE/ITMC), pp. 1–9. IEEE (2019)
Centin, M., Signoroni, A.: Rameshcleaner: conservative fixing of triangular meshes (2015)
Alliez, P., De Verdire, E.C., Devillers, O., Isenburg, M.: Isotropic surface remeshing. In: 2003 Shape Modeling International, pp. 49–58. IEEE (2003)
Nehab, D., Rusinkiewicz, S., Davis, J., Ramamoorthi, R.: Efficiently combining positions and normals for precise 3d geometry. ACM TOG 24(3), 536–543 (2005)
Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2022). https://www.gurobi.com
Community, B.O.: Blender—a 3D Modelling and Rendering Package. Blender Foundation, Stichting Blender Foundation, Amsterdam. Blender Foundation. http://www.blender.org (2018)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems. ACM TOG (2017)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE CVPR, pp. 586–595 (2018)
Peters, R., Dukai, B., Vitalis, S., van Liempt, J., Stoter, J.: Automated 3D reconstruction of LoD2 and LoD1 models for all 10 million buildings of the Netherlands. Am. Soc. Photogramm. Remote Sens. (2022). https://doi.org/10.14358/PERS.21-00032R2
Kratt, J., Spicker, M., Guayaquil, A., Fiser, M., Pirk, S., Deussen, O., Hart, J.C., Benes, B.: Woodification: user-controlled cambial growth modeling. In: Computer Graphics Forum, vol. 34, pp. 361–372. Wiley Online Library (2015)
Sayed, M., Gibson, J., Watson, J., Prisacariu, V., Firman, M., Godard, C.: Simplerecon: 3d reconstruction without 3d convolutions. arXiv preprint arXiv:2208.14743 (2022)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, University of Leeds, Woodhouse Lane, Leeds, LS2 8AR, UK
Jialin Zhu & David Hogg
Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK
He Wang
Cambridge, United Kingdom
Tom Kelly

Authors

Jialin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
He Wang
View author publications
You can also search for this author in PubMed Google Scholar
David Hogg
View author publications
You can also search for this author in PubMed Google Scholar
Tom Kelly
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jialin Zhu.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (zip 245809 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhu, J., Wang, H., Hogg, D. et al. Learning to sculpt neural cityscapes. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03528-7

Download citation

Accepted: 26 May 2024
Published: 12 July 2024
DOI: https://doi.org/10.1007/s00371-024-03528-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning to sculpt neural cityscapes

Abstract

Similar content being viewed by others

Structured3D: A Large Photo-Realistic Dataset for Structured 3D Modeling

Geometric Image Synthesis

3D Shape Segmentation with Geometric Deep Learning

1 Introduction

2 Related work

2.1 Pre-neural generative modeling

2.2 2D image synthesis

2.3 3D object synthesis

2.4 2.5D terrain synthesis

2.5 Large scale urban environments generation

3 Method

3.1 The Sculpting operation

3.2 Coarse 3D mesh Sculpting and optimizations

3.3 Fine Sculpting

3.3.1 Linear optimization for fine depth maps

3.3.2 Fine normal map generation

3.4 Complete NeuroSculpt pipeline

3.5 Dataset, training, and engineering details

4 Results

5 Evaluation

6 Limitation and discussion

7 Conclusion

Data availability

Code Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (zip 245809 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation