1 Introduction

Urban environments form a central and growing aspect of our lives—more than half the world’s population lives in urban areas and eighty percent of the global GDP is generated in cities. In order to understand and develop these environments, it is important to quickly and accurately synthesize 3D virtual cityscapes at scale. Such models are important for applications ranging from the generation of training data for self-driving cars, environments for video games, urban studies, as well as city planning. However, manually constructing urban 3D models is time-consuming, requiring proficient artists with architecture skills, or engineers to build parametric models. Our system, NeuroSculpt (Neural Sculptor), learns to synthesize large-scale and detailed cityscapes as 3D surfaces (Fig. 1) suitable for a wide range of urban applications.

Fig. 1
figure 1

From a dataset of urban mesh (left, pink and purple) exemplars, we learn to synthesize novel urban meshes (yellow). Our system, NeuroSculpt, synthesizes large scale, highly detailed urban 3D meshes by drawing from a normal distribution. Bottom: a synthesized large scale mesh, with the three shown areas highlighted; it contains 2 million polygons and spans 1.6 km

Fig. 2
figure 2

System overview. NeuroSculpt draws from the normal distribution and use neural networks to generate a large-scale coarse heightmap (i) which is used to deform a tessellated plane (ii) to create a coarse mesh (iii). The fine network takes many different viewpoints on this mesh, and for each, suggests refinements by pushing and pulling the mesh, towards or away-from, the view. The final mesh (v) is the combination of these coarse and fine sculpting operations. Below: a final output mesh with the detailed section outlined

The traditional approach to generating 3D models is manual creation by 3D artists. However, such manual modeling has several drawbacks. It can be time consuming, and requires highly skilled artists. Further, manual 3D modeling also relies on artists or (in the case of procedural geometry systems) engineers to match the appearance and distribution of a generated model to the real world. Such approaches introduce a human bias—e.g., during the manual 3D modeling or in the design of parametric programs. For many applications which require realism (e.g., self driving cars) it is desirable to accurately match the statistical distribution of urban models to the real-world. NeuroSculpt generates kilometer-scale scenes by learning from exemplar training data in an unbiased fashion. Our learning and synthesis process can also be largely automated, considerably reducing the cost of urban mesh generation.

Data driven 3D mesh synthesis is challenging because the memory and compute requirements for Machine Learning in 3D are high. Further, the synthesis of a mesh’s discrete components—vertices and edges—remains a challenging problem at the cutting edge of Machine Learning due to the difficulty of learning over heterogeneous data. Inspired by artists sculpting of clay objects, we introduce a pipeline which learns to sculpt digital clay. To learn these sculpting actions, we exploit the robustness, scalability, and memory efficiency which come from homogeneously structured 2D CNNs (Convolutional Neural Networks) to iteratively sculpt the surface of our digital-clay to create our intricate 3D urban meshes (Fig. 2). By starting with coarse features and progressing to fine details we are able to synthesize highly detailed, concave, large scale models. The input of NeuroSculpt is a random latent vector drawn from Gaussian distribution, and the output is one cityscape represented by a 3D mesh of arbitrarily large extent. The detailed procedure of our whole pipeline is shown in Fig. 3. NeuroSculpt makes the following contributions:

  • A multi-resolution neural sculpting pipeline which allows for the generation of arbitrarily large meshes.

  • A linear system and set of neural networks trained for the generation of sculpting operations.

  • Trained networks and example generations for English towns.

Fig. 3
figure 3

NeuroSculpt ’s input is a random latent code, it is used to generate an infinite-sized depth map (a) with Alis [1]. This top-view is then used to sculpt the coarse mesh (b). Next, the coarse depth (d) and normal map (c) are rendered from the coarse mesh. The CUT neural network then translates the coarse normal map to a fine normal map (e). The coarse depth map and fine normal map are used by ND2D (Sect. 3.3) to generate the fine depth map (f), which is used to sculpt a patch of the coarse mesh to the final fine mesh (g). This sculpting (cg) continues over many patches

Source code and model weights are provided at https://github.com/Misaliet/Learning-to-Sculpt-Neural-Cityscapes.

2 Related work

2.1 Pre-neural generative modeling

Virtual urban environments have become ubiquitous for city planning, video game environments, and mapping. Following this trend there are increasing numbers of 3D city model sources—both for reconstruction [2,3,4] of existing environments and synthesis of designed, planned, or fictional cityscapes. Synthesis techniques for urban geometry may be manual or procedural. In a manual pipeline, designers build meshes using specialised software [5] or guided by images [6]. Such manual approaches can be slow for massive urban areas. Another manual approach is to allow the sculpting of a deformable, digital-clay like surface, as commercial tools such as ZBrush [7] or Mudbox [8], albeit these techniques are typically applied to modeling organic forms. A faster approach is procedural modeling with approaches such as Split Grammars [9,10,11]—these models can be driven by various goals such as optimisation [12], inverse procedural modeling [13, 14], or directly from real world data [15]. Procedural languages are able to quickly generate large urban environments but are written by programmers—a highly skilled and time consuming process, prone to human bias. Further, procedural grammars are not data-driven and cannot learn from example data.

Reconstruction techniques approximate the city-as-built from observations. We discuss highlights here and refer the reader to summaries from the graphics [2] and remote sensing [16] communities for additional depth. Reconstruction from 2D cartographic data sources is simple, but limited. Typical pipelines simply extrude building footprints [17] and create a roof [18], although modern cartographic standards such as CityGML [19, 20] are rapidly developing more nuance. With modern Geographic Information Systems (GIS) it is possible to create huge 3D urban meshes with these approaches [21], although mostly untextured. Point clouds (typically from Lidar or structure-from-motion systems) are converted to 3D mesh representations using techniques such as Poisson reconstruction [22, 23], planar patch identification and fusion [24,25,26], or semantic segmentation [27]. Reconstruction approaches vary—large area aerial Lidar showing only roof heights [28] requires different approaches to that of dense, street-side Lidar, capturing detailed facades [29]. Photogrammetry can be used to recover 3D information from aerial [3, 4] and street level photogrpahy [30, 31]. Various models have been used to fit meshes to raw images, including procedural [32, 33], unwrappable [34], Markovian [35], or deep [36] models of 2D facades.

2.2 2D image synthesis

Texture synthesis is the generation of 2D texture images from exemplars or lower dimension representations; in this paper we apply such textures to describe sculpting actions over surfaces. Early, non-neural, methods learn to generate textures with local and stationary properties, using techniques such as Markov fields to generate pixels [37, 38] or patches [39, 40] from an exemplar image. Improvements to these approaches generate larger, more varied, textures [41]. Procedural languages to create [42] or edit images [43] are well adapted for 2D urban domains.

Texture synthesis using neural networks has become possible as the size of datasets has grown, along with the available computing capacity. Typical systems use convolutional neural networks (CNNs) and apply techniques including gradient descent in feature space [44], disentangling latent representations [45, 46], and GANs [47]. Another approach to texture generation is to transform an input image to a different style or domain. Such image-to-image translation [48, 49] has been applied to generate terrains [50] and building facades [15, 51]. The resolution of such networks is typically limited by available memory; a number of works attempt to create arbitrarily sized images by searching for matches [52], removing seams [53], synthesis over implicit neural functions [54], or interpolating over latent encodings [1].

Where there are no datasets of paired input and output images, unsupervised image-to-image networks use novel losses to generate images and textures. A typical approach is to use a cycle loss [55] which learns to translate from one image domain to a second, as well as from the second back to the first, at the same time. This reconstruction loss is applied alongside other losses, such as adversarial, for unsupervised texture learning. This technique has been extended to two or more image domains [56, 57], smaller/faster patches [58], and domain specific problems such as urban image [59] or video generation [60].

2.3 3D object synthesis

CNNs can create 3D objects by replacing their 2D kernels with 3D ones at the cost of significantly increased memory consumption. Techniques such as GANs [61] and recurrent networks [62] can be applied; applications include learning from point clouds [63] and reconstruction from 2.5D sketches [64]. To attempt to reduce the memory usage, many techniques have been applied including octrees [65], generation from 2D silhouettes which can carve volumetric representations [66, 67], and the use of sparse voxel representations [68]. Despite these techniques, synthesizing objects over 3D arrays of voxels has high memory requirements and can only model small areas.

An alternative to directly generating voxels is to learn to generate an implicit surface [69]. Here the network learns to transform a coordinate to a value—typically a signed distance [70], representing whether the point is inside or outside of the object, but variations such as binary occupancy [71] are also in use. These implicit surfaces can be converted to surface meshes using approaches such as marching cubes [72, 73] or tetrahedra [74]. Recent variations disentangle the mesh from the associated textures [75], multi-resolution surface networks [76], or also generate normals [70].

Methods using voxels and implicit fields for 3D have seen more recent success than the direct generation of 3D meshes consisting of vertices and points. Early work uses graph neural networks (GNNs) to generate meshes from single [77, 78] or multi-view [79, 80] images; such approaches typically work by deforming, rather than creating, meshes. Recent memory-intensive techniques, such as transformers, have been applied to directly generate vertex and face coordinates [81]. Other approaches have used a variety of higher-level modeling techniques and primitives, such as convex decomposition [82], topology autoencoders [83], parametric surface elements [84], or individual parts with contact-based reasoning [85]. The memory and compute intensiveness of these methods have meant that they are only used for single object generation (e.g., a chair or single building).

2.4 2.5D terrain synthesis

DEMs (Digital Elevation Models) are used for environment generation in geospatial and cartography applications. They lift 2D data to a 3D terrain by providing the elevation distance. Traditional methods include procedural modeling, simulation of erosion, and example-based methods [86]. Because of the characteristics of 2D terrains, representing and generating 3D landforms is effectively an ill-posed problem. Existing methods either use an implicit surface model [87] or modeling in the gradient domain instead of the elevation domain [88]. Further, additional work studies how to optimize the 2D heightfield to either achieve better control [89], generate higher resolution [90], or introduce different styles [91].

2.5 Large scale urban environments generation

The above 2.5D terrain synthesis methods can generate high-quality landscape models, but often fail to obtain better results for more complex 3D urban environments. BungeeNeRF [92] uses Neural Radiance Fields (NeRFs) [93] as the basic model and introduces progressive growing to achieve large-scale and multi-resolution 3D urban environment reconstruction. However, it is still a reconstruction and not a synthesis method. Furthermore, it does not generate the 3D urban meshes directly. Using algorithms such as marching cubes [94] to extract mesh from the neural scene in BungeeNeRF’s results in lower quality compared to the rendering results. CityDreamer [95] also uses NeRF to create cityscapes. It implements the generation task rather than reconstruction in BungeeNeRF. CityDreamer uses a VQVAE [96] as a layout generator to generate the unbounded semantic map and height field to create a corresponding neural city scene. InfiniCity [97] and CityGen [98] generate infinite-sized 3D models of cityscapes using infinite depth, normal, or semantic maps generated by deep generative models. They do not refine 3D model details however using image-to-image systems to render final fine results, resulting in a very low-quality 3D urban model. More recently, BlockFusion [99] uses a diffusion model [100] to denoise a tri-plane representation [101] to create infinite neural scenes of villages but the generation process is very slow. Our goal is to directly generate directly usable 3D urban meshes of arbitrary size in a way that balances efficiency and quality.

3 Method

To create large-scale cityscapes in 3D it is necessary to limit the memory requirements of our mesh synthesis. More memory-intensive approaches, such as voxel or transformer architectures, are unable to scale to such massive urban areas. To address this, we take inspiration from a human sculptor who is able to use 2D vision and many small sculpting actions (i.e., adding or removing material over a surface) to create complex 3D models. Digital sculpting is used to create organic (e.g., models of humans or clothing) in existing manual tools such as ZBrush [7]; we extend this paradigm to learn to synthesize large-scale cityscapes including mixed organic (e.g., trees and vegetation) and man-made (e.g., buildings and streets) features over large scale urban areas.

Following the sculptor paradigm, we work in a coarse-to-fine fashion, using two neural networks to refine a tessellated mesh (Fig. 2), first to a coarse mesh, and then to a fine mesh. NeuroSculpt ’s sculpting operations are introduced in Sect. 3.1. By using 2D CNNs to describe these refinement operations, we are able to exploit existing work to create massive seamless meshes in Sect. 3.2 [1], and unsupervised refinement in Sect. 3.3 [58]. In order to apply the sculpting paradigm to man-made urban structures with many flat faces we learn to generate sculpting actions both through depth-map generation and normal-map generation, which are fused in Sect. 3.3.1 using a linear system. We present our results in Sect. 4 and evaluate them in Sect. 5.

3.1 The Sculpting operation

NeuroSculpt is a sculpting pipeline to create large-scale urban environment 3D meshes. From synthesized depth maps and camera locations we sculpt the mesh into the expected shape by pushing and pulling mesh vertices towards or away from the camera. The creation of these depth maps is described in Sects. 3.2 and 3.3.

Given a set of vertices, a camera position, and a depth map, there is a projective mapping between the 2D map/image and the 3D mesh which can be computed from the camera parameters and vertex locations, as in Fig. 4. To sculpt, we modify the vertex positions by applying the depth offset specified by the generated depth map. The camera in NeuroSculpt is used to compute normal and depth maps from 3D meshes as well as locate vertices and directions (towards or away from the camera plane) when performing sculpting operations with new depth maps. The values in the depth maps are relative to the mesh surface.

Fig. 4
figure 4

The sculpting operation. A camera (top left, grey) captures depth values (dashed lines) of the mesh (black solid lines). It is then used as a reference to move vertices given the sculpting operation. Vertices that are visible to the camera (those which don’t have a normal facing of the camera and are not occluded; red) are pulled towards or pushed away from the camera to the distance specified by the depth map. Vertices which have a back-facing normal (green) or are hidden (blue) are not moved

The pixel values in the generated depth map are interpreted as the new depth distances from the camera plane to the corresponding vertices. The procedure of pushing or pulling a vertex to its new depth values along the camera direction and creating a new mesh is the sculpting operation.

We use a camera with a scale appropriate to the network resolution in the NeuroSculpt pipeline; an orthogonal projection is used to reduce variance for the learning-aspects of the pipeline. The sculpting operation can either be performed over the whole mesh at once, or tile-by-tile to cover all areas in a piecemeal fashion.

For each sculpting operation over a single tile, we select vertices from within the camera frustum which is visible with a vertex normal oriented towards the camera, as in Fig. 4. Their current depth values from the camera plane are calculated. We then measure the difference between vertices’ current depth values and depth values in the generated depth map. These are computed from the nearest four pixels in the depth map with an edge-preserving linear interpolation filter. We move the selected vertices by this distance along the camera’s view direction, thus sculpting the mesh towards the target depth map.

Fig. 5
figure 5

Left: a mesh is shown after a sculpting operation. Right: after the two post-processing methods. Obvious spikes are removed from the mesh structure, as well as long skinny triangles. Note that other vertices are not impacted by this processing

3.2 Coarse 3D mesh Sculpting and optimizations

The system employs a generative network based on a 2D CNN that areis designed for consistent synthesis at scale. This map is used to convert a latent vector (drawn from a normal distribution) to a depth map which is used to sculpt an empty plane into a coarse urban cityscape.

The initial sculpting operation is applied using a vertical camera positioned over a mesh plane. This plane is subdivided to have one vertex for each pixel in the depth map; that is, the first sculpting operation is the application of a heightmap generated by a neural architecture.

We choose the Alis [1] architecture to learn to generate these coarse depth maps by learning over our training dataset. It allows NeuroSculpt to create arbitrarily large-scale 3D meshes from these depth maps. It also serves as a template to ensure large scale consistency over the mesh in the later, fine, sculpting stages. The Alis model generates seamless tile images between continuous noise vectors in latent space using StyleGAN2-ADA [46] as its backbone architecture. It modifies the AdaIN (Adaptive Instance Normalization) [102] mechanism in StyleGAN2-ADA to SA-AadIN (Spatially Aligned AdaIn) to interpolate adjacent latent codes from the mapping network.

We observed that while CNNs are proficient at generating images of organic objects, generating depth maps with straight lines was problematic. We applied CoordConv [103] to remedy this. A CoordConv layer feeds a convolutional layer with additional input channels to allow access to its own spatial coordinates. Previous research [104] shows using CoordConv layers has a positive effect on generative images’ structural details. So, we add a CoordConv layer in Alis to reach better performance. We add one CoordConv layer before the input layer in the discriminator, and this gives the generator coordinate information during the training phase by backward propagation of the gradient. Figure 6, bottom, illustrates the architecture of Alis after adding the CoordConv layer. This significantly improved the quality of our initial depth map synthesis.

Fig. 6
figure 6

The generation process of the infinite size heightmap image with Alis and the CoordConv layer are shown in the top part. The bottom part shows how we concatenate x and y coordinate information to the generator (pink block) output and where we add the CoordConv layer in the discriminator (blue block)

The training data are urban environment depth maps for this coarse stage, they are rendered from the top view of 3D urban meshes from a photogrammetry dataset of English towns. After training on this data, Alis draws from the normal distribution to generate arbitrary size depth maps; in the coarse stage, we generate a depth map of resolution of \(256\times 4096\) pixels, or \(100 \times 1600\) meters. Given a synthesized depth map, we apply the first sculpting operation to create the coarse mesh. This nadir (top to bottom) map is used to vertically displace the vertices of a regular 2D grid to create a 3D terrain mesh.

After this process, the output mesh contains two types of artefacts. The first is that some pixels in the heightmap image generated by the neural network result in outlier values compared to surrounding pixels. This noise leads to highly visible “spikes” in the 3D mesh, pointing towards the camera. Additionally, large vertex movements cause the 3D mesh to develop skinny triangles. Not having enough vertices in such areas causes later operations to be ineffectual, whilst the unbalanced density in the mesh impacts the mesh quality.

NeuroSculpt uses two mesh post-processing methods to resolve the artefacts above and to increase the performance in later stages. For spike removal, we use a similar algorithm to that in Centin et al. [105]—we detect spike vertices which have at least one edge’s dihedral angle larger than a threshold (\(90^{\circ }\) in our experiments). This spike vertex is assigned a new position based on the mean of the neighboring vertices’ positions. Ten iterations of this algorithm after each sculpting operation.

The system resamples the mesh to remove long skinny triangles; this is implemented by isotropic surface remeshing [106], which regularizes triangle sizes over a threshold (0.5 meters in our experiments). This remeshing also increases our mesh resolution if required, and has the advantage of not moving other vertices. Figure 5 shows the results after two mesh post-processing techniques have been applied.

3.3 Fine Sculpting

The coarse mesh is large-scale and contains many interesting features at a variety of scales. However, as we can see in Fig. 10 top-row, there are some limitations—man-made urban features (e.g., building walls and roofs) are blobby rather than flat, the mesh is convex because we have only moved vertices in a single direction, and we are missing a lot of fine detail, in particular, on vertical faces. In the fine sculpting stage of NeuroSculpt we apply repeated sculpting operations from a variety of angles to address these shortcomings.

3.3.1 Linear optimization for fine depth maps

Depth synthesis networks often have issues sculpting perfectly flat surfaces. To address this flatness issue of sculpting urban spaces with CNNs, we compute a depth map for the fine sculpting operation from a synthesized normal map (rather than directly synthesizing depth maps). In our experiment, we found that it is easier for the network to learn the local derivative (normal), and to learn a constant normal over a patch than to learn a varying gradient depth. We hypothesise this is because learning a kernel to output a constant normal is simpler than changing depth. Therefore, we synthesize fine normal maps instead of depth maps; however, we must convert these to depth maps for the sculpting operation. To achieve this, we take inspiration from [107], to construct a 2D linear system whose solution is a fine depth map. The system, ND2D (Normal and Depth to Depth), is constructed from a rendered depth map of the coarse mesh and a synthesized normal map from the CNN (Sect. 3.3.2).

We solve a linear system to find the depth of every pixel, \(\Delta _{i,j}\), in the fine depth map by minimizing an objective function. To compute these we take an input depth map, d, of resolution \(p \times q\) with pixels \(d_{i,j}\) and a target normal map, n of the same resolution, with pixels \(n_{i,j}\) that can be converted to 3D vectors representing normal values.

Fig. 7
figure 7

Top row: The normal map used for Linear Optimization. The normal vector (black arrow) can be sampled from the pixel value on the top grid which represents a normal map. We show the central pixel’s normal decomposition in 3D space and its decomposed normal vector (red and yellow arrows) in the YZ and XZ plane on its right. We use these two decomposed normal values as optimization goals for the linear optimization over two directions in 2D image space (shown with red and yellow arrows). The central row shows a coarse depth map and normal calculation from it. The same normal (red arrow) on the XZ plane in 2D space is shown on its right. This normal value of the red arrow is calculated by the adjacent pixel’s grayscale value difference. Bottom row: after performing the linear optimization to create the fine depth map, we observe corrected normals—similar to that in the normal map, but with minimal changes to the depth value

We solve the pixels of the depth map to constrain the output normals, nc, to the target normals: \(n = nc\). We decompose the target 3D normal vector n into two vectors: \(n_{yz}\) and \(n_{xz}\) in 2D. \(n_{yz}\) is the projection of n on YZ plane in 3D and \(n_{xz}\) is the projection of n on XZ plane. These two decomposed normal vectors can also be calculated from the depth map as \(nc_{yz}\) and \(nc_{xz}\) in vertical and horizontal directions in 2D. This is done by calculating the difference value with adjacent pixel value both vertically and horizontally. Figure 7 illustrates this decomposition and calculation. Assume \(n(i, j) = (a_{i,j}, b_{i,j}, c_{i,j})\) and \(\Delta _{i,j}\) is the value of pixel (ij) in the fine depth map.

$$\begin{aligned}{} & {} nc_{yz}(i, j) = \left( \Delta _{i,j}-\Delta _{i+1,j}, 1\right) \nonumber \\{} & {} nc_{xz}(i, j) = \left( \Delta _{i,j}-\Delta _{i,j+1}, 1\right) \nonumber \\{} & {} n_{yz}(i, j) = \left( b_i/c_i, 1\right) \nonumber \\{} & {} n_{xz}(i, j) = \left( a_i/c_i, 1\right) \end{aligned}$$
(1)

The output fine depth map should meet the normal value requirements in both YZ and XZ planes. So, we add the following terms for each pixel among total p rows and q columns as horizontal and vertical terms; this ensures that the difference between \(nc_{yz}\), \(nc_{xz}\) and \(n_{yz}\), \(n_{xz}\) is low after the optimization.

$$\begin{aligned} \text {Goal}_h = \text {min}\sum _{i=0}^{p-1} \sum _{j=0}^{q-1} |nc_{yz}(i,j) - n_{yz}(i,j)|\end{aligned}$$
(2)
$$\begin{aligned} \text {Goal}_v = \text {min}\sum _{i=0}^{p-1} \sum _{j=0}^{q-1} |nc_{xz}(i,j) - n_{xz}(i,j)|\end{aligned}$$
(3)

One normal map could correspond to many different depth maps. Therefore, we add a term to promote stability by reducing the change in depth; that is, the difference between the original depth map \(d_{i,j}\) and the fine depth map image \(\Delta _{i,j}\) should be as small as possible. This is added as an objective term in ND2D system:

$$\begin{aligned} \text {Goal}_d = \text {min} \sum _{i=0}^p \sum _{j=0}^q |d_{i,j} - \Delta {i,j}|\end{aligned}$$
(4)

The view direction of the camera leads to discontinuities in the 2D depth maps, for example along the edge between a roof and the ground plane in shadow which cannot be captured by the camera. In this situation, pixels near such edges are not adjacent in 3D space as the camera cannot capture the occluded areas. The differences do not correspond to surface normals along these edges. Therefore, we create edge images to detect discontinuous edges in the depth maps. These edge images are created using a Sobel filter which identifies the pixels adjacent to the depth discontinuities. We can then remove the \(\text {Goal}_h\) and \(\text {Goal}_v\) constraints for every pixel that is adjacent to, or on, such edges.

The horizontal and vertical constraints are updated as:

$$\begin{aligned} \text {Goal}_h = \text {min}\sum _{i=0}^{p-1} \sum _{j=0}^{q-1} |\left( nc_{yz}(i,j) - n_{yz}(i,j)\right) * E(i, j)|\nonumber \\ \end{aligned}$$
(5)
$$\begin{aligned} \text {Goal}_v = \text {min}\sum _{i=0}^{p-1} \sum _{j=0}^{q-1} |\left( nc_{xz}(i,j) - n_{xz}(i,j)\right) * E(i, j)|\nonumber \\ \end{aligned}$$
(6)

where E(ij) gives the 1 or 0 value for whether pixel ij is located on or adjacent to the edge.

Finally, we adjust the priorities of the three groups of terms by different lambda values to generate the best results. The objective function of the linear system is then:

$$\begin{aligned} \text {Goal} = \lambda _d \times \text {Goal}_d + \lambda _h \times \text {Goal}_h + \lambda _v \times \text {Goal}_v \end{aligned}$$
(7)

By minimizing this function, we find a value for every pixel, \(\Delta _{i,j}\), and compute it in the fine depth map. We use Gurobi [108], a fast solver for mathematical optimization, to solve our linear system.

After the linear optimization, fine depth maps are generated. We can use these maps for sculpting operations to deform the coarse meshes and get the final fine meshes.

3.3.2 Fine normal map generation

To achieve details on all sides of the mesh, the fine sculpting operation is repeated from different directions. For each direction, we position the camera, render the required inputs, compute the fine depth map, and then apply the sculpting operation.

To position the camera for the fine normal map generation, the system uses preset camera parameters. The camera’s pitch angle is \(45^{\circ }\) down from the horizontal direction. This value minimizes occlusions while allowing the sides and roofs of buildings to be sculpted, creating concave and overhanging features. The camera’s yaw angle value is switched between \(0^{\circ }\) and \(180^{\circ }\). These two angles are what we have found to ensure the balance of result quality, efficiency and artifacts during experiments. The camera’s roll angle is fixed to \(0^{\circ }\). The height of the camera (value on the Z-axis) is fixed, while the Y position is changing all the time as well as the X axis, whereas it is presumably constant for each of the two directions of view. This is shown in Fig. 8.

Fig. 8
figure 8

Top: our camera (yellow and red parallelogram) positions and directions for the front (first iteration) and back (second iteration) fine depth map generation. The bottom part shows camera (yellow and red rectangle) positions over the whole coarse mesh; our camera positions are calculated to span the coarse mesh without gaps or repeats on either side

Fig. 9
figure 9

Results for fine depth and fine normal maps. Column a shows the original depth maps from coarse meshes; b illustrates the 3D mesh result of a sculpting operation on (a); c shows the normal maps from coarse meshes; d gives the fine normal maps after processing generated by CUT; e demonstrates the fine depth maps generated by our ND2D linear system; f shows the final 3D mesh after applying directly sculpt e to a plane

During runtime, the fine sculpting operation is performed multiple times for a coarse mesh (shown in the bottom part of Fig. 8). The front operation with \(0^{\circ }\) camera yaw angle is performed first. We set the camera’s position Y by letting it capture a whole tile based on the short edge length of the coarse mesh. Then, the camera is moved over the X-axis to ensure all tiles on the mesh are modified by sculpting operations. Next, we change the camera position to the other side of the potential mesh and rotate it to \(180^{\circ }\) degree camera yaw angle for the second row of sculpting operations along the back of the coarse mesh. The camera positions and movements of two sculpting operations are shown in the bottom of Fig. 8 by yellow rectangles. After front and back deformations, the final fine mesh is complete. Using this technique there is a chance that some vertices may be moved twice, but we find the network is robust to such situations. Further implementation details are discussed in Sect. 3.5.

To generate the fine normal maps, the system must learn to translate poor quality coarse normal maps to high quality fine normal maps. However, as there is no paired dataset matching our course mesh, we must learn from unpaired data. Therefore, we generate fine normal maps with CUT [58], an unpaired image-to-image translation neural network. CUT takes as input rendered normal maps of the mesh (Fig. 9, column c) and outputs clean normals (Fig. 9, column d). The training data for CUT is rendered from our coarse 3D meshes and ground-truth English town meshes. As above, for each direction, the output of a CUT execution is processed by ND2D to create a depth map which is applied to the mesh as a sculpting operation.

3.4 Complete NeuroSculpt pipeline

To summarize, we first generate a coarse heightmap image which is applied with a single sculpting operation to a tessellated plane to create our coarse mesh. This is refined by a number of fine sculpting operations from different directions. For each direction, normal maps are rendered and processed by a network to refine and improve them; these are processed by ND2D to create a depth map which can be applied. Once all these fine sculpting operations have been applied to the mesh, NeuroSculpt is complete and outputs the finished mesh. The whole procedure of NeuroSculpt is shown in Fig. 3.

3.5 Dataset, training, and engineering details

Our training data was from a photogrammetric source of an English town, as shown in the top left part in Fig. 1. We render single channel grayscale heightmap images from the top view over these 3D meshes with a \(256 \times 256\) resolution to train the coarse network, Alis [1]. The heightmap images are normalized from 0 to 60 ms. We enable augmentation with parameter “bg” for Alis and train it for 6 days with 2 NVIDIA V100 GPUs. The normal maps for CUT [58] are directly rendered from our coarse mesh results using Blender [109]. The ground-truth normal maps are rendered from English town meshes. These images are also \(256 \times 256\) resolution. We use the default parameters to train CUT for 2 days (400 epochs) on a single NVIDIA Titan V GPU. We render 16-bit depth maps for ND2D linear system optimization. The maximum depth is 131.0 meters and the resolution is \(256 \times 256\) for depth maps. The fine depth maps are created with lambda values—\(\lambda _d = 0.1\). \(\lambda _h = 1.0\), and \(\lambda _v = 1.0\) for ND2D.

Fig. 10
figure 10

Our ultimate outcomes, as presented in the bottom row, encompass a spatial extent exceeding one kilometer. We emphasize specific details within black rectangles to illustrate these enhancements, as observed in the middle row. Concurrently, we display the associated coarse meshes in the top row. It’s important to note that the vertical (height) resolution of these meshes has undergone substantial enhancement during the fine sculpting phase. Additionally, we observe that minor repetitive artifacts introduced by the CNN at the coarse stage have been rectified, and the overall quality of flat vertical surfaces, such as walls, has been notably enhanced

Because fine depth maps for sculpting are generated multiple times, from different directions for the same areas, our ND2D model cannot guarantee they are continuous over adjacent tiles. In order to solve the potential seams problem after sculpting, we retain some overlapping areas when rendering the coarse mesh depth and normal maps, and use linear alpha interpolation over the overlapping areas to ensure the continuity of the final result. This removes the most obvious seams in NeuroSculpt, one example is shown in (c) and (d) in Fig. 11).

For all images, the camera uses an orthogonal camera scale of 80 meters. We compute the approximate height of the ground plane based on the height of the thousand lowest vertices in the meshes. We don’t move these vertices in order to resolve any obvious discontinuity in silhouetted areas. In addition, after the first fine sculpting operation iteration, we use a strict condition - only moving vertices whose normals are inside the semicircle facing the camera direction. This is to avoid changing the parts that have already been sculpted by earlier sculpting operations. Because the accuracy of the image is limited even for the 16-bit depth maps, some different position vertices are projected inside the same pixel square. Therefore, the system performs linear interpolation to compute pixel values when we sample the fine depth maps. Finally, a maximum movement distance is used to prevent vertices from being moved to an inappropriate position.

4 Results

Our intermediate 2D results are shown in Fig. 9. Depth and normal maps are rendered from coarse meshes are are shown in columns (a) and (c). Column (b) shows the corresponding coarse 3D results of sculpting with the depth map. Column (d) introduces the normal maps output from the fine CUT neural network, with the processed depth maps generated from our ND2D linear system in column (e). Finally, column (f) gives the final 3D result after sculpting with the fine depth map.

Figure 10 illustrates our 3D results. The synthesized 3D meshes with the scale of \(100 \times 1600\) meters are shown at the bottom of this figure. It shows four areas in detail, both with their coarse and fine stage results. The improvement made by our complete pipeline can be seen by comparing the two stages - we note the shape of trees, woods, and houses in the final result are much closer to the training data, and that the smoothness of all surfaces of houses are significantly improved. NeuroSculpt successfully introduces concave and overhangs to the vertical plane during the fine sculpting stage. Such concave details can be seen in the top left part of tiles on the third column in Fig. 10. Trees in the final results do not contain trunks since trees in the training data do not contain trunks. Detailed results are presented in the red rectangle in Fig. 11 showing trees, houses, walls, and bushes; other details can be found in the additional materials

Fig. 11
figure 11

The images inside the green rectangle illustrate the improvements of the fine stage results (row b) from coarse stage results (row a). From left to right, trees, houses, and bushes are shown. The images inside the blue rectangle show the removal of the seam by our blending of adjacent depth maps in Sect. 3.5; we show the coarse mesh including a seam (c; light blue rectangle) caused by Alis, and the fine mesh which removes the seam entirely (d). The red rectangle (row e) gives exmaples of artifacts created by the pipeline. From left to right, the low quality results caused by Alis cannot be fixed with NeuroSculpt; the uneven root caused by CUT cannot be repaired; the height difference in one house’s roof caused by sculpting operations from two different directions; regular cuboid type buildings cannot be handled well

5 Evaluation

Evaluating the quality of generative models is difficult as our goal is the synthesis of statistically similar outputs, rather than verbatim reproduction of the exemplars. Because we are doing 3D generation rather than reconstruction, it is not appropriate to use direct comparison to a mesh ground truth such as Hausdorff or chamfer distance. Therefore, our evaluation uses statistical similarities to compare our results to the entire ground truth dataset and a user study to assess the quality of our pipeline.

To evaluate the NeuroSculpt pipeline, we compute the statistical similarity to the ground-truth meshes using FID (Fréchet Inception Distance) [110] and LPIPS (Learned Perceptual Image Patch Similarity) [111] scores. Because FID and LPIPS process 2D features we capture 3D meshes’ features as images. To do this, we create normal maps from 3D meshes, both from the two stages of our system and for the ground-truth meshes. The normal map is well suited to capture the 3D meshes’ characteristics, 3D object shapes, and surface flatness or textures.

To avoid introducing bias, we use random camera positions and directions to render the normal maps for FID and LPIPS calculations. During the rendering process, we add some restrictions (e.g., the camera position cannot be too far from the mesh) to ensure that blank scenes are not being captured. This is illustrated in Fig. 12. We use a 20 meter camera orthogonal scale and the normal map patches are rendered at \(64 \times 64\) pixel resolution. We render 10, 000 patches for each data domain and calculate FID and LPIPS scores with these patches.

Fig. 12
figure 12

Bottom: Camera configuration during the evaluation. We show three cameras (yellow rectangles) at random positions with rotations given by yaw (blue) and pitch (green). Top: Eight examples of rendered normal patches

To give a context for comparison, we also show the results of our evaluation against another large scale 3D urban mesh dataset - LoD 2.2 geometry from the Netherlands - 3D BAG [112]. A representative example is shown in Fig. 13. This dataset was selected because, while the urban content is similar, the style of the mesh is significantly different as it was generated from a cartographic database of building footprints. The dataset also omits non-building objects, such as trees and streets. When comparing the quality of the Netherlands dataset and our training dataset, the algorithmically generated LoD model is a cleaner mesh with fewer polygons than the photogrammetry model. We emphasize, however, that our goal is to generate cityscapes as close to the training data as possible and use the Netherlands dataset as an alternate data source with a distinct style to evaluate our approach.

Table 1 shows the results of FID evaluation of different mesh datasets against our training data - English town photogrammetry. The FID value compares the learned activations of a network to give a statistical estimate of image similarity. We can infer from the table that even though the coarse mesh is not of high quality, it still has a lower FID than the Netherlands dataset which means it is statistically closer to the ground-truth meshes. Further, as our pipeline progresses to the fine sculpting stage, the distance decreases, showing improved similarity to the training data. Table 2 shows the results of LPIPS evaluation which provides an estimate of human perceptual similarity between the meshes. We observe similar patterns to the FID measure - that our coarse mesh more closely matches the training data than the Netherlands’ dataset, and the fine stage mesh improves further.

Fig. 13
figure 13

One example tile of the Netherlands 3D dataset

Table 1 FID scores table for evaluation in comparison to the ground-truth
Table 2 LPIPS scores table for evaluation in comparison to the ground-truth

We also conduct a user study to evaluate our results through direct human comparison. Our user study evaluates the perceived similarity to the training data of our generated results as well as the Netherlands dataset. We designed a user questionnaire to ask users to choose which image from three alternatives has the most similar appearance and buildings to a sample from the ground-truth 3D models. Specifically, randomly chosen ground-truth meshes from random viewpoints were rendered as the reference images for users. The experiment then provides three rendered images from the Netherlands dataset, our coarse stage results, and our fine stage results. These are presented in a random order and users are prompted to choose which one is more similar compared to the given references. See additional materials for example study screens. Comparison trials are performed by 30 participants over 20 triplets of images.

At the start of the user study, we collected information on participant’s self-assessed experience of 3D graphics by asking about their viewing, generating, and editing 3D models. Among our participants, 16.7% create or edit 3D models more than two times every month, 30% less than two times every month, while the remaining 53.3% have no relevant experience. 6.7% of participants have a professional level of experience with 3D graphics, 16.7% are studying graphics or related disciplines, 40% are interested amateurs, while 36.6% do not have any 3D graphics knowledge.

Figure 14 shows the results of our user study. Our fine stage results are chosen as the most similar to the ground-truth distribution in the majority of trials. Moreover, our fine stage advantage is particularly strong against the Netherlands dataset and our coarse stage results. The results of our coarse stage also beat the Netherlands dataset by a small margin.

Fig. 14
figure 14

User study results. Our fine stage results are selected by participants to be the most similar to the ground-truth distribution

6 Limitation and discussion

There are several limitations of our NeuroSculpt pipeline. First, our height maps are obtained from Alis [1] using a modified StyleGAN2-ADA [46] generator which means that the quality is limited by their architectures. One limitation is that Alis can generate infinite size images, but only in the horizontal direction. This means that our final output has an extreme aspect ratio. Our work is not focused on solving the issue of Alis. However, in future work, we hope to use adjust the interpolation of the latent vector in Alis to bi-linear interpolation in both horizontal and vertical directions. Generally, Alis has no guarantee that there will never be visible seams, and this is true of our system too; a seam caused by Alis is shown in row (c) in Fig. 11. Further, we observed that the generated shapes in the height maps are not always reasonable, and have some strangely shaped buildings in our results that are not fixed by our pipeline (first and last images in row (e) in Fig. 11). We noted similar issues with the CUT network (second image in row (e) in Fig. 11); there is a small chance it will generate fine normal maps that are uncharacteristic of the target data. These can be seen in the meshes provided in the additional materials.

Another limitation of our architecture is that fine normal maps generated from different views may not perfectly correspond to each other. This means that our neural network does not have the ability to distinguish the same object in the same area from two different viewpoints. This sometimes causes the results of two sculpting operations to conflict. Our Gurobi optimization model has the same issue. This is shown in the third image in row (e) in Fig. 11.

Our pipeline is slow because optimization steps for dense mesh require complex calculations. The sculpting operation is also not perfect. Because the fine depth maps used in the two different sculpting directions are not guaranteed to correspond to each other mentioned above, sometimes the second iteration sculpting destroys the part that has been deformed during the first iteration and creates visible artifacts. We add several restrictions to the sculpting process to avoid these situations, but we cannot entirely prevent them from happening.

The accuracy of our sculpting pipeline can still continue to improve. Not only are the real depth values within a range normalized to the same minimum pixel value unit (even using 16-bit bitmaps), but many vertices are projected to the same minimum pixel square which causes some bugs. Our sculpting operations are based on dense meshes, which amplifies this inaccuracy.

Additionally, our NeuroSculpt pipeline can create overhangs and trees but it can only model genus zero surfaces. We hope to use the work of [113] to improve in future.

Finally, we suffer the difficulty of evaluating generative systems against ground truths. NeuroSculpt is a generative system in which the input is a latent vector drawn from Gaussian distribution. This feature prevents us from controlling the generated content and makes it impossible to directly compare it with ground truths. We can only evaluate our results by comparing the similarity of the final results through FID and user studies.

In future work, we would like to add a super-fine stage for an extra detail level of our results. To make the outputs of all stages continuous, we also want to engineer a general framework for more complex neural networks integrated with the whole pipeline such as backpropagation of gradients through Gurobi. Complementary colour texture generation is another avenue of improvement we wish to try. Finally, we wish to extend our work to on-demand generation like [114] and combine Nerf [115] and SLAM (Simultaneous localization and mapping) approaches to synthesize cityscapes in real-time.

7 Conclusion

We have proposed our NeuroSculpt pipeline for sculpting large-scale 3D urban environment meshes by learning from photogrammetry. This is achieved by computing a 3D surface deformation using 2D image generation and optimization which reduces memory costs and complex calculation. Our results demonstrate high quality meshes at scale, showing realistic generation of complex concave meshes.