Exploring in the Latent Space of Design: A Method of Plausible Building Facades Images Generation, Properties Control and Model Explanation Base on StyleGAN2

. GAN has been widely applied in the research of architectural image generation. However, the quality and controllability of generated images, and the interpretability of model are still potential to be improved. In this paper, by implementing StyleGAN2 model, plausible building façade images could be generated without conditional input. In addition, by applying GANSpace to analysis the latent space, high-level properties could be controlled for both generated images and novel images outside of training set. At last, the generating and controlling process could be visualized with image embedding and PCA projection method, which could achieve unsupervised classiﬁcation of generated images, and help to understand the correlation between the images and their latent vectors.


Introduction
With the emerging of Generative Adversarial Network (GAN) based image generation method in recent years, many attempts have been made to apply GAN into architectural images and drawing generation research [1]. However, for the realistic building façade images generation task, most attempts faced different challenges, such as quality and controllability of generated image, and interpretability of model. These challenges were due to various limitations, such as performance of the selected GAN model, the size of training dataset, the understanding of latent space, etc. In this paper, by training the state-of-the-art GAN based image generation model, StyleGAN2 [2], with high-resolution building façade image dataset, and exploring its latent space by applying PCA and GANSpace analysis, we could overcome above challenges in different extend [3].
In summary, the main functions and contributions of this paper are: 1. A StyleGAN2 model instance which could generate plausible building façade images without conditional input. 2. Introduce GANSpace and image embedding method to visualize the correlation between the generated building façade images and their corresponding latent vectors, which achieved unsupervised classification and high-level properties control of both generated and novel images.

Image Generation Research via GAN in Computer Science
Generative Adversarial Networks (GAN) are a neural network architecture consist of generators and discriminators, and have shown the potential to generate novel image instances from the learned distribution of training set [1]. Recently, the derived research of GAN become the focus of image generation task in computer vision scope. Above research could be generally classified as supervised and unsupervised learning structure. The supervised GAN require conditional input in both training and inference process, such as Pix2Pix, Pix2Pix HD, GauGAN (require paired training sets), and cycleGAN (require unpair training sets but each set must be similar content) [4][5][6][7]. Because the supervised GAN require relatively small training set and less training resource, and achieve high quality output when inputs were appropriated, most architectural image generation research based on that. However, their performance and application were limited in the practical workflow because conditional inputs were required. In the other aspect, the unsupervised GAN models, such as DCGAN, BigGAN, StyleGAN require much bigger training sets (normally in millions) and higher training resource, which have been used in less research [2,[8][9][10][11]. However, because the unsupervised GAN model could generate diverse outputs without conditional input, it is more potential to apply into real task. In addition, the features of latent space they have brings the possibility to achieve further model explanation and semantic edition of generated images [3,11].

Plan Drawing Generation Research
Most research of generating architectural plan drawing based on the supervised GAN model. Hao Zheng is one of the early researchers in this scope. In 2018, he pplied a conditional GAN, Pix2Pix to prove building plan, urban plan, and satellite images of city could be generated by given conditional input, such as footprints or color pattern images [12]. In the following research, he successfully generated plausible apartment plans, and explained the working principles [13][14][15]. In 2019, Stanislas Chaillou developed ArchiGAN also based on Pix2Pix, which could generate whole department building plan from building footprints [16]. In 2021, above research has been expanded to large scale plan drawing. Liu et al. applied Pix2Pix to generate campus layout by given campus boundary and surrounding roads [17]. Pan et al. implied GauGAN to generate plan of community by similar conditional input [18]. The outcomes images of above research were normally significant when appropriated conditional input was satisfied. However, plan drawing are relatively simple to generate comparing with complex building façade images.

Building Facades and Other Perspective Architectural Images Generation Research
Similar with plan drawing generation, most previous research of building facades and other architectural perspective image generation required conditional inputs. In 2017, in the original paper of Pix2Pix, Isola et al. generated novel street scene and building façade images, but required street view and refine colour label as paired image inputs [4]. In 2019, Kyle Steinfeld developed GAN Loci upon both Pix2Pix and StyleGAN, when the Pix2Pix version required depth map as conditional input, and the StyleGAN version has only trained as a 512 pixel square unrefined instance, due to the limitation of computing resource and dataset [19]. Kelly et al. proposed the FrankenGAN which could generate 3D building model with detailed façade textures, but require mass 3D model as input [20]. Mohammad et al. attempted to generate novel building elevation design from AI generated datasets, but they only got low-resolution grayscale images [21]. In 2020, Chan et al. attempted to generate building facade images from hand sketch, but only got low-resolution output due to small dataset and small GAN architecture [22]. Different with previous research, In 2019, Bachl et al. developed City-GAN to synthesis novel city images from random input by learning from large street view dataset. City-GAN was developed upon the unsupervised DCGAN model with feeding additional label information to control the style of the generated city images, and allow simple interpolation between different styles. Nevertheless, the generated images were still with limited quality and resolution [23]. Chen et al. proposed another unsupervised model, embedGAN, which attempted to explore the property of latent space [24]. They embedded an interior image into the latent space as a starting point, and then guided that latent walk purposefully by a pretrained classification network, to regenerate the image into different decoration material and style. However, only trained image could be applied, and the image quality was not good enough.

Methodology
In this paper, the state-of-the-art GAN based image generation model, StyleGAN2 has been applied on the experiment [2]. In addition, a training set with 9772 building façade images of 1024 × 1024 resolution have been implemented. Because StyleGAN2 model generates images from random sampling vectors in the high-dimensional latent space, to explore and visualize the relations between the generated building façade images and corresponding latent vectors, the methods of dimensionality reduction, clustering and image embedding have been applied. Specially, by utilizing the principal component analysis (PCA) on the intermediate latent space W of StyleGAN2 model [3], this paper achieved high-level properties control of generated building facade images. In addition, even though StyleGAN2 does not have encoder network, projecting novel building facade images (outside from training set) into existing latent space has been achieved through applying VGG16 pre-trained perceptual network. This method could locate the latent vector which could generate the most similar image with the target image [25]. Once the projection completed, we could control novel image as same as the generated one.

Introduction of StyleGAN2
StyleGAN2 is the SOTA GAN based image generation model upgraded from Style-GAN, which was proposed by Nvidia company in 2020 [2,11]. It has unique generator structure different from most GAN models, which provides better model performance and interpretability. Most GAN models, such as Pix2Pix and CycleGAN, have an image encoder to encode the input image as a latent vector, which used as the direct input of image synthesis network (decoder) [4,7]. This structure requires images as input to generate others, and potentially limits the model performance, because the distribution of input images may not fit to the output images [11].
The style-based generator of StyleGAN2 could avoid using image as input. The synthesis network g of style-based generator begins with a learned constant number, and go through 18 layers to output as 1024 pixel square image. In each layer, a noise and a latent vector w will be inputted to adjust the style and content of the generated image. The latent vector w is an intermediate output in a 512-dimension latent space W, which was converted from a vector z in another randomly sampling 512-dimension space Z, by a 8-layer trainable fully connected network [11].
The improvement of Style-based generator in StyleGAN2 brings bellowing new features, which are the foundation of the further research in this paper [11].
1. State-of-the-art image generation performance. 2. Generate image from random vector without conditional input. 3. Latent space W is free from restriction and potential to be disentangle. 4. Unsupervised separation of high-level attributes of generated images. 5. Adjust the style and content of generated images by manipulating latent vectors.

Training Process
In this paper, an open source architectural style image dataset was firstly integrated with other 6000 building facades photos and renderings downloaded from internet [26]. Secondly, repeated, non-architectural and low-resolution images were removed. Finally, all the images will be converted to jpg format with RGB channel at 1024 × 1024 pixels. The final training set included 9772 architectural façade images from various styles. The training was proceeding in config-f (1024 × 1024 resolution) and with auto mirror augment function. The training has been running on a single NVIDIA Tesla V100 with 16G RAM, and continued about 816 h until 12240K images.

Generation Examples
After the training completed, this StyleGAN2 model instance could generate plausible building facades images with 1024 × 1024 resolution from random seeds (Fig. 1). The generated images were similar with the training set images, but showing the mixing features from different examples rather than simply repeat (Fig. 2).
However, for some details in the generated image were still blurry, mismatched or missing. These may because of the relatedly small dataset, not enough original resolution of some training images (were enlarged) and not enough training time.

Visualizing High-Dimension Latent Space by PCA
The style-based generator will require an intermediate latent vector was input in the intermediate latent space W. The W was remapped from a 512-dimension randomly sampled latent space Z by 8 layers of trainable fully connected networks. To visualize the distribution of W and Z, the Principal Component Analysis (PCA) was introduced to reduce the dimensionality and projected the vectors in both spaces to 2D figure. The PCA method will firstly analysis the distribution of high-dimension latent space, then projects the vectors orthogonally with the principal axis into the low-dimension space, to present the main features in high-dimension space [27].
In this paper, for exploring both latent spaces, 2000 vectors have been randomly sampled in Z and remapped into W, and finally the generated building façade images. The vectors distributions in W and Z have been projected by PCA into 2D figure (Fig. 3). Because the vectors in Z were randomly sampled, its distribution was almost sphere. Relatively, the distribution in W has remarkable shape, which may reflect the features distribution of image contents.

Explanation of StyleGAN2 Model: Images Embedding and Clustering in Latent Space
To prove previous hypothesis and visualize the correlation between images and their latent vectors w in space W, 2000 generated images examples have been embedded at the projected locations of corresponding was dots (Fig. 4). To avoid too much overlap, only about 10% images thumbnails have been shown. In addition, the unsupervised Kmeans clustering algorithm has been applied to cluster the w vectors into 4 types, which were marked as colours of the dots and frames of thumbnails. It could be observed that the images in same clusters shared similar features. Moreover, some features will show linear change alongside certain direction. For example, the height of buildings was descending from the top to the bottom in the Fig. 4. These proposed the hypothesis the generated image's high-level property could be controlled by moving its vector w along certain principal axis.

GANSpace Method
Above hypothesis has been proved by the research GANSpace [3]. In that research, the principal axes Vn will be firstly computed by analyzing the Latent space W via PCA (the max amount of Vn is equal to the dimension of latent space, 512), then the modified vector w' can be computed from original vector w by the Eq. 1 below, where the x is the scale parameter customed by user [3]:

High-Level Property Control
In this paper, the control process of principal axes 0 was visualized by setting series of equidifferent parameters (Fig. 5). Because axes 0 was assumed as the foremost axis of latent space W, it should present the most significant diversity of the training sets, which was from high-rise modern building to low-rise traditional residential house. The  More examples controlled by different main axis could be seen on Fig. 6. Ideally, each axis would control one significant feature. However, these features were still partly disentanglement in this experiment. That may because of the insufficient quantity and diversity of the images in training set.
S. Meng   Fig. 6. Examples of the high-property control by main principal axes 2 and 5.

The Projection Method
The image's latent vector w in StyleGAN2 model is necessary when ccontrol its highlevel property. However, because StyleGAN2 does not have encoder network, the novel image (outside of training set) cannot be directly encoded as w. To solve this problem, Image2StyleGAN algorithm employed a pretrained VGG16 perceptual model. In details, a latent walk will firstly start from the average w, then the VGG16 model will be applied to compute the loss between the output image of present w and the target image. Finally, gradient descend could be used to guide the latent walk to get close to the w which could generate the most similar output image with the target novel image, in the latent space W of existing model instance [25].

Projection and Control of Novel Image
In this paper, a white vacation house outside of training set has been projected into the w space, and the whole process could be observed by PCA projection (Fig. 7). It could be found that the moving interval of projecting vector was keep decelerating, because the computed loss of gradient descending was reducing progressively. The final project result image was not totally as same as the input target image, but very close. After that, the projected image could be controlled with high-level property by principal axis, just like the generated images (Fig. 8).

Conclusion
This paper is aiming to remove the obstacle of applying GAN based image generation model into generative design workflow. By integrated series of SOTA method from computer vision scope, this research has improved the quality of generated building façade images, and visualized the correlation between generated images and their feature vectors in latent space. In addition, by analyzing and manipulating the latent space of the trained model, high-level property control has be achieved for the generated images and the novel images.
However, some details of generated image were still blurry or mismatched, and the property control has not achieved completely disentanglement. Both of that are possibly because of the insufficiency of the quantity, quality and diversity of training set images. The present training set is less than 10K images and part of them have been enlarged, when the original StyleGAN2 research applies 200K full high-resolution training set images. Better performance may realize with larger training set and longer training time. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.