1 Introduction

Urban perception refers to people’s feelings about the urban visual environment, that is, their esthetic judgment of the urban scene (Weber, Schnier, & Jacobsen, 2008). In recent years, urban perception studies have developed rapidly. The results have been applied to many aspects of urban construction, such as urban space esthetics, urban safety, and urban public health (Harvey et al., 2015; Helbich et al., 2019; Weber et al., 2008). The Chinese government puts forward the idea of people-oriented urban planning (Blaxland, Shang, & Fisher, 2014). People’s perceptions of the city are of great value in future urban planning and design (Been et al., 2016; Cheng et al., 2017; Ozkan, 2014). Therefore, it is very important to study the method of urban perception for decision support in urban management, urban planning and policymaking.

Traditional perception data are collected by social scientists through field investigation (Liu et al., 2015; Sampson, 2012). This method is time-consuming, expensive and has a small study scope. Street view objectively depicts the real urban landscape, which has been proven to be effective and reliable data to measure the urban environment (Long & Liu, 2017; Zhang et al., 2018a, b, c). In recent years, perception prediction based on street view images has received increasing attention. Salesses, Schechtner, and Hidalgo (2013) found that street view images can be used to assess the social and economic impact of the urban environment. Naik et al. (2014) proposed a method of scene perception prediction based on support vector regression. However, this method has the problem of relying on predefined feature mapping. With the popularity of deep learning, most scholars have begun to introduce it into the study of urban perception (Dubey et al., 2016; Liu et al., 2017a, b). Porzi et al. (2015) proved that deep learning is superior to traditional feature description in predicting human perception. Naik, Raskar, and Hidalgo (2016) and Zhang et al. (2018a, b, c) used computer vision to simulate individuals to quantify urban perception scores.

At present, the dataset used in urban perception is mainly the place pulse dataset provided by MIT (Ordonez & Berg, 2014; Porzi et al., 2015). This urban perception dataset has collected street view images from many cities around the world but lacks training samples for the cities of mainland China. There are obvious differences in architectural styles and town planning between the East and the West (Ashihara, 1983). The urban perception model derived from this global dataset may not be applicable to the cities in mainland China. Therefore, Yao et al. (2019) proposed building a unique urban perception dataset in China. They constructed a human-machine adversarial scoring framework based on deep learning to support the perception study of cities and regions in mainland China and finally obtained interpretable results.

The above work shows that urban perception based on street view images and deep learning technology is the main study direction, which can deepen people’s understanding of large-scale urban environment scenes in a more automatic and effective way. There are two ways to obtain urban perception score through deep learning. The first is to learn the deep features automatically via the deep learning model and fit the urban perception score. For example, Dubey et al. (2016) used an end-to-end model to directly extract the high-dimensional features of street view images to predict the urban perception scores. Zhang et al. (2018a, b, c) used the street view images as the input, acquired the scores as the output and predicted the perception scores of the images in six dimensions.

Another method is to obtain the features (object proportion) constructed by expert knowledge based on scene semantic deep learning model and then fit them by machine learning. For example, the FCN + RF based model proposed by Yao et al. (2019) firstly uses the full convolution neural network to identify the ground objects and then obtains urban perception score based on the proportion of ground objects and random forest. This method has been used in the field of public health (Wang et al., 2019a, b), and the results are reasonable and interpretable. Compared with the black box mechanism of the CNN perception model, the perception model of FCN + RF-based model proposed by Yao et al. (2019) obtains artificial features from expert knowledge and has better interpretability.

The two deep learning methods have different principles in obtaining features and fitting urban perception scores. This leads to a discussion of the differences between the two models. Therefore, this study proposes two issues for further discussion: 1) What are the advantages and disadvantages of the two methods in the urban perception prediction task? 2) Is the result of automatic feature extraction based on the deep learning model reasonable explanation? At present, there is no relevant study on this issue.

To solve the above problems, this study constructs a typical model (CNN-based model) which can automatically learn deep features and compares its score results with FCN + RF-based model (Yao et al., 2019). We train the CNN-based model based on the China urban perception dataset (Yao et al., 2019). Then, we use the street view images that we collected by video to explore the differences in perception scores between the two methods and analyze models’ scene suitability. To verify the interpretability of the CNN-based model, we use point of interests (POI) and OpenStreetMap (OSM) data to explore the drivers that affect different perceptions.

2 Methodology

Figure 1 displays the flow chart of this study. First, we develop a mobile app that can capture driving scenes and obtain real-time longitude and latitude data at the same time. According to app’s time-series data, we process the mobile video files to get the image dataset of the study area. Second, we use the China urban perception dataset (Yao et al., 2019) as the training dataset to train six perception models (beautiful, wealthy, depressing, lively, safety, boring) and quantify the street view images of the study area. Third, based on the street view images collected while driving we evaluate the accuracy of the model results and analyze the scene suitability of the CNN-based model and FCN + RF-based model. Then, POI and OSM road network data are used to analyze the driving factors of different urban perceptions, and the interpretability of CNN-based model is verified.

Fig. 1
figure 1

Workflow of assessing street-level local urban perception

2.1 Mobile video access to street view images

According to the time series data collected by GPS, the adjacent points’ timestamps are calculated. The study area’s street view images are extracted from the video file by using the time stamp information. The code is freely available at https://github.com/Leitast/LocationListener. In this study, a mobile phone is placed in a vehicle for video recording. The bottom of the scene, which includes the vehicle, is cut to ensure the experiment’s reliability.

2.2 CNN-based urban perception model

The depth of the network is critical to the performance of the model. When the number of network layers is increased, the network can extract more complex features (Simonyan & Zisserman, 2014). To extract enough image features, we construct an end-to-end CNN model to represent urban perception. The end-to-end perception model proposed in this study refers to the structure of VGGNet (Simonyan & Zisserman, 2014). The model’s parameters and structure have been proved to be reliable in many aspects of image feature extraction (Ha et al., 2018; Liu et al., 2018; Lu et al., 2017). Increasing network’s depth may lead to degradation, and may cause decrease of testing and training accuracy results (Monti, Tootoonian, & Cao, 2018). Considering the small number of training samples in this study, the network depth is reduced to avoid degradation. By repeatedly stacking a 3 × 3 convolution kernel and a 2 × 2 maximum pooling layer, a model for urban perception is constructed. In this study, batch normalization is used to speed up the convergence of the model and avoid gradient dispersion. Batch normalization is better than dropout (Ioffe & Szegedy, 2015).

Traditional CNN models perform well in image classification (He et al., 2016; Krizhevsky, Sutskever, & Hinton, 2012; Simonyan & Zisserman, 2014). In the field of urban study, CNNs have been used for the semantic segmentation of urban traffic scenes and land-use change analysis (Deng et al., 2017; Zhai et al., 2020). As shown in Fig. 2, to preserve the image feature structure, we replace the softmax, which is responsible for multiclassification tasks, with a one-dimensional full connection layer to realize the end-to-end function. By inputting a street view image, the model can extract the topological features of the image and obtain the urban perception score.

Fig. 2
figure 2

The computational framework of the CNN model

2.3 FCN + RF-based urban perception model

The FCN + RF-based model (Yao et al., 2019) is used in this study for comparative analysis. FCN can predict the semantic features of each pixel in the image to generate natural target level segmentation results and obtain the classification of each image (such as sky, road, car, and building) (Badrinarayanan, Kendall, & Cipolla, 2017; Cordts et al., 2016; Long, Shelhamer, & Darrell, 2015). Yao et al. (2019) selected the annotated images in the ADE-20 k scene analysis and segmentation database (Zhou et al., 2017; Zhou et al., 2019). Taking the street view images as the input of the FCN, the ratio of 151 categories in each image is obtained. Finally, the ratios of the 151 categories are used as the input of RF to obtain the scores of each urban perception. The model has been applied in public health (Wang et al., 2019a, b) and proved to be reliable and effective. The detailed structure of the model can refer to the paper of Yao et al. (2019). The software can be found at http://www.urbancomp.net/2020/08/03/semantic-segmentation-software-for-visual-images-based-on-fcn.

2.4 Interpretability analysis of urban perception based on POI and OSM data

POI data have been applied to urban functional area identification (Yuan, Zheng, & Xie, 2012). Palczewska et al. (2014) demonstrated the correlation between urban function and urban perception using random forest. Yao et al. (2019) also verified the interpretability of the FCN + RF-based model from the perspective of urban function based on POI or OSM and Random Forest. Therefore, we get the urban perception distribution using the CNN-based model, analyze the correlation between the urban functional areas and the simulation results, and explore the model results’ interpretability through the feature importance function of random forest.

2.5 Accuracy assessment

During the process of comparing the advantages and disadvantages of the different models and using POI and OSM to analyze the correlation between urban function and urban perception, this study uses the mean absolute error (MAE), root mean squared error (RMSE) and Pearson correlation coefficient (Pearson R) to quantify the accuracy between the predictions and the ground-truth values. The MAE, RMSE and Pearson R are mathematically represented by Eq. (1) to Eq. (3), respectively.

$$ MAE=\frac{1}{n}\sum \limits_{i=1}^n\left|{y}_i-\hat{y_i}\right| $$
(1)
$$ RMSE=\sqrt{\frac{1}{n}\sum \limits_{i=1}^n{\left({y}_i-\hat{y_i}\right)}^2} $$
(2)
$$ Pearson\ R=\frac{\sum_{i=1}^n\left({y}_i-\overline{y_i}\right)\left(\hat{y_i}-\overline{\hat{y_i}}\right)}{\sqrt{\sum_{i=1}^n{\left({y}_i-\overline{y_i}\right)}^2}\sqrt{\sum_{i=1}^n{\left(\hat{y_i}-\overline{\hat{y_i}}\right)}^2}} $$
(3)

where yi is the ground-truth value, \( \overline{y} \) is equal to \( \frac{1}{n}{\sum}_{i=1}^n{y}_i \), and \( \hat{y_i} \) is the predicted result.

3 Study area and data

As the largest political, economic and cultural center in Central China (Sun, Chen, & Niu, 2016), Wuhan has become one of the most rapidly developing cities in China. As shown in Fig. 3, this study selects two streets, LuMo Road and Gaoxin Avenue, which run through the suburbs and urban center of Wuhan. Figure 3(c) shows that the red line represents LuMo Road, connecting the school and the business center. The yellow line represents Gaoxin Avenue, which connects the satellite city (Zuoling community) and the central city. From east to west, Gaoxin Avenue passes through undeveloped suburbs, satellite cities, science and technology industrial parks newly planned by the government, mature commercial district, and residential area. Therefore, the street view images collected from these two streets can cover various of urban functional areas in Wuhan. They have an excellent verification effect in the comparative analysis of the models.

Fig. 3
figure 3

The study area and its location in Hubei Province in China: a the geographic location of Wuhan City in Hubei; b the study area; c the red lines are LuMo Road and the yellow lines are Gaoxin Avenue. The white lines are the main roads in the study area obtained from openstreetmap.org

We obtain the street view images of 3592 sample points for perception analysis. Among them, 1154 sample points are collected from LuMo Road, and 2798 sample points are collected from Gaoxin Avenue.

Figure 4 shows the street view images used in this study. (A), (B), (C) and (D) are the sample images of the China urban perception dataset (Yao et al., 2019) used for model training. Each image in the dataset is scored on six perceptions (wealthy, safety, lively, beautiful, boring and depressing, with a score range of 0–100 by volunteers who have a good understanding of the local socio-economic background). The geographic location of the images in the dataset is close to the study area. Therefore, this dataset is of great help for the perception analysis in the study area. (E) and (F) are street view images in the study area collected through mobile video.

Fig. 4
figure 4

Street view images involved in this study: (A) (B) (C) (D) are samples of street view images provided from the training data set. (E1- E4) and (F1- F4) are street view images obtained by a mobile phone. (E1)- (E4) are images of the Wuhan Optical Valley (CBD area), (F1)- (F4) are images of the campus of the China University of Geosciences

POI and OSM data are also used in our study. Gaode, the largest online map service provider in China, has complete POI resources (http://amap.com). We obtain 24 categories of POI data in the study area (https://lbs.amap.com/api/webservice/download). To facilitate the follow-up study, we combine similar classes and calculate the kernel density (Fig. 5). The processed 12 kinds of POI and OSM road network data can describe and analyze the social economy and infrastructure (Liu et al., 2017a, b; Yao et al., 2018).

Fig. 5
figure 5

Distribution of Gaode POIs and OSM in the study area using kernel density. POI categories: traffic facilities (TRA), tourist attractions (TOU), life service (LF), shopping (SHP), residential communities (RES), public facilities (PUB), medical service (MHS), education facilities (EDU), financial services (FIN), government (GOV), entertainment facilities (ENT), commercial and business (COM). ROAD indicates the density of the road network

4 Results

4.1 Comparison of model accuracy based on street view dataset

In this study, the China urban perception dataset (Yao et al., 2019) is divided into a training set and a test set, where 80% of the data are used for training and 20% of the data are used for testing. We use the training set to build the fitting model and use the test set to evaluate the prediction results of the model. It should be noted that the parameters of the FCN module in FCN + RF-based model are set by referring to Long et al. (2015), while the RF part is formed by grid optimization. These models and parameters have been widely used in image semantic segmentation (Zhou et al., 2016), segmentation of street view images (Middel et al., 2019), remote sensing image classification (Han et al., 2020; Piramanayagam et al., 2016), public health (Zamani Joharestani et al., 2019), and proved to be effective. To better compare the CNN-based model, we refer to (Bulat & Tzimiropoulos, 2016; Simonyan & Zisserman, 2014) and other models to set parameters for the CNN’s hyperparameters.

Table 1 shows the training accuracy results of the six perception models. The Pearson R of all perceptions is greater than 0.9, which shows that the CNN-based model proposed in this study has a good perception effect. The results of the FCN + RF-based model is shown in Table 2. We find that the fitting results based on CNN model (the average error of RMSE of each perception is around 6.5) are more accurate than those based on FCN + RF-based model (the average error of RMSE of each perception is approximately 9.1). The CNN can automatically extract the features from images (Sun, Li, & Huang, 2017). These features represent the color, contour, texture, and spatial structure of the objects in images (Hu et al., 2015; Jiao et al., 2017; Kim & Pavlovic, 2016; Sahiner et al., 1996). Compared with the method of calculating the proportion of ground objects, the CNN-based model can also learn highly representative and hierarchical image features from sufficient training data (Shin et al., 2016)

Table 1 Testing accuracy of the urban perception estimation via CNN
Table 2 Testing accuracy of the urban perception estimation via FCN + RF

4.2 Comparison of model results based on real application environment

According to Tables 1 and 2, the accuracy of CNN-based model is higher than that of FCN + RF-based model. However, the two models have different characteristics of simulated urban perception scores, so it is necessary to analyze the difference and similarity between model scores and the specific application environment. Based on the app’s street view data, we simulate the real environment and compare the two models. Figure 6 shows the distribution of six perceptions in the study area obtained from the two models.

Fig. 6
figure 6

The distribution of urban perception results in the study area. (a1) - (f1) is the perception result of CNN based model, and (a2) - (f2) is the perception result of FCN + RF based model. According to the different scores (the range of scores is 0–100), the results are divided into five equal parts. The cube height represents the perception score. [0,20] is dark blue, (20.40] is sky blue, (40,60] is yellow, (60.80] is orange, and (80,100] is brown

From Fig. 6, we find that there is a strong similarity distribution between the two methods in the study area (except for beautiful scores). On the left side of the study area, the scores of wealthy, lively and safety are significantly higher than those of the right. The boring scores show a high level in the whole study area. When under the overpass of the city (the brown line area on the left in Fig. 6 e(1) and e(2)), the depressing scores increase significantly.

Urban streetscape can reflect spatial distribution of landscape elements in a certain area. The spatial distribution of the landscape is potentially related to regional land use and functional heterogeneity (Zhang et al., 2018a, b, c). Therefore, urban perception simulated by the street view is also affected by the land use and urban heterogeneity pattern to a certain extent. In order to further analyze the perception differences in real scenes, we selected several scenes with different land use and heterogeneity patterns (Fig. 7).

Fig. 7
figure 7

The comparison of the perception scores by the CNN-based model and FCN + RF-based model for the samples in case study areas: (a1) -(a10) are scenes with strong spatial heterogeneity; (b1)- (b10) are scenes with weak spatial heterogeneity

Urban central areas usually have a high scene complexity, which have diverse object types and mixed urban functions, and the area has a high heterogeneity pattern (Deng et al., 2020; Irwin & Bockstael, 2007). While urban suburb areas usually have a low scene complexity, where the objects are relatively homogeneous and the urban functions are simple (Zhou, Pickett, & Cadenasso, 2017). The suburb area has a low heterogeneity pattern (Irwin & Bockstael, 2007). The two models have evident differences in perception score between commercial center and suburb.

In the commercial center, the FCN + RF-based model is more consistent with the real scene in the perception of beautiful, lively and wealthy. However, the CNN-based model achieves more reasonable results in the suburb. The green plants and sky in the street scene are positive visual elements, which will have a positive impact on the feeling of beautiful, quiet and happy (Kaufman & Lohr, 2002; Quercia, O'Hare, & Cramer, 2014), but have a negative correlation with depressing (Helbich et al., 2019; Zhang et al., 2018a, b, c). (a1), (a3), (a4) and (a5) in Fig. 7 are mixed scene of complex landscape and natural landscape. Due to trees and other natural landscapes, people will have higher beautiful perception and lower depressing perception. In these scenes, FCN + RF-based model gets higher beautiful score. CNN-based model in Fig. 7 (a4) gives a significantly abnormal score for depressing perception. Contrary to (a3), (a4) and (a5) in Fig. 7, (b1) and (b4) in Fig. 7 are under an overpass on suburban roads, which blocks the sky and other elements, indicating that large-scale human-made features can have a negative impact (Zhang et al., 2018a, b, c). By extracting the image’s spatial structure and color features, the CNN-based model gives a high depressing score, which is significantly higher than that of the FCN + RF-based model.

There is a close relationship between the traffic/crowd flow and the environment liveness (Yao et al., 2019). Downtown areas tend to have higher traffic and crowd flow (Zhang et al., 2018a, b, c; Zhang et al., 2019). For example, (a3), (a6), (a7) and (a9) in Fig. 7 are located on the urban trunk road with dense traffic flow; (a5) and (a10) in Fig. 7 are located in the campus with dense pedestrian flow. The lively scores of the FCN + RF-based model in these scenes are higher than those of CNN-based model. However, in suburb, the objects are single, and the people and vehicles flow are rare, such as (b2) and (b3) in Fig. 7, which are scenes driving on a suburban highway, the FCN + RF-based model gives a high lively score, which is inconsistent with the real scene experience. The buildings density has a positive impact on the wealthy perception (Yao et al., 2019). The FCN + RF-based model gets higher wealthy scores in (a2) and (a8) which are complete landscape scenes in Fig. 7. This is because they are located in the city center’s commercial area, and the prosperity of buildings is higher than that of other areas. The wealthy score of the FCN + RF-based model is more reasonable and effective.

4.3 Interpretability analysis of the model

There is a specific correlation between urban function and urban perception, and POI data can reflect urban functional areas (Hu & Han, 2019; Zhang, Du, & Wang, 2017). Therefore, based on POI and OSM, we analyze the model’s interpretability fitting results from the perspective of urban function. Tables 3 and 4 are the urban perception results of the CNN-based model and FCN + RF-based model fitted by POI and OSM. The two models show good adaptability to each perception (R2 > 0.89, Pearson R > 0.94). Our results show that there is a strong nonlinear relationship between urban perception and urban function. POI and OSM can accurately estimate the distribution of urban perception and is an effective method to evaluate urban perception. By comparing the results of Tables 3 and 4, we find that the fitting accuracy of the CNN-based model is better than that of the FCN + RF-based model (RMSE of CNN is 0.016, RMSE of FCN + RF is 0.021).

Table 3 Testing accuracy of the urban perceptions (CNN) based on the POI and OSM densities via RF
Table 4 Testing accuracy of the urban perceptions (FCN + RF) based on the POI and OSM densities via RF

Yao et al. (2019) proves that FCN + RF-based model has good interpretability by using POI and feature weight ranking of random forest. This study will also refer to Yao et al. (2019) study to verify the interpretability of CNN-based model. The weight relationship between urban perception and POI or OSM categories is shown in Table 5.

Table 5 Fitting weights perceptions and POI or OSM categories: government (GOV), life service (LF), medical service (ME), public facilities (PUB), residential communities (RES), traffic facilities (TRA), financial services (FIN), entertainment facilities (ENT), tourist attractions (TOU), road (RO), commercial and business (COM), shopping (SHP), and education facilities (EDU). A gradient background color from blue to yellow to red indicates a gradual increase in value

Through the analysis of the results, we find that residential communities and roads are the most critical factors affecting urban perception in the case study area. In Yao et al. (2019), the six emotions are greatly affected by Edu and Gov. This study finds that the urban functions that affect urban perception are different from each other. ENT has a considerable weight of the wealthy perception, consistent with the general recognition that entertainment consumption places are in developed areas. Because depression is closely related to academic pressure, education areas have a great impact on depression perception (Ang & Huan, 2006). The perception of safety and lively have very strong relationships with the residential communities. This is because the community has the characteristics of frequent people flow and has reasonable security measures, which will bring great comfort to people (Holmberg, 2005).

5 Discussion

In this study, an end-to-end urban perception evaluation model based on street view images is proposed by building a multilayer CNN. We choose the FCN + RF-based model as the comparison method and analyze the difference in urban perception results between the two models in the study area. By combining POI data and OSM data, we calculate the weight of the driving factors that affect the urban perception, and analyze the interpretability of the results based on the CNN-based model.

Both models achieved ideal prediction accuracy (Pearson R > 0.78). However, the CNN-based model has better accuracy than the FCN + RF based model. The RMSE index of the CNN-based model is 2.6 lower than that of the FCN + RF-based model. Therefore, the result of the CNN-based model is slightly better than that of the FCN + RF-based model. In addition, compared with the method of using semantic segmentation first and then using random forest to determine the perception scores, the CNN-based model directly obtains the perception scores by inputting the images, which is faster and easier.

By extracting the high-dimensional features, the CNN-based model can obtain a high degree of nonlinear correlation between urban perception and urban functions. Through the study of the correlation and the weight between urban function and urban perception, we find that there is spatial similarity between the distribution of urban perception and the distribution of urban functional areas. POI data and OSM data can accurately estimate the distribution of urban perception (the average error of RMSE for six perceptions is 0.016). The case study quantificationally determines the impact of the urban functions on urban perception and proves that the CNN-based model proposed in this study has better rationality in promoting the evaluation of local urban perception. The FCN + RF-based model ignores the spatial topological features of the ground objects, which leads to its lower correlation with urban function than the CNN-based model (the average error of RMSE of six perceptions is 0.021).

The CNN-based model is more suitable for scenes with weak spatial heterogeneity, such as small cities or suburbs in central China. The FCN + RF-based model has more advantages for urban areas with strong spatial heterogeneity, such as the developed metropolises in central China. The CNN-based model can extract the features of the ground objects, such as color, texture and density (Hu et al., 2015; Jiao et al., 2017; Kim & Pavlovic, 2016; Sahiner et al., 1996). These detailed features are more reasonable in representing scenes with weak spatial heterogeneity. The FCN + RF-based model is suitable for scenes with strong spatial heterogeneity. When scenes contain strong spatial heterogeneity, the features of the ground objects become fuzzy, and the impact on urban perception will decrease. By directly calculating the ratio of the ground objects, a more accurate score can be obtained. Therefore, the two models have their own advantages. The method needs to be chosen according to the actual spatial heterogeneity of the urban environment to obtain a more accurate urban perception score.

There are still some deficiencies and many opportunities for future study. First, the mobile video is shot along the city streets. Considering that the front view plays a leading role in commuting, this study only selects the front view of the car and does not take a left or right view into account. However, the areas beside the road, such as parks and communities, also have a certain impact on people’s cognition (Abkar et al., 2010; Oetzel et al., 2011). The purpose of this study is to analyze the applicability of the two models. Forward-looking images can effectively explain the rationality of the results of urban perception. Therefore, in future studies, we will consider adding two side views and more street views on different blocks; expanding our study to actual application scenarios will also be considered.

Second, urban perception is related not only to street-view but also to other factors in the city, such as season, temperature, humidity and noise (Gunnarsson et al., 2017; Hong & Jeon, 2015). Therefore, in future work, more evaluation objectives will be considered in the study of urban perception, and an end-to-end perception model will be adopted to obtain higher accuracy and stronger interpretability.

The third is what we need to do in the future. At present, there is no Chinese urban perception data set (official). Therefore, when comparing different methods, we can only compare and analyze the existing small-scale urban data. In the future, we will collect street view images of other regions (urban, suburban, rural) based on the collection method proposed in this study and construct Chinese perception data set to increase the persuasiveness.

6 Conclusion

In view of the consistency and interpretability of the prediction results of the two different urban perception models, this study extracts the street view images in the study area through mobile video, constructs a CNN-based model and FCN + RF-based model by using the China urban perception dataset, and compares the results of the different models on two urban road networks in Wuhan, China. In this study, the prediction accuracy is ideal (the Pearson R of the CNN-based model is 12% higher than that of the FCN + RF-based model, and the RMSE of the CNN-based model is 2.6% lower than that of the FCN + RF-based model). This shows that the proposed CNN-based model is effective for urban perception assessment. The two models both have reached the ideal accuracy. By using POI data and OSM data for auxiliary analysis, we find that there is a strong nonlinear correlation between urban function and urban perception. The CNN-based model has more advantages in predicting the urban perception by extracting the high-dimensional features and has a higher degree of correlation with the urban function. The scene suitability of the two models is different. The CNN-based model is suitable for scenes with weak spatial heterogeneity, and the FCN + RF-based model is suitable for scenes with strong spatial heterogeneity. This study can accurately and quickly identify the exposed perception of residents and will promote urban planners to integrate the concept of urban perception into the planning practice. The results will provide decision support for government managers in urban planning to achieve a more sustainable and human-oriented urban development.