Introduction

The prediction of urban areas and the corresponding growth [4, 9] utilising machine learning [1, 12, 22, 23] is a vital topic in built environments. This work, as a first pilot study, focuses on the prediction of demographic structures, more specifically the distribution of university graduates, in urban areas, as they can be used as a proxy for the wealth distribution of cities as well as for the attractiveness of neighbourhoods [6, 13, 20]. Satellite images can assist in deriving visual characteristics—whether a residential area is generally attractive—without the assistance of any other (demographic) statistical data, distance measures, points of interest (POIs) or any other relevant GIS datablack. Such visual characteristics include the material of the buildings in a residential area and the immediate neighbourhood or vegetation in the area, as the presence of vegetation, street trees, parks, forests, open spaces, and bodies of water usually enhances residential areas [24, 30]. We assume that the presence of the aforementioned types of land covers are indicators of a wealthy neighbourhood and thus also correlate with the settlement of university graduates.

Satellite images and the capabilities of today’s computer vision techniques, in combination with machine learning, play an increasingly important role in economic evaluation. Some examples of statistical economic variables that can be predicted using computer vision are the gross domestic product (GDP) [5], economic growth [17] and poverty [51]]; computer vision can also be used in detecting, estimating and monitoring socioeconomic dynamics such as urbanisation, population and economic activity [3] using nighttime luminosity. The focus of our research is on predicting the university graduate ratio (GR) in an urban population (the city of Vienna, Austria) by the use of satellite images depicting a 250 m × 250 m area, which is the smallest population grid data available in Austria. Figure 1 shows example satellite images from our dataset with different densities of university graduates and the related visual characteristics.

Fig. 1
figure 1

Where do university graduates live? Four example areas from our dataset are represented by high-resolution satellite images (4285 × 4285 pixels). From left to right: Image (a), with a high graduate ratio of 51.2%, in the city centre of Vienna. Image (b), with a high graduate ratio of 32.4%, in the outskirts of the city. Image (c), with a graduate ratio of 10.0%, near an industrial zone. Image (d), which displays panel buildings with a graduate ratio of 3.3%

Our approach enables us to constantly monitor economic data in a populated area and to circumvent the modifiable areal unit (MAU) problem, which arises due to administrative borders and other spatial limitations [41, 47]. The MAU problem arises when spatial aggregated data are analysed, as the size and scale of the aggregation district can lead to statistical bias. The MAU problem is resolved when the trained neural network can be applied to arbitrary locations, independent of the size and scale of any predefined grid. Part of the increased accuracy needed to circumvent MAU problem distortions is the use of the smallest possible spatial area when predicting sociodemographic data [50]. For a review of the MAU problem and suggested solutions see [8, 40]. black

Additionally, most evaluations conducted in the field are executed at a highly aggregated level (e.g., 1 km × 1 km satellite image grid cells [21]) to predict economic variables. Our approach differs in the sense that we focus on a much more fine-grained analysis of the images, which can enable a more precise economic analysis in the future. This small-scale analysis together with the free positioning of the satellite image grid can address problems and provide possible solutions (e.g., varying the shape and size of the investigated area) [8, 40, 52]. blackAdditionally, we do not make any assumptions, and our work is fully data driven. This means that characteristic visual patterns are discovered and learned autonomously and do not have to be predefined.

The overall aim of this article is to combine methodology from the disciplines of urban economics and computer vision to realise innovative services for urban analysis. We enable the analysis of an important economic variable, namely the settlement of university graduates, with a low-resource technology, assisting in urban policy analysis. An advantage of this approach is that a properly trained computer vision model can be applied to any other raster of satellite images. This makes the proposed methodology applicable to any size and position of a corresponding raster and therefore relaxes the dependency from the existance of suitable GIS data. As a computer vision-based prediction with a previously trained model is a time-saving procedure, the efficiency of urban planning and development processes can be increased. black

Related Work

Computer vision has gained great importance in recent years due to the increasing availability of geo-data [49]. It includes various techniques for detecting and monitoring the physical properties of an area by photographing or measuring its reflected and emitted radiation at a large distance from the Earth’s surface [43]. Computer vision in this area has thus far been used for land-use classification and segmentation (e.g., detecting and segmenting meadows, forests, and roads) [11, 35, 38, 53], building (footprint) detection [34, 42], building detection and classification (e.g., differentiating residential, commercial, single-family houses, and apartments) [27, 32, 45], building roof analysis (e.g., to estimate the potential for solar power systems) [10, 33] and 3D city modelling [15, 25]. For a comprehensive review of computer vision and neural network applications in the real estate sector, urban systems and built environments, see [26] and [14]. We contribute to this research in the sense that we establish a novel, visually grounded approach to predict the spatial distribution of residents according to their education level from satellite images.

The relevance of satellite images in economic prediction was assessed in [7], where the authors provided proof of a relationship between economic development and deforestation utilising satellite images of forest cover. They found that the key determinant of per capita income differences among countries is determined by the relation between forest cover and GDP.

The authors of [21] utilised daytime satellite images to predict socioeconomic variables for consumption expenditure and asset wealth by employing a CNN. Trained separately for each country, the model explains 37 to 55% of the variation in the average household consumption of the four countries and 55 to 75% of the household asset wealth variation across all five compared countries. For their investigation, the researchers used 1 km × 1 km daytime satellite images with up to 10 km of noise in the ground-truth data to protect the survey respondents. In our work, we focus on much smaller areas for a more fine-grained analysis and employ an accurate high-resolution ground truth. Similar to their methodology, we employ transfer learning to increase the speed of learning meaningful visual features from the satellite images. Research in urban economics leveraging computer vision and machine learning does not only focus on the prediction of economic variables. The authors of [39], e.g., predicted the perceived safety of US cities by analysing street view images. They found that the visual appearance of a neighbourhood can influence the liveability for the neighbourhoods’ inhabitants. This task is related to our article, as university graduates tend to agglomerate in areas with higher quality of life and thus in safer neighbourhoods [6, 13, 20].

The cost of living plays a negligible role in the location choice, in contrast to the prevalent wage levels [31]. This is in line with research that suggests that rural population growth is reduced by schooling, as highly educated individuals will migrate into urban areas, where they can expect a higher return on their education [19, 36].

The agglomeration of human capital or knowledge has been addressed in the theoretical foundations of the new economic geography, with its core-periphery model focusing on the spatial concentration and specialisation of production factors [28]. A derivation of the core-periphery model yields theoretical proof of the agglomeration of skilled workers [37], where education is among the determinants of skill.Footnote 1

Reference [46] follows a similar idea as our work, as they predict the socioeconomic profile of a city using satellite images and a neural network. They find proof of visual patterns that correlate with the socioeconomic classes of the inhabitants. Nonetheless, the methodology employed differs significantly from our approach, as the authors use a different social group and a more complex preparation of the ground truth; they focus their predictions on the presence of certain objects (e.g., swimming pools). Thus, our approach has broader applicability, as it is not dependent on the presence of predefined objects.

Recently, satellite images have become an increasingly popular source of data in the field of urban economics. The migration and agglomeration of production factors, in this case education or skill, is a prevalent topic in the field of economics that has been examined in different ways (e.g., [19, 36] as well as [37] on a theoretical level). Overall, the question of whether academic agglomeration is reflected visually in satellite images is currently open. We examine the agglomeration of graduates on a fine scale by analysing small grids of satellite images together with population statistics.

Methodology

For our investigation, we employ population data together with satellite images of the city of Vienna in Austria. The objective is to predict the spatial distribution of university graduates on a grid cell level using a convolutional neural network (CNN). Our approach is based on the assumption that satellite images of residential areas contain visual indicators that allow an estimation of the proportion of university graduates in a defined residential area.

An overview of our approach is depicted in Fig. 2 and described below (the numbers in the description are linked to the figure). A detailed description is given in Sections 3.1, “Data Preparation and Pre-Processing”, and 3.2 “Training and Prediction”.

Fig. 2
figure 2

Flow chart summarising the proposed method. Our approach consists of three main parts: “Data Preparation” (alignment of meta-data and satellite images), “Pre-Processing” (detection of inhabited areas) and “Training and Prediction” (prediction of graduate ratios)

First, we obtain the regional statistical grid data (1) for the study location. Next, we extract the grid coordinates to collect the respective satellite images (2). Then, we align the statistical grid data (population count and university graduate count per grid cell) with the satellite images. As a result, we obtain cell-based satellite images together with the computed graduate ratios that serve as the target variables (the ground truth) for our experiment (3). The ground truth is necessary to train and evaluate our model, which aims at learning the relationship between the visual modality (images) and the statistical measure (density of graduates).

In the next step, we identify grid cells with no or very little population density. This can be performed automatically by training a model for the detection of low-population-density areas or by using information from the ground truth, i.e., using a certain threshold (20 inhabitants, in this paper) to differentiate residential and non-residential cells (4). The non-residential cells are excluded from further analysis in our approach.

For the training of our density prediction network, we employ a five-fold cross validation protocol (5), where we obtain five independent predictions of graduate ratios covering the populated area of the city. After predicting all five folds, we obtain a prediction of the graduate ratios for the entire city. Combining the predictions with the ground truth, we compute confusion matrices to analyse misclassifications and to estimate the overall performance (6).

Data Preparation and Pre-Processing

Demographic data

For our investigation, we use 250 m × 250 m regional statistical grid dataFootnote 2, which are laid out across the entire federal territory and are made publicly available by Statistik Austria, the statistical office of the Republic of Austria. The grid is independent of administrative boundaries and therefore allows for a more subject-related delimitation of territories, which also solves the aforementioned MAU problem [47]. In future research, the determination of grids can be done independently on the basis of size. The statistical grid data with the corresponding satellite images are needed only for training and not for prediction. The dataset includes the grid cell coordinates with the population count as well as the count of university graduates.

Satellite data

To obtain suitable image data, we need high-resolution satellite images for the study location. We retrieve satellite images from the Google Tile ServerFootnote 3 of size 4285 × 4285 pixels (at a resolution of 1 pixel ≈ 5.8 centimetres) and resize them to 224 × 224 pixel images that match the regional statistical grid data (at a resolution of 1 pixel ≈ 111.6 centimetres). Resizing is necessary, as the employed neural network processes 224 × 224 pixel images.

Population and graduate ratio

The statistical grid data with the matching satellite images consist of 6,632 grid cell data points. The dataset contains absolute numbers for the overall population as well as absolute numbers of university graduates for every cell. For machine learning, however, a normalised value range of the target variable is beneficial (see Fig. 3 for the distribution of the absolute and relative data). Furthermore, absolute numbers add a population size bias; thus, using proportions results in less biased results. To convert the data to relative numbers, we calculate the graduate ratio (GR) for every grid cell by dividing the absolute number of university graduates by the total population of the grid cell. This ratio facilitates the interpretation and comparability of the distribution of graduates in the investigated areas. To obtain the classes for prediction, we separate the dataset using the GR-20% percentile. Reference [46] also uses five classes in their prediction model based on a neural network. Their approach differs in the sense that the determination of classes is made according to the presence or absence of certain objects in the images, which is decided a priori by the researchers.black. An equal distribution of classes is beneficial in machine learning. The result is a set of five classes, from class 1, containing the lowest graduate ratio grid cells, up to class 5, with the highest graduate ratio grid cells. In Appendix A, we show sample satellite images for all the classes, from low (class 1) to high (class 5) graduate ratios.

Fig. 3
figure 3

Graph (a) depicts the absolute graduate count per grid cell on the x-axis over the population count per grid cell on the y-axis. Graph (b) depicts the calculated graduate ratio per grid cell on the x-axis and the population count per grid cell on the y-axis. Graph (c) depicts the grid cell-based graduate ratio on the x-axis and the frequency of the corresponding ratio on the y-axis. The coloured lines depict the upper 20% percentile boundaries for the five graduate ratio classes in ascending order, with the numerical boundaries in brackets (black: null ratio [0.0%], red: class 1 [6.4%], orange: class 2 [10.8%], yellow: class 3 [16.2%], green: class 4 [26.2%], blue: class 5 [60.7%])

Pre-filtering

The employed categorisation scheme is to some degree sensitive to changes in absolute numbers, especially in sparsely populated grid cells, where small changes in the absolute numbers can strongly impact the resulting ratios. To mitigate this sensitivity, we filter out the sparsely inhabited and completely uninhabited grid cells. For our experiments on graduate density prediction, we define a threshold of 20 inhabitants. This (i) excludes sparsely populated cells that are counterproductive for our analysis (artificially increase the accuracy) and (ii) counteracts unstable class assignments for sparsely populated cells and thus improves the robustness of classes. After filtering, our ground truth consists of 3,314 grid cells. For the filtering of the dataset, we tested an automatic approach. For further details, see Section 4.1.

Training and Prediction

Once the uninhabited areas have been filtered out (via our population threshold), we train a CNN for the prediction of graduate ratio classes using satellite images. To model the relationships between the visual information from the satellite images and the five target classes, we build VGG-16 [44], which is a pre-trained CNN, and apply transfer learning to adapt it to our requirements. Prior to the selection of VGG-16 as the network model, we have evaluated a number of alternative promising CNN architectures, namely DenseNet201 [18] (a CNN that is 201 layers deep with connections between each layer and subsequent layers, preserving features in previous layers, giving the model more flexibility in multi-scale modeling) and VGG-19 [44] (a deeper version of VGG-16). We trained all networks on the graduates density data from Vienna where VGG-16 showed the best accuracy on the test set compared to VGG-19 and DenseNet201 and was thus selected for all further experiments. black

Network architecture

VGG-16 is a feed-forward neural network architecture that builds upon a stack of convolutional filters followed by several dense (fully-connected) layers (see Fig. 4 for an overview and Table 1 for details on all hyperparameters of the architecture).

Fig. 4
figure 4

The adapted network architecture for graduate density estimation from satellite images based on the VGG-16 network

Table 1 VGG-16 network architecture: The adapted network architecture for graduate density estimation with satellite images input, which is zero-center normalised, based on the VGG-16 network (138M network parameters)

The inputs to the network are three-channel RGB images of size 224 × 224 pixels (i.e. 224×224×3 tensors) covering one grid cell of 250 m × 250 m. The network consists of two major parts. The first part is a stack of convolutional layers that aims at learning a hierarchical (multi-scale) visual representation from satellite images. In each of the five convolutional layer groups, image filters are learned for different image scales. The pooling layers after each layer group reduce the resolution of the representation by half. The filters in the early layers (e.g., Conv 1-1 and Conv 1-2) represent very basic and generic small-scale image structures (usually edges). The intermediate layers (e.g., Conv 3-1 to 3-3) represent larger-scale structures (e.g., image textures). The higher layers (Conv 5-1 to 5-3) capture visual structures at the largest scale, related to buildings and building parts. The hierarchical representation stack is followed by a stack of dense layers, which can be considered a non-linear classifier. We employ two dense layers, as in the original VGG implementation, and replace the third dense layer (i.e., the output layer) by a smaller layer with five nodes, where each node corresponds to one density class. The neuron activation functions throughout the entire network are rectified linear unit (ReLU) functions of the form:

$$ f(x) = max(0,x) $$

where x is the current activation fed into the activation function. After the last dense layer, we position a softmax layer. The softmax layer re-scales the outputs xj of the network to obtain class probabilities (for the five density classes) that sum to one overall:

$$ y_{j}=\frac{e^{x_{j}}}{{\sum}_{i=1}^{N} e^{x_{i}}}, $$

where yj represents the normalised output for the respective un-scaled network output xj (yj > 0) and N is the number of network outputs: x1,...,xN (N = 5 in our case).

Training

Prior to training, we normalise the input images by subtracting the average red, green and blue values from each image. As a result, all the images become zero-centred colour channels. Normalisation is recommended to accelerate the optimisation process during training (gradient descent). To estimate the model quality, we employ categorical cross-entropy loss to measure the match between the network predictions and the ground truth during training. Categorical Cross-entropy is commonly used for multi-class classification problems and defined as follows:

where \(\hat {y_{i}}\) is the i-th scalar value in the model output, yi is the corresponding target value, and N is the number of scalar values in the model output (N = 5 classes in our case). The loss function estimates how well the network outputs correspond to the desired target outputs and is used as a target function during training which is minimised.

The percentile-based classes obtained from the statistical data serve as the target variables for training. Training is performed via mini-batch gradient descent with a batch-size of 32 images. The SGD optimizer is used to minimise the loss function. To avoid over-fitting the network, we freeze the first 10 network layers during training. This means that the network weights for those layers remain unchanged. Only the higher layers are fine-tuned and adapted to the current task. For training the neural network, we use the MATLAB framework MatConvNet from [48]. The experiments were performed on a workstation with an Ubuntu 18 OS, 64 GB RAM and an NVIDIA GTX 1080Ti. Retraining (fine-tuning) is performed for 50 epochs with a learning rate of 0.0001, momentum of 0.9 and weight decay of 0.0005. We use classification accuracy as performance measure in our experiments. To monitor overfitting during training a validation set is employed (see below)

Prediction and evaluation

For training, we employ five-fold cross-validation. The motivation for using cross-validation is to make the best use of the limited amount of data (number of grid cells) that is available for our experiment. We train five networks using five independent training partitions from the ground-truth dataset. The partitions are composed of randomly assigned datasets to avoid location dependencies. To avoid bias from different locations in the city or from similar characteristics of neighbouring cells, three parts (60%) of the available data are used for training, one part (20%) of the available data is used for validation and the fifth part (20%) of the available data is used for predicting and testing. The assignments of the parts to the training, validation and test data vary across all five iterations.

After the application of all five networks, the result is a prediction of the entire study area, which can subsequently be evaluated with the five test sets, which are composed of 20% of the ground truth each, thus yielding the entire study area and the aggregated confusion matrix in Section 4.2 “Prediction of Graduate Ratios”. Since all the networks are trained independently from different data and there is no optimisation of a hyperparameter over all networks, their results can be considered independent. Thus, their joint predictions provide a reasonable performance estimate for the prediction of the target variable (the graduate ratio distribution) over the entire study area (i.e., the whole grid of Vienna). As a performance measure, we employ the accuracy rate (the portion of correctly classified grid cells), which is justified due to the balanced class sizes in the experiment.

Note that we do not use the cross-validation approach to select the model or training parameters (e.g., the network architecture, learning rate or loss function), i.e., to optimise our approach. This is important to avoid over-fitting and biased (i.e., overly optimistic) results. For all five folds, the same hyperparameters are used.

Results

Below, we first state the results for the automated differentiation of inhabited and non-inhabited areas and then present the results of the graduate ratio prediction. We conclude the result presentation with the analysis of graduates class deviations.

Prediction of Inhabited Grid Cells

In our analysis, we focus on inhabited areas only. In a preliminary study, we investigate whether we can automatically differentiate inhabited from uninhabited grid cells to provide automatic pre-filtering.Footnote 4 The results show that the trained CNN correctly predicts 95.3% of the inhabited and uninhabited areas in the test dataset. This shows that data pre-processing can be almost fully automated in future work.

Nevertheless, for the following experiments, we manually split the inhabited and non-inhabited areas using ground-truth information with a manually defined threshold of 20 residents per cell. The reason for enforcing this separation is to assure that the subsequent experiment is completely based on error-free data.

Prediction of Graduate Ratios

After filtering out all uninhabited areas, 3,313 populated grid cells remain for the subsequent analysis. In what follows, we investigate the prediction accuracy of the distribution of graduates obtained by our approach. For each fold, we compute the accuracy rate on the independent test set and identify false detections. The aggregated (summed) confusion matrix of all five cross-validation runs in Table 2 shows the correctly predicted grid cells on its main diagonal (absolute numbers and percentages of the respective classes). With 245 an overall accuracy rate (AR) of 40.5%, we are able to correctly predict twice as many grid cells as a random approach (which would yield a classification accuracy of 20% due to the five equally likely classes), and we obtain an overall accuracy rate that is 10.5% higher than that in reference [46] (30.0% overall accuracy) which employs a similar prediction model for a sociodemographic variable. Furthermore, 78.3% of the predicted density estimates deviate by no more than one class from the true class (random approach: 52%). This indicates that the model learned visual patterns that correlate with the graduate density in the grid cells.

Table 2 Aggregated confusion matrix for Vienna

Comparing the prediction accuracy for the individual graduate density classes (the diagonal of Table 2), we observe that the model performs best for the lowest and highest graduate ratio classes (i.e., class 1 and class 5; the same evidence is used as in reference [46]).The weakest performance is achieved for class 2, where 34.1% (true class 2: 669 observations) of the grid cells belonging to class 2 are assigned to class 1 (compared to 20.3% or 136 grid cells of the data that are correctly predicted as class 2). This may be due to the narrow width of the corresponding class boundaries of classes 1 and 2. The upper boundary of class 1 is a graduate ratio of 6.4%, and class 2 exhibits an upper boundary of 10.8%, resulting in a margin of only 4.4 percentage points. Thus, class 2 spans a low range of ratios, which can explain the difficulties in robustly predicting the grid cells.

Next, we investigate whether there is a bias in the misclassifications towards higher or lower densities. To this end, we sum the misclassifications above the diagonal in the confusion matrix (the sum of upper diagonals, SUD) and the sum below the diagonal (the sum of lower diagonals, SLD). The similar values for the SUD, 29.5%, and SLD, 30.0%, indicate that there is no bias towards higher or lower densities.

To further analyse a potential bias, we investigate the five confusion matrices obtained for the five folds. In Table 3, the accuracy rates of the five CNN runs and the respective SUDs and SLDs are listed. The confusion matrices for all five runs can be found in Appendix B. From Table 3, we can conclude that the performance is at a comparable level across the whole study area (approx. 37-45% accuracy), and thus, the dependency on the training data selection is rather low. Looking at the accuracy rates as well as the SUDs and SLDs, we do not see significant outliers in the deviations from the aggregated results in Table 2 which indicates that our models perform equally well over all five test datasets.

Table 3 Accuracy rates (ACR) and sums of upper diagonals (SUDs) as well as sums of lower diagonals (SLDs) for all five runs of cross-validation

After analysing the aggregated results, we examine the data on the grid level by generating a heat map that depicts the true and predicted class distributions. In Fig. 5, we show the city area partitioned into the statistical regional grid employed. Utilising the ground truth data, we are able to construct a grid map of the distribution of the true graduate ratio for the city area depicted in the image, and we plot the predicted grid cells of our model in image (b). Both maps show similar trends and patterns. The differences are mostly on a local level. Therefore, the predictions of our model are consistent.

Fig. 5
figure 5

True distribution of graduates in Vienna according to the ground truth (a) and the prediction of our model (b). The red cells indicate a grid cell with a low graduate ratio. The blue grid cells show a high graduate ratio. The white cells are uninhabited areas (fewer than 20 inhabitants)

To evaluate our results, we sensitise our approach in two ways:

  1. i)

    i) To evaluate the generalisation ability of our methodology, we evaluate our approach on other (yet unseen) cities for which adequate ground-truth or reference data are available. We select the Austrian cities of Graz, Linz and Salzburg for testing and predict the distribution of the graduate ratios over the space.Our network, when trained on the Vienna ground truth, achieves 28.7% accuracy in Graz, 32.7% accuracy in Linz and 27.8% accuracy in Salzburg. The lower accuracy rates are mainly because the populations of these three cities are considerably smaller than that of Vienna (by a factor of approx. 10).

  2. ii)

    Furthermore, we tested the prediction accuracy for Vienna with a different number of graduate ratio classes. Using two graduate ratio classes, the CNN achieves a prediction accuracy of 74.1%; three classes yield an accuracy of 48.7%, and four classes yield a prediction accuracy of 43.9%. In summary, these accuracy rates are substantially higher than those of random guessing (50%, 33% and 25%, respectively), which further confirms that the CNN finds patterns in the satellite images that correlate with graduate density.

Analysis of Class Deviations

To further investigate the performance of our machine learning approach, we create an additional heat map, shown in Fig. 6, which depicts the deviation of the predicted classes from the true class. This enables us to analyse the spatial distribution of the misclassifications, which in turn can help to better understand the performance of the model. We compute the signed deviation between the prediction and ground truth for each cell. Blue indicates an underestimation of graduate density, while red indicates an overestimation of the density compared to the true value per cell. All grid cells with correct predictions are coloured in green.

Fig. 6
figure 6

Heat map showing the deviation of the model predictions from the ground truth. Green indicates a match of the predicted and true graduate ratios. The reddish and blueish colours indicate an under- and overestimation of the density, respectively. More saturated colours indicate larger errors in estimation

Figure 6 shows that large areas are correctly predicted (green). False predictions are distributed across the entire area, and no large spatial clusters can be observed. A certain tendency can be observed towards false predictions as we move into suburban regions away from the city centre. When considering the overall population density (see Appendix C for the heat map), there may be a link between the prediction accuracy and population density. As the population count decreases towards the periphery, this may explain the difficulties of the prediction model in suburban areas, leading to a higher number of misclassifications.

Examples of matching (“Match”) and mismatching (“Mismatch”) predictions of our CNN are marked on the city map in Fig. 6. Satellite images of the four example matches and mismatches (a)-(d) are shown in Figs. 7 and 8. Pairs (a) and (b) show strongly deviating images, where the CNN predicts a much higher or lower GR class. Pairs (c) and (d) show minor deviations compared to the true class.

Fig. 7
figure 7

Matches. Example satellite images (in the 224 × 224 pixel resolution employed for the CNN) of correctly matched predictions. The images correspond to Match (a) (GR: 21.6% with a population of 97), Match (b) (GR: 28.1% with a population of 839), Match (c) (GR: 3.0% with a population of 985) and Match (d) (GR: 8.1% with a population of 86)

Fig. 8
figure 8

Mismatches. Example satellite images (in the 224 × 224 pixel resolution employed for the CNN) of false predictions. The images correspond to Mismatch (a) (true class: 1, predicted class: 5, GR: 5.0% with a population of 40), Mismatch (b) (true class: 5, predicted class: 1, GR: 32.4% with a population of 173), Mismatch (c) (true class: 1, predicted class: 2, GR: 5.7% with a population of 456), and Mismatch (d) (true class: 3, predicted class: 2, GR: 15.5% with a population of 181)

Match (a) is located in the suburban area of Vienna and corresponds to class 4 (a graduate ratio of 21.6%). The CNN correctly predicts the satellite image as class 4. Mismatch (a) is in the same suburban area and neighbours the cell of Match (a). The CNN assigns it to the incorrect class; i.e., the ground truth indicates class 1 (a graduate ratio of 5%), while the CNN classifies it as class 5 (highest graduate ratio). By human visual judgement, both pictures generally look similar,; one difference is that Mismatch (a) depicts fewer streets and buildings than Match (a). Both pictures show vegetation and generally look like attractive residential areas. Thus, due to the attractiveness of the neighbourhood, it is reasonable to us that the CNN would predict Mismatch (a) as an incorrect class.

Match (b) is located in the city centre of Vienna. The true graduate ratio is 28.1%, i.e., class 5. Our prediction model correctly predicts this class. The neighbouring Mismatch (b) has a graduate ratio of 32.4% according to ground truth and thus would fall into class 5 as well. The model, however, fails to predict the true class and assigns class 1. Comparing the two pictures, we can see similar images with residential housing of different densities. An obvious difference is the square-shaped houses (panel buildings) in the wrongly predicted cell. We observe that panel buildings frequently correlate with lower graduate ratios (see, e.g., Match (c)). We assume that the network has recognised that this pattern frequently accompanies a low graduate density and thus assigns the wrong GR class to the cell.

Match (c) is located on the outskirts of Vienna and shows numerous panel buildings. The true graduate ratio of the grid cell is 3.0% and is correctly predicted as class 1 by our approach. The neighbouring Mismatch (c) is falsely predicted as class 2 but actually belongs to class 1 (a graduate ratio of 5.7%). The two cells do not significantly deviate from each other visually, as both are only sparsely populated and show considerable areas covered by vegetation, especially trees. One difference is that match (c) contains more panel buildings and mismatch (c) contains more individual buildings surrounded by greenery. This may incline the model towards predicting a higher graduate density class, which is generally in line with the ground truth (there is a higher graduate ratio in Mismatch (c), 5.7%, than in Match (c), 3.0%). The network seems to overestimate the ratio, and thus, the result falls into the higher density class.

Match (d) is again in a suburban area and corresponds to a grid cell with a graduate ratio of 8.1% (correctly classified as class 2). The neighbouring cell, labelled as Mismatch (d), has a graduate ratio of 15.5% according to the ground truth and thus falls into class 3. Our model predicts it as a class 2 image. Match (d) depicts a rural neighbourhood with areas of farmland. Mismatch (d) displays denser settlement with less single family housing. This might be the reason why our approach underestimates the graduate density.

Overall, when looking at the example matching and mismatching predictions, we can draw the conclusion that a high proportion of the chosen pictures are difficult to assess even with human judgement. This may be a reason for the difficulties of the CNN in accurately modelling the classes and can also explain class confusions (especially between neighbouring classes). Regarding the prediction of graduate settlement, we are aware that the CNN could also predict variables such as rent or housing/apartment prices instead of the intended variable of graduate class. In that regard, one could claim that graduates are agglomerating in desirable areas [6, 13, 20] and therefore increasing rents or vice versa. Such correlations in the data are worth closer examination and represent an important direction of our follow-up research.

Conclusion

In this article, we show that a CNN can predict the spatial distribution of university graduates in a city using only satellite images. Our research hypothesis is that visual features exist in satellite images that correlate with the settlement of graduates. To investigate this hypothesis, we leverage the rich capabilities of machine learning to extract useful data from satellite images (250 m × 250 m small-scale city grid cells) and to link it to statistical population data. We split the statistical population data into five equally balanced classes with a wide range of graduate ratios. We train five neural networks on the ground-truth data and achieve an overall accuracy rate of 40.5% (random baseline: 20%) in predicting the five graduate density classes for the study site of the Austrian city of Vienna. We also show that we can differentiate inhabited and uninhabited areas with a probability of 96% by purely visual features using machine learning.

Our findings show that computer vision (i) has great potential for future examinations in urban economics (socio-economic and demographic studies), (ii) can mitigate the MAU problem or can serve as the basis for a solution and (iii) can be used in economic fields where no (statistical reference) data are available or the data are outdated, as computer vision can be deployed independently of the availability of statistical data and the metadata derived from them. Computer vision therefore opens up a so far underestimated but extremely useful information source for economic analyses.

Future research will analyse how the network recognises the distribution of graduates in detail. Colours, contours, textures, arrangements of buildings, etc. can play a role. Another step will be to investigate to what extent a network trained on one site (e.g., Vienna) generalises to another site and whether the findings and patterns are consistent.

Overall, we see a broad applicability for our prediction model in future research and practice. The investigation of the predictability of graduate settlement in a metropolitan area could enhance future urban planning and guide urban development in the sense that it is controlled for human capital agglomeration. Especially in regard to urban governance, our findings can add a new dimension to city planning if future research is able to extract the visual characteristics that increase graduate agglomeration in certain city areas.