Measuring the Wisdom of the Crowd: How Many is Enough?

The idea of the wisdom of the crowd is that integrating multiple estimates of a group of individuals provides an outcome that is often better than most of the underlying estimates or even better than the best individual estimate. In this paper, we examine the wisdom of the crowd principle on the example of spatial data collection by paid crowdworkers. We developed a web-based user interface for the collection of vehicles from rasterized shadings derived from 3D point clouds and executed different data collection campaigns on the crowdsourcing marketplace microWorkers. Our main question is: how large must be the crowd in order that the quality of the outcome fulfils the quality requirements of a specific application? To answer this question, we computed precision, recall, F1 score, and geometric quality measures for different crowd sizes. We found that increasing the crowd size improves the quality of the outcome. This improvement is quite large at the beginning and gradually decreases with larger crowd sizes. These findings confirm the wisdom of the crowd principle and help to find an optimum number of the crowd size that is in the end a compromise between data quality, and cost and time required to perform the data collection.


Introduction
The term crowdsourcing was introduced by Howe (2006) and is a neologism of the terms crowd and outsourcing. In contrast to outsourcing, where employers outsource tasks to known and well-defined third parties, crowdsourcing involves the outsourcing of tasks to unknown workers (crowdworkers) on the Internet. This gives employers access DGPF * Volker Walter volker.walter@ifp.uni-stuttgart.de 1 Institute for Photogrammetry (ifp), University of Stuttgart, Geschwister-Scholl-Str. 24D, 70174 Stuttgart, Germany to a large number of workers who would otherwise not be available.
Many crowdsourcing projects rely on the work of unpaid volunteers, such as Wikipedia (www. wikip edia. org) or Zooniverse (www. zooni verse. org). The voluntary collection of geodata is called Volunteered Geographic Information (VGI) (Goodchild 2007). The most known VGI project is OpenStreetMap (OSM-www. opens treet map. org), a project to create a map of the world that can be edited by anyone (Haklay and Weber 2008). VGI projects require an active community that is intrinsically motivated to participate. The main motivating factor for users who voluntarily collect OSM data is that their work benefits other users, which in the case of OSM is freely available digital map data, but there are also other motivating factors, such as gaining reputation in the community, fun and inspiration, or gaining new knowledge. Budhathoki and Haythornthwaite (2012) presented a comprehensive discussion of motivational factors for VGI. If such motivations are not available, other (extrinsic) incentives must be provided. A problem of volunteered spatial data collection can be that the number of volunteers in the area to be mapped is often limited, hindering scalability and making it difficult to map large areas within a specific period (Maddalena et al. 2020).
The most common extrinsic motivation for crowdworkers to perform tasks and the one that leads to the fastest results is monetary incentive (Haralabopoulos et al. 2019). In paid crowdsourcing, tasks are published on online marketplaces (Mao et al. 2013) such as microWorkers (www. micro worke rs. com) (Hirth et al. 2011) or Amazon Mechanical Turk (MTurk-www. mturk. com) (Ipeirotis 2010). MicroWorkers, for example, has access to over 2,700,000 workers worldwide (according to their website-accessed March 2022). The research in this article is based on paid crowdsourcing. Paid crowdsourcing has been proven a powerful tool for very different applications. It can be used for practically all tasks that can be solved online with a computer. Some examples that show the variety of possible applications: Nguyen et al. (2012) describe an Amazon MTurk based application to detect polyps associated with colorectal cancer in images generated through computed tomographic colonography. Redi and Povoa (2014) compare the reliability in scoring the aesthetic appeal of images by paid and volunteered crowdworkers. Gao et al. (2015) propose a paid crowdsourcing approach that creates translations at much lower cost than hiring professional translators. Lans et al. (2018) developed a technology for a museum app and created content for over 80 museums worldwide with paid crowdsourcing. Koita and Suzuki (2019) propose a method to count traffic without actually being at the survey site by presenting images extracted from videos to paid crowdworkers.
There exists much research on spatial data collection based on the input of volunteers (Goodchild and Li 2012;Antonio and Skopeliti 2015;Albuquerque et al. 2016;See et al. 2016;Fonte et al. 2017;Seratne et al. 2017; Pinhero and Davis 2018) but so far only a limited number of publications on spatial data collection relying on paid crowdsourcing. Estesa et al. (2016) describe a platform for the mapping of crop fields in South Africa realized with Mechanical Turks' Human Intelligence Tasks (HITs). Walter and Soergel (2018) discuss the collection of buildings, forests and streets from aerial images with microWorker campaigns. Koelle et al. (2020) discuss a system for the classification of 3D point clouds where a machine-learning algorithm iteratively improves its performance by learning from paid crowdworkers. Walter et al. (2020) discuss the collection of trees from 3D LiDAR point clouds based on paid crowd campaigns. Maddalena et al. (2020) implemented a system to collect coordinates of points of interest from Street View images by paid crowdworkers. Koelle et al. (2021b) evaluated which 3D data representation (point cloud vs. mesh) is best suited for presenting to paid crowdworkers in context of coupled semantic segmentation of both an ultra-high-resolution point cloud and the derived 3D textured mesh. A well-founded basis for this approach of learning with only a few samples is presented in Koelle et al. (2021a), who also discuss the impact of erroneous answers from the crowd and possible countermeasures. Walter et al. (2020) present a two-level approach for the collection of vehicles from 3D point clouds by paid crowdworkers.
In both paid and free crowdsourcing, quality control is a challenge (Leibovici et al. 2017;Liu et al. 2018) because the quality of crowdsourced work varies widely (Vaughan 2017). The crowd consists of people with unknown and different skills (Daniel et al., 2018). Another problem, especially in paid crowdsourcing, is dishonest workers who try to maximise their income by submitting as many tasks as possible and delivering incomplete or poor results (Hirth et al. 2011). In addition, there may be adversarial workers who compromise the quality of the results (Zhang et al. 2016).
Obtaining high-quality ground truth from noisy data collected by non-experts is a major challenge in crowdsourcing (Zhou et al. 2012).
In general, there are two methods for controlling and improving the quality of crowd-sourced data (Zhang et al. 2016): (i) Quality Control on Task Designing and (ii) Quality Improvement after Data Collection. The first approach stands for methods that guide crowdworkers to provide highquality data. There are various methods for this purpose, such as skill testing, reputation systems, task assignment, task and workflow optimization, training, real-time quality control or quality checkpoints. An overview of these techniques is presented in (Daniel et al. 2018).
All these methods can improve the quality of the data. In the end, however, they are only partial solutions. Even if they manage to improve the average quality, the quality of individual datasets can still be heterogenous and low quality results cannot be completely eliminated. Many crowdworkers are not familiar with geospatial data collection standards or may not feel the need to follow them (Hashemi and Abbaspous 2015). Even when spatial data are collected by experts, the results can be heterogeneous (Walter and Soergel 2018). When spatial data are collected by non-experts, this effect is even stronger because people with very different expertise are working together (Senaratne et al. 2017). Methods based on Quality Control on Task Designing are not included in the research presented in this paper. However, they can be combined with our proposed methods for further quality improvement.
In the second approach (Quality Improvement after Data Collection), additional procedures are used to improve the quality of the data after it has been collected. A common idea for this is repeated data collection by different crowdworkers. After data collection, mechanisms are applied to filter out noisy data and to infer the truth. The process of estimating the truth from noisy multiple collected data is called truth inference (Zhen et al. 2017). Surowiecki (2004) has shown on many examples from very different fields that averages of multiple guesses are often better than the best individual guess. Large groups of people are smarter and can solve complex problems even better than experts. To do so, the same data must be collected multiple times (which can be realized without problems with paid crowdsourcing but would be hard to realize with volunteered data collection) and an aggregation rule (such as averaging) to derive a solution (Simons 2004).
Groups can be very smart when their aggregated results are compared with the results of individuals (Lorenz et al. 2010). A similar principle is swarm intelligence where the skills of individuals and the power of the masses are used to solve problems (Krause et al. 2010). This behaviour is also known from groups of insects, fishes, birds, mammals and primates that aggregate their individual decisions into group decisions for various tasks (Zuni and Eckstein 2017). However, swarm intelligence is only partially comparable to crowdsourcing, as the behaviour of participants in a swarm often arises from the behaviour of their physical neighbours (e.g., in the flight of flocks of birds), while participants in a crowdsourcing campaign are usually unaware of the existence of other crowdworkers (again, there are exceptions, such as in collaborative crowdsourcing, where a number of workers form groups and work together (Ikeda 2016)).
The crowd can comprise of any number of individuals (the larger the better) who neither need to know each other nor need to be aware what others are doing (Brown 2015). The wisdom of the crowd is a statistical phenomenon and relies only on mathematical aggregation methods (Hosio et al. 2016).
It is widely acknowledged that Charles Darwin's cousin Sir Francis Galton first observed the wisdom of the crowd in 1907 (Galton 1907). He found out that the average guess of the weight of an ox in a weight-judging contest at a farmer fair outperformed the accuracy of expert opinions (butchers). The average judgement nearly converged to an optimum result. In average, the fair audience estimated the weight of the ox was 1197 pounds. The real weight of the ox was 1198 pounds. Since then, many other researchers verified similar findings (Surowiecki 2004).
Simple majority voting is a widespread mechanism to exploit the wisdom of the crowd (Juni and Eckstein 2017). This mechanism is based on the idea that the majority of the workers are trusted (Kazemi et al. 2013). The employer assigns a task multiple times to many crowd workers. The result with the majority is considered the correct answer (Hirth et al. 2013). Majority voting is a simple but effective method (Zhang et al. 2016) and has been proven useful for different spatial labelling tasks. Salk et al. (2016) used majority vote classification for the identification of croplands in remote sensing images. Hecht et al. (2018) used majority voting to evaluate the results of crowdsourced classification of building footprint data. Herfort et al. (2018) used majority voting for the detection of trees in 3D point clouds by crowdworkers. Liu et al. (2018) applied weighted majority voting as aggregation technique for the crowd-based building of accessibility maps. The weights were derived from the reputation scores of the crowdworker. Koelle et al. (2020) used majority voting for the crowd-based labelling of 3D LiDAR point clouds.
While majority voting can be easily used for labelling tasks, it cannot be applied to the collection of the geometry of spatial objects. Labels are often classified into a finite number of classes (i.e., forming a categorization task) whereas the geometry of spatial objects can have in principle any shape. If different crowdworkers collect the geometry of an object, all resulting representations will be more or less different, which means that it is not possible to determine a majority representation.
If the objects are represented by simple geometric primitives, such as points, circles or rectangles, we can use the average instead of the majority as aggregation rule. This was successfully adapted by Walter et al. (2020) on the example of crowd-based collection of trees from 3D point clouds by means of minimum enclosing cylinders. Each crowdjob was duplicated 10 times. Integrated cylinders were calculated by averaging the centres and the heights of the individual cylinders. The quality of the integrated cylinders was significantly higher compared to the average quality of the individual cylinders.
If the objects are not represented by simple geometric primitives but by free-shaped polygons, the calculation of an average geometry is more difficult because polygons are not defined by a fixed list of parameters that can be averaged but can have any number of intermediate points at any position. The typical approach to calculate an average representation of multiple polygons is that first corresponding parts of the polygons are identified (matching) and then the actual integration is performed (conflation) (Walter and Fritsch 1999). In the past decades, many algorithms for the matching and conflation of geospatial data were introduced. An overview is presented in (Xavier et al. 2016).
Three factors affect the quality of the results of groups of crowdworkers that work together on a problem: The diversity of the group, the expertise of the crowdworkers, and the number of crowdworkers in the group.
If we crowdsource a task via a crowdsourcing marketplace, we have limited influence to the diversity of the workers. For example, on the microWorkers marketplace, it is possible to filter the crowdworkers based on the country of origin, the individual score of a crowdworker based on feedback from employers, the total payment received (top earners) and based on some specific skills some crowdworkers might have (programming, writing, blogging, caption writing, designing, data collecting, image annotating). However, the employer has no detailed control over the diversity of the group since the job assignment is an open process and every registered worker who belongs to the selected group is allowed to perform the tasks.
However, we can exactly control the number of workers. The higher the number of workers in the group, the more likely it is that the diversity of the group is also high. Krause et al. (2011) have shown that high diversity can be more beneficial to the quality than adding expertise. However, also expertise can improve the quality: King et al. (2011) demonstrated that in particular work configurations, experts could improve the quality and prevent significantly inaccurate results. The expertise of the crowdworkers could be checked with qualification tests (qualification-based preselection) (Geiger et al. 2011). However, expertise checks and expertise improvement are not part of our current research. Van Dijk et al. (2020) investigated the improvement of crowdsourced data by aggregating it and examined the relationship between the number of contributions, the parameters of the aggregation method and the resulting quality.
In this paper, our focus is on evaluating the impact of the number of crowdworkers on the quality of the aggregated results. For this, we implemented a graphical user interface for the collection of vehicles from 2D rasterized shadings derived from 3D LiDAR point clouds by means of line segments reaching from the front to the back of the vehicles. Each data collection task was assigned to multiple crowdworkers. For different group sizes, we calculated all possible combinations of the individual results and integrated them with a DBSCAN clustering (Ester et al. 1996). The results are then compared with reference data. The purpose of this work is to answer two questions: (1) How does the number of crowdworkers influence the completeness and correctness? (2) How does the number of crowdworkers influence the geometric quality?
The remainder of this paper is organized as follows. In Sect. 2, we introduce the graphical user interface for the collection of vehicles from rasterized shadings from LiDAR point clouds. The used datasets are presented in Sect. 3. In Sect. 4, we discuss the individual results of the different crowdsourcing campaigns. In Sect. 5, we present our data integration strategy and evaluate the wisdom of the crowd principle. We conclude our findings in Sect. 6 by discussing our results and formulating further research questions.

Crowd-Based Collection of Vehicles
Vehicle detection from remote sensing data has gained increasing interest in recent years (Wu et al. 2020) and can be important for many applications, e.g. autonomous driving, traffic management, traffic monitoring, urban planning, parking lot analysis, etc. Numerous approaches have been proposed for vehicle detection. The majority of these approaches can be divided into two classes: (i) detection of vehicles from vehicle-mounted sensors, such as in-vehicle LiDAR (Feng et al. 2019) or in-vehicle cameras (Ponn et al. 2020) and (ii) detection of vehicles from airborne sensors, such as airborne LiDAR (Eum et al. 2017) or airborne cameras (Yang et al. 2018). In recent years, deep learning algorithms have become powerful tools for the automated detection of vehicles from all kind of sensing data and have achieved remarkable results (Wang et al. 2019). Deep learning systems require large amounts of annotated data to implement interpretation concepts and crowdsourcing offers an effective method to provide such data.
Our study is based on an approach that we presented in (Walter et al. 2021). It is a two-level method for the crowdbased acquisition of vehicles from 3D point clouds by means of minimum bounding boxes. We subdivide the approach into two steps to make the data collection task as simple as possible because most crowdworkers have no expert knowledge in the field of geospatial data collection or have worked with 3D point clouds before.
Crowdsourcing tasks are designed in such way, that complex problems are subdivided into smaller sub-problems that can be solved quickly and easily. Paid crowdsourcing tasks take typically only a few minutes to complete and the payment is only several cents (Hirth et al. 2011;Hitlin 2016). For spatial data, subdividing large problems can be understood as geographically dividing the working area into small tiles and assigning these tiles to individual crowdworkers. However, this is only reasonable if the objects to be collected are homogeneously distributed, which is generally not the case for geospatial data. For example, in rural landscapes, the occurrence of vehicles is rather scarce. In this case, we would produce many tiles that contain no vehicles at all. Therefore, we subdivide the working area into large strips (so that absence of vehicles is very unlikely) and present these strips as 2D rasterized shadings to the crowdworkers in which they must first identify the positions of all vehicles within a strip. The advantage of strips over tiles is that crowdworkers only have to navigate in two directions in the user interface. This makes it less likely, that they will miss certain areas.
In the second step, these positions are used as preliminary information to cut out small parts from the 3D point cloud for each identified vehicle. Each of these small 3D point clouds is then presented to a crowdworker, which is asked to approximate the vehicle with a minimal bounding box. This has the advantage that each crowdworker only needs to download a small portion of the point cloud, which reduces the required bandwidth.
In Walter et al. (2021), we have shown that we can achieve a high degree of completeness and geometric quality in both levels of this approach by multiple data collection and subsequent averaging. With tenfold acquisition, a completeness of 93.3% was achieved. The geometric deviation between the centres of the crowd-based collected minimum bounding boxes and the reference was less than 1 m for more than 95% of the vehicles.
The number of how often the data is collected influences the quality of the results. However, a higher number of redundancies does not necessarily increase the quality. For example, if an object is collected only twice (with different quality characteristics), the quality of the integrated object would be poorer than the quality of the original object with the higher quality. If, however, an object is collected by n crowdworkers, we expect that for a specific number n > n 0 the quality of the integrated object will be higher than the quality of the individual objects or at least very close to the quality of the best object.
Higher quality can be achieved through larger number of redundancies, but this also leads to higher cost. If the cost is to be reduced, the number of redundancies can be reduced, but this has a negative effect on the quality. In the end, the optimal number of redundancies is a compromise between data quality versus cost. In this paper, we examine these numbers in more detail to gain a better understanding of the relationship between the number of repeated acquisitions and the results obtained by integrating them.
We investigate the impact of the crowd size on the data quality for the first step of the approach described in (Walter et al. 2021), where crowdworkers are asked to identify vehicles using line segments reaching from the front to the back of the vehicle or vice versa. An example is shown in Fig. 1.

Datasets
As test area, we rely on the newly introduced Hessigheim LiDAR dataset (H3D) (Koelle et al. 2021c). Hessigheim is located in the south of Germany. The point cloud was collected with a RIEGL VUX-1LR LiDAR sensor combined with two Sony Alpha 6000 oblique cameras using the RIEGL RiCopter octocopter. The mean laser pulse density is about 800 points/m 2 . The ranging accuracy is 10 mm (Riegl 2018).
This area was subdivided in 15 east-west-oriented strips of 250 × 50 m (see Fig. 3), from which we selected the 6 southernmost strips. We did not use the entire data set, but only the part that contains an urban area where vehicles are located. The rest of the data mainly contains agricultural land with very sparse occurrence of vehicles. An example of one strip is shown in Fig. 4. To avoid artefacts of truncated vehicles at the strip boundaries, a second set of overlapping strips (overlap = 50%) is used. Therefore, we have a total of 11 strips.
We use data from two epochs to be able to verify our results on two different dataset and different crowdworkers: Dataset 1 was collected in November 2018 and dataset 2 in March 2019. Figure 5 shows data of both epochs from a part of the study area. The two data sets are very similar due to the use of the same sensor and similar recording conditions, so that similar results are to be expected. If the data sets were more different (e.g., a different sensor, or a different location), it would be expected that the results would also be more different. The main purpose of using a second dataset is to show that running a campaign with a different group of crowdworkers leads to comparable results. The groups of crowdworkers are not completely different but still very distinct. An examination of the worker IDs shows that in the second campaign, 78.2% of the crowdjobs were performed by new crowdworkers who had not already participated in the first campaign.
Whenever mapping 3D data (i.e., point clouds) to 2D raster images, we inevitably lose information due to dimensionality reduction. Especially if an object of interest (i.e., vehicle) is occluded by other structures such as vegetation, simply using the maximum z-value per grid cell is counterproductive, since airborne laser scanning can penetrate such structures. Therefore, we derive the data to be annotated by crowdworkers as follows.
First, we derive a Digital Terrain Model based on filtered ground points, so that for every 3D point an individual height above ground can be calculated. Afterwards we derive two height models: one using the maximum z-value per grid cell (see Fig. 6a) and one only using LiDAR points with a maximum height over ground of 3 m (see Fig. 6b). Due to the elimination of all points that are 3 m above ground, there may be raster cells which contain no 3D point at all (e.g., at buildings). In these cases, we use the maximum z-value of the points inside the corresponding cells to avoid holes in the shading (see Fig. 6c). Following this strategy, we can efficiently filter vegetation points. Please note that vegetation can also be filtered by only considering last echo points, but due to multi-path effects on the vehicle's surface and due to the moving nature of some vehicles during data capturing, the resulting areas in the height model would be extremely noisy and unsuitable for presenting to crowdworkers. All calculations have been carried out using the Opals software (Pfeifer et al. 2014).

Crowd-Based Data Collection
For each of the two datasets, we created one crowd campaign in which each strip was collected 100 times. Since we have 11 strips (see Sect. 3), this results in 1100 crowd jobs per campaign. We paid $0.10 per crowdjob, resulting in a total cost of $110 per campaign. The average time to collect all vehicles in a strip was 335.6 s in the first campaign and 311.3 s in the second campaign. This calculates to an average hourly wage of $1.07 for campaign 1 and $1.15 for campaign 2. The time required to execute all 1100 crowdjobs was approximately 10 h for both campaigns. Table 1 shows the distribution of origin of the crowdworkers sorted by top 10 countries. The top three countries are Bangladesh, India, and Venezuela.
To evaluate the quality of the data, we have very carefully collected reference data ourselves. We developed a MATLAB tool with additional zoom functionalities to collect the vehicles as accurately as possible. Table 2 shows an evaluation of the reference data. The 11 strips of the first campaign contain 309 vehicles and of the second campaign 274 vehicles. Due to the overlap of the strips, this results in 172 unique vehicles in the first campaign and 153 unique vehicles in the second campaign. A total of 26,752 vehicles were collected in all 1100 crowdjobs together in campaign 1 and 24,303 vehicles in campaign 2. Compared to the reference, this results in an average number of vehicles collected per crowdworker relative to the number of vehicles in the reference of (26,752/1100)/(309/11) = 0.868 (Campaign 2: (24,303/1100)/(274/11) = 0.887). This means that if only one crowdworker collects the data, we would have to expect an average completeness of less than 90 percent. Figure 7 shows a typical example of data collection from 100 crowdworkers. It can be seen that although there are individual outliers, the majority of the collected lines can be clearly assigned to a vehicle. Figure 8 shows a problem that occurred sporadically in the data. Vehicles moving while being scanned by the laser scanner may be distorted. In the figure, a vehicle can be seen that is compressed in length (due to opposing movement of the car and the scanner platform). Only one crowdworker collected this vehicle. Dataset 1 contains 11 and dataset 2 15 distorted vehicles. In these cases, it is difficult to decide whether to include the vehicles in the reference, as well as difficult for the crowdworkers to interpret the data correctly. Figure 9 shows an area where two vehicles are hidden by vegetation. This problem occurs whenever there is dense vegetation close to the ground (within our considered range of 3 m above ground level, see Sect. 3) such as low-hanging  Only one crowdworker detected one of them. The data in this area are so difficult to interpret that it is very hard to decide whether there are actually vehicles or other objects (at least by only relying on 2D shadings). This problem occurs very rarely, but it shows that it is very difficult even for an expert to achieve a data collection accuracy of 100 percent.

Clustering and Integration
We cluster the centre points of all collected vehicles with Density-Based Spatial Clustering of Applications with Noise (DBSCAN) (Ester et al. 1996). DBSCAN requires two parameters: ɛ specifies the maximum distance between two points in a cluster, and MinPts specifies the minimum number of points in each cluster. The value of epsilon results from our application considering the minimum distance of the centres of two neighbouring cars in parking lots. Based on this domain knowledge and visual inspections, we use ɛ = 2 m. With smaller ɛ, it can happen that one single vehicle is divided into several clusters (see Fig. 10). With larger ɛ, it can happen that several vehicles parked next to each other merge into one cluster, so that a cluster no longer represents only one vehicle, but several vehicles (see Fig. 11). The parameter MinPts can be used to control whether vehicles that are difficult to detect appear in the result or not. A vehicle must be detected by at least MinPts crowdworkers. Thus, with high MinPts, the result would contain only those vehicles that can be detected very reliably. If, on the other hand, vehicles that are difficult to detect should also be included in the result, MinPts should be set low. At the same time, MinPts must not become too small, since otherwise falsely collected data would be accepted as vehicles. The appropriate choice of MinPts depends on n. The larger n, the larger MinPts should be chosen. A detailed examination of this relationship is given in Sect. 5.2. Figure 12a shows the 2D coordinates of the centres of the collected vehicles of a part of our study area on the example n = 10. The outliers detected with DBSCAN are marked with red colour. The remaining points are subdivided into clusters (Fig. 12b). The final positions of the clusters are calculated by averaging the x-and y-positions of the collected vehicles within each cluster (Fig. 12c).
In a refinement step, we focus on the individual lines within each cluster and iteratively eliminate all lines that deviate strongly from the averaged value. We perform this step to improve the geometric accuracy of the integrated results. This step has no effect on Precision, Recall and F1 score because it is computed after the cluster are calculated. The elimination of the strongly deviating lines mainly improves the accuracy of the length and has only little effect on position accuracy. A similar approach was used in ) for eliminating outliers from Minimum Bounding Cylinders, which represent trees collected by crowdworkers. We proceed as follows: FOR all clusters: • Compute two clusters with k-means (k = 2) (i.e., derive one cluster for the start and one cluster for the end point, no matter which one is actually start and end point) • Implicitly calculate an integrated line by averaging the points within the two clusters • Compute orthogonal distances of all points (i.e., start and end points) to integrated line • Omit all lines with a length difference to the integrated line greater 2 m  Figure 13 shows the elimination of strongly deviating lines within a cluster by means of an example. It can be seen that this step eliminates especially short lines that do not go from the front to the back of the vehicle or vice versa, but rather from one side to the other.

Quality Evaluation
To describe the completeness and correctness of the integrated multiple collected data, we evaluate: A precision value of 1.0 means that every vehicle collected by the crowd is also in the reference (but says nothing about whether all vehicles were collected). A recall value of 1.0 means that all vehicles in the reference were collected by the crowd (but says nothing about how many false vehicles were collected). The F1 score is the harmonic mean of precision and recall. The highest value of F1 score is 1, representing perfect precision and recall. The lowest value is 0, when either precision or recall is zero.
We divide the quality evaluation into two parts depending on the number n of multiple acquisitions: in the first part we discuss n ∈ [1..16] and in the second part n = 20, 40, 60, 80, 100.
For the first part, we collected the test area 16 times. For each n ∈ [1..16], we computed all possible 16!/((16−n)!*n!) combinations and calculated the mean of each quality measure. In this way, we avoid that the results depend on the selection of individual observations. This combinatorial approach provides more meaningful values than performing the test multiple times with an increasing number of crowdworkers (Test 1: one crowdworker, test 2: two crowdworkers, …). This is especially true for small n. Consider for example n = 2: if we repeat the test multiple times with an increasing number of crowdworkers, the result for n = 2 would be highly dependent on the two crowdworkers in that test. If one or even both crowdworkers deliver exceptionally good or poor results, this would lead to non-representative results. In the combinatorial approach with 16 crowdworkers, for n = 2: 16!/((16-2)!*2!) = 120, different combinations of crowdworkers are calculated and afterwards their results are averaged. Each individual crowdworker is involved only in 15 of the 120 combination. The combinatorial approach avoids outliers even if some crowdworkers deliver exceptionally good or bad results.
In the second part of the quality evaluation, we did not calculate all combinations, but selected the first n out of 100 observations for n = 20, 40, 60, 80, 100, since a combinatorial evaluation is no longer numerically feasible in a reasonable time. Therefore, the results of the second part depend on the selection of observations, but since we now have larger n, this effect is alleviated.
In the following, we discuss the evaluation of dataset 1 (see Sect. 3). The evaluation of dataset 2 led to very similar results. Therefore, we have moved the figures of the evaluation of dataset 2 to the Appendix for better readability of this paper. Figure 14 shows the evaluation of TP, FN, FP, Precision, Recall, and F1 score of dataset 1 for n ∈ [1..16]. The quality metrics depend (i) on n and (ii) on the definition of MinPts. The larger n, the higher TP and the lower FN. At the same time, it can be seen that FP increases with increasing n, which leads to declining precision and increasing recall. The reason for this is that the more crowdworkers are in the group, the more vehicles that are difficult to recognize will be detected (recall increases). At the same time, vehicles are now also detected in places where there are none at all (precision declines). This behaviour can be controlled by MinPts. If the focus is on high precision, MinPts should be chosen large. This leads to the result that only such vehicles are accepted, which were recognized by enough crowdworkers. If the focus is on high recall, MinPts should be chosen smaller, so that vehicles that are difficult to recognize are also included in the result. Figure 15 shows the evaluation of TP, FN, FP, Precision, Recall, and F1 score of dataset 1 for n = 20, 40, 60, 80, 100. It can be seen that for large n TP and FP no longer change significantly, but FP increases, resulting in decreasing precision. Compared to n ∈ [1..16], it can be seen that the F1 score does not improve anymore, but actually decreases. Therefore, with regard to completeness and correctness, it is not reasonable to collect the data more than 20 times.

Evaluation of the Geometric Quality
To describe the geometric quality, we evaluate the position difference and the length difference between integrated lines and reference. The position difference is calculated by the distance between the centre points of the lines. Based on our evaluation in the previous section, we have defined MinPts so that F1 score is maximized (compare Fig. 14), resulting in a section-by-section definition: • MinPts = 1 for n < = 4 • MinPts = 2 for 4 < n < = 8 • MinPts = 3 for 8 < n < = 12 • MinPts = 4 for n > 12  Figure 16 shows the evaluation of the position difference for n ∈ [1..16] of dataset 1. When the group consists of only one crowdworker, the mean value of the position difference is on average 0.275 m with a standard deviation of 0.182 m. As n increases, the mean and the standard deviation improve continuously. With 16 crowdworkers in the group, we achieve a position difference of 0.175 m and a standard deviation of 0.106 m. Figure 17 shows the evaluation of the length difference for n ∈ [1..16] for dataset 1. When the group consists of only one crowdworker, the mean value of the length difference is on average 0.89 m with a standard deviation of 1.02 m. There is an improvement of the mean value only up to n = 5. After that, the mean value deteriorates and then stagnates at around 0.82 m. However, it can be seen that the standard deviation decreases monotonically. This eliminates coarse outliers as n increases.
It is noticeable that the form of the distributions of the position difference and the length difference are different. For small n, there are few position differences that are close to zero, while this is the case for the length differences. The reason for this is that it is difficult to define the exact centre of a line by specifying the beginning and the end of the line. However, entering the exact line length is easier because also lines that are displaced from the reference can still have exactly the same length as the reference. This effect disappears with higher n, since a smoothing effect occurs due to the averaging. Figure 18 shows the evaluation of the position difference for n = 20, 40, 60, 80, 100 of dataset 1. We have defined MinPts = 0.25 * n so that F1 score is maximized (compare Fig. 15). It can be seen that with increasing n, the mean value of the position difference continues to improve, whereas the standard deviation remains about the same.  Figure 19 shows the evaluation of the length difference for n = 20, 40, 60, 80, 100 of dataset 1. The mean value of the length difference remains approximately constant. Similarly, the standard deviation hardly changes. The reason that the mean length difference does not improve further is that some of the crowdworkers systematically collect the data too short. Figure 20 shows a typical acquisition example. It can be seen that the majority of the lines do not span from edge to edge of the vehicles but are shorter than the reference. A mean value close to zero is therefore not to be expected even with a large n.

Significance Test
To evaluate whether increasing the number of crowdworkers improves our quality measures in a statistically significant sense, we applied a one-tailed Student's t test, which is formulated to answer the question: "Does the result of more crowdworkers yield to a significant decrease of accuracy errors?" Hence, we compare each distribution against all others and consider the 95% confidence interval as critical threshold to accept or refuse the hypothesis of a significant improvement. The results confirm the observations already made. Figure 21 shows the evaluation of the position difference of dataset 1 for n ∈ [1..16]. The position error becomes significantly smaller with increasing n. From n = 15, we run  into saturation. The smallest position error is obtained at n = 16. Figure 22 shows the evaluation of the length difference of dataset 1 for n ∈ [1..16]. A significant improvement of the length error occurs only up to n = 3. After that, an increase is no longer worthwhile, since the length error does not improve significantly. However, the standard deviation decreases with increasing n (see Fig. 17). This eliminates coarse outliers as n increases. Figure 23 shows the evaluation of the position difference of dataset 1 for n = 20, 40, 60, 80, 100. Between n = 20 and n = 40, there is no significant improvement. Thereafter, an increase of n up to n = 80 leads to significantly better results. Figure 24 shows the evaluation of the length difference of dataset 1 for n = 20, 40, 60, 80, 100. As discussed earlier, increasing n in the range between 20 and 100 does not result in smaller length errors.

Conclusion
This research proves the effectiveness of the idea of the wisdom of the crowd once more and again demonstrates that it is also applicable for very special tasks such as the collection of spatial objects from remote sensing data. We tested our approach on two datasets depicting the same area but taken at different times. The results of the quality evaluations of both data sets are very similar.
We have shown that increasing the crowd size improves the quality of the data. This effect occurs especially with small n. From about n = 20, the improvements are only very small and rather not worthwhile in relation to the costs since the costs grow linearly with n. For n greater than 20, even a deterioration of the F1 score can be observed, which is mainly explained by the increase of FP. To prevent the increase of FP, MinPts would have to be chosen larger, which, however, would again have a negative effect on TP and FN. Hence, MinPts is an important parameter to control the quality of the results. If data are to be collected where correctness is more important than completeness, MinPts   Fig. 21 One-tailed Student's t test for the position difference of dataset 1 for n ∈ [1..16], green = significant improvement, red = no significant improvement, grey = not evaluated

Fig. 22
One-tailed Student's t test for the length difference of dataset 1 for n ∈ [1..16], green = significant improvement, red = no significant improvement, grey = not evaluated Fig. 23 One-tailed Student's t test for the position difference of dataset 1 for n = 20, 40, 60, 80, 100], green = significant improvement, red = no significant improvement, grey = not evaluated Fig. 24 One-tailed Student's t test for the position difference of dataset 1 for n = 20, 40, 60, 80, 100], green = significant improvement, red = no significant improvement, grey = not evaluated should be set rather high. If the focus is on completeness and less on correctness, a low MinPts is the better choice.
The position difference is the only measure showing an improvement up to n = 100. This is mainly due to the fact that the geometric accuracy can only be calculated for TP and thus an increase of FP has no negative effect to this measure. FP cannot degrade the results because they cannot be matched to a reference and therefore no position difference and no length difference can be measured. It is likely that with even larger n, even more accurate positions can be expected, although improvements would be very slow. For the length difference, no improvement could be observed up from n = 20, which is related to the fact that some of the crowdworkers systematically collect the vehicles too short compared to the reference data. This problem could be solved, for example, by training the crowdworkers before the actual task.
Although the collection of the reference data was carried out very carefully, there are some areas that were very difficult to interpret also impacting our quality measures. In these cases, differences between crowd-based collection and reference must always be expected. This problem is of fundamental nature and cannot be solved by increasing n either. Even if we would replace the crowdworkers with experts, this problem would not be completely solved.
We currently see the main application for spatial data collection by paid crowdworkers in providing training data for deep learning methods. The approach presented uses ultra-high-resolution UAV data, which are currently not widely available. Lower-resolution airborne laser data on the other hand are widely available, but vehicle detection in such data sets is limited. However, it is conceivable to apply our approach to image data if we use orthophotos, which are available over large areas in high resolution, instead of the 2D shadings. Another possible data source would be terrestrial laser scanners such as vehicle-mounted laser scanners.

Limitations
We tested our method on two different campaigns to show that we achieve comparable results with different groups of crowdworkers. Both campaigns were conducted on mircoWorkers. It is conceivable that different results would be obtained if we switched to a different crowdsourcing marketplace due to a deviating pool of workers with different abilities (for instance, the majority of workers on Amazon Mechanical Turk are US citizen, while the ones of microWorkers are mostly situated in Asia).
The two datasets we used are very similar due to the use of the same sensor and similar flight conditions. If the data were more different, it is also to be expected that the results would be more different.
The parameters determined in this project are valid for the task discussed in conjunction with the used data and cannot be directly applied to other tasks or other data. Thus, it is to be expected that for tasks that are more difficult to solve or for data that is more difficult to interpret, more observations are necessary to increase the quality. Similarly, there may be other systematic effects than in our investigation.
We therefore see the approach described here as a guide to how the Wisdom of the Crowd principle can be evaluated qualitatively and quantitatively. The research presented here can be used as a blueprint for other applications.

Future Work
For further quality improvement, the integration of the individual collected data could be optimized. Currently we use simple averaging as aggregation rule. This could be extended using weights for each individual worker. The weights could be derived from the score that is available for each crowdworker on the crowdsourcing marketplace. Furthermore, the wisdom of the crowd principle is only one way to improve the quality of data from paid crowdworkers. Many other approaches from the fields of psychology and work science are described in the literature. These approaches can be combined with the method described in this paper to achieve further quality improvement.