1 Introduction

Social media enables members of the public to post real-time text messages, videos and photographs describing events taking place close to them. While many posts may be extraneous or misleading, social media nonetheless provides streams of up-to-date information across a wide area. For example, after the Haiti 2010 earthquake, Ushahidi gathered thousands of text messages that provided valuable first-hand information about the disaster situation [14]. An effective way to extract information from large unstructured datasets such as these is to employ crowds of non-expert annotators, as demonstrated by Galaxy Zoo [10]. Besides social media, crowdsourcing provides a means to obtain geo-tagged annotations from other unstructured data sources such as imagery from satellites or unmanned aerial vehicles (UAV).

In scenarios such as disaster response, we wish to infer the situation across a region of interest by combining annotations from multiple information sources. For example, we may wish to determine which areas are currently flooded, the level of damage to buildings in an earthquake zone, or the type of terrain in a specific area from a combination of SMS reports and satellite imagery. The situation across an area of interest can be visualised using a heatmap (e.g. Google Maps heatmap layerFootnote 1), which overlays colours onto a map to indicate the intensity or probability of phenomena of interest. Probabilistic methods have been used to generate heatmaps from observations at sparse, point locations [1, 8, 9], using a Bayesian treatment of Poisson process models. However, these approaches model the rate of occurrence of events, so are not suitable for classification problems. Instead, a Gaussian process (GP) classifier can be used to model a class label that varies smoothly over space or time. This uses a latent function over input coordinates, which is mapped through a sigmoid function to obtain probabilities [16]. However, standard GP classifiers are unsuitable for heterogeneous, crowdsourced data since they do not account for the differing relevance, error rates and bias of individual information sources and annotators.

A key challenge in exploiting crowdsourced information is to account for its unreliability and combine it with trusted data as it becomes available, such as reports from experienced first responders in a disaster zone. For regression problems, differing levels of accuracy can be handled using sensor fusion approaches such as [12, 25]. The approach of [25] uses heteroskedastic GPs to produce heatmaps that account for sensor accuracy through variance scaling. This method could be applied to spatial classification by mapping GPs through a softmax function. However, such an approach cannot handle label bias or accuracy that depends on the true class. Recently, [11], proposed learning a GP classifier from crowdsourced annotations, but their method uses a coin-flipping noise model that would suffer from the same drawbacks as adapting [25]. Furthermore they train the model using a maximum likelihood (ML) approach, which may incorrectly estimate reliability when data for some workers is insufficient [7, 17, 20].

For classification problems, each information source can be modelled by a confusion matrix [3], which quantifies the likelihood of observing a particular annotation from an information source given the true class label. This approach naturally accounts for bias toward a particular answer and varying accuracy depending on the true class, and has been shown to outperform techniques such as majority voting and weighted sums [7, 17, 20]. Recent extensions following the Bayesian treatment of [7] can further improve results: by identifying clusters of crowd workers with shared confusion matrices [13, 23]; accounting for the time each worker takes to complete a task [24]; additionally modelling language features in text classification tasks [4, 21]. However, these methods depend on receiving multiple labels from different workers for the same data points, or, in the case of [4, 21], on correlations between text features and target classes. None of the existing confusion matrix-based approaches can model the spatial distribution of each class, and therefore, when reports are sparsely distributed over an area of interest, they cannot compensate for the lack of data at each location.

In this paper, we propose a novel Bayesian approach to aggregating sparse, geo-tagged reports from sources of varying reliability, which combines independent Bayesian classifier combination (IBCC) [7] with a GP classifier to infer discrete state values across an area of interest. Our model, HeatmapBCC, assumes that states at neighbouring locations are correlated, allowing us to fuse neighbouring reports and interpolate between them to predict the state at locations with no reports. HeatmapBCC uses confusion matrices to model the error rates, relevance and bias of each information source, permitting the use of non-expert crowds providing heterogeneous annotations. The GP handles the uncertainty that arises from sparse spatial data in a principled Bayesian manner, allowing us to incorporate prior information, such as physical models of disaster events such as earthquakes, and visualise the resulting posterior distribution as a spatial heatmap. We derive a variational inference method that is able to learn the reliability model for each information source without the need for ground truth training data. This method learns full distributions over latent variables that can be used to prioritise locations for further data gathering using an active learning approach. The next section presents in detail the HeatmapBCC model, and provides details of our efficient approximate inference algorithm. The following section then provides an empirical evaluation of our method on both synthetic and real-world problems, showing that HeatmapBCC can outperform rival methods. We make our code publicly available at https://github.com/OxfordML/heatmap_expts.

2 The HeatmapBCC Model

Our goal is to classify locations of interest, e.g. to identify them as “flooded” or “not flooded”. We can then choose locations in a grid over an area of interest and plot the classifications on a map as a spatial heatmap. The task is to infer a vector \(\varvec{t}^*\in \{1,\ldots ,J\}^{N^*}\) of target state values at \(N^*\) locations \(\varvec{X}^*\), where J is the number of state values or classes. Each row \(\varvec{x}_i\) of matrix \(\varvec{X}^*\) is a coordinate vector that specifies a point on the map. We observe a matrix of potentially unreliable geo-tagged reports, \(\varvec{c}\in \{1,\ldots ,L\}^{N\times S}\), with L possible discrete values, from S different information sources at N training locations \(\varvec{X}\).

HeatmapBCC assumes that each report label \(c_{i}^{(s)}\), from information source s, at location \(\varvec{x}_i\), is drawn from \( c_{i}^{(s)} | t_{i}, \varvec{\pi }^{(s)} \sim \mathrm {Categorical}(\varvec{\pi }_{t_{i}}^{(s)})\). The target state, \(t_{i}\), selects the row, \(\varvec{\pi }_{t_{i}}^{(s)}\), of a confusion matrix [3, 20], \(\varvec{\pi }^{(s)}\), which describes the errors and biases of s as a dependency between the report labels and the ground truth state, \(t_{i}\). As per standard IBCC [7], the reports from each information source are conditionally independent of one another given target \(t_i\), and each row of the confusion matrix is drawn from \( \varvec{\pi }_j^{(s)} | \varvec{\alpha }_{0,j}^{(s)} \sim \mathrm {Dirichlet}(\varvec{\alpha }_{0,j}^{(s)})\). The hyperparameters \(\varvec{\alpha }_{0,j}^{(s)}\) encode the prior trust in s.

We assume that state \(t_{i}\) at location \(\varvec{x}_i\) is drawn from a categorical distribution, \( t_{i} | \varvec{\rho }_{i} \sim \mathrm {Categorical}(\varvec{\rho }_{i})\), where \(\rho _{i,j} = p(t_{i}=j | \varvec{\rho }_{i} ) \in [0,1]\) is the probability of state j at location \(\varvec{x}_i\). The generative process for state probabilities, \(\varvec{\rho }\), is as follows. First, draw latent functions for classes \(j\in \{1,\ldots ,J\}\) from a Gaussian process prior: \(f_j \sim \mathcal {GP}(m_j, k_{j, \varvec{\theta }}/\varsigma _j)\), where \(m_j\) is the prior mean function, \(k_j\) is the prior covariance function, \(\varvec{\theta }\) are hyperparameters of the covariance function, and \(\varsigma _j\) is the inverse scale. Map latent function values \(f_j(\varvec{x}_i)\in \mathcal {R}\) to state probabilities: \(\varvec{\rho }_{i} = \sigma (f_1(\varvec{x}_i),\ldots ,f_J(\varvec{x}_i))\in [0,1]^J\). Appropriate functions for \(\sigma \) include the logistic sigmoid and probit functions for binary classification, and softmax and multinomial probit for multi-class classification. We assume that \(\varsigma _j\) is drawn from a conjugate gamma hyperprior, \(\varsigma _j \sim \mathcal {G}\left( a_{0}, b_{0} \right) \), where \(a_{0}\) is a shape parameter and \(b_{0}\) is the inverse scale.

While the reports, \(c_{i}^{(s)}\), are modelled in the same way as standard IBCC [7], HeatmapBCC introduces a location-specific state probability, \(\varvec{\rho }_{i}\), to replace the global class proportions, \(\varvec{\kappa }\), which IBCC [20] assumes are constant for all locations. Using a Gaussian process prior means the state probability varies reasonably smoothly between locations, thereby encoding correlations in the distribution over states at neighbouring locations. The covariance function is chosen to suit the scenario we wish to model and may be tailored to specific spatial phenomena (the geo-spatial impact of an earthquake, for example). The hyperparameters, \(\varvec{\theta }\), typically include a length-scale, l, which controls the smoothness of the function. Here, we assume a stationary covariance function of the form \( k_{j,\varvec{\theta }}\left( \varvec{x}, \varvec{x}'\right) = k_j \left( |\varvec{x} - \varvec{x}'|, l\right) \), where k is a function of the distance between two points and the length-scale, l. The joint distribution for the complete model is:

$$\begin{aligned}&p\left( \varvec{c}, \varvec{t}, \varvec{f}_1,..., \varvec{f}_J, \varvec{\varsigma }_1,..., \varvec{\varsigma }_J, \varvec{\pi }^{(1)} ,...,\varvec{\pi }^{(S)} | \varvec{\mu }_1,...,\varvec{\mu }_J, \varvec{K}_1,...,\varvec{K}_J, \varvec{\alpha }_0^{(1)},...,\varvec{\alpha }_0^{(S)}\right) = \\&\,\,\prod _{i=1}^N \left\{ \rho _{i,t_i} \prod _{s=1}^S \pi ^{(s)}_{t_i,c_i^{(s)}} \right\} \prod _{j=1}^J \left\{ p\left( \varvec{f}_j | \varvec{\mu }_j, \varvec{K}_j/\varsigma _j \right) p\left( \varsigma _j | a_0, b_0 \right) \prod _{s=1}^S p\left( \varvec{\pi }^{(s)}_j | \varvec{\alpha }^{(s)}_{0,j}\right) \right\} , \end{aligned}$$

where \(\varvec{f}_j=\left[ f_j(\varvec{x}_1), ..., f_j(\varvec{x}_N) \right] \), \(\varvec{\mu }_j = \left[ m_j(\varvec{x}_1),..., m_j(\varvec{x}_N) \right] \), and \(\varvec{K}_j\in \mathbb {R}^{N\times N}\) with elements \(K_{j,n, {n'}}=k_{j,\varvec{\theta }}(\varvec{x}_n,\varvec{x}_{n'})\).

3 Variational Inference for HeatmapBCC

We use variational Bayes (VB) to efficiently approximate the posterior distribution over all latent variables, allowing us to handle streaming data reports online by restarting the VB algorithm from the previous estimate as new reports are received. To apply variational inference, we replace the exact posterior distribution with a variational approximation that factorises into separate latent variables and parameters:

$$\begin{aligned} p(\varvec{t}, \varvec{f}, \varvec{\varsigma }, \varvec{\pi }^{(1)},...,\varvec{\pi }^{(S)} | \varvec{c}, \varvec{\mu }, \varvec{K}, \varvec{\alpha }_0^{(1)},...,\varvec{\alpha }_0^{(S)}) \approx q(\varvec{t}) \prod _{j=1}^J \left\{ q(\varvec{f}_j)q(\varsigma _j)\prod _{s=1}^S q\left( \varvec{\pi }_j^{(s)}\right) \right\} . \end{aligned}$$

We perform approximate inference by optimising the variational posterior using Algorithm 1. In the remainder of this section we define the variational factors q(), expectation terms, variational lower bound and prediction step required by the algorithm.

figure a

Variational Factor for Targets, t :

$$\begin{aligned} \log q(\varvec{t}) = \sum _{i=1}^N \left\{ \mathbb {E}[\log \rho _{i,t_{i}}] + \sum _{s=1}^S \mathbb {E}\left[ \log \pi ^{(s)}_{t_{i},c_{i}^{(s)}}\right] \right\} + \mathrm {const}. \end{aligned}$$
(1)

The variational factor \(q(\varvec{t})\) further factorises into individual data points, since the target value, \(t_i\), at each input point, \(\varvec{x}_i\), is independent given the state probability vector \(\varvec{\rho }_i\), giving \(r_{i,j} := q(t_{i}=j)\) where \(q(t_{i}=j) = q(t_{i}=j,\varvec{c}_{i})/ \sum _{\iota \in J}q(t_{i}=\iota ,\varvec{c}_{i})\) and:

$$\begin{aligned} q(t_{i}=j,\varvec{c}_{i}) = \exp \left( \mathbb {E}\left[ \log \rho _{i,j}\right] + \sum _{s=1}^S \mathbb {E}\left[ \log \pi ^{(s)}_{j,c_{i}^{(s)}}\right] \right) . \end{aligned}$$
(2)

Missing reports in \(\varvec{c}\) can be handled simply by omitting the term \(\mathbb {E}\left[ \log \pi ^{(s)}_{j,c_{i}^{(s)}}\right] \) for information sources, s, that have not provided a report \(c^{(s)}_i\).

Variational Factor for Confusion Matrix Rows, \(\varvec{\pi }_j^{(s)}\) :

$$\begin{aligned} \log q\left( \varvec{\pi }_j^{(s)}\right) = \mathbb {E}_{\varvec{t}}\left[ \log p\left( \varvec{\pi }^{(s)}|\varvec{t},\varvec{c}\right) \right] = \sum _{l=1}^L N^{(s)}_{j,l}\log \pi ^{(s)}_{j,l} + \log p\left( \varvec{\pi }^{(s)}_j|\varvec{\alpha }_{0,j}^{(s)}\right) + \mathrm {const}.,&\end{aligned}$$

where \(N^{(s)}_{j,l}= \sum _{i=1}^N r_{i,j}\delta _{l,c_{i}^{(s)}}\) are pseudo-counts and \(\delta \) is the Kronecker delta. Since we assumed a Dirichlet prior, the variational distribution is also a Dirichlet, \(q(\varvec{\pi }^{(s)}_j) = \mathcal {D}(\varvec{\pi }^{(s)}_j | \varvec{\alpha }_{j}^{(s)})\), with parameters \(\varvec{\alpha }_{j}^{(s)} = \varvec{\alpha }_{0,j}^{(s)}+\varvec{N}^{(s)}_j\), where \(\varvec{N}^{(s)}_{j} =\left\{ N^{(s)}_{j,l} | l \in [1,...,L] \right\} \). Using the digamma function, \(\varPsi ()\), the expectation required for Eq. 2 is therefore:

$$\begin{aligned} \mathbb {E}\left[ \log \pi ^{(s)}_{j,l}\right] = \varPsi \left( \alpha _{j, l}^{(s)}\right) - \varPsi \left( \sum _{\iota =1}^L \alpha _{j, \iota }^{(s)}\right) . \end{aligned}$$
(3)

Variational Factor for Latent Function: The variational factor \(q(\varvec{f})\) factorises between target classes, since \(t_i\) at each point is independent given \(\varvec{\rho }\). Using the fact that \(\mathbb {E}_{t_i}[\log \mathrm {Categorical}([t_i=j] | \rho _{i,j})] = r_{i,j}\log \sigma (\varvec{f})_{j,i}\), the factor for each class is:

$$\begin{aligned} \log q(\varvec{f}_j) = \sum _{i=1}^N r_{i,j} \log \sigma (\varvec{f})_{j,i} + \mathbb {E}_{\varsigma _j}\left[ \log \mathcal {N}(\varvec{f}_{j} | \varvec{\mu }_{j}, \varvec{K}_{j}/\varsigma _j) \right] + \mathrm {const}. \end{aligned}$$
(4)

This variational factor cannot be computed analytically, but can itself be approximated using a variational method based on the extended Kalman filter (EKF) [18, 22] that is amenable to inclusion in our overall VB algorithm. Here, we present a multi-class variant of this method that applies ideas from [5]. We approximate the likelihood \(p(t_{i}=j | \varvec{\rho }_{i,j}) = \varvec{\rho }_{i,j}\) with a Gaussian distribution, using \(\mathbb {E}[\log \mathcal {N}([t_i=j] | \sigma ( \varvec{f})_{j,i}, v_{i,j})] = \log \mathcal {N}\left( r_{i,j} | \sigma ( \varvec{f})_{j,i}, v_{i,j}\right) \) to replace Eq. 4 with the following:

$$\begin{aligned} \log q(\varvec{f}_j) \approx \sum _{i=1}^N \log \mathcal {N}(r_{i,j} | \sigma ( \varvec{f})_{j,i}, v_{i,j}) + \mathbb {E}_{\varsigma _j}[ \log \mathcal {N} \left( \varvec{f}_{ j} | \varvec{\mu }_j, \varvec{K}_j/\varsigma _j \right) ] + \mathrm {const}, \end{aligned}$$
(5)

where \(v_{i,j}=\rho _{i,j}(1 - \rho _{i,j})\) is the variance of the binary indicator variable \([t_i=j]\) given by the Bernoulli distribution. We approximate Eq. 5 by linearising \(\sigma ()\) using a Taylor series expansion to obtain a multivariate Gaussian distribution \(q(\varvec{f}_j) \approx \mathcal {N}\left( \varvec{f}_j| \hat{\varvec{f}}_{j},\varvec{\varSigma }_j\right) \). Consequently, we estimate \(q\left( \varvec{f}_j\right) \) using EKF-like equations [18, 22]:

$$\begin{aligned} \hat{\varvec{f}}_{j}= & {} \varvec{\mu }_{j} + \varvec{W} \left( \varvec{r}_{.,j} - \sigma (\hat{\varvec{f}})_{j} + \varvec{G} (\hat{\varvec{f}}_{j} - \varvec{\mu }_{j}) \right) \end{aligned}$$
(6)
$$\begin{aligned} \varvec{\varSigma }_j= & {} \hat{\varvec{K}}_j - \varvec{W} \varvec{G}_j \hat{\varvec{K}}_j \end{aligned}$$
(7)

where \(\hat{\varvec{K}}_j^{-1} = \varvec{K}_j^{-1}\mathbb {E}[\varsigma _j]\) and \(\varvec{W}=\hat{\varvec{K}}_j \varvec{G}_j^T \left( \varvec{G}_j \hat{\varvec{K}}_j \varvec{G}_j^T + \varvec{Q}_j\right) ^{-1}\) is the Kalman gain, \(\varvec{r}_{.,j}=\left[ r_{1,j},r_{N,j}\right] \) is the vector of probabilities of target state j computed using Eq. 2 for the input points, \(\varvec{G}_j \in \mathbb {R}^{N \times N}\) is the diagonal sigmoid Jacobian matrix and \(\varvec{Q}_j\in \mathbb {R}^{N\times N}\) is a diagonal observation noise variance matrix. The diagonal elements of \(\varvec{G}\) are \(\varvec{G}_{j, i, i} = \sigma (\hat{\varvec{f}}_{.,i})_j (1 - \sigma (\hat{\varvec{f}}_{.,i})_j )\), where \(\hat{\varvec{f}}=\left[ \hat{\varvec{f}}_1,...,\hat{\varvec{f}}_J\right] \) is the matrix of mean values for all classes.

The diagonal elements of the noise covariance matrix are \(Q_{j,i,i} = v_{i,j}\), which we approximate as follows. Since the observations are Bernoulli distributed with an uncertain parameter \(\rho _{i,j}\), the conjugate prior over \(\rho _{i,j}\) is a beta distribution with parameters \(\sum _{j'=1}^J \nu _{0,j'}\) and \(\nu _{0,j}\). This can be updated to a posterior Beta distribution \(p\left( \rho _{i,j} | r_{i,j}, \varvec{\nu }_{0} \right) = \mathcal {B} \left( \rho _{i,j} | \nu _{\lnot j}, \nu _j \right) \), where \(\nu _{\lnot j} = \sum _{j'=1}^J \nu _{0,j'} - \nu _{0,j} + 1 - r_{i,j}\) and \(\nu _j = \nu _{0,j} + r_{i,j}\). We now estimate the expected variance:

$$\begin{aligned}&v_{i,j}\approx \hat{v}_{i,j} = \int \left( \rho _{i,j} - \rho _{i,j}^2\right) \mathcal {B}\left( \rho _{i,j} | \nu _{\lnot j}, \nu _{j}\right) \mathrm {d}\rho _{i,j} = \mathbb {E}[\rho _{i,j}] - \mathbb {E}\left[ \rho _{i,j}^2\right] \end{aligned}$$
(8)
$$\begin{aligned} \mathbb {E}[\rho _{i,j}] = \frac{\nu _j}{\nu _j + \nu _{\lnot j}}\qquad \qquad \mathbb {E}\left[ \rho _{i,j}^2\right] = \mathbb {E}[\rho _{i,j}]^2 + \frac{\nu _j \nu _{\lnot j} }{(\nu _j + \nu _{\lnot j})^2(\nu _j + \nu _{\lnot j} + 1)} . \end{aligned}$$
(9)

We determine values for the prior beta parameters, \(\nu _{0,j}\), by moment matching with the prior mean \(\hat{\rho }_{i,j}\) and variance \(u_{i,j}\) of \(\rho _{i,j}\), found using numerical integration. According to Jensen’s inequality, the convex function \(\varphi (\varvec{Q}) = \left( \varvec{G}_j \varvec{K}_j \varvec{G}_j^T + \varvec{Q}\right) ^{-1}\) is a lower bound on \(\mathbb {E}[\varphi (\varvec{Q})] = \mathbb {E}\left[ (\varvec{G}_j \varvec{K}_j \varvec{G}_j^T + \varvec{Q})^{-1}\right] \). Thus our approximation provides a tractable estimate of the expected value of \(\varvec{W}\).

The calculation of \(\varvec{G}_j\) requires evaluating the latent function at the input points \(\hat{\varvec{f}}_j\). Further, Eq. 6 requires \(\varvec{G}_j\) to approximate \(\hat{\varvec{f}}_j\), causing a circular dependency. Although we can fold our expressions for \(\varvec{G}_j\) and \(\hat{\varvec{f}}_j\) directly into the VB cycle and update each variable in turn, we found solving for \(\varvec{G}_j\) and \(\hat{\varvec{f}}_j\) each VB iteration facilitated faster inference. We use the following iterative procedure to estimate \(\varvec{G}_j\) and \(\hat{\varvec{f}}_j\):

  1. 1.

    Initialise \(\sigma (\hat{\varvec{f}}_{.,i}) \approx \mathbb {E}[\varvec{\rho }_{i}]\) using Eq. 9.

  2. 2.

    Estimate \(\varvec{G}_j\) using the current estimate of \(\sigma (\hat{f}_{j,i})\).

  3. 3.

    Update the mean \(\hat{\varvec{f}}_{j}\) using Eq. 6, inserting the current estimate of \(\varvec{G}\).

  4. 4.

    Repeat from step 2 until \(\hat{\varvec{f}}_{j}\) and \(\varvec{G}_j\) converge.

The latent means, \(\hat{\varvec{f}}\), are then used to estimate the terms \(\log \rho _{i,j}\) for Eq. 2:

$$\begin{aligned} \mathbb {E}[ \log \rho _{i,j}] = \hat{f}_{j,i} - \mathbb {E}\left[ \log \sum _{j'=1}^J \exp (f_{j', i}) \right] . \end{aligned}$$
(10)

When inserted into Eq. 2, the second term in Eq. 10 cancels with the denominator, so need not be computed.

Variational Factor for Inverse Function Scale: The inverse covariance scale, \(\varsigma _j\), can also be inferred using VB by taking expectations with respect to \(\varvec{f}\):

$$\begin{aligned} \log q\left( \varsigma _j\right) = \mathbb {E}_{\varvec{\rho }}[\log p(\varsigma _j | \varvec{f}_j)] = \mathbb {E}_{\varvec{f}_j}[\log \mathcal {N}(\varvec{f}_{j} | \mu _{i}, \varvec{K}_j/\varsigma _j)] + \log p(\varsigma _j | a_{0}, b_{0} ) + \mathrm {const} \end{aligned}$$

which is a gamma distribution with shape \(a=a_0 + \frac{N}{2}\) and inverse scale \(b = b_0 + \frac{1}{2} \mathrm {Tr}\left( \varvec{K}_j^{-1} \left( \varvec{\varSigma }_j + \hat{\varvec{f}}_j \hat{\varvec{f}}_j^T - 2 \varvec{\mu }_{j,i}\hat{\varvec{f}}_j^T - \varvec{\mu }_{j,i}\varvec{\mu }_{j,i}^T \right) \right) \). We use these parameters to compute the expected latent model precision, \(\mathbb {E}[\varsigma _j] = a/b\) in Eq. 7, and for the lower bound described in the next section we also require \(\mathbb {E}_q[\log (\varsigma _j)] = \varPsi (a) - \log (b)\).

Variational Lower Bound: Due to the approximations described above, we are unable to guarantee an increased variational lower bound for each cycle of the VB algorithm. We test for convergence of the variational approximation efficiently by comparing the variational lower bound \(\mathcal {L}(q)\) on the model evidence calculated at successive iterations. The lower bound for HeatmapBCC is given by:

$$\begin{aligned}&\mathcal {L}(q) = \mathbb {E}_q \left[ \log p\left( \varvec{c}|\varvec{t},\varvec{\pi }^{(1)},...,\varvec{\pi }^{(S)}\right) \right] + \mathbb {E}_q\left[ \log \frac{p(\varvec{t}|\varvec{\rho })}{q(\varvec{t})}\right] + \sum _{j=1}^J \Bigg \{ \\&\left. \mathbb {E}_q\left[ \log \frac{p\left( \varvec{f}_j | \varvec{\mu }_j, \varvec{K}_j/\varsigma _j\right) }{q(\varvec{f}_j)}\right] + \mathbb {E}_q \left[ \log \frac{p(\varvec{\varsigma }_j | a_0, b_0)}{q(\varvec{\varsigma }_j)} \right] \right. \left. + \sum _{s=1}^S \mathbb {E}_q \left[ \log \frac{p\left( \varvec{\pi }_j^{(s)}|\varvec{\alpha }_{0,j}^{(s)}\right) }{q\left( \varvec{\pi }_j^{(s)}\right) }\right] \right\} . \nonumber \end{aligned}$$
(11)

Predictions: Once the algorithm has converged, we predict target states, \(\varvec{t}^*\) and probabilities \(\varvec{\rho }^*\) at output points \(\varvec{X}^*\) by estimating their expected values. For a heatmap visualisation, \(\varvec{X}^*\) is a set of evenly-spaced points on a grid placed over the region of interest. We cannot compute the posterior distribution over \(\varvec{\rho }^*\) analytically due to the non-linear sigmoid function. We therefore estimate the expected values \(\mathbb {E}[\varvec{\rho }_j^*]\) by sampling \(\varvec{f}_j^*\) from its posterior and mapping the samples through the sigmoid function. The multivariate Gaussian posterior of \(\varvec{f}_j^*\) has latent mean \(\hat{\varvec{f}}^*\) and covariance \(\varvec{\varSigma }^*\):

$$\begin{aligned}&\hat{\varvec{f}}^*_{j} = \varvec{\mu }^*_j + \varvec{W}_j^*\left( \varvec{r}_j - \sigma (\hat{\varvec{f}}_j) + \varvec{G}(\hat{\varvec{f}}_j - \varvec{\mu }_j )\right) \end{aligned}$$
(12)
$$\begin{aligned}&\varvec{\varSigma }_j^{*} = \hat{\varvec{K}}^{**}_j -\varvec{W}_j^* \varvec{G}_j \hat{\varvec{K}}^*_j, \end{aligned}$$
(13)

where \(\varvec{\mu }^*_j\) is the prior mean at the output points, \(\hat{\varvec{K}}^{**}_j\) is the covariance matrix of the output points, \(\hat{\varvec{K}}^{*}_j\) is the covariance matrix between the input and the output points, and \(\varvec{W}_j^* = \hat{\varvec{K}}_j^* \varvec{G_j}^T \left( \varvec{G}_j \hat{\varvec{K}}_j \varvec{G_j}^T + \varvec{Q}_j \right) ^{-1}\) is the Kalman gain. The predictions for output states \(\varvec{t}^*\) are the expected probabilities \(\mathbb {E}\left[ t_{i,j}^*\right] = r^*_{i,j} \propto q(t_i=j,\varvec{c})\) of each state j at each output point \(\varvec{x}_i \in \varvec{X}^*\), computed using Eq. 2. In a multi-class setting, the predictions for each class could be plotted as separate heatmaps.

4 Experiments

We compare the efficacy of our approach with alternative methods on synthetic data and two real datasets. In the first real-world application we combine crowdsourced annotations of images in the aftermath of a disaster, while in the second we aggregate crowdsourced labels assigned to geo-tagged text messages to predict emergencies in the aftermath of an Earthquake. All experiments are binary classification tasks where reports may be negative (recorded as \(c^{(s)}_{i}=1\)) or positive (\(c^{(s)}_{i}=2\)). In all experiments, we examine the effect of data sparsity using an incremental train/test procedure:

  1. 1.

    Train all methods on a random subset of reports (initially a small subset)

  2. 2.

    Predict states \(\varvec{t}^*\) at grid points in an area of interest. For HeatmapBCC, we use the predictions \(\mathbb {E}[t_{i,j}^*]\) described in Sect. 3

  3. 3.

    Evaluate predictions using the area under the ROC curve (AUC) or cross entropy classification error

  4. 4.

    Increment subset of training labels at random and repeat from step 1.

Specific details vary in each experiment and are described below. We evaluate HeatmapBCC against the following alternatives: a Kernel density estimator (KDE) [15, 19], which is a non-parametric technique that places a Gaussian kernel at each observation point, then normalises the sum of Gaussians over all observations; a GP classifier [18], which applies a Bayesian non-parametric approach but assumes reports are equally reliable; IBCC with VB [20], which performs no interpolation between spatial points, but is a state-of-the-art method for combining unreliable crowdsourced classifications; and an ad-hoc combination of IBCC and the GP classifier (IBCC+GP), in which the output classifications of IBCC are used as training labels for the GP classifier. This last method illustrates whether the single VB learning approach of HeatmapBCC is beneficial, for example, by transferring information between neighbouring data points when learning confusion matrices. For the first real dataset, we include additional baselines: SVM with radial basis function kernel; a K-nearest neighbours classifier with \(n_{neighbours}=5\) (NN); and majority voting (MV), which defaults to the most frequent class label (negative) in locations with no labels.

4.1 Synthetic Data

We ran three experiments with synthetic data to illustrate the behaviour of HeatmapBCC with different types of unreliable reporters. For each experiment, we generated 25 binary ground truth datasets as follows: obtain coordinates at all 1600 points in a \(40 \times 40\) grid; draw latent function values \(f_{\varvec{x}}\) from a multivariate Gaussian distribution with zero mean and Matérn\(\frac{3}{2}\) covariance with \(l=20\) and inverse scale 1.2; apply sigmoid function to obtain state probabilities, \(\rho _{\varvec{x}}\); draw target values, \(t_{\varvec{x}}\), at all locations.

Fig. 1.
figure 1

Synthetic data, noisy reporters: median improvement of HeatmapBCC over alternatives over 25 datasets, against number of crowdsourced labels. Shaded areas show inter-quartile range. Top-left: AUC, 25% noisy reporters. Top-right: AUC, 50% noisy reporters. Bottom-left: AUC, 75% noisy reporters. Bottom-right: NLPD of state probabilities, \(\varvec{\rho }\), with 50% noisy reporters.

Noisy reporters: The first experiment tests robustness to error-prone annotators. For each of the 25 ground truth datasets, we generated three crowds of 20 reporters. In each crowd, we varied the number of reliable reporters between 5, 10 and 15, while the remainder were noisy reporters with high random error rates. We simulated reliable reporters by drawing confusion matrices, \(\varvec{\pi }^{(s)}\), from beta distributions with parameter matrix set to \(\varvec{\alpha }^{(s)}_{jj}=10\) along the diagonals and 1 elsewhere. For noisy workers, all parameters were set equally to \(\varvec{\alpha }^{(s)}_{jl}=5\). For each proportion of noisy reporters, we selected reporters and grid points at random, and generated 2400 reports by drawing binary labels from the confusion matrices \(\varvec{\pi }^{(1)}, ..., \varvec{\pi }^{(20)}\). We ran the incremental train/test procedure for each crowd with each of the 25 ground truth datasets. For HeatmapBCC, GP and IBCC+GP the kernel hyperparameters were set as \(l=20\), \(a_0=1\), and \(b_0=1\). For HeatmapBCC, IBCC and IBCC+GP, we set confusion matrix hyperparameters to \(\varvec{\alpha }^{(s)}_{j,j}=2\) along the diagonals and \(\varvec{\alpha }^{(s)}_{j,l}=1\) elsewhere, assuming a weak tendency toward correct labels. For IBCC we also set \(\nu _0=[1, 1]\).

Figure 1 shows the median differences in AUC between HeatmapBCC and the alternative methods for noisy reporters. Plotting the difference between methods allows us to see consistent performance differences when AUC varies substantially between runs. More reliable workers increase the AUC improvement of HeatmapBCC. With all proportions of workers, the performance improvements are smaller with very small numbers of labels, except against IBCC, as none of the methods produce a confident model with very sparse data. As more labels are gathered, there are more locations with multiple reports, and IBCC is able to make good predictions at those points, thereby reducing the difference in AUC as the number of labels increases. However, for the other three methods, the difference in AUC continues to increase, as they improve more slowly as more labels are received. With more than 700 labels, using the GP to estimate the class labels directly is less effective than using IBCC classifications at points where we have received reports, hence the poorer performance of GP and IBCC+GP.

In Fig. 1 we also show the improvement in negative log probability density (NLPD) of state probabilities, \(\varvec{\rho }\). We compare HeatmapBCC only against the methods that place a posterior distribution over their estimated state probabilities. As more labels are received, the IBCC+GP method begins to improve slightly, as it is begins to identify the noisy reporters in the crowd. The GP is much slower to improve due to the presence of these noisy labels.

Fig. 2.
figure 2

Synthetic data, 50% biased reporters: median improvement of HeatmapBCC compared to alternatives over 25 datasets, against number of crowdsourced labels. Shaded areas showing the inter-quartile range. Left: AUC. Right: NLPD of state probabilities, \(\varvec{\rho }\).

Biased reporters: The second experiment simulates the scenario where some reporters choose the negative class label overly frequently, e.g. because they fail to observe the positive state when it is present. We repeated the procedure used for noisy reporters but replaced the noisy reporters with biased reporters generated using the parameter matrix \(\varvec{\alpha }^{(s)} = \left[ {\begin{matrix}7 &{} 1\\ 6 &{} 2\end{matrix}}\right] \). We observe similar performance improvements to the first experiment with noisy reporters, as shown in Fig. 2, suggesting that HeatmapBCC is also better able to model biased reporters from sparse data than rival approaches. Figure 3 shows an example of the posterior distributions over \(t_{\varvec{x}}\) produced by each method when trained on 1500 random labels from a simulated crowd with \(50\%\) biased reporters. We can see that the ground truth appears most similar to the HeatmapBCC estimates, while IBCC is unable to perform any smoothing.

Fig. 3.
figure 3

Synthetic data, 50% biased reporters: posterior distributions. Histogram of reports shows the difference between positive and negative label frequencies at each grid square.

Continuous report locations: In the previous experiments we drew reports from discrete grid points so that multiple reporters produced noisy labels for the same target, \(t_{\varvec{x}}\). The third experiment tests the behaviour of our model with reports drawn from continuous locations, with 50% noisy reporters drawn as in the first experiment. In this case, our model receives only one report for each object \(t_{\varvec{x}}\) at the input locations \(\varvec{X}\). Figure 4 shows that the difference in AUC between HeatmapBCC and other methods is significantly reduced, although still positive. This may be because we are reliant on \(\rho \) to make classifications, since we have not observed any reports for the exact test locations \(\varvec{X}^*\). If \(\rho _{\varvec{x}}\) is close to 0.5, the prediction for class label \(\varvec{x}\) is uncertain. However, the improvement in NLPD of the state probabilities \(\varvec{\rho }\) is less affected by using continuous locations, as seen by comparing Fig. 1 with Fig. 4, suggesting that HeatmapBCC remains advantageous when there is only one report at each training location. In practice, reports at neighbouring locations may be intended to refer to the same \(t_{\varvec{x}}\), so if reports are treated as all relating to separate objects, they could bias the state probabilities. Grouping reports into discrete grid squares avoids this problem and means we obtain a state classification for each square in the heatmap. We therefore continue to use discrete grid locations in our real-world experiments.

Fig. 4.
figure 4

Synthetic data, 50% noisy reporters, continuous report locations. Median improvement of HeatmapBCC compared to alternatives over 25 datasets, against number of crowdsourced labels. Shaded areas showing the inter-quartile range. Left: AUC. Right: NLPD of state probabilities, \(\varvec{\rho }\).

4.2 Crowdsourced Labels of Satellite Images

We obtained a set of 5,477 crowdsourced labels from a trial run of the Zooniverse Planetary Response Network projectFootnote 2. In this application, volunteers labelled satellite images showing damage to Tacloban, Philippines, after Typhoon Haiyan/Yolanda. The volunteers’ task was to mark features such as damaged buildings, blocked roads and floods. For this experiment, we first divided the area into a \(132 \times 92\) grid. The goal was then to combine crowdsourced labels to classify grid squares according to whether they contain buildings with major damage or not. We treated cases where a user observed an image but did not mark any features as a set of multiple negative labels, one for each of the grid squares covered by the image. Our dataset contained 1,641 labels marking buildings with major structural damage, and 1,245 negative labels. Although this dataset does not contain ground truth annotations, it contains enough crowdsourced annotations that we can confidently determine labels for most of the region of interest using all data. The aim is to test whether our approach can replicate these results using only a subset of crowdsourced labels, thereby reducing the workload of the crowd by allowing for sparser annotations. We therefore defined gold-standard labels by running IBCC on the complete set of crowdsourced labels, and then extracting the IBCC posterior probabilities for 572 data points with \(\ge 3\) crowdsourced labels where the posterior of the most probable class \(\ge 0.9\). The IBCC hyperparameters were set to \(\varvec{\alpha }^{(s)}_{0,j,j}=2\) along the diagonals, \(\varvec{\alpha }_{0,j,l}^{(s)}=1\) elsewhere, and \(\nu _0=[100, 100]\).

We ran our incremental train/test procedure 20 times with initial subsets of 178 random labels. Each of these 20 repeats required approximately 45 min runtime on an Intel i7 desktop computer. The length-scales l for HeatmapBCC, GP and IBCC+GP were optimised at each iteration using maximum likelihood II by maximising the variational lower bound on the log likelihood (Eq. 11), as described in [16]. The inverse scale hyperparameters were set to \(a_0=0.5\) and \(b_0=5\), and the other hyperparameters were set as for gold label generation. We did not find a significant difference when varying diagonal confusion matrix values \(\varvec{\alpha }^{(s)}_{j,j}=2\) from 2 to 20.

Fig. 5.
figure 5

Planetary response network, major structural damage data. Median values over 20 repeats against the number of randomly selected crowdsourced labels. Shaded areas show the inter-quartile range. Left: AUC. Right: cross entropy error.

In Fig. 5 (left) we can see how AUC varies as more labels are introduced, with HeatmapBCC, GP and IBCC+GP converging close to our gold-standard solution. HeatmapBCC performs best initially, potentially because it can learn a more suitable length-scale with less data than GP and IBCC+GP. SVM outperforms GP and IBCC+GP with 178 labels, but is outperformed when more labels are provided. Majority voting, nearest neighbour and IBCC produce much lower AUCs than the other approaches. The benefits of HeatmapBCC can be more clearly seen in Fig. 5 (right), which shows a substantial reduction in cross entropy classification error compared to alternative methods, indicating that HeatmapBCC produces better probability estimates.

4.3 Haiti Earthquake Text Messages

Here we aggregate text reports written by members of the public after the Haiti 2010 Earthquake. The dataset we use was collected and labelled by Ushahidi [14]. We have selected 2,723 geo-tagged reports that were sent mainly by SMS and were categorised by Ushahidi volunteers. The category labels describe the type of situation that is reported, such as “medical emergency” or “collapsed building”. In this experiment, we aim to predict a binary class label, “emergency” or “no emergency” by combining all reports. We model each category as a different information source; if a category label is present for a particular message, we observe a value of 1 from that information source at the message’s geo-location. This application differs from the satellite labelling task because many of the reports do not explicitly report emergencies and may be irrelevant. In the absence of ground truth data, we establish a gold-standard test set by training IBCC on all 2723 reports, placed into 675 discrete locations on a \(100 \times 100\) grid. Each grid square has approximately 4 reports. We set IBCC hyper-parameters to \(\varvec{\alpha }^{(s)}_{0,j,j}=100\) along the diagonals, \(\varvec{\alpha }^{(s)}_{0,j,l}=1\) elsewhere, and \(\varvec{\nu }_0=[2000, 1000]\).

Since the Ushahidi data set contains only reports of emergencies, and does not contain reports stating that no emergency is taking place, we cannot learn the length-scale l from this data, and must rely on background knowledge. We therefore select another dataset from the Haiti 2010 Earthquake, which has gold standard labels, namely the building damage assessment provided by UNOSAT [2]. We expect this data to have a similar length-scale because the underlying cause of both the building damages and medical emergencies was an earthquake affecting built-up areas where people were present. We estimated l using maximum likelihood II optimisation, giving an optimal value of \(l=16\) grid squares. We then transferred this point estimate to the model of the Ushahidi data. Our experiment repeated the incremental train/test procedure 20 times with hyperparameters set to \(a_0=1500\), \(b_0=1500\), \(\varvec{\alpha }^{(s)}_{0,j,j}=100\) along the diagonals, \(\varvec{\alpha }^{(s)}_{0,j,l}=1\) elsewhere, and \(\varvec{\nu }_0=[2000, 1000]\).

Fig. 6.
figure 6

Haiti text messages. Left: cross entropy error against the number of randomly selected crowdsourced labels. Lines show the median over 25 repeats, with shaded areas showing the inter-quartile range. Gold standard defined by running IBCC with 675 labels using a \(100 \times 100\) grid. Right: heatmap of emergencies for part of Port-au-Prince after the 2010 Earthquake, showing high probability (dark orange) to low probability (blue). (Color figure online)

Figure 6 shows that HeatmapBCC is able to achieve low error rates when the reports are sparse. The IBCC and HeatmapBCC results do not quite converge due to the effect of interpolation performed by HeatmapBCC, which can still affect the results with several reports per grid square. The gold-standard predictions from IBCC also contain some uncertainty, so cross entropy does not reach zero, even with all labels. The GP alone is unable to determine the different reliability levels of each report type, so while it is able to interpolate between sparse reports, HeatmapBCC and IBCC detect the reliable data and produce different predictions when more labels are supplied. In summary, HeatmapBCC produces predictions with 439 labels (65%) that has an AUC within 0.1 of the gold standard predictions produced using all 675 labels, and reduces cross entropy to 0.1 bits with 400 labels (59%), showing that it is effective at predicting emergency states with reduced numbers of Ushahidi reports. Using an Intel i7 laptop, the HeatmapBCC inference over 675 labels required approximately one minute.

We use HeatmapBCC to visualise emergencies in Port-au-Prince, Haiti after the 2010 earthquake, by plotting the posterior class probabilities as the heatmap shown in Fig. 6. Our example shows how HeatmapBCC can combine reports from trusted sources with crowdsourced information. The blue area shows a negative report from a simulated first responder, with confusion matrix hyperparameters set to \(\varvec{\alpha }^{(s)}_{0,j,j}=450\) along the diagonals, so that the negative report was highly trusted and had a stronger effect than the many surrounding positive reports. Uncertainty in the latent function \(f_j\) can be used to identify regions where information is lacking and further reconnaisance is necessary. Probabilistic heatmaps therefore offer a powerful tool for situation awareness and planning in disaster response.

5 Conclusions

In this paper we presented a novel Bayesian approach to aggregating unreliable discrete observations from different sources to classify the state across a region of space or time. We showed how this method can be used to combine noisy, biased and sparse reports and interpolate between them to produce probabilistic spatial heatmaps for applications such as situation awareness. Our experiments demonstrated the advantages of integrating a confusion matrix model to capture the unreliability of different information sources with sharing information between sparse report locations using Gaussian processes. In future work we intend to improve scalability of the GP using stochastic variational inference [6] and investigate clustering confusion matrices using a hierarchical prior, as per [13, 23], which may improve the ability to learn confusion matrices when data for individual information sources is sparse.