1 Introduction

Vagueness is ubiquitous in natural language, but it is unclear what practical role, if any, it plays in our communication. For example, is the vagueness of adjective definitions an efficiency enhancing feature of the way in which we represent concepts, or is it an unfortunate, if perhaps inevitable, side-effect of the way in which language is acquired, or has evolved (O’Connor 2013)? A fundamental difficulty encountered by any general attack on this problem relates to the breadth of the concept of vagueness itself. Vagueness is a multi-faceted phenomenon and although it is clearly different from ambiguity and imprecision there are still differing opinions as to exactly what linguistic phenomena come under its umbrella. Keefe and Smith (2002) identify three interrelated properties of vague predicates; (1) borderline cases (2) blurred boundaries and (3) susceptibility to sorites paradoxes. There is a subtle but important distinction between (1) and (2) which suggests that we may need to look at different aspects of communication in order to understand the possible utility of these different properties of vagueness. Explicit borderline cases are those which are neither members of a given category nor of its complement. Proposed models of this characteristic either permit truth gaps (Fine 1975), i.e. statements which are neither true nor false, or introduce a third truth-value to represent ‘borderline’ (Kleene 1952). van Deemter (2009a) has identified a number of communication scenarios in which vagueness can play a positive role including, for example, by mitigating the risk associated with making predictions or promises. Lawry and Tang (2012) suggested that borderline cases may indeed have a positive role to play in this form of risk management. The underlying intuition is that the presence of borderline cases provides additional flexibility within a payoff model when there is uncertainty about the possible outcomes. For instance, we might assume that the payoff from making a forecast that turns out to be borderline will lie somewhere between the payoffs from a forecast which turn out to be false and one which turns out to be true respectively. This extra flexibility allows agents to balance the vagueness of assertions against their uncertainty so as to maximise the expected payoff from making a forecast or a promise. Blurred boundaries on the other hand arise from a type of uncertainty about where exactly the boundary of a category lies, and we will argue below that this can be modelled probabilistically. In this paper we focus on the utility of blurred boundaries and, by adopting a probabilistic interpretation, we attempt to describe a communication scenario in which stochastic behaviour, resulting from vague definitions of adjectives along a continuous scale, is on average better than the optimal Boolean, i.e. non-vague, alternative.

Signalling games (Lewis 1969) have provided a common formalism in which to study the utility of vagueness, and in particular blurred boundaries, in communication [see van Deemter (2009b) for an overview of recent work]. Such games typically involve two agents, a sender and receiver, with a shared vocabulary consisting of a finite set of words, used to describe an underlying reality of which the sender but not the receiver has direct knowledge. Each agent then adopts transmission and interpretation strategies so as to maximize their respective utilities. For example, De Jaegher (2003) investigates the role of vagueness in signalling games in which the sender and receiver have different and possibly conflicting utilities. From an alternative perspective Franke et al. (2011) suggest that vagueness is a natural property for boundedly rational agents. In particular, they consider the cases in which agents have bounded rationality due to memory limitations and also due to random error i.e. noise. Perhaps the most compelling study of vagueness in communication, however, is the still unpublished work of Lipman (2009) in which signalling is studied assuming the Gricean maxim (1975) that both sender and receiver aim to communicate as effectively as possible. Lipman’s result shows that for rational agents, using vague definitions is always sub-optimal in comparison to a Boolean alternative. More specifically, vagueness is associated with the use of mixed strategies; these being probability distributions over pure strategies.Footnote 1 Informally stated, Lipman’s main result is that no non-trivial mixed strategy is ever strictly better performing than any of the pure strategies to which it allocates non-zero probability. In other words, strict Nash equilibria will only contain pure strategies. One feature which is common to all of these studies is that in any signalling game there is only one sender. This immediately rules out the possibility of any form of information aggregation on the part of the receiver. We will now argue that it is exactly as part of such an aggregation process that labels with blurred boundaries may have some utility. We begin by considering a simple example.

There has been a street robbery in central Bristol. Around midday, a robber has approached a member of the public and stolen some money and their mobile phone. Due to the location and time at which the robbery took place, there are many witnesses, each able to provide a good description of the robber. The police officer in charge of the investigation takes formal statements in which the witnesses are asked to describe different characteristics of the robber including about their height. Now we have a clear intuition that the police officer benefits from having multiple statements, and to some extent, the more the better. This is no doubt partly because the different witnesses bring different perspectives, fill in the gaps left by others, and hence together provide a more complete overall picture of events. However, in addition, we suggest that an element of randomness on the part of witnesses in their choice of words can also provide the police officer with additional information. Furthermore, we suggest that the blurred boundaries or gradedness of vague words can be a natural source of this type of stochasticity. Suppose for simplicity that height is only describable using the two labels, short and tall, then if all witnesses describe the robber as short, then the police officer might infer that they are likely to be a prototypical short person. On the other hand, a 50–50 split between those witnesses who say short and those who say tall is more likely to suggest a person of intermediate height. Now notice that if instead of making stochastic assertions based on some form of graded concept definition, the witnesses were simply applying Boolean definitions of short and tall, then inference of this form would not be possible. To see this, suppose that all the witnesses share the same Boolean definitions of short and tall, according to which all heights less than a threshold \(\theta \) are classified as short, and all heights greater than \(\theta \) as tall. In this case, if we assume a noise free model in which everyone receives the same information, then no matter what the robber’s height, either all the witnesses would describe him as short or all as tall. Of course, in practice the witnesses are likely to differ, even in the case that they all adopt the same Boolean model. For example, there would be natural variation in their perceptions and in the conditions and locations where they each saw the robbery take place e.g. witnessing it from different angles and in different light. However, we suggest that in addition to this natural variation there can be a positive role to play for stochasticity directly induced by the blurred boundaries of vague categories.

2 The Uncertain Threshold Model of Vagueness

Probabilistic approaches to vagueness have a history dating back to Black (1937), and include work by Loginov (1966), Hisdal (1988), Edgington (1997) and more recently Lawry (2008) and Lassiter (2011). These models tend to be strongly interrelated, see Dubois and Prade (1997), and for graded adjectives a common formulation is in terms of an uncertain threshold value defined on a particular measurement scale (Cresswell 1976). Consider, for example, the adjective short defined on a height scale corresponding to the positive real numbers. As outlined in Sect. 1, a simple Boolean model is characterised by a threshold value \(\theta \), all heights below which are classified as being short. In the case of vague concepts it is then proposed that blurred category boundaries result from uncertainty about the exact value of \(\theta \). Lawry (2008) refers to this as semantic uncertainty and argues that it can be naturally quantified in terms of subjective probabilities. Both Lawry (2008) and Lassiter (2011) suggest that semantic uncertainty is a likely consequence of the empirical way in which language is acquired. In this paper we propose that it may also underlie stochastic assertion decisions which can play a positive role in communication scenarios where some form of aggregation is involved. For instance, suppose that for a witness in our robbery example her uncertainty about the threshold \(\theta \), defining the adjective short, is quantified by the probability density f.Footnote 2 The probability that this witness would classify a robber of height x metres as being short, then corresponds to the probability that the threshold value \(\theta \) is at least x. This provides a natural definition for the membership degree of x in the category short as follows:

$$\mu _{short}(x)=P(\theta \ge x)=\int\limits_{x}^\infty f(\theta ) \, {\text{d}}\theta =1-F(x)$$

where F is the cumulative distribution function of f. Applying a stochastic assertion model the witness would then describe the robber as being short with probability \(\mu _{short}(x)\) and as being tall with probability \(\mu _{tall}(x)=1-\mu _{short}(x)\). In the following section we propose a simple stochastic communication channel involving vague labels defined in terms of uncertain thresholds. We show that in such a channel, by aggregating the varying signals from sufficiently many stochastic senders, a receiver can on average obtain a better estimate of the input being described, than by using an optimal Boolean model.

The uncertain threshold model has clear similarities to the epistemic theory of vagueness as expounded by Williamson (1992, 1994), although there are also subtle but important differences. Williamson proposes that there is a precise but unknown, and possibly unknowable, boundary between the extension of a vague concept and that of its negation. From this perspective vagueness can be captured within the framework of classical logic, with properties such as the law of excluded middle and the law of non-contradiction being preserved. The model we propose, though sharing with the epistemic theory the basic premise that vagueness can be understood in terms of precise but uncertain boundaries, makes a fundamentally different assumption regarding the nature of these boundaries and how the uncertainty about them arises. In particular, the epistemic theory would seem to assume the existence of some objectively correct boundary threshold between, for example, short and not short. This assumption lies at the heart of one of the main criticisms of epistemicism in the literature, that it does not provide a satisfactory account of the relationship between the semantics and the use of language (Keefe and Smith 2002; Smith 2008). That is, it seems clear that the meaning of vague concepts are in large part determined by their use over time by a diverse population of communicators. But the role of the individual within the epistemic theory appears to be that of learning the meaning of already fixed boundaries, a task at which, according to Williamson (1994), they can only hope to have at best partial success. In contrast, following Lawry (2008) and D’Odorico and Bennett (2013), we propose a model in which individuals adopt an epistemic stance by assuming the existence of precise boundary thresholds about which they are uncertain, and where they quantify this uncertainty using probability. However, the epistemic stance is understood to be a modelling assumption on the part of language users, and there is no implication that precise thresholds have an independent existence beyond the models. From this perspective there is a clear account of how language use determines semantics through an emergent process resulting from multiple interactions between individuals, each adopting the epistemic stance and updating their semantics by conditioning within a probabilistic representational model as outlined above. Indeed there is a growing literature on agent-based simulation studies in which simple probabilistic models of concepts are shown to converge across a population (Steels 1997; Steels and Belpaeme 2005; Eyre and Lawry 2014). Nonetheless, one might ask of such approach, why do individuals choose to adopt the epistemic stance, as opposed to an alternative representational model, given that, as admitted, there is no claim as to the objective existence of precise boundaries? A pragmatic response would be to claim that, faced with the challenge of deciding what to assert and of interpreting the assertions of others in a variety of contexts, individuals simply find it useful as part of a decision making and learning strategy to assume that there is a clear divide between those labels which are and those which are not appropriate to assert. This is consistent with Lassiter’s view that, rather than language being a simple precise entity, there are in fact a number of precise interpretations that can be employed in a given context (Lassiter 2011). Adopting a probabilistic approach in which individuals attempt to take account of their prior knowledge of language conventions and their models of other language users in order to choose between these various interpretations, could then be a natural way of bringing to bear already established tools for dealing with epistemic uncertainty when deciding between competing possible assertions. In this paper we furthermore propose that probabilistic definitions can also be exploited by communicating agents as a mechanism for generating stochastic uncertainty which then has a positive role to play in the aggregation of information from different signals.Footnote 3

We should note that for some, even this pragmatic epistemic approach to vagueness may still be unpalatable. Hence, in the context of the current paper it is worth pointing out that the stochastic channels proposed in the sequel are also relevant for probabilistic but non-epistemic theories and even for non-probabilistic degree-based treatments of vagueness. To make the case for the former we consider the non-epistemic probabilistic approaches to vagueness as proposed by Borel, discussed and developed by Egré and Barberousse (2014), Egré (2016), and Kamp (1975). Borel applies statistical methods after identifying two main sources of variation in the way that individuals apply vague terms. For example, suppose that a witness’ decision as to whether or not to describe the robber as short depends both on her perception of his height and on a precise threshold, with the term short being used provided that the former is less than the latter.Footnote 4 Variation in the responses of different witnesses then occurs as a result both of differences in their perceptions of the height of the robber and in the height thresholds they apply. Certainly the former would tend to naturally occur due to the inherent imperfection of human perception and other environmental effects e.g. differences in relative position, lighting etc. For the latter Egré (2016) suggests that decision thresholds might be based on a representative value for the reference or context class, so for the robber example this could be the mean height of UK males, but with subjective differences between individuals about how exactly the threshold is derived from this value. Given this set-up we can try to understand the use of the term short in this particular context by running a controlled experiment in which a sample of individuals (witnesses) are shown a number of suspects with varying heights and asked whether or not they would describe them as being short, yes or no. From the resulting data statistical methods can then be employed so as to estimate a probability function quantifying the probability that the adjective short will be applied to describe someone of a given height. Now the fundamental difference between this statistical approach and the probabilistic model we have outlined above is that in Borel’s approach probability describes the macro-level use of vague predicates across a population, capturing natural variations between individuals,Footnote 5 whilst we have proposed that each individual adopts a probabilistic model when deciding whether or not a vague term can be applied. However, stochastic channels as described below are agnostic as to the exact source of the variations between senders. Indeed for Borel’s statistical model the main claim of our paper can be reformulated as follows; variation in the application of vague predicates with certain overall probabilistic profiles, can be a positive benefit in multi-sender channels.

A different non-epistemic probabilistic approach is proposed by Kamp (1975) as an extension of the supervaluation theory of vagueness (Fine 1975) in which a probability measure is introduced to weight the different admissible precisifications of a predicate. The membership of a element in the extension of the predicate is then taken to be the measure of the set of precisifications that contain it. Now clearly this model can also act as a source of stochasticity if, for example, when deciding whether or not to describe the robber as short, each witness picks a precisification at random according to the probability weighting and then checks if the robber’s height is contained in the particular extension of short that they have chosen. Finally, a general degree-based view of vagueness defines the membership of the extension of a predicate as a function into [0, 1], but where there is no probabilistic interpretation of this membership function (Smith 2008). Even for this non-probabilistic model, stochastic channels can still be relevant provided that assertion decisions are made by employing a threshold on membership functions. For example, a witness will assert that ‘the robber is short’ provided that the robber’s value in the witness’ membership function for short exceeds some threshold \(\theta \). If \(\theta \) is chosen stochastically then the type of signal aggregation proposed below can still be applied, but where the information conveyed over the channel relates to the robber’s membership value in short rather than to a direct estimate of their height.

An outline of the remainder of the paper is as follows: Sect. 3 introduces the optimal Boolean binary channel as well as a simple vague channel involving the aggregation of stochastic signals. Section 4 then compares these two channels in terms of the expected squared error between the actual input and the receiver’s estimate of it, under the assumption that inputs are uniformly distributed on [0, 1]. In Sect. 5 we consider both Boolean and vague channels involving multiple labels. Section 6 investigates the robustness of vague channels to transmission error. In Sect. 7 we consider the situation in which the input distribution is unknown so that the channels cannot be optimised for a particular prior. In particular, we compare how both channels perform under a range of different input distributions. Section 8 considers optimal vague channels for different numbers of senders and, in particular, will show that S-curve membership functions perform well for channels with relatively low numbers of senders. Finally, in Sect. 9 we give some discussion and conclusions.

3 Boolean and Vague Channels

We now introduce a simple model of binary communication involving aggregation, as exemplified by the robber story from Sect. 1. An input value x is drawn at random from the normalised scale [0, 1] according to a uniform distribution. Each of a number of senders then select a label from the message set \({\mathcal{M}}=\{L_1,L_2\}\) which they judge to be an appropriate description of x, and transmit this to a single receiver. The receiver then aggregates these signals in order to determine an estimate y, of the value of x. We assume that all agents, senders and receiver, share the same definition of the labels in \({\mathcal{M}}\). Furthermore, we adopt Grice’s assumption (1975) that all the senders aim to describe x in such a way as to enable the receiver to determine the best possible estimate. We now consider the two cases in which the labels in \({\mathcal{M}}\) are defined according to the standard Boolean model and according to an uncertain threshold based vague model.

3.1 The Optimal Boolean Channel

For binary Boolean channels we adopt a general fixed threshold model in which \(L_1\) corresponds to the interval \([0,\theta )\) and \(L_2\) to \([\theta ,1]\), for some threshold value \(\theta \) in [0, 1]. That is, any value \(x<\theta \) is always described as \(L_1\) and any \(x \ge \theta \) is always described as \(L_2\). As discussed in Sect. 1, in such cases the receiver does not benefit from multiple signals since, given a shared Boolean model, all senders will assert identical descriptions of x.Footnote 6 Consequently we can simplify any such Boolean channel so as to consist of only one sender S and a receiver R. The sender transmits either a 0 (i.e. \(S=0\)) to stand for \(L_1\) or a 1 (i.e. \(S=1\)) to stand for \(L_2\). The receiver then estimates x to be \(y_0\), a typical \(L_1\) value, if they receive a 0 and to be \(y_1\), a typical \(L_2\) value, if they receive a 1. Assuming that x is uniformly distributed on [0, 1] we can measure the accuracy of this channel by evaluating the expected value of \((x-y)^2\), which we denote by \({\mathbb{E}}^{\mathbf{B}}((x-y)^2)\). Unsurprisingly this value is minimal when \(\theta ={\frac{1}{2}}, y_0={\frac{1}{4}}\) and \(y_1={\frac{3}{4}}\).

Theorem 1

For a Boolean channel, if \(L_1\) is defined as the interval \([0,\theta )\) and \(L_2\) as the interval \([\theta ,1]\) and

$$\begin{aligned} y={\left\{ \begin{array}{ll}y_0:R=0\\ y_1:R=1 \end{array}\right.}, \end{aligned}$$

and assuming that x is uniformly distributed on [0, 1], then \({\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) is minimal when \(\theta ={\frac{1}{2}}, y_0={\frac{1}{4}}\) and \(y_1={\frac{3}{4}}\).

3.2 A Multiple Sender Vague Channel

We now propose a multiple sender vague channel in which signals from a number of stochastic senders are aggregated by a receiver so as to estimate the input variable. In contrast to the Boolean channel, in the vague channel all senders and receiver adopt a probabilistic interpretation of the labels in \({\mathcal{M}}\) as described in Sect. 2. More formally, there are \(n+1\) agents corresponding to n senders \(S_1, \ldots , S_n\) and a receiver R. Given the same input \(x \in [0,1]\) each sender independently selects a message from the set \({\mathcal{M}}=\{L_1,L_2\}\) and transmits either a 0 (i.e. \(S_j=0\)) standing for \(L_1\) or a 1 (i.e. \(S_j=1\)) standing for \(L_2\). All agents adopt the same shared probabilistic definition of \({\mathcal{M}}\) in which \(L_1\) is \([0,\theta )\) and \(L_2\) is \([\theta ,1]\) and where \(\theta \) is an uncertain threshold which we assume to be uniformly distributed on [0, 1].Footnote 7 This results in the membership functions \(\mu _{L_1}(x)=1-x\) and \(\mu _{L_2}(x)=x\). We then assume that for each sender \(S_j\) the choice of signal, either 0 or 1, is stochastic with \(P(S_j=0|x)=\mu _{L_1}(x)=1-x\), and \(P(S_j=1|x)=\mu _{L_2}(x)=x\) (see Fig. 1). R receives an n-bit sequence of 1’s and 0’s from the different senders, where \(R_j\) denotes the signal received from sender \(S_j\). R then aggregates these signals in order to obtain an estimate y, of the input x (see Fig. 2). We initially adopt the simple frequency estimator;

$$y={\frac{T}{n}} \quad {\text{where}} \;\; T=\sum _{j=1}^n R_j$$
Fig. 1
figure 1

Probabilities for sending a 0 or a 1 given x, derived from a vague definition of labels \(L_1\) and \(L_2\)

Fig. 2
figure 2

A multiple sender vague channel

4 A Comparison of Boolean and Vague Binary Channels

Assuming that x is uniformly distributed on [0, 1] we can use elementary statistics to evaluation the expected squared error for the vague channel described in Sect. 3.2 and denoted \({\mathbb{E}}^{\mathbf{V}}((x - y)^2)\), as follows:

$${\mathbb{E}}^{\mathbf{V}}((x - y)^2)= \int\limits_0^1 {\mathbb{E}}^{\mathbf{V}}((x - y)^2|x) \, {\text{d}}x$$

Given input x, T is distributed according to a binomial distribution with parameters n and x. Hence, \({\mathbb{E}}(T|x)=n x\) and \({\mathbb{E}}^{\mathbf{V}}(y|x)=x\). Therefore,

$${\mathbb{E}}^{\mathbf{V}}(({\mathbb{E}}^{\mathbf{V}}(y|x) - y)^2|x )= {\mathbb{V}}^{\mathbf{V}}(y|x)={\mathbb{V}}\left( {\frac{T}{n}}|x\right) = {\frac{1}{n^2}} {\mathbb{V}}(T|x)={\frac{x(1-x)}{n}}$$

From this we obtain that:

$${\mathbb{E}}^{\mathbf{V}}((x - y)^2) =\int\limits_0^1 {\frac{x(1-x)}{n}}\, {\text{d}}x={\frac{1}{6n}}$$

For the optimal Boolean channel we have instead that the expected squared error is given by:

$${\mathbb{E}}^{\mathbf{B}}((x-y)^2)=\int\limits_0^{\frac{1}{2}} \left( x-{\frac{1}{4}}\right) ^2 \, {\text{d}}x +\int\limits_{\frac{1}{2}}^1 \left( x-{\frac{3}{4}}\right) ^2 \, {\text{d}}x={\frac{1}{48}}$$

Now trivially, \({\mathbb{E}}^{\mathbf{V}}((x-y)^2)\) is a strictly decreasing function of n (see Fig. 3) and hence \({\mathbb{E}}^{\mathbf{V}}((x-y)^2) \le {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) provided that \(n \ge 8\).

Fig. 3
figure 3

\({\mathbb{E}}^{\mathbf{V}}((x-y)^2)\) and \({\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) as functions of the number of senders n

At this point we might be tempted to argue that a lower bound of 8 on the required number of senders does not make a strong case for the utility of vagueness in communication. After all how often do we have the luxury of aggregating assertions from that many different independent sources? However, note that we have not yet attempted to optimise the vague channel as we have done for the Boolean channel. We return to this issue in Sect. 8 where we show that there are vague channels that outperform the optimal Boolean channel when there are 2 or more senders. Initially, however, we investigate the behaviour of the linear vague channel described above (Fig. 1) as the number of labels in \({\mathcal{M}}\) increases and in this case show that the number of senders required to outperform the Boolean channel also decreases significantly. Furthermore, we then consider the robustness of the vague channel to noise and to ignorance about the underlying distribution on x.

5 Multiple Labels Channels

In this section we consider channels in which there are multiple labels so that \({\mathcal{M}}=\{L_1, \ldots , L_k\}\) for \(k \ge 2\). We are thinking of these labels as representing higher granularity descriptions of values on some common underlying scale. For example, instead of simply describing the robber as being either short or tall, witnesses might instead choose between the three labels; short, medium and tall, or perhaps between the five labels; very short, short, medium, tall and very tall. As the number of labels increases then each label refers to a more and more specific range on the scale.

We assume that Boolean labels are defined in terms of \(k+1\) fixed threshold values \(0=\theta _0 \le \theta _1 \le \ldots \le \theta _{k-1} \le \theta _k=1\) such that the label \(L_i\) corresponds to the interval \([\theta _{i-1},\theta _i)\) for \(i=1, \ldots , k-1\) and \(L_k\) corresponds to \([\theta _{k-1},\theta _k]\). As in Sect. 4, the Boolean nature of this channel and the fact that the same label definitions are shared by all agents mean that we need only assume one sender and a receiver. The sender transmits a value in \(\{0, \ldots , k-1\}\), where \(S=i-1\) stands for \(L_i\), and upon receiving which the receiver estimates the value of x to be a typical value of \(L_i\) denoted by \(y_{i-1}\). This form of channel fits within the general framework of quantization in multi-sensor platforms proposed by Gubner (1993). Gubner’s model is more general in that, for example, it allows for different sensor reading from the different senders resulting from sensory noise and other environmental variations. From the following theorem we see that the expected squared error for this channel is minimal when the threshold values are regularly spaced between 0 and 1 and where the typical values are the mid points of each interval.

Theorem 2

For a Boolean channel, if \(L_i\) is defined as the interval \([\theta _{i-1},\theta _i)\) for \(i=1, \ldots , k-1\) and \(L_k\) is defined as \([\theta _{k-1},\theta _k]\) where \(0=\theta _0< \theta _1< \ldots< \theta _{k-1} <\theta _k=1,\) and

$$y=\left\{ y_i:R=i \quad {\text{for}}\; i=0, \ldots , k-1 \right.$$

then, assuming that x is uniformly distributed on \([0,1], {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) is minimal when \(\theta _i={\frac{i}{k}}\) and \(y_i={\frac{\theta _{i-1}+\theta _i}{2}}\) for \(i=1, \ldots , k\).

For the vague channels with multiple labels we assume that the label \(L_i\) corresponds to the interval \([\theta _{i-1},\theta _i)\) for \(i=1, \ldots , k-1\) but where each of the thresholds is uncertain.Footnote 8 There are many possible joint distributions on these \(k-1\) thresholds satisfying the constraints that \(\theta _{i-1}<\theta _i\), but here we adopt a simple formulation in which \(\theta _i=\theta +{\frac{i-1}{k-1}}\) where the parameter \(\theta \) is uniformly distributed on the interval \((0,{\frac{1}{k-1}})\). The memberships for the labels are then as follows (see Fig. 4):

$$\begin{aligned} \mu _{L_i}(x)={\left\{ \begin{array}{ll} (k-1)x-(i-1): x \in \left[{\frac{i-2}{k-1}},{\frac{i-1}{k-1}}\right)\\ i-(k-1)x : x \in \left[{\frac{i-1}{k-1}},{\frac{i}{k-1}}\right) \\ 0:{\text{otherwise}} \end{array}\right. }\quad {\text{for}}\; i=1, \ldots , k \end{aligned}$$

Each of the n senders then stochastically transmits a value from \(\{0, \ldots , k-1\}\) where \(P(S_j=i-1|x)=\mu _{L_i}(x)\). R then receives a n-length sequence of numbers from \(\{0, \ldots , k-1\}\) which they aggregate using the frequency estimator;

$$y={\frac{T}{n(k-1)}} \quad {\text{where}}\;\; T=\sum _{j=1}^n R_j$$

This form of multiple label linear vague channel is a special case of the model of probabilistic quantization proposed by Xiao et al. (2006). In Xiao et al. (2006) an upper bound on estimation error is determined for a sensor fusion platform employing linear probabilistic quantization and assuming that each sender is prone to independent noise drawn from a distribution with mean zero and a known standard deviation. Here, however, we focus on a direct comparison between stochastic channels of this kind and the optimal Boolean channel. The following results show that the minimal number of senders required for the vague channel with multiple labels to be on average at least as accurate as the comparable Boolean channel, is a decreasing function of the number of labels k (see Fig. 5). This value is strictly greater than 2 for all k, tending to 2 in the limit as k tends to infinity. In fact, for channels with 6 or more labels only 3 senders are required for the vague channel to be at least as accurate as the Boolean channel.

Fig. 4
figure 4

Definition of vague and Boolean labels for a k label channel

Fig. 5
figure 5

The minimum number of senders n required such that \({\mathbb{E}}^{\mathbf{V}}((x-y)^2) \le {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) plotted as a function of the number of labels k

Lemma 3

Let \(n_i=|\{j:S_j=i\}|\) for \(i=0, \ldots , k-1.\) If \(x \in [{\frac{i-1}{k-1}},{\frac{i}{k-1}})\) then

$$y={\frac{{\frac{n_i}{n}}+i-1}{k-1}} ={\frac{{\frac{n_i}{n_{i-1}+n_i}}+i-1}{k-1}}$$

Furthermore, \({\mathbb{E}}^{\mathbf{V}}(y|x)=x.\)

Theorem 4

If x is uniformly distributed on [0, 1] then \({\mathbb{E}}^{\mathbf{V}}((x-y)^2) \le {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) if and only if \(n \ge \left\lceil {\frac{2k^2}{(k-1)^2}}\right\rceil\).Footnote 9

6 Robustness to Errors

It is commonly argued that systems which employ categories with fuzzy or blurred boundaries are inherently tolerant of errors due to the gradedness of category membership.Footnote 10 In our context we now investigate how tolerant binary vague channels are to transmission errors i.e. when \(S_j \ne R_j\). For example, such errors could be due to the receiver mishearing the speaker in a noisy environment, or in our robbery example, information from a witness being misreported or misrecorded. Throughout this analysis we will compare the expected squared error of the vague channel to that of the error free optimal Boolean channel. For vague channels we consider the simple case in which there is a fixed probability \(\alpha \) of an error occurring for each of the j channels to be aggregated. In other words;

$$P(R_j=1|S_j=0)=P(R_j=0|S_j=1)=\alpha \quad {\text{for}}\;\; j=1, \ldots , n$$

The following result shows that provided the transmission error probability \(\alpha \) is less than \({\frac{1}{4}}\) then by increasing the number of senders, vague channels can compensate for errors so as to still perform as well as the error free Boolean channel (see Fig. 6). As \(\alpha \) tends to \({\frac{1}{4}}\) from below this minimum number of required senders tends to infinity. However, for example, to compensate for a 10% error rate only requires a relatively modest increase from 8 to 12 senders. Indeed, for small error probabilities of upto 0.045 only one additional sender is needed.

Fig. 6
figure 6

The minimum value of n such that \({\mathbb{E}}^{\mathbf{V}}((x-y)^2|\alpha ) \le {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) plotted as a function of the channel error probability \(\alpha \). a \(\alpha \) ranging from 0 to \({\frac{1}{4}}\). b \(\alpha \) ranging from 0 to 0.1

Theorem 5

If x is uniformly distributed on [0, 1], and \(\alpha <{\frac{1}{4}}\) then \({\mathbb{E}}^{\mathbf{V}}((x-y)^2|\alpha ) \le {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) if and only if \(n \ge \left\lceil {\frac{8(2\alpha -2\alpha ^2+1)}{1-16\alpha ^2}}\right\rceil \). If \(\alpha \ge {\frac{1}{4}}\) then \({\mathbb{E}}^{\mathbf{V}}((x-y)^2|\alpha ) \ge {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) for all \(n \ge 1.\)

7 Robustness to Ignorance

In the previous sections we have assumed that the distribution of the inputs x is known to be uniform on [0, 1]. Instead, we now consider the situation in which the distribution on inputs is unknown prior to communication so that it is not possible to a priori optimise the design of the channels in order to minimize expected squared error.Footnote 11 In the face of such ignorance we assess how the Boolean and vague channels introduced in Sect. 3 perform in different possible realities i.e. given different distributions on x. In the first instance we suppose that the world turns out to be such that inputs are symmetrically distributed about \({\frac{1}{2}}\). To model this scenario we evaluate the expected squared error for both channels assuming that x is distributed according to a symmetric beta distribution with parameter s i.e. with density function \({\frac{x^{s-1}(1-x)^{s-1}}{\beta (s,s)}}\) (see Fig. 7). The following result gives an expression for the minimal number of senders required for the vague channel to be at least as accurate as the Boolean channel as a function of the symmetric beta distribution parameter s. In the limit as s tends to infinity the required number of senders tend to 4. Furthermore, from Fig. 8 we can see that across all s the maximal number of required senders is 11. In other words, providing that the vague channel has at least 11 senders then we can be sure that it will be as least as accurate as the Boolean channel no matter what value of s characterises the true input distribution.

Fig. 7
figure 7

Density functions for symmetric beta distributions with \(s=0.2, s=0.5, s=1, s=2\) and \(s=5\)

Fig. 8
figure 8

The minimum number of senders n required such that \({\mathbb{E}}^{\mathbf{V}}((x-y)^2) \le {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\), assuming that x is distributed according to a symmetric beta distribution with parameter s, plotted as a function of s

Theorem 6

If x is distributed according to a symmetric beta distribution with parameter \(s>0,\) then \({\mathbb{E}}^{\mathbf{V}}((x-y)^2) \le {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) if and only if

$$n \ge \left\lceil {\frac{8 s^2 \beta (s,s)}{s(2s+5)\beta (s,s)-\left({\frac{1}{2}}\right)^{2s+1}16(2s+1)}}\right\rceil$$

The assumption that inputs will turn out to be symmetrically distributed is of course a strong one, and may well be unrealistic. In order to investigate asymmetric input distributions we now evaluate the expected squared error for both channels assuming that inputs follow a general beta distribution with parameters s and t i.e. with density function \({\frac{x^{s-1}(1-x)^{t-1}}{\beta (s,t)}}\). The following result gives an expression for the minimum number of senders required for the vague channel under this distribution, as a function of the beta parameters s and t. From this we can obviously infer that no matter what values of s and t characterise the actual distribution of inputs there is always a minimum number of senders for which the vague channel is at least as accurate as the Boolean channel. Unfortunately, this minimal number of senders is unbounded as s and t vary. To see this consider the case where \(s=2t\). Figure 9b shows the beta density functions in this case for different values of t, all of which have an expected value of \({\frac{3}{4}}\). Furthermore, as t increases these density functions become increasingly peaked at \({\frac{3}{4}}\). Now clearly the Boolean channel will tend to be well suited to any such reality since the sender would be highly likely to transmit a 1, given which the receiver will estimate the value \(y_1={\frac{3}{4}}\). Indeed Fig. 9a suggests that the minimum number of senders required for the vague channel given this family of skewed distributions is an unbounded strictly increasing function of t.

Fig. 9
figure 9

The case in which x is distributed according an asymmetric beta distributions with parameters 2t and t. a The minimum value for n for which \({\mathbb{E}}^{\mathbf{V}}((x-y)^2)\le {\mathbb{B}}^{\mathbf{B}}((x-y)^2)\). b Beta distribution with parameters 2t and t for \(t=10, t=50\) and \(t=100\)

Theorem 7

If x is distributed according to a beta distribution with parameters \(s>0\) and \(t>0,\) then \({\mathbb{E}}^{\mathbf{V}}((x-y)^2) \le {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) if and only if

$$n \ge \left\lceil {\frac{16 \beta (s,t) st}{8 \beta \left( {\frac{1}{2}};s,t\right) (s-t) (s+t+1)-16\left( {\frac{1}{2}}\right) ^{s+t}(s+t+1) +(9t^2+9t-6st+s^2+s)\beta (s,t)}}\right\rceil$$

8 Optimal Vague Channels

Up to this point we have focused on comparing a simple linear vague channel with the optimal Boolean channel for languages of different sizes, as well as under channel noise and when both senders and receives are ignorant about the underlying distribution of the input values. In this section we investigate the optimal vague channel for a fixed number of senders. To make a precise comparison between the optimal vague and Boolean channel we initially need to clarify what exactly we mean by vague channel in this more general context. From the discussion of the threshold model of vagueness in Sect. 2, we consider the labels \(L_1=[0,\theta )\) and \(L_2=[\theta ,1]\) where \(\theta \) is a random variable with probability density function f and associated cumulative distribution F. We then have that \(S_j\) sends a 0 or 1 according to the generator function F as follows:

$$P(S_j=0|x)=P(x < \theta )=1-F(x) \quad {\text{and}}\quad P(S_j=1|x)=P(x \ge \theta )=F(x)$$

Now if we allow for the possibility that \(f(x)=\delta (x-{\frac{1}{2}})\), i.e. the Dirac delta function at \({\frac{1}{2}}\), then this class of channels will also include the optimal Boolean channel. Hence, to make a clear distinction between vague and Boolean channels we insist that for vague channels f is a continuous function on [0, 1]. Given this requirement it follows that for channels with only one sender all vague channels have a strictly higher expected error than the optimal Boolean channel.

Theorem 8

There is no vague channel with only one sender such that \({\mathbb{E}}^{\mathbf{V}}((x-y)^2) \le {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\).

In contrast, for \(n \ge 2\) it is always possible to find a vague channel of this more general form which outperforms the optimal Boolean channel. However, the optimal distribution on the threshold \(\theta \) will be different for different numbers of senders. To see this consider a vague channel with n senders and threshold cumulative distribution F then the error minimizing estimator of x from T is given by:

$$y={\mathbb{E}}(x|T)= \int\limits_0^1 x P(x|T) \, {\text{d}}x = {\frac{\int _0^1 x P(T|x) \, {\text{d}}x}{\int _0^1 P(T|x) \, {\text{d}}x}}= {\frac{\int _0^1 x F(x)^T(1-F(x))^{n-T} \, {\text{d}}x}{\int _0^1 F(x)^T(1-F(x))^{n-T} \, {\text{d}}x}}$$

For example, if \(\theta \) is uniformly distributed as in Sect. 3 then the error minimizing estimator of x corresponds to Laplace’s rule so that \(y={\frac{T+1}{n+2}}\). In this case we obtain that \({\mathbb{E}}^{\mathbf{V}}((x-y)^2)={\frac{1}{6(n+2)}}\) and hence, by using this estimator in place of the frequency \(y={\frac{T}{n}}\) the minimum number of senders for which \({\mathbb{E}}^{\mathbf{V}}((x-y)^2) \le {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) decreases from 8 to 6. More generally, we can also consider optimising the choice of threshold distribution F so as to minimise the expected error of the vague channel when applying the error minimizing estimator of x. Here we consider a parametrised family of density functions f in the form of normal distributions with mean \({\frac{1}{2}}\) and standard deviation \(\sigma \), normalised so that all values of \(\theta \) are between 0 and 1. In this case the cumulative distribution F has the following form:

$$F(x)={\frac{1}{2}}\left( 1+{\text{erf}}\left( {\frac{x-{\frac{1}{2}}}{\sigma {\sqrt{2}}}}\right) \right. +\left( x-{\frac{1}{2}}\right) \left( 1+{\text{erf}}\left( {\frac{-{\frac{1}{2}}}{\sigma {\sqrt{2}}}}\right) \right)$$

Here we can view \(\sigma \) as a vagueness parameter such that as \(\sigma \rightarrow 0\) then F(x) tends to the step function so that the vague channel converges to the Boolean channel, whereas as \(\sigma \rightarrow \infty \) then F(x) tends to x giving the linear vague channel already investigated in this paper. Figure 10 shows \({\mathbb{E}}^{\mathbf{V}}((x-y)^2)\) for the error minimizing vague channel with two senders compared to \({\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) as \(\sigma \) varies. The optimal two sender vague channel for this parametrised family of distributions is at \(\sigma \approx 0.07532\) but the error minimising vague channel outperforms the optimal Boolean channel for \(\sigma \le 0.1482\). Note that the optimal distribution function is different for channels with different numbers of senders n. For example, Fig. 11 shows the optimal cumulative distributions and Fig. 12 shows the corresponding optimal values of \(\sigma \) for the channels with \(n=1, \ldots , 10\) senders. This suggests that vaguer channels are optimal when there are a larger numbers of senders, but that the gradient of this increasing trend in vagueness decreases with n. In terms of a direct comparison between vague and Boolean channels then for \(n \ge 6\), adopting the error minimizing estimator of x ensures that \({\mathbb{E}}^{\mathbf{V}}((x-y)^2) < {\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) for all \(\sigma >0\). For example, Fig. 13a, b shows the expected error for channels with 6 and 8 senders respectively plotted against \(\sigma \) and compared to the Boolean channel error.

Fig. 10
figure 10

\({\mathbb{E}}^{\mathbf{V}}((x-y)^2)\) and \({\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) plotted against \(\sigma \) for a channel with 2 senders

Fig. 11
figure 11

Cumulative distribution F for the optimal channel for \(n=1, \ldots , 10\) senders

Fig. 12
figure 12

Optimal values of \(\sigma \) for channels with \(n=1,\ldots ,10\) senders

Fig. 13
figure 13

\({\mathbb{E}}^{\mathbf{V}}((x-y)^2)\) and \({\mathbb{E}}^{\mathbf{B}}((x-y)^2)\) plotted against \(\sigma \) for channels with 6 and 8 sender channel. a Expected errors for a channel with 6 senders. b Expected errors for a channel with 8 senders

9 Discussion and Conclusions

In this paper we have attempted to make the case for vague categories with blurred boundaries playing a positive role in a certain type of communication scenario in which a receiver aggregates signals from multiple senders. We have compared a simple vague channel with linear membership functions and frequency based aggregation to the optimal Boolean channel. Unsurprisingly, for error free channels in which the input distribution is a priori known to be uniform, the expected squared error for the vague channel is a strictly decreasing function of the number of senders. Since Boolean channels do not gain from having multiple senders then we can always identify a minimum number of senders above which the vague channel will be on average more accurate, in terms of expected squared error, than the comparable Boolean channel. Our focus has then been on identifying the minimal number of senders in different scenarios where there are multiple labels, channel error or prior ignorance about the input distribution, and also when optimal vague channels are considered. This is motivated by the intuition that the lower bound on the number of senders required by vague channels directly influences the strength of our case for the efficacy of blurred boundaries.

The plausibility of our argument that the blurred boundaries of vague predicates have a useful role to play as a natural source of stochastic assertion decisions, depends to a large part on the extent to which aggregation, of the form exemplified by our robbery story, is a common part of natural language communication. We note that for one sender and one receiver channels our results are entirely consistent with those of Lipman (2009), all be it formulated differently. Stochastic channels of the form we have proposed are undoubtedly suboptimal in such cases (see Theorem 8). For our argument in favour of vagueness to be in any way convincing it would need to hold that some level aggregation is a common part of linguistic communication, indeed even more common than one-on-one interactions of the type modelled by signalling games. We do not attempt to directly make this case here, neither are we aware of any empirical studies which look specifically into this claim. Instead, as we emphasised earlier, our goal is only to identify a possible scenario in which vagueness can be useful. However, it nonetheless seems clear that the larger the number of senders required for the vague channel to at least match the accuracy of the Boolean channel, the less compelling is the case for stochastic aggregation being a common feature of language. From this respect both our result for multiple vague channels (Theorem 4) and our study of optimal vague channels (Sect. 8) are both encouraging. For the former we have shown that the number of senders required for the linear vague channel to outperform the Boolean channel decreases rapidly as the number of labels increases (see Fig. 5). For the latter we have shown that by adopting the error minimizing estimator of x and then by selecting the distribution on \(\theta \) from a parametrised family with mean \({\frac{1}{2}}\), we can identify a unique channel which minimizes the value of \({\mathbb{E}}^{\mathbf{V}}((x-y)^2)\) for any fixed number of senders. Furthermore, provided that \(n \ge 2\) then vague channels can be found with a lower expected error than the optimal Boolean channel. Note that different vague channels are optimal for aggregating different numbers of senders, with more vague label definitions being preferred for larger n (see Fig. 12). Tantalisingly the type of S-curve reported in recent experimental studies on scalar adjectives (Lassiter and Goodman 2013; Qing and Franke 2014) are similar in form to those cumulative distribution functions optimal for channels with relatively low numbers of senders (see Fig. 11). This would then be consistent with the form of limited aggregation that one might expect to find in natural language where senders are scarce resources and where normally there will only be a small number of them. Certainly, the accuracy gained by using vague channels could potentially confer a significant advantage to both senders and receivers. For instance, in multi-label communication with 7 labels and 3 senders the expected squared error for the vague channel is around 11% lower than that of the Boolean channel with the same number of senders. Certainly message sets with around 7 labels are not unrealistic, being consistent with the famous magic number theory of Miller (1956) which proposes bounds on the number of graduations on a numerical scale based on the limitations of human memory. Furthermore, a vague channel optimised for 2 labels and for only 2 senders has an error around 13% lower than the Boolean channel.

From a game theory perspective and with reference to Lipman (2009) we might wonder how the multi-sender games described in this paper can demonstrate the utility of stochasticity in communication and hence escape the general result that mixed strategies are always suboptimal to pure strategies. In order to reconcile Lipman’s observation with our results we must first clarify the type of games being played by the Boolean and vague channels respectively. For instance, for the Boolean channel only two strategies are available to senders; transmit 0 or transmit 1. In contrast, for the vague channel we should think of the n senders as a compound aggregated sender \({\mathbf{S}}\) who can choose between signals \(0,1,\ldots , n\), i.e. the possible values of T, and whose available strategies is the set of all the binomial distributions on \(\{0,\ldots ,n\}\). Hence, for \(n>1\) one explanation for the superior performance of vague channels is that the sender simply has more strategies to chose from than in the Boolean channel. The question remains, however, why is a pure strategy not also optimal for the vague channel? The reason for this lies in the restricted set of strategies available to \({\mathbf{S}}\). In a mixed-strategy game \({\mathbf{S}}\) would be allowed to chose any strategy from \(\Delta \), the set of all probability distributions on \(\{0,\ldots ,n\}\) (Osborne and Rubinstein 1994). However, the set of binomial distributions is a non-convex strict subset of \(\Delta \). In particular, it does not include any pure strategy of the form \({\mathbf{S}}=T\), where \(T \in \{1, \ldots , n-1\}\). However, in the case that \(x \in (0,1)\) it is exactly such a pure strategy, i.e. where \({\mathbf{S}}=\left\lceil nx\right\rceil \), that is optimal in the full mix-strategy game. On the other hand, permitting this optimal strategy would be hard to justify in the context of natural language communication, since it would require that the n senders collaborate so as to transmit the best n-bit approximation to x i.e. any combination of signals in which the number of ones is exactly \(\left\lceil nx\right\rceil \). Essentially this would then be equivalent to a single n-bit channel, rather than n 1-bit channels. However, in natural language scenarios such as the robbery example in which the descriptions of a number of independent witnesses are aggregated, it is the latter which would seem to provide the more appropriate model.Footnote 12

To assume error free channels for which the input distribution is completely known prior to communication, is unrealistic. However, we have shown that vague channels are robust to reasonable levels of transmission error i.e. with error probability less than \({\frac{1}{4}}\). In such cases by increasing the number of senders a vague channel can compensate for transmission error so as to still be more accurate than the error free Boolean channel. Indeed to compensate for an error rate less than 4.5% requires only one additional sender. Regarding robustness to ignorance concerning the input distribution our results are rather more mixed. If reality is well modelled by the family of symmetric beta distributions then across all possible parameter values there is an upper bound on the minimal number of senders required by the vague channel. On the other hand, no such upper bound exists for the general family of beta distributions. This is mainly because an asymmetric model of this kind allows for the case that reality may turn out to be particularly favourable for the Boolean channel. For example, this is the case if the distribution on inputs is heavily peaked at either \({\frac{1}{4}}\) or \({\frac{3}{4}}\).

In the current paper we have focussed on vague labels with fixed definitions. However, a common feature of adjectives in natural language is that they are context dependent. For example, the description short has a different meaning when applied to the restricted class of basketball players than to the general class of potential suspects in the Bristol robbery. One potential mechanism by which relative descriptors of this kind could be incorporated into the current model would be for both speakers and listeners to employ a form of context dependent scaling. For instance, suppose that z is the underlying variable to be communicated, e.g. unscaled height in the robber example, and further suppose that for a reference class C, z has the distribution function \(F_C\). If both senders and receivers have sufficient knowledge of z on class C to have a good estimate of \(F_C\), then channels of the following form can be defined by employing rescaling. Senders evaluate the scaled variable \(x=F_C(z)\), which is uniformly distributed on [0, 1] provided that inputs are restricted to the class C. This can then be transmitted using vague channels of the form proposed above, with the receiver obtaining an estimate y of x, which they then rescale according to \(F_C^{-1}(y)\) in order to give an estimate of z. In this case the production function for the input z is \(P(S=1|z)=\mu _{L_2}(F_C(z))\), and Fig. 14 illustrates this scaling process for the reference classes ‘UK males’, perhaps the reference class for the robbery example, and ‘Basketball players’. In particular, Fig. 14c shows the membership functions for tall in the two different contexts. Additional work is required to investigate the efficacy of this approach from a communication perspective.Footnote 13

Fig. 14
figure 14

Context scaling for the reference classes ‘UK males’ and ‘Basketball players’. a Distributions of two reference classes. b Optimal membership for \(L_2\) given 8 senders. c Scaled production functions \(P(S=1|z)=\mu _{L_2}(F_C(z))\) for the two reference classes. These correspond to the membership functions for tall in the two different contexts

In addition to signalling errors as discussed in Sect. 6, there are two additional sources of noise that will naturally occur for the type of communication channel we have proposed. Firstly, assuming a distributed learning model in which individuals infer the meanings of labels from repeated experiences of language use, it is inevitable that there will be variations in definitions between individuals. Secondly, we have assumed that all senders are describing the same input value. In reality this sensory data is likely to be subject to noise from a variety of sources. A future challenge is then to undertake a comparative study of vague and Boolean channels in the presence of both types of noise.

In summary, the results presented in this paper suggest that vagueness acting as a source of randomness in assertion decisions, can be useful in communication scenarios where the number of relevant description words is moderately large and when there is aggregation of signals from several senders. However, the extent to which such scenarios occur in natural language and whether or not they are sufficiently common to explain the ubiquitousness of vague terms, remains very much an open question.