Most quantifiers have many meanings

Logical theories of meaning assume that function words, such as natural language quantifiers, have a fixed meaning expressed by their truth conditions. In this study, we challenge this view by showing that there are systematic individual differences in semantic representations of quantifiers. Using computational modeling, we separated three sources of individual differences: truth condition, vagueness, and response error, and mapped them on different model parameters. We selected five natural language quantifiers ( few , fewer than half , many , more than half , and most ), which we expected to differ in the model parameters. We collected response data in an online experiment and fitted a Bayesian three-parameter logistic regression model. By applying the k-means clustering algorithm to the model’s parameters, we found three subgroups of participants with different semantic representations of quantifiers and the organization of the mental line of quantifiers. Moreover, we found asymmetry between positive and negative quantifiers in response error and vagueness. This finding supports the view that logical words, like content words, are sensitive to individual differences, and hence it challenges the logical theories of meaning. This suggests that response error reflects general cognitive ability.


Introduction
Needless to say, humans differ in their cognitive abilities. Similar to other cognitive domains, individual differences are also present in natural language processing (Kidd, Donnelly, & Christiansen, 2018). In this paper, we investigate individual differences in natural language quantifier representations. Natural language quantifiers make an excellent case study as they have drawn the attention of researchers from different fields ranging from logic (Barwise & Cooper, 1981;Mostowski, 1957) to formal semantics (Keenan & Paperno, 2012;Szabolcsi, 2010) to cognitive science (Ramotowska, Steinert-Threlkeld, van Maanen, & Szymanik, 2020bsee Szymanik, 2016. Quantifiers, such as many, few, most, some, and at least 5, are used to express quantities. They belong to the close class of functional words. They have been studied mostly 1 in the verification paradigm (e.g., Deschamps, Agmon, Loewenstein, & Grodzinsky, 2015;Hackl, 2009;Pietroski, Lidz, Hunter, & Halberda, 2009;Schlotterbeck, Ramotowska, van Maanen, & Szymanik, 2020), in which participants have to decide if a sentence containing quantifiers is true in a given context.
Individual differences in quantifiers may come from three different sources. The first source are differences in general cognitive abilities, e.g., working memory (Just & Carpenter, 1992;Kidd et al., 2018) or executive functions (Kidd et al., 2018). For example, the accuracy and speed of verification of proportional quantifiers depend on working memory capacities (Steinert-Threlkeld, Munneke, & Szymanik, 2015;Zajenkowski & Szymanik, 2013;Zajenkowski, Szymanik, & Garraffa, 2014) and cognitive control (Zajenkowski & Szymanik, 2013;Zajenkowski et al., 2014). The second source of individual differences could lay in the choice of verification strategies (Talmina et al., 2017). For example, Talmina et al. (2017) showed that some participants prefer to use a precise strategy while verifying most, and others choose an approximate strategy. Moreover, strategy preference depends on the context (Register, Mollica, & Piantadosi, 2018). Finally, the third source of individual differences could be different semantic representations of quantifiers where individuals assign different truth values to the same sentence (Spychalska, Kontinen, & Werning, 2016). Spychalska et al. (2016) divided participants into two groups based on their truth value evaluation of the underinformative sentence "Some As are B" when in fact all As were B. The group of so-called pragmatic responders judged the underinformative sentence as false and logical responders as true.
While the first two sources of individual differences are compatible with the formal semantics perspective on language, the last one contradicts the intuition that language users have to agree on the truth condition of the sentences in order to communicate. At first glance, it seems that rational subjects cannot assign different meanings to logical words such as quantifiers. Nevertheless, in this paper, we show that the last option is tangible. We aim to answer three questions regarding individual differences in quantifiers. First, how many subgroups of participants with different meanings can we identify? Second, how are the meanings of quantifiers interrelated at the subject level ? Third, we want to separate behavioral and semantic sources of individual variation in quantifier representations. We considered the truth conditions (quantifier's threshold) and vagueness as the semantic source. We include a response error parameter representing mistakes that participants made during the verification process to account for behavioral sources. Response errors could happen due to attentional lapses or difficulties in processing of complex quantifiers but are unrelated to vagueness or thresholds. The third question regarding individual differences is therefore: How are the parameters in our model interrelated ?
To answer these questions, we analyzed data from a quantifier verification task, in which participants were asked to judge the truth of a quantified sentence based on information about proportion. We modeled the choices using a logistic regression model and estimated three model parameters corresponding to threshold, vagueness, and response errors. Then, we clustered participants based on the parameter estimates. Computational modelling has previously been successfully applied to test competing semantic theories (van Tiel, Franke, & Sauerland, 2021) and to distinguish between different sources of individual differences in language processing (Vasishth, Nicenboim, Engelmann, & Burchert, 2019;Waldon & Degen, 2020). Moreover, computational modelling allows the investigation of qualitatively different effects in experimental data (Haaf & Rouder, 2019;Donzallaz, Haaf, & Stevenson, 2021;Kolvoort, Davis, van Maanen, & Rehder, 2021;Miletić & van Maanen, 2019;Ramotowska, Steinert-Threlkeld, van Maanen, & Szymanik, 2020a). Our work continues the tradition of using computational modeling to better understand cognitive representations. In the following section, we explain the reasons for each of our questions and modeling choices.
1.1 How many subgroups of participants with different meanings can we identify?
The logical theory of meaning (e.g., Generalized Quantifier Theory, Barwise & Cooper, 1981;Mostowski, 1957) analyses the meaning of quantifiers in terms of truth conditions. The natural language quantifier's truth condition specifies a threshold above or below which the quantifier is true 1 . For example, the quantifier most in the sentence "Most of the As are B" is true (most(A, B) = 1), if the intersection of sets A and B (|A ∩ B|) is greater than the intersection of sets A and not B (|A ∩ ¬B|). Example 1.1 shows truth conditions for quantifiers: most, more than half, fewer than half, many, and few. Some quantifiers like at least 5 have clear truth conditions with the threshold equals 5. Other quantifiers, like many, have various thresholds depending on the context (Schöller & Franke, 2016). Moreover, many and few are ambiguous between cardinal and proportional reading (Partee, 1989). According to cardinal reading, the threshold is a fixed number e.g., "Many students passed the exam" means more than 40 students. Proportional reading of many, in turn, refers to many as more than some proportion, e.g., "Many of the students passed the exam" means more than 40% of the students. In this paper, we focus only on proportional readings of few and many.
Individual differences seem likely in context-dependent quantifiers such as many and few. Yildirim, Degen, Tanenhaus, and Jaeger (2016) showed that different speakers have different meanings of these quantifiers. More surprisingly, Ramotowska et al. (2020b) found individual differences in the quantifier most within the experimental paradigm downplaying the role of context. This finding questions the underlying assumption of many studies (Hackl, 2009;Pietroski et al., 2009;) that participants have a dominant representation of most. In the current paper, we performed a cluster analysis to systematically investigate the subgroups of participants.
1.2 How are the meanings of quantifiers interrelated at the subject level?
The meanings of the quantifiers considered here highly overlap. They constitute the sets of alternatives for each other. The first studies that looked into the order of quantifiers on a scale tried to link quantifiers with proportions for psychometric purposes (Hammerton, 1976;Newstead, Pollard, & Riezebos, 1987). They found that participants were less consistent in the usage of some quantifiers than others. For example, low-magnitude quantifiers were more context-dependent than high-magnitude quantifiers (Newstead et al., 1987). Recently, Pezzelle, Bernardi, and Piazza (2018) have shown that quantifiers can be ordered on the mental number line. However, the distance between meaning representations does not have to be equal (see also van Tiel et al., 2021). For example, low-magnitude quantifiers (e.g., few, almost none) were more separated from each other and had sharper representations than high-magnitude quantifiers (almost all, most, many). They also showed that some quantifiers are semantically more similar than others. For example, many is more similar to most than to few. Moreover, the change in the meaning representation of one quantifier (e.g., many) affects the threshold of the polar opposite quantifier (e.g., few, Heim et al., 2015). This effect is present in the reinforcement learning paradigm (Heim et al., 2015) or via adaptation during exposure (Heim, Peiseler, & Bekemeier, 2020).
The above studies did not account for the individual differences in quantifier meaning representation. In contrast, we investigated the relationship between quantifier meanings taking into account the between-subjects variability in thresholds to shed more light on how quantifiers are represented on the mental number line on the individual level.
1.3 How are the parameters of our model interrelated?

Vagueness
Quantifiers such as many and few are vague, which means that their meaning boundaries depend on the situation (Newstead & Coventry, 2000;Solt, 2011). Another characteristic of vagueness concerns the borderline cases. If we agree that the sentence "Many of the students failed the exam." is true when 20% of students failed, we will also probably agree that the sentence is true when 19% failed. Thus, the threshold for accepting a statement as true for many and few is fuzzy even given a fixed context (Solt, 2011).
Some studies showed that the quantifier most is also vague (Denić & Szymanik, 2020;Solt, 2011). Solt (2016) claimed that most and more than half are represented on different underlying scales. More than half has to be represented on the ratio scale, while most requires only the semiordered scale. The latter scale allows less precise comparisons, and, therefore, the meaning of most is more variable. Moreover, Denić and Szymanik (2020) showed that participants were less consistent about their threshold for most than for more than half. Taken together, context dependency is not the only factor that might change the quantifier threshold. In a fixed context, some quantifiers can have variable truth condition assignments due to vagueness. Therefore, we included a separate parameter in our model to test the effect of vagueness independently of the threshold.
Moreover, previous studies (Hackl, 2009) argued that the same overall proportion of errors in the verification task for most and more than half speaks in favor of the same truth conditions of these quantifiers. In contrast, another study (Kotek, Sudo, & Hackl, 2015) showed that the accuracy for most is lower than for more than half when the proportion is slightly above 50%. Kotek et al. (2015) interpreted this asymmetry as a difference in quantifier pragmatics rather than truth conditions. Finally, Denić and Szymanik (2020) showed that the accuracy for most is lower than for more than half relative to their estimated thresholds. These studies show that the response error is a crucial measure of participants' performance. However, its interpretation is not unequivocal. We included the additional response error parameter in our model to account for differences in accuracy between negative and positive quantifiers and to disentangle the measure of error from the measures of threshold and vagueness.
To summarize, even though the above discussion suggests that vagueness, threshold, and error may be interrelated, as far as we know, this relationship has not been systematically investigated on an individual level. For example, we can imagine that participants may have the same truth conditions for most and more than half and yet perform worse while verifying most because of other reasons. Moreover, participants may make more errors when verifying vague quantifiers. Response errors and vagueness, in turn, can lead to variability in thresholds. These interdependencies might lead to confounds when interpreting the experimental data. Therefore, we applied a model with three different parameters to capture these three aspects.

Current study
To test the individual differences in quantifier representations and the relationship between the meanings of different quantifiers, we asked participants to judge the truth of a sentence involving a quantifier against the proportion given as a number between 1% and 99%. We chose proportional quantifiers from three groups: quantifiers with sharp meaning boundaries (fewer than half and more than half ); vague and contextdependent quantifiers (few and many); and one quantifier that falls between these groups (most). After fitting a computational model to the response data to estimate these parameters for every quantifier and participant, we performed a cluster analysis on the threshold parameter to establish the subgroups of participants with different meanings. We predicted that all participants would have the same threshold for fewer than half and more than half because these quantifiers already refer to the threshold, namely half. In contrast, we predicted that we would find between-clusters variability in thresholds for vague quantifiers like most, many, and few. We also hypothesized that only vague quantifiers would contribute to clustering on the threshold.
Moreover, to address our second research question, we explored how the meaning of one quantifier relates to other quantifiers. Firstly, we tested the correlations between thresholds on the group level to see if the thresholds between quantifiers are interrelated. In contrast to previous studies (Hammerton, 1976;Heim et al., 2015;Newstead et al., 1987;Pezzelle et al., 2018;van Tiel et al., 2021), we also looked into the order of quantifiers on a mental scale on the individual level within the clusters of participants.
Finally, we tested the relationship between model parameters. We wanted to separate the between-participants variability in truth conditions (thresholds) from vagueness and response error by introducing three parameters into our model. We tested whether the model parameters were correlated. We did not have specific predictions about the direction of these correlations. This analysis was exploratory in nature. Nonetheless, we predicted a higher value of the vagueness parameter for vague quantifiers and that participants would make more mistakes while verifying the negative quantifiers. In addition to clustering on threshold parameters, we performed a cluster analysis on vagueness and response error to see which quantifiers contributed to clustering. We expected that few, many, and most would contribute to clustering on vagueness and negative quantifiers to clustering on response error.
Before running the computational model, we explored the effects of the three parameters on potential data patterns. In particular, we wanted to separate vagueness and response error effects because they both lead to response variability. Response errors are a result of additional cognitive processes and should therefore occur after the participants compare the proportion given in the experimental trial to their internal threshold. As such, response errors are independent of proportion. In contrast, vagueness adds noise to the decision process. The noise is greater around the participants' threshold. As a result, the internal threshold shifts from trial to trial. As such, vagueness depends on the proportion. Figure 1 presents how we conceptualized threshold, response error, and vagueness parameters. We chose the quantifier more than half for illustration. For the ideal responder, the proportion of 'true' responses below 50% is zero, and above 50% is one. The logistic curve has a sharp shape indicating a rapid shift from false to true responses at the threshold. When the responses are affected by vagueness, the perceived threshold varies from trial to trial, and the logistic curve increases gradually. The response error, in turn, does not change the shape of the response curve. Instead, it lowers the probability of the true response above the threshold and increases the probability of the true response below the threshold equally for all proportions. We also plotted the combined effect of response errors, vagueness, and threshold.

Participants
We recruited 90 participants via the online recruitment platform Amazon Mechanical Turk. We excluded 19 participants based on three exclusion criteria. Firstly, we excluded 11 participants who had 50% or more reaction times faster than 300 ms. Secondly, we excluded 7 participants who failed to obey the monotonicity of quantifiers. We defined the monotonicity criterion in the following way: for positive quantifiers (many, most, and more than half ) we expected the probability of providing the true response to increase with increasing proportion. The opposite effect should hold for negative quantifiers. To apply monotonicity criterion, we fitted the generalized linear model to participants' response data with the proportion as a predictor and with bysubject random intercept and slope for proportion (glmer R function, Kuznetsova, Brockhoff, & Christensen, 2017). We excluded participants, who had a negative slope for positive quantifiers or a positive slope for negative quantifiers. Finally, we excluded 1 participant, who took part in a similar experiment. These exclusions meant that we included 71 participants (47 male, age M = 35, range: 22-59) in the final sample.

Experimental Design and Procedure
In our experiment, participants had to indicate whether the sentence with the quantifier: most, many, few, fewer than half, or more than half was true or false based on the sentence containing a proportion ranging from 1% to 99% (excluding 50%). We did not include the proportion 100%, because Ariel (2003) showed that most has an upper bound on meaning and using it with 100% proportion is not accepted, although it is highly accepted with 99%. The upper bound of most could cause a divergence in the logistic function which we used in our model. We did not include 50%, because this proportion could be confusing for more than half and fewer than half.
While most, more than half and fewer than half have a proportional interpretation (Hackl, 2009), as explained above, many and few are ambiguous between cardinal and proportional reading (Partee, 1989). For example, many could mean more than a certain number (cardinal reading) or more than a certain proportion (proportional reading, see Example 1.1). We used explicit partitive 'of the' and present proportions as a percentage for all quantifiers to ensure the proportional reading and avoid confounds for ambiguous quantifiers. Moreover, by using the percentage format we enforced the precise comparison between proportion and the threshold. In this way, we minimized the differences between quantifiers in verification strategies. For example, in some experimental paradigms most is verified using approximate strategy (Pietroski et al., 2009), while in others mixtures of strategies is used (Talmina et al., 2017).
The experiment started with a short training block to familiarize participants with the procedure. Next participants completed the 250 trials (50 per quantifier) in randomized order. At the end of the experiment, participants provided basic demographic information. Each trial of the experiment consisted of two sentences displayed on separate screens. The first sentence containing the quantifier was of the form "[Most/Many/Few/More than half/Fewer than half ] of the gleerbs are fizzda." To read this sentence participants had to press the arrow down key and keep it pressed. When they advanced to the next screen, they read a sentence containing proportion e.g., "20% of the gleerbs are fizzda." Participants had to provide a response by pressing the right or left arrow keys corresponding to true or false judgment (counterbalanced between participants).
In our experiment, we used pseudowords generated from 50 English six-letters nouns and adjectives using Wuggy software (Keuleers & Brysbaert, 2010). We used pseudowords to avoid pragmatic effects associated with quantifiers. The original words were controlled for frequency (Zipf value 4.06, van Heuven, Mandera, Keuleers, & Brysbaert, 2014). A native English speaker assessed the pseudowords in terms of how well they imitated English words.

Data pre-processing
We excluded trials with response times shorter than 300ms and longer than 2500ms (similar cut-offs to Ratcliff & McKoon, 2018). Altogether, we excluded 6% of trials. To be able to fit the same logit model to all quantifiers we flipped the true and false responses for few and fewer than half.

Computational Model
The logistic regression model is suitable for modelling the threshold variability (Ramotowska et al., 2020b). The model assumes that the probability that participants verify a statement as true or false depends on the proportion that was presented on a particular trial and the values of the logistic function parameters asymptote, midpoint and scale: To accommodate individual differences and differences between quantifiers in the model, we used a three-parameter logistic regression model inspired by Item Response Theory (IRT). IRT determines the relationship between an individual's trait and the probability of providing a correct response for a given item (Hanlbleton, Swaminathan, & Rogers, 1991;Ligia et al., 2013). This relationship is expressed by the Item Response Function, which maps the IRT parameters (difficulty, discrimination, and guessing) onto the logistic function. The three-parameter model has a difficulty parameter, which determines the level of an individual trait necessary to provide a correct response (midpoint), a discrimination parameter that determines the steepness of the logistic curve (scale), and a guessing parameter that can adjust the logistic curve asymptotes.
In our model, the threshold corresponds to the difficulty parameter, vagueness to the discrimination parameter, and response error to the guessing parameter from the IRT model. We used a hierarchical Bayesian model to estimate the parameters for each participant-quantifier combination. To fit the model, we used the rstan package in R (Stan Development Team, 2017) with 6 chains, 750 warm up iterations per chain and 2500 iterations per chain.
The model was specified in the following way. Let i indicate participants, i = 1, ..., I, j indicate the quantifier, j = 1, ..., 5, and k indicate the trial for each quantifier, k = 1, ..., Kij. Then Yij is the i-th participant's response to the j -th quantifier in the k -th trial, and Y ijk = 1 if participant indicated true, and Y ijk = 0 if participant indicated false. Then, we may model Y ijk as a Bernoulli, using the logit link function on the probabilities: where the probability space of π maps onto the µ.
The additional parameter γij determines the probability of making a response error on either side of the threshold, namely erroneously saying true, or erroneously saying false. Each participant-quantifier combination has its own response error parameter estimate. The parameter µ ijk has a linear model explication: where c ijk indicates the percentage centered at 50%, parameters βij indicate the threshold, and parameters αij correspond to the vagueness of the quantifier.
We defined prior probabilities on response error (γ), threshold (β), and vagueness (α) parameters: The hierarchical nature of the distributions for αij and βij indicate that we estimated the effect of threshold and vagueness for each participant under the assumption that they had a common mean and variance. The vagueness and threshold priors were fairly uninformative. Vagueness (αij) came from a log-normal distribution to ensure only the positive estimates. Its mean (νj) had a normal distribution, and its variance (σ 2 α j ) was drawn from Inverse-Gamma distribution, as this distribution is typically used to model variance. For the thresholds (βij) we used a normal distribution with a common, normally-distributed mean (δj) and the same variance distribution (σ 2 j ) as for αij. The response error (γij) came from a more informed distribution with most of its mass below an error rate of 20% for each true and false response 2 .

Cluster analysis
We ran the exploratory cluster analysis for threshold, vagueness and response errors separately, estimating the clusters using the K-means clustering method (kmeans function in R, Hartigan & Wong, 1979). We determined the optimum number of clusters by using the elbow plots and Silhouette width.

Linear Discriminant Analysis
To assess the contribution of the model estimates to the clustering, we performed a linear discriminant analysis (LDA). We used the stepwise procedure Wilks' lambda assessment (greedy.wilks function in R package klaR, Roever et al., 2015) to determine which variable contributed significantly to cluster formation. Next, we ran the LDA (lda function in R package MASS ) to test how accurately the selected variables could predict the clusters. To validate the LDA, we ran a leave-one-out cross validation.

Estimated parameters
The estimated model parameters are shown in Table 1. Figure 2 shows the estimated item response curves for each participant-quantifier combination; the overall response curves for the quantifiers are represented by the bold, colored lines. We found greater individual variation in thresholds for most, many and few, compared to more than half and fewer than half. At the group level, quantifier thresholds were represented in the following order (Friedman test χ 2 (4) = 134, p < 0.001, moderate effect size W = 0.47): few had the lowest threshold, followed by many, then were fewer than half and more than half, and most had the highest threshold (pairwise comparison, Wilcoxon Signed Rank Test with Bonferroni correction).
The quantifiers fewer than half and more than half were the least vague as indicated by the steep response curves in Figure 2. Moreover, few was more vague than fewer than half (V = 2556; p < 0.001), many was more vague than more than half (V = 2556; p < 0.001), many was more vague than most (V = 2556; p < 0.001), and most was more vague than more than half (V = 2556; p < 0.001), p -values based on Wilcoxon Signed Rank Test. We also found that fewer than half had a greater response error than more than half (V = 2323; p < 0.001), and few had greater response error than many (V = 1809; p = 0.002), p -values based on Wilcoxon Signed Rank Test. As predicted, the vague quantifiers had a higher value of vagueness parameter and negative quantifiers had higher value of response error parameter. In the next step, we studied the associations between model parameters across quantifiers to reveal potentially systematic patterns (see Figure 3). Figure 3a shows the correlations between thresholds. These correlations were negligible or weak. This finding gives reason for the cluster analysis, because the lack of correlation might be caused by different relationship between thresholds in the subgroups. It also suggests that clusters of participants could have different representation and ordering on the mental line. Figure 3b shows the correlations for vagueness, and Figure 3c for response errors. The correlations for vagueness were also weak, suggesting that this parameter is quantifier-specific and not domain-general. In contrast, the correlations for response error varied, ranging from a strong correlation between few and fewer than half (r = 0.75), to the weakest correlation between more than half and many (r = 0.24, see Figure 3C). The strongest correlation was significantly higher than the weakest, Stringer's test z = 4.72, p < 0.001. This suggests that response error reflects general cognitive ability.  To test the interrelationship between vagueness, threshold, and response error, we correlated the model parameters for each quantifier (Figure 4). This correlation analysis was exploratory in nature. We wanted to test whether there were any systematic patterns across quantifiers. We found a significant negative correlation between threshold and vagueness for few (r = -0.33) and many (r = -0.31). We also found correlations between threshold and response error for fewer than half (r = -0.32), and response error and vagueness for many (r = 0.53) and most (r = 0.52). In general, the correlations did not reveal systematic patterns. The lack of systematic correlations between vagueness and response error parameters gives additional support to the 13 choice to model these parameters as two separate mechanisms. Figure 4: Correlations of parameters for each quantifier (significance level *** 0.001, ** 0.01, * 0.05). The p values were adjusted using the Bonferroni correction.

Threshold
The methods to determine the optimum number of clusters for threshold gave ambiguous results. The elbow plot indicated 3 or 4 clusters, while the Silhouette method preferred 5 clusters. We chose the simplest solution, comprising 3 clusters, because the additional clusters consisted of only 4 participants, making interpretation difficult. The three clusters were indistinguishable for the quantifiers fewer than half and more than half, but differed substantially in thresholds for the quantifiers few, many, and most. Figure 5 shows the individual estimates for threshold, vagueness, and response error parameters for the quantifiers few, many, and most, with color indicating cluster membership.
The first cluster (N = 13) consisted of participants with a higher mean threshold for most, the second cluster (N = 34) included participants who had thresholds for all quantifiers close to 50%, and the last cluster (N = 24) consisted of participants who had similar a mean threshold for few and many (see Table 2). In addition, we found that participants in Cluster 3 had a higher tendency to make errors, with this tendency especially visible for few (see Figure 5). Because we did not find a systematic relationship between thresholds of different quantifiers (see Figure 3a), we investigated this relationship in the clusters (see Figure  6). We supposed that the lack of correlations between thresholds could be explained by the different relationships between quantifiers in subgroups. Specifically, we wanted to test whether all participants would have the same order of vague quantifiers on a mental line and whether the distance between quantifiers would be different in clusters. Figure 6a shows that all participants had a lower or equal thresholds for many than for most. However, the distance between thresholds was higher in Cluster 3 than in other clusters. Figure 6b shows that the vast majority had a higher threshold for many than for few. The greatest distance between thresholds was in Cluster 1, while the smallest was in Cluster 3. Figures 6c and 6d show that all participants in Cluster 3 had a lower threshold for many than for more than half and fewer than half.

18
(c) (d) Figure 6: 6a The difference between the threshold for many and most for each participant. 6b The difference between the threshold for many and few for each participant. 6c The difference between the threshold for many and more than half for each participant. 6d The difference between the threshold for many and fewer than half for each participant. Colors are used to indicate cluster membership: Cluster 1 is indicated in green (N = 13), Cluster 2 in orangne (N = 34), and Cluster in purple (N = 24). The error bars indicate the 95% credible intervals.

Vagueness
The elbow plot and Silhouette method agreed that the two-cluster solution was optimal, identifying one cluster (N = 24) with high vagueness for many, and a second cluster (N = 47) with lower vagueness for many (Table 3). We expected polar opposite quantifiers few and many to make comparable contributions to clustering on vagueness. What we observed instead was the asymmetry in many and few. Figure 7 shows that participants with higher vagueness for many had a tendency to make more mistakes and had lower threshold, while participants with lower vagueness for many had a threshold concentrated around 50% and made fewer errors. Figure 7: Relationship between threshold, vagueness, and response error for many, indicating two clusters based on vagueness. Cluster 1 (N = 24) with higher vagueness for many is indicated in green, and Cluster 2 (N = 47) with lower vagueness for many in orange.

Response error
The elbow plot suggested that either two or three clusters should be optimal, but the Silhouette method indicated the 2-cluster solution. Assuming two clusters, we found a cluster of participants with few response errors (N = 64) and a cluster with more response errors (N = 7) across quantifiers, see Table 4. This means that the majority of participants had a low response error rate. The difference in response error between clusters was most prominent for negative quantifiers. Figure 8 shows the relationship between model parameters based on response error clustering for few and fewer than half. For few, we did not observe that participants with a high response error had a tendency toward more extreme thresholds or vagueness, while for fewer than half some participants that made more errors also had lower threshold.  Figure 8: Relationship between threshold, vagueness, and response error for few (8a) and fewer than half (8b), indicating two clusters based on response error. Cluster 1 (N = 7) with a high response error is indicated in green, and Cluster 2 (N = 64) with a lower response error in orange.

Threshold
For thresholds, as expected, we found that only vague quantifiers contributed to the clustering: many (λ = 0.42, p < 0.001), few (λ = 0.24, p < 0.001), and most (λ = 0.16, p < 0.001). Figure 9 shows the combined effect of the three quantifiers on the clustering. The LDA accuracy in classification into Clusters 1 to 3 based on thresholds for many, few and most was 97%, and the leave-one-out cross validation accuracy was 94%. Figure 9: Three clusters for threshold based on few, many, and most parameters. The parameters' values of thresholds for three quantifiers (few, many, and most) that contributed to clustering are plotted against each other. Colors are used to indicate the cluster membership: Cluster 1 (N = 13) is indicated in green, Cluster 2 (N = 34) in orange, and Cluster 3 (N = 24) in purple.

Vagueness
For the vagueness parameter, we expected vague quantifiers to contribute to the clustering. We found that only many contributed significantly to the clustering (λ = 0.29, p < 0.001). The LDA achieved 94% accuracy in classification of participants into clusters based on vagueness parameters for many, and the leave-one-out cross validation accuracy was 94%.

Response error
We expected the response error parameter for negative quantifiers to contribute more to clustering. In line with this hypothesis, the Wilks test showed a significant contribution of response error parameters for few (λ = 0.32, p < 0.001) and fewer than half (λ = 0.25, p < 0.001), but not for many, most and more than half. Figure 10 shows the combined effect of the two quantifiers on clustering. Participants who made more errors while verifying few also made more errors for fewer than half. We used the LDA to predict the cluster membership for each participant based on response error parameters for few and fewer than half. The LDA achieved 99% accuracy, and the leave-one-out cross validation accuracy was 99%.

Discussion
Previous studies showed that quantifiers are organized on a mental scale (Hammerton, 1976;Pezzelle et al., 2018) and that participants use their internal threshold to verify proportional quantifiers (Shikhare, Heim, Klein, Huber, & Willmes, 2015). However, little has been known about the individual differences in the organization of quantifiers on the mental line. The main goal of this study was to identify the subgroups of participants with different meanings of quantifiers. We investigated how quantifiers are organized on the mental line within the subgroups. Firstly, we examined the correlations between quantifiers for each parameter of our model. We found that only the response errors correlated across quantifiers. The lack of significant correlations for other parameters further motivated the analysis of the subgroups. We ran a cluster analysis on threshold parameters of quantifiers. We identified three groups of participants with different mean thresholds and relationships between the meaning of quantifiers. As initially predicted, quantifiers with sharp meaning boundaries, like fewer than half and more than half, did not contribute to clustering, and they had similar thresholds in all groups. In contrast, thresholds for many, few, and most varied considerably between clusters. In all groups, most had the highest threshold. However, the mean threshold varied between clusters. In the first cluster, the mean threshold was 60%, and in the second and third clusters, the mean thresholds were just slightly above 50%, at 51% and 52%, respectively. For few, participants in the first and third clusters had mean thresholds equal to 35%, and in the second cluster, the mean threshold was 45%. The mean threshold for many was the most diverse between groups. It ranged from 51% in the first cluster to 48% in the second cluster to and 34% in the third cluster.
The subsequent goal of this paper was to look into the relationship between threshold, vagueness, and response error. As predicted, we found that quantifiers with broad meaning boundaries had a higher vagueness value and that negative quantifiers had a higher response error value. We investigated the correlations across parameters for all quantifiers. However, we failed to find systematic patterns. Finally, we clustered participants based on vagueness and response error. We found two clusters with high and low vagueness for many and two clusters with high and low response error for few and fewer than half. We will discuss the implications of these findings in more detail in the following subsections.

Order of quantifiers on the mental line
Because we failed to find correlations at the group level between thresholds of different quantifiers, we zoomed into the mental line of the subgroups of participants. We observed that the clusters differed in the range of the mental line and the order of quantifiers on it. Participants in the first cluster had the most stretched mental line, ranging between 35% and 60%, with a clear order of thresholds, where few was the lowest and most was the highest. In contrast, the second group had the most shrunk mental line, ranging between 45% and 51%. The mental scale of the last group stretched between 34% and 52%. We further looked into the relationship between vague quantifier pairs: few and many (the polar opposites), and many and most. Hammerton (1976) found that although participants assigned different numerical equivalence to quantifiers, they were consistent about the order of the quantifiers. Our findings indicate that participants were consistent about the order of some quantifiers, but not all. For example, we found an asymmetry between many and few with regard to their positioning on the mental scale. The position of many on the mental scale was more flexible than the position of few. In the second and third clusters, the mean threshold for many was lower than for more than half and fewer than half, but in the first cluster, it was higher (see Figure 6). The second asymmetry between many and few was that only many contributed to clustering based on vagueness.

Many vs. few
The flexibility of many on the mental scale cannot be explained by its contextdependency. Firstly, in our experiment, we used an artificial context by introducing pseudowords. There was no reason for participants to have different expectations about the context. Secondly, based on the literature (Newstead et al., 1987), we predicted the opposite pattern of results. The low-magnitude quantifiers, such as few, are more contextdependent than high-magnitude quantifiers (Newstead et al., 1987). Moreover, they can change their threshold depending on the reference set (Newstead et al., 1987) and they are more separated from each other on the mental scale than high-magnitude quantifiers (Pezzelle et al., 2018).
We attribute the asymmetries in our study to competition between quantifiers. While few was less than 50% for all clusters and most was more than 50%, many had to compete with both quantifiers for a place on the mental line. As a result of this competition, many had a greater variation in threshold and was more vague, at least for some participants. We observed two tendencies concerning the threshold of many (see Figure 6). The first tendency was to either keep the threshold for few and many close together (Clusters 2 and 3) or far apart (Cluster 1, Figure 5b). The second tendency was to either keep the threshold for many close to most (Cluster 2, and to some extent 1) or far from most (Cluster 3, see Figure 6a). Despite these tendencies, almost all participants had a higher threshold for many than for few, and all participants had a higher threshold for most than for many. Altogether, this finding shows that the position of many on the mental line is more flexible than the position of few and it explains the membership of the clusters. Nonetheless, in all clusters participants treated few as less than many, and many as less than most.

Many vs. most
Previous studies and linguistic analysis (Hackl, 2009;Pezzelle et al., 2018) stressed similarities between most and many. Firstly, Hackl (2009) analyzed most as a superlative of many (many+est). This analysis predicts that most has to be more than many. Our data support this prediction. We showed that not only the mean threshold for many was lower than for most in all clusters, but also all participants had a higher threshold for most than many regardless of the cluster's membership (see Figure 6a). While all participants treated most as the superlative of many, the distance between thresholds of these quantifiers was different depending on the cluster. The greatest distance was in the third subgroup.
Secondly, Pezzelle et al. (2018) showed substantial overlap in the production of most and many. Both quantifiers cover comparable proportions on the mental scale. In contrast, our results show individual differences in the distance on the mental line between most and many. For example, in the third cluster, the mean threshold for many was considerably lower than the mean threshold for most, while in the second cluster, both thresholds were close to 50%.
Lastly, Pezzelle et al. (2018) found that many is used less frequently than most. We think that the quantifier's vagueness could be one of the sources of the difference in frequency. The high perceived vagueness of many lowers its usefulness. The more vague the quantifier, the less information it conveys. However, participants try to be as informative as possible (Grice, 1975) and therefore avoid the usage of uninformative quantifiers with very flexible meanings. This explanation generates a new prediction to test in future work: participants who perceive many as vaguer should also use it less often in the production experiment.

Relationship between model parameters
Finally, we tested the relationships between vagueness, thresholds, and response error. We did not find significant correlations for threshold and vagueness between quantifiers, indicating that these parameters were quantifier-specific. In contrast, the response error parameter was significantly correlated across almost all quantifiers. The correlations were, however, stronger between negative than positive quantifiers because of the greater variation in response error in negative quantifiers. Due to this variation, only negative quantifiers contributed to clustering on response error. Response error, thus, reflects a combination of general task performance ability and specific difficulty in verification for negative quantifiers (Deschamps et al., 2015;Just & Carpenter, 1971;Schlotterbeck et al., 2020). We noted that the cluster with a higher rate of response error was small (N = 7), probably because the task was generally easy. It would be worth testing whether the response error parameter contributes more to clustering in a more challenging task, for example, with visual displays instead of sentences.
With regard to correlation between parameters for each quantifier, we did not find systematic patterns for the whole group of participants (see Figure 4). The only significant correlation between threshold and response error was for fewer than half. This correlation was, however, strongly affected by the outlier participants with a low threshold for fewer than half (see Figure 11 in Appendix A). This finding shows that the variation in thresholds reflects variation in the semantic representations and it is not an artefact of task performance.
The response error correlated positively with the vagueness parameter. However, the correlation was only significant for many and most. Moreover, participants from the cluster with higher vagueness for many also had a higher response error ( Figure 6). As one could expect, the vaguer the quantifier, the more difficult it is to perform the task. In addition, the lack of systematic correlations between vagueness and response error shows that they correspond to two different processes that should be modeled as separate parameters in the model. The response error reflects the general cognitive mechanism and is affected by a quantifier's difficulty, while vagueness is a semantic property, which may correlate with the verification difficulties of a quantifier (in this study e.g., most and many), but it can not be equated with a number of errors (cf. Denić & Szymanik, 2020).
Finally, we found significant correlations between vagueness and threshold for many and few, but, importantly, not for most. This finding challenges the explanation proposed by Solt (2016), according to which participants verify most using the approximate strategy (Pietroski et al., 2009). Consequently, the verification of most is noisy around 50%. To reduce the noise, participants prefer thresholds significantly greater than 50%. This theory predicts that participants with higher thresholds for most will perceive it as a vaguer quantifier than subjects with lower thresholds. In our model, we captured the noisiness of verification in the vagueness parameter. The lack of significant correlation between vagueness and threshold for most does not support Solt's explanation. Instead, it suggests that some participants assigned different truth conditions to most and more than half.

Sources of individual differences
Our starting point for considering the individual differences in meaning representations of natural language quantifiers was the observation that language users can have different truth conditions for logical words. For example, previous studies (e.g., Spychalska et al., 2016) showed that two groups of speakers have different interpretations of the quantifier some. In this spirit, we demonstrated that this phenomenon is not limited to just one quantifier or to pragmatics. We showed that there are three subgroups of participants with different meaning representations for many, few, and most.
We argue that individual differences are not due to the various verification strategies used by participants. We think that this explanation is unlikely because the task design limited possible strategy choices. Participants verified the sentence with a quantifier by comparing their threshold to the proportion given as a number. Although the Approximate Number System (Dehaene, 1997) could have interfered with the precise number system, it is rather unlikely that participants were unable to precisely compare proportions. In our task, there was no time pressure on the decision and the proportions were displayed on the screen for an unlimited period of time. We feel confident in rejecting the explanations based on the variability in verification strategies as a source of observed individual differences in thresholds.
The individual differences in thresholds are also unlikely to be a result of the different cognitive abilities of our participants. We did not measure the working memory or executive function performance of participants, but our task was relatively easy and did not require much working memory or other cognitive function resources. Moreover, we included a response error parameter in our model, which accounted for variability in task performance (e.g., attention lapses or mistakes). We found that the majority of participants belonged to a low response error cluster, indicating that they performed the task on at a similar level of accuracy. Altogether, we conclude that the differences in thresholds between groups are due to different representations of the truth conditions of quantifiers.

Conclusions
In the current study, we identified three clusters of participants assigning different meanings to vague quantifiers such as most, many, and few. We showed that these quantifiers have different positions on the mental scale in subgroups of participants. Moreover, we separated individual semantic differences in meaning representations, such as vagueness and threshold, from general cognitive abilities reflected in a response error parameter. Our findings are consistent with the claim that logical words can have various semantic representations for different speakers. We believe that our approach could be helpful for studying individual differences in the representation of not only quantifiers but also other function or content words. Figure 11 illustrates how relationships between model parameters for each quantifier are affected by influential observations. We computed the Cook's distance using the ols plot cooksd bar R function in the package olsrr (Hebbali, 2020).  Figure 11: The scatter plots illustrate the relationships between model parameters (abbreviation Resp. error -response error) for each quantifier. The influential observations according to Cook's distance are indicated in red.