Feature inference with uncertain categorization: Re-assessing Anderson’s rational model

Konovalova, Elizaveta; Le Mens, Gaël

doi:10.3758/s13423-017-1372-y

Feature inference with uncertain categorization: Re-assessing Anderson’s rational model

Theoretical Review
Published: 18 September 2017

Volume 25, pages 1666–1681, (2018)
Cite this article

Download PDF

Psychonomic Bulletin & Review Aims and scope Submit manuscript

Feature inference with uncertain categorization: Re-assessing Anderson’s rational model

Download PDF

1722 Accesses
5 Citations
Explore all metrics

Abstract

A key function of categories is to help predictions about unobserved features of objects. At the same time, humans are often in situations where the categories of the objects they perceive are uncertain. In an influential paper, Anderson (Psychological Review, 98(3), 409–429, 1991) proposed a rational model for feature inferences with uncertain categorization. A crucial feature of this model is the conditional independence assumption—it assumes that the within category feature correlation is zero. In prior research, this model has been found to provide a poor fit to participants’ inferences. This evidence is restricted to task environments inconsistent with the conditional independence assumption. Currently available evidence thus provides little information about how this model would fit participants’ inferences in a setting with conditional independence. In four experiments based on a novel paradigm and one experiment based on an existing paradigm, we assess the performance of Anderson’s model under conditional independence. We find that this model predicts participants’ inferences better than competing models. One model assumes that inferences are based on just the most likely category. The second model is insensitive to categories but sensitive to overall feature correlation. The performance of Anderson’s model is evidence that inferences were influenced not only by the more likely category but also by the other candidate category. Our findings suggest that a version of Anderson’s model which relaxes the conditional independence assumption will likely perform well in environments characterized by within-category feature correlation.

Comparing methods of category learning: Classification versus feature inference

Article Open access 20 February 2020

Same but Different: Providing a Probabilistic Foundation for the Feature-Matching Approach to Similarity and Categorization

Article Open access 08 June 2023

Context, structure, and informativeness judgments: An extensive empirical investigation

Article 08 July 2020

According to J. Anderson, ‘The basic goal of categorization is to predict the probability of various inexperienced features of objects’ (Anderson, 1991). At the same time, humans often find themselves in situations where the categories of the objects they perceive are uncertain. How do people make predictions about unobserved features of an object when the category of that object is uncertain?

A highly influential answer to this question is J. Anderson’s ‘rational model’ (Anderson, 1991). Consider a setting where an individual observes a feature of an object and makes a prediction about an unobserved feature of that object. The values of the two features are denoted by X (first feature) and Y (second feature). It is assumed that the individual has organized her knowledge of the domain in a set of categories $\mathcal {C}$.^{Footnote 1} According to Anderson’s model (AM), the probability that the value of the second feature is y when the individual knows that the value of the first feature is x is given by

$$ P(y \mid x)= \sum\limits_{c\in \mathcal{C}} P(c \mid x)P(y \mid c), $$

(1)

where P(c∣x) is the subjective probability that the object comes from category c given the observed feature value x and P(y∣c) is probability that the second feature has value y given that the object belongs to category c. An important qualitative prediction of this model is that people take into account all the candidate categories when making an inference about the unobserved feature Y on the basis of the value of the observed feature X = x.

A large amount of empirical work has focused on testing this prediction. Existing findings are mixed. Some experimental evidence suggests that participants’ inferences are the same as those implied by a model that relies only on the most likely category given the observed feature (the ‘target category’) (Chen, Ross, & Murphy, 2014a, b; Malt, Ross, & Murphy, 1995; Murphy & Ross, 1994, 2010a; Murphy, Chen, & Ross, 2012; Ross & Murphy, 1996; Verde, Murphy, & Ross, 2005). Other experiments suggest that participants rely on more than just the target category (Chen et al., 2014a; Hayes & Chen, 2008; Hayes & Newell, 2009; Murphy & Ross, 2010a; Newell, Paton, Hayes, & Griffiths, 2010; Verde et al., 2005). Finally, still other experiments suggest that participants do not pay attention to categories at all but instead are sensitive to the overall feature correlation (Griffiths, Hayes, Newell, & Papadopoulos, 2011; Griffiths, Hayes, & Newell, 2012; Hayes, Ruthven, & Newell, 2007; Papadopoulos, Hayes, & Newell, 2011). Several recent papers have attempted to uncover the conditions under which people are more likely to rely on multiple categories or just the target categories. For example, Murphy and Ross (2010b) found that participants were more likely to use multiple categories when the most likely category gives an ambiguous inference, and less likely to do so when the most likely category gives an unambiguous inference. Chen et al. (2014a) found that participants’ inferences were likely to be influenced by multiple categories when the inference was implicit, whereas they were likely to be influenced by just the target category when the inference was explicit. Griffiths et al. (2012) found that participants’ inferences were more likely to be influenced by a single category when participants had been trained to classify stimuli before the feature induction task.

Despite the diversity of findings, the studies that analyzed the performance of Anderson’s model converge in showing that it provides a poor fit to experimental data. Central to this model is an assumption about the structure of the environment: it assumes that the within-category feature correlations are equal to 0 (this is the ‘conditional independence’ assumption). We believe this model can be seen as a ‘rational model’ only to the extent that this assumption is consistent with the structure of the actual task environment. We reviewed all prior experiments on feature inference with uncertain categorization (reported in the papers cited above) to check whether the task environments of these experiments were characterized by conditional independence. We found that it is the case in none of the previously published experiments.^{Footnote 2}

The poor performance of Anderson’s model in an environment without conditional independence suggests that people do not make this assumption in such environments (a point made by (Murphy & Ross 2010a)). Yet, currently available evidence provides little information about how this model would fit participants’ inferences in a setting where conditional independence is satisfied. How well would Anderson’s model (AM) predict participants’ inferences in a task environment consistent with the conditional independence assumption?

At first sight, this question might seem moot. After all, Murphy and Ross (2010a) noted that there are many environments in which this assumption is not satisfied. For example, they argued that within- category feature correlation can result from large category difference. One example is sexual dimorphism in animals (Murphy & Ross 2010a, p. 14). Male deer are larger and have different coloration than females. Therefore, these features are correlated within the category ‘deer’. Similar feature correlations are present in consumer goods categories like books or computers. There is also evidence that people are aware of some within-category correlations (Malt and Smith 1984).

However, even if there are possibly few naturally occurring environments that satisfy conditional independence, it is important to assess the performance of the Anderson’s model in such settings. This is because there currently does not exist a rational model for environments where the conditional independence assumption does not hold. If Anderson’s model performs well under conditional independence—when it can be seen as a ‘rational model’—this will suggest that an extension of this model to settings without conditional independence needs to be developed. Such a model is likely to perform well.

We analyzed the performance of Anderson’s model in a task environment characterized by conditional independence, consistent with this key assumption of the model. In five experiments, we found that the model performed better than other competing models. This finding is important because it suggests that people’s inferences can be influenced by several categories when making inferences under uncertain categorization. Although there already exists some evidence that this can be the case (e.g., Chen et al., 2014a; Griffiths et al., 2012; Murphy & Ross, 2010b), we explain below that such evidence is based on a design that does not allow the parsing out between two possible interpretations of the data: that participants ignore categories altogether or that categories influence inferences in a fashion close to what would be predicted by application of Bayes’ theorem. The results reported in this paper suggest the later interpretation.

In the following, we describe the existing experimental paradigm that has been used by most of the literature on feature inference under uncertain categorization. We explain how the fact that it relies on discrete-valued features makes it of limited usefulness to the performance of assess Anderson’s model. Then we introduce our adaptation of Anderson’s model to continuous environments and describe competing models. Subsequently, we report the performance of Anderson’s model in four experiments based on a novel paradigm with continuous features and one experiment based on the existing paradigm with discrete features. Finally, we discuss how our findings relate to prior research.

Existing paradigm - discrete features

In the experimental paradigm used in the vast majority of experiments that focused on feature prediction with uncertain categorization, participants are shown a set of items of various shapes and colors divided into small number of categories, typically four (Murphy & Ross, 1994). Then they are told that the experimenter has a drawing of a particular shape and were asked to predict its likely color (or similar questions about the probability of an unobserved feature given an observed feature). An important characteristic of this paradigm is that the categories are shown graphically to the participants. The idea was to avoid complications related to memory and category learning by participants.

Suppose the two features are X and Y and there are four categories. Participants are asked to estimate P(y∣x), the proportion of items with Y = y out of items with X = x. There is some evidence that participants’ predictions are the same as those implied by a model that focuses on just the ‘target’ category, that is, the most likely category given the observed feature (Murphy & Ross, 1994). There is also some evidence that participants sometimes make predictions that are the same as those implied by a model that takes into account multiple categories (Murphy & Ross 2010a). Still, other experiments have found evidence that participants do not pay attention to categories at all but instead are sensitive to the overall feature correlation (Hayes et al. 2007; Papadopoulos et al. 2011; Griffiths et al. 2012).

A limitation of this paradigm pertains to the fact that the features are discrete-valued. This implies that the predictions of a model that ignores categories altogether or makes optimal use of the categories are exactly the same. This is a consequence of the law of total probability. In this case, we have

$$ P(y \mid x) = \sum\limits_{c=1}^{4} P(c \mid x) P(y \mid c x), $$

(2)

where P(c∣x) is the proportion of items belong to c out of all the items such that X = x, and P(y∣c x) is the proportion of items with Y = y out of the items that both are in c and have X = x.

In settings where there is conditional independence, we have P(y∣c x) = P(y∣c) and thus the above equation can be rewritten as:

$$ P(y \mid x) = \sum\limits_{c=1}^{4} P(c \mid x) P(y \mid c). $$

(3)

In order to estimate P(y∣x), a participant that would ignore the categories would consider all objects with X = x and would respond with the proportion of objects with y among all objects with x. A participant that would consider all four categories would compute the proportion of items with y among the items with x in each category and then would compute the weighted average by multiplying each of these numbers by her estimates of P(c∣x). The responses given by the two participants would be exactly the same. It is therefore difficult to assess whether the participants use multiple categories (but see Murphy & Ross, 2010a for an attempt to do so using post-prediction questions). When features are continuous, however, the predictions of these two strategies differ.

Below, we describe a version of Anderson’s model adapted to a continuous environment and report four experiments designed to test this model. We return to the discrete environment setup in Experiment 5 and the General Discussion section.

Rational feature inferences in a continuous environment

Representing mental categories

We depart from the prior literature on feature inference with uncertain categorization by focusing on a setting with continuously valued (as opposed to discrete) features. Following recent work, we model mental categories using probability distribution functions (pdfs) on the feature space (Ashby & Alfonso-Reese, 1995; Sanborn, Griffiths, & Shiffrin, 2010). Let $c \in \mathcal {C}$ be a category. We denote by f(x,y∣c) the value of the associated pdf at position (x,y) in the feature space, where x denotes the value of the first feature and y denotes the value of the second feature. This pdf denotes the prior belief of the individual over positions given that she knows that an object is from category c.

For simplicity, in what follows we assume there are two relevant categories ($\mathcal {C}\,=\,\{1,2\}$) each represented by bi-variate normal distributions (Ashby and Alfonso-Reese 1995):

$$ \left( \begin{array}{c} X_{c} \\ Y_{c} \end{array}\right) \sim N \left( \left( \begin{array}{c} \mu_{xc} \\ \mu_{yc} \end{array}\right); \left( \begin{array}{cc} \sigma^{2}_{xc} & 0 \\ 0 & \sigma^{2}_{yc} \end{array}\right)\right), $$

(4)

where μ _{x
c} and μ _{y
c} are the category means for the two features, and σ _{x
c} and σ _{y
c} are the standard deviations. Consistent with the conditional independence assumption, the within-category feature correlation is zero. See Fig. 1 for an example.

Anderson’s rational model (AM)

By adapting Eq. 1 to this continuous setting, we express the posterior on the second feature given the value of the first feature:

$$ f(y \mid x)= \sum\limits_{c\in \mathcal{C}} P(c \mid x)f(y \mid c), $$

(5)

where P(c∣x) is the subjective probability that the object comes from category c given that the first feature is observed to have value x and f(y∣c) is the marginal distribution of the second feature, conditional on the fact that the object is a c.

Anderson’s model assumes that the subjective probabilities of the candidate category are given by Bayes’ theorem:

$$ P(c \mid x) = \frac{P(c)f(x \mid c)}{f(x)}= \frac{P(c){\int}_{v} f(x,v \mid c)dv}{{\sum}_{c\in \mathcal{C}} P(c){\int}_{v} f(x,v \mid c)dv}, $$

(6)

where P(c) is the prior on the category.

In the special case with two categories and normally distributed category pdfs, we have:

$$ f(y \mid x)= P(c_{1}\mid x) f_{\mu_{y1},\sigma_{y1}}(y) + P(c_{2}\mid x)f_{\mu_{y2},\sigma_{y2}}(y) , $$

(7)

where $f_{\mu _{y},\sigma _{y}}$ denotes the density of a normal distribution with mean μ _y and standard deviation σ _y, P(c ₂∣x) = 1 − P(c ₁∣x), and

$$ P(c_{1}\mid x)=\frac{1}{1+e^{ax^{2}-bx+c}}, $$

(8)

with

$$\begin{array}{@{}rcl@{}} a&=&\frac{\sigma_{x2}^{2}-\sigma_{x1}^{2}}{2\sigma_{x2}^{2}\sigma_{x1}^{2}}, \\ b&=&\frac{\sigma_{x2}^{2}\mu_{x1}-\sigma_{x1}^{2}\mu_{x2}}{\sigma_{x2}^{2}\sigma_{x1}^{2}}, \\ c&=&\frac{\sigma_{x2}^{2}\mu_{x1}^{2}-\sigma_{x1}^{2}\mu_{x2}^{2}}{2\sigma_{x2}^{2}\sigma_{x1}^{2}}+ \log\frac{\sigma_{x2}}{\sigma_{x1}}+\log\frac{P(c_{2})}{P(c_{1})}. \end{array} $$

We assume that the priors on the two categories, P(c ₁) and P(c ₂), are both equal to 0.5.

Competing models

Prior literature suggests that people frequently focus on the most likely category and that they sometimes ignore categories altogether but are sensitive to the overall feature correlation. We describe translations of these perspectives to the continuous environment.

Single category - independent features (SCI)

We refer to the most likely category given the observed feature (x) as the ‘target’ category (this is category 1 if P(c ₁∣x) > .5, as given by Eq. 8). The posterior has the same structure as in Anderson’s model, but with all the weight on the target category (c ^∗). In this case, $ f(y \mid x)= f_{c^{*}}(y \mid x), $ where $f_{c^{*}}= f_{\mu _{y1},\sigma _{y1}}$ if the target category is category 1, and $f_{c^{*}}= f_{\mu _{y2},\sigma _{y2}}$ otherwise. The ‘switch’ is situated where x is such that P(c ₁∣x) = .5. In the rest of the paper, we refer to this value as the ‘boundary’.

Linear model (LM)

Prior literature considered the ‘feature conjunction’ approach as a model that is sensitive to the overall statistical association between the two features across objects, independently of categorical boundaries. This model simply computes the empirical probability of the unobserved feature given the observed feature based on all the data, ignoring categorical boundaries. A direct analogue in the continuous setting does not exist because the agent might have to infer Y conditional on an x value to which she has never been exposed. This observation implies that a model that ‘regularizes’ the available observations is in order. This could be a parametric model or a non-parametric exemplar model that weights prior observations based on their similarity to the stimulus (Ashby and Alfonso-Reese 1995; Nosofsky 1986). For the sake of simplicity, we analyze a linear model. This is the simplest model that takes into account the overall feature correlation:

$$ f(y \mid x)= f_{a_{0}+a_{1}x,\sigma_{l}}(y), $$

(9)

where $f_{a_{0}+a_{1}x,\sigma _{l}}(y)$ denotes a normal pdf with mean a ₀ + a ₁ x and standard deviation σ _l. The parameters are the coefficients of the best-fitting linear model based on the observed samples from the two categories.

Decision rule

The outputs of all three models, as described above, are posterior distributions: subjective probability distributions over the value of the second feature (Y) given the observed feature (X). To make empirical predictions about human inferences, we need to specify how this posterior distribution translates into responses. In analyses of our experimental results, we will assume that the response is a random draw from the posterior distribution—this is a ‘probability matching’ decision rule. Other decision rules are theoretically possible. They would lead to different model predictions. We return to this issue in the General Discussion section of the paper.

Experiment 1

Participants faced a feature inference task that closely matches the setting of the previous section. Following standard practice in the study of feature inference with uncertain categorization, we used a ‘decision-only’ paradigm: participants were provided with a graphical depiction of the categories which remained visible when they made inferences about the second feature on the basis of the value of the first feature. We adopted this design to avoid issues related to memory.

Design

Our experiment used artificial categories to avoid the influence of domain-specific prior knowledge. We asked the participants to assume they were biochemists who studied the levels of two hormones in blood samples coming from two categories of animals (e.g., Kemp, Shafto, & Tenenbaum, 2012). The hormones were called ‘Rexin’ and ‘Protropin’ and the two categories of animals were ‘Mouse’ and ‘Rat.’ We provided the participants with visual representations of the categories in the form of scatter plots of exemplars of the two categories (see Fig. 1). In addition, participants went through a learning procedure designed to familiarize themselves with the position of the categories in feature space (see Supplementary Material). In the judgment stage, participants were asked to infer, without feedback, the likely level of Protropin, based on the level of Rexin, for 48 blood samples which didn’t indicate the animal they came from (the category was thus uncertain). The question was ‘What is the likely level of Protropin in this blood sample?’ Participants answered using a slider scale with minimal value 40, maximal value 90, and increments of 1 unit.

Thirty participants recruited via Amazon Mechanical Turk completed the experiment for a flat participation fee.^{Footnote 3}

Model predictions

Figure 2 depicts the posterior distributions, f(y∣x), implied by the three competing models. The posterior for AM is based on Eq. 7 and the Bayesian category weights given by Eq. 8. The posterior for SCI is based on Eq. 7 and the all-or-nothing category weights. The parameters are the coefficients used to generate the categories (see the legend of Fig. 1). The posterior for LM is based on Eq. 9. The parameters are the coefficients of the best fitting linear model based on the all the dots depicted on Fig. 1 (irrespective of their categories). With the stimuli used in the experiment, we have a ₀ = 94.8,a ₁ = −0.45 and σ _l = 5.7.

The crucial difference between the predictions of AM and SCI lies in the region around the x value at which both categories are equally likely (Rexin level of 65). Consider Rexin level of 60. According to SCI only high levels of Protropin (close to 75) are likely (the ones corresponding to the “Mouse” category). According to AM, however, both high (close to 75) and low levels (close 55) of Protropin are likely.

Results

Parameter-free model comparison

Here we assume that the model parameters for AM and SCI are the coefficients used to generate the categories and that the parameters for LM are those of the best fitting regression line, just as in Fig. 2. We computed the log-likelihood fit of each model on a participant-by-participant basis.^{Footnote 4} Anderson’s model (AM) is the best fitting model for the majority of participants (74% of them, see Table 1). Figure 3 shows the inferences of all participants as well as the log-likelihood of the three models for each participant.

Table 1 Percentage of participants whose feature predictions were best fit by each of the candidate models

Full size table

Comparison of models with parameters estimated participant-by-participant

The comparison of the parameter-free models implicitly assumes that the participants perceived the categories accurately (i.e., the parameters of their category pdfs were exact). This might not have been the case, however. For example, participants might have misjudged the position of the point where categories are equally likely (x = 65). Inspection of Fig. 3 reveals that the perceived position of this ‘boundary’ is essential to the performance of the single category model (SCI). A slight error leads to a strong penalty in terms of log-likelihood that might not translate to the fact that a participant used multiple categories. For example, participant #4 made predictions that are clearly indicative of a focus on just one category since the predictions correspond to the median y-level for ‘Mouse’ (the category on the left) when x is low and to the median y-level for rats (the category on the right) when x is high. But the participant switched between categories not exactly at the ‘boundary’ of x = 65. This implies a strong penalty to the likelihood of the single category inference (SCI) model. A strict version of the SCI model discussed in the prior literature (e.g. Murphy & Ross, 1994) is thus a poor performer in our task. To give a better chance to the SCI model and account for possible misperception of the categories, we estimated the parameters of each model on a participant-by-participant basis (by maximizing the likelihood).^{Footnote 5}

Table 2 reports the mean estimated parameter values (the mean was estimated across participants for whom the focal model is the best). The average parameter estimates are close to the true values for both the rational and the single category model. This suggests that participants used the categories we intended them to use. Models were compared in terms of the BIC criterion. For 60% of the participants, AM provides the best fit while SCI provides the best fit for the rest of the participants.

Table 2 Estimated model parameters

Full size table

Analyses of the ‘switching’ behavior of the participants

A crucial prediction of Anderson’s rational model (AM) is that in the area where the ‘target’ category is uncertain (around the boundary at x = 65), there are oscillations between the typical level of Protropin (y-axis) for mice (about 75) and rats (about 55). Consider one participant in the experiment. Suppose the participant has to make an inference about a blood sample with a level of Rexin of 70. The probability that this sample is from a Rat is about 0.8. Anderson’s model predicts that in 80% of the cases, a participant facing this situation will give a response close to 55 (typical Protropin level for a Rat sample) and that in about 20% of the cases, she will give a response close to 75 (typical Protropin level for a Mouse sample). If we collect many such judgments in the area where the ‘target’ category is uncertain, we should expect that some inference values will be close to 55 and others close to 75. The top row of Fig. 4 shows the inferences of ten simulated participants who follow Anderson’s model. All these simulated participants show oscillations between Protropin levels around 55 and 75 (the x values used for the simulation are the same as those used in the experiments, without any instance of x = 65).

By contrast, no such oscillation is implied by the single category model (SCI). In this case, there is just one ‘switch’ at the boundary (x = 65). The bottom row of Fig. 4 shows the inferences of ten simulated participants who are assumed to follow the single category model. Inferences are close to 75 for Rexin levels lower than 65 and close to 55 for Rexin levels higher than 65.

Instances of the two distinct inference patterns can clearly be seen on the graphs depicting the inferences of the participants in the experiment (Fig. 3). For example, participant 4 switched exactly once at the ‘boundary’ whereas participant 8 switched many times between the two modal Protropin levels of 55 and 75. Our participant-by-participant model estimations identified this difference since the single-category model provides the best fit to the inferences of Participant 4 whereas Anderson’s model provides the best fit to the inferences of Participant 8.

In Experiment 1 there are 12 (40%) participants with exactly one switch. This is very close to the number of participants best fit by the single category model (with parameters estimated participant-by-participant – see Table 1).

Discussion

Most participants’ inferences are better explained by Anderson’s rational model (AM) than by the single category model (SCI) and the linear model (LM). In this experiment, we provided participants with a visual representation of the categories. One might wonder if this design captures the psychological process that underlies inferences with uncertain categorization when such graphical representation is not available at the time of the inference. It could be that participants engaged in some elaborate form of curve fitting on the basis of the graphs we showed to them. We address this potential concern in Experiments 2 & 3.

Experiments 2 & 3

The experiments follow a design similar to Experiment 1. We recruited 30 participants via Amazon Mechanical Turk for each experiment.

Design of Experiment 2

The only change in comparison to Experiment 1 is that we removed the graphical representation of the categories (i.e., the graph of Fig. 1) on the screens on which participants made judgments (during learning and test stages). The graph was shown in the instructions and before every judgment, but not on the judgment screen.

Design of Experiment 3

In this experiment, participants never saw any graphical representation of the data. They learnt the categories from experience by first seeing 40 exemplars of both categories (Rexin and Protropin values), then making within-category inferences (of Protropin level based on Rexin level) with feedback and categorizations of blood samples as Rat or Mouse, based on Rexin level (see Supplementary Material).

Results Experiments 2 & 3

Parameter-free model comparison

Removing the graphical depiction of the data didn’t drastically change the pattern of results. Just as in Experiment 1, Anderson’s model is the best for the majority of the participants in both experiments (Table 1).

Comparison of models with parameters estimated participant-by-participant and switching behavior

In comparisons based on the BIC, Anderson’s model (AM) is by far the best fitting model (Table 2). As in Experiment 1, the single-category model (SCI) performs better in this comparison than in the comparison of parameter-free models, but worse than Anderson’s model.

In Experiments 2 & 3 there are 4 (13%) and 8 (28%) participants with exactly one switch (see also participant-by-participant inferences in Figures S3 & S4 in the Supplementary Material). These numbers closely reflect the performance of the single-category model (with parameters estimated participant-by-participant).

Discussion of Experiments 1–3

Taken together, Exp. 1–3 show that Anderson’s model (AM) provides a better fit to the data than the single category model (SCI). The linear model provides a very poor fit to the data. This pattern of results is consistent across experiments, in which we varied the information available about the categories. Whether participants saw a graphical representation of the categories in feature space at time of inference (Exp. 1), this representation was seen in the learning stage but removed in the inference stage (Exp. 2), or never seen (Exp. 3), the patterns of feature inferences were similar.

We would like to claim that good performance of Anderson’s model is evidence for the integration of information across categories. In other words, we would like to claim that people use a cognitive algorithm of the following kind:

1.
observe X = x;
2.
compute the posterior distribution f(y∣x) according to Eq. 7;
3.
provide an estimate of Y by generating a random draw from the posterior.

Because the posterior depends on the marginal distributions of the unobserved features for both the target and the non-target category, we call this cognitive algorithm ‘AM non-target’.

The evidence gathered so far does not unequivocally show that participants used this kind of cognitive algorithm. The reason is that the results of Exps. 1–3 are also compatible with a noisy version of the single category inference (SCI) model. Suppose a participant uses SCI, but is uncertain about the location of the boundary at which the ‘Rat’ category becomes more likely than the ‘Mouse’ category. Let β denote the uncertain position of the boundary on the x-axis. Suppose inferences about y are produced by the following cognitive algorithm:

1.
observe X = x and estimate the position of the boundary β;
2.
evaluate if x < β or if x > β.
3.
if x < β, select the ‘Mouse’ category; else select the ‘Rat’ category. Denote the selected category by c ^∗.
4.
provide an estimate of Y given X = x by producing an intuitive estimate of the mode of the posterior distribution conditional on the selected category f(y∣c ^∗).

Assume, moreover, that the uncertainty is such that the participant’s belief about the location of this boundary is represented by a probability density $g(\beta )=\frac {\partial P(Rat \mid \beta )}{\partial \beta }$. In this case, the inferences produced by this algorithm are compatible with Anderson’s rational model: P(R a t∣β) follows Eq. 8, assuming c ₁ = R a t and x = β—see Supplementary Material for an explicit formulation of g(β). Figure 5 depicts the density of the uncertain boundary, g(β). It is a unimodal symmetric distribution centered at the mid-point between the two categories (x = 65). We will refer to this algorithm as ‘SCI with uncertain boundary’.

If participants whose inferences are best fit by Anderson’s model rely on the ‘SCI with uncertain boundary’ cognitive algorithm, then eliminating the uncertainty about the boundary should reduce the fit of Anderson’s model as compared to SCI. This should not happen if people integrate information from both categories in inferring the unobserved feature value—if they use the ‘Anderson non-target’ algorithm. We designed an experiment that relied on these predictions.

Experiment 4

Experiment 4 used a design identical to Exp. 1, with one change: we provided participants with information that ruled out subjective uncertainty about the boundary at which one category becomes more likely than the other. Consider the following two hypotheses:

H1: People whose inferences are best fit by Anderson’s rational model use the ‘SCI with uncertain boundary’ cognitive algorithm.
H2: People whose inferences are best fit by Anderson’s rational model use the ‘AM non-target’ cognitive algorithm.

If hypothesis H1 is true, then removing the subjective uncertainty about the boundary should lead to inferences consistent with the SCI model. Therefore, under this hypothesis, the SCI model should provide a better fit to participants’ inferences than in Exp. 1. If hypothesis H2 is true, removing subjective uncertainty about the boundary should not lead to a relative increase in the performance of the SCI model.