A rational model of function learning

Lucas, Christopher G.; Griffiths, Thomas L.; Williams, Joseph J.; Kalish, Michael L.

doi:10.3758/s13423-015-0808-5

A rational model of function learning

Theoretical Review
Published: 03 March 2015

Volume 22, pages 1193–1215, (2015)
Cite this article

Download PDF

Psychonomic Bulletin & Review Aims and scope Submit manuscript

A rational model of function learning

Download PDF

Christopher G. Lucas¹,
Thomas L. Griffiths²,
Joseph J. Williams³ &
…
Michael L. Kalish⁴

6585 Accesses
47 Citations
Explore all metrics

Abstract

Theories of how people learn relationships between continuous variables have tended to focus on two possibilities: one, that people are estimating explicit functions, or two that they are performing associative learning supported by similarity. We provide a rational analysis of function learning, drawing on work on regression in machine learning and statistics. Using the equivalence of Bayesian linear regression and Gaussian processes, which provide a probabilistic basis for similarity-based function learning, we show that learning explicit rules and using similarity can be seen as two views of one solution to this problem. We use this insight to define a rational model of human function learning that combines the strengths of both approaches and accounts for a wide variety of experimental results.

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Article Open access 01 April 2016

A random forest guided tour

Article 19 April 2016

The anchoring bias reflects rational use of cognitive resources

Article 08 May 2017

A rational model of function learning

Every time we get into a rental car, we have to learn how hard to press the gas pedal for a given amount of acceleration. Solving this problem—which is an important part of driving safely—requires learning a relationship between two continuous variables. Over the past 50 years, several studies of function learning have shed light on how people come to understand continuous relationships (Carroll 1963; Brehmer 1971; 1974; Koh and Meyer 1991; Busemeyer et al. 1997; DeLosh et al. 1997; Kalish et al. 2004; McDaniel and Busemeyer 2005). It has become clear that people can learn and recall a wide variety of relationships, but demonstrate certain systematic biases that tell us about the mental representations and implicit assumptions that humans employ when solving function learning problems. For example, people tend to expect that relationships will be linear when extrapolating to novel examples (DeLosh et al. 1997), and find it more difficult to learn relationships that change direction than those that do not (Brehmer 1974; Byun 1995).

Several models have been developed to understand the cognitive mechanisms behind function learning. These models tend to fall into two different theoretical camps. The first includes rule-based theories (e.g., Carroll, 1963, Brehmer, 1974, Koh and Meyer, 1991), which suggest that people learn an explicit function from a given family, such as polynomials (Carroll 1963; McDaniel and Busemeyer 2005) or power-law functions (Koh and Meyer 1991). This approach attributes rich representations to human learners, but has traditionally given limited treatment to how such representations could be acquired. A second approach includes similarity-based theories (e.g.,DeLosh et al., 1997, Busemeyer et al., 1997), which focus on the idea that people learn by forming associations: if x is used to predict y, observations with similar x values should also have similar y values. This approach can be straightforwardly implemented in a connectionist architecture and thus gives an account of the underlying learning mechanisms, but faces challenges in explaining how people generalize so broadly beyond their experience. Most recently, hybrids of these two approaches have been proposed (e.g., Kalish et al., 2004, McDaniel and Busemeyer, 2005), with an associative learning process that acts on explicitly represented functions.

Almost all past research on computational models of function learning has been oriented towards understanding the psychological processes that underlie human performance, or the steps by which people update and deploy their mental representations of continuous relationships. In this paper, we take a different approach, presenting a rational analysis of function learning in the spirit of Anderson (1990) and Marr and Vision. W.H. (1982), and Shepard (1987). Specifically, we start with an abstract representation of the problem to be solved and a handful of additional assumptions about the nature of continuous relationships, and then explore optimal solutions to the problem in light of these assumptions with the goal of shedding light on human behavior. This rational analysis provides a way to understand the relationship between the rule- and similarity-based approaches that have dominated previous work and suggest how they might be combined. Whereas hybrid models apply similarity-based learning to explicit rules, we offer a single foundation that supports both approaches, using a common set of commitments about learning and representation.

To understand the abstract problem that a function learner faces, we can turn to machine learning and statistics, where prediction in continuous domains—a problem familiarly known as regression—has been studied extensively. There are a variety of solutions to regression problems, but we focus on methods related to Bayesian linear regression (e.g., Bernardo and Smith, 1994), which allow us to make and test explicit claims about learners’ expectations, using probability distributions. Bayesian linear regression is also directly related to a nonparametric approach known as Gaussian process prediction (e.g., Williams, 1998), in which predictions about the values of an output variable are based on the similarity between values of an input variable. We use this relationship to connect the two traditional approaches to modeling function learning, as it shows that learning rules that describe functions and specifying the similarity between stimuli for use in associative learning are not mutually exclusive alternatives, but rather two views of the same solution. We exploit this fact to define a rational model of human function learning that incorporates the strengths of both approaches.

The plan of this paper is as follows. First, we review several sets of empirical phenomena in function learning, both to provide background and to establish criteria by which different theories of function learning can be judged. We then review past models of function learning, dividing them into rule-based, similarity-based, and hybrid approaches. Next, we introduce a new perspective on function learning in which rules and similarity can be expressed in a common framework, and describe a model that follows from this perspective. Finally, we evaluate different variations on our model against one another and previous models.

Phenomena in function learning

Past studies have taken diverse approaches to understanding how people learn relationships between continuous variables, but we will focus on four kinds of empirical phenomena that have been used in previous tests of function learning models (e.g., McDaniel and Busemeyer, 2005), or explicitly measure what kinds of relationships people implicitly believe to be more or less likely (Kalish et al. 2007), or challenge many models of function learning (Kalish et al. 2004). Our decision to focus on the following phenomena is also motivated by their being relatively comparable, coming from similar experimental designs involving randomly ordered, sequentially presented training stimuli, in the absence of informative cover stories or contextual information. In this section, we review these four kinds of phenomena, which we will later use to evaluate our own approach to explaining and understanding function learning.

Interpolation and learning difficulty

Some kinds of relationships are easier to learn than others. For example, increasing linear relationships tend to be easier to learn than decreasing linear relationships (Brehmer 1971; 1976). Similarly, linear relationships are typically easier to learn than non-linear ones ((Brehmer 1974; Brehmer et al. 1985; Byun 1995); see Koh and Meyer (1991) for a possible counterexample). Among non-linear relationships, people have more difficulty learning those that change direction (Brehmer 1974; Brehmer et al. 1985; Byun 1995). Cyclic relationships are especially difficult—but not impossible—to learn (Bott and Heit 2004; Byun 1995; Kalish 2013). These systematic differences suggest that some relationships are subjectively simpler, more common, or more straightforwardly represented than others, and the patterns given above dovetail with explicit human judgments about the probabilities of different kinds of relationships (Brehmer 1974).

If the difficulty of learning a relationship reflects its mental representation, one can evaluate a model of function learning by comparing its average error rates to those of humans across several kinds of relationships. More precisely, if one orders several relationships by the average magnitude of errors that humans make when predicting y for x values that fall between past examples, i.e., interpolating, a good model should show the same ordering in its prediction error. For humans, these errors are influenced by many factors, such as the match or mismatch of cover stories to the available data, the number of training points, and presentation order (Byun 1995), but we will focus on properties of the relationships themselves, which provide a simple basis for evaluating different theories of function learning. For instance, relationships in which y increases as a function of x tend to be easier to learn than functions in which y decreases as a function of x, which are in turn easier to learn than non-monotonic functions. For a summary of some qualitative properties of functions that contribute to differential learning difficulty for humans, see (Busemeyer et al. 1997). In our own evaluation, we will use data from several studies that were gathered by (McDaniel and Busemeyer 2005) and are summarized in Table 1.

Table 1 Difficulty of learning results based on experiments reviewed in McDaniel and Busemeyer 2005

Full size table

Extrapolation

Studies that measure interpolation errors allow relationships to be ranked by how easy they are to learn, with implications for those relationships’ subjective probability and consistency with humans’ mental representations. Unfortunately, quite different models can show similar patterns of errors (given a limited set of relationship types) which constrains the amount one can learn from this approach. This and other limitations of interpolation-error studies have led some researchers to focus on how people extrapolate, or make judgments about points that are distant from those seen before. This approach gives a greater share of influence to learners’ prior beliefs, and makes it possible to uncover patterns that are not reflected in interpolation error rates. To date, extrapolation-based studies of function learning are comparatively sparse, but have revealed several biases in human learners. For example, people’s extrapolation judgments follow linear patterns ((DeLosh et al. 1997), but see Kalish et al. (2004)), and more specifically tend toward functions with a positive slope and an intercept of zero (Kwantes and Neal 2006). In one instance of this bias, when people are trained using data from a quadratic function, their average predictions fall between the true function and straight lines fitted to the closest training points.

Learning multiple relationships

The term “function learning” suggests that relationships between continuous variables—or at least the representations that people form of them—are functions, in that for a given value of the predictor x, there is a single valid prediction, or at least a range of predictions with a single most-likely value or mode. In reality, this is not always the case. For example, dose responses for drugs might have two or more patterns, depending on unobserved genetic factors or patient histories, and some hybrid cars have different relationships between pressure on the accelerator and the car’s real acceleration, depending on whether or not the combustion engine is active. The world abounds with hidden mediators that can change the relationship between observable variables, and one might expect humans to be able to make judgments that reflect the presence of multiple underlying relationships. Consistent with this intuition, Lewandowsky, (Kalish and Ngang 2002) found that fire fighters learn two distinct relationships between wind speed, ground slope, and the rate at which a fire spreads, depending on whether the fire is labeled as a standard forest fire, or a “back burn” fire set to mitigate damage from future fires. Lewandowsky et al. refer to this phenomenon as “knowledge partitioning”, based on the idea that participants’ knowledge of the relationship at hand is partitioned into distinct subsets based on context.

More recently, Kalish, Lewandowsky and Krushke 2004 conducted three experiments showing that people make judgments that demonstrate an implicit belief in the presence of multiple overlapping linear relationships, even when no contextual information was present, and in circumstances where the training data could be explained using a single non-linear relationship (see Fig. 1 for examples).

Iterated learning is an experimental method that was first developed for studying language evolution (Kirby 2001), but it has more recently been applied to other phenomena, including function learning. In an iterated learning experiment, there are chains of learners where the first learner in each chain receives data, makes some inference on the basis of those data, and uses that inference to provide new data to the next learner in the chain. The data produced by each learner is the product of the data he or she receives and his or her inductive biases or expectations about the underlying relationship, item, or event. As the chain of learners grows longer, the influence of the learners’ shared expectations eventually washes out the information carried by the data provided to the first learner. After enough iterations, the data carried forward in the chain reflect human expectations about what relationships are likely, rather than the data the first learner in the chain sees, providing useful information about how people represent and reason about the phenomena at hand (Kalish et al. 2007).

Iterated learning

Figure 2 shows the results of a set of iterated function learning experiments conducted by Kalish et al. 2007. There were four conditions that differed in what data were given to the first participants in the chains. The positive linear (A) chains started with a linear relationship with a slope of one and an intercept of zero, the negative linear (B) chains started with a linear relationship with a slope of negative one and an intercept of zero, the U-shaped (C) chains started with data from a U-shaped relationship, and the random (D) chains started with a disorganized collection of points without any apparent underlying regularity. Kalish et al. (2007) found that the judgments of later participants tended to converge to a positive linear relationship with a slope of one and an intercept of zero regardless of the initial data. While these convergence results dovetail with past findings indicating that positive linear relationships are easier to learn, the intermediate states of the chains provide a more detailed view of function learning. For example, learners tended to preserve negative linear relationships, consistent with the idea that people think these relationships are likely or plausible. Further, many learners were quick to infer the presence of multiple overlapping relationships, as when some participants interpreted noisy data as evidence for a negative linear relationship superimposed on a positive one.

Models of human function learning

The phenomena described in the previous section have inspired several theories and models of function learning, which can be organized into three classes: those based on rules or explicit functions, those based on associative or similarity-based learning, and hybrids that use explicit representations and associative learning. In this section, we review each class in turn, before discussing the extent to which each is consistent with the empirical results described above.

Representing functions with rules

Some of the earliest research into function learning postulates that people learn continuous relationships using explicitly represented functions (Carroll 1963). Carroll proposed that people assume a particular class of functions (such as polynomials of degree k) and use the available observations to estimate the parameters of those functions. The resulting representation allows people to generalize beyond the observed values of the variables involved. Consistent with the version of this hypothesis that Carroll advanced, people learned linear and quadratic functions better than random pairings of values for two variables, and extrapolated appropriately. Similar assumptions have guided subsequent work, which has explored the ease with which people learn different kinds of functions (e.g., Brehmer, 1974), and examined how well human responses are described by different forms of nonlinear regression (e.g., Koh and Meyer, 1991).

The advent of rule-based models precedes most of empirical results we consider, so it may be unsurprising that these models face some difficulty in explaining those results. Rule-based models do not show the flexibility in interpolation that human learners exhibit, and tend not to predict the order-of-difficulty found in interpolation studies (McDaniel and Busemeyer 2005). Similarly, there is evidence that rule-based models (such as Koh and Meyer (1991)) make extrapolation predictions that diverge from human judgments (DeLosh et al. 1997). Purely rule-based models make no provision for multiple overlapping relationships, and thus cannot account for knowledge partitioning effects (Kalish et al. 2004). By extension, their ability to explain (Kalish, Griffiths, and Lewandowsky’s 2007) iterated learning results is limited: while rule-based models might be able to explain long-run convergence to positive linear relationships, they do not anticipate participants’ multimodal judgments.

Similarity and associative learning

Associative learning models propose that people do not learn relationships between continuous variables by explicitly learning rules, but instead forge associations between observed events and generalize based on the similarity of new variable values to old. The first model to implement this approach was the Associative Learning Model (ALM; DeLosh et al., 1997, Busemeyer et al., 1997), in which input and output arrays are used to represent a range of values for the variables between which the functional relationship holds. Presentation of an input activates input nodes close to that value, with activation falling off as a Gaussian function of distance, implementing a theory of similarity in the input space.

Learned weights determine the activation of the output nodes, which is a linear function of the activation of the input nodes. Weights are learned by gradient descent, where the local relationship between weights and errors is used to find new weights that reduce the squared error of the model’s predictions. This process is repeated until the error can no longer be reduced. In practice, this approach performs well when interpolating between observed values, but poorly when extrapolating beyond those values, as it does not capture humans’ ability to extrapolate in systematic, structured ways. As a consequence, Delosh et al. introduced the EXAM model, which constructs a linear approximation to the output of the ALM when selecting responses.

Similarity-based models have seen mixed success in explaining the range of empirical phenomena we describe above. In studies of interpolation and learning difficulty, similarity-based models show similar patterns of interpolation errors to those of humans (McDaniel and Busemeyer 2005). In the context of extrapolation, ALM does not address extrapolation but EXAM was developed with those results in mind and effectively captures the human bias toward linearity and predicts human extrapolations over a variety of relationships (McDaniel and Busemeyer 2005), but without accounting for the human capacity for non-linear extrapolation (Bott and Heit 2004). Like rule-based models, similarity-based models make unimodal predictions for any given x, and thus fail to account for knowledge partitioning results. This limitation also prevents EXAM from capturing some of the intermediate patterns that people produce in the iterated learning experiment.

Hybrid approaches

Several studies have explored methods for combining rule-like representations of functions with associative learning. One example of such an approach is the set of models explored in McDaniel and Busemeyer (2005). These models used the same kind of input representation as ALM and EXAM, with activation of a set of nodes similar to the input value. However, the models also feature a set of hidden units, where each hidden unit corresponds to a different parameterization of a rule from a given class, including polynomial, Fourier, and logistic functions. The values of the hidden units—corresponding to the values of the rules they instantiate—are combined linearly to obtain output predictions, with the weight of each hidden node being learned through gradient descent.

Another instance of a hybrid approach is the POLE model (Kalish et al. 2004), in which hidden units represent different linear functions and the weights from inputs to hidden nodes indicate which linear function should be used to make predictions for particular input values. Using this representation, the model can learn non-linear functions by identifying a series of local linear approximations, and can even model situations in which people seem to learn different functions in different parts of the input space. As a result, it is unique among the models we have discussed in its ability to match the bimodal response distributions discovered by Kalish et al. (2004).

Hybrid rule- and similarity-based models form a more heterogenous group than similarity- and ruled-based models, with representatives including POLE (Kalish et al. 2004) and McDaniel and Busemeyer’s (2005) connectionist implementations of rule-based models. POLE is set apart from the other models we have discussed by its ability to capture knowledge partitioning effects and it demonstrates a similar ordering of error rates to those of human learners (McDaniel et al. 2009). In its extrapolation predictions, however, there is evidence that it deviates from human performance (McDaniel et al. 2009). In an iterated learning design, POLE showed both convergence to positive linear relationships and some of the qualitative patterns that human learners demonstrate (depicted in Fig. 3II), including transitional states with overlapping positive and negative linear relationships. McDaniel and Busemeyer’s hybrid polynomial model—which performed better than the alternative hybrid models they considered—demonstrates an ordering of interpolation errors on different functions that aligns only roughly with human judgments (see Table 1), but its extrapolation predictions are consistent with human judgments from McDaniel and Busemeyer’s studies (McDaniel and Busemeyer 2005). Like rule-based models, this model offers unimodal predictions, and thus cannot account for knowledge partitioning phenomena, and has not been evaluated against iterated learning results.

Summary

We have reviewed a diverse set of models that accurately predict a variety of empirical phenomena in function learning. Despite their different commitments about how humans learn continuous relationships, a common theme of these models is an emphasis on the process by which function learning occurs. In the next section, we will take a fundamentally different view, focusing on the abstract problem of function learning and the forms that good solutions to that problem should take, rather than the process. This view complements past models rather than supplanting them, and we will demonstrate that it provides a common framework with which to understand and unify rule- and similarity-based approaches.

Rational solutions to regression problems

The models outlined in the previous section all aim to describe the psychological processes involved in human function learning. In this section, we consider the abstract computational problem underlying this task, using optimal solutions to this problem to shed light on both previous models and human learning. Viewed abstractly, the computational problem behind function learning is to use a set of real-valued observations x _n=(x ₁,...,x _n) and t _n=(t ₁,...,t _n), to predict what y _n+1 goes with a new x _n+1. Here, the y-values correspond to the underlying relationship, and the t-values are observations of y that have been obscured by additive noise, so $y_{n+1} = \mathbb {E}[t_{n+1}]$. Following much of the literature on human function learning, we consider only one-dimensional relationships, but this approach generalizes naturally to the multi-dimensional case. In machine learning and statistics, this is referred to as a regression problem. In this section, we discuss how regression problems can be solved using Bayesian statistics, and how the result of this approach is related to Gaussian processes, a formalism with close ties to associative learning. Our presentation follows that in Williams (1998). See Appendix A for a more thorough treatment of the mathematical details.

Bayesian linear regression

Ideally, we would seek to solve our regression problem by using not just the observations of x and t, but some prior beliefs about the probability of encountering different kinds of functions f(⋅) in the world. We can do this by applying Bayes’ rule, with

$$ p(f|\mathbf{x}_{n}, \mathbf{t}_{n}) = \frac{p(\mathbf{t}_{n}|f, \mathbf{x}_{n})p(f)}{{\int}_{\mathcal{F}} p(\mathbf{t}_{n}|f, \mathbf{x}_{n})p(f) df}. $$

(1)

Knowledge of which functions in the space of possibilities $\mathcal {F}$ is more likely to be the true function is captured by p(f), the prior distribution. The probability of observing the values of t _n if f were the true function is given by the likelihood function p(t _n|f,x _n), and the probability that f is the true function given the observations x _n and t _n is the posterior distribution p(f|x _n,t _n). In most regression models, the likelihood is defined by assuming that any deviation from the true function is due to many independent sources of noise—more specifically, that t _i is Gaussian with mean y _i=f(x _i) and variance ${\sigma ^{2}_{t}}$. Predictions about the value of the function f for a new input x _n+1 can be made by integrating over all functions in the posterior distribution,

$$ p(y_{n+1}|x_{n+1},\mathbf{t}_{n}, \mathbf{x}_{n}) = {\int}_{f} p(y_{n+1}|f, x_{n+1})p(f|\mathbf{x}_{n}, \mathbf{t}_{n}) df $$

(2)

where p(y _n+1|f,x _n+1) is a delta function placing all of its mass on y _n+1=f(x _n+1). Performing the integration outlined above can be challenging, but it becomes straightforward if we limit the hypothesis space to certain specific classes of functions. If we take $\mathcal {F}$ to be all linear functions of the form y=b ₀+x b ₁, then our problem takes the familiar form of linear regression. To perform Bayesian linear regression, we need to define a prior p(f) over all linear functions. Since these functions are identified by the parameters b ₀ and b ₁, it is sufficient to define a prior over b=(b ₀,b ₁), which we can do by assuming that b follows a multivariate Gaussian distribution, which results in a posterior distribution over b that is also a multivariate Gaussian (see Bernardo and Smith (1994)). Linear transformations of Gaussian distributions are also Gaussian, so the predictive density Eq. 2 is also Gaussian, and the noise introduced between true values t and observations y simply adds to the variance of this distribution.

While considering only linear functions might seem overly restrictive, linear regression actually gives us the basic tools we need to solve this problem for more general classes of functions. Many classes of functions can be described as linear combinations of a small set of basis functions. For example, all kth degree polynomials are linear combinations of functions of the form 1 (the constant function), x,x ²,...,x ^k. Letting ϕ ⁽¹⁾, ..., ϕ ^(k) denote a set of functions, we can define a prior on the class of functions that are linear combinations of this basis by expressing such functions in the form f(x)=b ₀+ϕ ⁽¹⁾(x)b ₁+...+ϕ ^(k)(x)b _k and defining a prior on the vector of weights b. As long as the prior over weights is Gaussian, the same results apply as in the simple linear case.

Gaussian processes

Another approach to regression problems is to forgo any explicit representation of the underlying function and focus on making predictions. If our goal is merely to predict y _n+1 using x _n+1, t _n, and x _n, we might simply define a joint distribution on t _n+1 given x _n+1 and find its expected value, which is equal to y _n+1, after conditioning on t _n:

$$ p(t_{n+1}|x_{n+1},\mathbf{x_{n}},\mathbf{t_{n}}) = \frac{p(\mathbf{t}_{n+1}|x_{n+1},\mathbf{x_{n}})}{p(\mathbf{t}_{n}|x_{n+1},\mathbf{x_{n}})}. $$

(3)

This equation expresses the problem of regression in very general terms, and may, at first glance, seem daunting to compute: it involves defining a joint distribution over all of the points observed so far, as well as the joint distribution including the new, unknown point. Further, if we want to predict y _n+1, we must be able to take the expectation of this quotient. However, in some circumstances, the probability of t _n+1 given x _n+1, x _n, and t _n has a straightforward analytical solution, and an easily computed expectation. One such case, which will be our focus here, is when all t _n+1 values are jointly Gaussian. In other words, t _n+1 is distributed according to a single multivariate Gaussian, with dimensionality corresponding to the number of points under consideration. This is determined by its mean and covariance matrix, and once these are specified, we have a solution for Eq. 3: the quotient has a closed form for multivariate Gaussians (see Rasmussen and Williams, 2006, for details). As we will see, assuming a jointly Gaussian distribution is not a strong constraint, and we can express a very broad set of relationships through our choice of means and covariances.

Both the mean vector and the covariance matrix are determined by the values of x. Broadly speaking, the mean vector captures expectations about how the function looks in the absence of data, and the covariance matrix—or the kernel function that generates it—captures expectations about how points relate to one another. The covariance matrix entry for any pair of t-values (t _i,t _j) is given by a function K(x _i,x _j), plus a diagonal matrix capturing the noisy relationship between the underlying values y _i and the observations t _i. we can Using this covariance matrix, we can obtain the distribution of t _n+1 conditional on t _n. The function K(⋅,⋅), called the kernel function, can be chosen arbitrarily as long as the covariance matrix it produces is valid.

One common kind of kernel is a radial basis function, e.g.,

$$ K(x_{i},x_{j}) = {\theta_{1}^{2}}\exp(-\frac{1}{{\theta_{2}^{2}}}(x_{i}-x_{j})^{2}) $$

(4)

which leads to t values that are more strongly correlated when their corresponding x values are more similar, with the parameters 𝜃 ₁ and 𝜃 ₂ determining how quickly the correlation falls off as differences in x values increase. Other kernels are possible, including periodic functions such as

$$ K(x_{i},x_{j}) = {\theta_{3}^{2}}\exp\left({\theta_{4}^{2}}\left(\cos\left(\frac{2\pi}{\theta_{5}}[x_{i}-x_{j}]\right)\right)\right) $$

(5)

indicating that values of y for which values of x are close relative to the period 𝜃 ₃ are likely to be highly correlated.

This approach to prediction, in which a kernel function applied to x defines a normal distribution on t-values, is called a Gaussian process. A wide variety of kernel functions are possible, corresponding to varied commitments about which x values are likely to lead to similar t-values, making Gaussian processes a flexible way to solve regression problems.

Two views of regression

Bayesian linear regression and Gaussian processes appear to be quite different approaches. In Bayesian linear regression, a hypothesis space of functions is identified, a prior on that space is defined, and predictions are formed by averaging over the posterior distribution of y, while Gaussian processes simply use the similarity between different values of x, as expressed through a kernel, to predict correlations in values of y. It might thus come as a surprise that these approaches are equivalent.

Showing that Bayesian linear regression corresponds to Gaussian process prediction is straightforward. The assumption of linearity means that the vector y _n+1 is equal to X _n+1 b. Given normally distributed weights, it follows that p(y _n+1|x _n+1) is a multivariate Gaussian distribution with mean zero and covariance matrix $\mathbf {X}_{n+1} \mathbf {\Sigma }_{b} \mathbf {X}_{n+1}^{T}$. Bayesian linear regression thus corresponds to prediction using Gaussian processes, with this covariance matrix playing the role of K _n+1 above (i.e., using the kernel function K(x _i,x _j)=[1 x _i][1 x _j]^T). Using a richer set of basis functions corresponds to taking $\mathbf {K}_{n+1} =\mathbf {\Phi }_{n+1} \mathbf {\Sigma }_{b} \mathbf {\Phi }_{n+1}^{T}$, i.e.,

$$ K(x_{i},x_{j}) = \left[ 1 \ \phi^{(1)}(x_{i}) \ {\ldots} \ \phi^{(k)}(x_{i})\right] \left[ 1 \ \phi^{(1)}(x_{i})\ {\ldots} \ \phi^{(k)}(x_{i})\right]^{T}, $$

(6)

where ϕ ^(1...k) are k arbitrary functions of x (Williams 1998). It is also possible to show that Gaussian process prediction can always be interpreted as Bayesian linear regression, albeit with a potentially infinite number of basis functions. Just as we can express a covariance matrix in terms of its eigenvectors and eigenvalues, we can express a given kernel K(x _i,x _j) in terms of its eigenfunctions ϕ and eigenvalues λ, with

$$ K(x_{i},x_{j}) = \sum\limits_{k=1}^{\infty} \lambda_{k} \phi^{(k)}(x_{i}) \phi^{(k)}(x_{j}) $$

(7)

for any x _i and x _j (Minh et al. 2006). Thus, any kernel can be viewed as the result of performing Bayesian linear regression with a set of basis functions corresponding to its eigenfunctions, and a prior with covariance matrix # #Σ# # _b=diag(λ).

These results establish an important duality between Bayesian linear regression and Gaussian processes: for every prior on functions, there exists a corresponding kernel, and for every kernel, there exists a corresponding prior on functions. Bayesian linear regression and prediction with Gaussian processes are thus just two views of the same solution to regression problems.

Combining rules and similarity through Gaussian processes

The results outlined in the previous section suggest that, in the context of regression, learning using rules—as expressed in a Bayesian linear regression model, and generalizing based on similarity, as expressed in a Gaussian process’s kernel function—are mutually compatible points of view. In this section, we briefly describe how previous accounts of function learning connect to these statistical models, and then use this insight to define a model of human function learning that combines the strengths of both approaches.

Reinterpreting previous accounts of human function learning

That idea of human function learning as a kind of statistical regression connects directly to Bayesian linear regression. Many rule-based models (e.g., Koh and Meyer, 1991Carroll, 1963) can be framed in terms Bayesian linear regression while retaining all of their basic commitments and predictions. Similarly, the basic ideas behind Gaussian process regression (with a standard radial-basis kernel function) lie at the heart of similarity-based models such as ALM. In particular, ALM and the associative-learning component of EXAM implement cubic spline approximation (McDaniel and Busemeyer 2005), which can be represented using Gaussian processes (Rasmussen and Williams 2006). Similarly, neural network approaches to similarity-based generalization are directly related to Gaussian processes, with some networks having a perfect mapping to a corresponding Gaussian process (Neal 1994). Gaussian processes with radial-basis kernels can thus be viewed as implementing a simple kind of similarity-based generalization, predicting similar y values for stimuli with similar x values. The hybrid approach to rule learning taken by McDaniel and Busemeyer (2005) is also closely related to Bayesian linear regression. The rules represented by the hidden units serve as a basis set that specifies a class of functions, and applying penalized gradient descent on the weights assigned to those basis elements serves as an online algorithm for finding the function with highest posterior probability (MacKay 1995).

Mixing functions in a Gaussian process model

The relationship between Gaussian processes and Bayesian linear regression suggests that we can define a single model that exploits both similarity and rules in forming predictions. We can do this by choosing a hypothesis space that covers a broad class of functions, including both those consistent with a radial basis kernel and those taking simple parametric forms. This is equivalent to modeling y as being produced by a Gaussian process with a kernel corresponding to one of a small number of types. Specifically, we assume that observations are generated by a function that is linear with positive slope, linear with negative slope, quadratic, or nonlinear but generally smooth. Figure 4 depicts samples from these individual kernels. This combination is one way to express the total prior over functions in Eqs. 1 and 2, with $p(f) = {\sum }_{k} p(f|k)P(k)$, where k represents a particular kernel in the set of four we have mentioned. For examples of functions that are likely under each of the different kernels, see Fig. 4.

We do not claim that the specific kernels compose an exhaustive account of the relationships that people and learn and extrapolate from. Rather, we believe that people find these relationships especially easy to learn, and especially plausible or likely as explanations of data in the face of uncertainty, based on the results of Brehmer (1971) and DeLosh et al. (1997) and Kalish et al. (2007).

A more complete account would include kernels that permit a wide variety of extrapolation patterns (e.g., Bott and Heit, 2004) , but for the data we will consider such an expansion would add to the complexity of our models without substantially changing our predictions (see Lucas et al. (2012) for a demonstration of how Gaussian process models can be used to predict a variety of non-linear extrapolations). The probabilities of the different relationship types are defined by the vector π. The relevant kernels are introduced in the previous sections (where “Nonlinear” corresponds to the radial basis kernel), with the positive and negative kernels having different means in their distributions over weights b, taking mean intercepts and slopes of [0 1], [1 −1], respectively. Using this Gaussian process model allows a learner to simultaneously make inferences about the overall type and specific form of the function from which their observations are drawn.

In developing this kind of model and selecting this particular set of priors—reflected in our choice of kernel functions—we are making explicit commitments about the inductive biases that shape human function learning. These include what types of relationships are more subjectively probable than others, and the more specific forms that relationships of a given type are likely to take. Our model does not, however, commit to any specific process by which those biases shape people’s inferences, which might resemble, for example, the associative mechanisms present in POLE or EXAM or an elaboration the hypothesis-testing framework offered by Brehmer.^{Footnote 1}

Basic tests of the Gaussian process model

In the remainder of the paper, we will evaluate our Gaussian process approach to function learning using each of the empirical phenomena we discussed earlier. First, following the approach taken in McDaniel and Busemeyer’s (2005) review of computational models of function learning, we look at two quantitative tests of Gaussian processes as an account of human function learning: reproducing the order of difficulty of learning functions of different types, and extrapolation performance. As indicated earlier, there is a large literature consisting of both models and data concerning human function learning, and these simulations are intended to demonstrate the potential of the Gaussian process model rather than to provide an exhaustive test of its performance. See Appendix B for a summary of the parameters in our model, and Appendix C for a description of the procedures used to generate model predictions.

Difficulty of learning

As discussed above, one important measure of a theory of human function learning is its ability to account for the relative difficulty people have in learning different kinds of relationships. Table 1 is an augmented version of results presented in McDaniel and Busemeyer (2005) which compared several models’ prediction errors to humans’ errors when learning a range of functions. Each entry in the table is the mean absolute deviation (MAD) of human or model responses from the actual value of the function, evaluated over the stimuli presented in training. The MAD provides a measure of how difficult it is for people or a given model to learn a function. The data reported for each set of studies are ordered by increasing MAD (corresponding to increasing difficulty). In addition to reproducing the MAD for the models in McDaniel and Busemeyer (2005), the table has been expanded to contain the MADs exhibited by seven Gaussian process (GP) models trained on the target functions.

The seven GP models incorporated different collections of kernel functions by adjusting their prior probabilities. The most comprehensive model includes the {Positive Linear,Negative Linear,Quadratic,Nonlinear} set of kernel functions, assigning them prior probabilities proportional to 8, 1, 0.1, and 0.01, respectively.^{Footnote 2} Six other GP models were examined by assigning certain kernel functions zero prior probability and re-normalizing the remainder so that the prior probabilities summed to one. The seven distinct GP models are presented in Table 1 are labeled by the kernel functions to which they assign non-zero probability, under the header “Model 1”. Models 2 and 3, which are extensions that account for knowledge partitioning phenomena, are discussed below. The kernels include Linear (including both positive and negative linear functions), Quadratic (second-order polynomial functions), RBF (nonlinear relationships, fit by a radial basis function kernel), LQ (linear and quadratic), LR (linear and RBF), QR (quadratic and RBF), and LRQ (linear, quadratic, and RBF). The MAD for each function from McDaniel and Busemeyer (2005) is reported for each model in Table 1, along with human MADs. The last three rows of Table 1 give the correlations between human and model performance across functions, expressing quantitatively how well each model captured the pattern of human function learning behavior. All of the GP models perform well, with every model (except for the Linear and LQ models) providing a closer match to the human data than any of the models considered by McDaniel and Busemeyer (2005).

Extrapolation performance

Predicting and explaining people’s capacity for extrapolation to novel stimuli is another key criterion for judging models of function learning. In Table 2, we compare mean human predictions for linear, exponential, and quadratic functions (from DeLosh et al. (1997)) to those of several models described in McDaniel and Busemeyer (2005), as well as each of the Gaussian process models we describe above and two model extensions that we will describe below. While none of the GP models produce quite as high a correlation as EXAM on all three functions, all but the Linear and LR models make predictions that correspond closely with human judgments. It is notable that this performance is achieved with the same parameters that were used for the difficulty of learning data (see Appendix B for details), while the predictions of EXAM were the result of optimizing two parameters for each of the three functions.

Table 2 Linear correlations between human and model predictions for extrapolation regions

Full size table

Figure 5 displays mean human judgments for each of the three functions, along with the predictions of an extended Gaussian process model we discuss below, which incorporates Linear, Quadratic, and Nonlinear kernel functions. The regions to the left and right of the solid black lines represent extrapolation regions, containing input values for which neither people nor the model were trained. Both people and the model extrapolate nearly optimally on the linear function, and reasonably accurately for the exponential and quadratic function. However, there is a bias towards a linear slope in the extrapolation of the exponential and quadratic functions, with extreme values of the quadratic and exponential function being overestimated. Quantitative measures of extrapolation performance are shown in Table 2, which gives the correlation between human and model predictions for EXAM (DeLosh et al. 1997; Busemeyer et al. 1997) and the seven GP models.

Summary

We have shown that our model accounts well for the relative difficulty with which people learn different kinds of relationships, and how they extrapolate from limited training data. More complex phenomena, such as knowledge partitioning and the multiple overlapping relationships it entails, require more complex models. The next section addresses these phenomena, and describes a straightforward extension of our Gaussian process model to accommodate the possibility of multiple relationships while still explaining human interpolation and extrapolation behavior.

Extending the Gaussian process model beyond single relationships

In most models of function learning, including the Gaussian process-based models described above, it is assumed that people learn a single relationship between a variable and its predictors. There might be a complex, non-linear relationship between x and f(x), but for a single value of x, f(x) is always unimodal and relationships are never compositions of other relationships. We have mentioned that this assumption fails to describe many real relationships, and, as knowledge partitioning results show, it also fails to explain human behavior.

Of the models we have described, only the POLE model (Kalish et al. 2004) makes predictions that are consistent with knowledge partitioning phenomena, doing so by appealing to the mental representations and processes people use when learning functions. We will show that a rational analysis of function learning leads to a similar set of predictions. In many real-world situations, two variables x and y will relate to one another in different ways, depending on context. If y depends on w in addition to x, i.e., the true function is y=f(x,w), and w is not observable, the apparent relationship between x and y may have discontinuities, and it may not be a function at all, having multiple values of y for a given x. We previously discussed examples of such relationships, including acceleration in hybrid cars and dose-response curves in a patient population. Other examples of hidden mediators include the relationship between brake pressure and acceleration, mediated by surface slipperiness, and the relationship between the temperature of a material and its malleability, mediated by its unobserved crystal structure, as with the temper of a piece of metal. With these intuitions in hand, we will now describe how our model may be extended to reflect them.

Mixtures of Gaussian process experts

We extended our Gaussian process model (Model 1), to capture the assumption that each point belongs to one of an unknown number of underlying relationships. Clearly, there is no fixed bound on the number of relationships that might obtain between x and y, but one would expect that fewer relationships should be more plausible than more, as a matter of simplicity or parsimony (Chater and Vitanyi 2003). There are multiple ways to express this intuition formally, but one obvious choice is to allow points to be divided into arbitrary partitions, assigning each partition a probability using a Chinese Restaurant Process prior (Aldous 1985), which has previously been used in rational analyses of categorization (Anderson 1991; Sanborn et al. 2010).

Under this prior, the likelihood that a new (x,y) pair will be assigned to an existing relationship is proportional to the number of other points that participate in that relationship, and the likelihood that it will be assigned to a new relationship is proportional to a parameter α. More precisely, the probability that the i ^th point’s relationship r _i will be k is

$$\begin{array}{@{}rcl@{}} \Pr(r_{i} = k) & {} = \left\{\begin{array}{ll} \frac{n_{k}}{i+\alpha} & \text{if }n_{k} > 0, \\\\ \frac{\alpha}{i+\alpha} & \text{if }n_{k} = 0 \end{array}\right. \end{array} $$

(8)

where n _k is the number of points already participating in relationship k. The likelihood of the data under a given partition is determined by how likely the ensemble of y values is, given the nature of the relationships they participate in and their corresponding x values. This conceptually straightforward extension from Gaussian processes to a mixture of Gaussian processes will be called Model 2. We might also wish the capture the intuition that (x,y) pairs that have similar x values are more likely to participate in the same relationships—in other words, relationships tend to be locally smooth and unimodal. This expectation can be built into the model by assuming that the likelihood that a point belongs to a partition is determined in part by its closeness to current members, represented using the x-value’s likelihood under a Gaussian distribution based on existing members. This last model, Model 3, is an example of a mixture of experts (Jacobs et al. 1991; Erickson and Kruschke 1998; Kalish et al. 2004), an approach that has been applied to Gaussian processes in the past (Rasmussen and Ghahramani 2002; Meeds and Osindero 2006). As with Model 1, Models 2 and 3 can be interpreted in terms of Bayesian linear regression or Gaussian processes, where every Gaussian process kernel for every expert can be represented as a linear regression model, albeit, as before, with a potentially infinite number of features. See Fig. 6 for samples of the kinds of relationships that the mixture of Gaussian process experts (henceforth Model 3) favors.

Knowledge partitioning

Before applying Models 2 and 3 to knowledge partitioning phenomena, we evaluated them against the same difficulty-of-learning and extrapolation results with which we assessed our original Gaussian process models. As with the earlier models, we used the same parameters for all of the experiments, and obtained close fits to human judgments, summarized in Tables 1 and 2 (see Appendix B for details of parameters and fits). We also plotted predictions for Model 3 against mean human judgments in the extrapolation experiments in Fig. 5. In general, Models 2 and 3 performed as well as any other model, and better than the majority of the alternatives.

To gauge the extent to which the models’ predictions are consistent with knowledge partitioning phenomena, we obtained individual predictions from twelve participants in (Kalish et al.’s 2004) studies, four per experiment.^{Footnote 3} Each experiment included training points and interpolation regions that were designed to elicit multiple modes in y for a given x. For example, in Experiment ??1, there was a gap between two partial linear functions with the same slope and different intercepts. Many participants made judgments in the gap that matched both functions, leaving a bimodal response distribution. Like Kalish et al., we focus on showing that our model captures the bimodal responses of the participants, and gives a posterior distribution that matches the distribution of actual judgments.

The results are summarized in Fig. 7, comparing Models’ 1, 2, and 3 predicted probabilities of different y values to those given by participants. Model 1 predicts the aggregate trend in Kalish et al.’s Experiment 1, but cannot explain the discontinuities exhibited by two of the participants shown in Fig. 1) or the multiple modes evident in participants’ judgments for Experiments 2 and 3. In contrast, Models 2 and 3 predict the multiple relationships will be inferred. Model 3, being sensitive to the proximity of points, is more likely than Model 2 to group points into local relationships, as is apparent in its predictions for Experiment 1 We used a single prior distribution across the different experiments and participants, but the individual differences in Fig. 1 are readily explained in terms of different participants having different inductive biases. Future work, with more extensive within-subjects data, would permit us to test our model as a framework for understanding how inductive biases vary between individuals.