A rational model of function learning

Every time we get into a rental car, we have to learn how hard to press the gas pedal for a given amount of acceleration. Solving this problem—which is an important part of driving safely—requires learning a relationship between two continuous variables. Over the past 50 years, several studies of function learning have shed light on how people come to understand continuous relationships (Carroll 1963; Brehmer 1971; 1974; Koh and Meyer 1991; Busemeyer et al. 1997; DeLosh et al. 1997; Kalish et al. 2004; McDaniel and Busemeyer 2005). It has become clear that people can learn and recall a wide variety of relationships, but demonstrate certain systematic biases that tell us about the mental representations and implicit assumptions that humans employ when solving function learning problems. For example, people tend to expect that relationships will be linear when extrapolating to novel examples (DeLosh et al. 1997), and find it more difficult to learn relationships that change direction than those that do not (Brehmer 1974; Byun 1995).

Several models have been developed to understand the cognitive mechanisms behind function learning. These models tend to fall into two different theoretical camps. The first includes rule-based theories (e.g., Carroll, 1963, Brehmer, 1974, Koh and Meyer, 1991), which suggest that people learn an explicit function from a given family, such as polynomials (Carroll 1963; McDaniel and Busemeyer 2005) or power-law functions (Koh and Meyer 1991). This approach attributes rich representations to human learners, but has traditionally given limited treatment to how such representations could be acquired. A second approach includes similarity-based theories (e.g.,DeLosh et al., 1997, Busemeyer et al., 1997), which focus on the idea that people learn by forming associations: if x is used to predict y, observations with similar x values should also have similar y values. This approach can be straightforwardly implemented in a connectionist architecture and thus gives an account of the underlying learning mechanisms, but faces challenges in explaining how people generalize so broadly beyond their experience. Most recently, hybrids of these two approaches have been proposed (e.g., Kalish et al., 2004, McDaniel and Busemeyer, 2005), with an associative learning process that acts on explicitly represented functions.

Almost all past research on computational models of function learning has been oriented towards understanding the psychological processes that underlie human performance, or the steps by which people update and deploy their mental representations of continuous relationships. In this paper, we take a different approach, presenting a rational analysis of function learning in the spirit of Anderson (1990) and Marr and Vision. W.H. (1982), and Shepard (1987). Specifically, we start with an abstract representation of the problem to be solved and a handful of additional assumptions about the nature of continuous relationships, and then explore optimal solutions to the problem in light of these assumptions with the goal of shedding light on human behavior. This rational analysis provides a way to understand the relationship between the rule- and similarity-based approaches that have dominated previous work and suggest how they might be combined. Whereas hybrid models apply similarity-based learning to explicit rules, we offer a single foundation that supports both approaches, using a common set of commitments about learning and representation.

To understand the abstract problem that a function learner faces, we can turn to machine learning and statistics, where prediction in continuous domains—a problem familiarly known as regression—has been studied extensively. There are a variety of solutions to regression problems, but we focus on methods related to Bayesian linear regression (e.g., Bernardo and Smith, 1994), which allow us to make and test explicit claims about learners’ expectations, using probability distributions. Bayesian linear regression is also directly related to a nonparametric approach known as Gaussian process prediction (e.g., Williams, 1998), in which predictions about the values of an output variable are based on the similarity between values of an input variable. We use this relationship to connect the two traditional approaches to modeling function learning, as it shows that learning rules that describe functions and specifying the similarity between stimuli for use in associative learning are not mutually exclusive alternatives, but rather two views of the same solution. We exploit this fact to define a rational model of human function learning that incorporates the strengths of both approaches.

The plan of this paper is as follows. First, we review several sets of empirical phenomena in function learning, both to provide background and to establish criteria by which different theories of function learning can be judged. We then review past models of function learning, dividing them into rule-based, similarity-based, and hybrid approaches. Next, we introduce a new perspective on function learning in which rules and similarity can be expressed in a common framework, and describe a model that follows from this perspective. Finally, we evaluate different variations on our model against one another and previous models.

Phenomena in function learning

Past studies have taken diverse approaches to understanding how people learn relationships between continuous variables, but we will focus on four kinds of empirical phenomena that have been used in previous tests of function learning models (e.g., McDaniel and Busemeyer, 2005), or explicitly measure what kinds of relationships people implicitly believe to be more or less likely (Kalish et al. 2007), or challenge many models of function learning (Kalish et al. 2004). Our decision to focus on the following phenomena is also motivated by their being relatively comparable, coming from similar experimental designs involving randomly ordered, sequentially presented training stimuli, in the absence of informative cover stories or contextual information. In this section, we review these four kinds of phenomena, which we will later use to evaluate our own approach to explaining and understanding function learning.

Interpolation and learning difficulty

Some kinds of relationships are easier to learn than others. For example, increasing linear relationships tend to be easier to learn than decreasing linear relationships (Brehmer 1971; 1976). Similarly, linear relationships are typically easier to learn than non-linear ones ((Brehmer 1974; Brehmer et al. 1985; Byun 1995); see Koh and Meyer (1991) for a possible counterexample). Among non-linear relationships, people have more difficulty learning those that change direction (Brehmer 1974; Brehmer et al. 1985; Byun 1995). Cyclic relationships are especially difficult—but not impossible—to learn (Bott and Heit 2004; Byun 1995; Kalish 2013). These systematic differences suggest that some relationships are subjectively simpler, more common, or more straightforwardly represented than others, and the patterns given above dovetail with explicit human judgments about the probabilities of different kinds of relationships (Brehmer 1974).

If the difficulty of learning a relationship reflects its mental representation, one can evaluate a model of function learning by comparing its average error rates to those of humans across several kinds of relationships. More precisely, if one orders several relationships by the average magnitude of errors that humans make when predicting y for x values that fall between past examples, i.e., interpolating, a good model should show the same ordering in its prediction error. For humans, these errors are influenced by many factors, such as the match or mismatch of cover stories to the available data, the number of training points, and presentation order (Byun 1995), but we will focus on properties of the relationships themselves, which provide a simple basis for evaluating different theories of function learning. For instance, relationships in which y increases as a function of x tend to be easier to learn than functions in which y decreases as a function of x, which are in turn easier to learn than non-monotonic functions. For a summary of some qualitative properties of functions that contribute to differential learning difficulty for humans, see (Busemeyer et al. 1997). In our own evaluation, we will use data from several studies that were gathered by (McDaniel and Busemeyer 2005) and are summarized in Table 1.

Table 1 Difficulty of learning results based on experiments reviewed in McDaniel and Busemeyer 2005

Extrapolation

Studies that measure interpolation errors allow relationships to be ranked by how easy they are to learn, with implications for those relationships’ subjective probability and consistency with humans’ mental representations. Unfortunately, quite different models can show similar patterns of errors (given a limited set of relationship types) which constrains the amount one can learn from this approach. This and other limitations of interpolation-error studies have led some researchers to focus on how people extrapolate, or make judgments about points that are distant from those seen before. This approach gives a greater share of influence to learners’ prior beliefs, and makes it possible to uncover patterns that are not reflected in interpolation error rates. To date, extrapolation-based studies of function learning are comparatively sparse, but have revealed several biases in human learners. For example, people’s extrapolation judgments follow linear patterns ((DeLosh et al. 1997), but see Kalish et al. (2004)), and more specifically tend toward functions with a positive slope and an intercept of zero (Kwantes and Neal 2006). In one instance of this bias, when people are trained using data from a quadratic function, their average predictions fall between the true function and straight lines fitted to the closest training points.

Learning multiple relationships

The term “function learning” suggests that relationships between continuous variables—or at least the representations that people form of them—are functions, in that for a given value of the predictor x, there is a single valid prediction, or at least a range of predictions with a single most-likely value or mode. In reality, this is not always the case. For example, dose responses for drugs might have two or more patterns, depending on unobserved genetic factors or patient histories, and some hybrid cars have different relationships between pressure on the accelerator and the car’s real acceleration, depending on whether or not the combustion engine is active. The world abounds with hidden mediators that can change the relationship between observable variables, and one might expect humans to be able to make judgments that reflect the presence of multiple underlying relationships. Consistent with this intuition, Lewandowsky, (Kalish and Ngang 2002) found that fire fighters learn two distinct relationships between wind speed, ground slope, and the rate at which a fire spreads, depending on whether the fire is labeled as a standard forest fire, or a “back burn” fire set to mitigate damage from future fires. Lewandowsky et al. refer to this phenomenon as “knowledge partitioning”, based on the idea that participants’ knowledge of the relationship at hand is partitioned into distinct subsets based on context.

More recently, Kalish, Lewandowsky and Krushke 2004 conducted three experiments showing that people make judgments that demonstrate an implicit belief in the presence of multiple overlapping linear relationships, even when no contextual information was present, and in circumstances where the training data could be explained using a single non-linear relationship (see Fig. 1 for examples).

Fig. 1
figure 1

Training data and four participants’ judgments for Experiments 1–3 in Kalish et al. (2004). Predictor variable values are plotted on the x-axes, with predicted variable values plotted on the y-axes

Iterated learning is an experimental method that was first developed for studying language evolution (Kirby 2001), but it has more recently been applied to other phenomena, including function learning. In an iterated learning experiment, there are chains of learners where the first learner in each chain receives data, makes some inference on the basis of those data, and uses that inference to provide new data to the next learner in the chain. The data produced by each learner is the product of the data he or she receives and his or her inductive biases or expectations about the underlying relationship, item, or event. As the chain of learners grows longer, the influence of the learners’ shared expectations eventually washes out the information carried by the data provided to the first learner. After enough iterations, the data carried forward in the chain reflect human expectations about what relationships are likely, rather than the data the first learner in the chain sees, providing useful information about how people represent and reason about the phenomena at hand (Kalish et al. 2007).

Iterated learning

Figure 2 shows the results of a set of iterated function learning experiments conducted by Kalish et al. 2007. There were four conditions that differed in what data were given to the first participants in the chains. The positive linear (A) chains started with a linear relationship with a slope of one and an intercept of zero, the negative linear (B) chains started with a linear relationship with a slope of negative one and an intercept of zero, the U-shaped (C) chains started with data from a U-shaped relationship, and the random (D) chains started with a disorganized collection of points without any apparent underlying regularity. Kalish et al. (2007) found that the judgments of later participants tended to converge to a positive linear relationship with a slope of one and an intercept of zero regardless of the initial data. While these convergence results dovetail with past findings indicating that positive linear relationships are easier to learn, the intermediate states of the chains provide a more detailed view of function learning. For example, learners tended to preserve negative linear relationships, consistent with the idea that people think these relationships are likely or plausible. Further, many learners were quick to infer the presence of multiple overlapping relationships, as when some participants interpreted noisy data as evidence for a negative linear relationship superimposed on a positive one.

Fig. 2
figure 2

Plots of results from Kalish et al. (2007). A Positive linear initial data; B Negative linear initial data; C U–shaped initial data; D Random initial data

Models of human function learning

The phenomena described in the previous section have inspired several theories and models of function learning, which can be organized into three classes: those based on rules or explicit functions, those based on associative or similarity-based learning, and hybrids that use explicit representations and associative learning. In this section, we review each class in turn, before discussing the extent to which each is consistent with the empirical results described above.

Representing functions with rules

Some of the earliest research into function learning postulates that people learn continuous relationships using explicitly represented functions (Carroll 1963). Carroll proposed that people assume a particular class of functions (such as polynomials of degree k) and use the available observations to estimate the parameters of those functions. The resulting representation allows people to generalize beyond the observed values of the variables involved. Consistent with the version of this hypothesis that Carroll advanced, people learned linear and quadratic functions better than random pairings of values for two variables, and extrapolated appropriately. Similar assumptions have guided subsequent work, which has explored the ease with which people learn different kinds of functions (e.g., Brehmer, 1974), and examined how well human responses are described by different forms of nonlinear regression (e.g., Koh and Meyer, 1991).

The advent of rule-based models precedes most of empirical results we consider, so it may be unsurprising that these models face some difficulty in explaining those results. Rule-based models do not show the flexibility in interpolation that human learners exhibit, and tend not to predict the order-of-difficulty found in interpolation studies (McDaniel and Busemeyer 2005). Similarly, there is evidence that rule-based models (such as Koh and Meyer (1991)) make extrapolation predictions that diverge from human judgments (DeLosh et al. 1997). Purely rule-based models make no provision for multiple overlapping relationships, and thus cannot account for knowledge partitioning effects (Kalish et al. 2004). By extension, their ability to explain (Kalish, Griffiths, and Lewandowsky’s 2007) iterated learning results is limited: while rule-based models might be able to explain long-run convergence to positive linear relationships, they do not anticipate participants’ multimodal judgments.

Similarity and associative learning

Associative learning models propose that people do not learn relationships between continuous variables by explicitly learning rules, but instead forge associations between observed events and generalize based on the similarity of new variable values to old. The first model to implement this approach was the Associative Learning Model (ALM; DeLosh et al., 1997, Busemeyer et al., 1997), in which input and output arrays are used to represent a range of values for the variables between which the functional relationship holds. Presentation of an input activates input nodes close to that value, with activation falling off as a Gaussian function of distance, implementing a theory of similarity in the input space.

Learned weights determine the activation of the output nodes, which is a linear function of the activation of the input nodes. Weights are learned by gradient descent, where the local relationship between weights and errors is used to find new weights that reduce the squared error of the model’s predictions. This process is repeated until the error can no longer be reduced. In practice, this approach performs well when interpolating between observed values, but poorly when extrapolating beyond those values, as it does not capture humans’ ability to extrapolate in systematic, structured ways. As a consequence, Delosh et al. introduced the EXAM model, which constructs a linear approximation to the output of the ALM when selecting responses.

Similarity-based models have seen mixed success in explaining the range of empirical phenomena we describe above. In studies of interpolation and learning difficulty, similarity-based models show similar patterns of interpolation errors to those of humans (McDaniel and Busemeyer 2005). In the context of extrapolation, ALM does not address extrapolation but EXAM was developed with those results in mind and effectively captures the human bias toward linearity and predicts human extrapolations over a variety of relationships (McDaniel and Busemeyer 2005), but without accounting for the human capacity for non-linear extrapolation (Bott and Heit 2004). Like rule-based models, similarity-based models make unimodal predictions for any given x, and thus fail to account for knowledge partitioning results. This limitation also prevents EXAM from capturing some of the intermediate patterns that people produce in the iterated learning experiment.

Hybrid approaches

Several studies have explored methods for combining rule-like representations of functions with associative learning. One example of such an approach is the set of models explored in McDaniel and Busemeyer (2005). These models used the same kind of input representation as ALM and EXAM, with activation of a set of nodes similar to the input value. However, the models also feature a set of hidden units, where each hidden unit corresponds to a different parameterization of a rule from a given class, including polynomial, Fourier, and logistic functions. The values of the hidden units—corresponding to the values of the rules they instantiate—are combined linearly to obtain output predictions, with the weight of each hidden node being learned through gradient descent.

Another instance of a hybrid approach is the POLE model (Kalish et al. 2004), in which hidden units represent different linear functions and the weights from inputs to hidden nodes indicate which linear function should be used to make predictions for particular input values. Using this representation, the model can learn non-linear functions by identifying a series of local linear approximations, and can even model situations in which people seem to learn different functions in different parts of the input space. As a result, it is unique among the models we have discussed in its ability to match the bimodal response distributions discovered by Kalish et al. (2004).

Hybrid rule- and similarity-based models form a more heterogenous group than similarity- and ruled-based models, with representatives including POLE (Kalish et al. 2004) and McDaniel and Busemeyer’s (2005) connectionist implementations of rule-based models. POLE is set apart from the other models we have discussed by its ability to capture knowledge partitioning effects and it demonstrates a similar ordering of error rates to those of human learners (McDaniel et al. 2009). In its extrapolation predictions, however, there is evidence that it deviates from human performance (McDaniel et al. 2009). In an iterated learning design, POLE showed both convergence to positive linear relationships and some of the qualitative patterns that human learners demonstrate (depicted in Fig. 3II), including transitional states with overlapping positive and negative linear relationships. McDaniel and Busemeyer’s hybrid polynomial model—which performed better than the alternative hybrid models they considered—demonstrates an ordering of interpolation errors on different functions that aligns only roughly with human judgments (see Table 1), but its extrapolation predictions are consistent with human judgments from McDaniel and Busemeyer’s studies (McDaniel and Busemeyer 2005). Like rule-based models, this model offers unimodal predictions, and thus cannot account for knowledge partitioning phenomena, and has not been evaluated against iterated learning results.

Fig. 3
figure 3

Model predictions for iterated learning data. AD denote positive linear, negative linear, U-shaped, and random initial data, respectively. (I) Predictions from EXAM; (II) Predictions from POLE; (III) Mean function estimates from our Model 3, removing noise

Summary

We have reviewed a diverse set of models that accurately predict a variety of empirical phenomena in function learning. Despite their different commitments about how humans learn continuous relationships, a common theme of these models is an emphasis on the process by which function learning occurs. In the next section, we will take a fundamentally different view, focusing on the abstract problem of function learning and the forms that good solutions to that problem should take, rather than the process. This view complements past models rather than supplanting them, and we will demonstrate that it provides a common framework with which to understand and unify rule- and similarity-based approaches.

Rational solutions to regression problems

The models outlined in the previous section all aim to describe the psychological processes involved in human function learning. In this section, we consider the abstract computational problem underlying this task, using optimal solutions to this problem to shed light on both previous models and human learning. Viewed abstractly, the computational problem behind function learning is to use a set of real-valued observations x n =(x 1,...,x n ) and t n =(t 1,...,t n ), to predict what y n+1 goes with a new x n+1. Here, the y-values correspond to the underlying relationship, and the t-values are observations of y that have been obscured by additive noise, so \(y_{n+1} = \mathbb {E}[t_{n+1}]\). Following much of the literature on human function learning, we consider only one-dimensional relationships, but this approach generalizes naturally to the multi-dimensional case. In machine learning and statistics, this is referred to as a regression problem. In this section, we discuss how regression problems can be solved using Bayesian statistics, and how the result of this approach is related to Gaussian processes, a formalism with close ties to associative learning. Our presentation follows that in Williams (1998). See Appendix A for a more thorough treatment of the mathematical details.

Bayesian linear regression

Ideally, we would seek to solve our regression problem by using not just the observations of x and t, but some prior beliefs about the probability of encountering different kinds of functions f(⋅) in the world. We can do this by applying Bayes’ rule, with

$$ p(f|\mathbf{x}_{n}, \mathbf{t}_{n}) = \frac{p(\mathbf{t}_{n}|f, \mathbf{x}_{n})p(f)}{{\int}_{\mathcal{F}} p(\mathbf{t}_{n}|f, \mathbf{x}_{n})p(f) df}. $$
(1)

Knowledge of which functions in the space of possibilities \(\mathcal {F}\) is more likely to be the true function is captured by p(f), the prior distribution. The probability of observing the values of t n if f were the true function is given by the likelihood function p(t n |f,x n ), and the probability that f is the true function given the observations x n and t n is the posterior distribution p(f|x n ,t n ). In most regression models, the likelihood is defined by assuming that any deviation from the true function is due to many independent sources of noise—more specifically, that t i is Gaussian with mean y i =f(x i ) and variance \({\sigma ^{2}_{t}}\). Predictions about the value of the function f for a new input x n+1 can be made by integrating over all functions in the posterior distribution,

$$ p(y_{n+1}|x_{n+1},\mathbf{t}_{n}, \mathbf{x}_{n}) = {\int}_{f} p(y_{n+1}|f, x_{n+1})p(f|\mathbf{x}_{n}, \mathbf{t}_{n}) df $$
(2)

where p(y n+1|f,x n+1) is a delta function placing all of its mass on y n+1=f(x n+1). Performing the integration outlined above can be challenging, but it becomes straightforward if we limit the hypothesis space to certain specific classes of functions. If we take \(\mathcal {F}\) to be all linear functions of the form y=b 0+x b 1, then our problem takes the familiar form of linear regression. To perform Bayesian linear regression, we need to define a prior p(f) over all linear functions. Since these functions are identified by the parameters b 0 and b 1, it is sufficient to define a prior over b=(b 0,b 1), which we can do by assuming that b follows a multivariate Gaussian distribution, which results in a posterior distribution over b that is also a multivariate Gaussian (see Bernardo and Smith (1994)). Linear transformations of Gaussian distributions are also Gaussian, so the predictive density Eq. 2 is also Gaussian, and the noise introduced between true values t and observations y simply adds to the variance of this distribution.

While considering only linear functions might seem overly restrictive, linear regression actually gives us the basic tools we need to solve this problem for more general classes of functions. Many classes of functions can be described as linear combinations of a small set of basis functions. For example, all kth degree polynomials are linear combinations of functions of the form 1 (the constant function), x,x 2,...,x k. Letting ϕ (1), ..., ϕ (k) denote a set of functions, we can define a prior on the class of functions that are linear combinations of this basis by expressing such functions in the form f(x)=b 0+ϕ (1)(x)b 1+...+ϕ (k)(x)b k and defining a prior on the vector of weights b. As long as the prior over weights is Gaussian, the same results apply as in the simple linear case.

Gaussian processes

Another approach to regression problems is to forgo any explicit representation of the underlying function and focus on making predictions. If our goal is merely to predict y n+1 using x n+1, t n , and x n , we might simply define a joint distribution on t n+1 given x n+1 and find its expected value, which is equal to y n+1, after conditioning on t n :

$$ p(t_{n+1}|x_{n+1},\mathbf{x_{n}},\mathbf{t_{n}}) = \frac{p(\mathbf{t}_{n+1}|x_{n+1},\mathbf{x_{n}})}{p(\mathbf{t}_{n}|x_{n+1},\mathbf{x_{n}})}. $$
(3)

This equation expresses the problem of regression in very general terms, and may, at first glance, seem daunting to compute: it involves defining a joint distribution over all of the points observed so far, as well as the joint distribution including the new, unknown point. Further, if we want to predict y n+1, we must be able to take the expectation of this quotient. However, in some circumstances, the probability of t n+1 given x n+1, x n , and t n has a straightforward analytical solution, and an easily computed expectation. One such case, which will be our focus here, is when all t n+1 values are jointly Gaussian. In other words, t n+1 is distributed according to a single multivariate Gaussian, with dimensionality corresponding to the number of points under consideration. This is determined by its mean and covariance matrix, and once these are specified, we have a solution for Eq. 3: the quotient has a closed form for multivariate Gaussians (see Rasmussen and Williams, 2006, for details). As we will see, assuming a jointly Gaussian distribution is not a strong constraint, and we can express a very broad set of relationships through our choice of means and covariances.

Both the mean vector and the covariance matrix are determined by the values of x. Broadly speaking, the mean vector captures expectations about how the function looks in the absence of data, and the covariance matrix—or the kernel function that generates it—captures expectations about how points relate to one another. The covariance matrix entry for any pair of t-values (t i ,t j ) is given by a function K(x i ,x j ), plus a diagonal matrix capturing the noisy relationship between the underlying values y i and the observations t i . we can Using this covariance matrix, we can obtain the distribution of t n+1 conditional on t n . The function K(⋅,⋅), called the kernel function, can be chosen arbitrarily as long as the covariance matrix it produces is valid.

One common kind of kernel is a radial basis function, e.g.,

$$ K(x_{i},x_{j}) = {\theta_{1}^{2}}\exp(-\frac{1}{{\theta_{2}^{2}}}(x_{i}-x_{j})^{2}) $$
(4)

which leads to t values that are more strongly correlated when their corresponding x values are more similar, with the parameters 𝜃 1 and 𝜃 2 determining how quickly the correlation falls off as differences in x values increase. Other kernels are possible, including periodic functions such as

$$ K(x_{i},x_{j}) = {\theta_{3}^{2}}\exp\left({\theta_{4}^{2}}\left(\cos\left(\frac{2\pi}{\theta_{5}}[x_{i}-x_{j}]\right)\right)\right) $$
(5)

indicating that values of y for which values of x are close relative to the period 𝜃 3 are likely to be highly correlated.

This approach to prediction, in which a kernel function applied to x defines a normal distribution on t-values, is called a Gaussian process. A wide variety of kernel functions are possible, corresponding to varied commitments about which x values are likely to lead to similar t-values, making Gaussian processes a flexible way to solve regression problems.

Two views of regression

Bayesian linear regression and Gaussian processes appear to be quite different approaches. In Bayesian linear regression, a hypothesis space of functions is identified, a prior on that space is defined, and predictions are formed by averaging over the posterior distribution of y, while Gaussian processes simply use the similarity between different values of x, as expressed through a kernel, to predict correlations in values of y. It might thus come as a surprise that these approaches are equivalent.

Showing that Bayesian linear regression corresponds to Gaussian process prediction is straightforward. The assumption of linearity means that the vector y n+1 is equal to X n+1 b. Given normally distributed weights, it follows that p(y n+1|x n+1) is a multivariate Gaussian distribution with mean zero and covariance matrix \(\mathbf {X}_{n+1} \mathbf {\Sigma }_{b} \mathbf {X}_{n+1}^{T}\). Bayesian linear regression thus corresponds to prediction using Gaussian processes, with this covariance matrix playing the role of K n+1 above (i.e., using the kernel function K(x i ,x j )=[1 x i ][1 x j ]T). Using a richer set of basis functions corresponds to taking \(\mathbf {K}_{n+1} =\mathbf {\Phi }_{n+1} \mathbf {\Sigma }_{b} \mathbf {\Phi }_{n+1}^{T}\), i.e.,

$$ K(x_{i},x_{j}) = \left[ 1 \ \phi^{(1)}(x_{i}) \ {\ldots} \ \phi^{(k)}(x_{i})\right] \left[ 1 \ \phi^{(1)}(x_{i})\ {\ldots} \ \phi^{(k)}(x_{i})\right]^{T}, $$
(6)

where ϕ (1...k) are k arbitrary functions of x (Williams 1998). It is also possible to show that Gaussian process prediction can always be interpreted as Bayesian linear regression, albeit with a potentially infinite number of basis functions. Just as we can express a covariance matrix in terms of its eigenvectors and eigenvalues, we can express a given kernel K(x i ,x j ) in terms of its eigenfunctions ϕ and eigenvalues λ, with

$$ K(x_{i},x_{j}) = \sum\limits_{k=1}^{\infty} \lambda_{k} \phi^{(k)}(x_{i}) \phi^{(k)}(x_{j}) $$
(7)

for any x i and x j (Minh et al. 2006). Thus, any kernel can be viewed as the result of performing Bayesian linear regression with a set of basis functions corresponding to its eigenfunctions, and a prior with covariance matrix # #Σ# # b =diag(λ).

These results establish an important duality between Bayesian linear regression and Gaussian processes: for every prior on functions, there exists a corresponding kernel, and for every kernel, there exists a corresponding prior on functions. Bayesian linear regression and prediction with Gaussian processes are thus just two views of the same solution to regression problems.

Combining rules and similarity through Gaussian processes

The results outlined in the previous section suggest that, in the context of regression, learning using rules—as expressed in a Bayesian linear regression model, and generalizing based on similarity, as expressed in a Gaussian process’s kernel function—are mutually compatible points of view. In this section, we briefly describe how previous accounts of function learning connect to these statistical models, and then use this insight to define a model of human function learning that combines the strengths of both approaches.

Reinterpreting previous accounts of human function learning

That idea of human function learning as a kind of statistical regression connects directly to Bayesian linear regression. Many rule-based models (e.g., Koh and Meyer, 1991Carroll, 1963) can be framed in terms Bayesian linear regression while retaining all of their basic commitments and predictions. Similarly, the basic ideas behind Gaussian process regression (with a standard radial-basis kernel function) lie at the heart of similarity-based models such as ALM. In particular, ALM and the associative-learning component of EXAM implement cubic spline approximation (McDaniel and Busemeyer 2005), which can be represented using Gaussian processes (Rasmussen and Williams 2006). Similarly, neural network approaches to similarity-based generalization are directly related to Gaussian processes, with some networks having a perfect mapping to a corresponding Gaussian process (Neal 1994). Gaussian processes with radial-basis kernels can thus be viewed as implementing a simple kind of similarity-based generalization, predicting similar y values for stimuli with similar x values. The hybrid approach to rule learning taken by McDaniel and Busemeyer (2005) is also closely related to Bayesian linear regression. The rules represented by the hidden units serve as a basis set that specifies a class of functions, and applying penalized gradient descent on the weights assigned to those basis elements serves as an online algorithm for finding the function with highest posterior probability (MacKay 1995).

Mixing functions in a Gaussian process model

The relationship between Gaussian processes and Bayesian linear regression suggests that we can define a single model that exploits both similarity and rules in forming predictions. We can do this by choosing a hypothesis space that covers a broad class of functions, including both those consistent with a radial basis kernel and those taking simple parametric forms. This is equivalent to modeling y as being produced by a Gaussian process with a kernel corresponding to one of a small number of types. Specifically, we assume that observations are generated by a function that is linear with positive slope, linear with negative slope, quadratic, or nonlinear but generally smooth. Figure 4 depicts samples from these individual kernels. This combination is one way to express the total prior over functions in Eqs. 1 and 2, with \(p(f) = {\sum }_{k} p(f|k)P(k)\), where k represents a particular kernel in the set of four we have mentioned. For examples of functions that are likely under each of the different kernels, see Fig. 4.

Fig. 4
figure 4

Samples from the four kernels that are combined in our models, reflecting the kind of relationship that each kernel favors

We do not claim that the specific kernels compose an exhaustive account of the relationships that people and learn and extrapolate from. Rather, we believe that people find these relationships especially easy to learn, and especially plausible or likely as explanations of data in the face of uncertainty, based on the results of Brehmer (1971) and DeLosh et al. (1997) and Kalish et al. (2007).

A more complete account would include kernels that permit a wide variety of extrapolation patterns (e.g., Bott and Heit, 2004) , but for the data we will consider such an expansion would add to the complexity of our models without substantially changing our predictions (see Lucas et al. (2012) for a demonstration of how Gaussian process models can be used to predict a variety of non-linear extrapolations). The probabilities of the different relationship types are defined by the vector π. The relevant kernels are introduced in the previous sections (where “Nonlinear” corresponds to the radial basis kernel), with the positive and negative kernels having different means in their distributions over weights b, taking mean intercepts and slopes of [0 1], [1 −1], respectively. Using this Gaussian process model allows a learner to simultaneously make inferences about the overall type and specific form of the function from which their observations are drawn.

In developing this kind of model and selecting this particular set of priors—reflected in our choice of kernel functions—we are making explicit commitments about the inductive biases that shape human function learning. These include what types of relationships are more subjectively probable than others, and the more specific forms that relationships of a given type are likely to take. Our model does not, however, commit to any specific process by which those biases shape people’s inferences, which might resemble, for example, the associative mechanisms present in POLE or EXAM or an elaboration the hypothesis-testing framework offered by Brehmer.Footnote 1

Basic tests of the Gaussian process model

In the remainder of the paper, we will evaluate our Gaussian process approach to function learning using each of the empirical phenomena we discussed earlier. First, following the approach taken in McDaniel and Busemeyer’s (2005) review of computational models of function learning, we look at two quantitative tests of Gaussian processes as an account of human function learning: reproducing the order of difficulty of learning functions of different types, and extrapolation performance. As indicated earlier, there is a large literature consisting of both models and data concerning human function learning, and these simulations are intended to demonstrate the potential of the Gaussian process model rather than to provide an exhaustive test of its performance. See Appendix B for a summary of the parameters in our model, and Appendix C for a description of the procedures used to generate model predictions.

Difficulty of learning

As discussed above, one important measure of a theory of human function learning is its ability to account for the relative difficulty people have in learning different kinds of relationships. Table 1 is an augmented version of results presented in McDaniel and Busemeyer (2005) which compared several models’ prediction errors to humans’ errors when learning a range of functions. Each entry in the table is the mean absolute deviation (MAD) of human or model responses from the actual value of the function, evaluated over the stimuli presented in training. The MAD provides a measure of how difficult it is for people or a given model to learn a function. The data reported for each set of studies are ordered by increasing MAD (corresponding to increasing difficulty). In addition to reproducing the MAD for the models in McDaniel and Busemeyer (2005), the table has been expanded to contain the MADs exhibited by seven Gaussian process (GP) models trained on the target functions.

The seven GP models incorporated different collections of kernel functions by adjusting their prior probabilities. The most comprehensive model includes the {Positive Linear,Negative Linear,Quadratic,Nonlinear} set of kernel functions, assigning them prior probabilities proportional to 8, 1, 0.1, and 0.01, respectively.Footnote 2 Six other GP models were examined by assigning certain kernel functions zero prior probability and re-normalizing the remainder so that the prior probabilities summed to one. The seven distinct GP models are presented in Table 1 are labeled by the kernel functions to which they assign non-zero probability, under the header “Model 1”. Models 2 and 3, which are extensions that account for knowledge partitioning phenomena, are discussed below. The kernels include Linear (including both positive and negative linear functions), Quadratic (second-order polynomial functions), RBF (nonlinear relationships, fit by a radial basis function kernel), LQ (linear and quadratic), LR (linear and RBF), QR (quadratic and RBF), and LRQ (linear, quadratic, and RBF). The MAD for each function from McDaniel and Busemeyer (2005) is reported for each model in Table 1, along with human MADs. The last three rows of Table 1 give the correlations between human and model performance across functions, expressing quantitatively how well each model captured the pattern of human function learning behavior. All of the GP models perform well, with every model (except for the Linear and LQ models) providing a closer match to the human data than any of the models considered by McDaniel and Busemeyer (2005).

Extrapolation performance

Predicting and explaining people’s capacity for extrapolation to novel stimuli is another key criterion for judging models of function learning. In Table 2, we compare mean human predictions for linear, exponential, and quadratic functions (from DeLosh et al. (1997)) to those of several models described in McDaniel and Busemeyer (2005), as well as each of the Gaussian process models we describe above and two model extensions that we will describe below. While none of the GP models produce quite as high a correlation as EXAM on all three functions, all but the Linear and LR models make predictions that correspond closely with human judgments. It is notable that this performance is achieved with the same parameters that were used for the difficulty of learning data (see Appendix B for details), while the predictions of EXAM were the result of optimizing two parameters for each of the three functions.

Table 2 Linear correlations between human and model predictions for extrapolation regions

Figure 5 displays mean human judgments for each of the three functions, along with the predictions of an extended Gaussian process model we discuss below, which incorporates Linear, Quadratic, and Nonlinear kernel functions. The regions to the left and right of the solid black lines represent extrapolation regions, containing input values for which neither people nor the model were trained. Both people and the model extrapolate nearly optimally on the linear function, and reasonably accurately for the exponential and quadratic function. However, there is a bias towards a linear slope in the extrapolation of the exponential and quadratic functions, with extreme values of the quadratic and exponential function being overestimated. Quantitative measures of extrapolation performance are shown in Table 2, which gives the correlation between human and model predictions for EXAM (DeLosh et al. 1997; Busemeyer et al. 1997) and the seven GP models.

Fig. 5
figure 5

Extrapolation performance, with mean predictions on linear, exponential, and quadratic functions for human participants from (Delosh, Busemeyer and McDaniel 1997) and a mixture of Gaussian process experts (Model 3; see text). Training data were presented in the region spanned by a solid black line, and extrapolation performance was evaluated outside this region, with the true function represented by dashed lines

Summary

We have shown that our model accounts well for the relative difficulty with which people learn different kinds of relationships, and how they extrapolate from limited training data. More complex phenomena, such as knowledge partitioning and the multiple overlapping relationships it entails, require more complex models. The next section addresses these phenomena, and describes a straightforward extension of our Gaussian process model to accommodate the possibility of multiple relationships while still explaining human interpolation and extrapolation behavior.

Extending the Gaussian process model beyond single relationships

In most models of function learning, including the Gaussian process-based models described above, it is assumed that people learn a single relationship between a variable and its predictors. There might be a complex, non-linear relationship between x and f(x), but for a single value of x, f(x) is always unimodal and relationships are never compositions of other relationships. We have mentioned that this assumption fails to describe many real relationships, and, as knowledge partitioning results show, it also fails to explain human behavior.

Of the models we have described, only the POLE model (Kalish et al. 2004) makes predictions that are consistent with knowledge partitioning phenomena, doing so by appealing to the mental representations and processes people use when learning functions. We will show that a rational analysis of function learning leads to a similar set of predictions. In many real-world situations, two variables x and y will relate to one another in different ways, depending on context. If y depends on w in addition to x, i.e., the true function is y=f(x,w), and w is not observable, the apparent relationship between x and y may have discontinuities, and it may not be a function at all, having multiple values of y for a given x. We previously discussed examples of such relationships, including acceleration in hybrid cars and dose-response curves in a patient population. Other examples of hidden mediators include the relationship between brake pressure and acceleration, mediated by surface slipperiness, and the relationship between the temperature of a material and its malleability, mediated by its unobserved crystal structure, as with the temper of a piece of metal. With these intuitions in hand, we will now describe how our model may be extended to reflect them.

Mixtures of Gaussian process experts

We extended our Gaussian process model (Model 1), to capture the assumption that each point belongs to one of an unknown number of underlying relationships. Clearly, there is no fixed bound on the number of relationships that might obtain between x and y, but one would expect that fewer relationships should be more plausible than more, as a matter of simplicity or parsimony (Chater and Vitanyi 2003). There are multiple ways to express this intuition formally, but one obvious choice is to allow points to be divided into arbitrary partitions, assigning each partition a probability using a Chinese Restaurant Process prior (Aldous 1985), which has previously been used in rational analyses of categorization (Anderson 1991; Sanborn et al. 2010).

Under this prior, the likelihood that a new (x,y) pair will be assigned to an existing relationship is proportional to the number of other points that participate in that relationship, and the likelihood that it will be assigned to a new relationship is proportional to a parameter α. More precisely, the probability that the i th point’s relationship r i will be k is

$$\begin{array}{@{}rcl@{}} \Pr(r_{i} = k) & {} = \left\{\begin{array}{ll} \frac{n_{k}}{i+\alpha} & \text{if }n_{k} > 0, \\\\ \frac{\alpha}{i+\alpha} & \text{if }n_{k} = 0 \end{array}\right. \end{array} $$
(8)

where n k is the number of points already participating in relationship k. The likelihood of the data under a given partition is determined by how likely the ensemble of y values is, given the nature of the relationships they participate in and their corresponding x values. This conceptually straightforward extension from Gaussian processes to a mixture of Gaussian processes will be called Model 2. We might also wish the capture the intuition that (x,y) pairs that have similar x values are more likely to participate in the same relationships—in other words, relationships tend to be locally smooth and unimodal. This expectation can be built into the model by assuming that the likelihood that a point belongs to a partition is determined in part by its closeness to current members, represented using the x-value’s likelihood under a Gaussian distribution based on existing members. This last model, Model 3, is an example of a mixture of experts (Jacobs et al. 1991; Erickson and Kruschke 1998; Kalish et al. 2004), an approach that has been applied to Gaussian processes in the past (Rasmussen and Ghahramani 2002; Meeds and Osindero 2006). As with Model 1, Models 2 and 3 can be interpreted in terms of Bayesian linear regression or Gaussian processes, where every Gaussian process kernel for every expert can be represented as a linear regression model, albeit, as before, with a potentially infinite number of features. See Fig. 6 for samples of the kinds of relationships that the mixture of Gaussian process experts (henceforth Model 3) favors.

Fig. 6
figure 6

Samples from Model 3. The left plots show samples drawn from an infinite mixture of experts with α=.1, favoring a small number of distinct relationships. The right plots show samples drawn from a mixture with α=10, favoring a large number of distinct relationships. Because of the diffuse prior over locations in the x-axis in Model 3, randomly drawn samples tend not to concentrate in x and thus look similar to samples drawn from Model 2

Knowledge partitioning

Before applying Models 2 and 3 to knowledge partitioning phenomena, we evaluated them against the same difficulty-of-learning and extrapolation results with which we assessed our original Gaussian process models. As with the earlier models, we used the same parameters for all of the experiments, and obtained close fits to human judgments, summarized in Tables 1 and 2 (see Appendix B for details of parameters and fits). We also plotted predictions for Model 3 against mean human judgments in the extrapolation experiments in Fig. 5. In general, Models 2 and 3 performed as well as any other model, and better than the majority of the alternatives.

To gauge the extent to which the models’ predictions are consistent with knowledge partitioning phenomena, we obtained individual predictions from twelve participants in (Kalish et al.’s 2004) studies, four per experiment.Footnote 3 Each experiment included training points and interpolation regions that were designed to elicit multiple modes in y for a given x. For example, in Experiment ??1, there was a gap between two partial linear functions with the same slope and different intercepts. Many participants made judgments in the gap that matched both functions, leaving a bimodal response distribution. Like Kalish et al., we focus on showing that our model captures the bimodal responses of the participants, and gives a posterior distribution that matches the distribution of actual judgments.

The results are summarized in Fig. 7, comparing Models’ 1, 2, and 3 predicted probabilities of different y values to those given by participants. Model 1 predicts the aggregate trend in Kalish et al.’s Experiment 1, but cannot explain the discontinuities exhibited by two of the participants shown in Fig. 1) or the multiple modes evident in participants’ judgments for Experiments 2 and 3. In contrast, Models 2 and 3 predict the multiple relationships will be inferred. Model 3, being sensitive to the proximity of points, is more likely than Model 2 to group points into local relationships, as is apparent in its predictions for Experiment 1 We used a single prior distribution across the different experiments and participants, but the individual differences in Fig. 1 are readily explained in terms of different participants having different inductive biases. Future work, with more extensive within-subjects data, would permit us to test our model as a framework for understanding how inductive biases vary between individuals.

Fig. 7
figure 7

Plots comparing human judgments in Experiments 1–3 of (Kalish et al. 2004) to the predictions of Models 1, 2, and 3. The points represent individual human judgments, aggregated over four individuals for whom data were available, while the colors represent log probability densities, with hotter colors representing higher probabilities

Iterated learning

As a final measure of Gaussian process models of function learning, we compared their predictions to human judgments in the iterated learning experiments of Kalish et al. (2007). As mentioned earlier, iterated learning designs involve a chain of learners in which each individual observes data, makes inferences from those data, and uses those inferences to provide data to the next learner in the chain. For function learning specifically, each observation is an (x,y) pair, and the data that a learner passes forward is a subset of his or her y-predictions for new x-values. Ideally, these judgments would reflect samples from the inferred underlying function, with variance attributable only to uncertainty about that function, and, potentially, inferred noise around that function. In practice, however, participants’ judgments are subject to errors in perception and in recording their judgments, as well as varying degrees of motivation and attention. Rather than attempting to model these factors—which are underdetermined—we chose to apply our mixture-of-experts model to the same tasks that human faced as-is, looking for the same qualitative patterns that human learners demonstrated. As in Kalish et al.’s experiments, we ran chains in which the first iteration’s observations, or the initial data, were drawn from four functions, including positive linear, negative linear, U-shaped, and random functions. For each subsequent round, the model used 50 predictions generated from the previous round, like the human learners.

The human learners’ judgments revealed several broad patterns, shown in Fig. 2, which we used as the basis for our evaluation, including: (1) given positive linear initial data, judgments were consistently positive linear over successive rounds; (2) a shift toward positive linear functions for the negative linear, U-shaped, and random initial data, with transitional states reflecting uncertainty or inferences to high noise or multiple overlapping relationships—in almost all chains, there are intermediate states that deviate from any simple, well-formed function; and (3) greater stability and slower transitions in the negative linear case than in the U-shaped and random cases.

Figure 3III shows that Model 3 demonstrates each of these features. Like many human learners and the POLE and EXAM models (Fig. 3I and II), it preserved positive linear relationships, with small deviations from a 0–intercept 1–slope relationship that are due to our treatment of out-of-range samples: when the model samples y values that are greater than 1 or less than 0, those values are resampled, leading to a slight flattening of the slope. A policy of converting out-of-range samples by replacing them with the most extreme value would reduce this effect. Like human chains and the POLE model’s predictions, but not EXAM’s, iterations following U-shaped initial data included cases of overlapping positive and negative relationships. Like several human learners, random initial data led to the GP model to offer overlapping, weakly sloped linear relationships before shifting towards a single positive linear relationship. Finally, like human learners, the GP model tended to preserve the negative linear relationships more than U-shaped and disordered relationships. The most salient difference between the GP Model 3 and human learners is its slower convergence to positive linear relationships.

There are several ways in which we might account for this difference in convergence rates. First, our priors over types of relationships were not fitted to human behavior, and one more strongly favoring positive linear relationships—or a lower variance in the distribution of slopes—would naturally lead to faster convergence. Second, a more nuanced view of noise would be consistent with the differences in convergence rates. For example, our model assigns a very low probability to “random” relationships, in which points have very high variance, whereas participants might expect that some points are anomalies, analogous to equipment failures. Third, the rapid convergence of human chains might be explained in part by differences between individual human learners. For example, specific individuals might have stronger expectations that relationships are positive and linear, and believe more strongly that their observations are only noisy reflections of the underlying relationship. As with individual differences in knowledge partitioning, all of these possibilities could be explored using within-subjects data.

General discussion

Function learning is one of the core inductive problems that we encounter every day, arising whenever we need to learn the relationship between two continuous variables. Models of function learning have explained the human ability to solve this problem in terms of different cognitive mechanisms, such as inducing rules or generalizing on the basis of similarity. We have shown that these different cognitive mechanisms correspond to different strategies for solving the abstract computational problem of regression, and that both can be expressed as special cases of a Bayesian solution to this problem based on Gaussian processes. This perspective helps to reveal the commonalities between these different mechanisms, and to define models that combine their strengths. The resulting models provide a good fit to human data, performing similarly to the best mechanistic accounts, and provide a way to transparently identify the inductive biases that guide human learners in function learning tasks.

In our introduction, we stated that our model is intended to complement, rather than replace, existing accounts of function learning: we focus on the inductive biases that shape function learning, rather that the processes by which it occurs. In the remainder of this paper, we will discuss the relationship between these levels of analysis, the project of identifying human inductive biases, and some of the ways in which our work could be extended.

The roles of models at different levels of analysis

Our focus in this paper has been on understanding human function learning by identifying the underlying computational problem and the assumptions that seem to yield parallels between optimal solutions to this problem and human behavior. This approach is in the spirit of the approach of rational analysis laid out by (Anderson 1990), yielding an explanation of behavior that lies at what (Marr 1982) termed the “computational level”. The results of this investigation are quite different from those yielded by a more traditional modeling approach operating at what (Marr 1982) termed the “algorithmic level” and focusing on identifying the cognitive mechanisms underlying human behavior. The previous models of function learning we have discussed in this paper are defined at this level, making claims about the aspects of human memory and reasoning that contribute to their performance on function learning tasks.

The focus on the computational level establishes a clear set of goals for our model. First, we are not trying to define the single best model of human performance on function learning tasks, because our computational-level model is not in competition with algorithmic-level models. It is entirely possible for our computational-level analysis to be correct, and for it to be executed at the algorithmic level by cognitive mechanisms that resemble existing psychological process models. In this case, we would expect both kinds of models to fit well (and possibly the process models to fit better, since they will capture idiosyncrasies of behavior due to the way in which the computational-level solution is carried out). Our goal is to show that the computational-level solution we have proposed does a good job of capturing human behavior, and existing algorithmic-level models provide a good yardstick against which to measure this performance.

Second, a key part of our contribution is theoretical. We have shown that algorithmic-level mechanisms that seem quite different can in fact be captured in a single theoretical framework at the computational level, and that this leads to new ways of thinking about combining the strengths of these approaches. This kind of contribution has a precedent in other work examining aspects of cognition at different levels of analysis: (Ashby and Alfonso-Reese 1995) showed that exemplar and prototype models of categorization could both be viewed as strategies for solving the problem of density estimation that arises when categorization is viewed from the perspective of Bayesian inference. This demonstration of a common underlying computational-level problem (and connections to ideas in statistics) provides the foundation for recent work on rational models of categorization that can interpolate between exemplar and prototype representations (Sanborn et al. 2010). We view our analysis as making a similar contribution for the case of function learning, providing an explicit link between existing cognitive models and ideas from statistics that leads to new ways of understanding human behavior. A probabilistic approach also provides a basis for understanding a broader range of phenomena, including not just patterns of interpolation and extrapolation judgments. For example, one can use explain the influence by linguistic and contextual information on function learning (Byun 1995) in terms of priors, and understand people search for new information (Borji and Itti 2013) or benefit from different kinds of instruction (Lindsey et al. 2013).

Capturing human inductive biases

In inductive problems, such as function learning, the right answer is underdetermined by the available data. This means that doing a good job of solving the problem requires having good inductive biases—those factors other than the data that lead a learner to favor one hypothesis over another (Mitchell 1997). When viewed from the abstract computational level, the key challenge in explaining human inductive inference is characterizing our inductive biases. Bayesian models of cognition make this task particularly clear, as the inductive biases of these models are expressed through the choice of hypothesis space and the prior on hypotheses.

In function learning, the characterization of the inductive biases of a learner is particularly clear: it corresponds to a prior distribution on functions. As we have discussed, defining a prior distribution on functions is challenging, since there are uncountably many possible functions, dependent on an unbounded number of latent variables. The Gaussian process models we have explored provide a succinct way of expressing priors on functions that is nonetheless extremely flexible in the range of distributions that it allows, and thus provide a powerful tool for exploring human inductive biases for function learning. We can express the assumptions behind this prior in terms of a kernel function, which captures the similarity between stimuli, in terms of a set of basis functions, which express a representation of these stimuli, or through samples from the resulting distribution over functions, providing three different ways to indicate the inductive biases that a learner has.

Being able to characterize human inductive biases in terms of a probability distribution over functions also makes it straightforward to make automated learning systems that are guided by the same inductive biases. We can easily take the prior assumed by our Gaussian process models and use it as a component of Gaussian process models used in machine learning or statistics. This provides a natural bridge between human and machine learning, and an opportunity to explore whether using human inductive biases improves the operation of automated systems as well as to develop automated systems that make inferences that are more comprehensible to human users.

Limitations and future directions

The models we have explored cover a wide range of results from the literature on human function learning, but there are still phenomena that they cannot capture and aspects of human performance that lie outside the considerations that normally inform a computational-level analysis. Addressing these limitations creates some interesting directions for future work.

A basic omission in the formulation of our model is that it is unable to learn cyclic functions. Since these functions are learnable by people (although with significant difficulty Bott and Heit, 2004, Byun, 1995), this is a weakness that should be addressed. It is straightforward to incorporate a capacity to learn cyclic functions by including a periodic kernel in the mixture of kernels. Incorporation of this additional kernel—with an appropriately low mixture weight—would not change the predictions of the model for non-cyclic functions appreciably. We judged the corresponding increase in the complexity of the model to outweigh the value of capturing these additional phenomena.

The fact that people can learn cyclic functions raises another interesting question: Can we build an exhaustive summary of the kinds of relationships that people can learn? In the context of our models, this becomes a question of what kinds of functions have support in people’s prior distributions, or what set of kernels should be included in the mixture. Existing results support the inclusion of a relatively small set of kernels—essentially, those that we consider plus a periodic kernel for cyclic functions.

Another issue, also related to our prior over kernels, is that we chose a distribution strongly favoring linear relationships. Is this prior consistent with the idea that a rational analysis should use diffuse priors that capture the statistical structure of the environment (Anderson, 1990)? It is a shortcoming of the current work that we cannot be certain, but we believe that a linearity-biased prior is better than alternatives. In function learning, it is not realistic to directly measure the statistical structure of the environment, i.e., what functions are truly more or less common: doing so would depend on knowing what combinations of variables are salient to human observers over long periods of time, including, perhaps, our evolutionary history. Further, any census of functions would reflect the cognitive and attentional biases of the people who would conduct it. In the absence of ground truth about the frequencies of functions, we believe that the best approach is to look at what relationships people think are more common, using both direct and indirect measures. Previous studies, including many that we have not evaluated here (see Busemeyer et al. 1997) for a summary, and Little and Shiffrin (2009), for evidence that people infer linear relationships given very noisy data) support the idea that linear relationships are thought to be more common. Among these are results showing that people say that linear relationships occur much more frequently than non-linear ones, and showing that people tend to offer linear relationships when prompted in the absence of data or informative context (Brehmer 1974). Even if we set aside these results, it seems a case can be made that linear functions are indeed very common in situations the matter to humans. Under usual (e.g., non-relativistic) conditions, relationships between mass, force, acceleration, velocity, distance, and time can be expressed as collections of linear relationships, and many physical objects have broadly similar shapes at different scales, implying that an object’s height is a roughly linear function of its width, for example.

Our focus on the abstract computational-level problem underlying function learning and the nature of ideal solutions to that problem means that there are aspects of human performance that our models cannot capture. For example, our models assume that people have perfect memory for the stimuli and exact recall of the values of the variables presented on each trial. These assumptions are clearly false, and a more realistic treatment of memory and perception might make it possible to tease apart the assumptions in our model that are due to these factors (e.g., high noise parameters) from those that capture human inductive inference (e.g., the set of kernels appearing in the mixture). There are going to be aspects of human performance that cannot be captured by the kind of computational-level models we have considered, such as sensitivity to the order in which stimuli are presented, that may be candidates for identifying algorithmic-level implementations of these ideal solutions (similar to the role of order effects in categorization Anderson, 1991, Sanborn et al., 2010). As a starting place, it may be worth drawing inspiration from efforts in the machine literature to overcome the difficulty of scaling Gaussian processes to large data sets in the face of limited memory (e.g., Hensman et al., 2013).

If future work is to provide a deeper understanding of function learning, including the roles played by priors (and free parameters more generally), learners’ limited cognitive resources, and individual differences, it will be necessary to go beyond the evaluation methods that have become standard in function learning, in at least two respects. First, it is now increasingly feasible and important to examine not just overall error rates, or aggregate correlations between model predictions and human judgments, but the accuracy with which a model can predict human judgments for individual points given the values and order of previous training and test points. In addition to making it possible to assess how well a model can account for the process and dynamics of function learning—which include order effects as described above, as well as other phenomena, like the tacit belief that a relationship might be changing over time or trials (Speekenbrink and Shanks 2010)—such an approach is more robust to aggregation artifacts (Navarro et al. 2006). Second, we have taken a common approach to fitting and testing cognitive models—finding global parameters or priors that give low error or a high likelihood of the experimental data—but this approach has drawbacks beyond the simple risk of overfitting. Perhaps the most serious of these, in cases where we are interested in the priors that people tacitly use, is that this approach licenses only coarse-grained conclusions about what priors are likely or plausible given the experimental data. In the future, we hope that cheaper computational resources and increasingly efficient algorithms will make it feasible to conduct a Bayesian analysis of our model and others, which would provide a clearer picture of the priors that are consistent with group-level tendencies as well as individual differences (Hemmer et al. 2014).

Finally, a key question for any Bayesian model of cognition is the origins of the inductive biases that are expressed in the prior distribution. Having established a picture of adult inductive biases at the start of an experiment, we can begin to explore questions related to the development of these inductive biases. Within the Bayesian framework, it is possible to make inferences at the level of prior distributions by using hierarchical Bayesian models (Tenenbaum et al. 2006). In the case of our Gaussian process model, people could learn the set of kernels or parameter distributions for flexible kernel types (for work related to these ideas, see Wilson and Adams, 2013, Duvenaud et al., 2013), the probabilities assigned to those kernels, and other parameters of the model. The predictions of this account of the origins of human inductive biases for function learning can be evaluated by comparing the performance of children and adults in function learning tasks and conducting transfer learning experiments examining how people’s inductive biases change through experience, and is an exciting direction for future research.

Conclusions

We have presented a rational account of human function learning, drawing on ideas from machine learning and statistics to show that the two approaches that have dominated previous work—rules and similarity—can be interpreted as two views of the same kind of optimal solution to this problem. Our Gaussian process models combine the strengths of both approaches, using a mixture of kernels to allow systematic extrapolation as well as sensitive non-linear interpolation. Tests of the performance of this model on benchmark datasets show that it can capture some of the basic phenomena of human function learning, and is competitive with existing process models. The result is a clear characterization of human inductive biases for function learning, and a new set of links between human learning and ideas in statistics and machine learning.