Inference and Search on Graph-Structured Spaces

Wu, Charley M.; Schulz, Eric; Gershman, Samuel J.

doi:10.1007/s42113-020-00091-x

Inference and Search on Graph-Structured Spaces

Original Paper
Open access
Published: 02 November 2020

Volume 4, pages 125–147, (2021)
Cite this article

Download PDF

You have full access to this open access article

Computational Brain & Behavior Aims and scope Submit manuscript

Inference and Search on Graph-Structured Spaces

Download PDF

4094 Accesses
6 Citations
10 Altmetric
Explore all metrics

Abstract

How do people learn functions on structured spaces? And how do they use this knowledge to guide their search for rewards in situations where the number of options is large? We study human behavior on structures with graph-correlated values and propose a Bayesian model of function learning to describe and predict their behavior. Across two experiments, one assessing function learning and one assessing the search for rewards, we find that our model captures human predictions and sampling behavior better than several alternatives, generates human-like learning curves, and also captures participants’ confidence judgements. Our results extend past models of human function learning and reward learning to more complex, graph-structured domains.

Quantifying Humans’ Priors Over Graphical Representations of Tasks

Maximum Likelihood Analysis of the Ford–Fulkerson Method on Special Graphs

Article 10 April 2015

The ubiquity of large graphs and surprising challenges of graph processing: extended survey

Article 29 June 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

On September 15th, 1835, Charles Darwin and the crew of the HMS Beagle arrived in the Galapagos Islands. As part of a 5-year journey to study plants and animals along the coast of South America, Darwin collected specimens of Galapagos finches, which would become an important keystone for his theory of evolution. Back in England, Darwin began to study the geographical distribution of the birds, particularly the relationship between their features and their habitat. He noticed that while finches on nearby islands had similar beaks (e.g., the vegetarian tree finches and the large insectivorous tree finches with their broad and stout beaks), finches on more distant islands were more dissimilar (e.g., the cactus ground finch with its long and spike-like beak). From these observations, Darwin concluded that these finches all originally derived from the same finch and then gradually adapted to the conditions of the Islands. Since nearby islands had similar conditions, finches on these islands had more similar beaks.

Darwin’s historical insight is an example of function learning, where a function represents a mapping from some input space to some output space. In Darwin’s case, the hypothesis was a function mapping a bird’s habitat to the characteristics of its beak (e.g., size). Function learning has traditionally been studied with continuous input spaces, but functions can also be defined over discrete input spaces such as graphs. While the geography of habitats can sometimes be described by a Cartesian coordinate system (latitude and longitude), the Galapagos is structured as a chain of islands, where the Euclidean distance within an island can be larger than the distance between islands. Since finches from the same island tend to be similar, the relevant metric for function learning may be topological rather than Euclidean distance, where the chain of islands can be described as a graph.

Function learning on graph-structured inputs spaces is not restricted to scientific epiphanies; it also applies ubiquitously to daily life. For example, the spread of disease, ideas, and cultural products from interpersonal contact can be understood as functions defined over social graphs. We can learn to predict which of our friends will like a piece of music after observing the music preferences of other friends in our social network. Similarly, as many parents of toddlers know, the appearance of a sickness in daycare is highly predictive of who will get sick next. Beyond social graphs, the flow of individuals in a transportation network and the distribution of food resources in patchy environments can likewise be described using graph-structured functions.

Despite the ubiquity of graph-structured functions, most studies of function learning (as we review below) have examined only continuous input spaces. In addition, reinforcement learning in discrete state spaces can also be interpreted as a form of graph-structured function learning, but relatively little work has examined patterns of generalization beyond very simple graph structures (e.g., Wimmer et al. 2012; Gershman and Niv 2015). Can similar computational principles of inference and search that describe human behavior in continuous spaces also apply to discrete, graph-structured spaces?

In this paper, we investigated how people learn graph-structured functions and use this knowledge to guide the search for rewards. In Experiment 1, we studied how people infer the values of nodes on complex graphs (corresponding to the number of passengers on a virtual subway map), where values were correlated by the connectivity structure, such that connected nodes had similar values. This is a discrete analogue of traditional function learning tasks on continuous input spaces, where we hypothesized that people would be able to make accurate predictions by taking into account the connectivity structure of the graph. We tested this hypothesis by analyzing performance and comparing different computational models in their ability to predict participants’ judgments and confidence ratings. In Experiment 2, we studied how people search for rewards on complex graphs, tantamount to a 64-armed bandit problem, where each arm of the bandit corresponded to a node on a graph and rewards were similarly correlated based on connectivity. Here, we hypothesized people would be able to leverage the structure of the environment to explore efficiently and rapidly acquire better rewards, using the same computational principles of function learning for inferring value. We tested this hypothesis through both behavioral analyses and computational modeling, where we compared models that differed in how they generalized about novel stimuli and in their exploration strategies.

Our results indicate that people learn and search for rewards consistent with a Bayesian model of function learning, implemented using Gaussian process (GP) regression with a diffusion kernel. Our diffusion kernel GP model outperformed various alternatives in predicting inferences, uncertainty judgements, and when combined with an optimistic sampling strategy (upper confidence bound sampling), also performed best in predicting sampling decisions on a 64-armed bandit problem with graph-structured rewards. This model builds on past studies using Gaussian process regression to describe human function learning on continuous spaces (Lucas et al. 2015; Schulz et al. 2017), but using a prior over functions designed for discrete spaces (Kondor and Lafferty 2002). Not only do we find strong empirical evidence for our model, but it also provides new theoretical connections to past research on human function learning, sample-efficient exploration, and classic theories of generalization and learning.

Function Learning in Continuous Spaces

Research on human function learning was originally pioneered by Carroll (1963), who studied how participants learned to predict the length of a line (output) based on the horizontal position of a “V” shaped marking (input). Unknown to participants, the relationship between the inputs and outputs were governed by either a positive linear, a quadratic, or a random function. Carroll’s (1963) study was motivated by the goal of showing that people could extrapolate functions to generate novel predictions about outcomes that had never before been observed. In contrast to classical theories of generalization (Shepard et al. 1961), Carroll’s work provided evidence for a mechanism of generalization that went beyond merely predicting the same outcome as that of the most similar previous experiences. Aside from showing that function learning was an important feature of human inference, Carroll (1963) also discovered that some functions, such as linear ones, were easier to learn than others, such as nonlinear ones. Subsequent studies of human function learning built on Carroll’s initial insight and further investigated which types of functions were more difficult to learn (Brehmer 1974; Koh and Meyer 1991; Busemeyer et al. 1997), finding that linear functions with positive slopes are the most learnable, and that both nonlinear functions and linear functions with negative slopes are more difficult to learn.

A problem with many of these early studies was the inflexibility of their models. Likely inspired by timely advances in statistical methods of least-square estimation, they assumed that participants used a specific parametric model, for example, linear regression, and then learned by optimizing the parameters to explain the data. Yet the parametric classes of function used in these studies were insufficiently flexible to account for human function learning. Instead of only adapting a specific class of functions to a particular set of observations, people seem to adapt the model itself when encountering novel data. Brehmer (1976) tried to explain some of these effects with a sequential hypothesis testing model of functional rule learning, according to which participants adapt the complexity of their model by performing sequential hypothesis tests and pivoting between parametric forms if necessary. However, this model still required a pre-determined set of parametric rules that could be compared, such that it is not able to explain the ability to learn almost any function given enough data. Thus, these earlier, rule-based models of human function learning could not easily explain the full range of human function learning abilities; more flexible models were needed.

To overcome the weaknesses of rule-based models of human function learning, several researchers proposed a novel class of similarity-based models of function learning. These models operated under the assumptions that similar input points will produce similar outputs and used neural networks to model behavior (McClelland et al. 1986). These models could not only theoretically learn nearly any function, they were also able to capture the effect that linear functions are easier to learn than nonlinear functions.

An important distinction in the literature on function learning (and machine learning more generally) is between interpolation (i.e., predictions for points nested between training examples) and extrapolation (i.e., predictions outside the convex hull of training inputs). Whereas similarity-based models can explain order-of-difficulty effects in interpolation tasks, they have trouble explaining how people extrapolate. Specifically, people tend to make linear predictions with a positive slope and an intercept of zero when extrapolating functions (Kwantes and Neal 2006). This linearity bias holds true even when the underlying function is nonlinear; for example, when trained on a quadratic function, average predictions fall between the true function and straight lines fit to the closest training points (Kalish et al. 2004).

Since traditional similarity-based models of function learning could not easily explain these extrapolation patterns, the class of function learning models had to be extended even further. This led to the development of so-called hybrid models of function learning, which contain an associative learning process that acts on explicitly represented rules. One such hybrid model is the Extrapolation-Association Model (DeLosh et al. 1997), which uses similarity-based interpolation, but extrapolates using a simple linear rule. The model effectively captured the human bias towards linearity, and could predict human extrapolations for a variety of functions, but without accounting for non-linear extrapolation (Bott and Heit 2004).

More recently, another class of models was developed using Gaussian process (GP; Rasmussen and Williams 2006) regression to model function learning based on the principles of Bayesian inference. The GP framework describes a prior over functions, which given a set of observed data points, can be used to infer a posterior distribution over functions. Importantly, GP regression is a non-parametric model (Schulz et al. 2018; Gershman and Blei 1), meaning that it adapts its complexity to the encountered data rather than assuming a fixed level of complexity. Griffiths et al. (2009) and Lucas et al. (2015) were the first to show that GP regression provides a rational model of human function learning, and that it replicates most of the observed empirical phenomena of human function learning. Importantly, GP regression performs posterior inference in a way that can be understood as both similarity-based (because the kernel provides a similarity metric between data points) and rule-based (because the kernel can be expressed as a linear weighted sum), providing a further unification of rule-based and similarity-based theories (Lucas et al. 2015).

Using Function Learning to Guide Search

Learning a function is not only useful for making explicit generalizations about novel situations, but can also be used to guide adaptive behavior by leveraging functional structure to predict unobserved rewards in the environment. For example, in reinforcement learning tasks where options had inversely correlated rewards (Wimmer et al. 2012) or with rewards structured as a linear function (i.e., linearly increasing rewards from option 1 to option N; Schulz et al. 2019), participants were able to rapidly learn this structure and leverage it to facilitate better performance, even without having been explicitly told about the underlying structure.

In tasks with a large number of options, it becomes important to be able to learn efficiently, for instance by using features of the task to predict rewards (Farashahi et al. 2017; Radulescu et al. 2019). One approach is to learn an implicit value function mapping features onto rewards (Schulz et al. 2017), which can be used to guide efficient exploration even in infinitely large problem spaces. Previous work has successfully used a GP model of function learning to predict human search behavior in a variety of both spatially and conceptually correlated reward environments (Wu et al. 2018; Wu et al. 2020; Schulz et al. 2018), where the number of options vastly outnumbered the sampling horizon.

In transitioning from a pure function learning paradigm to a reward learning paradigm, the demands of the task change from pure information maximization to a balance between exploration and exploitation (Cohen et al. 2007; Mehlhorn et al. 2015; Schulz and Gershman 2019). Typically studied in multi-armed bandit tasks, the exploration-exploitation dilemma requires an agent to trade-off between sampling novel options to acquire potentially useful information about the structure of rewards (exploration) with sampling options known to have high-value payoffs (exploitation). Not enough exploration, and the agent could get stuck in a local optima, while not enough exploitation and the agent never reaps the rewards they have discovered.

Since optimal solutions in such tasks are intractable for all but the most simplistic scenarios (Whittle 1980), a variety of heuristic algorithms are commonly used. One such algorithm is upper confidence bound sampling, which adds an “uncertainty bonus” to each option’s value (Auer 2002). Since this corresponds to a weighted sum of the expected reward and its uncertainty, this algorithm explicitly encodes the trade-off between exploration and exploitation. Although earlier studies produced mixed evidence for an uncertainty bonus in human decision making (Daw et al. 2006; Payzan-LeNestour and Bossaerts 2011), many recent studies have shown that humans do engage in uncertainty-guided exploration (Gershman 2018a; Wilson et al. 2014; Knox et al. 2012; Gershman 2019; Speekenbrink and Konstantinidis 2015; Wu et al. 2018).

A key component for performing uncertainty-guided exploration is being able to estimate the uncertainty of one’s predictions. Since GP regression is a Bayesian model of function learning, uncertainty is quantified by the posterior distribution. In contrast, a model that makes only point estimates of expected reward does not have access to uncertainty-guided exploration. Instead, less efficient random exploration strategies must be used (e.g., softmax exploration). A combined model of GP regression with upper confidence sampling has proved to be an effective model in a wide number of contexts, describing how people explore different food options based on real-world data (Schulz et al. 2019), predicting whether or not to people will try out novel options (Stojić et al. 2020), and explaining developmental differences between how children and adults search for rewards (Schulz et al. 2018; Meder et al. 2020).

Function Learning in Graph-Structured Spaces

In the current work, we examine whether principles of function learning can be used to model human inference and search for rewards in graph-structured spaces (see Fig. 1). Studying these environments greatly expands the scope of classical function learning models, and addresses an important gap in our understanding of how people reason about structured environments (e.g., social graphs or subway networks) that are ubiquitous in our daily lives.

In what follows, we will first introduce the GP regression framework, and then specialize it to the problem of function learning on graphs. The key mathematical tool that we employ is the diffusion kernel (Kondor and Lafferty 2002), which offers one of the simplest ways to define priors over functions on graphs. We will show how the diffusion kernel naturally connects to past models of human function learning. We will then put this model to an empirical test, presenting two experiments studying how people make inferences and search for rewards on graph structures. In Experiment 1, participants were shown a series of artificially generated subway maps and asked to predict the number of passengers at unobserved stations. In Experiment 2, participants played a graph-structured multi-arm bandit task, where arms correspond to nodes in the graph, and the payoffs are correlated via the connectivity structure.

Gaussian Process Regression

A GP (Rasmussen and Williams 2006) defines a distribution over functions $\mathbf {f}: \mathcal {S} \rightarrow \mathbb {R}$ that map the input space $s \in \mathcal {S}$ (e.g., nodes on a graph; Fig. 1a) to real-valued scalar outputs (e.g., rewards). Intuitively, for any finite set of inputs {s₁, s₂,...s_N}, we can express the output of a function as a vector of finite length f = {f₁, f₂,…,f_N}. Each function vector f can be modeled as a random draw from a multivariate normal distribution:

$$ \mathbf{f} \sim \mathcal{GP}\left( m(s),k(s,s^{\prime})\right), $$

(1)

where $m(s) = \mathbb {E}[f(s)]$ is a mean function^{Footnote 1} specifying the expected output of the function given input s, and $k(s,s^{\prime })$ is the kernel function (see below) defining the covariance between outputs for a given input pair $(s,s^{\prime })$.

We can think of each function as a potential hypothesis, relating each node $s \in \mathcal {S}$ to some function value f(s), where the GP describes a distribution over functions and the kernel encodes inductive biases about how smoothly the function varies across the input space. Thus, the distributional nature of the GP captures uncertainty across different potential functions (Fig. 1). In the next section, we will define a kernel specialized for graph-structured functions.

We model a scenario in which an observer measures observations y = f(s) + 𝜖, where $\epsilon \sim \mathcal {N}(0,\sigma ^{2})$ is Gaussian noise added to the output value. Given a data set of observations $\mathcal {D}=\{ \mathbf {s},\mathbf {y}\}$ containing N input-output pairs (Fig. 1g–h), we can use a GP to compute the posterior predictive distribution $p(f(s_{\ast })|\mathcal {D})$ for any target state s_∗ (e.g., an unobserved node; Fig. 1g). This posterior is a Gaussian distribution $p(f(s_{\ast })|\mathcal {D})=\mathcal {N}(m_{\ast },v_{\ast })$, where the mean m_∗ and variance v_∗ are defined as:

$$ \begin{array}{@{}rcl@{}} m_{\ast}&=& \mathbf{k}_{\ast}^{\top} (\mathbf{K}+\sigma^{2}\mathbf{I})^{-1}\mathbf{y} \end{array} $$

(2)

$$ \begin{array}{@{}rcl@{}} v_{\ast}&=& k(s_{\ast},s_{\ast})-\mathbf{k}_{\ast}^{\top}(\mathbf{K}+\sigma^{2}\mathbf{I})^{-1}\mathbf{k}_{\ast}. \end{array} $$

(3)

K is the N × N covariance matrix evaluated at each pair of observed inputs, and k_∗ = [k(s₁, s_∗),…,k(s_N, s_∗)] is the covariance between each observed input and the target input s_∗. Thus, as illustrated in Fig. 1g, the GP posterior allows us to make Bayesian predictions about the expected output (m_∗) and uncertainty (v_∗) for any unobserved node s_∗ on the graph.

As pointed out by Lucas et al. (2015), we can draw a connection between GP regression and similarity-based models of function learning. In particular, the posterior predictive mean (Eq. 2) can alternatively be expressed as a similarity-weighted sum:

$$ m_{\ast}=\sum\limits_{n=1}^{N} w_{n} k(s_{n},s_{\ast}), $$

(4)

where each s_n is a previously observed input, and the weights are given by $\mathbf {w} = \left [\mathbf {K}+\sigma ^{2} \mathbf {I} \right ]^{-1}\mathbf {y}$. Intuitively, this means that GP regression is equivalent to a linearly weighted sum of similarities between the target input and the observed input (see Schulz et al. 2018, for a tutorial).

The Diffusion Kernel

We now introduce a kernel function that is specialized for graph-structured input spaces. A graph $G=(\mathcal {S}, \mathcal {E})$ consists of nodes $s \in \mathcal {S}$ and edges $e \in \mathcal {E}$ (Fig. 1a). As a concrete example, a subway map describes a graph structure, where nodes correspond to stations and edges correspond to connections between stations. For now, we assume that all edges are undirected, so that probabilistic dependencies between any two adjacent nodes are symmetric.

The diffusion kernel (DF; Kondor and Lafferty 2002) defines a similarity metric $k(s,s^{\prime })$ between any two nodes based on the matrix exponentiation^{Footnote 2} of the graph Laplacian:

$$ \mathbf{K} = \exp(-\alpha \mathbf{L}), $$

(5)

where the graph Laplacian L captures the transition structure of the graph based on the difference between the adjacency matrix A and degree D:

$$ \mathbf{L} = \mathbf{D}-\mathbf{A} $$

(6)

Each element A_ij of the adjacency matrix A is 1 when nodes i and j are connected, and 0 otherwise, while D is a diagonal matrix computed from the row sums of A and describe the number of connections of each node. Returning to our subway example, when there exists a route between stations i and j, then A_ij = 1 while L_ij = − 1. In addition, for any station i, both D_ii and L_ii indicate the number of connected stations. The graph Laplacian can also describe graphs with weighted edges, where we can substitute the weighted adjacency matrix W for A, where each element W_ij describes the edge weight between nodes i and j, and the weighted degree of each node is expressed in the diagonals of D.

Intuitively, the graph Laplacian can be understood as a measure of the“flux” between nodes, for instance, the flow of passengers along different sections of a subway network. Flux between nodes i and j is not only influenced by whether they are connected, but is also affected by other connected nodes. For instance, if two train stations are connected to many other stations, then there is a relatively low probability that a randomly selected commuter will transit between them, compared with when the two stations have few alternative connections.

The diffusion kernel uses this intuition to define a similarity metric over discrete graph-structured spaces (Fig. 1a), by assuming that output values diffuse along the edges of a graph, similar to a heat diffusion process. The free parameter α models the rate of diffusion, where $\alpha \rightarrow 0$ assumes complete independence between nodes, and $\alpha \rightarrow \infty $ assumes all nodes are perfectly correlated. Thus, closely connected nodes are assumed to have similar output values, where the covariance between nodes decays monotonically as a function of graph distance (Fig. 1c).

Connecting Spatial and Structured Generalization

The GP framework allows us to relate similarity-based generalization on graphs to theories of generalization in continuous domains (Fig. 1 bottom row). Consider the case of an infinitely fine lattice graph (i.e., a grid-like graph with equal connections for every node and with the number of nodes and connections approaching infinity). Following (Kondor and Lafferty 2002) and using the diffusion kernel defined by Eq. 5, this limit can be expressed as

$$ k(s,s^{\prime})=\frac{1}{\sqrt{(4\pi\alpha)}}\exp{\left( \frac{-|s-s^{\prime}|^{2}}{4\alpha}\right)}, $$

(7)

which is equivalent to the radial basis function (RBF) kernel. The RBF kernel provides a similarity metric in continuous spaces based on Euclidean distance between data points (Fig. 1b), where similarity is the inverse of distance. In comparison, the diffusion kernel models similarity based on the dynamics of diffusion, where transitions are restricted by the graph structure. The RBF kernel can be understood as a special case of the diffusion kernel, when the environment is symmetric and transitions are unrestricted. The diffusion kernel is therefore able to offer a broader framework for modeling function learning and search, which subsumes past research on human behavior in spatial and conceptual input spaces (Wu et al. 2018; Wu et al. 2020).

Experiment 1: Subway Prediction Task

In our first experiment, participants were shown various graph structures described as subway maps (Fig. 2), and were asked to make predictions about unobserved nodes. For each prediction, participants also gave confidence judgments, which we use as an estimate of their (inverse) uncertainty. We used a GP parameterized with a diffusion kernel as a model of function learning in this task, which we compared with several alternative models.

Methods

Participants

We recruited 100 participants (M_age = 32.7; SD = 8.4; 28 female) on Amazon Mechanical Turk (requiring a 95% approval rate and 100 previously completed HITs) to perform 30 rounds of a graph prediction task. The experiment was approved by the Harvard Institutional Review Board (IRB15-2048).

Procedure

On each graph, numerical information was provided about the number of passengers at 3, 5, or 7 other stations (along with a color aid), from which participants were asked to predict the number of passengers at a target station (natural numbers from 0 to 50) and provide a confidence judgment (Likert scale from 1 to 11). The subway passenger cover story was used to provide intuitions about graph-correlated functions, similar to our example from the introduction. The color aid was generated through a continuous, linear mapping (similar to Wu et al. 2018; Schulz et al. 2019; Meder et al. 2020), with both hue and brightness changing monotonically with value. Additionally, participants observed 10 fully revealed graphs to familiarize themselves with the task and completed a comprehension check before starting the task.

Participants were paid a base fee of US$2.00 for participation with an additional performance contingent bonus of up to US$3.00. The bonus payment was based on the mean absolute judgement error weighted by confidence judgments: $R_{bonus} ={\text {US}}\$3.00 \times (25 - {\sum }_{i} \tilde {c}_{i} \epsilon _{i})/25$ where $\tilde {c}_{i}$ is the normalized confidence judgment $\tilde {c}_{i} = \frac {c_{i}}{\sum c_{j}}$ and 𝜖_i is the absolute error for judgment i. On average, participants completed the task in 8.09 min (SD = 3.7) and earned US$3.87 (SD = US$0.33).

All participants observed the same set of 40 graphs that were sampled without replacement for the 10 fully revealed examples in the familiarization phase and for the 30 graphs in the prediction task. We generated the set of 40 graphs by iteratively building 3 × 3 lattice graphs (also known as mesh or grid graphs), and then randomly pruning 2 out of the 12 edges. In order to generate the functions (i.e., number of passengers), we sampled a single function from a GP prior over the graph, where the diffusion parameter was set to α = 2.

Modeling

We compared the predictive performance of the GP with two heuristic models that use a nearest neighbors averaging rule (see below). Models were fit to each individual participant by using leave-one-round-out cross-validation to iteratively compute the maximum likelihood estimates on a test set, and then make out-of-sample predictions on the held-out round. We repeated this procedure for all rounds and compared the predictive performance (see Appendix 2) over all held-out rounds.

The two heuristic strategies for function learning on graphs make predictions about the output values of a target state s_∗ based on a simple nearest neighbors averaging rule. The k-nearest neighbors (kNN) strategy averages the values of the k nearest nodes (including all nodes with same shortest path distance as the k-th nearest), while the d-nearest neighbors (dNN) strategy averages the values of all nodes within path distance d. Both kNN and dNN default to a prediction of 25 when the set of neighbors are empty (i.e., the median value in the experiment).

Both the dNN and kNN heuristics approximate the local structure of a graph with the intuition that nearby nodes have similar output values. While they sometimes make the same predictions as the GP model while having lower computational demands, they fail to capture the full connectivity structure of the graph. Thus, they are unable to learn directional trends (e.g., increasing function values from one end of the graph to the other) or asymmetric influences (e.g., a central hub exerting relatively larger influence on sparsely connected neighbors). Additionally, they only make point-estimate predictions, and thus do not capture the underlying uncertainty of a prediction,which we use to model confidence judgments.

Results and Discussion

All code and data necessary to replicate the analyses in this manuscript are publicly available at https://github.com/charleywu/graphInference. Figure 3 shows the behavioral and model-based results of the experiment. We applied Bayesian mixed effects regression to estimate the effect of the number of observed nodes on participant prediction errors, with participants as a random effect (see Table 1 in Appendix 1 for details). Participants made systematically lower errors in their predictions as the number of observations increased (b_numNodes = − 0.60, 95% HPD: [− 0.79,− 0.41], BF₁₀ = 1.1 × 10⁷; Table 1 in Appendix 1; Fig. 3a). Repeating the same analysis but using participant confidence judgments as the dependent variable, we found that confidence increased with the number of available observations (b_numNodes = 0.23, 95% HPD: [0.17,0.30], BF₁₀ = 4.7 × 10⁸; Table 1 in Appendix 1; Fig. 3b). Finally, participants were also able to calibrate confidence judgments to the accuracy of their predictions, with higher confidence predictions having lower error (b_confidence = − 0.66, 95% HPD: [− 0.83,− 0.49], BF₁₀ = 4.0 × 10⁸; Table 1 in Appendix 1; Fig. 3c). We found no effect of round number on prediction error (b_round = 0.01, 95% HPD: [0.02,− 0.03], BF₁₀ = 0.06), suggesting that the familiarization phase and cover story were sufficient for providing intuitions about graph-correlated structures.

Figure 3d shows the model comparison results. We evaluated the relative performance of models using the protected exceedence probability (pxp), as a Bayesian estimate of the probability that a particular model is more frequent in the population than all the other models under consideration, corrected for chance (see Appendix 1; Stephan et al. 2009; Rigoux et al. 2014). The GP with diffusion kernel was overwhelmingly the best model, with pxp(GP) ≈ 1. Overall, 68 out of 100 participants were best predicted by the GP, 21 by the dNN, and 11 by the kNN (Fig. 3e; see Fig. 6 for additional comparisons between model predictions and participant judgments).

Figure 3f shows individual parameter estimates of each model. The estimated diffusion parameter α was substantially lower than the ground truth of α = 2 (t(99) = − 31.3, p < .001, d = 3.1, BF₁₀ = 4.4 × 10²⁹)^{Footnote 3}, replicating previous findings that have shown undergeneralization to be a prominent feature of human behavior (Wu et al. 2018). Estimates for d and k were highly clustered around the lower limit of 1, suggesting that averaging over larger portions of the graph were not consistent with participant predictions.

Lastly, an advantage of the GP is that it produces Bayesian uncertainty estimates for each prediction. While the dNN and kNN models make no predictions about confidence, the GP’s uncertainty estimates correspond to participant confidence judgments, which we validated using a Bayesian mixed model regressing the uncertainty estimates of the GP onto participant confidence judgments (b_{gpUncertainty} = − 1.8, 95% HPD: [− 2.5,− 1.1], BF₁₀ = 1.2 × 10⁵; Table 1 in Appendix 1; Fig. 3g).

The results of this experiment demonstrate that a GP with a diffusion kernel can successfully model human function learning on graphs, in particular the empirical pattern of predictions and confidence ratings. Our model extends existing theories of human function learning in continuous spaces, where the RBF kernel (commonly used in continuous domains) can be seen as a special limiting case of the diffusion kernel.

Experiment 2: Graph Bandit

In our next experiment, we tested the suitability of the diffusion kernel as a model of search, using a multi-armed bandit task with structured rewards (see also Wu et al. 2018). In particular, extending our previous work on spatially and conceptually correlated multi-armed bandits (Wu et al. 2018; Wu et al. 2020), we constructed a task where rewards were defined by the connectivity structure of a graph (Fig. 4). In this task, participants searched for rewards by clicking nodes on a graph. As in Experiment 1, the output values (rewards) were generated by a function drawn from a GP with a diffusion kernel. This induced a graph-correlated reward structure, allowing for similarity-based generalization to aid in search, but where similarity was defined based on connectivity rather than perceptual features or Euclidean distances between options as in our previous work.