Decision makers typically gather information before making a choice or inference. Imagine someone who wants to find a reliable used car and chooses between, say, a 2008 Honda and a 2012 Toyota. To facilitate the choice, the person needs to gather relevant information, which usually takes the form of attributes about the alternatives, such as the mileage, model year, and accident histories of the cars. These attributes are often referred to as cues (e.g., Stewart, 1988). How do people use this information (e.g., cues) to make decisions? One approach is to posit that people are equipped with a variety of strategies that they can select adaptively to solve the decision problems they face (e.g., Beach & Mitchell, 1978; Gigerenzer & Selten, 2002; Lieder & Griffiths, 2017; Payne et al., 1993), and the selection of a particular strategy depends on many factors, such as time pressure (Payne et al., 1996; Rieskamp & Hoffrage, 2008), information cost (Beach & Mitchell, 1978; Bröder, 2000), feedback on decision outcomes (Rieskamp & Otto, 2006), and cognitive cost associated with each strategy (Fechner et al., 2018). With a repertoire of possible strategies, it has been a challenge for researchers to identify which strategy or strategies people use in a given task.

Some strategy identification methods use only individuals’ choices to infer the strategy they may have applied (e.g., Bröder & Schiffer, 2003; Bröder, 2003; Hilbig & Moshagen, 2014; Lee, 2016). Other methods consider additional data, such as confidence ratings, decision time, and eye-tracking measures (e.g., Glöckner, 2009; Rieskamp & Hoffrage, 2008). Compared to relying on choices alone, strategies can be identified more reliably and accurately when a variety of behavioral measures are taken into account (e.g., Glöckner, 2009; Lee et al., 2019; Riedl et al., 2008).

The present study proposes a machine learning method for strategy identification that takes both choices and other behavioral data into consideration. Compared to existing methods, this method has fewer constraints on the form and the amount of behavioral data required, and it can identify strategies on a trial-by-trial basis. Thus, it can detect dynamic changes in strategy selection that elude other methods. The remainder of this article is structured as follows. We start by describing the decision strategies that we use to illustrate the machine learning approach. We then outline methods of strategy identification applied frequently in the literature, followed by an introduction to the novel machine learning approach that we call machine learning strategy identification (MLSI). After that, we describe the performance of MLSI in three experiments. To conclude, we discuss some limitations of MLSI and future directions.

Decision strategies

The set of decision strategies we focus on in this study have been studied extensively in the simple heuristic research framework, which posits that people are equipped with a toolbox of strategies that they can apply adaptively in different task environments (Gigerenzer & Goldstein, 1996; Gigerenzer & Selten, 2002; Gigerenzer et al., 1999). Consider the used car example shown in Fig. 1, in which five cues are available and are ranked according to their validities. In a paired-comparison task, where one must infer which alternative of a pair has a larger criterion value, a cue’s validity is defined as the probability that it makes a correct inference given that the cue discriminates, that is, when the cue values are different for the two alternatives being compared. Knowing the cue information (i.e., cue values and cue validities), how do people use that information to make a decision? In other words, what are the decision strategies they may use? We describe four below.

Fig. 1
figure 1

Two used cars described by five cues. Durability is the criterion variable whose values are not known to the decision maker. The decision maker needs to infer from the cues which car will be more durable. The cues are ordered by validity, a measure of a cue’s quality

A person using take-the-best decides by searching for a cue that discriminates between the alternatives (Gigerenzer et al., 1991). Specifically, take-the-best searches cues in the order of their validities. If a cue discriminates, it chooses the alternative that has a cue value associated with a higher criterion value. If the cue values of the alternatives are the same, then take-the-best moves on to the next cue. If no cue discriminates, it selects one of the alternatives randomly. Table 1 shows how take-the-best, as well as the three other strategies, would make the used-car decision in Fig. 1.

Table 1 Examples of strategy applications for the decision problem shown in Fig. 1

Δ-inference works similarly to take-the-best (Luan et al., 2014). However, instead of stopping search when the cue values differ between the alternatives, Δ-inference stops search and makes a decision when the cue value of one alternative exceeds that of the other by a threshold Δ. Take-the-best can be considered as a special case of Δ-inference when Δ is set at zero for all cues. For take-the-best and Δ-inference, a cue-wise information search is expected; that is, one would inspect both alternatives’ values for a cue before moving on to the next cue or making a decision. Because of their close similarities, take-the-best and Δ-inference often lead to the same choices, posing a challenge to strategy identification. This challenge is especially pronounced when Δ is small.

The third strategy is weighted-additive (WADD), in which one weights an alternative’s cue values by each cue’s importance, adds up the weighted values for an overall score, and selects the alternative with the highest score (Payne et al., 1988). In a task with binary cues, cue values can be weighted by each cue’s validity. When cues are not binary, cue dichotomization may simplify the weighting-and-adding process. Specifically, one may first dichotomize a cue with a threshold, treating the higher or more favorable values as “1” and others as “0.” A weighted score for each alternative is then calculated based on the dichotomized cue values and cue validities.

Tallying is a special case of WADD, in which one treats all cues to be equally important (Payne et al., 1988). In doing so, tallying can reduce the amount of computation substantially. For WADD and tallying, an alternative-wise search is expected; that is, one would inspect all cue values of one alternative before checking those of the other alternative.

These four strategies can be grouped into two general categories based on how they use cue information. Take-the-best and Δ-inference are examples of non-compensatory strategies, in which favorable or unfavorable values on lower-ranked cues cannot compensate for favorable or unfavorable values on higher-ranked cues, and thus cannot overrule decisions made by higher-ranked cues. For example, one insists on buying a four-wheel drive Jeep no matter what discount the dealer offers on a two-wheel drive model. In contrast, WADD and tallying are compensatory strategies that allow trade-offs among cues, so that favorable values on lower-ranked cues can compensate for unfavorable values on higher-ranked cues.

Inferring which strategies people use to make decisions, such as which used car to buy, is a problem that decision researchers have tried to solve for decades (e.g., Bröder & Schiffer, 2003; Glöckner, 2009; Hilbig & Moshagen, 2014; Lee et al., 2019). Before introducing the new method that we tested in this study, we review the main existing methods that have been applied to the problem of determining what strategies people are using. We contrast outcome-based methods, which use only the choices made, with process-based methods, which incorporate behavioral data leading up to the choice.

Outcome-based methods

Choice outcomes are frequently used to infer decision-making processes because they are readily observed. For instance, in structural modeling, researchers run multiple regressions between cues and choice outcomes, and take regression weights as indications of how heavily a person relies on each cue (e.g., Brehmer, 1994; Stewart, 1988). Because it only describes the statistical relations between cues and choice outcomes, this approach does not reveal much about the actual process of how a decision is made (Bröder, 2000).

In comparative model fitting, a metric, such as maximum likelihood, is calculated to gauge how well a strategy describes choice outcomes. For example, Bröder and Schiffer (2003) compared an individual’s choice outcomes with predictions made by several strategies and treated the strategy with the highest estimated likelihood as the one most likely to have produced the observed choice pattern. This method assumes that an individual uses only one strategy and applies that strategy with a constant error rate across different combinations of alternatives, which they referred to as item types. For this method to work well, researchers need to carefully design a set of item types, so that the strategies make markedly different outcome predictions across trials (Jekel et al., 2010). The comparative model fitting method is not unique in this regard, diagnostic items are required for all identification methods to a greater or lesser extent. Hilbig and Moshagen (2014) further developed this method by using a multinomial process tree formalism, which allows varying error rates across item types (i.e., each type can have its own error rate), instead of a fixed error rate for all item types. The error rates are further grouped into random application errors and systematic errors associated with an item type, helping to increase the accuracy of strategy identification.

The comparative model fitting approach assumes that individuals use the same strategy over time, and it identifies strategies based on the overall choice outcomes in a task. However, studies have found that people may use a mixture of strategies over a sequence of decisions (e.g., Davis-Stober & Brown, 2011; Scheibehenne et al., 2013) and adapt strategies in response to environmental changes (Lee et al., 2014; Lieder & Griffiths, 2017; Rieskamp & Otto, 2006). A latent mixture model can accommodate these findings, because it allows for the possibility of incorporating multiple strategies into a single mixture model. Such models are amenable to various statistical methods, including Bayesian methods. For example, Scheibehenne et al. (2013) used a Bayesian framework to infer the strategy or combination of strategies most likely to have produced the outcome data by comparing the posterior probabilities of a single-strategy model and a multiple-strategy model (see also Lee, 2011, 2016).

Process-based methods

Outcome-based methods draw inferences about strategies based on observed decisions. In contrast, process-based methods infer strategies from an array of dynamic process data associated with each strategy. The process data can be collected when information is acquired, integrated, and evaluated. The methods to collect process data include mouse tracing, verbal report, brain imaging, eye tracking, and so on. Researchers analyze these process measures, inferring individuals’ cognitive processes and, thereby, the decision strategies used. Various process-tracing techniques have been developed to collect data about how people acquire information (see Schulte-Mecklenbeck et al., 2017b, for a recent overview).

For example, information boards display cues and cue values in a matrix and are commonly used to track how individuals acquire information in an experiment (e.g., Bettman et al., 1990; Johnson, et al., 2008; Payne et al., 1993). In most computer-based information matrices (e.g., MouseLab), information about the cue values is hidden behind boxes. At the beginning of a decision trial, all boxes in the matrix are closed. As a participant moves the mouse over or clicks on a box, it opens to reveal the cue value. Figure 2 shows an example of how participants may move their mouse on an information board display when applying take-the-best.

Fig. 2
figure 2

An example of how participants might move their mouse as they use take-the-best to make a decision based on cues displayed in an information board. In step 1, a participant clicks on the box of the highest-ranked cue (i.e., the mileage cue) for car A to reveal the cue value (i.e., 33,000), and then in step 2, clicks on the same cue for car B. Because values of the mileage cue discriminate between the cars, the participant in step 3 chooses car B, which is inferred to be the more durable car

A variety of process measures can be constructed from mouse movements on an information board that are indicative of different decision strategies. These measures include the total time spent on a trial, the proportion of information searched, variability in the amount of information searched per alternative, the ratio of cue-wise transitions to alternative-wise transitions, to name a few (for a comprehensive list, see Riedl et al., 2008). These measures have also been extended to eye-tracking studies (Krol & Krol, 2017; Schulte-Mecklenbeck et al., 2017a).

The conventional approach to strategy identification compares the observed process measures to the canonical process patterns of that strategy (e.g., Day, 2010). Sometimes, analyses of process measures are used to bolster conclusions from outcome-based methods. For instance, Rieskamp and Hoffrage (2008) found that participants who were identified as using take-the-best or WADD based on their choices differed significantly on six process measures. When different strategies arrive at the same choices, process measures provide more information to assist strategy identification.

Combining outcome-based and process-based methods

Strategies make predictions not only about the decisions people make, but also about the information search and cognitive processing they undertake before reaching decisions. Researchers have stressed the importance of combining outcome-based and process-tracing data to uncover human decision processes more accurately than using each type of data alone (Costa-Gomes et al., 2001; Harte & Koele, 2001; Glöckner, 2009; Lee et al., 2019). For example, Riedl et al. (2008) developed a decision tree, called DecisionTracer, with three process measures and one outcome-based measure as the decision nodes. Glöckner (2009, see also Jekel et al., 2010) developed the multiple-measure maximum likelihood (MM-ML) method that integrates outcomes, decision times, or confidence ratings to identify strategies based on the Bayesian information criterion. We use MM-ML as a benchmark in the present study, because it relies on data that are typically collected in decision experiments.

Incorporating multiple sources of data may decrease the number of decisions required to identify the strategies people are using. Several recent attempts have been made to identify strategy switches, because decision makers can use different strategies over time, even when facing the same kind of decisions (Lee & Gluck, 2020). Brusovansky et al. (2018) proposed a model that deploys strategies in a trial-by-trial stochastic manner by using a probabilistic switching parameter. Lee et al. (2019) incorporated decision outcomes, verbal report data, and search behavior into a Bayesian hierarchical model to infer when individuals may have changed strategies and how often they have done so. The existing methods make inferences about strategy switch based on observations of multiple trials. In many situations outside the laboratory, however, people do not make repeated decisions. Here, we propose machine learning techniques that can identify strategies on a trial-by-trial basis by integrating process and outcome data collected in one decision trial.

Machine learning strategy identification (MLSI)

The problem of strategy identification entails inferring individuals’ strategies based on behavioral data, such as choice outcomes and process measures. These data help differentiate strategies because each strategy is presumably associated with a signature pattern of data. Machine learning (ML) techniques, specifically supervised learning, solve similar classification problems. The goal of supervised learning applied to strategy identification is to find a good classifier that can distinguish between strategies based on the combination of outcome and process data in data sets where strategy labels are known.Footnote 1 The resulting classifier is then applied to assign strategy labels in data sets where individuals’ strategies are unknown. We next explain the detailed workflow of this method (see Fig. 3 for an overview).

  1. 1.

    The strategy identification problem

Fig. 3
figure 3

The process of identifying strategies using machine learning algorithms

The ML method aims to find a function that maps behavioral traces (i.e., all data collected during an experiment) to strategies. We look for this function by training an ML algorithm on a data set with known trace-label associations. The trained model will then be able to assign a given behavioral trace with a strategy label, such as take-the-best or WADD.

  1. 2.

    Collect labeled data

The labeled data are relevant behavioral data engendered by a strategy, and their forms depend on the means of data collection (e.g., mouse movements or eye tracking). To increase the efficacy of trained ML algorithms, the labeled data should be representative of the unlabeled data that will need to be classified later.

  1. 3.

    Construct features

Even a single decision can produce a large amount of raw data. Using the raw data to differentiate strategies is, however, not ideal, because these data typically contain much noise and irrelevant information. Moreover, the high dimensionality of the raw data can result in computational inefficiency. Therefore, when building a classification model, it is crucial to derive a set of informative features from the raw data. For example, the process-based approaches we reviewed above suggest some potentially useful features, such as the time spent on reading information and the proportion of information searched. Furthermore, using meaningful features can help improve the interpretability of the resulting classifier and, in turn, provide a better understanding of the strategies people use to make decisions. With these goals in mind, we constructed features from the raw data.

  1. 4.

    Divide data into training and testing sets

Labeled data are divided into a training set and a testing set. An ML model is trained on the training set, while its performance is evaluated on the testing set. A model’s performance in the testing set is one way to measure its ability to generalize to unseen data, and one common metric to evaluate model performance is its identification accuracy on the testing set.

Some ML models have hyperparameters that control the learning process (e.g., the maximum depth of trees in random forest). These hyperparameters need to be tuned to optimize a model’s performance. A common approach is to test all combinations of hyperparameters in a predefined search space through K-fold cross validation. Specifically, the training set is split evenly into K subsets. A model is trained on K \(-\) 1 subsets and evaluated on the remaining subset, and the procedure is repeated for each subset to find the best combination of hyperparameters.

  1. 5.

    Select ML algorithms

To classify decision strategies, we use classic supervised learning algorithms, including K-nearest neighbors (KNN), random forest (RF), decision trees (DF), support vector machines (SVM), and multilayer perceptron (MLP). The algorithms are implemented in Python using the scikit-learn library (Pedregosa et al., 2011). A comprehensive review of these and other ML algorithms can be found elsewhere (e.g., Friedman et al., 2001; Kotsiantis et al., 2007).

The performance of an ML model is determined mainly by how accurately it classifies the testing set. If the performance does not meet preset criteria, we will return to the previous stage, trying to improve the diagnosticity of the features or include more ML algorithms (see Fig. 3). One criterion is the relative performance of a model compared to that of other models. We also consider the interpretability of a model and the ease of collecting the required behavioral data.

A worked example

Here, we present an example of how the ML approach works for strategy identification, using KNN and a SVM model with a linear kernel (linear SVM) as the classification models (see Fig. 4). We first simulated 40 take-the-best and 40 WADD trials. In each trial, we labeled the trial with the strategy that generated its data and recorded two features, total decision time and proportion of information searched. The 80 trials were divided so that 60% of the trials were used for training and 40% for testing. The ML models learned the decision boundaries that separated the labeled trials based on the training set, and a trial from the testing set was labeled according to which side of the boundary it was located. The learned decision boundary can be linear or nonlinear, depending on the ML algorithm. In our case, linear SVM builds a linear decision boundary, whereas KNN produces a nonlinear boundary.

Fig. 4
figure 4

Simulated data for take-the-best (red dots) and WADD (weighted-additive; blue dots), and the accuracy of two machine learning algorithms. Each dot represents data from one decision trial. In the two panels in the leftmost column, the x-axis is the decision time in seconds, and the y-axis is the proportion of information searched; these were the two features processed by the machine learning algorithms. To test the accuracy of the algorithms, 60% of all data were used for training, and 40% for testing. The leftmost column shows data from the training set (top) and the testing set (bottom). The upper panels in the right two columns show the classification boundaries constructed by K-nearest neighbors and linear SVM, respectively, based on the training set data, and each model’s training accuracy is shown in the lower right corner. The lower panels in these columns show model predictions in the testing set data, and the identification accuracy is shown in the lower-right corner. The color shadings indicate confidences of model identifications, with darker shadings representing higher levels of confidence

In sum, we have reviewed strategy identification methods that rely on choice outcomes, or process measures, or both, and introduced a new machine learning strategy identification (MLSI) method with a worked example. Next, we explore the potential of MLSI in three experiments.

Experiment overview

We evaluate the MLSI method by investigating how well it identifies the decision strategies individuals use in multi-attribute decision tasks. In Experiment I, we applied MLSI in a task environment where the cues took on continuous values. Such an environment represents a broad range of real-life decision tasks, in which Δ-inference and take-the-best often lead to the same decisions. In Experiment II, we compared the strategy identification accuracy of MLSI and the MM-ML (multiple-measure maximum likelihood) method. In Experiment III, we adapted the ML model that performed best in Experiment I to analyze participants’ decision processes in an environment that favors non-compensatory strategies. The goal was to examine whether and how participants might learn to adopt strategies that yield the highest rewards in such a task.

Experiment I

We explore the potential of the MLSI method in an environment in which cue values are continuous, consistent with how information is often presented in the real world. In the first part of the experiment, we taught participants how to use take-the-best, tallying, or Δ-inference. In the second part, we asked participants to make decisions using the strategy they had been taught, thus obtaining the labeled data needed to evaluate the MLSI method.

Participants

We recruited 180 undergraduate participants from the participant pool of the psychology department at Syracuse University. Participants were randomly assigned to one of the three strategy conditions with 60 participants in each. Informed consent was obtained before the experiment. This study was approved by the university’s institutional review board (IRB). The experimental session took approximately 30 min, and participants were given credits to fulfill a course research requirement. The data for ten participants in the tallying condition, two participants in the Δ-inference condition, and two participants in the take-the-best condition were excluded because their decisions deviated from the ones predicted by the strategy they had been taught in more than 20% of the trials. Overall, there were 58, 58, and 50 participants in the take-the-best, Δ-inference, and tallying conditions, respectively.

Materials

Participants were asked to take the role of a college student interested in purchasing one of two used cars with the goal of choosing the car that would last longer. Each car was described by five cues: the car’s mileage, model year, the number of accidents, the number of previous owners, and the frequency of maintenance as shown in Fig. 2. Values of these five cues of the two alternatives (cars) were displayed in a computerized information board, a matrix of five rows and two columns. The cue validities were shown in parentheses to the right of the cue names. The thresholds to dichotomize cues were presented between the cue boxes for the tallying participants, and the Δ values were shown between the cue boxes for Δ-inference participants. Participants looked up cue values by moving the mouse over the corresponding box and clicking, and indicated their choices by clicking a button at the bottom of the screen. The experiment program recorded mouse locations with a 5-ms precision.

Decision trials varied in terms of search requirements for take-the-best. One to five cues had to be inspected before a cue discriminated between the alternatives and a decision could be made. The five different search patterns were repeated eight times, resulting in a total of 40 decision trials per participant. Cue configurations were constructed such that take-the-best and Δ-inference predicted different choices than did tallying on 20 trials. Take-the-best and Δ-inference made identical predictions on 30 trials. The presentation of the trials was randomized for each participant.

Procedure

Each participant went through a tutorial to learn a particular strategy according to the condition they were in. After the tutorial, they did five practice trials where they were given feedback on each of their decisions. A “correct” decision was one that matched the decision their assigned strategy would make. Participants needed to make correct decisions on at least 80% of the practice trials to proceed to the next step; otherwise, they repeated the tutorial from the beginning. After completing the tutorial, participants proceeded to the decision task that consisted of 40 trials where they were supposed to apply the strategy they had learned. There was no feedback during the decision task on whether their decisions matched their assigned strategy. During both the tutorial and the decision-making phases, participants were alerted when their behavior suggested that they were not on task. Such behavior included checking only the cues for one of the alternatives or making a choice without opening any box.

Results

We first report descriptive statistics about the participants’ overall performance (Table 2), before reporting the results of the MLSI analysis.

Table 2 Descriptive statistics for Experiment I

The descriptive results shown in Table 2 were in line with our expectations for the strategies. A Bayesian one-way ANOVA analysis indicates that the total decision time and the proportion of information searched were significantly different among the strategies (BFs > 1000).

Feature selection

The goal of the feature selection step is to construct features from the labeled raw data, so that the ML algorithms can classify trials as coming from users of take-the-best, tallying, or Δ-inference. Participants interacted with the experiment interface as shown in Fig. 2. Their search behaviors were recorded as mouse coordinates every five milliseconds. Figure 5 shows the mouse traces of three representative trials in which take-the-best, tallying, and Δ-inference were applied to the same decision pair. The traces illustrate the cue-wise search typical of Δ-inference and take-the-best and the alternative-wise search expected from tallying. We then encoded the mouse trace data as features that are indicative of the strategies.

Fig. 5
figure 5

Examples of mouse movement paths for participants trained to use take-the-best, Δ-inference, and tallying, respectively. In the experiment, participants must click on boxes to see cue values (see the upper left panel and Fig. 2). The numbers on the boxes displayed in the upper left panel are used to identify features based on the mouse movement paths (see Table 3), and they were not shown to the participants. Colors of a movement trace indicate how much time had elapsed since the trial started. The total time participants spent on the trial is shown next to the labels of the trained strategies

Table 3 summarizes the features that were selected. A primary feature is the decision time in each trial (\({x}_{1}\)). As discussed previously, non-compensatory strategies tend to search for less information, resulting in shorter decision times. Compensatory strategies integrate all cue values, which potentially takes longer. The proportion of cues searched (\({x}_{2}\)) reflects the number of boxes that have been opened. Participants using non-compensatory strategies typically need to open fewer boxes than those using compensatory strategies.

Table 3 Feature set for machine learning models in Experiment I

Each box on the information board was assigned a number (see Fig. 5). We recorded the total time taken to process each box, yielding ten features (i.e., \({x}_{3}\) to \({x}_{12}\)) for a five-cue decision task. Ten features ( \({x}_{13}\dots {x}_{22}\)) represent the search order; that is, the order in which the boxes were opened. We recorded search order by entering the box number of each opened box into one of these ten features. If fewer than ten boxes were opened, then zeros were recorded in the remaining search order features. Non-compensatory and compensatory strategies are associated with cue-wise and alternative-wise search, respectively. An example of take-the-best’s search order is \((\mathrm{1,5},\mathrm{2,6},\mathrm{0,0},\mathrm{0,0},\mathrm{0,0})\) and tallying is \((\mathrm{1,2},\mathrm{3,4},\mathrm{5,6},\mathrm{7,8},\mathrm{9,10})\).

Participants’ choices were compared with the predictions of each strategy (\({x}_{23}\) to \({x}_{25}\)). If a participant’s decision was consistent with a particular strategy, the feature value was coded as a “1,” whereas an inconsistent decision was coded as “0.” For example, if a participant’s final choice is consistent with take-the-best and Δ-inference but not with tallying, the feature vector is (1,1,0). An additional feature \({x}_{26}\) is designed to differentiate take-the-best from Δ-inference. It is a binary feature that indicates whether the cue values on the cue before the discriminating cue were the same. If they were different, the participant was likely using Δ-inference, and the feature is encoded as “0”; otherwise, the participant was more likely using take-the-best, and the feature is encoded as “1.” The intuition behind this feature is that when the cue values for both alternatives are different on the previous cue, the participant could not have used take-the-best, because they would have made their decision on this cue already.

MLSI analysis

We randomly selected five participants from each strategy group to serve as the testing set data, yielding 600 trials in total. We then used the K-fold cross-validation method (K = 10) to train the ML algorithms on the remaining participants’ trials (the training set) and searched for optimal hyperparameters.Footnote 2 Lastly, we applied five ML models in the training set and evaluated their performance in the testing set. Table 4 shows the identification accuracy of the five trained ML models in the testing set. The best-performing model is random forest, with an identification accuracy of 93.8%. The data, the Python code for the analysis, and the confusion matrices can be found in the Supplementary Material.

Table 4 Strategy identification accuracy of machine learning models on test participants

The classification results for each test participant’s decision trials are plotted in Fig. 6. Tallying, the compensatory strategy in this experiment, is perfectly identified. The identification accuracy for take-the-best and Δ-inference is 91.5% and 89.5%, respectively. The misclassified Δ-inference trials are exclusively classified as take-the-best trials, and vice versa. This pattern was expected, because take-the-best and Δ-inference are both non-compensatory strategies that compare alternatives cue-wise.

Fig. 6
figure 6

Results of strategy identification for test participants in Experiment I. Strategies were identified by random forest, the best-performing machine learning algorithm. The five participants at the top were trained to use take-the-best (TTB), the middle five Δ-inference (Delta), and the bottom five tallying. The colored boxes indicate strategies predicted by random forest for each participant in each decision trial, with blue for TTB, orange for Delta, and green for tallying. The column on the right, outlined in grey, shows the overall strategy identified for each participant

Feature importance

It is informative to not only have an accurate model, but also an interpretable one. In strategy identification, we also want to know what features were important to distinguish the strategies. Having a better understanding of the model’s logic can help verify that the model is correct and potentially improve the model by selecting the most relevant features. The random forest algorithm can be difficult to interpret due to its randomized nature, but it is possible to learn something about what features are most important for random forest.

Gini importance (or mean decrease in impurity) is a measure of feature importance for a random forest model. Random forest constructs a set of decision trees, and each tree has its own internal nodes and leaves. Each internal node uses a feature to divide a data set into two separate sets with similar responses within. Therefore, we can measure how well each feature decreases the impurity of the split and rank the features according to how much they decrease impurity. Gini importance represents how much each feature decreases the impurity over all trees in the forest on average (Archer & Kimes, 2008).

Figure 7 shows the Gini importance of the 26 features (Table 3). The feature designed to differentiate between Δ-inference and take-the-best (\({x}_{26}\)) is the most important one. The second most important feature (\({x}_{14}\)) codes which box was opened second. If participants applied cue-wise search, the second box opened should correspond to the highest validity cue of the second alternative (box 6). In contrast, an alternative-wise search suggests that the second box opened should correspond to the second highest validity cue of the first alternative (box 2). Moreover, as suggested by the analysis of decision times, total decision time (\({x}_{1}\)) is also an important feature. The time to read the second alternative’s highest validity cue (\({x}_{8}\)) could help distinguish Δ-inference from take-the-best too, because it generally takes longer to assess whether two cue values differ by a Δ than simply to determine whether they are different.

Fig. 7
figure 7

The Gini importance of each feature in Experiment I. Gini importance measures how much random forest relies on a particular feature in strategy identification. The sum of Gini importance is 1

Discussion

In this experiment, we tested MLSI in an environment with continuous cues. Random forest, the best-performing ML model, classified participants’ strategies on a trial-by-trial basis with an overall accuracy rate of 93.8%. The compensatory strategy tallying was perfectly differentiated from the non-compensatory strategies. The most misclassified trials were between take-the-best and Δ-inference, because both strategies search information cue-wise and usually lead to the same decisions. Perhaps it is because their decisions and search patterns are so similar that random forest relied heavily on feature \({x}_{26}\), which is effective in differentiating take-the-best from Δ-inference.

Experiment II

In Experiment II, we compare MLSI with MM-ML in terms of strategy identification accuracy. Participants were trained to use either take-the-best, WADD, or tallying to make decisions, and then made a series of decisions using the strategy they were taught. We use the stimuli from Glöckner (2009) to ensure that the strategies make different predictions on the dependent variables, thereby making MM-ML feasible. Δ-inference was not included, because in this experiment all the cue values were binary, and the performance of take-the-best and Δ-inference are identical in such a task environment.

Participants

Sixty undergraduates were recruited from the participant pool of the psychology department at Syracuse University. They were randomly allocated to one of the three strategy conditions with 20 participants in each. Informed consent was obtained before the experiment. The experimental session took approximately 30 min, and participants were given research credits to fulfill a course research requirement. Three participants in the WADD condition did not finish the experiment. After removing them from the analysis, we were left with data from 57 participants.

Materials

Participants were asked to take the role of a college student interested in purchasing a used car with the aim of choosing the car that would last longer. Each cue took either a favorable (1) or an unfavorable (0) value, with favorable values associated with more durability. Each car was described by four cues: mileage, model year, the number of previous owners, and the number of accidents. The cues were ranked by their validities, which were shown to the participants. MM-ML requires that different strategies make different predictions on multiple dependent variables. To meet this requirement, we used stimuli from Glöckner (2009, Table 1) that consist of six stimuli types. The six types were repeated ten times each, resulting in 60 trials for each participant. The order of decision trials was randomized for each participant.

Procedure

The procedure is identical to that in Experiment I. Specifically, participants were trained to apply a certain strategy and made decisions using a computerized information board. Every participant went through a tutorial on how to use one of the strategies, and was given five practice trials. Feedback was provided for each practice trial. Participants needed to make correct decisions on at least 80% of the practice trials to proceed to the decision task; otherwise, they repeated the tutorial from the beginning. After successfully learning the strategy, participants engaged in a 60-trial decision task in which their decision outcomes and mouse movements were recorded. A “correct” decision was defined as a choice in agreement with the strategy they were trained on. Participants were given one point for each correct decision, and their goal was to maximize total points. There was no monetary reward given to the participants.

Results

The descriptive statistics shown in Table 5 were in line with our expectations for the strategies. A Bayesian one-way ANOVA analysis indicates that the total decision time and the proportion of information searched were significantly different among the strategies (BFs > 1000).

Table 5 Descriptive statistics for Experiment II

Feature selection

Based on the feature set in Experiment 1, we selected the 21 features shown in Table 6. The first two features (\({x}_{1}, {x}_{2}\)) are the decision time and the proportion of cues searched. Because there were four cues in this experiment, we reduced the numbers of features for the time to read each box (\({x}_{3}\dots {x}_{10}\)) and those to code search order (\({x}_{11}\dots {x}_{18}\)). There are three features (\({x}_{19}\dots {x}_{21}\)) recording the final choice outcomes for each strategy. We removed the feature we used in Experiment I that was designed to differentiate take-the-best from Δ-inference.

Table 6 Feature set for machine learning models in Experiment II

MLSI analysis

We first analyzed the performance of MLSI at the trial-by-trial level. We then compared the performance of MLSI and MM-ML at the individual level.

Strategy identification at the trial-by-trial level

We randomly selected five participants from each strategy group to form the testing data set. For the remaining 42 participants, we used ten-fold cross-validation to train the ML models and search for optimal hyperparameters. Each participant made 60 decisions; therefore, there were 2520 training trials and 900 testing trials. We applied five ML models in the training set and evaluated their performance in the testing set. Table 4 shows the identification accuracy of these ML models in the testing set. MLP performed the best, with an identification accuracy of 91.8%. The data, the Python code for the analysis, and the confusion matrices can be found in the Supplementary Materials.

We report the results of random forest in more detail, because random forest is easier to interpret than SVM in terms of feature importance, and its identification accuracy of 91.3% is very close to that of MLP. We plot the trial-by-trial identification results by random forest in Fig. 8. It shows that random forest was best at discriminating take-the-best from tallying and WADD, yielding perfect identification accuracy for take-the-best. The identification accuracy for WADD and tallying was 85.6% and 88.3%, respectively. Because the search patterns of WADD and tallying are similar, the majority of the misclassifications were between WADD and tallying. The identification accuracy also differed among participants. For example, the trained random forest model mistakenly classified 26 of participant 9’s trials and perfectly classified six participants’ trials (participants 8, 11, 12, 13, 14 and 15).

Fig. 8
figure 8

Results of strategy identification for test participants in Experiment II. Strategies were identified by random forest, the best-performing machine learning algorithm. The five participants at the top were trained to use take-the-best (TTB), the middle five weighted-additive (WADD), and the bottom five tallying. The colored boxes indicate strategies predicted by random forest for each participant in each decision trial, with blue for TTB, purple for WADD, and green for tallying. The two columns on the right show the overall strategy identified by the machine learning strategy identification (MLSI) method and the multiple-measure maximum likelihood (MM-ML) method, respectively, for each participant

Feature importance

Figure 9 shows the Gini importance of the 21 features. The time needed to read the bottom boxes of the first alternative (box 4, \({x}_{6}\)) and the second alternative (box 8, \({x}_{10}\)), and the total decision time (\({x}_{1}\)) were important in differentiating take-the-best, tallying, and WADD. A likely explanation is that WADD participants generally spent more time at the bottom boxes, because they would need to take some time to integrate cue values and calculate the overall score of an alternative. The search order features (\({x}_{11}\dots {x}_{18}\)) were also important, because they indicate whether a participant used cue-wise or alternative-wise search.Footnote 3

Fig. 9
figure 9

The Gini importance of each feature in Experiment II. Gini importance measures how much random forest relies on a particular feature in strategy identification. The sum of Gini importance is 1

Classification by participants

We aggregated the trial-by-trial classification results for each participant to classify the participant as being best described by one of the three strategies. This was done for MLSI based on how the majority of the trials for a participant were classified. Figure 8 shows that random forest perfectly classified the fifteen test participants at the individual level.

MM-ML classification

We used the R code provided by Jekel et al. (2010) to conduct the MM-ML analysis. Using a combination of decision outcomes, decision times, and confidence judgements, MM-ML estimates the likelihood that a participant used a particular strategy. We ran MM-ML with participants’ choice outcomes and decision times. With these two types of process data as inputs, MM-ML classified 86.7% of the participants correctly. Figure 8 shows that of the 15 test participants, two tallying participants (i.e., participants 3 and 5) were misclassified as using take-the-best. MM-ML never misclassified a take-the-best participant as using WADD, and vice versa. MM-ML’s classification performance might have been even better had we collected confidence ratings.

Discussion

We have shown that MLSI can identify the strategies participants used for each trial with high levels of accuracy. The most frequently misclassified trials were between WADD and tallying, because they are both compensatory strategies and result in similar search patterns. The identification accuracy varied among the participants. For example, the identification accuracy was relatively low for participant 9. A plausible reason is that we trained a model based on data of a group of participants and applied that model to classify the idiosyncratic behavior of new participants. A future study could investigate if the identification accuracy of a participant would improve by using a model trained on data of the same participant. At the individual level, MLSI perfectly classified each test participant, and MM-ML accurately classified 86.7% of the test participants. That said, each approach has its own strengths and weaknesses. MM-ML classifies participants at the individual level using carefully designed stimuli but does not need training data, whereas MLSI can classify at both the trial and the individual levels on a broad range of stimuli but does require training data.

Experiment III

In Experiments I and II, we evaluated MLSI by training the ML models on data from participants who were taught to use various decision strategies and evaluated the performance of MLSI on test participants taught with the same strategies. In this experiment, we assess how well MLSI can classify participants who have not been trained on any strategy. When applying MLSI to untrained participants, it is not possible to assess its identification accuracy, because there are no strategy labels for the untrained participants. Instead, we evaluate identification performance by having participants make decisions under conditions that favor some strategies over others. Studies have shown that decision makers adapt to their environment by adopting strategies that offer higher rewards (Lieder & Griffiths, 2017; Rieskamp & Otto, 2006). Building on this finding, participants in this experiment were rewarded for responses that were in accordance with take-the-best. With this design, we want to examine to what extent MLSI detects expected shifts in participants’ strategies over the course of the experiment. Moreover, we constructed the stimuli for this experiment from a used car website to make the task more ecologically valid.

Participants

Undergraduate students at Syracuse University participated in the experiment to fulfill a course requirement. The experiment took approximately 60 min, and the study was approved by the IRB office of Syracuse University.

Procedure and materials

We extracted information for 16 cars from cars.com and constructed 120 comparison pairs based on these cars. Each car is described by five cues: mileage, model year, miles per gallon (MPG), engine size, and the average review of the dealership. Participants were instructed to take the role of a college student interested in purchasing a used car, and to choose the more durable car in a pair. In addition to cue values, they were also given the cue validities whose values reflect how predictive the cues are in the data set. The order of decision trials shown to each participant was randomized. Participants were given feedback each time they made a decision, and received a point when their decision was consistent with the decision made by using take-the-best.

Results

Participants learned from the feedback. For the first 20 trials, they received 69.7% of the available points, and their performance improved throughout the experiment such that by the last 20 trials, they earned 90.1% of the available points.

Our analysis of the participants’ strategies relied on the random forest model developed for Experiment I. The model’s parameters were kept the same as those in Experiment I. We ignored, however, whether the choices the participants made were consistent with a certain strategy (i.e., features \({x}_{23}, {x}_{24},\mathrm{and} {x}_{25}\); see Table 3). The reason is that the predictions of tallying depend on how participants dichotomize the continuous values, and the predictions for Δ-inference depend on how the participants set the Δ’s for each cue; without knowing how participants dichotomized cues or set Δ’s, we do not know the choices tallying and Δ-inference would make. Thus, we ignored the choices participants made by setting the three outcome features to 0 and relying on the other 23 features described in Table 3 for strategy identification.

The trained random forest model identified strategies on a trial-by-trial basis. As shown in Fig. 10, participants used all three strategies at the beginning of the experiment. Δ-inference was used most frequently, but some participants (e.g., 20 to 25) started using take-the-best consistently after a dozen or so trials. Only three participants (i.e., 5, 7 and 16) used tallying more than three times, and none used tallying after around trial 60. The computational difficulty associated with weighting or tallying continuous cue values (e.g., Payne et al., 1993) could be the reason why there were so few tallying trials.

Fig. 10
figure 10

Results of strategy identification for participants in Experiment III. The colored boxes indicate strategies identified for each participant in each decision trial, with blue for take-the-best (TTB), orange for Δ-inference (Delta), and green for tallying. The participants are sorted from top to bottom, in descending order, by the proportion of trials in which they are identified as using take-the-best

The majority of the participants switched strategies multiple times. Most consistently used take-the-best in the latter part of the experiment but differed in the specific point when they made the switch. A couple of participants (i.e., 1 and 2) used Δ-inference throughout the experiment. A plausible explanation for the continued use of Δ-inference is that reinforcing decisions consistent with take-the-best, to a large extent, also reinforce decisions consistent with Δ-inference. Overall, the results of this experiment are in line with findings from previous studies (e.g., Rieskamp & Otto, 2006), showing that decision makers tend to try out a variety of strategies and examine more cues in a new task environment before learning to settle on a particular strategy that is adaptive for the environment.

General discussion

We have demonstrated that MLSI can effectively identify strategies on a trial-by-trial basis. MLSI achieves this level of identification fidelity by relying on well-established ML (machine learning) algorithms to integrate multiple types of behavioral data. After first providing an overview of the main findings and contributions, we discuss some limitations and future directions.

Overview of main findings

The effectiveness of MLSI is demonstrated in three experiments. Because supervised ML algorithms require labeled data, participants were instructed to use specific strategies. After the participants’ behavioral data were processed into a set of features, the ML algorithms learned the relations between the features and the strategy labels. The trained ML models then classified test data that they had not been trained on. In Experiment I, which included the challenging task of distinguishing take-the-best, Δ-inference, and tallying, all the ML models performed well, led by random forest with an accuracy rate of 93.8%.

The goal of Experiment II was to compare MLSI with MM-ML, which we used as a benchmark method. The two methods classified whether participants were using take-the-best, tallying, or WADD when choosing between stimuli designed to work well with MM-ML. At the individual level, both methods achieved high levels of accuracy in identifying participants’ strategies. With respect to identification accuracy at the trial level, only MLSI was considered, because MM-ML was not designed to classify strategies at that level. MLP performed best, but the performance of the other ML models, especially random forest, was not far behind.

In Experiment III, participants received feedback according to the decisions that take-the-best would have made. We analyzed the data in this experiment with a random forest model trained on the data from Experiment I. The MLSI analysis shows that participants adapted to the task environment according to the feedback they received. Participants started by trying a variety of strategies, but most of them eventually converged to the non-compensatory strategies take-the-best and Δ-inference, which in this environment often lead to the same decisions.

We are able to identify strategies on a trial-by-trial basis. Some strategy identification approaches require the strong assumption that decision makers use one strategy throughout a sequence of decision tasks (e.g., Bröder, 2003). With a rich set of features that reflect the characteristics of different strategies on a single trial and ML algorithms’ ability to learn the relationship between these features and the strategies, we can identify strategies in a single decision trial based on a decision maker’s search behavior and choice. We expect that the ability to identify strategies on a trial-by-trial basis would help researchers investigate factors that influence strategy selection, such as cue correlations and information search cost, and lead to a better understanding of how decisions makers adapt their strategies to the characteristics of task environments.

A classification method that exploits multiple sources of behavioral data

The trained ML models discriminate between different decision strategies based on mouse-tracking and choice outcome data. MLSI can accommodate a wide array of features and makes no distributional assumptions about the features. For example, the features can be quantitative, ordinal, categorical, or Boolean. They need not be statistically independent, and can come from any distribution. Many of the features used in the current study were derived from the mouse coordinates in information search. These features were bundled into vectors representing participants’ search patterns (e.g., the order in which they opened the boxes and how long the boxes were opened), along with participants’ choices and other characteristics of the task. Moreover, the features we selected were tailored to discriminate between possible strategies. Because the selection of features drives classification results, the process of transforming raw data into informative features is critical to the development of any ML application. As Locklin (2014) puts it, “Much of the success of machine learning is actually success in engineering features that a learner can understand.” The process of selecting good features is so important that there is a growing machine learning literature that is dedicated to feature engineering (e.g., Heaton, 2016; Zheng & Casari, 2018). Can discriminating features be automatically discovered based on the raw behavioral data alone? Although a full exploration of this question is beyond the scope of this paper, we next report some exploratory findings that show the promise of automatic feature learning.

Automatic feature learning

Feature learning algorithms help classification when useful features cannot be identified easily. Examples of such cases are image classification, object detection, and speech recognition (e.g., Deng & Li, 2013; Loussaief & Abdelkrim, 2016; Rawat & Wang, 2017). These tasks are complex and generally involve much unstructured data. For example, in a speech recognition task, the speech input is a time series of hundreds of phoneme segmentations. Similarly, the current study’s mouse data is also a time series of mouse coordinates. Automatic feature learning methods are able to learn hidden patterns in the data without human intervention, combine them together, and build efficient classification rules. Potentially, some hidden patterns that are not captured in the feature sets we constructed may be found by automatic feature learning methods.

There are primarily two methods in automatic feature learning: principal component analysis and linear discriminant analysis. They extract useful information from a high dimensional data set and are also used as dimensional reduction methods. We applied principal component analysis to the mouse coordinates data, ignoring for demonstration purposes the choice outcomes. Principal component analysis extracted information in the form of principal components, which then became features in training the ML models. The ML model with principal component analysis classified the test trials from Experiment I as resulting from using either an alternative-wise search strategy (i.e., tallying) or a cue-wise strategy (i.e., take-the-best or Δ-inference). Based only on features extracted from mouse coordinates, a multilayer perceptron model was correct on 86.6% of the trials, demonstrating the potential for using automatic feature learning to support strategy identification. One direction for future work is a systematic investigation of feature representation, encoding more raw data (e.g., choice outcomes and cue values) for automatic feature learning.

Deep neural networks (i.e., deep learning) have gained popularity partly due to their remarkable ability to learn and discover features (Zhong et al., 2016). The networks consist of multiple layers of interconnected nodes and activation functions that connect input and output layers. Moreover, deep neural networks compose lower-level feature representations into more complex ones via hidden layers. For example, the output of a hidden layer that is activated by edges and lines in an image may feed into subsequent layers to identify faces. To apply deep neural networks to Experiment I, we fed the raw mouse coordinate data to a network, encoding the features as weights in the hidden layers. When classifying whether a decision was the result of a compensatory or a non-compensatory strategy, its identification accuracy was 86.4%.

Enrich feature sets

Another approach to improve identification accuracy is to enrich the feature sets. We currently use mouse data to derive our process measures, but the types of useful data could be expanded. Many studies have shown that eye-tracking measures are helpful in identifying the cognitive processes underlying decision making (e.g., Krol & Krol, 2017; Schulte-Mecklenbeck et al., 2017a, b). In addition to scan paths, fixations and gazes can also be informative in showing how decision makers process information. The development of webcam eye-tracking techniques (e.g., Papoutsaki et al., 2016) promises to make eye tracking a ubiquitous technology that will help researchers collect large amounts of visual attention data cheaply and rapidly. These data should facilitate the training of ML models.

Neurological data are another source of process data. Machine learning techniques have been applied on fMRI data to classify schizophrenia patients (Bleich-Cohen et al., 2014), major depression (Sato et al., 2015), and more (e.g., Pereira et al., 2009). Incorporating such data into ML models has the potential to further improve the models’ performance. Studies have identified correlations between neural activation patterns and the strategies people use (Dimov et al., 2019; Fechner et al., 2016; Khader et al., 2016). For example, Volz et al. (2006) measured brain activity with fMRI when participants used the recognition heuristic and did not. They found an increased activation within the anterior frontomedian cortex (aFMC) when participants used the recognition heuristic. These findings indicate that brain activations can be potential features to facilitate ML-based strategy identification.

Limitations and future directions

One future line of work is to increase the number of candidate strategies. We assumed that participants could use four strategies—take-the-best, Δ-inference, tallying, and WADD—that are often considered in studies of simple heuristics. Still, there are more strategies that people may use, such as the extensions of take-the-best (Heck et al., 2017) and WADD (Hilbig & Moshagen, 2014). With more strategies being tested, it is critical to construct a rich set of features to describe and distinguish these strategies.

We only considered strategies that are not so difficult for participants to learn and execute. Because some strategies are more cognitively demanding, participants may find it more challenging to learn and use those strategies. For example, in Experiment III, an environment with continuous cues, it could be difficult for participants to apply WADD, as its execution may be beyond the working memory capacity of most participants (Fechner et al., 2018). As a consequence, labeled training data for some strategies in some environments may be hard to come by, hindering strategy identification. On the other hand, strategies that are difficult to learn are probably used less frequently than other strategies (Marewski & Schooler, 2011).

In addition, decision makers in real-world environments may not use exactly the same strategies that participants were taught in our study. For example, people using a particular strategy may deviate from the idealized search patterns, such as take-the-best users sometimes looking for more cues even after a discriminating cue. In this case, the search pattern for take-the-best could mimic that of Δ-inference, adding more ambiguity to strategy identification. The ML models can tolerate, and perhaps even benefit from, a certain amount of noise and errors in the training data. But too much noise and errors could hinder the ability of a model to learn the right parameters. The question is to what extent the ML models can handle such variability. The identification accuracy in practice may not be as high as what we have been able to achieve here. That said, even if we cannot make fine-grained distinctions among all the strategies, our results suggest that MLSI can reliably identify whether people were using a compensatory or a non-compensatory strategy.

Our analyses do not include a guessing strategy, where “each alternative is chosen with probability one-half on each trial, independent of the cue information” (Lee & Gluck, 2020). A guessing strategy defined across trials is incompatible with MLSI, because MLSI is applied on a trial-by-trial basis. Here we propose a working definition of guessing that could be used to classify single trials as guesses. A decision would be classified as a guess, when MLSI assigns probabilities to all the candidate strategies that fall below a threshold. For example, suppose for a certain trial MLSI assigned probabilities to take-the-best, Δ-inference, and tallying of 34%, 33%, and 33%, respectively. Without the inclusion of a guessing threshold, the current version of MLSI would identify this participant as having used take-the-best, even though the assigned probabilities of all the strategies are nearly the same. By including a threshold (e.g., 50%), the model would identify this trial as a guess. This working definition of guessing, like the definition adopted by (Lee & Gluck, 2020), leaves open the question of what actual processes underlie decisions that are classified as “guesses.”

Lastly, the core methodologies in MLSI can be applied to study other cognitive processes beyond multi-attribute choice, such as risky choice, cognitive development, function learning, and perceptual categorization. In each domain, either simulated or real behavioral data of candidate theories or models may be used to train various ML models; we could then use the trained ML models’ predictions to determine which theory or model provides a better account of participants’ data. In risky choice, for instance, we can adapt the method to test whether an expected -value based model, such as prospect theory (Tversky & Kahneman, 1992), or the lexicographic priority heuristic (Brandstätter et al., 2006), would describe how people process and integrate probability and choice consequence information better, and what information is more important in determining the choice outcomes.

Outlook

Our results demonstrate that off-the-shelf ML models are up to the task of strategy identification. Currently, we favor random forest, because the Gini importance analyses make its results more interpretable than those of the other models we investigated, and show how we might further tune the feature sets. Our preliminary explorations with a few off-the-shelf automatic feature construction algorithms suggest that researchers may not even need to develop the initial feature sets. Relying too much on automatic feature construction, however, runs the risk of making the ML models uninterpretable. Rahwan et al. (2019) argue that the behavior of ML algorithms can be difficult to analyze formally; therefore, they call for investigating ML models with experimental and behavioral methods, much in the same way that these methods are used to study the behavior of humans and other animals. Going forward, careful experimentation will be key to improving on the MLSI paradigm and its generalization to other cognitive domains. Our supplementary materials include examples of how to run the key analyses presented in this paper. They should provide good support for researchers to examine whether the MLSI approach could work for their data, extending the approach to address other questions related to strategy identification and discovery.