Tracking strategy changes using machine learning classifiers

Moss, Jarrod; Wong, Aaron Y.; Durriseau, Jaymes A.; Bradshaw, Gary L.

doi:10.3758/s13428-021-01720-4

Tracking strategy changes using machine learning classifiers

Published: 26 October 2021

Volume 54, pages 1818–1840, (2022)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

Tracking strategy changes using machine learning classifiers

Download PDF

Jarrod Moss¹,
Aaron Y. Wong¹,
Jaymes A. Durriseau¹ &
…
Gary L. Bradshaw¹

1237 Accesses
1 Altmetric
Explore all metrics

Abstract

In complex tasks, high performers often have better strategies than low performers, even with similar amounts of practice. Relatively little research has examined how people form and change strategies in tasks that permit a large set of strategies. One challenge with such research is identifying strategies based on behavior. Three algorithms were developed that track the task features people use in their strategies while performing a complex task. Two of these algorithms were based on task-general, machine-learning classifiers: a support vector machine and a decision tree algorithm. The third was a task-specific algorithm. Data from several strategies in a complex task were simulated, and the algorithms were tested to see how well they identified the underlying features of the simulated strategy. The two machine-learning classifiers performed better than the task-specific algorithm. However, the two classifiers differed on how well they identified different types of strategies. The first two studies show that the ability of these algorithms to recover the underlying strategy depends on the complexity of the strategy relative to the quantity of performance data available. If the underlying strategy changes too frequently, then the performance of the algorithms suffers. However, results from the third study show that it is possible to use these algorithms to track strategy changes that occur in a task. The fourth study examines performance on data from human participants. This approach to tracking strategy exploration may enable further development of theories about how people search for and select effective strategies.

Cognitive load theory and educational technology

Article 01 August 2019

A random forest guided tour

Article 19 April 2016

Twenty years of load theory—Where are we now, and where should we go next?

Article 04 January 2016

To improve performance in complex tasks, experts develop and use task-specific strategies (Schunn et al., 2005). A strategy is a sequence of actions performed to solve a problem or accomplish a task. Strategy use and changes in strategy use have been examined in areas such as skill acquisition, problem-solving, and decision-making. However, one impediment to investigations of strategy development is that many of the tasks used to examine strategies are too complex or too simple. On one end of the spectrum are simple tasks that permit only a few strategies and lack the potential to investigate how people explore large spaces of strategies. On the other end are tasks with sufficient complexity to support a larger space of possible strategies, in which it is difficult to identify the strategies that people are using and when they are using them. Here we provide a method to identify strategies by using machine learning classifiers to analyze behavior in a task to determine the key features that are guiding the strategy-driven choices that people make.

The general theoretical framework guiding this work on identifying strategies is that people develop strategies by selecting features from the task environment to drive their decisions about what to do next (Lovett & Schunn, 1999). A task that presents several features that can be used in combination permits a large space of strategies to be developed and evaluated. Strategy choice is then based on the success, or utility, of that strategy as applied to similar problems in the past (Anderson et al., 2004; Lovett & Anderson, 1996; Lovett & Schunn, 1999). If the success rates of strategies are low enough, then the task may be re-represented to include additional features to compose new strategies. In simple tasks, there are few features to choose from, so selecting additional features can be straightforward. However, in more complex tasks, the act of identifying additional features and forming strategies based on those features is a significant problem that has received little attention.

Searching through the space of strategies can be considered a problem-solving activity where the search for a new strategy operates in a secondary problem space separate from the original task. These kinds of dual space searches have been proposed to account for rule induction and scientific reasoning (Klahr & Dunbar, 1988; Simon & Lea, 1974). In a recent category induction study, Prezenski et al. (2017) presented evidence that the search for a category rule was systematic in that participants appeared to use a heuristic to generate simpler one-feature rules before more complex multi-feature rules. However, the space of possible category rules in this research was relatively small.

In prior research, the most common approach to examining strategies is to identify a critical task performance measure that can discriminate between two previously identified strategies that, in some cases, have been explicitly taught to participants. For example, in an isomorph of Luchins’ water jug problems called the Building Sticks Task, the first move characterizes which of two strategies the participant is using (Lovett & Anderson, 1996; Schunn et al., 2001). In a study of Space Fortress strategy adaptation, researchers first taught and trained one flight control strategy before modifying the environment and examining the impact on the flight control strategy (Moon et al., 2013). In this case, the proportion of the time spent in a particular region of the screen determined if participants continued to use the original strategy or adopted a modified strategy.

Another approach to examine strategies is to generate data from cognitive models performing a task using various strategies and examining how human data match these models (Chen et al., 2015; Zhang & Hornof, 2014). However, developing cognitive models can require a great deal of time and task-specific knowledge. These approaches are therefore costly and are not likely to generalize well to other tasks without building a new model for each possible task strategy. Further, if a participant uses a novel strategy not implemented within the model, then this procedure cannot identify the participant’s strategy.

A related line of work on the strategies that people use in decision-making tasks has led to multiple algorithms for tracking strategy use in these tasks. Many of these decision-making strategies focus on simplifying the decision, using strategies referred to as heuristics (Gigerenzer & Gaissmaier, 2011). For example, a take-the-best heuristic would only examine the most valid, or predictive, feature and ignore the rest. Several techniques have been developed that can analyze a set of decisions among a pair of alternatives and determine the heuristic being used (Hilbig & Moshagen, 2014; Lee, 2016; Lee & Newell, 2011). Most recently, a Bayesian approach has been put forward that uses multiple sources of information to identify the decision heuristic being used and can further identify heuristic changes (Lee et al., 2019).

However, there are two characteristics that differentiate these decision-making heuristic approaches from the one we describe in this paper. First, the stimuli in these decision-making studies are carefully designed to discriminate between decision-making heuristics based on the choice made on a pair of stimuli in a two-alternative choice task (e.g., Walsh & Gluck, 2016). The machine-learning approach described in this paper is tested on data that occurs naturally in a complex task in which participants select from several stimuli on any trial, and there is no guarantee it is possible to discriminate between strategies based on the choice a participant makes on any given trial because multiple strategies could yield the same choice. Second, the decision-making research focuses on identifying which of a small set of well-known heuristics such as take-the-best, weighted-additive, or tally is being used by a participant. These heuristics are focused on how participants make use of available features in a decision but not on which features are used. For example, these methods would identify the participant as using the take-the-best heuristic but are not concerned with which feature is the best one being used because the stimuli have been designed in such a way that the researcher knows which cue is the most predictive cue.

Here we describe a method that determines which combination of the many task features is incorporated in a participant’s strategy while simultaneously identifying whether higher or lower values of a feature are preferred in the strategy. For example, instead of reporting that the participant is using a weighted-additive heuristic, our method reports that the participant used feature1 as the primary feature and feature2 as a secondary feature. Furthermore, increasing values on feature1 may make an option more likely to be selected, and increasing values on feature2 make an option less likely to be selected, which is referred to here as the valence of the feature. Depending on the task, determining the actual features used and how they affect decisions are both important for understanding how people search for an effective strategy. For these reasons, previous strategy identification methods are not applicable to the problem we address here.

The methods described here produce a list of feature valences ordered by importance. This list of features and their valences does not specify the exact process by which someone combines multiple features to make a decision. In that sense, it is not a perfect description of the strategy being used. However, measuring strategies in this manner defines an abstract space of feature/valence combinations in which many aspects of the similarity of strategies are captured.

Complex tasks that allow for a large space of strategies require a method for tracking the features that people are using in their strategies. This paper presents a method using machine-learning classifiers, along with a more task-specific algorithm, for tracking a participant’s strategy over time as they perform a task. The process begins by training machine-learning classifiers to predict the action that a participant will take, then the trained classifier can be analyzed to identify the features used in making that prediction. The predictive features are then assumed to be the same ones that the person was using.

The goal of the present set of studies was to simulate strategies that people might use in a task with a complex space of strategies and then to evaluate the algorithms’ ability to recover those strategies. The first three studies use simulations performing a task to examine whether these methods are successful at recovering the strategy features that the simulations are known to be using. The fourth study examines how these methods perform on human data. Note that only the first three studies permit accuracy values to be computed. We cannot be certain what strategies people are using as they perform the task, but we can examine how well the identified strategy explains the observed behavior in the task.

We used a modified version of the Abstract Decision Making (ADM) task (Joslyn & Hunt, 1998). The original ADM research demonstrated that it predicted performance on air traffic control and emergency dispatch tasks (Joslyn & Hunt, 1998). A modified version of the task has also been used to examine individual differences in a multitasking situation with interruptions (Bai et al., 2014). The variant used in the current research has been modified to increase the space of possible strategies that participants can use to select the next subtask to work on, and this variant is referred to as the strategic ADM (sADM).

In all variants of the ADM task, choices made in the past influence the current set of alternatives to choose from, but random factors influence the evolving task state as well. To ensure that our classifiers were accurately tracking task strategies, an ACT-R model (Anderson et al., 2004) was developed to perform the sADM task. Within the ACT-R model, a range of strategies was implemented. The ACT-R model consistently uses the strategy and therefore provides data with a known strategy for comparison with the output of strategy tracking algorithms. The complexity of the model’s strategy can be controlled, and it is also possible to insert a controlled amount of noise into the model’s behavior to examine the performance of the strategy-tracking algorithms in the presence of noise. This approach provides a means to evaluate the algorithms and their ability to identify the underlying strategy that generated the data.

sADM task description

The sADM task comprises two interleaved activities: selecting an object to work on and processing that object by sorting it into a bin based on its attributes. This structure mimics real-world tasks such as emergency dispatch, where there are multiple tasks one could work on and each task requires a set of actions to complete it before moving on to the next task (Joslyn & Hunt, 1998). In the sADM, an object is selected from a queue and processed by querying its attributes one at a time and then sorting it into one of four bins based on those attributes. Additional objects appear in the queue either after an object has been sorted or during the sorting process (i.e., interrupting the flow of the sorting process). Figure 1 summarizes the basic structure of the task.

The sADM task requires participants to sort objects into one of four different bins, depending on the object’s attributes. Prior to beginning work on the task, participants memorize the attributes associated with each of the four bins so they can correctly sort objects. For example, bin 1 might accept only large, yellow triangles, and bin 2 might accept only small, orange octagons. The interface is text-based and controlled via five keys on the keyboard. Each object is identified by an arbitrary CVC name, and the participant must execute a series of keystrokes to query the attributes of the object before sorting. For example, a participant might select the object DAX from the example queue of objects shown in Fig. 2. After selecting DAX, the participant then presses keys to query the object’s color (yellow), then another query identifies its size (large), and finally a third query to identify its shape (triangle). Based on these object attributes, which must be held in working memory, the participant now knows that the object belongs in bin 1 and can execute a series of key presses to sort it. Some objects require querying and sorting on one set of attributes (visual-based attributes) and others require querying and sorting on two sets of attributes (sound-based attributes in addition to visual-based attributes). Objects that require two levels of querying and sorting therefore require about twice as long to process.

Each task block lasts 6 min, and the goal is to score as many points as possible. In addition to the attributes that must be queried to enable sorting, objects also have a set of performance features that affect the current score. These features include point values, penalties, and deadlines and are shown directly in the task interface in the queue as shown in Fig. 2. Sorting the object correctly awards the participant the point value of the object. Sorting the object into the wrong bin subtracts the object’s point value from the current score. Every object has a deadline and a timer that counts down from that deadline to zero. Every time the counter reaches zero, the counter resets to the deadline and the penalty value for that object is deducted from the current score. This penalty for elapsed deadlines applies to all objects in the queue, including the object that is being processed by the participant. Therefore, performance depends on utilizing strategies that take into account multiple, dynamically changing factors (e.g., points, deadline, time until the object’s next deadline).

The queue of objects initially starts with 1–3 objects, and each time an object is successfully sorted, 0–3 new objects appear probabilistically. The task adjusts the probability of new objects arriving so that the queue grows to contain approximately 7–12 objects. Objects can also occasionally arrive, as an interruption, while the participant is processing an object. When this interruption occurs, the participant must choose whether to continue processing the current object or switch to the interrupting object.

The best way to maximize performance is to sort as many high-point objects before their deadline and to prevent high-penalty objects from remaining in the queue and accumulating penalties. By manipulating the distribution of points, penalty values, and deadlines in the object queue, the task can be manipulated such that distinct strategies are needed to maximize a participant’s total score. Selection strategies used to select objects from the queue are therefore critical to performance. In the queue, the participant can see the object’s name, deadline, point value, penalty value, and how many seconds remain before the deadline elapses again. These object features, along with the position of the object in the queue, are the features that can guide one’s selection strategy. Determining a participant’s selection strategy for the sADM task is the primary focus of the strategy identification algorithms discussed here.

Strategy tracking algorithms

We developed and compared three different algorithms for extracting strategies from the sADM object-selection data. Two algorithms use standard machine-learning classifiers implemented using the scikit-learn library (Pedregosa et al., 2011): a linear support vector machine (SVM) and a decision tree (DT) classifier. A third algorithm, the Every Strategy (ES) algorithm, was implemented to compare these task-general classifiers to one that has more knowledge of the sADM queue structure. A general description of the approach taken with each of these algorithms is presented here, but the full details can be found in the Python code available at https://osf.io/qfxpr/. For a more thorough but approachable introduction to machine learning classifiers and the scikit-learn library that would be recommended to adapt the provided code to different tasks, see Géron (2019).

These selection strategy classification algorithms take as input a list of the objects on each queue presented to a participant, including all the object features, and whether each object was selected from the queue. Each sADM task block yields a list of objects that appeared in the queue for each selection that a participant made and nine features for each object: points, deadline, time until deadline, penalty value, the number of sorting levels required, whether work on the object was interrupted, position in the queue, selection distance, and the queue number. When a participant begins object selection, the middle object in the queue is initially highlighted and they move the selector up or down using key presses before selecting an object. The selection distance is the minimum number of times the selector has to be moved to reach an object.

The queue number indicates which queue the object was on when a selection was made (i.e., first, second, etc.), and it is a feature that is only used to group together all objects that appeared on the same queue. For example, there may have been nine objects present on the queue when a participant made their fourth selection. These nine objects would all have a queue number of four. The category label that the classification algorithms are being trained to predict is a binary value (selected/unselected), and the algorithms are trained to predict whether a given object would be selected or not based on the object’s features. Note that these classifiers do not explicitly identify the strategy being used for object selection. A second step is needed to extract the strategy embodied in each trained classifier.

After the classifier was trained to predict which objects a participant selected, it was analyzed to determine the features present in that participant’s strategy. A strategy is represented as a series of features ordered by importance and their valence. Valence here means whether higher or lower values on that feature were more likely to be selected (e.g., higher point values were more likely to be selected). For example, a strategy represented as [Points+, Deadline-] means that a higher point value was the most important feature, but the participant also preferred lower deadline values. Valence is considered because across participants or even task blocks within a participant, we have found evidence that the same feature was used with opposite valence.

Machine learning classifiers

A discussion of the details of SVM and DT classifiers is beyond the scope of the current paper, but a general characterization of the how the algorithms make a classification decision is presented because of the implications for the types of selection strategies that might best be tracked by each algorithm. Both classifiers use different techniques to learn to categorize instances of selected and unselected objects from the sADM selection data.

Because objects persist from one selection to the next (with the exception of objects selected and correctly sorted), classifiers are given distinct object states that represent the characteristics of each object as it appeared on a specific queue when a selection was made. Even though the same object name may be present in the queues for multiple selections that the participant makes, each of these is a distinct object state because the values of the object’s features relative to the feature values of other objects in a queue may change from selection to selection. Each object state includes information about all features for the object (e.g., points, deadline, time in queue) that define a point in a multidimensional feature space.

Note that the object’s absolute feature values are not very useful in determining whether the object will be selected or not. For example, an object with a point value of 300 might be the lowest point value in one queue or the highest in another. To simplify the learning process, features are contextualized within each queue by scaling all features within a given queue onto a 0 to 1 scale. The object in a queue with the highest point value will have a 1 after scaling, while the lowest will have a value of 0. This scaling is illustrated by the set of queues presented on the left side of Fig. 3. This method of scaling is important because these classifiers treat each object as a separate piece of data to be trained on and classified. They do not make a selection of one object from a queue of objects. Instead, each object from each queue is a separate instance to be labeled as either selected or unselected.

The SVM classifier divides up this multidimensional feature space by separating the selected and unselected objects with a hyperplane. The DT classifier builds a hierarchical set of rules to classify objects as selected or unselected (e.g., if the object has the highest point value, then it is selected, if not then if it has the lowest deadline then it is selected, otherwise it is not selected). Therefore, the DT classifier represents a participant’s strategy as a sequence of binary decisions, while the SVM classifier represents the strategy as a hyperplane. The components of the DT or the location of the SVM hyperplane in the feature space can be analyzed to determine the important features of a participant’s strategy. Figure 3 presents an example of this process applied to three queues in which the difference between the results of applying the SVM and DT classifiers can be seen.

Given that each queue has several objects and only one will be selected, there are more unselected than selected objects in the training data. A trivial solution is to classify all objects as ‘unselected.’ This solution accurately classifies all unselected objects and only makes errors on the smaller number of selected objects. To avoid this trivial solution, both the DT and SVM algorithms include mechanisms to balance the contribution of both selected and unselected objects on the resulting trained classifier using the class_weight parameter in the scikit-learn library for these classifiers. Conceptually, setting this class_weight parameter to ‘balanced’, as was done here, results in the classifier being penalized for a mistake using a weight that is proportional to the number of selected and unselected objects. Because there are more unselected than selected objects, the classifier is penalized more for classifying an object as unselected if it was selected by the participant than for classifying an object as selected when it was not actually selected. In other words, this weighting avoids the trivial solution of classifying all objects as unselected because the classifier is heavily penalized for classifying selected objects as unselected.

Following this feature scaling, the data are split into three cross-validation folds such that the classifier is trained on two-thirds of the data and produces a prediction accuracy on the other third. The basic unit of data is the queue number (i.e., a queue of objects). All objects from each queue are semi-randomly placed into one cross-validation fold so that objects from multiple selection queues are not spread across folds. The process is semi-random because the selection queues from a block are divided up into those that occur in the first, second, and third 2-min portion of the 6-min block. Each of the three cross-validation folds will contain one third of the data from each third of the block. This constraint was included so that any strategy differences that occurred over time in the block would be represented in each cross-validation fold. This process is illustrated with the training and testing folds shown in Fig. 4.

Traditional machine learning applications of these classifiers have the primary goal of maximizing prediction accuracy on novel data. However, our goal is to identify the decision-making strategy that a participant was using. Therefore, the cross-validation process was not used to maximize accuracy, but it was instead used to tune hyperparameters of the classifiers. These hyperparameters control how complex the classifier is allowed to be, which in this data translates into the number of object features used to classify the data. The more features used to classify the data means that the resulting strategy includes more object features.

A grid search is performed for each hyperparameter, and the weighted average of prediction accuracy and strategy agreement is used to select the best-performing hyperparameters. Strategy agreement is determined by comparing the features extracted from each cross-validation fold, with perfect agreement occurring when all folds yield the same strategy. Again, strategies are represented as an ordered list of features and their valence. The primary purpose of this approach is to allow the complexity of the strategy to vary so long as all three folds yielded the same strategy. This approach allows for the most complex strategy supported by the data to be extracted from the data.

A good representation of a participant’s strategy should lead to a prediction about which object that participant will select from a queue of objects. Because each classifier simply classifies each object in each queue as selected or not selected, the classifier might report that none of the objects were selected or that multiple objects were selected. To test the classifiers’ ability to predict selection from a queue, testing accuracy was calculated by predicting which object would be selected from a queue as opposed to allowing the classifier to individually classify each object as selected or not selected. Both the DT and SVM classes in the scikit-learn library have a method that allows for a probability to be generated instead of a binary classification. For all the objects in a queue, the classifiers rank ordered the objects that would be selected according to their predicted probability of being selected. A rank order accuracy score was calculated by the formula: (queue_size - rank) / (queue_size - 1). Here queue_size is the number of objects in the queue and rank is the rank order assigned to the object that the participant picked. This rank accuracy score has a maximum value of 1 when the participant’s selection matches the classifier’s top ranked object and has a minimum of 0 when the participant picks the object ranked last by the classifier. The expected value of this rank accuracy score if the classifier assigned ranks randomly would be 0.5.

This rank accuracy score was used instead of a binary accuracy score so that the accuracy score would keep some sensitivity to a participant’s underlying strategy even if the participant did not always pick the optimal object under a strategy. For example, a participant might pick the second highest point value because of an error or it was close enough to the maximum point value object (i.e., satisficing behavior).

Finally, the prediction accuracy, using the rank accuracy score, was compared to the accuracy expected if the algorithm had selected randomly, and a strategy is only reported if the predictive accuracy is significantly above chance levels at α = .05. This mechanism was provided to limit reporting a strategy when there is little evidence for a strategy.

DT details

The DT classifier was trained on each cross-validation fold with all features available for incorporation into the tree. An additional level of the tree can only be added if it would improve the purity of the leaf nodes by a value of N (the value of the min_impurity_decrease parameter in the scikit-learn implementation). This parameter essentially controls how much of an improvement in classification accuracy has to occur for adding an additional decision node to the tree. This parameter was the only hyperparameter of the DT classifier, and its value was determined as described above with a grid search ranging from values of 0.005 to 0.25. Values closer to 0 yield more complex trees. This parameter has a maximum value of 0.5, but a value of 0.25 was used for the upper bound because values above 0.25 were never optimal on any of the data (simulated or human) this approach was applied to.

The resulting DT was analyzed by examining both the importance of the object features it used for classification and the valence of those features. The importance of the features was determined using the existing feature_importances_ attribute of the DT object. The valence is determined by analyzing the node in the tree where the feature appears to determine whether higher or lower values are associated with more selected than unselected objects. If a feature appears in more than one node on the tree, then the valence is marked as ambiguous because determining the overall impact of the feature is much more complex when it appears more than once.

SVM details

Because the SVM classifier does not have the same type of complexity-based hyperparameter as the DT, a recursive feature elimination (RFE) approach was taken. In RFE, all features are first used to train the SVM on the training set of a cross-validation fold, then the least important feature is dropped, and the process continues until there is only one feature remaining. At each iteration of the RFE process, the initial training data is again split using a three-fold cross-validation process so that a testing accuracy measure can be returned for each iteration of RFE. The classifier that is returned from this process is the one with the highest testing accuracy. For example, if the RFE process found that two features (e.g., points and deadline) led to the highest test set accuracy, then the SVM using these two features is the resulting classifier.

One problem with this RFE process is that accuracy will increase for each feature that was part of the participant’s strategy, but as additional features are added, accuracy will plateau (not decrease). To obtain a pattern in accuracies with a clear peak, a complexity penalty was added to the testing method by subtracting a constant from the accuracy score for each feature added to the model. This complexity penalty parameter was used in a grid search with a range of 0.01 to 0.10. SVMs also have a C parameter that balances the margin between the data and the hyperplane and the misclassification rate. This parameter was also included in the grid search with a range of .01 to 100. As described earlier, the goal of this grid search was to identify the classifier that produced the highest weighted average of predicted accuracy on the test set and agreement of strategy features across the cross-validation folds.

The resulting SVM was analyzed to determine the important strategy features. The absolute magnitude of the coefficients of the support vector (i.e., hyperplane) were used to rank order the importance of the features and the sign of the coefficients was used to determine the valence of the features. As a two-dimensional example, in Fig. 3, the slope of the line (i.e., support vector) provides information on the relative weighting of the points and deadline feature in the strategy captured by that SVM classifier.

ES algorithm

The DT and SVM classifiers can only be trained to classify whether a given object will be selected. The testing of these classifiers has to use a probability to pick an object from a queue as described earlier. The ES algorithm was developed as an alternative approach that learned to select one object from a queue of objects. It therefore has more task-specificity built into it.

Even though the sADM task provides the ability to explore a large strategy space, it is possible that people use fairly simple strategies. The ES algorithm uses a brute force method to determine selection strategy, but it limits the potential combinatorial problem by only considering strategies with at most two features and weighting each feature equally. For every object selection decision, the ES algorithm ranks all objects in the queue based on every possible strategy that includes one or two features. If a strategy contains two features, then the algorithm sums the rankings for the individual features to create a final ranking. The algorithm then selects the object with the highest rank. If two or more objects have the highest rank, then the algorithm selects one of the objects randomly. Similar to the other classifiers, the test accuracy for each strategy is then calculated by comparing the objects selected by the algorithm with the objects selected by the participant.

The same cross-validation approach used with the machine-learning classifiers is used with the ES algorithm. When determining which strategy was used by the participant, the ES algorithm reports which strategy had the highest rank order accuracy for the training set. This algorithm was included to see if it performed better with smaller amounts of data in some of the simulations in which noise was added.

Three parameter recovery studies were conducted to compare the algorithms. The first study examines both simple and more complex strategies using multiple features. The second study examines the performance of the algorithms when noise is added to the selection process, and the third study examines the possibility of tracking strategy changes as they happen during performance of a task.

Study 1: Evaluation of ability to detect simple and more complex strategies

This first set of simulations and analyses was conducted to assess the ability of each of the three algorithms to detect the selection strategy used by different simulations. A set of simulations was implemented using a range of strategies including single-feature strategies, a strategy that used a feature non-linearly, and two different methods of combining two features in a selection strategy. Two primary questions are addressed with these simulations. First, do all the algorithms accurately identify the strategy? Second, how much data is necessary to identify strategies in this range of complexity? This second question is relevant to establishing a lower bound on how quickly a strategy could be identified if the goal was to track which strategies participants were using while performing the task. An additional issue explored in this study is the influence of the hyperparameter values on the accuracy of the algorithms.

Method

For all the simulations, the objects that appear in the sADM task were set so that their feature values were drawn from a uniform distribution with points ranging from 50 to 550, deadline ranging from 20 to 75 s, penalty ranging from – 150 to – 35. In addition, half of the objects required one level of sorting and half required two levels. All other features of an object depend on these initial features and the dynamics of how the task unfolds as the simulated participant interacted with it. For example, time until deadline is a dynamic feature that depends on the deadline and how long the object has been in the queue.

Five single-feature variants, based on the same fundamental ACT-R model, were developed. Four of these variants always picked the highest-ranked object based on a single feature: highest points, lowest deadline, lowest time until deadline, and lowest penalty value. The fifth single-feature strategy picked the first object highlighted by the interface when going to select an object (i.e., the object in the middle of the queue).

Besides these single-feature strategies, three more complex strategy models were also implemented. First, a model that used the time until deadline feature non-linearly was implemented. If the time until deadline was 8 s or less, then lower values were preferred, but if the time until deadline was greater than 8 s then higher values were preferred. This strategy allows for objects that are nearest to their deadline to be sorted, but if the object has plenty of time until its deadline, then longer times would be preferred to allow time for handling interrupting objects if they occur during sorting. While the simulated participant always ignored interrupting objects for simplicity of implementation, human participants will often switch to interrupting objects before resuming sorting the original object. This strategy was included as a potentially effective strategy that used a feature in a nonlinear manner.

Two other complex strategies were implemented that combined the points and deadline features in either a weighted or thresholded manner. These combination strategies were selected based on an analysis of heuristics reported in the multi-attribute decision-making literature (e.g., weighted additive, take the best) (Gigerenzer & Gaissmaier, 2011). A set of weighted combination strategies was used that produces a weighted sum of the point and deadline values. Accounting for the distribution of point and deadline values noted above, three different weighted strategies were implemented. The first weighted points and deadline equally and is referred to as the deadline=points weighted strategy. The second weighted points more than deadline in an approximately 60/40 weighting referred to as the points+deadline weighted strategy, and the third reversed this to a 40/60 weighted strategy referred to as the deadline+points weighted strategy. A range of weightings was explored to ensure that the classifiers were sensitive to a range and not one specific weighting. More extreme weightings that greatly prefer one feature over another (e.g., 80/20 points over deadline) are not likely to be distinguishable from a single-feature points strategy because of the limited number of selections in which a single-feature strategy would lead to a different selection than a strategy heavily weighted toward the points feature.

A set of threshold combination strategies was also implemented similarly to the weighted strategies where the deadline=points thresholded strategy has roughly equal contributions of both features, the deadline+points thresholded strategy has a greater contribution of the deadline feature, and the points+deadline thresholded strategy has a greater contribution of the points feature. The deadline=points strategy selected the highest point value if there was an object over 300 points (300 is the mean point value from the uniform distribution of 50–550), and if there were no objects above that threshold, then the lowest deadline object was selected. The deadline+points strategy selected the highest point value if there was an object over 250 points, and if there were no objects above that threshold, then the lowest deadline object was selected. The points+deadline strategy selected the highest point value if there was an object over 400 points, and if there were no objects above that threshold, then the lowest deadline object was selected. Just as in the weighted strategies, there are some thresholded strategies that could be indistinguishable from a single feature strategy. For example, a strategy that selected based on points as long as there was an object worth more than 100 points on the queue would almost always select the object with the highest point value. Given that each object has a value ranging from 50 to 550 points, the probability that all objects in the queue fall in the 50-to-100-point range is small.

These variants were examined to ensure that the strategy identification techniques could correctly recognize distinct strategies at different levels of complexity. All strategies were simulated for 60 6-min blocks of the sADM task. Each block was an independent set of data to detect the selection strategy being used by the model.

Results

The classifiers were assessed on two primary measures. First, as described earlier, a feature importance analysis can be done on each classifier to determine the features and their valence (i.e., are higher or lower values for a feature preferred). A simple binary scoring measure was used to assess the accuracy of the classifiers in determining the actual features used in the strategy. If the results of the classifier feature analysis process returned only the correct features and their correct valence, then the strategy was determined correctly. Otherwise, the strategy was not determined correctly. The proportion of blocks for which the correct strategy was identified was the primary measure used to examine the classifiers’ ability to determine the correct strategy and is referred to as strategy feature accuracy.

A second measure was the mean prediction accuracy of the classifier on the test set during the three-fold cross-validation process used to train the algorithms. While the strategy feature accuracy measure described in the last paragraph focuses on whether the underlying features present in the strategy are identified, this prediction accuracy measure assesses whether the object that was selected from the queue of objects can be accurately predicted by the classifier. For all the objects in a queue in the test set, the classifiers rank ordered the objects that would be selected according to their predicted probability of being selected. This rank order score, defined earlier, was the measure of predictive accuracy, and is the only measure available when applying the classifiers to human data because their strategy is not known.

The mean prediction accuracy and proportion of time the correct strategy was identified for each strategy for each algorithm are shown in Table 1. For the single-feature strategies, all three classifiers performed at ceiling. Prediction accuracy was occasionally below 100% because there were infrequent ties in the data (e.g., two objects have the same point value). For the two-feature strategies, the DT performed the best for the strategy which involved combining the two features with a threshold (e.g., if there is an object over 400 points, then use points, otherwise use deadline). The SVM performed better than the DT for the weighted combinations of features. Finally, the DT performed best for the nonlinear points strategy where extreme high and low point values were preferred over mid-range point values. Based on these results and the underlying compatibility in the implementation of the DT and SVM classifiers, the outputs of these classifiers can be combined and the classifier with the highest prediction accuracy can be chosen to determine the strategy. The last column of the table shows strategy feature accuracy when the outputs of these two classifiers are combined. This combined algorithm can sometimes yield feature accuracy values above or below the performance of DT or SVM algorithms alone because the algorithm with the highest predictive accuracy of each block is selected and not the algorithm with the highest feature recovery accuracy because only the predictive accuracy would be available for human data. For example, the combined classifier generally does better at detecting both features of the weighted strategies than the SVM or DT alone.

Table 1 Mean rank accuracy and accuracy at recovering the strategy features for each algorithm

Full size table

The proportion of time the correct strategy was identified and predictive accuracy both focus on assessing the performance of the classifiers for the different strategies tested. A secondary aim of this set of simulations was an assessment of how much data would be needed to accurately identify the strategy a hypothetical participant was using in the task. The approach used here was to run the classifiers only on the first N object selections of each block of the task. The total number of selections in a block of data ranged from 36 to 39, and therefore N was varied from 15 to 35 in increments of five and whether the correct strategy features were identified was examined for each N.

For the single-feature strategies, all classifiers extracted the correct features 95–100% of the time with only the first 15 selections and all reached 100% with 20 selections. For the more complex strategies, Table 2 shows the proportion of time the correct strategy was identified for each amount of data for each classifier. The nonlinear strategy is not shown in Table 1 for the ES or SVM because they never identified the correct strategy. From Table 2, it appears that all classifiers benefit from increasing amounts of data on the more complex strategies with performance leveling out around 30–35 selections.

Table 2 Strategy feature accuracy by number of selections included in the data

Full size table

The final issue examined the sensitivity of the DT and SVM classifiers to changes in hyperparameters. In particular, the algorithms use a grid search over a space of hyperparameters to find the value of the hyperparameters that maximize prediction accuracy and agreement on the strategy features across cross-validation folds. Because the eventual goal is to examine human data with these algorithms and strategy feature accuracy is not available on human data from the task, this analysis focused on how prediction accuracy was affected by the hyperparameter values. Strategy feature accuracy is also shown to demonstrate that these two measures are often correlated. For the DT classifier, the minimum impurity hyperparameter value is plotted against accuracy for the simple strategies in Fig. 5 and the more complex strategies in Fig. 6. There appears to be an optimal range for this hyperparameter between 0.025 and 0.125. The SVM had two hyperparameters (C and penalty). The SVM strategy accuracy results are plotted for these hyperparameter values in Fig. 7 for the simple strategies and Fig. 8 for the complex strategies. These figures show some hyperparameter values do not perform well, but there is a large space of hyperparameters that perform well in both classifiers.

Discussion

For simple, monotonic, single-feature strategies, all the classifiers performed well, requiring minimal data and showing little sensitivity to hyperparameter values. However, significant differences emerged when slightly more complex strategies were examined. Two methods of combining multiple features were examined: threshold and weighted. For the threshold strategy, the DT classifier performed the best, and for the weighted strategy, the SVM classifier performed the best. This result is consistent with how the two underlying data structures represent the classification process.

The DT creates a tree of if-then rules to correctly classify an sADM object as selected or not, while the SVM separates selected and unselected objects with a hyperplane. The hyperplane in a linear SVM in two dimensions is a line that is equivalent to a weighting between two features. This finding supports the expectation that the match between the strategy and the type of classifier is important. The more similar the classifier’s representation of the strategy is to the strategy in the simulation, the better the classifier captures the strategy.

The nonlinear strategy was also designed to be something that a linear SVM or the ES strategy should not be able to capture, and the only classifier that performed well on this strategy was the DT. It is possible that a nonlinear SVM might do well on this strategy, but then the problem becomes analyzing the SVM to provide a human-readable representation of the strategy.

Given these results, it should be possible to take a multi-classifier approach to strategy identification. For example, the DT and SVM classifiers could both be used on a set of data, and then the cross-validated prediction accuracy could determine which of the two classifiers should be analyzed to identify the strategy. In the data examined in this study, the combination of the DT and SVM classifiers performed well on all strategies examined.

One limitation of this study is that the ACT-R model was perfectly consistent in its adherence to a given strategy. However, people are likely to deviate from this level of consistency, introducing noise into the data. The second study addresses this possibility by examining two different types of noise that can be introduced into the data to determine how sensitive the classifiers are to noise.

Study 2: Strategy detection in the presence of noise

This second set of simulations examines how well the strategy detection algorithms handle noise in the selection data. People may not always select the object that perfect execution of a selection strategy would demand. Lapses of attention or pressing the wrong key will lead to selections that deviate from the ideal selection strategy. Two different types of noise were simulated to examine the performance of the algorithms in less-than-ideal circumstances.

First, random noise was added to a single-feature strategy at varying levels by having the model select according to the strategy at times and randomly other times. Levels of randomness were varied from 10 to 90% of selections being random. In addition, random selections were added to the more complex two-feature selection strategies with levels of randomness ranging from 10 to 50% of selections. A second type of noise was also examined in which the model simulated a form of satisficing. For example, under a points strategy, instead of always selecting the highest point value object, the satisficing version of the strategy just picked an object that had a sufficiently high point value (e.g., within 50 points of the highest point value in the queue).