1 Introduction

This study investigates the possible contribution of machine learning techniques to the coding of natural language transcripts from experiments. The aim is to evaluate whether simple tools from Natural Language Processing (NLP) and machine learning (ML) provide valid and economically viable assistance to the manual approach of coding even when complex concepts are coded.

In recent years, the analysis of communication has been an increasingly important element of many studies in economics. Communication transcripts are being consulted to understand behavior beyond what can be inferred from choice data and to obtain insights into team deliberation processes (e. g. Cooper and Kagel 2005; Burchardi and Penczynski 2014; Goeree and Yariv 2011; Penczynski 2016a). Computerized experiments make the collection of communication data very easy. And communication data are potentially very informative about reasoning processes. This strength, however, comes with the natural disadvantage that the coding of text—which is usually done manually—is time-intensive and based entirely on human judgment.Footnote 1

Enabling the assistance of computers in the processing of natural language is the aim of the many different research fields of NLP, such as machine translation, question answering and speech recognition.Footnote 2 A basic judgment of texts can be made with the help of simple statistics, such as message counts, word counts and word ranks. Moellers et al. (2017) fruitfully use those concepts when they experimentally investigate communication in vertical markets. More automated approaches like the Linguistic Inquiry and Word Count program (LIWC) group words in semantic classes such as positive or negative emotions, money, past tense etc. Abatayo et al. (2017) analyse communication in cooperation experiments with the help of such software. This automation comes at the cost that “the semantic classes may or may not fit the theory being investigated” (Crowston et al. 2012, p. 526). A closer fit with a specific economic theory and a higher level of automation can be achieved when statistical techniques such as ML use manually coded examples to build models of linguistic phenomena, an approach that I follow here.Footnote 3

Machine learning—or statistical learning—is a way of obtaining statistical models for prediction in large datasets. Due to the increasing importance of Big Data and variable selection, ML is making its way into the toolbox of econometricians and applied economists (Varian 2014). For example, its strong out-of-sample prediction capabilities support causality studies by estimating policy implementation and counterfactuals (Mullainathan and Spiess 2017). The computational handling of text data leads to datasets with many variables and makes these techniques appropriate.

Across the sciences, text analysis with the help of ML has become more popular in recent years. Physicians classify suicide notes and observe that the trained computer model outperforms experienced specialists in suicide predictions (Pestian et al. 2010). Linguists use ML to sift Twitter for useful information during mass emergencies (Verma et al. 2011). Based on large volumes of text such as party programs and speeches, political scientists use ML to locate politicians and parties in the political space, for example in the left-right spectrum (Benoit et al. 2009). Similarly, economists have used it to quantify the slant of media (Gentzkow and Shapiro 2010) or the consequences of transparency rules for central banks (Hansen et al. 2017). To my knowledge, this is the first study to investigate this technique’s usefulness for experimental text data. A facilitating feature of experimental data is that the topic of the chat conversation is usually known which simplifies the machine learning analysis.Footnote 4

The communication transcripts studied here are obtained from implementations of Burchardi and Penczynski’s (2014) intra-team communication design in beauty contest, hide and seek, social learning and asymmetric-payoff coordination games. Among the applications in experimental work, the classification of reasoning in terms of the level-k model is certainly one of the more ambitious tasks.

Still, the results are clearly positive and show that the out-of-sample computer classification is able to replicate many results of the human classification. They suggest that in similar or easier classification tasks, computer classification can be a valid option to reduce the additional effort that comes with communication analyses, especially large ones. The following sections will introduce the data and the machine learning techniques that are used. Afterwards, results will be presented for three different applications. The technical appendix introduces the computational method based on an example code.

2 Data

All communication transcripts in this study are generated by the intra-team communication protocol that was introduced in Burchardi and Penczynski (2014). Teams of two subjects play as one entity and exchange arguments as follows. Both subjects individually make a suggested decision and write up a justifying message. Upon completion, this information is exchanged simultaneously and both subjects can enter individually a final decision. The computer draws randomly one final decision to be the team’s action in the game. The protocol has the advantage of recording the arguments of the individual player at the time of the decision making. Furthermore, the subject has incentives to convince his team partner of his reasoning as the partner determines the team action with 50% chance.

The original communication analyses have two research assistants (RA)—usually PhD or Master students—classify the messages according to a standard procedure of content analysis. From the authors of the study, they are provided written instructions as to which concepts to look for in the text. Initially, they code the messages individually in order not to be influenced by the opinion of the other. Afterwards, they meet or are informed about disagreements and have the chance to reconcile their classification. Finally, only the coding that the two RAs agree upon is entering the messages’ data analysis.

In all analyses of this study, the RAs looked for similar concepts described in the level-k model of strategic reasoning (Nagel 1995; Stahl and Wilson 1995). RAs were asked to indicate the lower and upper bound of level of reasoning and in some cases the characteristics of the level-0 belief. Due to a possible ambiguity of messages with respect to the level of reasoning, lower and upper bounds are given that determine the interval within which the level of reasoning is likely to lie.

Here, three datasets will be used to investigate the usefulness of machine learning for the classification. Note that the studies were not chosen based on the particular characteristics of the games, but rather on the kinds of results to be replicated and the content extracted from the text, namely levels of reasoning and level-0 belief characteristics.

First, to see the general features of the computerized level classification, I unite observations from the beauty contest game in Burchardi and Penczynski (2014) with observations from the hide and seek game (Penczynski 2016b). This dataset is referred to as BCHS.

The second, larger dataset is from a study of social learning (SL, Penczynski 2017) and allows me to investigate whether one of the main results of the paper, namely that the mode behavior is level-2 (or “naïve inference” as in Eyster and Rabin 2010), can be found via the computer classification. It features scenarios from the standard social learning framework as introduced by Anderson and Holt (1997).

Finally, the third and largest dataset is from a study of asymmetric-payoff coordination games (APC) as investigated in van Elten and Penczynski (2018) based on games introduced by Crawford, Gneezy, and Rottenstreich (2008, CGR). Beyond the out-of-sample replication of the result that the incidence of level-k reasoning is low in symmetric, pure coordination games and high in asymmetric, “battle of the sexes”-type coordination games, this dataset allows me to go one step further and investigate the classification of level-0 beliefs. Specifically, it can be tested whether the computer classification replicates differences in the relevance of label and payoff salience between symmetric and asymmetric games.

3 Technique

The classification method studied here combines techniques of Natural Language Processing (NLP, Sect. 3.1) and machine learning (ML, Sect. 3.2). Appendix A provides further technical details and annotated example code in the software language R.

3.1 Natural language processing

In order to transform a set of natural language messages—a text corpus—into a computer-friendly dataset, the text of each message is represented by a bag-of-words model as a multiset of its words, abstracting from grammar and word order. Specifically, in a process of tokenization, the messages of a corpus are broken down into single strings of letters, numbers, or marks that are divided by a space. Each of the M messages can then be represented by a vector of the frequencies of the T unique tokens.Footnote 5 This way, the set of messages is converted into a highly sparse \(T\times M\)-dimensional, so-called document-feature matrix. Denote the frequency of token t in message m as \(x_t^m\) and the vector generated by message m as \({\mathbf {x}}^m\).

Some measures can be taken to usefully reduce the number of features T. Here this is done by a) removing so-called stopwords, common words that are not indicative of the text contentFootnote 6, b) reducing inflected words to their stem so that, for example, “team”, “teams” and “teamed” all appear under “team”, and c) dropping tokens that appear rarely in the whole document (\(\sum _m x^m_t<5\)). For simplicity and objectivity, I did not remove typos from obviously mistyped words although this could further strengthen the results.Footnote 7

3.2 Machine learning

Due to the large number of independent variables T and the possibly nonlinear relationship between word frequencies and level of reasoning, standard linear regression approaches cannot be used. The statistic method of choice should feature a selection of variables and the ability to represent highly nonlinear relationships. The field of machine learning has available a large variety of algorithms for various purposes. Precedent cases of text analysis with random forests (Agrawal et al. 2013), the ease of their implementation and their general usefulness (Varian 2014) let me choose the random forest technique (Breiman 2001; Hastie et al. 2008, henceforth HTF).Footnote 8 It does not require prior calibration and has featured good accuracy and little overfitting across applications.

Machine learning is generally used for out-of-sample prediction, in our case for the prediction of reasoning characteristics based on word counts in messages. The out-of-sample performance can be easily and precisely measured and is therefore the deciding measure of the usefulness of a model and guides many if not all of the choices of algorithms and parameters. It is thus indispensable to split the data into two separate sets for training and testing of the model.

For initial analyses and for a very simple linear model that relates the count of a particular token \(x_t^m\) to the level of reasoning \(y^m\) in message m, \(f(x^m_t)=\beta \cdot x_t^m\), I chose to have 70% of the observations to formulate the model in-sample (“train”) and the remaining 30% of observations to test the model out-of-sample.

The in-depth evaluation of the random forest results will make use of cross-validation. For 10 consecutive times, a specific 10% subset of the dataset is taken out for testing and the remaining 90% are used for training. The advantage of this more involved process is that eventually all observations will have been predicted based on a model that was trained exclusively on other observations. In all analyses, the in-sample vs. out-of-sample split is balanced across treatments/games to avoid that results vary due to differences in the number of training observations from particular treatments/games.

As in nature, the concept of a forest is conceptually based on the idea of “trees”. Trees partition the space spanned by the independent variables into subspaces. The splits are performed sequentially, dividing a dimension t along a split point \(s_t\) into two subspaces, as shown in the illustrative tree and variable space in Fig. 1. For example, one could divide messages into those with less than one token “team”, \(x_{\mathrm {team}}< 1\), and messages with more instances of “team”, \(x_{\mathrm {team}}\ge 1\). The first subspace, \(x_{\mathrm {team}}< 1\), could be split again by \(x_{\mathrm {urn}}< 1\) and \(x_{\mathrm {urn}}\ge 1\), the second by \(x_{\mathrm {saw}}< 1\) and \(x_{\mathrm {saw}}\ge 1\). The online appendix A.4 gives details on how the trees are grown in random forests. To each subsample, one can now associate a level of reasoning \({\hat{y}}^{\mathbb {R}_n}\), as is done illustratively in Fig. 1a.

Fig. 1
figure 1

Exemplary decision tree

Models in machine learning are fundamentally different depending on the nature of the dependent variable. With numerical dependent variables for which differences and means are defined like levels of reasoning, one speaks of a “regression model”. When the dependent variable takes a limited number of non-ordered values—“discrete variables” in economics—one speaks of a “classification model”.

A simple regression model reflects the response as a constant \(c_n\) in each of the subspaces \(\mathbb {R}_n\). The dependent variable y is predicted by

$$\begin{aligned} f({\mathbf {x}}^m) = \sum _{n} c_n \mathbb {1}({\mathbf {x}}^m \in \mathbb {R}_n), \end{aligned}$$
(1)

and the model error criterion is the mean squared error \(Q_{mse}=\frac{1}{M}\sum _m (y^m-f({\mathbf {x}}^m))^2\). In the case at hand, the random forest algorithm grows 500 trees. In regression, the prediction for a message m of the collection of 500 trees is the average over all trees’ predictions, \(f({\mathbf {x}}^m)=\frac{1}{500}\sum _{b=1}^{500} f^b({\mathbf {x}}^m)\).

In classification models, the mean cannot be used for aggregation of outcomes in the subspaces. The mode outcome can and therefore the aggregation works like a ballot, each of the randomly generated trees casts one vote for its predicted category. The winner of the ballot turns into the model prediction for the message. In each subspace \(\mathbb {R}_n\), the proportion of class d messages is \({\hat{p}}_{nd}=\frac{1}{N_n} \sum _{m: \,{\mathbf {x}}^m\in \mathbb {R}_n} \mathbb {1}(y^m=d)\). The majority class d(n) in \(\mathbb {R}_n\) determines the response that the tree model attributes to a message, that is,

$$\begin{aligned} f({\mathbf {x}}^m) = d(n: {\mathbf {x}}^m\in \mathbb {R}_n)=\arg \max _d ({\hat{p}}_{nd}: {\mathbf {x}}^m\in \mathbb {R}_n). \end{aligned}$$
(2)

With 500 trees, the majority class d over all 500 trees is the prediction for \({\mathbf {x}}^m\).

In classification, various error criteria can be conceived. The misclassification error counts the number of misclassified messages and is thus intuitive but not differentiable. I will report the Gini impurity, which gives the error rate not for majority classification, but for a mixture model of classifying a randomly chosen observation in \(\mathbb {R}_n\) of category d into category \(d'\) with a probability that corresponds to the proportion \({\hat{p}}_{nd'}\): \(Q_{Gini}=\sum _{d\ne d'} {\hat{p}}_{nd} {\hat{p}}_{nd'}\). This criterion measures dispersion in the categorization and is 0 if all messages in \(\mathbb {R}_n\) fall into one category.

In random forests, many uncorrelated trees are grown and then aggregated. “They can capture complex interactions structures in the data, and if grown sufficiently deep, have relatively low bias. Since trees are notoriously noisy, they benefit greatly from the averaging.” (HTF, p. 587f.).

While a single tree as in Fig. 1a is quite transparent about the modelled relationships, a forest clearly is not. Still, the structure of the model is representable by the so-called variable importance, which tracks over all trees the improvement in the model error thanks to each variable. The higher the reduction in the model error, the more important is the variable for the prediction of the model.

While the level-0 characteristics are discrete variables and hence treated in classification models, the level of reasoning can be treated in either regression or classification models. Given my understanding of levels of reasoning, I would probably see them as categories rather than typical numerical variables. However, in order to also treat and show regression models and results in this paper, I will report both regression and classification results for the levels of reasoning.

4 Results

4.1 Beauty contest and hide and seek games

The beauty contest game (Nagel 1995) requires players to indicate an integer between 0 and 100, the winner is the player that is closest to 2/3 of the average indicated number. In the hide and seek game, hiders hide a treasure at one of four positions, labelled ABAA (Rubinstein and Tversky 1993). Seekers can search for the treasure at one position. Whoever holds the treasure at the end wins a prize. The BCHS dataset contains 78 BC and 98 HS messages. I use the rounded average of the agreed-upon lower and upper bounds in the hide and seek game and—for robustness—the rounded average of more than 40 level classifications of the BC dataset obtained on Amazon Mechanical Turk (Eich and Penczynski 2016).Footnote 9

English stopwords, numbers between 0 and 100, and, due to the game frames, the tokens “a”, “b”, “a’s”, “b’s”, “two”, “third”, “two-third”, “thirds”, “two-thirds”, “half” are excluded from the analysis. Word clouds illustrate the quantified tokens nicely as they indicate more frequent tokens in larger font size. The tokens in the dataset are represented in Fig. 2.

Fig. 2
figure 2

Message tokens in the BCHS dataset. \(M=176\), \(T=98\), \(\sum _t x_t=1605\), \(x_{\mathrm {think}}=127\)

In Fig. 3, splitting the dataset by the level of reasoning as classified by the RAs gives a first idea whether the content in terms of tokens is different and potentially predictive of the level. Indeed, Fig. 3a shows for level-0 the words “just” and “one” to be most frequent and others such as “random”, “chance”, or “guess” to come up often. In contrast, higher levels feature words such as “think” and “will” more and more prominently and show fewer instances of “guess” or “random”.

Fig. 3
figure 3

Message tokens in the BCHS dataset by level

In the BCHS dataset, the frequency of one single token is significantly correlated with the level of reasoning both in- and out-of-sample: “think”. Table 1 reports the correlation coefficients as well as the parameters of the linear model. The \(R^2\) indicates that the word alone accounts for around 48% of the variation in levels.

Table 1 Bivariate correlations and linear regression between token count and level of reasoning in BCHS

In a random forest model all tokens are considered. For the two kinds of random forest models, regression and classification, Table 2 tabulates the human classification against the computer model’s out-of-sample prediction from cross-validation.

Table 2 Human classification versus computer prediction from cross-validation in BCHS

In both cases, the computer prediction correlates significantly with the human classification and explains around 71% and 80% of the variation, respectively. The numbers of correctly classified messages, 105 (60%) and 91 (52%), are also considerable. In order to test whether the numbers of correctly classified messages could have possibly been obtained by chance, I randomly permute the levels in the training set and observe the number of correctly classified messages 2000 times (Random permutation test, Golland et al. 2005). For both regression and classification, the numbers 105 and 91, respectively, are above the 99.9th percentile in the resulting distribution. Hence, chance success is rejected with \(p<0.001\).

The structure of the random forest model is illustrated by the importance of the explanatory variables. Figure 4 illustrates the 30 most important tokens in the dataset. Between the two models, the ranking of the most important words is fairly correlated, with the tokens “think”, “will”, “obvious”, and “averag” appearing in the top 4 tokens of both models. Looking back at Fig. 3, the latter are indeed quite discriminatory, since “obvious” is mainly appearing in level-3 and “averag” is strong in level-1.

Fig. 4
figure 4

Variable importance in the BCHS dataset

Overall, this first analysis on a small and diverse dataset shows that the method can work. The computer classification is not perfect, but it shows promise for larger datasets. In the machine learning literature, the BCHS dataset would be deemed as quite small and in the range where more training datapoints have a positive impact on the prediction performance (HTF).

4.2 Social learning

The social learning dataset is taken from Penczynski (2017) and studies the framework introduced by Anderson and Holt (1997). Subjects subsequently receive binary signals (“white”, “black”) about the binary state of the world, A or B, and can observe the decisions of their predecessors in the sequence. Their aim is to match the state of the world with the decision. The private signals are correct with probability 2/3. The dataset contains \(M=348\) messages and their agreed level of reasoning classification from 2 RAs. The messages feature \(T=115\) unique tokens after stemming and disregarding common and rare words.Footnote 10

Figure 5 illustrates the token clouds by level of reasoning. As before, a transition can be noticed, from words such as “choose”, “random”, and “select” in level-0, over “urn” and “ball” in level-1, to a predominant occurrence of considerations including the token “team” in levels 2 and 3. Figure 5 inspired the exemplary decision tree in Fig. 1a.

Fig. 5
figure 5

Message tokens in the SL dataset by level

In this dataset, there is no single token whose frequency in a message correlates with the level of reasoning both in- and out-of-sample. The strongest correlation and \(R^2\) can be observed with the token “team”. Close to the previous dataset, this token accounts for 37% of the outcome variation (Table 3).

Table 3 Bivariate correlations and linear regression between token counts and level of reasoning in SL

In the random forest analyses, the token “team” is turning out to be the most important one in both regression and classification, as Fig. 6 shows. Further, the tokens “just”, “chance”, and “chose” appear in both models’ top 10 important tokens.

Fig. 6
figure 6

Variable importance in the SL dataset

One of the major results of the original study is the observation that the level of reasoning of the large majority of subjects is 2. In the prediction of the random forest model from cross-validation as shown in Table 4, the same conclusion would be drawn from the computer classification. In both regression and classification model, the mode level of reasoning is 2, far ahead of level-1 and level-0.

Here, both models again lead to significant correlation \(\rho\) and explain 85% and 88% of the variation. The number of correctly classified messages, 219 (63%) and 239 (69%), is higher than in the BCHS dataset. The random permutation test rejects chance success of that magnitude with \(p<0.001\) in the regression and \(p=0.004\) in the classification.Footnote 11

Table 4 Human classification versus computer prediction from the cross-validation in SL. \(\rho\) gives the correlation coefficient

4.3 Asymmetric-payoff coordination games

The final dataset in this study results from asymmetric-payoff coordination games (APC) as investigated by Crawford et al. (2008) and van Elten and Penczynski (2018). The challenge here is not only the replication of the result that, roughly speaking, symmetric coordination games lead to significantly lower levels of reasoning than asymmetric ones, but also the test whether characteristics such as level-0 features can be classified. In particular, the analysis of van Elten and Penczynski (2018) showed that asymmetric, “battle of the sexes”-type games predominantly led to payoff salience in the level-0 belief while symmetric, pure coordination games were mostly approached with reference to the salience of the labels.

The dataset consists of \(M=851\) messages and \(T=311\) unique tokens. The analysis uses the agreed upon classification for lower bounds of level of reasoning. Similar results are obtained for the upper bounds or averaged bounds. Table 5 describes the 4 X-Y games and 4 Pie games. In contrast to payoff-symmetric games (in bold), payoff-asymmetric games feature a higher coordination payoff \(\pi\) for one of the two players, depending on the action on which they coordinate. The miscoordination payoff is 0 for both players. The choice is between letters X and Y in the X-Y games and between 3 pie slices (L, R, B) which are identified by ($, #, §) and of which B is uniquely white.Footnote 12

Table 5 Payoff structure of coordination games

4.3.1 Levels of reasoning

As before, Fig. 7 shows the most common tokens by the level of reasoning of the containing message. The experiment communication is in German.Footnote 13 As before, one can see a characteristic transition from level-0 to level-3. While take (“nehm”), white (“weiss”), same (“gleich”), first (“erst”) are some of the most common tokens in level-0, the levels 1 and 2 feature most prominently “team” and that (“dass”). The incidence of think (“denk”) is steadily rising in levels 1 and 2, becoming the most common token in level-3.

Fig. 7
figure 7

Message tokens in the APC dataset by level

Table 6 shows the 5 of the 100 most frequent tokens whose frequencies in messages correlate significantly with the level of reasoning in- and out-of-sample. Among those are two related ones, “denk” and “denkt”, which surprisingly are not pooled during stemming. Again, for objectivity, I do not correct for this manually. The correlations and \(R^2\) reach similar levels as in earlier data and suggest that the token count can again help predict the level of reasoning. Figure 8 shows that these tokens are among the most important variables for the random forest models.

Table 6 Bivariate correlations and linear regressions between word counts and level of reasoning
Fig. 8
figure 8

Variable importance in the APC dataset

Table 7 shows the predicted levels for the random forest models. While the correlation between human and computer classification is high and above 0.5, the \(R^2\) is lower than in the previous analyses. The reason is that the computer has difficulties identifying level-2 or higher players, recognizing only 41 and 54, respectively, out of 122. Both models feature an amount of correctly identified messages, 568 (67%) and 536 (63%), similar as in the SL dataset.Footnote 14

Probably due to the numerical nature of the dependent variable and the role of averaging, the regression model identifies many more level-1 players than the classification model or the human classification. A similar but smaller effect can be seen in the SL dataset. I choose the classification model for the following analysis.

Table 7 Human classification versus computer prediction from cross-validation

To conclude the analysis of the level of reasoning, let us take a look at the level predictions by game. Table 8 shows the average level of reasoning in the human and computer classifications and the difference \(\Delta\) between the two. The reduced ability of the computer to identify level-2 players shows most strongly in the asymmetric games. There, the difference \(\Delta\) is on average \(-\,0.19\). Importantly, however, the ranking of games in terms of level averages is very similar between human and computer classification. Both feature lower absolute levels in symmetric games SL and S1 on the one side and higher levels in asymmetric games on the other. Despite the reduced identification of higher level players, the computer classification indicates qualitatively similar level differences between games.

Table 8 Level averages of human and computer classifications by APC game

4.3.2 Level-0 salience

The level-0 salience in the APC games can be divided into payoff and label salience. For both, I use the classification model of the random forest method since the attitudes towards salience are non-numerical categories. Payoff salience implies that subjects mention a belief as to how their opponent reacts to the asymmetric payoffs. Figure 9 shows the most frequently used tokens by the two most important categories, “no salience” and “high payoff”. There are no striking differences across categories, in both the token “team” is most frequent, although it appears more often in “high payoff”.

Fig. 9
figure 9

Message tokens in the APC dataset by payoff salience

Table 9 illustrates the prediction of the classification model based on the 5 payoff-asymmetric games. Out of 534 observations, 353 are classified correctly (66%), a substantial amount.Footnote 15

Table 9 Human payoff salience classification versus computer prediction from cross-validation

The important tokens for the classification model are illustrated in Fig. 10a. Compared to the important tokens in the model for the level of reasoning, the notable difference lies in the importance of more (“mehr”), egoistic (“egoist”) and taler (“tal”), which is plausible for the payoff salience. The token “team” stays relevant since payoff salience is correlated with higher level messages that feature this token more often than lower level messages.

Fig. 10
figure 10

Variable importance in the APC dataset. Classification model with Gini criterion

Label salience implies that participants are attracted or averse to actions due to a salient label in the game, which improves the coordination probability. Figure 11a, b illustrate the most frequently used tokens for X-Y games by label salience category. It is telling that the “label salience on X” category (\(X \succ Y\)) features the token first (“erst”) most frequently, a term that alludes to the first position of the X in the displayed action space (Fig. 11b). Similarly for the Pie games in Fig. 11c, d, the latter features white (“weiss”) most prominently.

Fig. 11
figure 11

Message tokens in the APC dataset by label salience

In terms of the prediction of the label salience, with the example of games SL and ALL, Table 10 shows that differences between games can be detected in the computer classification. While in the symmetric game SL 37 subjects are classified to hold a belief of preference for X (Table 10a), only 3 are classified to hold such a belief in the asymmetric game ALL (Table 10b). In both games, the computer classification is close to the human classification with 74 out of 105 (70%) on the diagonal in SL and 99 out of 104 (95%) in ALL. Recall that the model is not trained in a game-specific way, but trained with a balanced number of observations from all games.Footnote 16

In Fig. 10b, the important tokens for a joint model in X-Y and Pie games clearly relate to the level-0 label salience: white, first, and field. I conclude that the computer classification is indeed able to indicate differences in level-0 belief characteristics.

Table 10 Human classification versus computer prediction from cross-validation

5 Economic viability

An important aspect of the presented coding exercise is its economic viability for a research project. What would be the costs and benefits of implementing machine learning?

Regarding the costs, the requirement of a training dataset implies that the manual coding effort cannot be fully substituted. For small projects of the size of the ones treated here the cost of human coding is moderate. Ultimately, the necessary size of the training set relates to the complexity and the quality of the machine coding. For the largest dataset APC, Table 11 shows how the quality of machine coding achieved with a training set of 762 observations (90% of the full dataset) can be achieved with smaller training sets.

Unfortunately, this result does not readily generalize since the determinants of the quality of machine coding can at this point not be identified. The statistical theory points to the number of independent variables, which would relate to the number of tokens and thus the variety of words used in the corpus. One can speculate further that the concepts to be looked for, the level of perceptibility in a given context and the language have a bearing on the model’s performance as well. One possible predictor of performance might be the agreement rate between human coders. Across the three datasets, APC featured the lowest pre-reconciliation agreement rate of 60%. The machine coding might therefore perform less well than in other datasets, holding constant the number of training observations.Footnote 17

Table 11 Coding performance of regression and classification models in APC (\(N=851\)) depending on the size of the training set

For the sake of a conservative cost calculation, let us assume that, for larger projects, the time and money spent on manual coding will not exceed the effort of coding 1000 messages. The extrapolated cost of coding 1000 messages are at the time of writing about 180 Euros and 12 RA student hours. With experimental datasets becoming larger as scientific standards improve and costs of experiments decrease— due to platforms such as Amazon MTurk—the mentioned cap can be valuable. Coding 10000 messages would have resulted in a cost of 1800 Euros and 120 RA student hours, a significant dent in the project’s money and time budget.

Beyond the availability of a training dataset, the costs of implementing machine learning as I present it here are relatively low. The software environment R as well as the required packages are freely available. Machine learning methods are quickly absorbed by quantitatively trained economists. Based on the exposition and references here as well as the example in Appendix A, I estimate that 3-5 researcher hours are enough to generate a first computer coding output. The statistical training of the model implies that the expertise of a linguist or NLP-trained analyst is not needed (Crowston et al. 2012).

For large projects and for researchers that work frequently with text, these numbers suggest that the investment in machine learning expertise is highly economical. Other than reduced labor costs and reduced time of analysis, the computer approach has the additional potential to improve consistency where extended coding or the use of multiple coders jeopardize consistency. Some future developments might shift these numbers further in favor of the investment.

6 Outlook

Economists work with a finite set of concepts to be looked for in text. Linguists have developed off-the-shelf tools like sentiment analysis which do not need further training data and thus work without human coding. It is thus conceivable to eventually have enough training data and validated models for off-the-shelf tools that code strategic sophistication, lying aversion, social preferences, etc. Already now, the body of coded text and messages is considerable and could be used as manually-coded training data.Footnote 18

Certainly, more research is required to understand the scope of applications and research questions that can be investigated in this way. Since the present study investigates a rather complex phenomenon of strategic sophistication and aspires to code the degree of this sophistication, I view it as a relatively strong test of the feasibility of machine coding. The estimates given in the context of economic viability should be applicable in other coding tasks and possibly understate the benefits. Other concepts that have been studied with communication such as strategicness in Cooper and Kagel (2005) or the extent of social conversation in Abatayo et al. (2017) are probably more easily coded in general, both manually and by machine coding.

An important facilitator in the current study is the researcher’s knowledge of the topic of discussion. In studies with field data, the topic of a text needs to be found out first (Hansen et al. 2017), which imposes further costs of analysis. The control in laboratory studies makes the topic of conversations in a given text to be generally set by the game and thus known by the experimenter. This control makes it particularly simple for experimentalists to use the method shown here for coding.

More work is also required to understand possible differences between human and machine classification. Certainly, the conversion of text data into, here, a document-feature matrix risks losing information that is relevant for the theory at hand. Further, since the machine learning coding cannot easily be reconstructed and intuitively understood, it will require more studies and the input of linguists to clearly see the possibilities and limitations of machine coding.

Further, machine coding might not substitute but rather complement human classification. Existing or to-be-established off-the-shelf models for the coding of specific economic concepts can add evidence or a new perspective beyond the manual classification, as is done, for example in Moellers et al. (2017) and Abatayo et al. (2017). Finally, the establishment of standard methods has the potential to improve the acceptance of rigorous qualitative analyses by sceptical quantitatively-minded economists.