This chapter discusses the empirical results, adds additional results, and compares derived insights to other scientific conclusions from the domain of behavioral economics. The first subchapter sums up understandings of agent behavior by the results of the various hypotheses, and includes further results from statistical analyses. The second subchapter discusses strengths, weaknesses, opportunities and threats of the scientific methods used. The third subchapter provides an overview of all limitations, and the fourth subchapter suggests potential methodological variations and recommendations for future research.

7.1 Discussion of Experimental Results

Evaluating individual expertise of the well-defined problem “Tower of Hanoi” by the number of “perfectly solved” games, and filtering by “failing not more than one game” has proven to categorize participants very reliably by their logic deviation. This is not only true for the well-defined problem-solving stage. For the ill-defined problem solving stages, where ToE has to be played, Kruskal-Wallis-H shows significant differences by individual expertise regarding ToE total, (H(33, 16, 38) = 7,775, p = 0.021) and regarding ToE parts1, (H(33, 16, 38) = 10.692, p = 0.005). The individual expertise difference only fails to show clear significant differences in the “chaotic” ill-defined stages, (H(33, 16, 38) = 4.526, p = 0.104). Still, overall the expertise categorizes show significant difference in the ill-defined stages. Correlation of expertise with all ill-defined logic proportions shows significance at the 0.01 level for ToE total, with Spearman-Rho p = 0.005, shows significance at the 0.01 level for the “metastable” ill-defined stages ToE parts1, with Spearman-Rho p = 0.001, and shows significance at the 0.05 level for the “chaotic” ill-defined stages ToE parts2, with Spearman-Rho p = 0.038.

Figure 7.1
figure 1

Source own source

Boxplot results of logic proportion during „ill-defined“ stages over expertise levels.

Agents with higher expertise in the well-defined problem-solving stages also behaved less “random” in the ill-defined stages, at least from the perspective of the methodological model. Kruskal-Wallis-H shows highly significant differences regarding logic marker proportions amongst the expertise levels, with (H(33, 16, 38) = 18.835, p = 0.000), and Spearman-Rho correlation between well-defined problem solving expertise and logic marker proportions proves to be significant at the 0.01 level, with p = 0.000. The logic marker is an index representing the proportion of ToE actions of an agent, which do not fall inside a known logic category. In addition, as shown in figure 7.1, the higher the expertise levels, the more actions during the ill-defined stages conform to the routine logic. Expertise levels are measured by skillful puzzle-solving of well-defined ToH stages, where the routine strategy is defined. The ToE tot variable represents the proportion of actions, which are part of the routine strategy. In other words, the higher individual expertise in the well-defined stages, the less participants seem to leave their routine strategy path during ill-defined stages.

Therefore, problem solving expertise, as is measured in this thesis, not only relates to well-defined problem-solving performance, but also to ill-defined problem-solving behavior. Agents with high well-defined problem-solving expertise deviated less from their routine strategy and also behaved less random during the ill-defined problem-solving stages.

Figure 7.2
figure 2

Source own source

Boxplot results of logic marker proportions over expertise levels.

This is shown in figure 7.2, as higher individual expertise levels led to less actions by participants, which were not part of any category and are thus considered “random” actions. This correlation is shown by the logic marker variable, which represents the proportion of actions, which do not fall inside known logic categories, and expertise, which represents skill-full puzzle-solving of well-defined ToH stages. In other words, the higher individual expertise in the well-defined stages, the less random individuals behaved during ill-defined stages. As expected, the environmental change of the goal rod position influenced well-defined problem-solving performance significantly. Individual expertise can be linked to these agents, who did not fall for the goal rod change, and immediately shifted their routine strategy. From 33 low expertise agents, only 8 managed to start ToH level 4 with an ideal action. From 16 medium expertise agents, only 4 managed to start ToH level 4 with an ideal action. From 36 high expertise agents, 30 managed to start ToH level 4 with an ideal action. Those who do a mistake at the first move at ToH level 4, where the goal rod was changed, are more likely to be found in the “low” or “medium” expertise categories. Individual expertise significantly correlates with agents avoiding this mistake at the first action at level 4. Spearman-Rho shows the 2-sided correlation between expertise and this mistake to be significant at the 0.01 level, with p = 0.000, and Mann-Whitney-U shows the differences in expertise between agents who did the mistake and agents who did not to be highly significant, with (U(45, 42) = 436.000, z = −4.673, p = 0.000).

During metastable stages, Kruskal-Wallis H showed expected states deviation to differ significantly at the 0.1 level (p = 0.063) amongst the 5 information conditions during metastable conditions (ToE X parts 1), as shown in figure 7.3. This shows that agent experience regarding feedback was different, depending on the information conditions—yet, expertise remained a reliable predictor of consistent behavior.

Figure 7.3
figure 3

Source own source

Boxplot results of expected states proportion during „metastable“ stages over information conditions: 0 = N-IC, 1 = G-IC, 2 = D-IC, 3 = R-IC, 4 = C-IC.

This insight adds another important property to the significance of the expertise categories. Agents with high expertise were significantly more likely to adapt to visual environmental change, which influences their strategy performance, than agents with medium or low expertise.

Regarding all logic proportion analyses, behavior in the routine logic deviation was most surprising. Agents did not, as anticipated, deviate strongly from their routine strategy, and did, in fact, more or less stick to their routine strategy. It was rather the behavior in the no information condition and dissolution information condition that fulfilled the behavior that was thought to be measured in the routine information condition. Therefore, all anticipated orders of logic proportions were roughly observed to be turned “upside down”.

Significant difference in routine logic proportion was found during the metastable ill-defined stages, where behavior in the routine information condition has proven to deviate least from its routine logic, while behavior in the dissolution information condition deviated the most. When logic proportions were analyzed over all ill-defined stages, including the “chaotic” stages, this statistical significance vanished. Differences in routine proportions were especially insignificant, when only the “chaotic” ill-defined stages are observed, with Kruskal-Wallis H (H(18, 24, 15, 15, 15) = 1,440, p = 0.837).

The proportion of individual experienced expected outcome was shown to correlate with individual logic proportions at the 0.01 level. Also differences in experienced expected outcome proportions only differed amongst the information conditions in the metastable ill-defined stages (figure 7.3) with weak significance (p = 0.063). Overall differences between the information condition regarding experienced logical feedback were not significant (p = 0.312), especially during the “chaotic” stages (p = 0.423). In other words, all agents experienced comparable level of “chaotic feedback” and did not differ too much in their behavior. Only during the metastable ill-defined stages, meaningful statements can be made regarding behavior and experience. Here, behavior in the routine information condition deviated least from its routine strategy, and feedback was the least “chaotic”. During ill-defined and instable ToE stages, no significant difference in deviation from routine strategy (ToE parts 2) amongst information conditions was found, as shown in figure 7.4. In other words, agent behavior regarding logic deviation was comparable during stages that provided more chaotic feedback.

Figure 7.4
figure 4

Source own source

Boxplot results of logic proportion during „instable“ stages over information conditions: 0 = N-IC, 1 = G-IC, 2 = D-IC, 3 = R-IC, 4 = C-IC.

Random agent behavior, expressed by a high logic marker, did not differ amongst conditions significantly, with Kruskal-Wallis H being (H(18, 24, 15, 15, 15,) = 5.714, p = 0.222), but was shown to correlate with experiencing “chaotic” feedback amongst all ill-defined stages. As chaotic feedback was comparable amongst all conditions, this result was no surprise. In addition, routine consistency did not differ significantly amongst the information conditions as well. Routine consistency described how many actions performed were following the routine strategy category. High individual expertise was found to significantly correlate with low random behavior, and is also found to correlate at the 0.01 level with high routine consistency, with Spearman-Rho of 0.002. Difference in routine consistency proportions amongst individual expertise was found to be highly significant, with Kruskal-Wallis-H (H(33, 16, 38) = 9.844, p = 0.007).

The higher individual expertise in well-defined problem solving, the more routine strategy actions were performed or in other words, the higher individual expertise the higher the routine consistency, as can be seen in figure 7.5.

Figure 7.5
figure 5

Source own source

Boxplot results of routine consistency over expertise levels.

Game-group performance was found to rely heavily on agents agreeing which disk to move, which enhances the chances to beat randomness significantly. In order to know how many moves were required to solve ToE when actions are being chosen randomly, five bot groups played 6 ill-defined ToE settings, with the goal rod changing at the fourth level, just as in the main experiment. The bot groups required more than 166 steps on average to solve a ToE game with the goal rod positioned at the center, and more than 113 steps on average to solve a ToE game with the goal rod positioned right. The minimum number of steps solving any ToE stage randomly was 25, the maximum number of steps solving any ToE stage randomly was 727. The bots required more than 139 steps on average to solve any ToE stage. At the time of measurement, the bot game group was implemented in such a way that all three bots would have the identical random input, therefore always having a fundamental index of “1”. For this reason, the bot groups did not behave perfectly random, as all bots agreed on disk and distance. From all 29 groups observed, only two groups did not outperform randomness, requiring more than 139 steps to solve all ToE stages. Due to unreliable variables it was unclear which game group managed to finish a ToE stage due to solving it properly in time or failing to solve it in time. Time in seconds required per game was saved, but also deemed unreliable. For this reason no statement about group performance can be made.

Correlation between group expertise and ToE logic proportions was significant at the 0.05 level for the N-IC and significant at the 0.01 level for the G-IC. Analysis with Kruskal-Wallis H was significant in all but the C-IC condition. Statistical analysis has shown enough potential correlations between group expertise and logic deviations to confirm hypothesis 10.

Gender effects were tested in detail and while some small deviations between female and male behavior were found, but in general, the existence of convincing gender effects was disregarded. Some small differences between goal rod change strategy adaption performances were found, where female participants outperformed. Random behavior by female participants was more framed by the experiment’s model than was male behavior. Aside from these two minor differences, gender effects are not visible. This is in line with research regarding NPS performance, where no gender effects were visible for any age or country-origin (Strunz, 2019; Strunz & Chlupsa, 2019).

After thorough analyses the most promising independent variable was individual expertise. Agents with high expertise not only performed well during the well-defined problem-solving stages, adapting their strategy instantly to environmental change, but showed less routine logic deviation in the ill-defined stages, and behaved less random and volatile solving ill-defined games.

7.2 Methodological Analysis

The transfer from offline to online experimental analysis was a success, as interpersonal communication between agents was avoided. In addition, the online functionality enabled experiments to be done in a matter of minutes. Experiments running on CuriosityIO can be modified quickly if required. CuriosityIO enables live-observation of each agent. By implementing bots and time limitations, and kicking inactive players automatically, ethical payment was preserved, as agents played 31 minutes on average for a 6.10 USD pay. It took dozens of iterations to structure the multi-agent experiment in such a way that average completion time could be anticipated. As a safe-line, Amazon Mechanical Turks should be informed that submitting incomplete data would not lead to a rejection if a certain threshold of time was exceeded, in this case, 50 minutes. Otherwise MTurks tend to rather cancel the experiment without submitting the data, in order to avoid rejection. For MTurks the rejection rate is more important than financial loss, as the HIT rejection rate is the most common filter for experiments on Amazon Mechanical Turk, and usually lies between 95 and 99 %. When a large experiment fails due to a server crash for example, it is better to have MTurks to submit incomplete data quickly, as compensation of MTurks who did not submit their data comes along with problems. In such cases, individual “fake” experiments or “compensation HITs” have to be started for each agent. This can lead to huge organizational work. MTurks who failed to submit due to the server crash with 330 MTurks participating were partially compensated via paypal, however, paying MTurks via paypal is a violation of Amazon Mechanical Turk’s terms of services. Also, live support via email during large online multi-agent experiments is mandatory. Participants need to be answered with a response time less than 2 minutes in order to make them feel guided. Many questions arise during all online experiments, leading to dozens and hundreds of emails to be answered in very short timeframes. The experimenter should prepare experiments accordingly to avoid being overwhelmed by organizational work due to compensation or support requirements. As experimenters are being rated online and MTurks are well connected, experimenter should take ethical payment and sound experiment structure seriously.

All in all, the way Amazon Mechanical Turks works deems to be not ideal for conducting multi-agent experiments under uncertainty. In order to avoid bots or low-quality data, the HIT rejection rate should be greater than or equal to 99 %. However, with such a high HIT rejection rate not enough participants will join in a short time span. Many agents had to join in a short amount of time, so that a game group was not automatically filled with a bot, in order to gain enough meaningful data. A bot had to be implemented, so that MTurks would not have to wait longer than a couple of minutes until the experiment started. This was mandatory for ethical payment, as for any HIT the time limit for a participant has to be pre-set. If a participant fails to finish a HIT (paid task like this experiment) in that pre-set amount of time, the MTurks will not be able to submit and the experimenter has a hard time to compensate. However, pre-setting the number of minutes is mandatory in order for the MTurks to calculate and anticipate their earnings. When the HIT rejection rate is lower than 99 %, the experimenter risks lower individual quality data, but enables more participants to join in a short time span. When the HIT rejection rate is lower, paradoxically data quality rises for this particular experiment, as more data becomes meaningful, but with a too low HIT rejection rate, individual data quality becomes less valuable. For the main experiment, a HIT rejection rate of “greater than 95 %” was chosen, and it is recommended that the experimenter takes into consideration the perspectives of the MTurks, via online communication channels such as “Reddit”. Here the author gained enough insight by MTurks to find the ideal HIT rejection rate for the experiment.

Even though there exist many studies about the behavior and data quality gained by conducting experiments with MTurks, not much can be known about each participant in reality. More information about each individual MTurk had to be obtainable for the experimenter for higher quality experiments. An additional feature that enhanced data quality would be an online lobby, where MTurks could idle without losing time and money. Such features would have to be implemented for multi-agent experiments to be more effective, ethical and efficient. As most freelancers working with Amazon Mechanical Turk are mostly either from India or US-America, alternatives to Amazon Mechanical Turk should be regarded, if participants from e.g. Europe were required. The Amazon Mechanical Turk business model gains popularity, and many alternatives are currently being developed, which offer more information about the individual freelancers, and also enough European participants for more diverse country-origin experiments.

As response times are valuable predictors for behavior, online experiments should run on stable infrastructure, in order to ensure the saved response times to not be erroneous. Even after one year of optimizing both infrastructure and software performance, response times deemed not to be reliable enough to make statistically meaningful analyses. In addition, any server running multi-agent experiments should be equipped with way more memory capacity than anticipated to be required. While it was suggested that a server holding 1 GiB of working memory would certainly suffice for an experiment with 330 agents, in reality, the server with such a setup crashed. Even a 32 GiB working memory server showed a CPU load of 55 %, while calculating an experiment with only 180 agents. The author recommends at least 128 GiB working memory for experiments with a 4-digit number of participants. In addition, at least one stress test with a couple of hundred non-simulated participants should be conducted beforehand.

7.3 Limitation

Participants were confronted with the cognitive puzzle game “Tower of Hanoi”, and its multiplayer version “Tower of Europe”. As for some participants this puzzle game might be an undefined or well-defined task from the very beginning, ex-ante expertise can lead to a fast learning curve in the well-defined problem-solving stages. In addition, some participants self-reported having encountered the experiment before, and might have had some a-priori knowledge. However, none of the participants who self-reported having encountered the experiment before could have been playing the ill-defined stages, as the first experiment, with 1 GiB server memory, crashed right after the well-defined stages. Still, statistical analysis did not treat these participants differently during the second successful experiment, which was equipped with 32GiB of memory.

After the well-defined stages, the experiment makes the transition to an ill-defined problem, with the first three games being “metastable” and the last three games representing “chaotic” decision-making circumstances. The order of the experiment’s problem-solving stages, being well-defined, ill-defined and metastable, ill-defined and chaotic, models real world experiences and challenges, but is also a limitation in itself, as in real world decision-making any order of problem categories or decision system states might occur.

Participants received information on the outcome of their individual action, but no further details about how the hidden ruleset works, i.e. how their decision influenced the outcome. Participants therefore received simple feedback and not rich feedback; therefore, learning was limited.

No analysis including response times was conducted, due to yet unreliable data. Group performance could not be evaluated due to yet missing variables that could clearly indicate, whether an ill-defined stage was solved via performing the right actions. Some statistical evaluations would clearly benefit from a larger pool of participant data; however, software efficiency and stability had to be tweaked further to enable experiments with more than one thousand participants. It is estimated that in order to derive insights with sound, statistical analyses about inter-group differences with five conditions, at least 2.700 participants would be required for nonparametric statistics. With a data dropout rate of about 50 %, participants should be in the thousands in order to ensure data quantity. This thesis relies on 87 data points derived from a pool of 180 participants; therefore, all insights are limited in their statistical validity.

7.4 Future Outlook

Tower of Hanoi experiments are both thoroughly researched and used for behavioral experiments. Flag Run and Tower of Europe are experimental novelties, which might benefit from scientific insights regarding insight problem solving, working memory capacity, cultural uncertainty avoidance, and from conducting the experiment with different models of learning environments. Multiple learning environments could be simulated via altering the content of the instruction or implementing rich feedback. Experiments which differ in their visual representation, yet still run on the identical logic of Flag Run or Tower of Europe, such as an interactive stock exchange game, could be designed. Interpersonal communication can be included via chat windows, holding a list of certain predefined text-passages, which can be chosen from. The algorithm ensures that in any case, a multi-agent group decision making domain is created, where each individual decision influences the outcome, while not necessarily having impact on the group decision output. The algorithm is fair, unbiased, and even if its rules are known, it can only be taken advantage of, when agents can agree on their order of action input; in other words, when agents were able to synchronize their actions. However, the algorithm can be set arbitrarily complex, so that even if communication between agents was enabled, and agents would communicate their order of inputs, they would not be able to take full control over the outcome. Therefore, stable, metastable and chaotic decision-making environments can be easily simulated. An arbitrary number of agents per group can be used, and the algorithm can also be used for games with multi-dimensional decisions.