Keywords

3.1 Introduction

If researching current trends already involves real challenges, predicting future trends holds even more. Predictions are usually difficult to make and inherently uncertain. However, the Delphi method makes it possible to quantify this uncertainty by gathering many expert opinions on a future issue. We can then easily see how much these assessments differ from one another or resemble each other. The method also provides the possibility of reducing the uncertainty, at least in principle, by asking the experts in a Delphi for their assessments not just once but repeatedly. The process confronts them with the statistical results from the previous survey round and asks them to reassess their previous responses in light of these results. The experts can then either stick to their earlier assessment or change it, and we can see whether and how much the expert opinions converge across survey rounds. Thus, the feedback between rounds facilitates informed decision-making in the process (Linstone & Turoff, 2011). In a conventional Delphi, this process can occur across several survey rounds; in a real-time Online-Delphi, one such round is sufficient to obtain an assessment along with one reassessment, the approach the present study pursues.

A Delphi has no group discussion in a socio-psychological sense. All experts remain anonymous during the entire Delphi process, and not interacting rules out any group dynamics that could affect the expert opinions—a great advantage. However, this same design feature also precludes achieving through discussion a shared understanding of the target in question. Thus, designing and wording the survey questions to ensure comparability of answers in even the most complex subject matter is of utmost importance. The present study accomplishes this by establishing a sequence for each of five complex scenarios (Engel & Schultheis 2021). First, we ask respondents for a response to a complex scenario that consists of multiple dimensions; then, we use follow-up scales to assess each such dimension separately. While the complex scenario helps to convey to the respondent a realistic idea of the imagined future situation and a basis for empathizing with it, the follow-up scales help to ensure precise and comparable responses.

The appropriate selection of experts is an extremely important task for a Delphi. It must ensure the competent assessment of all relevant aspects of the subject matter in question. We describe this selection for the present study in the next section. Its basis is the preference for a larger rather than a smaller selection of experts, as well for a sufficiently heterogeneous group of experts able to cover the topic of “AI and society” from the point of view of various relevant groups. Finally, belonging to the Free Hanseatic City of Bremen as the scientific location is a eligibility criterion for all the participating scientists.

In the past, Delphi surveys have been the method of choice to forecast future societal changes in a variety of research agendas. For several decades in Germany, Delphi surveys have provided decision-makers with valuable insights into ongoing and future trends. For example, the German Federal Ministry of Education and Research (BMBF) first conducted Delphi studiesFootnote 1 in 1992, 1993, and 1998, to assess societal trends and challenges for science and technology. Their results culminated in policy recommendations, participative studies, and the still ongoing “Foresight” cycles that focus on development trends with a view toward the year 2030.

The “Foresight” study, a large-scale multinational Delphi survey and part of the BOHEMIA study,—recently carried out for the European Commission, in preparation of the next research framework program, 2021–2027 Horizon Europe (European Commission, 2018)—offered only a first rough orientation for the development of scenarios for the present study. A detailed study of human–robot interaction required developing a completely new, precise, and detailed instrument for the present Bremen AI Delphi study.

3.2 Sample and Survey Design

3.2.1 Delphi Survey of Scientists and Stakeholders

As outlined elsewhere (Engel & Dahlhaus, 2022), we invited 1826 experts from two different backgrounds to participate in the Bremen AI Delphi Study, namely 1359 members of Bremen’s scientific community and a diverse group of 467 people on Bremen’s political landscape. The expert group from the Bremen scientific community included scientists affiliated with one of Bremen’s public or private universities at the time of the survey. The prerequisite for participation was holding a doctorate or a professorship.

Disciplines from the social sciences included economics, sociology, political science, health science/public health, cultural science, pedagogy, media and communication science, linguistics, psychology, philosophy, and history. Professionals in engineering, mathematics, robotics, and computer science represented the STEM disciplines. The natural sciences included physics, chemistry, biology, and earth science. The group of political experts, including officials and stakeholders, comprised members of the Bremen Parliament (all party affiliations) and officials serving in senate departments. The group of stakeholders included union representatives, executives of organizations of employer representation, and pastors of Bremen’s Catholic and Protestant parishes. The Delphi sample achieved a response rate of 17.8% (n = 297).

3.2.2 Population Survey

Following a quasi-randomization approach (Elliott & Valliant, 2017), the overall sample consisted of a combined probability and nonprobability sample. A probability sample of residents aged 18+ was drawn from the population register of the municipality of Bremen, weighted for unit-nonresponse, and used as a reference sample for the estimation of inclusion probabilities for an analog volunteer sample. The response rate of the probability sample was 2.5%. Sample sizes were 108 cases each, so the overall N was 216 people, aged 18 and over. Throughout this book, all analyses of the population survey are based on weighted frequency distributions to balance nonresponse. The weighting approach devised for this is detailed in a recent freely accessible handbook chapter (Engel & Dahlhaus, 2022, pp. 356–357).

3.2.3 Fieldwork

Fieldwork for the two surveys took place from 25 November to 15 December 2019. Invitations to the Delphi were sent via personalized e-mail to the recipient’s professional address, and if no response was received within 2 weeks, reminders were sent. Invitation to the population survey was conveyed via personalized letter post, including a link to the web questionnaire (probability sample), an advertisement in the local newspapers, and an announcement on the homepage of the University of Bremen website (volunteer sample).

3.3 Questionnaire Design

3.3.1 The Scenarios for the Reference Year of 2030

A Delphi functions on the processing of the answers to survey questions to present the statistical findings to the respondents again in the succeeding round. In a conventional Delphi, this procedure takes place in separate survey rounds; in a real-time Delphi, within one such round. This reduces the otherwise likely three-to-five survey rounds to a test–retest design, thus limiting the capability of mapping the process of opinion formation and assessing it for possible convergence. However, it offers the great advantage of providing the respondents with quick feedback.

Test–Retest Agreement

In the present study, we implemented the survey of scientists and stakeholders as a real-time Delphi. The related survey questions specifically concern the five Delphi scenarios embodying the themes at the intersection of AI and society: competition, wealth, communication, conflict, and assistance. For each of these scenarios, the model for phrasing questions was: “What do you expect: Will this scenario become a reality?” Then, the standard response scale used consistently throughout the surveys was: “not at all,” “probably not,” “possibly,” “quite probable,” and “quite certain.” Participants first responded to the question, then received access to a text field to optionally explain their response. Next in the sequence was the presentation of the frequency distribution of all assessments to that point in the Delphi sample, followed again by the standard response scale for a renewed rating. We observed an average agreement of κ = 0.8 across the five scenarios. The definition of the weighted kappa

$$ \kappa =\frac{p_o-{p}_{\mathrm{e}}}{1-{p}_{\mathrm{e}}} $$

is the probability of observed matches (po) minus the probability of matches expected by chance (pe), divided by one minus the probability of expected matches. Thus, κ expresses the excess of observed over expected agreement as a share of the maximal possible excess. Accordingly, a weighted κ of 0.8 indicates that the observed agreement of assessment and reassessment exceeds the expected agreement by 80% of the maximum possible excess.

The Larger Fictitious Context and Its Single Situational Dimensions

Each scenario refers to the reference year of 2030 and follows a clear structure. In an initial block, the scenario is deliberately pictured as a larger context and not as a narrowly defined situation. This construction should make it easier for the respondents to empathize with this fictional future situation before answering. However, this desired multidimensionality of the picture requires specific follow-up questions about the single dimensions inherent in the broader picture. In a subsequent block, each respondent is asked to rate these dimensions, employing the standard response scale. Chapter 2 demonstrates this using the competitive scenario as an example.

3.3.2 Assessing the Future Without Primary Experience

The future-oriented nature of the survey questions precludes the respondents from already having relevant application experience with the technology. The ideas about robots and AI on which the answers may be based are, therefore, highly relevant. Quite conceivably, the cognitive processing of the interview questions itself helps to develop such ideas. Therefore, we asked a survey question on the expected influence of AI on one’s quality of life, once immediately before and once immediately after a block of questions about communication between humans and robots, to see whether and how the answers changed under the impression from this block of questions.

Table 3.1 Expected influence of AI on one’s perceived quality of life

Table 3.1 shows such a response effect. Even if the pre-/post-distributions differ only slightly and, thus, indicate a high level of aggregate stability, this stability goes hand-in-hand with a substantial change in individual responses. Only 71.2% of the respondents stayed with their original answers (sum of the percentages on the main diagonal). In contrast, 13% corrected an initially positive expectation toward “no influence,” “negative influence,” or “don’t know” (sum of percentages in the upper triangle) while, at the same time, 15.8% changed their assessment in the opposite direction (lower subdiagonal triangle). The weighted κ of 0.57 indicates that the observed agreement of assessment and reassessment exceeds the expected agreement by only 57% of the maximum possible excess.

3.3.3 Randomized Sequence of Items

The presentation of unordered sets of categories should consider the “primacy” vs. “recency” distinction that Krosnick and Alwin (Weisberg, 2005, p. 108f.) identify. Accordingly, the impact of response order presumes depending on the mode of presenting the question: “With visual presentation, primacy effects will predominate; with auditory presentation, recency effects” (Tourangeau et al., 2000, p. 252). “Primacy effects” means that respondents tend to prefer options at the beginning of the list over those at the end; “recency effects” means the opposite tendency, to prefer options at the end of the list over those at the beginning (Engel, 2020, p. 248). A solution to such effects of the response order is the presentation of the list to each respondent in randomized order. This is the solution also implemented throughout the present study whenever lists of unordered items were presented to respondents.

3.3.4 The Standard Response Scale

A scale consistently employed throughout the two involved surveys, the Delphi and the population survey, ensured comparable responses. Following recommended practice from survey methodology (Schnell, 2012, p. 91f.), this standard scale rates the degree of belief in the validity of statements by using an ordinal scale. This scale maps numbers to their meanings accordingly:

  • 1 = “not at all”

  • 2 = “probably not”

  • 3 = “possibly”

  • 4 = “quite probable”

  • 5 = “quite certain”

Using these numbers, we computed interpolated quartiles for such frequency distributions. Throughout the book, this is the respective first (Q1), second (Q2), and third (Q3) quartile, to thus obtain a suitable mean estimate (Q2 = median) and the corners of the interquartile range for the middle 50% of responses. In the chapters of this volume, we often use a standard instrument, the box plot, to graph this kind of information. In addition, the ordinal scale level is considered by specifying probit regression equations. This concerns essentially the confirmatory factor analyses in the chapters of this volume.

Fig. 3.1
A table of 6 columns and 7 rows depicts the calculations in the interpolation scheme for the competitive scenario. Row 7 depicts a formula for Q subscript i, interplated, and values of Q subscript i for 3 quartiles.

Interpolation scheme using as example the percentages for the “competitive scenario” detailed in Chap. 2

Figure 3.1 illustrates the interpolation scheme in the case of quartile computations. The calculation is based on two auxiliary assumptions: (1) that a scale point represents the midpoint of a surrounding interval and (2) that the responses are evenly distributed within each such interval. Then, it makes sense to determine the very share of the class width to be added to the relevant lower interval bound, to reach a sought quartile.

In the present case, this class width is equal to 1. Using the percentages for the competitive scenario as an example, 9.3% have an observed score of 1 and 56.9% have an observed score of 2 or less than 2.

$$ {Q}_{2\left(\mathrm{interplated}\right)}=1.5+\left(\frac{50.0-9.3}{47.6}\right)\times 1=1.5+0.86\times 1=2.4. $$

This implies, for instance, that the median (second quartile) falls in the interval of 1.5 to 2.5 (because this interval contains the cumulative 50% corresponding to the median), but the interval contains more than the required 50% of respondents—in fact, 56.9%. Interpolation then simply means the calculation of the portion of the class width, up to the theoretical value of 50%. In the present case, this portion is 0.86 (i.e., 86% of the whole class width of 1 is to be added to the lower interval bound).