Social scientists spend much of their time constructing hypotheses about statistical associations between measurements. For instance, do political ideologies correlate with health behaviors, or does poor sleep predict suicidal intent? During the production and revision of such hypotheses, researchers often cast a wide net of inspiration including written scholarship, anecdotes, and intricate theories. Here, we examine the ability of GPT-3, OpenAI’s large language model (LLM), and one of the most well-read entities in the world (Brown et al., 2020), to hypothesize about the directionality of correlations between social science measures.

LLMs have allowed for the automation of increasingly complex tasks including some chores of social scientists (Argyle et al., 2022; Aher et al., 2022; Hansen & Hebart, 2022; see also Yeung et al., 2022). Researchers execute a large range of complex tasks which are seemingly very far from being automated. These include conducting original research, teaching, university administration, and public outreach. Here, we focus on original research. A substantial part of conducting research consists of reading scholarship within (and often beyond) the researcher’s field of expertise. Mechanically, this step is not challenging to LLMs that are routinely fed with scientific texts too plentiful for a single person to read. However, developing a deep understanding, or encoding, of the scholarship—allowing for abstraction, hypothesis generation, and real-world application—is substantially more challenging and possibly beyond the capabilities of current LLMs. Groups of human experts, and even laypeople, can provide fairly accurate predictions about which study outcomes will materialize (often replicate), especially when their predictions are aggregated within competitions or prediction markets (e.g., Gordon et al., 2021; Hoogeveen et al., 2020). Thus, comparing GPT-3’s performance to a human accuracy baseline can reveal how difficult the current task is and whether GPT-3 can be helpful for hypothesis generation and testing.

If LLMs can provide accurate hypotheses about the empirical correlations in social science datasets, they could inspire and support social scientific research in the future. Currently, it is challenging for researchers to gain awareness of the wide ranges of interrelated social science constructs in their field and exponentially more time-consuming to formulate hypotheses about the masses of inter-construct relationships, for instance when constructing large network graphs (e.g., McGlashan et al., 2016). An example is the recent phenomenon of the pandemic eliciting profound shifts in networks of political attitudes (Albarracín & Shavitt, 2018). Each potential relationship between two attitudes requires an integration of existing scholarship and current affairs. Fittingly, awareness of interrelated, real-world objects appears to be one of the strong suits of language models whose parameters can encode encyclopedic knowledge from large text corpora. However, whether these models can accurately anticipate how measurements will covary in empirical studies is yet to be examined.

The current work focuses on the social scientific context of ideological attitudes as they steer many human behaviors and cognitions and are relevant across many social sciences (Albarracín et al., 2014; Bosnjak et al., 2020). Specifically, the challenges given to the LLM in the current work revolve around predicting correlations between two attitudes (e.g., towards feminism vs. towards Donald Trump). In psychological research, attitudes constitute a personal evaluation of a target object (Albarracín et al., 2014; Briñol et al., 2019). They are important for virtually all life contexts as they underlie what behaviors we engage in, how we relate to others, and what we deem moral or immoral (e.g., Maio et al., 2018).

Entire ideologies and belief systems are often conceptualized as networks of interrelated attitudes (Brandt & Sleegers, 2021; Dalege et al., 2016). For instance, there are typical patterns of correlations between people’s attitudes towards Donald Trump, Barack Obama, gun control, and immigration. Knowing a person’s attitudes towards these attitude objects defines part of a person’s overall political ideology and enables predictions of adjacent attitudes in the belief network (e.g., towards Hillary Clinton) and behaviors (e.g., voting in the next election; Dalege et al., 2017). Many social scientists examine relationships between attitudes or entire attitude networks. Examples include work on cognitive dissonance (Starzyk et al., 2009), planned behavior (Ajzen, 1991), or inter-group relations (Brewer et al., 1985)—all of which have attitudes at their core.

1 Can LLMs Forecast Empirical Correlations Between Attitudes?

The GPT-3 model family, and LLMs in general, are neural networks with sophisticated architectures (Vaswani et al., 2017) that were trained on large datasets and can be used for a variety of text processing tasks (Akhbardeh et al., 2021; Brown et al., 2020; Chowdhery et al., 2022; Dzendzik et al., 2021; Jordan et al., 2015; Lin et al., 2021; Stahlberg, 2020). Their inner parameters (i.e., regression weights) are often optimized to anticipate the next (or left out) words in a given text, essentially auto-completing any prompt they are given. Many current approaches transform input words into numerical vectors, recombine and transform these vectors (as done with predictor variables in logistic regression), and adjust the numerical weights to maximize word prediction accuracy for texts within the training data (for details, see Vig & Belinkov, 2019; Wolf et al., 2020).

While the idea to automate parts of the scientific process is not new (e.g., King et al., 2009), using LLMs for this purpose was only recently suggested (Krenn et al., 2022). Despite a current wave of enthusiasm around AI applications (cf., Gozalo-Brizuela & Garrido-Merchan, 2023), mechanical predictions of social science phenomena have had mixed levels of success in the past, often concluding that human behavior outside the lab is too complex to be accurately predicted. For instance, Salganik and colleagues observed that many central life outcomes remain difficult to predict even when large teams of scientists have access to ostensibly high-value predictors (2020). Similar observations were made when machine learning methods were used to predict romantic attraction between people that were yet to meet (Joel et al., 2017). Additionally, poorly performing models can inflict harm in applied settings, especially when self-reinforcing (e.g., in some cases of predictive policing; Berk, 2021).

When predicting social science constructs based on textual data, some encouraging findings were made. For instance, personality traits are sometimes predictable based on a person’s social media posts (e.g., Christian et al., 2021; Schwartz et al., 2013; but see Sumner et al., 2012). When it comes to political attitudes—the context of the current work—previous work found political affiliations can be predicted quite well from certain social media texts, even without relying on LLMs (Ullah et al., 2021). Given the plentiful political texts on which LLMs are trained, they can be a promising tool for predicting political phenomena like ideological attitudes (Argyle et al., 2022; Jiang et al., 2022). However, such highly relevant training data come with the caveat of ideological biases potentially distorting outputs (e.g., Liang et al., 2021). Another caveat of such large data dumps is the difficulty in determining whether a specific task is truly “new” for the model or can simply be reproduced from the training data (Rae et al., 2021).

Additionally, LLMs are argued to relate poorly to human interaction partners (Montemayor, 2021), are restricted to a superficial learning strategy (identifying relationships between words; Lake & Murphy, 2021), and have therefore no “understanding” of its own outputs (Tamkin et al., 2021; van der Maas et al., 2021). In short, most researchers still evaluate their capabilities as being far from human-level intelligence (Floridi et al., 2020; but see Wei et al., 2022), or argue that machines, albeit powerful, remain fundamentally different to humans. Accordingly, it is sometimes suggested that trying to re-engineer human-like intelligence might be a bad strategy towards maximizing machine performance (Korteling et al., 2021).

Thus, when reviewing the literature, there are notable arguments for and against the idea that LLMs could accurately predict the outcome of social science studies about political attitudes. Furthermore, this task diverges quite far from those that LLM’s are typically scored on like text translation, summarization, or simple info extraction (e.g., Akhbardeh et al., 2021; Barrault et al., 2019; Lai et al., 2017). Somewhat more relevant is the common task of extracting implied information from a text (e.g., an event’s duration and starting time is mentioned and the model is prompted to name the ending time; Dua et al., 2019; Chowdhery et al., 2022). Sometimes, such logical deduction tasks include human behaviors and cognitions. The following example task is taken from Chowdhery et al. (2022, p 14):

“Input: Students told the substitute teacher they were learning trigonometry. The substitute told them that instead of teaching them useless facts about triangles, he would instead teach them how to work with probabilities. What is he implying? (a) He believes that mathematics does not need to be useful to be interesting. (b) He thinks understanding probabilities is more useful than trigonometry. (c) He believes that probability theory is a useless subject.

Answer: (b) He thinks understanding probabilities is more useful than trigonometry”.

For such multiple-choice tasks, current LLMs approach the best performance of human participants. Importantly, human performance is usually estimated with small batches of crowd-sourced efforts instead of work from dedicated experts (e.g., Chowdhery et al., ). Furthermore, it must be noted that these tasks usually remain substantially easier than developing novel psychological theories from complex academic texts.

As daunting as the scientific literature is, the reading and writing level of LLMs has increased very rapidly over the last years, and the barriers towards automatic article processing and thesis generation might fall rather soon. In fact, a GPT-3-based startup is showing promising results when prompted to answer users’ research questions by providing summaries of relevant research papers (Elicit, 2022). As the application was designed to support literature reviews, it lacks the interface for generating new hypotheses or answering research questions that are yet to be addressed, but it showcases the abilities of current LLMs (specifically GPT-3) to process academic texts.

In the current work, we test the ability of GPT-3, to freely hypothesize about research questions. If successful, the model could participate during a crucial step in the work of an empirical scientist: predicting the outcome of a study. Below, we give GPT-3 exactly this task, pre-register its predictions, run the study, test the accuracy of the predictions, and evaluate whether it can serve as a useful hypothesis-generation machine.

2 Study 1

The goal of Study 1 is to quantify the accuracy with which GPT-3 can predict correlations between attitudes. We chose the topic of attitudes as they are immensely important across the social sciences. Similarly, we chose the challenge of predicting correlations as it is very common in scientific work.

The chosen attitude objects all pertain to US American politics and ideologies as GPT-3’s training data includes various texts about these objects, and scientific work on attitude networks is often conducted in this specific context. We estimate the difficulty of the task as fairly low compared to the many text-based tasks of social scientists (but not compared to typical LLM tasks). In order to assess this assumption, we contrast the performance of GPT-3 with the performance of 30 empirical scientists predicting the correlations.

We only had weak a priori hypotheses about GPT-3’s performance as the current task is not common in research on LLMs. Specifically, we simply expected the performance to lie between a chance baseline and 100% accuracy (uniform prior). The pre-registrations of the prediction and validation procedure can be found here: bit.ly/3ym9cWJ.

3 Method

We utilized 36 people, organizations, behaviors, and abstract concepts that are central to political ideologies frequently discussed in a US American context. Of these attitude objects, 18 are associated with left-wing and right-wing ideologies, respectively. This split was chosen to ensure the presence of at least some strong attitude correlations (e.g., attitudes towards Donald Trump and Barack Obama are likely negatively associated). The full list is provided in the supplementary materials. The 36 objects were arranged into all possible 630 binary pairs, and GPT-3 was prompted to estimate whether the correlation between attitudes towards the paired objects was going to be positive or negative.

3.1 GPT-3 and Prompt Specifications

Given that GPT-3 s performance can depend on the prompt in unanticipated ways, we always provided seven zero-shot prompts (see supplementary materials) with very different phrasings for binary correlation predictions. The seven individual predictions were averaged to arrive at the final prediction; performances of the individual prompts are reported as well. An example prompt is:

“If there was a correlation between being a fan of <object1> and being a fan of <object2>, would this correlation be positive or negative?”

We used the Python interface from openAI to access GPT-3. The selected engine was “davinci-2”, the most advanced engine at the time of writing this text. All other settings are listed in the pre-registration. GPT-3’s responses did not give a clear prediction about the directionality of the correlation coefficient in 0.05% of cases. These predictions all came from the same prompt and were coded as missing values.

3.2 Accuracy Scoring

After preregistering GPT-3’s predictions, we collected actual attitude measurements for all 36 objects from 600 US American adults (300 female, 300 male; Mage = 42.25, SDage = 13.33) through prolific.com. All attitudes were measured with the item: “What is your attitude towards < object > ?” and responses were given on a seven-point scale (0 = very negative; 7 = very positive). All 630 inter-attitude Pearson correlation coefficients were computed and treated as the ground truth labels for GPT-3’s previous predictions. Due to measurement error, these binary directionality labels are unlikely to be perfect for very small correlation coefficients and the unbiased accuracy limit could therefore be below 100%.

3.3 Performance of Human Experts

In order to contextualize the performance of GPT-3 in predicting correlations, we asked 30 experts to complete the same task. All experts either held or were currently working on a PhD in political science research or research methodology. All experts were currently conducting empirical research themselves as part of their employment at a research institution in the USA, Germany, or the Netherlands. Each expert was presented with 93 random attitude pairings (e.g., attitude towards Hillary Clinton and attitude towards smoking weed) and had to indicate whether the correlation between these attitudes in an adult US sample would be positive or negative. The reason for choosing 93 per person was primarily dictated by printed versions of the questionnaire fitting 93 pairs on an A4 sheet (front and back), more might have been perceived as overtasking. Participants were allowed to use all resources available to them to come to their predictions, but based on anecdotal interviews most people relied on their intuition occasionally supported by a short Google query. Notice, that the collection of expert predictions was suggested during the revision of the current work and is therefore not included in the original preregistration.

4 Results

The accuracy of GPT-3’s prediction of correlation signs was 78.25% across all predictions (95% CI [76, 82]; see Fig. 1). A direct replication study run in March 2023 with 600 participants reflecting the age, gender, and ethnicity distributions in the USA achieved an accuracy of 77.62% (95% CI [74, 81]; for details and plots see supplementary materials).

Fig. 1
figure 1

Results of Study 1. Top left: estimated accuracy of GPT-3. Top right: distribution of empirical correlation coefficients (i.e., prediction targets) and mispredicted coefficients of individual prompts. Bottom left: prediction accuracies when excluding small correlation coefficients. Bottom right: isolated performances of prompts. Whiskers signify 95% credible intervals. The accuracy prior followed a uniform distribution

The baseline accuracy of always predicting “negative correlation” would have led to an accuracy of 50%. The baseline accuracy of always predicting positive associations between same-ideology (e.g., left-wing) concepts and always predicting a negative association between different-ideology concepts would have led to an accuracy of 98.57% with all nine mistakes involving people’s attitudes towards John McCain and liberal concepts/people. The confusion matrix below (see Table 1) shows that GPT-3 erred mostly on the side of predicting negative correlations.

Table 1 Confusion matrix study 1

Figure 1 summarizes the central findings of Study 1. Empirical correlations showed a bi-modal distribution (likely given ideological polarization in the USA). The magnitude of the true correlation coefficients was associated with an increased probability of correct predictions by GPT-3. Some prompts performed better than others while the aggregated predictions were about as accurate as the best-performing prompt. The largest residuals occurred for the associations between Barack Obama and Hillary Clinton (true: r = 0.76; predicted: “negative”), and between abortion and gay marriage (true: r = 0.69; predicted: “negative”).

Human experts showed an average prediction accuracy of 92% (95% CI [89, 94]). As indicated in Fig. 2, no expert made flawless predictions, while some appeared inattentive, dragging the average human performance down slightly.

Fig. 2
figure 2

Thin lines indicated the posterior probabilities of each expert’s accuracy, while the thick line integrates the results of all experts. The prior distribution was uniform between 0.8 and 1, leading to the sharp cut-off on the left

Our suspicion of inattentiveness is supported by examining the most error-prone attitudes. While “John McCain” and “Joe Biden” understandably elicited relatively many mispredictions, given that their career activity has led to cross-party reputations, the most common error source was “gun control,” which has a fairly established position on the political spectrum. This suggests that some participants might have simply flipped the word meaning when providing their correlation predictions.

5 Discussion

On the task of predicting correlations between attitudes, GPT-3 performed substantially above chance, but far below the highly accurate baseline “predict correlations based on same vs. other ideological group.” The task was relatively easy in reference to common prediction scenarios in the social sciences as it merely involved predicting bivariate relationships between polarized, prominent objects, although none of the human experts performed flawlessly either. Conversely, the task was relatively complex and uncommon in reference to typical reasoning tasks that LLMs are tested on. Nonetheless, relating ideological attitude objects to each other should have played into the strengths of GPT-3 as it can largely rely on an encyclopedic knowledge of political concepts and partisanship, which is a luxury that it would not have for many social science questions.

It is important to note that the results above were collected with an out-of-the-box GPT-3 pipeline. That is, all predictions were generated in a zero-shot format (i.e., no labeled, task-clarifying examples were shown to the model). Furthermore, the model was not instructed to solve the reasoning task in a step-by-step fashion (i.e., chained reasoning) which can increase the performance of LLM’s (Wei et al., 2022).

In sum, the out-of-the-box performance of GPT-3 was not high enough to be useful to human experts, who outperformed the model by a substantial margin. In Study 2, we complement this first finding by giving the model a much better shot at achieving a high performance by customizing it to the task at hand.

6 Study 2: Top-Down Pipeline Customization and Few-Shot Learning

In Study 2, we optimize the prediction pipeline in two ways. First, we change the previous zero-shot setup to a few-shot setup, and second, we utilize a chained prompting pipeline. We will briefly discuss both additions below.

The essence of few-shot learning is that the model is presented with a few example instructions and correct responses within the prompt and can thereby recognize the nature of the presented NLP task, whereas zero-shot learning does not include exemplary responses. Through the added context, few-shot prompting often increases model performance across various NLP tasks (e.g., Wei et al., 2021). In our current study, few-shot learning consists of providing examples of correctly predicting correlations to the model before requesting the prediction of the targeted correlation.

The concept of chained prompting consists of breaking down a complex reasoning task into smaller steps that the model has to solve consecutively (cf., chain-of-thought prompting, Wei et al., 2022). If set up favorably, the procedure of stepwise reasoning can allow the solver to avoid learning unknown, complex procedures in favor of multiple simpler procedures (Wu et al., 2022). An intuitive example for humans would be to solve the task “3 × 3.5” by consecutively computing “3 + 3 + 3” and “0.5 + 0.5 + 0.5” and adding the results. For LLMs, an example would be to transform a complex reasoning task into consecutive look-up tasks. For the current context, this is achieved by breaking down the task of predicting correlations into multiple steps where GPT-3 first labels attitude objects as either left-wing or right-wing objects and then integrates this additional insight into a consecutive prompt to predict attitude correlations.

For Study 2, we expect the simple classification of left vs. right-wing objects to be completed with a high accuracy (> 90%) as it constitutes a simple look-up task which was a strong suit of GPT-3 in the past (Brown et al., 2020). Furthermore, we expect the accuracy for subsequently predicting correlations to improve from Study 1 (> 80%; pre-registration bit.ly/3ym9cWJ). Notice that customizing the prediction pipeline in this way can potentially improve model performance but entails that the model would no longer be useful across different social scientific tasks. Furthermore, an improved performance would not constitute evidence of a more intelligent model, as part of the thinking is done beforehand by the researcher (“use the concept of ideological polarization to predict attitudes”). Whereas Study 1 explored GPT-3’s performance in a virtually unassisted manner, Study 2 provides strong assistance to the model through a heavily customized prediction pipeline.

7 Method

First, we prompted GPT-3 to categorize all 36 objects into left-wing and right-wing objects (pre-registration and method details: bit.ly/3ym9cWJ). Subsequently, we explicitly integrated the generated labels into the prompts for predicting correlations. Furthermore, we continued to use an integration of seven prompts, but now prefacing each prompt with five examples of correct correlation predictions. An example of a single prompt (gun control vs. Hillary Clinton) is:

“Given that Donald Trump is endorsed by the politically [prediction:] right and Barack Obama is endorsed by the politically [prediction:] left, what do you think is the correlation between people’s attitude towards Donald Trump and Barack Obama? Please answer positive or negative.

negative

Given that the Bible is endorsed by the politically [prediction:] right and the NRA is endorsed by the politically [prediction:] right, what do you think is the correlation between people’s attitude towards the Bible and the NRA? Please answer positive or negative.

positive

[…]

Given that gun control is endorsed by the politically [prediction:] left and the Hillary Clinton is endorsed by the politically [prediction:] left, what do you think is the correlation between people’s attitude towards gun control and Hillary Clinton? Please answer positive or negative.”

The attitude objects referred to in the five labeled examples were randomly sampled from all objects except for the two objects in the current prediction target. The engine specifications as well as the performance analyses remained the same as in Study 1 (pre-registration: bit.ly/3ym9cWJ).

8 Results

The simple categorization into left-wing and right-wing concepts was accurate for 35 out of the 36 concepts. The concept of “targeted policing” was predicted to be a left-wing policy. This might have been due to our phrasing of the concept being less politicized than for instance “stop-and-frisk,” but we still scored it as a mistake as most current forms of policing can be expected to be less popular among the politically left (e.g., Silver & Pickett, 2015) and all real attitude correlations between targeted policing and the other concepts were coherent with a right-wing categorization. Note that the mistake in this intermediate step was consciously carried forward in the pipeline before predicting attitude correlations.

The accuracy of GPT-3’s prediction of correlation signs was 93.97% across all predictions 95% CI [92, 95]. A direct replication study run in March 2023 with 600 participants reflecting the age, gender, and ethnicity distributions in the USA achieved an accuracy of 93.02% (95% CI [91, 95]; for details and plots see supplementary materials).

Table 2 shows that GPT-3 still slightly erred on the side of predicting negative correlations.

Table 2 Confusion matrix study 2

Figure 3 summarizes the central statistics of Study 2. The magnitude of the true correlation coefficients was again associated with an increased probability of correct predictions by GPT-3. Prompts performed similarly well and were marginally outperformed by the aggregated predictions.

Fig. 3
figure 3

Results of Study 2. Top left: estimated accuracy of GPT-3. Top right: distribution of empirical correlation coefficients (i.e., prediction targets) and mispredicted coefficients of individual prompts. Bottom left: prediction accuracies when excluding small correlation coefficients. Bottom right: isolated performances of prompts. Whiskers signify 95% credible intervals. The accuracy prior followed a beta distribution with a = 4 and b = 2

In total, GPT-3 made 38 incorrect predictions depicted as edges in Fig. 4. True absolute correlation coefficients (that were mispredicted) ranged from 0.01 to 0.33 with a mean of 0.19. One source of confusion quite obviously resulted from the previous stage in the prediction pipeline where GPT-3’s mislabeled “targeted policing” as a concept endorsed by the political left. The other errors primarily mirror the simple “partisan-split” baseline, where correlations between attitudes towards John McCain and left-wing concepts were also mispredicted.

Fig. 4
figure 4

All edges represent correlations (gray = positive; red = negative; size = magnitude) that were mispredicted by GPT-3 (i.e., if the edge is gray/positive GPT-3 predicted a negative correlation and vice versa)

9 Discussion

The accuracy when using a few-shot approach and a highly customized prompting procedure is much improved compared to Study 1. It now lies within the range of performances that were observed among human experts. This mirrors previous findings that deliberate prompt engineering and chaining is a very potent method for tackling narrow problems (Hansen & Hebart, 2022). Such an approach can be well-utilized with repetitive tasks that allow for the re-usage of highly task-specific prompts (e.g., Tanana et al., 2021). However, it is apparent that much of the think-work with such an approach is put on the researcher, who has to explicitly guide GPT-3 to consider specific aspects or strategies when generating hypotheses. In short, the researcher takes over much of the I from the AI. Furthermore, through the high amount of customization of the prompt chain, this pipeline can no longer function as a general hypothesis generator and will remain restricted to correlations within the context of political partisanship. In Study 3, we test whether these issues can be addressed through data-based modifications to the model.

10 Study 3: Finetuning

Instead of researcher-driven improvements of the model performance, Study 3 focuses on data-driven improvements. Specifically, we expand on the previous few-shot setup by retraining the GPT-3 model through openAI’s finetuning API. Thus, human researchers no longer provide context-specific information in the prompt. Rather, the model must extract relevant information (here: “political sidedness matters for ideological attitude correlations”) autonomously from training data. This way the procedure of auto-generating hypotheses can be generalized to other research contexts.

The details of openAI’s finetuning procedure, as well as details of their commercial models, are not publicly accessible. However, the general practice of finetuning transformer models is very common (e.g., Bhatia & Richie, 2021; Gutiérrez et al., 2022; Liu, 2019). Often, task-specific data is used to retrain (parts of) the original network and/or a new layer of regression weights. In past studies, finetuning transformer models has led to performance increases across tasks, and many prominent transformer models are developed specifically to be finetuned later on (e.g., Devlin et al., 2018; López-Úbeda, 2021; Liu, 2019). GPT-3 is one of these models that was explicitly devised for finetuning (openAI, 2022). The strategy of finetuning is intuitive as seeing more task-relevant data and focusing model outputs has the potential to improve any statistical model, albeit incurring the risk of overfitting on the newly provided data (Xu et al., 2021).

For Study 3, we expect the accuracy of GPT-3 to lie in between Study 1 (zero-shot) and Study 2 (few-shot + customized prompting chain). The reason is that we expect top-down customization to have contributed more to the performance increase from Study 1 to Study 2 than the bottom-up customization (few-shot instead of zero-shot). Given that Study 3 removes the top-down prompt adjustment to improve the generalizability of the prediction pipeline, we tentatively expect a performance decrease compared to Study 2.

11 Method

We again examined the performance of GPT-3 on predicting correlations between the ideological target objects from Studies 1 and 2. We utilized the same 7 prompts from Study 1,, requiring us to finetune GPT-3 separately for each prompt. We utilized 46 new ideological concepts (i.e., their 1035 intercorrelations) as finetuning data, which can be found here: bit.ly/3ym9cWJ. Three examples are Nancy Pelosi, tough love, and focusing on the future.

The desired responses (i.e., finetuning labels) were generated through a simple partisanship split (same-side objects are positively correlated and vice versa). Notice, that we know at this point that these labels are noisy as there are exceptions to this rule (e.g., pro-Biden and pro-McCain attitudes). However, the sidedness heuristic is highly accurate (cf. Study 1) and might even allow us to speculate about the reason for errors in Study 3 (e.g., overreliance on the sidedness heuristic from training data). The uploaded dataset of 1035 rows was uploaded and finetuned through openai’s command line interface. The returned model IDs are included in the analysis scripts and can be utilized by anyone (bit.ly/3ym9cWJ). Method decisions, finetuning data, and hypotheses were pre-registered under the same link.

12 Results

The accuracy of GPT-3’s prediction of correlation signs was 97.46% across all predictions 95% CI [96, 98]. A direct replication study run in March 2023 with 600 participants reflecting the age, gender, and ethnicity distributions in the USA achieved an accuracy of 96.51% (95% CI [95, 98]; for details and plots see supplementary materials).

Table 3 shows that GPT-3 again overpredicted negative correlations. In fact, it never erroneously predicted positive correlations.

Table 3 Confusion matrix study 3

Figure 5 summarizes the central statistics of Study 3.

Fig. 5
figure 5

Results of Study 3. Top left: estimated accuracy of GPT-3. Top right: distribution of empirical correlation coefficients (i.e., prediction targets) and mispredicted coefficients of individual prompts. Bottom left: prediction accuracies when excluding small correlation coefficients. Bottom right: isolated performances of prompts. Whiskers signify 95% credible intervals. The accuracy prior followed a beta distribution from 0.5 to 1 with a = 4 and b = 2

The magnitude of the true correlation coefficients was again associated with an increased probability of correct predictions by GPT-3. Prompts performed similarly well as the aggregated predictions. True absolute correlation coefficients (that were incorrectly predicted) ranged from 0.01 to 0.46 with a mean of 0.23. When examining the errors depicted in Fig. 6, two central nodes again account for most errors.

Fig. 6
figure 6

All edges represent correlations that were mispredicted by GPT-3. All errors were positive correlations predicted to be negative

As in Study 2, correlations between attitudes towards John McCain and various left-wing objects were erroneously predicted to be negative. Furthermore, “defunding the police” was the most error-eliciting object as many correlations with left-wing objects were erroneously predicted to be negative. Notice, that a similar pattern emerged in Study 2 with the object “targeted policing.”

13 General Discussion

Large AI models are commonly trained on texts featuring social and psychological phenomena and can, by encoding word associations, produce statements about the relationships between these phenomena (Hansen & Hebart, 2022; y Arcas, 2022). Furthermore, they can produce outputs that appear creative or even unique, which currently changes the landscape of artistic work (Anantrasirichai & Bull, 2022; Stevenson et al., 2022). Does this mean that such models can automate the creative reasoning task of hypothesis generation in the social and behavioral sciences?

The current work demonstrates that GPT-3 can, at least sometimes, accurately predict the direction of correlations between ideological attitudes. While zero-shot prompting elicited poor results, prompting with a highly task-specific, multi-step procedure or finetuning the model with relevant training data led to accuracies around 95% (where real-world study outcomes were considered the ground truth). A replication study collecting the same attitude data from a representative sample of 600 US-American adults showed the same results (see supplementary materials). These performances were above the average performance of our expert sample, which supports previous work showing that LLM’s can do useful autonomous work in the social and behavioral sciences (Hansen & Hebart, 2022). We want to emphasize that the expert performance that was observed here is not equal to the peak expert performance that researchers would show when generating hypotheses about their own empirical studies. Here, some experts participated very casually or conducted only marginally related research themselves (sometimes on other geographical areas). Still, the current work encourages the testing of more general study prediction models. Most study outcomes within the social sciences are more difficult to predict than the ones used here. Especially when outcomes have to be predicted on the individual level, rather than the collective level like in the current work, prediction accuracy often plummets (Joel et al., 2017; Salganik et al., 2020).

If AI models remained competitive across research contexts, they could function as a “second opinion” before conducting studies, a reference point for power analyses, an indicator of how intuitive new study results are, or a creative thinking tool for eliciting probable hypotheses in a given field. Note however that a few glaring errors remained in all configurations and that the gradual extension and retesting of these models in research outside of political attitudes remains a task for future endeavors. The question of how general such models should finally be (one model for all social sciences or specific to subfields) requires empirical testing. Naturally, the training corpus for new models should consist of results from empirical studies, similarly to the simulated training data we used for finetuning in Study 3. Efforts to make publications machine readable or to scrape them into a standardized format can directly support this goal (Lakens & DeBruine, 2021; Rosenbusch et al., 2020).

While the accumulation and pre-processing of such a training corpus constitutes a large project, we assume that it is generally superior to task-specific prompt engineering (cf. Study 2), which always remains dependent on the human researcher and cannot easily extend beyond its original scope of application (cf., the research robot “Adam,” King et al., 2009). However, if task-specific pipeline customization were to become the primary strategy for auto-generating research hypotheses, meta-methods like PromptChainer (Wu et al., 2022) might start playing a central role (Chase, 2022).

While encouraging for AI-supported research, we want to emphasize that even an LLM which is 100% accurate in predicting study outcomes is not “doing social science.” Specifically, we highlight three key limitations that can be at least partly addressed in future studies.

First, making a prediction about a study’s outcome is not the same as developing a theory about whether and how study outcomes will come about. The job of a social scientist is not just to state how attitudes towards A and B will correlate, but to justify their hypotheses, consider the impact of their results, review the methods used to generate findings, integrate findings into a body of literature, and many more activities. Thus, even if an AI were to reliably outperform researchers in predicting study outcomes, it would be far from being an autonomous scientist.

Second, in the current study, we utilized ideological concepts that are well-documented in the LLM’s training corpus. If they had been absent in the data (say a politician that started their career after model training, or a newly suggested psychological construct), it would have effectively prevented the generation of valid predictions. Similarly, bivariate correlations are among the most simple statistics to be predicted. Had the task instead been to, for instance, anticipate complex network structures, model performance would have very likely decreased substantially (as does the success rate of human hypothesizers).

Third, all current LLMs should be expected to regularly violate research ethics. Most prominently, models are likely to generate discriminatory hypotheses based on the many biases in their training data (Weidinger et al., 2021). Current efforts explore possibilities to adjust training data, model architecture, and training procedure, which is not feasible for human hypothesizers. However, especially for ethically-loaded hypotheses, an LLM based on word associations in large internet scrapes should not be given much credibility (cf., Bender et al., 2021; Corbett-Davies & Goel, 2018).

These limitations together bring forward the question of what a machine-generated hypothesis would need to look like to be given equal consideration as a human-generated hypothesis. We assume that a convincing theory justifying the hypothesis would be sufficient despite arising mechanically and without human-like understanding. A well-argued and coherent chain of thought would be sufficient as supervising researchers would primarily care about a convincing rationale behind a mechanically generated hypothesis (Krenn et al., 2022). Given that the production of such rationales is an independent challenge, machine-generated predictions of study outcomes might become empirically superior to human hypotheses without being able to provide much-needed theoretical explanations. In addition, LLMs are only exposed to knowledge in the form of written text, whereas human knowledge is gained and captured in many alternative modes. The near future will show how AI products and the research community come together to address these issues. A fruitful next step might be to enter hypothesis generators into study prediction contests and replication markets (Liu et al., 2020).

In conclusion, the journey towards an artificial intelligence winning a nobel prize for scientific discovery has begun (Kitano, 2016). The road towards a universally persuasive hypothesis engine is far from over, and new tasks often take a while to be mastered in the field of machine learning (Grace et al., 2018). However, the current work highlights the possibility of automatically generating hypotheses for simple questions. The promising next steps are to illuminate the philosophical and ethical implications of accurate hypotheses generated by machines that have no built-in incentive to tell the truth (Sobieszek & Price, 2022).