1 Why a(nother) social media observation project?

Regardless of social media being a (distorting) mirror, an amplifying sounding board, or a root cause of social developments, it takes place in, connects, and shapes the public sphere(s) of today (cf. Bruns and Highfield 2016). Especially the polarization of people’s opinions and, thus, potential impacts on political and social processes has been one of the earliest observations researched on social media platforms (cf. Yardi and Boyd 2010). Acknowledging this very special role of social media communication, the project Social Media Observatory (SMO) started its work in the summer of 2020 as a virtual infrastructure provider for the newly funded German Research Institute Social Cohesion (RISC)Footnote 1. The RISC is a decentral, networked research organization, spanning eleven institutions in Germany from a diverse set of disciplinary and methodological perspectives to study and monitor phenomena related to social cohesion. Its strong roots in the social sciences pose a challenge to the goal of developing a focus on social media research with big datasets and long-term monitoring which, as an interdisciplinary endeavor by itself, requires both computational as well as social analytics skills. Instead of building up the required knowledge and analytical skills separately in the widely dispersed parts of the institute, it appeared reasonable to combine large efforts for this task in one unit.

The contribution of this article is to inform about how the SMO as a centralized infrastructure within the RISC provides a facilitated entry point into large-scale analysis for social media research. Our mission is to support scholars using our infrastructure services in a do-it-yourself (DIY) fashion as summarized in Fig. 1, enabling them to build their own solutions. Based on three years of experience following that mission, we describe how the SMO supports typical research design decisions, data collection, and analysis steps throughout the social media research process.

Fig. 1
figure 1

Structure of this paper (section numbers are given in brackets) along principal services of the SMO virtual infrastructure

1.1 Predecessors and role models

The relevance of social media research for the RISC is obvious. However, the question remains, why do we need a dedicated research unit like the SMO that, rather than answering thematic questions, focuses on infrastructure and knowledge creation towards the goal of social media observation in general? There is a long list of existing centers pursuing this general goal that could serve as role models as well as partners in this endeavor:

  • the Digital Methods Initiative (DMI)Footnote 2 at the University of Amsterdam, offering ready-to-use internet research web apps, workshops, and summer schools,

  • the Observatory on Social Media (OSOME)Footnote 3, known for its web apps and widely used Application Programming Interfaces (API), also engages in graduate and undergraduate student education at its host organization, Indiana University,

  • the Virtual Observatory for the Study of Online Networks (VOSON) Lab at the Australian National University, with a similar profile in education, focusing on R‑packages and R‑Shiny apps for the collection and analysis of network data,

  • the Penelope platformFootnote 4, aiming to modularize and open the web app model with the possibility to contribute with open APIs to the platform (cf. Willaert et al. 2020),

  • the Queensland University of Technology (QUT) Digital Observatory (DO)Footnote 5, spun out of the Digital Media Research Centre (DMRC), contributing to the Python-based Twitter data collection tool TwarcFootnote 6 and hosting a continuous collection of tweets by Australian Twitter users (cf. Bruns et al. 2017), or

  • the European Digital Media Observatory (EDMO)Footnote 7, with a focus on establishing a network of hubs to help “identify disinformation, uproot its sources or dilute its impact, support fact checking and quality information, and connect related expert communities” (EDMO’s Vision And Mission, EDMO 2022).

This list is not comprehensive and omits more general-purpose data centers. However, the distinct undertakings are a hint to our conviction that another Social Media Observatory is a wise idea. Despite the year-long existence of projects facilitating digital media observation, there remain challenges that might always need answers from a multitude of perspectives.

1.2 Remaining challenges, unmet demands, and guiding principles

As there are many parallel efforts to collect, explore and analyze social media data, one could argue that these efforts should be as centralized as possible. However, from our perspective, redundancy to some extent is a necessity to produce triangulated scientific knowledge and to maintain an innovative and resilient ecosystem of social media research. Past and looming ‘APIcalypses’ (Bruns 2019), in an industry that tends to ‘move fast and break things’—as the takeover of Twitter by Elon Musk in fall 2022 or the lay-off of science-support personnel at Facebook have illustrated—are less threatening to a research environment that has more than one answer to them. The field of social media observation is also still lopsided towards the English-speaking world, which should be addressed by initiatives that can contextualize other language spaces, such as the German one in our case.

Furthermore, one of the main reasons the SMO and other observatories exist is arguably a demand for centralization. This demand is inter alia created by the projectification in academic knowledge production (cf. Ylijoki 2016), particularly the requirement by contemporary funding models to bundle research into packages of a few years or less, with single-purpose-bound budgets, and most researchers working on fixed-term contracts (cf. Bahr et al. 2022). The ephemeral nature of such work environments—at least in timescales of scientific knowledge-creation—does often neither allow to assemble the infrastructure and skills for large-scale online media research at the beginning of a project nor does it incentivize the sustainable continuation of such structures. Therefore, the RISC, in offering (comparably) good conditions with funding periods of initially 4 years and following 5 years planned, made the SMO a logically following idea.

With this in mind, the SMO was founded as a digital innovation unit (DIU) (cf. Barthel et al. 2021) within the RISC at the Leibniz-Institute for Media Research | Hans-Bredow-Institut (HBI)Footnote 8, which is also host to other RISC projects. It is embedded in the Media Research Methods Lab (MRML), which anchors the field of computational social science within the HBI. The organizational design choice to create it as its own DIU is due to a further challenge: the still lacking anchorage of computational methods within the curricula of the social sciences that constitute the disciplinary core of the RISC. This challenge is manifested in the presence of training resources in most of the SMO’s predecessors and role models. Training and consulting postgraduates and established researchers in workshops and individual consulting is, consequently, one pillar of our mission.

One of the challenges that comparable programs face, and which is one of the main reasons for parallel efforts of social media observation, is the ethical and legal sharing of data. Even in the absence of legislation as rigorous as the General Data Protection Regulation (GDPR) in the European Union (EU), ethical and research strategic considerations do hinder the exchange of social media research data. While the SMO fully acknowledges and promotes strong ethical boundaries for the sharing of data, and actively provides resources to researchers to draw these boundaries (see Sect. 2), we are committed to open science as a strategic goal. This leads to a conflict of interest that we try to mitigate in two ways:

First, our approach to the data-sharing problem is rooted in our standpoint that reliable and sustainable social media research, and research in general, necessitates a networked approach to competence sharing. Therefore, the SMO’s goal is to provide an infrastructure that supports decentral initiatives to build social media research projects and groups themselves. For this, it provides them with the base data and further knowledge they need to collect more sensitive data themselves instead of solely relying on shared data. To share and continuously update methodical knowledge and experience on social research designs, we curate a publicly available and editable wikiFootnote 9 (cf. Münch et al. 2021a). This SMO wiki proved already a useful resource for our own training activities or when consulting new projects on how to set up their data collection.

Second, the SMO strives to offer and curate information on high-quality, reusable datasets as open as ethically and legally possible. We focus particularly on datasets that are prohibitively hard to compile for shorter-lived projects but are needed as base data to collect or benchmark case-based data. Examples include, but are not limited to:

  • account lists of relevant public actors, e.g., parliamentarians (see also van Vliet 2021), election candidates (Schmidt et al. 2022), politicians, media outlets, journalists, or other public speakers (Schmidt et al. 2023b),

  • large topical corpora of posts, e.g., about the Russian invasion of Ukraine in 2022 (Münch and Kessling 2022), or

  • subscription and activity networks across social media platforms on a national to global scale, such as the Australian (Bruns et al. 2017; Bruns and Moon 2019), Dutch (Geenen et al. 2016), German (Hammer 2020; Münch et al. 2021b), Korean (Guan et al. 2022), or Norwegian (Bruns and Enli 2018) Twitterspheres.

If open access cannot be provided, we aim to share the data on (vetted) request, in collaboration, or make the collection as reproducible as possible. For instance, instead of publicly sharing full datasets, we may publish only ID lists to contents on a certain platform along with a software tool that re-collects the full dataset given the IDs. For this purpose, we develop our tools as free and open-source software (FOSS). These tools are built under the design paradigm to close gaps in toolchains of already existing software for use in comparable research labs instead of offering one all-purpose product. This allows us and other researchers a flexible mix-and-match approach that has greater chances of adapting to the fast-changing environment of social media platforms and encouraging innovation beyond the SMO.

The remainder of this article is organized by the main services of the SMO virtual infrastructure, as shown in Fig. 1. We continue with the introduction of our ethical framework (Sect. 2) that can serve as a starting point for any social media research project. In Sect. 3, we explain how to use well-curated account lists and keyword lists to conduct social media data collections systematically. In Sect. 4, we describe our technical approach to continuous monitoring of social media communication resulting in a large, diachronic database that allows for insights into the current state of the public debate as well as into the developments of long-term trends. Sect. 5 highlights publications of analysis resources from our project as parts of a DIY infrastructure approach enabling other researchers not only to use our data and pre-computed results but to develop their own projects. Finally, we conclude with our vision for knowledge collaboration for jointly extending and curating the knowledge base on computational social media analysis.

2 Ethical and legal foundations of social media research

Social media research is subject to ethical and legal issues which need to be addressed by every project. The SMO provides guidance to address these challenges before, during, and after any online data-based research.Footnote 10

Prominently debated and criticized studies and research projects around Cambridge Analytica (cf. Venturini and Rogers 2019), the Facebook emotional contagion study (cf. Kramer et al. 2014), and the attempt of detecting sexual orientation online (cf. Resnick 2018), show the potential sensitivity of social media data and the potential of abusing it. At the same time, the relative newness of the research field results in a lack of established norms as well as ethical and legal uncertainties related to research with this type of data (cf. Franzke et al. 2020). In an ethically pluralistic environment, people from different racial, ethnic, and cultural backgrounds may have very different views of what constitutes ethical research. Our understanding of ethical research aims to protect the fundamental rights of human dignity, autonomy, protection, and safety, as well as to maximize benefits and minimize harm, i.e., having respect for persons, justice, and benefit. This is in line with the Association of Internet Researchers (AoIR) Ethical Guidelines (cf. Franzke et al. 2020, p. 10) and is based on the Belmont Report (Ryan et al. 1979). It is important to highlight that some research may be illegal even if it seems morally acceptable, and vice versa. This makes it necessary to assess the ethical and legal aspects of the research process.

The guiding principle should be that “greater risks to fundamental rights of participants (all persons affected by the research) and greater risks of harm to participants, researchers, and society must be accompanied by greater countermeasures to mitigate these risks—or, if this is not possible, to refrain from following through with a project at all” (Rau et al. 2021, p. 2). Reflections, decisions, and actions to be taken should be documented carefully.

2.1 Ethical challenges around social media monitoring

Avoiding harm is one of the most central elements of ethical research (cf. Franzke et al. 2020, p. 17). The researcher(s) must consider how data they are gathering and processing, insights they are gaining, and dissemination of both may harm research subjects. This becomes even more important with the involvement of vulnerable groups, like members of the LGBTQ+ community who might not want to make their identities public. At the same time, the researcher’s obligation to protect their research subjects must not depend on their political proximity to them. The researcher’s responsibility to consider detrimental outcomes to their research subjects includes everyone, even those who could be viewed as “problematic actors” (e.g., extremists).

Another cornerstone of research ethics in the context of processing personal data is informed consent (cf. CUREC 2021, p. 6). Informed consent aims to ensure that participants can enter and exit research voluntarily, with full knowledge of what it entails for them to participate, giving their consent before participating in the study. This form of agreement becomes often prohibitively challenging in the case of large-scale social media data research. Considerations to determining whether explicit consent is required include public interest and benefit from the research, public availability of the data, public interest in research subjects, the anticipated expectations of the research subjects towards their data being processed, and the type of processing (e.g., analysis vs. dissemination) (cf. Franzke et al. 2020, p. 10; CUREC 2021, p. 6).

Inconsiderate amplification of certain phenomena, for example, suicides, mass shootings, or extremist ideology, can have harmful consequences (cf. Gould 2006; Kwan 2019). For instance, revealing extremist communities and accounts in research data and publications can encourage more people to join them. It is vital to deliberate whether and how findings and data should be shared and to consider further security measures.

Protecting the health and safety of researchers in online research is crucial, too (cf. Franzke et al. 2020, p. 11). Working on issues like terrorism, extremism, jihadism, racism, antisemitism, misogyny, hostility towards queer identities and others, but also simply the researcher’s identity (e.g., ethnicity, minority identity, sexual identity, political activism, etc.), can trigger uncivil ideological reactions. This includes verbal abuse, doxing, death threats, or even physical attacks. Appropriate safeguards, starting with personal data security and crisis communication plans, need to be in place (cf. Marwick et al. 2016). Also, the research itself, for example reviewing videos of beheadings and other acts of violence, abuse, or self-harm, can affect the psychological health of researchers and should be mitigated by countermeasures like psychological support.

2.2 Legal challenges around social media monitoring

Legal obligations for research differ depending on the respective legal system. The following section focuses on legal aspects relevant to social media research in the European Union and Germany.

The General Data Protection Regulation (2016) (GDPR) together with the national variations of the opening paragraphs of the GDPR (in our case, the German Federal Data Protection Act [BDSG] 2017) regulates the processing of personal data in the European Union.Footnote 11 Art. 9 (2) lit. j of the GDPR provides a proper legal basis for social media data collection for scientific purposes without informed consent under certain conditions such as the proportionality to the aim pursued, and implementation of measures to protect the interests of data subjects. For this, the GDPR mandates that personal data must be processed lawfully, fairly, and transparently; be collected only for specified, explicit, and legitimate purposes, and not be processed in any manner incompatible with those; be adequate, relevant, and limited to what is necessary in relation to the purposes for which it is processed; be accurate and, where necessary, kept up-to-date; not be kept as identifiable data for longer than necessary for the purposes concerned; and be processed securely.

Social media platforms’ or social networking apps’ Terms and Conditions (T&C) or Terms of Service (ToS) sometimes include clauses that restrict automated data gathering. In a German context, Golla and Schönfeld (2019, p. 17) point out that if the data is publicly available and can therefore be gathered without agreeing to the ToS, such a limitation might be unilateral and a non-binding condition. Following this argument, web scraping of publicly available data for research implies several legal ambiguities. However, it can be regarded as legal, if it is restricted to specific portions of a platform and required for the research.

Social media data may also be protected by copyright regardless of its public availability. Analog to data protection regulation, it is impossible to obtain the consent of every single content creator on social media. Instead, Sections 50, 60c, and 60d of the German Act on Copyright and Related Rights (1965) provide a legal basis for scientific, non-commercial research, too. These sections allow for the processing of small-scale work, which will apply to most social media posts, and for data mining for large-scale automated text analysis.

3 Systematic data collection and sampling in social media

Users of social media platforms generate myriads of data points every second, which raises the question of what data should even be collected for what purpose. The SMO provides guidance to address challenges of systematic data collection for valid social media research.

Foremost, we restrict ourselves to public data due to technical (only a fraction of data generated by social media platforms is accessible) and ethical reasons (private communication and its traces can, without consent, hardly be justified as subject to research). Moreover, we adhere to a concept of publicity that includes public interest and intend to investigate data that has a high potential for representing and influencing the public discourse in Germany.

3.1 Sampling strategies

Technically, some platforms such as Twitter through its streaming API endpoint allow for very large quasi-random samples of user-generated content in real-time (cf. Morstatter et al. 2013). However, access to them may not be free, they may be difficult to process due to their size, and they largely do not meet the relevance criterion for our collection. In contrast to random sampling, there are two ways to retrieve relevant data from social media platforms to study public discourses more systematically: account-based sampling, and thematic sampling.

For account-based sampling, researchers define a population of entities such as persons or organizations within their research scope and search for their accounts on the targeted social media platform. In our data collection, we distinguish account lists based on two modes of their creation: a) closed lists referring to pre-existing knowledge, and b) lists open to extension with new entities of interest. The curation of the first type is comparatively easy since we know our base population and just have to find their digital representations within targeted platforms. This approach is most useful for base populations with clearly defined entities (or affiliations) with ideally digitally accessible membership records, and groups of manageable size such as members of a parliament or private broadcasters licensed by the federal media authorities. It becomes less useful for base populations with more fuzzy definitions and less clear rules for membership (e.g., far-right extremists), or at larger scale (e.g., the climate movement). This second account list type is characterized by the circumstance that we do not know the entirety of our base population upfront, or that only a theoretical notion of that base population exists. To create such lists, we recommend exploring potentially interesting accounts that might be relevant and extend from there with one of three prototypical strategies:

  • Community linkage: Based on data from platforms such as retweet/forward or follower relations a network of accounts can be created for which clusters of highly connected accounts can be detected that fit the theoretical relevance criterion and, thus, approach the imagined base population.

  • Network metrics: In addition to the linkage, relevant accounts can be identified by network statistics such as node degree, k‑coreness, page rank, or clustering coefficients.

  • Account metadata is also useful, such as posting in a targeted language, above a certain frequency, or with a minimum threshold of popularity. This last criterion, typically measured by the number of followers/subscribers, is not only useful to detect influential accounts but also to argue for the public character of their communication.

  • Content filter: A third selection strategy checks whether accounts talk about relevant issues associated with the base population. A simple but effective operationalization is to look for a (repeated) presence of certain keywords in their post content.

With all four strategies, researchers are required to follow clearly defined inclusion criteria. These do not need to rely on platform data only but can include additional data, too. Dependent on the selection strategy and the size of the base population, this research step can be time-consuming and error-prone. Errors result from missing out on existing accounts for entities, from the confusion of real with fake accounts, or false inclusion of accounts in the base population. However, the effort of creating curated account lists carefully makes them a valuable resource for research. Therefore, account lists became a vital subject of data resource publications (e.g., König et al. 2022) as well as for research data infrastructures (cf. HBI 2022). Hopes are that sharing account collections across research teams avoids redundancy of work and improves data quality. By publishing the account collections of public speakers (cf. Schmidt et al. 2023b), the SMO contributes to this emerging sharing practice among social media researchers.

For thematic sampling, researchers need to define key terms for the identification of relevant content for a given research topic. These queries are then input to a search endpoint of a social media platform such as the Twitter search API that responds with a result of matching content items. Key term queries can comprise single terms or lists of terms that represent a certain semantic concept or thematic field. Although less tedious than the compilation of account lists, the creation of valid key term queries is still an elaborate task like the development of valid dictionaries in content analysis (cf. Wiedemann 2013, p. 344 ff.). The goal of carefully curating a set of key terms is to maximize the number of true positives, i.e., relevant results while keeping the number of false positives as low as possible and not missing out on relevant items, i.e., false negatives. This step can be facilitated by a process called query expansion where a researcher either manually or automatically identifies new key term candidates in the results from an initial query (cf. de Vries 2022, p. 433). Suggestions by automatic approaches need to be validated carefully as automatically suggested terms tend to increase recall at the cost of significantly lower precision. Precision, however, is usually favored over recall when sampling social media data to lower the risk of invalid observations from noisy data, while recall can hardly be assessed since the number of truly relevant items is usually unknown.

On some platforms, thematic collection for specific debates can be facilitated by the observation of hashtags (cf. Bruns et al. 2016). Using established hashtags as search queries has the major advantage that users themselves mark content relevant for a certain debate with a vocabulary that is controlled to some degree. At the same time, users participating in a specific social media debate are not required to use a certain hashtag to mark their contributions, and other users with an interest to promote unrelated content are incentivized to use popular hashtags (cf. Bruns and Burgess 2015). Thus, thematic collections solely relying on hashtags still need to check carefully for false negatives of missed debate contributions as well as for false positives such as hashtag spam.

Account-based and thematic collections can be extended with related content and metadata. To acquire more thematically relevant content, one interesting strategy is to make use of a platform’s conversational structure. Twitter, for example, allows users to quote or reply to tweets which creates conversational structures (cf. Sousa et al. 2010) that are potentially about the same issue without the need to use the same vocabulary. Twitter’s API can be used for the retrieval of such discussion threads. Another line of research, for instance, deals with the geolocation of accounts to perform analyses across a geographical dimension (e.g., Nguyen et al. 2022 locate around 1/5 of German Twitter accounts).

For the goal of long-term monitoring, account-based sampling poses a major challenge compared to thematic collections. While thematic collections only need to assess periodically whether the user’s vocabulary of key terms reflecting a certain debate may have changed, actor-based lists require much more curation effort. Lists of parliamentarians change after every election (and, to a lesser extent, also during a legislative period), media organizations persistently create new or discontinue existing formats, and actor networks such as far-right extremists develop in a highly dynamic manner such that curated actor lists represent a valid base population only for a certain point of time. Thus, long-term monitoring requires systematic revision procedures to guarantee up-to-date lists along with well-documented version control and dataset release cycles—a service that can be maintained much easier by an infrastructure such as the SMO rather than a typical short-lived research project.

3.2 Platform and data selection

The SMO performs both account-based and thematic collections on multiple social media platforms as well as from online news sites. We have decided on platforms that a) appeared of special importance for public debates in Germany due to their size and audience, and b) were accessible for research. At the time of writing, these platforms are Facebook, Twitter, Instagram, and Telegram. These platforms provide three major data types for social media analysis: account data, network data, and content data.

Account data usually comprise usernames and IDs, demographic metadata, and statistical information. Dynamic metrics such as numbers of posts or followers/subscribers as well as interactions such as shares/retweets/forwards, and reactions such as likes/favorites can be monitored to study the activity and reputation of accounts over time. The analysis of such metrics can be realized straight-forward with basic utilities from statistical software or a spreadsheet program.

For network data, networks of entities can be constructed from structural information observable on a social media platform. Entities are often user accounts, groups, or channels. Links represent relationships between two of them such as followership, forwarding from one to another, or mentions of them together in one or more posts. Entities could also be hashtags that occur jointly in posts that can be used to visualize thematic networks of key terms of a debate. Acquisition for large networks usually requires vast amounts of requests for data such that tools for automated scraping and elaborated sampling strategies become necessary (cf. Münch et al. 2021b). The analysis of network data can be realized best with specialized tools such as Gephi (cf. Bastian et al. 2009) and with libraries available for scientific programming languages such as R or Python (e.g., igraph, networkx, graph-tool, or networkit).

For the analysis of content data, social media postings may include texts, URLs, and multi-media entities such as photos, videos, or audio files. The mere size of data, especially when it comes to multi-media content, requires large amounts of resources for downloading, storage, and processing. For text collections above a certain size, the use of database technologies and full-text indexes becomes essential. Automatic content analysis is possible with a wide range of text mining and natural language processing methods (cf. Wiedemann and Fedtke 2021, pp. 381–382), many of which are implemented in convenient and mature packages for Python and R (see, for instance, Wiedemann 2022 for an overview of topic modeling in R). For image and video analysis, automatic approaches are actively developed in the field of machine learning. Pretrained models for object recognition, gender, face, mimic expression, or body pose detection allow for the automatic categorization of certain categories that may serve some research interests in the communication and social sciences (cf. Araujo et al. 2020). For more specific recognition demands, researchers must train new models with their own training data (see Zhan and Pan 2019 for an example). Due to these technical challenges and despite the success of platforms such as TikTok or YouTube, most content analyses are still based on textual data.

4 The technical architecture of the SMO

For the SMO, we currently develop a technical setup based on open-source applications and self-programmed tools to perform three monitoring tasks on the selected social media platforms:

  • a long-term collection of activity metrics and content data of a continuously curated account list of public speakers,

  • a containerized service architecture to perform ad-hoc thematic collections of content data along with standardized analysis options, and

  • a set of tools and best practices to perform large social network surveys such as the state of the German Twittersphere at certain points in time.

This technical infrastructure can be differentiated into four major parts: (1) data collection, (2) storage and computing, (3) processing and analysis, and (4) visualization and export of data (Fig. 2). In the following, we briefly outline our current and planned solutions to these parts.

Fig. 2
figure 2

SMO’s technical infrastructure (dashed lines indicate planned work)

Data collection requires access to platform data. Some platforms provide special academic access via APIs. An API provides a systematic, automatable way to access functions and data from another computer (program). Platform APIs allow the retrieval of some platform data such as posts or user information via well-defined queries. Since 2020 Twitter, for instance, provided access to its entire historic archive through its Academic API v2. Accredited researchers were able to download up to 10 million tweets per month, including interaction metrics (views, likes, retweets, replies). On June 23, 2023, Twitter shut down free access to its Academic API. Currently, the SMO team, like many other Twitter researchers, looks out for alternatives either through paid API access,Footnote 12 data scraping,Footnote 13 or data donation.Footnote 14 Also, there are hopes that the European Union’s Digital Services Act (DSA) enforces a renewed academic access in 2024. Also, Twitter’s diminishing importance in favor of new, more accessible platforms may eventually eliminate this problem.

For a long time, Facebook provided an unrestricted API to its data, too. In 2018, it was shut down almost completely after the Cambridge Analytica scandal. As an alternative, the SMO relies on the service CrowdTangle to get access to public profiles on Meta’s two major platforms, Facebook and Instagram.Footnote 15 CrowdTangle allows only access to public post data and is organized by lists of usernames. For instance, for a curated list of Facebook politician profiles, we can download their entire post history including user interaction metrics. Unfortunately, CT seems to be dried out by its new owner, Meta, as no research access applications have been approved since spring 2022. It seems likely that Meta will eventually shut down CT and replace it with another, potentially less useful and more excluding access for researchers.

As a fourth platform, we currently monitor public channels and group chats on Telegram. It does not provide specific researcher access but most of its public data is accessible via a general API. Since this API lacks a full-text search function for content, researchers are limited to account-based collections. Accounts, or in Telegram’s terms, channels and groups can be searched for via key terms in their names. Then, the complete history of posts in selected channels and groups can be downloaded. Forwarding posts from other channels is common to spread information on Telegram. Thus, we can also sample networks of thematically related channels iteratively by looking through the message history of already known channels.

Our plans for the remaining first project phase include data collection for Mastodon and YouTube (also via a research API). Other important platforms like TikTok just opened a dedicated academic access for researchers in the United States and Europe.Footnote 16 With the entry into force of the Digital Services Act (DSA) on November 1st, 2022, more large platforms in Europe will be obliged to provide research access. We hope that they will fulfill this to an extent that facilitates and standardizes data access.

Storage and computing for social media analytics have become challenging issues for large collections. As a rule of thumb, content collections of up to one million text posts can be processed on a performant notebook. But long-term account-based and thematic collections can easily grow up to hundreds of millions of posts. Handling such big data collections requires storage capacity and computing power that exceeds standard hardware. At the same time, resource demands are lower first but grow over time. This requires a flexibly-scaling infrastructure. Therefore, we rely on cloud computing for larger projects. Cloud computing can be obtained comparatively easily for research institutions in Europe through the Open Clouds for Research Environments (OCRE) project.Footnote 17 OCRE provides a framework contract for the National Research and Education Network (NREN) member countries, allowing services such as Amazon Web Services (AWS), Google Cloud, or Telekom Cloud to be purchased without complicated procurement procedures. For the SMO, we decided to build our infrastructure on AWS, using EC2 instances as virtual servers to run our data collection processes and analysis scripts. Specific services such as raw data requests from APIs, and machine learning components will be wrapped in Docker containersFootnote 18 to provide flexible integration into different cloud environments and analysis workflows. Raw data is stored in Elastic File System (EFS) volumes that grow automatically with the size of the data collection. Preprocessed data is pushed into a managed RDS PostgresFootnote 19 database on AWS. A managed column store (e.g. Google BigQuery) and vector databases (e.g. MilvusFootnote 20) for contextualized embeddings (cf. Wiedemann and Fedtke 2021, p. 379) will be added soon to improve our capabilities for big data analytics and semantic search. In general, we try to rely on open-source software for cost reasons and the requirement of reproducibility of our setup for the scientific community. Up to now, most of our data collection and processing jobs are time-scheduled or started manually. One of our current main tasks is the automation, orchestration, and monitoring of our data processing pipelines with the workflow management framework Apache AirflowFootnote 21.

Processing and analysis of large social media datasets can hardly avoid programming tailored solutions. Specific analysis requirements as well as distinctive features of individual data collections make adapted ways of data (pre‑)processing a necessary step for most research projects which renders off-the-shelf programs less useful. Instead, the development in computational social science is going towards providing libraries (packages) for programming languages such as R or Python. These libraries provide specialized functionality to perform a certain analysis task and operate with common input and output formats such as tabular data frames. Researchers integrate these into their processing pipeline, i.e., a sequence of program scripts that convert raw input data into the targeted analysis outcome data. Splitting complex analysis workflows into smaller packages with well-documented input-output interfaces allows for easy publication and sharing of scientific code and drastically facilitates its reuse. For our base infrastructure, we integrate machine learning-based Python components for topic modeling (e.g., with BERTopic, Grootendorst 2022) and text classification (e.g., with Flair NLP, Akbik et al. 2019). The former allows for a thematic clustering of our collected social media content data as well as an extraction of relevant keywords. For the latter, several pretrained models exist to perform, for instance, the detection of hate speech, offensive language, or sentiment. We also plan to integrate active learning-based classifier training and prediction processes for new, project-specific codebooks (cf. Wiedemann 2019, p. 149). Labels from automatic predictions are written into our databases as additional metadata dimensions for the analysis step.

Visualization and export of processed data is one outcome of our infrastructure. The database can be queried for metadata (e.g., time periods, or user account sets), content (e.g., full-text queries on posts), network node and edge lists (e.g., subscription, forward, or mention networks), or results of machine learning processes (e.g., topics, or offensive posts). Query results can be exported as CSV, or JSON for further processing in subsequent project-specific analysis or qualitative coding tools such as MAXQDA for mixed-method research designs. As a fast and up-to-date way to basic statistics of collected data, we also provide interactive dashboards. For this, we decided on the software Apache SupersetFootnote 22 (see Fig. 2 for an example) which allows for a flexibly configurable and appealing presentation of statistics and content samples in selectable time ranges directly from our databases. For projects that require more tailored solutions, we rely on R ShinyFootnote 23 and Plotly DashFootnote 24 as frameworks to program small, interactive visualization applications.

5 The SMO as a DIY infrastructure

As outlined in the introduction, our understanding of infrastructure does not aim at developing, hosting, and maintaining a monolithic, centralized, technical setup for others to use. Instead, we want to enable researchers to replicate and adapt (parts of) our infrastructure by following our documentation, drawing on our experience for their decisions, consulting our suggestions for good practices, and picking useful items from our tool and data publications. In this section, we report briefly about our previous and planned outcomes and published resources that can be useful along the subsequent steps of a typical research workflow—project design, data collection, and data analysis.

For project designs working with digital trace data, researchers need to draw heavily from knowledge and experience created by peers and from external sources. To facilitate the entry into the process, we curate the SMO Wiki (Münch et al. 2021a), a collaborative knowledge base that supports researchers with an overview of options and tools for scraping, mining, and analyzing data on various social media platforms; with examples and good practice advice for specific analysis steps; (pointers to) training material for self-learning; or advice for setting up computational development and analysis environments on different operating systems.Footnote 25 It can be edited by anyone with a GitHub account in its draft (‘edge’) version but is regularly reviewed to incorporate changes in a more validated (‘pretty’) version. Moreover, we developed a “Starter Workshop” training series with its first workshop on Twitter held in 2022, and follow-ups planned for other platforms in the subsequent years.Footnote 26

To support researchers (including ourselves) in maneuvering the ethical and legal challenges of social media research, the SMO provides the “Social Media Research Assessment Template for Ethical Scholarship (SOCRATES): Your politely asking data ethics guideFootnote 27 (cf. Rau et al. 2021). SOCRATES aims at giving practical advice on how to conduct legal and ethical research using social media data. It does so by asking researchers a range of questions addressing ethical and legal issues (see Sect. 3) which aim to make researchers aware of the different challenges connected to their research. At the same time, SOCRATES provides information and links to external resources to support the researchers in outlining how they intend to deal with these challenges.

For data collection, the SMO develops and maintains a range of Python scripts, packages, and CLI tools that facilitate the collection of (social) media data from various platforms. We publish these in a well-documented way to allow for re-use in various project settings. Currently, there are four tools to perform ongoing data collections when given account id lists as input: twacapic for tweets from the Twitter Academic API,Footnote 28factli for Facebook and Instagram posts from CrowdTangle,Footnote 29tegracli for posts from public Telegram channels and group chats,Footnote 30 and newsfeedback, a scraper for new articles (header and teaser info) from RSS feeds and homepages of online news sites.Footnote 31 As a tool to organize data flows from raw collections into databases and archives, we develop dabapushFootnote 32 that can read from input data of the just mentioned collection tools and push it to a configured output stream such as a Postgres database or a CSV file format.

For data analysis, the SMO publishes a range of curated datasets that can serve as input for various research projects. Our primary resource to assist account-based collections of social media data is the “Datenbank Öffentlicher Sprecher*innen” (“database of public speakers”, DBöS), a comprehensive set of actors with relevance for the public discourse in Germany. It consists of several well-defined types of individuals (politicians, journalists), media outlets (e.g., newspapers, news agencies, or broadcasting stations), and organizations (e.g., political parties, universities and research institutions, or bodies of the state and federal administration). It aims to collect up-to-date information on the basic population of each type (e.g., a list of all public service broadcasters and the private broadcasters licensed by the federal media authorities), as well as their accounts (if existing) on social media platforms. A snapshot of this database has been published in June 2023 (Schmidt et al. 2023b) and will be updated regularly. In addition to the DBöS, the SMO collected and published a dataset of all 6211 candidates for the federal election 2021, including their accounts on Twitter and Facebook (Schmidt et al. 2022). A second account resource we prepare to publish is a set of the most popular German Twitter accounts determined by German language use and follower metrics. As an example of thematic collections, between February 2022 and June 2023 we collected all tweets mentioning “Ukraine” in the four languages German, English, Ukrainian, and Russian (Münch and Kessling 2022). IDs of tweets were published daily via GitHub to allow other researchers easy access to the data via “rehydration” with tools such as TwarcFootnote 33.

Planned for the remaining project years is the development and publication of dashboards to allow easy access to aggregated content data and account statistics such as reputation metrics for our daily collections. These dashboards provide basic information that can help answer various research questions. Figure 3 shows an example of our current Twitter dashboard with a filter for politicians’ accounts and the third quarter of 2022 as the selected time period. We can observe major topics among politicians such as the war in Ukraine and related issues such as energy policy and financial relief measures for the population. We can also observe that far-right actors are among the most active on the platform while both one far-right and one far-left politician by far accumulate the most likes right after posting. Beyond datasets, we also plan to publish pretrained machine learning models for text classification that will be used in our further analysis (e.g., argument stances and aspects as in Ruckdeschel and Wiedemann 2022). Complex analysis flows will be documented as Jupyter notebooksFootnote 34 or Quarto documentsFootnote 35 for others to reproduce and adapt.

Fig. 3
figure 3

Screenshot of the SMO Twitter dashboard (subset: politicians, 3rd quarter 2022)

To show how our data collections and tools can serve to answer specific research questions, we also conduct research driven by topical issues resulting in scientific publications. As a first outcome, Schmidt et al. (2023a) inform about the prevalence and activity of Twitter and Facebook accounts among candidates of the German federal election 2021. Kessling et al. (2023) study the phenomenon of deleted tweets of those candidates before and after the election. Planned publications investigate the visibility of political actors on different platforms and the interactions with other users such as likes or comments they trigger through their posting behavior. For instance, we want to learn which parties, actors, and topics spark the most emotional, uncivil or controversial user discussions and if they correlate with certain communication patterns. We further want to test the hypothesis to what extent far-right actors’ content gained more visibility in terms of followers or views for their posts on Twitter after Elon Musk took over the platform. The project NOTORIOUSFootnote 36 which originated from the SMO uses its data to research diffusion patterns of misinformation across platforms through linked accounts of celebrity actors present on multiple platforms. This variety of examples demonstrates that a curated base population of accounts such as the DBöS in combination with continuous data collection and a readily available scalable computing infrastructure can monitor the digital ecosystem of our everyday social media communication for a wide range of research questions.

6 Conclusion

The SMO is conceptualized as a DIY Observatory that offers mix-and-match solutions to empower newcomers with knowledge and enable experts with means to build their own research infrastructure as self-sufficiently as possible, enabling federation of access and resources, and thereby allowing independent research. We try to build sustainable competence in networks instead of hoarding centralized ‘business secrets’ and data treasures. We understand our infrastructure being a hardware store rather than a furniture shop, a package manager rather than an app store, and a support offer to solve many problems rather than a solution for one specific issue.

Our strategy to achieve this has been described in the preceding sections: starting with the encouragement and support of ethical and legal scholarship with this data (Sect. 2); the curation and collection of base data from the largest and most relevant social media platforms of the German public sphere (Sect. 3); the exploration, development, and testing of tool-chains to collect, analyze and publish social media data (Sect. 4); and the transparent documentation of our infrastructure, distribution of data, code packages, and containers together with the sharing of best practices to enable others to reproduce and build on our work (Sect. 5). This way we hope to support social media researchers at every step of their research process, especially since research capabilities in the field continue to change, for example, due to the opening and closing of research APIs.

After three years of work, we are regularly supporting projects within the RISC and beyond. We plan to consolidate our resources and streamline our processes to enable more data collections, public dashboards, and regular reports, and to sustainably archive the collected data and insights. This will also enable us conduct topical social cohesion research within the SMO in the coming years, such as on online language borders, extremism on fringe platforms, gender discourses in online media, or antisemitism in everyday online life. In line with our networked approach to infrastructure, we cordially invite interested researchers to collaborate with us—by following our work, raising issues, or contributing to our wikiFootnote 37 and code repositories.Footnote 38

figure a