In 2017, Clifford Lynch published a broad treatise on the challenges that algorithmic-intensive systems pose for archivists and the near futures of digital preservation in networked environments. In this vibrant call to action, Lynch makes a point of connecting the preservation mandates of traditional professional archival duties with the haziness of what is to come for stewardship in a society underwritten by collections of data and driven by an impulse to collect:
“If archivists will not create, capture, curate the “Age of Algorithms,” then we must quickly figure out who will undertake this task, and how to get the fruits of their work into the custody and safety of our memory organizations for long-term preservation. Traditional archivists seem most comfortable dealing with the outcomes of the work of various types of documenters, rather than creating the testimony: this is a professional constraint that needs to be explicitly recognized, considered, and if appropriate clarified and affirmed if it is to be the case going forward” (Lynch 2017).
Application programming interfaces (hereafter, APIs) are both gateways to access data and artifacts in their own right of our algorithmically driven information age. They are technologies of custody that enable the extraction and access to data. Our connected world is increasingly saturated with networked mobile computing devices that allow platform users to create data at tremendous rates that drive new markets, technologies, and social structures. Information policy, privacy, and law scholars have examined the functional sovereignty that these platforms now exert over society, democracy, and economies of scale by collecting personal data, providing access for third parties, and repurposing it for algorithms, personalization, and advertising technology (Pasquale 2016). However, most of this work is concerned with access to data collections presently and near-term implications; it does not concern long-term preservation contexts or future archives of data extracted from platforms with APIs. As archival scholars concerned with the future of preservation and digital cultural memory, we are concerned that platform APIs—as access points and technologies of custody—will have a significant impact on the abilities to preserve documentation from the algorithmic age and to create testimony about those impacts as Lynch has described because of their prohibitive control and access constraints to human activity data in the long term.
Since the early days of social media platforms, extracting data for secondary reuse, research, and auditing has been difficult for those interested in working with social media data. Social activity streams, as they are called now, are difficult to capture because of their content, context, and form (“Activity streams: specifications” 2019; Snell and Prodromou 2017a). As their name suggests, these streams of social media are not static documents. Social activity streams are frequently updated by platforms with features and algorithms, accumulating new elements of data containing many layers of context, multimedia objects, and engagement metadata. Often these social activity streams, in whole or in part, remain locked into dynamic platforms with particular kinds of technical and legal gateways of access by way of extraction possibilities. Social platforms themselves identified this extraction “problem” as an opportunity to create a new marketplace in third-party data access to social media data by primarily providing access to user data (and metadata) through a range of APIs in the early 2010s (John and Nissenbaum 2019).
APIs allow third parties (known as “developers”) the ability to query and gain access to portions of social activity streams from user data created in using and experiencing social platforms. APIs specify the rules by which software talks to each other, articulating which elements can be queried, how frequently, and how the results appear. Different APIs allow for purpose-driven access and extraction. For example, a social network API would allow a dating app to use information about user profiles to match people for dates. A content publishing API would allow a newspaper to promote breaking news articles on a social network’s newsfeed, promoting content directly to specific users. Or an advertising API could allow a small business to promote their new menu on a restaurant rating and review platform. While APIs are often hidden or unknown to social media users themselves, they are part of the software development ecology of the social network infrastructures that drive our experiences of numerous platforms and apps.
Presently, access to much of the social media data from platforms is governed through this API-driven gatekeeping model set up by platform owners to establish the rules by which all third parties accessing data must abide (Bruns 2019). These models, while unique to each platform, are rigid in their permissions and restrictions on API users, failing to make distinctions between different kinds of data brokers such as developers who are researchers, journalists, or digital archivists. Not all API developer users have the same reasons or motivations for using them and accessing social media data, and not all API users are concerned with the long-term authentication, fixity, or reproducibility of results. The one-size-fits-all approach to developer access has increasingly become problematic for social platforms when providing access, measuring impact, and enforcing governance over developers’ collections from APIs (Acker and Donovan 2019; Driscoll and Walker 2014). A series of gaps exists between the range of users that APIs are intended to serve, the account holders or creators that generate these data accessed in APIs, and third-party data brokers who use API extracted data in agreement with the platform’s aim to keep users engaged. Indeed, the problem of extracted social media data collections is not just a concern for researchers and scholarly institutional repositories. As intermediaries between many kinds of users with different motivations, social platforms themselves grapple with issues of control over the records that users create, managing user data archives and the types of users who leverage them for data access (Glassman 2019).
Social platform APIs from Twitter, Facebook, Instagram, YouTube have not only changed the experience of the web for users who use and create social media, but also research methods in computational social science that allow researchers to create new models of instrumentation to gather social media data (Bruns 2013; Bruns and Weller 2014; Hargittai and Sandvig 2015). Just as platforms have changed instrumentation and research methods for scholars studying behavior online, they have impacted our ability to collect and access research data (Freelon 2018). Thus, APIs also create new preservation and access issues for digital archivists, research data managers, scholarly communication repositories, and digital curation initiatives. The ascendency of the API access over other data extraction techniques such as web scraping has led to new models of digital collection, accessioning, and preservation in research archives and web archiving (Littman et al. 2016).
Digital preservation techniques, policy, social, and technical constraints have always been shaped by our ability to access information and bring it into custody, for example with proprietary formats or unique playback hardware (e.g., Faniel and Yakel 2011; Hedstrom and Lampe 2001; Rimkus et al. 2014). The current API access regime of platforms that depend on large collections of aggregated user data for access and reuse of these collections is not typically managed or oriented toward archival principles (Glassman 2019). Few preservation perspectives or non-commercial use cases have been subsumed into the development of platforms as part of their market strategies as for-profit corporations (Conway 2010; John and Nissenbaum 2019). Despite these preservation and access barriers, as collections of social media data grow (internally or through extraction), platforms are rapidly becoming sites of post-custodial archives. In post-custodial theory of archives, the power of the archivist over records is ceded to the contexts where records originate and circulate, thereby reorienting the role of traditional archival institutions to that of a “participant” in a (re)allocation of power very much [sic] at odds with traditional praxis” (Kelleher 2017, p. 23). Archival scholars have argued that post-custodial approaches democratize custody, power, and agency over archives by prioritizing context, allowing creators self-determination through the archival autonomy of asserting provenance, ownership, context, and function of records creation (Cook 1994; Evans et al. 2015). But platforms as data enablers and intermediaries that provide access through APIs can severely limit the archival autonomy of creators, researchers, digital stewards, and archivists by stripping context from activity streams as they provide access to platform data.
Social media data collections from platform APIs are a new post-custodial challenge for archivists because they result in a proliferation of decontextualized social activity streams as records and evidence of user data (Walker 2017). Digital archivists have commented on the drawbacks—there is a loss of context from the platform itself, data can be disconnected from the communities that have created it, there is a focus on machine-readable structured text over dynamic or visual content, and an over-reliance on platforms to shape the scope and acquisition through developers’ APIs (Jules 2018; Littman 2019; Summers 2019). Another challenge is that while APIs present a means of access, there may be many different types of API users with different motivations for access who are all governed by the same terms of service.
Moreover, the design of social platform APIs follows changes in the platforms themselves for servicing users who create content and have knock-on effects that spread out into different areas of data access and reuse. While researchers, journalists, and stewards can gain access to platform data through APIs, the conditions of access strip context in important technical, social, and moral ways. And so, accurate reproducibility is not just a problem for researchers using developers’ APIs; it poses a threat to cultural memory stewards, such as archives and libraries, aiming to capture accurate slices of social media experiences from individuals, communities and filial groups, and society more broadly. Archivists, stewards, and repositories that collect social media data archives using APIs (as many now do) then regain some control as custodians of extracted social media data, but lose valuable context that the platform contains, arguably the rich context typically called for by post-custodial orientations. And further, the mechanism for which these data are harvested and collected, the API, is itself constantly changing—little is known about the long-term impact and unintended consequences of API rollbacks, updates, or their ever-changing terms of service. If the extraction and decontextualization of social media data are one kind of technical challenge, the developers’ API that allows “one-size-fits-all” access remains another, larger ontological challenge for archivists who make use of APIs to assemble these collections.
In this paper, we examine features of social media data from platforms and discuss their long-term preservation consequences by focusing on the current landscape of data access through APIs. We begin the next section by introducing the preservation problems that occur with user data extracted from developers’ APIs, and how these fit with existing models of archives and digital repository development. Then, we define and analyze the range of possible users concerned with extracting social media data from platforms. We make a distinction between platform users (such as account holders and creators of content) and API users, who may be platform users, but have API keys to make use of extracted data as “developers.” This is because users who have social media platform accounts and developers who use platform APIs have separate terms of service, rights, and responsibilities when using social media platforms and developer APIs. Then, we discuss how platforms govern possibilities for access, and how the current access regime promotes persistent problems over stewarding personally identifying information, guaranteeing the reproducibility or fixity of content, and incredible amounts of energy use and resource consumption because of bottleneck redundancies. We finish by surveying early models for access to social media data archives, including community driven not-for-profit community archives, university research repositories, and early industry–academic partnerships primarily in the USA. We argue for applying a platform perspective in exploring the rich problem space that social platforms and their APIs present for efforts to collect social media data archives and manage them over the long term as digital cultural memory artifacts.