Once we accept the nature of data as socially constructed, we have a basis for understanding the challenges faced by current approaches to the creation of big data infrastructure for the study of history. This paper will discuss three of these in further detail. The first, complexity of humanistic data, has long been recognised. To increase the scale of activity, however, this element will need to be revisited and potentially assigned a new position in the developer’s hierarchy of concerns. Second, we must develop more robust representational norms for hidden data implicated by the contents of a digital system. To not do so is to go against some of the most deep-seated impulses of the historical researcher, and to undermine the utility of digital methodologies for knowledge creation. Finally, there are great gains to be made in increasing our application and understanding of not just humanistic research activities (as captured in participatory or user-centred design processes), but also from digging more deeply into the cognitive and social elements of the epistemics of historical and humanistic research. Only through such an investigation can both the user and the reuse of data become more strongly conceptualised widely and applied.
3.1 Revisiting the Complexity of Humanistic Data
One of the foundational challenges of humanities research lies in the nature of its research objects: human beings, their languages, cultures and the records of their activities. Cultural signals (which, according to Manovich, constitute their own distinct level within new media alongside the computational ) can be ambiguous and are often conflicting and self-contradictory. This is true even in ‘low context’ cultures, where a greater cultural permeability is facilitated by explicitness in the communication and day-to-day deployment of cultural norms and practices, as inscribed most visibly in language, but also in personal interactions, in religious practices, and in artistic production.
In order to transform culture into something recognisable as data, its elements – as all phenomena that are being reduced to data – have to be classified, divided, and filed into taxonomies and ontologies. Even at their best, these processes rely on the ability to turn human communication into a set of rules for transactions, rules that are very often overturned or made more complex by the addition of fine nuances of tone, gesture, or reference. The stereoscopic world must be rendered lenticular, the narratives must become data. But the historian remembers or records what she discards in creating her interpretation, or at least remains aware that she discards. The computational system does not, or at least, does not generally do so in a manner transparent to the user. This lack of transparency presents a dilemma to historians considering digital methods and tools, reducing the scholar’s mastery of her methodological vehicle by which data has been turned into knowledge.
The tendency of technology is to turn its users into consumers rather than experts: for example, many of the most adept users of technical tools could not aspire to reconstructing the code behind them. But the black box is not an acceptable paradigm for research environments. A scholar needs to know when a result is underpinned by less robust algorithms or smaller bases for the statistical modelling, leading to less reliable results. For example, in large scale, multilingual environments (like Google Translate), variations in system reliability between languages and expressions is not communicated to the user. For historians to harness big data, the black boxes will need to become glass boxes – but how we contextualise this richer contextual information in a user-friendly fashion remains a challenge.
Investigating competing theories and definitions of data will only take us so far, as will superficial observations of our users. The CENDARI project deployed a suite of four different measures of the course of the project’s active development to harvest and integrate historians’ perspectives into the system development: participatory design sessions, prototyping on research questions, a trusted user group and weekly testing cycles. Each of these mechanisms uncovered further layers of activity and requirement (including an early facilitated discussion to agree what was meant from different perspectives by the term ‘data’). This process revealed that to understand how and why the data processing functions between computer scientists and historians differ, we need to dig more deeply into those processes, but also to develop a more robust definition of what the characteristics and qualities of data are from a humanistic/cultural perspective as well as from a computational perspective. For example, provenance is a key concept for historians and collections management professionals: indeed, a source loses its authority utterly if its provenance is not clear. But in big data systems, provenance data is more likely to be looked upon as noise than signal. This is not to downplay the good work of teams like the W3C provenance working group, which has established a solid model for the representation of provenance.  It is merely to say that modelling of uncertainty and complexity under these protocols would be labour intensive at best, and impossibly convoluted at worst: in particular as the standard itself is not designed to model uncertainty (though possible extensions to make this possible have been proposed).  To give an example, let us consider the collection of papers of Roger Casement held in the County Clare, Ireland archives. Here is an excerpt from the description of the papers (already an anomaly among more traditional archival fonds):
Personal papers relating to the Irish patriot, Roger Casement were kept under lock and key in Clare County Council’s stores since the late 1960s. The papers were presented to the council by the late Ignatius M. Houlihan in July 1969. The Ennis solicitor had received them as a gift from “a member of one of the noble families of Europe.” …The papers, mainly letters, cover the last two years of Casement’s life before he was executed by the British for his role in smuggling arms into Ireland for the 1916 rising. The last letter on file is one from Casement, dated April 4, 1916, just 11 days before his departure for Ireland on a German U-boat, which landed him at Banna Strand in Co. Kerry on Good Friday, 1916.
“I came across the papers during an inventory of the council’s archives. At first, I did a double take, I wasn’t expecting something so exciting. I instantly recognised the value of them and their importance for Clare and I was anxious to make them accessible as soon as possible,” explained Ms. [Roisin] Berry [archivist]. “They date from Casement’s arrival in Germany in 1914 to the very month he leaves Germany in 1916 on the under 19 bound for Ireland. The documents address a range of different subjects including the enlisting of Irishmen in the First World War, the appointment of an envoy from England to the Vatican, the Findlay affair, the work of Fr. Crotty in German prison camps, writing articles for the press, keeping a diary and the desire for peace. 
This excerpt (and it is only an excerpt) brings out a number of highly interesting examples of the potential complexity of historical sources. No less than three previous owners of the papers are referenced (one of which is only known for his or her status as a member of the aristocracy). Their place in Casement’s life (and indeed his own place in Irish history) is explained, chronologically and in terms of his thematic interests. The material status of the collection is given, including the fact that it consists of ‘mainly’ (but not exclusively?) letters. A surprising anecdote is relayed regarding how the archive came to realise they held such a significant collection, which illustrates how the largely tacit knowledge of the archivist enabled their discovery and initial interpretation. This example is not an exceptional one. How is this level of uncertainty, irregularity and richness to be captured and integrated, without hiding it ‘like with like’ alongside archival runs with much less convoluted narratives of discovery? Who is to say what in this account is ‘signal’ and what ‘noise’? Who can judge what critical pieces of information are still missing? These are perhaps more questions of “documentation” than “cataloguing” (to borrow Suzanne Briet’s  canonical distinction between the two) but while Briet proposed that documentation approaches could be differentiated according to each discipline, the granularity she was proposing was far less detailed than anything that would be required for historical enquiry. Indeed, the focus of the documentation required would vary not only for each historian, but quite likely as well according to each of their individual research questions, a result of the historians’ research and epistemic processes that greatly raises the bar for description within their digital resources.
Unfortunately, another key aspect of what historians seek in their data is completeness. In spite of the often fragmentary nature of analogue sources, digital sources are held by them to a higher standard, and expected to include all relevant material. This fact has been tested, and again and again, the same insight emerges: “Researchers are wary of digital resources that are either incomplete or highly-selective.”  “One concern of humanities users … is the extent of the resource: whether the whole of the physical collection is digitized or not.”  “Two key concerns for digital archives in general…are the desire to be: authoritative and of known quality [and] complete, or at least sampled in a well-controlled and well-documented manner.”  This perception results from a somewhat outdated paradigm of the digital resource (that its only value is in the access it provides), and places a particular burden given the often hidden nature of many sources (discussed below).
A further key issue in the ecosystem is the relationship between metadata and the objects they represent, as well as their changing place in the research process: as reminders from a pre-digital age of physical catalogues; as the most common data to be found in digital systems of cultural data; as research objects that are seldom the focus of modern historical research in themselves; as structured data of a sort that is easy to aggregate; as a draw on the resources of the institutions that must create it; and as marks of human interpretation and occasional error. In the words of Johanna Drucker: “Arguably, few other textual forms will have greater impact on the way we read, receive, search, access, use, and engage with the primary materials of humanities studies than the metadata structures that organize and present that knowledge in digital form.”  We will also, however, need to look into how emerging computational approaches, such as ultra large system approaches  and deep learning, may be disrupting the need for the production of such metadata, removing the human investment and replacing it with a proxy that may or may not serve quite the same function.
3.2 Dealing with ‘hidden’ Data
According to the 2013 ENUMERATE Core 2 survey, only 17 % of the analogue collections of European heritage institutions had at that time been digitised . Although great progress was expected by the respondent institutions in the near future, this number actually represents a decrease over the findings of their 2012 survey (almost 20 %). The survey also reached only a limited number of respondents: 1400 institutions over 29 countries, which surely captures the major national institutions but not local or specialised ones. Although the ENUMERATE Core 2 report does not break down these results by country, one also has to imagine that there would be large gaps in the availability of data from some countries compared to others (an assumption borne out by the experiences of research infrastructure projects).
Is this something that historians are unaware of? Of course not. Does it have the potential to effect the range of research questions that are proposed and pursued by modern historians. Absolutely. Modern historians often pride themselves on being “source-led” and characterise the process by which they define research questions as one of finding a “gap” in the current research landscape. Because digital data is more readily accessible, and can be browsed speculatively without the investment of travel to the source, they have the potential to lead (as the ‘grand narratives’ of history once did before them ) or at least incentivise certain kinds of research based on certain kinds of collections. The threat that our narratives of history and identity might thin out to become based on only the most visible sources, places and narratives is high. Source material that has not been digitised, and indeed may not even be represented in an openly accessible catalogue, remains ‘hidden’ from potential users. This may have always been the case, as there have always been inaccessible collections, but in a digital world, the stakes and the perceptions are changing. The fact that so much material is available online, and in particular that an increasing proportion of the most well-used and well-financed cultural collections are, means that the novice user of these collections will likely focus on what is visible, an allocation of attention that may or may not crystallise into a tacit assumption that what cannot be found does not exist. In the analogue age, this was less likely to happen, as collections would available only as objects physically contextualised with their complements: the materiality would be able to speak of the scale of collections, and extension into less well-trodden territory would require only an incremental increase in time or insight, rather than a potentially wasted research journey.
Sources are not only hidden from the aggregated, on-line view because they have not been digitised, however. Increasingly, users are becoming frustrated with digital silos. The current paradigm is not that a user visits a number of news or information sites, but that he channels his content through an intermediary, such as Facebook or Twitter. The increase in the use of APIs and other technologies (including personalisation and adaptation algorithms) evidences this preference. Cultural heritage institutions (CHIs) have adapted to this paradigm shift by establishing their own curated spaces within these channels, but in spite of this ‘pushing out’ response, the vast majority of their data cannot yet be ‘pulled in’ by developers wanting to feature cultural content. The biggest exception to this rule in Europe is Europeana, which has a very popular API and makes the metadata it delivers available under an open CC-0 reuse license. Most national, regional or local institutions hesitate to do the same, however, in part because of technical or resource barriers, but also to a great extent because they do not trust the intermediaries and reuse paradigms that are emerging. These institutions have developed over centuries to protect the provenance of items in their care, and to prevent their destruction or misuse. Not enough is known about how the digital age impacts upon this mission, and whether the hesitation to release data into shared platforms is merely risk-aversion, or whether this can tell us something critical about our current conceptions of data, and our current data sharing environment. This is not an issue of copyright: it is one of trust and social contracts. It is also not an issue of putting all historical material online, or even indeed of ensuring it all is digitised: it is a challenge of ensuring that data can be used outside of the silos that were designed to hold them, and that what is not online can be clearly signposted alongside cognate collections. As complex as they may be, solving these particular problems is an essential requirement for transnational digital approaches to the study of the modern era to become possible, not to even think of their becoming widespread.
The following excerpt from one of the CENDARI project user scenarios (documented in the project’s Domain Use Cases report ) provides an illustration of the challenges a transnational research question can pose in a dispersed source landscape based upon national silos.
My project examines how the rural-urban divide shaped Habsburg Austrian society’s experience of the war from about 1915 (when food and food shortages became increasingly politicized) and to what extent that divide shaped the course of the Habsburg Monarchy’s political dissolution in the fall of 1918. I will focus on provinces with large multiethnic urban centers that experienced food crises: Lower Austria (Vienna), Bohemia (Prague), Moravia (Brno), the Littoral (Trieste), and Galicia (Krakow). … transcended the urban-rural divide—also grew sharper over the course of the war. I want to answer the following questions: How did the administration and realities of rationing vary between cities on the one hand, and between urban centers and the rural areas of their provinces on the other? How did food protests—and other grassroots demonstrations without party-political leadership—vary between these selected provincial capitals and within their largely rural provinces? To what extent were protesters’ grievances cast in terms of urban-rural divides or in terms of other fault lines and antagonisms? How did inhabitants of these cities and their rural hinterlands experience and perceive the political dissolution of the monarchy in different ways, i.e. in terms of expectations and demands? To what extent did successor states—Austria, Czechoslovakia, Poland, Yugoslavia, and Italy—overcome, institutionalize, or exacerbate rural-urban divides?
This researcher’s work covers four current national systems and at least as many languages. Because the work encompasses rural and urban contexts, it is likely that some of the required source material will be held in smaller regional or local archives (which usually have far inferior infrastructure to their flagship national equivalents). The work is looking at events, perceptions and interpretations that may not have been captured in the official records, and which indeed may only be measurable through proxy data or personal accounts. Even in the case of the successor states listed, two have since dissolved. This scholar is setting out on a rich transnational research trajectory, to be sure, but there will be very little support in the formal finding aids to assist in wayfinding or knowledge creation, and very little this individual will be able to do to progress such an ambitious project within the current landscape of digital resources, where countries such as Hungary are particularly poorly represented, in spite of the centrality of the legacy of the Austro-Hungarian empire for understanding the development of European structures and identities after that empire’s fall.
3.3 Knowledge Organisation and Epistemics of Data
The nature of humanities data is such that even within the digital humanities, where research processes are better optimised toward the sharing of digital data, sharing of ‘raw data’ remains the exception rather than the norm.
There are a number of reasons for this. First of all, in many cases, ownership of the underlying input data used by humanists is unclear, and therefore the question of what can be shared or reused is one that the individual researcher cannot independently answer. There are deeper issues, however, based in the nature of the epistemic processes of the humanities, that act as further barriers to reuse of humanities data. Very little research exists in this topic to date, although barriers to reuse of digital humanities projects do provide an interesting baseline for starting an investigation. For example, the Log Analysis of Digital Resources in the Arts and Humanities (or LAIRAH) project  pointed toward a number of key issues leading to a lack of reuse of digital data prepared by research projects. In particular, the lack of an early conceptualisation of who the future user of the data might be and how they might use it was a key deterrent to future use. While this lack may be seen as a weakness from a reuse standpoint, it is likely that the organisation of data or the curation of resources chosen in such projects was driven by the research questions in the mind of the original researcher, and that this organisational model was key to their epistemic process. As the yet-to-be published results of a research project  at Trinity College Dublin have demonstrated, the ‘instrumentation’ of the humanities researcher consists of a dense web of primary, secondary and methodological or theoretical inputs, which the researcher traverses and recombines to create knowledge. This synthetic approach makes the nature of the data, even at its ‘raw’ stage, quite hybrid, and already marked by the curatorial impulse that is preparing it to contribute to insight.