CLARIN-D: An IT-Based Research Infrastructure for the Humanities and Social Sciences

The paper discusses the idea of bridging the gap between computer sciences and the humanities by referring to an e-humanities infrastructure that provides tools and services for well-deﬁned and frequently encountered tasks. The main goal of this infrastructure is to enable researchers in the humanities and social sciences to better exploit their potential by reusing available digital resources, and thus to increase the efﬁciency of e-humanities projects. CLARIN-D is an example of such a research infrastructure. The paper provides a brief overview of the basic principles and services of the CLARIN-D infrastructure, such as metadata harvesting, federated content search, and chaining Web services.


Introduction
To date, computer science and the humanities have taken different approaches to working methodologies, rather than focusing on the potential synergies. However, recent advances in digitizing historical texts, and the search and text-mining technologies for processing these data, indicate an area of overlap that bears great potential. For the humanities, the use of computer-based methods may lead to more efficient research (where possible) and raise new questions that could not have been dealt with otherwise. For computer science, turning to the humanities as an area of application may pose new problems that require rethinking the approaches hitherto favored by computer science. As a result, new solutions may develop that help to advance computer science in other areas of media-oriented application. At present, most of these solutions are restricted to individual projects and do not allow the digital humanities community to benefit from other advances in computer science, like service engineering. Hence, in this paper we attempt to spell out in detail the idea of an infrastructure for e-humanities. Focusing on the notion of reusability of data and algorithms such as morphological annotation and part-of-speech (POS) tagging, we sketch how a loosely coupled infrastructure based on Web services and a service-oriented architecture (SOA) can help the humanities to better exploit their potential by reusing available digital resources, and thus increase the efficiency of e-humanities projects. As an example, we present a rough overview of Common Language Resources and Technology Infrastructure D (CLARIN-D), a Web-based research infrastructure for the humanities and social sciences.

The Impact of Digitization in the Humanities-From Digital Humanities to E-Humanities
To the extent that applications of computer science have always led to a replacement of analog by digital media and processes, digital media and processing models are having an increasing impact on traditional work flows based on analog media in the humanities and social sciences. The interdisciplinary combination of methods from computer science and traditional humanities with large amounts of digital data and advanced tools for processing these is commonly known as e-humanities (cf. McCarty 2005). Although there is no standard definition of terms yet, e-humanities in a broader sense are concerned with the intersection of computing and the humanities in the eScience paradigm, and thus pertain to any digitized data that are subject to investigation in the humanities and the social sciences, such as text, images, and objects (e.g., in archeology). For the humanities, the use of computer-based methods may lead to more efficient research (where possible) and raise new questions that could not have been dealt with otherwise. For computer science, turning to the humanities as an area of application may pose new problems that lead to rethinking approaches hitherto favored by computer science. As a result, new solutions may develop that help to advance computer science in other areas of media-oriented application. By focusing on text as the main data type in the humanities, we can highlight the benefit that can be gained from the combination of digital document collections and new analysis tools from computer science, mainly derived from information retrieval and text mining. In this way, all kinds of sciences that work with historical or present-day texts and documents are enabled to ask completely new questions and deal with text in a new manner. These methods impact in the following ways: • qualitative improvement of the digital sources (standardization of spelling and spelling correction, unambiguous identification of authors and sources, marking of quotes and references, temporal classification of texts, etc.); • the quantity and structure of sources that can be processed (processing of very large amounts of text, structuring by time, place, authors, contents and topics, comments from colleagues and other editions, etc.); • the kind and quality of the analysis (broad data-driven studies, strict bottomup approach using text-mining tools, integration of community networking approaches, etc.).
At present, most of these solutions are restricted to individual projects and do not allow the scientific community in the e-humanities to benefit from advances in other areas of computer science. We therefore wish to distinguish between two important aspects of e-humanities: 1. creation, dissemination, and use of digital repositories; 2. computer-based analysis of digital repositories using advanced computational and algorithmic methods.
While the first has originally been triggered by the humanities and is commonly known as digital humanities, the second implies a dominance of computational aspects and might thus be called computational humanities.
A practical consequence of this distinction in organizational terms would be to set up research groups in both scientific communities, computer science, and the humanities. The degree of mutual understanding of research issues, technical feasibility, and scientific relevance of research results will be much higher in the area of overlap between computational and digital humanities than with any intersection between computer science and the humanities.
To empower the humanities to enter into a substantial and mutually beneficial dialog with computer science, however, a research infrastructure is needed that enables researchers in the e-humanities to reuse distributed digitized data and tools for their analysis as much as possible. To use such computational methods, an individual researcher can proceed by employing two strategies, depending on his or her own degree of computer literacy. One strategy is the individual software approach. Given a selection of digital text data, the research question is transferred into a set of issues and methods that can be dealt with by a number of individual programs. This approach allows for highly dynamic and individual development of research issues. It requires, however, a high degree of software engineering know-how. The other approach is to use standard software. For well-defined and frequently encountered tasks, an e-humanities infrastructure will offer solutions that provide the users with data and analysis tools that are well understood, have already delivered convincing results, and can be learned without too much effort (cf. Boehlke et al. 2013).
Both approaches are interdependent. Probably good solutions in one domain of text-oriented humanities can be transferred to other domains by just using different kinds of text. A good infrastructure must be capable of making such solutions accessible as best practices.

CLARIN-D-An Infrastructure for Text-Oriented Humanities
Research infrastructures are concerned with the systematic and structured acquisition, generation, processing, administration, presentation, reuse, and publication of content. Content services make available the resources and programs needed for that. Public digital text and data resources are linked together and made accessible by common standards. New software architectures integrate digital resources and processing tools to develop new and better access to digital contents. CLARIN-D 1 is part of CLARIN Europe, which recently 2 became an independent legal entity according to the ERIC 3 statutes. CLARIN-D is primarily designed as a distributed, center-based project (cf. Wittenburg et al. 2010). This means that centers are at the heart of an infrastructure that aims at providing consistent data services. Different types of resource centers form the backbone of the infrastructure, provide access to data and metadata, and/or run infrastructure services. Access to data, metadata, and infrastructure services is usually (but not solely) based on Web services and Web applications. The protocols and formats of infrastructure services (like persistent identifiers or metadata systems and standards that are of interest to the CLARIN initiative on the European level) have been agreed upon in the preparatory phase of the project. Additional infrastructure or discipline-specific services are built upon those basic infrastructure services. The usage of general services like registering and resolving persistent identifiers is not limited to CLARIN itself. Other infrastructure initiatives can and do use such services. Important metadata on CLARIN centers-for example, technical access points, standards and contact information-is stored in a centralized centers registry that acts as a starting point for service users and enables the automation of various procedures, such as monitoring and visualizing the state of all infrastructure services.

Metadata, Citation, and Search
In CLARIN, metadata is usually represented in a component metadata infrastructure (CMDI). 4 The underlying technology of CMDI is XML-Schema (components, profiles), XML (instances), and REST (component registry). CMDI addresses the problem of various specialized metadata standards used for specific purposes by different research communities. Instead of introducing yet another standard, CMDI 1 http://de.clarin.eu. 2 http://ec.europa.eu/research/index.cfm?pg=newsalert&lg=en&year=2012&na=na-290212-1. 3 http://ec.europa.eu/research/infrastructures/index_en.cfm?pg=eric. 4 https://www.clarin.eu/content/component-metadata. Fig. 1 Components, profiles, and component registry aims at describing and reusing, and (when used in combination with ISOcat 5 ) interpreting and supporting the integration of existing metadata standards. CMDI components act as basic building blocks that define groups of field definitions. These components can be combined into profiles that define the syntax and semantics of a certain class of resources and act as blueprints for metadata instances describing items of this class. These components are managed in a component registry, which allows users to archive and share existing components, thus enabling their reuse (see Fig. 1). Through this approach, CMDI supports the free definition and usage of metadata standards dedicated to specific use cases. As long as metadata is stored in XML, CMDI is able to "embrace" other standards. By combining the data itself with semantic information stored in the ISOcat data-category registry, CMDI forms a solid basis for using sophisticated exploration and search algorithms.
Metadata is the backbone of the infrastructure and publicly available in CLARIN from the resource centers (cf. Boehlke et al. 2012) via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). 6 The openness of metadata is important to CLARIN since it guarantees high visibility of the provided resources in the research community.
OAI-PMH is a well-established standard and is supported by numerous repository systems like DSpace 7 and Fedora. 8 The OAI-PMH protocol is based on REST and XML and provides the ability to do two things. It offers full access to the metadata provided by the resource centers and allows for selective harvesting of metadata (see Fig. 2) for search portals like the Virtual Language Observatory (VLO). The VLO enables users to perform a faceted search on the metadata that was harvested from the repositories of all CLARIN centers. By using the information stored in the ISOcat data-category registry (cf. Kemps-Snijders et al. 2008) and the CMDI profiles (see Fig. 3) associated to the CMDI metadata instances, the VLO map information is stored in these instances onto a predefined set of facets (see Fig. 4). The VLO also supports the extraction and usage of additional, CLARIN/CMDI-specific, metadata CLARIN also provides support for content-based search. The CLARIN-D FCS 9 is based on Search/Retrieval via URL (SRU) and Contextual Query Language (CQL) and allows users to perform a CLARIN-wide search over all repositories that offer a FCS interface by using a simple Web application. This Web application and external applications send a request to an aggregator service. This service first queries a repository registry and searches for compatible interfaces. The initial query is then Web services in CLARIN are also described via CMDI (which may very well contain a link to a WSDL file). If more specific metadata is provided (i.e., the information enforced by a certain CMDI profile is given), these Web services can be used in a workflow system called WebLicht (cf. Hinrichs et al. 2010). WebLicht allows users to build and execute chains of Web services by analyzing the metadata available for each service and ensuring that the format of the data is compatible; that is, that the output of a predecessor service satisfies the specification of a successor service.   When thinking about interchanging neuro-linguistic programming (NLP) data like text, there are several established standards defining how texts can be encoded and how annotations like POS tags may be added. These standardization efforts are supported by WebLicht, hence the following interface definition of a Web service compatible with WebLicht: • the format used is TCF (or TEI 10 P5, etc.); • the document contains German text and is annotated with POS tags; • the POS tags are encoded according to the STTS 11 tagset.
A complete interface definition of a WebLicht Web service consists of two identically structured specifications for input and output. Each of these specifications defines the format of a document that is used to represent the data. Additionally, a set of pairs of parameter types is mandatory to invoke the service for the input specification, or is computed and added by the service for the output specification. Each of these parameter types is bound to a standard definition, which binds it to a standardized encoding of the information.
Tables 1 and 2 give example input and output specifications of a POS tagger Web service. This service consumes documents that contain German text that was split into tokens encoded in an imaginary format. It produces a document of the same format by adding POS tags based on the STTS tagset. The chaining algorithm of WebLicht (cf. Boehlke 2010) is based on the idea that NLP services usually consume a document of a well-defined standard and will also return such a document. The successful invocation of a service for an input document hence depends on which information is available in that document. A POS tagger Web service may only work if sufficient information on sentence and token boundaries is available, while a named entity recognizer (NER) requires appropriate POS tags. Therefore, the standard used for the input document needs to allow for a representation of this kind of information, and, of course, this information needs to be present in the input document itself. This fact is also represented in the interface definition. Thus, for service chaining to work, it must be ensured that this information is available by using a type checker on each step of a chain.
This check can be done when building the chain, since all the necessary information is already available. Based on a formal Web service description according to the proposed structure, a chaining algorithm, which is basically a type checker, can be implemented. A service can be executed if the previous services in the chain meet the following constraints: the format specified in the output is equal to the format specified in the input specification of the service; every parameter-type/standard pair defined in the input specification needs to be one of the pairs in the output specifications of services which have been executed (or scheduled for execution previously in the chain, if we stay on build time).
These two constraints are of course a simplification. But in many simple cases, an algorithm like this will be sufficient. A short and simplified example of the chaining logic is given in Figs. 7 and 8, which show part of a chain consisting of Web services A (a tokenizer) and B (a POS tagger). In Fig. 7, Service A can be executed since all constraints defined in its input specification are met. The format of the input document is compatible and its content fulfills the requirements because it contains German text encoded in UTF-8. The tokenizer segments the text into sentences and tokens. After its execution, this information is added to the resulting output document. Service B is checked against this updated knowledge about the content of the output document of Service A (see current metadata in Fig. 8). Service B is compatible since all of its input requirements, format and parameters, are available in the output document of Service A.

Summary and Conclusion
Research infrastructures for the humanities can help to share digital resources and content services. In particular, they can help researchers in the digital humanities to save time and effort when developing software to deal with specific research issues, while the development of such infrastructures and their key software components is a software engineering task that increasingly poses interesting and challenging research problems for computer scientists. In this paper, we have presented the European Strategy Forum on Research Infrastructures (ESFRI) project CLARIN and some of its key elements as a research infrastructure for the humanities. In detail, we have presented component metadata infrastructure as a means for unifying metadata descriptions of linguistic resources in the humanities. Based on these metadata, we have also shown how Web services can be built that share data and algorithms in the research infrastructure. Both aspects are closely related: The content-driven use of digitized data and software tools in a specific application scenario in the humanities, and the software and service engineering issues relating to an efficient research infrastructure in the humanities. These two aspects, content and service, clearly need to complement each other in order to establish a culture of best practice in the e-humanities.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.