This section introduces GATE Teamware, which is an open-source, web-based collaborative text annotation and curation environment, designed to meet all six key requirements. It supports the training and involvement of unskilled annotators, which can lower the overall cost of corpus annotation projects. Further cost reductions can be achieved through automatic pre-annotation services, if these exist for the target domain and language.
GATE Teamware is based on the GATE framework (Cunningham et al. 2011a), which provides selected user interface components, reusable automatic text annotation components, and support for linguistic annotation standards.
GATE Teamware’s novelty is in being a generic, reusable, web-based framework for collaborative text annotation. Unlike other tools (see Sect. 3), GATE Teamware provides the required multi-role methodological support, as well as the necessary tools to enable the successful management of distributed annotation projects. It has a service-based architecture which is parallel, distributed, and also scalable (via service replication) (see Fig. 1). Each section of the architecture diagram will be explained in more detail below, from the bottom up.
The services layer includes the GATE document service, serving the data structures used in GATE Teamware and the GATE annotation services, coordinating the computational tasks. Each is discussed in detail below.
GATE document service
The document storage service provides a distributed data store for corpora, documents, and annotation schemas. Input documents can be in all major formats (e.g., XML, HTML, PDF, ZIP), based on GATE’s comprehensive support. In all cases, when a document is uploaded in GATE Teamware, the format is analysed and converted into a single unified, graph-based model of annotation: the one of the GATE NLP framework. Then this internal annotation format is used for data exchange between the service layer, the executive layer and the UI layer. The main export format for annotations is currently stand-off XML, including XCES (Ide et al. 2000). Multilinguality is supported via Unicode and other Java-supported text encodings.
Since some corpus annotation tasks require ontologies, these are made available from a dedicated ontology service. This wraps the OWLIM (Kiryakov 2006) semantic repository, which is needed for reasoning support and consequently justifies having a specialised ontology service, instead of storing ontologies together with documents and schemas.
GATE annotation services
GATE Annotation Services (GAS) provide distribution of compute-intensive NLP tasks over multiple processors. It is transparent to the external user how many machines are actually used to execute a particular service. GAS provides a straightforward mechanism for running applications, created with the GATE framework, as web services that carry out various NLP tasks. In practical applications we have tested a wide range of services such as named entity recognition (based on the freely-available ANNIE system Cunningham et al. 2002), ontology population (Maynard et al. 2009), patent processing (Agatonovic et al. 2008), and automatic adjudication of multiple annotation layers in corpora.
The GAS architecture utilises two types of components: the web service endpoint, that accepts requests from clients and queues them for processing; and one or more workers that take the queued requests and process them.
The two sides communicate using the Java Messaging System (JMS),Footnote 5 a framework for reliable messaging between Java components. If a particular service is heavily loaded it is a simple matter to add extra worker nodes to spread the load, and workers can be added or removed dynamically without needing to shut down the web services. The configuration and wiring together of these components is handled using the Spring Framework
Annotation pipelines, installed in GATE Teamware as a GAS, are used in projects to prepare data. GATE Teamware includes a number of pre-packaged GASes to perform common functions, such as moving and copying annotations. Managers and administrators can view and edit GASes.
The executive layer
Firstly, the executive layer implements authentication and user management, including role definition and assignment. In addition, administrators can define here which UI components are made accessible to which user roles (the defaults are shown in Fig. 1).
The second major part is the workflow manager, which is based on JBoss jBPM
Footnote 7 and has been developed to meet most of the requirements discussed in Sect. 2.4 above. Firstly, it provides dynamic workflow management: create, remove, update, delete (CRUD) workflow definitions, and workflow actions. Secondly, it supports business process monitoring, i.e., measures how long annotators take, how good they are at annotating, as well as reporting the overall progress and costs. Thirdly, there is a workflow execution engine which runs the actual annotation projects. As part of the execution process, the project manager selects the number of annotators per document, the annotation schemas, the set of annotators involved in the project and the corpus to be annotated.
The user interfaces
The GATE Teamware user interfaces run in a web browser and do not require prior installation. After the user logs in, the system checks their role(s) and access privileges, to determine which interface elements they are shown. Annotators only see the annotation interfaces, whereas managers see the project management and adjudication interfaces. GATE Teamware administrators have access to all user interfaces, including a dedicated administration interface.
Annotation user interface
Annotators carry out manual annotation, from scratch, or by correcting automatic annotation generated by the GATE processing resources. When they log into GATE Teamware, human annotators see a very simple web page with one link to their user profile data and another one to start annotating documents.
The generic schema-based annotator UI is shown in Fig. 2. The annotation editor dialog shows the annotation types (or tags/categories) valid for the current project and optionally their features (or attributes). These are generated automatically from the annotation schemas assigned to the project by its manager. Annotation schemas define the acceptable range of annotations and attributes and thus allow the user interface to be customised, in a manner similar to other tools, such as Callisto (Day et al. 2004) and MMAX2 (Müller and Strube 2006).
The annotation editor also supports the modification of annotation boundaries, either through mouse clicks or keyboard shortcuts.Footnote 8 In addition, advanced users can define regular expressions to annotate multiple matching strings simultaneously.
To add a new annotation, one selects the text with the mouse (e.g., “Bank of England”) and then clicks on the desired annotation type in the dialog (e.g., Organization). Existing annotations are edited by hovering over them, which shows their current type and features in the editor dialog.
The annotation editor has a comprehensive multilingual support through Unicode—an evolution of the tools first described in (Tablan et al. 2002). Since the annotation data model underlying GATE Teamware is based on offsets, users can select any sequence of glyphs, i.e. markables are not required to be separated by white space. This is advantageous for languages, such as Thai, in which some glyphs constitute more than one Unicode character (the base character plus the tone marker). The editor also supports right-to-left, as well as left-to-right languages, through the default Java implementation.
Annotators can also control which annotation types are highlighted in the text, by selecting the corresponding check-boxes, shown at the top right side of Fig. 2. By default, all types are visible, but this functionality allows users to focus on one category at a time, if required.
The toolbar at the top of Fig. 2 shows all other actions which can be performed. The first button requests a new document to be annotated. When pressed, a request is sent to the workflow manager which checks if there are any pending documents which can be assigned to this annotator. The second button signals task completion, which saves the annotated document as completed on the data storage layer and enables the annotator to ask for a new one (via the first button). The third (save) button stores the document without marking it as completed in the workflow. This can be used for saving intermediary annotation results or if an annotator needs to log off before they have completed a document. The next time they log in and request a new task, they will be given this document to complete first.
Ontology-based document annotation is supported in a similar fashion, but instead of having a flat list of types on the right, the annotator is shown the type hierarchy and when they select a particular type (or class), they can then optionally choose an existing instance or add a new one.
In terms of interface design, many of the annotator interface components are reused from GATE Developer, which makes it easier for users to switch between the web-based annotation tools of GATE Teamware and the stand-alone, desktop environment of GATE Developer. In addition, this also minimises implementational effort, since the same code can be reused in both applications. The only downside of this approach is that the ergonomics of the GATE Teamware web-based annotation interface could have been more similar to other commonly used web applications, instead of using Java Web Start and Swing.
As discussed in Sect. 2.1, project managers carry out quality assurance tasks. Tools available include IAA metrics (including f-measure and Kappa) to identify if there are differences between annotators; a visual annotation comparison tool to see quickly where the differences are per annotation type; and an editor to edit and reconcile annotations manually (i.e. adjudication) or by using external automatic services.
The key part of the manual adjudication UI is shown in Fig. 3: the UI shows also the full document text above the adjudication panel, as well as lists all annotation types on the right, so the project manager can select which one they want to work on. In our example, the manager has chosen to adjudicate Date annotations created by two annotators and to store the results in a new consensus annotation set. The adjudication panel has on top arrows that allow managers to jump from one difference to the next, thus reducing the required effort. The relevant text snippet is shown and below it are shown the annotations of the two annotators. The manager can easily see the differences and correct them, e.g., by dragging the correct annotation into the consensus set. Annotation differences can also be resolved using the annotation diff interface, as shown in Fig. 4. There, the Statistics tab shows the IAA metrics, whereas the Adjudication tab (shown in focus in the figure) can be used by managers to produce the ultimate ground truth annotations.
Project management interfaces
Apart from adjudication, project managers are responsible for defining annotation guidelines and schemas. They choose the relevant automatic services with which to pre- or post-process the data, benchmark annotator performance and monitor the project progress. Project managers define annotation workflows, manage annotators, and liaise with the system administrators.
The project management web UI provides the front-end to the executive layer (see Sect. 4.2). In a nutshell, managers upload documents and corpora, define the annotation schemas, specifying legal annotation types and legal attributes, choose and configure the workflows and execute them on a chosen corpus. Workflows may be as simple as passing the documents to n human annotators, or more complex, for example, preprocess the documents to produce automatic annotations, pass each document to three annotators and then adjudicate the differences. The workflow wizard facilitates this step, as shown in Fig. 5. The management console also provides project monitoring facilities, e.g. number of annotated documents, number in progress, and yet to be completed, as shown in Fig. 6. Per annotator statistics are also available—time spent per document, overall time worked, average IAA, as well as per document statistics.
Administration user interface
Administrators can create, delete and suspend accounts, and can also use the GATE Teamware Bulk Upload feature to quickly add new user accounts from an Excel worksheet. Administrators can monitor processes and tasks that are created by GATE Teamware when projects are run.