Keywords

1 Introduction

Authority files [18], vocabularies (e.g., ULANFootnote 1), and actor ontologies (e.g. FOAFFootnote 2, RELFootnote 3, BIOFootnote 4, schema.org [5]) are used for (1) identifying people, groups, and organizations and (2) for representing data about them. They constitute a central resource for cataloging and information management in museums, libraries, and archives, but also a challenge for data linking due to alternative names, homonyms, spelling variations, different languages, transliteration rules, and changes in time. Although actor ontologies play an essential part in modeling historical information, there are still very few published scientific articles about the subject.

Historical military units and personnel is a particularly challenging domain for creating an actor ontology: the structures of units are large and change rapidly, different codes can be used for actors in order to confuse the enemies, and people come and go due to the violent actions of war. For example, during the phases of WW2 in Finland (The Winter War, The Continuation War, and The Lapland War) different units have used the same name, and during Winter War in Finland the names of major units were changed just to bluff the enemy. Furthermore, the data about the actors is often incomplete and uncertain, involving lots of “unknown soldiers” of whom little is known.

From a Linked Data viewpoint this poses two major problems: (1) Data linking (based on named entity linking [2, 6]) is difficult, because it has to be done in a changing and vague domain specific contexts [7]. For example, to tell whether a mention captain Smith and colonel Smith can refer to the same person, and to which Smith in the first place, data about different Smiths and their ranking history in time is needed. (2) It is difficult to aggregate and enrich data about actors that come from different sources and in different documentary forms, such as death records, diaries, magazine articles, or photographs, and to compile the global biographical history of the actors to the end users [9].

We argue that to address the problems above, a semantically rich spatio-temporal model for representing actors in relation to the events of the war is needed. This paper contributes to the state-of-the-art by presenting such an ontological actor model for historical military units and personnel. The model is in use in end-use application perspectives of the semantic portal WarSampoFootnote 5, where the idea is to reassemble automatically the biographical war history of individual soldiers and units. The model enables disambiguation of names in spatio-temporal contexts as well as combining contents from various sources, and publishing them in a harmonized format. The ontology and related data has been published as a Linked Open Data serviceFootnote 6 that can and has been used in digital humanities research and as well for developing online portals. For example, the community portal SotapolkuFootnote 7, provided by a commercial company, makes use of the WarSampo actor data.

The work is done as part of the WarSampo projectFootnote 8, and builds upon our previous publications [7, 9, 11, 12], which focus on the architecture, named entity linking, and end-user views of the application. In contrast, this paper represents the underlying ontology model and dataset regarding army units and people in detail, as well as the actor related application perspectives in use.

The paper is structured as follows: First, ontology model for representing army units, and military personnel, is presented. After this the collecting of WarSampo actor dataset is represented, and a brief look on person and unit perspectives at WarSampo portal is taken. In conclusion, contributions of the work are summarized and some directions for further research are suggested.

2 Use Case and Datasets

The use case for our work is the WarSampo semantic portalFootnote 9 [9]. It provides the end user with richly interlinked data about the WW2 in Finland via application perspectives in the Sampo model [8]. An illustration of the WarSampo datasets is represented in Fig. 1. In total, the WarSampo data cloud contains data of more than a dozen different types (e.g. casualty data, photographs, events, war diaries, and historical maps) from an even larger pool of sources (e.g. the National Archives, the Defense Forces, and scanned books, from which part of the data has been extracted semi-automatically).

The actor dataset contains ca. 100 000 soldiers, and ca. 16 100 army units. The data is enriched with ca. 488 000 links from events to actors. Actors have furthermore been linked to external resources in the LOD cloud databases on the Web.

Fig. 1.
figure 1

Linkage in the actor-event based dataset

3 Actor Ontology Model

The ontology of actors is based on the CIDOC CRMFootnote 10 [4] model, where the resources of actors are essentially described in terms of the spatio-temporal events they participate in. An event represents any change of status that divides the timeline into periods before and after the event. Using the actor-event-model facilitates reconstructing the status of an actor at a specified moment. One main reason for adapting the model is that the information regarding a single actor varies a lot in both form and amount; in some cases we may have access to a very detailed description of the actor’s biography, in some other cases only sparse pieces of information exist. All this data can be harmonized into a sequence of events. The applied actor-event-model also allows us to easily add new event types to the schema and new events the to database.

Fig. 2.
figure 2

Ontology schema of actors and events

Table 1. Namespaces and prefixes used in actor ontologies

Schema of the ontology is illustrated in Fig. 2. The schema is available at http://ldf.fi/schema/warsa, the namespaces and prefixes in use are listed in Table 1. The actor superclass crm:E39_Actor Footnote 11 is shown at center on the top. There is one subclass for people, and two for groups. For various types of events there are 19 classes with superclass :Event Footnote 12.

The biographical representation of a person was modeled with events of birth (:Birth), and death (:Death), and his military career with events like promotion (:Promotion), serving in an army unit (:UnitJoining), participating in battles (:Battle), or getting awarded with a medal of honor (:MedalAwarding). Furthermore, there are classes for getting wounded (:Wounding) or disappearing (:Disappearing), which represent the data fields in Casualties database. The schema includes supporting classes for representing military ranks, war diary entries, medals of honor, documentation, and data sources.

Example of a person resourceFootnote 13 (:Person) is shown in Table 2. The principle is to represent only constant information in a person resource; it has full name as a primary title, and the family and first names as separate fields. Property owl:sameAs links to a corresponding resource in external databases, and foaf:page to external web pages.

Examples of related events are shown in Table 3. During the war, the person in example has been promoted from lieutenant first to captain and finally to major. When the Winter War started in 1939 he served as a commander in an air force squadron, and shot down an enemy aircraft soon after.

In literature military personnel are ofter referred using a combination of current military rank and family name (e.g. Captain Karhunen or Colonel Talvela). So, to describe a person in detail, an ontology of military ranks was needed. The rank ontology is based on Muninn Military Ontology [19]. The hierarchy of ranks was constructed by interlinking the instances to equal and lower ranks. The :Rank instances in the datasets were enriched with additional information (e.g. countries or service branches in which the rank has been used, or categories like officer or non-commissioned officer). Event :Promotion was used to attach a rank to a person. Due to the variations in the amount of available data, a promotion event was created in all cases, even if a person is known to have only a single rank with no specific date of promotion.

Table 2. Properties of a resource describing pilot Jorma Karhunen
Table 3. Examples of events describing pilot Jorma Karhunen
Table 4. Properties of a resource describing 24th Fighter Squadron
Table 5. Examples of events describing 24th Fighter Squadron

An example of RDF resource of a military unit is shown in Table 4, the resource is also available in Turtle formatFootnote 14. Just like in the case of a person, the properties describe only constant information like unit’s preferable name and abbreviation, description, conflicts participated in, and links to LOD cloud resources. The events (Table 5) describe the unit’s position in the army hierarchy and the involved military activities. The lifespan of a unit spans from its formation :UnitFormation to dissolution :Dissolution. The changes of the unit name were modeled as :UnitNaming events. Also the army hierarchy, including the temporal changes made in it, was modeled using the event schema: the hierarchy was represented as a tree graph where the army units are the nodes and the events of joining into a superior unit :UnitJoining form the edges. The events also included the military activities taken (e.g. movements :TroopMovement and battles :Battle). The event :PersonJoining was used to combine a person to the unit, in which he has served. The event could also announce a role in the unit (e.g. being a commander or a squadron pilot).

4 Warsampo Actor Data

Currently the actor dataset contains ca. 100 000 people. The data has been collected from various sources: lists of generals and commanders, lists of recipients of honorary medals, the Casualties databaseFootnote 15, Finnish National BiographyFootnote 16, photographers mentioned in Finnish Wartime Photograph ArchiveFootnote 17, WikidataFootnote 18, and Wikipedia. Besides military personnel, an extract of 580 Finnish or foreign civilians from the National Biography database and Wikidata was included. This set consisted of people with political or cultural significance.

The unit dataset consists of over 16 100 Finnish wartime units, including Land Forces, Air Forces, Navy, Medical Corps, stations of Anti-Aircraft Warfare and Airwarning, Finnish White Guard, and Foreign Volunteer Corps. At this stage Soviet and German troops were excluded. The main sources of information have been the War Diaries, Army Postal Code listFootnote 19, and Organization Cards, all of which provided the information as datasheets in CSV format.

In general, the method to produce the data depended on the format of data source. The biographies of the National Biography and the Casualties Database had been transformed into LOD in our earlier projects, and therefore the information extraction process was to convert the existing data into new actor entries and relating events. Transformation was mostly done by using specific SPARQL construct queries. More than 95 000 entries were generated from the Casualty Database to actor dataset [12].

The organization cards (Fig. 3) were written by Finnish Defense Forces shortly after the WW2. The cards contain the major part of units in Finnish Army, unfortunately not those of Navy and Air Force. An example of organization card is shown in Fig. 3. The proper name and abbreviation of the unit is shown at the upper left corner (a), in this case Jalkaväkirykmentti 7 (7th Infantry Regiment), abbreviated as JR 7. The regiment has been part of 3. divisioona (3rd Division), which is told at the upper right corner (b). The card provides further information about the foundation (c) and the military district (d) of the unit. Changes considering the unit, like different names, are shown at part (e). During the Winter War JR 7 participated in four battles (f). The three columns on each line show the location or a short description of the battle, battle’s duration, and the name of the commanding officer.

The organization cards were provided as scanned booklets in PDF format, and converting to RDF had several steps. Firstly each page in PDF booklet was written as an individual PNG image. Images were preprocessed by adjusting the contrast and image rotation, and removing the compression artifacts. Next an Optical character recognition (OCR) process was applied. The resulting text was however very erroneous, and plenty of post-processing was required. The structured format of the cards, and the recurring use of military terms in the vocabulary however eased the automated error fixing. From the resulting text, the fields a-f (in Fig. 3) were extracted, and converted into RDF. The produced resources consisted of military units (:MilitaryUnit), their commanders (:Person) with ranks (:Promotion), and events like unit formations (:UnitFormation), joinings of units (:UnitJoining), movements (:TroopMovement), renamings (:UnitNaming), and battles (:Battle).

Fig. 3.
figure 3

Information on an organization card

Although the Wikipedia may not be considered as the most reliable source of information, it provided a way to connect data with external LOD cloud databases Wikidata, DBpediaFootnote 20, and VIAFFootnote 21. The material regarding personnel was widely available, but for units, specially those of Finnish Army during the WW2, the information was sparse. Information was extracted from Wikipedia pages of e.g. Finnish high-ranking officers, politicians, wartime casualties, and foreign volunteers. The pages of Wikipedia follow a structured layout which facilitated extracting the information. In case of military units, detailed information for events like unit foundation, troop movements, battles, and for names of commanding officers were available. In total 2500 people and 480 units with 5000 events were generated from corresponding Wikipedia pages.

Characteristic sentences picked from Wikipedia were descriptions like “1st Artillery Group was founded in Pori with Captain Paavo Suominen as the first commander”, “10th July 1941 Regiment was moved to Kitee, from where it begun attacking towards Lake Ladoga”, or “Regiment participated in the occupation of Prääsä September 7–8, 1941”. Each sentence was converted to an event, and the named entities of personnel, places and dates were recognized and linked to database resources. The data retrieval was done using Python scripts utilizing MediaWiki APIFootnote 22, and Wikipedia API for PythonFootnote 23. Entity linking was done with ARPA service [13].

The datasets of conflicts, war diaries, medals, and ranks are in separate graphs. ConflictsFootnote 24 contain four main periods of WW2 in Finland. The War Diary graphFootnote 25 has 26 400 entries. There are 200 medal typesFootnote 26 and 200 rank entriesFootnote 27. The data includes ranks used by the Finnish Military with most common German and Soviet ranks, among with some civil titles (e.g. the ones used by women’s voluntary association Lotta Svärd) [9].

5 The WarSampo Portal

The perspectives at WarSampo portalFootnote 28 visualize the linkage between the various datasets (e.g. military unit, personnel, casualties, events, places) etc [9, 11]. WarSampo portal is a Rich Internet Application (RIA), where all functionality is implemented on the client side using JavaScript with AngularJS framework, only data is fetched from the server side SPARQL endpoints.

5.1 The Person Perspective

The WarSampo person perspective applicationFootnote 29 is illustrated in Fig. 4. A typical use case is someone searching for information about a relative who served in the army. On the left, the page has an input field (a) for a search by person’s name. The matching query results are shown in the text field (b) below the input. After making a selection, information about the person is shown at the center top of the page (c). The tabs (d) allow the user to switch between this information page or a map-timeline application. In the example case, the page shows description of the person (e), photograph gallery (f), lists linking to related events (g), military units (h), battles (i), ranks (j), medals (k), related people (l), places (m), Wikipedia page (n), related Kansa Taisteli magazine articles (o), and a Finnish National Biography widget (p).

Fig. 4.
figure 4

Information on person perspective

As an example of SPARQL query, the query fetching related peopleFootnote 30 defines a similarity measure between two people. The more events, medals, units, and the higher ranks the two share in common, the higher the similarity gets. The list of related people (l) shows the results sorted in descending order.

WarSampo military unit perspective applicationFootnote 31 is illustrated in Fig. 5. In a typical use case someone searches for information about an army unit, where perhaps an elder relative has served during the wartime. On the left there is an input field (a) for a search by unit’s name. The matching results are shown in the text field (b) below the input. The map (c) depicts the known locations of the unit. The heatmap shows the casualties of the unit, and the timeline (d) the events (e), e.g. dates of unit foundations, troop movements, and durations of fought battles. On the right there are unit names and abbreviations (f), description (g), and a collection of related photographs (h). Three lists of related units are shown: larger groups in which the unit has been as a member (i), subdivisions being parts of the unit (j), and units at the same hierarchical level (k). Below there are fields for related battles (l), links to Kansa Taisteli magazine articles (m), Wikipedia page (n), and War Diaries (o). The number of casualties during the specified time is shown at the bottom of page (p).

5.2 The Military Unit Perspective

See Fig. 5.

Fig. 5.
figure 5

Information on unit Perspective

5.3 The Kansa Taisteli Magazine Perspective

Kansa Taisteli is a magazine published by Sanoma Ltd and Sotamuisto association between 1957 and 1986. The magazine articles cover the memoirs of WW2 from the point of view of Finnish military personnel and civilians. The articles contain mentions of people, military units, and places. From these the military units and personnel have been linked to Actor ontology. The magazine perspectiveFootnote 32 can be used for searching and browsing articles relating to WW2. Military units and personnel are used as separate facets to search for articles. In addition, writers have been linked to Actor ontology as well.

Fig. 6.
figure 6

The Contextual Reader interface targeting the Kansa Taisteli magazine articles

The purpose of the perspective is two-fold: (1) to help a user find articles of interest using faceted semantic search and, (2) to provide context to the found articles by extracting links to related WarSampo data from the texts. The start page of the magazine article perspective is a faceted search browser. Here, the facets allow the user to find articles by filtering them based on author, issue, year, related place, army unit, or keyword. Some of the underlying properties, such as the year and issue number, are hierarchical and represented using SKOS. The hierarchy is visualized in the appropriate facet, and can be used for query expansion: by selecting an upper category in the facet hierarchy one can perform a search using all subcategories.

After the user has found an article of interest, she can click on it, and the article appears on the screen in the CORE Contextual Reader interface [14]. Depicted in Fig. 6, CORE is able to automatically and in real time annotate PDF and HTML documents with recognized keywords and named entities, such as army units, places, and person names. These are then encircled with colored boxes indicating the linked data source. By hovering the mouse over a box, data is shown to the user, providing contextual information for an enhanced reading experience. In Fig. 6, for example, detailed data are shown about Raymond August Ericsson, one of the battalion commanders discussed in the article.

Solving the technical issues, however still left the problem of semantic disambiguation; in this case this concerned named entity recognition of correct people and military units. The identification was made by customizing the SPARQL queries, the order of the queries, and the article metadata. Each magazine article was identified and firstly references to people were searched from the text. The identification of people was done by using name and possibly a rank. Secondly the linking of the military units was performed from the remaining text. The article metadata was also used to identify the war to which the events of the article are related to. Afterwards the military units were linked based on the war into the corresponding units. A detailed description and evaluation of the process is available at [17].

5.4 Photographs

WarSampo contains a dataset of the metadata of ca. 160 000 historical photographs taken by Finnish soldiers during WWII. The data contains e.g. captions of the photographs. The actor ontology was used to automatically disambiguate and link people and military units mentioned in the metadata. Information in the actor ontology was used extensively in linking: For example, when disambiguating people, names, ranks, promotion dates, military units, sources, medals, and death dates were used to rank otherwise ambiguous mentions in the photograph captions [7].

The results of the linking can be seen in the person and unit perspectives of the WarSampo portal, as well as in the photograph perspective itselfFootnote 33 which provides a faceted search interface for the photographs.

6 Related Work, and Discussion

There are several projects publishing linked data about the World War I on the web, such as Europeana Collections 1914–1918Footnote 34, 1914–1918 OnlineFootnote 35, WW1 DiscoveryFootnote 36, Out of the TrenchesFootnote 37, Muninn [19], and WW1LOD [15]. There are few works that use the Linked Data approach to World War II, such as [1, 3], Defence of BritainFootnote 38, and Open Memory ProjectFootnote 39. The main focus on our work is on representing an actor as a biographical life story, unlike databases like Getty ULAN or Smithsonian American Art Museum [16] that have actor vocabularies.

Our research group, Semantic Computing Research Group (SeCo), has produced several projects with highly interlinked actor ontologies: The National Biography, CultureSampoFootnote 40, BookSampoFootnote 41, and Norssit—High School Alumni [10] datasets. Bio CRM modelFootnote 42 is developed to facilitate and harmonize the representation of an actor in semantic web, and therefore deals with the same problematics as the WarSampo actor ontology.

We have considered combining the different datasets like articles and photographs to actor ontology as one of the use-cases of the actor ontology. The evaluation of the ontology and actor dataset, has been work- and data-driven e.g. it has developed to the needs of semantically representing the data and of rendering the data at the end-user portal. 94% of users come from Finland and 25% of them are returning visitors. We have received feedback via the user interface, and we have considered their comments e.g. on misidentified people.

Main requirement for the ontology was to represent changes in spatio-temporal context as described in Introduction. Constant actor resources are enriched with events marking the changes in spatio-temporal continuity, adding details to the semantic biographical representation, and connecting the otherwise separate datasets of personnel, units, places, articles, photographs etc. The unit model had to be capable of representing even more dynamical changes than with people; identifiers like name and abbreviation may change in the time domain. The army hierarchy is represented as a tree graph where the groups are connected by the events of joining.

The actor ontology is based on CIDOC CRM standard which provides a clear framework and basis for actor-event schema. The Muninn Military Ontology offered an example of modeling military concepts semantically. In conclusion, there was no obvious basis for the ontology. On the contrary, it was constructed by combining principles of several solutions all serving different needs.

In a similar way Warsampo project has collected historical, wartime information from Finland. There is abundance of information about the WW2 in different countries, written in local languages, and published in various formats; often even having divergent points of view. Collecting the data and publishing it as LOD forms a tremendous field of work, but aims at constructing a comprehensive, worldwide database. In the events of history, individual people and groups are at the focal center; it is from their point of view that we build our notion of history.

The ontology model represented in this article may not be all-purpose suitable, but we encourage and hope to inspire the researchers to develop the ideas further.