A massively parallel corpus: the Bible in 100 languages

We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora.

Most parallel corpora exist in a small number of languages or in common languages pairs (e.g. the English-French Hansards corpus by Germann 2001). There are however, a few corpora that contain multiple languages: The Europarl corpus (Koehn 2005) contains parallel translations of European Parliament proceedings in 21 languages; the Joint Research Centre of the European Commission has released multiple corpora in more than 20 languages, including the sentence-aligned JRC-Acquis (22 languages, Steinberger et al. 2006) and the paragraph-aligned DGT-Acquis (23 languages); the InterCorp corpus (Č ermák and Rosen 2012), a collection of texts in Czech and 27 other European languages. To our knowledge, the most multilingual corpus currently available is the OPUS collection Tiedemann 2012 which contains 90 languages in various parallel corpora. However, comparatively few of the possible language pairs are available with parallel text. 1 In an attempt to access parallel material from as many and as diverse languages as possible, a very widely translated text is needed. In this work we will be following Resnik et al. (1999) in creating a massively parallel corpus based on Bible translations (cf. Bird 2010, 2011). According to United Bible Societies (2013) there are at least 2,527 translations of parts of the Bible and 475 full translations. These numbers exceed by far the translations of any other work of literature-according to Wikipedia (2013) the next most translated work of literature is 'Pinocchio' with 260 languages. Resnik et al. (1999) used 13 different translations of the Bible; we will increase the number of languages to 100. By having 100 different languages on the same corpus we can get 4,950 unique language pairs 2 -although not all translations contain the entire Bible as we shall see later-making this by far the largest number of bitexts available: in comparison, DGT-Acquis contains 253 pairs; InterCorp, 351; and the OPUS collection contains 3,800 pairs (Tiedemann 2012), but not all pairs contain the same amount of text.
2 The Bible as a corpus 2.1 Current and potential uses of the corpus As we mentioned in the previous section, most parallel corpora are created for SMT training purposes. While the relatively small size of the present corpus makes it rather unsuitable for the creation of full-scale SMT systems across the 4,950 language pairs, we believe that it can be used to tune the probability distributions of an existing SMT system for a phylogenetically similar language. Alternatively it can be used as a source of bi/multi-lingual dictionaries in emergency situations where 1 There are other, non-parallel but comparable corpora that exist in multiple languages (like Wikipedia with more than 10,000 articles in 121 languages) but their use is limited to a few approaches (e.g. Cohen et al. (2011), mentioned above). 2 The number of unique language pairs among n languages irrespective of the order in each pair is  (Lewis 2010) or Japan (Neubig et al. 2011)]. Steinberger et al. (2013) list a number of potential uses of parallel corpora in NLP. These include: annotation projection for co-reference resolution, discourse analysis; checking translation consistency automatically; testing and benchmarking alignment software (for sentences, words, etc.); producing multilingual lexical and semantic resources such as dictionaries and ontologies; annotation projection across languages for Named Entity Recognition (NER, Ehrmann et al. 2011), sentiment analysis , multi-document summarization (Turchi et al. 2010); cross-lingual plagiarism detection (Potthast et al. 2011); multilingual and cross-lingual document classification (Wei et al. 2008); creation of multilingual semantic space in Lexical Semantic Analysis (LSA, Landauer and Littman 1991) and Kernel Canonical Correlation Analysis (KCCA, Vinokourov et al. 2002). We believe that, despite some disadvantages (e.g. the lack of modern named entities and other issues discussed in Sect. 2.3), the Bible is an excellent resource for NLP, especially for low-resource languages.
Multilingual corpora are also ideal for typological or comparative language analysis, especially when a large number of languages can be collected. Indeed the present corpus has already been used for cross-linguistic induction and comparison of syntactic categories (Christodoulopoulos 2013, pp. 143-159). Similarly, we believe that parallel corpora can be invaluable to the whole area of Digital Humanities (e.g. Dipper and Schultz-Balluff 2013).

Advantages
There are a number of advantages to using the Bible as a corpus. Not only has it been translated into numerous languages; it has also been translated into a much more diverse set of languages than any other book. This is mostly due to the efforts of missionary linguists such as the Summer Institute of Linguistics (SIL, Brend and Pike 1977) that combine anthropological and linguistic research with missionary expeditions in remote locations and, as a result, produce Bible translations.
Another advantage of the Bible is the size of the text. The complete canonical 66 books contain around 800k words in English. This might seem small compared to modern (parallel) corpora-like, for instance, the Canadian Hansards corpus (Germann 2001) with *19 M words, and the Europarl (*60 M words on average per language); however it is much bigger than any single work of literature: for instance, the size of the average fiction novel is about 100k words, while 'Pinocchio' is *45 k.
The Bible also is unique as a text since every verse is uniquely identified by a book, chapter and verse number. This allows for an automatic, unambiguous alignment at the verse level across every language (with minor exceptions that will be discussed in Sect. 3).
A final advantage is that the Bible translations collected here are either public domain, or-as in the case of the King James Version-free to use for research purposes. 3

Translation methods
As with every translation work, one important question concerns the style and fidelity of translation. There are two competing translation methods: word-for-word (or formal equivalence), in which the literal meaning of each words as well as the syntactic structure is preserved where possible; and sense-for-sense translation (or dynamic equivalence), in which the 'spirit' or emotional effect of the text is kept. The former method is more appropriate for the type of analysis required here and has been put forward as the preferred method by the Catholic Church (2001), among others. However, some of the translation guides used by the missionary linguists follow the latter method. For instance Nida and Taber (1969) provide a theoretical framework as well as a set of principles for Bible translations, in which they advise: -Content is to have priority over style.
-Contextual consistency is to have priority over verbal consistency.
-Long, involved sentences are to be broken up on the basis of receptor-language usage. -Nouns expressing events should be changed to verbs whenever the results would be more in keeping with receptor-language usage. (Nida and Taber 1969, p. 182) This does not imply that every Bible translator has followed these principles, but given that the goal of the missionary linguists was to convey the message of the Bible, it makes sense that they would choose a more content-sensitive approach to their translations. Finally, we should keep in mind that it is not always desirable to have a formally equivalent translation 4 : for instance in MT, when translating the title of Stig Larsson's third book, a translation system should return ''The Girl Who Kicked the Hornet's Nest'' instead of the literal translation of the Swedish ''Luftslottet som sprängdes'' which would be ''The air castle that was exploded''. However, from a computational linguistics perspective, it is usually more helpful to have formally equivalent translations.

Other issues
A major issue that is relevant to the use of the Bible as a parallel corpus is the writing style; in particular, the use of antiquated language. This is especially problematic in languages (mostly Western European) where Bible translations were created hundreds of years in the past. Even if modern translations exist, often the editors would choose a more archaic style of writing to match the earlier text and to give the appropriate gravity to the material. Some exceptions exist, at least in English. As Resnik et al. (1999) showed, the New International Version (NIV) covers a significant variety of present-day terms as found in Longman Dictionary of Contemporary English (LDOCE, Proctor 1978) and in the Brown Corpus (Francis 1964).
For many translations, it is an open question whether the writing style of the Bible is representative of present-day language, but given the limited availability of written sources in some languages, and the breadth of available translations, the Bible corpus represents the best resource for cross-linguistic analysis. Indeed there have been a number of projects that used Bible translations as either a primary or secondary source of material (Resnik et al. 1999;Yarowsky and Ngai 2001;Wierzbicka 2001;Kanungo et al. 2005).
A final limiting factor is the fact that the alignment information is limited to verses (rather than sentences as is the case in the JRC-Acquis corpus for instance). While it is often the case that a verse corresponds to a whole sentence, there are verses that span more than two sentences, or are limited to sub-sentence phrases. The exact number varies depending on what is considered to be sentence-final punctuation. When counting only '.' and '?', out of the *30,000 verses, only 4,000 contain multiple sentences. However, this number increases to 10,000 if we include ';' and more than half the verses if we add ':' as a sentence-final marker. To make things worse, as we can see in the following example, different translations use different punctuation schemas which means that they contain significantly different numbers of sentences.
( 3 Acquiring and converting source material

Corpus collection
Despite the great number of translations, many Bible translations exist only in audio form. This is reasonable, since some of the translated languages exist only in verbal form, and even if an alphabet exists, most speakers of that language may be illiterate. Furthermore, even when textual resources have been available for years, electronic copies are hard to obtain. This means that there is a limited availability of machine-readable bibles online. In English, for instance, one of the most widespread Bibles, the King James Version, is not made available in electronic form by the official licensing body in Scotland (the Scottish Bible Board) even though the text is free to use for research purposes. Instead, we have had to rely on third-party sources, like the ones mentioned in the next paragraph. When multiple versions of the Bible were available-since the aim of this project was breadth instead of depth-we opted for a single translation, usually the oldest available one (e.g. the King James Version for English). We believe that this will lead to a more coherent corpus, as older translations tend to be more literal, but we acknowledge that this brings up the issue of diachronic language change. As discussed in Sect. 1, this problem is not as severe as initially perceived; however we are also open to the idea of adding multiple versions of the same language in the future.
There are a few websites that offer access to public domain, machine-readable versions of the Bible in multiple languages. The four main sources used here were the Bible Database, the Unbound Bible, GospelGo and the Bible Gateway websites. Each one offered the Bible in different formats, some containing HTML and others plain text. Figure 1 presents a comparison of the different versions.
In order to unify all the different styles of annotation under a well-defined universal format, we followed Resnik et al. (1999) in using the Corpus Encoding Standard (CES, Ide 1998), conforming to the level 1 annotation guidelines. Practically, this means that each Bible was formatted as an XML file, containing nested <div> elements corresponding to books and chapters, and <seg> elements that corresponded to verses. Each of the verses was marked with a serial ID. Figure 2 shows the same two verses of Fig. 1 as formatted by custom scripts.

Conversion problems
The most common issue we encountered when converting and formatting our corpus was the inconsistency in the formatting of the online sources. Some of the more common ones included incorrect HTML: unclosed <span> or <p> tags, inconsistent use of capitalisation (e.g. <SPAN> Verse Text </span>); errors in verse numbering (e.g. ''missing'' verses were actually included in previous or subsequent verses marked by text instead of HTML tags); character rendering errors (e.g. ž in Croatian rendered as ?); missing characters (e.g. final character in each verse of the Thai and Latin translations). In most cases the errors were systematic and could be corrected semi-automatically; in other cases (like the missing characters) we had to find multiple sources of the same translation. If neither option was available, the errors were left in the final version of the corpus. Overall, the whole process took about two-to-three person/months.
Finally, when dealing with machine-readable multilingual texts, character encoding can cause difficulties. This is especially true for languages that do not have a strong international presence and the need to adopt an encoding standard is low. However, we did not encounter such problems during the creation of this corpus; all languages included in the corpus have been encoded using the Universal Character Set (UCS, Allen et al. 2012), specifically the UCS Transformation Format-8-bit (UTF-8).

Parallel corpus information
The full corpus contains 100 languages from across the world (see Table 1 for the names of the languages). As Table 2 shows, the majority are non-Indo-European languages and 39 of the languages are spoken by fewer than 1 million speakers. Figure 3 presents a geographical distribution of the languages (data from Dryer and Haspelmath 2013) that cover almost all the continents, and Appendix A contains detailed linguistic information about every language as well as the approximate date of translation (data from Lewis et al. 2014). Table 3 contains statistics about the average size and variability of the lexicon of the whole corpus: we include total number of tokens, 5 standardised type-token ratio (STTR), average verse length and the percentage of the corpus covered by the 1,000 most frequent words. We also present STTR and average verse length information for each individual language in Fig. 4.
In order to normalise over the overall size of each corpus, we computed STTR by calculating a macro-average over successive measurements of the token-type ratio (# unique word types/# all tokens) of a fixed amount of tokens. This fixed amount corresponded to the smallest number of tokens (678 tokens in Gaelic). We also include the specific numbers for the English translation as well as number from other corpora for comparison: the Wall Street Journal (WSJ) portion of the Penn Treebank (Marcus et al. 1993), George Orwell's 1984 novel and a corpus of childdirected speech (CHILDES; MacWhinney 2000).
We can see that although the average type-token ratio of the corpus is close to that of both WSJ and 1984, the English translation has far fewer unique word types. Following Resnik et al. (1999), we also compared the vocabulary of the English translation with that of the other English corpora we had available. As Table 4 shows, even though the language King James' Version of the Bible is more archaic than the New International Version (used in Resnik et al's comparisons), it still covers a significant portion of the most frequent words in all three corpora.
For a qualitative view of the omissions, we present the 10 most frequent words of each corpus that are missing from the KJV Bible: From the WSJ corpus the words are mostly market-related: million, Mr, says, billion, Corp, inc, shares, president, Co, sales; in the case of 1984 they are words related to the story of the novel: winston, party, o'brien, telescreen, big, human, don't, merely, oceania, minutes; and finally from CHILDES they are mostly informal, spoken constructions: yeah, does, huh, alright, ya, okay, gonna, mhm, big, baby. It is interesting to note that words like 'big' and 'human' are in these lists. This is due to stylistic differences as well as actual diachronic language changes.

Remaining problems in the parallel corpus
As Table 2 shows, 45 out of the 100 languages contain only partial texts. In most cases this means that only the New Testament was available for that language, but in a few cases even less text exists. This means that if we want to use all 100 languages, we are limited to the smallest amount of text contained in any of them. A further problem is the fact that not all the canonical verses (i.e. verses that appear in the original Greek, Hebrew and Aramaic) are present even in the official translations. One possible explanation is that the missing verses are contained in the verses that come before, or after them. This is a reasonable assumption, since in some languages it might not be easy to follow the sentence structure of the original text (e.g. a sentence that is split across two verses). For instance, in the Turkish text, verses no. 2 and 3 of chapter 7 of the Book of Genesis are combined: 6 GEN.7.2: Yeryüzünde soyları tükenmesin diye, yanına temiz sayılan hayvanlardan erkek ve dişi olmak üzere yedişer çift, kirli sayılan hayvanlardan birer çift, kuşlardan yedişer çift al.
[Gloss] Extinction on earth, lest next clean counted seven pairs of animals, including male and female, a pair of unclean animals, birds take seven pairs. GEN.7.2: Of every clean beast thou shalt take to thee by sevens, the male and his female: and of beasts that are not clean by two, the male and his female. GEN.7.3: Of fowls also of the air by sevens, the male and the female; to keep seed alive upon the face of all the earth.
In fact, the most commonly missing verse in the New Testament is 2 Corinthians 13:14 (missing from 33 languages where the median is 2) which is a known versification difference. 7 Of course an alternative explanation would be that some verses were completely omitted, either intentionally or unintentionally (see footnote 7). This seems to be the case in the Swedish translation, where verse no. 29 of chapter 28 in the Book of Acts is missing and the text does not appear in either verse no. 28 or 30.
[Gloss] Be it known therefore know the pagans wore this salvation of God is sent; they will also hearken. There are other known omissions of verses whose authenticity has been doubted; see http://en.wikipedia.org/wiki/List_of_Bible_verses_not_included_in_modern_translations for a list. After examining the number of languages where each verse is missing, the verses listed here are indeed missing in an above-average number of languages: most of them are in the 0.6 percentile and some in the 0.9 percentile (e.g. MAR.9.46, ACT.8.37). Thanks to an anonymous reviewer for pointing us to these cases.

Fig. 3
A map of the distribution of languages in the Bible Corpus. Each pin represents the country or territory where the language originates or is primarily used (e.g. there is only one pin for English in Britain). Location coordinates were acquired from Dryer and Haspelmath (2013) A massively parallel corpus: the Bible in 100 languages 385 ACT.28.30: I två hela år bodde han sedan kvar i en bostad som han själv hade hyrt. Och alla som kommo till honom tog han emot; [Gloss] For two whole years he lived then left in a residence that he had rented. And everyone who came to him, he received; ACT.28.28: Be it known therefore unto you, that the salvation of God is sent unto the Gentiles, and that they will hear it.
ACT.28.29: And when he had said these words, the Jews departed, and had great reasoning among themselves.
ACT.28.30: And Paul dwelt two whole years in his own hired house, and received all that came in unto him, There are cases where multiple verses are omitted like, for instance in the Marathi translation: the first verse of the first chapter in the Book of Ezekiel is verse no. 5, with no information about the previous four verses. However neither the single nor the continuous omissions are very frequent. When we examine all the translations that contain the full text, there are on average 19.38 single verse omissions and 9.69 continuous ones. This amounts to 0.06 and 0.03 % of the total number of verses.
One way to deal with these omissions would be to ignore verses in all languages where text is missing even in one of the languages in the corpus. 8 Even with this drastic strategy, the overall loss of text across languages may be found to be tolerable: on average, each full bible translation contains about 643,000 words: after the elimination of all non-shared verses, we found the average word count to be about 549,000-only a 14.7 % reduction.   It should be noted, however, that the corpus presented here contains all available verses in all the languages (each with a unique ID as shown in Fig. 2), meaning that, depending on which subset of languages chosen, the limitations described above might not apply. Researchers are encouraged to choose their own methods to deal with these occasional unilateral omissions, whose detection is a precondition to finer-grain sentence-and wordlevel alignment of the kind proposed by Bird (2010, 2011).

Conclusion
This paper described the creation of a massively parallel corpus, consisting of translations of the Bible in 100 languages. We discussed some of the problems arising from the nature of the texts as well as the process of gathering and annotating the online material. The texts in each language were aligned up to the level of verse in compliance with the CES guidelines. While a few more Bible translations exist in a machine-readable form (as well as a number of different translations for some languages), we believe this set of 100 languages is significantly large for an initial release. We expect to add more languages if the resource is used, and we encourage such additions by other researchers. We have released code to allow users to add more languages to the corpus as well as process the existing ones, and together with the annotated XML files, they are published under a Creative Commons license and can be found at the following address: http://groups.inf.ed.ac.uk/ccg/corpora.html.
Acknowledgments This material is based on research partially supported by DARPA under agreement number FA8750-13-2-0008. The U.S. Government is authorised to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government. This research was partially supported by ERC Advanced Fellowship 249520 GRAMPLUS.Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Appendix A
See Table 5.