1 Introduction

Despite the fact that Wikipedia exists in 287 languages, its content is unevenly distributed. The content of the most under-resourced Wikipedias is maintained by a limited number of editors – they cannot curate the same volume of articles as in the large Wikipedia communities. Part of this problem has been addressed by Wikidata, the KB supporting Wikipedia with structured data in a cross-lingual manner. Recently, Wikimedia introduced ArticlePlaceholders [12] in order to integrate Wikidata’s knowledge into the Wikipedias of underserved languages and help in reducing the language gap. ArticlePlaceholders display Wikidata triples in a tabular-based way in the target Wikipedia language and are currently deployed to 11 underserved WikipediasFootnote 1. When a user searches for a topic on Wikipedia that has a Wikidata item, but no Wikipedia article yet, they are led to the ArticlePlaceholderFootnote 2 on the topic. Compared to stub articlesFootnote 3, ArticlePlaceholders have the advantage of being dynamically updated in real time to accommodate information changes in Wikidata. This means less maintenance for small communities of editors. Since Wikidata is one central, language-independent place to edit information and each item or property has to be translated only once, any contribution in Wikidata has an impact on the ArticlePlaceholders. For example, an editor speaking only English can connect the existing items Q1299 (The Beatles) with the item Q145 (United Kingdom) via the property P495 (country of origin). This will automatically add the same triple with their Esperanto labels: The Beatleseldonit/ata enUnuiĝinta Reĝlando. Nonetheless, ArticlePlaceholders currently only display information in the form of tables.

In this paper, we propose an automatic approach to enrich ArticlePlaceholders with textual summaries that can serve as a starting point for the Wikipedia editors to write their article. The summaries resemble the first sentence of a Wikipedia article, that gives a reader an overview of the topic. We pose the following research questions:

RQ1. :

Given the challenges concerning underserved languages, can we generate textual summaries that match the quality and style of Wikipedia content?

RQ2. :

Can we generate summaries that are useful for Wikipedia editors of underserved language communities?

We adapt an end-to-end trainable model, which generates a monolingual textual summary (i.e. only in English) given a set of KB triples as input, for multilingual support. To this end, we introduce a new “property placeholders” feature and put them under distant supervision in order to enable our system to verbalise even rare or “unseen” entities. Since the summaries are generated explicitly based on the input triples, potential changes in the respective triples can manifest themselves immediately to the textual content of the summary without the inclusion of the translation loop. Furthermore, since we do not transfer any information from a source language, our model learns to generate Wikipedia content that captures the linguistic peculiarities of our target underserved Wikipedias.

We apply our model on two languages that have a severe lack of both editors and articles on Wikipedia: Esperanto and Arabic. Esperanto is an artificially created language, with an easy acquisition, which makes it a suitable starting point to explore challenges of our task. On the other hand, Arabic is a morphologically rich language with a significantly larger vocabulary. Arabic is the 5th most spoken language in the world [8], however as shown in Table 1 the Arabic Wikipedia suffers a severe lack of content compared to the English.

We propose a novel evaluation framework that assesses the usefulness of the summaries via a multitude of metrics, computed against strong baselines and involving readers and editors of underserved Wikipedias. We start our evaluation by measuring how close our synthesized summaries are to actual summaries in Wikipedia. We compare our model to two strong baselines of different natures: MT and a template-based solution. Our model substantially outperforms the baselines in all evaluation metrics in both Esperanto and Arabic. In addition, we developed three studies with the Wikipedia community, in which we ask for their feedback about the generated summaries, in terms of their fluency, appropriateness for Wikipedia, and engagement with editors. We believe that given the promising results achieved in the automatic and human evaluations, our approach along with the datasets, the baselines, and the experimental design of the human evaluation can serve as a starting point for the research community to further improve and assist in solving this critical task. Our code and experiments are available: https://github.com/pvougiou/Mind-the-Language-Gap.

Table 1. Recent page statistics and number of unique words (vocab. size) of Esperanto, Arabic and English Wikipedias in comparison with Wikidata.

2 Related Work

Multilingual Text Generation. Many existing techniques for text generation and RDF verbalization rely on templates. These templates are generated using linguistic features such as grammatical rules [26], or are hand-crafted [7]. These approaches face many challenges when scaling for a language-independent system, as templates need to be fine-tuned to any new languages they are ported to. This is especially difficult for the few editors of underserved Wikipedias since templates need extra attention. They would have to create and maintain templates while this time could be invested in the creation of an actual article. Recognizing this problem, the authors of [5, 6] introduce a distant-supervised approach to verbalize triples. The templates are learned from existing Wikipedia articles. This makes the approach more suitable for language-independent tasks. However, templates always assume that items will always have the appropriate triples to fill the slots of the template. This assumption is not always necessarily true. In our experiments, we implement a template-learning baseline and we show that adapting to the varying triples available can achieve better performance.

Text Generation for Wikipedia. Pochampally et al. and Sauper et al. proposed the generation of Wikipedia summaries by harvesting sentences from the Internet [20, 23]. Existing Wikipedia articles are used to automatically derive templates for the topic structure of the summaries and the templates are afterward filled using Web content. Such approaches are limited to only one or two domains and only in English. The lack of Web resources for underserved languages prevents these approaches to scale to undeserved languages in multiple domains [16]. Meanwhile, KBs have been used as a resource for NLG [2, 5, 19, 25]. These techniques leverage linguistic information from KBs to build a dataset of triples aligned with equivalent sentences from Wikipedia. This alignment is used at subsequent steps to train NLG systems.

The most relevant work to our proposed model are the recent approaches by Lebret et al. [15], Chisholm et al. [2], and Vougiouklis et al. [25], who all propose adaptations of the general encoder-decoder neural network framework [3, 24]. They use structured data from Wikidata and DBpedia as input and generate one sentence summaries that match the Wikipedia style in English in only a single domain. The first sentence of Wikipedia articles in a single domain exhibit a relatively narrow domain of language in comparison to other text generation tasks such as translation. However, Chisholm et al. [2] show that this task is still challenging and far from being solved. In contrast with these works, in our paper we extend those research work to include open-domain, multilingual summaries.

Evaluating Text Generation. Evaluating generated text is challenging and there have been different approaches proposed by the literature. Automatic scores [15], expert evaluation and crowdsourcing [2, 14] have been employed. Additionally, similar to Sauper and Barzilay [23], we extend our evaluation to usefulness of the summaries for Wikipedia editors by measuring the amount of reuse of the generated summaries. This concept has been widely investigated in fields such as journalism [4] and plagiarism detection [21].

3 Methods

We use a neural network in order to understand the impact of adding automatically generated text to ArticlePlaceholders in underserved language Wikipedias.

3.1 Our System

Our system is adapted from our encoder-decoder architecture introduced in [25] that has already been used on a similar text generative task. The architecture of the generative model is displayed in Fig. 1. The encoder is a feed-forward architecture which encodes an input set of triples into a vector of fixed dimensionality. This is used at a later stage to initialise the decoder. The decoder is an RNN that uses Gated Recurrent Units (GRUs) [3] to generate the textual summary one token at a time.

An example is presented in Table 2. The ArticlePlaceholder provides our system with a set of triples about the Wikidata item of Floridia (i.e. Q490900 (Floridia) is either the subject or the object of the triples in the set). Figure 1 displays how our model generates a summary from those triples, \(f_1\), \(f_2\), and \(f_3\). A vector representation \(h_{f_1}\), \(h_{f_2}\), and \(h_{f_3}\) for each of the input triples is computed by processing their subject, predicate and object. These vector representations are used to compute a vector representation for the whole input set \(h_{F_E}\). \(h_{F_E}\), along with the special start-of-summary \(\mathtt{{<}start{>}}\) token, are used to initialise the decoder that sequentially predicts tokens (“[[Q490900, Floridia]]”, “estas”, “komunumo” etc.).

Table 2. The ArticlePlaceholder provides our system with a set of triples about Floridia, whose either subject or object is related to the item of Floridia. Subsequently, our system summarizes the input set of triples as text. We train our model using the summary with the extended vocabulary.
Fig. 1.
figure 1

The triple encoder computes a vector representation for each one of the three input triples from the ArticlePlaceholder, \(h_{f_1}\), \(h_{f_2}\) and \(h_{f_3}\). Subsequently, the decoder is initialized using the concatenation of the three vectors, \([h_{f_1};h_{f_2};h_{f_3}]\). The purple boxes represent the tokens of the generated summary. Each summary starts and ends with the respective start-of-summary \(\mathtt{{<}start{>}}\) and end-of-summary \(\mathtt{{<}end{>}}\) tokens. (Color figure online)

Formally, let \(F_E\) be the set of triples provided by the ArticlePlaceholder for the item E (i.e. item E is either the subject or the object of the triples in the set), our goal is to learn a model that generates a summary \(Y_E\) about E. We regard \(Y_E\) as a sequence of T tokens such that \(Y_E=y_1, y_2, \ldots , y_{T}\) and compute the conditional probability \(p(Y_E | F_E)\):

$$\begin{aligned} p(Y_E | F_E) = \prod _{t=1}^{T}p(y_t|y_1, \ldots y_{t-1}, F_E) . \end{aligned}$$

3.1.1 Generating a Summary

Our model learns to make a prediction about the next token by using the negative cross-entropy criterion. We define a maximum number of triples per summary. Input sets with fewer triples are padded with zero vectors, which are consistently ignored by the encoder. During training our architecture predicts the sequence of tokens that make up the summary. During testing, the ArticlePlaceholder provides our model with a set of unknown triples. After the vector representation \(h_{F_E}\) for the unknown set of triples is computed, we initialize the decoder with a special start-of-sequence \(\mathtt{{<}start{>}}\) token.

We adopt a beam-search decoder [15, 24, 25] which provides us with B-most-probable summaries for each triple set \(F_E\).

3.1.2 Vocabulary Extensions

Each summary consist of words and mentions of named entities. Mapping those entities to words is hard since an entity can have several surface forms and the system may face rare/unseen entities at prediction time. We adopt the concept of surface form tuples to learn a number of different verbalisations of the same entity in the summary [25]. In Table 2, [[Q490900, Floridia]] in the vocabulary extended summary is an example of a surface form tuple where the entity Q490900 is associated with the surface form of “Floridia”.

Additionally, we address the problem of learning embeddings for rare entities in text [17] by training our model to match the occurrence of rare entities in the text to the corresponding triple. To this end, we introduce property placeholders. The property placeholders are inspired by the property-type placeholders [25]. However, their applicability is much broader since they do not require any instance type-related information about the entities that appear in the triples. In the vocabulary extended summary of Table 2, [[P17]] is an example of property placeholder. In case it is generated by our model, it is replaced with the label of the object of the triple with which they share the same property (i.e. Q490900 (Floridia) P17 (ŝtato) Q38 (Italio)).

Further details regarding the fundamental components of our neural architecture, such as the triples encoder and the surface form tuples, can be found in our previous work [25].

4 Training and Automatic Evaluation

In this section, we describe the dataset that we built for our experiments along with the results of the automatic evaluation of our neural network architecture against the baselines.

4.1 Dataset

In order to train and evaluate our system, we created a new dataset for text generation from KB triples in a multilingual setting. We wish to explore the robustness of our approach to variable datasets with respect to language complexity and size of available training data. Consequently, we worked with two linguistically distinct Wikipedias of different sizes (see Table 1) and different language support in Wikidata [13].

This dataset aligns Wikidata triples with the first, introductory sentence of its corresponding Wikipedia articles. For each Wikipedia article, we extracted and tokenized the first sentence using a multilingual Regex tokenizer from the NLTK toolkit [1]. Afterwards, we retrieved the corresponding Wikidata item to the article and queried all triples where the item appeared as a subject or an object in the Wikidata truthy dumpFootnote 4.

In order to create the surface form tuples (i.e. Sect. 3.1.2), we identify occurrences of entities in the text along with their verbalisations. We rely on keyword matching against labels from Wikidata from the corresponding language, due to the lack of reliable entity linking tools for underserved languages.

For the property placeholders (described in more detail in Sect. 3.1.2), we use the distant-supervision assumption for relation extraction [18]. After identifying the rare entities that participate in relations with the main entity of the article, they are replaced from the introductory sentence with their corresponding property placeholder tag (e.g. [[P17]] in Table 2). During testing, any property placeholder token that is generated by our system is replaced by the label of the entity of the relevant triple (i.e. triple with the same property as the generated token).

4.2 Automatic Evaluation

To evaluate how well our system generates textual summaries for Wikipedia, we evaluated the generated summaries against two baselines on their original counterparts from Wikipedia. We use a set of evaluation metrics for text generation BLEU 1, BLEU 2, BLEU 3, BLEU 4, METEOR and ROUGE\(_\text {L}\). BLEU calculates n-gram precision multiplied by a brevity penalty which penalizes short sentences to account for word recall. METEOR is based on the combination of uni-gram precision and recall, with recall weighted over precision. It extends BLEU by including stemming, synonyms and paraphrasing. ROUGE\(_\text {L}\) is a recall-based metric which calculates the length of the most common subsequence between the generated summary and the reference.

4.3 Baselines for Automatic Evaluation

Due to the variety of approaches for text generation, we demonstrate the effectiveness of our system by comparing it against two baselines of different nature. Both baselines are reproducible and the code is provided in the GitHub repo.

Machine Translation (MT). For the MT baseline, we used Google Translate on English Wikipedia summaries. Those translations are compared to the actual target language’s Wikipedia entry. This limits us to articles that exist in both English and the target language. In our dataset, the concepts in Esperanto and Arabic that are not covered by English Wikipedia account for 4.3% and 30.5% respectively. This indicates the content coverage gap between different Wikipedia languages [10].

Template Retrieval (TP). Similar to template-based approaches for text generation [6, 22], we build a template-based baseline that retrieves an output summary from the training data based on the input triples. First, the baseline encodes the list of input triples that corresponds to each summary in the training/test sets into a sparse vector of TF-IDF weights [11]. Afterwards, it performs LSA [9] to reduce the dimensionality of that vector. Finally, for each item in the test set, we employ the K-nearest neighbors algorithm to retrieve the vector from the training set that is the closest to this item. The summary that corresponds to the retrieved vector is used as the output summary for this item in the test set. We provide two versions of this baseline. The first one (TP) retrieves the raw summaries from the training dataset. The second one (TPext) retrieves summaries with the special tokens for vocabulary extension. A summary can act as a template after replacing its entities with their corresponding Property Placeholders (see Table 2).

Table 3. Participation numbers: total number of participants (P), total number of sentences (S), number of P that evaluated at least 50% of S, and average number of S evaluated per P.

5 Community Study

Automatic measures of text quality such as BLEU can give an indication of how close a generated text is to the source of a summary. Complementary, working with humans is generally more trusted when it comes to quality evaluation of generated text, and captures the direct response of the community. We ran a community study for a total of 15 days to answer our research questions. To address the question whether the textual summaries can match the quality of Wikipedia (RQ1), we define text quality as fluency and appropriateness. Fluency describes the quality in terms of understandability and grammatical correctness. Appropriateness describes how well a summary fits into Wikipedia, i.e. whether a reader can identify it as part of a Wikipedia article. We assess editors reuse to answer whether we can generate summaries that are useful for Wikipedia editors (RQ2). Our evaluation targets two different communities: (1) readers: Any speaker of Arabic and Esperanto, that reads Wikipedia, independent of their activity on Wikipedia, and (2) editors: any active contributor to Arabic and Esperanto Wikipedia. Readers were asked to fill one survey combining fluency and appropriateness. Editors were also asked to fill an additional surveyFootnote 5. To sample only participants with previous activity on Wikipedia, we asked them for their reading and editing activity on Wikipedia. The survey instructionsFootnote 6 and announcementsFootnote 7 were translated in Arabic and Esperanto.

Recruitment. For the recruitment of readers, we wanted to reach fluent speakers of the language. For Arabic, we got in contact with Arabic speaking researchers from research groups working on Wikipedia related topics. For Esperanto, as there are fewer speakers and they are harder to reach, we promoted the survey on social media such as Twitter and RedditFootnote 8 using the researchers’ accounts. For the recruitment of editors, we posted on the editors’ mailing-listsFootnote 9. Additionally, for Esperanto we posted on the Wikipedia discussion pageFootnote 10. The Arabic editors survey was also promoted at WikiArabia, the conference for the Arabic speaking Wikipedia community. The numbers of participation in all surveys can be found in Table 3.

Table 4. Automatic evaluation of our model against all other baselines using BLEU 1–4, ROUGE and METEOR on the validation and the test set for both Arabic and Esperanto.

Fluency. We answer whether we can generate summaries that match the quality and style of Wikipedia content in a study with 54 Wikipedia readers from two different Wikipedia languages. We created a corpus consisting of 60 summaries of which 30 are generated through our approach, 15 are from news, 15 from Wikipedia summaries of the training dataset. For news in Esperanto, we chose introduction sentences of articles in the Esperanto version of Le Monde DiplomatiqueFootnote 11. For news in Arabic, we chose introduction sentences of the RSS feed of BBC ArabicFootnote 12. Each participant was asked to assess the fluency of the text. We employ a scale from 0 to 6, where: (6) Excellent: the given sentence has no grammatical flaws and the content can be understood with ease; (3) Moderate: the given sentence is understandable, but has minor grammatical issues; (0) Non-understandable: the given sentence cannot be understood. For each sentence, we calculate the mean quality given by all participants and then averaging over all summaries in each corpus.

Appropriateness. As we used the same survey for both fluency and appropriateness, participants answered questions regarding the appropriateness over the same set of sentences. They were asked to assess whether the displayed sentence could be part of a Wikipedia article. We test whether a reader can tell the difference from just one sentence whether a text is appropriate for Wikipedia, using the news sentences as a baseline. This gives us an insight on whether the text produced by the neural network “feels” like Wikipedia text (appropriateness). Participants were asked not to use any external tools for this task. Readers have just two options to choose from (Yes and No).

Table 5. Results for fluency and appropriateness.

Editors Reuse. We randomly choose 30 items from our test set. For each item, each editor was offered the generated summary and its corresponding set of triples and was asked to write a paragraph of 2 or 3 sentences. Editors had the freedom to copy from the generated summary, or completely work from scratch. We assessed how editors used our generated summaries in their work by measuring the amount of text reuse. To quantify the amount of reuse in text we use the Greedy String-Tiling (GST) algorithm [27]. GST is a substring matching algorithm that computes the degree of reuse or copy from a source text and a dependent one. GST is able to deal with cases when a whole block is transposed, unlike other algorithms such as the Levenshtein distance, which calculates it as a sequence of single insertions or deletions rather than a single block move. Given a generated summary \(S = s_1, s_2,\ldots \) and an edited one \(D = d_1, d_2,\ldots \), each consisting of a sequence of tokens, GST will identify a set of disjoint longest sequences of tokens in the edited text that exist in the source text (called tiles) \(T = \{t_1, t_2,\ldots \}\). It is expected that there will be common stop words appearing in both the source and the edited text. However, we are rather interested in knowing how much of real structure of the generated summary is being copied. Thus, we set minimum match length factor \(mml = 3\) when calculating the tiles, s.t. \(\forall t_i \in T : t_i \subseteq S \wedge t_i \subseteq D \wedge |t_i| \ge mml \) and \(\forall t_i,t_j \in T | i \ne j: t_i \cap t_j = \emptyset \). This means that copied sequences of single or double words will not count in the calculation of reuse. We calculate a reuse score gstscore by counting the lengths of the detected tiles, and normalize by the length of the generated summary.

$$\begin{aligned} gstscore(S, D) = \frac{\sum _{t_i \in T} |t_i|}{\left| S \right| } \end{aligned}$$

We classify each of the edits into three groups according to the gstscore as proposed by [4]: (1) Wholly Derived (WD): the summary structure has been fully reused in the composition of the editor’s text (\(gstscore \ge 0.66\)); (2) Partially Derived (PD): the summary has been partially used (\( 0.66 > gstscore \ge 0.33\)); (3) Non Derived (ND): The summary has been changed completely (\( 0.33 > gstscore\)).

Table 6. Percentage of summaries in each category of reuse. A generated summary (top) and after it is was edited (bottom). Solid lines represent reused tiles, while dashed lines represent overlapping sub-sequences not contributing to the gstscore.

6 Results and Discussions

In this section, we will report and discuss our experimental findings with respect to the two research questions.

6.1 Automatic Evaluation

As displayed in Table 4, our model shows a significant enhancement compared to our baselines across the majority of the evaluation metrics in both languages. We achieve a 3.01 and 5.11 enhancement in BLEU 4 score in Arabic and Esperanto respectively over TPext, the strongest baseline. MT of English summaries is not competitive. We attribute this result to the differences in the way of writing across different Wikipedia languages – this inhibits MT from being sufficient for Wikipedia document generation. The results show that generating language directly from the knowledge base triples is a much more suitable approach.

6.2 Community Study

We present the results of the community study in order to find whether we could generate textual summaries that match the quality and style of Wikipedia (RQ1) and can support editors (RQ2).

Fluency (Table 5). Overall, the quality of our generated summaries is high (4.7 points in average in Arabic, 4.5 in Esperanto). In Arabic, 63.3% of the summaries were evaluated to have at least 5 (out of 6) in average. In Esperanto, 50% of the summaries have at least a quality of 5 (out of 6) in average, with 33% of all summaries given a score of 6 by all participants. This means the majority of our summaries is highly understandable and grammatically correct. Furthermore, our generated summaries are also considered by participants to have a similar average quality as Wikipedia summaries and news from widely read media organizations.

Appropriateness (Table 5). 77% (resp. 69%) of the generated Arabic (resp. Esperanto) summaries were categorized as being part of Wikipedia. In comparison, news sentences were identified more likely to not fit. In only 35% (Arabic) and 52% (Esperanto) of cases, readers have mistaken them for Wikipedia sentences. Wikipedia sentences were clearly recognized as such (77% and 84%) with scores that are closely matching the one from the generated summaries from our model. Wikipedia has a certain writing style, that seems to differ clearly from news. Our summaries are able to reflect this writing style, being more likely evaluated as Wikipedia sentences than the news baseline – we can expect the generated summaries to melt seamlessly with other Wikipedia content.

Editors Reuse (Table 6). Our summaries were highly reused. 79% of the Arabic generated summaries and 93% of the Esperanto generated summaries were either wholly (WD) or partially (PD) reused by editors. For the wholly derived edits, editors tended to copy the generated summary with minimal modifications such as Table 6 subsequences A and B in Arabic or subsequence G in Esperanto. One of the common things that hampers the full reusability are “rare” tokens, () in Arabic and (mankas vorto) in Esperanto. Usually, these tokens are yielded when the output word is not in the model vocabulary, it has not been seen frequently by our model such as names in different languages. As it can be seen in tiles E and D in the Arabic examples in Table 6, editors prefer in those cases to adapt the generated sentences. This can also go as far as making the editor to delete the whole subsentence if it contains a high number of such tokens (subsequence H in Table 6). By examining our generated summaries we find that such missing tokens are more likely to appear in Arabic than in Esperanto (2.2 times more). The observed reusability by editors of the Esperanto generated summaries (78.98% WD) in comparison to Arabic (45.45% WD) can be attributed to this. This can be explained as follows. First, the significant larger vocabulary size of Arabic, which lowers the probability of a word to be seen by the Arabic model. Second, since the majority of rare tokens are named entities mentioned in foreign languages and since the Latin script of Esperanto is similar to many other languages, the Esperanto model has an advantage over the Arabic one when capturing words representing named entities.

7 Conclusions

We introduce a system that extends Wikipedia’s ArticlePlaceholder with multilingual summaries automatically generated from Wikidata triples for underserved language on Wikipedia. We show that with the encoder-decoder architecture that we propose is able to perform better than strong baselines of different natures, including MT and a template-based baseline. We ran a community evaluation study to measure to what extent our summaries match the quality and style of Wikipedia articles, and whether they are useful in terms of reuse by Wikipedia editors. We show that members of the targeted language communities rank our text close to the expected quality standards of Wikipedia, and are likely to consider the generated text as part of Wikipedia. Lastly, we found that the editors are likely to reuse a large portion of the generated summaries, thus emphasizing the usefulness of our approach to its intended audience.