Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction

We propose a language-independent graph-based method to build à-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopedia’s category graph and can produce both mono- and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph model reaches an average precision of 84%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$84\%$$\end{document} on in-domain articles, outperforming an alternative model based on information retrieval techniques. As manual evaluations are costly, we introduce the concept of domainness and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with human judgments, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities.


Introduction
Different natural language processing (NLP) and information retrieval (IR) tasks require large amounts of domain-specific text with different levels of parallelism.With such data, one can obtain in-domain lexicons, semantic representations of concepts, train specialised machine translation engines or question answering systems.A common strategy to gather multilingual domainspecific material is crawling the Web; e.g., looking for different language editions of a website (Resnik and Smith, 2003;Esplá-Gomis and Forcada, 2009).Nowadays, one of the largest controlled sources for this kind of text at the fingertips is the Wikipedia -an online encyclopaedia with millions of topic-aligned articles in multiple languages. 1Wikipedia is not only comparable, but some fragments, and even full articles, are parallel across languages due to cross-language (CL) text re-use. 2n this paper, we explore the value of the Wikipedia as a source for domain-specific comparable text with a practical perspective.We present a methodology to extract in-domain articles by taking advantage of Wikipedia's categorisation mark-up and its graph structure.The multilingual aspect of the resource facilitates the extraction of multilingual counterparts.
In our experiments, we extract collections with different systems in 10 languages and 743 domains, and manually evaluate the adequacy to the domain for a subset of the collections.Nevertheless, manual evaluations are expensive both in terms of time and money.An automatic evaluation is problematic in this area since, to our knowledge, there is no accepted way to measure how well a collection represents a domain.To this end, we define the concept of domainness as a combination of the representativity and cohesion of texts.We introduce several automatic metrics that model the occurrence, co-occurrence, distribution of the characteristic vocabulary, and the semantic similarity among the articles.
We release the implementation of our architectures and the quality metrics within WikiTailor, a Java toolkit designed to extract and analyse corpora from Wikipedia in any language and domain.Both a stand-alone executable and the source code are available. 3As a result, the generation of comparable resources becomes relatively easy.We also make available some of the collections generated in our analysis.We share the domain-specific term vocabularies and the identifiers of the articles obtained with our best model for all the domains in the languages under study -English, French, Spanish, German, Arabic, Romanian, Catalan, Basque, Greek, and Occitan. 4 Notice that, contrary to domain names, vocabularies are not parallel but can be useful for other cross-language or multilingual applications.
The rest of the paper is distributed as follows.Section 2 overviews comparable corpora acquisition methods, with special focus on the categorisation and multilinguality of the Wikipedia.The relevance of Wikipedia for NLP and IR is also highlighted.Section 3 summarises some related work and points out similarities and differences with our study.Section 4 presents our models for the automatic extraction of (multilingual) in-domain corpora.Section 5 describes the experimental setting, analyses the characteristics of the collections extracted, and reports the results of our manual evaluation to assess their quality.In Section 6, we define the concept of domainness, introduce several automatic evaluation metrics, and in Section 7 we use them to quantify the quality of the produced collections, and analyse the correlation with human judgements.We summarise the work and draw our conclusions in Section 8. We include a glossary with Wikipedia-specific terms in Appendix A and detail the crowdsourcing experiment that leads to our manual evaluation in Appendix B.

Comparable Corpora and the Wikipedia
Multiple kinds of Web contents have been used as a source for the acquisition of comparable corpora.Usually, the first stage consists of acquiring the documents on the required languages (Resnik and Smith, 2003;Talvensaari et al., 2008;Aker et al., 2012;Plamada and Volk, 2013).The second stage is usually alignment; i.e. identifying pairs of comparable documents (Pouliquen et al., 2003;Tao and Zhai, 2005;Munteanu and Marcu, 2005;Vu et al., 2009;Gamallo Otero and González López, 2010).Among them, Plamada and Volk (2013) and Gamallo Otero and González López (2010) are specially relevant to this work since they use Wikipedia as a corpus.In this case and up to the limitations we discuss later, alignment is close to trivial due to the existing links between articles in different languages.
In general, three properties cause the Wikipedia particularly suitable as a source of comparable and parallel data: (i) it contains editions in a large number of languages,5 (ii) articles covering the same topics in different language editions are connected via inter-language links also called langlinks, and (iii) articles have categories which purpose is both describing the topic covered and grouping together related articles.
Nevertheless, it also presents drawbacks.First, the inter-language links (as many other characteristics) are subject to inconsistencies because most often they are manually created by volunteers.A volunteer may make mistakes linking non-equivalent concepts; there are even cases in which an article in one edition is linked from two or more articles in another one (Hecht and Gergle, 2010).Second, an article can belong to multiple categories and it is even possible to construct loops with categories; i.e. no strict tree hierarchy is in place (Zesch and Gurevych, 2007).Given that categories are built collaboratively, they are arbitrary at times, many articles are not associated to the categories they should belong to objectively, and one can observe the phenomenon of overcategorization 6 .Consequently, the Wikipedia category graph (WCG) and the links between languages must be considered carefully in order to extract topic-aligned articles across multiple language editions.
Moreover, the intersection across languages tends to be relatively small.In general, smaller Wikipedia editions are not subsets of the larger ones.In the dumps considered for this study, only 0.4% of the articles are common across all 10 editions, which are within the top-100 according to their size.Among the largest four editions, which represent relatively close cultures (English, French, Spanish, German), the number only grows to 4.8%.Hecht and Gergle (2010) called this effect context diversity.According to their analysis, the articles in the intersection correspond to "globally relevant concepts", whereas the singletons show cultural diversity.Recent studies show that the level of diversity across languages remains -both for text and for images (He et al., 2018).Consequently, the importance and presence of different topics depends on the language.So, one should expect to be able to obtain more easily comparable corpora for topics associated to the globally relevant concepts.
Even with all this noise, which must be acknowledged and taken into account, the Wikipedia has been widely and successfully used in (CL)-NLP and (CL)-IR.For example, it has been used for terminology and bilingual dictionary extraction (Erdmann et al., 2008;Yu and Tsujii, 2009;Prochasson and Fung, 2011;Chu et al., 2014;Jakubina and Langlais, 2016).In most of these models, Wikipedia's inter-language links are crucial to obtain an aligned comparable corpus.
The value of the Wikipedia as a source of highly comparable and parallel sentences was soon observed too (Adafre and de Rijke, 2006;Yasuda and Sumita, 2008;Smith et al., 2010;Plamada and Volk, 2012;S ¸tefȃnescu et al., 2012;Skadin ¸a et al., 2012;Barrón-Cedeño et al., 2015).With the rise of deep learning for NLP and the need of large amounts of clean data, the use of Wikipedia has grown exponentially not only for parallel sentence extraction and machine translation (Varga, 2017;Harsha Ramesh and Prasad Sankaranarayanan, 2018;Ruiter et al., 2019;Schwenk et al., 2019), but also for training models to obtain semantic representations of words and sentences.
Word and contextual embeddings have been trained on it, so that the resources are nowadays at hand for more than 100 languages.Examples include fastText (Bojanowski et al., 2017;Grave et al., 2018) and MUSE word embeddings (Lample et al., 2018), BERT multilingual embeddings (Devlin et al., 2019) and LASER sentence embeddings (Artetxe and Schwenk, 2019).
Semantic representations can also be obtained via explicit semantic analysis (ESA) (Gabrilovich and Markovitch, 2007) and have been widely used in IR to compute the semantic relatedness of concept vectors.CL-ESA (Hassan and Mihalcea, 2009;Potthast et al., 2008) is a cross-language extension of the explicit semantic analysis model which allows for computing this semantic relatedness across languages.Compared to neural network-based embeddings, CL-ESA representations are less sensitive to the amount of training data and differences in sizes among languages (see Section 6.2) and therefore they are adequate within the multilingual setting we present in this work.

Related Work
Gamallo Otero and González López (2010) proposed the basis to extract different kinds of comparable material from the Wikipedia by exploiting its metadata (category tags) and the WCG.They distinguished among collections built with: (i) nonaligned articles defined as those which belong to the same topic just because they have associated the same category; (ii) strongly-aligned articles as those which are connected by an inter-language link and both belong to the same category; and (iii) softly-aligned articles as those which are connected by an inter-language link but do not necessarily share the same category.These extractions are implemented in CorpusPedia7 .The tool is designed to extract comparable corpora from the Wikipedia by considering a pair of languages and a category.Given a category, the tool generates the three kinds of comparable corpora by considering every article belonging to it and its sub-categories at one-level depth.
Our work has lots of similarities with theirs.We also consider the Wikipedia as the corpus and exploit its metadata.For the alignment, we opt for the first type, we retrieve all the articles -even if they are not linked-in two or more languages which belong to the same domain, extending the method to deal with complete domains instead of with individual categories.In order to extract the domains, we explore WCG and, as a result of avoiding their "strict" strategy based on the exact category, we are able to extract more articles.This idea was first sketched in Barrón-Cedeño et al. (2015), where we also extracted parallel sentences from the comparable corpora in Computer science, Science and Sport to successfully domain-adapt a machine translation system.
The WCG is close to a taxonomy structure (Zesch and Gurevych, 2007) and can be easily explored, but the exploration might be slow given the size of some Wikipedia editions, the high density of the graph, and the existence of loops.Several works facilitate the task.PetScan8 is an online utility that retrieves all the articles that depart from the root category up to a desired depth.Aspert et al. (2019) introduced a graph database structure -in such management systems traversing and performing the breadth-first search is very efficient-and provide the database for the English Wikipedia with monthly updates.Differently to us, these utilities, as CorpusPedia, also expect the user to input the depth up to which define the traversal for a root category.
In an approach completely unrelated to graphs, Plamada andVolk (2012, 2013) proposed a model for retrieving Wikipedia articles associated to a domain based on a typical search engine.Their final purpose was to retrieve parallel sentences for domain-specific statistical machine translation.The authors processed the collection of Wikipedia articles as follows.Given two Wikipedia editions in language L and L , they (i) identify the subset of articles in language L for which a corresponding article exists in language L (i.e. an inter-language link connects them); (ii) index the resulting documents.In order to retrieve the relevant articles, the index is queried with 100 in-domain keywordsthe most frequent ones in an external in-domain corpus.In this case, the information about the Wikipedia structure is not used at all and the selection of in-domain articles fully depends on their contents.Due to the completely different nature of this system with respect to our approach, we use it throughout our work for comparison purposes.Plamada and Volk (2012) also showed the difficulties of using Wikipedia categories for the extraction of articles in the Alpine domain.In their experiments, they found that some articles within the main namespace lack a category tag and that the categories assigned to the same article in different languages do not overlap.
Full projects have been also devoted to the topic.The ACCURAT project9 released a toolkit for multi-level alignment and information extraction from comparable corpora.The toolkit (Pinnis et al., 2012) operates on different levels: (i) alignment of comparable documents, (ii) extraction of parallel sentences, (iii) extraction of terminologies, and (iv) extraction of named entities.The toolkit can be applied on the Wikipedia to extract a general domain comparable corpus, and it retrieves the documents by analysing comparable segments in the candidates.A series of similarity metrics is applied to determine the level of comparability of a pair of two documents.The approach and aim of the tool is completely different to ours.In their case, the main purpose is the comparability of corpora.Our focus is the domain; the comparability is a direct consequence given that (i) at corpus level, if the languages cover the same domain, the corpora are comparable and (ii) at document level, comparability can be established using the inter-language links10 .
Linguatools11 released three Wikipedia-derived corpora in 23 different languages.A monolingual corpus with more than 5 billion tokens; a comparable corpus with more than 41 million bilinguallyaligned Wikipedia articles for 253 language pairs; and two parallel corpora, one with bilingual titles, extended with redirects and textlinks with almost 500 M parallel segments, and the other one with 7 k sentence pairs extracted from bilingual English-German quotations.Unfortunately, neither the tool nor the methodology for the extraction are available in this case.Still, similarly to the corpus that can be obtained with the ACCURAT toolkit, these comparable corpora do not belong to a specific domain but to the whole Wikipedia.
As said in the introduction, we release the WikiTailor toolkit and 7,430 in-domain collections.WikiTailor further allows for extracting the intersection and union of collections in multiple languages at the same time and for the extraction of multilingual (in-domain) titles in several languages.Parallel titles can also be obtained with a tool12 from LTI/CMU, but we extend this functionality to go beyond only two languages.

Models for Domain-based Article Selection
We approach the automatic extraction of domainspecific comparable corpora using two alternative approaches.Both approaches are language independent -as far as the tools to perform standard preprocessing are at hand-, and can be applied to any domain without a priori information.

Graph-based Model
In this approach we take advantage of the usergenerated categories associated to most Wikipedia articles.As aforementioned, even if these categories are imperfect, they offer important hints on the domain an article belongs to.Ideally, the categories and sub-categories should compose a category tree, and one could traverse the tree to extract the related categories hanging from a specific domain (root category). 13Nevertheless, the categories in the Wikipedia compose a denselyconnected graph G and the traversal is not trivial.Figure 1 is an example of the intrinsic difficulties inherent to WCG topology (although this particular example comes from the Wikipedia in English, similar phenomena can be observed in other editions).Firstly, the paths from different unrelated categories Space and Language, converge in common nodes early in the graph: in category Geometric measurement at depth 2 and 7 respectively.As a result, not only Geometric measurement would be considered as a sub-category of both Space and Language, but all its descendants.Notice also that the topic of the root category gets diluted as we go deeper into the graph and it can change to another topic.The 6th level departing from Language in this path already talks about physics.Secondly, G contains cycles, as observed in the sequence Space → Geometry → Geometric measurement → Dimension → Space.The exploration of the graph is therefore non-trivial.
The previous example evinces that one cannot consider Wikipedia's category pseudo-tree from a root category to its leaves to define a domain.Therefore, we designed a strategy to walk through the category graph departing from a user-defined root category up to the level that most likely represents an entire knowledge domain.We tailor the Wikipedia to fit our purpose; that is, to build a well-formed tree representing a domain.Figure 2 shows the two modules of our graph-based model, which we describe below.The input consists of the domain of interest and the pre-existing full category graph.Module 1: Vocabulary Definition.The objective of this module is building the characteristic domain vocabulary V .It consists of four submodules.Root category identification.We select the root category c r that better matches the desired domain (e.g., Sport).Our vocabulary definition and the graph exploration process departs from such selected node.Seed articles selection.Next, we identify every article which belongs to category c r .The resulting set of articles is the seed for the in-domain vocabulary generation.If the resulting number of seed articles is small (< 10 in our experiments) we include those articles associated to the children categories as well.Concatenation and pre-processing.The resulting set of articles are concatenated into one single document and we apply the following pre-processing operations: tokenisation, stopword removal, numbers, diacritics and punctuation marks removal, and stemming (Porter, 1980).In order to reduce noise further, we discard tokens shorter than four characters (we threshold at three for Arabic as most roots in this language are triliteral (Darwish and Magdy, 2014, p. 4)).Ranking.We compute term frequency and rank the terms accordingly.The output of this step consists of the top-n tfranked terms.
Module 2: Graph-based Article Selection.The second module explores the category graph to find those categories which are likely to belong to the desired domain and extracts the associated articles.The input for this step is the root category c r , which represents the ceiling of the domain to be retrieved (e.g., Sport), and the produced characteristic vocabulary V .We perform a breadth- first search departing from node c r .Different criteria can be considered to stop the search in order to avoid exploring practically the entire graph.Our stopping criterion is inspired by the classification tree-breadth first search model by Cui et al. (2008).The objective is scoring the explored categories in order to assess their likelihood of actually belonging to the desired domain.Our strategy assumes that a category belongs to the domain only if its title contains at least one of the words in the previously defined vocabulary.Nevertheless, many categories exist that may not include any of the words in the vocabulary.A naïve but efficient solution is to consider subsets of categories according to their depth with respect to the root, and include or exclude the full subset (level).Therefore, we traverse G and score each tree level by measuring the percentage of its categories that are associated to the domain by means of containing at least one term of the vocabulary in the title.The process stops when less than k% of the categories are related to the vocabulary.In the example represented in Figure 2, both categories in the first level fulfill the constraints.Two out of three do in the second level and three out of five do in the third one.In the fourth level only four out of nine categories include a characteristic term in their ti-Figure 3: The IR-based in-domain article selection pipeline.Notice that vocabulary definition is identical to the one in the graph-based approach (cf. Figure 2).Orange rounded blocks represent processes.Green rectangles represent outcomes.
tles.Assuming a threshold of 50%, that level in the tree is discarded and all the articles associated to the categories of the tree, up to the third level, compose the output of this process.
This article selection method has two free parameters: the size of the vocabulary and the percentage of articles with an in-domain term in the title that we require to include a level in the extraction.Section 5 describes the characteristics of the extractions according to these parameters.

IR-based Model
For comparison with the graph-based model, we include one based on standard IR techniques.A model for retrieving Wikipedia articles associated to a domain based on a typical search engine was proposed in Plamada and Volk (2013) (see Section 3).Here, we implement a similar method that consists of three steps, as depicted in Figure 3: Module 0: Article Indexing.As an offline preliminary process, we index every Wikipedia edition and set up a search engine (right-hand side of the bottom block in Figure 3).For this, we use the Apache Lucene open-source search engine14 and perform a pre-processing pipeline identical to the one in the graph-based model.
Module 1: Vocabulary Definition.Again, we perform the same pre-processing and ranking strategy to define the necessary domain vocabulary.
Module 2: IR-based Article Selection.Finally, we query the search engine with the vocabulary and retrieve the set of articles that presumably belong to the domain of interest.The quality of the vocabulary is even more relevant in this case and a loose list could involve retrieving almost the full document collection.
The IR-based article selection method has two free parameters, the size of the vocabulary and the threshold for the relevance of the articles.Section 5 describes the characteristics of the extractions according to these parameters.

In-Domain Collection Extraction
We explore in this section the collections obtained when applying the two described models but, before, we describe the experimental framework where they are going to be evaluated.

Framework and Domains Definition
We select ten Wikipedia editions that serve as archetypes for different development levels, both in terms of amount of articles and richness of contents: English, French, Spanish, German, Arabic, Romanian, Catalan, Basque, Greek, and Occitan.The set also covers different language families, including Germanic, Romance, and Semitic.We use dumps 15 and preprocess them with JWPL (Zesch et al., 2008) 16 .
We use only the subset of content articles in the dumps -those that belong to the main namespace-, and discard redirection and disambiguation pages.17Table 1 summarises the main figures of the resulting collections.
We define a set of root categories in order to choose the domains in our study.The root categories should mimic the choice of topics or domains that a user would be interested in.Following Hecht and Gergle (2010), we look for the globally relevant concepts and assume that a category represents a general domain if it appears in all ten languages -even if those ten languages do not cover all the majority cultures in the world.Applying this constraint results in a pool of 2, 081 categories (cf.Table 1).We further eliminate categories starting with the same word, keeping only one of the family in any of the languages.The purpose is gathering a more heterogeneous and general set. 18We eliminate categories that begin with a digit as well for similar reasons.This cleanup results in a final collection of 741 categories.Categories used in previous research are includedif not already present-for comparison purposes: Archaeology, Linguistics, Physics, Biology, and Sport (Gamallo Otero and González López, 2011); Mountaineering (Plamada and Volk, 2013) and Computer Science (Barrón-Cedeño et al., 2015).Observe that Computer Science does not exist in the Greek edition nor Mountaineering in the Occitan one.With these additions, we finally consider 743 core domains.
For IR, we query the engine with the top 100 or 50 terms.The first threshold allows for a direct comparison with Plamada and Volk (2013).In their case, the characteristic vocabulary is defined as the 100 most frequent words (not terms) in an external corpus.Our IR model is clearly inspired by theirs, but we try to keep all the requirements fulfilled inside the Wikipedia itself, hence we avoid using external corpora.In the experiments, we build the collection with all the retrieved articles (IRall), those with a relevance score higher than a hundredth of the maximum (IR100) or those with a relevance higher than a  tenth of the maximum (IR10).The combined nomenclature is equivalent to WT models: 100-IR10, 100-IR100, 100-IRall, 50-IR10, 50-IR100 for individual models and a wildcard indicates groups.

Characteristic Vocabulary
The first step in both architectures involves the extraction of the characteristic vocabulary of the domain.Following the pipeline described in Section 4.1, we extract the vocabularies for different language editions in the 743 categories (domains).
Table 2 shows statistics on the number of articles and size of the vocabularies.As a general trend, the number of root articles diminishes with the size of the Wikipedia edition.This is true for all the languages but German and Arabic.Notice that even for English -the largest edition-, the mean of root articles is 99, but the mode is as low as 2. Therefore, in many domains the root articles are not enough to obtain a large enough vocabulary.This is somehow solved by including also the articles in the subcategories when there are less than 10 articles in the root.In general, there is a chain relation: the larger the edition, the larger the amount of articles in the root category.This results in more terms and larger vocabularies, potentially inducing to vocabularies with a lot of noise for large editions or for editions such as the German one, with lots of root articles.Since the quality of this vocabulary is a core factor in our methods, we explore several alternatives in our experiments.
Taking only the top-10% of the terms, the size of the vocabulary is completely language-dependent.
A similar thing happens with 500 elements, since the cut only affects major languages.For the last configuration with a maximum of 100 elements, the size of the vocabulary is the same, at least on average, for all the languages.
We can now study the distribution of this vo- cabulary along the graph.We consider that a category belongs to the desired domain if it has an in-vocabulary term in its title.Figure 4 depicts the evolution in the percentage of categories supposedly associated to the Sport domain (top plot) and to Astronomy (bottom plot) in the ten Wikipedia editions under study.As expected, the farther the level from the root, the lower the extent of associated categories (but also the larger the amount of elements).Peaks at deeper levels can appear due to the noisy category structure of the Wikipedia, that makes that after departing from the original domain, the path returns to it (e.g., peak at level 13th for Sport in Occitan or at level 12th for Astronomy in German).Nevertheless, the distribution is rough and, at the lowest levels, the small number of articles can lead to artificial canyons in the curves (e.g., canyon at the 2nd level for Sport in Basque).This effect depends on the domain and the language.In this work, we deal with more than 7,000 domains (743 domains times 10 languages), so, on average, the effect is not important and all the process is done fully automatically.However, in order to obtain a corpus in a concrete language and domain, a visual inspection of the shape of this curve helps to determine the stopping point of the method.

Collections Characteristics
WikiTailor determines automatically the depth from the root up to which it should extract articles according to the percentage of in-vocabulary categories, and this is a crucial point for the extraction.Different percentages lead to different stopping points and consequently different collection sizes.Looking at the specific numbers in Table 3, we see that the threshold depth seems to be directly proportional to the size of the characteristic vocabulary and the number of categories.In general, the more categories in a Wikipedia edition, the more levels are used to describe a root category.These two features are more important than the alternatives of taking levels with a 50% or a 60% of positives.For a given language, the most relevant feature is the size of the vocabulary, specially for small editions: smaller vocabularies imply smaller threshold depths.For Romanian, Catalan, Basque, and Greek systems with 50% of positives select a mean boundary depth of 3 for WT100 and 4 for WTall.The change is less significant for the systems with 60%.However, if one considers large editions, the change is striking in both cases.In English, systems with 50% of positives select a mean threshold depth of 6 for WT100 and 12 for WTall (5 and 11 for the 60% systems).So, for the editions with more articles, we also extract all the articles from a larger sub-tree, and that favours even more the extraction of huge in-domain corpora for English and more modest ones for the other languages.As before, Arabic and German seem to be out of place.If we rank the editions according to the number of categories, Arabic has a higher than expected mean selected depth per domain, given its position in the ranking.German has it lower.All differences among languages are reduced for small and similar vocabularies (*-WT100).
The top rows of Table 4 show the size of the collections extracted with the WT model.The size for every system and language is a direct consequence of the aforementioned.Except for Arabic and German, the larger the edition, the larger the extracted collection of in-domain articles, but for small vocabularies differences among languages are less extreme.The loss in number of articles in English for small vocabularies respect to *-WTall is remarkable (from 1 M in 50-WTall to 50 k in 50-WT100).This is not the case for German (5 k vs. 3 k) although its initial vocabulary for 50-WTall was even larger than the English one.
The bottom rows of Table 4 describe the indomain corpora extracted with the IR model.In general, IR retrieves larger collections than WT, up to the point that for queries with 100 terms and without any threshold for the relevance score (IRall) the extracted corpus can be almost the full Wikipedia.Also notice that the number of extracted articles in the IR models is proportional to the size of the collection instead of to the number of categories, as it happens with WT and small vocabularies (numbers on the left of the language in the table rank the editions according to the number of articles, the actual order in the table is regarding the number of categories).As expected, queries with less elements (50-IR* vs. 100-IR*) retrieve smaller collections.Some exceptions appear for Basque and Greek.This occurs when one does not look at the collection with all the hits (IRall) but at those recovering a percentage of the maximum score.Since the maximum score changes when using 100 query terms and 50 query terms, the same can happen for the number of elements.
Our two methods, WT and IR, build very different corpora -specially in content.WT collections, which are smaller, are not a subset of the IR ones. 19For example, if one compares 50-WT100 19 Except in the cases in which IRall is the reference, the system that selects almost the whole Wikipedia for a given and 100-IR10, two similar collections in terms of size, only between 20 − 60% of the WT articles and a 5-15% of the IR ones appear in the intersection between the corresponding extractions by both models.The common articles cover a larger percentage of the WT collections because their size is smaller.The ranges in the previous figures, describe the behaviour for the different languages.Large editions have a lower percentage of common articles (for example 23% and 56% for WT in English and Greek respectively, and 8% and 4% for IR in the same languages).Previous to the evaluation of the quality of the collections, a collection built from the union of the different systems seems a way to enlarge the amount of data; specially for small editions, where it can be more useful.
As a final remark, notice that these results correspond to the monolingual scenario.A multilingual comparable corpus is just the set of collections of the same domain for each language.We can increase the degree of comparability (Gamallo Otero and González López, 2011;Su and Babych, 2012) by selecting a subset of equivalent articles in a straightforward way thanks to Wikipedia's inter-language links.Once the monolingual corpora have been retrieved, the union or intersection of their linked articles builds the final comparable corpus for the desired domain and languages.

Comparison to Similar Systems
Gamallo Otero andGonzález López (2010, 2011) obtained comparable corpora in Spanish, English and Portuguese in the domains of Archaeology, Linguistics, Physics, Biology, and Sport based also on Wikipedia's categorisation.The comparison domain.).Of course, the accuracy of CorpusPedia will be much higher, but for some tasks the size of the corpus would not be enough.Notice that at this point, we are talking about the size of the collections and not about their quality.
Plamada and Volk (2013) used a very similar method to IR to extract parallel articles in the Alpine domain for German and French.We can compare their results with the ones we have for Mountaineering with our IR model but, again, the Wikipedia editions differ.They index only aligned documents according to the inter-language links, since their main purpose is to extract parallel sentences and they assume they are mostly found in aligned (parallel) documents.Their methodology retrieves 40,000 parallel articles while our most flexible version with the same number of terms for the query (IRall with 100 term queries) retrieves almost the full Wikipedia (1,182,465 French articles 1,460,036 German articles).The conservative version (IR100 with 100 term queries) retrieves 225,422 in-domain French articles and 305,200 German ones.We can extract the subset of par-allel articles from this comparable corpus via the intersection or the union of the articles.For the intersection, we use the articles that have been identified as in-domain simultaneously in German and French.For the union, we expand the set of articles to include all the articles that have been identified as in-domain articles in one of the languages with the equivalent article in the other one in case it exists.Using the intersection, we obtain a high precision/low recall parallel set with 55,551 articles and with the union we gather a low precision/high recall corpus with 205,913 articles.

Manual Evaluation
We have generated several in-domain document collections, but we have not determined how well these documents represent the domain.In this section, we are interested in determining whether the documents in a corpus belong to a particular domain or not.For this manual study, we select two representative systems: 50-WT100 and 100-IR10 and manually judge their articles in three domains in all ten languages: Astronomy, Software, and Sport.The evaluation set for each language, domain and system consists of 200 articles: 100 articles exclusive to each system and 100 articles in common to both.The articles are extracted evenly in its subset.In three cases, the number of articles in the collection is smaller than 200 and so is the evaluation set (see Table 5).
We manually annotate the 8,600 articles with three assessments each.We use the Figure Eight 20 platform to crowdsource this task.All the details on setting up the experiment and instructing the Turkers are in Appendix B.
Table 5 shows the manually-judged precision results.We calculate the precision of the extracted collections under two circumstances: (i) hard precision when there is full agreement in assigning a domain among the three annotators and (ii) soft precision when an article is assigned to a domain by the two out of three annotators.For the three domains, the quality of the WT extractions is much better than those with IR.Even in the hardprecision setting, the mean value is 0.74±0.14for WT and 0.43±0.12for IR, and values per domain are close to this value.The average values for soft precision go up to 0.84±0.13for WT and 0.50±0.14for IR.Focusing in the language factor, the IR system does specially well for German, sug-20 https://www.figure-eight.com/gesting that the quality of the extracted characteristic vocabulary is better.This is an indication that the quality of the characteristic vocabulary is less important in the WT models than in the IR ones, as WT averages among all the articles in a level before extracting the whole level.On the other hand, WT's weakest performance comes with Arabic, with a mean soft precision over domains of 0.57±0.11.Arabic collections are built after considering a low depth (3.6±2.3 with a mode as low as 1; cf.Table 3).Nevertheless, the three evaluated domains are built upon a higher depth (5 for Astronomy, 8 for Software, and 6 for Sport) meaning that perhaps too many articles are extracted increasing the coverage but damaging the precision.The outcome is still better than for its IR counterpart.
The difference between the WT and IR systems becomes more evident when looking into the distribution of their resulting collections.As explained before, we have built the subsets to evaluate by assuring that half of the articles in a collection are common in both systems and the other half is exclusive to each of them.That allows us not only to save in manual assessments, but also to have a clear idea of the distribution of the articles in a collection.The third vertical block of Table 5 "100-element subset" shows the results.As expected, the articles that are common to both systems (∩ only ) are those with the highest precision (0.79±0.15 for hard precision and 0.89±0.15for soft precision on average).The quality of the articles extracted only by the WT system (WT only ) are very close in quality with an average of 0.70±0.17for hard precision and 0.80±0.17for soft precision.The precision values are very low for articles only retrieved by the 100-IR10 system (mean of 0.11±0.16for hard precision and 0.16±0.20 for soft precision).The only exception is again German, where the IR only subcollection has a hard precision of 0.50±0.24and a soft precision of 0.61±0.19.
The last column of Table 5 shows the interannotator agreement for more than two raters, the Fleiss' kappa (κ Fleiss ) (Fleiss, 1971).Turkers agreed the most when discriminating between Sport and other domain, with an average κ = 0.88 ± 0.07.The lowest agreements occurred in the Software domain: κ = 0.74 ± 0.11.Astronomy lies in the middle with 0.81±0.12.Regarding the language dimension, annotators of Basque Table 5: Results of the manual evaluation.The number of articles selected for the manual assessments are shown in Set WT and Set IR ."Complete set" shows the precision obtained under the hard and soft criteria for the 50-WT100 (WT) and 100-IR10 (IR) systems."100-element subset" analyses the distribution of the sets (see text).The last column shows the inter-annotator agreement measured by the Fleiss' kappa.Notice that for most of the evaluations (28 out of 30) we obtain either substantial agreement (0.61<κ<0.80) or almost perfect agreement (0.81<κ<1.00) as defined in Landis and Koch (1977), and we can conclude that system 50-WT100 is significantly better than 100-IR10.However, a manual evaluation is always expensive and one would like to be able to quantify automatically how adequate is a collection with respect to the desired domain for each experiment.Next section introduces the concept of domainness and addresses the issue.

Domainness Characterisation
We are still interested in determining whether the documents in a collection belong to a particular domain or not.Nevertheless, describing corpora is a difficult and subjective task and the answer should not be binary, but a continuous score, especially if it is quantified automatically.Here, we define domainness as the degree of cohesion and representativity of a corpus with respect to a domain:

Concept Intuition
The idea behind the definition of domainness builds on the intuition that a collection should be heterogeneous but cohesive at the same time.For illustrative purposes, Figure 5  Figure 6 shows another example to illustrate the concept of representativity within a collection.Whereas collections C 1 and C 3 correspond to the Physics domain, C 1 should receive a higher domainness score because articles seem to be purely about physics (C 3 contains articles in the intersection of physics and math).However, when measuring the domainness of the collections with respect to the Science domain, C 3 should have a higher value because it has more diversity, i.e. it holds a higher representativity of the domain.From this configuration, one cannot say which of C 2 or C 3 should have a higher domainness score for Science.
To the best of our knowledge, no specific measures exist to quantify this concept.In the next section, we propose several automatic metrics to measure the domainness of a collection and afterwards, in Section 7, we apply them to determine the quality of our in-domain collections of Wikipedia articles and we study their correlation against the manual evaluation performed in Sec-Figure 6: Illustrative example of representativity and cohesion to define the domainness of a collection of documents C i .C 1 has the highest domainness for Physics, whereas C 3 and C 2 have higher domainness for Science since they have a major representativity.tion 5.6.

Domainness Metrics
Although there is no predefined scale to quantify the domainness, we intend to measure if a corpus represents better a domain than another one, and how or if it degrades when being enlarged.With that in mind, and in order to come out with a more affordable evaluation framework, we define four different families of automatic metrics inspired by the work of Kilgarriff (2001) on corpus analysis and the work of Newman et al. (2010) on topic coherence.The first three families intend to measure the representativity of the corpus and characterise a domain on the basis of its characteristic vocabulary.Quite differently, the fourth family intends to measure the cohesion of the collection without the requirement of characterising the domain.
Family 1: Density of terms.We begin with the assumption that a corpus describes better a domain the higher the density of terms it contains belonging to the domain's characteristic vocabulary.Obtaining this vocabulary is straightforward when using the Wikipedia as a corpus.Since root articles belong to the domain by definition, the characteristic vocabulary can be obtained as the most frequent terms in this subcorpus (as it has been assumed in our models).The density of these terms should be a measure of the representativity of the collections.We propose two densities based on two different term frequency estimations (Salton and Buckley, 1988).The first one is the term frequency of all in-domain terms w i in the collection, c terms = w i counts(w i ), normalised by the number of articles, N : The second one is the augmented frequency of indomain terms for each article normalised by the number of articles: where c max are the counts for the most frequent term in each document and the optimum value of K is 0 in our experiments.
Family 2: Mutual Information.The evaluation of the quality of a corpus regarding domainness is somehow related to the evaluation of topic models.
In the first case, we have a collection of texts and we want to evaluate how well they describe a domain that might be characterised or not by a set of keywords.In the second case, we are given a set of keywords and we want to evaluate how well they describe the topic (domain) of a collection.Newman et al. ( 2010) introduced the concept of coherence of a topic as the coherence or interpretability of its keywords.They measure it with the average or median of pointwise mutual information (PMI) between the topic keywords.works use NPMI (Bouma, 2009), a normalised version of PMI, for the same purpose: where w i and w j are the keywords describing a topic -the terms in the characteristic vocabulary in our case-, is a smoothing constant, and p stands for frequentist probability.For topic modelling, the median of the pairs showed better correlation with human judgments than the mean because it is less sensitive to outliers (Newman et al., 2010).We apply the two measures and two variants to evaluate domainness; assuming that the vocabulary we use perfectly describes the domain and the loss in the value of (N)PMI gives information about the background collection.We expect in-domain collections to have a high density of in-domain terms -p(w i ) and p(w j ) values higher than in general collections-, but we still expect co-occurrences of terms to be representative.Computationally, the main difference with the original usage is how to estimate term co-occurrence frequencies to compute probabilities.In topic modelling, co-occurrences are sampled from the full collection or from an external source, such as the Wikipedia or Google n-grams, with a sliding window of length m words.Here, we always use the full in-domain collection and consider as window an entire article of the domain: (N)PMI art .Notice that with this definition the window has a variable length.In order to study if this difference is relevant, we define a second variant (N)PMI col where we estimate a probability as the sum of probabilities in all the articles of the collection instead of simply the counts per article as in the original version: Family 3: Correlations.In his deep study, Kilgarriff (2001) quantifies the similarity among corpora measuring frequencies of words and crossentropies.Here we adapt the measure he evaluated as the best one fitting our problem -the Spearman correlation-and add Kendall's τ correlation as well for a better generalisation.Spearman ρ (and Kendall's τ ) is a non-parametric rank correlation.It measures the difference in rank order between two distributions: where pd are the pairwise distances of the ranks of the terms w i and w j , and n is the number of terms.
For Kendall, we have: where c is the number of concordant pairs, d is the number of discordant pairs, where t is the number of times the terms w i are tied, and u is the number of times the terms w j are tied.
In our particular case, we measure the difference in rank order of n terms in two corpora: an extracted collection of articles of a given domain, and the subset of its root articles.Terms are defined as before; since the important feature of a term is its rank and not its absolute frequency, this measure can be used for corpora of varying size.
To compute the correlation, one needs to find the n most frequent common terms.These are obtained as the union of the first m terms for every corpus.The terms that do not appear in the other corpus have frequency zero and are therefore ranked at the bottom of the other corpus' list.Some heuristics are considered to build the vectors: (i) At most 1000 terms from the top 10% (if available) for every collection are used, therefore the maximum number of common elements is 2000; (ii) terms with frequency 1 are not considered within the 1000; and (iii) correlations are not estimated with less than 5 points.
Both Spearman and Kendall correlations measure monotonicity relationships.Although we checked that in most cases the two statistics lead to the same conclusions, Kendall's τ has shown to be more robust, more appropriate for small samples and, given its definition, to deal better with ties and outliers (Croux and Dehon, 2010), so it is the one we use as a representative of this family.
Family 4: Cohesion.In this case, our objective is assessing the distance between the articles pertaining to a given domain, according to our models.The lower the distance between such articles, the more cohesive they are, and the more likely that they actually belong to the domain; i.e. the better the model works.In order to come out with a single number to compare across different models, we compute the average distance between all the article pairs in the domain.Considering standard vector-space models to represent the texts could result in measures sensitive to length and vocabulary differences between the pairs of articles.Article embeddings obtained as document embeddings simply by using doc2vec (Le and Mikolov, 2014) could solve this issue, but the quality would still depend on the language because poorer languages have a lesser amount of data where to estimate the embeddings.On top of these factors, in this work we focus on multilinguality.As a result, we opt for using a high-dimensional conceptbased representation, ESA.
The purpose of ESA is representing texts -regardless of their lengths-onto a highdimensional concept-based space.The space is built on top of the term-document matrix D generated from a large collection D of documents using tf-idf weighting.The representation of a text is then built by comparing it against D, resulting in a |D|-dimensional vector.For efficiency reasons, the average distance is computed with respect to the center of the collection as where a ESA is the vector representing article a and c ESA is the centroid of all the vectors in the corpus and dist θ refers to the angular distance: 7 Domainness Evaluation Now we inspect the numbers obtained for the different metrics when analysing the collections extracted by the WP and IR models in all languages and domains.Figure 7 summarises the results with some representative measures of the four families of metrics.We plot the mean and standard deviation of six measures, C terms /N , ĉterms , PMI art , PMI col , τ , and d ESA , for the ten languages under study and the 10 systems analysed.For comparison purposes, we also chose a representative model of every family (50-WT100 and 100-IR10) and compare it against a subcollection of the other family gathered to have the same size.Although we do not include the corresponding figures, the outcomes are also discussed.
For the representativity measures (Families 1, 2 and 3), the size of the characteristic vocabulary used in the experiments is 100 terms, i.e. 5049 term pairs.In all cases, the collections on which probabilities are estimated are preprocessed as explained in Section 4.1 so that the format of the articles matches the terms.Points represent the arithmetic mean over the 743 selected domains.
Family 1.By design, IR systems are the ones with a larger number of in-domain terms.The density is expected to be higher in the smallest collections, *-IR10, because they contain the top ranked articles retrieved according to these terms.In WT systems, the terms have a high density in the root articles, also by definition of the model, but there is no expectation for a high number of in-domain terms in the rest of the collection.The output of ĉterms and especially of C terms /N reflects this (cf.top-left plot in Figure 7).Differences between WT systems seem not to be significant under these metrics.In general, differences appear for large editions, where the vocabulary size varies notably from system to system.The best WT system is 60-WT100, the most restrictive one and that with less articles per collection, with a mean across languages of C terms /N = 49.7 and ĉterms = 4.1.However, 60-WTall has a higher density of in-domain terms than any of the 50-* systems for some editions (those with less categories) even if the obtained corpora are larger.
As expected from its definition, IR systems with the smallest collections (*-IR10) are clearly the best ones according to C terms /N , the normalisation in ĉterms smooths the effect and makes systems closer to each other.Since IR collections grow significantly after allowing for lower retrieval scores, there are a lot of differences between IR models.According to these metrics, *-IR10 systems are better in quality than any WT model, especially for large editions, with the additional benefit that they gather larger collections.This effect is more pronounced when comparing equal-size collections, but disappears for the less constrained configurations where WT models are better.If we analyse the results per edition, Greek is the language on which both models perform the best.There is no clear trend for the other editions, although English and Arabic perform poorly in contrast with the others.This is one of the differences when evaluating with the correlation family of metrics (family 3).In this case, English, Greek and Spanish are the editions for which results are the best.This is a first indication that both metrics are not equally valid for assessing the quality of the extractions.
Family 2. Contrary to in-domain terms, there is no requirement on the number of co-occurrences of terms when building the systems, neither for IR nor for WT systems.The plots in the middle row of Figure 7 show the mean and standard deviation of PMI art and PMI col .One would expect positive PMIs for related terms, meaning that they occur more frequently together than if they were independent in a general collection, but we obtain negative values for most collections.The reason is the high density of in-domain terms in all the documents, which causes co-occurrences to have comparatively less weight than in general collections.
Since we want to indirectly evaluate the collection and not the terms, we just compare the values of the different models.Within a family of systems, WT or IR, the scores completely depend on the size of the collection, the larger the collection the better the evaluation.When comparing the two systems, WT systems are better than IR systems even if IR collections tend to be larger.For instance PMI art =-1.1±1.0 for the 50-WT100 English collection, with a mean of 50,514 documents per domain, and PMI art =-2.8±0.3 for 100-IR10 with a mean of 64,239 documents per domain.The values of PMI col for these collections are -0.2±0.3 and -1.2±0.4.We observe the same trends with PMI art and PMI col , but the scores with PMI col tend to be higher.When we estimate the normalised PMIs, differences among models become smaller, but the main conclusions hold.
If we look at differences across languages, we see that the scores are almost independent of the language for IR systems, whereas English collections are the best ones for WT systems and the Romanian and Occitan the worst ones.Besides, Romanian, Basque and Occitan have large deviations, especially in WT systems.In IR systems, these languages have the smallest collections, but this is not the case of WT.The uncertainties for these languages which range from ±4 to ±8 are not shown in Figure 7 for clarity.Family 3. As observed in the bottom-left plot of Figure 7, correlation measures show a clear preference for the WikiTailor model.Kendall's τ lies in the range [0.2, 0.5] for WT systems and [−0.1, 0.2] for IR systems.Results are equivalent with Spearman's ρ although with a higher score: within [0.3, 0.6] for WT systems, [−0.1, 0.3] for IR systems.For different variations of a model, the results are consistent with those seen with the measures related to the density of terms: smaller and more constrained collections are always evaluated better.However, the standard deviation is too big to make statistically significant statements when comparing models within one same family.In general, the quality increases for Wikipedia editions that have less categories for WT systems; whereas there is no specific trend for IR systems.Large editions correlate less because their domains have more articles; when only domains with more than 100 articles are considered, correlations diminish for those languages where this is important, such as Occitan, Greek, or Basque; and the scores per language become more homogeneous.When we compare IR and WT collections up to an equal size, we confirm that WT models are better than the IR ones according to ρ and τ and, the smaller the edition, the more evident the difference becomes.
Family 4. Following the original ESA proposal and in consistency with this work, we use the Wikipedia as our reference text collection D for the cohesion-oriented metric.The size of D for each of the languages under study is 12, 539, as this is the size of the intersection among the top nine Wikipedia language editions.Gottron et al. (2011) showed the convergence of the method with 10, 000 articles approximately, so, we discard the tenth edition, the Wikipedia in Occitan, because including it would decrease too much the number of articles in D. The Occitan models are therefore not evaluated using this measure.
Similar trends seen with the previous metrics regarding quality can be observed with d ESA , even if its nature its different.In this case, lower values imply collections with a higher cohesion, irrespective of the domain they belong to.The results are shown in the bottom-right plot of Fig-ure 7. Since WT collections include the root articles of the desired domain and IR systems retrieve only articles that contain the vocabulary of the domain, we can assume that a large cohesion implies a large domainness.As it happens with ρ and τ , d ESA clearly peaks WT models (d ESA ≈0.85) over IR ones (d ESA ≈1.00).The best (worst) collections are obtained for Greek (German).Again, mean averages do not allow to establish preferences among the different configurations within a same family of models in a statistically significant way, but models with the smallest set of terms (*-IR10 and *-WT100) are preferred; i.e. more constrained collections have a larger cohesion.
All the metrics we have defined clearly differentiate the quality of WikiTailor and IR systems when we study the average on all the domains but only show trends within the different models of a same family.In general, the most constrained configuration for each family (60-WT100 and 50-IR10) obtains the most in-domain collection, but the difference is sometimes minimal with respect to another configuration which, on the other hand, might have retrieved many more articles.We are comparing 7,430 collections for 10 different models but, in practice, a standard user will be dealing with only a few of them.In that case, it might be more fruitful to decide which is the collection to be used according to the scores but also according to size and domain representativity requirements.Notice also that the density metrics (families 1 and 2) behave differently to correlation (family 3) and cohesion (family 4) measures when dealing with the most constrained collections.
The human judgments from Section 5.6 also allow us to estimate the quality of the automatic evaluation metrics.We calculate the Pearson correlation r P between the crowdsourced precisions and the scores given by the automatic metrics on the same subcollections considering 200 articles per system and language in three domains (settings in Section 5.6).
A visual inspection of the data is a first good clue to understand the behaviour of the metrics.Figure 8 shows the relation against soft precision of six metrics: C terms /N , ĉterms , PMI col , τ , d ESA , and a full measure for domainness: Dom.The first worth-noticing aspect is that in all cases the graphical counterpart of Table 5 (e.g., points corresponding to the 50-WT100 system; green bul-lets) are located towards higher precision values than those corresponding to the 100-IR10 system (orange diamonds).We plot 60 points per figure, corresponding to two systems applied on ten languages × three domains.The exceptions are d ESA and Dom, for which only nine languages × 3 domains are shown (we discard those collections with less than 200 articles for the correlation estimation (Astronomy and Software for Occitan, and Sport for Basque; cf.Table 5).
Family 1. Counterintuitively, the metric with the highest and negative correlation is the density of terms C terms /N with r P = −0.716.The high value is just an artifact given by the different composition of the WT and IR collections.By construction, the IR system retrieves articles with lots of terms, whereas the dependence for WT models is lower.The quality of WT is better, so there is a clear anticorrelation between the density of terms and the precision.If we look at what happens only within WT or IR instances (i.e.only with the green or orange points independently), we obtain worse correlation values: r P = −0.18for WT and r P = −0.23;still negative in both cases, but closer to zero.The fact that these values are not positive invalidate the assumption we made to use this family of metrics to measure domainness.The results show how the density of the characteristic vocabulary of the domain is neither a sufficient nor a necessary condition to obtain in-domain corpora.It can be a good estimator for the representativity of the corpus, but if the cohesion is low, the domainness is also low.
The additional normalisation of this measure included in the augmented frequency ĉterms rules out the metric as a global measure.The Pearson correlation for ĉterms when all the data are used together is r P = −0.08:these two variables do not correlate.Since the frequency of terms is now normalised to the most frequent term, their importance is lower, and therefore, both WT and IR behave similarly, with slightly higher values for IR than for WP.The reason is the same as before, hence exhibiting an anticorrelation with precision scores.However, when looking into the two systems, the correlation increases specially for WT: r P = 0.63 for WT and r P = 0.36 for IR.So, within a system, we have a positive correlation of ĉterms vs Precision which indicates that ĉterms is a good barometer of the quality of a WT extracted in-domain corpus.Family 2. Metrics related to mutual information or co-occurrence show a clear positive trend with respect to precision.Even with negative PMI values, human judgments show how the best collections have higher PMI values.The score that correlates best with precision is PMI col with r P = 0.57.The metric with the standard probability calculation PMI art is close with r P = 0.55.The variable size sliding window that we use, an article, is not affecting the results.The normalised versions are slightly below these values because the effect of the normalisation is to smooth differences among points (NPMI art has r P = 0.41; NPMI col has r P = 0.55).We also observe that in our setting the median of (N)PMI is a better estimator than the average.
If we compare the subset of points belonging to WT and IR, the correlation is lower than the global one in both cases but specially for IR, where we observe no correlation between the metric and the observations (PMI WT art has r P = 0.44; PMI IR art has r P = 0.08).Notice that the different nature of WT and IR allows us to say that a high density of in-domain terms in an article does not imply that it belongs to the domain, as concluded from the fact that C terms /N and ĉterms for the IR system are above their equivalents for WT.However, a higher number of co-occurrences of the domain vocabulary does (PMI WT larger than PMI IR ).
Family 3. The next family of metrics, ρ and τ , measures the rank correlation between the terms of an extracted in-domain collection and a collec-tion of Wikipedia root articles in the same domain.The correlation with soft precision is in this case r P = 0.31 for ρ, and r P = 0.34 for τ .As the plot for τ in Figure 8 shows, the dispersion of the WT points is larger, but their subset has a higher correlation than the IR one (r P = 0.25 vs r P = 0.02).For the IR subset, the metric is a very bad measure of the quality of the extraction, but contrary to the augmented term frequency metric ĉterms , it performs better in the global setting than within the subsets.
Family 4. The measure of cohesion of the corpus through ESA distances results in a good estimator.With a global correlation of r P = −0.60 and subset correlations of r P = −0.41(WT) and r P = −0.13(IR), d ESA is the best individual metric to estimate the domainness of a collection in general, but ĉterms is the best metric when we focus on WikiTailor extractions.ĉterms is not bounded.Its range is [0, ∞), where high densities imply a good quality.However, due to the lack of top boundary, it is useful to compare collections, but no clear interpretation exists in terms of an absolute number.In terms of ease of use, both d ESA and ĉterms rely on the Wikipedia.ĉterms comes for free with a WT extraction because we estimate the characteristic vocabulary in our models.d ESA performs better globally, but the cost is the need to define a reference collection, which can be different across languages.PMI col alleviates this problem being also language independent, but the quality its a metric is slightly lower.
Finally, we estimate the domainness as the combination of the most promising metrics for representativity and cohesion: ) where hats in PMI col and d ESA represent a normalisation of the data points between [0,+1].As expected, we obtain the largest global correlation with the combination as representativity and cohesion are two perpendicular features.Dom reaches a correlation of r P = 0.71 when all 60 datapoints are used.At system level, with two sets of 30 datapoints, Dom WT has r P = 0.55 and Dom IR r P = 0.27 showing that the more homogeneous a collection of points is, the less important is the combination of aspects.In that case, the correlation is slightly worse than that given by the simple augmented term frequency metric ĉterms as seen before.

Summary and Conclusions
Several multilingual applications benefit from indomain corpora, but gathering them usually requires a considerable amount of work.We therefore design a system to extract such corpora from the Wikipedia, a multilingual online encyclopaedia with information of the domain of the articles encoded in their category tags.The WikiTailor system explores Wikipedia's category graph and performs a breadth-first search departing from the category associated to the desired domain.From this point, it extracts all the articles belonging to its children categories down to an estimated optimal depth.We compared the performance of WikiTailor with a standard IR system based on querying the Wikipedia with a set of keywords that describe the domain.The keywords or in-domain vocabulary were extracted in the same way for the two architectures as the most frequent terms in the root articles; that is, the articles belonging to the top category.The two methods are very different in nature and generate complementary collections.WT collections, which are smaller, are not in general a subset of the IR ones.The experimental analysis on 10 languages and 743 domains showed the preference by automatic and manual evaluations for the WT models with respect to the IR ones.
The manual evaluation was carried out on three domains -Astronomy, Software, and Sporton one model for WT and one for IR.Turkers in Figure Eight were asked to indicate if an article belonged to the domain or not, for a total of 200 articles per language and system.Precision was afterwards used to evaluate the quality of each collection.With an average precision of P WT =0.84±0.13 and P IR =0.50±0.14,WikiTailor resulted statistically better than the IR system.
The lack of metrics to measure the domainness of a corpus caused an automatic evaluation more complicated.Therefore, we first define the concept as a combination of the representativity and coherence of the texts in a corpus and, afterwards, we introduce several metrics to account for it.Representativity is measured on the basis of the characteristic vocabulary of its intended domain (density, co-occurrence or correlations between distribution of terms) and coherence on the basis of the distance between the articles of the collection.Via the correlation with human judgments, we show how the density of the characteristic vocabulary of the domain is neither a sufficient nor necessary condition for in-domain corpora.IR systems, with a higher density of in-domain terms by construction, are worse for all languages and domains in our manual evaluation.On the other hand, distances between the documents of a collection as measured by ESA representations outperform term-based measures and show a moderate correlation with observations.Mathematically, we introduce Dom, a metric which is a normalised linear combination between the best representativity metric ( PMI col ) and the distance-based one for coherence ( d ESA ).This combination shows a strong correlation with human evaluations, 0.71.In summary, d ESA is the best individual metric to estimate the quality of a collection in general, when comparing heterogeneous collections as different in nature as the ones we explore.However, it is only measuring the coherence between the documents and the performance is improved when combined with a measure of the importance of in-domain term cooccurrences.Within a system conclusions change.WT systems extract the articles without any request on the number of in-domain terms that the documents have, and within these collections the occurrences and co-occurrences of terms are relevant.For homogeneous collections (WT or IR) ĉterms is the best metric.For heterogeneous collections (WT and IR) d ESA and Dom are the best options, meaning that coherence is more important when discrepancies in the number of in-domain vocabulary are not huge.
All the metrics and the WT and IR systems are freely available in the WikiTailor package.(Calzolari et al., 2008).

A Wikipedia-Specific Concepts
Category Tag present in a set of articles grouped together by covering similar topics.
Dump Snapshot of an edition in the form of wikitext source and metadata embedded in XML.
Edition Each one of the Wikipedias for a specific language.
Inter-language link/langlink A link in a Wikipedia article towards an equivalent entry in a different language.
Main namespace The namespace in the Wikipedia containing the actual contents: the articles.Other namespaces are user, help, or category.
WCG Wikipedia category graph.Directed acyclic graph formed by the category tags.

B Crowdsourcing Settings
Setting up the Figure Eight crowdsourcing annotation involves four steps: (i) the selection of Turkers, (ii) their instruction, (iii) setting the task itself and (iv) a quality control of the annotation.
The selection of the Turkers was made by their language knowledge.We opted for three different criteria based upon language capabilities or region to determine the population that annotating each language.No language or geographical limitation was set for English, composing our most flexible configuration.For Arabic, French, and German we selected the corresponding language on the platform interface.Such a setting was not available for the rest of languages21 ; hence we opted for a geographical configuration.Table 6 summarises the geographical configurations, set according to four criteria: countries where the language is official (e.g., Spain for Spanish), countries with official languages from the same family (e.g., France for Catalan), neighbouring countries (e.g., Bulgaria for Greek), and countries with a high rate of immigration of native speakers (e.g., Germany for Greek).
We set the job as a binary classification task where Turkers had to assess if a Wikipedia article matches the domain displayed in the interface or not.
Instruction: Task -Identify the category a given Wikipedia article belongs to.It either belongs to domain d or to other, where d can be Astronomy, Software, or Sport.
The Turkers had to scroll an actual Wikipedia article, which we framed into the interface, to judge.
After a pilot experiment, we wrote additional specific guidelines for each of the three domains aiming at clarifying how some ambiguous cases should be handled by the annotators: Astronomy -The biography of an astronomer should be considered within the Astronomy domain.
-Articles about Physics should not always be considered as Astronomy even if atoms, particles or orbits are involved.Software -Concepts which are in essence software (e.g., video games, matchboxes) belong to the Software domain.Sport -The biography of a sportsman should be considered within the Sport domain.
-An article of a location with a section on Sport does not belong to the domain Sport.
We paid 0.06 USD per HIT, each of which consisted of 10 binary annotations, and set a minimum working time of 120 seconds.We manually annotated 10% of the instances for quality control and requested an annotation accuracy of 80% to verify the annotation quality.Each item was judged three times.

Figure 1 :
Figure 1: Slice of the English WCG as in May 2020 departing from the categories Space and Language.Both graphs meet on the Geometric measurement category at depth 2 and 7 respectively.Notice also the cycle around Space.

Figure 2 :
Figure 2: The two modules of the graph-based in-domain article selection pipeline: vocabulary definition and articles selection.Orange rounded blocks represent processes.Green rectangles represent outcomes; pctge.refers to the percentage of positive categories at a given tree level.

Figure 4 :
Figure 4: Percentage of categories associated to the domain Sport (top) and Astronomy (bottom) according to the criterion described in Section 4.1 as a function of the distance to the root category.

Figure 5 :
Figure 5: Example of three intersecting domains: Sport, Games, Videogames (orange boxes) and articles within them (in gray).
shows three domains and five Wikipedia articles within them.Article Basketball clearly belongs to domain Sport, whereas Tetris clearly does not.Articles such as NBA 2K18 lie within all -Sport, Games and Videogames domains-as it represents them all.Yet the membership of article NBA 2K18 to the Sport domain is subjective, unless a more detailed description of the domain is given.At collection level, a collection with the previous three documents is less representative of Sport, than a collection including articles Basketball, Soccer and Chess which are more cohesive.Again, at what extent remains subjective -we need a measure to quantify the difference.

Figure 7 :
Figure 7: Automatic evaluation of the in-domain collections for the different systems and languages under study with six measures which are representative of the four families introduced in Section 6. Points represent the arithmetic mean over the 743 selected domains.

Figure 8 :
Figure 8: Relation between six domainness measures and the precision given by human judgments (see text for correlations).Points correspond to the score for the 10 languages in the three domains manually evaluated, some examples are highlighted.

Table 1 :
Statistics of the ten Wikipedia editions considered in this work in terms of number of articles and categories.Editions ranked according to their number of categories.The cumulative intersection is measured with respect to all the languages below a given row.

Table 2 :
Number of articles per category used to build the domain vocabularies (mean x, standard deviation σ x and mode m) for the ten Wikipedia editions used and the 743 domains.Only for those categories with less than 10 articles in the root, the first children are also considered.The last two columns show the number of elements of the vocabulary when the top 10% of the terms are considered.

Table 3 :
Selected depth threshold per category (mean x, standard deviation σ x and mode m) for the ten Wikipedia editions used and the 743 domains.

Table 4 :
Mean N and standard deviation σ N of the number of articles per domain for the WikiTailor model (top) and the IR-based model (bottom).We show five systems with different values for the two free parameters in both cases (cf.Section 4 for a description).Left-most numbers indicate the ranking of the edition in number of articles (cf.Table1).

Table 6 :
Geographical settings for the Figure Eight workers selection (when a language-based filtering was not available).Torsten Zesch and Iryna Gurevych.2007.Analysis of the Wikipedia Category Graph for NLP Applications.In Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing, pages 1-8, Rochester, NY, USA.Association for Computational Linguistics.Torsten Zesch, Christof Müller, and Iryna Gurevych.2008.Extracting Lexical Semantic Knowledge from Wikipedia and Wikictionary.In