Keywords

1 Introduction

Various NLP tasks such as Automatic Summarization, Machine Translation or Dialogue Systems benefit from precision linguistic resources (e.g. grammars, lexicons, semantic representations). Alas these can hardly be obtained automatically without any loss of quality (in terms of supported linguistic phenomena or structural correctness), and building hand-crafted precision resources is a very costly task. As an illustration, let us consider syntax. Building resources describing the syntax of natural languages such as French or English (that is, electronic grammars) can take many person-years (see e.g. [1, 25]). A common way to reduce this cost consists in using description languages to semi-automatically generate these precision grammars. These description languages provide users (e.g. linguists) with means to define abstractions over linguistic structures in order to capture redundancy. Users thus no longer have to describe actual linguistic structures (e.g. grammar rules), but abstractions over these. Such abstractions are then processed automatically to generate the full set of redundant structures.

Many description languages were successfully used over the past decades to generate linguistic resources ranging from small-size lexicons to real-size grammars (see e.g. [5, 10, 21, 23, 24]). Each of these description languages was tailored for handling specific linguistic objects. For instance, LKB [5] was designed for describing typed feature-structures, while LexOrg [24] was designed for representing syntactic trees. Furthermore, description languages were extended with features specific of a target grammar formalism (e.g. HPSG’s Head Feature Principle [19] for LKB) making these languages formalism-dependent. Users end up with plethora of specific description languages (and corresponding implementations, i.e. compilers).

These description languages often support a single linguistic framework (e.g. grammar formalism)Footnote 1 and a limited number of levels of description (mainly syntax). Depending on the target linguistic objects, the user chooses the most adequate tool (which provides a description language together with a compiler for this language), and uses it to describe and produce an actual precision linguistic resource. The consequences of this are in particular that (i) there is little information sharing between precision linguistic resources (can abstractions defined for a given target formalism be applied to other formalisms?), and (ii) should several levels of description be needed (e.g. syntax and morphology for morphologically rich languages), several tools have to be learned and combined (if possible) or a single tool has to be tinkered with.

The work presented here aims at changing this by offering a common framework which would make it possible for users to define description languages in a modular and extensible way. We built on previous work on modularity made within the eXtensible MetaGrammar (XMG) description language [6]. The paper is organized as follows. In Sect. 2, we introduce XMG and show how it laid down the bases for extensible and customizable description languages. In Sect. 3, we present (i) the concept of assembling description languages underlying XMG 2, and (ii) a compilation architecture based on logic programming and which permits a modular and extensible description (and meta-compilation)Footnote 2 of description languages (called hereafter Domain Specific Languages, DSL, for the sake of coherency with the terminology used in Compilation theory). In Sect. 4, we show how to use XMG 2 to dynamically assemble the original XMG language while adding a morphological layer so that it can be used to generate not only syntactic trees or flat semantic representations but also morphological representations (i.e. inflected forms). Finally, in Sect. 5, we compare our approach with related work and in Sect. 6 we conclude and present future work.

2 Compiling Extensible Metagrammars

In this section, we present the eXtensible MetaGrammar (XMG) framework [6], on which this work builds. XMG refers to both a description language used to describe tree grammars and a compiler for this language. In the XMG approach, the description of the linguistic grammar is seen as a formal grammar. XMG users thus describe grammar rules by writing a formal grammar (so-called metagrammar). The metagrammar is in our case a Definite Clause Grammar (DCG) [17] (i.e. a logic program), which is compiled and executed by the XMG compiler to produce an actual tree grammar (i.e. a set of trees). Let us briefly define what an XMG metagrammar is, and explain how it is processed (compiled) to generate syntactic trees.

Metagrammars as Logic Programs. Intuitively, an XMG metagrammar consists of (conjunctive and/or disjunctive) combinations of reusable tree fragments. Hence the XMG language provides means to define abstractions over tree fragments along with operators to combine these abstractions conjunctively or disjunctively. These 3 concepts (abstraction, conjunction, disjunction) are already available within DCGs, which are formally defined as follows:

$$\begin{aligned} \textit{Clause} \qquad&{:}{:}{=}\qquad \textit{Name} \rightarrow \textit{Goal} \end{aligned}$$
(1)
$$\begin{aligned} \textit{Goal}\qquad&{:}{:}{=}\qquad \textit{Description}\ \mid \ \textit{Name}\ \mid \ \textit{Goal} \vee \textit{Goal} \ \mid \ \textit{Goal}\wedge \textit{Goal} \end{aligned}$$
(2)

Indeed, clauses allow to associate goals (e.g. descriptions) with names, hence providing abstraction. Goals can be made of conjunctions or disjunctions of goals. In DCG, descriptions usually refer to facts. In our case, descriptions correspond to formulas of a tree description logic (TDL) defined as follows:

$$\begin{aligned} \textit{Description}&\ {:}{:}{=}\ \ x\rightarrow y\ \mid \ x\rightarrow ^{*} y\ \mid \ x\prec y\ \mid \ x\prec ^{+} y\ \mid \ x[f{:}E]\ \end{aligned}$$

where xy range over node variables, \(\rightarrow \) represents immediate dominance, \(\rightarrow ^{*}\) its reflexive transitive closure, \(\prec \) immediate precedence, and \(\prec ^{+}\) its transitive closure. x[f : E] constrains feature f on node x.

Finally, axioms of the DCG indicate clauses which correspond to complete tree descriptions:

$$\begin{aligned} \textit{Axiom} \qquad&{:}{:}{=}\qquad \textit{Name} \end{aligned}$$

Grammars as Logic Program Executions. The compilation of the input metagrammar is summarized below:

figure a

The metagrammar is first tokenized, then it is parsed to produce an abstract syntax tree (AST). This AST is processed to unfold the statements composing the metagrammar and produce flat structures (instructions for a kernel language). These flat structures are then interpreted by a code generator to produce instructions for a virtual machine (in our case code for a Prolog interpreter). The code is finally executed by the interpreter to produce terms. These terms are accumulations of descriptive constraints. More precisely, code execution outputs sets of conjunctions of input TDL formulas (where logic variables have been unified according to the clause instantiations defined in the metagrammar). There is one such set per derivation of the axiom of the metagrammar. To get actual trees instead of tree description logic formulas, the latter need to be solved using a tree description solver such as [8]. The interpreter is thus enriched with such a solver as a post-processor. The result of this metagrammar compilation and execution is, for each axiom, a set of trees (minimal models of the input tree description).

The architecture above makes it possible to define declarative and concise descriptions of tree grammars. Still, it supports a single level of description, namely syntax. In order to allow users to describe other levels such as semantics, the XMG language has been extended by using DCGs with multiple accumulators (so-called Extended Definite Clause Grammar, EDCG [22]). In (2) above, Description is replaced with:

figure b

Depending on the dimension being used, there are two distinct description languages : one for describing syntactic trees (dimension syn) and one for describing semantic predicate structures (dimension sem).

XMG thus offers some extensibility in so far as it supports two distinct levels of description.Footnote 3 Still XMG does not allow users to describe other linguistic structures than trees or predicates. Nevertheless, XMG’s modular architecture and the multiple accumulators provided by EDCG offer an adequate backbone for on-demand composition of description languages by assembling elementary description languages (called hereafter language bricks).

In the next section, we will present the concept of assembling language bricks and show how to extend the XMG architecture so that users can define their own description language (or Domain Specific Language, DSL) and compile the corresponding compiler. The output of this meta-compilation is a metagrammar compilerFootnote 4 (i) whose architecture follows XMG’s architecture introduced above, and (ii) which can be used by linguists to describe actual language resources.

3 Assembling a Domain Specific Language

It is the foundational philosophy of XMG 2 that the tool should be easily customizable to the user’s specific requirements in expressivity rather than the user be forced to cast her intuitions in terms of a rigidly predefined framework not necessarily well suited to the task.

Thus XMG 2 aims at facilitating the definition of DSLs for building linguistic resources such as grammars or lexicons. Typically a DSL allows for describing some data structure using a concrete syntax. However, even if the same data structure may be used, under the hood, for different applications, different DSLs may still be desirable: for example, a decorated tree is a very versatile data structure and could potentially be used for tree-based description of syntax and for representing agglutinative morphology; yet the two tasks have quite different requirements and attempt to capture intuitions and generalizations of dissimilar nature.

3.1 Defining a Modular DSL by Assembling Bricks of Language

Our approach is predicated on assembling DSLs by composing bricks from an extensible library. A brick binds together a fragment of context-free syntax with some underlying data structure and some processing instructions to operate on it.

A brick can be viewed as a module: it exports a non-terminal which is the axiom of its language fragment, and it defines sockets which are non-terminals for which rules may be provided by other bricks.

Let us illustrate this brick-based description of linguistic resources by considering feature structures. Feature structures are elements that are used in many grammatical formalisms. The rules describing feature structures would consequently be added to the context free grammar of all DSLs that would be designed to describe these formalisms. This would lead to some redundancy as CFG rules would be repeated several times.

To avoid this redundancy, we propose to divide description languages into reusable and composable fragments called language bricks. For example, feature structures (also called Attribute-Value Matrices, AVM for short) use the following concrete syntax:

$$\begin{aligned} \textit{AVM}\quad {:}{:}{=}\quad&[ \textit{Feats} ] \end{aligned}$$
(3)
$$\begin{aligned} \textit{Feats}\quad {:}{:}{=}\quad&\textit{Feat} \quad |\quad \textit{Feat}{} \texttt {,} \textit{Feats} \end{aligned}$$
(4)
$$\begin{aligned} \textit{Feat}\quad {:}{:}{=}\quad&\textit{id} = {\_Value} \end{aligned}$$
(5)

The axiom of this brick is the non-terminal AVM. Note that the brick provides no production for the non-terminal _Value: this is what we call an external non-terminal and serves as the socket mentioned earlier. A production for _Value is obtained by plugging the axiom of another brick into this socket. To this end, let us consider a Value brick:

$$\begin{aligned} \textit{Value}\quad {:}{:}{=}\quad&\textit{id} \quad |\quad {{{\mathbf {\mathtt{{bool}}}}}} \quad |\quad {{{\mathbf {\mathtt{{int}}}}}} \quad |\quad {{{\mathbf {\mathtt{{string}}}}}} \quad |\quad {\_Else} \end{aligned}$$
(6)

The external non-terminal _Else makes it possible to plug in additional kinds of values. Now we can plug the Value axiom into the AVM brick’s _Value socket (to define admissible feature values) and the AVM brick’s axiom into the Value brick’s _Else socket (to allow for recursive AVMs, that is, AVMs whose feature values can be AVMs). Plugging an axiom into a socket is realized by adding a production of the following form:

$$\begin{aligned} {\_Value}\quad {:}{:}{=}\quad&\textit{Value} \\ {\_Else}\quad {:}{:}{=}\quad&\textit{AVM} \end{aligned}$$

An external non-terminal may have any number of connections, including none. One production is added for each connection: if there are many, then the external non-terminal has alternative expansions; if there are none, then it has no expansion and does not contribute to the generated language. This process can be illustrated graphically as follows (only bricks’ axioms and sockets are displayed):

figure c

There is a cycle in this graph because we have assembled the concrete syntax for an inductively defined type. It is possible to create several instances of the same brick and to connect each instance differently. The method that we propose to instantiate and connect bricks is concretely based on a configuration file using the YAMLFootnote 5 syntax. For our last example, the configuration file would contain the following code:

figure d

where avm and value are instances of the language bricks presented earlier (multiple instances can be distinguished using a suffix). For each one, we give the list of connections, as defined above. The context free grammar generated by this construction is the following:

$$\begin{aligned} \textit{AVM}\quad {:}{:}{=}\quad&[ \textit{Feats} ] \\ \textit{Feats}\quad {:}{:}{=}\quad&\textit{Feat} \quad |\quad \textit{Feat} , \textit{Feats} \\ \textit{Feat}\quad {:}{:}{=}\quad&\textit{id} = {\_Value}\\ {\_Value}\quad {:}{:}{=}\quad&\textit{id} \quad |\quad {{{\mathbf {\mathtt{{bool}}}}}} \quad |\quad {{{\mathbf {\mathtt{{int}}}}}} \quad |\quad {{{\mathbf {\mathtt{{string}}}}}} \quad |\quad {\_Else}\\ {\_Else}\quad {:}{:}{=}\quad&{AVM}\\ \end{aligned}$$

3.2 Meta-Compiling a Modular DSL

Now that we have defined a way to assemble a DSL from a single configuration file, let us see how to assemble the whole processing chain for this DSL from this file.

Let us first have a look at XMG 2’s architecture, which is given in Fig. 1.

Fig. 1.
figure 1

Architecture of XMG 2.

XMG 2 can be used by three different types of users (hereafter called profiles). The first profile, called User on Fig. 1, corresponds to a linguist, whose aim is to write a description of a linguistic resource (that is, a metagrammar) and feed it to a metagrammar compiler (so-called Meta Executor). This tool compiles and executes the input metagrammar to generate the corresponding linguistic resource. Concretely, XMG 1 is an instance of Meta Executor.

A new step towards modularity is that new instances of Meta Executors can be easily assembled, by an Assembler. This type of user writes simple specifications using reusable bricks as shown previously, and a tool called Meta Compiler automatically produces the whole processing chain (that is, the Meta Executor) for the corresponding assembled DSL.

Bricks used for this assembly are picked from a brick library, which can be extended by a Programmer. This profile is the only one which requires programming skills. Creating a new brick consists in giving the context free grammar of the DSL and implementing the processing chain for it.

Let us now see how brick assembly and meta-compilation work in practice. As shown on Fig. 1, the processing chain is divided in two main parts, compilation and generation (performed by the executor), for which we will now detail the modular construction.

Assembling the Compiler. The type of compilers we want to assemble aims at transforming a program written in a DSL into a logic program. The processing chain has the following shape:

figure e

This chain is closely related to XMG’s architecture introduced in Sect. 2. The process starts with lexical and syntactic analysis of the metagrammar (tokenizer and parser), creating its abstract syntax tree (AST). Then, unlike XMG, a type-inferrer ensures the consistency of the data types inside this AST. The next step, accomplished by the unfolder, rewrites the metagrammar using a minimal set of flat instructions of our kernel language. Finally, these instructions are translated into the target language (Prolog) by the code generator. To sum up, the compiler deals with three different languages: the DSL is used for the input, the kernel language for the output of the unfolder, and Prolog for the output of the code generator.

We will now see how to create these five modules for each brick assembly. As we will see, the tokenizer and parser can be automatically generated using standard compilation techniques. For the three other modules, we will see how treatments can be distributed over the bricks. But first, let us present two devices of which we make heavy use, both in the code written for the bricks and in the generated one, namely extended DCG and attributed variables.

Extended DCG. In a compiler, global and modifiable data structures are necessary (e.g. tables modeling the context). This is problematic here because Prolog does not offer such structures. The classic way to address this need in logic programming is to use additional pairs of arguments inside predicates, one to represent the structure before the application of the predicate, the other one to represent it after the application. Such pairs are usually called an accumulator. DCGs offer syntactic sugar to make handling accumulators easier. In case an arbitrary number of accumulators is needed, one can use Extended DCG [22]. This is what is done in XMG 2, which contains a library based on EDCG where one can declare accumulators, associate them with predicates, and trigger actions on the accumulations inside these predicates.

Accumulators are accessible by their identifier, and are manipulated by the application of the operations defined for them. For example, acc::add(H) applies the operation add with the argument H to the accumulator acc: the element H is added to the structure acc.

Attributed Variables. In Prolog, manipulated terms have a fixed arity, which is a limit in our case: we wish to manipulate structures that can be constrained incrementaly during compilation. It is for instance the case for feature structures, whose size can be augmented by open unification. For this reason, the feature structures brick includes a module containing a dedicated library. This library uses the concept of attributed variables [14], which allow to associate a Prolog variable with a set of attributes, and to design a dedicated unification algorithm over these attributes when two variables of the same type are unified.

In practice, to handle attributed variables, we use the YAPFootnote 6 library atts, which provides two predicates to modify the attribute of a variable or to consult it (put_atts and get_atts). Two other predicates are defined by the user: verify_attributes, which is called during unification, and attribute_goal, which converts an attribute into a goal. In our module for feature structures, a variable is associated to an attribute, this attribute being the list of attribute-value pairs composing the structure.

Lexical and Syntactic Analyses. Every brick used to build a DSL includes a file named lang.def, which contains context free rules as those previously shown. Each such rule is associated with a semantic action. This semantic action specifies which Prolog term will be built when this rule is parsed. For instance, the brick for feature structures comes with the following lang.def file:

figure f

The second line means that when a Feat is parsed, an avm:feat/2 term is created, its two arguments being the result of the parsing of id and the one of _Value. The parser for the DSL can be created from the lang.def file similarly to what is done by the parser generator Yacc, where the rules for a LALR parser are inferred from the language definition. The tokenizer for the language is also created by extending a generic tokenizer with the punctuation and keywords specific to the brick.

Type Inference. The type inference of the DSL program has to cope with the particular context of our tools, that is to say constrained data structures (structures with partial information). A central problem is the typing of feature structures, for which we follow the ideas of [16] for the typing of records, adapting it for variable arity. The modular type inference of XMG 2 is based on two predicates, xmg:type_stmt(Stmt,Type) and xmg:type_expr(Expr,Type). A brick needs to provide new clauses for these two predicates, for all of their local constructors.

As an illustration, the following clause is given by the brick for AVMs:

figure g

where avm:feat is a constructor of the AVM brick, Attr and Value are variables representing a feature name and value respectively. TypeDef is the type of the feature, which is used to check the corresponding value. The predicate extend_type updates the feature type (for instance in case it refers to an AVM).

Unfolding. The kernel language is a minimal set of flat instructions (terms of depth one), which will be easily translated into the target language of compilation. The unfolder rewrites abstract syntax trees (terms of arbitrary depth) into instructions of the kernel language. The modularity is given in the same way as for the type checker, thanks to two predicates, xmg:unfold_stmt(Stmt) and xmg:unfold_expr(Expr, Var), for which every brick has to provide new clauses. The following clause gives the support for the unfolding of the avm constructor, as provided by the avm brick:

figure h

where Target is the variable which will represent the feature structure, constraints is the accumulator where the instructions are accumulated, enq is the enqueuing operation on this accumulator, and unfold_feats a local predicate handling the unfolding of the features.

Code Generation. We follow the same pattern for code generation. A brick must provide clauses of the predicate xmg:generate_instr(Instr) for every instruction of its kernel language. This predicate triggers the accumulation of prolog code. The following example shows the implementation of the code generation rules for feature structures:

figure i

where decls is the table associating every variable identifier in the kernel language with a Prolog variable, tget is the operation allowing to access variables in this table, and code::enq is the accumulation operation for the generated instructions.

Assembling the Executor. As shown on Fig. 1, the execution phase is handled by three components: the Generator, which transforms Prolog code into a linguistic resource and runs thanks to Prolog’s base virtual machine (VM). The Specific VM extends the base VM to fit the linguistic description task (e.g. performing additional treatments when solving linguistic descriptions).

Concretely, the non-deterministic program created by the Compiler is first executed. Each successful execution of this program produces a set of accumulations of constraints inside the dimensions. For each dimension which requires solving the accumulated descriptions to obtain linguistic structures, the task of the executor is to extract all valid models described in the corresponding accumulation.

Note that for a given execution, each accumulation (that is dimension) is handled by a specific solver. The solving may produce zero, one or more solutions for each execution. The solutions, expressed as terms, are given to the externalizer (still specific to the dimension) where they are translated into the target language (XML or JSON) for storage or display.

Solving. Solving descriptions relies on the following sub processing chain:

figure j

Each accumulation created by the execution is given to the preparer, which transforms the accumulation into a constraint satisfaction problem. The set of constraints is then translated into executable code (Prolog code with bindings to the C++ Gecode libraryFootnote 7) by the solver. The extractor computes all the solutions to the problem and translates them into a term.

Note that solvers are also packaged in bricks (but these do not provide a lang.def file since the input language they deal with is the language of constraints which can be accumulated in a dimension). Note also that these solvers can be extended by defining plug-ins to apply specific constraints on the models being computed (e.g. natural language-dependent constraints such as clitic order in French).

4 Application: Designing a Language to Describe Syntax, Semantics and/or Morphology

As an illustration of the DSL assembly and meta-compilation techniques introduced above, let us see how the XMG language [6] presented in Sect. 2 can be assembled and enriched with another level of description, namely morphology, so that one can use the same framework to describe various dimensions of language.

4.1 Defining Language Bricks for Describing Tree Grammars

As mentioned above, XMG was designed to describe tree grammars such as Feature-Based Tree-Adjoining Grammars (FB-TAG) (see e.g. [11]). Basically, a FB-TAG is made of elementary trees whose nodes are labelled with feature-structures. These feature-structures associate features with either values or unification variables. The latter can be shared between syntactic and semantic representations.

Recall from (1) and (2) defined on page 3, that an XMG description is made of clauses containing either (i) Descriptions or (ii) conjunctive or disjunctive combinations of these Descriptions. In our case, Descriptions belong to a Dimension which is either syntax or semantics. Syntactic descriptions are tree fragments (defined as formulas of a tree description logic), and semantic descriptions formulas of a predicate logic.

Descriptions. Let us first define a language brick syn for defining syntactic descriptions (that is, syntactic statements). Such a brick contains both the definition of the syntax of the language (that is, a CFG), and instructions for processing (compiling) statements belonging to this language.

$$\begin{aligned} \textit{SynStmt}\quad {:}{:}{=}\quad&{{{\mathbf {\mathtt{{node}}}}}}\ \, \textit{id} \quad |\quad {{{\mathbf {\mathtt{{node}}}}}}\ \, \textit{id}\ \, {\_AVM} \\ |\quad&\textit{id} \quad {{{\mathbf {\mathtt{{->}}}}}} \quad \textit{id} \quad |\quad \textit{id} \quad {{{\mathbf {\mathtt{{->*}}}}}} \quad \textit{id} \quad |\quad \textit{id} \quad {{{\mathbf {\mathtt{{<}}}}}} \quad \textit{id} \quad |\quad \textit{id} \quad {{{\mathbf {\mathtt{{<+}}}}}} \quad \textit{id} \end{aligned}$$

A syntactic statement (SynStmt) is either the definition of a node (identified by the value id), the definition of a node labelled with some feature-structure (_AVM), or the definition of a relation between node identifiers (\({{{\mathbf {\mathtt{{->}}}}}}\) for dominance, \({{{\mathbf {\mathtt{{<}}}}}}\) for precedence)Footnote 8. An AVM is described using the bricks defined by (3), (4), (5) and (6) page 5. Finally, semantic descriptions are defined using the following brick sem:

$$\begin{aligned} \textit{SemStmt} \quad {:}{:}{=}\quad \ell {{{\mathbf {\mathtt{{:}}}}}} p {{{\mathbf {\mathtt{{(}}}}}} id,\ldots ,id {{{\mathbf {\mathtt{{)}}}}}} \quad \mid \quad id \quad {{{\mathbf {\mathtt{{<<}}}}}} \quad id \end{aligned}$$

where p refers to a predicate, id to unification variables representing p’s arguments, \(\ell \) to a predicate label (these are all identifiers), and \({{{\mathbf {\mathtt{{<<}}}}}}\) to a scope constraint.Footnote 9

Combinations. Combinations of descriptions are realized by means of a parameterized brick constructor Dim\(_{X}\) defined as follows (X is a lexical keyword, here syn or sem):

This brick constructor is used to instantiate bricks of the following form:Footnote 10

$$\begin{aligned} \textit{Stmt}\quad {:}{:}{=}\quad&{\_Stmt} \quad |\quad \textit{Stmt}\ {{{\mathbf {\mathtt{{;}}}}}}\ {\_Stmt} \quad |\quad \textit{Stmt}\ {{{\mathbf {\mathtt{{||}}}}}}\ {\_Stmt} \end{aligned}$$

which allow to describe conjunctive or disjunctive combinations of statements. The external non-terminal _Stmt makes it possible to connect either syntactic or semantic statements (which are thus accumulated separately):

$$\begin{aligned} \textit{Dim}_{{\texttt {syn}}}.{\_Stmt}\quad {:}{:}{=}\quad&\textit{SynStmt} \\ \textit{Dim}_{{\texttt {sem}}}.{\_Stmt}\quad {:}{:}{=}\quad&\textit{SemStmt} \end{aligned}$$

4.2 Assembling Language Bricks for Describing Tree Grammars

From the language bricks defined above, it is possible to assemble the XMG compiler. Concretely, this amounts to defining the needed assembly (that is, to writting the YAML configuration file where is defined which bricks to load and how these interact).

The YAML file (named compiler.yaml) defining how to automatically assemble the XMG compiler is given below:

figure k

Concretely, what this YAML file says is the following. The target metagrammatical DSL (that is, the XMG language) corresponds to a brick named mg where statements are defined in the brick combination. The brick combination contains statements of type either dim_syn or dim_sem (these are parameters of the combination brick). The dim_syn brick contains statements defined in the brick syn, introduced by the keyword (tag) "syn", and solved using a solver named "tree". Statements of the brick dim_sem are introduced by the keyword "sem" and defined in the brick sem. Semantic statements do not need to be solved (hence the absence of any solver feature). The brick syn contains AVMs defined in the brick avm and expressions defined in value. The brick avm is parameterized by value and vice versa, as mentioned in Sect. 3.

4.3 Adding a Morphological Layer

So far, we showed how to assemble the DSL corresponding to XMG from a library of language bricks using a configuration file in YAML format. From this file, the XMG compiler is automatically built. Let us now see how to add morphological descriptions to this DSL (and recompile the corresponding meta-executor).

The morphological descriptions we will consider here are inspired by work on Ikota, a Bantu language spoken in Gabon [7]. The idea is to describe inflected verbal forms as (i) concatenations of ordered morphological fields (namely subject, tense, root, aspect, active and proximal) and (ii) morphological features associated with these fields (e.g. person, number, tense, verbal class, etc.).

The metagrammar of verbal forms contains for each field alternative possible realizations (that is, a disjunction of elementary descriptions). A verb is then described as the conjunction of all morphological fields. The metagrammar compiler will compute all combinations of values of these fields (that is, all elements of the cartesian product \(subject \times tense \times root \times aspect \times active \times proximal\)) and keep those where there is no unification failure between morphological features. As an illustration, consider the successful combination for the inflected form (you will eat) below, where fields are numbered from 1 to 6.Footnote 11

figure n

To extend the DSL defined above with such descriptions, we need a language brick allowing to define morphological fields, to associate them with a lexical form (and potentially also features), and to order them:

$$\begin{aligned} \textit{MorphStmt}\quad {:}{:}{=}\quad&{{{\mathbf {\mathtt{{field}}}}}}\ \, \textit{id}\ \, \textit{id} \quad |\quad {{{\mathbf {\mathtt{{field}}}}}}\ \, \textit{id}\ \, \textit{id}\ \, {\_AVM} \quad |\quad \textit{id} \quad {{{\mathbf {\mathtt{{>>}}}}}} \quad \textit{id} \end{aligned}$$

Note that this brick reuses the avm and value (e.g. for identifiers) bricks already defined above. To assemble this DSL, the YAML configuration file from Sect. 4.2 needs to be extended as illustrated below:

figure o

Basically, combinations no longer only contain syntactic or semantic statements, but also statements defined in the brick dim_morph. These statements are introduced by the keyword "morph" and are of type morph. Finally, AVMs contained in morphological statements are of type avm. From this extended compiler.yaml file, a new XMG-like meta-executor can be compiled. This tool can be used to describe (within the same metagrammar or not) not only tree grammars with flat semantic representations, but also inflected forms.

Note that the DSL assembly and meta-compilation techniques introduced here do not have as a main goal to provide users with means to design metagrammatical DSLs which would fit several dimensions of language at once (even if it may be technically possible). This extension of the XMG DSL is given for illustration purposes. The motivation underlying XMG 2 is that, depending on the target linguistic resource, one should be able to easily define and use appropriate DSLs. XMG 2 should thus provide users with means to easily assemble and build compilers for such DSLs (no matter which and how many of these are needed).

5 Related Work

The meta-compilation architecture presented here exhibits two particularly interesting properties in the context of linguistic resource engineering, namely modularity and extensibility. It is inspired by previous work on compilation, illustrated by systems such as LISA [13], JastAdd [9], or Neverlang [3]. All these systems allow users to relatively easily extend compilers by defining modules.

Still, their methodology differ from ours. They all aim at offering software designers with means to develop their own DSL by defining formal language specifications. These specifications are often complex (for instance, in Neverlang, assembling elementary bricks requires to solve graph dependencies), while XMG 2 allows for easy configuration using the YAML format.

Furthermore, these systems are used to extend or recreate existing substantial compilers. As an illustration, both JastAdd and Neverlang were used to build extensible Java Virtual Machines. We are mainly interested in providing users with easy-to-assemble dedicated specific languages. XMG 2’s philosophy is to provide adequate DSLs, that fit the linguistic description tasks. In particular, users should be able to reconfigure their DSL according to their needs, without having to support a large machinery.

Also, as mentioned above, XMG 2 provides three user profiles which make it different from other approaches. Indeed, the expected JastAdd, LISA or Neverlang users are skilled programmers. In XMG 2, contributing to an assembly of language bricks is such an easy task that users who do not know programming can define their own DSL.

Finally, unlike previous approaches, XMG 2 is based on logic programming, which makes it particularly appropriate for describing linguistic structures since these often use unification variables.

Apart from these approaches, to our knowledge, there are very few attempts (in particular in the NLP community) at providing users with such an extensible and modular linguistic description framework. One may cite work on cross-formalism language description by [4]. In their approach, the authors use a metagrammar compiler primarily designed for Tree-Adjoining Grammars (TAG) to derive both TAG and Lexical-Functional Grammar (LFG) rules from a single linguistic description. Their work was made possible by the fact that TAG and LFG both relies on syntactic trees. Should users be interested in describing less related structures within the same framework, their approach would not permit this.

Another interesting approach is that of Grammatical Framework (GF) [18]. GF is a system for designing grammars for various languages. It is based on an abstract syntax which can be mapped to several concrete syntaxes (hence languages). A GF grammar is modular, and can be interpreted by the GF system to parse or generate sentences. That is, GF provides a modular and extensible way to design a specific type of linguistic resources (multi-lingual morpho-syntactic grammars), while XMG 2 tries to its best to remain agnostic regarding the linguistic structures it describes.

6 Conclusion

In this paper, we showed how to design Domain Specific Languages for describing linguistic resources, by assembling elementary language bricks. We then presented how to concretely implement a meta-compiler which would take as input a library of language bricks together with a configuration file defining how to assemble a given DSL, and would produce automatically the corresponding compiler.

This meta-compilation from a DSL specification has been used for instance to produce a Prolog version of the XMG compiler. The resulting compiler was successfully used to (re)compile existing large scale tree grammars for French, English and German [6] (the generated resources are identical to the ones produced by the original XMG). The development of new syntactic resources was also initiated using XMG 2’s modular architecture (including works on São Tomense [20] and Arabic [2]).

XMG 2’s extensibility made it possible to create new language bricks, and thus new compilers. These were used for various description tasks, including the development of morphological resources (lexicon of inflected forms) for Ikota [7], the definition of new syntax-semantics and morpho-semantics interfaces (see e.g. [15]). This work paved the way for new uses of description languages. Future work includes extending the library of language bricks (and corresponding solvers) to support these new uses.

Several paths remains to be explored in the context of this work, both on the theoretical and on the practical side. First, the expressive power of certain types of DSL needs to be further studied. As an example, we saw that DSLs are used to describe tree-based grammars in the TAG formalism. In our case, formulas of a dominance-based tree description logic are used and solved using Constraint Satisfaction techniques. Alternatively formulas of a monadic second order (MSO) logic could be considered and solved using automaton-based techniques. More generally, the link between the input metagrammatical description language (DSL) and the target grammar formalism requires more attention. Interesting questions include the definition of the class of grammar formalisms (resp. of formal languages) which can be captured by XMG-like metagrammatical descriptions.

Second, on the practical side this approach to linguistic resource production made the relation between software engineering and linguistic resource design clearer. Designing precision resource is very close to designing software (one has to deal with relatively complex formal statements and expressions) and should thus benefit from the same kind of integrated development environments. Future work on metagrammar meta-compilation must take this analogy into account by providing user with facilities such as debuggers, regression tests, etc.