1 Introduction

A syntax definition formalism is a formal language to describe the syntax of formal languages. At the core of a syntax definition formalism is a grammar formalism in the tradition of Chomsky’s context-free grammars [14] and the Backus-Naur Form [4]. But syntax definition is concerned with more than just phrase structure, and encompasses all aspects of the syntax of languages.

In this paper, we give an overview of the syntax definition formalism SDF3 and its tool ecosystem that supports the multi-purpose interpretation of syntax definitions. The paper does not present any new technical contributions, but it is the first paper to give a (high-level) overview of all aspects of SDF3 and serves as a guide to the literature. SDF3 is the third generation in the SDF family of syntax definition formalisms, which were developed in the context of the ASF+SDF [5], Stratego/XT  [10], and Spoofax [38] language workbenches.

The first SDF [23] supported modular composition of syntax definition, a direct correspondence between concrete and abstract syntax, and parsing with the full class of context-free grammars enabled by the Generalized-LR (GLR) parsing algorithm  [44, 56]. Its programming environment, as part of the ASF+SDF MetaEnvironment  [40], focused on live development of syntax definitions through incremental and modular scanner and parser generation  [24,25,26] in order to provide fast turnaround times during language development.

The second generation, SDF2 encompassed a redesign of the internals of SDF without changing the surface syntax. The front-end of the implementation consisted of a transformation pipeline from the rich surface syntax to a minimal core (kernel) language  [58] that served as input for parser generation. The key change of SDF2 was its integration of lexical and context-free syntax, supported by Scannerless GLR (SGLR) parsing [60, 61], enabling composition of languages with different lexical syntax  [12].

SDF3 is the latest member of the family and inherits many features of its predecessors. The most recognizable change is to the syntax of productions that should make it more familiar to users of other grammar formalisms. Further, it introduces new features in order to support multi-purpose interpretations of syntax definitions. The goals of the design of SDF3 are (1) to support the definition of the concrete and abstract syntax of formal languages (with an emphasis on programming languages), (2) to support declarative syntax definition so that there is no need to understand parsing algorithms in order to understand definitions  [39], (3) to make syntax definitions readable and understandable so that they can be used as reference documentation, and (4) to support execution of syntax definitions as parsers, but also for other syntactic operations, i.e to support multi-purpose interpretation based on a single source. The focus on multi-purpose interpretation is driven by the role of SDF3 in the Spoofax language workbench  [38].

In this paper, we give a high-level overview of the features of SDF3 and how they support multi-purpose syntax definition. We give explanations by means of examples, assuming some familiarity of the reader with grammars. We refer to the literature for formal definitions of the concepts that we introduce. Figure 1 presents the complete syntax definition of a tiny functional language (inspired by OCaml  [42]), which we will use as running example without (necessarily) referring to it explicitly.

Fig. 1.
figure 1

Syntax of a small functional language in SDF3 and an example program.

2 Phrase Structure

A programming language is more than a set of flat sentences. It is the structure of those sentences that matters. Users understand programs in terms of structural elements such as expressions, functions, patterns, and modules. Language designers, and the tools they build to implement a language, operate on programs through their underlying (tree) structure. The productions in a context-free grammar create the connection between the tokens that form the textual representation of programs and their phrase structure  [14]. Such productions can be interpreted as parsing rules to convert a text into a tree. But SDF3 emphasizes the interpretation of productions as definitions of structure  [39].

A sort (also known as non-terminal) represents a syntactic category such as expression (Exp), pattern match case (Case), or pattern (Pat). A production defines the structure of a language construct. For example, the production

figure a

defines that an addition expression is one alternative for the Exp sort and that it is the composition of two expressions. A production makes the connection with sentences by means of literals in productions. In the production above, the two expressions making an addition are separated by a + operator. Finally, a production defines a constructor name for the abstract syntax tree structure of a program (Add in the production above). The pairs consisting of sort and constructor names should be unique within a grammar and can be used to identify productions. (Such explicit constructor names are new in SDF3 compared to SDF2.) A set of such productions is a grammar.

The productions of a grammar generate a set of well-formed syntax trees. For example, Fig. 2 shows a well-formed tree over the example grammar. The language defined by a grammar are the sentences obtained by taking the yields of those trees, where the yield of a syntax tree is the concatenation of its leaves. Thus, the sentence corresponding to the tree in Fig. 2 is .

The grammars of programming languages frequently feature lists, including lists of statements in a block, lists of field declarations in a class, and lists of parameters of a function. SDF3 supports direct expression of such list sorts by means of Kleene star and plus operators on sorts. In Fig. 1 the formal parameters of a Fun is defined as ID*, a list of zero or more identifiers. Other kinds of list include A+ (one or more As), {A sep}* (zero or more As separated by seps), and {A sep}+ (one or more As separated by seps). Lists with separators are convenient to model, for example, the arguments of a function as {Exp ","}*, i.e. a list of zero or more expressions separated by commas.

Fig. 2.
figure 2

Concrete syntax tree

Fig. 3.
figure 3

Abstract syntax tree

Abstract Syntax. Concrete syntax trees contain irrelevant details such as keywords, operator symbols, and parentheses (as identified by the bracket attribute on productions). These details are irrelevant since the constructor of a production of a node uniquely identifies the language construct concerned. Thus, from a concrete syntax tree we obtain an abstract syntax tree by omitting such irrelevant details. Figure 3 shows the abstract syntax tree obtained from the concrete syntax tree in Fig. 2. Abstract syntax trees can be represented by means of first-order terms in which a constructor is applied to a (possibly empty) sequence of sub-terms. For example, the abstract syntax tree of Fig. 3 is represented by the term.

figure c

Note that lists are represented by sequences of terms between square brackets.

Signatures. A grammar is a schema for describing well-formed concrete and abstract syntax trees. That is, we can check that a tree is well-formed by checking that the subtrees of a constructor node have the right sort according to the corresponding production, and a parser based on a grammar is guaranteed to produce such well-formed trees. To further process trees after parsing, we can work on a generic tree representation such as XML or ATerms  [6], or we can work with a typed representation. The schemas for such typed representations can be derived automatically from a grammar. For example, the Statix language for static semantics specification  [3] uses algebraic signatures to describe well-formed terms. The following signature in Statix defines the algebraic signature of a selection of the constructors of the example language:

figure d

The SDF3 compiler automatically generates signatures for Statix  [3], Stratego  [10], and DynSem  [57].

3 Declarative Disambiguation

Multiple trees over a grammar can have the same yield. Or, vice versa, a sentence in the language of a grammar can have multiple trees. If this is the case, the sentence, and hence the grammar is ambiguous.

One strategy to disambiguate a grammar is to transform it to an unambiguous grammar that describes the same language, but has exactly one tree per sentence in the language. However, this may not be easy to do, may distort the structure of the trees associated with the grammar, and changes the typing scheme associated with the grammar. SDF3 supports the disambiguation of an ambiguous grammar by means of declarative disambiguation rules. In this section we describe disambiguation by means of associativity and priority rules. In the next section we describe lexical disambiguation rules.

Disambiguation by Associativity and Priority Rules. Many language reference manuals define the disambiguation of expression grammars by means of priority and associativity tables. SDF3 formalizes such tables as explicit associativity and priority rules over the productions of an ambiguous context-free grammar. While grammar formalisms such as YACC also define associativity and priority rules, these are defined in terms of low-level implementation details (e.g. choosing sides in a shift/reduce conflict.) The semantics of SDF3 associativity and priority rules has a direct formal semantics that is defined independently of a particular implementation  [53]. The semantics is defined by means of sub-tree exclusion, that is, disambiguation rules are interpreted by rejecting trees that match one of the subtree exclusion patterns generated by a set of disambiguation rules. If a set of rules is sound and complete (there is a rule for each pair of productions), then disambiguation is sound and complete, i.e. assigns a single tree to a sentence. (Read the fine print in [53].)

A priority rule defines that (the production identified by the constructor) A.C1 has higher priority than (the production identified by the constructor) A.C2. This means that (a tree with root constructor) A.C2 cannot occur as a left, respectively right recursive child of (a tree node with constructor) A.C1 if A.C2 is right, respectively left recursive. A left associativity rule defines that A.C1 and A.C2 are mutually left associative. This means that A.C2 cannot occur as a right recursive child of A.C1. (Right associativity is defined symmetrically.)

Figure 1 defines the disambiguation rules for the example language. According to these rules the expression should be parsed as (since Sub and Add are left associative and have higher priority than Eq) and the expression should be parsed as (since Add has higher priority than Match).

The semantics of priority shown above is particularly relevant for prefix and postfix operators. A prefix operator (such as Match) may occur as right child of an infix operator (such as Sub), even if it has lower priority, since such a combination of productions is not ambiguous. For example, the expression has only one abstract syntax tree.

Fig. 4.
figure 4

Concrete syntax trees for the expression .

This semantics is safe, i.e. it does not reject any sentences that are in the language of the underlying context-free grammar. However, with the rules defined so far the semantics is not complete. As an example consider two of the trees for the sentence in Fig. 4. Both these trees are conflict free according to the rules above; a Match may occur as right hand child of a Sub and Sub and Add are left associative. The problem is that the conflict between Match as a left child of Add is hidden by the Sub tree. To capture such deep conflicts, the priority rule involving Add, Sub and Match is amended to require that a right-most occurrence of a production A.C2 in the left recursive argument of a production A.C1 is rejected if A.C1> A.C2. (And symmetrically for left-most occurrences in right recursive arguments.) Thus, the priority rules of Fig. 1 select the left tree of Fig. 4.

The longest-match attribute of the Match production is a short hand for deep priority conflicts for lists. The Match construct gives rise to nested pattern match clauses such as the following

figure n

The longest match attributes disambiguates such nested lists by associating trailing cases with the nearest match statement.

Afroozeh et al. [1] showed that semantics of disambiguation in SDF2 [7, 61] was not safe. They define a safe interpretation of disambiguation rules by means of a grammar transformation. Amorim and Visser [53] define a direct semantics of associativity and priority rules by means of subtree exclusion including prefix and postfix operators, mixfix productions, and indirect recursion. They show that the semantics is safe and complete for safe and complete sets of disambiguation rules for expression grammars without overlap. They also discuss the influence of overlap on disambiguation of expression grammars. For example, in Fig. 1, the productions Min, Sub, and App have overlap. The expression can be parsed as or as . This is not an ambiguity that can be solved by means of safe associativity and priority rules. The indexed priority rule solves this ambiguity by forbidding the occurrence of Min as second argument of App. (The index is 0 based.)

Amorim et al. show that deep conflicts are not only an artifact of grammars, but do actually occur in the wild, i.e. that they do occur in real programs  [52]. One possible implementation of disambiguation with deep conflicts is by means of data dependent parsers. Amorim et al. show that such parsers can have near zero overhead when compared to disambiguation by grammar rewriting  [55].

Parenthesization. In the previous section we saw that parentheses, i.e. productions annotated with the bracket attribute, are omitted when transforming a concrete syntax tree to an abstract syntax tree (Fig. 3). Furthermore, by using declarative disambiguation, the typing scheme for abstract syntax trees allows arbitrary combinations of constructors in well-formed abstract syntax trees. This is convenient, since it allows transformations on trees to create new trees without regard for disambiguation rules. Before formatting such trees (Sect. 5), parentheses need to be inserted in order to prevent creating a sentence that has a different (abstract) syntax tree when parsed. That is, we want the equation \(\mathsf {parse}(\mathsf {format}(t)) = t\) to hold for any well-formed abstract syntax tree.

The advantage of declarative disambiguation rules is that they can be interpreted not only to define disambiguation during parsing, but can also be interpreted to detect trees that need disambiguation. For example, without parenthesization the tree would be formatted as , which would be parsed as . Parenthesization recognizes that the first tree has a priority conflict between Add and Eq and inserts parentheses around the equality expression, such that the tree is formatted as , which has the original tree as its abstract syntax tree. The implementation of SDF3 in Spoofax supports parenthesization following the disambiguation semantics of Amorim and Visser  [53].

4 Lexical Syntax

The lexical syntax of a language concerns the lexemes, words, or tokens of the language and typically includes identifiers, numbers, strings, keywords, operators, and delimiters. In traditional parsers and parser generators, parsing is divided into a lexical analysis (or scanning) phase in which the characters of a program are merged into tokens, and a context-free analysis phase in which a stream of tokens is parsed into phrase structure. Inspired by Salomon and Cormack  [45], SDF2 adopted character-level grammars using the single formalism of context-free productions to define lexical and context-free syntax, supported by scannerless parsing  [60]. SDF3 has inherited this feature.

Character-Level Grammars. In character-level grammars, the terminals of the grammar are individual characters. In SDF3, characters are indicated by means of character classes. For example, the definition of identifiers uses the character class [a-zA-Z0-9] comprising of lower and upper case letters and digits. Tokens are defined using the same productions that we use for context-free phrase structure, except that it is not required to associate a constructor with a lexical production. For example, the syntax of identifiers is defined using the production , i.e. an identifier starts with a letter, which is followed by zero or more letters or digits. In a production such as it appears that and are terminals. However, SDF3 defines such literals by means of a lexical production in which the literal acts as a sort, which is defined in terms of character classes. Thus, the use of the literal implies a production . SDF3 also supports case-insensitive literals; in this case, the literal implies a production .

Lexical Disambiguation. Just as phrase structure, lexical syntax may be ambiguous, requiring lexical disambiguation. The root cause of lexical ambiguity is overlap between lexical categories. For example, an identifier ab overlaps with the prefix of a longer identifier abc and let may be an identifier or a keyword. The two common lexical disambiguation policies are (1) prefer longest match, and (2) prefer one category over another. In scanner specification languages such as LEX  [43] these policies are realized by (1) preferring the longest match and by (2) ordering the definitions of lexical rules and selecting the first rule that applies. This works well when recognizing tokens independent of the context in which they appear.

In a character-level grammar that approach does not work, since tokenization may depend on the phrase structure context (see also the discussion on language composition below), and due to modularity of a syntax definition, there is no canonical order of lexical rules. Thus, lexical disambiguation is defined analogously to subtree exclusion for phrase structure in the previous section, by defining what is not allowed using follow restrictions and reject productions. We discuss an example of each. The expression ab can be a single identifier or the application of a to b, i.e. . This ambiguity is solved by means of the follow restriction which states that an identifier cannot be followed directly by a letter or digit. The expression can be an if-then expression, i.e., , or it can be the application of the variable if to some other variables, i.e.,

figure ai

This ambiguity is solved by means of reject productions and to forbid the use of the keywords if and else as identifiers.

Layout. Another aspect of lexical syntax is the whitespace characters and comments that can appear between tokens, which are known as ‘layout’ in SDF. The definition of layout is a matter of lexical definition as that of any other lexical category. Module lex in Fig. 1 defines layout as whitespace, multi-line comments (delimited by /* and */), and single-line comments (starting with //). The multi-line comments can be nested to enable commenting out code with comments. This is not supported by scanner generators based on regular expressions. Note the use of follow restrictions to ensure that an asterisk within a multi-line comment is not followed by a slash (which should be parsed as the end of the comment), and to characterize end-of-file as the empty string that is not followed by any character (which is in turn defined as the complement of the empty character class).

Fig. 5.
figure 5

Normalized syntax and restrictions for a selection of productions from Fig. 1.

What is special about layout is that it can appear between any two ordinary tokens. In a scanner-based approach layout tokens are just skipped by the scanner, leaving only tokens that matter for the parser. A character-level grammar needs to be explicit about where layout can appear. This would result in boilerplate code as illustrated by the following explicit version of the Fun production:

figure al

To avoid such boilerplate, the SDF3 compiler applies a transformation to productions in context-free syntax sections in order to inject optional layout  [61]. Figure 5 shows the result of that normalization to a small selection of productions from Fig. 1. Note that in lexical productions (such as for ID-LEX) no layout is injected, since the characters of tokens should not be separated by layout. Note the use of -LEX and -CF suffixes on sorts to distinguish lexical sorts from context-free sorts (This transformation is currently applied to the entire grammar, which may hinder grammar composition between modules specifying different layout.)

Fig. 6.
figure 6

Layout-sensitive disambiguation of dangling-else.

Fig. 7.
figure 7

Layout-sensitive disambiguation of longest match for nested match cases.

Layout Sensitive Syntax. In Sect. 3 we showed how associativity and priority rules can be used to disambiguate an ambiguous grammar. For example, we saw how longest match for Match ensures that a match case is always associated with the nearest match. Similarly, Fig. 1 disambiguates the dangling-else ambiguity between IfT and IfE such that an else branch is always associated with the closest if.

An alternative approach to disambiguation is to take into account the layout of a program. For that purpose, SDF3 supports the use of layout constraints, which pose requirements on the two dimensional shape of programs  [17, 54]. We illustrate layout constraints with layout-sensitive disambiguations of the Match and IfE productions in Figs. 6 and 7.

The layout constraints in Fig. 6 require that the if and else keywords of the IfE production are aligned. The examples in the figure show how the else branch can be associated with either if by choosing the layout. In addition, the indent constraints require that the conditions and branches of the IfT and IfE constructs appear to the right of the if and else keywords. Figure 7 disambiguates the association of the match cases with a match by requiring that the cases are aligned. Thus, one can obtain the non-longest match (second example) without using parentheses.

Fig. 8.
figure 8

Syntax-aware editor for the fun-query language with syntax highlighting, parse error recovery, error highlighting, and syntactic completion.

Syntax Highlighting. The Spoofax language workbench  [38, 64] generates a syntax-aware editor from a syntax definition. Based on the lexical syntax, it derives syntax highlighting for programs by assigning colors to the tokens in the syntax tree as illustrated in Fig. 8. The default coloring scheme assigns colors to lexical categories such as keywords, identifiers, numbers, and strings. The coloring scheme can be adjusted in a configuration file by associating colors with sorts and constructors.

Language Composition. SDF3 supports a simple module mechanism, allowing large syntax definitions to be divided into a collection of smaller modules, and allowing to define a library with reusable definitions. For example, the lex module provides a collection of common lexical syntax definitions. A module may extend the definition of syntactic categories of another module. This can be used, for example, to organize the syntax definition for a language as a collection of components (such as variables, functions, booleans, numbers) that each introduce constructs for a common set of syntactic categories (such as types and expressions).

Another application of the module mechanism is to compose the syntax definitions of different languages into a composite language. For example, Fig. 9 defines a tiny query language in module query and its composition with the fun language of Fig. 1. The composition introduces the use of a query as an expression, and a quoted expression as a query identifier. The languages have a different lexical syntax, i.e. the keywords of the fun language are not reserved in the query language, and vice versa. Thus, from can be used as a variable in a fun expression, while it is a keyword in a query (see Fig. 8). Language composition with SDF2/3 has been used for the embedding of domain-specific languages  [12], for the embedding of query and scripting languages  [9], and for the organization of composite languages such as AspectJ  [11] and WebDSL  [27, 62].

A consequence of merging of productions for sorts with the same name and injecting layout between symbols of a production, is that the layout of composed languages is unified. It is future work to preserve the layout of composed languages.

Fig. 9.
figure 9

Composition of languages with different lexical syntax.

5 Formatting

Formatting is the process of mapping abstract syntax trees to text. This can be used to improve the layout of a manually written program, or it can be used to turn a generated or transformed abstract syntax tree into a program text. Formatting is preceded by parenthesization to correctly insert parentheses such that parsing the formatted text preserves the tree structure (see Sect. 3).

Template Productions. Formatting comes in two levels. The basic level of formatting, also known as ugly-printing, is concerned with inserting the ‘irrelevant’ notational details that were removed in the translation to abstract syntax. After ugly-printing, parsing the generated text should produce the original abstract syntax tree. This translation can be obtained from a grammar mechanically. For example, the Stratego/XT transformation tool suite featured a ‘pretty-print’ table generator [35] that formalized for each constructor a mapping to these notational details.

The second level of formatting, also known as pretty-printing, is concerned with producing white space to make the generated program text readable. The Box language  [8, 34] provides abstractions for horizontal and vertical composition and horizontal (e.g. indentation) and vertical (line breaks) spacing. This is a useful intermediate representation for formatting, which allows the pretty-printer writer to abstract from an actual pretty-print algorithm. (Libraries for pretty-printing are built on the same principle  [29].) Still, a mapping from abstract syntax trees to Box expressions requires human judgement and cannot be derived mechanically from a grammar. The pretty-print table generator mentioned above featured heuristics for associating Box expressions with language constructs. However, in many cases, it was necessary to edit the table to produce useful results, creating a bidirectional update problem to reflect changes to the grammar. SDF3 solves this problem by means of template productions, originally motivated to support syntactic completion (see below)  [63]. (Template productions are a signature feature of SDF3, as they changed the syntax of productions from defined non-terminal on the right in SDF and SDF2, to defined non-terminal on the left, and the template quotes have a distinct influence on the typography of syntax definitions.)

A regular context-free grammar production (Sect. 2) such as

figure am

combines sorts and literals. Sorts are identifiers referring to other productions and become the sub-terms of an abstract syntax tree node. Literals are quoted strings and are removed in the mapping to abstract syntax, needing to be restored during pretty-printing. Sorts and literals are implicitly separated by layout as discussed in Sect. 4.

Fig. 10.
figure 10

Template production

In a template production the usual quotation is inverted. Consider the template version of the IfE production in Fig. 10. The outer quotes (<if ...>), quote a literal piece of text. The inner quotes (<Exp>) are escapes to sorts. A template not only captures literals and sorts, but also captures a two dimensional shape. For the purposes of parsing this shape is ignored. That is, whitespace between symbols is turned into optional layout analogous to the transformation discussed in Sect. 4. (For the purpose of layout-sensitive parsing it would be interesting to interpret the layout in a template as layout constraints, but it is not easy to distinguish which layout should be enforced, and which layout is incidental.)

Fig. 11.
figure 11

Separator layout

For the purpose of pretty-printing, the two dimensional shape is interpreted as horizontal and vertical composition and spacing. That is, newlines are interpreted as vertical space and spaces are interpreted as indentation (with respect to the first non-whitespace character of the template). The template in Fig. 11 shows how the spacing of list elements can be configured with whitespace in the separator.

Templates are translated to a transformation from abstract syntax terms to Box expressions. Thus, after every change to the grammar, the pretty-printer is automatically regenerated and up-to-date, without requiring a bidirectional update process. Plain productions with quoted literals can also be obtained automatically from template productions.

The formatters derived from SDF3 templates have some limitations, which are partly due to (the interpretation of) the Box intermediate representation. First, formatting is fairly rigid. It does not take into account the composition and size of expressions, but formats a language construct always in the same manner. Furthermore, it is not customizable with user preferences, as is customary in integrated development environments such as Eclipse. When formatting manually written programs to improve their layout, or when formatting a program after applying some transformation (e.g. a refactoring), it can be important to preserve the layout (comments and/or whitespace) of the original program. De Jonge and Visser [32] developed a layout preserving formatting algorithm with heuristics for moving comment blocks. This algorithm is currently not integrated in the SDF3 tool suite.

Completion. Formatting is also an issue when proposing and inserting syntactic completions in an editor. The first version of Spoofax  [38] featured syntactic completion templates instructing the editor what to do on particular triggers, which redundantly specified syntactic patterns. Vollebregt et al.  [63] introduced template productions with the goal to automatically generate completion templates and support a program completion workflow in the style of structured editors. Amorim et al.  [51] generate explicit placeholder syntax for all syntactic sorts in order to explicitly represent incomplete programs. Syntactic completion becomes a matter of generating completion proposals for placeholders based on the productions of the grammar. The resulting editor behaves like a combination of text editor and structure editor as illustrated in Fig. 8.

6 Parsing

Finally, we discuss the parsing strategy of SDF3. Character-level grammars do not fit in restricted grammar classes such as LL or LR grammars; deciding which alternative to take may require an unbounded number of characters of lookahead  [61]. Furthermore, only the full class of context-free grammars is closed under composition  [28], i.e. the composition of two LL or LR grammars is not necessarily an LL or LR grammar. Thus, SDF3 uses a generalized parsing algorithm that can deal with the full class of context-free grammars.

Lazy Parse Table Generation. The SDF3 compiler first transforms a modular syntax definition to a monolithic and normalized syntax definition, which makes layout and deep priority conflicts explicit in the grammar  [53, 61]. A static analysis checks whether all used sorts are defined and warns for missing associativity and priority rules. A parser generation algorithm is used to generate a shift/reduce parse table from the normalized grammar. The algorithm is based on SLR parse table generation  [28] adapted to deal with shallow priority conflicts  [59]. Follow restrictions are implemented by restricting the follow set of non-terminals in the parse table. Follow restrictions that are longer than one character are added as dynamic checks. The resulting table may contain shift/reduce conflicts.

LR parse table generation is a non-local operation, requiring the entire grammar, implying that separate compilation is not possible. If one module of the syntax definition is changed, it needs to be recompiled entirely. This is a disadvantage for scenarios that depend on language extension  [12, 16]. Bravenboer and Visser developed a representation and algorithm for parse table composition that realized a form of separate compilation for syntax definitions  [13]. However, the algorithm did not support cross-module priority declarations and was not adopted in practice. As a more pragmatic approach, Amorim et al.  [52] adopted lazy parse table generation  [26], which starts with an empty parse table, and only generates those states that are needed at parse time. This ensures fast turnaround times during development of syntax definitions.

Fig. 12.
figure 12

Sentence and abstract syntax tree with (shared) ambiguities.

Scannerless Generalized LR Parsing with Error Recovery. The shift/reduce parse tables generated from SDF3 definitions are not deterministic, i.e. may have shift/reduce conflicts due to proper ambiguities or unbounded lookahead. To handle both these cases, SDF3 uses a Scannerless Generalized-LR (SGLR) parsing algorithm  [60].

The GLR algorithm handles conflicts in the parse table by forking off separate parsers for each alternative of a conflict  [44]. If the parser has encountered a genuine ambiguity, the parallel parsers will eventually end up in the same parse state, and the branches give rise to alternative parse trees. The result of parsing is a parse forest, a compact representation of all possible parse trees. A language engineer using SDF3 can inspect the ambiguities of a grammar by inspecting the (abstract) syntax trees with ambiguities, instead of inspecting shift/reduce conflicts. Figure 12 shows an abstract syntax tree with ambiguities for an expression in the example language using a syntax definition without disambiguation rules.

Another reason for shift/reduce conflicts is the limited lookahead of the parser generator. For example, consider parsing the expression . After reading the identifier b, the parser can reduce to create or it can shift, expecting to eventually parse some sub-expression of Eq, i.e. resulting in a term of the form . This decision can only be made when parsing the + operator. But before the parser sees that operator, it first needs to process the comment. Forking the parser allows delaying the decision. Eventually only one of the parsers will survive and produce a tree without ambiguities.

Fig. 13.
figure 13

Testing longest match disambiguation of the match-with expression.

A GLR parser becomes a scannerless parser by reading characters as tokens and handling lexical disambiguation such that invalid forks are prevented or killed as early as possible  [60]. Follow restrictions are handled by means of a dynamic lookahead check on reductions. Reject productions are implemented by rejecting states that are reached with a reject production. That requires postponing the reduction from rejectable states until it is certain no reject productions will appear.

The SGLR algorithm is extended to support parse error recovery and produce a parse tree even if a program text contains syntactic errors  [30, 33, 36]. This is important in interactive settings such as editors in an integrated development environment in order to enable editor services such as syntax highlighting and type analysis for programs with errors, as arise during program development. Error recovery is realized by an extension of SDF3 with recovery productions, which are only used in case normal parsing fails. There are two main categories of recovery rules. Inspired by island grammars  [31], so called water productions turn normal tokens into layout, which allows skipping some tokens when they cannot be parsed otherwise. Productions such as allow the insertion of missing literals (or complete sorts). The SDF3 normalizer automatically generates a permissive grammar with recovery rules, but such rules can also be added manually. Error recovery is the basis for reporting syntax errors. Improving the localization and explanation of error messages is a topic for future work.

An extension of SGLR to support incremental parsing based on the work of Wagner et al. [65] is under development  [49].

Testing. Testing SDF3 syntax definitions is supported by the Spoofax Testing (SPT) language, a domain-specific language for testing various aspects of language definitions, including parsing  [37]. An SPT test quotes a language fragment and specifies a test expectation. For testing syntax, the expectations are parse succeeds, parse fails, and parse to a specific term structure. Figure 13 illustrates the testing of disambiguation in SPT by specifying the disambiguated expression as parse result.

7 Related Work

We have referred to previous and related work throughout this paper. The papers that we cited about particular aspects of SDF3 provide extensive discussions of related technical work, which is beyond the scope of this paper. Here we provide a couple of high-level pointers to related efforts.

The design and implementation of SDF3 is motivated by its use in the Spoofax language workbench [38, 64]. Erdweg et al.  [18, 19] give an overview of general concerns of the design and implementation of language workbenches.

SDF3 is bootstrapped, i.e. the syntax of SDF3 has been defined in SDF3. Other significant applications of SDF3 are the NaBL2  [2] and Statix  [3] languages for type system specification, the IceDust language for data modeling  [20,21,22], and the FlowSpec language for data-flow analysis specification  [50]. Many languages originally developed with SDF2 are being ported to SDF3, including the Stratego transformation language  [10].

Several syntax definition languages share aims with SDF3, in particular regarding the support for language composition. The syntax definition sub-language of the RASCAL meta-programming language  [41] has a common root in SDF2. RASCAL has adopted generalized GLL parsing  [48] instead of GLR parsing. The syntax definition language of the Silver  [66] attribute grammar system takes a different approach to language composition. Instead of relying on scannerless generalized parsing, it relies on context-aware scanners and restrictions on grammars in order to guarantee absence of ambiguities in composed grammars  [46]. Based on these restrictions it can support parse table composition for language composition  [47]. The Eco editor  [15] supports language composition using language boxes, where the editor keeps track of transitions between languages, avoiding the composition of grammars.

8 Conclusion

In this paper we have presented SDF3, a mature language for the definition of syntax. The design and implementation of SDF3 are based on many years of research and engineering, fed by the experience of numerous researchers, developers, and students. The multi-purpose interpretation of SDF3 specifications allows quick prototopying of language designs and enables testing these designs in a full-fledged environment with a syntax aware editor.