The Synthesizer Generator:
Specifying an Editing Environment in EBNF
This is a preliminary
description of a tool that has not yet been released.
Table of Contents
Introduction
Extended BNF (EBNF) is a well-known notation for specifying the syntax
of languages. The EBNF Editor is a tool for creating language-based
systems from EBNF specifications.
The EBNF Editor generates a language specification in SSL, the
meta-language of the Synthesizer Generator. If the editing environment
generated from the generated SSL specification is adequate, you are
done. Otherwise, you may choose to refine the SSL directly. SSL
provide fexibility at the expense of succinctness. In contrast, EBNF
provides a succinct notation for specifying a language-based editing
environment with default behavior.
An Example
Here is an EBNF specification for a typical toy programming
language:
program => +;
statement =>
id := exp |
WHILE exp DO +; OD |
IF exp THEN +; ELSE +; FI
exp =>
exp + exp |
exp - exp |
exp * exp |
exp / exp |
( exp ) |
{ [a-zA-Z][a-zA-Z0-9_]* } |
{ [0-9]+ }
id => { [a-zA-Z][a-zA-Z0-9_]* }
|
This specification consists of four rules that can be paraphrased
as follows:
- A program consists of one or more statement's
separated by semicolons.
- A statement has one of three alternative forms:
assignment statement, WHILE-statement, or IF-statement.
- An exp has one of seven forms: sum, difference, product,
quotiant, parenthesized expression, identifier, or number.
- An id is begins with an upper or lower case letter, and
is followed by zero or more letters, digits, or underscores.
The EBNF Specification Language
The example illustrates the following general rules for EBNF
specifications:
- Identifers consisting of only lower-case letters are
nonterminal symbols. (In general, nonterminal symbols can also
contain dashes and underscores, e.g., hi-ho and
if_then_else are valid nonterminals. Dashes and underscores
are indistinguishable. Thus, hi-ho and hi_ho are
the same nonterminal.)
- Each rule consists of a left and a right hand side separated by
an arrow (=>). The left hand side is a nonterminal. The
right hand side is a list of one or more alternatives separated
by bars (|).
- Each alternative consists of one or more pieces. There
are five kinds of pieces:
- Nonterminals, e.g., id and exp in
id := exp, the alternative for assignment statements.
- Keywords, e.g., WHILE and IF. A
keyword is designated by an identifier consisting of only upper-case
letters.
- Tokens, e.g., := and +. A token is a
sequence of one or more characters. Because certain characters and
character combinations are reserved as EBNF meta-symbols, an
escape mechanism using double quotes (") is provided. Thus,
for example, the token consisting of the two characters => is
designated by "=>".
- Lexemes, e.g., { [a-zA-Z][a-zA-Z0-9_]* }. A
lexeme consists of a regular expression enclosed in curly braces. The
regular expression must begin and end with at least one blank, and may
constain no nested unescaped blanks.
- Embedded rules, e.g., <statement>+;. Much of the
power of EBNF comes from the embedded rules. There are three
sorts of embedded rules:
- Iterations, e.g., <statement>+;. There are
four kinds of iterations:
- One or more separated items, e.g., <statement>+;
designates one or more statement separated by semicolons.
- Zero or more separated items, e.g., <exp>*,
designates zero or more exp separated by commas.
- One or more unseparated items, e.g., <name>+
designates one or more name.
- Zero or more unseparated items, e.g.,
<elsif-part>* designates zero or more elsif-part.
The examples so far may have misled you to believe that angle brackets
are part of the designation of a nonterminal, but they are not. An
arbitrary list of one or more alternatives separated by bars may
appear between angle brackets. Thus, for example, each of the
following is a valid iteration:
- <RED | WHITE | BLUE>+,
- <declaration | statement>*
- <label :>*
- <ELSEIF <stmt>+;>*
Any token or keyword can be the separator of an iteration. For
example, <foo>+BAR specifies a list of one or more
exp separated by occurrences of the keyword BAR.
- Optionals, e.g., [declaration-part]. As with
iterations, a list of one or more alternatives may appear between the
square brackets. For example, [LONG | SHORT] designates an
optional occurrence of either the keyword LONG or the keyword
SHORT.
- Nested alternatives, e.g., <+|-|*|/>. Exactly
one of the listed alternatives must occur.
Additional details about EBNF notation are described later.
The Specified Language
The language defined by an EBNF specification consists of a
sequence of text fragments separated by whitespace. By
default, whitespace consists of spaces, newlines, and tabs. By
default, the defined language is case-insensitive. Thus, regardless of
the case of characters in the EBNF specification, either case
is acceptable in the language.
The Generated Environment
The Synthesizer Generator has distinct meta-linguistic mechanisms for
specifying different aspects of a language-based system:
- abstract syntax, i.e., the structure used to represent
hierarchical objects,
- concrete syntax, i.e., scanning and parsing rules that
convert strings of text into abstract syntax trees,
- unparsing rules, i.e., display rules that convert
abstract syntax trees into formatted text,
- templates, i.e., rules for inserting predefined
subtrees,
- placeholders, i.e., labelled slots that characterize the
form of phrase required in the given field.
- transforms, i.e., rules for performing predefined
source-to-source translations,
- attribute equations, i.e., rules defining derived
information about abstract syntax trees.
These distinct mechanisms provide fexibility at the expense of
succinctness. In contrast, EBNF provides a single succinct notation
for specifying a language. As a result, in each category, a default
choice is made. As a result, the concrete syntax may result in
LALR(1) parsing conflicts where a hand taylored grammar would not, the
unparsing rules may result in a less pleasing layout than hand written
would produce, templates have default names, etc. There is no support
at the moment for attribute equations.