Sunday, April 29, 2012

Adding state support to BNFC

The BNF converter is pretty cool tool.  The idea is simple:  provide a grammer written in Labelled BNF, and it spits out a lexer, parser, pretty printer, and abstract syntax tree.  The very nice part is that the generated code can be in several formats: Alex/Happy/Haskell, flex/bison/C/C++, or JLex/CUP/Java.  It's especially convenient when you're developing a language, as you can make changes as you go and automatically keep your AST, parser, and printers in sync.

BNFC takes as input a file written in Labelled BNF, which is essentially a BNF grammar with a label attached to each production rule.  Brackets "[ ]" indicate lists rather than optional items, and there are few extensions from minimal BNF other than pragmas to define lexer tokens and positions.

Writing one grammar file instead of 4 separate modules is very attractive.  Unfortunately BNFC has a serious problem - only context-free grammars are supported.

LBNF is sufficient to represent context-free grammars, and within this domain the system works very well. But despite the claims of the LBNF Report, many interesting languages are not context-free, and they still need to be lexed and parsed.  The most notable example is probably C, in which a string can be lexed as an identifier or a type name, depending on previous typedefs and other things.  I present here a small set of extensions to enable support for passing state between the lexing and parsing stages, enabling a wider variety of parsers to be constructed from LBNF.

A darcs repository of BNFC, with state support as described here, is available from http://www.tiresiaspress.us/haskell/bnfc/.  These features are only supported in the Haskell/Alex backend (Alex-3 is required).

An initial examination

The approach I took is founded on two simple ideas.  First, the lexer needs to access to a state table to determine the correct token, and secondly, the parser needs to update the state table in response to certain rules.

Lexing rules

In LBNF, tokenizing rules are specified like this:
token Ident (alpha [alpha|digit|'_']*) ;
where "Ident" is the type of the token produced, and the remainder is a regular expression to determine valid "Ident" strings.

A stateful token rule is similar.  The implementation supports multiple sets of state, so it's necessary to specify which state set the tokenizer should check.
state token TypedefName typedefNameSet [alpha|digit|'_']+ ;
First the lexer matches the regular expression against the input string.  It then takes the match and checks if it's present in "typedefNameSet".  If so, a token of type "TypedefName" is produced.  Otherwise an "Ident" is produced.(1).

Parsing Rules

Standard production rules are extended with pragmas that specify state update annotations:
-- after this rule, the final Ident will be lexed as a TypedefName
Etypedef.  Declaration ::= "typedef" TypeName Ident <% state typedefNameSet add 3 %> ;
The number "3" specifies that a token string derived from the 3rd production (Ident) is to be added to the set "typedefNameSet".(2)

Similarly, names can be removed from a state set:
-- after this rule, the second token will be lexed as an Ident
UnTydeclarator. Direct_declarator ::= TypeName TypedefName <% state remove 2 %> ;
When removing a token from a set, it's unnecessary to specify the set the token is being removed from.

Extended support

Initial Values

Initial members of the state sets can be specified in the grammar via the pragma TODO.

Scoping

The implementation maintains a stack of state values in order to support lexical scoping of changes.  Two additional pragmas, "state push" and "state pop", provide access to stack manipulations.  A common usage would be:
PushParen.  PushParen ::= "(" <% state push %>
PopParen.   PopParen  ::= ")" <% state pop %>
It is an error to pop the last frame off the state stack.

The implementation only supports one stack for all state sets.  It is not currently possible to represent multiple state sets with differing scoping rules.

A drawback of the current implementation is that production rules such as PushParen and PopParen are no longer terminals.  Therefore the implementation creates abstract representations for them, although no additional meaning is conveyed than if they were omitted completely.  I hope to address this in the future.

Future Work

Deriving the state string

A major question is, how to determine which token is added to the state set for a given production?  The above cases are simple.  In a production rule like:
Etypedef.  Declaration ::= "typedef" TypeName Ident <% state typedefNameSet add 3 %> ;
"Ident" is a token, so it's simple to derive the string that should be added to the set.  However, it's frequently desirable to modify state based upon a non-terminal rule, such as:
Rule.  SomeProd ::= Declaration <% state stateset add 1 %> ;
How should the token be determined from the Declaration?

The current implementation only supports state adjustments based upon tokens and lists of tokens, e.g. [Ident].  Code generated from rules like the above "Rule" currently results in a compile-time error due to missing instances.  The instances can be manually provided in a separate file, although creating this file can be tedious.  I plan to add an additional pragma to the LBNF grammar that would allow for the necessary instances to be automatically generated.

Notes

(1) - The astute reader will notice that the TypedefName regex matches strictly more identifiers than the built-in "Ident" regex.  Therefore this match provides a "backdoor" into Ident, allowing strings that don't match the "Ident" regex to actually be produced as "Ident"s.  I find this behavior useful, and in any case it can be mitigated simply by using the Ident regex for state tokenizing rules.  Another possibility would be to use a user-provided alternative in case the match fails.

(2) - An alternative notation would omit the numbers and place the pragma adjacent to the production.