Parsing

Once the document has been successfully turned into blocks, each of those can be individually parsed, as the Story deems fit.

Each individual parser is designed to read one input line and make a determination as to what, if anything, the content might mean. There are three possible outcomes: a valid parse tree of an interpretation; a set of apparent errors compared to what it could mean; and "I cannot understand this".

It's important to understand that each parser is just one small piece of the overall puzzle and that all the pieces need to work together to provide a complete solution. In particular, the parsers do not need to worry about other lines or scopes (the Blocker's job); semantics or types (later in the game); resolving names (to be dealt with later during scoping); any syntax they don't understand (other parsers can deal with that); or how the blocks fit together (which Stories do).

The one common thread across all parsers is the interface: they are presented with a Tokenizable object which is essentially just a character buffer representing the current continued line after concatenation (it actually does more than that, including mapping characters back to their original input lines) and return an Object which is one of the three possible outcomes. null indicates that the parser was unable to interpret the input; an ErrorResult object indicates that there were syntax errors in a plausible interpretation; and anything else is to be considered a valid syntax tree.

Internally, the parser obtains tokens from the input stream and uses those to attempt its interpretation. In the same fashion as the other components, there are many tokenizers, each of which is intended for a specific task: for instance, there is a tokenizer that recognizes potential keywords (must be lower case starting with a letter and then allowing numbers); there is one that recognizes tokens that would be valid in pattern matching; and there is one that recognizes valid expression tokens.

There are a number of reasons for this, but the main one is again the desire to have multiple different "styles" of languages within the same environment, each of which has its own rules of syntax and tokenization, and therefore needs its own rules as to how to group characters into tokens. As a simple example, a package name (in the package declaration) wants to take a name in which dots are valid characters and may not be surrounded by spaces; in an expression, dot is a field or method operator which certainly may be surrounded by spaces - or even start a continuation line.

Internally, of course, each parser is entitled to do anything it wants to obtain the results it needs (including not really taking advantage of the Tokenizable), but given the relatively constrained context of the size and significance of the input, a fairly simple parsing scheme such as LL(1) should be appropriate. Having said that, the most complex of the parsers I've written so far - the one that processes the expressions on the right hand side of function definitions - uses a simple tokenization scheme to read a list of tokens and then reprocesses those in two steps - first to group all the direct function calls together; and then to apply an operator precedence algorithm. It's worth noting that this again takes advantage of the knowledge of the limited size of the input to read all of it first and then reprocess it.

Parsers can call each other if desired. For instance, the function declaration parser calls the expression parser to evaluate the right hand side, together with the pattern parser (multiple times) to process all the arguments.

Parsers are not bound to return a specific type of parse tree on success. For example, the function declaration parser can return just an "intro" (the left hand side of the declaration), in which case the nested block will be expected to contain one or more right hand sides; or it can return a complete definition, in which case any nested block would be of scoped-in definitions.