Glossary

Parsing terminology gets very confusing very quickly. Here is a glossary that can help us all communicate more effectively.

token: the smallest meaningful unit (the “words”) of your language

lexer: a function that converts a sequence of characters to a sequence of tokens (converts a sequence of letters and spaces to a sequence of words)

string: a sequence of tokens

concatenation: joining two strings by appending one to the other

language: a set of strings

production rule: a set of strings specified as a sequence of symbols, such that the rule matches a string if it is a concatenation of strings matched by the respective symbols

symbol: a generic term for a member of a production rule, either a nonterminal or a terminal

nonterminal: a symbol that specifies a set of other production rules that match

terminal: a symbol that directly specifies a set of tokens that match

context-free grammar: a set of production rules that together specify a language

context-free language: a language that can be specified by a context-free grammar

Backus-Naur Form: a set of conventions for describing and typesetting context-free grammars

recognizer: a function that takes a grammar and a string and sees if the grammar matches the string (it just says “yes” or “no”)

parser: a function that takes a grammar and a string and returns a derivation of that string from that grammar

derivation: a way to apply the production rules of a grammar recursively to obtain a string

parse tree: a representation of a derivation as a tree

abstract syntax tree (AST): a version of a parse tree that has undergone some postprocessors to simplify it (for example, an AST is likely to omit whitespace)

preprocessor: the dialect of JavaScript targeted by nearleyc; for example, you can emit TypeScript code instead of plain JavaScript (not to be confused with postprocessor)

postprocessor: a function associated with a production rule, whose purpose is to transform a parse tree into an AST, making simplifications along the way — for example, by omitting whitespace (not to be confused with preprocessor)

ambiguity: a situation where there exist more than one derivations for a single string — unlike most parser libraries, nearley returns all possible derivations

Earley algorithm: a parsing algorithm developed by Jay Earley in 1968, which can efficiently parse all context-free grammars

Earley table: the intermediate data structure created by the Earley algorithm, where each new token creates a new column

Leo optimization: a trick we can use to optimize right-recursion, making it just as efficient to parse as left-recursion

left-recursion: a situation where a nonterminal refers to a production rule whose first symbol matches the same nonterminal

epsilon: the empty production rule, matching only the empty string

nullable rule: a production rule that matches the empty string, even though it is not necessarily equal to the epsilon rule (for example, the concatenation of epsilon with epsilon)

nearley: a parser that parses context-free languages, along with several additional utilities for building languages