Tokenizers

By default, nearley splits the input into a stream of characters. This is called scannerless parsing.

A tokenizer splits the input into a stream of larger units called tokens. This happens in a separate stage before parsing. For example, a tokenizer might convert 512 + 10 into ["512", "+", "10"]: notice how it removed the whitespace, and combined multi-digit numbers into a single number.

Using a tokenizer has many benefits. It…

Lexing with Moo

The @lexer directive instructs Nearley to use a lexer you’ve defined inside a Javascript block in your grammar.

nearley supports and recommends Moo, a super-fast lexer. Construct a lexer using moo.compile.

When using a lexer, there are two ways to match tokens:

Here is an example of a simple grammar:

@{%
const moo = require("moo");

const lexer = moo.compile({
  ws:     /[ \t]+/,
  number: /[0-9]+/,
  word: { match: /[a-z]+/, type: moo.keywords({ times: "x" }) },
  times:  /\*/
});
%}

# Pass your lexer object using the @lexer option:
@lexer lexer

expr -> multiplication {% id %} | trig {% id %}

# Use %token to match any token of that type instead of "token":
multiplication -> %number %ws %times %ws %number {% ([first, , , , second]) => first * second %}

# Literal strings now match tokens with that text:
trig -> "sin" %ws %number {% ([, , x]) => Math.sin(x) %}

Have a look at the Moo documentation to learn more about writing a tokenizer.

You use the parser as usual: call parser.feed(data), and nearley will give you the parsed results in return.

Custom lexers

nearley recommends using a moo-based lexer. However, you can use any lexer that conforms to the following interface:

Note: if you are searching for a lexer that allows indentation-aware grammars (like in Python), you can still use moo. See this example or the moo-indentation-lexer module.

Custom token matchers

Aside from the lexer infrastructure, nearley provides a lightweight way to parse arbitrary streams.

Custom matchers can be defined in two ways: literal tokens and testable tokens. A literal token matches a JS value exactly (with ===), while a testable token runs a predicate that tests whether or not the value matches.

Note that in this case, you would feed a Parser instance an array of objects rather than a string! Here is a simple example:

@{%
const tokenPrint = { literal: "print" };
const tokenNumber = { test: x => Number.isInteger(x) };
%}

main -> %tokenPrint %tokenNumber ";;"

# parser.feed(["print", 12, ";", ";"]);