Wednesday, June 5, 2013

Tokenize This

As a man of my word, I have reinvented the wheel once again. I wrote a simple lexer in JavaScript.

A lexer (or scanner) is a tool in programming language design that is responsible for breaking an input string into individual tokens. These tokens can then be handed to a parser for further processing. For example, consider the following statement in a hypothetical language:

double := 2 * x   //double the number

Here, our lexer might give back the following tokens:

ID (double), ASSIGN, NUMBER, TIMES, ID (x)

Note that the lexer discarded the whitespace and the comment. The rest of the input was matched to a predefined token.

Most lexers use regular expressions to specify the matching rules. Each of these rules, when matched, returns some specified token. Here is how it would look in Token.JS:

var lexer = new TokenJS.Lexer(
  'double := 2 * x   //double the number', {
  root: [
    [/\/\/.*/, TokenJS.Ignore],  //ignore comments
    [/\s+/, TokenJS.Ignore],  //ignore whitespace
    [/[a-zA-Z]+/, 'ID'],
    [/[0-9]+/, 'NUMBER'],
    [/:=/, 'ASSIGN'],
    [/\*/, 'TIMES']
  ]
});

var tokens = lexer.tokenize();

Even if you're not writing a compiler, lexing has many other uses. For starters, imagine you wanted to write a syntax highlighter for a language. This would be trivial with a lexer. You would simply specify a regular expression for identifiers, literals, reserved words, etc. in the given language. The 'token' in each case could be a color to highlight the match with.

In general, string processing is a big part of many applications, so it's good to know what tools you have at your disposal.