Parsers

Parser turns a sequence of tokens (usually produced by the Lexer|lexer) into a tree representation called a syntax tree (not be to be confused with AST).

A parser relies on a grammar. Regular expressions can't express operator precedence, so Context-Free Grammar is needed. CFG grammars have a single nonterminal symbol on the left side of a production rule. They are often specified in a Backus-Naur Form.

Parsing can happen top-down (recursive descent, LL) or bottom-up (LR). LL parsers generate leftmost derivation and LR rightmost in reverse.

LL(1) means: Left-to-right, leftmost derivation, single lookahead token. LL and recursive descent can't handle left recursion in production rules, e.g. Sum ::= Sum '+' Int | Int. In practice it can be solved by using a loop in a recursive descent parser.

Many industrial compilers use recursive descent parsers.

Representing expressions in BNF is not straight forward. Recursive descent parser is often paired with a bottom-up Pratt parser to efficiently parse operator precedence.

Bison, Yacc, and ANTLR are examples of parser generators. They usually generate bottom-up parsers, like LR.

Recursive descent code example

std::unique_ptr<BlockNode> parseBlock() {
  expect(Token::Kind::LeftCurly);
  getNextToken();

  std::vector<std::unique_ptr<Node>> statements;
  while (true) {
    if (auto statement = parseStatement()) {
      statements.push_back(std::move(statement));
    } else {
      break;
    }
  }

  expect(Token::Kind::RightCurly);
  getNextToken();

  return std::make_unique<BlockNode>(std::move(statements));
}

Links