Parsers
Parser turns a sequence of tokens (usually produced by the Lexer|lexer) into a tree representation called a syntax tree (not be to be confused with AST).
A parser relies on a grammar. Regular expressions can't express operator precedence, so Context-Free Grammar is needed. CFG grammars have a single nonterminal symbol on the left side of a production rule. They are often specified in a Backus-Naur Form.
Parsing can happen top-down (recursive descent, LL) or bottom-up (LR). LL parsers generate leftmost derivation and LR rightmost in reverse.
LL(1) means: Left-to-right, leftmost derivation, single lookahead token. LL and recursive descent can't handle left recursion in production rules, e.g. Sum ::= Sum '+' Int | Int
. In practice it can be solved by using a loop in a recursive descent parser.
Many industrial compilers use recursive descent parsers.
Representing expressions in BNF is not straight forward. Recursive descent parser is often paired with a bottom-up Pratt parser to efficiently parse operator precedence.
Bison, Yacc, and ANTLR are examples of parser generators. They usually generate bottom-up parsers, like LR.
Recursive descent code example
std::unique_ptr<BlockNode> parseBlock() {
expect(Token::Kind::LeftCurly);
getNextToken();
std::vector<std::unique_ptr<Node>> statements;
while (true) {
if (auto statement = parseStatement()) {
statements.push_back(std::move(statement));
} else {
break;
}
}
expect(Token::Kind::RightCurly);
getNextToken();
return std::make_unique<BlockNode>(std::move(statements));
}
Links
- Grammar Online
- BNF Playground
- Simple but Powerful Pratt Parsing
- Pratt Parsing - YouTube
- TL;DR - Arena-based parsers
AST (Abstract Syntax Trees)
- The life and times of an Abstract Syntax Tree | Trail of Bits Blog β on AST design