Lexer
Lexer reads the input source code and produces and stream of tokens. Each token is a pair of token class and the substring itself (called lexeme).
The correspondence between a set of strings and a token class can be defined through regular expressions and implemented using finite automata. NFA (Non-deterministic Finite Automata) can be converted to DFA (Deterministic Finite Automata) and that can be used to generate an efficient table-driven implementation.
Tools like Flex can generate lexer code.
Links
- Fast Tokenizers with StringScanner | Tenderlove Making – using perfect hash to optimize keyword lookup