A Brief Primer on Compilers

Compilers was my favorite class in college. I came in expecting a dry, theoretical slog and left genuinely excited about programming languages in a way I hadn’t been before. There’s something satisfying about understanding the machinery underneath the tools you use every day. This post is my attempt to give a quick tour of the major phases of a compiler. We’ll use OCaml for the code examples, which is a fitting choice given how naturally it handles the kinds of tree transformations compilers are built around.

The classic model of a compiler breaks down into four big phases: lexing, parsing, semantic analysis, and code generation. Each phase takes the output of the previous one and transforms it into a higher-level representation, progressively closing the gap between raw text and something a machine can execute.

Lexing

The very first thing a compiler does is turn a raw string of characters into a flat list of tokens. This phase is called lexing (or scanning, or tokenization - people use these interchangeably). A token is just a meaningful unit: a number, a keyword, an operator, a parenthesis. The lexer throws away whitespace and comments and hands the parser something much cleaner to work with.

Here’s a token type for a tiny arithmetic language, plus a simple lexer:

type token =
  | INT of int
  | PLUS
  | MINUS
  | TIMES
  | LPAREN
  | RPAREN
  | EOF

let lex input =
  let i = ref 0 in
  let len = String.length input in
  let advance () = incr i in
  let rec go acc =
    if !i >= len then List.rev (EOF :: acc)
    else match input.[!i] with
    | ' ' | '\t' | '\n' -> advance (); go acc
    | '+' -> advance (); go (PLUS :: acc)
    | '-' -> advance (); go (MINUS :: acc)
    | '*' -> advance (); go (TIMES :: acc)
    | '(' -> advance (); go (LPAREN :: acc)
    | ')' -> advance (); go (RPAREN :: acc)
    | c when c >= '0' && c <= '9' ->
      let start = !i in
      while !i < len && input.[!i] >= '0' && input.[!i] <= '9' do advance () done;
      let n = int_of_string (String.sub input start (!i - start)) in
      go (INT n :: acc)
    | c -> failwith (Printf.sprintf "Unexpected character: %c" c)
  in
  go []

Running lex "1 + (2 * 3)" would give you [INT 1; PLUS; LPAREN; INT 2; TIMES; INT 3; RPAREN; EOF]. Nothing fancy, but it’s a crucial first step - the parser would have a much harder time if it had to operate character by character.

Parsing

The parser takes that flat token list and builds a tree out of it. This tree is called an abstract syntax tree, or AST. It’s “abstract” because it captures the structure of the program without the syntactic noise - parentheses are gone, the tree shape itself encodes precedence and grouping.

type expr =
  | Num of int
  | Add of expr * expr
  | Sub of expr * expr
  | Mul of expr * expr

let parse tokens =
  let pos = ref 0 in
  let peek () = List.nth tokens !pos in
  let consume () =
    let t = peek () in
    incr pos; t
  in
  let rec parse_expr () =
    let left = parse_term () in
    match peek () with
    | PLUS  -> ignore (consume ()); Add (left, parse_expr ())
    | MINUS -> ignore (consume ()); Sub (left, parse_expr ())
    | _     -> left
  and parse_term () =
    let left = parse_atom () in
    match peek () with
    | TIMES -> ignore (consume ()); Mul (left, parse_term ())
    | _     -> left
  and parse_atom () =
    match consume () with
    | INT n  -> Num n
    | LPAREN ->
      let e = parse_expr () in
      (match consume () with
       | RPAREN -> e
       | _ -> failwith "Expected ')'")
    | _ -> failwith "Unexpected token"
  in
  parse_expr ()

This is a recursive descent parser, which is one of the most readable ways to write a parser by hand. Each grammar rule maps to a function. The separate parse_expr and parse_term functions encode operator precedence - multiplication binds tighter than addition, and the nesting in the call stack reflects that!

For 1 + 2 * 3, you’d get Add (Num 1, Mul (Num 2, Num 3)), which is exactly the right tree. No ambiguity.

Semantic Analysis

Once you have an AST, you need to check that it actually means something. Syntactically valid programs can still be semantically broken - using an undefined variable, calling a function with the wrong number of arguments, passing a string where an integer is expected. Semantic analysis is where you catch all of that.

For a language with types, this phase is usually called type checking. The idea is to walk the AST and assign a type to every expression, flagging anything that doesn’t add up. Here’s a small type checker for our arithmetic language, extended with booleans:

type typ = TInt | TBool

type expr =
  | Num  of int
  | Bool of bool
  | Add  of expr * expr
  | Eq   of expr * expr   (* produces a bool *)
  | If   of expr * expr * expr

let rec typecheck = function
  | Num  _ -> TInt
  | Bool _ -> TBool
  | Add (l, r) ->
    (match typecheck l, typecheck r with
     | TInt, TInt -> TInt
     | _ -> failwith "Type error: '+' requires integers")
  | Eq (l, r) ->
    (match typecheck l, typecheck r with
     | TInt, TInt -> TBool
     | _ -> failwith "Type error: '=' requires integers")
  | If (cond, then_, else_) ->
    (match typecheck cond with
     | TBool ->
       let t1 = typecheck then_ and t2 = typecheck else_ in
       if t1 = t2 then t1
       else failwith "Type error: branches of 'if' must match"
     | _ -> failwith "Type error: condition must be a bool")

Pattern matching makes this kind of structural traversal feel very natural in OCaml. Real compilers do much more here - inferring types, resolving names, checking mutability - but the basic shape is the same: walk the tree, accumulate information, raise errors early.

Code Generation

The final phase takes the verified, annotated AST and produces output - either machine code, bytecode, or sometimes another high-level language (this last case is called transpilation). For our tiny language, let’s generate instructions for a simple stack machine.

type instruction =
  | Push of int
  | Add
  | Sub
  | Mul

let rec codegen = function
  | Num n       -> [Push n]
  | Add (l, r)  -> codegen l @ codegen r @ [Add]
  | Sub (l, r)  -> codegen l @ codegen r @ [Sub]
  | Mul (l, r)  -> codegen l @ codegen r @ [Mul]

let eval instructions =
  let stack = Stack.create () in
  List.iter (function
    | Push n -> Stack.push n stack
    | Add    -> let b = Stack.pop stack in let a = Stack.pop stack in Stack.push (a + b) stack
    | Sub    -> let b = Stack.pop stack in let a = Stack.pop stack in Stack.push (a - b) stack
    | Mul    -> let b = Stack.pop stack in let a = Stack.pop stack in Stack.push (a * b) stack
  ) instructions;
  Stack.pop stack

For Add (Num 1, Mul (Num 2, Num 3)), codegen gives you [Push 1; Push 2; Push 3; Mul; Add], and eval on that returns 7. It works!

Real code generation targets things like LLVM IR or x86 assembly, and has to deal with register allocation, calling conventions, and a lot of other platform-specific concerns. But the fundamental idea - traverse the AST and emit something executable - stays the same.

There’s obviously a lot more to compilers than what’s covered here. Optimizations, intermediate representations, linking, and runtime systems are all rich topics on their own. But if you ever felt like the compiler was a black box, hopefully this gives a useful mental model of what’s going on inside it. Taking that course remains one of the better decisions I made in college!