ExamplesBy LevelBy TopicLearning Paths
165 Advanced

Keyword Parser

Functional Programming

Tutorial

The Problem

Keywords like if, else, while, and return must be distinguished from identifiers. A naive string parser would match "if" at the start of "ifstream" — failing to recognize that "ifstream" is an identifier containing the prefix "if". Keyword parsers enforce word boundary checking: the keyword must not be followed by an identifier character. This is the difference between lexer-level and parser-level thinking in grammar design.

🎯 Learning Outcomes

  • • Understand the word boundary problem: why tag("if") alone is insufficient for keywords
  • • Implement keyword as tag followed by a negative lookahead (no identifier character)
  • • Learn how keyword parsers interact with identifier parsers in a complete lexer
  • • See the "longest match" rule as the standard resolution for keyword/identifier ambiguity
  • Code Example

    fn keyword<'a>(kw: &str) -> Parser<'a, &'a str> {
        let kw_owned = kw.to_string();
        Box::new(move |input: &'a str| {
            if !input.starts_with(&kw_owned) {
                return Err(format!("Expected \"{}\"", kw_owned));
            }
            let rest = &input[kw_owned.len()..];
            match rest.chars().next() {
                Some(c) if is_ident_char(c) => Err(format!("not a complete keyword")),
                _ => Ok((&input[..kw_owned.len()], rest)),
            }
        })
    }

    Key Differences

  • Lookahead: Both use non-consuming lookahead (peek); the implementation detail is language-specific but the semantics are identical.
  • Negative lookahead: Both express "keyword not followed by identifier char"; Rust checks the remainder string; OCaml uses peek_char.
  • Reserved words: Some parsers pre-collect keywords into a HashSet and check identifiers against it; this is an alternative to individual keyword parsers.
  • Error messages: Failing keyword parsers should report "expected keyword 'X'"; reporting "expected identifier" would mislead the user.
  • OCaml Approach

    In OCaml's angstrom, the common pattern is:

    let keyword kw =
      string kw *> peek_char >>= function
      | Some c when is_ident_char c -> fail ("expected end of keyword '" ^ kw ^ "'")
      | _ -> return kw
    

    peek_char looks at the next character without consuming it, providing the lookahead. This is equivalent to Rust's approach and equally necessary for correct keyword parsing.

    Full Source

    #![allow(clippy::all)]
    // Example 165: Keyword Parser
    // Keywords with word boundary checking
    
    type ParseResult<'a, T> = Result<(T, &'a str), String>;
    type Parser<'a, T> = Box<dyn Fn(&'a str) -> ParseResult<'a, T> + 'a>;
    
    fn is_ident_char(c: char) -> bool {
        c.is_ascii_alphanumeric() || c == '_'
    }
    
    // ============================================================
    // Approach 1: keyword — match string + word boundary
    // ============================================================
    
    fn keyword<'a>(kw: &str) -> Parser<'a, &'a str> {
        let kw_owned = kw.to_string();
        Box::new(move |input: &'a str| {
            if !input.starts_with(&kw_owned) {
                return Err(format!("Expected \"{}\"", kw_owned));
            }
            let rest = &input[kw_owned.len()..];
            match rest.chars().next() {
                Some(c) if is_ident_char(c) => Err(format!(
                    "\"{}\" not a complete keyword (followed by '{}')",
                    kw_owned, c
                )),
                _ => Ok((&input[..kw_owned.len()], rest)),
            }
        })
    }
    
    // ============================================================
    // Approach 2: keyword mapped to a token type
    // ============================================================
    
    #[derive(Debug, Clone, PartialEq)]
    enum Token {
        If,
        Then,
        Else,
        Let,
        In,
        Fn,
    }
    
    fn keyword_token<'a>(kw: &str, tok: Token) -> Parser<'a, Token> {
        let kw_owned = kw.to_string();
        Box::new(move |input: &'a str| {
            if !input.starts_with(&kw_owned) {
                return Err(format!("Expected \"{}\"", kw_owned));
            }
            let rest = &input[kw_owned.len()..];
            match rest.chars().next() {
                Some(c) if is_ident_char(c) => Err(format!("\"{}\" not a complete keyword", kw_owned)),
                _ => Ok((tok.clone(), rest)),
            }
        })
    }
    
    // ============================================================
    // Approach 3: any_keyword — try multiple, longest first
    // ============================================================
    
    fn any_keyword<'a>(keywords: Vec<(&str, Token)>) -> Parser<'a, Token> {
        let mut kws: Vec<(String, Token)> = keywords
            .into_iter()
            .map(|(s, t)| (s.to_string(), t))
            .collect();
        // Sort longest first to avoid prefix ambiguity
        kws.sort_by(|a, b| b.0.len().cmp(&a.0.len()));
    
        Box::new(move |input: &'a str| {
            for (kw, tok) in &kws {
                if input.starts_with(kw.as_str()) {
                    let rest = &input[kw.len()..];
                    match rest.chars().next() {
                        Some(c) if is_ident_char(c) => continue,
                        _ => return Ok((tok.clone(), rest)),
                    }
                }
            }
            Err("Expected keyword".to_string())
        })
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_keyword_match() {
            assert_eq!(keyword("if")("if x"), Ok(("if", " x")));
        }
    
        #[test]
        fn test_keyword_boundary() {
            assert!(keyword("if")("iffy").is_err());
        }
    
        #[test]
        fn test_keyword_paren() {
            assert_eq!(keyword("if")("if("), Ok(("if", "(")));
        }
    
        #[test]
        fn test_keyword_eof() {
            assert_eq!(keyword("if")("if"), Ok(("if", "")));
        }
    
        #[test]
        fn test_keyword_token() {
            assert_eq!(
                keyword_token("let", Token::Let)("let x"),
                Ok((Token::Let, " x"))
            );
        }
    
        #[test]
        fn test_any_keyword() {
            let p = any_keyword(vec![
                ("if", Token::If),
                ("in", Token::In),
                ("let", Token::Let),
            ]);
            assert_eq!(p("let x"), Ok((Token::Let, " x")));
            assert_eq!(p("in "), Ok((Token::In, " ")));
        }
    
        #[test]
        fn test_any_keyword_none() {
            let p = any_keyword(vec![("if", Token::If)]);
            assert!(p("hello").is_err());
        }
    
        #[test]
        fn test_keyword_no_match() {
            assert!(keyword("else")("elseif").is_err());
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_keyword_match() {
            assert_eq!(keyword("if")("if x"), Ok(("if", " x")));
        }
    
        #[test]
        fn test_keyword_boundary() {
            assert!(keyword("if")("iffy").is_err());
        }
    
        #[test]
        fn test_keyword_paren() {
            assert_eq!(keyword("if")("if("), Ok(("if", "(")));
        }
    
        #[test]
        fn test_keyword_eof() {
            assert_eq!(keyword("if")("if"), Ok(("if", "")));
        }
    
        #[test]
        fn test_keyword_token() {
            assert_eq!(
                keyword_token("let", Token::Let)("let x"),
                Ok((Token::Let, " x"))
            );
        }
    
        #[test]
        fn test_any_keyword() {
            let p = any_keyword(vec![
                ("if", Token::If),
                ("in", Token::In),
                ("let", Token::Let),
            ]);
            assert_eq!(p("let x"), Ok((Token::Let, " x")));
            assert_eq!(p("in "), Ok((Token::In, " ")));
        }
    
        #[test]
        fn test_any_keyword_none() {
            let p = any_keyword(vec![("if", Token::If)]);
            assert!(p("hello").is_err());
        }
    
        #[test]
        fn test_keyword_no_match() {
            assert!(keyword("else")("elseif").is_err());
        }
    }

    Deep Comparison

    Comparison: Example 165 — Keyword Parser

    keyword with boundary

    OCaml:

    let keyword (kw : string) : string parser = fun input ->
      match tag kw input with
      | Error e -> Error e
      | Ok (matched, rest) ->
        if String.length rest > 0 && is_ident_char rest.[0] then
          Error (Printf.sprintf "\"%s\" is not a complete keyword" kw)
        else Ok (matched, rest)
    

    Rust:

    fn keyword<'a>(kw: &str) -> Parser<'a, &'a str> {
        let kw_owned = kw.to_string();
        Box::new(move |input: &'a str| {
            if !input.starts_with(&kw_owned) {
                return Err(format!("Expected \"{}\"", kw_owned));
            }
            let rest = &input[kw_owned.len()..];
            match rest.chars().next() {
                Some(c) if is_ident_char(c) => Err(format!("not a complete keyword")),
                _ => Ok((&input[..kw_owned.len()], rest)),
            }
        })
    }
    

    Token enum

    OCaml:

    type token = If | Then | Else | Let | In | Fun
    

    Rust:

    #[derive(Debug, Clone, PartialEq)]
    enum Token { If, Then, Else, Let, In, Fn }
    

    Exercises

  • Build a keywords combinator that tries a list of keywords in order: keywords(["if", "else", "while"]).
  • Write a parser that correctly distinguishes "for" (keyword) from "formula" (identifier) and test both.
  • Implement reserved_word(words: &[&str]) -> impl Fn(&str) -> bool as a fast lookup for keyword exclusion in identifier parsing.
  • Open Source Repos