ExamplesBy LevelBy TopicLearning Paths
152 Advanced

Character Parsers

Functional Programming

Tutorial

The Problem

All parsers ultimately reduce to reading individual characters. Primitive character parsers β€” match this specific character, match any character, match a character not in this set, match any of these characters β€” form the atomic vocabulary from which all other parsers are constructed. Getting these primitives right (correct UTF-8 handling, informative error messages, correct remaining input slicing) is essential for building correct higher-level parsers.

🎯 Learning Outcomes

  • β€’ Implement the fundamental character parsers: char_parser, any_char, none_of, one_of
  • β€’ Understand correct UTF-8 character slicing using char::len_utf8()
  • β€’ Learn how error messages should name what was expected vs. what was found
  • β€’ See how these primitives combine to form digit, letter, and alphanumeric parsers
  • Code Example

    fn char_parser<'a>(expected: char) -> Parser<'a, char> {
        Box::new(move |input: &'a str| {
            match input.chars().next() {
                Some(c) if c == expected => Ok((c, &input[c.len_utf8()..])),
                Some(c) => Err(format!("Expected '{}', got '{}'", expected, c)),
                None => Err(format!("Expected '{}', got EOF", expected)),
            }
        })
    }

    Key Differences

  • UTF-8 safety: Rust must use len_utf8() to advance correctly; OCaml's string model (bytes or Uchar) similarly requires care at byte boundaries.
  • Ownership: Rust returns &'a str slices without copying; OCaml typically returns new strings or offsets into a buffer.
  • Error messages: Both should include the expected character name in the error; this is convention rather than enforcement in both languages.
  • Performance: Rust's chars().next() decodes one codepoint from UTF-8 efficiently; OCaml's equivalent is String.get_utf_8_uchar.
  • OCaml Approach

    In OCaml's angstrom library, char 'a' and satisfy is_alpha are the primitives. OCaml's Uchar module handles Unicode; angstrom internally works on Bigstring for performance. OCaml's any_char is take 1. The pattern matches are structurally identical to Rust's, but without lifetime annotations.

    Full Source

    #![allow(clippy::all)]
    // Example 152: Character Parsers
    // Parse single characters: char_parser, any_char, none_of, one_of
    
    type ParseResult<'a, T> = Result<(T, &'a str), String>;
    type Parser<'a, T> = Box<dyn Fn(&'a str) -> ParseResult<'a, T> + 'a>;
    
    // ============================================================
    // Approach 1: Parse a specific character
    // ============================================================
    
    fn char_parser<'a>(expected: char) -> Parser<'a, char> {
        Box::new(move |input: &'a str| match input.chars().next() {
            Some(c) if c == expected => Ok((c, &input[c.len_utf8()..])),
            Some(c) => Err(format!("Expected '{}', got '{}'", expected, c)),
            None => Err(format!("Expected '{}', got EOF", expected)),
        })
    }
    
    // ============================================================
    // Approach 2: Parse any character
    // ============================================================
    
    fn any_char<'a>() -> Parser<'a, char> {
        Box::new(|input: &'a str| match input.chars().next() {
            Some(c) => Ok((c, &input[c.len_utf8()..])),
            None => Err("Expected any character, got EOF".to_string()),
        })
    }
    
    // ============================================================
    // Approach 3: Parse char NOT in set / IN set
    // ============================================================
    
    fn none_of<'a>(chars: Vec<char>) -> Parser<'a, char> {
        Box::new(move |input: &'a str| match input.chars().next() {
            Some(c) if !chars.contains(&c) => Ok((c, &input[c.len_utf8()..])),
            Some(c) => Err(format!("Unexpected character '{}'", c)),
            None => Err("Expected a character, got EOF".to_string()),
        })
    }
    
    fn one_of<'a>(chars: Vec<char>) -> Parser<'a, char> {
        Box::new(move |input: &'a str| match input.chars().next() {
            Some(c) if chars.contains(&c) => Ok((c, &input[c.len_utf8()..])),
            Some(c) => Err(format!("Character '{}' not in allowed set", c)),
            None => Err("Expected a character, got EOF".to_string()),
        })
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_char_parser_match() {
            let p = char_parser('a');
            assert_eq!(p("abc"), Ok(('a', "bc")));
        }
    
        #[test]
        fn test_char_parser_no_match() {
            let p = char_parser('a');
            assert!(p("xyz").is_err());
        }
    
        #[test]
        fn test_char_parser_empty() {
            let p = char_parser('a');
            assert!(p("").is_err());
        }
    
        #[test]
        fn test_any_char_success() {
            let p = any_char();
            assert_eq!(p("hello"), Ok(('h', "ello")));
        }
    
        #[test]
        fn test_any_char_single() {
            let p = any_char();
            assert_eq!(p("x"), Ok(('x', "")));
        }
    
        #[test]
        fn test_any_char_empty() {
            let p = any_char();
            assert!(p("").is_err());
        }
    
        #[test]
        fn test_none_of_allowed() {
            let p = none_of(vec!['x', 'y', 'z']);
            assert_eq!(p("abc"), Ok(('a', "bc")));
        }
    
        #[test]
        fn test_none_of_blocked() {
            let p = none_of(vec!['a', 'b']);
            assert!(p("abc").is_err());
        }
    
        #[test]
        fn test_one_of_match() {
            let p = one_of(vec!['a', 'b', 'c']);
            assert_eq!(p("beta"), Ok(('b', "eta")));
        }
    
        #[test]
        fn test_one_of_no_match() {
            let p = one_of(vec!['x', 'y']);
            assert!(p("abc").is_err());
        }
    
        #[test]
        fn test_unicode_char() {
            let p = char_parser('Γ©');
            assert_eq!(p("Γ©cole"), Ok(('Γ©', "cole")));
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_char_parser_match() {
            let p = char_parser('a');
            assert_eq!(p("abc"), Ok(('a', "bc")));
        }
    
        #[test]
        fn test_char_parser_no_match() {
            let p = char_parser('a');
            assert!(p("xyz").is_err());
        }
    
        #[test]
        fn test_char_parser_empty() {
            let p = char_parser('a');
            assert!(p("").is_err());
        }
    
        #[test]
        fn test_any_char_success() {
            let p = any_char();
            assert_eq!(p("hello"), Ok(('h', "ello")));
        }
    
        #[test]
        fn test_any_char_single() {
            let p = any_char();
            assert_eq!(p("x"), Ok(('x', "")));
        }
    
        #[test]
        fn test_any_char_empty() {
            let p = any_char();
            assert!(p("").is_err());
        }
    
        #[test]
        fn test_none_of_allowed() {
            let p = none_of(vec!['x', 'y', 'z']);
            assert_eq!(p("abc"), Ok(('a', "bc")));
        }
    
        #[test]
        fn test_none_of_blocked() {
            let p = none_of(vec!['a', 'b']);
            assert!(p("abc").is_err());
        }
    
        #[test]
        fn test_one_of_match() {
            let p = one_of(vec!['a', 'b', 'c']);
            assert_eq!(p("beta"), Ok(('b', "eta")));
        }
    
        #[test]
        fn test_one_of_no_match() {
            let p = one_of(vec!['x', 'y']);
            assert!(p("abc").is_err());
        }
    
        #[test]
        fn test_unicode_char() {
            let p = char_parser('Γ©');
            assert_eq!(p("Γ©cole"), Ok(('Γ©', "cole")));
        }
    }

    Deep Comparison

    Comparison: Example 152 β€” Character Parsers

    char_parser

    OCaml:

    let char_parser (c : char) : char parser = fun input ->
      match advance input with
      | Some (ch, rest) when ch = c -> Ok (ch, rest)
      | Some (ch, _) -> Error (Printf.sprintf "Expected '%c', got '%c'" c ch)
      | None -> Error (Printf.sprintf "Expected '%c', got EOF" c)
    

    Rust:

    fn char_parser<'a>(expected: char) -> Parser<'a, char> {
        Box::new(move |input: &'a str| {
            match input.chars().next() {
                Some(c) if c == expected => Ok((c, &input[c.len_utf8()..])),
                Some(c) => Err(format!("Expected '{}', got '{}'", expected, c)),
                None => Err(format!("Expected '{}', got EOF", expected)),
            }
        })
    }
    

    any_char

    OCaml:

    let any_char : char parser = fun input ->
      match advance input with
      | Some (ch, rest) -> Ok (ch, rest)
      | None -> Error "Expected any character, got EOF"
    

    Rust:

    fn any_char<'a>() -> Parser<'a, char> {
        Box::new(|input: &'a str| {
            match input.chars().next() {
                Some(c) => Ok((c, &input[c.len_utf8()..])),
                None => Err("Expected any character, got EOF".to_string()),
            }
        })
    }
    

    none_of

    OCaml:

    let none_of (chars : char list) : char parser = fun input ->
      match advance input with
      | Some (ch, rest) ->
        if List.mem ch chars then Error (Printf.sprintf "Unexpected '%c'" ch)
        else Ok (ch, rest)
      | None -> Error "Expected a character, got EOF"
    

    Rust:

    fn none_of<'a>(chars: Vec<char>) -> Parser<'a, char> {
        Box::new(move |input: &'a str| {
            match input.chars().next() {
                Some(c) if !chars.contains(&c) => Ok((c, &input[c.len_utf8()..])),
                Some(c) => Err(format!("Unexpected character '{}'", c)),
                None => Err("Expected a character, got EOF".to_string()),
            }
        })
    }
    

    Exercises

  • Implement upper_case_char() -> Parser<char> and lower_case_char() -> Parser<char> using char::is_uppercase.
  • Write ascii_parser(c: char) -> Parser<char> that panics at creation time if c is not ASCII (to catch programming errors early).
  • Benchmark parsing a 1MB string character-by-character using any_char and measure throughput.
  • Open Source Repos