152 Advanced

Character Parsers

Functional Programming

Tutorial

The Problem

All parsers ultimately reduce to reading individual characters. Primitive character parsers — match this specific character, match any character, match a character not in this set, match any of these characters — form the atomic vocabulary from which all other parsers are constructed. Getting these primitives right (correct UTF-8 handling, informative error messages, correct remaining input slicing) is essential for building correct higher-level parsers.

🎯 Learning Outcomes

• Implement the fundamental character parsers: char_parser, any_char, none_of, one_of

• Understand correct UTF-8 character slicing using char::len_utf8()

• Learn how error messages should name what was expected vs. what was found

• See how these primitives combine to form digit, letter, and alphanumeric parsers

Code Example

fn char_parser<'a>(expected: char) -> Parser<'a, char> {
    Box::new(move |input: &'a str| {
        match input.chars().next() {
            Some(c) if c == expected => Ok((c, &input[c.len_utf8()..])),
            Some(c) => Err(format!("Expected '{}', got '{}'", expected, c)),
            None => Err(format!("Expected '{}', got EOF", expected)),
        }
    })
}

let char_parser (c : char) : char parser = fun input ->
  match advance input with
  | Some (ch, rest) when ch = c -> Ok (ch, rest)
  | Some (ch, _) -> Error (Printf.sprintf "Expected '%c', got '%c'" c ch)
  | None -> Error (Printf.sprintf "Expected '%c', got EOF" c)

Key Differences

UTF-8 safety: Rust must use len_utf8() to advance correctly; OCaml's string model (bytes or Uchar) similarly requires care at byte boundaries.

Ownership: Rust returns &'a str slices without copying; OCaml typically returns new strings or offsets into a buffer.

Error messages: Both should include the expected character name in the error; this is convention rather than enforcement in both languages.

Performance: Rust's chars().next() decodes one codepoint from UTF-8 efficiently; OCaml's equivalent is String.get_utf_8_uchar.

OCaml Approach

In OCaml's angstrom library, char 'a' and satisfy is_alpha are the primitives. OCaml's Uchar module handles Unicode; angstrom internally works on Bigstring for performance. OCaml's any_char is take 1. The pattern matches are structurally identical to Rust's, but without lifetime annotations.

Full Source

#![allow(clippy::all)]
// Example 152: Character Parsers
// Parse single characters: char_parser, any_char, none_of, one_of

type ParseResult<'a, T> = Result<(T, &'a str), String>;
type Parser<'a, T> = Box<dyn Fn(&'a str) -> ParseResult<'a, T> + 'a>;

// ============================================================
// Approach 1: Parse a specific character
// ============================================================

fn char_parser<'a>(expected: char) -> Parser<'a, char> {
    Box::new(move |input: &'a str| match input.chars().next() {
        Some(c) if c == expected => Ok((c, &input[c.len_utf8()..])),
        Some(c) => Err(format!("Expected '{}', got '{}'", expected, c)),
        None => Err(format!("Expected '{}', got EOF", expected)),
    })
}

// ============================================================
// Approach 2: Parse any character
// ============================================================

fn any_char<'a>() -> Parser<'a, char> {
    Box::new(|input: &'a str| match input.chars().next() {
        Some(c) => Ok((c, &input[c.len_utf8()..])),
        None => Err("Expected any character, got EOF".to_string()),
    })
}

// ============================================================
// Approach 3: Parse char NOT in set / IN set
// ============================================================

fn none_of<'a>(chars: Vec<char>) -> Parser<'a, char> {
    Box::new(move |input: &'a str| match input.chars().next() {
        Some(c) if !chars.contains(&c) => Ok((c, &input[c.len_utf8()..])),
        Some(c) => Err(format!("Unexpected character '{}'", c)),
        None => Err("Expected a character, got EOF".to_string()),
    })
}

fn one_of<'a>(chars: Vec<char>) -> Parser<'a, char> {
    Box::new(move |input: &'a str| match input.chars().next() {
        Some(c) if chars.contains(&c) => Ok((c, &input[c.len_utf8()..])),
        Some(c) => Err(format!("Character '{}' not in allowed set", c)),
        None => Err("Expected a character, got EOF".to_string()),
    })
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_char_parser_match() {
        let p = char_parser('a');
        assert_eq!(p("abc"), Ok(('a', "bc")));
    }

    #[test]
    fn test_char_parser_no_match() {
        let p = char_parser('a');
        assert!(p("xyz").is_err());
    }

    #[test]
    fn test_char_parser_empty() {
        let p = char_parser('a');
        assert!(p("").is_err());
    }

    #[test]
    fn test_any_char_success() {
        let p = any_char();
        assert_eq!(p("hello"), Ok(('h', "ello")));
    }

    #[test]
    fn test_any_char_single() {
        let p = any_char();
        assert_eq!(p("x"), Ok(('x', "")));
    }

    #[test]
    fn test_any_char_empty() {
        let p = any_char();
        assert!(p("").is_err());
    }

    #[test]
    fn test_none_of_allowed() {
        let p = none_of(vec!['x', 'y', 'z']);
        assert_eq!(p("abc"), Ok(('a', "bc")));
    }

    #[test]
    fn test_none_of_blocked() {
        let p = none_of(vec!['a', 'b']);
        assert!(p("abc").is_err());
    }

    #[test]
    fn test_one_of_match() {
        let p = one_of(vec!['a', 'b', 'c']);
        assert_eq!(p("beta"), Ok(('b', "eta")));
    }

    #[test]
    fn test_one_of_no_match() {
        let p = one_of(vec!['x', 'y']);
        assert!(p("abc").is_err());
    }

    #[test]
    fn test_unicode_char() {
        let p = char_parser('é');
        assert_eq!(p("école"), Ok(('é', "cole")));
    }
}

(* Example 152: Character Parsers *)
(* Parse single characters: char_parser, any_char, none_of *)

type 'a parse_result = ('a * string, string) result

type 'a parser = string -> 'a parse_result

(* Helper to advance input by one character *)
let advance input =
  if String.length input > 0 then
    Some (input.[0], String.sub input 1 (String.length input - 1))
  else
    None

(* Approach 1: Parse a specific character *)
let char_parser (c : char) : char parser = fun input ->
  match advance input with
  | Some (ch, rest) when ch = c -> Ok (ch, rest)
  | Some (ch, _) -> Error (Printf.sprintf "Expected '%c', got '%c'" c ch)
  | None -> Error (Printf.sprintf "Expected '%c', got EOF" c)

(* Approach 2: Parse any character *)
let any_char : char parser = fun input ->
  match advance input with
  | Some (ch, rest) -> Ok (ch, rest)
  | None -> Error "Expected any character, got EOF"

(* Approach 3: Parse any character NOT in the given set *)
let none_of (chars : char list) : char parser = fun input ->
  match advance input with
  | Some (ch, rest) ->
    if List.mem ch chars then
      Error (Printf.sprintf "Unexpected character '%c'" ch)
    else
      Ok (ch, rest)
  | None -> Error "Expected a character, got EOF"

(* one_of: parse any character IN the given set *)
let one_of (chars : char list) : char parser = fun input ->
  match advance input with
  | Some (ch, rest) when List.mem ch chars -> Ok (ch, rest)
  | Some (ch, _) -> Error (Printf.sprintf "Character '%c' not in allowed set" ch)
  | None -> Error "Expected a character, got EOF"

(* Tests *)
let () =
  (* char_parser tests *)
  assert (char_parser 'a' "abc" = Ok ('a', "bc"));
  assert (Result.is_error (char_parser 'a' "xyz"));
  assert (Result.is_error (char_parser 'a' ""));

  (* any_char tests *)
  assert (any_char "hello" = Ok ('h', "ello"));
  assert (any_char "x" = Ok ('x', ""));
  assert (Result.is_error (any_char ""));

  (* none_of tests *)
  assert (none_of ['x'; 'y'; 'z'] "abc" = Ok ('a', "bc"));
  assert (Result.is_error (none_of ['a'; 'b'] "abc"));

  (* one_of tests *)
  assert (one_of ['a'; 'b'; 'c'] "beta" = Ok ('b', "eta"));
  assert (Result.is_error (one_of ['x'; 'y'] "abc"));

  print_endline "✓ All tests passed"

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_char_parser_match() {
        let p = char_parser('a');
        assert_eq!(p("abc"), Ok(('a', "bc")));
    }

    #[test]
    fn test_char_parser_no_match() {
        let p = char_parser('a');
        assert!(p("xyz").is_err());
    }

    #[test]
    fn test_char_parser_empty() {
        let p = char_parser('a');
        assert!(p("").is_err());
    }

    #[test]
    fn test_any_char_success() {
        let p = any_char();
        assert_eq!(p("hello"), Ok(('h', "ello")));
    }

    #[test]
    fn test_any_char_single() {
        let p = any_char();
        assert_eq!(p("x"), Ok(('x', "")));
    }

    #[test]
    fn test_any_char_empty() {
        let p = any_char();
        assert!(p("").is_err());
    }

    #[test]
    fn test_none_of_allowed() {
        let p = none_of(vec!['x', 'y', 'z']);
        assert_eq!(p("abc"), Ok(('a', "bc")));
    }

    #[test]
    fn test_none_of_blocked() {
        let p = none_of(vec!['a', 'b']);
        assert!(p("abc").is_err());
    }

    #[test]
    fn test_one_of_match() {
        let p = one_of(vec!['a', 'b', 'c']);
        assert_eq!(p("beta"), Ok(('b', "eta")));
    }

    #[test]
    fn test_one_of_no_match() {
        let p = one_of(vec!['x', 'y']);
        assert!(p("abc").is_err());
    }

    #[test]
    fn test_unicode_char() {
        let p = char_parser('é');
        assert_eq!(p("école"), Ok(('é', "cole")));
    }
}

Deep Comparison

Comparison: Example 152 — Character Parsers

char_parser

OCaml:

let char_parser (c : char) : char parser = fun input ->
  match advance input with
  | Some (ch, rest) when ch = c -> Ok (ch, rest)
  | Some (ch, _) -> Error (Printf.sprintf "Expected '%c', got '%c'" c ch)
  | None -> Error (Printf.sprintf "Expected '%c', got EOF" c)

Rust:

fn char_parser<'a>(expected: char) -> Parser<'a, char> {
    Box::new(move |input: &'a str| {
        match input.chars().next() {
            Some(c) if c == expected => Ok((c, &input[c.len_utf8()..])),
            Some(c) => Err(format!("Expected '{}', got '{}'", expected, c)),
            None => Err(format!("Expected '{}', got EOF", expected)),
        }
    })
}

any_char

OCaml:

let any_char : char parser = fun input ->
  match advance input with
  | Some (ch, rest) -> Ok (ch, rest)
  | None -> Error "Expected any character, got EOF"

Rust:

fn any_char<'a>() -> Parser<'a, char> {
    Box::new(|input: &'a str| {
        match input.chars().next() {
            Some(c) => Ok((c, &input[c.len_utf8()..])),
            None => Err("Expected any character, got EOF".to_string()),
        }
    })
}

none_of

OCaml:

let none_of (chars : char list) : char parser = fun input ->
  match advance input with
  | Some (ch, rest) ->
    if List.mem ch chars then Error (Printf.sprintf "Unexpected '%c'" ch)
    else Ok (ch, rest)
  | None -> Error "Expected a character, got EOF"

Rust:

fn none_of<'a>(chars: Vec<char>) -> Parser<'a, char> {
    Box::new(move |input: &'a str| {
        match input.chars().next() {
            Some(c) if !chars.contains(&c) => Ok((c, &input[c.len_utf8()..])),
            Some(c) => Err(format!("Unexpected character '{}'", c)),
            None => Err("Expected a character, got EOF".to_string()),
        }
    })
}

Exercises

Implement upper_case_char() -> Parser<char> and lower_case_char() -> Parser<char> using char::is_uppercase.

Write ascii_parser(c: char) -> Parser<char> that panics at creation time if c is not ASCII (to catch programming errors early).

Benchmark parsing a 1MB string character-by-character using any_char and measure throughput.

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust