Character Parsers
Tutorial
The Problem
All parsers ultimately reduce to reading individual characters. Primitive character parsers β match this specific character, match any character, match a character not in this set, match any of these characters β form the atomic vocabulary from which all other parsers are constructed. Getting these primitives right (correct UTF-8 handling, informative error messages, correct remaining input slicing) is essential for building correct higher-level parsers.
🎯 Learning Outcomes
char_parser, any_char, none_of, one_ofchar::len_utf8()Code Example
fn char_parser<'a>(expected: char) -> Parser<'a, char> {
Box::new(move |input: &'a str| {
match input.chars().next() {
Some(c) if c == expected => Ok((c, &input[c.len_utf8()..])),
Some(c) => Err(format!("Expected '{}', got '{}'", expected, c)),
None => Err(format!("Expected '{}', got EOF", expected)),
}
})
}Key Differences
len_utf8() to advance correctly; OCaml's string model (bytes or Uchar) similarly requires care at byte boundaries.&'a str slices without copying; OCaml typically returns new strings or offsets into a buffer.chars().next() decodes one codepoint from UTF-8 efficiently; OCaml's equivalent is String.get_utf_8_uchar.OCaml Approach
In OCaml's angstrom library, char 'a' and satisfy is_alpha are the primitives. OCaml's Uchar module handles Unicode; angstrom internally works on Bigstring for performance. OCaml's any_char is take 1. The pattern matches are structurally identical to Rust's, but without lifetime annotations.
Full Source
#![allow(clippy::all)]
// Example 152: Character Parsers
// Parse single characters: char_parser, any_char, none_of, one_of
type ParseResult<'a, T> = Result<(T, &'a str), String>;
type Parser<'a, T> = Box<dyn Fn(&'a str) -> ParseResult<'a, T> + 'a>;
// ============================================================
// Approach 1: Parse a specific character
// ============================================================
fn char_parser<'a>(expected: char) -> Parser<'a, char> {
Box::new(move |input: &'a str| match input.chars().next() {
Some(c) if c == expected => Ok((c, &input[c.len_utf8()..])),
Some(c) => Err(format!("Expected '{}', got '{}'", expected, c)),
None => Err(format!("Expected '{}', got EOF", expected)),
})
}
// ============================================================
// Approach 2: Parse any character
// ============================================================
fn any_char<'a>() -> Parser<'a, char> {
Box::new(|input: &'a str| match input.chars().next() {
Some(c) => Ok((c, &input[c.len_utf8()..])),
None => Err("Expected any character, got EOF".to_string()),
})
}
// ============================================================
// Approach 3: Parse char NOT in set / IN set
// ============================================================
fn none_of<'a>(chars: Vec<char>) -> Parser<'a, char> {
Box::new(move |input: &'a str| match input.chars().next() {
Some(c) if !chars.contains(&c) => Ok((c, &input[c.len_utf8()..])),
Some(c) => Err(format!("Unexpected character '{}'", c)),
None => Err("Expected a character, got EOF".to_string()),
})
}
fn one_of<'a>(chars: Vec<char>) -> Parser<'a, char> {
Box::new(move |input: &'a str| match input.chars().next() {
Some(c) if chars.contains(&c) => Ok((c, &input[c.len_utf8()..])),
Some(c) => Err(format!("Character '{}' not in allowed set", c)),
None => Err("Expected a character, got EOF".to_string()),
})
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_char_parser_match() {
let p = char_parser('a');
assert_eq!(p("abc"), Ok(('a', "bc")));
}
#[test]
fn test_char_parser_no_match() {
let p = char_parser('a');
assert!(p("xyz").is_err());
}
#[test]
fn test_char_parser_empty() {
let p = char_parser('a');
assert!(p("").is_err());
}
#[test]
fn test_any_char_success() {
let p = any_char();
assert_eq!(p("hello"), Ok(('h', "ello")));
}
#[test]
fn test_any_char_single() {
let p = any_char();
assert_eq!(p("x"), Ok(('x', "")));
}
#[test]
fn test_any_char_empty() {
let p = any_char();
assert!(p("").is_err());
}
#[test]
fn test_none_of_allowed() {
let p = none_of(vec!['x', 'y', 'z']);
assert_eq!(p("abc"), Ok(('a', "bc")));
}
#[test]
fn test_none_of_blocked() {
let p = none_of(vec!['a', 'b']);
assert!(p("abc").is_err());
}
#[test]
fn test_one_of_match() {
let p = one_of(vec!['a', 'b', 'c']);
assert_eq!(p("beta"), Ok(('b', "eta")));
}
#[test]
fn test_one_of_no_match() {
let p = one_of(vec!['x', 'y']);
assert!(p("abc").is_err());
}
#[test]
fn test_unicode_char() {
let p = char_parser('Γ©');
assert_eq!(p("Γ©cole"), Ok(('Γ©', "cole")));
}
}#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_char_parser_match() {
let p = char_parser('a');
assert_eq!(p("abc"), Ok(('a', "bc")));
}
#[test]
fn test_char_parser_no_match() {
let p = char_parser('a');
assert!(p("xyz").is_err());
}
#[test]
fn test_char_parser_empty() {
let p = char_parser('a');
assert!(p("").is_err());
}
#[test]
fn test_any_char_success() {
let p = any_char();
assert_eq!(p("hello"), Ok(('h', "ello")));
}
#[test]
fn test_any_char_single() {
let p = any_char();
assert_eq!(p("x"), Ok(('x', "")));
}
#[test]
fn test_any_char_empty() {
let p = any_char();
assert!(p("").is_err());
}
#[test]
fn test_none_of_allowed() {
let p = none_of(vec!['x', 'y', 'z']);
assert_eq!(p("abc"), Ok(('a', "bc")));
}
#[test]
fn test_none_of_blocked() {
let p = none_of(vec!['a', 'b']);
assert!(p("abc").is_err());
}
#[test]
fn test_one_of_match() {
let p = one_of(vec!['a', 'b', 'c']);
assert_eq!(p("beta"), Ok(('b', "eta")));
}
#[test]
fn test_one_of_no_match() {
let p = one_of(vec!['x', 'y']);
assert!(p("abc").is_err());
}
#[test]
fn test_unicode_char() {
let p = char_parser('Γ©');
assert_eq!(p("Γ©cole"), Ok(('Γ©', "cole")));
}
}
Deep Comparison
Comparison: Example 152 β Character Parsers
char_parser
OCaml:
let char_parser (c : char) : char parser = fun input ->
match advance input with
| Some (ch, rest) when ch = c -> Ok (ch, rest)
| Some (ch, _) -> Error (Printf.sprintf "Expected '%c', got '%c'" c ch)
| None -> Error (Printf.sprintf "Expected '%c', got EOF" c)
Rust:
fn char_parser<'a>(expected: char) -> Parser<'a, char> {
Box::new(move |input: &'a str| {
match input.chars().next() {
Some(c) if c == expected => Ok((c, &input[c.len_utf8()..])),
Some(c) => Err(format!("Expected '{}', got '{}'", expected, c)),
None => Err(format!("Expected '{}', got EOF", expected)),
}
})
}
any_char
OCaml:
let any_char : char parser = fun input ->
match advance input with
| Some (ch, rest) -> Ok (ch, rest)
| None -> Error "Expected any character, got EOF"
Rust:
fn any_char<'a>() -> Parser<'a, char> {
Box::new(|input: &'a str| {
match input.chars().next() {
Some(c) => Ok((c, &input[c.len_utf8()..])),
None => Err("Expected any character, got EOF".to_string()),
}
})
}
none_of
OCaml:
let none_of (chars : char list) : char parser = fun input ->
match advance input with
| Some (ch, rest) ->
if List.mem ch chars then Error (Printf.sprintf "Unexpected '%c'" ch)
else Ok (ch, rest)
| None -> Error "Expected a character, got EOF"
Rust:
fn none_of<'a>(chars: Vec<char>) -> Parser<'a, char> {
Box::new(move |input: &'a str| {
match input.chars().next() {
Some(c) if !chars.contains(&c) => Ok((c, &input[c.len_utf8()..])),
Some(c) => Err(format!("Unexpected character '{}'", c)),
None => Err("Expected a character, got EOF".to_string()),
}
})
}
Exercises
upper_case_char() -> Parser<char> and lower_case_char() -> Parser<char> using char::is_uppercase.ascii_parser(c: char) -> Parser<char> that panics at creation time if c is not ASCII (to catch programming errors early).any_char and measure throughput.