String Unicode
Tutorial
The Problem
Unicode defines multiple ways to represent the same visual character: é can be a single precomposed codepoint (U+00E9) or a base letter e (U+0065) followed by a combining accent (U+0301). These two sequences look identical but compare unequal as byte strings. Web forms, databases, and search engines must normalise Unicode before comparison. Emoji occupy 4 bytes in UTF-8 (U+1F600 = \u{1F600}) — naive len() returns 4, not 1. Correct Unicode handling requires understanding: NFC/NFD normalisation, grapheme clusters, and the difference between bytes, codepoints, and user-perceived characters.
🎯 Learning Outcomes
eq_ignore_ascii_case for case-insensitive ASCII comparison without allocationstr::is_ascii()unicode-normalization crate is needed for correct comparisonCode Example
#![allow(clippy::all)]
// 482. Unicode normalization and graphemes
#[cfg(test)]
mod tests {
#[test]
fn test_nfc_nfd() {
assert_ne!("caf\u{00E9}", "caf\u{0065}\u{0301}");
}
#[test]
fn test_ascii_eq() {
assert!("hello".eq_ignore_ascii_case("HELLO"));
}
#[test]
fn test_is_ascii() {
assert!("hello".is_ascii());
assert!(!"café".is_ascii());
}
#[test]
fn test_emoji() {
let e = "\u{1F600}";
assert_eq!(e.len(), 4);
assert_eq!(e.chars().count(), 1);
}
}Key Differences
char::is_alphabetic() uses the Unicode Alphabetic property; OCaml's Char.is_alpha is ASCII-only (via is_alpha from Char).unicode-normalization crate; OCaml delegates to uunf — neither includes it in the standard library.eq_ignore_ascii_case**: Rust has this in the standard library; OCaml needs String.lowercase_ascii + compare..chars().count() / Uutf yield 1 codepoint; both require unicode-segmentation / Uuseg for grapheme cluster counting.OCaml Approach
OCaml's standard library has no Unicode normalisation. String.equal is byte equality. Case-insensitive comparison requires String.lowercase_ascii (ASCII-only) or Uucp.Case.fold (full Unicode):
String.equal
(String.lowercase_ascii "Hello")
(String.lowercase_ascii "HELLO") (* true *)
(* Unicode normalisation via uunf *)
let nfc s =
let buf = Buffer.create (String.length s) in
let norm = Uunf.create `NFC in
(* feed codepoints from Uutf, flush from Uunf into buf *)
Buffer.contents buf
Full Source
#![allow(clippy::all)]
// 482. Unicode normalization and graphemes
#[cfg(test)]
mod tests {
#[test]
fn test_nfc_nfd() {
assert_ne!("caf\u{00E9}", "caf\u{0065}\u{0301}");
}
#[test]
fn test_ascii_eq() {
assert!("hello".eq_ignore_ascii_case("HELLO"));
}
#[test]
fn test_is_ascii() {
assert!("hello".is_ascii());
assert!(!"café".is_ascii());
}
#[test]
fn test_emoji() {
let e = "\u{1F600}";
assert_eq!(e.len(), 4);
assert_eq!(e.chars().count(), 1);
}
}#[cfg(test)]
mod tests {
#[test]
fn test_nfc_nfd() {
assert_ne!("caf\u{00E9}", "caf\u{0065}\u{0301}");
}
#[test]
fn test_ascii_eq() {
assert!("hello".eq_ignore_ascii_case("HELLO"));
}
#[test]
fn test_is_ascii() {
assert!("hello".is_ascii());
assert!(!"café".is_ascii());
}
#[test]
fn test_emoji() {
let e = "\u{1F600}";
assert_eq!(e.len(), 4);
assert_eq!(e.chars().count(), 1);
}
}
Exercises
unicode_eq(a: &str, b: &str) -> bool that normalises both strings to NFC (using unicode-normalization) before comparing.count_emoji(s: &str) -> usize that counts characters with Unicode category So (Other Symbol) using char::is_ascii() inversion and the unicode-properties crate.caseless crate to implement case_fold_eq(a: &str, b: &str) -> bool that handles the Turkish dotless-i and other Unicode case-folding edge cases.