ExamplesBy LevelBy TopicLearning Paths
482 Fundamental

String Unicode

Functional Programming

Tutorial

The Problem

Unicode defines multiple ways to represent the same visual character: é can be a single precomposed codepoint (U+00E9) or a base letter e (U+0065) followed by a combining accent (U+0301). These two sequences look identical but compare unequal as byte strings. Web forms, databases, and search engines must normalise Unicode before comparison. Emoji occupy 4 bytes in UTF-8 (U+1F600 = \u{1F600}) — naive len() returns 4, not 1. Correct Unicode handling requires understanding: NFC/NFD normalisation, grapheme clusters, and the difference between bytes, codepoints, and user-perceived characters.

🎯 Learning Outcomes

  • • Understand that NFC and NFD representations of the same character compare unequal
  • • Use eq_ignore_ascii_case for case-insensitive ASCII comparison without allocation
  • • Check ASCII-only strings with str::is_ascii()
  • • Understand emoji encoding: 4 UTF-8 bytes, 1 char, 1 grapheme cluster
  • • Recognise when the unicode-normalization crate is needed for correct comparison
  • Code Example

    #![allow(clippy::all)]
    // 482. Unicode normalization and graphemes
    
    #[cfg(test)]
    mod tests {
        #[test]
        fn test_nfc_nfd() {
            assert_ne!("caf\u{00E9}", "caf\u{0065}\u{0301}");
        }
        #[test]
        fn test_ascii_eq() {
            assert!("hello".eq_ignore_ascii_case("HELLO"));
        }
        #[test]
        fn test_is_ascii() {
            assert!("hello".is_ascii());
            assert!(!"café".is_ascii());
        }
        #[test]
        fn test_emoji() {
            let e = "\u{1F600}";
            assert_eq!(e.len(), 4);
            assert_eq!(e.chars().count(), 1);
        }
    }

    Key Differences

  • Standard Unicode properties: Rust's char::is_alphabetic() uses the Unicode Alphabetic property; OCaml's Char.is_alpha is ASCII-only (via is_alpha from Char).
  • NFC/NFD in stdlib: Rust delegates normalisation to unicode-normalization crate; OCaml delegates to uunf — neither includes it in the standard library.
  • **eq_ignore_ascii_case**: Rust has this in the standard library; OCaml needs String.lowercase_ascii + compare.
  • Emoji byte count: Both languages store emoji as 4-byte UTF-8 sequences; both .chars().count() / Uutf yield 1 codepoint; both require unicode-segmentation / Uuseg for grapheme cluster counting.
  • OCaml Approach

    OCaml's standard library has no Unicode normalisation. String.equal is byte equality. Case-insensitive comparison requires String.lowercase_ascii (ASCII-only) or Uucp.Case.fold (full Unicode):

    String.equal
      (String.lowercase_ascii "Hello")
      (String.lowercase_ascii "HELLO")  (* true *)
    
    (* Unicode normalisation via uunf *)
    let nfc s =
      let buf = Buffer.create (String.length s) in
      let norm = Uunf.create `NFC in
      (* feed codepoints from Uutf, flush from Uunf into buf *)
      Buffer.contents buf
    

    Full Source

    #![allow(clippy::all)]
    // 482. Unicode normalization and graphemes
    
    #[cfg(test)]
    mod tests {
        #[test]
        fn test_nfc_nfd() {
            assert_ne!("caf\u{00E9}", "caf\u{0065}\u{0301}");
        }
        #[test]
        fn test_ascii_eq() {
            assert!("hello".eq_ignore_ascii_case("HELLO"));
        }
        #[test]
        fn test_is_ascii() {
            assert!("hello".is_ascii());
            assert!(!"café".is_ascii());
        }
        #[test]
        fn test_emoji() {
            let e = "\u{1F600}";
            assert_eq!(e.len(), 4);
            assert_eq!(e.chars().count(), 1);
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        #[test]
        fn test_nfc_nfd() {
            assert_ne!("caf\u{00E9}", "caf\u{0065}\u{0301}");
        }
        #[test]
        fn test_ascii_eq() {
            assert!("hello".eq_ignore_ascii_case("HELLO"));
        }
        #[test]
        fn test_is_ascii() {
            assert!("hello".is_ascii());
            assert!(!"café".is_ascii());
        }
        #[test]
        fn test_emoji() {
            let e = "\u{1F600}";
            assert_eq!(e.len(), 4);
            assert_eq!(e.chars().count(), 1);
        }
    }

    Exercises

  • NFC normalise and compare: Write unicode_eq(a: &str, b: &str) -> bool that normalises both strings to NFC (using unicode-normalization) before comparing.
  • Emoji counter: Write count_emoji(s: &str) -> usize that counts characters with Unicode category So (Other Symbol) using char::is_ascii() inversion and the unicode-properties crate.
  • Case folding: Use the caseless crate to implement case_fold_eq(a: &str, b: &str) -> bool that handles the Turkish dotless-i and other Unicode case-folding edge cases.
  • Open Source Repos