String Encoding
Functional Programming
Tutorial
The Problem
Software systems communicate using standardised text encodings: HTTP headers are Latin-1 or UTF-8, XML files may start with a byte-order mark (BOM, U+FEFF), JSON must be UTF-8, and legacy databases often use Windows-1252. Rust's strings are always UTF-8 internally, but interfacing with the outside world requires encoding knowledge: how many bytes does a character occupy, how do I detect a BOM, how do I validate arbitrary bytes as UTF-8 before accepting them as &str?
🎯 Learning Outcomes
char to its UTF-8 byte representation with encode_utf8(&mut buf)char::len_utf8()std::str::from_utf8 returning Result<&str, Utf8Error>strip_prefix('\u{FEFF}')Code Example
#![allow(clippy::all)]
// 483. UTF-8 encoding patterns
#[cfg(test)]
mod tests {
#[test]
fn test_encode() {
let mut b = [0u8; 4];
assert_eq!('A'.encode_utf8(&mut b), "A");
assert_eq!('é'.len_utf8(), 2);
}
#[test]
fn test_validate() {
assert!(std::str::from_utf8(&[104, 105]).is_ok());
assert!(std::str::from_utf8(&[0xFF]).is_err());
}
#[test]
fn test_bom() {
let s = "\u{FEFF}hi";
assert_eq!(s.strip_prefix('\u{FEFF}'), Some("hi"));
}
}Key Differences
str::from_utf8 validates and returns a &str pointing into the original bytes; OCaml's equivalent requires external crates and always decodes.encode_utf8 to stack buffer**: Rust encodes a char into a stack-allocated [u8; 4]; OCaml's Uutf.Buffer.add_utf_8 writes to a heap Buffer.strip_prefix('\u{FEFF}') handles BOM as a normal char; OCaml needs manual byte prefix matching.len_utf8**: Rust provides char::len_utf8() as a O(1) query; OCaml has no equivalent — you must encode and measure.OCaml Approach
OCaml encodes/decodes UTF-8 via the Uutf library:
(* Encode a Unicode codepoint to UTF-8 bytes *)
let encode_utf8 uchar =
let buf = Buffer.create 4 in
Uutf.Buffer.add_utf_8 buf uchar;
Buffer.to_bytes buf
(* Validate UTF-8 *)
let is_valid_utf8 s =
Uutf.String.fold_utf_8 (fun ok _ d ->
ok && d <> `Malformed) true s
OCaml has no BOM-stripping in the standard library; a manual if String.length s >= 3 && String.sub s 0 3 = "\xef\xbb\xbf" then String.sub s 3 ... check is typical.
Full Source
#![allow(clippy::all)]
// 483. UTF-8 encoding patterns
#[cfg(test)]
mod tests {
#[test]
fn test_encode() {
let mut b = [0u8; 4];
assert_eq!('A'.encode_utf8(&mut b), "A");
assert_eq!('é'.len_utf8(), 2);
}
#[test]
fn test_validate() {
assert!(std::str::from_utf8(&[104, 105]).is_ok());
assert!(std::str::from_utf8(&[0xFF]).is_err());
}
#[test]
fn test_bom() {
let s = "\u{FEFF}hi";
assert_eq!(s.strip_prefix('\u{FEFF}'), Some("hi"));
}
}
✓ Tests
Rust test suite
#[cfg(test)]
mod tests {
#[test]
fn test_encode() {
let mut b = [0u8; 4];
assert_eq!('A'.encode_utf8(&mut b), "A");
assert_eq!('é'.len_utf8(), 2);
}
#[test]
fn test_validate() {
assert!(std::str::from_utf8(&[104, 105]).is_ok());
assert!(std::str::from_utf8(&[0xFF]).is_err());
}
#[test]
fn test_bom() {
let s = "\u{FEFF}hi";
assert_eq!(s.strip_prefix('\u{FEFF}'), Some("hi"));
}
}
Exercises
Utf8Validator that accepts bytes one at a time and returns Valid, Invalid, or Incomplete (for a multibyte sequence split across buffers).read_text_file(path: &Path) -> Result<String> that reads raw bytes, detects UTF-8/UTF-16 BOM, and returns a normalised UTF-8 string (transcode UTF-16 using the encoding_rs crate).