472 Fundamental

String Slices

Functional Programming

Tutorial

The Problem

In many languages, str[2] gives you the third character. In Rust, string slices are byte ranges. UTF-8 encodes non-ASCII characters in 2–4 bytes, so slicing at an arbitrary byte offset can split a multi-byte character and panic at runtime. The str::get method returns Option<&str> — None if the range falls outside a char boundary — while direct indexing panics. Correct Unicode handling requires iterating characters, not bytes.

🎯 Learning Outcomes

• Understand that "café".len() == 5 (bytes) but "café".chars().count() == 4 (chars)

• Use .get(range) for safe slicing that returns None on boundary violations

• Use char_indices() to map character positions to byte offsets

• Distinguish ASCII-safe [byte_range] slicing from multi-byte safe .chars() iteration

• Recognise when byte-level slicing is acceptable (known ASCII or validated boundaries)

Code Example

#![allow(clippy::all)]
// 472. String slices and byte boundaries

#[cfg(test)]
mod tests {
    #[test]
    fn test_ascii() {
        assert_eq!(&"hello"[0..3], "hel");
    }
    #[test]
    fn test_safe_get() {
        assert_eq!("hello".get(1..4), Some("ell"));
        assert_eq!("hello".get(0..99), None);
    }
    #[test]
    fn test_utf8() {
        assert_eq!("café".len(), 5);
        assert_eq!("café".chars().count(), 4);
    }
    #[test]
    fn test_char_idx() {
        let v: Vec<_> = "abc".char_indices().collect();
        assert_eq!(v, vec![(0, 'a'), (1, 'b'), (2, 'c')]);
    }
}

(* 472. String slices – OCaml *)
let () =
  let s = "Hello, World!" in
  Printf.printf "sub: %s\n" (String.sub s 7 5);
  let pos = String.index s ',' in
  Printf.printf "before comma: %s\n" (String.sub s 0 pos);
  let safe_sub s p l =
    if p>=0 && l>=0 && p+l<=String.length s then Some(String.sub s p l) else None
  in
  Printf.printf "safe: %s\n" (match safe_sub s 0 5 with Some v->v | None->"None");
  (* UTF-8: String.length counts bytes *)
  let cafe = "caf\xc3\xa9" in  (* café *)
  Printf.printf "byte_len=%d\n" (String.length cafe)

Key Differences

Panic vs. None: Rust's &s[range] panics on invalid UTF-8 boundaries; s.get(range) returns Option. OCaml's String.sub raises Invalid_argument on out-of-bounds.

Byte vs. character length: Both Rust and OCaml len/length count bytes; character counting requires chars().count() in Rust and a library in OCaml.

**char_indices**: Rust provides char_indices() as a standard iterator; OCaml requires Uutf.String.fold_utf_8 or manual UTF-8 decoding.

Safety by default: Rust's type system distinguishes char (a Unicode scalar value, 4 bytes) from u8 (a byte); OCaml's char is a single byte, silently wrong for non-ASCII.

OCaml Approach

OCaml's standard string is a byte string — String.length "café" returns 5, matching Rust's .len(). Character-level operations require the Uutf or Camomile library:

(* Byte-level slicing *)
let sub = String.sub "hello" 1 3  (* "ell" *)

(* Character count via Uutf *)
let char_count s =
  Uutf.String.fold_utf_8 (fun acc _ _ -> acc + 1) 0 s

OCaml 5 does not include Unicode-aware string operations in the standard library; correct Unicode handling always requires an external package.

Full Source

#![allow(clippy::all)]
// 472. String slices and byte boundaries

#[cfg(test)]
mod tests {
    #[test]
    fn test_ascii() {
        assert_eq!(&"hello"[0..3], "hel");
    }
    #[test]
    fn test_safe_get() {
        assert_eq!("hello".get(1..4), Some("ell"));
        assert_eq!("hello".get(0..99), None);
    }
    #[test]
    fn test_utf8() {
        assert_eq!("café".len(), 5);
        assert_eq!("café".chars().count(), 4);
    }
    #[test]
    fn test_char_idx() {
        let v: Vec<_> = "abc".char_indices().collect();
        assert_eq!(v, vec![(0, 'a'), (1, 'b'), (2, 'c')]);
    }
}

(* 472. String slices – OCaml *)
let () =
  let s = "Hello, World!" in
  Printf.printf "sub: %s\n" (String.sub s 7 5);
  let pos = String.index s ',' in
  Printf.printf "before comma: %s\n" (String.sub s 0 pos);
  let safe_sub s p l =
    if p>=0 && l>=0 && p+l<=String.length s then Some(String.sub s p l) else None
  in
  Printf.printf "safe: %s\n" (match safe_sub s 0 5 with Some v->v | None->"None");
  (* UTF-8: String.length counts bytes *)
  let cafe = "caf\xc3\xa9" in  (* café *)
  Printf.printf "byte_len=%d\n" (String.length cafe)

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    #[test]
    fn test_ascii() {
        assert_eq!(&"hello"[0..3], "hel");
    }
    #[test]
    fn test_safe_get() {
        assert_eq!("hello".get(1..4), Some("ell"));
        assert_eq!("hello".get(0..99), None);
    }
    #[test]
    fn test_utf8() {
        assert_eq!("café".len(), 5);
        assert_eq!("café".chars().count(), 4);
    }
    #[test]
    fn test_char_idx() {
        let v: Vec<_> = "abc".char_indices().collect();
        assert_eq!(v, vec![(0, 'a'), (1, 'b'), (2, 'c')]);
    }
}

Exercises

Safe nth char: Implement nth_char(s: &str, n: usize) -> Option<char> using chars().nth(n) and benchmark it against a byte-indexed approach on ASCII-only input.

Char boundary validator: Write is_char_boundary_range(s: &str, start: usize, end: usize) -> bool without using str::get — check s.is_char_boundary(start) && s.is_char_boundary(end).

Grapheme clusters: Use the unicode-segmentation crate to split "e\u{0301}" (e + combining accent) correctly and compare the grapheme count to .chars().count().

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust