ExamplesBy LevelBy TopicLearning Paths
472 Fundamental

String Slices

Functional Programming

Tutorial

The Problem

In many languages, str[2] gives you the third character. In Rust, string slices are byte ranges. UTF-8 encodes non-ASCII characters in 2–4 bytes, so slicing at an arbitrary byte offset can split a multi-byte character and panic at runtime. The str::get method returns Option<&str>None if the range falls outside a char boundary — while direct indexing panics. Correct Unicode handling requires iterating characters, not bytes.

🎯 Learning Outcomes

  • • Understand that "café".len() == 5 (bytes) but "café".chars().count() == 4 (chars)
  • • Use .get(range) for safe slicing that returns None on boundary violations
  • • Use char_indices() to map character positions to byte offsets
  • • Distinguish ASCII-safe [byte_range] slicing from multi-byte safe .chars() iteration
  • • Recognise when byte-level slicing is acceptable (known ASCII or validated boundaries)
  • Code Example

    #![allow(clippy::all)]
    // 472. String slices and byte boundaries
    
    #[cfg(test)]
    mod tests {
        #[test]
        fn test_ascii() {
            assert_eq!(&"hello"[0..3], "hel");
        }
        #[test]
        fn test_safe_get() {
            assert_eq!("hello".get(1..4), Some("ell"));
            assert_eq!("hello".get(0..99), None);
        }
        #[test]
        fn test_utf8() {
            assert_eq!("café".len(), 5);
            assert_eq!("café".chars().count(), 4);
        }
        #[test]
        fn test_char_idx() {
            let v: Vec<_> = "abc".char_indices().collect();
            assert_eq!(v, vec![(0, 'a'), (1, 'b'), (2, 'c')]);
        }
    }

    Key Differences

  • Panic vs. None: Rust's &s[range] panics on invalid UTF-8 boundaries; s.get(range) returns Option. OCaml's String.sub raises Invalid_argument on out-of-bounds.
  • Byte vs. character length: Both Rust and OCaml len/length count bytes; character counting requires chars().count() in Rust and a library in OCaml.
  • **char_indices**: Rust provides char_indices() as a standard iterator; OCaml requires Uutf.String.fold_utf_8 or manual UTF-8 decoding.
  • Safety by default: Rust's type system distinguishes char (a Unicode scalar value, 4 bytes) from u8 (a byte); OCaml's char is a single byte, silently wrong for non-ASCII.
  • OCaml Approach

    OCaml's standard string is a byte string — String.length "café" returns 5, matching Rust's .len(). Character-level operations require the Uutf or Camomile library:

    (* Byte-level slicing *)
    let sub = String.sub "hello" 1 3  (* "ell" *)
    
    (* Character count via Uutf *)
    let char_count s =
      Uutf.String.fold_utf_8 (fun acc _ _ -> acc + 1) 0 s
    

    OCaml 5 does not include Unicode-aware string operations in the standard library; correct Unicode handling always requires an external package.

    Full Source

    #![allow(clippy::all)]
    // 472. String slices and byte boundaries
    
    #[cfg(test)]
    mod tests {
        #[test]
        fn test_ascii() {
            assert_eq!(&"hello"[0..3], "hel");
        }
        #[test]
        fn test_safe_get() {
            assert_eq!("hello".get(1..4), Some("ell"));
            assert_eq!("hello".get(0..99), None);
        }
        #[test]
        fn test_utf8() {
            assert_eq!("café".len(), 5);
            assert_eq!("café".chars().count(), 4);
        }
        #[test]
        fn test_char_idx() {
            let v: Vec<_> = "abc".char_indices().collect();
            assert_eq!(v, vec![(0, 'a'), (1, 'b'), (2, 'c')]);
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        #[test]
        fn test_ascii() {
            assert_eq!(&"hello"[0..3], "hel");
        }
        #[test]
        fn test_safe_get() {
            assert_eq!("hello".get(1..4), Some("ell"));
            assert_eq!("hello".get(0..99), None);
        }
        #[test]
        fn test_utf8() {
            assert_eq!("café".len(), 5);
            assert_eq!("café".chars().count(), 4);
        }
        #[test]
        fn test_char_idx() {
            let v: Vec<_> = "abc".char_indices().collect();
            assert_eq!(v, vec![(0, 'a'), (1, 'b'), (2, 'c')]);
        }
    }

    Exercises

  • Safe nth char: Implement nth_char(s: &str, n: usize) -> Option<char> using chars().nth(n) and benchmark it against a byte-indexed approach on ASCII-only input.
  • Char boundary validator: Write is_char_boundary_range(s: &str, start: usize, end: usize) -> bool without using str::get — check s.is_char_boundary(start) && s.is_char_boundary(end).
  • Grapheme clusters: Use the unicode-segmentation crate to split "e\u{0301}" (e + combining accent) correctly and compare the grapheme count to .chars().count().
  • Open Source Repos