String Truncation
Functional Programming
Tutorial
The Problem
Naively truncating &s[..N] panics if N falls in the middle of a multi-byte character (e.g., slicing "café" at byte 4 lands inside é). Database column limits, UI text truncation, and API field limits are measured in bytes or characters — not always the same. A correct truncation implementation must: (1) find the nearest valid char boundary for byte-limited truncation, and (2) find the byte position of the Nth character for character-limited truncation.
🎯 Learning Outcomes
is_char_boundarychar_indices().nth(max_chars) for the byte position… (U+2026, 3 bytes in UTF-8) when truncating display stringssaturating_sub to avoid underflow when reserving space for the ellipsisCode Example
#![allow(clippy::all)]
// 498. Safe Unicode truncation
fn truncate_bytes(s: &str, max_bytes: usize) -> &str {
if s.len() <= max_bytes {
return s;
}
// floor_char_boundary available in Rust 1.72+
// For compatibility, implement manually:
let mut end = max_bytes;
while end > 0 && !s.is_char_boundary(end) {
end -= 1;
}
&s[..end]
}
fn truncate_chars(s: &str, max_chars: usize) -> &str {
match s.char_indices().nth(max_chars) {
Some((byte_pos, _)) => &s[..byte_pos],
None => s, // shorter than max_chars
}
}
fn truncate_with_ellipsis(s: &str, max_chars: usize) -> String {
let char_count = s.chars().count();
if char_count <= max_chars {
return s.to_string();
}
let truncated = truncate_chars(s, max_chars.saturating_sub(1));
format!("{}…", truncated)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_truncate_bytes() {
assert_eq!(truncate_bytes("hello", 3), "hel");
assert_eq!(truncate_bytes("café", 3), "caf");
}
#[test]
fn test_truncate_chars() {
assert_eq!(truncate_chars("café", 3), "caf");
assert_eq!(truncate_chars("hello", 10), "hello");
}
#[test]
fn test_ellipsis() {
assert_eq!(truncate_with_ellipsis("hello world", 8), "hello w…");
assert_eq!(truncate_with_ellipsis("hi", 10), "hi");
}
#[test]
fn test_emoji() {
let s = "🌍🌎🌏";
assert_eq!(truncate_chars(s, 2).chars().count(), 2);
}
}Key Differences
is_char_boundary**: Rust provides this as a standard str method; OCaml needs Uutf to check UTF-8 boundaries.truncate_bytes and truncate_chars return &str pointing into the original — no allocation; OCaml's String.sub always allocates.char_indices().nth(n)**: Rust's O(N) character indexing via char_indices is explicit about its cost; OCaml's fold is equally O(N) but less idiomatically readable.char**: '…' is 3 UTF-8 bytes; using saturating_sub(1) reserves one character position, not one byte — correct in Rust's char-counting truncate_chars.OCaml Approach
let truncate_bytes s max_bytes =
if String.length s <= max_bytes then s
else
(* Walk back to UTF-8 boundary using Uutf *)
let i = ref max_bytes in
while !i > 0 && not (Uutf.String.is_char_boundary s !i) do decr i done;
String.sub s 0 !i
let truncate_chars s max_chars =
let i = ref 0 and count = ref 0 in
Uutf.String.fold_utf_8 (fun () p _ ->
if !count < max_chars then (i := p; incr count)) () s;
if !count <= max_chars then s
else String.sub s 0 !i
OCaml has no standard is_char_boundary — it requires Uutf for correct Unicode truncation. String.sub allocates a new string.
Full Source
#![allow(clippy::all)]
// 498. Safe Unicode truncation
fn truncate_bytes(s: &str, max_bytes: usize) -> &str {
if s.len() <= max_bytes {
return s;
}
// floor_char_boundary available in Rust 1.72+
// For compatibility, implement manually:
let mut end = max_bytes;
while end > 0 && !s.is_char_boundary(end) {
end -= 1;
}
&s[..end]
}
fn truncate_chars(s: &str, max_chars: usize) -> &str {
match s.char_indices().nth(max_chars) {
Some((byte_pos, _)) => &s[..byte_pos],
None => s, // shorter than max_chars
}
}
fn truncate_with_ellipsis(s: &str, max_chars: usize) -> String {
let char_count = s.chars().count();
if char_count <= max_chars {
return s.to_string();
}
let truncated = truncate_chars(s, max_chars.saturating_sub(1));
format!("{}…", truncated)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_truncate_bytes() {
assert_eq!(truncate_bytes("hello", 3), "hel");
assert_eq!(truncate_bytes("café", 3), "caf");
}
#[test]
fn test_truncate_chars() {
assert_eq!(truncate_chars("café", 3), "caf");
assert_eq!(truncate_chars("hello", 10), "hello");
}
#[test]
fn test_ellipsis() {
assert_eq!(truncate_with_ellipsis("hello world", 8), "hello w…");
assert_eq!(truncate_with_ellipsis("hi", 10), "hi");
}
#[test]
fn test_emoji() {
let s = "🌍🌎🌏";
assert_eq!(truncate_chars(s, 2).chars().count(), 2);
}
}
✓ Tests
Rust test suite
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_truncate_bytes() {
assert_eq!(truncate_bytes("hello", 3), "hel");
assert_eq!(truncate_bytes("café", 3), "caf");
}
#[test]
fn test_truncate_chars() {
assert_eq!(truncate_chars("café", 3), "caf");
assert_eq!(truncate_chars("hello", 10), "hello");
}
#[test]
fn test_ellipsis() {
assert_eq!(truncate_with_ellipsis("hello world", 8), "hello w…");
assert_eq!(truncate_with_ellipsis("hi", 10), "hi");
}
#[test]
fn test_emoji() {
let s = "🌍🌎🌏";
assert_eq!(truncate_chars(s, 2).chars().count(), 2);
}
}
Exercises
floor_char_boundary**: Rust 1.72 added str::floor_char_boundary(n); rewrite truncate_bytes to use it and add a cfg! fallback for older Rust versions.unicode-width crate to truncate based on terminal column width (CJK characters are 2 columns wide).truncate_sentence(s: &str, max_chars: usize) -> String that truncates at the last sentence boundary (./!/?) before max_chars rather than mid-word.