String Interning
Tutorial
The Problem
Compilers, parsers, and databases deal with many repeated string values: variable names, SQL column names, log message templates. Without interning, each copy of "username" is a separate heap allocation — wasting memory and requiring character-by-character comparison for equality. Interning maintains a global table of unique strings; the same content always returns the same pointer. This enables pointer equality (ptr_eq) instead of string comparison — O(1) vs. O(N) — and a 2–10× memory reduction for symbol-heavy workloads. LLVM's StringRef, Java's String.intern(), and Python's string interning all use this pattern.
🎯 Learning Outcomes
Interner backed by HashMap<String, Arc<str>>Arc<str> as a cheap cloneable handle to an interned stringArc::ptr_eq to confirm interning worksMutex for thread-safe shared accessArc<str> as a fat-pointer reference-counted string without a String wrapperCode Example
#![allow(clippy::all)]
// 487. String interning pattern
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
pub struct Interner {
table: HashMap<String, Arc<str>>,
}
impl Interner {
pub fn new() -> Self {
Interner {
table: HashMap::new(),
}
}
pub fn intern(&mut self, s: &str) -> Arc<str> {
if let Some(v) = self.table.get(s) {
return Arc::clone(v);
}
let arc: Arc<str> = Arc::from(s);
self.table.insert(s.to_string(), Arc::clone(&arc));
arc
}
pub fn len(&self) -> usize {
self.table.len()
}
}
// Thread-safe interner
pub struct SyncInterner(Mutex<Interner>);
impl SyncInterner {
pub fn new() -> Self {
SyncInterner(Mutex::new(Interner::new()))
}
pub fn intern(&self, s: &str) -> Arc<str> {
self.0.lock().unwrap().intern(s)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_intern_same() {
let mut i = Interner::new();
let a = i.intern("hi");
let b = i.intern("hi");
assert!(Arc::ptr_eq(&a, &b));
}
#[test]
fn test_intern_diff() {
let mut i = Interner::new();
let a = i.intern("hi");
let b = i.intern("ho");
assert!(!Arc::ptr_eq(&a, &b));
}
#[test]
fn test_intern_size() {
let mut i = Interner::new();
i.intern("a");
i.intern("b");
i.intern("a");
assert_eq!(i.len(), 2);
}
}Key Differences
Arc<str> to track interned string lifetimes — the interner table holds one count, callers hold additional counts; OCaml's GC handles lifetime automatically.Arc::ptr_eq; OCaml uses (==) (physical equality).SyncInterner wraps in Mutex; OCaml requires a Mutex too, but the GC ensures no use-after-free even without one on reads.Arc<str> vs. Arc<String>**: Arc<str> is a fat pointer (address + length) with one allocation; Arc<String> would add a String wrapper layer unnecessarily.OCaml Approach
OCaml strings are immutable and GC-managed; the GC does not intern by default. A manual interner uses a Hashtbl:
let table : (string, string) Hashtbl.t = Hashtbl.create 64
let intern s =
match Hashtbl.find_opt table s with
| Some v -> v (* same physical string *)
| None -> Hashtbl.add table s s; s
(* Pointer equality *)
let same a b = a == b (* physical equality in OCaml *)
OCaml's (==) is physical (pointer) equality; (=) is structural. Interning allows using (==) for string identity checks.
Full Source
#![allow(clippy::all)]
// 487. String interning pattern
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
pub struct Interner {
table: HashMap<String, Arc<str>>,
}
impl Interner {
pub fn new() -> Self {
Interner {
table: HashMap::new(),
}
}
pub fn intern(&mut self, s: &str) -> Arc<str> {
if let Some(v) = self.table.get(s) {
return Arc::clone(v);
}
let arc: Arc<str> = Arc::from(s);
self.table.insert(s.to_string(), Arc::clone(&arc));
arc
}
pub fn len(&self) -> usize {
self.table.len()
}
}
// Thread-safe interner
pub struct SyncInterner(Mutex<Interner>);
impl SyncInterner {
pub fn new() -> Self {
SyncInterner(Mutex::new(Interner::new()))
}
pub fn intern(&self, s: &str) -> Arc<str> {
self.0.lock().unwrap().intern(s)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_intern_same() {
let mut i = Interner::new();
let a = i.intern("hi");
let b = i.intern("hi");
assert!(Arc::ptr_eq(&a, &b));
}
#[test]
fn test_intern_diff() {
let mut i = Interner::new();
let a = i.intern("hi");
let b = i.intern("ho");
assert!(!Arc::ptr_eq(&a, &b));
}
#[test]
fn test_intern_size() {
let mut i = Interner::new();
i.intern("a");
i.intern("b");
i.intern("a");
assert_eq!(i.len(), 2);
}
}#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_intern_same() {
let mut i = Interner::new();
let a = i.intern("hi");
let b = i.intern("hi");
assert!(Arc::ptr_eq(&a, &b));
}
#[test]
fn test_intern_diff() {
let mut i = Interner::new();
let a = i.intern("hi");
let b = i.intern("ho");
assert!(!Arc::ptr_eq(&a, &b));
}
#[test]
fn test_intern_size() {
let mut i = Interner::new();
i.intern("a");
i.intern("b");
i.intern("a");
assert_eq!(i.len(), 2);
}
}
Exercises
Arc<str> values with Weak<str> so interned strings are freed when no external handles remain, turning the table into a cache rather than a permanent registry.Mutex<Interner> with a DashMap<String, Arc<str>> (from the dashmap crate) for concurrent interning without a global lock.Arc<str> in a newtype struct Symbol(Arc<str>) and implement PartialEq/Hash using pointer identity rather than string content for O(1) hash map lookups.