ExamplesBy LevelBy TopicLearning Paths
1050 Intermediate

1050-interning — String Interning

Functional Programming

Tutorial

The Problem

When a program repeatedly stores and compares the same set of strings — identifiers in a compiler, field names in a serializer, CSS class names in a renderer — storing each string as a separate allocation wastes memory and makes comparison O(|string|) instead of O(1).

String interning maps each unique string to a small integer "symbol". Comparing two symbols is a single integer comparison. The interner maintains a bidirectional map between strings and IDs, ensuring each unique string is stored exactly once.

🎯 Learning Outcomes

  • • Implement a string interner with HashMap<String, Symbol> and Vec<String>
  • • Understand that interned symbols compare in O(1) instead of O(|string|)
  • • Return the same symbol for repeated insertions of the same string
  • • Look up the original string from a symbol in O(1)
  • • Connect to compiler symbol tables and CSS parser use cases
  • Code Example

    #![allow(clippy::all)]
    // 1050: String Interning — Dedup Strings to IDs
    // Map strings to unique integer IDs for O(1) comparison
    
    use std::collections::HashMap;
    
    /// Interned string handle — cheap to copy and compare
    #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
    struct Symbol(usize);
    
    /// String interner: maps strings ↔ unique IDs
    struct Interner {
        to_id: HashMap<String, Symbol>,
        to_str: Vec<String>,
    }
    
    impl Interner {
        fn new() -> Self {
            Interner {
                to_id: HashMap::new(),
                to_str: Vec::new(),
            }
        }
    
        /// Intern a string: returns existing ID or assigns a new one
        fn intern(&mut self, s: &str) -> Symbol {
            if let Some(&id) = self.to_id.get(s) {
                return id;
            }
            let id = Symbol(self.to_str.len());
            self.to_str.push(s.to_string());
            self.to_id.insert(s.to_string(), id);
            id
        }
    
        /// Resolve a symbol back to its string
        fn resolve(&self, sym: Symbol) -> Option<&str> {
            self.to_str.get(sym.0).map(|s| s.as_str())
        }
    
        /// Number of interned strings
        fn len(&self) -> usize {
            self.to_str.len()
        }
    }
    
    fn basic_interning() {
        let mut interner = Interner::new();
        let id1 = interner.intern("hello");
        let id2 = interner.intern("world");
        let id3 = interner.intern("hello"); // Same as id1
    
        assert_eq!(id1, id3); // Same string → same ID
        assert_ne!(id1, id2); // Different strings → different IDs
        assert_eq!(interner.resolve(id1), Some("hello"));
        assert_eq!(interner.resolve(id2), Some("world"));
        assert_eq!(interner.len(), 2); // Only 2 unique strings
    }
    
    fn fast_comparison() {
        let mut interner = Interner::new();
        let words = ["the", "cat", "sat", "on", "the", "mat", "the", "cat"];
        let ids: Vec<Symbol> = words.iter().map(|w| interner.intern(w)).collect();
    
        // Compare IDs instead of strings: integer comparison is O(1)
        let the_id = ids[0];
        let count = ids.iter().filter(|&&id| id == the_id).count();
        assert_eq!(count, 3); // "the" appears 3 times
    
        // Frequency counting with symbols is faster than with strings
        let mut freq: HashMap<Symbol, usize> = HashMap::new();
        for &id in &ids {
            *freq.entry(id).or_insert(0) += 1;
        }
        assert_eq!(freq[&the_id], 3);
    }
    
    fn symbol_table() {
        let mut interner = Interner::new();
        let vars = ["x", "y", "x", "z", "y", "x"];
        let interned: Vec<Symbol> = vars.iter().map(|v| interner.intern(v)).collect();
    
        // Only 3 unique symbols
        assert_eq!(interner.len(), 3);
    
        // Dedup using a set of symbols
        let mut unique: Vec<Symbol> = interned.clone();
        unique.sort_by_key(|s| s.0);
        unique.dedup();
        assert_eq!(unique.len(), 3);
    
        // Resolve back to strings
        let names: Vec<&str> = unique
            .iter()
            .filter_map(|&sym| interner.resolve(sym))
            .collect();
        assert_eq!(names.len(), 3);
    }
    
    /// Symbols work great as HashMap keys (faster than String keys)
    fn symbol_as_key() {
        let mut interner = Interner::new();
        let mut values: HashMap<Symbol, i32> = HashMap::new();
    
        let x = interner.intern("x");
        let y = interner.intern("y");
    
        values.insert(x, 42);
        values.insert(y, 99);
    
        assert_eq!(values[&x], 42);
        assert_eq!(values[&y], 99);
    
        // Symbol is Copy — no cloning needed
        let key = x;
        assert_eq!(values[&key], 42);
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_basic() {
            basic_interning();
        }
    
        #[test]
        fn test_fast_compare() {
            fast_comparison();
        }
    
        #[test]
        fn test_symbols() {
            symbol_table();
        }
    
        #[test]
        fn test_symbol_key() {
            symbol_as_key();
        }
    
        #[test]
        fn test_empty_string() {
            let mut interner = Interner::new();
            let empty = interner.intern("");
            assert_eq!(interner.resolve(empty), Some(""));
        }
    
        #[test]
        fn test_symbol_is_copy() {
            let mut interner = Interner::new();
            let sym = interner.intern("test");
            let copy = sym; // Copy, not move
            assert_eq!(sym, copy); // Both still valid
        }
    }

    Key Differences

  • Motivation: In Rust, string comparison is O(|string|) even with ==; interning reduces it to O(1). In OCaml, String.equal is O(n) but physical equality == is O(1) and can be used after interning.
  • Memory deduplication: Both languages benefit from storing each unique string once; the memory saving is language-independent.
  • Thread safety: Rust's single-threaded interner uses HashMap; a thread-safe version needs Mutex or DashMap. OCaml's Hashtbl is single-threaded; Domain-safe versions need explicit locking.
  • Arena allocation: Production interners use arena allocation to avoid individual String heap allocations; Rust's typed-arena crate supports this.
  • OCaml Approach

    OCaml uses Hashtbl for mutable interning:

    let interner = Hashtbl.create 256
    let next_id = ref 0
    
    let intern s =
      match Hashtbl.find_opt interner s with
      | Some id -> id
      | None ->
        let id = !next_id in
        Hashtbl.add interner s id;
        incr next_id;
        id
    

    OCaml's polymorphic equality already compares strings by value — interning is more about memory deduplication than comparison speed. The Camomile and Zarith libraries use interning for performance-critical string handling.

    Full Source

    #![allow(clippy::all)]
    // 1050: String Interning — Dedup Strings to IDs
    // Map strings to unique integer IDs for O(1) comparison
    
    use std::collections::HashMap;
    
    /// Interned string handle — cheap to copy and compare
    #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
    struct Symbol(usize);
    
    /// String interner: maps strings ↔ unique IDs
    struct Interner {
        to_id: HashMap<String, Symbol>,
        to_str: Vec<String>,
    }
    
    impl Interner {
        fn new() -> Self {
            Interner {
                to_id: HashMap::new(),
                to_str: Vec::new(),
            }
        }
    
        /// Intern a string: returns existing ID or assigns a new one
        fn intern(&mut self, s: &str) -> Symbol {
            if let Some(&id) = self.to_id.get(s) {
                return id;
            }
            let id = Symbol(self.to_str.len());
            self.to_str.push(s.to_string());
            self.to_id.insert(s.to_string(), id);
            id
        }
    
        /// Resolve a symbol back to its string
        fn resolve(&self, sym: Symbol) -> Option<&str> {
            self.to_str.get(sym.0).map(|s| s.as_str())
        }
    
        /// Number of interned strings
        fn len(&self) -> usize {
            self.to_str.len()
        }
    }
    
    fn basic_interning() {
        let mut interner = Interner::new();
        let id1 = interner.intern("hello");
        let id2 = interner.intern("world");
        let id3 = interner.intern("hello"); // Same as id1
    
        assert_eq!(id1, id3); // Same string → same ID
        assert_ne!(id1, id2); // Different strings → different IDs
        assert_eq!(interner.resolve(id1), Some("hello"));
        assert_eq!(interner.resolve(id2), Some("world"));
        assert_eq!(interner.len(), 2); // Only 2 unique strings
    }
    
    fn fast_comparison() {
        let mut interner = Interner::new();
        let words = ["the", "cat", "sat", "on", "the", "mat", "the", "cat"];
        let ids: Vec<Symbol> = words.iter().map(|w| interner.intern(w)).collect();
    
        // Compare IDs instead of strings: integer comparison is O(1)
        let the_id = ids[0];
        let count = ids.iter().filter(|&&id| id == the_id).count();
        assert_eq!(count, 3); // "the" appears 3 times
    
        // Frequency counting with symbols is faster than with strings
        let mut freq: HashMap<Symbol, usize> = HashMap::new();
        for &id in &ids {
            *freq.entry(id).or_insert(0) += 1;
        }
        assert_eq!(freq[&the_id], 3);
    }
    
    fn symbol_table() {
        let mut interner = Interner::new();
        let vars = ["x", "y", "x", "z", "y", "x"];
        let interned: Vec<Symbol> = vars.iter().map(|v| interner.intern(v)).collect();
    
        // Only 3 unique symbols
        assert_eq!(interner.len(), 3);
    
        // Dedup using a set of symbols
        let mut unique: Vec<Symbol> = interned.clone();
        unique.sort_by_key(|s| s.0);
        unique.dedup();
        assert_eq!(unique.len(), 3);
    
        // Resolve back to strings
        let names: Vec<&str> = unique
            .iter()
            .filter_map(|&sym| interner.resolve(sym))
            .collect();
        assert_eq!(names.len(), 3);
    }
    
    /// Symbols work great as HashMap keys (faster than String keys)
    fn symbol_as_key() {
        let mut interner = Interner::new();
        let mut values: HashMap<Symbol, i32> = HashMap::new();
    
        let x = interner.intern("x");
        let y = interner.intern("y");
    
        values.insert(x, 42);
        values.insert(y, 99);
    
        assert_eq!(values[&x], 42);
        assert_eq!(values[&y], 99);
    
        // Symbol is Copy — no cloning needed
        let key = x;
        assert_eq!(values[&key], 42);
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_basic() {
            basic_interning();
        }
    
        #[test]
        fn test_fast_compare() {
            fast_comparison();
        }
    
        #[test]
        fn test_symbols() {
            symbol_table();
        }
    
        #[test]
        fn test_symbol_key() {
            symbol_as_key();
        }
    
        #[test]
        fn test_empty_string() {
            let mut interner = Interner::new();
            let empty = interner.intern("");
            assert_eq!(interner.resolve(empty), Some(""));
        }
    
        #[test]
        fn test_symbol_is_copy() {
            let mut interner = Interner::new();
            let sym = interner.intern("test");
            let copy = sym; // Copy, not move
            assert_eq!(sym, copy); // Both still valid
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_basic() {
            basic_interning();
        }
    
        #[test]
        fn test_fast_compare() {
            fast_comparison();
        }
    
        #[test]
        fn test_symbols() {
            symbol_table();
        }
    
        #[test]
        fn test_symbol_key() {
            symbol_as_key();
        }
    
        #[test]
        fn test_empty_string() {
            let mut interner = Interner::new();
            let empty = interner.intern("");
            assert_eq!(interner.resolve(empty), Some(""));
        }
    
        #[test]
        fn test_symbol_is_copy() {
            let mut interner = Interner::new();
            let sym = interner.intern("test");
            let copy = sym; // Copy, not move
            assert_eq!(sym, copy); // Both still valid
        }
    }

    Deep Comparison

    String Interning — Comparison

    Core Insight

    String interning maps strings to unique integer IDs, enabling O(1) comparison and smaller memory footprint when strings repeat. Both languages implement it the same way — a bidirectional map between strings and IDs. The key difference is that Rust's Symbol can be Copy, making it as cheap as an integer.

    OCaml Approach

  • Hashtbl for both directions: (string, int) and (int, string)
  • • Mutable counter for next ID
  • • Module-based encapsulation
  • • OCaml strings are already value-compared (structural equality)
  • • GC handles interned string lifetime
  • Rust Approach

  • HashMap<String, Symbol> for string→ID
  • Vec<String> for ID→string (index = ID)
  • Symbol(usize) newtype: derives Copy, Clone, Eq, Hash
  • • Symbols as HashMap keys are faster than String keys
  • • Must manage lifetimes — interner must outlive symbols
  • Comparison Table

    FeatureOCamlRust
    String→IDHashtblHashMap<String, Symbol>
    ID→StringHashtblVec<String> (index)
    ID typeintSymbol(usize) newtype
    Copy semanticsN/A (GC)Copy trait — zero-cost
    Comparison costO(1) int compareO(1) usize compare
    Use as keyint in any mapSymbol derives Hash + Eq
    MemoryGC managedMust outlive usage
    Common inCompilers, interpretersCompilers, ECS, databases

    Exercises

  • Make the interner thread-safe by wrapping it in Arc<Mutex<Interner>> and benchmark the contention.
  • Implement Interner::merge(other: Interner) -> Interner that combines two interners, remapping the IDs of one to avoid conflicts.
  • Write a simple parser that interns all identifiers during lexing and compares symbols (not strings) during semantic analysis.
  • Open Source Repos