ExamplesBy LevelBy TopicLearning Paths
724 Fundamental

Zero-Copy Parsing

Functional Programming

Tutorial

The Problem

A naive parser copies every token from the input buffer into a freshly allocated String. For a 100 MB JSON file, this means allocating millions of strings, applying GC pressure, and touching twice the memory bandwidth. Zero-copy parsing eliminates these allocations by returning borrowed references (&str, &[u8]) into the original input buffer. The parse result's lifetime is tied to the input's lifetime, preventing the input from being freed while tokens are still accessible.

The pattern originated in high-frequency trading systems (parsing FIX protocol messages millions of times per second), network proxies (forwarding HTTP headers without copying), and embedded systems (parsing sensor data from a DMA buffer). Rust's lifetime system makes zero-copy safe by statically ensuring that borrowed parse results cannot outlive their input. Languages with GC (Java, Python, OCaml) achieve this only through careful discipline; Rust enforces it at compile time.

🎯 Learning Outcomes

  • • Implement a zero-copy parser that returns &str / &[u8] slices into input
  • • Use lifetime annotations to tie parse output lifetimes to input lifetimes
  • • Represent parse errors with enum ParseError without heap allocation
  • • Apply split_once, splitn, and manual byte scanning to avoid allocation
  • • Understand when zero-copy is impossible (e.g., unescaping, base64 decoding)
  • Code Example

    pub fn take_until(buf: &[u8], delimiter: u8) -> Result<(&[u8], &[u8]), ParseError> {
        buf.iter()
            .position(|&b| b == delimiter)
            .map(|pos| (&buf[..pos], &buf[pos + 1..]))
            .ok_or(ParseError::MissingDelimiter(delimiter))
    }
    
    pub fn parse_key_value(line: &[u8]) -> Result<KeyValue<'_>, ParseError> {
        let (key_bytes, value_bytes) = take_until(line, b'=')?;
        Ok(KeyValue { key: as_str(key_bytes)?, value: as_str(value_bytes)? })
    }

    Key Differences

    AspectRustOCaml
    Borrowing inputLifetime-annotated &strGC-managed; string.sub copies
    Safety enforcementCompile-time lifetime checkRuntime / discipline
    Binary frames&[u8] slices, no copyBigstring with Angstrom
    Error typeEnum, stack-allocatedstring or exception
    Parser librariesnom, winnow (zero-copy)Angstrom (Bigstring)

    OCaml Approach

    OCaml strings are immutable and the GC manages their lifetime, so "zero-copy" means using String.sub (which does copy) or Bytes.sub_string. True zero-copy requires Bigstring/Bigarray or the Angstrom parser combinator library with Lwt:

    (* Copies substring — not zero-copy *)
    let parse_key_value s =
      match String.split_on_char '=' s with
      | [k; v] -> Ok (String.trim k, String.trim v)
      | _       -> Error "invalid"
    
    (* Zero-copy with Angstrom (returns Bigstring slices) *)
    (* let record_parser = ... Angstrom.take_while ... *)
    

    The Angstrom library uses Bigstring (a Bigarray.Array1 of char) as the backing buffer and returns offsets rather than copies, achieving true zero-copy in practice.

    Full Source

    #![allow(clippy::all)]
    // 724. Zero-copy parsing with byte slices
    //
    // Returns &str / &[u8] slices into the input buffer — no allocation.
    // Lifetimes tie parsed references to the original input.
    
    use std::str;
    
    // ── Error type ────────────────────────────────────────────────────────────────
    
    #[derive(Debug, PartialEq)]
    pub enum ParseError {
        UnexpectedEof,
        InvalidUtf8,
        MissingDelimiter(u8),
        InvalidFormat(&'static str),
    }
    
    // ── Low-level byte-slice combinators ─────────────────────────────────────────
    
    /// Take `n` bytes from the front of `buf`, returning (taken, rest).
    pub fn take(buf: &[u8], n: usize) -> Result<(&[u8], &[u8]), ParseError> {
        if buf.len() < n {
            Err(ParseError::UnexpectedEof)
        } else {
            Ok((&buf[..n], &buf[n..]))
        }
    }
    
    /// Consume bytes until `delimiter` (exclusive), returning (before, after_delim).
    pub fn take_until(buf: &[u8], delimiter: u8) -> Result<(&[u8], &[u8]), ParseError> {
        buf.iter()
            .position(|&b| b == delimiter)
            .map(|pos| (&buf[..pos], &buf[pos + 1..]))
            .ok_or(ParseError::MissingDelimiter(delimiter))
    }
    
    /// Interpret a byte slice as UTF-8 `&str` — zero-copy, zero allocation.
    pub fn as_str(buf: &[u8]) -> Result<&str, ParseError> {
        str::from_utf8(buf).map_err(|_| ParseError::InvalidUtf8)
    }
    
    /// Skip leading ASCII whitespace, returning the trimmed slice.
    pub fn skip_whitespace(buf: &[u8]) -> &[u8] {
        let pos = buf
            .iter()
            .position(|b| !b.is_ascii_whitespace())
            .unwrap_or(buf.len());
        &buf[pos..]
    }
    
    // ── Span — index-pair view into a shared buffer ───────────────────────────────
    
    /// A lightweight window into a byte buffer: start index + length.
    /// Mirrors the OCaml `span` record, but without copying.
    #[derive(Debug, Clone, Copy, PartialEq, Eq)]
    pub struct Span {
        pub start: usize,
        pub len: usize,
    }
    
    impl Span {
        pub fn new(start: usize, len: usize) -> Self {
            Self { start, len }
        }
    
        /// Resolve the span against the original buffer — still zero-copy.
        pub fn slice<'a>(&self, buf: &'a [u8]) -> &'a [u8] {
            &buf[self.start..self.start + self.len]
        }
    
        pub fn as_str<'a>(&self, buf: &'a [u8]) -> Result<&'a str, ParseError> {
            as_str(self.slice(buf))
        }
    }
    
    /// Split `buf` at the first `sep` byte, returning two `Span`s (no allocation).
    pub fn span_split_at(buf: &[u8], start: usize, len: usize, sep: u8) -> Option<(Span, Span)> {
        let slice = &buf[start..start + len];
        slice.iter().position(|&b| b == sep).map(|pos| {
            let left = Span::new(start, pos);
            // +1 to skip the separator itself
            let right = Span::new(start + pos + 1, len - pos - 1);
            (left, right)
        })
    }
    
    // ── HTTP request-line parser ──────────────────────────────────────────────────
    
    /// Parsed HTTP request line.  All fields borrow from the input buffer.
    #[derive(Debug, PartialEq)]
    pub struct RequestLine<'a> {
        pub method: &'a str,
        pub path: &'a str,
        pub version: &'a str,
    }
    
    /// Parse `"METHOD /path HTTP/1.x\r\n"` without allocating.
    ///
    /// Every `&str` in the returned struct points directly into `buf`.
    pub fn parse_request_line(buf: &[u8]) -> Result<RequestLine<'_>, ParseError> {
        // Consume up to first space → method
        let (method_bytes, rest) = take_until(buf, b' ')?;
        let method = as_str(method_bytes)?;
    
        let rest = skip_whitespace(rest);
    
        // Consume up to second space → path
        let (path_bytes, rest) = take_until(rest, b' ')?;
        let path = as_str(path_bytes)?;
    
        let rest = skip_whitespace(rest);
    
        // Consume up to \r\n or end of slice → version
        let version_bytes = rest
            .iter()
            .position(|&b| b == b'\r' || b == b'\n')
            .map(|pos| &rest[..pos])
            .unwrap_or(rest);
        let version = as_str(version_bytes)?;
    
        Ok(RequestLine {
            method,
            path,
            version,
        })
    }
    
    // ── CSV field iterator — yields &str slices, zero-copy ───────────────────────
    
    /// Iterator over comma-separated fields in a single CSV row.
    /// Yields `&str` slices borrowed from the original input.
    pub struct CsvFields<'a> {
        remaining: &'a [u8],
        done: bool,
    }
    
    impl<'a> CsvFields<'a> {
        pub fn new(row: &'a [u8]) -> Self {
            Self {
                remaining: row,
                done: false,
            }
        }
    }
    
    impl<'a> Iterator for CsvFields<'a> {
        type Item = Result<&'a str, ParseError>;
    
        fn next(&mut self) -> Option<Self::Item> {
            if self.done {
                return None;
            }
            match self.remaining.iter().position(|&b| b == b',') {
                Some(pos) => {
                    let field = &self.remaining[..pos];
                    self.remaining = &self.remaining[pos + 1..];
                    Some(as_str(field))
                }
                None => {
                    // Last field — consume everything
                    self.done = true;
                    if self.remaining.is_empty() {
                        None
                    } else {
                        let field = self.remaining;
                        self.remaining = &[];
                        Some(as_str(field))
                    }
                }
            }
        }
    }
    
    /// Collect all CSV fields from a row into a `Vec<&str>`, zero-copy.
    pub fn parse_csv_row(row: &[u8]) -> Result<Vec<&str>, ParseError> {
        CsvFields::new(row).collect()
    }
    
    // ── Key=Value line parser ─────────────────────────────────────────────────────
    
    /// A single `key=value` pair, both halves borrowing from the input.
    #[derive(Debug, PartialEq)]
    pub struct KeyValue<'a> {
        pub key: &'a str,
        pub value: &'a str,
    }
    
    pub fn parse_key_value(line: &[u8]) -> Result<KeyValue<'_>, ParseError> {
        let (key_bytes, value_bytes) = take_until(line, b'=')?;
        Ok(KeyValue {
            key: as_str(key_bytes)?,
            value: as_str(value_bytes)?,
        })
    }
    
    // ─────────────────────────────────────────────────────────────────────────────
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        // ── take ─────────────────────────────────────────────────────────────────
    
        #[test]
        fn take_splits_correctly() {
            let buf = b"Hello, world!";
            let (head, tail) = take(buf, 5).unwrap();
            assert_eq!(head, b"Hello");
            assert_eq!(tail, b", world!");
        }
    
        #[test]
        fn take_eof_returns_error() {
            let buf = b"Hi";
            assert_eq!(take(buf, 10), Err(ParseError::UnexpectedEof));
        }
    
        #[test]
        fn take_zero_returns_empty_head() {
            let buf = b"abc";
            let (head, tail) = take(buf, 0).unwrap();
            assert_eq!(head, b"");
            assert_eq!(tail, b"abc");
        }
    
        // ── take_until ───────────────────────────────────────────────────────────
    
        #[test]
        fn take_until_finds_delimiter() {
            let buf = b"key=value";
            let (before, after) = take_until(buf, b'=').unwrap();
            assert_eq!(before, b"key");
            assert_eq!(after, b"value");
        }
    
        #[test]
        fn take_until_missing_delimiter_errors() {
            let buf = b"nodot";
            assert_eq!(
                take_until(buf, b'.'),
                Err(ParseError::MissingDelimiter(b'.'))
            );
        }
    
        // ── Span ─────────────────────────────────────────────────────────────────
    
        #[test]
        fn span_slice_is_zero_copy() {
            let buf = b"Hello, world!";
            let span = Span::new(7, 5);
            assert_eq!(span.slice(buf), b"world");
            assert_eq!(span.as_str(buf).unwrap(), "world");
        }
    
        #[test]
        fn span_split_at_produces_two_windows() {
            let buf = b"left:right";
            let (l, r) = span_split_at(buf, 0, buf.len(), b':').unwrap();
            assert_eq!(l.as_str(buf).unwrap(), "left");
            assert_eq!(r.as_str(buf).unwrap(), "right");
        }
    
        #[test]
        fn span_split_at_missing_sep_returns_none() {
            let buf = b"nodot";
            assert!(span_split_at(buf, 0, buf.len(), b'.').is_none());
        }
    
        // ── HTTP request-line ─────────────────────────────────────────────────────
    
        #[test]
        fn parse_request_line_get() {
            let input = b"GET /index.html HTTP/1.1\r\n";
            let req = parse_request_line(input).unwrap();
            assert_eq!(req.method, "GET");
            assert_eq!(req.path, "/index.html");
            assert_eq!(req.version, "HTTP/1.1");
        }
    
        #[test]
        fn parse_request_line_post_no_crlf() {
            let input = b"POST /api/data HTTP/2";
            let req = parse_request_line(input).unwrap();
            assert_eq!(req.method, "POST");
            assert_eq!(req.path, "/api/data");
            assert_eq!(req.version, "HTTP/2");
        }
    
        #[test]
        fn parse_request_line_missing_path_errors() {
            let input = b"GET";
            assert!(parse_request_line(input).is_err());
        }
    
        // ── CSV row ───────────────────────────────────────────────────────────────
    
        #[test]
        fn parse_csv_row_three_fields() {
            let row = b"alice,30,engineer";
            let fields = parse_csv_row(row).unwrap();
            assert_eq!(fields, vec!["alice", "30", "engineer"]);
        }
    
        #[test]
        fn parse_csv_row_single_field() {
            let row = b"only";
            let fields = parse_csv_row(row).unwrap();
            assert_eq!(fields, vec!["only"]);
        }
    
        #[test]
        fn parse_csv_row_empty_fields() {
            let row = b"a,,c";
            let fields = parse_csv_row(row).unwrap();
            assert_eq!(fields, vec!["a", "", "c"]);
        }
    
        // ── key=value ─────────────────────────────────────────────────────────────
    
        #[test]
        fn parse_key_value_basic() {
            let line = b"host=localhost";
            let kv = parse_key_value(line).unwrap();
            assert_eq!(kv.key, "host");
            assert_eq!(kv.value, "localhost");
        }
    
        #[test]
        fn parse_key_value_value_with_equals() {
            // Only splits on the FIRST '='
            let line = b"url=http://x?a=1";
            let kv = parse_key_value(line).unwrap();
            assert_eq!(kv.key, "url");
            assert_eq!(kv.value, "http://x?a=1");
        }
    
        #[test]
        fn parse_key_value_missing_equals_errors() {
            let line = b"noequals";
            assert!(parse_key_value(line).is_err());
        }
    
        // ── Lifetime safety (compile-time) ────────────────────────────────────────
    
        #[test]
        fn parsed_fields_borrow_from_input() {
            let input = b"name=Ferris";
            let kv = parse_key_value(input).unwrap();
            // Both &str slices point into `input` — no heap allocation occurred.
            assert!(std::ptr::eq(kv.key.as_bytes().as_ptr(), input.as_ptr()));
            assert!(std::ptr::eq(kv.value.as_bytes().as_ptr(), unsafe {
                input.as_ptr().add(5)
            }));
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
    
        // ── take ─────────────────────────────────────────────────────────────────
    
        #[test]
        fn take_splits_correctly() {
            let buf = b"Hello, world!";
            let (head, tail) = take(buf, 5).unwrap();
            assert_eq!(head, b"Hello");
            assert_eq!(tail, b", world!");
        }
    
        #[test]
        fn take_eof_returns_error() {
            let buf = b"Hi";
            assert_eq!(take(buf, 10), Err(ParseError::UnexpectedEof));
        }
    
        #[test]
        fn take_zero_returns_empty_head() {
            let buf = b"abc";
            let (head, tail) = take(buf, 0).unwrap();
            assert_eq!(head, b"");
            assert_eq!(tail, b"abc");
        }
    
        // ── take_until ───────────────────────────────────────────────────────────
    
        #[test]
        fn take_until_finds_delimiter() {
            let buf = b"key=value";
            let (before, after) = take_until(buf, b'=').unwrap();
            assert_eq!(before, b"key");
            assert_eq!(after, b"value");
        }
    
        #[test]
        fn take_until_missing_delimiter_errors() {
            let buf = b"nodot";
            assert_eq!(
                take_until(buf, b'.'),
                Err(ParseError::MissingDelimiter(b'.'))
            );
        }
    
        // ── Span ─────────────────────────────────────────────────────────────────
    
        #[test]
        fn span_slice_is_zero_copy() {
            let buf = b"Hello, world!";
            let span = Span::new(7, 5);
            assert_eq!(span.slice(buf), b"world");
            assert_eq!(span.as_str(buf).unwrap(), "world");
        }
    
        #[test]
        fn span_split_at_produces_two_windows() {
            let buf = b"left:right";
            let (l, r) = span_split_at(buf, 0, buf.len(), b':').unwrap();
            assert_eq!(l.as_str(buf).unwrap(), "left");
            assert_eq!(r.as_str(buf).unwrap(), "right");
        }
    
        #[test]
        fn span_split_at_missing_sep_returns_none() {
            let buf = b"nodot";
            assert!(span_split_at(buf, 0, buf.len(), b'.').is_none());
        }
    
        // ── HTTP request-line ─────────────────────────────────────────────────────
    
        #[test]
        fn parse_request_line_get() {
            let input = b"GET /index.html HTTP/1.1\r\n";
            let req = parse_request_line(input).unwrap();
            assert_eq!(req.method, "GET");
            assert_eq!(req.path, "/index.html");
            assert_eq!(req.version, "HTTP/1.1");
        }
    
        #[test]
        fn parse_request_line_post_no_crlf() {
            let input = b"POST /api/data HTTP/2";
            let req = parse_request_line(input).unwrap();
            assert_eq!(req.method, "POST");
            assert_eq!(req.path, "/api/data");
            assert_eq!(req.version, "HTTP/2");
        }
    
        #[test]
        fn parse_request_line_missing_path_errors() {
            let input = b"GET";
            assert!(parse_request_line(input).is_err());
        }
    
        // ── CSV row ───────────────────────────────────────────────────────────────
    
        #[test]
        fn parse_csv_row_three_fields() {
            let row = b"alice,30,engineer";
            let fields = parse_csv_row(row).unwrap();
            assert_eq!(fields, vec!["alice", "30", "engineer"]);
        }
    
        #[test]
        fn parse_csv_row_single_field() {
            let row = b"only";
            let fields = parse_csv_row(row).unwrap();
            assert_eq!(fields, vec!["only"]);
        }
    
        #[test]
        fn parse_csv_row_empty_fields() {
            let row = b"a,,c";
            let fields = parse_csv_row(row).unwrap();
            assert_eq!(fields, vec!["a", "", "c"]);
        }
    
        // ── key=value ─────────────────────────────────────────────────────────────
    
        #[test]
        fn parse_key_value_basic() {
            let line = b"host=localhost";
            let kv = parse_key_value(line).unwrap();
            assert_eq!(kv.key, "host");
            assert_eq!(kv.value, "localhost");
        }
    
        #[test]
        fn parse_key_value_value_with_equals() {
            // Only splits on the FIRST '='
            let line = b"url=http://x?a=1";
            let kv = parse_key_value(line).unwrap();
            assert_eq!(kv.key, "url");
            assert_eq!(kv.value, "http://x?a=1");
        }
    
        #[test]
        fn parse_key_value_missing_equals_errors() {
            let line = b"noequals";
            assert!(parse_key_value(line).is_err());
        }
    
        // ── Lifetime safety (compile-time) ────────────────────────────────────────
    
        #[test]
        fn parsed_fields_borrow_from_input() {
            let input = b"name=Ferris";
            let kv = parse_key_value(input).unwrap();
            // Both &str slices point into `input` — no heap allocation occurred.
            assert!(std::ptr::eq(kv.key.as_bytes().as_ptr(), input.as_ptr()));
            assert!(std::ptr::eq(kv.value.as_bytes().as_ptr(), unsafe {
                input.as_ptr().add(5)
            }));
        }
    }

    Deep Comparison

    OCaml vs Rust: Zero-Copy Parsing with Byte Slices

    Side-by-Side Code

    OCaml

    (* OCaml tracks positions as (start, length) pairs over a shared Bytes buffer *)
    type span = { buf: bytes; start: int; len: int }
    
    let span_split_at sep s =
      let rec find i =
        if i >= s.len then None
        else if Bytes.get s.buf (s.start + i) = sep then
          Some ({ s with len = i },
                { s with start = s.start + i + 1; len = s.len - i - 1 })
        else find (i + 1)
      in find 0
    
    let span_to_string s = Bytes.sub_string s.buf s.start s.len
    

    Rust (idiomatic — slice references)

    pub fn take_until(buf: &[u8], delimiter: u8) -> Result<(&[u8], &[u8]), ParseError> {
        buf.iter()
            .position(|&b| b == delimiter)
            .map(|pos| (&buf[..pos], &buf[pos + 1..]))
            .ok_or(ParseError::MissingDelimiter(delimiter))
    }
    
    pub fn parse_key_value(line: &[u8]) -> Result<KeyValue<'_>, ParseError> {
        let (key_bytes, value_bytes) = take_until(line, b'=')?;
        Ok(KeyValue { key: as_str(key_bytes)?, value: as_str(value_bytes)? })
    }
    

    Rust (Span — index-pair approach, mirrors OCaml)

    #[derive(Debug, Clone, Copy)]
    pub struct Span { pub start: usize, pub len: usize }
    
    impl Span {
        pub fn slice<'a>(&self, buf: &'a [u8]) -> &'a [u8] {
            &buf[self.start..self.start + self.len]
        }
    }
    
    pub fn span_split_at(buf: &[u8], start: usize, len: usize, sep: u8) -> Option<(Span, Span)> {
        let slice = &buf[start..start + len];
        slice.iter().position(|&b| b == sep).map(|pos| {
            (Span::new(start, pos), Span::new(start + pos + 1, len - pos - 1))
        })
    }
    

    Type Signatures

    ConceptOCamlRust
    Buffer viewtype span = { buf: bytes; start: int; len: int }&[u8] (fat pointer: ptr + len)
    Split resultspan * span (tuple of spans)(&[u8], &[u8]) (tuple of slices)
    UTF-8 viewBytes.sub_stringcopiesstr::from_utf8borrows
    Lifetime contractImplicit — GC keeps buffer alive<'a> annotation — compiler enforced
    Optional result'a optionOption<T>
    Fallible resultoption or exceptionResult<T, ParseError>

    Key Insights

  • **OCaml's String.sub allocates; Rust's &[u8] slice never does.**
  • In OCaml, extracting a substring almost always copies bytes into a new heap object. Rust &[u8] and &str are fat pointers (address + length) into existing memory — the parsed value is the original bytes, viewed differently.

  • Lifetimes replace garbage collection as the safety mechanism.
  • OCaml's GC ensures the underlying bytes buffer is kept alive as long as any span references it. Rust achieves the same guarantee at compile time through lifetime annotations: struct RequestLine<'a> cannot outlive the &'a [u8] it was parsed from. Use-after-free is rejected before the binary is produced.

  • **The Span struct is the OCaml idiom; slice references are the Rust idiom.**
  • OCaml must carry buf inside every span because references are opaque. Rust fat-pointer slices already carry both address and length, so the idiomatic Rust equivalent of a span is just &[u8] — no wrapper struct required.

  • Iterator-based field parsers compose without allocation.
  • CsvFields is a lazy Iterator<Item = Result<&str, _>> that yields slices into the original row buffer. In OCaml a comparable implementation would either allocate a list of substrings or thread an explicit index through a recursive function.

  • **? operator + Result makes zero-copy parsers as ergonomic as exception-based ones.**
  • OCaml parsers often raise exceptions for error paths. Rust's ? propagates Result::Err up the call stack with the same brevity, but without hidden control flow and with explicit error types that the caller can inspect or recover from.

    When to Use Each Style

    **Use idiomatic Rust &[u8] / &str slices when:** you control the full parser pipeline in one crate and want maximum ergonomics — the compiler infers lifetimes in most cases and the code reads like a sequence of combinator calls.

    **Use the Span index-pair style when:** you need to store multiple parsed views alongside the buffer in a single struct (a self-referential pattern that Rust slices cannot express directly without unsafe), or when passing parsed results across FFI boundaries where raw pointer + length pairs are expected.

    Exercises

  • Implement a zero-copy HTTP/1.1 request-line parser returning (&str, &str, &str)
  • for method, path, and version. Write property tests verifying no allocation occurs (use bumpalo as allocator oracle).

  • Extend parse_frame to return an iterator over multiple consecutive frames in a
  • buffer, with each frame borrowing from the original &[u8].

  • Implement a zero-copy JSON string tokenizer that returns &str slices for
  • unescaped strings but falls back to String for strings containing \uXXXX escapes (use Cow<'_, str>).

  • Benchmark your CSV field parser vs one that collects into Vec<String>. Measure
  • allocations with heaptrack or the dhat allocator.

  • Write a nom-based parser for a simple binary format and compare its generated code
  • to your hand-rolled version using cargo asm.

    Open Source Repos