ExamplesBy LevelBy TopicLearning Paths
958 Intermediate

958 Csv Parser

Functional Programming

Tutorial

The Problem

Implement a CSV line parser that handles quoted fields (fields containing commas or newlines wrapped in double quotes) and escaped quotes ("" inside a quoted field represents a literal "). Model the parser as a finite state machine with three states: Normal, InQuote, and AfterQuote. Implement both a simple split-only version and the full state-machine version.

🎯 Learning Outcomes

  • • Implement a simple CSV split using line.split(',').collect()
  • • Model parser state as an enum: Normal, InQuote, AfterQuote
  • • Drive the state machine character by character with match (&state, c)
  • • Handle "" escape: transition from InQuote to AfterQuote on ", then back to InQuote on another " (escaped quote) or to Normal on , (end of field)
  • • Recognize when split(',') is insufficient and a real state machine is required
  • Code Example

    #![allow(clippy::all)]
    // 958: CSV Parser
    // OCaml uses mutable Buffer + state ref; Rust uses an enum state machine with chars iterator
    
    // Approach 1: Simple split (no quote handling)
    pub fn split_simple(line: &str) -> Vec<&str> {
        line.split(',').collect()
    }
    
    // Approach 2: Full CSV state machine with quote handling
    #[derive(Debug, PartialEq)]
    enum State {
        Normal,
        InQuote,
        AfterQuote,
    }
    
    pub fn parse_csv_line(line: &str) -> Vec<String> {
        let mut fields: Vec<String> = Vec::new();
        let mut current = String::new();
        let mut state = State::Normal;
    
        for c in line.chars() {
            match (&state, c) {
                (State::Normal, '"') => {
                    state = State::InQuote;
                }
                (State::Normal, ',') => {
                    fields.push(current.clone());
                    current.clear();
                }
                (State::Normal, c) => {
                    current.push(c);
                }
                (State::InQuote, '"') => {
                    state = State::AfterQuote;
                }
                (State::InQuote, c) => {
                    current.push(c);
                }
                (State::AfterQuote, '"') => {
                    // Escaped quote: "" inside quoted field
                    current.push('"');
                    state = State::InQuote;
                }
                (State::AfterQuote, ',') => {
                    fields.push(current.clone());
                    current.clear();
                    state = State::Normal;
                }
                (State::AfterQuote, _) => {
                    state = State::Normal;
                }
            }
        }
        // Push last field
        fields.push(current);
        fields
    }
    
    // Approach 3: Parse multiple rows
    pub fn parse_csv(text: &str) -> Vec<Vec<String>> {
        text.lines()
            .filter(|line| !line.is_empty())
            .map(parse_csv_line)
            .collect()
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_simple_split() {
            assert_eq!(split_simple("a,b,c"), vec!["a", "b", "c"]);
            assert_eq!(split_simple("one"), vec!["one"]);
        }
    
        #[test]
        fn test_quoted_fields() {
            assert_eq!(
                parse_csv_line("\"hello\",\"world\",plain"),
                vec!["hello", "world", "plain"]
            );
        }
    
        #[test]
        fn test_comma_inside_quotes() {
            assert_eq!(
                parse_csv_line("\"one, two\",three"),
                vec!["one, two", "three"]
            );
        }
    
        #[test]
        fn test_escaped_quotes() {
            assert_eq!(
                parse_csv_line("\"say \"\"hi\"\"\",end"),
                vec!["say \"hi\"", "end"]
            );
        }
    
        #[test]
        fn test_empty_fields() {
            assert_eq!(parse_csv_line(",,"), vec!["", "", ""]);
            assert_eq!(parse_csv_line("a,,c"), vec!["a", "", "c"]);
        }
    
        #[test]
        fn test_mixed() {
            assert_eq!(
                parse_csv_line("name,\"Alice, Bob\",42"),
                vec!["name", "Alice, Bob", "42"]
            );
        }
    
        #[test]
        fn test_multi_row() {
            let csv = "a,b,c\n1,2,3\n\"x,y\",z,w";
            let rows = parse_csv(csv);
            assert_eq!(rows.len(), 3);
            assert_eq!(rows[0], vec!["a", "b", "c"]);
            assert_eq!(rows[2], vec!["x,y", "z", "w"]);
        }
    }

    Key Differences

    AspectRustOCaml
    String builderString with pushBuffer with add_char
    Mutable statelet mut state = State::Normallet state = ref Normal
    Pattern matchmatch (&state, c)match !state, c
    Character iterationline.chars()String.iter (fun c -> ...)
    Final fieldfields.push(current) after loopfields := contents :: !fields + List.rev

    CSV parsing is a classic example where simple split(',') fails for real-world data. The three-state machine is the minimal FSM that correctly handles RFC 4180 quote escaping.

    OCaml Approach

    type state = Normal | InQuote | AfterQuote
    
    let parse_csv_line line =
      let fields = ref [] in
      let current = Buffer.create 64 in
      let state = ref Normal in
    
      String.iter (fun c ->
        match !state, c with
        | Normal, '"'      -> state := InQuote
        | Normal, ','      -> fields := Buffer.contents current :: !fields;
                              Buffer.clear current
        | Normal, c        -> Buffer.add_char current c
        | InQuote, '"'     -> state := AfterQuote
        | InQuote, c       -> Buffer.add_char current c
        | AfterQuote, '"'  -> Buffer.add_char current '"'; state := InQuote
        | AfterQuote, ','  -> fields := Buffer.contents current :: !fields;
                              Buffer.clear current; state := Normal
        | AfterQuote, c    -> Buffer.add_char current c; state := Normal
      ) line;
      fields := Buffer.contents current :: !fields;
      List.rev !fields
    

    OCaml uses Buffer for efficient mutable string building (Rust equivalent: String::with_capacity). The ref pattern for mutable state is OCaml's imperative style. The match !state, c has identical structure to Rust's match (&state, c).

    Full Source

    #![allow(clippy::all)]
    // 958: CSV Parser
    // OCaml uses mutable Buffer + state ref; Rust uses an enum state machine with chars iterator
    
    // Approach 1: Simple split (no quote handling)
    pub fn split_simple(line: &str) -> Vec<&str> {
        line.split(',').collect()
    }
    
    // Approach 2: Full CSV state machine with quote handling
    #[derive(Debug, PartialEq)]
    enum State {
        Normal,
        InQuote,
        AfterQuote,
    }
    
    pub fn parse_csv_line(line: &str) -> Vec<String> {
        let mut fields: Vec<String> = Vec::new();
        let mut current = String::new();
        let mut state = State::Normal;
    
        for c in line.chars() {
            match (&state, c) {
                (State::Normal, '"') => {
                    state = State::InQuote;
                }
                (State::Normal, ',') => {
                    fields.push(current.clone());
                    current.clear();
                }
                (State::Normal, c) => {
                    current.push(c);
                }
                (State::InQuote, '"') => {
                    state = State::AfterQuote;
                }
                (State::InQuote, c) => {
                    current.push(c);
                }
                (State::AfterQuote, '"') => {
                    // Escaped quote: "" inside quoted field
                    current.push('"');
                    state = State::InQuote;
                }
                (State::AfterQuote, ',') => {
                    fields.push(current.clone());
                    current.clear();
                    state = State::Normal;
                }
                (State::AfterQuote, _) => {
                    state = State::Normal;
                }
            }
        }
        // Push last field
        fields.push(current);
        fields
    }
    
    // Approach 3: Parse multiple rows
    pub fn parse_csv(text: &str) -> Vec<Vec<String>> {
        text.lines()
            .filter(|line| !line.is_empty())
            .map(parse_csv_line)
            .collect()
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_simple_split() {
            assert_eq!(split_simple("a,b,c"), vec!["a", "b", "c"]);
            assert_eq!(split_simple("one"), vec!["one"]);
        }
    
        #[test]
        fn test_quoted_fields() {
            assert_eq!(
                parse_csv_line("\"hello\",\"world\",plain"),
                vec!["hello", "world", "plain"]
            );
        }
    
        #[test]
        fn test_comma_inside_quotes() {
            assert_eq!(
                parse_csv_line("\"one, two\",three"),
                vec!["one, two", "three"]
            );
        }
    
        #[test]
        fn test_escaped_quotes() {
            assert_eq!(
                parse_csv_line("\"say \"\"hi\"\"\",end"),
                vec!["say \"hi\"", "end"]
            );
        }
    
        #[test]
        fn test_empty_fields() {
            assert_eq!(parse_csv_line(",,"), vec!["", "", ""]);
            assert_eq!(parse_csv_line("a,,c"), vec!["a", "", "c"]);
        }
    
        #[test]
        fn test_mixed() {
            assert_eq!(
                parse_csv_line("name,\"Alice, Bob\",42"),
                vec!["name", "Alice, Bob", "42"]
            );
        }
    
        #[test]
        fn test_multi_row() {
            let csv = "a,b,c\n1,2,3\n\"x,y\",z,w";
            let rows = parse_csv(csv);
            assert_eq!(rows.len(), 3);
            assert_eq!(rows[0], vec!["a", "b", "c"]);
            assert_eq!(rows[2], vec!["x,y", "z", "w"]);
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn test_simple_split() {
            assert_eq!(split_simple("a,b,c"), vec!["a", "b", "c"]);
            assert_eq!(split_simple("one"), vec!["one"]);
        }
    
        #[test]
        fn test_quoted_fields() {
            assert_eq!(
                parse_csv_line("\"hello\",\"world\",plain"),
                vec!["hello", "world", "plain"]
            );
        }
    
        #[test]
        fn test_comma_inside_quotes() {
            assert_eq!(
                parse_csv_line("\"one, two\",three"),
                vec!["one, two", "three"]
            );
        }
    
        #[test]
        fn test_escaped_quotes() {
            assert_eq!(
                parse_csv_line("\"say \"\"hi\"\"\",end"),
                vec!["say \"hi\"", "end"]
            );
        }
    
        #[test]
        fn test_empty_fields() {
            assert_eq!(parse_csv_line(",,"), vec!["", "", ""]);
            assert_eq!(parse_csv_line("a,,c"), vec!["a", "", "c"]);
        }
    
        #[test]
        fn test_mixed() {
            assert_eq!(
                parse_csv_line("name,\"Alice, Bob\",42"),
                vec!["name", "Alice, Bob", "42"]
            );
        }
    
        #[test]
        fn test_multi_row() {
            let csv = "a,b,c\n1,2,3\n\"x,y\",z,w";
            let rows = parse_csv(csv);
            assert_eq!(rows.len(), 3);
            assert_eq!(rows[0], vec!["a", "b", "c"]);
            assert_eq!(rows[2], vec!["x,y", "z", "w"]);
        }
    }

    Deep Comparison

    CSV Parser — Comparison

    Core Insight

    CSV parsing requires a state machine to handle quoted fields. The algorithm is identical in both languages. OCaml expresses mutable state via ref cells; Rust uses let mut variables. Both use Buffer/String for accumulating the current field. The Rust enum for state is more idiomatic than OCaml's type state.

    OCaml Approach

  • type state = Normal | InQuote | AfterQuote — custom variant type
  • state := InQuote — mutable reference cell update
  • Buffer.create, Buffer.add_char, Buffer.contents — mutable character accumulation
  • for i = 0 to n - 1 do ... done — imperative iteration over string indices
  • String.split_on_char '\n' for line splitting
  • List.filter_map to skip empty lines
  • Rust Approach

  • enum State { Normal, InQuote, AfterQuote } — same concept, idiomatic Rust
  • state = State::InQuote — direct assignment of enum variant
  • String::new(), push(c), clear() — mutable String accumulation
  • for c in line.chars() — iterator over chars (Unicode-safe)
  • match (&state, c) — tuple pattern matching on (state, char) pair
  • text.lines().filter(...).map(...).collect() — functional pipeline for rows
  • Comparison Table

    AspectOCamlRust
    State typetype state = ...enum State { ... }
    State mutationstate := InQuotestate = State::InQuote
    Char accumulationBuffer.add_char current ccurrent.push(c)
    String from bufferBuffer.contents currentcurrent.clone()
    Loop stylefor i = 0 to n-1for c in line.chars()
    Pattern on pairmatch !state, c withmatch (&state, c)
    Line iterationString.split_on_char '\n'text.lines()
    Skip emptyList.filter_map.filter(\|l\| !l.is_empty())

    Exercises

  • Extend the parser to handle multi-line records (quoted fields containing newlines).
  • Implement a full CSV file parser: split on newlines, parse each line, return Vec<Vec<String>>.
  • Add trimming of leading/trailing whitespace from unquoted fields.
  • Implement a streaming parser using impl Iterator<Item=Vec<String>> that processes one line at a time.
  • Write a property test: encode a Vec<Vec<String>> with the CSV writer (959), then parse with this parser — result should round-trip exactly.
  • Open Source Repos