958 Csv Parser
Tutorial
The Problem
Implement a CSV line parser that handles quoted fields (fields containing commas or newlines wrapped in double quotes) and escaped quotes ("" inside a quoted field represents a literal "). Model the parser as a finite state machine with three states: Normal, InQuote, and AfterQuote. Implement both a simple split-only version and the full state-machine version.
🎯 Learning Outcomes
line.split(',').collect()Normal, InQuote, AfterQuotematch (&state, c)"" escape: transition from InQuote to AfterQuote on ", then back to InQuote on another " (escaped quote) or to Normal on , (end of field)split(',') is insufficient and a real state machine is requiredCode Example
#![allow(clippy::all)]
// 958: CSV Parser
// OCaml uses mutable Buffer + state ref; Rust uses an enum state machine with chars iterator
// Approach 1: Simple split (no quote handling)
pub fn split_simple(line: &str) -> Vec<&str> {
line.split(',').collect()
}
// Approach 2: Full CSV state machine with quote handling
#[derive(Debug, PartialEq)]
enum State {
Normal,
InQuote,
AfterQuote,
}
pub fn parse_csv_line(line: &str) -> Vec<String> {
let mut fields: Vec<String> = Vec::new();
let mut current = String::new();
let mut state = State::Normal;
for c in line.chars() {
match (&state, c) {
(State::Normal, '"') => {
state = State::InQuote;
}
(State::Normal, ',') => {
fields.push(current.clone());
current.clear();
}
(State::Normal, c) => {
current.push(c);
}
(State::InQuote, '"') => {
state = State::AfterQuote;
}
(State::InQuote, c) => {
current.push(c);
}
(State::AfterQuote, '"') => {
// Escaped quote: "" inside quoted field
current.push('"');
state = State::InQuote;
}
(State::AfterQuote, ',') => {
fields.push(current.clone());
current.clear();
state = State::Normal;
}
(State::AfterQuote, _) => {
state = State::Normal;
}
}
}
// Push last field
fields.push(current);
fields
}
// Approach 3: Parse multiple rows
pub fn parse_csv(text: &str) -> Vec<Vec<String>> {
text.lines()
.filter(|line| !line.is_empty())
.map(parse_csv_line)
.collect()
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_simple_split() {
assert_eq!(split_simple("a,b,c"), vec!["a", "b", "c"]);
assert_eq!(split_simple("one"), vec!["one"]);
}
#[test]
fn test_quoted_fields() {
assert_eq!(
parse_csv_line("\"hello\",\"world\",plain"),
vec!["hello", "world", "plain"]
);
}
#[test]
fn test_comma_inside_quotes() {
assert_eq!(
parse_csv_line("\"one, two\",three"),
vec!["one, two", "three"]
);
}
#[test]
fn test_escaped_quotes() {
assert_eq!(
parse_csv_line("\"say \"\"hi\"\"\",end"),
vec!["say \"hi\"", "end"]
);
}
#[test]
fn test_empty_fields() {
assert_eq!(parse_csv_line(",,"), vec!["", "", ""]);
assert_eq!(parse_csv_line("a,,c"), vec!["a", "", "c"]);
}
#[test]
fn test_mixed() {
assert_eq!(
parse_csv_line("name,\"Alice, Bob\",42"),
vec!["name", "Alice, Bob", "42"]
);
}
#[test]
fn test_multi_row() {
let csv = "a,b,c\n1,2,3\n\"x,y\",z,w";
let rows = parse_csv(csv);
assert_eq!(rows.len(), 3);
assert_eq!(rows[0], vec!["a", "b", "c"]);
assert_eq!(rows[2], vec!["x,y", "z", "w"]);
}
}Key Differences
| Aspect | Rust | OCaml |
|---|---|---|
| String builder | String with push | Buffer with add_char |
| Mutable state | let mut state = State::Normal | let state = ref Normal |
| Pattern match | match (&state, c) | match !state, c |
| Character iteration | line.chars() | String.iter (fun c -> ...) |
| Final field | fields.push(current) after loop | fields := contents :: !fields + List.rev |
CSV parsing is a classic example where simple split(',') fails for real-world data. The three-state machine is the minimal FSM that correctly handles RFC 4180 quote escaping.
OCaml Approach
type state = Normal | InQuote | AfterQuote
let parse_csv_line line =
let fields = ref [] in
let current = Buffer.create 64 in
let state = ref Normal in
String.iter (fun c ->
match !state, c with
| Normal, '"' -> state := InQuote
| Normal, ',' -> fields := Buffer.contents current :: !fields;
Buffer.clear current
| Normal, c -> Buffer.add_char current c
| InQuote, '"' -> state := AfterQuote
| InQuote, c -> Buffer.add_char current c
| AfterQuote, '"' -> Buffer.add_char current '"'; state := InQuote
| AfterQuote, ',' -> fields := Buffer.contents current :: !fields;
Buffer.clear current; state := Normal
| AfterQuote, c -> Buffer.add_char current c; state := Normal
) line;
fields := Buffer.contents current :: !fields;
List.rev !fields
OCaml uses Buffer for efficient mutable string building (Rust equivalent: String::with_capacity). The ref pattern for mutable state is OCaml's imperative style. The match !state, c has identical structure to Rust's match (&state, c).
Full Source
#![allow(clippy::all)]
// 958: CSV Parser
// OCaml uses mutable Buffer + state ref; Rust uses an enum state machine with chars iterator
// Approach 1: Simple split (no quote handling)
pub fn split_simple(line: &str) -> Vec<&str> {
line.split(',').collect()
}
// Approach 2: Full CSV state machine with quote handling
#[derive(Debug, PartialEq)]
enum State {
Normal,
InQuote,
AfterQuote,
}
pub fn parse_csv_line(line: &str) -> Vec<String> {
let mut fields: Vec<String> = Vec::new();
let mut current = String::new();
let mut state = State::Normal;
for c in line.chars() {
match (&state, c) {
(State::Normal, '"') => {
state = State::InQuote;
}
(State::Normal, ',') => {
fields.push(current.clone());
current.clear();
}
(State::Normal, c) => {
current.push(c);
}
(State::InQuote, '"') => {
state = State::AfterQuote;
}
(State::InQuote, c) => {
current.push(c);
}
(State::AfterQuote, '"') => {
// Escaped quote: "" inside quoted field
current.push('"');
state = State::InQuote;
}
(State::AfterQuote, ',') => {
fields.push(current.clone());
current.clear();
state = State::Normal;
}
(State::AfterQuote, _) => {
state = State::Normal;
}
}
}
// Push last field
fields.push(current);
fields
}
// Approach 3: Parse multiple rows
pub fn parse_csv(text: &str) -> Vec<Vec<String>> {
text.lines()
.filter(|line| !line.is_empty())
.map(parse_csv_line)
.collect()
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_simple_split() {
assert_eq!(split_simple("a,b,c"), vec!["a", "b", "c"]);
assert_eq!(split_simple("one"), vec!["one"]);
}
#[test]
fn test_quoted_fields() {
assert_eq!(
parse_csv_line("\"hello\",\"world\",plain"),
vec!["hello", "world", "plain"]
);
}
#[test]
fn test_comma_inside_quotes() {
assert_eq!(
parse_csv_line("\"one, two\",three"),
vec!["one, two", "three"]
);
}
#[test]
fn test_escaped_quotes() {
assert_eq!(
parse_csv_line("\"say \"\"hi\"\"\",end"),
vec!["say \"hi\"", "end"]
);
}
#[test]
fn test_empty_fields() {
assert_eq!(parse_csv_line(",,"), vec!["", "", ""]);
assert_eq!(parse_csv_line("a,,c"), vec!["a", "", "c"]);
}
#[test]
fn test_mixed() {
assert_eq!(
parse_csv_line("name,\"Alice, Bob\",42"),
vec!["name", "Alice, Bob", "42"]
);
}
#[test]
fn test_multi_row() {
let csv = "a,b,c\n1,2,3\n\"x,y\",z,w";
let rows = parse_csv(csv);
assert_eq!(rows.len(), 3);
assert_eq!(rows[0], vec!["a", "b", "c"]);
assert_eq!(rows[2], vec!["x,y", "z", "w"]);
}
}#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_simple_split() {
assert_eq!(split_simple("a,b,c"), vec!["a", "b", "c"]);
assert_eq!(split_simple("one"), vec!["one"]);
}
#[test]
fn test_quoted_fields() {
assert_eq!(
parse_csv_line("\"hello\",\"world\",plain"),
vec!["hello", "world", "plain"]
);
}
#[test]
fn test_comma_inside_quotes() {
assert_eq!(
parse_csv_line("\"one, two\",three"),
vec!["one, two", "three"]
);
}
#[test]
fn test_escaped_quotes() {
assert_eq!(
parse_csv_line("\"say \"\"hi\"\"\",end"),
vec!["say \"hi\"", "end"]
);
}
#[test]
fn test_empty_fields() {
assert_eq!(parse_csv_line(",,"), vec!["", "", ""]);
assert_eq!(parse_csv_line("a,,c"), vec!["a", "", "c"]);
}
#[test]
fn test_mixed() {
assert_eq!(
parse_csv_line("name,\"Alice, Bob\",42"),
vec!["name", "Alice, Bob", "42"]
);
}
#[test]
fn test_multi_row() {
let csv = "a,b,c\n1,2,3\n\"x,y\",z,w";
let rows = parse_csv(csv);
assert_eq!(rows.len(), 3);
assert_eq!(rows[0], vec!["a", "b", "c"]);
assert_eq!(rows[2], vec!["x,y", "z", "w"]);
}
}
Deep Comparison
CSV Parser — Comparison
Core Insight
CSV parsing requires a state machine to handle quoted fields. The algorithm is identical in both languages. OCaml expresses mutable state via ref cells; Rust uses let mut variables. Both use Buffer/String for accumulating the current field. The Rust enum for state is more idiomatic than OCaml's type state.
OCaml Approach
type state = Normal | InQuote | AfterQuote — custom variant typestate := InQuote — mutable reference cell updateBuffer.create, Buffer.add_char, Buffer.contents — mutable character accumulationfor i = 0 to n - 1 do ... done — imperative iteration over string indicesString.split_on_char '\n' for line splittingList.filter_map to skip empty linesRust Approach
enum State { Normal, InQuote, AfterQuote } — same concept, idiomatic Ruststate = State::InQuote — direct assignment of enum variantString::new(), push(c), clear() — mutable String accumulationfor c in line.chars() — iterator over chars (Unicode-safe)match (&state, c) — tuple pattern matching on (state, char) pairtext.lines().filter(...).map(...).collect() — functional pipeline for rowsComparison Table
| Aspect | OCaml | Rust |
|---|---|---|
| State type | type state = ... | enum State { ... } |
| State mutation | state := InQuote | state = State::InQuote |
| Char accumulation | Buffer.add_char current c | current.push(c) |
| String from buffer | Buffer.contents current | current.clone() |
| Loop style | for i = 0 to n-1 | for c in line.chars() |
| Pattern on pair | match !state, c with | match (&state, c) |
| Line iteration | String.split_on_char '\n' | text.lines() |
| Skip empty | List.filter_map | .filter(\|l\| !l.is_empty()) |
Exercises
Vec<Vec<String>>.impl Iterator<Item=Vec<String>> that processes one line at a time.Vec<Vec<String>> with the CSV writer (959), then parse with this parser — result should round-trip exactly.