CSV and JSON are the two most common tabular data formats in modern software, and moving between them reliably is one of the most frequent data-engineering tasks on any project. CSV is the lingua franca of spreadsheets, analytics exports, and legacy systems; JSON is the standard for web APIs, modern data pipelines, and document databases. The sections below explain the subtle edge cases that trip up naive converters, when to choose each format, and how to interpret the Data Inspector output to catch quality problems before they reach production.
Edge Cases That Trip Up Naive Converters
A naive CSV parser splits each line on the delimiter character and calls it a day, but real-world CSV files contain several edge cases that break that approach. Quoted fields are the most common: a cell like `"Smith, John"` contains a comma that is part of the value rather than a field separator, and the parser must track quote state as it walks the line. Embedded newlines inside quoted fields are legal per RFC 4180 but frequently mis-parsed — a CSV row can span multiple physical lines when a quoted cell wraps. Escaped double quotes inside quoted fields are represented as two consecutive double quotes (`"He said ""hello"""`), which must be collapsed to a single quote in the output. Unicode BOMs (byte-order marks) appear at the start of files saved by Excel on Windows and create invisible junk bytes in the first field name unless stripped. Different systems use different line endings (LF on Unix, CRLF on Windows, occasionally bare CR on old Mac files), and the parser must handle all three to avoid empty trailing rows. This converter handles every one of these cases correctly, but if you're writing your own converter, don't roll your own parser — use a well-tested library like Papa Parse in JavaScript, csv module in Python, or equivalent.
When to Choose CSV vs JSON
The format choice depends on the consumer and the data shape rather than the source. Choose CSV when the data is fundamentally tabular (rows of homogeneous records), when the consumer is a spreadsheet tool (Excel, Google Sheets, Numbers), a legacy ETL pipeline, or any analyst who'll open the file to eyeball it. CSV has zero structural overhead — the smallest possible representation of tabular data — which matters when files reach gigabytes. The downsides: CSV has no type system (everything is a string), no support for nested structures (hierarchical data must be flattened), and no formal way to represent nulls (usually empty string or a sentinel like `NULL`). Choose JSON when records are heterogeneous (different shape per record), when data has natural nesting (orders with line items, users with addresses), or when the consumer is a modern API, JavaScript application, or document database. JSON preserves native types and structure. The downsides are size (JSON with repeated keys is 2–5× larger than equivalent CSV) and parsing cost (JSON requires a full parser while CSV can be streamed). JSONL threads the needle for very large datasets that are naturally hierarchical — each line is independently parseable, enabling true streaming without holding the entire file in memory.
Using the Data Inspector to Catch Problems
The Data Inspector tab surfaces data-quality issues that would otherwise only appear downstream when something breaks. The most important signal is the per-column fill rate — the percentage of rows that have a non-null value in each column. A column at 100% fill is fully populated; a column at 50% fill has half its rows empty, which may be correct (optional fields) or may indicate upstream data corruption. Spot-check columns with under 80% fill before sending the data anywhere important. Type inference reports whether a column is consistently numeric, consistently string, or mixed. Mixed-type columns are usually a bug in the source — typically a numeric column with a few stray text entries that will cause type errors in a strictly-typed downstream consumer. The unique-value count catches duplicate keys: if your supposed primary-key column has fewer unique values than total rows, the source data has duplicates that need deduplication before import. Sample values give you a quick sanity check that the first few values in each column look reasonable — dates that parsed as strings, numbers with currency symbols still attached, or trailing whitespace that will cause string matches to fail. Together these diagnostics catch roughly 90% of real-world CSV/JSON data quality problems before they leave this tool.