World's Smallest CSV Parser (C#)

neonsunset · 2024-04-10T22:34:24 1712788464

To counterbalance this, an extremely fast CSV parser (also written in C#, uses SIMD and multi-threading): https://github.com/nietras/Sep/

p.s.: this one is unfortunately another parser that uses drain char-by-char into a list/vec/buffer parsing approach which is very inefficient and plagues many languages, which causes it to not take advantage of vectorized string.Split. But other than that, I'm happy more people are noticing .NET.

DanielBryars · 2024-04-10T22:30:11 1712788211

What's the utility of defining the "Error" exception. Why not use an existing one, say InvalidOperationException, or a plain Exception. Is making your own better practice?

gnabgib · 2024-04-10T22:47:44 1712789264

There is no utility. It's perhaps written for JavaScript developers who are used to Error.. but it's not idiomatic C#. Might be indicative of a copilot too.

The use of a class-scoped `StringBuilder` that only one method uses, and `ReadQuotedColumn`/`ReadNonQuotedColumn` yielding one character at a time, rather than accepting a the builder isn't a good sign either (for efficiency). Or casting everything to a `char` (this won't support UTF8), or assuming an end quote followed by anything (:71) is valid way to end a field.

neonsunset · 2024-04-10T22:52:48 1712789568

C# `char` is a UTF-16 code unit. It does not indicate a byte which is just `byte`.

Having StringBuilder be a private field on the parser instance is not an issue either - it is simply reused.

giaour · 2024-04-10T23:04:29 1712790269

Iterating over the `char`s does not support the full range of what can be stored in a C# string (for instance, UTF-8 graphemes that are serialized as surrogate pairs are usually two `char`s in a C# string.

.Net provides a TextElementEnumerator that will iterate over graphemes instead: https://learn.microsoft.com/en-us/dotnet/api/system.globaliz...

There's a fairly comprehensive guide to working with .net character encodings at https://learn.microsoft.com/en-us/dotnet/standard/base-types... .

neonsunset · 2024-04-10T23:12:42 1712790762

The return value of StreamReader.Read() will always be within bounds of -1 and char.MaxValue.

All surrogate pairs will be drained into the StringBuilder, working correctly. Most implementations usually agree that torn UTF-16 surrogate pairs (which are strictly the code points outside of basic multilingual plane) may exist in the input and will be passed as is, which is different to what UTF-8 implementations choose (Rust is strict with this, Go lets you tear code points arbitrarily).

We, as a community, can do better than to jump to immediate criticism of this type.

gnabgib · 2024-04-11T00:51:26 1712796686

If you (a consuming dev) want the world's smallest (in your code) - use the .net built in parser[0]. Bonus, it's RFC4180 compliant.

If you (competing/learning) want to write the world's smallest (code golf style)... this isn't it, and has some weird superfluous lines (if that's your measure - per the original question).

If you (learning) want to write an efficient parser.. this isn't it. You don't need a StringBuilder, you can seek the Stream to collect the (already formed) strings directly from source vs char-by-char memory copy and rebuild. Yes; that limits your stream choices, but since the example/tests only use FileStreams (which are seekable) you might not come across other kinds. If you need to use un-seekable streams, then you'll need to use a large enough buffer.

[0]: https://learn.microsoft.com/en-us/dotnet/api/microsoft.visua...

neonsunset · 2024-04-11T01:04:11 1712797451

This is not a correct link (it refers to VB.NET). There are better parsers out there (Sep).

I'm not sure what is your point but it certainly misses the idea behind this HN submission and makes me sad as it would be nice to see words of encouragement in .NET submissions here instead.

int_19h · 2024-04-11T01:39:30 1712799570

It's an assembly with "Microsoft.VisualBasic" in the name, but it shipped as part of every version of .NET to date, and is perfectly usable from C#. In fact, I would be very surprised if there aren't vastly more uses of this API from C#, since it's a very old trick of the trade.

What GP is saying is that, given that it is already included in the standard class library, it's always the cheapest option wrt size of your shipping app. So it should arguably be the default choice for any .NET dev unless they either need better performance or some more exotic requirements wrt input format.

neonsunset · 2024-04-11T02:06:35 1712801195

What is it with .NET or C# submissions (but I suppose other languages are not immune either) that attracts this type of replies, which miss the point behind a particular piece of code, trivial or not?

Yes, there are existing implementations, many of which are incomparably better, one of which ships with default project SDK (even if it is effectively obsolete[0]). But surely offering a competitive implementation that intends to replace existing solutions wasn't the purpose of this?

Either way, I'm not the author of the code and have already spent enough (free) time in the last 8 months working on a string library which has performant parsing as one of the project goals[1].

[0] https://github.com/dotnet/runtime/tree/main/src/libraries/Mi...

[1] https://github.com/U8String/U8String

vilark · 2024-04-11T16:13:10 1712851990

The unit tests have an emoji test (which uses a surrogate pair). I thought I would have to use Runes, but it's not necessary. https://github.com/kjpgit/SmallestCSVParser/blob/master/Smal...

Jerrrry · 2024-04-10T23:15:12 1712790912

You imply that a string, reversed, would have the same length as the original.

This is not true.

esdf · 2024-04-10T23:46:34 1712792794

Where are they implying this and why would the strings not have the same length? Is there normalization implied somewhere?

Jerrrry · 2024-04-11T00:22:08 1712794928

If they weren't reversing it, what other operation would separate grapheme clusters?

Someone · 2024-04-11T04:56:41 1712811401

> Having StringBuilder be a private field on the parser instance is not an issue either - it is simply reused.

It doesn’t matter for this API, but it is a code smell. It makes the class not reentrant.

Talking of the API, I would make it simpler to use and more idiomatic by making the entire public API

   static IEnumerable<List<String>> parse(StreamReader sr)

That call would store the parser state (currently just the StreamReader and that reused StringBuilder) in a private inner class. There would not be a constructor of the publicly visible class, removing that code smell.

vilark · 2024-04-11T18:30:22 1712860222

I will add a micro benchmark to see if the `yield return` is slowing things down, compared to just calling _sb.Add() inside Read*(). I will also see if it looks cleaner that way. To be honest, the `yield return` is currently in there just because I thought it's "cool".

vilark · 2024-04-12T15:16:01 1712934961

30% performance improvement after removing the `yield return`, and readability is probably better too.

jayd16 · 2024-04-11T05:18:57 1712812737

It's good practice to throw an exception from your own namespace if you're writing a library.

You don't want to expose an implementation detail like some specific exception as part of your public API and have to worry about breaking that later.

You could overload some built in exception but IMO that's not the best practice. You muddy your API and a caller has to wrap your exception if they want to bubble it up and catch it specifically, anyway.

taspeotis · 2024-04-10T23:50:55 1712793055

> Why not use ... a plain Exception.

It is forbidden.

https://learn.microsoft.com/en-us/dotnet/standard/exceptions...

> Exception ... None (use a derived class of this exception).

https://learn.microsoft.com/en-us/dotnet/standard/design-gui...

> DO NOT throw System.Exception or System.SystemException.

throwaway2990 · 2024-04-11T00:23:32 1712795012

It’s a guideline not to. It’s not a hard rule or forbidden.

recursive · 2024-04-11T00:16:04 1712794564

I'm not seeing where it says not to throw InvalidOperationException.

TillE · 2024-04-11T01:59:23 1712800763

InvalidOperationException means "the object is in an inappropriate state". That does not describe a parse error.

C# conventions for exceptions are admittedly a bit confusing. There are a handful of very specific scenarios where you're supposed to use a built-in exception (most commonly ArgumentException). For everything else, you want to define your own type.

recursive · 2024-04-11T13:33:07 1712842387

The state of the stream, such that it's pointing to an illegal character, actually does seem to be invalid though. Maybe this is an overly pedantic argument. I've been writing quite a bit of c# for over a decade and have basically been doing this the whole time. I thought that I knew how to use exceptions, but it seems I do not.

flowerlad · 2024-04-11T01:52:12 1712800332

If you want to recover from CSV errors you should ideally throw InvalidCSVException.

xamuel · 2024-04-10T23:59:54 1712793594

Very nice. The submission's main file (SmallestCSVParser.cs) is 3851 characters (of which 657 are commentary).

Mine, in C, is only 2807 characters (of which 198 are commentary):

https://github.com/semitrivial/csv_parser/blob/master/csv.c

Ahh, but the submission's main file is for parsing an entire .csv, whereas mine is only for parsing a single "line" (possibly including quote-escaped newlines). So the submission wins :)

userbinator · 2024-04-11T01:27:09 1712798829

Do you need more than 1k of additional source to wrap the line parsing in a loop? I doubt it.

xamuel · 2024-04-11T02:23:49 1712802229

Linebreaks can be escaped in CSV, so splitting a file into rows is actually ~1/3 the complexity of parsing a whole row.

See: https://github.com/semitrivial/csv_parser/blob/master/split....

Though I suppose that's the naive approach. You could combine the two into a single file by, like you say, wrapping the row-parser in a (clever, non-trivial) outer loop, and it probably wouldn't take anywhere near 1000 characters to do that...

Someone · 2024-04-11T05:02:28 1712811748

> Linebreaks can be escaped in CSV

In some variants of CSV. There isn’t agreement on the format. For example, https://www.ietf.org/rfc/rfc4180.txt says

“While there are various specifications and implementations for the CSV format (for ex. [4], [5], [6] and [7]), there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files.”

That RFC doesn’t even agree with itself, saying

“1. Each record is located on a separate line, delimited by a line break (CRLF).”

but then following that up with:

“6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes”

ReleaseCandidat · 2024-04-11T05:27:04 1712813224

There is no contradiction, as (1) does not say that "each record is located on a (exactly one) single separate line". But it could be better phrased, like "two consecutive records are separated by...".

This is the "mathematical a", which does not mean "exactly one" but "at least one, and we don't care how many, we already did the interesting work". Like in "this problem has a solution".

ReleaseCandidat · 2024-04-11T05:15:55 1712812555

Parse the whole file in one go. You need to track opening and closing quotes (and escaped ones) anyway, so there is no need to distinguish between commas (semicolons, tabs) and newlines.

Btw. you are not handling escaped double quotes in strings at all, and if you would do that you'd also need to count the number of backspaces. Oh, and no need to escape double quotes in single quotes ('\" could just be '"').

_nub3 · 2024-04-11T08:48:37 1712825317

Back in the days (2008) i created an Autohotkey v1 function for parsing a delimiter seperated line called ReturnDSVArray

it can be found here: https://www.autohotkey.com/board/topic/30102-how-can-i-parse...

it consists of some 30ish lines of code and 67 lines with comments and usage example

osigurdson · 2024-04-11T00:44:03 1712796243

The code seems fairly nice on first glance, but the world's smallest?

cb321 · 2024-04-10T22:29:29 1712788169

Don't just parse - convert. In a pipeline to split-parseable data, if you like, such as the possibly smaller, faster, and more general: https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim (And, ideally, convert all the way to a mmap & go binary format like nio so you don't have to re-parse.)

theendisney · 2024-04-11T00:13:18 1712794398

Fun! Convert to js:

csv.split('\n").join('".split(\',\'));a.push("');

So that each line becomes:

a.push("foo,bar,baz".split(','));

And then we have an array of arrays.

twoodfin · 2024-04-11T01:16:28 1712798188

Of course the “hard part” of CSV parsing is dealing with escapes, which break simple splits.

But now I’m wondering if a good approach might be to split on the escape character and then reassemble / parse from there, safe in the knowledge that every character has exactly one interpretation.

int_19h · 2024-04-11T01:44:06 1712799846

"Normal" CSV doesn't have escape characters. Quotes in quoted strings are escaped by doubling then, and everything else (including newlines) is interpreted as is inside quoted strings.

LeonB · 2024-04-11T02:29:07 1712802547

There is no spec or standard or consensus on “Normal” CSV.

I like it when CSV follows RFC 4180 too - but it’s descriptive not prescriptive.

datascienced · 2024-04-11T02:28:32 1712802512

There is no normal csv! I always used Excel as the “standard” when writing a CSV parser.

If every field is quoted you can indeed remove the first and last “, then split on “,“ and then replace “” with “ in the fields. Excuse my phone converting the quotes!

int_19h · 2024-04-11T04:21:20 1712809280

That is precisely why I put "normal" in quotes.

Nevertheless, if there is a way to escape anything at all, usually it is the quotation mark, and usually it is escaped by doubling. Pretty much any other scheme is very unlikely to be properly interpreted in this context.

datascienced · 2024-04-11T04:48:36 1712810916

Yes indeed. To make it easy to parse everything has to be quoted. If some things are quoted then you can’t just split on comma because for example:m

    “, is a cat”,”, is my boyfriend”,123

etc.

cb321 · 2024-04-11T02:00:02 1712800802

I think of what you say as really just the first step on the path to the parsing state machine in the c2tsv.nim (or /c2dsv.c in the same folder) thing I mentioned above which have comments in their source code.

I think it helps to think of the problem more like "How do I translate a complex syntax buffered input stream which 'most' of the time just translates ',' to '\t' into a buffered output stream that is "almost" as fast as a Unix `tr , \\t`?" If there were no escaping/quoting the output buffer could literally be the same memory as the input, just with the delimiter bytes changed.

The next step is realizing that you can still just do this byte translation if you "flush" the IO buffer opportunistically at syntactically relevant times. That gets you the "almost" performance. (Scare quotes on "almost" since you might do a few more IO-system calls with certain kinds of dense syntax, but unlike your "reassemble" there won't be any allocations. Various trade-offs, but a nifty design.)

There are other nice aspects to the "partitioned program design" mentioned a sibling-ish comment, but, all together, I think it is a pretty tidy solution.

theendisney · 2024-04-11T03:25:45 1712805945

In my sillyness I forgot to brag about the output still being csv.

If you want to enjoy all the strange escapery, extra commas, line breaks and wrapping quotes, you may describe it in code in the first and last 2 columns.

neonsunset · 2024-04-11T00:29:05 1712795345

Is also expressible (and is vectorized) in C#. But that's author's code, not mine :)

cb321 · 2024-04-11T01:05:59 1712797559

That's right - pure splitting is much more SIMD-friendly than..the whole syntax melange. This is another charm to the "partitioned" design. The conversion to split-parseable TSV can run on its own CPU core and the SIMD splitting on its own core. As long as pipe bandwidth suffices, you have very easy parallelism. This kind of design/intent was popular on Unix at the dawn of multiprocessing when there were still Giant Kernel Locks. But it still has merits, even on Windows.

If you have a lot of data (and space for it, e.g. in /dev/shm) you can also save all the converted data to a TSV file. That's now soundly "partitionable" at the "nearest ASCII newline to 1/N bytes" and you can then go core-parallel as well as SIMD within cores (but that N-wise pass with memory mapping or mem.views). Admittedly this is probably more helpful when you are doing more computation than just splits, like ASCII-to-binary conversion of fields or such.

Plus, someone might have actually exported from Excel (or whatever) into some sound TSV instead of weird quote-escaped-CSV that some think is standardized by rfc 4180 (which itself disavows being a "standard"). In that case, at least, you needn't convert at all.

So, I see at least 3 reasons to layer this part of a system as a convert-then-split: pipeline parallelism, file parallelism, and entire pass elision.

WatchDog · 2024-04-11T02:11:22 1712801482

I like to think that the author wasn't happy with how much code his CSV parser required, and decided to nerd-snipe hacker news to find some alternatives.

lolpanda · 2024-04-10T22:34:03 1712788443

I reviewed the code in this project and it looks pretty reasonable. I thought CSV was a loosely specified format. In my past experience, I never had a smooth experience moving data from one system to another using CSV. I had a lot of trouble with Snowflake -> CSV -> Clickhouse. I now use JSONL for pretty much everything.

buybackoff · 2024-04-10T22:46:19 1712789179

There are the specs and the real world. The specs are more often than not on the opposite side of the moon, you never see that in real life. Oh, so many hours of my life were wasted on that. Real-world CSVs are as loosely specified as any free text in a notepad.

neilv · 2024-04-11T00:01:08 1712793668

Yes, loosely specified in practice. I made a CSV parser that tries to do something reasonable for many variants by default. When that's not enough, you can specify options. https://www.neilvandyke.org/racket/csv-reading/

mbreese · 2024-04-10T23:55:47 1712793347

This is the problem… CSV isn’t specified at sufficient detail, it is just too loose in the real world. So the question “can you make a small parser” isn’t a real issue. And then, the problem with such a small parser is — which edge cases are you missing/ignoring?

I just don’t see the flex in having a small csv parser.

deathanatos · 2024-04-11T00:01:52 1712793712

There's an RFC that specifies a standard format for CSV. If you're smart you'd use it ^W^W… well, you'd probably not use CSV to start with.

The problem is that often, what you have to ingest is more properly described as "malformed CSV / bytes that loosely resembled CSV in some manner that I have no choice but to either try to shove into a parser, or write some custom junk for this hot garbage because it comes form a source that I cannot control".

A lot of parsers are fairly configurable precisely to account for the situation of "the other end is sending me ill-defined jank" and to be flexible enough that maybe, just maybe, it'll mostly work. But it's hardly "engineering" at that point.

LeonB · 2024-04-11T02:31:57 1712802717

It doesn’t specify a standard format.

It describes a common format.

https://datatracker.ietf.org/doc/html/rfc4180

samatman · 2024-04-11T00:25:07 1712795107

To the author: please consider using an actual license.

I infer from the tone of your license that you intended it to give away the code to any "human" who wants to "use" it.

What if I modify it? Is modification "use"? (No.)

What if a shell script calls it? Is that a "human" using it, or a "computer"? Is linking "use"?

The result is probably a license which is non-free, which doesn't appear to be your intention.

LeonB · 2024-04-11T02:21:08 1712802068

Maybe also add that as an issue at the repo.

(Writing it as a comment here is/was still useful because it may help others)

hesviiggvv · 2024-04-11T00:40:52 1712796052

Maybe part of the license is proving you are (human) by interpreting it reasonably.

Or maybe it’s a trap. Who can say?

nrdvana · 2024-04-11T02:23:00 1712802180

It's a nice tidy CSV parser, but needs a new title. "world's smallest" is never going to happen in C#, for any measure of "smallest". And aside from that, nobody should be rolling their own CSV parsers if they want to solve real-world problems; use the most capable library your language offers you, which will account for a hundred edge cases yours doesn't.

xamuel · 2024-04-11T02:28:23 1712802503

Funny story re "nobody should be rolling":

When I was switching from academia to industry, I decided, based on HN comments like this, that I should un-publish my CSV parser.

I was worried potential employers would tsk-tsk me for self-rolling.

I promptly got an email from the creator of Ruby asking me why I had un-published my CSV parser, which apparently was being used in Ruby at the time.

(...And then later I landed my current job, a dream job, a large part of which involves handling CSV files in finance!)

nrdvana · 2024-04-12T01:17:18 1712884638

Well, it depends if your self-rolled version is a complete library, or a quick sidetrack you implemented as a bigger project. If you published it as an installable independent library, and Ruby was using it, I think it's safe to say that you had a complete product. The evil of self-rolled CSV is that people often build on an incomplete understanding of the problem, don't have unit tests, or do something simplistic that necessitates workarounds like this: https://metacpan.org/pod/Data::TableReader::Decoder::IdiotCS...

(case in point, that crazy workaround is only possible because of a large expenditure of effort by the authors of perl's Text::CSV which very few CSV parsers would have implemented)

justinator · 2024-04-11T01:37:15 1712799435

No Perl submission in the comments?!

Y'all are slacking on me.

recursive · 2024-04-11T00:22:32 1712794952

`column.Substring(1, column.Length - 2)` could use the new-ish range indexing syntax.

    column[1..^1]

vilark · 2024-04-12T15:19:06 1712935146

FYI I got rid of this line, now I just don't add the quotes in the first place, unless the caller requests it. Performance didn't actually change, but it looks smarter. Thanks again for the review though.

vilark · 2024-04-11T16:11:51 1712851911

Thanks... I didn't think that would compile on .net 6, but it does!

ygra · 2024-04-12T13:28:39 1712928519

You can probably shorten it to

    column[1..]

and it compiles down to a Substring call, and ranges are part of C# 8, so they exist since .NET Core 3.1. But even if the syntax is newer (e.g. collection expressions in C# 12) you can often also use features on older target frameworks if they don't require additional runtime support (and even that can often be retrofitted internally).

recursive · 2024-04-12T17:46:05 1712943965

That's different. It includes the last character.

hasmanean · 2024-04-11T01:56:08 1712800568

That’s still multiple lines of code.

Here’s one in c++ that uses only a single line of code (excluding function headers)

// How do I format this as code?

vector<string> split( const string& s) { return accumulate(s.begin(), s.end(), vector<string>(1), [=](auto acc, char c) { if (c == ‘,’){ acc.push_back(string()); } else { acc.back( ) += c; } return acc; } ); }

teo_zero · 2024-04-11T05:06:45 1712812005

Does it parse this, though?

  "Verne, Jules", "20,000 leagues under the sea"

hasmanean · 2024-04-11T11:46:32 1712835992

Wrap it in a second function that parses out quotes and bit-stuffs the commas with something else.

kazinator · 2024-04-11T05:32:11 1712813531

Yuck, that RFC 4180 actually quotes Postel's poorly considered "law" and recommends that it be followed. Why have a RFC then. If you're going to be liberal in what you accept, then the RFC is just a suggestion.

Plus it's not just defining CSV but actually a CSV MIME format (what?) and thus insists that since CSV is MIME wire data, it must use CR-LF line breaks, rather than assume that data can be converted to an operating system's text file format, and native line breaks.

You'd think that CSV could be defined without reference to MIME whatsoever, using abstract line breaks; and that it's a no-brainer that since it is text, it can be MIME-encoded as a plain text type.

1vuio0pswjnm7 · 2024-04-11T05:27:45 1712813265

Questions:

1. What is the size of the install required to get C# to run on a computer without Windows or MacOS installed

2. What is the size of the install for this C# library

Just curious

tester756 · 2024-04-11T15:26:37 1712849197

App with runtime would probably be around 30 MBs?

dotnet publish -c Release -r linux-x64 -o output -p:PublishTrimmed=true

or 100MB without trimming

neonsunset · 2024-04-11T18:10:38 1712859038

AOT-built /Example should be <= 2MB (like most of them regardless of the library), since the library itself can only be consumed by .NET and its assembly would take a couple dozens of KB at most.

ReleaseCandidat · 2024-04-11T05:32:17 1712813537

If UNIX means "a certified unix that isn't MacOS" the answer is 0 bytes (actually more like 0/0 bytes).