Hacker News new | past | comments | ask | show | jobs | submit login
World's Smallest CSV Parser (C#) (github.com/kjpgit)
64 points by vilark 8 months ago | hide | past | favorite | 71 comments



To counterbalance this, an extremely fast CSV parser (also written in C#, uses SIMD and multi-threading): https://github.com/nietras/Sep/

p.s.: this one is unfortunately another parser that uses drain char-by-char into a list/vec/buffer parsing approach which is very inefficient and plagues many languages, which causes it to not take advantage of vectorized string.Split. But other than that, I'm happy more people are noticing .NET.


What's the utility of defining the "Error" exception. Why not use an existing one, say InvalidOperationException, or a plain Exception. Is making your own better practice?


There is no utility. It's perhaps written for JavaScript developers who are used to Error.. but it's not idiomatic C#. Might be indicative of a copilot too.

The use of a class-scoped `StringBuilder` that only one method uses, and `ReadQuotedColumn`/`ReadNonQuotedColumn` yielding one character at a time, rather than accepting a the builder isn't a good sign either (for efficiency). Or casting everything to a `char` (this won't support UTF8), or assuming an end quote followed by anything (:71) is valid way to end a field.


C# `char` is a UTF-16 code unit. It does not indicate a byte which is just `byte`.

Having StringBuilder be a private field on the parser instance is not an issue either - it is simply reused.


Iterating over the `char`s does not support the full range of what can be stored in a C# string (for instance, UTF-8 graphemes that are serialized as surrogate pairs are usually two `char`s in a C# string.

.Net provides a TextElementEnumerator that will iterate over graphemes instead: https://learn.microsoft.com/en-us/dotnet/api/system.globaliz...

There's a fairly comprehensive guide to working with .net character encodings at https://learn.microsoft.com/en-us/dotnet/standard/base-types... .


The return value of StreamReader.Read() will always be within bounds of -1 and char.MaxValue.

All surrogate pairs will be drained into the StringBuilder, working correctly. Most implementations usually agree that torn UTF-16 surrogate pairs (which are strictly the code points outside of basic multilingual plane) may exist in the input and will be passed as is, which is different to what UTF-8 implementations choose (Rust is strict with this, Go lets you tear code points arbitrarily).

We, as a community, can do better than to jump to immediate criticism of this type.


If you (a consuming dev) want the world's smallest (in your code) - use the .net built in parser[0]. Bonus, it's RFC4180 compliant.

If you (competing/learning) want to write the world's smallest (code golf style)... this isn't it, and has some weird superfluous lines (if that's your measure - per the original question).

If you (learning) want to write an efficient parser.. this isn't it. You don't need a StringBuilder, you can seek the Stream to collect the (already formed) strings directly from source vs char-by-char memory copy and rebuild. Yes; that limits your stream choices, but since the example/tests only use FileStreams (which are seekable) you might not come across other kinds. If you need to use un-seekable streams, then you'll need to use a large enough buffer.

[0]: https://learn.microsoft.com/en-us/dotnet/api/microsoft.visua...


This is not a correct link (it refers to VB.NET). There are better parsers out there (Sep).

I'm not sure what is your point but it certainly misses the idea behind this HN submission and makes me sad as it would be nice to see words of encouragement in .NET submissions here instead.


It's an assembly with "Microsoft.VisualBasic" in the name, but it shipped as part of every version of .NET to date, and is perfectly usable from C#. In fact, I would be very surprised if there aren't vastly more uses of this API from C#, since it's a very old trick of the trade.

What GP is saying is that, given that it is already included in the standard class library, it's always the cheapest option wrt size of your shipping app. So it should arguably be the default choice for any .NET dev unless they either need better performance or some more exotic requirements wrt input format.


What is it with .NET or C# submissions (but I suppose other languages are not immune either) that attracts this type of replies, which miss the point behind a particular piece of code, trivial or not?

Yes, there are existing implementations, many of which are incomparably better, one of which ships with default project SDK (even if it is effectively obsolete[0]). But surely offering a competitive implementation that intends to replace existing solutions wasn't the purpose of this?

Either way, I'm not the author of the code and have already spent enough (free) time in the last 8 months working on a string library which has performant parsing as one of the project goals[1].

[0] https://github.com/dotnet/runtime/tree/main/src/libraries/Mi...

[1] https://github.com/U8String/U8String


The unit tests have an emoji test (which uses a surrogate pair). I thought I would have to use Runes, but it's not necessary. https://github.com/kjpgit/SmallestCSVParser/blob/master/Smal...


You imply that a string, reversed, would have the same length as the original.

This is not true.


Where are they implying this and why would the strings not have the same length? Is there normalization implied somewhere?


If they weren't reversing it, what other operation would separate grapheme clusters?


> Having StringBuilder be a private field on the parser instance is not an issue either - it is simply reused.

It doesn’t matter for this API, but it is a code smell. It makes the class not reentrant.

Talking of the API, I would make it simpler to use and more idiomatic by making the entire public API

   static IEnumerable<List<String>> parse(StreamReader sr)
That call would store the parser state (currently just the StreamReader and that reused StringBuilder) in a private inner class. There would not be a constructor of the publicly visible class, removing that code smell.


I will add a micro benchmark to see if the `yield return` is slowing things down, compared to just calling _sb.Add() inside Read*(). I will also see if it looks cleaner that way. To be honest, the `yield return` is currently in there just because I thought it's "cool".


30% performance improvement after removing the `yield return`, and readability is probably better too.


It's good practice to throw an exception from your own namespace if you're writing a library.

You don't want to expose an implementation detail like some specific exception as part of your public API and have to worry about breaking that later.

You could overload some built in exception but IMO that's not the best practice. You muddy your API and a caller has to wrap your exception if they want to bubble it up and catch it specifically, anyway.


> Why not use ... a plain Exception.

It is forbidden.

https://learn.microsoft.com/en-us/dotnet/standard/exceptions...

> Exception ... None (use a derived class of this exception).

https://learn.microsoft.com/en-us/dotnet/standard/design-gui...

> DO NOT throw System.Exception or System.SystemException.


It’s a guideline not to. It’s not a hard rule or forbidden.


I'm not seeing where it says not to throw InvalidOperationException.


InvalidOperationException means "the object is in an inappropriate state". That does not describe a parse error.

C# conventions for exceptions are admittedly a bit confusing. There are a handful of very specific scenarios where you're supposed to use a built-in exception (most commonly ArgumentException). For everything else, you want to define your own type.


The state of the stream, such that it's pointing to an illegal character, actually does seem to be invalid though. Maybe this is an overly pedantic argument. I've been writing quite a bit of c# for over a decade and have basically been doing this the whole time. I thought that I knew how to use exceptions, but it seems I do not.


If you want to recover from CSV errors you should ideally throw InvalidCSVException.


Very nice. The submission's main file (SmallestCSVParser.cs) is 3851 characters (of which 657 are commentary).

Mine, in C, is only 2807 characters (of which 198 are commentary):

https://github.com/semitrivial/csv_parser/blob/master/csv.c

Ahh, but the submission's main file is for parsing an entire .csv, whereas mine is only for parsing a single "line" (possibly including quote-escaped newlines). So the submission wins :)


Do you need more than 1k of additional source to wrap the line parsing in a loop? I doubt it.


Linebreaks can be escaped in CSV, so splitting a file into rows is actually ~1/3 the complexity of parsing a whole row.

See: https://github.com/semitrivial/csv_parser/blob/master/split....

Though I suppose that's the naive approach. You could combine the two into a single file by, like you say, wrapping the row-parser in a (clever, non-trivial) outer loop, and it probably wouldn't take anywhere near 1000 characters to do that...


> Linebreaks can be escaped in CSV

In some variants of CSV. There isn’t agreement on the format. For example, https://www.ietf.org/rfc/rfc4180.txt says

“While there are various specifications and implementations for the CSV format (for ex. [4], [5], [6] and [7]), there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files.”

That RFC doesn’t even agree with itself, saying

“1. Each record is located on a separate line, delimited by a line break (CRLF).”

but then following that up with:

“6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes”


There is no contradiction, as (1) does not say that "each record is located on a (exactly one) single separate line". But it could be better phrased, like "two consecutive records are separated by...".

This is the "mathematical a", which does not mean "exactly one" but "at least one, and we don't care how many, we already did the interesting work". Like in "this problem has a solution".


Parse the whole file in one go. You need to track opening and closing quotes (and escaped ones) anyway, so there is no need to distinguish between commas (semicolons, tabs) and newlines.

Btw. you are not handling escaped double quotes in strings at all, and if you would do that you'd also need to count the number of backspaces. Oh, and no need to escape double quotes in single quotes ('\" could just be '"').


Back in the days (2008) i created an Autohotkey v1 function for parsing a delimiter seperated line called ReturnDSVArray

it can be found here: https://www.autohotkey.com/board/topic/30102-how-can-i-parse...

it consists of some 30ish lines of code and 67 lines with comments and usage example


The code seems fairly nice on first glance, but the world's smallest?


Don't just parse - convert. In a pipeline to split-parseable data, if you like, such as the possibly smaller, faster, and more general: https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim (And, ideally, convert all the way to a mmap & go binary format like nio so you don't have to re-parse.)


Fun! Convert to js:

csv.split('\n").join('".split(\',\'));a.push("');

So that each line becomes:

a.push("foo,bar,baz".split(','));

And then we have an array of arrays.


Of course the “hard part” of CSV parsing is dealing with escapes, which break simple splits.

But now I’m wondering if a good approach might be to split on the escape character and then reassemble / parse from there, safe in the knowledge that every character has exactly one interpretation.


"Normal" CSV doesn't have escape characters. Quotes in quoted strings are escaped by doubling then, and everything else (including newlines) is interpreted as is inside quoted strings.


There is no spec or standard or consensus on “Normal” CSV.

I like it when CSV follows RFC 4180 too - but it’s descriptive not prescriptive.


There is no normal csv! I always used Excel as the “standard” when writing a CSV parser.

If every field is quoted you can indeed remove the first and last “, then split on “,“ and then replace “” with “ in the fields. Excuse my phone converting the quotes!


That is precisely why I put "normal" in quotes.

Nevertheless, if there is a way to escape anything at all, usually it is the quotation mark, and usually it is escaped by doubling. Pretty much any other scheme is very unlikely to be properly interpreted in this context.


Yes indeed. To make it easy to parse everything has to be quoted. If some things are quoted then you can’t just split on comma because for example:m

    “, is a cat”,”, is my boyfriend”,123
etc.


I think of what you say as really just the first step on the path to the parsing state machine in the c2tsv.nim (or /c2dsv.c in the same folder) thing I mentioned above which have comments in their source code.

I think it helps to think of the problem more like "How do I translate a complex syntax buffered input stream which 'most' of the time just translates ',' to '\t' into a buffered output stream that is "almost" as fast as a Unix `tr , \\t`?" If there were no escaping/quoting the output buffer could literally be the same memory as the input, just with the delimiter bytes changed.

The next step is realizing that you can still just do this byte translation if you "flush" the IO buffer opportunistically at syntactically relevant times. That gets you the "almost" performance. (Scare quotes on "almost" since you might do a few more IO-system calls with certain kinds of dense syntax, but unlike your "reassemble" there won't be any allocations. Various trade-offs, but a nifty design.)

There are other nice aspects to the "partitioned program design" mentioned a sibling-ish comment, but, all together, I think it is a pretty tidy solution.


In my sillyness I forgot to brag about the output still being csv.

If you want to enjoy all the strange escapery, extra commas, line breaks and wrapping quotes, you may describe it in code in the first and last 2 columns.


Is also expressible (and is vectorized) in C#. But that's author's code, not mine :)


That's right - pure splitting is much more SIMD-friendly than..the whole syntax melange. This is another charm to the "partitioned" design. The conversion to split-parseable TSV can run on its own CPU core and the SIMD splitting on its own core. As long as pipe bandwidth suffices, you have very easy parallelism. This kind of design/intent was popular on Unix at the dawn of multiprocessing when there were still Giant Kernel Locks. But it still has merits, even on Windows.

If you have a lot of data (and space for it, e.g. in /dev/shm) you can also save all the converted data to a TSV file. That's now soundly "partitionable" at the "nearest ASCII newline to 1/N bytes" and you can then go core-parallel as well as SIMD within cores (but that N-wise pass with memory mapping or mem.views). Admittedly this is probably more helpful when you are doing more computation than just splits, like ASCII-to-binary conversion of fields or such.

Plus, someone might have actually exported from Excel (or whatever) into some sound TSV instead of weird quote-escaped-CSV that some think is standardized by rfc 4180 (which itself disavows being a "standard"). In that case, at least, you needn't convert at all.

So, I see at least 3 reasons to layer this part of a system as a convert-then-split: pipeline parallelism, file parallelism, and entire pass elision.


I like to think that the author wasn't happy with how much code his CSV parser required, and decided to nerd-snipe hacker news to find some alternatives.


I reviewed the code in this project and it looks pretty reasonable. I thought CSV was a loosely specified format. In my past experience, I never had a smooth experience moving data from one system to another using CSV. I had a lot of trouble with Snowflake -> CSV -> Clickhouse. I now use JSONL for pretty much everything.


There are the specs and the real world. The specs are more often than not on the opposite side of the moon, you never see that in real life. Oh, so many hours of my life were wasted on that. Real-world CSVs are as loosely specified as any free text in a notepad.


Yes, loosely specified in practice. I made a CSV parser that tries to do something reasonable for many variants by default. When that's not enough, you can specify options. https://www.neilvandyke.org/racket/csv-reading/


This is the problem… CSV isn’t specified at sufficient detail, it is just too loose in the real world. So the question “can you make a small parser” isn’t a real issue. And then, the problem with such a small parser is — which edge cases are you missing/ignoring?

I just don’t see the flex in having a small csv parser.


There's an RFC that specifies a standard format for CSV. If you're smart you'd use it ^W^W… well, you'd probably not use CSV to start with.

The problem is that often, what you have to ingest is more properly described as "malformed CSV / bytes that loosely resembled CSV in some manner that I have no choice but to either try to shove into a parser, or write some custom junk for this hot garbage because it comes form a source that I cannot control".

A lot of parsers are fairly configurable precisely to account for the situation of "the other end is sending me ill-defined jank" and to be flexible enough that maybe, just maybe, it'll mostly work. But it's hardly "engineering" at that point.


It doesn’t specify a standard format.

It describes a common format.

https://datatracker.ietf.org/doc/html/rfc4180


To the author: please consider using an actual license.

I infer from the tone of your license that you intended it to give away the code to any "human" who wants to "use" it.

What if I modify it? Is modification "use"? (No.)

What if a shell script calls it? Is that a "human" using it, or a "computer"? Is linking "use"?

The result is probably a license which is non-free, which doesn't appear to be your intention.


Maybe also add that as an issue at the repo.

(Writing it as a comment here is/was still useful because it may help others)


Maybe part of the license is proving you are (human) by interpreting it reasonably.

Or maybe it’s a trap. Who can say?


It's a nice tidy CSV parser, but needs a new title. "world's smallest" is never going to happen in C#, for any measure of "smallest". And aside from that, nobody should be rolling their own CSV parsers if they want to solve real-world problems; use the most capable library your language offers you, which will account for a hundred edge cases yours doesn't.


Funny story re "nobody should be rolling":

When I was switching from academia to industry, I decided, based on HN comments like this, that I should un-publish my CSV parser.

I was worried potential employers would tsk-tsk me for self-rolling.

I promptly got an email from the creator of Ruby asking me why I had un-published my CSV parser, which apparently was being used in Ruby at the time.

(...And then later I landed my current job, a dream job, a large part of which involves handling CSV files in finance!)


Well, it depends if your self-rolled version is a complete library, or a quick sidetrack you implemented as a bigger project. If you published it as an installable independent library, and Ruby was using it, I think it's safe to say that you had a complete product. The evil of self-rolled CSV is that people often build on an incomplete understanding of the problem, don't have unit tests, or do something simplistic that necessitates workarounds like this: https://metacpan.org/pod/Data::TableReader::Decoder::IdiotCS...

(case in point, that crazy workaround is only possible because of a large expenditure of effort by the authors of perl's Text::CSV which very few CSV parsers would have implemented)


No Perl submission in the comments?!

Y'all are slacking on me.


`column.Substring(1, column.Length - 2)` could use the new-ish range indexing syntax.

    column[1..^1]


FYI I got rid of this line, now I just don't add the quotes in the first place, unless the caller requests it. Performance didn't actually change, but it looks smarter. Thanks again for the review though.


Thanks... I didn't think that would compile on .net 6, but it does!


You can probably shorten it to

    column[1..]
and it compiles down to a Substring call, and ranges are part of C# 8, so they exist since .NET Core 3.1. But even if the syntax is newer (e.g. collection expressions in C# 12) you can often also use features on older target frameworks if they don't require additional runtime support (and even that can often be retrofitted internally).


That's different. It includes the last character.


That’s still multiple lines of code.

Here’s one in c++ that uses only a single line of code (excluding function headers)

// How do I format this as code?

vector<string> split( const string& s) { return accumulate(s.begin(), s.end(), vector<string>(1), [=](auto acc, char c) { if (c == ‘,’){ acc.push_back(string()); } else { acc.back( ) += c; } return acc; } ); }


Does it parse this, though?

  "Verne, Jules", "20,000 leagues under the sea"


Wrap it in a second function that parses out quotes and bit-stuffs the commas with something else.


Yuck, that RFC 4180 actually quotes Postel's poorly considered "law" and recommends that it be followed. Why have a RFC then. If you're going to be liberal in what you accept, then the RFC is just a suggestion.

Plus it's not just defining CSV but actually a CSV MIME format (what?) and thus insists that since CSV is MIME wire data, it must use CR-LF line breaks, rather than assume that data can be converted to an operating system's text file format, and native line breaks.

You'd think that CSV could be defined without reference to MIME whatsoever, using abstract line breaks; and that it's a no-brainer that since it is text, it can be MIME-encoded as a plain text type.


Questions:

1. What is the size of the install required to get C# to run on a computer without Windows or MacOS installed

2. What is the size of the install for this C# library

Just curious


App with runtime would probably be around 30 MBs?

dotnet publish -c Release -r linux-x64 -o output -p:PublishTrimmed=true

or 100MB without trimming


AOT-built /Example should be <= 2MB (like most of them regardless of the library), since the library itself can only be consumed by .NET and its assembly would take a couple dozens of KB at most.


If UNIX means "a certified unix that isn't MacOS" the answer is 0 bytes (actually more like 0/0 bytes).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: