C# `char` is a UTF-16 code unit. It does not indicate a byte which is just `byte...

giaour · 2024-04-10T23:04:29 1712790269

Iterating over the `char`s does not support the full range of what can be stored in a C# string (for instance, UTF-8 graphemes that are serialized as surrogate pairs are usually two `char`s in a C# string.

.Net provides a TextElementEnumerator that will iterate over graphemes instead: https://learn.microsoft.com/en-us/dotnet/api/system.globaliz...

There's a fairly comprehensive guide to working with .net character encodings at https://learn.microsoft.com/en-us/dotnet/standard/base-types... .

neonsunset · 2024-04-10T23:12:42 1712790762

The return value of StreamReader.Read() will always be within bounds of -1 and char.MaxValue.

All surrogate pairs will be drained into the StringBuilder, working correctly. Most implementations usually agree that torn UTF-16 surrogate pairs (which are strictly the code points outside of basic multilingual plane) may exist in the input and will be passed as is, which is different to what UTF-8 implementations choose (Rust is strict with this, Go lets you tear code points arbitrarily).

We, as a community, can do better than to jump to immediate criticism of this type.

gnabgib · 2024-04-11T00:51:26 1712796686

If you (a consuming dev) want the world's smallest (in your code) - use the .net built in parser[0]. Bonus, it's RFC4180 compliant.

If you (competing/learning) want to write the world's smallest (code golf style)... this isn't it, and has some weird superfluous lines (if that's your measure - per the original question).

If you (learning) want to write an efficient parser.. this isn't it. You don't need a StringBuilder, you can seek the Stream to collect the (already formed) strings directly from source vs char-by-char memory copy and rebuild. Yes; that limits your stream choices, but since the example/tests only use FileStreams (which are seekable) you might not come across other kinds. If you need to use un-seekable streams, then you'll need to use a large enough buffer.

[0]: https://learn.microsoft.com/en-us/dotnet/api/microsoft.visua...

neonsunset · 2024-04-11T01:04:11 1712797451

This is not a correct link (it refers to VB.NET). There are better parsers out there (Sep).

I'm not sure what is your point but it certainly misses the idea behind this HN submission and makes me sad as it would be nice to see words of encouragement in .NET submissions here instead.

int_19h · 2024-04-11T01:39:30 1712799570

It's an assembly with "Microsoft.VisualBasic" in the name, but it shipped as part of every version of .NET to date, and is perfectly usable from C#. In fact, I would be very surprised if there aren't vastly more uses of this API from C#, since it's a very old trick of the trade.

What GP is saying is that, given that it is already included in the standard class library, it's always the cheapest option wrt size of your shipping app. So it should arguably be the default choice for any .NET dev unless they either need better performance or some more exotic requirements wrt input format.

neonsunset · 2024-04-11T02:06:35 1712801195

What is it with .NET or C# submissions (but I suppose other languages are not immune either) that attracts this type of replies, which miss the point behind a particular piece of code, trivial or not?

Yes, there are existing implementations, many of which are incomparably better, one of which ships with default project SDK (even if it is effectively obsolete[0]). But surely offering a competitive implementation that intends to replace existing solutions wasn't the purpose of this?

Either way, I'm not the author of the code and have already spent enough (free) time in the last 8 months working on a string library which has performant parsing as one of the project goals[1].

[0] https://github.com/dotnet/runtime/tree/main/src/libraries/Mi...

[1] https://github.com/U8String/U8String

vilark · 2024-04-11T16:13:10 1712851990

The unit tests have an emoji test (which uses a surrogate pair). I thought I would have to use Runes, but it's not necessary. https://github.com/kjpgit/SmallestCSVParser/blob/master/Smal...

Jerrrry · 2024-04-10T23:15:12 1712790912

You imply that a string, reversed, would have the same length as the original.

This is not true.

esdf · 2024-04-10T23:46:34 1712792794

Where are they implying this and why would the strings not have the same length? Is there normalization implied somewhere?

Jerrrry · 2024-04-11T00:22:08 1712794928

If they weren't reversing it, what other operation would separate grapheme clusters?

Someone · 2024-04-11T04:56:41 1712811401

> Having StringBuilder be a private field on the parser instance is not an issue either - it is simply reused.

It doesn’t matter for this API, but it is a code smell. It makes the class not reentrant.

Talking of the API, I would make it simpler to use and more idiomatic by making the entire public API

   static IEnumerable<List<String>> parse(StreamReader sr)

That call would store the parser state (currently just the StreamReader and that reused StringBuilder) in a private inner class. There would not be a constructor of the publicly visible class, removing that code smell.