Friends don't let friends export to CSV

prepend · 2024-03-26T00:05:49.000000Z

This article seems written by someone who never had to work with diverse data pipelines.

I work with large volumes of data from many different sources. I’m lucky to get them to send csv. Of course there are better formats, but all these sources aren’t able to agree on some successful format.

Csv that’s zipped is producible and readable by everyone. And that makes is more efficient.

I’ve been reading these “everyone is stupid, why don’t they just do the simple, right thing and I don’t understand the real reason for success” articles for so long it just makes me think the author doesn’t have a mentor or an editor with deep experience.

It’s like arguing how much mp3 sucks and how we should all just use flac.

The author means well, I’m sure. Maybe his next article will be about how airlines should speak Esperanto because English is such a flawed language. That’s a clever and unique observation.

hn_throwaway_99 · 2024-03-26T00:49:33.000000Z

Totally agree. His arguments are basically "performance!" (which is honestly not important to 99% of CSV export users) and "It's underderspecified!" And while I can agree with the second, at least partly, in the real world the spec is essentially "Can you import it to Excel?". I'm amazed at how much programmers can discount "It already works pretty much everywhere" for the sake of more esoteric improvements.

All that said (and perhaps counter to what I said), I do hope "Unicode Separated Values" takes off. It's essentially just a slight tweak to CSV where the delimiters are special unicode characters, so you don't have to have complicated quoting/escaping logic, and it also supports multiple sheets (i.e. a workbook) in a single file.

diogocp · 2024-03-26T01:21:04.000000Z

> in the real world the spec is essentially "Can you import it to Excel?"

And the answer to that is always no. You will it think it's yes because it works for you, but when you send it to someone who has a different Excel version or simply different regional settings, it won't work. The recipient will first have to figure out what dialect you used to export.

Freak_NL · 2024-03-26T07:49:04.000000Z

Oh absolutely. Don't forget about the giant mess Microsoft made in countries like the Netherlands, where Excel will refuse to automatically open comma-separated values files (you know, CSV), unless the separator is a semicolon — because someone in the past thought that was how the Dutch did it.

You want people to be able to open your nice and sensible CSV files in Excel? First figure out which arcane locale specific settings you need, then generate those monstrosities and annoy anyone who genuinely expected Unicode comma-separated values.

My solution was to just write a generic spreadsheet writer component in our API and have it generate CSV (normal, comma-separated, UTF-8), XLSX, or ODS. Anyone using Excel just grabs the XLSX output. Generating XLSX and ODS was just a matter of saving a minimal example file for each and figuring out where to write cells and rows. No extra library needed (both file formats are just zipped up XML files), and everybody is happy.

roelschroeven · 2024-03-26T10:24:53.000000Z

Many countries use the comma as decimal separator, and Microsoft in its infinite wisdom thinks that data interchange formats should follow regional formatting settings (that's unbelievably stupid; I'll never understand how such an enormous error not only came to be, but was never corrected). That makes the comma unusable as column separator in those countries for exchange of numerical data.

wouldbecouldbe · 2024-03-26T12:22:13.000000Z

US users have the same in my experience. I've had multiple clients complain the exports are broken, because Excel only imports them (perfectly) but doesn't allow to open them. I thought it was their way of forcing excel files. Iphone & Google Drive no issues.

chaps · 2024-03-26T01:49:59.000000Z

"Always no"? What. Really not sure what you mean here when you agree that the possibility of it working exists.

I get the sentiment -- when I request data through FOIA, I will almost always request it as "an excel format" because I know that I'll at least be able to import it. CSV is much less of a guarantee and will have issues -- missing quotes, wrong delimiters, inconsistent column counts, things like that. So requesting "an excel format" implies "make the best effort to give me something that will load in my excel, but without asking what version of excel I have". Removes a fair amount of hassle, especially when it took months to get the data. It also means that if they fuck up the columns by doing a conversion, you have some means of telling them that the data is simply wrong, rather than the data is hard to work with. It does mean dealing with [0-9]+GB sized excel files sometimes, though.

That all said, I prefer to share CSV files. Haven't had much of a problem with it and I can guarantee some consistency from my side. Like, the people I share files with aren't going to know what the hell a parquet file is. A CSV though? Just double click it if you're less technical, or open it up in less if you can use a terminal. It usually compresses well, despite what the author wrote.

throwaway2037 · 2024-03-26T04:52:34.000000Z

   > when I request data through FOIA

Fascinating. Can you share any details? Did you ever think to share some of your interesting finds here on HN as a submission?

chaps · 2024-03-26T06:41:21.000000Z

Every now and then, yep :)

https://mchap.io/that-time-the-city-of-seattle-accidentally-...

chongli · 2024-03-26T10:01:43.000000Z

That was a really fascinating story! Thanks for sharing.

r-w · 2024-03-26T13:16:18.000000Z

Wow, what a treasure trove you’ve got there! I’ve subscribed via RSS, in case anything else comes down the pipe :)

chaps · 2024-03-27T05:12:52.000000Z

Thank you! Hopefully by the end of the year but these things can get, strange.

Pikamander2 · 2024-03-26T07:48:35.000000Z

I've been amazed by how much better LibreOffice is at importing CSVs in a sane manner than Excel. Its CSV import prompt is nothing short of the gold standard and puts Excel to shame.

Also, even if the CSV format is completely valid, Excel will still find a way to misinterpret some of your cells in baffling ways, destroying the original data in the process if you don't notice it in time.

wiz21c · 2024-03-26T09:13:13.000000Z

Yeah, I can complain about LO in many ways, but the way it opens CSV is much better than Excel. It was developed by a dev, that's for sure.

nuc1e0n · 2024-03-26T22:31:12.000000Z

The root cause of a lot of problems is that Excel's CSV import is garbage.

Someone should write a simple utility modelled on LibreOffice's CSV import dialog that reads in a CSV file and outputs its content as an XLSX file with all the cells formatted as text. Being as how XLSX files are just XML text in a renamed ZIP file and CSV is a very simple format such a project could be written over a weekend.

Network admins could then create a group policy to reassign the CSV file extension to open with the utility instead of Excel. I guess the utility could automatically open the generated XLSX in Excel as well.

This would fix so many data corruption issues across the whole world.

Microsoft themselves could even do this as a powertoy.

prox · 2024-03-26T08:49:25.000000Z

Heck yes, LibreOffice shines when it comes to that. Excel always threw me curveballs a lot.

smcin · 2024-03-26T13:28:55.000000Z

Yes the LibreOffice CSV import dialog showing you a what-if preview of what you'd get as you play with different settings, is pure amazing.

Stranger43 · 2024-03-26T05:51:13.000000Z

At this point i suspect excel is as dangerous as powerpoint to the quest for sharing and improving human knowledge, in that it gives the illusion of allowing people to analyze data but not the tools to make sure they get their data and model lined up correctly.

geraldhh · 2024-03-26T09:43:59.000000Z

otoh it could be instrumental precisely because it is flexible and doesn't require a lot of forethought

pif · 2024-03-26T12:09:18.000000Z

>> in the real world the spec is essentially "Can you import it to Excel?"

> And the answer to that is always no.

Sorry, you are wrong! You are confusing "No" and "Yes, after a quick setup at most".

braiamp · 2024-03-26T12:14:12.000000Z

Oh oh oh, I have a story about a quick setup. Sent a csv file to someone in the same org. Guy said that it was not opening. Wanted me to come to their office to see. I told them that IT support should fix it, since I can't, and every machine on my OU could read it. I was the bad guy. Yeah, quick setup my ass. Users can't be arsed to understand that their setup isn't properly configured.

mcv · 2024-03-26T13:14:26.000000Z

Yeah, I definitely prefer CSV over Excel. Excel messes with your data. CSV is what it is. Sure, you may not know exactly what that is, and plenty of times I've had to figure out manually how to handle a specific CSV export, but once I've figured that out, it will work. With Excel, there's always the risk that a new value for a string field could also be interpreted as a date.

alexvoda · 2024-03-26T11:54:43.000000Z

On the other hand you can now use Power Query to import perquet data into Excel.

resource_waste · 2024-03-26T15:12:03.000000Z

Whatever situation got you into the "Power" universe was bad.

Warning you about M$, you will soon be an enterprise dev.

alexvoda · 2024-03-27T09:02:39.000000Z

I am not in that universe, on the contrary I try to stay as far away as possible. And I agree with you. However I think the everything-done-in-OLD-Excel universe is worse. Some people will be terminally stuck in Excel but at least they can use new Excel capabilities instead of being stuck with the Excel of 20 years ago.

So why remain stuck importing CSVs into Excel when you can use Power Query to import Parquet. Why remain stuck using VBA in Excel when you can use Python in Excel.

I do not think an Excel user can be convinced to move to things like Jupyter, R, databases, etc. since they won't even make the jump to Access but maybe they can be convinced to use modern features of Excel.

resource_waste · 2024-03-27T19:17:25.000000Z

Sorry man, I can't think of a case where I'd import a CSV into excel, but have the skill level to use powerquery and import parquet.

Like, if you are going to use power query, why not just python? At least this way you arent going to get nailed into a legacy hellhole.

goatlover · 2024-03-26T06:08:15.000000Z

If the answer was always no, importing CSVs to Excel wouldn't be an expectation or widely used.

kibwen · 2024-03-26T11:04:37.000000Z

The answer is "always no" because the question is inherently underspecified, precisely because importing things into Excel is more complicated and context-dependent than it appears on the surface.

djbusby · 2024-03-26T05:32:24.000000Z

ASCII has had field and record separators since like, forever. Wish we had kept using those.

wombatpm · 2024-03-26T06:31:16.000000Z

No you don’t. It’s a holdover from when files were on tapes. The logic is all inverted too. Record separators were at the beginning of a record. Group and unit separators could then be nested. You really needed a spec for whatever document you were trying to makt

rightbyte · 2024-03-26T07:43:39.000000Z

It doesn't matter if the sentinel byte is after or before each record.

Having it before is nice for syncing byte streams.

wombatpm · 2024-03-26T14:24:26.000000Z

It matters when you trying to roundtrip the data through a text editor because existing tools balk at a 300MB file with a single line.

Repulsion9513 · 2024-03-26T21:30:03.000000Z

You need a "spec" just the same for a CSV: does it have headers, is there a consistent column ordering, etc. Control characters work the exact same as commas and newlines, but you don't have to worry about escaping them because (unless your data is binary or recursive) the control characters won't show up in the data!

jdougan · 2024-03-26T21:33:28.000000Z

Do you have a reference to how this worked?

josefx · 2024-03-26T06:08:47.000000Z

Does it have nesting operators? I want to embedd ASCII within my ASCII fields. So I can have a table within my table.

ykonstant · 2024-03-26T09:28:54.000000Z

The POSIX spec defines portable and non-portable ASCII characters, prudently placing the separators in the non-portable set. In order to nest tables, base64 (or whatever portable encoding) code the table into the field. This works much better, easier and more error-free than any escaping strategy.

Regarding visibility in editors, if you are nesting tables I don't think you care too much about manual editing, but if you do, it is easy to setup vim/emacs/vscode to display the ASCII separators. I am being told even notepad++ can display those, there are no excuses.

vaylian · 2024-03-26T06:27:13.000000Z

It kind of does. See `man ascii`

* FS (0x1C) file separator

* GS (0x1D) group separator

* RS (0x1E) record separator

* US (0x1F) unit separator

I've never seen these in the wild though.

alex_suzuki · 2024-03-26T06:36:29.000000Z

They‘re used a lot in barcodes, e.g. for delimiting the different fields of a driving license.

euroderf · 2024-03-26T08:17:15.000000Z

That sounds like the premise for an utterly fascinating deep dive.

alex_suzuki · 2024-03-26T16:23:33.000000Z

It‘s quite the rabbit hole, I can assure you.

adammarples · 2024-03-26T12:15:55.000000Z

Not seeing them in the wild is good, it means they will work when you use them. The more they get used the more often you'll find they crop up in the text fields you're trying to separate and the whole csv escaping nightmare will continue.

Repulsion9513 · 2024-03-26T21:31:49.000000Z

If you've got delimited-text embedded inside your delimited-text, you've got a nightmare that escaping can't save you from.

(obligatory https://knowyourmeme.com/memes/xzibit-yo-dawg)

dhagberg · 2024-03-26T17:02:04.000000Z

They're definitely used in ACARS messages that go to every commercial airplane cockpit...

lolive · 2024-03-26T06:19:01.000000Z

A manager: I wish I could have a CSV inside my CSV.

Any sane person: NO!

josefx · 2024-03-26T06:27:56.000000Z

base64 encoded fields it is.

lolive · 2024-03-26T06:36:11.000000Z

That was the right answer. You are hired!

bruce511 · 2024-03-26T08:07:01.000000Z

Well you might be wrong, but EDI in general and HL7 specifically allow 3 levels of "fields in fields in field".

As long as your parser copes, and as long as you have appropriate structures to import into, its no big deal.

iainmerrick · 2024-03-26T08:52:11.000000Z

So now when I'm exporting data I need to know what nesting level it's going to live at so I can generate the correct separators?

I really think that might be the worst idea I've heard for a while!

bruce511 · 2024-03-28T14:52:27.000000Z

It's not toooo bad :) But it's very much a "thing" in the real word. It's called EDI, and it's been around for a long time.

icedchai · 2024-03-26T13:47:42.000000Z

About 10 years ago, I worked at a place where we were embedding both XML and JSON in CSV fields.

Then there are always the people who can't generate a valid CSV due to escaping issues...

Nothing is ever simple.

wiredfool · 2024-03-26T08:42:10.000000Z

Been there, done the Interop.

lolive · 2024-03-26T06:23:21.000000Z

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...

ryncewynd · 2024-03-26T00:55:37.000000Z

Are there unicode characters specifically for delimiters?

If Excel had a standardised "Save as USV" option it would solve so many issues for me.

I get so many broken CSVs from third-parties

hiccuphippo · 2024-03-26T01:09:53.000000Z

ASCII has characters for unit, record, group and file separator. And a some days ago there was a story here about using the unicode printable representation of these for an editor friendly format.

https://news.ycombinator.com/item?id=39679378

crq-yml · 2024-03-26T01:11:57.000000Z

There are characters from ASCII for delimiting records, which are underused because they cause confusion about whether they should be represented as a state change like a backspace character, or as a glyph. See also: "nobody can agree on the line ending character sequence".

The USV proposal uses additional codepoints introduced in Unicode for the representation of the record delimiters, so they will always look and edit like character glyphs, and nobody is using them for some other purpose. The standardized look of these glyphs is unappealing, and they aren't easy to type, but it's fixable with a font and some editing functions.

Most of the issue hinges on Excel support.

samatman · 2024-03-26T02:15:14.000000Z

> nobody is using them for some other purpose.

There's a lot of tooling which uses them for their intended purpose, which is to represent the C0 control characters in a font, so they can be printed when they appear in a document. Your editor is probably one of those.

Which is why I consider USV a terrible idea. If I see ␇ in a file, I don't want to be constantly wondering if it's "mention ␇" or "use ␇ to represent \x07". That's why the control pictures block exists: to provide pictographs of invisible control characters. Not to make a cutesy hack "look! it's a picture of a control character pretending to be what it isn't!!" format.

ykonstant · 2024-03-26T09:41:05.000000Z

I agree about USV, it creates confusion where none needs to exist. For personal use, though, it is not that bad to receive a USV: it should be postmarked ".usv" and in any case if you suspect shenanigans you can `grep` for the offending (literally!) unicode characters and `tr` them into proper ASCII separators. Now, if there is nesting in the USV, I give up.

I share the lament: the whole table issue was solved before it became a problem. POSIX divides ASCII into portable and non-portable characters; only portable characters are allowed in the fields and separators are non-portable. If you need nesting, use a portable encoding of the inner table. This scheme repeats indefinitely without escaping hell or exceptions, preventing tons of errors and headache.

Visibility is such a bizarre complaint. Text editors already handle control characters: they handle tabs, they handle newlines, it is not a tremendous, earth-shattering feature request to make them handle separators gracefully.

rightbyte · 2024-03-26T07:49:00.000000Z

I don't underatand why this is a question up for debate. You need eye tracking, so that there is a beep when you read the relevant part.

bhaney · 2024-03-26T01:09:01.000000Z

Hell, there's ASCII characters specifically for delimiters. 0x1C to 0x1F are respectively defined as file, group, record, and unit separators. Unicode naturally inherits them all.

smcin · 2024-03-26T13:31:23.000000Z

Except nobody uses them. Another previous discussion: https://news.ycombinator.com/item?id=33935140

fbdab103 · 2024-03-26T02:50:56.000000Z

My significantly bigger beef would be all of the auto-formatting Excel does to mangle data. Excel loves to turn entries into dates.

Human genes had to be renamed so as to avoid this Excel features.

potatoicecoffee · 2024-03-26T04:27:29.000000Z

excel now has a prompt so you can tell it to not convert stuff automatically

fbdab103 · 2024-03-26T04:33:49.000000Z

Gasp. Big news. I do not recall ever seeing this, so I wonder if $JOB is running some hilariously outdated version for compatibility with a load bearing VBA script.

sgerenser · 2024-03-26T14:54:43.000000Z

Automatic Data Conversion toggle was only added in the past ~year: https://insider.microsoft365.com/en-us/blog/control-data-con...

hn_throwaway_99 · 2024-03-26T01:15:58.000000Z

Yes, this recent discussion has lots of good info and links in the comments: https://news.ycombinator.com/item?id=39679378

lostlogin · 2024-03-26T02:05:21.000000Z

> Are there unicode characters specifically for delimiters?

We could use the HL7 pipe ‘|’ and all enjoy that hell.

deathanatos · 2024-03-26T03:25:20.000000Z

God, please no.

For those unfamiliar with the atrocity that is HL7v2, the format is essentially CSV, but with the record separator set to a lone CR, and the field separator usually set to |. Usually, because the format lets the consumer of the format redefine it, for whatever reasons. (The first use of the field separator it determines whatever character it will be. Thankfully, the first use is in a fixed spot, so it's determinable, but still. Oh, but we don't know the character encoding until like the 18th field in … and it doesn't necessarily have to be an ASCII superset. So I have no idea what an HL7v2 message in a non-ASCII superset even looks like, or how a parser is even supposed to reasonably parse such a thing. I presume attempt a decoding in all possible decodings, and then see which one matches the embedded character set, and pray nobody can create a polyglot?)

There's also further separators, delimiting within a field.

It also has its own escape sequences, to deal with the above.

… and it is what carries an unfortunate amount of medical data, and is generally how providers interoperate, despite the existence of more civilized standards like FHIR

deathanatos · 2024-03-28T18:41:13.000000Z

I've unfortunately had to bless my brain with much more of this standard this week, for some reason.

Did I mention that subcomponents (if you look at it like a CSV, cells are further subdivided into components & subcomponents, so subcomponents are sort of where we hit "cell text", if you want to keep going with that broken analogy) — contain escape sequences, so that you can have things like the field separator. Normal stuff, so far. The escape sequences also include highlighting, binary blobs, and a subset of roff.

chaps · 2024-03-26T02:39:35.000000Z

My worst nightmare was a semicolon-delimited file. Where one of the columns had hand-typed street names - without quotes.. so "WELLS" was often "WE;;S".

Since it was the only column like that, the # of columns to the left of the annoying column and the # on the right would always stay the same. So it was pretty easy to clean.

doctor_eval · 2024-03-26T02:43:33.000000Z

It’s been years since I last worked with HL7. Isn’t there also ^ and ~ to deal with?

Hell indeed.

alexvoda · 2024-03-26T12:03:15.000000Z

Isn't that what HL7 stands for? Hell Layer 7 as in the seventh circle of hell.

deathanatos · 2024-03-28T18:42:01.000000Z

Yes, there is. Multi-dimensional CSV?

lostlogin · 2024-03-26T03:49:02.000000Z

There is the white space bs too. Sometimes it matters, sometimes it doesn’t. What type of white space is it?

Seriously rough.

Dalewyn · 2024-03-26T02:16:08.000000Z

>I do hope "Unicode Separated Values" takes off. It's essentially just a slight tweak to CSV where the delimiters are special unicode characters

Commas can be typed by anyone on any keyboard and readable by anyone.

Special Unicode Characters(tm) can't be typed by anyone on any keyboard and readable by noone.

Convenience is a virtue.

hn_throwaway_99 · 2024-03-26T02:34:10.000000Z

I can't remember the last time I, or anyone I've ever worked with for that matter, ever typed up a CSV from scratch. The whole point of USV is that the delimiters can't normally be typed so you don't have to worry about escaping.

USV supports displayable delimiters (see https://github.com/SixArm/usv), so for the much more common case of editing an existing CSV in a text editor, you can just copy and paste.

albert_e · 2024-03-26T03:24:16.000000Z

Everyone of us was a beginner at some point. The first time we came across CSV format we likely typed it in notepad by hand. A lot of issues with CSVs are also sometimes troubleshooted by hand-- by manually fixing a quote or a comma.

There is value is the ability to do this level of editing and troubleshooting.

hn_throwaway_99 · 2024-03-26T03:44:29.000000Z

> The first time we came across CSV format we likely typed it in notepad by hand.

Again, I'm not saying CSVs aren't edited by hand in a text editor, I'm saying they aren't created from scratch in a text editor, even by beginners. USVs are easy to edit in a text editor, too, and I tried viewing and editing USVs with a couple different fonts and had no problems.

iainmerrick · 2024-03-26T08:54:19.000000Z

If the separators can't easily be typed, how do you add a new cell?

scbrg · 2024-03-26T12:31:30.000000Z

Nobody can type up a GIF image, or Word document in a Notepad, yet files of both those formats exist. The answer obviously is tooling. If a format with sane separators was common, so would editors that could edit that format be.

iainmerrick · 2024-03-26T12:52:43.000000Z

I was responding to the GP's:

USVs are easy to edit in a text editor

I don't see how that's the case.

If a format with sane separators was common, so would editors that could edit that format be

Sure, but that's a hypothetical future editor, not something that currently exists.

Edit to add: I also disagree with "sane" in that context. New separators won't solve anything. You'll always need escaping/encoding/encapsulation; get that right and everything else follows. JSON is comma-separated and does just fine.

hn_throwaway_99 · 2024-03-26T13:50:54.000000Z

Copy and paste.

eviks · 2024-03-26T06:29:20.000000Z

No we didn't, we likely typed in Excel by double clicking on our first csv

hnlmorg · 2024-03-26T07:35:59.000000Z

I can’t speak for everyone, but I definitely didn’t use Excel.

Dalewyn · 2024-03-26T02:37:56.000000Z

I've valued the virtue of CSVs being readable by any text editor known to man, and I've occasionally edited them by hand. The pure simplicity of reading and typing commas trumps any value provided by more esoteric configurations.

As for escaping, that's for the subsequent programmers (which could also be me) to figure out. If it is me, I'll deal with it because it keeps things simple.

groestl · 2024-03-26T09:10:15.000000Z

> I've occasionally edited them by hand.

Yeah, usually when the quoting was f'up.

deathanatos · 2024-03-26T03:30:03.000000Z

> Special Unicode Characters(tm) can't be typed by anyone on any keyboard and readable by noone.

While I'm not a fan of USV, I do believe it is type-able on both macOS and Linux¹. The IME's character picker in both OSes contains all of the necessary characters, I think. (I use "␤" in commit messages, occasionally. That was a typed ␤, not copy/pasted from a Google query or such.)

It would be quite tedious, I do admit; reasonably typed by someone, probably not.

I don't normally type out CSVs by hand either, though.

(¹I don't know about Windows as I simply haven't cared about it in years. For all I know, they've grown an IME in the intervening decade.)

al_borland · 2024-03-26T04:35:55.000000Z

Even if the csv isn’t being typed out by hand, when importing into Excel and the delimiter needs to be manually entered, because it isn’t one of the common ones Excel has a radio button for… it is nice to be able to easily type it.

deathanatos · 2024-03-26T16:55:42.000000Z

While I can see a convenience argument for the somewhat contrived case of typing an entire file out by hand, entering the character once for the import does not seem like a great bar.

> it is nice to be able to easily type it.

Again, that's where an IME is helpful; on the OSes I mention, it's "␞" is:

  macOS: ⌘+^+Space, "record", <arrows to select>, Enter
  Linux: Super+e, "record", Space, Space, <arrows>, Enter

The process is highly visual on both, so you're getting feedback about whether you're about to hit the right character, or not.

(And like, if you have the file, you can always ^C ^V the character, if say you don't know how to IME, or you don't know what the name of the char is, etc.…)

bee_rider · 2024-03-26T04:06:23.000000Z

Would it be possible to just type the file using commas, semicolons, or pipes or something (whatever you happen to know you don’t have in your file) and then convert them using sed?

deathanatos · 2024-03-26T16:59:44.000000Z

Yes, it would be possible. You'd have to make sure the character didn't appear / no escaping at all was present, which the data may or may not allow.

Might as well just get a dedicated CSV→USV converter, though.

(I have a variant of this problem in JSON→YAML … usually I just shove one into the other & pray, akin to your sed solution.)

Dalewyn · 2024-03-26T05:04:45.000000Z

Any character within reason can certainly be entered by way of Character Map in Windows or its equivalent in Linux or MacOS, but if you're arguing that then you don't understand the crux of my argument: Convenience is a virtue.

There is value in the delimiter simply being a key on practically any keyboard in existence. Anything that involves something more complicated than just pushing a single button on a keyboard (this includes IMEs) is a non-starter, because convenience is a virtue.

deathanatos · 2024-03-26T16:50:40.000000Z

> Anything that involves something more complicated than just pushing a single button on a keyboard (this includes IMEs)

My point is that this is merely a more stringent argument; it's now "on a keyboard, and cannot involve dead keys, etc." … which now excludes classic CSV, too, which requires two keys to enter a double quote. (Shift+')

Again, it does require more keys, and it is certainly not convenient, but n keys to me is still n keys. The real question is why one isn't using a proper tool to enter the data, and is instead encoding it by hand, which, again, even for a classic CSV, is basically something I've never done. (… because why would one?)

adammarples · 2024-03-26T12:18:40.000000Z

The fact that it is a character on the keyboard is exactly the problem, too. Any character a user can easily enter will definitely end up mixed into your data somewhere.

medstrom · 2024-03-26T13:21:35.000000Z

The IANA standard for TSV already disallows tabs inside fields, so you can skip writing any quoting logic (in principle). The MIME type is `text/tab-separated-values`.

https://www.iana.org/assignments/media-types/text/tab-separa...

chrisandchris · 2024-03-26T07:41:25.000000Z

So true. Working with imports/exports in CSV from ERP software. One can't imagine how often "Oh, this import doesn't work. I'll just fix the CSV file" occurs. Try that with some compressed, "esoteric" file, or even CML and users will break it.

Besides all the downsides CSV has, as soon as it's not only machine-machine communication and a human is involved, CSV is just simole enough.

xeromal · 2024-03-26T12:23:56.000000Z

Check out Polars in python if you want some CSV performance lol. I recently got a 5 million row CSV from a 3rd party and I could manipulate columns (filtering, sorting, grouping) in actions that took less than a second. It's an incredible tool.

paulddraper · 2024-03-26T05:40:38.000000Z

USV has a mountain of problems.

And really is in search of a problem to solve.

palmfacehn · 2024-03-26T05:00:19.000000Z

JSON objects as a CSV field has been mostly agreeable for my usage. It would be nice if some of the spreadsheet apps displayed the object tree.

puppymaster · 2024-03-26T13:58:24.000000Z

ditto. if it fits in the ram, file types don't matter.

likis · 2024-03-26T07:49:45.000000Z

It seems like you missed the conclusion in the article. If users want CSV exports, let them have it.

If you have important data being shuffled around systems, pick something with a specification instead.

whakim · 2024-03-26T01:27:21.000000Z

To me this criticism feels excessive. It feels like the author is describing their frustrations with internal usage of CSVs - there's no mention of customers and non-technical stakeholders at all. I think it goes without saying that Parquet files and other non-human-readable formats are a nonstarter when working with external stakeholders and the last paragraph makes that clear - if the end-user wants CSV, give them CSV.

I also think we shouldn't blindly dismiss the performance drawbacks of CSV when working with data pipelines. At even modest scales it becomes hard to work with very large CSVs because the data often doesn't fit into memory, a problem easily solved by Parquet and other formats assuming you only need a subset.

prepend · 2024-03-26T02:41:51.000000Z

I deal with gig size csvs all the time and don’t have any performance issues. These aren’t huge files, but decent sized. And most are just a few megs and only thousands to millions of records.

Csv is not very performant, but it doesn’t matter for these use cases.

I’ll also add that I’m not working with the csvs, they are just I/o. So any memory issues are handled by the load process. I certainly don’t use csvs for my internal processes. Just for when someone sends me data or I have to send it back to them.

That being said my workstation is pretty big and can handle 10s of gigs of csv before I care. But that’s usually just for dev or debugging and anything that sticks around will be working with data in some proper store (usually parquet distributed across nodes).

whakim · 2024-03-26T15:12:11.000000Z

That may be your experience, but certainly not a universal experience (and apparently not the author's, either). In my experience, it's pretty easy to have CSVs (or Parquet files, or whatever) that are tens or hundreds of GBs in size. The space savings from a more modern file format are significant, as is the convenience of being able to specify and download/open only a subset of rows or columns over the network. Most of us don't have workstations with 50GB of RAM, because it's far more cost-effective to use a Cloud VM if you only occasionally need that much memory.

That being said, the real point here is that folks blindly use CSVs for internal-facing processes even though there's no particular reason to, and they have plenty of drawbacks. If you're just building some kind of ETL pipeline why wouldn't you use Parquet? It isn't as if you're opening stuff in Excel.

prepend · 2024-03-26T19:21:00.000000Z

The author is giving universal advice to all friends.

If the title was “friends in certain circumstances shouldn’t let friends in certain circumstances export to csv.”

Even a laptop with 8gb ram can open a gig csv.

Of course the internals of your etl will use some efficient data structure, but you’d still want to export as csv at some point to get data to other people. Or you want your friends to export csv to get data to you.

Certhas · 2024-03-26T08:15:55.000000Z

If I run a simulation workload it's pretty easy to generate gigabytes of data per second. CSV encoding adds a huge overhead space and time wise, so saving trajectories to disc for later analysis can easily become the bottleneck.

I have had many other situations where CSV was the bottleneck.

I still would default to CSV first in many situations because it's robust and easily inspected by hand.

smcin · 2024-03-26T12:41:27.000000Z

> That being said my workstation is pretty big and can handle 10s of gigs of csv before I care.

How much RAM do you have? What's the ratio of [smallest CSV file which bottlenecks]/[your RAM]?

prepend · 2024-03-26T19:22:20.000000Z

My dev workstation has 96gb. I don’t work with massive data files so I’ve never really hit my limit. I think the biggest raw data file I’ve opened was 10-20gb.

strken · 2024-03-26T07:02:13.000000Z

I very much agree with this. For an integration where you have control over both ends of the pipeline, CSV is not optimal unless there's existing work to build on, and even then it's a legacy choice.

Parquet and Avro are widely supported in backend languages and also in data analysis. I don't think the article is talking about exported-like-a-jpeg, but instead exported-like-a-daily-report-run: the data scientist doing the exporting is probably using R or Pandas instead of Excel, and can reasonably be expected to read https://arrow.apache.org/docs/r/reference/read_parquet.html.

jgord · 2024-03-26T02:41:15.000000Z

btw, xsv has solved most of my problems dealing with 'large' 40GB csv files

throwaway2037 · 2024-03-26T04:55:10.000000Z

xsv? I never heard of it. This one? https://github.com/BurntSushi/xsv

If yes, looks very cool. Plus, bonus HN/Internet points for being written in Rust!

jgord · 2024-03-26T07:13:50.000000Z

yep .. his utils are most excellent.

vmchale · 2024-03-26T13:20:52.000000Z

its parser is buggy! https://github.com/BurntSushi/xsv/issues/337

(I ran into this issue myself)

burntsushi · 2024-03-26T13:38:38.000000Z

I just responded to that. It isn't the parser that's a buggy. The parser handles the quotes just fine. If it didn't, that would be a serious bug in the `csv` crate that oodles of users would run into all the time. There would be forks over it if it had persisted for that long.

The problem is that `xsv table` doesn't print the parsed contents. It just prints CSV data, but with tabs, and then those tabs are expanded to spaces for alignment. Arguably it ought to print the parsed contents, i.e., with quotes unescaped.

It almost looks like it's doing that because the quotes are removed in one case, but that's only because the CSV writer knows when it doesn't need to write quotes.

mardifoufs · 2024-03-26T16:03:49.000000Z

Ok this might sound stupid, and a bit unrelated, but you make so many great tools that I can't help but ask. How do you start planning and creating for a tool that needs to "follow standards"(in this case I know CSV is under specified but still!), is it by iteration or do you try to set and build a baseline for all the features a certain tool needs? Or do you just try to go for modularity from the get go even if the problem space is "smaller" for stuff like csv for example.

burntsushi · 2024-03-26T17:21:47.000000Z

I suppose https://old.reddit.com/r/burntsushi/ might be a good place for questions like this.

I don't really have a simple answer unfortunately. Part of it is just following my nose with respect to what I'm interested in. So there's an element of intrinsic motivation. The other part is trying to put myself in the shoes of users to understand what they want/need. I typically do a lot of background research and reading to try and understand what others have done before me and what the pain points are. And iteration plays a role too. The `csv` crate went through a lot of iteration for example.

I think that's about it. It's hard to answer this question in a fully general way unfortunately. But if you want to get into it, maybe you can be the first person who opens a thread on r/burntsushi haha.

chaps · 2024-03-26T01:50:46.000000Z

Why is this getting downvoted? They're right that the criticism is pretty excessive:

  "Maybe his next article will be about how airlines should speak Esperanto because English is such a flawed language. That’s a clever and unique observation."

Hm.

prepend · 2024-03-26T02:43:30.000000Z

I got a little snarky but I think the analogy holds.

Esperanto is a superior language to English. And English has many flaws.

Theoretically it would be better to have all pilots and airports learn an efficient language.

But it would be stupid and immature to seriously write a blog post about that, especially without talking about all the flaws in that plan.

lmm · 2024-03-26T03:50:50.000000Z

> Esperanto is a superior language to English.

Not really. You can say it's more regular but that's because it sees barely any actual use; if it ever gained popularity it wouldn't stay regular. (And given that pilots speak in set phrases anyway, irregularity isn't really an issue). It's not a great language by any stretch, it's an awkward mismash of four european languages; sure it sounds kind of nice in an Italianate way, but if that's what you want then why not just speak Italian?

shermantanktop · 2024-03-26T05:57:08.000000Z

> if it ever gained popularity it wouldn't stay regular

This. Utility and purity always pull in opposite directions.

I see those boutique little (programming) languages written by amateur language designers with exotic type systems or “everything is an X” philosophies, and my reaction is to assume that they are useless for anything past toys and experiments.

I know useful language features have been born in that world and then eventually bolted onto mutt languages like Java and Python, but that suits me just fine.

nrdvana · 2024-03-26T04:09:54.000000Z

It would stay regular if there was a strict governing body for it that wasn't a Webster-style "whatever people are speaking is the new definition of correct".

English really is a disaster of a language. There was a(nother) great XKCD about it just a few days ago. https://xkcd.com/2907/

prmoustache · 2024-03-26T08:20:34.000000Z

> if there was a strict governing body for it that wasn't a Webster-style "whatever people are speaking is the new definition of correct".

There is no way it can work.

People don't care about governing bodies when they speak a language.

maxcoder4 · 2024-03-26T11:18:06.000000Z

They kind of do. In my language there exists a central "governing" body that decides what is correct, and some "incorrect" regionalisms are disappearing because of it.

FredPret · 2024-03-26T11:48:42.000000Z

When that happens, it’s likely more about politics and social status than the governing body.

Ie, the governing body decrees that the regionalisms from the dominant region is the definitive version of the language. But it might be considered cool to speak that way even without a governing body.

mardifoufs · 2024-03-26T16:19:29.000000Z

Are you referring to french? Because, if anything, french in France has an insane quantity of slangs and has an extremely emergent vocabulary. Much more so than any English speaking country I can think of. Quebec isn't really influenced by the Académie française yet has a much more "correct" usage of the language generally speaking.

Maybe it's a totally different language but still it goes to show that even a very prestigious central authority doesn't make a language better or less prone to diverge. Regardless of the reason, French is evolving much more quickly than English.

nrdvana · 2024-03-26T18:19:24.000000Z

But maybe that's beside the point. If someone wants to "learn French" they can learn by the official rules and communicate with other french-speaking people regardless of how many slang variants exist in France. They can also probably watch French television and understand it.

The point of esperonto was to make it easier to learn. French is regular, but extremely complicated. English is complicated and has a million special cases. Both languages are hard enough to master that society starts to judge a person's intelligence by how well they know the rules and special cases.

Stranger43 · 2024-03-26T06:18:40.000000Z

Correction it's a disaster of at least 4 languages and this is probably why English is so hard to dethrone as it have no strict ownership so everyone is kind of equal in speaking it incorrectly.

Sometimes lack of rigidity is actually an feature that allows for things to sort of work that would be politically impossible if thinks had to be specified formally before being used.

nrdvana · 2024-03-26T19:20:49.000000Z

I'm pretty sure the reason English is hard to dethrone is because Britain ~helped~ forced the various colonies to join world commerce using English, so they started teaching it to entire generations as the national second-language, and then because the USA dominated world commerce after that in a sort of "we'll let you in on the game if you speak our language and use our money" sort of way.

vouaobrasil · 2024-03-26T12:01:26.000000Z

> Esperanto is a superior language to English. And English has many flaws.

At least one one perspective, English is superior. That perspective is that you can actually use it in almost any modern situation because it has been tried and tested globally.

prepend · 2024-03-26T19:25:06.000000Z

This is the point of my analogy.

English:Esperanto::csv:parquet

(Although I think parquet is much more useful than Esperanto and may eventually end up dethroning csv)

gxs · 2024-03-26T02:37:44.000000Z

Agree.

Not saying csv doesn’t have its issues, but I don’t think the author made a convincing argument.

A lot of the issues the author brought up didn’t sound that bad and/or it sounds like he never looked at the source data first.

If you’re doing work with large datasets, I think it’s a good practice to at least go and look at the source data briefly to see what to expect.

This will give you a good idea of the format it outputs, data types, some domain context, etc. or some combination thereof and I don’t think it even takes that long.

Also, it reminds me of the arguments against excel in a way. Most people know what a csv is, more or less how to open it, and don’t need too much context when discussing the file. Someone will quickly understand if you tell them the file isn’t delimited properly right away. These are pros that shouldn’t be taken for granted.

Again, I’m not saying csv doesnt have issues or that there aren’t better alternatives, simply that I didn’t find this particular argument convincing.

mrgoldenbrown · 2024-03-26T05:28:00.000000Z

IME most people don't know that using Excel to open and save a csv will silently mangle data. In our application leading zeros are significant, so we constantly get screwed by people trying to do quick manual edits and breaking the data. If we're lucky it breaks so badly the import fails. It's worse when the mangling results in structurally valid but wrong data.

gxs · 2024-03-26T06:00:26.000000Z

I think what you’re saying is accurate, but it’s also important to be practical about stuff.

These are pretty well know excel limitations by now.

And really, anyone using excel who is somehow not aware of that limitation is probably not someone yet experienced enough to be working on a larger and/or mission critical dataset to begin with.

Are there exceptions? Sure. You might be tempted to cite the example of the incident where this happened to some biologists not too long ago, but mistakes happen. I’ve seen people make mistakes building android or iPhone using the right (TM) tools.

What is the exact number of mistakes where you make the decision to jump to a new format?

I’m not sure. This does happen eventually, but the author didn’t make a strong case here imo.

likis · 2024-03-26T07:52:31.000000Z

But the point is that you don't have to look at the source data if you have an actual specification and defined format, right?

eviks · 2024-03-26T05:44:30.000000Z

> all these sources aren’t able to agree on some successful format.

But the same is true for csv, and they are not readable by everyone since you don't always know how to read them, there is not enough info for that

Also it's not a good reflection on "deep experience" if it leads to reflexive defense of common stupid things people do with wrong analogies (e.g, flac is less efficient, so more like csv)

prepend · 2024-03-26T19:32:55.000000Z

In my experience csv has the fewest problems. Not that it has zero problems.

lobochrome · 2024-03-26T05:58:30.000000Z

For me, the giveaway was:

"You give up human readable files,..."

I was genuinely interested in some alternative suggestions - but the human readableness of csv is what makes it so sticky imo.

DontchaKnowit · 2024-03-26T14:41:49.000000Z

My entire exoerience with software development has been me bellyaching about how stupidly things are setup, why dont we do it this way instead etc... only to actually set about working on fixing these things and reqlizing its either way harder than I thought, it makes more sense than I thought, or it just plumb isnt worth the effort.

The evervescent suggestions of a brighter more logical, even obvious, solutions, is often a clear indicator of domain inexperience or ignorance.

Oioioioiio · 2024-03-26T12:59:36.000000Z

I worked plenty enough with 'diverse data pipelines' and most of them were shit due to other companies just not knowing how to work properly.

CSV created tons of issues regarding encoding, value separation etc.

I started talking to our customers and were able to define interfaces with better and aligned format. json made my life easier.

netcan · 2024-03-26T13:46:50.000000Z

So...

In some senses, I think internet culture (maybe modern intellectual culture generally) gets stuck in these repetitive conversations.

Reprosecuting without seemingly knowing about all the previous times the conversation has been had.

mannyv · 2024-03-26T17:08:53.000000Z

And it's surprisingly hard for etl departments to export to csv correctly. I mean, if they can't do csv they can't do anything more complicated for sure.

iainmerrick · 2024-03-26T08:48:53.000000Z

This article seems written by someone who never had to work with diverse data pipelines

I think that's a little unfair, it sounds like the author does have a decent amount of experience working with real-world CSV files:

I remember spending hours trying to identify an issue that caused columns to "shift" around 80% into a 40GB CSV file, and let me tell you, that just isn't fun.

fennecbutt · 2024-03-26T21:39:08.000000Z

Yup, csv is always the best fallback, imo. It's: easily generated, easily parsed, human readable/editable, compact, portable, list goes on.

msla · 2024-03-26T01:06:23.000000Z

> Csv that’s zipped is producible and readable by everyone. And that makes is more efficient.

If only CSV were CSV, as opposed to some form that's 80-90% CSV by line count with enough oddities to really make the parser ugly and hard to read.

See, the sweet spot isn't something completely unstructured, because then you feel justified in throwing up (your hands) and declaring defeat. The sweet spot is a file that's sufficiently close to being structured you can almost parse it nicely, but has enough bad lines you can't go in and fix them all by hand in a reasonable timeframe, and you can't tell upstream to get their shit in order because it's only a few lines.

prepend · 2024-03-26T02:45:02.000000Z

There’s definitely hair to deal with and it’s a little messy, but it’s never a blocker.

But I’d say the error rate is actually very low, maybe .1-1% and nowhere near 10-20% of data being messed up.

wolrah · 2024-03-26T13:42:26.000000Z

> But I’d say the error rate is actually very low, maybe .1-1% and nowhere near 10-20% of data being messed up.

The thing with CSV-related issues is it's usually not a fixed percentage but instead depends on the data.

I work in the VoIP industry so I deal with the FreePBX Asterisk GUI quite often, and it uses CSV as its bulk import/export format. This mostly makes sense as the data is almost entirely (with one notable exception) simple tables that fit nicely in to rows and columns. The issue I run in to most commonly with this is that it doesn't quote numerical fields, and as a result the fields for extension numbers and voicemail PINs can be problematic when they contain one or more leading zeroes. All of the major spreadsheet software I've used defaults to dropping leading zeroes from columns they've decided contain numerical values, and this results in broken data in these cases. It's of course relatively rare for users to choose a voicemail PIN starting with zero and even more rare for extensions to be set up with a leading zero, but both happen regularly enough that I need to remember to manually configure those columns as "Text" when opening an extension export CSV.

Either way, how often the problem occurs depends entirely on the data being sent through this pipeline. Most sites will never see the problem on the extension column, but one of my sites where the company liked a user's extension to be the last four of their DID when they were initially set up 20 years ago has a dozen of them in a row.

nrdvana · 2024-03-26T04:16:12.000000Z

Depends on your tools, I suppose. I'd just like to share this:

https://metacpan.org/pod/Data::TableReader::Decoder::IdiotCS...

likis · 2024-03-26T07:47:18.000000Z

Did you actually read the conclusion at the end of the article?

"Of course, we can't conclude that you should never export to CSV. If your users are just going to try to find the quickest way to turn your data into CSV anyway, there's no reason why you shouldn't deliver that. But it's a super fragile file format to use for anything serious like data integration between systems, so stick with something that at the very least has a schema and is more efficient to work with."

hackerlight · 2024-03-26T04:26:06.000000Z

It's a premature optimization issue. If you don't have special requirements like IO throughput, or mission critical data accuracy guarantees, be biased towards picking the format that anyone can easily open in a spreadsheet.

eviks · 2024-03-26T06:45:06.000000Z

You can open out easily, but just as easily it can be wrong. So with this bias you'd still not export csv, you'd use xls

tdudhhu · 2024-03-26T07:30:06.000000Z

"Of course there are better formats, but all these sources aren’t able to agree on some successful format."

It's the same with csv. They come in all kinds of formats because nobody agreed on the standard. Comma separated, semicolon separated, pipe separated, escaped, not escaped.

Everytime I have to deal with csv I first have to figure out how to parse it in code.

So I think the author is right, we must agree on a better format because that is what friends do.

You are also right because it's an illusion to think that this is going to change anythime soon. But who knows..

iamflimflam1 · 2024-03-26T09:04:07.000000Z

Every integration I’ve ever worked on has started off with high ideas of APIs and nice data standards. And has eventually devolved into “can we just put a CSV file on an FTP site…”. With the inevitable, “it’s not really CSV…”

gerdesj · 2024-03-26T00:27:38.000000Z

... And what's more, you'll be an Engineer my son.

rossdavidh · 2024-03-26T03:11:30.000000Z

"You give up human readable files, but what you gain in return is..." Stop right there. You lose more than you gain.

Plus, taking the data out of [proprietary software app my client's data is in] in csv is usually easy. Taking the data out in Apache Parquet is...usually impossible, but if it is possible at all you'll need to write the code for it.

Loading the data into [proprietary software app my client wants data put into] using a csv is usually already a feature it has. If it doesn't, I can manipulate csv to put it into their import format with any language's basic tools.

And if it doesn't work, I can look at the csv myself, because it's human readable, to see what the problem is.

90% of real world coding is taking data from a source you don't control, and somehow getting it to a destination you don't control, possibly doing things with it along the way. Your choices are usually csv, xlsx, json, or [shudder] xml. Looking at the pros and cons of those is a reasonable discussion to have.

TimTheTinker · 2024-03-26T04:46:38.000000Z

I think his arguments apply more closely to SQLite databases. They're not directly human readable, but boy are there a lot of tools for working with them.

askvictor · 2024-03-26T07:34:55.000000Z

We have a use case where we effectively need to have a relational database, but in git. The database doesn't change much, but when it does, references between tables may need to be updated. But we need to easily be able to see diffs between different versions. We're trying an SQLite DB, with exports to CSV as part of CI - the CSV files are human-readable and diff'able.

It's also worth noting that SQLite can ingest CSV files into memory and perform queries on them directly - if the files are not too large, it's possible to bypass the sqlite format entirely.

thenbe · 2024-03-26T13:39:40.000000Z

> we need to easily be able to see diffs between different versions

Can git attributes help in this case? It allows you to teach git how to diff binary files using external tools. Here [0] is a demonstration for teaching git to produce an "image diff" for *.png files using exiftool. You can do something similar for *.sqlite files by adding these 3 lines [1] [2]. The sqlite3 cli needs to be installed.

Alternatively, there's a tool that might also fit the bill called datafold/data-diff [3]. I'm pretty sure I originally heard of it on a HN thread so those comments may offer even more alternative solutions.

[0]: https://youtu.be/Md44rcw13k4?t=540 [the relevant timestamp is @ 9:00]

[1]: https://github.com/kriansa/dotfiles/blob/7a8c1b11b06378b8ca8...

[2]: https://github.com/kriansa/dotfiles/blob/7a8c1b11b06378b8ca8...

[3]: https://github.com/datafold/data-diff

zachmu · 2024-03-27T23:21:01.000000Z

Somebody already said this, but we built exactly this and it's called Dolt.

https://github.com/dolthub/dolt

Would love to hear how it addresses your use case or falls short.

john-shaffer · 2024-03-26T13:38:03.000000Z

Have you considered https://github.com/dolthub/dolt for your use case?

ultrarunner · 2024-03-26T17:04:29.000000Z

Real world example of this that we just experienced:

I work with a provider who offers CSV exports as the only way to access data. Recently, we found they were including unsanitized user input directly in fields. They weren't even quoting these fields.

The platform "notified their quality assurance team ASAP" (like every other issue, we never heard back), but we had a deadline. This, of course, was a mess, but being able to quickly open the file and fix quotes was all it took. I shudder at the thought of trying to salvage a corrupted binary with zero help from the holder of the data.

Too · 2024-03-26T20:45:39.000000Z

This sounds like a problem that wouldn’t have existed in the first place if following a binary protocol with a standard format and using a proper serialization library.

The issue comes from CSV files looking easy to generate by hand, when it in fact is not.

ultrarunner · 2024-03-26T22:00:58.000000Z

This is a decent point, but practically the platform uses some library in their ancient ASP application. The issue is that these types of things can't be fixed because the original author is gone and the tech debt has become unmanageable. This is not the only issue we've had, unfortunately.

Debugging this same issue in a binary format is far and away not going to happen in this scenario.

rossdavidh · 2024-03-26T21:30:08.000000Z

...but if it was a binary protocol that did have a problem, of any sort whatsoever, and you couldn't get the provider to address is (in time), then you're hosed if it's not human readable.

saulrh · 2024-03-26T05:40:50.000000Z

In my experience, human readable file formats are a mistake. As soon as people can read a single file they think that that's the entire format and that it's okay to write it by hand or write their own code for it. And when everyone writes code based on the just what they've personally seen about a format, everyone is sad. This is why not a single piece of software on earth uses the CSV RFC. This is why people hand you CSVs that don't quote string fields with commas in them. This is why you find software that can't handle non-comma delimiters. This is why you find software that assumes that any field made of digits is an integer and then crashes when it tries to do string operations on it. This is why you find software that can't be used unless you change your computer's locale because the front end understands locales and uses commas for numbers but the backend is running on a server and doesn't know what locales are and now everything is broken. This has happened for every single "human-readable" format in existence: html, markdown, CSV, rtf, json, everything. I consider human readability to be a death knell for a format's chances of broad, easy interoperability. There are certainly other things that can doom a format - parquet is almost too complex to implement and so only just barely works, for example - but I'll take a sqlite database over a csv every single time.

asddubs · 2024-03-26T06:44:58.000000Z

I think a takeaway could also be not to give people options when making a human-readable format. "you always need quotes, they're not optional" solves the comma problem. "the delimiter is always a comma" solves the delimiter problem. json has also fared better than csv, I'd say.

emj · 2024-03-26T07:19:44.000000Z

That makes the delimiter "," which is ugly so someone is just going to use , instead and you are back to square one.

asddubs · 2024-03-26T22:17:44.000000Z

this is only the case because this was allowed from day 1. If no one ever allowed this to begin with, that just wouldn't work. Of course this is speaking in hypotheticals, but my point is a more general one about specifications of human-readable formats. no one ever attempts to use strings without quotes in json, because then you would be incompatible with everything. There's compatibility issues, but they're far more subtle edge cases

GuB-42 · 2024-03-26T00:25:40.000000Z

As a French, there is another problem with CSV.

In the French locale, the decimal point is the comma, so "121.5" is written "121,5". It means, of course, that the comma can't be used as a separator, so the semicolon is used instead.

It means that depending whether or not the tool that exports the CSV is localized or not, you get commas or you get semicolons. If you are lucky, the tool that imports it speaks the same language. If you are unlucky, it doesn't, but you can still convert it. If you are really unlucky, then you get commas for both decimal numbers and separators, making the file completely unusable.

There is a CSV standard, RFC 4180, but no one seems to care.

layer8 · 2024-03-26T00:39:41.000000Z

There are tools to convert between the formats. Either you have a defined data pipeline where you know what you get at each step and apply the necessary transformations. Or you get random files and, yes, have to inspect them and see how to convert them if necessary.

It’s unfortunate that there isn’t a single CSV format, but for historical reasons it is what it is. It’s effectively more like a family of formats that share the same file extension.

Excel actually has a convention where it understands when there is a line

   sep=;

at the start of the file.

By the way, in addition to differing separators, you can also get different character encodings.

Excel does understand a BOM to indicate UTF-8, but some versions of Excel unfortunately ignore it when the “sep=“ line is present…

al_borland · 2024-03-26T04:42:56.000000Z

>sep=;

Thank you! This is a game changer. As I was reading through these comments I was thinking how much better it would be if the separator could be specified in the file, but it would only be useful for my own stuff, if I was to do that on my own.

I’ll be trying this first thing tomorrow at work. I don’t do as much with CSVs as I used to, but am currently working with them a lot, and have a single entry that always throws the resulting Excel file off, and I can’t be bothered to figure out how to escape it (I spent too much time in the past messing with that and got nowhere). This sep= lines will hopefully solve my issues.

heresie-dabord · 2024-03-26T11:07:03.000000Z

I have sometimes distributed a CSV file with a similarly-named text file that contains key-value pairs to aid in the use of the CSV file.

A minimal text file of this type would contain:

    #key=value
    sep=;

eyegor · 2024-03-26T01:41:02.000000Z

What kind of voodoo is this? I've always wanted something like this for non technical coworkers across countries but I didn't know it existed. I always just exported TSV and provided steps for how to import it (although I think most excel versions have native tsv support).

Gabriel54 · 2024-03-26T00:30:46.000000Z

If I'm not mistaken this is pretty universal outside of the US (and maybe the UK).

sfRattan · 2024-03-26T00:43:59.000000Z

Going by the Wikipedia article and included map, use of comma versus period as decimal separators is roughly an even split:

https://en.wikipedia.org/wiki/Decimal_separator

https://commons.wikimedia.org/wiki/File:DecimalSeparator.svg

Larrikin · 2024-03-26T01:57:14.000000Z

Seems geographically split, but I wonder what is the actual population split is. Most of the top 10 population countries use the decimal separator. Only Brazil, Russia and Indonesia don't.

Maybe someone with a CSV of the world populations and a CSV of the countries broken down by their separator can do that comparison.

fshr · 2024-03-26T01:54:29.000000Z

There's definitely a big distribution disparity. 11 of the 15 most populous countries use the period for decimals.

micheljansen · 2024-03-26T04:57:14.000000Z

Most of Europe is in the fat long tail though, as those countries are counted individually.

zztop44 · 2024-03-26T00:38:25.000000Z

You are mistaken. Probably more countries overall use a decimal comma, but the decimal point is used as convention in many countries, including China, India, Nigeria and the Philippines.

echoangle · 2024-03-26T06:28:38.000000Z

Does any tool seriously localize CSVs?

elric · 2024-03-26T09:13:30.000000Z

Any serious CSV tool has the option to pick a delimiter. Usually semicolon or comma, some offer additional options. The only impact it has is on which fields need quoting. When using comma, all decimals in many langugaes need to be quoted. When using semicolon those don't need to be quoted.

Overall, semicolon feels like the superior delimiter.

Most sensible people don't export formatted numbers (e.g. 100.000,00), but even those are pretty trivial to import.

nlitened · 2024-03-26T12:51:30.000000Z

So many tools do that, even (most importantly) Excel does

kmmlng · 2024-03-26T10:19:34.000000Z

In my experience, this is only a problem when you are using Excel. It's ridiculous how bad Excel is at handling CSVs, I really cannot comprehend it. If you use LibreOffice all your problems magically disappear.

henriquecm8 · 2024-03-26T17:11:47.000000Z

I find out that Excel follows windows locale, you can change the decimal and the thousand separator on windows, and it will affect how excel exports and reads CSVs.

stephenr · 2024-03-26T08:39:39.000000Z

Isn't this exactly what quoting solves?

i.e.: ``` "1,20","2,3",hello "2,40","4,6",goodbye ```

If your tool reads CSV by doing `string_split(',', line);`, your tool is doing it wrong. There's a bunch of nuance and shit, which can make CSVs interesting to work with, but storing a comma in a field is a pretty solved issue if the tool in question has more than 5 minutes thought put into it.

ric2b · 2024-03-26T09:12:57.000000Z

Now all your numbers are strings

elric · 2024-03-26T09:14:36.000000Z

It's a text file. All your numbers were already strings. Nothing has changed.

ranguna · 2024-03-26T11:06:52.000000Z

There's a difference between "1" and 1. When you import a csv and try to do maths on a "number" you won't get the expected result. Some importers won't even allow you to specify that "number" columns are numbers, they'll outright fail and force you to say it's a string, or you'll have to specify which columns are "numbers" and map the strings to numbers on the importer side.

If they are numbers to begin with (not "numbers"), you can just import the csv and you'll get the expected result out of the box.

In the end things are just 1s and 0s, but that doesn't mean we only ever do binary operations on data at the abstraction layer we humans operate, so saying it's just 0s and 1s or just strings is not very smart.

elric · 2024-03-26T17:25:02.000000Z

Sounds to me like you're using a shitty parser. CSV is schemaless. It is up to you to tell the parser what types to use if it isn't unambiguous. Quoted values can be numbers, and unquoted values can be strings. I have not used any CSV tools that don't support this behaviour.

ranguna · 2024-03-27T09:25:37.000000Z

A shitty parser is one that assumes, if I quite a number I want it to be a string.

itishappy · 2024-03-26T15:41:17.000000Z

But now all our strings might be numbers! We now have to parse every quoted string, and we can no longer represent numbers as text.

    unquoted input:
    0, 10, "Text", "123"

    unambiguous output:
    (Num) 0, (Num) 10, (Text) Text, (Text) 123

    quoted input:
    "0", "10", "Text", "123"

    output :
    (Num) 0, (Num) 10, (Text) Text, (Num) 123

elric · 2024-03-26T17:27:45.000000Z

Why would you quote Text in the first example? That makes no sense. Text does not contain any delimiters or special characters.

Unquoted input should look like this:

unquoted input: 0,10,Text,123

Note the absence of spaces as well. Not sure what flavour of CSV you are using, but there usually aren't spaces after the delimiter.

itishappy · 2024-03-26T17:48:59.000000Z

Fair point, but that doesn't really resolve the issue. Here's a cleaned up example showing the same problem:

    unquoted input:
    0,10,"Text,","123"

    output:
    (Num) 0
    (Num) 10
    (Text) Text,
    (Text) 123

    quoted input:
    "0","10","Text,","123"

    output:
    (Num) 0
    (Num) 10
    (Text) Text,
    (Num) 123

junto · 2024-03-26T11:10:50.000000Z

Same in German. We have things like 1.500.021,92 (I.e. 1,500,021.92)

gerdesj · 2024-03-26T00:58:43.000000Z

If only a totally separate data field separator character had been invented early on and been given its own key on the keyboard, coloured: "only use for field delimiting". You know as well as I do that it would have been co-opted into another rôle within weeks, probably as a currency indicator.

You should probably use PSV - Point Separated Variable! Obviously we would need to adjust PSV to account for correct French word order (and actually use French correctly). Britain and USA would use something called PVS2 instead as a data interchange format with France which involves variable @@@ symbols as delimiters, unless it is a Friday which is undefined. The rest of the world would roll its eyes and use PSV with varying success. A few years later France would announce VSP on the back page of Le Monde, enshrine its use in law (in Aquitaine, during the fifteenth century) but not actually release the standard for fifteen years.

The world is odd. We have to work within an odd world.

Interestingly enough, you and I could look at a CSV encoded data set with commas as decimal separators and work out what is going on. You'll need a better parser!

lolive · 2024-03-26T06:13:15.000000Z

Don’t look at the ASCII table, at entries 28 to 31. You are not mentally ready for what you will find there.

junto · 2024-03-26T11:08:46.000000Z

> ..what you will find there

Made me laugh, especially since you can’t “see” them.

In fact, that was maybe an oversight when ASCII was designed, but maybe there was a reason for that. If they were visible, and were actually recognizable as separator types then people would know them better.

https://www.lammertbies.nl/comm/info/ascii-characters

frob · 2024-03-26T14:23:53.000000Z

My company has been using DSV for a bit: Dagger Separated Values. Unicode dagger (†) to separate values and double-dagger (‡) to indicate end of row. It allows us to easily handle both commas and newlines.

msla · 2024-03-26T01:08:03.000000Z

> In the French locale, the decimal point is the comma, so "121.5" is written "121,5". It means, of course, that the comma can't be used as a separator

Heh. Ah, HN. Always good for a laugh line.

samatman · 2024-03-26T02:19:42.000000Z

Yeah, hon hon hon and all, but one of my (US) bank statements exports a CSV which uses commas in numbers in the US fashion, so $1,500 and the like. Writing a custom CSV munger to intake that into ledger-csv was... fun, but then again, only had to do it once.

stephenr · 2024-03-26T08:41:16.000000Z

Surely if they're putting commas in values they were quoting the values though?

samatman · 2024-03-26T14:39:10.000000Z

They were not.

Fortunately the values are always prefixed by a dollar sign, making parsing deterministic, though ugly.