Why CSV is still king

thbb123 · 2024-08-02T06:15:47 1722579347

Sad that the ASCII specification includes 2 codes: 30 and 31, respectively field separator and record separator, precisely to answer cleanly the need that CSV fullfils addresses.

During the 90's I was anal for using them, pissing the hell out of my teammates and users for forcing them to use these 'standard compliant' files.

Had to give up.

viraptor · 2024-08-02T08:46:05 1722588365

They're hard to type though. You need to teach people how to use those -vs- just using a comma.

jeff-hykin · 2024-08-02T17:18:47 1722619127

And they still don't fix the escaping problem. You might as well use a niche utf8 emoji as a separator. Editors at least know how to consistently render an emoji.

applied_heat · 2024-08-03T00:45:12 1722645912

Does your data include ascii 30 and 31?

As a co-op student I used a library to achieve fool proof encoding in csv so it escaped and quoted everything as necessary so commas,\, and quotes and any other character could be included in the data, but it was rejected since the plain text files were difficult to read and edit by hand!

drwu · 2024-08-03T10:46:17 1722681977

One typical scenario is to embed a CSV (or whatever it is called) into another one.

giraffe_lady · 2024-08-02T17:38:37 1722620317

Hilariously I have actually seen this done.

malux85 · 2024-08-02T22:57:10 1722639430

Which emoji was it? Scissors?

giraffe_lady · 2024-08-03T01:22:38 1722648158

Nah nothing funny like that, just a shape I think blue circle.

NautilusWave · 2024-08-02T23:13:20 1722640400

The ease of typing a character should only matter for artisanal, hand-typed files.

viraptor · 2024-08-03T12:00:24 1722686424

I agree. So if we don't need this hand crafted and for human consumption, we may as well just use some TLV or LV encoding instead of the CSV madness of separators and escaping. CSV is basically designed for hand crafting.

snypher · 2024-08-03T01:26:56 1722648416

They are also easy to read, perhaps easier than a 'space' or other character. Although this could be because we are just used to seeing data eg CSV in this way

gklitz · 2024-08-03T15:06:32 1722697592

What do you mean comma? csv uses tabs. /s

m-p-3 · 2024-08-05T01:45:14 1722822314

Technically that is a TSV file. :)

baggy_trough · 2024-08-02T20:01:33 1722628893

A valuable lesson in user experience triumphing over pedantic correctness.

bandie91 · 2024-08-03T10:46:52 1722682012

a lesson in confusing representation and data. if users can learn not to edit xls in text editor and to "go to next cell" type "tab" in spreadsheet software, they can learn edit csv in a proper csv editor. the only trap was that we made a non-text format so simple that tricked ourself that "it's only plain text".

hanche · 2024-08-02T06:36:10 1722580570

Sqlite still supports it: .mode ascii

snthpy · 2024-08-03T08:57:18 1722675438

I've recently learned about them and an trying to revive usage - .asv and .usv files.

The .usv separators make things easier to read at the expense of a bit more space.

Main point for me making the parsing so much simpler.

Who writes .csv files by hand anyway?

soared · 2024-08-03T14:08:09 1722694089

Editing csv by hand is something I’ve seen a lot for internal-only software where every user is a super-power user who need to move small but bulk amounts of data and sometimes make small edits for formatting.

Easiest example is geo, I need 20 states listed as US-CO, US-CA, etc but one tool exported as US CO.

cqqxo4zV46cp · 2024-08-02T15:05:41 1722611141

Standards-compliance and using esoteric features over catering for the realities of usability. Your coworkers were right to steamroll you.

aleph_minus_one · 2024-08-02T16:06:28 1722614788

If these ASCII code points were actively used, the support in common editors that are used for editing CSV files would become much better very fast.

tanin · 2024-08-02T01:35:08 1722562508

What surprised me the most about CSVs is that:

- To escape the delimiter, we should enclose the value with double quotes. Ok, makes sense.

- To escape double quotes within the enclosing double quotes, we need to use 2 double quotes.

Many tools are getting it wrong. Meanwhile some tools like pgadmin, justifiably, allows you to configure the escaping character to be double quote or single quote because CSV standard is often not respected.

Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.

arp242 · 2024-08-02T01:52:33 1722563553

> Many tools are getting it wrong.

They're not getting it wrong, they're just assuming a different variant.

There is no "standard" for CSV. Yes, there's an RFC, published in 2005, about 30 years after everyone was already using CSV. That's too late. You can't expect people to drop all compatibility just because someone published some document somewhere. RFC 4180 explicitly says that "it does not specify an Internet standard of any kind", although many people do take it as a "standard". But even if it did call itself a standard: it's still just some document someone published somewhere.

They should have just created a new "Comma Separated Data" (file.csd) standard or something instead of trying to retroactively redefine something that already exists. Then applications could add that as a new option, rather than "CSV, but different from what we already support". That was always going to be an uphill battle.

Never mind that RFC 4180 is just insufficient by not specifying character encodings in the file itself, as well as some other things such as delimiters. If someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.

jorams · 2024-08-02T04:30:26 1722573026

RFC 4180 says it "documents the format that seems to be followed by most implementations" and in practice I find that to be true, though my CSVs don't interact with a lot of very old software. You get very far by treating "RFC 4180, UTF-8" as a standard and considering every implementation that doesn't follow it to be broken. I'm not sure I have ever seen software that simultaneousy doesn't follow the RFC, but does consistently support escaping.

Cyberdog · 2024-08-02T02:16:19 1722564979

Did TOML take off? As much as I love it, it seems really rare to see in the wild. I still see YAML everywhere and despair.

arp242 · 2024-08-02T02:36:12 1722566172

It's in the standard library for Python, Rust, Julia, and maybe some other languages. It's also widely used in those ecosystems (pyproject.toml, cargo.toml). I think it's fair to say it took off, even though YAML is also popular.

jononor · 2024-08-02T10:10:19 1722593419

The tomllib library in Python 3.11+ can only read TOML files, not write them.

dagenix · 2024-08-02T03:05:27 1722567927

I don't believe its in the standard library for Rust, even if it is very popular in the Rust ecosystem.

arp242 · 2024-08-02T03:38:04 1722569884

Right; I'm not super-familiar with Rust and how exactly they organise things, but it's in more or less every Rust project due to Cargo.toml.

MrVandemar · 2024-08-02T02:24:34 1722565474

Rust uses it, and Rust seems pretty popular.

I know Alire, the Ada crate manager uses it too.

I use it for some personal projects. It's really nice!

moomin · 2024-08-02T02:24:04 1722565444

Which is hilarious when you consider that the spec is that complex.

paradox460 · 2024-08-03T21:35:50 1722720950

Toml is both great and terrible. I'm not a fan of how it handles some deeper arrays

imtringued · 2024-08-02T06:22:42 1722579762

> someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.

Why? We have xlsx for the office crowd and arrow for the HPC crowd. In no universe does anyone actually have to invent another tabular data format using delimiters.

arp242 · 2024-08-02T12:42:59 1722602579

Neither are a universal replacements for CSV. They're not even text formats (well, technically xlsx is if you expect the XML from the zip, but practically: no really.). The article already explains why, as the title says, "CSV is still king": it's widely used, it's simple, it's used all over the place, it's universal, it's human-readable-y.

999900000999 · 2024-08-02T02:45:12 1722566712

I can't tell you how to run your business, but subscriptions for offline apps aren't going to be popular here.

Charge me more upfront for a perpetual license, or just version the software. Say 40$ today for V3, and every year charge a reasonable fee to upgrade, but allow me to use the software I purchased...

vdqtp3 · 2024-08-02T15:18:15 1722611895

I recently saw a license that was based on a monthly subscription, but once you paid for a year you got a perpetual license to the version you started with. Every year, your perpetual license was updated to the next year's version. I find that to be a reasonable middle ground.

sandy_coyote · 2024-08-02T02:56:41 1722567401

I think you mean perpetual license, unless you really do mean a license that covers the clitoris or penis.

tanin · 2024-08-05T04:21:38 1722831698

Thank you for your feedback. I think your opinion is super valid here.

I've been thinking about pricing, and a lot of people did complain about it. However, many people expense their software cost, so they don't mind the yearly subscription.

I'm improving the pricing right now and a perpetual license is what I'm going with.

AnonC · 2024-08-02T06:19:44 1722579584

> Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.

Looks like SQL is the main selling point for your tool. For other simpler needs, Modern CSV [1] seems suitable (and it’s cheaper too, with a one time purchase compared to a yearly subscription fee). But Modern CSV does not support SQL or other ways to create complex queries.

[1]: https://www.moderncsv.com/

1vuio0pswjnm7 · 2024-08-02T07:57:51 1722585471

https://www.ietf.org/rfc/rfc4180.txt

Works for SQLite at least, but not sure about other software.

mikhailfranco · 2024-08-04T13:28:31 1722778111

It would be more useful if every RFC had a test suite of input/output and input/error.

Yes, those are potentially infinite, but a core set would be useful. As ambiguities come up, publish an addendum for clarification, and eventually, as the exceptions accumulate, a version step.

I don't understand how anyone can write a spec without concrete examples of pass/fail in their head. Perhaps there could be an informal example/counterexample syntax for those writing RFCs, which could be extracted into the 1.0 test suite.

The test suite must be a single open source repo, that accumulates acceptable edge cases until the relevant informed adults can make a call about revising the spec.

There has to be one approved, sanctioned, well-known and monitored test suite repo. It cannot be shrugged off into a free-for-all that makes it impossible to find a single canonical test suite. The interwebs are big and conflicted.

See Imre Lakatos 'Proofs and Refutations' for how this evolves.

1vuio0pswjnm7 · 2024-08-05T01:14:24 1722820464

RFCs sometimes have pseudocode. It would be nice to have a "pseudocode translator" that translates it to some actual programming language.

With few exceptions, I have gven up on documentation. Whether it is specifications or software. Now I just read source code instead.

I think in the 60s and 70s documentation used to be better and did focus more on input/output. For example, I still use spitbol and icon.

Maybe it is controversial view, but I fail to comprehend how any RFC can be considered a "specification". In truth an RFC is only a "proposed specification" at best, literally a "request for comments". (Where are the comments?) In fact, often RFCs simply document some internet practice that already exists. (Meanwhile the number of "BCPs" is relatively small.) RFCs can be anything.

kawakamimoeki · 2024-08-02T02:48:42 1722566922

As is the case with Markdown, many parsers have prioritized ease of implementation over formal rigor.

mikhailfranco · 2024-08-04T15:04:36 1722783876

I agree about markdown, but the only awkward implementation issue is nested syntax: what markup is parsed inside various other outer markup forms?

Italic headings? Bold links? Nested lists - how many levels? Code in list? How do paragraphs interact with lists? There are many opinions and many leaky implementations of those opinions. Newlines? Embedding HTML in Markdown !?!?

It all seems so sad, because (X)HTML nailed most of these issues a very long time ago. But HTML implementations were sloppy from the outset. And XML was born with inherited bloat, then got ever more complex over time (modular specs, XLink, XPath, XSLT, DTD -> XML Schema, ...)

With Markdown, it is relatively easy to introduce some recursion into the parser, but for what spec? In what contextual cases? At what cost?

mikhailfranco · 2024-08-04T13:34:14 1722778454

One classic example is JSON.

It is possible to just treat commas as whitespace. It makes implementation so much easier. It accepts missing, trailing and repeated commas. It makes elements uniform. It ignores many common errors that arise from typos or cut'n'paste. It makes JSON writers simpler, by removing the first/last special case.

A JSON parser that treats commas as whitespace can be two dozen lines in most programming languages - if you do not want line/column, chapter and verse, for the remaining error messages.

lenkite · 2024-08-02T09:15:31 1722590131

I wish there was a text format that used the ascii unit separator and record separator. It would have solved so many problematic edge cases.

cm2187 · 2024-08-02T03:27:28 1722569248

The one tools get the most wrong is that there is no escaping of the new line character.

tanin · 2024-08-03T07:57:42 1722671862

Oh yes, but I encountered it on the parsing side. A CSV parsing algorithm that does parallel processing would have this issue.

DuckDB has this problem when the parallel processing of CSV is enabled.

Understandably though because they want to process many lines in parallel.

Nihilartikel · 2024-08-02T03:30:17 1722569417

I've found the Unicode cat emoji to be an effective delimiter to avoid escaping more common chars in my cat-separated-value artifacts.

Of course the cat emoji is escaped by the puppy emoji if it occurs in a value. The puppy emoji escapes itself when needed.

exidex · 2024-08-02T06:35:01 1722580501

There is https://github.com/SixArm/usv which is exactly that, but with special unicode characters

theendisney4 · 2024-08-02T03:40:45 1722570045

In the 80's i thought we should have an entire character set just for code. While never implemented the idea arguably aged well.

I also considered a dedicated keyboard like apl just to be dense about it.

Have each character signed by the keyboard so that we have proof by whoem it was typed and when.

People who dont work here don't get to write code. It just wont happen. haha

acuozzo · 2024-08-02T15:58:58 1722614338

> In the 80's i thought we should have an entire character set just for code.

APL got pretty close.

Hackbraten · 2024-08-02T05:46:12 1722577572

Instructions unclear, my puppy emoji is now chasing its own tail

geekodour · 2024-08-02T03:32:23 1722569543

last line unclear ⬛ an example would be great!

ok_computer · 2024-08-02T03:42:54 1722570174

I read that as the puppy emoji escapes itself as two characters print a single character, similar to \ in python strings using \\ to print \

TylerE · 2024-08-02T03:39:30 1722569970

Think backlashes in shell. \$ is just $, \\$ is literal ‘\$’

zarzavat · 2024-08-02T04:04:54 1722571494

Just use TSV. Commas are a terrible delimiter because many human strings have commas in them. This means that CSV needs quoting of fields and nobody can agree on how exactly that should work.

TSV doesn’t have this problem. It can represent any string that doesn’t have either a tab or a newline, which is many more than CSV can.

uncharted9 · 2024-08-02T04:40:30 1722573630

It's 2024 and Excel still doesn't natively parse CSV with tabs as delimiters. When I send such csv files to my colleagues, they complain about not being able to open them directly in Excel. I wish Excel could pop up a window like LibreOffice does to confirm the delimiter before opening a csv file.

Risord · 2024-08-02T06:03:56 1722578636

Excel does not support any delimeter natively since its region dependent.

I ended up saving my mental heath by supporting two different formats: "RFC csv" and "Excel csv". On excel you can for example use sep=# hint on beginning of file to get delimeter work consistently. Sep annotation obviously break parsing for every other csv parser but thats why there is other format.

Also there might be other reasons too to mess up with file to get it open correctly on excel. Like date formats or adding BOM to get it recognized as utf-8 etc. (Not quite sure was BOM case with excel or was it on some other software we used to work with )

7bit · 2024-08-03T05:48:44 1722664124

I also use sep= annotation. That is not documented ANYWHERE by Microsoft I assume one of the devs mentioned this in a mailing-list sometime in the nineties and it has found its way around.

Still... Shame on Microsoft of not documenting this and perhaps other annotations that one can use for ex El.

pjmlp · 2024-08-02T06:00:29 1722578429

I am quite sure that Excel import option has tabs as delimeters option.

https://support.microsoft.com/en-us/office/import-or-export-...

https://support.microsoft.com/en-us/office/text-import-wizar...

"Delimiters Select the character that separates values in your text file. If the character is not listed, select the Other check box, and then type the character in the box that contains the cursor."

Maybe they should know better their tools instead of plain double clicking and hope for the best.

imtringued · 2024-08-02T06:31:23 1722580283

It's 2024 and people still haven't realized that Excel does not and never will support opening CSV files. The closest thing it allows you to do is import data from a CSV file into your current spreadsheet, but open a CSV file? It will never do that. Stop using CSV for excel, just generate .xlsx files like everyone else.

prepend · 2024-08-02T13:29:01 1722605341

I can double click on a csv and it opens pretty cleanly in Excel. I don’t use it systematically but I usually eyeball csvs using Excel.

SebastianKra · 2024-08-02T16:53:10 1722617590

Not in every version. I recently found out that Excel doesn't recognize commas as separators in a comma-separated-values file on my coworkers PCs.

I presume it's because Germany uses the comma as a decimal separator instead of a dot.

I eventually settled on just exporting Excel because I couldn't get both the encoding and separator to work at the same time.

Another fun story is that a coworker lost data, when they opened a csv, wrote data to a second sheet, and then saved it. A sane program would probably have brought up a save-as window. Excel didn't. It just discarded the second sheet.

prepend · 2024-08-02T18:07:32 1722622052

I don’t like exporting excel because that’s harder to read. I don’t make the data for Excel. I make the data portable.

The fact that it opens in Excel is just convenient. If Excel doesn’t work, I’d use BBEdit or one of a billion other clients that read CSV.

Far fewer clients can open Excel files. Especially 10 or 100 years in the future.

LorenPechtel · 2024-08-03T17:41:06 1722706866

If Excel is set to handle the extension .csv then attempting to open a .csv file correctly launches Excel and imports it. File for read only, but if you want it back out you have force matters, it's not automatic.

kamaal · 2024-08-02T04:44:24 1722573864

Excel is the best tool out there but it has its quirks.

For example the web version doesn't have a dark mode. Google sheets and docs these days is more useful and feature rich than Excel.

thrdbndndn · 2024-08-02T07:53:18 1722585198

Feature-wise, Excel probably still has more options, but in terms of ergonomics, Google Sheets is much better. And I'm saying this as someone who has used Excel for 20 years.

Here are a few specific examples:

1. Editing formulas using the keyboard only is a nightmare in Excel. It often randomly throws errors and warnings when I move the cursor around (like typing parentheses or quotes first and then trying to move back to type text inside, etc.) before finishing editing.

2. Conditional formatting in Excel is so non-intuitive that I actively try to avoid it like the plague. Yet, I use it extensively in Google Sheets because it is so easy to create multiple rules there.

3. The whole copy/paste design choice in Excel is, in my opinion, weird. Firstly, there is a distinction between copying a cell and copying text: if you copy an entire cell, you cannot paste it as text in a formula or any other input area. You have to copy from the formula bar of that cell. Even for pure cell copying, the cells have to remain highlighted. If you copy a cell and then unselect it (by pressing Esc or trying to edit any cells), the copied content is lost. I'm sure there are reasons it's designed this way, but it's so irritating, and I never find any benefit.

Cordiali · 2024-08-03T04:49:12 1722660552

Number 1 has a simple solution, pressing F2 is your friend.

7bit · 2024-08-03T05:49:56 1722664196

Will try it. But it shouldn't need that in the first place. It's just bad UX

UnserMannInK · 2024-08-03T10:38:08 1722681488

Google Sheets indeed does everything a common user would expect from a spreadsheet, without having to install anything and fiddle with licenses. This in itself is the killer feature.

For me personally the absolute killer feature is the bidirectional integration with BigQuery, something you won’t get that easily (if at all, correct me if I am wrong) with Excel.

pjmlp · 2024-08-02T06:01:34 1722578494

As long time Office user since Windows 3.1 days, Google sheets has a lot of terrain to cover up.

Quattro Pro could easily do all of the sheets stuff but document collaboration, ignoring the fact that sheets is a Web application and Quattro Pro started on MS-DOS.

didntcheck · 2024-08-02T14:00:15 1722607215

The one flaw I do see with them is blank cells

If you go with the CSV convention of two adjacent tabs => blank cell in the middle, then rows of different length will not line up properly in most text editors. And "different length" depends on the client's tab width too

If you allow any amount of tabs between columns, then you need a special way to signify an actually-blank column. And escaping for when you want to quote that

If you say "use tabs for columns and spaces for alignment", then you've got to trim all values, which may not be desirable

orev · 2024-08-03T02:06:34 1722650794

You’re talking about issues with alignment when data is displayed on a terminal or text editor, which is not at all related to data exchange.

In data exchange nobody ever allows multiple tabs between columns. If there are multiple tabs with nothing in between it means the column is empty for that row.

Just like with CSV, TSV, is always a pain to edit manually so the issues there are the same. Using tabs does have a lower likelihood of conflicting with the actual data.

didntcheck · 2024-08-03T23:17:21 1722727041

This is true, but I'd assumed that one of the major reasons to use TSV is for human readability. If not, then I'd personally choose an even rarer character as my delimiter

MattPalmer1086 · 2024-08-02T07:56:36 1722585396

That's what I always use when I need to write out some tabular data. Haven't had any problem importing them into anything.

btreecat · 2024-08-02T04:06:55 1722571615

In that case, why not use "|" (pipe character)?

zarzavat · 2024-08-02T04:12:32 1722571952

Tabs are rarer than pipes. They are the rarest displayable character for human strings excluding code.

The best would be to use ASCII separator control characters but nobody uses that format so TSV it is.

btreecat · 2024-08-09T03:04:38 1723172678

Based on what data?

calibas · 2024-08-02T04:12:44 1722571964

Why don't we use 0x1F (␟) instead of "," or TAB to separate units and 0x1E (␞) to separate records?

It seems like half the problems with CSV were solved back in the 70s with ASCII codes.

gaganyaan · 2024-08-02T04:19:30 1722572370

Somebody went one step further and invented a format that uses the unicode codepoints as separators:

https://github.com/SixArm/usv

NikkiA · 2024-08-02T20:37:01 1722631021

Because nobody made keyboards with those keys. Had they stuck on a 'next unit' and 'next record' key pair that sent them, we'd all be fine, but instead we got overly redundant text editing keys rather than keyboards more suited to data entry.

impure · 2024-08-02T01:48:09 1722563289

I switched to TSV files for my app. None of my values contain tabs so I don't have to escape anything.

jillesvangurp · 2024-08-02T02:33:22 1722566002

I tend to prefer that over CSV as well. But usually I go for ndjson files since that's a bit more flexible for more complex data and easier to deal with when parsing. But it depends on the context what I use.

However, a good reason to use TSV/CSV is import/export in spread sheets is really easy. TSV used to have an obscure advantage: google sheets could export that but not CSV. They've since fixed that and you can do both now.

And of course, getting CSV out of a database is straightforward as well. Both databases and spreadsheets are of course tabular data; so the format is a good fit for that.

Spreadsheets are nice when you are dealing with non technical people. Makes it easier to involve them for editing / managing content. Also, a spread sheet is a great substitute for admin tools to edit this data. I once was on a project where we payed some poor freelancer to work on some convoluted tool to edit data. In the end, the customer hated it and we unceremoniously replaced that with a spreadsheet (my suggestion). Much easier to edit stuff with those. They loved it. The poor guy worked for months on that tool with the help of a lot of misguided UX, design and product management. It got super complicated and it was tedious to use. Complete waste of time. All they needed was a simple spreadsheet and some way to get the data inside deployed. They already knew how to use those so they were all over that.

imtringued · 2024-08-02T07:26:04 1722583564

If you have non technical people please for the love of god start using .xlsx directly.

Nobody on this planet wants to use e.g. Libre office to import your CSV file and save it as xslx so they can open it in Excel.

cowthulhu · 2024-08-02T07:32:00 1722583920

Excel can open, modify and save CSVs, in my experience some users won’t even notice they’re not editing a native excel file.

SoftTalker · 2024-08-02T02:00:33 1722564033

The ASCII specification defines characters for separating fields, groups, records, and files, but I've rarely seen them used.

zepolen · 2024-08-02T02:03:38 1722564218

That's because anyone can easily make a tab character with their keyboard. No one ever remembers the key combination for those special ascii characters.

kstrauser · 2024-08-02T02:25:21 1722565521

If it became popular, all common editors would have an easy way to type them.

cylinder714 · 2024-08-02T02:58:14 1722567494

This encapsulates my problem with CSVs:

- If I send someone a spreadsheet, they'll open it with a spreadsheet application; Excel, LibreOffice, whatever.

- If I send someone a CSV file, they'll want to open it with a text editor.

Ack, no! Open it with a spreadsheet app, or load it into SQLite, or, best of all, open it with VisiData or some kind of editor designed for tabular data.

https://www.visidata.org/

hobs · 2024-08-02T03:16:30 1722568590

Actually no - spreadsheets classically choose their own way to interpret CSVs, that's the classic way to get your client to continue to send you support requests.

There's a reason so many tools export to xls instead of csv.

cm2187 · 2024-08-02T03:33:03 1722569583

And if you just double click on a csv file to open it in excel (rather than importing from excel), excel will happily corrupt your data. Trim leading zeros in IDs, round large integers, not spot the date column is in us vs european format, etc.

uncharted9 · 2024-08-02T04:45:31 1722573931

Excel will happily ignore if you use tabs as delimiter and show all the columns as a single column. Atleast LibreOffice tries to figure out and confirms the delimiter before opening CSV.

crazygringo · 2024-08-03T13:48:10 1722692890

And if there were an easy way to type them, then people would start using them in strings.

And then they'd have to be escaped.

And we'd be back where we started.

Cyberdog · 2024-08-02T02:14:06 1722564846

While that's true, the way text editors handle these characters is not standardized, and many may not let you input them. One of the important features of CSV/TSV is that they're relatively easy to edit by hand, and for that you need separator characters that are easy for both text editors and humans to work with.

Personally, since I've discovered the field/group/record/file separator characters in ASCII, I've been using them to concat fields and rows on one-to-many SQL joins. They work great for that purpose since (at least on all the projects I've done this with so far) I can be certain that none of the values in the joined data will have those characters, so no further escaping is necessary. For example, in MySQL:

  SELECT
    i.item_id,
    GROUP_CONCAT(CONCAT_WS(0x1F, f.field_id, f.field_value) SEPARATOR 0x1E) AS field_values
  FROM items i 
  LEFT JOIN fields f ON f.item_id = i.item_id
  WHERE ...

Then split field_values with 0x1E to get each field ID and field value pair, and split each of those on 0x1F. Easy as pie.

cm2187 · 2024-08-02T03:28:59 1722569339

How do you escape newline characters?

userbinator · 2024-08-02T03:34:57 1722569697

I wish binary length-prefixed formats would've become more common. Parsing text, and especially escaping, seems to be a continual source of bugs and confusion. Then again, those who don't implement escaping correctly may also overlap with those who can't be bothered learning how to use a hex editor.

Dwedit · 2024-08-05T03:27:54 1722828474

They are pretty common, just not among "JSON/XML everywhere" people.

Kon-Peki · 2024-08-02T19:58:51 1722628731

CSV comes from a world in which the producer and consumer know each other; if there are problems they talk to each other and work it out.

There is still plenty of this kind of data exchange happening, and CSV is perfectly fine for it.

If I'm consuming data produced by some giant tech company or mega bank or whatever, there is no chance I'll be able to get them to fix some issue I have processing it. From these kind of folks, I'd like something other than CSV.

LorenPechtel · 2024-08-03T17:49:54 1722707394

But the big guy most likely exports the .csv correctly in the first place, you don't *need* to work with them.

Only once have I seen a bad .csv from a "big" company--big fish in a small pond type big. We were looking to get data out, hey, great, .csv is a valid export format. I'm not sure exactly what was in that file but it appeared to be the printout with some field info attached to each field. (Put this at that location on the paper etc, one field per line.) Every output format it has is in some scenario bugged.

theanonymousone · 2024-08-02T08:41:09 1722588069

I fully agree that CSV is king and am quite happy about it. But the comma character was probably one of the worst choices they could make for the "standard", IMHO of course.

Tab makes far more sense here, because you are very likely able to just convert non-delimiter tabs to spaces without losing semantics.

Even considering how editors tend to mess with the tab character, there are still better choices based on frequency in typical text: |, ~, or even ;.

All IMHO, again.

endgame · 2024-08-02T02:51:55 1722567115

I wasn't around at the time, but surely ASCII was (even if not ubiquitous)? Is there any particular reason that the FS/GS/RS/US (file/group/record/unit separator) characters didn't catch on in this role?

EvanAnderson · 2024-08-02T03:13:49 1722568429

I did an ETL project years ago from a legacy app that used these delimiters. It was gloriously easy. No need to worry about escaping (as these characters were illegal in the input). It's a shame they didn't catch on.

al_borland · 2024-08-02T02:57:12 1722567432

If I had to take a guess, I’d say the answer is as simple as there is no key for them on the keyboard.

euroderf · 2024-08-02T09:48:41 1722592121

Sounds like a job for a macOS keyboard code whiz.

breck · 2024-08-02T02:35:42 1722566142

I love CSVs.

I made, ScrollSets a language that compiles to CSVs! (https://scroll.pub/blog/scrollsets.html)

Here's a simple tool to turn your CSV into ScrollSet (https://scroll.pub/blog/csvToScrollSet.html)

This is what powers the CSV download on PLDB.io and how so many people collaborate on building a single CSV (https://pldb.io/csv.html)

jeff-hykin · 2024-08-02T17:12:29 1722618749

> Efforts to standardize them

I actually just finished a library to add proper typed parsing that works with existing CSV files. Its designed to be as compatible as possible with existing spreadsheets, while allowing for perfect escaping and infinite nesting of complex data structures and strings. I think its an ideal compromise, as most CSV files won't change at all.

https://github.com/jeff-hykin/typed_csv

mannyv · 2024-08-03T02:28:11 1722652091

CSV is king because most ETL department programmers suck. Half the time they can't generate a CSV correctly. Anything more complicated would cause their tiny brains to explode.

I'm not bitter, I just hate working with ETL 'teams' that struggle to output the data in a specified format - even when you specify it in the way they want you to.

fragmede · 2024-08-02T23:18:19 1722640699

> Why CSV Will Remain King

it'll only remain king as long as we let it.

move to using Sqlite db files as your interchange format

__mharrison__ · 2024-08-02T03:57:21 1722571041

CSV is the VHS of data formats. Or to reference our discussion from yesterday, the markdown of data formats. It gets the job done.

I help clients deal with them frequently. For many cases they are sufficient, for other cases moving to something like parquet makes a lot of sense.

EvanAnderson · 2024-08-02T03:16:06 1722568566

A lot of data that I see in CSV "format" would work fine as tab-delimited and wouldn't need any escaping (because most of the data I see doesn't allow literal tabs anyway). That would be a simple improvement over CSV.

valiant55 · 2024-08-02T20:00:17 1722628817

I'm surprised that the article and the comments failed to mentioned pipe delimited files. I work with almost two dozen different vendors (in healthcare) and 90% use pipes. Doing data exchange with a variety of delimiters is so common that I just built out a bespoke system for taking in a set of common configurations and parsing the information. Other settings include line endings, encoding, escape characters, whether the header is included etc.

maerF0x0 · 2024-08-03T23:17:09 1722727029

I prefer ndjson for systems I build. (with only json objects on the top level) It's much safer for a lot of edges. If there's significant repetition in the keys, they end up zipping well.

deafpolygon · 2024-08-04T13:34:14 1722778454

CSV is still king because of one thing: inertia

It's just much easier to keep using it, since you're already doing it.

In the meantime, how about XML? /awaits the pack of raving mad HNers

nuc1e0n · 2024-08-02T19:54:26 1722628466

As the article says, it will be interesting to see if NDJSON becomes more popular. Although it's a bit more difficult to parse and has makes for larger files than CSV it is more unambiguous.

penguin_booze · 2024-08-03T16:33:48 1722702828

It's a bit annoying that jq quotes strings in the CSV output:

  echo foo | jq -rR 'split("") | @csv'

Havoc · 2024-08-02T20:05:41 1722629141

I’ve been using parquet more lately. Different tradeoffs. Not having to worry about escaping chars and delimiters is nice though

pdyc · 2024-08-02T04:24:18 1722572658

indeed, i created my own tool to preview and adjust csv files before viewing https://csvonline.newbeelearn.com/csvdemo . Its not ready yet would probably not work for large files but works well enough for csv's with appended data that screws up formatting.

newusertoday · 2024-08-02T04:51:23 1722574283

wow! it does exactly what i want :-) . what are the odds of that! I tested with a bank file where csv starts after some lines and i was able to read csv after bit of fiddling with configure button. What is theming demo doing btw with csv?

bandie91 · 2024-08-03T11:13:57 1722683637

the site says "something went wrong" just 1 sec AFTER it successfully displayed the content. something is so wrong that had withdraw the content from the user... use js only to enhance UX!

01HNNWZ0MV43FF · 2024-08-01T20:44:00 1722545040

Because json5L hasn't caught on yet and everything else has obvious flaws

0cf8612b2e1e · 2024-08-02T03:38:17 1722569897

I routinely interface with 1GB+ csvs. The size explosion for json would be huge. Disk IO aside, I assume a json parser is going to be slower to parse than csv.

imtringued · 2024-08-02T07:33:52 1722584032

How would JSON cause a size explosion?

Nothing prevents you using ndjson where you define a header and then have an array per line.

0cf8612b2e1e · 2024-08-02T17:51:10 1722621070

Nobody does this currently. You have now created another bespoke format. If I am going to need a custom parser/writer, I might as well lean on a binary format that has far stronger properties than a text based one.

ianburrell · 2024-08-03T03:54:04 1722657244

JSONL is pretty common format. It makes sense for logs and anything else written incrementally.

JSON parsers are super common. They are simpler and faster than CSV because it is more regular. JSONL is simple to implement cause write by record and read by line.

The only difference with CSV are bracket characters around line and every string has quotes. The benefit is clear escaping rules including for newlines.

0cf8612b2e1e · 2024-08-03T18:02:59 1722708179

JSONL is standard. Upthread said to write the header row and then make subsequent rows arrays. Of which I am not aware of anything that does this currently.

My objection to JSONL was about the increase in file size owing to repeating the keys.

ianburrell · 2024-08-03T19:48:52 1722714532

JSON can write arrays in addition to hashes. JSON arrays are nearly identical to CSV. The only difference is brackets around li;es. There is no extra space wasted for keys.

im3w1l · 2024-08-02T19:28:21 1722626901

Why do you use a text-based format at all at that size?

0cf8612b2e1e · 2024-08-02T22:06:39 1722636399

You get what you get. Presumably when it started, they were a more modest size.

ryan_j_naughton · 2024-08-02T01:14:01 1722561241

Eh, I'm skeptical of this statement.

CVS is explicitly about tabular data. JSON (including JSON5) is much more flexible. Flexibility can be great but also can be annoying. If you want tabular data, then a system that enables nesting isn't great.

yawnxyz · 2024-08-02T02:57:53 1722567473

I love jsonlines but csvs are way more compact, since you don't have to repeat the column name for every line of data

sam_perez · 2024-08-02T03:39:55 1722569995

I think the fact that a human can mostly just read csvs is an important part of their adoption, too.

ianburrell · 2024-08-03T03:55:49 1722657349

You would write JSON arrays without names for tabular data. I don’t know if there is a standard way to do the header, but array of names would work. Or JSON Schema record.

jessekv · 2024-08-02T12:48:37 1722602917

Rather than highlighting flexibility as the differentiator, I would say: CSV is for dense data, JSON is for sparse data. They are flexible in different ways. For example, CSV is very flexible when renaming a column title.

up2isomorphism · 2024-08-03T06:42:26 1722667346

Not sure what is a “king “ in this case. But fav is one of example that is intuitive and straight horrible at the same time.

trillic · 2024-08-02T20:46:15 1722631575

I like Pipe-separated values

corytheboyd · 2024-08-02T15:45:55 1722613555

I don’t think CSV became king because “,” is a great delimiter (obviously it is not), it became king because it is an easy and logical separator _to most people_. Yeah it’s infuriatingly dumb from a technical standpoint. All the points here that tabs or ascii separators are superior are of course correct. I honestly respect it for how ubiquitous it became WITHOUT having a standard. Still going to curse when I have to deal with a broken one though.

dietr1ch · 2024-08-02T01:35:30 1722562530

I think that we just need someone to get fed up and simply tackle the list of well known problems of CVS.

What we need is,

  - A standard (yeah, link xkcd 927, it's mentioned enough that I can recall it's ID) to be announced **after** the rest of things are ready.

  - Libraries to work with it in major languages. One in Rust + wrappers in common languages might get good traction these days. Having support for dataframe libraries right away might be necessary too.

  - Good tooling. I'm guessing one of the reasons CSV took off is that regular unix tools are able to deal with CVSs mostly fine (there's edge cases with field delimiters/commas, but it's not that bad).

The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.

This seems too much work to get right since the very beginning, so maybe building on top of Apache Arrow might help reduce the solution space.

jillesvangurp · 2024-08-02T02:47:48 1722566868

Most major languages have decent libraries, frameworks and tools for dealing with CSV. Those tend to have lots of tests for all the well known issues and edge cases. Especially in the python world, which is used for a lot of data processing, tooling is not really an issue. But most other languages also have decent frameworks. Most of that stuff covers the few standards that exist for this, the well known variants of the format that are out there (quite a few) and can deal with the quirks of those.

The only time people get in trouble with CSV is when they skip using those tools, hack something together, and then get it wrong.

> The new format would ideally have types, the files would be sharded and have metadata to quickly scan them

There's no need for new stuff. It would be redundant as there are several things already that do these things. Adding more isn't helpful. The problem is most of the stuff that supports CSV tends to support none of those things and fixing a lot of ancient systems to retrofit them with e.g. parquet support or whatever is a mission impossible. CSVs principle feature is that it is simply everywhere. That's hard to replicate. People have been trying for decades.

wenc · 2024-08-02T02:37:12 1722566232

> The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.

Parquet fits the bill here. It's not perfect (there is no perfect file format), but it's a practical compromise as of today, at least for new systems where a columnar format is appropriate. There are some columnar formats that are better in some aspects (like ORC and some proprietary formats) but they're not as widely supported.

It's not that CSV/TSV is bad in every situation, but more that CSV/TSV is overused for things it shouldn't be used for. (CSV is good as for tabular format for simple applications, very bad as the storage format for data lakes or anything you want to query, questionable as an data exchange format, okay as a semi-structured format for structurally simple data -- many open data platforms offer it as a a download format and it generally works).

To get a sense of how much variation a CSV reader needs to handle, we can take a look at the number of arguments there are in Pandas' read_csv. And it still fails on some CSVs! (I've had to preprocess CSVs before pd.read_csv would work)

https://pandas.pydata.org/pandas-docs/stable/reference/api/p...

CSV is not king, but it is popular. But popularity doesn't mean it's good for every use case. Optimizing for human readability and easy generation means trading off other very important characteristics (type safety, legibility across different tooling, random access performance, reliability/consistency).

You can't do anything about legacy systems, but when designing a new system, you should really ask yourself: is CSV really the right choice?

(With DuckDB, the answer for me is increasingly no)

kibwen · 2024-08-02T03:58:59 1722571139

> Libraries to work with it in major languages. One in Rust

burntsushi is nine years ahead of you: https://crates.io/crates/csv

dietr1ch · 2024-08-02T07:05:55 1722582355

Yeah, I used it about 7-8 years ago. I liked the idea of chaining things, but it's very clear that csv has not been holding up well in the past decades.

Also, what I have in mind for file sharding needs maybe a standard on top of a record/column file format. The successor to CSV should be easy to process in parallel.

hilbert42 · 2024-08-02T04:39:47 1722573587

Exchanging information between different data formats is one of the biggest problems I've experienced in computing and IT and it's been thus from the earliest days.

Having so many formats is confusing, inefficient and leads to data loss. This article is right, CSV is king simply because it's essentially the lowest common denominator and I, like most of us, use it for that reason—at least that's so for data that can be stored in database type formats.

But take other data such as images, sound and AVI, and even text. There are dozens of sound, image and other formats. It's all a first-class mess.

For example, we fall back to the antiquated horrible JPG format because we can't agree on better ones such as say jpeg 2000, there being always excuses why we can't such speed, data size, inefficient algorithms etc.

Take word processing for instance, why is it so hard to convert Microsoft's confounded nasty DOC format to say the open document ODT format without errors. It's almost impossible to get the layout in one format converted accurately into another. Similarly, information is lost converting from lossless TIF to say JPG, or from WAV to MP3, etc. What's worse is that so few seem to care about such things.

Every time a conversion is done between lossless formats and lossy ones entropy increases. That's not to say that shouldn't happen it's just that in isolation one has little or no idea about the quality of the original material. Even with ever increasing speeds, more and more storage space so many still have an obsession—in fact a fetish—of compressing data into smaller and smaller sizes using lossy formats with little regard for what's actually lost.

It's not only in sound and image formats where data integrity suffers over convenience, take the case of converting data fields from one format to another. How often has one experienced the situation where a field is truncated during conversion—where say 128 characters suddenly becomes 64 or so after conversion and there's no indication from the converter that data has actually been truncated? Many times I'd suggest.

Another instance, is where fields in the original data don't exist in the converted format. For example, data is often lost from one's phone contacts when converted from an old phone to a new one because the new phone doesn't accommodate all the fields of the old one.

Programmers really have a damn hide for not only allowing this to occur but for not even warning the poor hapless user that some of his/her data has been lost.

That programmers have so little reagard and consideration for data integrity I reckon is a terrible situation and a blight on the whole IT industry.

Why doesn't computer science take these issues more seriously?

jmclnx · 2024-08-02T13:21:27 1722604887

>Why doesn't computer science take these issues more seriously?

Simple, cost. A company is not going to approve any project to move to a new standard. Plus you have new hires coming it with their favorite "Standard of the Day" and start using that standard no matter what they are told.

Management only care about the end result (ie: bottom line), now how it got there.

hilbert42 · 2024-08-03T05:31:57 1722663117

"Simple, cost."

That lack of consideration for users' data will ultimately lead to regulation. Much of a user's data is only machine-readable, so ordinary users shouldn't be expected to know when their data is truncated after say data conversion. They aren't responsible for realizing their data is corrupted long after the event and past the point where it can be corrected.

It's like everything else, originally there's the Wild West days when everything's a free-for-all, but regulations eventually kick in after the harm done is considered unacceptable. We've seen regulations introduced everywhere else, from foods—pure food acts, pharmaceutical—FDA, transport—NTSB, Water purity standards and so on. So eventually computing/IT will be no exception.

Unfortunately, computing/IT is still in the 'Wild West' days. Personally, I can hardly wait for those enforced regulations to become effective.