CSVs Are Kinda Bad. DSVs Are Kinda Good

teddyh · 2024-08-14T10:50:49 1723632649

The article talks about reading and parsing CSV data of unknown variants, but then skips to the solution being using a different format altogether. But you can only switch to a different format if you are producing data, not if you are reading it!

And if you are in control of producing data, just produce strict RFC 4180-compliant CSV data and everybody will be able to read it just fine. There is no need to make your reader’s lives difficult by using yet another non-standard data format.

See also: <https://news.ycombinator.com/item?id=39679753>

OscarCunningham · 2024-08-14T11:59:50 1723636790

I just had a look at RFC 4180. This is the grammar they suggest:

> file = [header CRLF] record *(CRLF record) [CRLF]

I find it kind of wild that you have to have at least one record. Suppose I have a program that lists the events that occurred on a given day. How do I represent the fact that the program ran successfully but that there weren't any events on that day?

joseda-hg · 2024-08-14T16:42:02 1723653722

Event - Loging Started Aditionally, Loging Ended (for the day) allows you to check for intermediate crashes and startup errors, kinda like a pulse / health check

Not necessarily how I'd do it intuitively, but doesn't seem that crazy

gnu8 · 2024-08-14T16:18:18 1723652298

Easy, count running the report as an event.

7bit · 2024-08-14T17:20:43 1723656043

That isn't a problem of the file format. That's a problem of your process.

kitd · 2024-08-14T11:24:05 1723634645

> yet another non-standard data format.

Tbf, ASCII delimiter characters have been around since the 1960s. They're not exactly reinventing the wheel

lazide · 2024-08-14T11:09:51 1723633791

It’s the classic ‘self driving is easy when we modify the world to be what we want’ type solution.

willcipriano · 2024-08-14T11:13:18 1723633998

"Politics is easy, the problem is we aren't working together!"

fragmede · 2024-08-14T18:51:35 1723661495

Sqlite isn't a standard per-se, but outputting an Sqlite db file, if you're writing the export code, is easy enough, and enough of a standard that I dare say you'd be doing fine to output .db files marked as sqpite.

throw0101d · 2024-08-14T11:17:17 1723634237

[flagged]

atrus · 2024-08-14T11:22:05 1723634525

Don't use broken tools. The key phrase is "in control of producing data". If you're forced to use Excel, then it's not really you in control, is it?

throw0101d · 2024-08-14T11:24:12 1723634652

>> […] Excel […]

> Don't use broken tools.

Tell that to your accounting and finance department and let us know how the message is received.

> If you're forced to use Excel, then it's not really you in control, is it?

In which case the up-thread's advice to "just produce strict RFC 4180-compliant CSV data" is worthless. "Just."

We're stuck with whatever CSVs we get, so 'just' doing X is not an option.

bathtub365 · 2024-08-14T12:40:34 1723639234

The up-thread’s comment (emphasis mine):

> And *if you are in control of producing data*, just produce strict RFC 4180-compliant CSV data

The point of the comment was that you likely aren’t in control of producing the data, so the article’s recommendation of using an entirely different format is likely also invalid. I’m not sure what you are arguing against as you seem to actually agree with them.

SammyStacks · 2024-08-14T12:02:42 1723636962

From a pragmatic viewpoint, the CSVs that I get from finance (usually saved as .xlsx) have the same issues for parsing the data as a CSV. But since the issues are consistent, I can automate conversion from .xlsx to CSV, then process the CSV using awk to eliminate errors in further parsing the CSV (for import, analysis, etc.). Sure, I'm essentially parsing the CSV twice but, because the parsing issues are consistent, I can automate to make the process efficient.

Obviously that wouldn't work for CSVs with different structures, but can be effective in the workplace in certain scenarios.

disgruntledphd2 · 2024-08-14T16:39:26 1723653566

As long as a human didn't generate the file, all things can be automated.

However, if you ever have the misfortune of dealing with human generated files (particularly Excels) then you will suffer much pain and loss.

I once had to deal with a "CSV" which had not one, not two but 6(!) distinct date formats in the same file. Life as a data scientist kinda sucks sometimes :shrug:.

hyperman1 · 2024-08-15T08:29:50 1723710590

Before 2010 and UTF-8 everywhere , I regularly had the misfortune of dealing with multi encoding CSVs. Someone got CSVs from multiple sources and catted them together. One source uses ISO 8859-1, another -15, another UTF-8, sometimes a greek or russian or even ebcdic was in there. Fun trying to guess where one stopped and the other begun . Of course, none of them were consistent crlf or escape wise.

Aniket-N · 2024-08-15T01:38:00 1723685880

This is some next level response where Excel is called a “broken tool”. You may not agree with their choices or design or anything else. But calling the entirety of the product isn’t making a strong case towards the prior point.

solarkraft · 2024-08-14T19:44:36 1723664676

“Just be in control” is unfortunately bad advice when you’re … not in control.

TheCleric · 2024-08-14T22:09:56 1723673396

I don't think the advice is "Just be in control" as much as "Acknowledge you're not in control most of the time."

Freak_NL · 2024-08-14T11:32:13 1723635133

Importing data generated from Excel: don't, or force RFC 4180-compliant CSV data.

Exporting data into Excel: provide RFC 4180-compliant CSV data, or just generate minimal XLSX files.

Most Excel users generally don't export to CSV (or practice any data sanity conventions); they seem to believe XLSX is a perfectly fine data exchange format for automated use. (“Oh the import broke? I just added an empty row above the headers, because it looked sloppy.”) Those that do understand that automated data processing means you need to be stricter in what you do in the sheet you are exporting tend to understand how to export proper CSV's from Excel as well.

ahmeneeroe-v2 · 2024-08-14T22:40:01 1723675201

>Those that do understand that automated data processing means you need to be stricter in what you do in the sheet you are exporting tend to understand how to export proper CSV's from Excel as well

exactly right

funcDropShadow · 2024-08-15T10:02:39 1723716159

Do you have some recommendations or references about getting data properly out of Excel. I usually avoid using Excel alltogether if possible. But, obviously that is not always an option.

ahmeneeroe-v2 · 2024-08-14T22:38:50 1723675130

As someone who has excel open all-day every-day: I don't think excel-employees are "producing" data. We are ingesting CSVs/XLSXs, performing modeling/analysis, and then saving as an XLSX.

I don't know anyone who is saving a CSV unless it is the final model output and another system (e.g. TM1) can only ingest the CSV.

If I accidentally save a spreadsheet as a CSV it is a bad day since I probably lost all my formatting, formulas, and additional tabs.

Kon-Peki · 2024-08-14T12:26:49 1723638409

> Excel

If you’re deep in the Excel world, chances are extremely high that you also have access to SSMS, which has a really, really good data import tool that makes short work of nasty CSV files. The output of this tool doesn’t even have to be SQL Server, it will use any ODBC driver you’ve got installed; you can send the data to Excel or even a new, properly formatted CSV.

And if you want a repeatable package, there is always SSIS.

Look, I would never recommend anyone jump into the Microsoft ecosystem. But when in Rome, do as the Romans do.

disgruntledphd2 · 2024-08-14T16:40:17 1723653617

> If you’re deep in the Excel world, chances are extremely high that you also have access to SSMS,

Can you provide more information about this?

Like, I'd reinstall windows if this would actually work on messy excel/CSV data.

Kon-Peki · 2024-08-14T18:42:30 1723660950

There is a screenshot here [1]. The trick is to choose the generic "import data" task rather than "import flat file". Then you have a wizard that lets you play around with the specifications, change things up, look at the errors, go back to tweak things, go forwards to try again, etc. The only improvement I can think of is if you could save what you did as a standalone SSIS package (or whatever).

[1] https://www.mssqltips.com/sqlservertutorial/9248/import-and-...

rcxdude · 2024-08-15T04:36:25 1723696585

Can you get DSVs out of Excel? If not, it's kinda moot.

im3w1l · 2024-08-14T11:22:37 1723634557

How does DSV solve Excel compatibility?

teddyh · 2024-08-14T12:58:32 1723640312

If you have to use Excel, you have to produce whatever data Excel produces. It all depends on what (explicit or implicit) contracts you have in place with whoever is to consume the data which you produce.

NikkiA · 2024-08-14T17:36:40 1723657000

Most likely the easiest solution is to use a python library to read the excel file, and another to export the RFC-compliant data.

jgalt212 · 2024-08-15T00:38:51 1723682331

Where does Excel fall down when exporting CSVs?

usrbinbash · 2024-08-14T11:21:49 1723634509

I can just not use Excel :-)

renewiltord · 2024-08-14T16:10:23 1723651823

Looking it up, using a custom delimited format in Excel is near impossible https://superuser.com/questions/733462/can-ms-excel-use-non-...

So this solution is not going to work for Excel either.

usrbinbash · 2024-08-14T11:10:35 1723633835

> CSVs are kinda bad.

Not really.

What's bad is when people keep insisting on coming up with new and amazing CSV dialects.

https://www.ietf.org/rfc/rfc4180.txt is very clear about what CSV files are supposed to look like, and the fact that people keep ignoring this for whatever reason, is not the formats problem.

And no, "using another format" is not a solution to this. Because: I can just invent a new DSV dialect. Or a JSON dialect. Or a dialect where the field separator is "0xFF00FF00" and the row separator is the string `DECAFCOFFEE` encoded in EBCDIC, all other characters have to be UTF-32, except for a, b and f, which also need to be EBCDIC encoded.

> For starters, it’s rather unreadable when opened in a text editor. But I bet you don’t really do that with your CSVs all that often anyway!

Wrong. I do that with csv files all the time. In fact I even have an amazing vim plugin just for them [0]. That's pretty much the point of having a plaintext tabular data storage format: That I can view and edit it using standard text wrangling utilities.

---

There is a much simpler solution to this problem: Don't accept broken CSV. If people keep ignoring standards, thats their problem.

[0]: https://github.com/mechatroner/rainbow_csv

brunokim · 2024-08-14T21:44:49 1723671889

"Broken" is a sliding scale, and it's unfeasible to refuse engaging at all times.

If you are a multi-billion dollar company creating a new integration, you can demand that your small supplier provide an RFC-4180 compliant file, and even refuse to process it if its schema or encoding is not conformant.

If you are the small supplier of a multi-billion dollar company, you will absolutely process whatever it is that they send you. If it changes, you will even adapt your processes around it.

TFA proposes a nice format that is efficient to parse and in some ways better than CSV, another ways are not. Use it if you can and makes sense.

usrbinbash · 2024-08-15T08:07:04 1723709224

I agree up to a point. It is a kind of tug-o-war, and yes, the weight of each side plays an important role there.

Nevertheless, even in projects where my services are talking to something that's bigger, I will, at the very least ask "why cant it be RFC compliant? is there a reason?". And without blowing my own horn overly much, but quite a few systems larger than mine have changed because someone asked that question.

dghf · 2024-08-14T12:29:02 1723638542

> https://www.ietf.org/rfc/rfc4180.txt is very clear about what CSV files are supposed to look like

Mm, not really. By its own admission, it is descriptive, not prescriptive:

> This section documents the format that seems to be followed by most implementations

And it came out in 2005, by which date CSVs had already been in use for some twenty or thirty years.

usrbinbash · 2024-08-14T14:55:27 1723647327

It doesn't matter when it came out, it doesn't matter that it it descriptive. It is the standard, period.

Yes, CSV is much, much older. In fact it predates personal computers. And it went through changes. Again: None of that matters. We have a standard, we should use the standard, and systems should demand the standard.

Standards are meant to ensure minimal-friction interoperability. If systems don't enforce standards, then there is no point in having a standard in the first place.

jhbadger · 2024-08-16T14:48:03 1723819683

Yes, but you could argue that web browsers shouldn't accept broken HTML either. But they do, and that's why there are so much broken HTML out there in the wild. Same with broken CSV -- basically people's measure is "if Excel can read it correctly, it's fine" even if not every CSV library in every programming language can.

tttp · 2024-08-15T03:47:29 1723693649

"This memo provides information for the Internet community. It does not specify an Internet standard of any kind."

teddyh · 2024-08-15T14:07:19 1723730839

Note the qualifier: “not an Internet standard” (my emphasis).

usrbinbash · 2024-08-16T06:25:49 1723789549

And again: None of that matters. I am not talking about formalities here, I am talking about technical realities.

Whether it is formally called a standard or no doesn't change the fact that this is the document everyone points at when determining what CSV is and is supposed to look like. So it is de-facto a standard. Call it a "quasi standard" if that makes you happy.

teddyh · 2024-08-16T13:31:50 1723815110

Oh no; I agree with you completely. I just wanted to point out that the document does not disclaim being a “standard”, is just says that it is not an “Internet standard”.

usrbinbash · 2024-08-18T08:59:27 1723971567

My mistake, in that case, thank you :-)

SigmundurM · 2024-08-14T13:46:33 1723643193

> Don't accept broken CSV. If people keep ignoring standards, thats their problem.

From the very memo you link to (RFC 4180):

> Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files.

usrbinbash · 2024-08-14T14:35:07 1723646107

Oh, I am nothing but liberal when it comes to CSV: Clients get the liberty to either have their requests processed, or get a 400 BAD REQUEST

And yes, I am aware that the standard says this. My counter question to that is: How much client-liberty do I have to accept? Where do I draw the line? How much is too much liberty?

And the answer is: there is no answer. Wherever any system draws that line, it's an arbitrary decision; Except for one, which ensures the least surprise and maximum interoperability (aka. the point of a standard): to be "conservative", and simply demand the standard.

geoelectric · 2024-08-15T01:04:32 1723683872

I think the suggestion reflected a deep understanding that transitioning from decades of wild-west to standardized in the smooth fashion most likely to succeed would require that strategy.

If you don’t accept whatever some org’s data is encoded with, they won’t consider it a win for standards, or swap out whatever is producing that data for something more compliant. They’ll consider it a bug, and probably use some other more flexible processor.

On the other hand, if you can be flexible enough to allow quirks on import while not perpetuating them on export, eventually you and other software built with the same philosophy standardize the field.

I do think there’s a point where things are standardized enough that you can safely stop doing that—when all the extra quirk code is so rarely used as to be irrelevant—but I’m unsure if we’ve reached it yet. It would be something to actually analyze, though, rather than just a philosophical decision.

usrbinbash · 2024-08-15T06:10:22 1723702222

> On the other hand, if you can be flexible enough to allow quirks on import while not perpetuating them on export, eventually you and other software built with the same philosophy standardize the field.

How? The only thing I can see happening is perpetuation of sloppy use of standards. "Why, why should I change my |-deliminated CSV dialect that requires a double-semicolon at the end of each row, which is arbitrarily denoted by either \n or \r or \n\r when all those programmers will accomodate me, no matter how little sense it makes to do so?

> I do think there’s a point where things are standardized enough that you can safely stop doing that

I agree. And that point was when someone sat down, and penned RFC-4180

Everything after that point, has to justify why it isn't RFC compliant, not the other way around.

AstroJetson · 2024-08-15T02:02:36 1723687356

> In fact I even have an amazing vim plugin just for them

So this is gold. Editing xSV files has been an ongoing pain, and this plugin is just amazingly awesome. Thanks for the link to it.

usrbinbash · 2024-08-15T05:51:28 1723701088

My pleasure :-)

tpoacher · 2024-08-15T07:30:52 1723707052

you mean to say that vim can't handle simple character substitution? /s

snapcaster · 2024-08-14T12:34:09 1723638849

No it isn't in the real world. It's very much your problem if you're the team consuming these files. Try to go tell the head of accounting they need to make all their data rfc4180 compliant see how that goes

usrbinbash · 2024-08-14T13:33:12 1723642392

> Try to go tell the head of accounting they need to make all their data rfc4180 compliant see how that goes

Fun fact: I did. And not just for accounting systems, but all sorts of data ingestion pipelines. Did it work every time? No. Did it work in many cases? Yes. Is that better? Absolutely.

Here is the thing: If I accept broken CSV, where do I stop? What's next? Next thing my webservice backends have to accept broken HTTP? My JSON-RPC backends have to accept JSON with /*/ style block comments? My ODBC load-balancer has to accept natural language instead of SQL statements (I mean, its the age of the LLM, I could make that possible).

mjevans · 2024-08-14T17:48:13 1723657693

I draw the line at, the source keeps changing how it's broken.

If things are broken, but in a predictable, standard for that source way... uggh but at least it's their standard and if some tweak gets the common tools working for that one standard then everyone can move on and be happy.

mjevans · 2024-08-14T10:50:45 1723632645

This keeps coming up as new people discover what CSVs are. An ancient TEXT data exchange format. The lowest vaguely common denominator. A style of format with flavors software long out of support contract are happy to export data in.

The intent of the format is to be human readable and editable. Sure, Tab characters can be used instead of commas. (TSV files) Yes that's that "" to escape a quote rule. Oh and quoted values are optional, unquoted strings are fine as long as they contain no newline or record separator characters.

Sure, you could make another CSV inspired format which uses the old mainframe control characters; except as keeps getting pointed out, even programmers often don't know how to enter raw flow control characters on their systems. Who even bashes those out these days? I know I have to look it up every time.

Rejoice that the format is so simple, it's all just text which software might convert to numbers or other values as it might desire.

JohnMakin · 2024-08-14T16:20:11 1723652411

I agree completely. Its simplicity is what gives it staying power.

When I was an undergrad, I had kind of an anal software engineering 101 professor who was treating the course like he was a scrum master. The deliverable was to make some dumb crud app, and a requirement was it used a "database." It was so stupid simple to write a csv to s3 or local disk that I just used that for the entire project. He tried to fail me for not following the requirements, and I had to go to the dean of CS and argue that by definition, a structured data format on a disk is absolutely a database, and I won. I got graded horribly after that though.

vaylian · 2024-08-14T11:06:39 1723633599

> even programmers often don't know how to enter raw flow control characters on their systems.

Yes, but that is because those characters are not meant to be entered directly. DSV values should either be created by a dedicated DSV editor or they should be constructed by a software library. You would rather use a paint program to create an image instead of writing the image's bytes in a text editor.

lazide · 2024-08-14T11:10:37 1723633837

Aka a completely different use case than CSV.

throw0101d · 2024-08-14T11:21:50 1723634510

> Aka a completely different use case than CSV.

How many CSVs are generated, edited, or viewed by Notepad.exe and how many by Excel (or Google Sheets)?

I would posit the vast majority of CSVs are generated through some kind of program where you go to File > Export or File > Save As…. In which case doing selecting a drop down with the option for File Format to be TSV or DSV (with the corresponding file extension) would solve a lot of problems. (Or at least if CSVs from Excel were RFC 4810 compliant by default.)

lazide · 2024-08-14T11:27:19 1723634839

How many get edited or inspected in notepad at some point in their life? Nearly all of them (for any given workflow).

vaylian · 2024-08-14T13:13:44 1723641224

It is nice that text editors are abundantly available and that they can be used for the task. But once the CSV columns get too wide and irregular, then you probably want to reach for a dedicated spreadsheet program, because it is otherwise too hard to figure out which column you are currently reading.

There is still room between a text editor and a full-blown spreadsheet program. New DSV editors could emerge when the DSV format gains popularity.

lazide · 2024-08-14T17:39:55 1723657195

At the point someone is using a different format, they’ll likely pick something explicitly structured. Like everything from JSON, to Yaml, to Protobufs, or hell even XML.

DSV seems like worst of both worlds. Not really structured, AND also not really viewable/editable by lowest common denominator tooling.

RaftPeople · 2024-08-14T14:29:56 1723645796

> when the DSV format gains popularity

CSV is equivalent to Voyager I, the chances of catching up with that kind of head start are extremely low.

memtet · 2024-08-14T15:20:22 1723648822

Right, the author skipped right over human-readable TSV files which play nicely with sed/awk/grep/sort pipelines, and are supported by all CSV parsers and spreadsheet software.

ElevenLathe · 2024-08-14T15:35:47 1723649747

TSV is also my go-to when mucking around on the command line. Perfect for noodling with data before you have to put together an Excel file to show to management.

disgruntledphd2 · 2024-08-14T16:42:16 1723653736

The problem is that people (non-technical mostly), put tabs in fields, and then you have all the problems that the article notes.

ElevenLathe · 2024-08-14T16:44:29 1723653869

I personally find that this happens (a lot) less often than with commas or quote characters.

disgruntledphd2 · 2024-08-14T17:03:57 1723655037

That's fair, but it only takes one to mess up the rest of the file.

ElevenLathe · 2024-08-14T17:57:20 1723658240

Agreed, that's why it's not good for production processes.

imtringued · 2024-08-14T11:17:25 1723634245

CSV isn't a common denominator of anything. Everything is communicated out of band. Nobody understands your CSV files.

wmal · 2024-08-14T10:54:46 1723632886

The author seems to ignore the fact that CSV got so popular because it is human readable. If anyone wanted a binary format there’s plenty of them - most better than this DSV.

Also, I’m on a mobile right now, so can’t verify that, but it seems the format is flawed. The reader decodes UTF8 strings after splitting the binary buffer by the delimiter, but I believe the delimiter may be a part of a UTF8 character.

Edit: just checked and there’s actually no chance that the delimiter the author chose would be part of UTF8 encoding of any other character than the delimiter itself

zekica · 2024-08-14T11:06:34 1723633594

No, all UTF-8 multi-byte encodings have the most significant bit set.

tpoacher · 2024-08-15T07:42:21 1723707741

CSV's aren't really readable either though. They're "inspectable", but that's different. So if you want to read them you'll need to either use specific software, or do some preprocessing to align things properly etc ... in which case the extra step of performing a file-wide substitution of the record separator with newlines and unit separator with tabs or sth, isn't a much worse problem.

postalrat · 2024-08-14T22:34:53 1723674893

I'd say CSVs stuck around because there weren't any other alternatives that could be easily created, appended to, read by different apps.

boomlinde · 2024-08-14T11:24:12 1723634652

> The author seems to ignore the fact that CSV got so popular because it is human readable.

It might seem that way if you didn't actually read the article:

> So what’s the downside? This custom FEC tooling might give you a hint.

> For starters, it’s rather unreadable when opened in a text editor.

bvrmn · 2024-08-14T12:09:40 1723637380

[Grumpy mode start]

Some nitpicks, maybe someone finds it useful. Could we talk about a code design a little bit?

    class DSV:
        @property
        def delimiter(cls) -> bytes:
            return b'\x1F'

        @property
        def record_separator(cls) -> bytes:
            return b'\x1E'

        @property
        def encoding(cls) -> str:
            return 'utf-8'

It's Python, do not make a premature properties for static values.

    class DSV:
        delimiter = b'\x1F'
        record_separator = b'\x1E'
        encoding = 'utf-8'

Also it's a false inheritance relationship. Writer is not related to configuration. You can't make any other useful subclasses for DSV (ok maybe DSVReader, but that's it). At least it should be in the opposite way: an abstract Writer operating on instance configuration attributes and DSVWriter defining these attributes.

Also `self._buffer += chunk` is O(N^2). It starts to bite even for buffers small as 100 bytes. It's ok for an example, but it's an issue for real code. Example at least buffers incomplete record not a whole read chunk (good!). But does only one split at a time (bad).

[Grumpy mode end]

Nevertheless article is very valuable and interesting to read. CSV gotchas are well described.

Pikamander2 · 2024-08-14T11:00:23 1723633223

From what I've seen, the biggest problem isn't with the CSV standard (even though it has a lot of faults), but rather that a lot of software that utilizes CSVs is poorly tested.

I can't tell you how many times I've downloaded a CSV that didn't escape quotes or newlines correctly, or how many times Excel has failed to correctly parse a perfectly valid CSV due to some decades-old quirk.

I know that there are better formats that make these types of situations pop up less, but is a little bit of quality control too much to ask for? If you've tested your software to make sure that it can handle CSVs with line breaks, tabs, and both types of quotes, then you've seemingly done more testing than 90% of the software out there.

On that note, the LibreOffice Calc team deserves major credit for how many different CSV formats it can handle. It's saved my bacon so many times when Excel wasn't up to the task.

brunokim · 2024-08-14T11:45:44 1723635944

I've read a comment here some years ago of someone discovering ASCII field delimiters and excited to use them. They then discovered that those characters are only used in three places: the ASCII spec, their own code, and the data from the first client where he tried to use this solution.

Any file format needs a well-specified escape strategy, because every file format is binary and may contain binary data. CSV is kinda bad not only because, in practice, there's no consensus escaping, but also because we don't communicate what the chosen escaping is!

I think a standard meta header like follows would do wonders to improve interchangeability, without having to communicate the serialization format out-of-band.

``` #csv delim=";" encoding=utf8 quote=double locale="pt-BR" header=true ```

(RFC-4180 does specify that charset and header may be specified in the MIME type)

bigbuppo · 2024-08-14T18:00:05 1723658405

To me it's wild that the problem was solved back in the early 1960s (and really, well before that) but everyone just ignored it because of reasons and now we're stuck with a sub-optimal solution.

NoboruWataya · 2024-08-14T11:00:57 1723633257

The only real benefit of CSV (other than that it is widely supported) is that it is easy for humans to read and write. The approach in this article solves the quoting problem, but also removes that benefit. If you have the power to move from CSV, surely JSON would be better if you need to keep the human readable/writable feature. And if you don't need it, there are other more featureful binary formats out there like parquet.

Filligree · 2024-08-14T11:11:01 1723633861

You can’t put comments in JSON, while that’s fairly easy in CSV. This makes JSON unusable most of the time for human-editable data.

imtringued · 2024-08-14T11:23:00 1723634580

There is no such thing as a comment in CSV.

Filligree · 2024-08-14T11:53:27 1723636407

In many dialects there are. Usually you start the line with #.

Comments will happen. If your file format doesn’t allow comments, then people will make up an extension to allow it. This is true even for binary formats.

zelphirkalt · 2024-08-14T11:12:26 1723633946

JSON is often good, but it also has potentially a lot of overhead, depending on how sparse the data is. For sparse data, it might be better. But for not sparse data, it will have the overhead of mentioning attribute names over and over again. Of course you could also have arrays in JSON, not writing attribute names over and over, but then you are basically back to a CSV inside the JSON file ...

imtringued · 2024-08-14T11:22:22 1723634542

>Of course you could also have arrays in JSON, not writing attribute names over and over, but then you are basically back to a CSV inside the JSON file ...

You're confusing the concept of tabular data with the file format. If the most natural way to represent tabular data is through a 2D array, then so be it. The vast majority of people aren't complaining about the fact that they have to hardcode the meaning of "the last name is written into the fifth column", they are cursing that the fifth column has suddenly shifted into the sixth column, because the first name contained a comma.

zelphirkalt · 2024-08-14T20:19:48 1723666788

Where am I confusing the two?

joeld42 · 2024-08-14T16:21:29 1723652489

I like the idea but this is non-standard enough to be just as hard as making a custom format.

In my experience, the best way to handle this is:

1) Use TSV (tab-separated) instead of CSV (most things that export CSV also export TSV). Strip LF characters while reading and assume newlines are CR.

2) If you have a stubborn data source that insists on CSV, convert it to TSV in a pre-process. This could be a separate step or part of your reader as you're reading in the file. That means there's a single place to handle the escaping nonsense, and you can tailor that to each data source.

tbrownaw · 2024-08-14T11:22:59 1723634579

> If we used 31 as a field delimiter and 30 instead of newlines, we solve every single edge case from above. Why? Because these are non-printing characters that should never appear in a text-stream data set.

I have in fact seen CSV files used as an interchange format for things that include non-plaintext fields. And I've seen nested CSV files.

brunokim · 2024-08-14T21:49:54 1723672194

LOL nested CSVs are a new one to me. What was it used for?

RockRobotRock · 2024-08-14T23:31:58 1723678318

>For starters, it’s rather unreadable when opened in a text editor. But I bet you don’t really do that with your CSVs all that often anyway!

I really wish that were true.

FerretFred · 2024-08-14T11:33:10 1723635190

I had to LOL a bit about this. I built a career that lasted over 30 years writing software that deciphered clients' attempts to produce sales data files in CSV format.

Many times they just couldn't seem to find the comma. Other times there were commas in the item description (unescaped). My favourite though was when files were edited collaboratively using a Mac, Windows and Linux machines - multiple line-end types FTW! Like I said, a long and somewhat inglorious career..

zelphirkalt · 2024-08-14T11:06:36 1723633596

CSVs are a subset of DSVs. So I guess the idea is, that using that specific subset is bad. But then again it sort of does not matter too much, which character is used for separation, at least if it is not a character that is frequently used as part of a cell value, because that would cause a lot of escaping being needed.

cafard · 2024-08-14T14:59:01 1723647541

About a month ago, somebody posted a link to an interview with Brian Kernighan. About 6 minutes in, he talks about the difficulty of writing a decent CSV parser: https://www.youtube.com/watch?v=_QQ7k5sn2-o

Diti · 2024-08-14T10:44:54 1723632294

The same idea, applied to JSON: https://shkspr.mobi/blog/2017/03/kyli-because-it-is-superior...

TomMasz · 2024-08-14T11:50:03 1723636203

When we get to CSVs I tell my Python students that while the CSV module does do a lot of nice things for you, CSVs are still a minefield and you really have to look at the file in a text editor first if you're not the one who created it.

knallfrosch · 2024-08-14T11:14:17 1723634057

If you strip the exchangeability from an exchange format, it is useless.

DSVs didn't work with either Google Sheets, nor vim and neither Python – I assume this is the exhaustive list of software the author would have needed support from. The question, then: If no software understands the format, what's the point?

> I first learned about these ASCII delimiters while working with .fec [Federal Election Commission] files.

And then the author instantly chose a different delimiter. Two parties and already two standards. That should have been the final red flag for this proposal.

--- Aside: CSVs have so many problems wit their data format that you have to always verify them anyway.

Germans write 5.000,24€ where an American would write $5,000.24. Date strings. URL-encoded strings. Numbers as strings. Floating numbers.

Solving the delimiter problem accomplishes nothing.

snthpy · 2024-08-15T02:49:47 1723690187

I was wondering what DSV is and saw it's a term the author created. I have seen this format usually called ASV (ASCII Separated Values).

There's also a more modern USV (Unicode Separated Values) which has visible separators.

mattewong · 2024-08-14T22:28:21 1723674501

I cannot imagine any way it is worth anyone's time to follow this article's suggestion vs just using something like zsv (https://github.com/liquidaty/zsv, which I'm an author of) or xsv (https://github.com/BurntSushi/xsv/edit/master/README.md) and then spending that time saved on "real" work

poikroequ · 2024-08-14T11:42:32 1723635752

If you're not concerned with the size of the file, you might consider just using NDJSON.

https://github.com/ndjson/ndjson-spec

dflock · 2024-08-14T11:02:21 1723633341

If only...

Every time I have to do any major work with CSVs, I re-lament this exact thing.

I think the only way this could ever become more widespread is to fix all the open source tooling so that it's eventually just supported everywhere - then keep evangelizing for... ~30 yrs.

Probably you should also register a new mime type and extension and make it a new thing - don't overload .CSV any further - but make the same tooling support it.

zelphirkalt · 2024-08-14T11:09:57 1723633797

If time could be turned back, a good idea would be to make CSV mean CSV. Not semicolon separated values, not any other thing separated values, but only comma separated values. To not overload the name in the first place.

jltsiren · 2024-08-14T16:54:07 1723654447

And I would rename the format to SSV and make semicolon the separator. Comma is a terrible choice, because it's used as the decimal separator in many countries around the world.

Ekaros · 2024-08-14T11:27:21 1723634841

Yep. Also limit data that can be stored there. Absolutely no decimal values. Absolutely no text.

Ban those things and it starts to become reasonable enough for general use.

tpoacher · 2024-08-15T07:34:50 1723707290

People here complaining this guy is suggesting a "new" standard: it's ASCII. It is already a standard, and probably a lot more sensible than others that followed.

I too have wondered why the hell aren't we using those special characters already ever since I discovered their existence

jgord · 2024-08-14T11:19:34 1723634374

CSV is kinda great .. but it does help to have nice tools to wrangle it, such as the famous xsv by burnt-sushi.

eigenvalue · 2024-08-14T18:57:00 1723661820

I have found that pandas is much better than the standard library csv library for just importing random CSV files and automatically figuring out what you would want to do most of the time, detecting column headers, dealing with quotes strings, etc.

euroderf · 2024-08-15T08:52:21 1723711941

Are there any commonly-used fonts that display FS GS RS US as cool graphical characters ? If I'm going to use them to structure text, I want them to be visible, and I want them to be clearly distinguishable from the text.

canimus · 2024-08-14T14:52:28 1723647148

Had a similar challenge when writing alphareader is in GitHub. HN comments helped me to think in multi-byte separators, and one thing is sure no matter which char you choose it will appear in the wrong place at some point.

jmclnx · 2024-08-14T11:55:02 1723636502

['Alice', 'She said, "Hello" and waved.', 'Fred''s Car is broken']

You still have the issue described by "" with '' if I read the examples correctly.

mr90210 · 2024-08-14T10:50:19 1723632619

Nice! Regarding support from third-party software, perhaps it would be worth writing a specification for DSVs. I think that it could ease the adoption from well-known softwares.

beardyw · 2024-08-14T13:22:48 1723641768

I wanted a quick and dirty to parse a CSV in js the other day and just added square brackets around it and used JSON.parse.

Am I alone in this?

gsck · 2024-08-14T14:25:17 1723645517

That's actually pretty smart.

The amount of times I have written a simple CSV parser to correctly handle quoted strings and the like is more times than I have digits when I could just pretend its JSON.

Going to make a mental note to try this next time!

cogman10 · 2024-08-14T10:48:32 1723632512

CSVs are bad. If you can change the format then don't use a DSV, use parquet and a library for your language to consume parquet.

It's less code for you and you can do neat things like zstd compression on the columns.

Bonus, it also doesn't require that you load and unload everything in memory.

https://arrow.apache.org/docs/python/parquet.html

Decabytes · 2024-08-14T17:02:58 1723654978

When I'm in control I just produce a TSV instead of a CSV. A comma is much more likely in text than a tab

kvbe · 2024-08-14T11:30:03 1723635003

Ok, but which tool can you use to edit csvs with the same power as excel… without messing with the csv?

BeFlatXIII · 2024-08-14T17:23:28 1723656208

What this DSV format needs for evangelization is for someone to create a front end editor for it.

dsevil · 2024-08-14T18:08:31 1723658911

Original author writes: >>> "Quick aside: I first learned about these ASCII delimiters while working with .fec files. For whatever reason, the Federal Election Commission in the United States also decided that they needed to ditch the comma, but they landed on using ASCII character 28 which is supposed to be used as a file separator not a field saparator. I have no idea why they picked that one when 31 was right there. Anyway, the FEC also has a tool called fs2comma.exe that turns it back into a CSV format, and a couple of years I filed a FOIA request and got the source code."

I can only speculate on this but in Perl, for fake multimensional arrays à la `$foo{$x,$y,$z}`[^1], Perl uses ASCII character 28 (U+001C INFORMATION SEPARATOR FOUR) as its default subscript separator. Perl borrowed this feature from AWK, which uses the same character by default for the same purpose.

Based on Perl, I initally used that same character for that same purpose in a project or two. I cannot speculate on why Aho, Weinberger, and/or Kernighan chose that character. (On or before 1977.)

[^1]: Not to be confused with nested array (or hash) references in Perl, a truer form of multimensional arrays: `$foo->[$x]->{$y}->[$z]`

foobarkey · 2024-08-14T10:57:26 1723633046

DSVs are pretty bad, CSVs are kinda OK (just don’t ever open one in Excel)

tomschwiha · 2024-08-14T11:13:15 1723633995

Directly opening CSV files in Excel does mot work for me, too, importing a CSV file however works devently well for me.

bionhoward · 2024-08-14T23:55:55 1723679755

Love it. I’m gonna use this. Thank you for sharing!

mberning · 2024-08-14T11:50:17 1723636217

> Because these are non-printing characters that should never appear in a text-stream data set.

Good luck with that.

welcome_dragon · 2024-08-14T13:40:35 1723642835

Ah yes good old CSV. It's perfectly fine to use for data transfer and there are libraries for (probably) every language that handle it perfectly to spec.

The problem isn't "CSV". The problems come from: - "excel doesn't like this CSV therefore it's not valid" - "what do you mean the CSV I sent you is wrong? I used excel" - "standard? What's a standard? I put info then a comma. That should be good enough for anyone"

CSV, when done right (i.e. following a standard) is a great format. It's human readable, less verbose than, say, JSON, and everybody understands it.

Just have to make sure business people (and less technical technical people) understand that CSV != Excel and vice-versa.

eth0up · 2024-08-14T14:47:32 1723646852

Question: I started with a deliberately convoluted PDF which after much effort I filtered, sorted, reorganized and transferred the 18000 useful lines to a csv. These lines are simple, with dates, indicator and corresponding numbers.

The purpose is to statically analyze the numbers for anomalies or any signs of deviation from expected randomness. I do this all in python3 with various libraries. It seems to be working, but...

What is a more efficient format than csv for this kind of operation?

Edit: I have also preserved all leading zeros by conversion to strings -- csv readers don't care much for leading zeros and simply disappear them, but quotes fix that.

brunokim · 2024-08-14T15:37:02 1723649822

18k lines is very small, CSVs are fine as storage option.

My rule of thumb is that anything that fits into Excel (approx 1M lines) is "small data" and can be analysed with Pandas in memory.

eth0up · 2024-08-14T16:31:30 1723653090

Hey, thanks for taking the time to reply. I won't be reaching 1M anytime soon, so good to know!

gsck · 2024-08-14T14:32:23 1723645943

The phones we use at {JOB} can be programmatically controlled by using their proprietary command language, which is just CSV and each command is ended with a new line (Because how else would you do it, packet boundary pfft?).

It's something I've never understood why, why not use something more standard like SIP, or even a more structured message format. Having to parse CSV across N different packet boundaries is a royal PITA

TudorAndrei · 2024-08-14T11:26:52 1723634812

This reminds me of https://xkcd.com/

As other said, most of the times, if you are producing them, just produce them right, or choose other formats.

If you don't then pray for the best.

beardyw · 2024-08-14T13:20:14 1723641614

Did you mean:

https://xkcd.com/927/

?