Sad that the ASCII specification includes 2 codes: 30 and 31, respectively field separator and record separator, precisely to answer cleanly the need that CSV fullfils addresses.
During the 90's I was anal for using them, pissing the hell out of my teammates and users for forcing them to use these 'standard compliant' files.
And they still don't fix the escaping problem. You might as well use a niche utf8 emoji as a separator. Editors at least know how to consistently render an emoji.
As a co-op student I used a library to achieve fool proof encoding in csv so it escaped and quoted everything as necessary so commas,\, and quotes and any other character could be included in the data, but it was rejected since the plain text files were difficult to read and edit by hand!
I agree. So if we don't need this hand crafted and for human consumption, we may as well just use some TLV or LV encoding instead of the CSV madness of separators and escaping. CSV is basically designed for hand crafting.
They are also easy to read, perhaps easier than a 'space' or other character. Although this could be because we are just used to seeing data eg CSV in this way
a lesson in confusing representation and data.
if users can learn not to edit xls in text editor and to "go to next cell" type "tab" in spreadsheet software, they can learn edit csv in a proper csv editor.
the only trap was that we made a non-text format so simple that tricked ourself that "it's only plain text".
Editing csv by hand is something I’ve seen a lot for internal-only software where every user is a super-power user who need to move small but bulk amounts of data and sometimes make small edits for formatting.
Easiest example is geo, I need 20 states listed as US-CO, US-CA, etc but one tool exported as US CO.
- To escape the delimiter, we should enclose the value with double quotes. Ok, makes sense.
- To escape double quotes within the enclosing double quotes, we need to use 2 double quotes.
Many tools are getting it wrong. Meanwhile some tools like pgadmin, justifiably, allows you to configure the escaping character to be double quote or single quote because CSV standard is often not respected.
Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.
They're not getting it wrong, they're just assuming a different variant.
There is no "standard" for CSV. Yes, there's an RFC, published in 2005, about 30 years after everyone was already using CSV. That's too late. You can't expect people to drop all compatibility just because someone published some document somewhere. RFC 4180 explicitly says that "it does not specify an Internet standard of any kind", although many people do take it as a "standard". But even if it did call itself a standard: it's still just some document someone published somewhere.
They should have just created a new "Comma Separated Data" (file.csd) standard or something instead of trying to retroactively redefine something that already exists. Then applications could add that as a new option, rather than "CSV, but different from what we already support". That was always going to be an uphill battle.
Never mind that RFC 4180 is just insufficient by not specifying character encodings in the file itself, as well as some other things such as delimiters. If someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.
RFC 4180 says it "documents the format that seems to be followed by most implementations" and in practice I find that to be true, though my CSVs don't interact with a lot of very old software. You get very far by treating "RFC 4180, UTF-8" as a standard and considering every implementation that doesn't follow it to be broken. I'm not sure I have ever seen software that simultaneousy doesn't follow the RFC, but does consistently support escaping.
It's in the standard library for Python, Rust, Julia, and maybe some other languages. It's also widely used in those ecosystems (pyproject.toml, cargo.toml). I think it's fair to say it took off, even though YAML is also popular.
> someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.
Why? We have xlsx for the office crowd and arrow for the HPC crowd. In no universe does anyone actually have to invent another tabular data format using delimiters.
Neither are a universal replacements for CSV. They're not even text formats (well, technically xlsx is if you expect the XML from the zip, but practically: no really.). The article already explains why, as the title says, "CSV is still king": it's widely used, it's simple, it's used all over the place, it's universal, it's human-readable-y.
I can't tell you how to run your business, but subscriptions for offline apps aren't going to be popular here.
Charge me more upfront for a perpetual license, or just version the software. Say 40$ today for V3, and every year charge a reasonable fee to upgrade, but allow me to use the software I purchased...
I recently saw a license that was based on a monthly subscription, but once you paid for a year you got a perpetual license to the version you started with. Every year, your perpetual license was updated to the next year's version. I find that to be a reasonable middle ground.
Thank you for your feedback. I think your opinion is super valid here.
I've been thinking about pricing, and a lot of people did complain about it. However, many people expense their software cost, so they don't mind the yearly subscription.
I'm improving the pricing right now and a perpetual license is what I'm going with.
> Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.
Looks like SQL is the main selling point for your tool. For other simpler needs, Modern CSV [1] seems suitable (and it’s cheaper too, with a one time purchase compared to a yearly subscription fee). But Modern CSV does not support SQL or other ways to create complex queries.
It would be more useful if every RFC had a test suite of input/output and input/error.
Yes, those are potentially infinite, but a core set would be useful. As ambiguities come up, publish an addendum for clarification, and eventually, as the exceptions accumulate, a version step.
I don't understand how anyone can write a spec without concrete examples of pass/fail in their head. Perhaps there could be an informal example/counterexample syntax for those writing RFCs, which could be extracted into the 1.0 test suite.
The test suite must be a single open source repo, that accumulates acceptable edge cases until the relevant informed adults can make a call about revising the spec.
There has to be one approved, sanctioned, well-known and monitored test suite repo. It cannot be shrugged off into a free-for-all that makes it impossible to find a single canonical test suite. The interwebs are big and conflicted.
See Imre Lakatos 'Proofs and Refutations' for how this evolves.
RFCs sometimes have pseudocode. It would be nice to have a "pseudocode translator" that translates it to some actual programming language.
With few exceptions, I have gven up on documentation. Whether it is specifications or software. Now I just read source code instead.
I think in the 60s and 70s documentation used to be better and did focus more on input/output. For example, I still use spitbol and icon.
Maybe it is controversial view, but I fail to comprehend how any RFC can be considered a "specification". In truth an RFC is only a "proposed specification" at best, literally a "request for comments". (Where are the comments?) In fact, often RFCs simply document some internet practice that already exists. (Meanwhile the number of "BCPs" is relatively small.) RFCs can be anything.
I agree about markdown, but the only awkward implementation issue is nested syntax: what markup is parsed inside various other outer markup forms?
Italic headings? Bold links? Nested lists - how many levels? Code in list? How do paragraphs interact with lists? There are many opinions and many leaky implementations of those opinions. Newlines? Embedding HTML in Markdown !?!?
It all seems so sad, because (X)HTML nailed most of these issues a very long time ago. But HTML implementations were sloppy from the outset. And XML was born with inherited bloat, then got ever more complex over time (modular specs, XLink, XPath, XSLT, DTD -> XML Schema, ...)
With Markdown, it is relatively easy to introduce some recursion into the parser, but for what spec? In what contextual cases? At what cost?
It is possible to just treat commas as whitespace. It makes implementation so much easier. It accepts missing, trailing and repeated commas. It makes elements uniform. It ignores many common errors that arise from typos or cut'n'paste. It makes JSON writers simpler, by removing the first/last special case.
A JSON parser that treats commas as whitespace can be two dozen lines in most programming languages - if you do not want line/column, chapter and verse, for the remaining error messages.
Just use TSV. Commas are a terrible delimiter because many human strings have commas in them. This means that CSV needs quoting of fields and nobody can agree on how exactly that should work.
TSV doesn’t have this problem. It can represent any string that doesn’t have either a tab or a newline, which is many more than CSV can.
It's 2024 and Excel still doesn't natively parse CSV with tabs as delimiters. When I send such csv files to my colleagues, they complain about not being able to open them directly in Excel. I wish Excel could pop up a window like LibreOffice does to confirm the delimiter before opening a csv file.
Excel does not support any delimeter natively since its region dependent.
I ended up saving my mental heath by supporting two different formats: "RFC csv" and "Excel csv". On excel you can for example use sep=# hint on beginning of file to get delimeter work consistently. Sep annotation obviously break parsing for every other csv parser but thats why there is other format.
Also there might be other reasons too to mess up with file to get it open correctly on excel. Like date formats or adding BOM to get it recognized as utf-8 etc. (Not quite sure was BOM case with excel or was it on some other software we used to work with )
I also use sep= annotation.
That is not documented ANYWHERE by Microsoft
I assume one of the devs mentioned this in a mailing-list sometime in the nineties and it has found its way around.
Still... Shame on Microsoft of not documenting this and perhaps other annotations that one can use for ex El.
"Delimiters Select the character that separates values in your text file. If the character is not listed, select the Other check box, and then type the character in the box that contains the cursor."
Maybe they should know better their tools instead of plain double clicking and hope for the best.
It's 2024 and people still haven't realized that Excel does not and never will support opening CSV files. The closest thing it allows you to do is import data from a CSV file into your current spreadsheet, but open a CSV file? It will never do that. Stop using CSV for excel, just generate .xlsx files like everyone else.
Not in every version. I recently found out that Excel doesn't recognize commas as separators in a comma-separated-values file on my coworkers PCs.
I presume it's because Germany uses the comma as a decimal separator instead of a dot.
I eventually settled on just exporting Excel because I couldn't get both the encoding and separator to work at the same time.
Another fun story is that a coworker lost data, when they opened a csv, wrote data to a second sheet, and then saved it. A sane program would probably have brought up a save-as window. Excel didn't. It just discarded the second sheet.
If Excel is set to handle the extension .csv then attempting to open a .csv file correctly launches Excel and imports it. File for read only, but if you want it back out you have force matters, it's not automatic.
Feature-wise, Excel probably still has more options, but in terms of ergonomics, Google Sheets is much better. And I'm saying this as someone who has used Excel for 20 years.
Here are a few specific examples:
1. Editing formulas using the keyboard only is a nightmare in Excel. It often randomly throws errors and warnings when I move the cursor around (like typing parentheses or quotes first and then trying to move back to type text inside, etc.) before finishing editing.
2. Conditional formatting in Excel is so non-intuitive that I actively try to avoid it like the plague. Yet, I use it extensively in Google Sheets because it is so easy to create multiple rules there.
3. The whole copy/paste design choice in Excel is, in my opinion, weird. Firstly, there is a distinction between copying a cell and copying text: if you copy an entire cell, you cannot paste it as text in a formula or any other input area. You have to copy from the formula bar of that cell. Even for pure cell copying, the cells have to remain highlighted. If you copy a cell and then unselect it (by pressing Esc or trying to edit any cells), the copied content is lost. I'm sure there are reasons it's designed this way, but it's so irritating, and I never find any benefit.
Google Sheets indeed does everything a common user would expect from a spreadsheet, without having to install anything and fiddle with licenses. This in itself is the killer feature.
For me personally the absolute killer feature is the bidirectional integration with BigQuery, something you won’t get that easily (if at all, correct me if I am wrong) with Excel.
As long time Office user since Windows 3.1 days, Google sheets has a lot of terrain to cover up.
Quattro Pro could easily do all of the sheets stuff but document collaboration, ignoring the fact that sheets is a Web application and Quattro Pro started on MS-DOS.
If you go with the CSV convention of two adjacent tabs => blank cell in the middle, then rows of different length will not line up properly in most text editors. And "different length" depends on the client's tab width too
If you allow any amount of tabs between columns, then you need a special way to signify an actually-blank column. And escaping for when you want to quote that
If you say "use tabs for columns and spaces for alignment", then you've got to trim all values, which may not be desirable
You’re talking about issues with alignment when data is displayed on a terminal or text editor, which is not at all related to data exchange.
In data exchange nobody ever allows multiple tabs between columns. If there are multiple tabs with nothing in between it means the column is empty for that row.
Just like with CSV, TSV, is always a pain to edit manually so the issues there are the same. Using tabs does have a lower likelihood of conflicting with the actual data.
This is true, but I'd assumed that one of the major reasons to use TSV is for human readability. If not, then I'd personally choose an even rarer character as my delimiter
Because nobody made keyboards with those keys. Had they stuck on a 'next unit' and 'next record' key pair that sent them, we'd all be fine, but instead we got overly redundant text editing keys rather than keyboards more suited to data entry.
I tend to prefer that over CSV as well. But usually I go for ndjson files since that's a bit more flexible for more complex data and easier to deal with when parsing. But it depends on the context what I use.
However, a good reason to use TSV/CSV is import/export in spread sheets is really easy. TSV used to have an obscure advantage: google sheets could export that but not CSV. They've since fixed that and you can do both now.
And of course, getting CSV out of a database is straightforward as well. Both databases and spreadsheets are of course tabular data; so the format is a good fit for that.
Spreadsheets are nice when you are dealing with non technical people. Makes it easier to involve them for editing / managing content. Also, a spread sheet is a great substitute for admin tools to edit this data. I once was on a project where we payed some poor freelancer to work on some convoluted tool to edit data. In the end, the customer hated it and we unceremoniously replaced that with a spreadsheet (my suggestion). Much easier to edit stuff with those. They loved it. The poor guy worked for months on that tool with the help of a lot of misguided UX, design and product management. It got super complicated and it was tedious to use. Complete waste of time. All they needed was a simple spreadsheet and some way to get the data inside deployed. They already knew how to use those so they were all over that.
That's because anyone can easily make a tab character with their keyboard. No one ever remembers the key combination for those special ascii characters.
- If I send someone a spreadsheet, they'll open it with a spreadsheet application; Excel, LibreOffice, whatever.
- If I send someone a CSV file, they'll want to open it with a text editor.
Ack, no! Open it with a spreadsheet app, or load it into SQLite, or, best of all, open it with VisiData or some kind of editor designed for tabular data.
Actually no - spreadsheets classically choose their own way to interpret CSVs, that's the classic way to get your client to continue to send you support requests.
There's a reason so many tools export to xls instead of csv.
And if you just double click on a csv file to open it in excel (rather than importing from excel), excel will happily corrupt your data. Trim leading zeros in IDs, round large integers, not spot the date column is in us vs european format, etc.
Excel will happily ignore if you use tabs as delimiter and show all the columns as a single column. Atleast LibreOffice tries to figure out and confirms the delimiter before opening CSV.
While that's true, the way text editors handle these characters is not standardized, and many may not let you input them. One of the important features of CSV/TSV is that they're relatively easy to edit by hand, and for that you need separator characters that are easy for both text editors and humans to work with.
Personally, since I've discovered the field/group/record/file separator characters in ASCII, I've been using them to concat fields and rows on one-to-many SQL joins. They work great for that purpose since (at least on all the projects I've done this with so far) I can be certain that none of the values in the joined data will have those characters, so no further escaping is necessary. For example, in MySQL:
SELECT
i.item_id,
GROUP_CONCAT(CONCAT_WS(0x1F, f.field_id, f.field_value) SEPARATOR 0x1E) AS field_values
FROM items i
LEFT JOIN fields f ON f.item_id = i.item_id
WHERE ...
Then split field_values with 0x1E to get each field ID and field value pair, and split each of those on 0x1F. Easy as pie.
I wish binary length-prefixed formats would've become more common. Parsing text, and especially escaping, seems to be a continual source of bugs and confusion. Then again, those who don't implement escaping correctly may also overlap with those who can't be bothered learning how to use a hex editor.
CSV comes from a world in which the producer and consumer know each other; if there are problems they talk to each other and work it out.
There is still plenty of this kind of data exchange happening, and CSV is perfectly fine for it.
If I'm consuming data produced by some giant tech company or mega bank or whatever, there is no chance I'll be able to get them to fix some issue I have processing it. From these kind of folks, I'd like something other than CSV.
But the big guy most likely exports the .csv correctly in the first place, you don't *need* to work with them.
Only once have I seen a bad .csv from a "big" company--big fish in a small pond type big. We were looking to get data out, hey, great, .csv is a valid export format. I'm not sure exactly what was in that file but it appeared to be the printout with some field info attached to each field. (Put this at that location on the paper etc, one field per line.) Every output format it has is in some scenario bugged.
I fully agree that CSV is king and am quite happy about it. But the comma character was probably one of the worst choices they could make for the "standard", IMHO of course.
Tab makes far more sense here, because you are very likely able to just convert non-delimiter tabs to spaces without losing semantics.
Even considering how editors tend to mess with the tab character, there are still better choices based on frequency in typical text: |, ~, or even ;.
I wasn't around at the time, but surely ASCII was (even if not ubiquitous)? Is there any particular reason that the FS/GS/RS/US (file/group/record/unit separator) characters didn't catch on in this role?
I did an ETL project years ago from a legacy app that used these delimiters. It was gloriously easy. No need to worry about escaping (as these characters were illegal in the input). It's a shame they didn't catch on.
I actually just finished a library to add proper typed parsing that works with existing CSV files. Its designed to be as compatible as possible with existing spreadsheets, while allowing for perfect escaping and infinite nesting of complex data structures and strings. I think its an ideal compromise, as most CSV files won't change at all.
CSV is king because most ETL department programmers suck. Half the time they can't generate a CSV correctly. Anything more complicated would cause their tiny brains to explode.
I'm not bitter, I just hate working with ETL 'teams' that struggle to output the data in a specified format - even when you specify it in the way they want you to.
A lot of data that I see in CSV "format" would work fine as tab-delimited and wouldn't need any escaping (because most of the data I see doesn't allow literal tabs anyway). That would be a simple improvement over CSV.
I'm surprised that the article and the comments failed to mentioned pipe delimited files. I work with almost two dozen different vendors (in healthcare) and 90% use pipes. Doing data exchange with a variety of delimiters is so common that I just built out a bespoke system for taking in a set of common configurations and parsing the information. Other settings include line endings, encoding, escape characters, whether the header is included etc.
I prefer ndjson for systems I build. (with only json objects on the top level) It's much safer for a lot of edges. If there's significant repetition in the keys, they end up zipping well.
As the article says, it will be interesting to see if NDJSON becomes more popular. Although it's a bit more difficult to parse and has makes for larger files than CSV it is more unambiguous.
indeed, i created my own tool to preview and adjust csv files before viewing https://csvonline.newbeelearn.com/csvdemo . Its not ready yet would probably not work for large files but works well enough for csv's with appended data that screws up formatting.
wow! it does exactly what i want :-) . what are the odds of that! I tested with a bank file where csv starts after some lines and i was able to read csv after bit of fiddling with configure button. What is theming demo doing btw with csv?
the site says "something went wrong" just 1 sec AFTER it successfully displayed the content. something is so wrong that had withdraw the content from the user...
use js only to enhance UX!
I routinely interface with 1GB+ csvs. The size explosion for json would be huge. Disk IO aside, I assume a json parser is going to be slower to parse than csv.
Nobody does this currently. You have now created another bespoke format. If I am going to need a custom parser/writer, I might as well lean on a binary format that has far stronger properties than a text based one.
JSONL is pretty common format. It makes sense for logs and anything else written incrementally.
JSON parsers are super common. They are simpler and faster than CSV because it is more regular. JSONL is simple to implement cause write by record and read by line.
The only difference with CSV are bracket characters around line and every string has quotes. The benefit is clear escaping rules including for newlines.
JSONL is standard. Upthread said to write the header row and then make subsequent rows arrays. Of which I am not aware of anything that does this currently.
My objection to JSONL was about the increase in file size owing to repeating the keys.
JSON can write arrays in addition to hashes. JSON arrays are nearly identical to CSV. The only difference is brackets around li;es. There is no extra space wasted for keys.
CVS is explicitly about tabular data. JSON (including JSON5) is much more flexible. Flexibility can be great but also can be annoying. If you want tabular data, then a system that enables nesting isn't great.
You would write JSON arrays without names for tabular data. I don’t know if there is a standard way to do the header, but array of names would work. Or JSON Schema record.
Rather than highlighting flexibility as the differentiator, I would say: CSV is for dense data, JSON is for sparse data. They are flexible in different ways. For example, CSV is very flexible when renaming a column title.
I don’t think CSV became king because “,” is a great delimiter (obviously it is not), it became king because it is an easy and logical separator _to most people_. Yeah it’s infuriatingly dumb from a technical standpoint. All the points here that tabs or ascii separators are superior are of course correct. I honestly respect it for how ubiquitous it became WITHOUT having a standard. Still going to curse when I have to deal with a broken one though.
I think that we just need someone to get fed up and simply tackle the list of well known problems of CVS.
What we need is,
- A standard (yeah, link xkcd 927, it's mentioned enough that I can recall it's ID) to be announced **after** the rest of things are ready.
- Libraries to work with it in major languages. One in Rust + wrappers in common languages might get good traction these days. Having support for dataframe libraries right away might be necessary too.
- Good tooling. I'm guessing one of the reasons CSV took off is that regular unix tools are able to deal with CVSs mostly fine (there's edge cases with field delimiters/commas, but it's not that bad).
The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.
This seems too much work to get right since the very beginning, so maybe building on top of Apache Arrow might help reduce the solution space.
Most major languages have decent libraries, frameworks and tools for dealing with CSV. Those tend to have lots of tests for all the well known issues and edge cases. Especially in the python world, which is used for a lot of data processing, tooling is not really an issue. But most other languages also have decent frameworks. Most of that stuff covers the few standards that exist for this, the well known variants of the format that are out there (quite a few) and can deal with the quirks of those.
The only time people get in trouble with CSV is when they skip using those tools, hack something together, and then get it wrong.
> The new format would ideally have types, the files would be sharded and have metadata to quickly scan them
There's no need for new stuff. It would be redundant as there are several things already that do these things. Adding more isn't helpful. The problem is most of the stuff that supports CSV tends to support none of those things and fixing a lot of ancient systems to retrofit them with e.g. parquet support or whatever is a mission impossible. CSVs principle feature is that it is simply everywhere. That's hard to replicate. People have been trying for decades.
> The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.
Parquet fits the bill here. It's not perfect (there is no perfect file format), but it's a practical compromise as of today, at least for new systems where a columnar format is appropriate. There are some columnar formats that are better in some aspects (like ORC and some proprietary formats) but they're not as widely supported.
It's not that CSV/TSV is bad in every situation, but more that CSV/TSV is overused for things it shouldn't be used for. (CSV is good as for tabular format for simple applications, very bad as the storage format for data lakes or anything you want to query, questionable as an data exchange format, okay as a semi-structured format for structurally simple data -- many open data platforms offer it as a a download format and it generally works).
To get a sense of how much variation a CSV reader needs to handle, we can take a look at the number of arguments there are in Pandas' read_csv. And it still fails on some CSVs! (I've had to preprocess CSVs before pd.read_csv would work)
CSV is not king, but it is popular. But popularity doesn't mean it's good for every use case. Optimizing for human readability and easy generation means trading off other very important characteristics (type safety, legibility across different tooling, random access performance, reliability/consistency).
You can't do anything about legacy systems, but when designing a new system, you should really ask yourself: is CSV really the right choice?
(With DuckDB, the answer for me is increasingly no)
Yeah, I used it about 7-8 years ago. I liked the idea of chaining things, but it's very clear that csv has not been holding up well in the past decades.
Also, what I have in mind for file sharding needs maybe a standard on top of a record/column file format. The successor to CSV should be easy to process in parallel.
Exchanging information between different data formats is one of the biggest problems I've experienced in computing and IT and it's been thus from the earliest days.
Having so many formats is confusing, inefficient and leads to data loss. This article is right, CSV is king simply because it's essentially the lowest common denominator and I, like most of us, use it for that reason—at least that's so for data that can be stored in database type formats.
But take other data such as images, sound and AVI, and even text. There are dozens of sound, image and other formats. It's all a first-class mess.
For example, we fall back to the antiquated horrible JPG format because we can't agree on better ones such as say jpeg 2000, there being always excuses why we can't such speed, data size, inefficient algorithms etc.
Take word processing for instance, why is it so hard to convert Microsoft's confounded nasty DOC format to say the open document ODT format without errors. It's almost impossible to get the layout in one format converted accurately into another. Similarly, information is lost converting from lossless TIF to say JPG, or from WAV to MP3, etc. What's worse is that so few seem to care about such things.
Every time a conversion is done between lossless formats and lossy ones entropy increases. That's not to say that shouldn't happen it's just that in isolation one has little or no idea about the quality of the original material. Even with ever increasing speeds, more and more storage space so many still have an obsession—in fact a fetish—of compressing data into smaller and smaller sizes using lossy formats with little regard for what's actually lost.
It's not only in sound and image formats where data integrity suffers over convenience, take the case of converting data fields from one format to another. How often has one experienced the situation where a field is truncated during conversion—where say 128 characters suddenly becomes 64 or so after conversion and there's no indication from the converter that data has actually been truncated? Many times I'd suggest.
Another instance, is where fields in the original data don't exist in the converted format. For example, data is often lost from one's phone contacts when converted from an old phone to a new one because the new phone doesn't accommodate all the fields of the old one.
Programmers really have a damn hide for not only allowing this to occur but for not even warning the poor hapless user that some of his/her data has been lost.
That programmers have so little reagard and consideration for data integrity I reckon is a terrible situation and a blight on the whole IT industry.
Why doesn't computer science take these issues more seriously?
>Why doesn't computer science take these issues more seriously?
Simple, cost. A company is not going to approve any project to move to a new standard. Plus you have new hires coming it with their favorite "Standard of the Day" and start using that standard no matter what they are told.
Management only care about the end result (ie: bottom line), now how it got there.
That lack of consideration for users' data will ultimately lead to regulation. Much of a user's data is only machine-readable, so ordinary users shouldn't be expected to know when their data is truncated after say data conversion. They aren't responsible for realizing their data is corrupted long after the event and past the point where it can be corrected.
It's like everything else, originally there's the Wild West days when everything's a free-for-all, but regulations eventually kick in after the harm done is considered unacceptable. We've seen regulations introduced everywhere else, from foods—pure food acts, pharmaceutical—FDA, transport—NTSB, Water purity standards and so on. So eventually computing/IT will be no exception.
Unfortunately, computing/IT is still in the 'Wild West' days. Personally, I can hardly wait for those enforced regulations to become effective.
During the 90's I was anal for using them, pissing the hell out of my teammates and users for forcing them to use these 'standard compliant' files.
Had to give up.