Hi mckoss, awesome to hear one of the main Excel architects is here on HN!
Any interesting stories to share? I think we'd all be interested.
Mind if I ask a question directly related to one of my comments a few days ago? I was lamenting the fact that the ASCII codes 29, 30, 31 (Group, Record, and File separators) never really became widely implemented, as these were specifically designed to delimit data. Ie, one could easily include commas, new lines/carriage returns, etc in data cells without clashing. But instead CSV seems to be the most common standard for tabular data.
Were these ASCII codes ever considered for tabular files?
You couldn't conveniently type those codes on a keyboard and there was no standard way to display them. You'd need a special editor for such files, eliminating the main advantage of an interchangeable text based file format.
Part of your response is accurate as to 'why' dedicated separators never became mainstream - software compatibility - but frankly 99% of csv files were/are never meant to be hand edited, simply used as a neutral file. For the later usage, dedicated separators that do not require escaping or import settings would be major improvement.
For files not intended to be hand-edited, it doesn't really matter what delimiters you use. It's not that difficult to code up char-escaping for csv.
At Yahoo, the typical delimiters for logs and whatnot were ctrl+a, ctrl+b, etc. It was slightly nicer than CSV, but only slightly. It was mostly nicer when manually inspecting files with columns that had embedded commas (that otherwise would have been escaped). The machines don't care, and for any interesting processing you'd often end up with escaping anyway.
But csv files can also be manipulated in shell scripts, with the text being piped from one util to the next. Here it becomes important that the separators can be correctly typed and displayed.
The standard way to type them is as control-\, control-], control-^, and control-_, which has been true since the introduction of control keys in the 1960s. It's true that there isn't a standard way to display them across platforms; some use ␜, ␝,␞, and ␟, while others use ^\, ^], ^^, and ^_ (sometimes in inverse video), and maybe the most common display would be the CP437 glyphs, which render them as ∟, ↔, ▲, and ▼. In the Unicode standard, they are displayed as dotted boxes around the strings "FS", "GS", "RS", and "US".
Emacs can edit files in this format without any extra work (it displays them as ^\, ^_, etc., with a different text color so that you can easily distinguish them from character sequences like "^" followed by "\") but maybe you mean to say that Emacs by itself doesn't understand the hierarchical structure of such a file.
This is easily fixed. You can get Emacs forms-mode for a file with these delimiters as follows:
(setq forms-field-sep "\036")
(setq forms-multi-line "\037")
(setq forms-read-file-filter 'forms-replace-gs-with-newlines)
(setq forms-write-file-filter 'forms-replace-newlines-with-gs)
(setq forms-file "fsgsusrs.data")
(setq forms-number-of-fields (forms-enumerate '(name aliases wikipedia employer notes)))
(setq forms-format-list
(list
"Project for a New American Century conspirators\n\n"
"\n Name: " name
"\n Aliases: " aliases
"\n Employer: " employer
"\nWikipedia URL: " wikipedia
"\n\nNotes:\n\n" notes))
(defun forms-replace-newlines-with-gs ()
(goto-char (point-min))
(while (search-forward "\n" nil t)
(replace-match "\035" nil t)))
(defun forms-replace-gs-with-newlines ()
(goto-char (point-min))
(while (search-forward "\035" nil t)
(replace-match "\n" nil t)))
It took me 35 minutes to write this "special editor for such files"; thus your argument is invalid.
You might further argue that this would create incompatibilities between different systems and so of course everyone would just use the same data file format. Even today, this seems implausible — JSON, various dialects of CSV (with tabs, commas, doublequote-delimited commas, pipes, and colons being the most common delimiters), SQL dumps, and HTML are all in common usage — and in the context of the 1960s and 1970s it seems even less founded. Remember that there were at least five widely used conventions for how to separate lines in ASCII text files up to the 1980s: \r\n (from teletypes), fixed-width records of 80 bytes (from punched cards), \n (from Unix), \r (from PARC, used in Smalltalk, Oberon, and the Macintosh), and \xfe (Pick, see below). And the PDP-10 used a six-bit variant of ASCII they called SIXBIT, the PDP-11 used ASCII, IBM used EBCDIC, and UNIVAC used FIELDATA.
This is a time when even computers from the same manufacturer couldn't agree on how many bits were in a character, much less how to delimit fields in data files. Thus even my attempt to save your argument is invalid.
Interestingly, there was a popular system that worked this way, with non-printable delimiters to divide up different levels of a hierarchical data structure represented as a string: Pick. But Pick didn't use FS, GS, RS, and US in storage either, and although it did use them in the user interface, it used them backwards. Pick's "items" (usually used like records in a database, but accessed like files in a directory) were divided into "attributes", corresponding to database fields, by the "attribute mark", byte 254, displayed as "^" (or often as a line break) and entered as control-^ (RS); the attributes were divided into "values" by the "value mark", byte 253, displayed as "]" and entered as control-] (GS); and values could be divided into "sub-values" by the "sub-value mark", byte 252, displayed as "\" and entered as control-\ (FS).
Pick also reserved byte 255 to mark the ends of items, like NUL (\0) in C, or ^Z (\x1A) in CP/M or early MS-DOS. It called it the "segment mark", displayed as "_", and it was entered with as control-_ (US).
Any interesting stories to share? I think we'd all be interested.
Mind if I ask a question directly related to one of my comments a few days ago? I was lamenting the fact that the ASCII codes 29, 30, 31 (Group, Record, and File separators) never really became widely implemented, as these were specifically designed to delimit data. Ie, one could easily include commas, new lines/carriage returns, etc in data cells without clashing. But instead CSV seems to be the most common standard for tabular data.
Were these ASCII codes ever considered for tabular files?
Thanks.