Hacker News new | past | comments | ask | show | jobs | submit login

> The real short of it is that it's a character that existed in another character set, and its role in that character set appears to be limited to "minor printing mark,"

Not really. The printing mark itself wasn’t a “character” in any character set until Unicode. It was probably a custom SVG-like shape that could be plopped into the margin of the document in desktop-publishing programs; sort of like drawing any other arbitrary vector shape (e.g. a 13-agon.) In specialized desktop-publishing programs (e.g. musical-score “engraving” programs), all sorts of these custom vector-shape symbols come with the system for placement onto the document; but—until very recently†—these weren’t “text” in any sense, just shapes.

https://blog.dorico.com/2013/05/introducing-bravura-music-fo...

What was a character in a character set (ISO 5426-2) was a sort of “reference to said printing mark” or “exemplar of said printing mark”, used once in a standards document to give examples of how printing marks are used. Note the difference: the printing mark itself is an arbitrarily-complex and varied shape that different desktop-publishing programs might render in all sorts of ways. The “reference to the printing mark” is one mark that appears in the standards document, one way. (Picture the difference between an actual calligraphic flourish separating passages of text, and a Unicode fleuron/heart-bullet. One is not the proper encoding of the other! The former is a shape, not a character per se. The latter is a character: it’s a standard shape-thing with a standard meaning, that communicates meaning, and is considered “part of” the text.)

Unicode’s mission statement is essentially to offer an encoding for the set of all characters needed to losslessly re-encode all document-corpuses (corpii?) of historical importance, such that those corpuses may be re-encoded into Unicode for archival purposes, with the Unicode standard then acting as the record-keeper of the semantic meaning of the characters used in those documents. Without this work, over time we’d lose the semantic meanings of those encoded characters, as the original living systems that represented the encoding began to rot/disappear. We’d just be left with opaque bytes in the middle of a document that we’d have to guess at the meaning at.

The printing mark, as used for its intended purpose, doesn’t meet Unicode’s mission statement, because nobody’s encoding it into a (preserved) document, only “drawing” it on page-edges before trimming them off. It’s not “text” per se. It doesn’t appear in any electronic stream-of-text in need of archival preservation. It exists only in the same way an image embedded in a Word document exists: as self-contained data interpretable by the document-publishing system, an “add-on” on top of the text, but without the meaning of the text depending on the image. The document-publishing system “brings together” a text source, and other data, to create a final “layout” that is more than just text per se. The text source, not the final layout, is what Unicode is concerned about.

But the “exemplar of the printing mark” ended up in a standard about document binding. And that glyph is “text”, in the same sense that an emoji is text — i.e. it’s part of the corpus that Unicode seeks to preserve, in order to preserve the meaning of the text surrounding it. So it must treat the mark there as a character.

(Yes, this means that if a text of historical importance has some odd squiggle in the middle of it — and it then talks about the squiggle — that squiggle will inevitably become part of Unicode. That doesn’t imply that a visual for the glyph will be drawn by every font artist, though! Just that Unicode will retain a codepoint defining it in the standard, to document the meaning of the glyph as it appears in that one archived document. It’ll likely appear as a replacement-box character, but people can go look up in the standard what that particular replacement-box character means, and that’s all a historian needs.)




> It likely wasn’t a “character” in any character set until Unicode.

The article expressly states:

> The proximate cause for the encoding of this character was the need to provide roundtrip mapping for encoded characters in ISO 5426-2:1996

What is ISO 5426-2, if not a character set?


You misunderstand. ISO 5426-2 is a standard specifying the encoding previously used for the text of the standard; it isn’t the encoding used by the desktop-publishing programs themselves that said standard refers to/affects. (After all, the standard is only tangentially “about” desktop-publishing.)

In the context of the standards document and its accompanying encoding, yes, sideways-Q — the exemplar of the mark — is already a defined character; and so Unicode could just adopt it directly from that source character-set.

But my point was twofold:

1. in the source desktop-publishing files that the standard documents, sideways-Q isn’t a character. It’s an embeddable shape. This shape is the ‘reality’ to which the character in the standard is a reference. Like the difference between a real floppy disk, and the floppy-disk emoji. Making it out to be one would be like encoding a game of hangman drawn on a page as a standard “hangman glyph.” Not the same! Loses meaning!

2. Even if the standards document didn’t have its own character set where sideways-Q was defined as a character, but instead just took a particular desktop-publishing program’s sideways-Q shape and plopped it inline into the text — the fact that the text refers to the mark, requiring you to be able to see the rendering of the glyph to know what it was talking about means that that “sideways-Q exemplar” would inevitably become a Unicode codepoint, in order to encode this document successfully. (In fact, if the standards document were entirely-analogue, e.g. typewritten; and the sideways-Q were drawn onto the page after the fact; then it’d still make it into Unicode in the process of digitizing the work.)


Okay, so I think we agree that ISO 5426-2 is a character set... are you saying that it's a character set invented solely to encode its own standard document, or some predecessor standard? Or you're possibly saying that it was only included in Unicode to facilitate representing the ISO 5426-2 standard document as Unicode?

Do you have some particular knowledge/reason for saying so? Other than your comments I can't find any evidence to this effect... maybe I'm simply misunderstanding you. I don't see anything that leads me to believe this was the rationale behind either the creation of the original standard or its inclusion in Unicode.

The sideways Q specifically, like most of this particular set, seems to originate from a British Library internal character set, was included in the earlier 5426-2 set, and from there into Unicode, with the primary interest throughout focusing on bibliographic records, i.e., MARC.

The bit about the difference between the character and the shape/glyph is fine as far as it goes (and does matter here as these older standards didn't really follow that separation) but is similarly true for all of Unicode.


Ah, so what you're saying is that sideways-Q is basically the same thing as ¤: a placeholder.


“only “drawing” it on page-edges before trimming them off”

https://collation.folger.edu/2012/08/deciphering-signature-m... has examples where they are on the same line as the last line of the text, so these marks weren’t always trimmed off.


> document-corpuses (corpii?)

“Corpora”, if you want. (cf. “corporeal”)


I believe the reason corpus changes the root when you decline it is because Latin went through a phase where /s/ between vowels became /r/.

Eg: corpus -> * corpusem -> corporem

Similar pattern with many other Latin words we think of as ending in -r but the nominative form actually has an -s, eg. colos -> * colosem -> colorem


colos is an archaic form; standard Latin has color.


Ok. But the thing I am going for is the pattern where we think of descendants of the word having /r/. colos does fit the pattern. As would flos.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: