Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Base32H, a human-friendly duotrigesimal number system (base32h.github.io)
87 points by yellowapple on Sept 7, 2020 | hide | past | favorite | 73 comments



Huh, interesting. For stuff like this (human-friendly numbers/ids/codes) I tend to use google's base20 Open Location Code alphabet[0] that I read an explanation of a long time ago on their website[1] :P. Whereas this project aliases some similar letters together, the OLC alphabet avoids similar looking letters altogether and tries to avoid spelling any words at all. Keep in mind that "oh, letters are aliases for each other" doesn't necessarily help now that "passwords are case sensitive" is also drilled into users' heads, and no distinction between I/l is even more rare than case sensitivity, so users might unnecessarily remember the difference or make a fuss about it (speaking from experience working with my grandmother and technology).

Anyway, usually I just copy/paste a piece of code like below to use the OLC alphabet and randomly generate an id, and you could just copy the literal to use as a radix alphabet as well.

    "".join(random.choices("23456789CFGHJMPQRVWX", k=5))
    # check against id in dict
    if userinput.toupper() in idmapping: ...
[0]: https://github.com/google/open-location-code/blob/master/js/...

[1]: https://github.com/google/open-location-code/blob/master/doc..., under "Easy to use"


I don't get the use case. We already have Base32. We already have human-readable Base32 (Crockford's flavor). And Crockford makes the better tradeoff by addressing I/1/l ambiguity instead of U/V. We don't need a new hex format either. I don't get why this exists.


> And Crockford makes the better tradeoff by addressing I/1/l ambiguity instead of U/V.

Exactly!

It is much easier to mistake "l" for "1" (lowercase L for number 1) than "5" for "S" or "U" for "V".


Crockford doesn't have U in its alphabet so it doesn't have U/V problem either. It just doesn't have aliasing for U because it's used for check digits.


At least they document the tradeoffs with comparisons to other encodings.

https://base32h.github.io/comparisons


So, according the document the only difference is Base32H solves ambiguity between S and 5 but allows ambiguity between l and 1.

I think that's the worse tradeoff to be made, let alone a reason to create whole another encoding scheme.


> allows ambiguity between l and 1

It doesn't really. The canonical is L (uppercase). In practice lowercase l will never appear, unless for some reason you want it to (you could convert L05ER to l05ER just for presentation, there are worse leetspeek offensive terms that would be allowed though).

Also the U/V thing neatly gets rid of FUCK. S/5 and I/1 take care of others.


I wouldn’t want ALLUPPERCASE identifiers on my URLs for instance. I presume that would be one of the most common scenarios it would be used in.


Well, it will still work but you'll lose the benefit of it.

Most of these don't look like SHOUTING to me. They look uppercase but not aggressive.


You don't want upper-case in URLs for social reasons. It simply looks out of place, takes more screen real-estate, and stands out unnecessarily.

I'm just saying, if canonical is good enough, "S/5" fix isn't sufficient to get a whole new standard then. If it isn't good enough, then the value this one promises over Crockford is arguable due to lack of I/l/1 aliasing.


I downcase all ids, sometimes this includes uuids and string ids.


This is suggesting that you don't.

It shouldn't be that hard to start using it in the canonical form - just start treating IDs as if they're case sensitive.


> I think that's the worse tradeoff to be made

That's definitely a fair conclusion. If you're using lowercase identifiers, then I agree wholeheartedly, and you'd be better served by Crockford's (or perhaps some Crockford/Base32H hybrid that aliases L/l to 1 and de-aliases S/s from 5).

What motivated me to make that tradeoff was two-fold:

1) For my use-cases, identifiers were/are always uppercase by default (and indeed, the Base32H specification requires encoders to emit uppercase-only), which eliminates ambiguity between L and I/1

2) Smudges and grime and scratches and such can and frequently do make it difficult to distinguish S from 5 or V from U; it was therefore a pragmatic choice to alias those (while preserving Crockford's partial profanity defense as a nice side-effect)


That's fair and what you say makes sense in that context. Crockford doesn't have U/V problem either by the way, because it doesn't have U in its alphabet. It's only missing aliasing for it, so there is no possibility of passing on the wrong value by confusing V with U.

That just leaves 5 and S. Whole another standard just for that might be acceptable if the use case you mentioned (mostly upper-case) gets significantly more benefit from it.


There's also "Bech32" from the Bitcoin world: https://en.bitcoin.it/wiki/BIP_0173#Bech32

The cool thing about using a standard Base32 representation is there's so many to choose from!


something something xkcd 927 something something

(Though in my defense I'm under no illusion that Base32H satisfies everyone's use cases)


Or that Base32H satisfies anyone's use cases...


Haha, fair enough.


I threw 0x2059B7DEDB800C03 (2331096449934167043) in there as a Postgres int8 I had lying around and I get the message: "The number you're encoding is bigger than what Javascript can accurately represent, so the below value is probably incorrect."

However, it is (now) possible to represent this in JavaScript as a BigInt[0]

[0] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


Yeah, I wanted to keep the initial pass at a JS implementation as universal and dependency-free as possible, which precluded the use of BigInt for this attempt (a surprising number of JS runtimes out there don't ship with it, including both any version of Internet Explorer and the ancient version of Node.js that's the default on my system). PRs are certainly welcome if someone wants to have a go at hacking it in (on the condition that it's able to fall back to a non-BigInt approach should it be unavailable).


The V v U u alias should be split into V v and U u. The l should be used as alias for 1.

I totally see the historic and soundex analogue between V v U u, but it seems to me that the visual similarity of 1 L l i has precedence.


I considered that, but wanted to preserve Crockford's partial defense against accidental profanity generation while still allowing U/u to be decoded (if Crockford's aliased it I probably wouldn't have bothered to whip up Base32H, lol).

As for 1/l/I, yeah, that's definitely the main flaw. The workaround (as described in the FAQ) would be to always emit uppercase and take advantage of L being pretty visually distinct. A bit ugly for URLs, but for things like asset tags, product keys, and item codes / SKUs (the main things I had in mind) that's already the norm.


> I considered that, but wanted to preserve Crockford's partial defense against accidental profanity...

I don't understand the point of this "partial" defense against profanity.

In the same way you can write "FVCK" in Base36H, you can write "F0DA" in hexadecimal (which means the same thing in Portuguese).

So, people will still be offended by random numbers.


A language is a tool. Is special-casing the U u to prevent a single case of profanity worth it?

People will come up with many ways to generate profane language. See guids/uuids for example, or l33tsp34k. A major change to hypothetically prevent a single case seems unbalanced.


> A language is a tool. Is special-casing the U u to prevent a single case of profanity worth it?

It was important enough that Douglas Crockford entirely removed the letter U from his system except as one of several choices for a check digit. I opted for the same tradeoff for Base32H; if anything, keeping U/u as a separate digit would be the major change. V and U look close enough together (more so than I and L, especially in real-world conditions where legibility is poor) that it didn't seem unreasonable for the latter to be an alias of the former.


As noted above, this is to take care of unintended F* appearing in the code.


It gets rid of a lot of swear words. I think I might prefer this to NanoID for showing to the user. I really like the choice of uppercase for making it visually boring. Slack does the same thing for its ids, in that it's a single capital letter for the type followed by a sequence of decimal digits (which could be mistaken for an integer).

For the example as well as in the library I suggest using big integers. That way if you use NanoID or BSON (MongoDB) IDs or even uuids behind the scenes you could always handle the default size ones.

It seems a lot are missing that it doesn't work as well on lowercase. Ever notice that the old product key codes on the boxes of shrinkwrapped software were in uppercase?


I created Elixir library for this today. https://github.com/jechol/base32h


Nice! Beat me to the punch :) I'll try to get that added to the implementation list as soon as I've got a sec.


Huh, this blew up unexpectedly (posted this the other day and it seemed to fly under the radar, so I figured that was that; imagine my surprise when I suddenly see it's got a bunch of attention and an inexplicably-more-recent timestamp). (EDIT: apparently the powers-that-be put it in the "second chance pool"; thanks!)

Thanks for all the feedback, y'all! I'll try to keep answering questions as they come up.


Why is a displaying 32 bits, rather than 16, per rune actually a useful idea? Why even, when MIME and USENET encodings like yyenc are denser, and hex aligns so well?

They toss out a 40 bit number as an example, but who even uses that? What even is it, a truncated hash? A 32 or 64 bit truncation is just as valid, and using a range of 1, 16, 256, 4096, or 65536 buckets (zero to four letters) seems to be just as valid.

EDIT (add this paragraph): Use cases for labeling involve 4 and 8 characters of 32-bit runes, but that's 4 vs 5 and 8 vs 10 hex characters. The hex characters are practically self-documenting, and as I noted in another part of the criticism an hour ago, the folding / skipping of some characters could cause someone to wonder where missing elements of a set are. No array or lookup table or spec definition is required for hex, which would be better in this use case than any 32-bit rune system.

Additionally, Hex natively extends decimal in a natural way that doesn't skip any letters. The skipped letters actively make me worry that in a list of folders someone might believe there are missing locations.

Our computational systems are currently based on powers of 2, because we use binary logic, and binary addresses. This is so powerful that the base set for text standardized on a friendly power of 2 within the useful range, 8, instead of other popular sizes at the time a couple decades ago.

Octal encoding numbers only represents 3 bits, this is inconvenient enough that hex encoding became popular. Representing 4 bits per display character to align with base 2 alignments (as well as expanding the encoding of a single octet-byte, which was the defacto minimum unit of addressing in popular architectures, to exactly twice it's normal storage / display footprint).


> Why is a displaying 32 bits, rather than 16, per rune actually a useful idea?

To be clear: base-32 systems (including Base32H) use 5 bits per rune, which means...

> A 32 or 64 bit truncation is just as valid, and using a range of 1, 16, 256, 4096, or 65536 buckets (zero to four letters) seems to be just as valid.

...base-32 systems would naturally produce 1, 32, 1024, 32768, or 33554432 buckets for that same range of letter counts. Base-64 systems would produce even better numbers of buckets within that range of letter counts, but (as outlined in the "Comparisons" page) it comes at the expense of case-sensitivity.


I was just updating my earlier reply after I thought to look at their use cases page.

I should write a better criticism about that.

Usernames: No, store them as given, just like real names. Also, use UTF-8. Consult the visual rendering engine for decomposing the value to display units, or a rendered result, etc.

Asset tags: Just use hex, or numbers. Mortal people looking at either of those systems understand them quickly enough without further documentation. It's only 4 vs 5 letters, worth the lack of a binder describing why you're not missing crates.

Cheat codes: This is actually fine, I don't care, it isn't important. Though hex or anything else is just as fine.

Cryptography: Less possible artifacts from less letters. 0123456789ABCDEF typically don't contain any difficult to read or alias characters for dyslexic or other non-perfect readers. If density matters (again 4 vs 5) no printed media is going to work out.

Geographic coordinates / addresses: Everyone's using base 10 here, either as big floats or as sets of numbers that are friendly to read. Those articles that have average consumers use an app and send 3 words back from it are making up for the missing safety feature of 911 sending a well formatted set of SMS messages with the call indicating the handset's current GPS data.


I have only one complaint about this Base32 encoding choice, and it stems from the fact that I prefer to encode Base32 using lower case letters, instead of the choice made here to make upper case canonical. When using lower case, the main source of possible confusion is that it can be difficult to tell l and 1 apart, as in l1l1l1l... and this scheme uses both l (canonically "L") and 1.


Crockford's Base32 already does that. I/l/1 are all synonyms.


Hmm, other base32 system avoid that by not including I and L (and O) - and some other refs I've read (ULID comes to mind) say produce UPPER output but accept either case input.

And, like this spec, the values are aliases so 0/o are the same, 1/I/l are the same, etc

https://github.com/ulid/spec


Yes. I'm surprised the author would be more concerned about confusion between U/V (or u/v) than between 1/l ... the former has always seemed relatively far-fetched to me, whereas depending on the font, the latter can be a real problem. Again, I attribute the issue to the choice of upper case as canonical, because L is not easily confused with any other letter or number.


> Again, I attribute the issue to the choice of upper case as canonical, because L is not easily confused with any other letter or number.

That is indeed why I settled for that tradeoff, yep; as long as it's all VPPERCA5E it'll be distinct enough.


Why is l and I not aliased? They are easily confused in san-serif fonts.


26 letters + 10 digits - 4 letters lost to aliasing (o, i, s, u) = 32 symbols. Making L and alias would leave them only enough for base 31 unless they dropped one of the other aliases.

Personally, I'd be OK with that. I think U is much less likely to be confused for V, at least in anything not handwritten, than lower-case L is likely to be confused for 1.


Encoders are required by the spec to emit uppercase only for this reason, since L is plenty distinct from 1 (whereas I and 1 can still get mixed up, and therefore the former is an alias for the latter). A bit unfair to uppercase L that it has to be so neglected just because of its lowercase counterpart :)

The original Tcl implementation treated uppercase L and lowercase l separately (i.e. uppercase was standalone and lowercase aliased to 1 like I/i do), but in the process of writing up that documentation site I quickly learned that it ended up just confusing me more than it helped, so I just kept lowercase and uppercase L together and distinct and called it a day, figuring that discouraging lowercase output (while still allowing lowercase input) was a good enough workaround for the lI1 issue.


I like this, though I agree with others that the minimally-confused U & V, would be better traded for the oft-confused 1, I & l.

One slight additional issue not so far mentioned is what of the case of needing to encode one of many now "NSFW numbers", such as the trigger-warning (!) decimal 739787225?


Yeah, that's a hard-to-address problem with any number system like this (for example, most other base-32 systems have this problem, as do Base64 and pretty much everything else using the full alphanumeric range). Aliasing U and V is an attempt to do the same for a different case (addressing e.g. decimal numbers 519571 and 421594).

If I didn't have as strong of a want for a power-of-2 radix I'd have aliased G/g to 6 (and re-aliased L/l to 1) to further address this. Maybe even aliased B/b to 8, since sometimes that's a source of readability issues (and it further mitigates 11594129).


Why do you consider this particular number NSFW?


Bitcoin addresses use base58 I believe, which is like base64 but avoids 0, O, I, l, +, /. It arguably serves the human-friendly requirement well while being more compact than Base32H. It is, however, not friendly to byte-aligning use cases.


Base58 isn't suitable for general purpose encoding though as its performance degrades exponentially based on the length of the input. Base32 doesn't have that problem.


> Base58 isn't suitable for general purpose encoding though as its performance degrades exponentially based on the length of the input.

Only with an absurdly naive implementation.

Base conversion with arbitrary (and even mixed) base can be done in entirely linear time.

Base58 doesn't work that great for a wide collection of uses because the mixed case makes it extremely slow for e.g. spoken use.

More modern Bitcoin addresses use BECH32 (https://github.com/bitcoin/bips/blob/master/bip-0173.mediawi... a base 32-scheme whos characters and permutation were selected via machine optimization against a confusion matrix, in order to minimize the number of bit errors made by human transcription. It incorporates an error correcting scheme that adds 6 extra characters and guarantees detecting up-to 4 substitutions or transpositions (or 5 substitutions if they're all hamming-1 errors, such as q vs g, z vs 2, u vs v, etc.).


> Base conversion with arbitrary (and even mixed) base can be done in entirely linear time.

Are you saying Bitcoin uses an O(N^2) implementation because they are naive?

You need to propagate carry to the rest of the digits. I wonder how it’s possible for Base58 to be implemented in linear time. Do you have any sources I can read?


Bitcoin doesn't use base-58 in any performance relevant way, it's only used for user interfacy sort of stuff.

Probably the most straight forward way to make fast base conversion is the way that most range coders in compression (which are effectively just fancy mixed-radix base converters) do it-- first store temporary digits of a larger size so that the there won't be any inter-digit carries, then make a second pass through to propagate the carries and generate the output.

IIRC the approach in bitcoin core is related-- but it processes the carries eagerly rather than deferring them (which makes it asymptotically quadratic (the worst case), but the average case is fast).


Note, though, that content of a length sufficient to result in that performance degradation would not be "human-friendly" — and thus, not a viable use case for Base32H as it's specified here — no matter whether a 32 or 53 or whatever-base encoding scheme is used.

(Technically, a book can be considered 'human friendly', but that's not the definition OP is using for that phrase, either.)


To be clear, long Base32H numbers are definitely in scope (which is why the specification "recommends" a binary encoding scheme for this exact purpose, and why things like public keys/signatures are mentioned among the proposed use cases). Yeah, it'd suck for a human to have to manually record or dictate large-ish (more than 1kb / 1600 digits) sets of data, but it sucks at least a little bit less when you already know that everything can be assumed to be uppercase.


> Base58 isn't suitable for general purpose encoding though as its performance degrades exponentially based on the length of the input.

This comment only applies to a naive scheme. There are various ways to avoid performance degradation with a minimal overhead; grouping characters so that they can represent (only slightly more than) the integral number of full bits is frequent.


Bitcoin uses the naive scheme, why?


Probably because:

1. There is no small power of 58 that is only slightly more than integral powers of 256. The closest one I can imagine is 57^11 ~= 1.1186 * 2^64.

2. Bitcoin addresses are short enough that trivial bignum calculation is not a bottleneck.


Exponential? I would assume it would be more like O(n^3) or less, where n is the number of bits of the input.


Yes, polynomial would be more accurate, and I think it's O(N^2).


You can also use it in a practically linear way, i.e. O(N*M^2) if you break your input into chunks of fixed size M and accept a very slight overhead in wasted bits.


Yep. The comparisons page mentions this: https://base32h.github.io/comparisons#octoquinquagesimal-58

tl;dr: base58 is more compact than any base-32 system, but (in addition to being unfriendly to byte-alignment) it relies on case-sensitivity, which makes it a non-starter for filenames on Windows or (by default) macOS and makes it annoying to read aloud ("Was that an uppercase or lowercase C?").


Interesting approach, thank you for sharing this. I especially like the 5Ss aliasing.

A test vector file for implementers would be nice (something like what cbor provides) so all possible edge cases can be checked for.


Good idea, agreed. There's a partial attempt at that in the JS implementation's test suite, albeit written into the test code itself; that'd probably be a decent enough starting point for a non-comprehensive approach.

At some point a full-blown test harness will be useful (i.e. to compare implementations and make sure they have equivalent behavior, for e.g. randomized or sequential tests). Haven't gotten that far yet :)


I just recently understood the deep connection between bytes and hex, hex and decimal.

Can someone ELI5 what would be the use case for having data represented in Base32H? I understand it conveys more info in less runes but it feels hard to keep in my head. Is this just something that takes practice and getting used to or I'm not supposed to try to use this to read binaries?


> Can someone ELI5 what would be the use case for having data represented in Base32H?

Well, the main thing I wanted to do (that motivated me to come up with Base32H, instead of using a different base-32 system) was be able to use English words as numbers (and likewise, turn numbers into English words). This is hard to do with hex because you (usually) only have 6 letters, but pretty easy to do with Base32H because every letter can be a digit.

It might seem silly at first glance, but it can be pretty handy for things like adding prefixes to a number. For example, for asset tagging (the use case that first motivated Base32H), figuring out whether or not a given asset is a desktop PC, you could use 8 digits in total: 4 for the prefix "DTPC" (for DeskTop PC) and 4 for the actual identifier. Since DTPC-0000 in Base32H is 475378221056 in decimal, and DTPC-ZZZZ is 475379269631, you can immediately know that anything between those numbers (inclusively) is a desktop PC, and anything outside that range is something else. Same deal with, say laptop PCs (LTPC) being in the range of LTPC-0000 (715896389632) through LTPC-ZZZZ (715897438207), or keyboards (KBRD) being in the range of KBRD-0000 (665498681344) through KBRD-ZZZZ (665499729919).

And yes, you could (and should!) absolutely do this as part of a database schema, too (for example, by having an "asset_type" column in whatever table's storing asset tags), but the advantage of this is that anyone and anything encountering one of these asset IDs "in the wild" can figure out the type without needing to access the database at all.


I miss the comparison to the duodecimal number system (base12). To me that seems much better then both the decimal and the duotrigesimal number system. http://duodecimal.net/archives/duodecimal/duodecimal.html


Time to link the ConLang Critic's video "A better way to count" where he compares decimal and dozenal (base12) with seximal (base6) and base 6 wins.

https://www.youtube.com/watch?v=qID2B4MK7Y0

One of the best videos of number-base related comedy.


I'll see about throwing that into the comparisons page, thanks!


There is also a base24 that has many implementations in various languages (I did the one for .NET):

https://www.kuon.ch/post/2020-02-27-base24/


I often use base 36 (0-9 a-z), which is the most compact notation (highest base) that python's int() function can decode.


Use the us ascii 0-9 and a-z (caps equivalent). O and zero (0) are the same, as are numeral 1 and L and I. S matches a and 5.

I like it.


Throw in some kind of erasure code and I’m sold.


Yep, that's the one major missing feature. The good news is there's nothing stopping an application from implementing it (and indeed, with 8-character / 40-bit chunks being the recommendation, that lends itself well to sticking 8-bit checksums/ECCs/whatever onto 32-bit values).


Be warned, the implementation is vulnerable to side-channel attacks. Please avoid using it to encode secrets. I would suggest using libsodium instead https://libsodium.gitbook.io/doc/helpers - although it does not offer a base32 variant.


Anything you'd recommend to mitigate those side-channel attacks? I was going more for simplicity and portability for the reference implementations, but should there be a security-focused implementation (e.g. as part of some library like libsodium) it'd be useful to know the attack surfaces.


Do not use the input as indices, do not branch depending on the input, and do not use division, mod, and even multiplication on the input. Check how libsodium does it. Here is a safe rot13 implementation in C if it is any help (note: it assumes a safe islower and isupper implementation).

    #include <ctype.h>
    
    #define IFTHENELSE(c, t, e) ((-(!!(c)) & (t)) | (-(!(c)) & (e)))
    
    unsigned char
    mod26 (unsigned char x)
    {
      x -= IFTHENELSE(x >= 26 * 4,
        26 * 4,
        0);
      x -= IFTHENELSE(x >= 26 * 2,
        26 * 2,
        0);
      x -= IFTHENELSE(x >= 26 * 2,
        26 * 2,
        0);
      x -= IFTHENELSE(x >= 26,
        26,
        0);
      return x;
    }
    
    unsigned char
    rot13 (unsigned char x)
    {
      return IFTHENELSE(islower(x), mod26(x - 'a' + 13) + 'a',
          IFTHENELSE(isupper(x), mod26(x - 'A' + 13) + 'A',
              x));
    }
implementing base32h should be much easier as you do not need to perform mod with a non-power of 2.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: