Hacker News new | past | comments | ask | show | jobs | submit login
Base 65536 (github.com/qntm)
162 points by leoh on Sept 20, 2020 | hide | past | favorite | 46 comments



Similarly we do a lot of unicode bit packing on Dwitter.net since the code size limit is in characters (140) not bytes.

There are a few compressors allowing an increase of arbitrary ASCII chars, with the most allowing 194 chars including decoder, the compressors are themselves dweets! : https://www.dwitter.net/d/11852 This is not the most efficient way to store bits in unicode but there is a balance with dwitter because you need to fit the decoder in the same 140 characters.

There are also some more specialist methods such as packing image or other binary data and decode using string.charCodeAt() e.g https://www.dwitter.net/d/12808 this is not limited to exact bit word sizes, for instance I packed this 3 color image efficiently by encoding it numerically in base3, ~1.6 bits per pixel, the decoding just requires division by power's of 3 and mod3 (generalization of bitwise shifting and masking): https://www.dwitter.net/d/19166

[edit] I should explain, that last example combines the general code compressor with the image bit packing technique, the image it encoded as 5 pixels (3 color) per 8bit characters but doesn't use the higher bits of unicode to allow the whole code to be encoded using the above compressor which fits two chars up to 8bit into UTF-16 and wraps it in the decoder.

You can see the source by replacing "eval" with "throw".


How about adding standard decompressers to the Dwitter JS environment so they can be executed with less code?


Do you mean to add the decompressor wrapper automatically so that more UTF-16 packed code can fit?

That would effectively increase the limit to 280 usable ASCII chars, i.e a double dweet. This has been discussed before but I don't think the authors will implement it.

There are a few other JS golfing sites that have multiple size brackets and I believe they are less active as a result. Although dwitter is not exactly popular I think this clear limitation is what keeps it focused and the community actively engaged. The 194B hack is a slight weakening of that, but it's often reflected by votes as people are less impressed when you've resorted to using that trick unless it was really worth it... it also feels like a fair compromise because you cannot for instance use emojis inside that compression method so it's not entirely free.


For those looking for motivation - you can see examples of the related base2048 encoding used in tweets to the bbc micro bot twitter account: https://www.dompajak.com/bbcmicrobot.html ...encoding and decoding (to learn from other's coding tricks) is easy with the editor: https://editor.8bitkick.cc/?compress=true

without these tools I'd never have managed to fit this in a tweet: https://twitter.com/bbcmicrobot/status/1248312174054948864


what a thing of beauty


Initially designed to pack more data in a tweet. Thanks to this article I learned about:

https://blog.twitter.com/engineering/en_us/topics/insights/2...


So they have differing tweet length according to the UTF8 pane it’s using. But Germans are still cheated compared to English, which is a quite concise language. (of course, the result focuses on Japanese because of the huge difference to latin languages)



Base65536 uses only "safe" Unicode code points - no unassigned code points, no whitespace, no control characters, etc..

The string encodes two bytes per code point.

Is it really 64k then? It is hard to understand how this works without documentation on the format.

I wonder how many others were able to mentally decode the example input string just on sight, from having worked with ASCII extensively.


It encodes two bytes per code point; the encoding of that code point may be longer than two bytes. (Indeed it mentions that it's optimized for UTF-32.) It accomplishes having the full 2^16, while only using safe characters, by not restricting to the BMP.

If you read the details of how it was constructed, you can see that the reason the 2^17 version isn't ready yet is because there aren't yet 2^17 safe characters in Unicode! (There are a total of 123,813 characters in Unicode 13 that are judged "safe" by qntm's criteria.)


A code point is just a number. (Well, a number below 1.1 million.)

The simplest way to turn two bytes into a code point is to output the same number.

But some code points are invalid, so you have to skip them. The simplest way is to split your 64k into two ranges. For example numbers 0-50k can get code points 1k-51k, and numbers 50k-64k can get code points 70k-84k.

Base65536 wants to avoid code points that are likely to misbehave, so it uses about ten different ranges of code points, carefully selected to avoid problems (and to look visually cohesive).


> Is it really 64k then?

There are significantly more than 64k possible codepoints (the total codepoint space has about 1.1 million possible values) so why not? In baseX the "X" is the number of symbols in the encoded output e.g. base32 uses 32 separate output symbols and base64 uses 64.

The problem was finding 64k codepoints which are mapped, tweet-safe (e.g. not control characters), and which twitter only counts for 1.

> It is hard to understand how this works without documentation on the format.

The concept is explained in a separate document: https://qntm.org/safe


It's actually 65,792 code points total: 65,536 plus 256 for "padding" characters if the input has an odd number of octets.


https://github.com/qntm/base65536#why

> Why?

> Erm.

> I wanted people to be able to share HATETRIS replays via Twitter.

My brain is melting.


So I went ahead and allowed myself to ask, "why is it called HATETRIS" and tried it out without reading anything first.

This has been a wonderful rollercoaster, thank you for sharing and pointing this out :D I only played the game for 30 seconds before I could already guess exactly why it's called that!


HATETRIS is actually a surprisingly engaging challenge, particularly since it has no time constraints. After a few minutes I found I enjoyed it at least as much if not more than regular Tetris. Figuring out how to make all the pieces appear makes quite the juicy little puzzle, as does figuring out how to maximize the number of cleared lines.


I was not able to clear more than 2 lines per game. I would like to become good at this game. I'm not sure it's possible. But the prospect of a community that shares hatetris replays is intriguing...


ИටฎਕذඥʣݹஐఢƔݹமقݩܥआƣະࡆ౫ටࠃݹԫइາকԥටݧऍ௨ටຜݹߝƐໃԓஸҨȣݓ௨ʈȣџƝۇȥɞୡइไऎਉضƖݹ௨پЈஆjതਈࡨ௬ࡃࠆٽߤಈϼஅԬݑȣ1

Wonder if HN comments will include all of the characters in this replay or if any of them are removed. Since HN strips most emojis for example, I mean.

Edit: Yup, all of the characters in this replay were included.


I love this! Idea, execution, everything.

This reminds me of the times when I encoded a binary database of several megabytes, with indices and everything, in the 53 integer bits of huge 64-bit floating point number arrays, because that was the only way to use that data in a Lua runtime environment of a computer games' UI that didn't expose direct file system access to UI plugins.


I want to marvel at the characters and languages our species has created and we've also found a way to encode and display on computers.


> maximum Tweet length is indeed 280 Unicode code points, except that code points U+1100 HANGUL CHOSEONG KIYEOK upwards now count double.

Oh, that's why some carefully composed tweets don't fit — things like General Punctuation and Currency Symbols count double. I think I'm up to the Gillette5 version of Hanlon's Razor where Twitter is concerned.


Superseded by Base2048 [1] by now.

[1] https://github.com/qntm/base2048


Assuming the goal is to transmit data via Twitter.

Other environments may treat characters after U+1100 with no penalties.

See: “Rationale” section at: https://github.com/qntm/base2048#rationale

> following some experimentation and examination of the new web client code, we now understand that maximum Tweet length is indeed 280 Unicode code points, except that code points U+1100 HANGUL CHOSEONG KIYEOK upwards now count double.


You are right that Base2048 is tailor-made for Twitter, but I think so is Base65536.

More specifically, the explicit length limitation is easily one of the weirdest aspects of Twitter. It makes Twitter experience somewhat like that of IRC, faciliating knee-jerk reactions and misunderstandings. And the limit is severe even for casual conversation to the point that it had to be changed once emojis broke its core assumptions. I don't think Base65536 has much use outside of Twitter and likes.


Base65536 is useful anywhere people are pasting codes to each other with sizes around 20-1000 bytes. Taking up half as much screen space as base64 is a big feature.


I needed something similar for a javascript project and stumbled upon this excellent library by pieroxy that worked flawlessly: https://pieroxy.net/blog/pages/lz-string/index.html

base65536 looks pretty similar to that except that base65536 is not as much focused on compression than it is on "copy-paste-ability".

Shoco is another excellent library that comes to mind for shorter strings: https://ed-von-schleck.github.io/shoco/


Mightn't the raw data be a good candidate for huffman encoding

https://en.wikipedia.org/wiki/Huffman_coding


Base65536 is not a compression algorithm but rather an encoding scheme that can convey any data compressed or not. Compression algorithms have an implicit input of the list of output symbols, which is commonly but not necessarily an octet (0 through 255). If your output format is the tweet you will eventually end up with something like Base65536 for output symbols.


The article mentions compression briefly. They are using basic run-length encoding.

There are probably better algorithms. GP suggested Huffman but I don't think it will work. Something like the algorithm used by crinkler [1] (context modeling + arithmetic coding) may work better. It can be made even better if combined with the encoding (the base doesn't have to be a power of two).

Anyways, no need to overdo it, the author's technique works. At least for the newer base2048 version, optimized for long tweets

[1] https://xoofx.com/blog/2010/12/28/crinkler-secrets-4k-intro-...


If they went out of their way to make base65536, I'd be surprised if they didn't evaluate at least a few compression schemes. It's mentioned in the readme that they are using RLE so it seems reasonable to assume that their data tends to have long runs that compress well with RLE.


The HATETRIS moves? You could try but because Huffman coding works at a bit level and there are only 4 moves it's going to have absolutely awful efficiency. You'd want arithmetic encoding at least.


Left and right seem like they'd be a lot more common than whatever the other two moves are, so maybe a 1-bit code code be assigned to one of those, and still do better than RLE.


How would this be considered more efficient if you need an additional character to mark each character?


Make sure to check out the game. Best tetris ever.

https://qntm.org/files/hatetris/hatetris.html


On the other end of the scale, we have base-1, e.g. scratching a row of 1's on a surface as a tally.

I guess base-0 can't really exist, since 0⁰ is undefined.


Fun fact: As the base goes down, the number gets longer. If you think about the base as the height, you could then find the "area" of the number by multiplying the two. It turns out that the base that minimizes this area is e.

https://web.williams.edu/Mathematics/sjmiller/public_html/10...



Javascript really has come a long way.

I can remember when hacks like this were in CPAN, and then in pip. Now npm?!


"To encode one additional bit per code point, we need to double the number of code points we use from 65,536 to 131,072."

I'd argue that going into fractional bits per code point would not make the project significantly geekier than it already is :-)


Well, if you really want to put more data to the tweet you can always abuse t.co [1].

[1] https://xem.github.io/obfuscatweet-reloaded/


Related: An "arbitrary base converter" that treats all strings as numbers: https://convert.zamicol.com/


Reminds me of Jira. Jira (at least server and data centre editions) uses base 64 numbers to impose global ordering onto issues.


Has anyone tried to tweet compressed data in any of these encodings.

Wondering if there any famous pieces of text that would fit.


I like Base1. Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa


Using a platform to do something it was not designed for could constitute hacking and therefore getting you into a legal trouble.


A platform designed to disseminate data is being used for its purpose. This is really no different than switching from English to Chinese to increase information density: https://pugs.blogs.com/audrey/2009/10/our-paroqial-fermament...




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: