Similarly we do a lot of unicode bit packing on Dwitter.net since the code size limit is in characters (140) not bytes.
There are a few compressors allowing an increase of arbitrary ASCII chars, with the most allowing 194 chars including decoder, the compressors are themselves dweets! : https://www.dwitter.net/d/11852 This is not the most efficient way to store bits in unicode but there is a balance with dwitter because you need to fit the decoder in the same 140 characters.
There are also some more specialist methods such as packing image or other binary data and decode using string.charCodeAt() e.g https://www.dwitter.net/d/12808 this is not limited to exact bit word sizes, for instance I packed this 3 color image efficiently by encoding it numerically in base3, ~1.6 bits per pixel, the decoding just requires division by power's of 3 and mod3 (generalization of bitwise shifting and masking): https://www.dwitter.net/d/19166
[edit] I should explain, that last example combines the general code compressor with the image bit packing technique, the image it encoded as 5 pixels (3 color) per 8bit characters but doesn't use the higher bits of unicode to allow the whole code to be encoded using the above compressor which fits two chars up to 8bit into UTF-16 and wraps it in the decoder.
You can see the source by replacing "eval" with "throw".
Do you mean to add the decompressor wrapper automatically so that more UTF-16 packed code can fit?
That would effectively increase the limit to 280 usable ASCII chars, i.e a double dweet. This has been discussed before but I don't think the authors will implement it.
There are a few other JS golfing sites that have multiple size brackets and I believe they are less active as a result. Although dwitter is not exactly popular I think this clear limitation is what keeps it focused and the community actively engaged. The 194B hack is a slight weakening of that, but it's often reflected by votes as people are less impressed when you've resorted to using that trick unless it was really worth it... it also feels like a fair compromise because you cannot for instance use emojis inside that compression method so it's not entirely free.
So they have differing tweet length according to the UTF8 pane it’s using. But Germans are still cheated compared to English, which is a quite concise language. (of course, the result focuses on Japanese because of the huge difference to latin languages)
It encodes two bytes per code point; the encoding of that code point may be longer than two bytes. (Indeed it mentions that it's optimized for UTF-32.) It accomplishes having the full 2^16, while only using safe characters, by not restricting to the BMP.
If you read the details of how it was constructed, you can see that the reason the 2^17 version isn't ready yet is because there aren't yet 2^17 safe characters in Unicode! (There are a total of 123,813 characters in Unicode 13 that are judged "safe" by qntm's criteria.)
A code point is just a number. (Well, a number below 1.1 million.)
The simplest way to turn two bytes into a code point is to output the same number.
But some code points are invalid, so you have to skip them. The simplest way is to split your 64k into two ranges. For example numbers 0-50k can get code points 1k-51k, and numbers 50k-64k can get code points 70k-84k.
Base65536 wants to avoid code points that are likely to misbehave, so it uses about ten different ranges of code points, carefully selected to avoid problems (and to look visually cohesive).
There are significantly more than 64k possible codepoints (the total codepoint space has about 1.1 million possible values) so why not? In baseX the "X" is the number of symbols in the encoded output e.g. base32 uses 32 separate output symbols and base64 uses 64.
The problem was finding 64k codepoints which are mapped, tweet-safe (e.g. not control characters), and which twitter only counts for 1.
> It is hard to understand how this works without documentation on the format.
So I went ahead and allowed myself to ask, "why is it called HATETRIS" and tried it out without reading anything first.
This has been a wonderful rollercoaster, thank you for sharing and pointing this out :D I only played the game for 30 seconds before I could already guess exactly why it's called that!
HATETRIS is actually a surprisingly engaging challenge, particularly since it has no time constraints. After a few minutes I found I enjoyed it at least as much if not more than regular Tetris. Figuring out how to make all the pieces appear makes quite the juicy little puzzle, as does figuring out how to maximize the number of cleared lines.
I was not able to clear more than 2 lines per game. I would like to become good at this game. I'm not sure it's possible. But the prospect of a community that shares hatetris replays is intriguing...
Wonder if HN comments will include all of the characters in this replay or if any of them are removed. Since HN strips most emojis for example, I mean.
Edit: Yup, all of the characters in this replay were included.
This reminds me of the times when I encoded a binary database of several megabytes, with indices and everything, in the 53 integer bits of huge 64-bit floating point number arrays, because that was the only way to use that data in a Lua runtime environment of a computer games' UI that didn't expose direct file system access to UI plugins.
> maximum Tweet length is indeed 280 Unicode code points, except that code points U+1100 HANGUL CHOSEONG KIYEOK upwards now count double.
Oh, that's why some carefully composed tweets don't fit — things like General Punctuation and Currency Symbols count double. I think I'm up to the Gillette5 version of Hanlon's Razor where Twitter is concerned.
> following some experimentation and examination of the new web client code, we now understand that maximum Tweet length is indeed 280 Unicode code points, except that code points U+1100 HANGUL CHOSEONG KIYEOK upwards now count double.
You are right that Base2048 is tailor-made for Twitter, but I think so is Base65536.
More specifically, the explicit length limitation is easily one of the weirdest aspects of Twitter. It makes Twitter experience somewhat like that of IRC, faciliating knee-jerk reactions and misunderstandings. And the limit is severe even for casual conversation to the point that it had to be changed once emojis broke its core assumptions. I don't think Base65536 has much use outside of Twitter and likes.
Base65536 is useful anywhere people are pasting codes to each other with sizes around 20-1000 bytes. Taking up half as much screen space as base64 is a big feature.
Base65536 is not a compression algorithm but rather an encoding scheme that can convey any data compressed or not. Compression algorithms have an implicit input of the list of output symbols, which is commonly but not necessarily an octet (0 through 255). If your output format is the tweet you will eventually end up with something like Base65536 for output symbols.
The article mentions compression briefly. They are using basic run-length encoding.
There are probably better algorithms. GP suggested Huffman but I don't think it will work. Something like the algorithm used by crinkler [1] (context modeling + arithmetic coding) may work better. It can be made even better if combined with the encoding (the base doesn't have to be a power of two).
Anyways, no need to overdo it, the author's technique works. At least for the newer base2048 version, optimized for long tweets
If they went out of their way to make base65536, I'd be surprised if they didn't evaluate at least a few compression schemes. It's mentioned in the readme that they are using RLE so it seems reasonable to assume that their data tends to have long runs that compress well with RLE.
The HATETRIS moves? You could try but because Huffman coding works at a bit level and there are only 4 moves it's going to have absolutely awful efficiency. You'd want arithmetic encoding at least.
Left and right seem like they'd be a lot more common than whatever the other two moves are, so maybe a 1-bit code code be assigned to one of those, and still do better than RLE.
Fun fact: As the base goes down, the number gets longer. If you think about the base as the height, you could then find the "area" of the number by multiplying the two. It turns out that the base that minimizes this area is e.
There are a few compressors allowing an increase of arbitrary ASCII chars, with the most allowing 194 chars including decoder, the compressors are themselves dweets! : https://www.dwitter.net/d/11852 This is not the most efficient way to store bits in unicode but there is a balance with dwitter because you need to fit the decoder in the same 140 characters.
There are also some more specialist methods such as packing image or other binary data and decode using string.charCodeAt() e.g https://www.dwitter.net/d/12808 this is not limited to exact bit word sizes, for instance I packed this 3 color image efficiently by encoding it numerically in base3, ~1.6 bits per pixel, the decoding just requires division by power's of 3 and mod3 (generalization of bitwise shifting and masking): https://www.dwitter.net/d/19166
[edit] I should explain, that last example combines the general code compressor with the image bit packing technique, the image it encoded as 5 pixels (3 color) per 8bit characters but doesn't use the higher bits of unicode to allow the whole code to be encoded using the above compressor which fits two chars up to 8bit into UTF-16 and wraps it in the decoder.
You can see the source by replacing "eval" with "throw".