Base64 Encoding, Explained

franky47 · on Oct 23, 2023

> It's important to remember that we are not encrypting the text here.

Thank you for emphasizing this. Many junior devs have been bitten by not being told early enough the difference between encryption (requires a secret to be reversed), hashing (cannot be reversed) and encoding (can always be trivially reversed).

Also good to know that while the output looks random, it follows the same entropy as the input. Ie: don't base64 encode your password to make it stronger.

tinco · on Oct 23, 2023

This is a nitpick and not pertinent to anything, but base64 encoding your password could make it stronger. Password strength is not just about entropy, high entropy is simply the most effective way of making your password stronger. If your password is 100% randomly generated (as it should be) then base64 encoding it won't do anything.

If however your password is engineered to be easier to remember, for example by using dictionaries or some kind of scheme that has a lower entropy, then the base64 encoding step adds a single bit of strength to your password. Meaning anyone who is brute forcing your password using a smart password cracker, has to configure that cracker to also consider base64 encoding as a scheme, basically forcing it to perform a single extra operation for every try.

Anyway, useless information, you shouldn't be using password schemes like that. The horse shoe battery staple type password style should be quite sufficient I think.

giancarlostoro · on Oct 23, 2023

> Anyway, useless information, you shouldn't be using password schemes like that. The horse shoe battery staple type password style should be quite sufficient I think.

I wonder if its better to make an encoder that uses words and the output looks like "horse shoe battery staple" except you don't release your dictionary list of potential words output by the encoder, but then you guarantee that you can always re-create a password if you lose it, assuming you don't lose the dictionary file.

starttoaster · on Oct 23, 2023

I feel like we're discussing the 24 word mnemonic private keys used by crypto wallets with extra steps.

franky47 · on Oct 23, 2023

One could argue that base64 having a shorter output length than its input would weaken any given password, assuming a brute force attack (not a dictionary one).

LelouBil · on Oct 23, 2023

Base64 turns 6 bits to a ASCII character (8 bits), I don't think it's possible to have a shorter output length.

franky47 · on Oct 23, 2023

Damn, indeed, my bad. In my defense, I just had a new baby and didn't sleep much last night..

cortesoft · on Oct 23, 2023

My brain never recovered from having kids. Good luck.

jonny_eh · on Oct 23, 2023

With each kid you add a brain to the world. So it's a net win?

selcuka · on Oct 24, 2023

> With each kid you add a brain to the world. So it's a net win?

It depends... Is it the sum or the average that counts?

iambateman · on Oct 23, 2023

Congrats on the baby!

LinuxBender · on Oct 23, 2023

Adding to this here [1] is a chart that shows how several encodings affect size ratios of utf-8, 16 and 32. Of course that gets into the discussions [2][3] of using utf in passwords and I have no idea how many password schemes support this beyond what I use.

[1] - https://github.com/qntm/base2048

[2] - https://security.stackexchange.com/questions/85663/is-it-a-g...

[3] - https://security.stackexchange.com/questions/4943/will-using...

luismedel · on Oct 23, 2023

Sorry, but it's the opposite.

A base64-encoded string will be ~30% longer than the original string.

osigurdson · on Oct 23, 2023

I would think every CS grad would know the difference between these things. Everyone interested in coding could learn these concepts in an afternoon.

xp84 · on Oct 23, 2023

You and I would both think -- but on the other hand, I've interviewed a lot of people who have dispossessed me of that illusion. Seemingly simple concepts don't always stick.

fud101 · on Oct 23, 2023

coding as in java or coding as in information theory?

bruce343434 · on Oct 23, 2023

as in cardiac arrest

CobrastanJorji · on Oct 23, 2023

A related thing always worth emphasizing: hashes aren't necessary cryptographically secure!

Hashing has many purposes besides security, and for that reason there are many hash libraries. If you plan on using hashes for something related to security or cryptography, you need to use a hash designed for that purpose! Yes, CRC hashing is really fast, and that's great, but it's not great when you use it for user passwords.

remram · on Oct 23, 2023

Password-masking functions are not usually what's referred to as "cryptographically-secure hash", e.g. unsalted SHA-2 is one but not the other. For example resistance to enumeration or rainbow tables is not a requirement for cryptographic hash functions, but is important for good password-hashing functions.

https://en.wikipedia.org/wiki/Cryptographic_hash_function

https://en.wikipedia.org/wiki/Password-hashing_function

fbdab103 · on Oct 24, 2023

Unless you have really specific requirements (hashes/second, hash size no bigger than X characters, etc) is there any reason not to default to sha256?

I still see newly released projects that choose md5. Like, sure, for the intended use case, probably nobody will construct a collision, but why even allow the possibility?

franky47 · on Oct 24, 2023

It's not much about collisions, more like predictability of the output. You can trivially construct a rainbow table of the most common N passwords and test a dump of SHA-256 hashes against it. Also, SHA-256 is vulnerable to length extension attacks, so it may not be suitable in some applications on variable-length inputs.

Generally speaking, hashing user-provided data as-is is only a guarantee of integrity, not of authenticity (see HMAC), nor secrecy.

globular-toast · on Oct 23, 2023

This is why computer science as a discipline matters.

throw0101a · on Oct 23, 2023

> Many junior devs have been bitten by not being told early enough the difference between encryption (requires a secret to be reversed), hashing (cannot be reversed) and encoding (can always be trivially reversed).

If they want to encrypt something just tell them to use ROT13 twice.

xp84 · on Oct 23, 2023

This is dangerous and outdated advice. ROT13 should always be used an odd number of times to avoid CVE-2022-13!

sirk390 · on Oct 23, 2023

I really though you were going to say "Chat-GPT detected" when quoting that message.

p4bl0 · on Oct 23, 2023

A funny thing about Base64: if you iterate encoding starting from any string, a growing prefix of the result tends towards a fixed point. In Bash:

    $ function iter {
    N="$1"
    CMD="$2"
    STATE=$(cat)
    for i in $(seq 1 $N); do
       STATE=$(echo -n $STATE | $CMD)
    done
    cat <<EOF
    $STATE
    EOF
    }
    $ echo "HN" | iter 20 base64 | head -1
    Vm0wd2QyUXlVWGxWV0d4V1YwZDRWMVl3WkRSV01WbDNXa1JTVjAxV2JETlhhMUpUVmpBeFYySkVU
    $ echo "Hello Hacker News" | iter 20 base64 | head -1
    Vm0wd2QyUXlVWGxWV0d4V1YwZDRWMVl3WkRSV01WbDNXa1JTVjAxV2JETlhhMUpUVmpBeFYySkVU
    $ echo "Bonjour Hacker News" | iter 20 base64 | head -1
    Vm0wd2QyUXlVWGxWV0d4V1YwZDRWMVl3WkRSV01WbDNXa1JTVjAxV2JETlhhMUpUVmpBeFYySkVU

EDIT: I just remembered that when I found that out by pure serendipity more than 10 years ago I tweeted cryptically about it [1] and someone made a blog post on the subject which I submitted here but it didn't generate discussion [2]. Someone else posted it on Reddit /r/compsci and it generated fruitful discussion there, correcting the blog post [3]. The blog is down now but the internet archive has a copy of it [4].

[1] https://twitter.com/p4bl0/status/298900842076045312

[2] https://news.ycombinator.com/item?id=5181256

[3] https://www.reddit.com/r/compsci/comments/18234a/the_base64_...

[4] https://web.archive.org/web/20130315082932/http://fmota.eu/b...

jstanley · on Oct 23, 2023

Whoa, that's really neat!

preciousoo · on Oct 25, 2023

Aww the article linked in Reddit is dead

p4bl0 · on Oct 25, 2023

As explained in my comment above, the blog is dead, but link 4 in my comment is a backup of this article by archive.org.

preciousoo · on Oct 25, 2023

Oops missed that

benjaminwai · on Oct 23, 2023

Just a note with the Bash encoding method. It should be with the -n option:

  $ echo -n "abcde" |base64

Otherwise, without the -n, echo injects an extra newline character to the end of the string that would become encoded.

meindnoch · on Oct 23, 2023

Simply don't use echo if you want predictable output. Use printf. https://linux.die.net/man/1/printf

ndsipa_pomu · on Oct 23, 2023

This is the way

(however, the parent's use of "echo" would be fine as it's not using a variable and so won't be interpreting a dash as an extra option etc)

gdavisson · on Oct 23, 2023

echo -n is not safe, because some versions of echo will just print "-n" as part of their output (and add a newline at the end, as usual). In fact, XSI-compliant implementations are required to do this (and the same for anything else you try to pass as an option to echo). According to the POSIX standard[1], "If the first operand is -n, or if any of the operands contain a <backslash> character, the results are implementation-defined."

[1] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/e...

ndsipa_pomu · on Oct 24, 2023

Thanks - I wasn't aware that echo was that problematic as I target bash (usually v4 and above) from my scripts.

I just tested it out with:

  sh /bin/echo -n "test"
  /bin/echo: 3: Syntax error: "(" unexpected

I didn't realise until recently that printf can also replace a lot of uses of the "date" command which is helpful with logging as it avoids calling an external command for every line logged.

hddqsb · on Nov 5, 2023

> printf can also replace a lot of uses of the "date" command

Very cool (but bash-specific). Manual: https://www.gnu.org/software/bash/manual/bash.html#index-pri...

> sh /bin/echo -n "test"

This is gibberish -- it's trying to execute /bin/echo as if it was a shell script. Maybe you meant:

  sh -c '/bin/echo -n "test"'

quesera · on Oct 23, 2023

Not all echos accept -n to suppress newlines.

printf is always the better choice.

manojlds · on Oct 23, 2023

Yeah been bit by few times. Somehow keep forgetting.

adamzochowski · on Oct 23, 2023

There is also base64URL , where the encoding uses different ascii characters that are url safe. I have seen some devs use BASE64URL but call it just base64, and that can lead to some problems for unaware.

https://datatracker.ietf.org/doc/html/rfc4648#section-5

JimDabell · on Oct 23, 2023

The problem with base64url is that ~ and . are not letters. This means that double-clicking on something encoded with base64url isn’t going to select the whole thing if you want to copy-paste it. This is annoying needless friction in a huge number of use cases. Base62 encoding (0-9A-Za-z) is almost as efficient as base64url and retains URL safety but is more easily copy-pasted.

If you want to eliminate ambiguity for human readers, you can drop to Base58 but in almost all cases, if you are BaseXX-encoding something, it’s long enough that copy-pasting is the norm, so it doesn’t usually matter.

https://en.wikipedia.org/wiki/Base62

nly · on Oct 23, 2023

Encoding and decoding base58 is a lot less efficient (needs arithmetic multiplication with a carry across the stream).

Base32 is sufficient in most cases and can avoid some incidental swear words.

If you want density go for Z85, which is a 4 -> 5 byte chunked encoding and therefore much more efficient on a pipelined CPU.

https://rfc.zeromq.org/spec/32/

JimDabell · on Oct 23, 2023

Base32 has almost half the efficiency of Base62; Z85 suffers from the same problem as Base64 in terms of including word-breaking punctuation.

kibwen · on Oct 23, 2023

Base32 is far more efficient than that. Base32 allows you to encode 5 bytes in 8 symbols, as compared to Base64, which allows you to encode 6 bytes in 8 symbols. While the exact efficiency will vary based on how many bytes your message is, for the most part Base32 produces output that is only 20% larger than Base64, let alone Base62.

iainmerrick · on Oct 23, 2023

The problem with base64url is that ~ and . are not letters.

No, typically the extra characters used are “-“ and “_”. That’s what the table in the IETF link shows.

Acinyx · on Oct 23, 2023

The issue remains: "-" breaks double clicking to select the full string, which means you'll have to manually select all the characters before copying. Same thing happens with UUIDs: using double clicking, you can only select one block at a time.

This isn't a major issue, which means there's no easy answer and it generally comes down to preference if this is a requirement or not.

yrro · on Oct 23, 2023

Double-clicking foo-bar_baz in GNOME Terminal selects the entire string. Anyway, this is something that is user-configurable surely?

eviks · on Oct 23, 2023

Very few apps are that user friendly, besides your terminal isn't either since whether you want to select the entire string or not depends on its type

kevincox · on Oct 23, 2023

It is in many terminals (I think it is in gnome-terminal) but not in major browsers and a lot of other software.

johnisgood · on Oct 23, 2023

In XTerm it only selects the entire string with this in `~/.Xdefaults`:

  UXTerm*charClass: 33:48,36-47:48,58-59:48,61:48,63-64:48,95:48,126:48

There is also `UXTerm*on2Clicks` and `UXTerm*on3Clicks`.

FWIW, browser only selects either `foo` or `bar_baz`, but not the whole `foo-bar_baz`.

Aachen · on Oct 23, 2023

> The problem with base64url is that ~ and . are not letters. This means that double-clicking on something encoded with base64url isn’t going to select the whole thing

Well, you're in luck: tilde and dot aren't part of base64url

Acinyx · on Oct 23, 2023

The issue remains: "-" breaks double clicking to select the full string, which means you'll have to manually select all the characters before copying. Same thing happens with UUIDs: using double clicking, you can only select one block at a time.

This isn't a major issue, which means there's no easy answer and it generally comes down to preference if this is a requirement or not.

layer8 · on Oct 23, 2023

Since the encoding is explicitly for use in URLs and filenames, and those generally aren’t selectable by double-clicking either, I don’t see what the problem is.

layer8 · on Oct 24, 2023

Besides, regular Base64 has the same problem with "/" and "+".

jameshart · on Oct 23, 2023

Base64url also typically omits padding.

Since a base64 string with padding is always guaranteed to be a multiple of four characters long, if you get a string that is not a multiple of four in length, you can figure out how much padding it should have had, which tells you how to handle the last three bytes of decoding.

Which makes it a little confusing why base64 needs == padding in the first place.

Zamicol · on Oct 23, 2023

Any time base conversion comes up, I shamelessly plug my arbitrary base converter: https://convert.zamicol.com

The base64 under "useful alphabets" is the "natural", iterative divide by radix, base. There's the RFC's "bucket" conversion base under extras.

BlackFly · on Oct 23, 2023

If you ever need to encode something and expect people to type it out... I recommend using https://en.wikipedia.org/wiki/Base32 instead. Nothing more frustrating than realizing that (because of bad fonts often) that I was an l or a 1 or that o was an O or was it a 0?

simbyotic · on Oct 23, 2023

Perhaps a bit pedantic, but would be more accurate to say that Base64 encodes binary data into a subset of ASCII characters, since ASCII has 128 code points - 95 printable characters and 33 control characters - whereas Base64 uses 64 - 65 if we include the padding - of those.

soliton4 · on Oct 23, 2023

it didnt go into detail about the purpose of the = / == padding. and it also didnt show in the example how to handle data that can not be devided into groups of 6 bits without bits left over. i think i have an understanding of how to do it but it would be nice to be certain. could someone address the following 2 questions in a short and exhaustive way:

- when do you use =, when do you use == and do you always add = / == or are there cases where you dont add = / == ?

- how to precisely handle leftover bits. for example the string "5byte". and is there anything to consider when decoding?

tangent128 · on Oct 23, 2023

Your questions are related.

For context: since a base64 character represents 6 bits, every block of three data bytes corresponds to a block of four base64 encoded characters. (83 == 24 == 64)

That means it's often convenient to process base64 data 4 characters at a time. (in the same way that it's often convenient to process hexadecimal data 2 characters at a time)

1) You use = to pad the encoded string to a multiple of 4 characters, adding zero, one, or two as needed to hit the next multiple-of-4.

So, "543210" becomes "543210==", "6543210" becomes "6543210=", and "76543210" doesn't need padding.

(You'll never need three = for padding, since one byte of data already needs at least two base64 characters)

2) Leftover bits should just be set to zero; the decoder can see that there's not enough bits for a full byte and discard them.

3) In almost all modern cases, the padding isn't necessary, it's just convention.

The Wikipedia article is pretty exhaustive: https://en.wikipedia.org/wiki/Base64

pixelbeat__ · on Oct 23, 2023

Padding is only required if concatenating / streaming encoded data. I.e. when there are padding chars _within_ the encoded stream.

Padding chars at the end (of stream / file / string) can be inferred from the length already processed, and thus are not strictly necessary.

Note how padding is treated is quite subtle, and has resulted in interesting variations in handling as discussed at: https://eprint.iacr.org/2022/361.pdf

eatporktoo · on Oct 23, 2023

from the article: "Every Base64 digit represents 6 bits of data. There are 8 bits in a byte, and the closest common multiple of 8 and 6 is 24. So 24 bits, or 3 bytes, can be represented using four 6-bit Base64 digits."

So you're essentially encoding in groups of 24 bits at a time. Once the data ends, you pad out the remainder of the 24 bits with = instead of A because A represents 000000 as data.

For the record, I had to read the whole thing twice to understand that too.

jameshart · on Oct 23, 2023

Not quite. The ‘=‘ isn’t strictly padding - it’s the padding marker. You pad the original data with one or two bytes of zeroes. Then you add ‘=‘ to indicate how many such bytes you had to add.

This is because if you’ve only got one of the three bytes you’re going to need, your data looks like this:

   XXXXXXXX

Then when you group into 6 bit base64 numbers you get

   XXXXXX XX????

Which you have to pad with two bytes worth of zeroes because otherwise you don’t even have a full second digit.

   XXXXXX XX0000 000000 000000

so to encode all your data you still need the first two of these four base64 digits - although the second one will always have four zeroes in it, so it’ll be 0, 16, 32, or 48.

The ‘=‘ isn’t just telling you those last 12 bits are zeroes - they’re telling you to ignore the last four bits of the previous digit too.

Similarly with two bytes remaining:

   XXXXXXXX YYYYYYYY

That groups as

   XXXXXX XXYYYY YYYY??

Which pads out with one byte of zeroes to

   XXXXXX XXYYYY YYYY00 000000

And now your third digit is some multiple of 4 because it’s forced to contain zeroes.

Funny side effect of this:

Some base64 decoders will accept a digit right before the padding that isn’t either a multiple of four (with one byte of padding) or of 16 (with two).

They will decode the digit as normal, then discard the lower bits.

That means it’s possible in some decoders for dissimilar base64 strings to decode to the same binary value.

Which can occasionally be a security concern, when base64 strings are checked for equality, rather than their decoded values.

rezmason · on Oct 23, 2023

Here is my Base64 encoder shader:

https://github.com/Rezmason/excel_97_egg/blob/main/glsl/base...

I got it down to about thirteen lines of GLSL:

https://github.com/Rezmason/excel_97_egg/blob/main/glsl/base...

I use it for Cursed Mode of my side project, which renders the WebGL framebuffer to a Base64-encoded, 640x480 pixel, indexed color BMP, about 15 times per second:

https://rezmason.github.io/excel_97_egg/?cursed=1

drawkbox · on Oct 23, 2023

Blast from the past, the Excel easter egg. Solid.

bakkoting · on Oct 23, 2023

There's some additional interesting details, and a surprising amount of variation in those details, once you start really digging into things.

If the length of your input data isn't exactly a multiple of 3 bytes, then encoding it will use either 2 or 3 base64 characters to encode the final 1 or 2 bytes. Since each base64 character is 6 bits, this means you'll be using either 12 or 18 bits to represent 8 or 16 bits. Which means you have an extra 4 or 2 bits which don't encode anything.

In the RFC, encoders are required to set those bits to 0, but decoders only "MAY" choose to reject input which does not have those set to 0. In practice, nothing rejects those by default, and as far as I know only Ruby, Rust, and Go allow you to fail on such inputs - Python has a "validate" option, but it doesn't validate those bits.

The other major difference is in handling of whitespace and other non-base64 characters. A surprising number of implementations, including Python, allow arbitrary characters in the input, and silently ignore them. That's a problem if you get the alphabet wrong - for example, in Python `base64.standard_b64decode(base64.urlsafe_b64encode(b'\xFF\xFE\xFD\xFC'))` will silently give you the wrong output, rather than an error. Ouch!

Another fun fact is that Ruby's base64 encoder will put linebreaks every 60 characters, which is a wild choice because no standard encoding requires lines that short except PEM, but PEM requires _exactly_ 64 characters per line.

I have a writeup of some of the differences among programming languages and some JavaScript libraries here [1], because I'm working on getting a better base64 added to JS [2].

[1] https://gist.github.com/bakkot/16cae276209da91b652c2cb3f612a...

[2] https://github.com/tc39/proposal-arraybuffer-base64

vikrant17 · on Oct 25, 2023

I am still not convinced with the reason for using base64.

1. "Another common use case is when we have to store or transmit some binary data over the network that's supposed to handle text, or US-ASCII data. This ensures data remains unchanged during transport."

What does it mean by network that handles text? Why should the network bother about the kind of data in the packet. If the receivers end is expecting a binary data, then why is there a need to encode it using base64. Also if data is changed during transport like "bit-flipping" or some corruption, then should't it affect the credibility of the base64 endcoded data as well.

2. "they cannot be misinterpreted by legacy computers and programs unlike characters such as <, >, \n and many others."

My question here is what happens if the legacy computers interpret characters like <,, > incorrectly? If you sent a binary data, isn't that better since its just 0's and 1's and only the program that understands that binary data, will interpret?

bathwaterpizza · on Oct 23, 2023

Thanks for the read, it's a very simple encoding but I never decided to find out how it works either, good to know.

aydyn · on Oct 23, 2023

Can someone give a plausible explanation why base64 is used instead of base85? Both are transcoded in linear time (Even base91 is possible in linear time but potentially still slower due to more difficult SIMD).

People have been making the argument for over a decade that base64 is incumbent and so people stick with it due to interoperability. But base85 represents a 20% compression gain for basically free from a computational perspective. Isn't that worth switching over as a widely used standard?

layer8 · on Oct 23, 2023

Base85 uses ASCII characters 33 (“!”) through 117 (“u”), which includes a number of inconvenient characters, like backslash and single and double quotes. The reason to use an encoding at all instead of just raw binary is not just to avoid nonprintable characters, but also to avoid certain printable characters with meta semantics, like string delimiters and escape characters.

An alternative would be the encoding specified by RFC 1924, which uses a different, noncontiguous set of characters. It still has the drawback that dividing by 85 is a bit slower than dividing by 64 (which is just a bit shift).

Last but not least, Base64 has the benefit of being easily recognizable by a human. Due to its relatively restricted character set, it doesn’t look like just line noise, but also doesn’t look like more intelligible syntax, or like hex, etc. It sits in a middle sweet spot.

kps · on Oct 23, 2023

One of the reasons for the shift from uuencoding to base64 is that the latter uses almost exclusively alphanumeric characters and consequently is more robust against cross-platform or ‘smart’ transformations or quoting or escaping issues.

chubot · on Oct 23, 2023

Was it free when base64 was invented?

CPUs were slower relative to memory in those days, so it may have made a difference. Many more programs were CPU bound, and people wrote assembly more often

Also base64 is just easier to code

barryrandall · on Oct 23, 2023

20% better isn't enough to motivate every single producer and consumer to incur the switching costs or taking on an IPv6-scale migration process.

devadvance · on Oct 23, 2023

Note: depending on the language, you might have to do more work to ensure that the original data is in a format that base64 encoding will support.

For example, in JavaScript, that involves making sure it's a well-formed string. I did a write-up of that here: https://web.dev/articles/base64-encoding.

erhaetherth · on Oct 23, 2023

Funny. I'm trying to write a base50 encoder now. No good reason, just 'cuz. Can't quite figure out what to do with the half a bit. Gotta carry it forward somehow until I have a full char again but haven't come up with a good scheme.

meindnoch · on Oct 23, 2023

Interesting choice. You won't be able to do it in batches because 50^n = 2^m doesn't have a solution with integer n and m. So you basically have to implement a bignum with division and modulo operations.

317070 · on Oct 23, 2023

You could also write an arithmetic encoder for a source of 50 uniform symbols. You could implement that without bignum.

meindnoch · on Oct 23, 2023

You can't represent 1/50 exactly in binary, thus arithmetic coding can't be equal to the usual numeric base conversion.

Arithmetic coding will produce an encoding from binary to 50-ary symbols. But then again, there are much simpler ways to do that if exact base conversion is not a requirement (e.g. choose a good enough approximate solution for 2^n = 50^m and do the conversion in n-bit chunks).

317070 · on Oct 24, 2023

> You can't represent 1/50 exactly in binary

You can though? You represent the encoder state in fractions of 50. In practice, it means the encoder state will not use all possible binary strings. But you don't need to trade-off in approximation size.

I am not 100% certain, but pretty convinced you can work with much smaller encoder states of only a few bytes, achieving the same compression performance of chunk approximations requiring kilobytes in size of memory.

charlieyu1 · on Oct 23, 2023

On the other hand, base91 is a real thing

meindnoch · on Oct 23, 2023

Base91 encodes 13 bits (8192 possibilities) in two base91 symbols (8281 possibilities), so it's not 100% efficient.

VMG · on Oct 23, 2023

write a base-n encoder and use 50 as a parameter

rustybolt · on Oct 23, 2023

What do you mean, half a bit? A typical base50 encoder would use 50 characters, so say 309 would get encoded as 69 (309 = 6 * 50 + 9 * 1).

cjm42 · on Oct 23, 2023

He meant approximately half a bit. Base64 uses 64 characters, which means 6 bits per character. 50 characters gives you about 5.64 bits per character. Unlike Base64's 3-bytes-equals-4-characters, that never comes out evenly.

kelsey9876543 · on Oct 23, 2023

fun fact, gpt speaks b64 as a native language, and you can instruct it - in base64 - to only respond to you in base64, and it will comply and switch to only speaking base64

beej71 · on Oct 23, 2023

I throw base64 in at the end when I'm teaching students about hex and binary. It's always fun to ask them to come up with the digits.

sharkski · on Oct 23, 2023

A use case I used it for recently was to encase multiple scripts and even files into one script.

ulrischa · on Oct 23, 2023

Embedding images with base64 encoding is pretty cool. I wonder why it is not more used

slivanes · on Oct 23, 2023

I think mostly because it’s less efficient byte transfer size, roughly +25%, and also it won’t be downloaded as an image in parallel.

wingi · on Oct 23, 2023

Gzip will fix this problem. But these ebessed imgaes are not cache-able.

neallindsay · on Oct 23, 2023

Gzip or Brotli will help, but the result will still be bigger when base64 is in the middle.

Sohcahtoa82 · on Oct 23, 2023

The only problem with using any sort of compression in HTTP is that unless you add extra countermeasures, you open your page up to the BREACH vulnerability:

https://en.wikipedia.org/wiki/BREACH

ulrischa · on Oct 23, 2023

That is a point

rasz · on Oct 23, 2023

Its not cool when all you know is Base64. Vivaldi uses it to store inline thumbnails in \Vivaldi\User Data\Default\Sessions Tabs_ Session_ files. A simple file containing list of 100 Tabs takes multi megabytes. They also used to store inline thumbnails in Bookmarks file, ~200 Bookmarks file was taking 10-20MB!

tinus_hn · on Oct 23, 2023

Base64 encoding is about 75% efficient, so that means that file contains about 7-15 megabytes of thumbnails.

As these are probably not compressible that means there really isn’t a whole lot lost compared to the optimal solution.

rasz · on Oct 24, 2023

I forgot to mention the best part, punch line: Vivaldi rewrites Session_/Tab_ files every time you close a Tab (and I think sometimes on open, or maybe a timer), so that in case of browser crash you can recover your session with all tabs intact. Yes, thats right, Vivaldi rewrites tens of megabytes on a whim. For me average is ~2.6GB per day/ ~1TB/year.

Just to nail the point home - couple of latest snapshots have a bug where they dont clear previous entries in Tabs_ https://www.reddit.com/r/vivaldibrowser/comments/17aqe76/tim... After ~23 restarts Tabs_ file grew to 144MB, still being rewritten every time you close a Tab :o, and grows by 6MB every restart.

How wasteful is Vivaldi/Chrome https://www.reddit.com/r/vivaldibrowser/comments/xu6o3k/bad_... :

\AppData\Local\Vivaldi\User Data\Default\Network\TransportSecurity ~700KB file is being regularly deleted and recreated

https://imgur.com/JAhCV3C

This 700KB file is being rewritten 2-4 times _per minute_ while only updating ~10 entries inside. Here is an example of whats inside TransportSecurity

https://chromium.googlesource.com/chromium/src.git/+/080fbdc...

Those entries arent even important for permanent storage ... and the only data changing is expiry. WTF is going on?

\AppData\Local\Vivaldi\User Data\Default\Preferences again ~700KB, also only useless stuff changes between rewrites, like:

    "last_visited"
    "language_model_counters"
    "predictionmodelfetcher"
    ""expiration"
    "last_modified"
    "visits"
    "lastEngagementTime"

useless stats, if anything (why would I need those exactly?) those all belong in a database file somewhere. ~100 byte change forces 700KB file rewrite multiple times per minute.

\AppData\Local\Vivaldi\User Data\Default\Sessions\Tabs_ multi megabyte file (~8MB for ~300 tabs) force written with every closing of a tab. Why is it so big? After all its just a list of opened tabs, right? The answer is pretty terrifying. Its full of base64 encoded jpeg thumbnails for _every single tab preview_.

    "thumbnail":"data:image/jpeg;base64

TEXT encoded images in constantly rewritten file instead of using Browser image cache!

\AppData\Local\Vivaldi\User Data\Default\Sessions\Session_ same as Tabs_, ~same contents including useless base64 encoded jpegs inside, rewritten together with Tabs_.

\AppData\Local\Vivaldi\User Data\Local State 12KB, at this point its almost insignificant that we are constantly rewriting this file. This one keeps such "important" garbage as

    "browser_last_live_timestamp"
    "session_id_generator_last_value"
    "reloads_external"

so more useless stats, 20 bytes change forces rewriting 12KB file.

I sort of understand the logic behind the decision that led to making Tabs_ and Session_ files slam users SSD on every tab interaction - someone was very concerned with Vivaldi constantly losing Tabs and Sessions on browser Crash. But the way this got implemented is backwards and not exactly achieves intended goal.

1 Saves to \AppData\Local\Vivaldi\User Data\Default\Sessions happen ONLY on Tab close, not on Tab Open. You can still open 10 Tabs, have browser crash and lose those tabs.

2 Why would you store base64 encoded jpeg thumbnails when you have image cache storage in the browser?

3 Why flat file rewrites instead of leveraging SQLite, IndexedDB or LevelDB? All available natively in Chrome codebase. Two first journaled, third still claiming crash resistance.

Why am I making this post? 'I mean it's one banana, Michael. What could it cost, $10?' https://www.youtube.com/watch?v=Nl_Qyk9DSUw. Lets count together an average daily session with Vivaldi being open for 6 hours and someone opening and closing ~100 Tabs. TransportSecurity 360 minutes x 0.7 x 2 = 500MB. Preferences 360 minutes x 0.7 x 2 = 500MB. 100 Tab closes 100 x 8 x 2 = 1.6 GB. 2.6GB per day. ~ 1TB/year of system SSD endurance burned doing useless writes. A reminder "A typical TBW figure for a 250 GB SSD lies between 60 and 150 terabytes written"

Edit: Addendum for people assuming its all Chromium fault. Chrome 106.0.5249.91 released 2 days ago. Preferences and TransportSecurity are indeed being written ~1/minute :/, but Tabs_ and Session_ are NOT _on every single tab close_ like in Vivaldi. 15 minute Chrome session resulted in 3xSession_ and 2xTabs_ writes. Chrome also doesnt appear to be storing base64 encoded thumbnails in Tabs_/Session_ files. Looks like the issue is caused by Vivaldis own hack after multiple complaints about crashes resulting in lost data.

Additionally even if you dont care about SSD wear there is also issue of additional power draw for mobile users. Heavy IO is not cheap.

Edit2: Found another one:

Vivaldi creates 9 files in \AppData\Local\Vivaldi\User Data\Default\JumpListIconsVivaldiSpeedDials and immediately deletes them.

What is Jump List anyway? MS says "A jump list is a system-provided menu that appears when the user right-clicks a program in the taskbar or on the Start menu." Does Vivaldi support Jump List in the first place? Chrome does so probably yes.

How To Get Back The Jump List Of Google Chrome In Taskbar https://www.youtube.com/watch?v=WG1tv-kceF4

Sure enough after enabling "show recently opened" Vivaldi does populate it with Speed Dial items. Why is Vivaldi refreshing JumpListIconsVivaldiSpeedDials so often? It tries to Regenerate favicons for jump list items:

- Even when Jump List is disabled.

- Even if NOTHING changed on the Speed Dial.

- Even if Speed Dial "show Favicons" Setting is _Disabled_.

- Despite ALL Vivaldi Jump list Speed Dial entries using default Vivaldi icon and NEVER using favicons.

- EVERY time Session_ file is written, and those are written on _every tab close_ https://www.reddit.com/r/vivaldibrowser/comments/xu6o3k/bad_....

- and then DELETES all the generated data anyway making whole operation a huge waste of CPU and IO resources.

tinus_hn · on Oct 25, 2023

Thanks, I will not be tuning in to your TED talk. Do you really think anyone is going to read through all that? If you dislike that browser so much, consider using one of the many alternatives.

rasz · on Oct 26, 2023

Its ok, it was cut&paste from Vivaldi forum. You didnt understand the problem of using Base64 told in four sentences, now you complain detailed explanation is too long :). Doesnt matter how efficient Base64 is when it encourages you to stuff immutable bulk data into hot text/ini/config files.

Dma54rhs · on Oct 23, 2023

No cache in web context

2-718-281-828 · on Oct 23, 2023

historically base64 was developed to make binary data printable.

cpach · on Oct 23, 2023

I thought it was developed for sending binary files over SMTP/Usenet.

2-718-281-828 · on Oct 23, 2023

possible. "printable" seems to be just the category but not that it was actually printed. of course some people print base64 of encrypted passwords (for offline storage) that would otherwise contain unprintable characters.

usenet exists since 1979/80 [1] and base64 was first described in 1987 [2].

1: https://en.wikipedia.org/wiki/Usenet 2: https://base64.guru/learn/what-is-base64

beej71 · on Oct 23, 2023

Maybe, but uuencode was there first! :)

kps · on Oct 23, 2023

MIME picked Base64 because uuencode uses a larger character set which, while fine for its original uu (Unix-to-Unix) purposes, made it less robust to cross-platform weirdness.

wichert · on Oct 23, 2023

uuencode was used for SMTP/Usenet. Base64 became popular as part of MIME if I remember correctly.

bediger4000 · on Oct 23, 2023

> Base64 encoding takes binary data and converts it into text, specifically ASCII text.

Perpetuates the idea that there's "binary" and "text", which is incorrect, but also implies you can't encode ordinary ASCII text into base64.

Aachen · on Oct 23, 2023

There is binary and text, though. Many bit sequences aren't valid in a given text encoding (such as UTF-whatever) and so trying to use them as text is an error.

I understand what you mean, of course text can be represented and treated as binary, and the inverse often as well although it isn't necessarily true. Even in Windows-1252, where the upper 127 characters are in use, there are control characters such as null, delete, and EOT which I'd be impressed if a random chat program preserves them across the wire.

I also don't read an implication that ASCII couldn't be converted to b64

farhanhubble · on Oct 23, 2023

The article actually shows an example of text to base64 encoding. But base64 is generally used for encoding data in places where only ascii is admissible like URLs and inlined binary blobs

jillesvangurp · on Oct 23, 2023

It's part of a lot of web standards and also commonly used for crypto stuff. E.g. the plain text files in your .ssh directory are typically in base64 encoding; if you use basic authentication that's $user:$passwd base64 encoded in a header; you can indeed use it to have images and other inline content in the url in web pages; email attachments are usually base64 encoded. And so on. One of those things any decent standard library for just about any language would need.

numlock86 · on Oct 23, 2023

> Perpetuates the idea that there's "binary" and "text" [...]

Well, there is binary, and there is text. Sure, all text - like "strawman" ;) - is binary somehow, but not all binary data is text, nor can be even interpreted as such, even if you tried really hard ... like all those poor hex editors.

imetatroll · on Oct 23, 2023

All text is binary. Everything is binary. "Somehow" is an odd choice of words here.

wongarsu · on Oct 23, 2023

Text is text. Text is encodable as binary. If text was binary, that encoding would be unique, but it isn't. Latin-1 encodes "Ü" differently than UTF-8 does, and even the humble "A" could be a 0x41 in ASCII and UTF-8 or a 0xC1 in EBCDIC

imetatroll · on Oct 23, 2023

This is just ... not how I look at it at all. Everything is represented in powers of two... binary.

taway1237 · on Oct 23, 2023

Everything is representable as binary, but not everything is binary. The abstract concept of 'A' has no inherent binary representation, and 0x41 is just one of the options. Representing Pi or e calls for even more abstract encoding, even though they are a very specific concept. Text is not binary, but text has to be encoded to binary (one way or another) in order to do any kind of computer assisted processing of it. But we tend to think of text in abstract terms, instead of "utf-8 encoded bytes", hence this abstraction is useful.

imetatroll · on Oct 23, 2023

I fully understand that there are different binary representations for the same character depending on the encoding. It is still all binary.

deadbeeves · on Oct 23, 2023

What if the computer isn't binary, but it needs to talk to a binary computer? Then you definitely can't go "oh, this text is binary anyway, I can just push it on the wire as-is and let the other end figure it out".

ahoka · on Oct 23, 2023

What? Is a ZIP file not binary because it’s not a valid tar.gz file? Text just means “something that can be printed” and by this definition not even all valid ASCII sequences are text.

roydivision · on Oct 23, 2023

Came to say this.

xp84 · on Oct 23, 2023

Most developers (etc) use "binary data" or "a binary format" as a shorthand for "not even remotely ASCII or Unicode" - as opposed to the opposite, like a .txt file or HTML or Markdown, where it's readable and printable to the screen. Of course if it's in a file or in RAM or whatever, it's always ultimately stored as 0s and 1s but that's not the sense we mean here.

Sohcahtoa82 · on Oct 23, 2023

> but also implies you can't encode ordinary ASCII text into base64.

I don't think it implies that at all.

Text isn't binary. Text can be encoded in binary, and there are different ways to do it. ASCII, UTF-8/16/32, latin-1, Shift-JIS, Windows-1252, etc. Many can't encode all text characters, especially languages that don't use the Latin alphabet.

The fact that you have to ensure you're using the correct encoding when processing text from a binary stream is proof enough that text isn't binary. Python before 3.x allowed you to treat binary and text as equal, and it often caused problems.