Calling NSA to find your encryption key after a few bits were flipped (2010)

codeflo · on April 24, 2022

Bit flips are scary even on Earth.

At a previous job, we had a customer who suddenly couldn’t send us email anymore. When their IT sent us the server logs to “prove” it’s our fault, we saw that the one letter in the cached MX record was wrong. This was puzzling, until I looked at the ASCII table to verify that the difference was exactly one bit.

We never found out where in the name resolution process the bit got flipped. The problem healed itself a few days later when the DNS cache expired, so it wasn’t worth further investigation.

That really gave me pause how often random bits are wrong in other data.

baruchel · on April 24, 2022

Bit flips are quite useful for sorting huge arrays of data: https://news.ycombinator.com/item?id=28766154

charcircuit · on April 24, 2022

That's not a proper sort. A sort is not just a function that takes in a list and returns a list that is sorted. The result must also include all of the elements that you had when starting. That property isn't checked by the code you linked.

layer8 · on April 24, 2022

Someone should calculate how many copies of the array you need (as a function of input data size) to make that sorting strategy work with reasonable assurance that the original data is maintained.

hvdijk · on April 25, 2022

Do you know of any sort routine that guarantees the result includes all of the input elements even in the event of random bit flips? If not, do you consider the sorting routines in your everyday libraries improper as well or is that a special restriction reserved for this particular implementation?

tshaddox · on April 25, 2022

No, when we normally talk about code we have some vague notion of a formal system that works according to its specification, and thus that !isSorted loop example is pretty clearly invalid. And even if you do allow for random bit flips, surely that’s more likely to happen to the boolean returned by isSorted than for enough bits to flip to make the array sorted.

hvdijk · on April 25, 2022

The objection that the GP raised, the lack of a check that the result includes all of the elements of the input, is not needed under your notion of a system that works according to its specification, as according to that specification it is a loop invariant. It is also not useful, as it is an infinite loop. Your view of why cosmic ray sort is not a proper sort therefore contradicts GP's explanation, even if you share the GP's conclusion. My question was aimed at further exploring the GP's view; I think an opposing view does not help here.

Gravyness · on April 25, 2022

You're arguing that the code that uses internal failure as functionality is invalid?

Damn, I wish I had your kind of expertise.

smegma2 · on April 24, 2022

That code has a race condition :^)

xyzzyz · on April 24, 2022

Did you mean ray's condition?

snoshy · on April 25, 2022

Oddly apropos anecdote to your pun:

About 15 years ago, while working on Windows at Microsoft, a test machine sitting at my desk hit a kernel panic (BSOD). As was standard working on the test team, the machine was already setup for kernel debugging, and so I set out to debug it a bit in order to file a decent bug report.

Hours later, I couldn't make sense of it (I wasn't super experienced at this point). A few of the nearby devs couldn't either, and a small troop of us curious enough about the puzzler eventually escalated to the resident wizard, Raymond Chen[1]. Within 15 minutes of checking our work and poking at the machine, he traced the root cause down to a bit flip.

Ray's condition indeed. :)

[1] https://devblogs.microsoft.com/oldnewthing/

elevaet · on April 25, 2022

Great story, a true random glitch.

Interestingly, a blog post of his was up on HN front page the other day: "The x86 architecture is the weirdo, part 2" https://news.ycombinator.com/item?id=31077912

sbierwagen · on April 25, 2022

Ray complained in 2005 that Microsoft sees lots of weirdo crash reports from computers with overclocked CPUs https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35...

rvnx · on April 25, 2022

Awwww, that Windows Me must have been overclocked out-of-the-box considering the amount of BSOD :D

captn3m0 · on April 24, 2022

Happens even on end user browsers resulting in bit-flipped domains being looked up: https://securitee.org/files/bitsquatting_www2013.pdf

lostlogin · on April 24, 2022

Bitsquatting is a great name. I’m not 100% that this isn’t just typosquatting though.

LanternLight83 · on April 24, 2022

It might be subset of typosquatting, but it's distinct from what the term usual refers to because the typos that are statistically likely to result from user input (swapping neighboring keys, duplicating or omitting letters, etc) occur manually on the level of keys and characters, where as from those that result from bitflips ("adjacent" bit sequences) occur invisibly / without a manual mistake on the level of bit-/byte-streams, forming two mostly exclusive sets of domains that could be targeted. A determined attacker would likely target both.

duskwuff · on April 25, 2022

It isn't simple typosquatting because it can involve errors that aren't plausible typos (like "mic2osoft.com" for "microsoft.com"), or domains that a user would never enter manually (like "fbcdn.com" or "ytimg.com", which are used for content delivery at Facebook and YouTube, respectively).

jdsampayo · on April 24, 2022

Veritasium has a video explaining cases of bit flip that maybe you find interesting:

https://youtu.be/AaZ_RSt0KP8

trinovantes · on April 25, 2022

It's scary to think an election can be flipped due to cosmic rays

https://en.wikipedia.org/wiki/Electronic_voting_in_Belgium

dudeinjapan · on April 25, 2022

Can't wait for the History Channel documentary on this... "Aliens"

Helmut10001 · on April 25, 2022

That is why I migrated from Raid 1 to ZFS on my home server and do weekly scrubs. Not sure if this would have prevented your case though.

daniel-cussen · on April 24, 2022

I have super bivalent opinions about Intel. This is the opposite of ambivalent, it means heavily charged in both directions, but cancellation is not allowed.

So that's why they should have let all their chips do ECC instead of making it a premium feature, it would have been better for their brand as "Chipzilla" and had no real cost. And it's dangerous! In fact a soft-error at sea level killed an operating system update on me, lost about tens of thousands of data and money. I have standing to sue Intel, until this sentence clause in which I hereby forfeit the suit, together with requesting them to reconsider ECC (error correction codes) in all their chips as a safeguard needed due to Moore's Law, which was their business plan. Just give it a thought, Intel.

wyldfire · on April 25, 2022

> had no real cost.

Can you expand on that? Typical coding schemes use redundancy to be able to detect failures. I'm no info theory expert but I presume there's always a cost. And if you can design a smart enough algorithm you can balance the cost with the actual likelihood of failure to find some kind of equilibrium. Of course the algorithm development has its own cost (NRE, time-to-market).

It's not clear whether you're talking about modern computers or dawn-of-PCs when you say "should have." It might be smaller cost now but back in the day it would have been mega expensive.

throwawaylinux · on April 25, 2022

You bought a non-ECC part from Intel and some memory from a (different) vendor, and the memory flipped a bit.

Do you have any evidence or reason to believe that memory error was outside the advertised error rate for the device, or due to some other wrongful conduct by the vendor? And how would you connect that to wrongful action by Intel which contributed to your damages?

I'm no lawyer but I suspect your case is not quite the bargaining chip that you seem to believe.

thaumasiotes · on April 24, 2022

> I have super bivalent opinions about Intel. This is the opposite of ambivalent, it means heavily charged in both directions

I mean, if you're at the level of manipulating the form of the word, that's already what "ambivalent" means. Strong in both directions. Bi- makes no sense as an opposite of ambi- since ambi- means "both", and bi- means "double". There is no negative element present in either word.

nix23 · on April 24, 2022

It's your error, having a system with important data no actual/realtime backup no second system and no plan to recover from a failed update and no ecc is YOUR error alone.

However, intel should have made ecc the standard and not just for 1000$+ Xeons.

hedora · on April 25, 2022

It's turtles all the way down, sadly.

The bitflip could have hit an OS distribution service. Yes, the target machines should check the checksum, but the distribution service could have flipped the bit before the checksum was computed.

But, yeah, ECC should be the standard. Also, Intel should be better about documenting where they've left holes in their online checksums, machine check exception implementations, etc.

nix23 · on April 25, 2022

>The bitflip could have hit an OS distribution service.

No because the updates are signed (integrity).

gjvc · on April 24, 2022

>However, intel should have made ecc the standard and not just for 1000$+ Xeons.

Agree 100%. IMHO, the choice between "domestic" and "industrial-strength" should not mean choosing between different degrees of risks of failure.

mrlonglong · on April 24, 2022

I gather DDR5 will have ECC as standard du to the extreme density of memory that will bitflip a lot more than usual. Yay fo r consumers.

kmeisthax · on April 24, 2022

DDR5 supports on-chip ECC but the extra parity bits we typically associate as "ECC" are still as optional as ever, motherboard manufacturers will still not bother to route those signals anyway, and Intel will still demand you give up overclocking and pay more for Xeon in order to use ECC sticks.

mrlonglong · on April 25, 2022

I did some reading up on this just now. You are of course right in that ECC on-chip will only handle bitflips inside the chip. To ensure the integrity of signals between the memory and the processor traditional ECC will still be required. On DDR5 it will be necessary to have 4 parity chips whereas with DDR4, you only need 2, so true ECC capable memories will be considerably expensive.

daniel-cussen · on April 24, 2022

> you give up overclocking and pay more for Xeon in order to use ECC sticks.

That's my point, a virtuous monopoly wouldn't do that. It would allow at least some way to have both. Especially since soft errors are easier with smaller transistors.

ClumsyPilot · on April 25, 2022

"It's your error, having a system with important data no actual/realtime backup"

Have you ever taken picture of anything important with your phone, like a crime, or a car accident? Have you ever called 911 or sent money?

How dare you use an unreliabke system without ECC, what if a random bitflip would cause it to send 10x more money or data woupd be lost without realtime backup!

This disrespect to users and wanky attitude is the cardinal sin of our industry, people's lives are at stake and it's their fault for trusting us.

sgtnoodle · on April 25, 2022

I dunno, I think both points of view are valid, and it's important to be pragmatic. Stuff fails all the time and it's usually not a big deal. Every holiday season, all the cash registers at Cost Plus stop working, and I stand in line for an hour to buy stocking stuffers. All three AV nerves on my heart failed and I had to get a pacemaker. The first night it skipped 5 beats and the doctors scratched their heads for a while. Turns out it was a loose ring terminal! I have no idea if it has ECC memory, but probably not, because it's a pain to replace the battery...

The probably of most things randomly failing at any particular instant is approximately zero. The probability goes up as you increase the size of the time window, or push things to their limits. You can trust your phone to store an important photo for a few days, as long as you don't run it through a washing machine or something. A few years? You're taking a risk. Many people take that risk, and it works out fine for them. I've had phones fail, hard drives fail, heart nerves fail. I take reasonable precautions to back up stuff I care about. I also have plenty of data that I would be bummed to lose that I haven't backed up yet. I'll get to it one day, or maybe it will get corrupted and I'll be bummed.

bcrosby95 · on April 25, 2022

You're unlikely to notice a single bit flip in a picture you take.

jagger27 · on April 25, 2022

I just tested this on a JPEG. Flipping any bit in the first few bytes of a JPEG renders it unopenable on my Fedora system. I get that it's just a header and could probably be fixed, but there are consequences for bitflips, even on images.

jasfi · on April 25, 2022

For compressed images (such as JPEG), if you flip a bit in the image section that breaks the compressed data, and the decompression algorithm won't make sense. However it could also be that there's a checksum on most of the file data. That would explain why you can't open the image at all.

wyldfire · on April 25, 2022

Consider the likelihood of the bitflip occuring in the critical area you targeted versus any old place in the file. The file is dominated by encoded raster info.

Most errors like the bitflips we discuss are not correlated with file locations. They're typically medium errors and the design of typical filesystems will result in something closer to uniform distribution of errors. Bus errors could perhaps be correlated with the start of activity but I think that seems uncommon.

nopenopenopeno · on April 25, 2022

unless it gives the killer the same mole that you have on the back of your neck

nopenopenopeno · on April 25, 2022

I strictly use Xeon processors for online banking for this reason.

nix23 · on April 25, 2022

>Have you ever taken picture of anything important with your phone, like a crime, or a car accident? Have you ever called 911 or sent money?

I don't compare my phone to a server (24x7) with 10000$ worth of data, but even if i do i would make a versioned backup every X minutes, especially before "updates".

But yes phones should have ecc too.

ClumsyPilot · on April 25, 2022

I do, because a picture of the criminal might be worth more than $10,000, or you might have to call an ambulance and your life is worth something

nix23 · on April 25, 2022

>I do, because a picture of the criminal might be worth more than $10,000, or you might have to call an ambulance and your life is worth something

He had the chance to have a good solution and decided it's not worth it, that's the difference.

>and your life is worth something

Restart your phone.

ClumsyPilot · on April 25, 2022

> Restart your phone.

Alright, next time my 80 year old grandoa is having a heart attack I am sure he will have time to wait for android to boot, then input the encryption key

Thats why i still keep a dumbass analogue phone wired to the socket - our industry is full of wankers and cant be trusted

nix23 · on April 25, 2022

>Alright, next time my 80 year old grandoa is having a heart attack I am sure he will have time to wait for android to boot, then input the encryption key

Let's go an buy her a Mainframe and three redundant inet connections and power-lines, but don't try to sue intel because of your own stupidity to rely on a cellphone...alone!

If you trust your life to a cellphone.....your problem.

And just for your information...your cellphone is not a server nor reliable...keep that in mind, go climbing in mountains and you see the massive difference from your iphone to a rugged satellite-phone (with buttons), but even then, never ever trust a single device.

nyolfen · on April 25, 2022

ryzen pro supports ecc now, fwiw

asasidh · on April 25, 2022

Bit flips sound like glitch in the matrix

moyix · on April 24, 2022

I'm kind of surprised that this took "two weeks, a stable of computers, and billions of combinations tested"? If we make the (generous) assumption that this was using a 128-bit key (more than was common in 1993—the age of DES and 56-bit keys, unless you were using public key crypto – which would be a very strange choice for a military satellite), we have:

256 (2 * 128) keys with 1 bit different

32,512 (2^2 * 128 choose 2) keys with 2 bits different

2,731,008 (2^3 * 128 choose 3) keys with 3 bits different

170,688,000 (2^4 * 128 choose 4) keys with 4 bits different

8,466,124,800 (2^5 * 128 choose 5) keys with 5 bits different

So to reach billions of combinations you need 5 bitflips, which seems quite high! But I guess space is a pretty rough environment :)

muhehe · on April 24, 2022

Article says "which was only a handful of bits away from the original". As non-native speaker i don't know the exact nuance of handful when considering bits, but it seems a lot

danachow · on April 25, 2022

No a handful in this abstract context (strengthened by the word “only”) means not so many (which given even the power of the computers at the disposal of the NSA at the time was enough to ruin your 2 weeks).

egeozcan · on April 25, 2022

And our love/hate relationship with the English language continues :)

My latest annoyance: https://www.usingenglish.com/forum/threads/three-times-as-mu...

At least we don have such problems in programming langu... Actually, never mind.

jaclaz · on April 25, 2022

More like a pinch of bits, then?

jakear · on April 25, 2022

That's assuming NSA told the world (/ people who go around posting stories about their day-job to the internet) immediately after succeeding with the crack. Much better opsec to wait a for a period at least as big as the time it took to crack, leaks less information about the magnitude of compute you can harness.

zaik · on April 24, 2022

It's only (128 choose k) I think. Why are you multiplying with 2^k?

moyix · on April 24, 2022

Oh, you might be right – I originally had that but second-guessed myself (thinking that once you have the k bit positions, you need to exhaustively search the 2^k possible settings for those bits). I guess for each possible set of positions you only need to check the case where they're all flipped.

Without the extra factor you need 6 flipped bits to reach a billion combinations (128 choose 6 is 5,423,611,200).

Thanks!

brundolf · on April 24, 2022

From the headline I imagined this was something like "We lost our encryption key for some important data, but the NSA had already cracked or stolen it, so they were able to return it to us"

godelski · on April 24, 2022

Same, but in reality it was a far more interesting topic. And surprising to see how long it took them to crack it considering they had a priori information for the key (knowing the new key could only be a few bits from the old key).

lukas099 · on April 25, 2022

If, hypothetically, the NSA could have found the key faster using classified technology, would they be forced to do it the slower way? Otherwise an uncleared person could surmise that the existence of the classified technology.

godelski · on April 25, 2022

I mean there's two possibilities with that, right? If they got lucky they could push immediately and freak everyone out and they could also just sit on the answer for a week. Either way, we get bounds on their capabilities.

account-5 · on April 24, 2022

Me too!

moyix · on April 24, 2022

For a more humorous tale of recovering a private key after a chunk of it was damaged (in this case, replaced by the string "FUCK A DUCK"), see this hilarious CRYPTO 2012 rump session talk:

https://crypto.2012.rump.cr.yp.to/87d4905b6d2fbc6ad2389debb7...

There is also a video and a few more details here (starting at 47:38) in their longer CCC talk:

https://youtu.be/v_X0gUzGWsA?t=2858

(Slides for that longer talk: https://www.hyperelliptic.org/tanja/vortraege/facthacks-29C3...)

phendrenad2 · on April 24, 2022

Always sucks to put into place a triple-redundant system, and a cosmic ray finds the one weak link that wasn't redundant. Space isn't for the faint-hearted.

Terry_Roll · on April 24, 2022

Considering bit flips were the leading theory for the changed key, I'm surprised it took that long to brute force test for the changed bit(s).

Sure I dont know how long the key length was, I dont know how long the encrypted string was, but surely it wouldnt have taken that long to cycle through a number of flipped bits, or would it?

lovemenot · on April 24, 2022

Same intuition here.

Assuming no miscommunication or subterfuge, perhaps it can be explained by a large number of bits flipped by a single ray and/or a preponderance of rays, each flipping a small number of bits. If the satellite's shielding design was poor, there could have been a lot of exposure from a single event, such as a solar flare.

Or perhaps just a single bit of code was flipped and it began writing to protected areas of memory.

kccqzy · on April 24, 2022

For Voyager 2 in particular, the problem was then fixed in May 2010.

https://www.space-travel.com/reports/NASA_Fixes_Bug_On_Voyag...

cube00 · on April 24, 2022

This is what worries me about the current push that all backups including the off site tape vault must be encrypted at rest. Any problem with the de-encryption and your data is toast.

willis936 · on April 24, 2022

This is taken care of at the filesystem level.

hedora · on April 25, 2022

That's physically impossible. Silent corruption can always happen between the point in time when the data is generated and when the filesystem checksums it.

Modern hardware tries to detect these sorts of things and halt before the corruption is propagated. Sometimes it succeeds, sometimes it does not.

The best checksums can reliably do is point at the software/hardware component that is at fault.

willis936 · on April 25, 2022

What are you talking about? Because it isn't a tape-writing machine.

javert · on April 24, 2022

Can you explain what you mean by that?

willis936 · on April 24, 2022

Filesystems keep checksums of every block of data. If single bits are flipped then they can be corrected. If you encrypt at a lower level than the filesystem then you're at the mercy of that lower level's error correction, but in practice it is rare to encrypt at a lower level. Typically it's done at the filesystem level or higher, including when using self-encrypting drives.

danachow · on April 25, 2022

> Filesystems keep checksums of every block of data.

False. A limited number of filesystems keep checksums of data - most notably ZFS and btrfs. Some like ext4 and APFS will do it for metadata only. One of the most commonly used filesystems, NTFS, does not for either data or metadata.

> Typically it's done at the filesystem level or higher, including when using self-encrypting drives.

I don’t know where you got this idea from, but it’s basically the opposite of true.

> If single bits are flipped then they can be corrected.

Also false. Most checksums are used for error detection, not correction. CRCs as are typically used for filesystems are not particularly well suited for error correction.

willis936 · on April 25, 2022

Those filesystems (ZFS, BTRFS) have robust error correction, but every modern filesystem has checksums for every block. Checkdisk isn't powered by magic.

Also, look into TCG Pyrite. Almost all consumer drives with SED features are Pyrite.

danachow · on April 25, 2022

> but every modern filesystem has checksums for every block.

At least 2 people have informed you that you’re wrong. Now it’s up to you if you choose to educate yourself on this topic or remain a fool.

> Checkdisk isn't powered by magic.

What is “checkdisk”? If you’re talking about chkdsk, or Check Disk, or fsck like tools - none of those require checksums to do what they do. At a basic level they check the integrity of on disk data structures - the actual connectivity, valid counts, etc. How in your mind does a checksum contribute to this task?

Chkdsk exists for FAT32, which you already seem to admit has no checksums. How do you think it works?

> Also, look into TCG Pyrite. Almost all consumer drives with SED features are Pyrite.

What does this have to do with anything - it certainly isn’t filesystem level encryption.

colejohnson66 · on April 24, 2022

> If you encrypt at a lower level than the filesystem then you're at the mercy of that lower level's error correction, but in practice it is rare to encrypt at a lower level.

My understanding is that many SSDs do encryption transparently. The ATA protocol even has a “SECURE ERASE” command that instructs the drive to wipe just the encryption key. This allowed even “bad blocks” to be erased securely.

willis936 · on April 24, 2022

SSDs have a lot more aggressive error correction than most filesystems because they anticipate a high error rate and are constantly changing the map of logical blocks to physical blocks.

Tapes at rest don't have to worry about that though.

danachow · on April 25, 2022

> SSDs have a lot more aggressive error correction than most filesystems

Most filesystems do not do any error correction. ZFS, btrfs, ReFS, bcachefs are a few notable exceptions. And for the most part these schemes are for multiple device resilience - which is architected quite differently from the schemes used in an SSD or tape (yes, even tape).

> Tapes at rest don't have to worry about that though.

This isn't really true. Pretty much every physical medium at the densities used in modern time requires robust error correction because all physical media has flaws either manufactured or acquired from wear and degradation. For instance, modern LTO tapes use relatively robust 2D Reed-Solomon forward error correction similar to DVD/Blu-Ray.

jaytaylor · on April 24, 2022

Which filesystems support this degree of integrity checking? Presumably ZFS, but what about EXT4/3, ReiserFS, BTRFS, ZFS, NTFS, and FAT32?

It would be wonderful if they all have the feature, but I thought only ZFS was really that paranoid.

willis936 · on April 24, 2022

ZFS and BTRFS have nice online scrubbing features, but nearly every filesystem these days is journaling, including NTFS and XFS (and its contemporaries). Journaling means every block has a checksum. Sure, FAT32 doesn't have that, but no one should ever have the expectation of data integrity on FAT32. You can run checkdisk on journaling filesystems to scrub for errors.

danachow · on April 25, 2022

> Journaling means every block has a checksum.

No it doesn’t.

Many journal filesystems use a checksum for log entries, but that is certainly not covering every block of the filesystem with a checksum. And that checksum only comes into play during log recoveries. Once a block is committed to disk there is no checksum in play (unless the fs has special support for it).

Some newer journaling filesystems support metadata checksumming, but that is not some requirement to be journaling. XFS has not always supported metadata checksumming, and it’s a relatively recent addition to ext4 (like last decade). NTFS doesn’t do checksums on even metadata. This is one reason why ReFS is a thing.

shakna · on April 25, 2022

FAT is a requirement of UEFI though, isn't it? So if you can boot from the drive, you can't rely on it to have the filesystem integrity preserved at the disk level.

wang_li · on April 25, 2022

Journaling means that there is a two phase commit to the metadata of the file system. This helps avoid file system corruption on unclean shutdown and speed up recovery after an unclean shutdown. But it has nothing to do with data checksums. You can’t perform any scrub like behavior to validate your on, e.g., ext4 just because it has a journal.

NovemberWhiskey · on April 24, 2022

Block-level encryption in SAN is pretty common.

meibo · on April 24, 2022

Correctly configured RAID setups also make it possible to detect and recover errors across drives & data without downtime, this is commonly how it's done in datacenters.

drdaeman · on April 24, 2022

I wonder what was the design consideration there? If I'd to make a guess, the point of having key in a re-programmable memory (susceptible to such errors) could be that it could be re-programmed later - otherwise it could've been just hardcoded in ROM. Athough if the error was that it was a RAM copy of the key that's got corrupt, this might explain things - no one to reboot the machine around, huh. If re-programmability was a design consideration, it is interesting is that there was no key reset procedure (with another, "master" key) which is something one would want to use if the normal communications key is compromised or corrupted.

tylergetsay · on April 24, 2022

This was my thought as well, maybe the key was considered sensitive and getting it on the ROM would have exposed it to more people than "necessary"?

axg11 · on April 24, 2022

Isn’t the use case the ability to change the encryption key in case it’s compromised?

themodelplumber · on April 24, 2022

That's pretty fascinating. I admit I find it kind of strange that only the encryption key was affected in this way, i.e. that finding the newly-correct key proved a fix to the issues, as opposed to fixing one issue out of many linked to such an event.

It made me wonder--what are the odds of that? What is the relative exposure area of the encryption key compared to the rest of the onboard assets which could have been mangled?

dfranke · on April 24, 2022

It probably wasn't the only thing affected. It's just flipping bits in encryption keys has much more dramatic and obvious effect than flipping other random bits in memory. Flip a bit in a raster image and you get one funny-looking pixel. Flip a bit in an AES key and you completely corrupt all the data handled by that key.

mlyle · on April 24, 2022

There's basically two things that happen with radiation in orbit.

* Ionizing dose weakens and disrupts crystalline structure. Wears things out / degrades their specs.

* Single, very high energy particles-- e.g. protons-- come in at high speed and change a voltage somewhere. This can have massively bad effects (e.g. it can, for non-radiation hardened parts, cause parts of the chip not meant to be a transistor to become one shorting power and ground-- this is a "single event latchup"). Or, it can affect the operation of one or a few adjacent bits of memory ("single event upset").

https://en.wikipedia.org/wiki/Single-event_upset

bombcar · on April 24, 2022

They probably had to find the key to get it to accept a command to restart and reread the key from ROM.

310260 · on April 25, 2022

This is a someone common problem with solar flares and cellular base stations. Bit flips cause odd configurations to show up, sites to become sleepy, or even just straight up go off-air. A simple reset fixes them but it happens more often than you might expect.

karmanyaahm · on April 24, 2022

Why is traffic from Voyager even encrypted? The results back are public scientific data anyway, right? And it's not like other nations (the only ones with power to transmit that far, back then) would send rogue commands without getting caught.

mlindner · on April 24, 2022

Voyager launched in 1977 so I find your surprise rather odd. Also it wasn't always so far from Earth that amateurs couldn't reach it.

bombcar · on April 24, 2022

Was it? The story is about a spy satellite.

Strilanc · on April 24, 2022

> it's not like other nations (the only ones with power to transmit that far, back then) would send rogue commands without getting caught.

How would you catch them?

Also, these spacecraft didn't start off outside the solar system. They weren't always so far away that a lone prankster would have trouble sending them messages.

karmanyaahm · on April 24, 2022

> How would you catch them?

Wouldn't they be monitoring whatever frequency it's using? I don't know much about how spacecraft worked in the 70s so maybe it's not practical.

> They weren't always so far away that a lone prankster would have trouble sending them messages.

That and what mlindner said is a good point. The start of the mission could've been easy to mess with.

wl · on April 24, 2022

The antennas used to send signals to probes are very directional. Monitoring the uplink frequency wouldn't detect someone else sending commands to the probe unless the monitoring receiver was very close to the transmitting antenna or within the transmitting antenna's beam.

Asdrubalini · on April 24, 2022

> Wouldn't they be monitoring whatever frequency it's using?

How would they? They have a huge directional dish antenna for communicating with the probe, they can't intercept every signal on 8 GHz, they would need an omnidirectional antenna which would catch a lot of noise.

drexlspivey · on April 24, 2022

What happens if a bit is flipped in a private key embedded in a HSM? For example the root CA private key or the root cold wallet key for a cryptocurrency exchange? In this case you are not able to alter the public key to correct for the bit flip (like they did with Voyager). I guess if that happens you are toast?

est31 · on April 24, 2022

Yeah same happens if someone tries to open it and the HSM deletes the content, or someone physically burns it, etc.

Generally that's why you ideally don't just have one key, but multiple. Ideally with voting, but even if you just replicate the key into a second HSM at a different physical location, it's going to improve your situation a lot.

benob · on April 24, 2022

Is there a way to embed redundancy in a crypto key so that another key a few bits away can still decrypt the data?

clysm · on April 24, 2022

DES (and TDES) has 1 parity bit for every 7 bits of key. Nobody really uses it as far as I've seen (e.g. they just generate random keys with invalid parity), but it's built in to the key itself.

ip26 · on April 25, 2022

Parity only tells you the key is corrupt, it can't correct the block :/

kevin_thibedeau · on April 24, 2022

Encode it for storage with FEC.

pinewurst · on April 24, 2022

https://www.space-travel.com/reports/NASA_Fixes_Bug_On_Voyag...

kklisura · on April 24, 2022

Besides hardware mitigations (radiation hardening, ECC memory) what would be software mitigation techniques for this?

yeetsfromhellL2 · on April 24, 2022

To protect against bit flips in car fly-by-wire systems, each signal is sent three times with the 2/3 majority making the decision. This happened after the runaway Prius fiasco that may have been caused by a gamma ray. Prior to that incident the fly-by-wire system only sent one signal.

sroussey · on April 24, 2022

They do this in planes, but with different coders for the three inputs in case of human coding error as well.

Unfortunately, humans tend to make similar errors at similar areas of code when given the same specs.

sodality2 · on April 24, 2022

This is really inefficient, two bitflips in the same location will result in a bitflip. For 3x the space surely there's a more resilient scheme that can handle more.

tetha · on April 24, 2022

If I recall right this depends on the original message length and 1 bit is a bit of an edge case. If you transfer just 1 bit, you're very space constrained and it's hard to do much. 1 bit to 1 bit has nothing, 1 bit to 2 bits has single-bit error detection, and 1 bit to three bits has single bit error correction (--and 2 bit error detection-- (this isn't right, thinking about it for a second)). After this, the minimum required checksum length growth logarithmically, plus 1 or 2 for detection / correction - and that constant factor makes 1 bit so weird.

thehappypm · on April 25, 2022

All error correction and detection are designed with an acceptable probability of error in mind, which depends on the medium. For example, if you know cosmic rays might flip 1 in a million bits, and you want your system to have 1 error per trillion bits, then you need to send every bit twice to be able to detect a one-in-a-million error.

necovek · on April 25, 2022

It's not only about memory size, but the speed of processing too.

Higher level protocols would signal an error, and require a retransmission until no error is detected.

However, "majority rule" does indeed sound like an error checking mechanism designed by a non-engineer (or non-mathematician) commitee.

babelfish · on April 24, 2022

So in this context, send the message with three different encryption keys?

NovemberWhiskey · on April 24, 2022

Forward-error correction.

At the simplest, write the key ten different places in memory and compare the read values to determine the most probably correct.

aborsy · on April 24, 2022

I expected NSA will hand over to them the stolen key before bit flip.

throwaway15908 · on April 25, 2022

Why are they even encrypting the stuff? 1024 bit key? On such a long term mission, KISS is pretty crucial.

EDIT: Probably some military doctrine. For science you monster.

hoppla · on April 25, 2022

Can a cosmic rays flip 0’s to 1’s and visa versa, or just in in one direction?

Edit: I am thinking of memory chips

ofiryanai · on April 25, 2022

I think only 1 to 0, as the cosmic ray cause electrons to "discharge" hence lowering the voltage, assuming that 1 is a little bit higher voltage than 0. But far from an expert in electronics and physics :)

rvnx · on April 25, 2022

Quoting the NASA incident report:

Updated May 17, 2010 at 5:00 PT.

One flip of a bit in the memory of an onboard computer appears to have caused the change in the science data pattern returning from Voyager 2, engineers at NASA’s Jet Propulsion Laboratory said Monday, May 17. A value in a single memory location was changed from a 0 to a 1.

On May 12, engineers received a full memory readout from the flight data system computer, which formats the data to send back to Earth. They isolated the one bit in the memory that had changed, and they recreated the effect on a computer at JPL. They found the effect agrees with data coming down from the spacecraft. They are planning to reset the bit to its normal state on Wednesday, May 19.