At a previous job, we had a customer who suddenly couldn’t send us email anymore. When their IT sent us the server logs to “prove” it’s our fault, we saw that the one letter in the cached MX record was wrong. This was puzzling, until I looked at the ASCII table to verify that the difference was exactly one bit.
We never found out where in the name resolution process the bit got flipped. The problem healed itself a few days later when the DNS cache expired, so it wasn’t worth further investigation.
That really gave me pause how often random bits are wrong in other data.
That's not a proper sort. A sort is not just a function that takes in a list and returns a list that is sorted. The result must also include all of the elements that you had when starting. That property isn't checked by the code you linked.
Someone should calculate how many copies of the array you need (as a function of input data size) to make that sorting strategy work with reasonable assurance that the original data is maintained.
Do you know of any sort routine that guarantees the result includes all of the input elements even in the event of random bit flips? If not, do you consider the sorting routines in your everyday libraries improper as well or is that a special restriction reserved for this particular implementation?
No, when we normally talk about code we have some vague notion of a formal system that works according to its specification, and thus that !isSorted loop example is pretty clearly invalid. And even if you do allow for random bit flips, surely that’s more likely to happen to the boolean returned by isSorted than for enough bits to flip to make the array sorted.
The objection that the GP raised, the lack of a check that the result includes all of the elements of the input, is not needed under your notion of a system that works according to its specification, as according to that specification it is a loop invariant. It is also not useful, as it is an infinite loop. Your view of why cosmic ray sort is not a proper sort therefore contradicts GP's explanation, even if you share the GP's conclusion. My question was aimed at further exploring the GP's view; I think an opposing view does not help here.
About 15 years ago, while working on Windows at Microsoft, a test machine sitting at my desk hit a kernel panic (BSOD). As was standard working on the test team, the machine was already setup for kernel debugging, and so I set out to debug it a bit in order to file a decent bug report.
Hours later, I couldn't make sense of it (I wasn't super experienced at this point). A few of the nearby devs couldn't either, and a small troop of us curious enough about the puzzler eventually escalated to the resident wizard, Raymond Chen[1]. Within 15 minutes of checking our work and poking at the machine, he traced the root cause down to a bit flip.
It might be subset of typosquatting, but it's distinct from what the term usual refers to because the typos that are statistically likely to result from user input (swapping neighboring keys, duplicating or omitting letters, etc) occur manually on the level of keys and characters, where as from those that result from bitflips ("adjacent" bit sequences) occur invisibly / without a manual mistake on the level of bit-/byte-streams, forming two mostly exclusive sets of domains that could be targeted. A determined attacker would likely target both.
It isn't simple typosquatting because it can involve errors that aren't plausible typos (like "mic2osoft.com" for "microsoft.com"), or domains that a user would never enter manually (like "fbcdn.com" or "ytimg.com", which are used for content delivery at Facebook and YouTube, respectively).
I have super bivalent opinions about Intel. This is the opposite of ambivalent, it means heavily charged in both directions, but cancellation is not allowed.
So that's why they should have let all their chips do ECC instead of making it a premium feature, it would have been better for their brand as "Chipzilla" and had no real cost. And it's dangerous! In fact a soft-error at sea level killed an operating system update on me, lost about tens of thousands of data and money. I have standing to sue Intel, until this sentence clause in which I hereby forfeit the suit, together with requesting them to reconsider ECC (error correction codes) in all their chips as a safeguard needed due to Moore's Law, which was their business plan. Just give it a thought, Intel.
Can you expand on that? Typical coding schemes use redundancy to be able to detect failures. I'm no info theory expert but I presume there's always a cost. And if you can design a smart enough algorithm you can balance the cost with the actual likelihood of failure to find some kind of equilibrium. Of course the algorithm development has its own cost (NRE, time-to-market).
It's not clear whether you're talking about modern computers or dawn-of-PCs when you say "should have." It might be smaller cost now but back in the day it would have been mega expensive.
You bought a non-ECC part from Intel and some memory from a (different) vendor, and the memory flipped a bit.
Do you have any evidence or reason to believe that memory error was outside the advertised error rate for the device, or due to some other wrongful conduct by the vendor? And how would you connect that to wrongful action by Intel which contributed to your damages?
I'm no lawyer but I suspect your case is not quite the bargaining chip that you seem to believe.
> I have super bivalent opinions about Intel. This is the opposite of ambivalent, it means heavily charged in both directions
I mean, if you're at the level of manipulating the form of the word, that's already what "ambivalent" means. Strong in both directions. Bi- makes no sense as an opposite of ambi- since ambi- means "both", and bi- means "double". There is no negative element present in either word.
It's your error, having a system with important data no actual/realtime backup no second system and no plan to recover from a failed update and no ecc is YOUR error alone.
However, intel should have made ecc the standard and not just for 1000$+ Xeons.
The bitflip could have hit an OS distribution service. Yes, the target machines should check the checksum, but the distribution service could have flipped the bit before the checksum was computed.
But, yeah, ECC should be the standard. Also, Intel should be better about documenting where they've left holes in their online checksums, machine check exception implementations, etc.
DDR5 supports on-chip ECC but the extra parity bits we typically associate as "ECC" are still as optional as ever, motherboard manufacturers will still not bother to route those signals anyway, and Intel will still demand you give up overclocking and pay more for Xeon in order to use ECC sticks.
I did some reading up on this just now. You are of course right in that ECC on-chip will only handle bitflips inside the chip. To ensure the integrity of signals between the memory and the processor traditional ECC will still be required. On DDR5 it will be necessary to have 4 parity chips whereas with DDR4, you only need 2, so true ECC capable memories will be considerably expensive.
> you give up overclocking and pay more for Xeon in order to use ECC sticks.
That's my point, a virtuous monopoly wouldn't do that. It would allow at least some way to have both. Especially since soft errors are easier with smaller transistors.
"It's your error, having a system with important data no actual/realtime backup"
Have you ever taken picture of anything important with your phone, like a crime, or a car accident? Have you ever called 911 or sent money?
How dare you use an unreliabke system without ECC, what if a random bitflip would cause it to send 10x more money or data woupd be lost without realtime backup!
This disrespect to users and wanky attitude is the cardinal sin of our industry, people's lives are at stake and it's their fault for trusting us.
I dunno, I think both points of view are valid, and it's important to be pragmatic. Stuff fails all the time and it's usually not a big deal. Every holiday season, all the cash registers at Cost Plus stop working, and I stand in line for an hour to buy stocking stuffers. All three AV nerves on my heart failed and I had to get a pacemaker. The first night it skipped 5 beats and the doctors scratched their heads for a while. Turns out it was a loose ring terminal! I have no idea if it has ECC memory, but probably not, because it's a pain to replace the battery...
The probably of most things randomly failing at any particular instant is approximately zero. The probability goes up as you increase the size of the time window, or push things to their limits. You can trust your phone to store an important photo for a few days, as long as you don't run it through a washing machine or something. A few years? You're taking a risk. Many people take that risk, and it works out fine for them. I've had phones fail, hard drives fail, heart nerves fail. I take reasonable precautions to back up stuff I care about. I also have plenty of data that I would be bummed to lose that I haven't backed up yet. I'll get to it one day, or maybe it will get corrupted and I'll be bummed.
I just tested this on a JPEG. Flipping any bit in the first few bytes of a JPEG renders it unopenable on my Fedora system. I get that it's just a header and could probably be fixed, but there are consequences for bitflips, even on images.
For compressed images (such as JPEG), if you flip a bit in the image section that breaks the compressed data, and the decompression algorithm won't make sense. However it could also be that there's a checksum on most of the file data. That would explain why you can't open the image at all.
Consider the likelihood of the bitflip occuring in the critical area you targeted versus any old place in the file. The file is dominated by encoded raster info.
Most errors like the bitflips we discuss are not correlated with file locations. They're typically medium errors and the design of typical filesystems will result in something closer to uniform distribution of errors. Bus errors could perhaps be correlated with the start of activity but I think that seems uncommon.
>Have you ever taken picture of anything important with your phone, like a crime, or a car accident? Have you ever called 911 or sent money?
I don't compare my phone to a server (24x7) with 10000$ worth of data, but even if i do i would make a versioned backup every X minutes, especially before "updates".
Alright, next time my 80 year old grandoa is having a heart attack I am sure he will have time to wait for android to boot, then input the encryption key
Thats why i still keep a dumbass analogue phone wired to the socket - our industry is full of wankers and cant be trusted
>Alright, next time my 80 year old grandoa is having a heart attack I am sure he will have time to wait for android to boot, then input the encryption key
Let's go an buy her a Mainframe and three redundant inet connections and power-lines, but don't try to sue intel because of your own stupidity to rely on a cellphone...alone!
If you trust your life to a cellphone.....your problem.
And just for your information...your cellphone is not a server nor reliable...keep that in mind, go climbing in mountains and you see the massive difference from your iphone to a rugged satellite-phone (with buttons), but even then, never ever trust a single device.
I'm kind of surprised that this took "two weeks, a stable of computers, and billions of combinations tested"? If we make the (generous) assumption that this was using a 128-bit key (more than was common in 1993—the age of DES and 56-bit keys, unless you were using public key crypto – which would be a very strange choice for a military satellite), we have:
256 (2 * 128) keys with 1 bit different
32,512 (2^2 * 128 choose 2) keys with 2 bits different
2,731,008 (2^3 * 128 choose 3) keys with 3 bits different
170,688,000 (2^4 * 128 choose 4) keys with 4 bits different
8,466,124,800 (2^5 * 128 choose 5) keys with 5 bits different
So to reach billions of combinations you need 5 bitflips, which seems quite high! But I guess space is a pretty rough environment :)
Article says "which was only a handful of bits away from the original". As non-native speaker i don't know the exact nuance of handful when considering bits, but it seems a lot
No a handful in this abstract context (strengthened by the word “only”) means not so many (which given even the power of the computers at the disposal of the NSA at the time was enough to ruin your 2 weeks).
That's assuming NSA told the world (/ people who go around posting stories about their day-job to the internet) immediately after succeeding with the crack. Much better opsec to wait a for a period at least as big as the time it took to crack, leaks less information about the magnitude of compute you can harness.
Oh, you might be right – I originally had that but second-guessed myself (thinking that once you have the k bit positions, you need to exhaustively search the 2^k possible settings for those bits). I guess for each possible set of positions you only need to check the case where they're all flipped.
Without the extra factor you need 6 flipped bits to reach a billion combinations (128 choose 6 is 5,423,611,200).
From the headline I imagined this was something like "We lost our encryption key for some important data, but the NSA had already cracked or stolen it, so they were able to return it to us"
Same, but in reality it was a far more interesting topic. And surprising to see how long it took them to crack it considering they had a priori information for the key (knowing the new key could only be a few bits from the old key).
If, hypothetically, the NSA could have found the key faster using classified technology, would they be forced to do it the slower way? Otherwise an uncleared person could surmise that the existence of the classified technology.
I mean there's two possibilities with that, right? If they got lucky they could push immediately and freak everyone out and they could also just sit on the answer for a week. Either way, we get bounds on their capabilities.
For a more humorous tale of recovering a private key after a chunk of it was damaged (in this case, replaced by the string "FUCK A DUCK"), see this hilarious CRYPTO 2012 rump session talk:
Always sucks to put into place a triple-redundant system, and a cosmic ray finds the one weak link that wasn't redundant. Space isn't for the faint-hearted.
Considering bit flips were the leading theory for the changed key, I'm surprised it took that long to brute force test for the changed bit(s).
Sure I dont know how long the key length was, I dont know how long the encrypted string was, but surely it wouldnt have taken that long to cycle through a number of flipped bits, or would it?
Assuming no miscommunication or subterfuge, perhaps it can be explained by a large number of bits flipped by a single ray and/or a preponderance of rays, each flipping a small number of bits. If the satellite's shielding design was poor, there could have been a lot of exposure from a single event, such as a solar flare.
Or perhaps just a single bit of code was flipped and it began writing to protected areas of memory.
This is what worries me about the current push that all backups including the off site tape vault must be encrypted at rest. Any problem with the de-encryption and your data is toast.
That's physically impossible. Silent corruption can always happen between the point in time when the data is generated and when the filesystem checksums it.
Modern hardware tries to detect these sorts of things and halt before the corruption is propagated. Sometimes it succeeds, sometimes it does not.
The best checksums can reliably do is point at the software/hardware component that is at fault.
Filesystems keep checksums of every block of data. If single bits are flipped then they can be corrected. If you encrypt at a lower level than the filesystem then you're at the mercy of that lower level's error correction, but in practice it is rare to encrypt at a lower level. Typically it's done at the filesystem level or higher, including when using self-encrypting drives.
> Filesystems keep checksums of every block of data.
False. A limited number of filesystems keep checksums of data - most notably ZFS and btrfs. Some like ext4 and APFS will do it for metadata only. One of the most commonly used filesystems, NTFS, does not for either data or metadata.
> Typically it's done at the filesystem level or higher, including when using self-encrypting drives.
I don’t know where you got this idea from, but it’s basically the opposite of true.
> If single bits are flipped then they can be corrected.
Also false. Most checksums are used for error detection, not correction. CRCs as are typically used for filesystems are not particularly well suited for error correction.
Those filesystems (ZFS, BTRFS) have robust error correction, but every modern filesystem has checksums for every block. Checkdisk isn't powered by magic.
Also, look into TCG Pyrite. Almost all consumer drives with SED features are Pyrite.
> but every modern filesystem has checksums for every block.
At least 2 people have informed you that you’re wrong. Now it’s up to you if you choose to educate yourself on this topic or remain a fool.
> Checkdisk isn't powered by magic.
What is “checkdisk”?
If you’re talking about chkdsk, or Check Disk, or fsck like tools - none of those require checksums to do what they do. At a basic level they check the integrity of on disk data structures - the actual connectivity, valid counts, etc. How in your mind does a checksum contribute to this task?
Chkdsk exists for FAT32, which you already seem to admit has no checksums. How do you think it works?
> Also, look into TCG Pyrite. Almost all consumer drives with SED features are Pyrite.
What does this have to do with anything - it certainly isn’t filesystem level encryption.
> If you encrypt at a lower level than the filesystem then you're at the mercy of that lower level's error correction, but in practice it is rare to encrypt at a lower level.
My understanding is that many SSDs do encryption transparently. The ATA protocol even has a “SECURE ERASE” command that instructs the drive to wipe just the encryption key. This allowed even “bad blocks” to be erased securely.
SSDs have a lot more aggressive error correction than most filesystems because they anticipate a high error rate and are constantly changing the map of logical blocks to physical blocks.
Tapes at rest don't have to worry about that though.
> SSDs have a lot more aggressive error correction than most filesystems
Most filesystems do not do any error correction. ZFS, btrfs, ReFS, bcachefs are a few notable exceptions. And for the most part these schemes are for multiple device resilience - which is architected quite differently from the schemes used in an SSD or tape (yes, even tape).
> Tapes at rest don't have to worry about that though.
This isn't really true. Pretty much every physical medium at the densities used in modern time requires robust error correction because all physical media has flaws either manufactured or acquired from wear and degradation. For instance, modern LTO tapes use relatively robust 2D Reed-Solomon forward error correction similar to DVD/Blu-Ray.
ZFS and BTRFS have nice online scrubbing features, but nearly every filesystem these days is journaling, including NTFS and XFS (and its contemporaries). Journaling means every block has a checksum. Sure, FAT32 doesn't have that, but no one should ever have the expectation of data integrity on FAT32. You can run checkdisk on journaling filesystems to scrub for errors.
Many journal filesystems use a checksum for log entries, but that is certainly not covering every block of the filesystem with a checksum. And that checksum only comes into play during log recoveries. Once a block is committed to disk there is no checksum in play (unless the fs has special support for it).
Some newer journaling filesystems support metadata checksumming, but that is not some requirement to be journaling. XFS has not always supported metadata checksumming, and it’s a relatively recent addition to ext4 (like last decade). NTFS doesn’t do checksums on even metadata. This is one reason why ReFS is a thing.
FAT is a requirement of UEFI though, isn't it? So if you can boot from the drive, you can't rely on it to have the filesystem integrity preserved at the disk level.
Journaling means that there is a two phase commit to the metadata of the file system. This helps avoid file system corruption on unclean shutdown and speed up recovery after an unclean shutdown. But it has nothing to do with data checksums. You can’t perform any scrub like behavior to validate your on, e.g., ext4 just because it has a journal.
Correctly configured RAID setups also make it possible to detect and recover errors across drives & data without downtime, this is commonly how it's done in datacenters.
I wonder what was the design consideration there? If I'd to make a guess, the point of having key in a re-programmable memory (susceptible to such errors) could be that it could be re-programmed later - otherwise it could've been just hardcoded in ROM. Athough if the error was that it was a RAM copy of the key that's got corrupt, this might explain things - no one to reboot the machine around, huh. If re-programmability was a design consideration, it is interesting is that there was no key reset procedure (with another, "master" key) which is something one would want to use if the normal communications key is compromised or corrupted.
That's pretty fascinating. I admit I find it kind of strange that only the encryption key was affected in this way, i.e. that finding the newly-correct key proved a fix to the issues, as opposed to fixing one issue out of many linked to such an event.
It made me wonder--what are the odds of that? What is the relative exposure area of the encryption key compared to the rest of the onboard assets which could have been mangled?
It probably wasn't the only thing affected. It's just flipping bits in encryption keys has much more dramatic and obvious effect than flipping other random bits in memory. Flip a bit in a raster image and you get one funny-looking pixel. Flip a bit in an AES key and you completely corrupt all the data handled by that key.
There's basically two things that happen with radiation in orbit.
* Ionizing dose weakens and disrupts crystalline structure. Wears things out / degrades their specs.
* Single, very high energy particles-- e.g. protons-- come in at high speed and change a voltage somewhere. This can have massively bad effects (e.g. it can, for non-radiation hardened parts, cause parts of the chip not meant to be a transistor to become one shorting power and ground-- this is a "single event latchup"). Or, it can affect the operation of one or a few adjacent bits of memory ("single event upset").
This is a someone common problem with solar flares and cellular base stations. Bit flips cause odd configurations to show up, sites to become sleepy, or even just straight up go off-air. A simple reset fixes them but it happens more often than you might expect.
Why is traffic from Voyager even encrypted? The results back are public scientific data anyway, right? And it's not like other nations (the only ones with power to transmit that far, back then) would send rogue commands without getting caught.
> it's not like other nations (the only ones with power to transmit that far, back then) would send rogue commands without getting caught.
How would you catch them?
Also, these spacecraft didn't start off outside the solar system. They weren't always so far away that a lone prankster would have trouble sending them messages.
The antennas used to send signals to probes are very directional. Monitoring the uplink frequency wouldn't detect someone else sending commands to the probe unless the monitoring receiver was very close to the transmitting antenna or within the transmitting antenna's beam.
> Wouldn't they be monitoring whatever frequency it's using?
How would they? They have a huge directional dish antenna for communicating with the probe, they can't intercept every signal on 8 GHz, they would need an omnidirectional antenna which would catch a lot of noise.
What happens if a bit is flipped in a private key embedded in a HSM? For example the root CA private key or the root cold wallet key for a cryptocurrency exchange? In this case you are not able to alter the public key to correct for the bit flip (like they did with Voyager). I guess if that happens you are toast?
Yeah same happens if someone tries to open it and the HSM deletes the content, or someone physically burns it, etc.
Generally that's why you ideally don't just have one key, but multiple. Ideally with voting, but even if you just replicate the key into a second HSM at a different physical location, it's going to improve your situation a lot.
DES (and TDES) has 1 parity bit for every 7 bits of key. Nobody really uses it as far as I've seen (e.g. they just generate random keys with invalid parity), but it's built in to the key itself.
To protect against bit flips in car fly-by-wire systems, each signal is sent three times with the 2/3 majority making the decision. This happened after the runaway Prius fiasco that may have been caused by a gamma ray. Prior to that incident the fly-by-wire system only sent one signal.
This is really inefficient, two bitflips in the same location will result in a bitflip. For 3x the space surely there's a more resilient scheme that can handle more.
If I recall right this depends on the original message length and 1 bit is a bit of an edge case. If you transfer just 1 bit, you're very space constrained and it's hard to do much. 1 bit to 1 bit has nothing, 1 bit to 2 bits has single-bit error detection, and 1 bit to three bits has single bit error correction (--and 2 bit error detection-- (this isn't right, thinking about it for a second)). After this, the minimum required checksum length growth logarithmically, plus 1 or 2 for detection / correction - and that constant factor makes 1 bit so weird.
All error correction and detection are designed with an acceptable probability of error in mind, which depends on the medium. For example, if you know cosmic rays might flip 1 in a million bits, and you want your system to have 1 error per trillion bits, then you need to send every bit twice to be able to detect a one-in-a-million error.
I think only 1 to 0, as the cosmic ray cause electrons to "discharge" hence lowering the voltage, assuming that 1 is a little bit higher voltage than 0. But far from an expert in electronics and physics :)
One flip of a bit in the memory of an onboard computer appears to have caused the change in the science data pattern returning from Voyager 2, engineers at NASA’s Jet Propulsion Laboratory said Monday, May 17. A value in a single memory location was changed from a 0 to a 1.
On May 12, engineers received a full memory readout from the flight data system computer, which formats the data to send back to Earth. They isolated the one bit in the memory that had changed, and they recreated the effect on a computer at JPL. They found the effect agrees with data coming down from the spacecraft. They are planning to reset the bit to its normal state on Wednesday, May 19.
At a previous job, we had a customer who suddenly couldn’t send us email anymore. When their IT sent us the server logs to “prove” it’s our fault, we saw that the one letter in the cached MX record was wrong. This was puzzling, until I looked at the ASCII table to verify that the difference was exactly one bit.
We never found out where in the name resolution process the bit got flipped. The problem healed itself a few days later when the DNS cache expired, so it wasn’t worth further investigation.
That really gave me pause how often random bits are wrong in other data.