Hacker News new | past | comments | ask | show | jobs | submit login
Random bit-flip invalidates certificate transparency log – again? (groups.google.com)
141 points by nickf on May 10, 2023 | hide | past | favorite | 115 comments



The numbers from this Google SIGMETRICS09 paper are my usual benchmark for thinking about ECC DIMMs:

https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

Their metric of “25,000 to 70,000 errors per billion device hours per megabit” is a bit hard to grapple with. If you assume each error is a single bit then that’s 20 to 50 bytes per GB DIMM per month, or one bit per GB every two hours.


> or one bit per GB every two hours

How is that possible? Wouldn't such an error frequency lead to user-observable problems all the time? For example, in the average statically compiled codebase, flipping a single bit in source code being edited (with the code then being saved to disk) will make it fail to compile with high probability, which would be noticed immediately. Yet I've never encountered this situation in practice, nor heard of anyone else encountering it, and like most people I don't even use ECC RAM. That seems incongruent with the figure quoted above.


This paper is ~15 years old. It was published in 2009, but the data was collected 2006-2008. As densities for things like RAM increase, error rates have to decrease commensurately to hide errors from end users.

Additionally, Google runs hardware in their datacenters in a much more pathological way than most consumers will (e.g. they are more likely to run hardware much hotter than typical consumers).

The point is that the figures in the paper are an interesting starting point for discussion, but they're not necessarily applicable to your laptop.


I'm not convinced by this argument. In practice, errors on consumer hardware seem to be several orders of magnitude less common than implied by the paper. If they weren't, we'd be hearing anecdotes of one-character diffs appearing out of nowhere all the time, among many other issues.

It's hard to see how this could be explained away with the slight increase in RAM densities since 2008, or different temperatures in data centers (plenty of individuals run their systems crazy hot, e.g. when the tower is stuffed with dust after years without cleaning).


Somebody at Mozilla a while back had the idea of using the browser telemetry to find off-by-one errors due to bit flip, which were correlated with cosmic storms. Interesting, but not definitive results:

https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-...

I have access to some massive number of API logs for my job. So it was easy for me to do something similar - most bit flip errors would result in a 4xx of some sort, especially 404.

Every time there's a huge magnetic storm, I run a query to see if 400's go up correlating with the storm. I even filter the logs for the geo hardest hit. Nope. No correlation. ¯\_(ツ)_/¯


ArenaNet did this for GuildWars when they encountered particularly inexplicable bugs:

[Mike O’Brien] wrote a module (“OsStress”) which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second.

On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!

https://www.codeofhonor.com/blog/whose-bug-is-this-anyway


Thanks for sharing that -- on its own it was a great read.

At first, I was not surprised that they saw 1% of GuildWars players had these issues. My reasoning was given the time of the article (2012) and pointing to a more distant hazy point somewhere between StarCraft and Diablo[0] their user-base probably included a lot of "custom builds", a not-small portion of which would be "built intended for overclocking", in an era when the firmware/motherboard/processors would let you apply settings that would become extremely unstable under load[1]

... and I got about that far before I thought, only 1% of users managed to make their "let's see what happens if we alter the voltages/frequencies such that we run the DDR2-333 memory at DDR2-2048 (it's been a while), change the various "latency" settings throwing caution to the wind. Meanwhile all of this hardware is sitting in a case with a fan configuration that slightly lowers the air pressure in the case[2] ...

Back in the day (less common, today) you could often put any value from a range of "Way Too Low" to "Way Too High" and steps in-between. If it could reach the boot sector,... "Weeeeeeeeee!!" I recall a case of a customer who completely trashed an Intel Pentium (2? 3?) chip via firmware settings "my. kid. was. F---in'. with![3]"

Maybe I'm wrong and it's just a matter that the game involved attracts people who rarely OC/build and the 1% is "just those people", but anecdotally "in my life", everyone I knew back then who was a PC gamer either started with a build of their own or went that route the instant they had a job which allowed them buy their own computer (rather than sharing Dad's or getting his hand-me-downs if you were really lucky/have fewer years on you than I do).

My experience, especially back in the 08-09 era, was that things were ridiculously forgiving. Depending on the use, overclocking would either "do nothing at all" or maybe let you get away with slightly better graphics settings while maintaining a playable game. Generally, I purchased the cheapest memory rated slightly better than what I intended to configure them for (DDR2-xxx slightly higher than where I will clock it) with whatever I decided were the right latency values[4], placebo-overclocked day-to-day, but would push it from time to time. There's been times the system simply doesn't boot but one "tiny notch south" and it'll pass a 72-hour memory stress-test.

I've had crashes, locally, that I can't explain beyond "memory corruption" but I can't say if it's an OS failure, hardware failure or "something that's not memory corruption but malfunctioning other hardware" ... it passes a stress test, I usually find the "other hardware" a little while later (surprisingly often the PSU, for me).

[0] Unclear if editions implied are "I" in both cases, and it's not important enough to matter since reasonably earlier than 2012 puts you in the category of "the firmware/motherboards will let you do incredibly stupid things".

[1] Sometimes subtle "some character's armor texture is transparent" (making your enemies see-through, not naked, much to my teenage chagrin). Sometimes crazy like "Everything is shades of green with partial geometries".

[2] Say, if you had to have the most sleek case where every available air-flow "vent" can take a fan facing inward or outward (where it can sometimes be a bad idea to not leave one empty) has all of the fans blowing out... I've seen "The Vacuum Computer".

[3] So, this really happened to me (CompUSA, filling in at the Parts Department), and that's exactly the way I hear it, but you need to understand that it wasn't some form of "Jersey Shore" accent -- this guy was absolutely Korean with an accent so thickand not a day younger than...60? Aside from the shock of a person very much struggling with the English language dragging out the "F" word, you just don't see a plain-looking-harmless-old-man speak with such vigor and gusto!

[4] This being "unimportant" information for me day-to-day, requires hours of "refreshing via Google" and learning every miserable thing that's changed, coming to a complete understanding, finding that the memory I want either doesn't exist, doesn't work with my mother board, or costs a fortune and settling somewhere pretty close to what is "typical" if not ... exactly typical.


were they testing system RAM exclusively? If it's VRAM, I'd agree with this. Back in the day we'd even overclock VRAM until we got "snow" in the rendered graphics then backoff a bit.


Run any consumer device for more than 4 weeks and it'll slowly accumulate all kinds of weird bugs. I've seen smartphones unable to receive SMS because they simply ran too long without reboot, a Windows PC where selecting files wouldn't work anymore etc.

> we'd be hearing anecdotes of one-character diffs appearing out of nowhere all the time

On consumer devices it might simply be a small "attack" surface. A server running 24/7 serving countless requests would probably see more issues than a laptop running idle 95% of the time and loading a website every few minutes. Also a bit flip isn't guaranteed to cause a noticeable bug - and if it does it might not stand out from the usual noise of janky software oddities.


We're perfectly capable of being bad enough at software to do that without invoking cosmic rays and quantum weirdness.

How long would you expect to have to run memtest on any desktop computer in good condition before you observe a memory error? Because I would expect the number to be somewhere between 'weeks' and 'forever.'


IDK what you mean by consumer, but I've routinely had desktop computers running 24/7 for several hundred days without noticeable glitches on Linux (BTW I run Arch).


> a Windows PC where selecting files wouldn't work anymore etc.

That is just Windows Explorer being shite. They even have a specific “restart” button for it in Task Manager for when it leaks into itself so much it stops functioning usefully.


Isn't the restart there because explorer is also the desktop? If it was just killed, you'd have to know how to launch shutdown.exe from the task manager etc.


I think it is a relatively recent (Win8?) addition, and explorer has been the shell since '95. I take adding a specific button for it as an admission that Explorer is getting increasingly buggy, so users are needing to restart it more. Before then you could do the same thing from the same task manager just in two steps: kill the existing explorer.exe and start a new one from the Run command in the menus.


I find this kind of difficult to believe, because I have machines with >6 months of uptime and never see any issues. Always just running regular consumer grade hardware.


One question is what proportion of the memory is being used for different things, and which of those you'd notice.

The actual text representation is tiny compared to (for example) the framebuffer of said text rendered to the screen. I wouldn't be surprised if the size of my desktop background in RAM is larger than all the code I have written ever in my professional life. You probably wouldn't even notice a bit corruption in that, especially if it's a temporary image buffer being updated constantly.

As you'd expect a random error to hit things proportional to their size in memory, there's likely orders of magnitude more running code in your editor than the actual text you're editing, which is more likely to simply crash than silently corrupt a diff.

And on crashing, there's issues I've seen from user reports with backtraces and memory states that just don't make sense. I personally suspect that some, especially those without multiple examples, could quite easily be due to random memory corruption. And there's a long enough tail of these weird issues from that sampling I honestly wouldn't be surprised if such corruption events were a lot more common than many people think.

And that's on a pretty locked down platform so no worries about consumers overclocking. I've heard it's even worse for some of my colleagues working on more gamer-adjacent products - some people's definition of "Stable" is rather loose.

And as for the OG paper, I think it also said the errors weren't evenly distributed over all hardware - likely due to manufacturing differences some are run closer to the edge than others, but in a way that don't break quickly enough to easily test and be replaced as defective. So you might just be unlucky and have a weirdly unstable system.


> If they weren't, we'd be hearing anecdotes of one-character diffs appearing out of nowhere all the time, among many other issues.

I think the amount of textual user data in memory is dwarfed by various kinds of program data (pointers, metadata, static strings, libraries, etc) as well as image data. It seems pretty likely that bit flips could result in a crash once in a while or could have no effect at all if the bit flip is in some rarely used piece of an executable loaded in memory. For example, would anyone notice that a letter in an error message string is corrupted?

So errors may be happening at the rate that researchers report, but with only a tiny fraction of them happening in user data where they will be noticed.


I think that the main difference is that the average person reboots his PC at least once a day, and in addition to this a good chunk if not most of your RAM is unallocated most of the time. Combine the two, and the chances of a bitflip actually affecting something important get very low.


But it should be applicable to EC2 instances, no? I have seen instances running for multiple months quite a bit of time. If there 1000s of random bit flip for any app, I think we should assume it is not running what we expect it to be running.


All EC2 instances use ECC: https://aws.amazon.com/ec2/faqs/


It just dawned on me that EC2 is probably named that to avoid confusion with ECC.


> they are more likely to run hardware much hotter than typical consumers

I'd expect laptops and phones with their limited cooling capacity to run their hardware much hotter than datacenters with their expensive HVAC systems


> As densities for things like RAM increase, error rates have to decrease commensurately to hide errors from end users.

But I would say that with more density you actually increase the error rate. Things like rowhammer are possible because higher density means smaller capacitors (as volume, and also as capacity) and so easier to casue bitflips. Or there is something in recent RAMs to compensate for the smaller caps?


I'll run memtest86 for a full day when I build a system or get new ram. I've never seen an error outside of the situation where a stick has gone bad and it errors like crazy. I don't use ECC other than on my homelab server either.

Tons of people do this and have no issues with bit flips or we'd be hearing about it.


In my experience, after handling many dozens of systems (out of 3000) with ECC errors, it's quite rare for the problem to be reproducible in memtest86. Running in production often triggers many errors per day. Our alarm threshold is over 100 correctable or 1 uncorrectable error per day. After a producing zero errors with 24 hours in memtest86, I usually see more errors when returned into production. Not sure if it's heat/cooling related, the access pattern, or maybe memtest86 tests the memory value, but not the memory address. I.e. if you fill memory with 0xdeadbeef you'll read 0xdeadbeef successfully even if the address is off.

Dimms can "fail", but still have low error rates. Stories are pretty common, LinusT had one in the last year spent something like a week tracking it down, assumed it was a new kernel error. Tracking down memory errors with ECC is a painful process that involve tracking down numerous possible causes crashes.

Sure bitflips are easy to ignore, but weird things do happen. Linux crashes, processes crash, suddenly things are weird, the desktop doesn't work like you expect. People assume that "the system is buggy" or it's related to patching, or some other user space error, but sometimes it's just a bitflip.

Sadly there are few cases where a bit flip is obvious, but if you keep 10k files around for a decade it's pretty common to find a few corrupted. Various weird behaviors have been tracked down to memory issues. GCC for years had an error that nearly always was an internal compiler error ... triggered by a memory problem. One famous speed run of a game had something very weird happened, which was tracked to a single bit memory error.

Sure can you get away with ECC, certainly. But the minimal price increase is in my opinion is quite reasonable. I'd much rather get a "single bit error corrected on dimm #3" dmesg, then occasional (or not so occasional) process, kernel, or filesystem crashes.

I'm typing this from an 2015 system with a Xeon e3-1230 CPU that was cheaper than the equivalent i7, but has ECC support. Sure I spent a bit more on the motherboard and dimms, something around $120. Money well spent. I find system crashes quite disruptive, even 1 less per year is valuable to me.


My server is quite similar, an e3-1275L v3 with 32GB of ECC from a trashcan Mac Pro. I had been planning to migrate my i9 desktop into that role at some point but not supporting ECC you're making me second guess that option. Maybe I'll ebay the lot and get a Ryzen. ;)


I believe alder lake and newer intel desktop chips support ECC, if you get the right motherboard/chipset.

Desktop ryzens support ECC for several generations, but ECC support varies by chipset. AMD forums often discuss it. Some Ryzen motherboards mention the support and certify ECC dimms.

So ECC is possible, but annoying.


Anecdotes are not real world statistics. Most people wouldn't identify a flipped bit of memory that caused a web page to glitch in their browser as a hardware problem. They'll just write it off as general enshittification of the web.


> How is that possible?

I'd say that we live in a world of checksums, cryptographic checksums, retries, deterministic builds, distributed VCS (like Git), digital signatures, etc. which helps a lot with the very rare bitflips that do actually happen.

> For example, in the average statically compiled codebase, flipping a single bit in source code being edited (with the code then being saved to disk) will make it fail to compile with high probability, which would be noticed immediately.

I fully agree. If I set say "unstable" settings in my UEFI/BIOS for the RAM, I cannot compile, say, Emacs, without getting a segfault.

> For example, in the average statically compiled codebase, flipping a single bit in source code being edited (with the code then being saved to disk) will make it fail to compile with high probability, which would be noticed immediately. That seems incongruent with the figure quoted above.

I don't understand it either.


One bit per GB of RAM, not per GB of processed source code (most of your RAM is likely empty most of the time, or used for stuff where a bit flip is less likely to be noticed). However, it still seems like a high estimate to me.


Operating systems keep files in ram In addition to memory allocation till they need the ram, so I think it should be mostly full all the time


Unused RAM ~= wasted RAM...

Recently I had to troubleshoot a linux process that started segfaulting randomly. Having dealt with non-ECC bit flips before I took a copy of the binary (i.e. from the filesystem cache), dropped the caches, and compared and disassembled the binaries: a single bit flip resulted in an illegal instruction.

Cost of investigating issues >>> cost of ECC


But bit flips may not be noticed any time soon in the disk cache, and when they are later noticed, may be wrongly attributed to disk failure. It's a different failure mode than a bit flip in some source code that you are about to compile (which is far less likely to happen, due to that source code being a much "smaller target").


I have 32 GB of RAM. If I only use 1 GB of it it's still one bit flip in the RAM I'm using.


Eh? The "cosmic ray" (or similar upset) is roughly equally likely to affect any bit in the 32GB, so the chance of it appearing in the 1GB that you're using right now is 32 times smaller than the chance of it happening anywhere in your RAM.


it is my interpretation as well that most bit-errors occur in places that simply don't matter or are handled in userspace error-detection/correction (tcp-checksums, asset-checksums, encryption, etc)


Any RAM-caused bit flip that happens e.g. before the TCP checksum is as likely to happen after the TCP checksum, it's using about the same memory either way.


leaving aside that a checksum doesn't catch every bit-change, that's still a 50% chance that it _will_ be detected.


I'm arguing against

> most bit-errors... are handled in userspace error-detection

If there really are a significant number of bad-RAM errors getting caught at the TCP layer, then there are also a significant number happening after that. There aren't, so there aren't.


Yeah, that's not realistic. This is as much code as goes through a CI I'm working with many times a day. I'd constantly see errors on an unknown variable if this was a real rate.


Aren’t you forgetting the time aspect? If you’re keeping that code in RAM for just a few seconds at most then the probability of a random bit flip is negligible, even if the hourly rate is high.

But I agree this sounds unrealistic.


Presumably, most of the code doesn't change often, and consequently, should be kept in the filesystem cache and never actually read from disk. So the numbers could apply, I think.


It's about one second of video at "editing quality" ;-)


Or you are using ECC RAM?


If it's this common, it should be straight-forward to test as well, right? Allocate 1 gig of memory, do whatever mmap call you need to make sure it's not paged out, fill it with deterministically random stuff, then every hour check the contents and refill it. Keep running for a month.

Seems odd to me that we have rely on some paper from 2009...


They observed the majority of dimms had no errors. They had ecc ram so it's possible for a dimm that is going bad to carry on in service regularly producing correctable errors. If they were in your laptop you would start getting problems quite quickly and end up replacing.


DDR5 includes on chip ECC in the standard (due to the high density).

It's not the same as "full" ECC, though, as it does not have vision once the data leaves the module (on the wire, at the CPU, etc)


I think you're off by a couple orders of magnitude there. 25e3 errors/(1e9 h * 1e6 b) = 25e-12 errors/hour/bit, ~~or 4 errors/hour/terabyte.~~

Edit: nope, I got that last step wrong, it's actually 200 errors/hour/terabyte, or 1 error/gigabyte/5 hours, pretty close to what you said.


I never really experienced that in real life. That is the thing we need to cope with in functional safety or some call it machine safety. There are people working with (mainly) electrical machines and by possible all measures an error cause by interference, bit flip, stuck bit, … needs be detected that the worst case of dead and injured people needs to be prevented. That are things we also need to cope with in automated/self driving cars.


> I never really experienced that in real life.

Most of the time the flipped bit will be in a cat photo and go unnoticed.


4 bit flips per GB in a working day, in every GB of my laptop. Let's keep using powers of 2: 256 days when I'm using it it's 1024 bit flips per GB per year. Some will happen in unused memory, some in running programs. It's 1024 * N bit flips in our code editor per year, were N is how many times our editor + language servers are larger than 1 GB.

Some in unused and unsaved data, some in code paths I'm not using. However some will probably crash a program because of a wrong or invalid machine code instruction or visibly alter data.

Times that per the number of people using computers. Those spelling mistakes or wrong figures were not our fault after all?

In a code editor people should see one character git diffs in files that are open in the editor, parked in a long unused tab (if the editor detects the change and saves it or asks to.)


I've observed that error rates are not the same for every machine - there are very bad machines, which usually have some way of triggering enough errors systematically that they end up getting someone's attention, then there are bad machines which have memory errors or are observed to report errors or crash occasionally, and then a bunch of good machines which on the surface appear to work flawlessly. Sometimes what makes a machine bad is the memory itself, other times it's the hardware or the configuration that the hardware settles on. The statistics may be skewed by the uneven distribution of "bad machines" amongst good ones, especially if you take into account hardware in the hands of end users rather than servers. It gives a new meaning to "works on my machine."


DRAM is not greatly affected by radiation, because the capacitors are large structures relative to radiation events. SRAM is affected, which is why SRAM arrays should always use SECDED ECC.

The dominant cause of DRAM failures is bit flips from variable retention time (VRT), where the cell fails to hold charge long enough to meet refresh timing. These are believed to be caused by stray charge trapped in gate dielectric, a bit like an accidental NAND, and they can persist for days to months. This is why the latest generation (LPDDR4x, LP/DDR5) have single bit correction built into the DRAM chip. Along with permanent single cell failures due to aging this probably fixes more than 95% of DRAM faults.

The DRAM vendors sure could do a lot better on publishing error statistics. They are probably the least transparent critical technology used in everything, but no regulation requires them to explain and they generally refuse statistics on faults even to major customers (which is why folks like AMD run large experiments at supercomputer sites to investigate, and most clouds gather their own data).

That said, DRAM chips are pretty good. The DDR4 generation probably had better than a 1000 FIT rate per 2GB chip, so in a laptop with 16GB that would have been less than 10 error per million hours, or under 1 per 50 laptops used for a year.

For many of us the vast majority of data is in media files. I personally notice broken photos and videos every now and then. I would love to have a laptop with a competent ECC level, but they do not exist. Even desktop servers often come without. It is unclear how much better the LP/DDR5 generation will be since the on-die ECC still does not fix higher order faults in word lines and other shared structures, which may sum to as much as 10% of aging faults. All simply educated guesses, since the industry will not publish.


So this might be a dumb question but it's been bothering me and you sound like you might know.

What's the big advantage of DRAM over SRAM? In school we learned that DRAM was cheaper -- but surely the difference between 1T1C and 6T isn't more than 6 and my intuition says C is big so it's probably 2 or 3 or something for a given process generation. The problem is that the latency of DRAM is absolutely dreadful. On one hand I see a staggering amount of engineering that goes into hiding DRAM latency, and on the other hand I see that DRAM has become so cheap that many systems are over-provisioned by a factor larger than its theoretical cost advantage purely by accident. The "obvious solution" would seem to be DIMMs of SRAM with (comparatively) wicked fast timings -- but this doesn't happen, despite the fact that the memory industry is extremely competitive and filled to the gills with smart people, so presumably there's another factor that stops "DIMMs of SRAM" from being viable. Do you happen to know what it is?


Density. Storing a bit in DRAM requires one capacitor, whose dielectric is simply the gate insulation layer on a transistor. Storing a bit in SRAM takes at least two complete transistors.


That's precisely the answer I didn't find convincing for the reasons I mentioned. If it were that simple, I strongly suspect we would have SRAM-DIMMs and DRAM-DIMMs duking it out in the marketplace in analogy to SSD vs HDD a decade ago.

> one capacitor, whose dielectric is simply the gate insulation

Every DRAM cell depiction I've seen in the last ~5 years has had a gigantic trench capacitor. Are those not in production?


Density isn't the only issue - power usage is also a contributing factor. Because SRAM uses more transistors per bit, leakage of the transistors in large arrays is a significant source of power draw. In DRAM leakage of the single transistor per cell can be compensated for by adjusting the refresh rate.

MRAM and other persistent memory technologies might be used someday, but there's a lot of R&D work to get them to the same level of price and performance as DRAM. It's sad that Intel gave up prematurely (imho) on Optane.


Ah, that makes sense! It's too bad that CMOS becomes leaky at small sizes.

Yeah, too bad about Optane, but CXL gives me hope that the next decade will bring more action in this space.


and it's not like SRAM isn't used in modern computers -- it's baked directly into the silicon of your CPU to serve as register banks and cache


Great question! I wasn't smart enough to ask it myself. Gonna learn something here I hope. ..


>SRAM is affected, which is why SRAM arrays should always use SECDED ECC.

How many SRAM arrays exist in electronics we use every day that don't have, at minimum an error detection and reset mechanism? What about hardware registers that aren't structured as 2D arrays, how often are those protected? Things like buses and counters are at least somewhat vulnerable too.


It seems to me that CT (or its operators?) should take a lesson from adversarial blockchains (cryptocurrency) here: a new state should not be propagated without verification.

I think that, for CT, this should be fairly straightforward. Some machine with access to the signing keys should generate new nodes and signatures and push those internally to some front-end machines. The latter (on separate physical machines) should fully validate the result before propagating it any farther. No outside user sees the result until at least, say, 3 machines fully validate it. Then, if validation fails, the state could be rolled back internally.

When a rollback occurs, the signing machine would think it’s signing a new, conflicting state, but that’s fine: no one outside the log operator has seen the old conflicting state.


In Blockchain, there are concurrent, multiple, anonymous appends to that log. That's why you need concenses.

In CT, all append are controlled. Nothing is anonymous.


A distributed consensus mechanism provides byzantine fault tolerance - which is helpful even with trusted actors, as this event demonstrates.


It shouldn’t even need much distribution or any protocol change. If every CT operator required three separate nodes to validate a proposed new block before distributing that block, then the only way to corrupt the log would be for an invalid block to pass verification three times (due to a bug or to a vanishingly unlikely coincidence of hardware errors) or for something to accidentally publish the block without verification.

Unlike cryptocurrency, this would require no fancy public protocols, negligible computational resources, and no additional verification overhead outside the log operator at all.


This happened before a couple of years ago, and the problem seems to have repeated itself. https://news.ycombinator.com/item?id=27728287


I cannot understand why they don't use two-of-three voting to produce the log entries.

It doesn't have to be three machines owned by separate organizations, or even in separate buildings. Just three servers in the same rack, doing the same computations, and nobody signs anything unless one of the other two produces the exact same result.


Is coincidence plausible here? Anyhow, Ian Fleming's thoughts on such things seem applicable:

"Once is happenstance. Twice is coincidence. Three times is enemy action"


Bitflips just happen. There's a good chance the device and connection you're using right now has had several bit flips in (unused) memory and you'll never even know.

Everything is fine until these flips happen in critical code or data paths. ZFS corruption is a famous example, but there's also DNS corruption that happens quite regularly (and has been demonstrated to be usable for malicious purposes).

I'm a little surprised the certificate transparency protocol doesn't validate the incoming data well enough to detect these bit flips, but on the other hand most software I've seen just assumes the bytes in memory and the bytes received through the network are all what they're supposed to be.


eh, I bet you are wrong.

I've run memtest many times, often for days, and it never detected any bitflips in RAM. And memtest is specifically designed to exercise every bit of memory and detect bitflips. While I've seen my share of PCs with memory errors, when you see one you replace memory / tweak settings, and then run memtest for a day or so to ensure it does not reappear.

Also, "unused memory" should not be a thing in modern PC: the read cache should expand to fill it - it's super fast to evict and provides tangible benefits in case of hit.

The errors usually occur on the boundaries - like in the network or in the SATA connection to disk or even in USB bus. For example the original "ZFS corruption" story (which seems t be gone from regular web, but I think its [0]) pretty clearly mentions damage "on the way to disk".

[0] https://web.archive.org/web/20091212132248/http://blogs.sun....


Not sure what it is about memtest86, but in my experience it doesn't find most memory errors. I've had serious memory issues, that trigger memlogd, dmesg, or similar about EDAC/ECC errors. I try memtest86 over night, no errors. Restart the node and get more errors. This seems to happen most of the time, only in the rare case can I reproduce errors I see in a production system with memtest86 reports.


Same guy too (Andrew Ayer). Someone is keeping an eye on it!


Do some folks still not validate new entries in their certificate transparency logs on at least one other machine before publishing them?

This is getting to the point (2 log failures in just under 2 years) that I wouldn't be surprised to see some certificates invalidated because they only used 2 transparency logs and both failed within the lifetime of the cert.


We don't take bit flips seriously enough, practically every consumer device uses non ECC memory, very few folks use filesystems (e.g. ZFS) that can detect corrupt blocks. Even when using those things together it's still not perfect.

Everything is terrible.


Why the downvote? Is it really acceptable for the machine reading your passport at the airport to have bit flip? Or the person processing you at the DMV, the doctor reading your medical history, or a million other things that power the modern world?


On reads it's easy enough to do a re-read.... bit-flips when writing something are somewhat more critical, as this is non-repeatable.


>Unfortunately, it is not possible for the log to recover from this.

That sounds bad. What does this mean? Does the entire log need to be thrown out, and a new log needs to be create to start from scratch?


Comment by the author (from the last time this happened) seemed helpful:

"OP here. Unless you work for a certificate authority or a web browser, this event will have zero impact on you. While this particular CT log has failed, there are many other CT logs, and certificates are required to be logged to 2-3 different logs (depending on certificate lifetime) so that if a log fails web browsers can rely on one of the other logs to ensure the certificate is publicly logged.

This is the 8th log to fail (although the first caused by a bit flip), and log failure has never caused a user-facing certificate error. The overall CT ecosystem has proven very resilient, even if a bit flip can take out an individual log.

(P.S. No one knows if it was really a cosmic ray or not. But it's almost certainly a random hardware error rather than a software bug, and cosmic ray is just the informal term people like to use for unexplained hardware bit flips.)"

-https://news.ycombinator.com/item?id=27731210


Cosmic rays are common. There is a detector exposed to the public in the Toledo metro station at Naples, Italy. It's placed 40 meters underground. In this short video you can see 2 cosmic rays passing by https://www.youtube.com/watch?v=K_puq6U7khg


I have no knowledge of how the CAs maintain the CT logs.

What is the process for a CA to rebuild the CT log, if one exists? Is it something like what is illustrated below?

Let CA1, CA2, CA3 and CA4 be different certificate authorities. The set enumerated alongside each CA is the set of certificates logged into its CT log.

  CA1 : {c1, c2, c3}

  CA2 : {c2, c3'}

  CA3 : {c1, c2}

  CA4 : {c1, c3}
Suppose CA2 is where an issue was detected with certificate c3 (the anomalous cert is denoted by c3') and CA2 trusts CA3 and CA4, then the set {c1, c2, c3} can be constructed after verifying the CT logs of CA3 and CA4 and merging their logs.

Is that kind of how it would work or would CA2 just truncate its log and restart from this point forward?


There's no merging or rebuilding. You just nuke the bad log, and certificates continue to work because they are present in other logs. (If all the logs used by a certificate go bad, then you'll have a bad time. This hasn't happened in 5 years of mandatory CT.)

The log operator may choose to stand up a replacement log. The new log will have a different key, URL, and name from the old log. It's for all intents and purposes a completely different log.


Thanks. So, the consequence is that browser vendors and other who utilize CT should ignore this log and use whatever is stood up as its replacement. That makes sense.


The broken log will just be replaced by a new one.


If there might be a software remedy (or at least amelioration) of the hardware problem of random bit flips affecting a software data structure -- it might involve creating multiple (more than 1, 2 or greater) redundant data structures in memory -- and checking each one for consistency against the others at specific intervals...

If the random bit flips affect code -- then if the code is deterministic, then one solution (or at least amelioration) might be running multiple copies of the same code -- but loaded at different memory locations and where the results of one set of code's calculations are checked against the results of the same set of code's calculations -- but loaded and executed from a different memory location...

Kludgy? Yes -- but if the underlying hardware is buggy (random bit errors which cannot be removed for whatever reason) -- then it may be the only effective way to make the system work, despite the kludgyness...

Which brings up a strictly academic question -- what would an OS where each OS data structure and OS code path was replicated/redundant (and the results of running each redundant code path / data structure and the results checked at various intervals) -- look like?

(I know NASA did something like that a long time ago with using something like five redundant computers where each computer checks the results of the computation of the group, and if there's an inconsistency, the computer producing the inconsistent result would be shut down...)

Related: https://history.nasa.gov/computers/Ch5-5.html


Do we know whether the machine on which this occurred had ECC memory?


It appears to be hosted on AWS, which claims to use ECC memory.


My lifetime experience says to suspect software did this (I've had careers in both hardware design, designing large memory subsystems, and in software development). Yes it's one bit changed which makes the mind go to the ever present alpha particle, but code also flips single bits. If some library code inside a process generating this data wanted to update a bitmap structure but got the address wrong, you'd have the same outcome.


I just don't have the same level of trust in HW. My priors include a bunch of dead hard drives and a few bad sticks of RAM, all of which caused significant subtle damage before the big catastrophic failure that drew attention. None of these were ECC or RAID, but I've also seen enough foot-guns with ECC and RAID to place the probability of "solution degraded to consumer reliability" considerably north of zero.


Funny enough my experience says that it is hardware. Software bugs rarely manifest at such low frequencies (say 10^-15 or so) compared to hardware faults.


Flipping a single bit is a lot harder to do by accident in code than corrupting a byte or multiple bytes. You really are not doing bitwise operations on a regular basis in most types of software.


Reminds me of BitSquatting where cosmic rays, hardware, or other errors flip a bit in a domain name and an advantage you can gain by purchasing bitflipped domains.

>Over the course of about seven months, 52,317 requests were made to the bitsquat domains [0]

[0] https://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinabu...


Section 5.3 of the PDF is veeery interesting.

They check where the most requests to bitsquat domains come from. They chose microsoft.com as a more neutral reference than e.g. FB. And looking at the graph, the lion's share of requests to bitsquat domains for microsoft comes from... China?? Followed by Brazil? (and only then by US)


This is perhaps one of the few legitimate use cases for a distributed blockchain. Then several nodes have to agree for the chain to advance.


Why?

What would a distributed agreement do that isn't achieved by the existing multiple logs per cert?


automating the consent


(More/stronger) ECC would seem a better choice. What am I missing?


> This is perhaps one of the few legitimate use cases for a distributed blockchain.

It's incredible how a global propaganda machine has turned one of the most transformative technologies of our time into something that's considered shady and even crime-adjacent by default, and for which "legitimate" use cases are somehow considered special, when in reality there are countless potential applications for distributed ledgers. But the powers that be want to maintain centralized control at all costs, and their pushback has clearly reached even the minds of HN users.


Or, hear me out, the technology sucks. And a great deal of software engineers here can see that the Emperor has no clothes.


If you can name an alternative technology that solves the same problem (integrity consensus without centralized authority), I'm all ears.


Why do we care about the centralized authority part so much? There are a lot of problem domains where zero trust is not desirable or understood to be a fool's errand.

In a B2B setting where you have 3 parties working together on an integration project wherein 2 parties are B2B vendors and the 3rd is a B2C org, where does the centralization reside? What about vendors of the B2B vendors? I honestly don't know what we are getting at with this word anymore.

If we were to apply something like SQL Server w/ Ledger tables (i.e. centralized blockchain) to this kind of problem in my shop, we would almost certainly find a solution that all parties would find agreeable. This forces you to trust 1-2 parties (i.e. Microsoft themselves), but in the above we agree that for many (most?) areas this may likely be explicitly desirable. There are also technologies (again, centralized) that provide non-repudiation through the hosting layer itself. Example of this being something like Azure Confidential Ledger.

The part where this seems to get frustrating for people is the desired crystalline & immediate nature of the system. If you operate with the tiniest amount of extra flexibility you can get so much more done - E.g. perhaps the business can review a data tamper event tomorrow with their partners on the phone.


Try again when you can solve the problem without introducing a dozen others that make everything, overall, worse.

Until then, trusting a central authority will do just fine. It works well enough now.


Perhaps the problem described is actually not the important one to solve. Once you leave the realm of pure digital operations and try to apply blockchain to any real-world use case you run into the same problem again and again - the inputs to the blockchain could be wrong, either through an honest mistake or malice.

Blockchain does nothing to resolve this, and it seems like a much more more common issue than malicious manipulation of data on the backend. Indeed, centralized authorities are actually a kind of solution to the dirty input problem, because they can alter records to agree with reality after the fact if a mismatch is discovered.


For this particular use case all the possible benefits from integrity consensus can be obtained while still having a centralized authority. The grandparent post notes that it would be better if "several nodes have to agree for the chain to advance.", and that's true, but there's no benefit for it to be distributed, the best and most efficient way to implement that is to simply have like three nodes in a single room 'voting' in a 2-out-of-3 manner, and there is no need to bring in extra complexity required to do the same thing without centralized authority.


It's on the 4th line where it says

00000030: 9126 9384 ....

Instead of 9284 ....


Relevant video from Veritasium on cosmic rays flipping bits and causing chaos

https://www.youtube.com/watch?v=AaZ_RSt0KP8


With things like RAID and Reed-Solomon codes, we have the ability to have verifiably correct data even with some percentage lost; how come something like this isn't used?


Wait, it's DigiCert's again? (Previous: https://groups.google.com/a/chromium.org/g/ct-policy/c/PCkKU...) Do we have a list of all failed CTs?


Not directly, but you can search for "Failed" on this page: https://sslmate.com/app/ctlogs


Interestingly, Google has a team of engineers dedicated to detecting hardware prone to these kinds of errors by perpetually QAing devices in the field. You can then replace the node before it breaks something important.


I would hope, at the very least, everyone doing important work is using ECC and actively monitoring their correction counters. This should be automated.

We do a lot of extra testing on top of our vendor's to catch weak bitcells before a device is shipped to customers. Over many generations of tech, RAM faults have always been a small but constant source of faults.


RAM isn’t the only source of these errors. They can originate inside of a CPU core as well.


Yes. And in long wiring paths (in fact, these are far more frequent than core logic faults). However, we've found atpg stuck-at and transition fault coverage to be vastly better with each generation. Due to improvements in the methodologies, the tools themselves, and our vendor's attitude. Of course, even with coverage in the high 90s that still leaves many paths unchecked but it's a small percentage of the faults (that we find...).

But for our devices, RAM faults don't seem to be getting much better, they're always a pain point.


what happened here?


an append-only log became non-writable earlier than expected


[flagged]


In CT, we want every inconsistency manually checked and reconfirmed. They are published to public maillist like this to deterrent attacks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: