Hacker News new | past | comments | ask | show | jobs | submit login
Cosmic rays causing 30k network malfunctions in Japan each year (mainichi.jp)
194 points by uncleberg on April 7, 2021 | hide | past | favorite | 116 comments



Kind of a fun ancedote: In "the old days" when you'd run a server room in your office we had a very large HPC cluster + a significant amount of storage and other one off servers on the top floor of a mid rise office building. I eventually moved it all to a former nuclear fallout facility where our systems were three floors down and under a gigantic pool of water. Error rates and random crashes fell off IMMEDIATELY. I believe Microsoft reported similar findings with their submerged data center.


Wonderful. Yet another reason for the IT office to be relegated to a sub-basement.


Why the whole IT office? Just the hardware running the software, not the humans who develop it.


Most offices/businesses have nothing to do with software development. Most IT shop are in basements, or at least on lower levels near the core servers and other infrastructure they manage. "The IT Crowd" was a stereotype but was still based on the reality of most businesses. I have never seen or even heard of an IT shop with a skylight. Even windows are rare.


I feel like OP was joking


should turn the entire floor into a big toilet


There are places lower than the basement.


Yea, when looking into these myself I did float in conversation that we'd need build a giant pool over our data centers. It was more of a joke though and never went anywhere.

Although our error rate was more like 1-2 per week on the equipment I was looking at.


Pools are expensive. The same weight of gravel or sand would produce a similar effect. Put the server farm under a multi-level car park structure.


Water is one of the most effective radiation shields we have for cosmic rays. NASA has contemplated using a water shield for Mars astronauts. There is a reason most fallout shelters are built under a pool when possible.


Nasa has also contemplated using human feces as shielding. The mixed use, dual purpose systems for spaceflight are a special case. What really matters is mass. Sand is cheaper than pool water.


Doing some quick searching, looks like there has been some research into what materials are better than others for shielding for cosmic rays. The couple minute look to me suggests there is more complexity than just mass, different materials have different properties, including at different energy levels for the cosmic rays. Although I suppose material cost and construction cost would also be a factor, where alot of sand might be easier to build a facility than the equivalent shielding in water. Although we're probably at the point where it's better to use something like a mine that is no longer used.


Sun’s UltraSPARC II CPU modules were claimed to be particularly susceptible to cosmic rays circa 2000.



Not so much susceptible as "came with their own radioactive source"


A reduction in floor and building vibration might have contributed too?


Or for example, old cables and improperly installed cooling fans might tend to get fixed during a move. Hard to know for sure.


Well yeah, that's why it's presented as an anecdote. :)


Totally. Kind of reminds me of http://www.ibiblio.org/harris/500milemail.html


I was recalling this paper characterizing loss of system drive performance due to environmental vibrations (though I'm sure there are others).

https://www.usenix.org/legacy/events/sustainit10/tech/full_p...


Awesome. Next time clients call about their services being down I am totally using this one.


I've had a couple outages just from people jiggling networking cables, back when nobody cared about locking them down.

I wonder how much of that was cosmic rays and how much was just less foot traffic resulting in fewer errors.


Can you give OOM of any of the improvements? Eg. crash rates halved?


I have no experience with tracking this kind of thing. How do you do it, what kind of analytics/tracking program is used, etc.?


And people still think the sun is nuclear. Ha!


Cosmic rays are, or rather were, the reason why manufactures avoid shipping high end cameras by plane, cosmic rays kill pixels in imaging sensors: https://youtu.be/98FZ8C6HneE?t=479

It's also the reason why all ISS footage has dead pixels on it.


Astronauts[1] also report seeing random flashes of light from time to time, believed to be due to high-energy particles interacting with their optical nerves, or cells in the retina, or maybe neurons too. Space is not a very friendly environment.

This is somewhat similar – although on a much different scale – to when Anatoli Bugorski[2] had an accident in 1978 involving a beam of protons traveling at over 99% the speed of light going through his head. He also described it as a flash "brighter than a thousand suns".

[1] https://en.wikipedia.org/wiki/Cosmic_ray_visual_phenomena

[2] https://en.wikipedia.org/wiki/Anatoli_Bugorski


When I was a postdoc at berkeley I was talking to the head of the bioengineering department who mentioned the phenomena. He had actually written a paper in which he put an appropriate set of filters in front of a particle accelerator to ensure that a statistical photon was emitted at a specific frequency (IE, it output 1 photon/sec, although only on average). Then he sat in front of the filters at looked into the accelerator.

https://www.nytimes.com/1971/03/04/archives/space-lights-tra...

Still one of the most audacious things I've seen, but he was a good physicist and did the math right, so he wasn't being unsafe.


Unrelated tangent, but this is the third time in two days that I've seen Anatoli Burgorski be mentioned. Twice on HN, once on YT.

I guess I'm just noticing his name more?


This is a fascinating claim (and video), but I can't find a ton of corroborating evidence for it: searches for "digital camera cosmic rays" bring up (1) considerations for cameras in space (much higher background radiation levels than a commercial flight), and (2) consumer photography forums where people are discussing that 2011 talk. Some of the latter seem to think that Kodak has/had a vested interest in spreading fear around digital photography, since their analog division was imploding at the time.


Any ray that might damage a ccd would also impact film stock, exposing sections into bright dots. Did kodak also not ship film by air?


Not really, no. It might be too brief to cause a chemical reaction, but powerful enough to induce a very high voltage.


Does anyone know how this works with our space-based telescopes? Or even our mountain top observatories? Surely if it's this crazy, they have to deal with that as well?


Yes, the electronics, including imaging sensors are all radiation hardened.

This can involve things such as using digital circuits that are radiation resistant (e.g. look up radiation resistant flip flop). Using multiple computers running the same thing that all "vote" for the correct result, so if one computer has an error from radiation you don't suffer. Using semiconductors that are more resistant to radiation (larger band gaps mean more energy required to flip a bit).

Physical shielding is key as well. The infrared imager on the Cassini probe had a case made out of tantalum, as tantalum is a very dense material which prevents a lot of radiation from going through it.


Also, larger feature nodes mean more capacitance, so a lower voltage spike and more area to dissipate the same amount of absorbed energy.


Well one thing they also do is that pretty much any desired image is taken multiple times/multiple exposures, and multiple offset positions in close succession to remove exactly such artifacts. The images are then stacked, etc.

So this allows removal of effects that do not correlate with what is physically should be in the image, but is an artifact of the sensor, image system, etc:

-- sensor artifacts: dead pixels, flat field irregularity, pixel response variations, electronic noise

-- imaging system issues: optical problems, lens/mirror defects

-- and then exactly what's being discussed here: cosmic rays, transient objects (satellite tracks!)


Interesting, where can I see an example of the ISS footage with dead pixels? I've never noticed them before.


https://www.youtube.com/watch?v=QvTmdIhYnes

Modern camera firmware detects stuck and dead pixels and tries to fill them in with neighbor data., but when there's too many... there's not enough data to fill in.

Keep watching top left part of the video. Most visible at 0:30.


Actually at the very end when they go into a dark place is the most visible. The whole sensor is covered with defects.


Maybe it's me, maybe it's the screen, I can barely spot any dead pixel at the 00:30 mark. But at the end of the video it's very clear. Thanks for pointing it out.


Does this apply to human retinas too?


Very interesting, so should people who travel with photography equipment take night flights if possible?


Rays hit our planet regardless of Sun, if anything it might block a few. They are caused by ultra-energetic cosmic events, such as star collisions, black hole feeding cycles.

Cosmic rays are just very potent photons, capable of knocking out electrons from an atom (meaning they are ionizing) causing havoc in precision electronics and, well, our DNA.


What is the tolerance for cosmic rays for a human... I ask because on the journey to Mars you’ll get quite a few I guess and also on the planet’s surface after you’re there with such a thin atmosphere?


This is actually a thing people are criticizing Musk for, he is being accused of sending people to Mars with insufficient shielding against cosmic rays.

In my opinion it's a bit silly argument, as there's a whole bunch of other risks and quality of life sacrifices made by the persons who are going to undertake that journey. Some raised chance of cancer is probably the least of their worries.


He can't send people to Mars yet so how are people accusing him of sending people to mars with insufficient shielding? There are limits to criticism of Musk one would think.


Humans going to Mars should expect to never come back


But at least they should reasonably expect to arrive.


Anyone know what expectations Columbus or other expeditions had about survival/risk?


For a normal NASA flight

"NASA told Business Insider it estimates there’s a 1-in-276 chance the flight could be fatal and a 1-in-60 chance that some problem would cause the mission to fail (but not kill the crew)"

That's fairly low - feels like the age of sail would be far higher -- Columbus's second voyage alone had about 25 deaths just from scurvy. His first voyage was only about 90 crew.

But remember that life wasn't exactly easy on land either.


The Magellan Expedition left with 5 ships and ~270 men and returned with 1 ship and 18 men. People knew that going on long, overseas expeditions was very dangerous, but so was life in general in 1500. I imagine worrying about 50% increase in cancer risk by taking a voyage to fricking Mars, would seem very silly to those explorers.


A lot less than an astronaut has when going to Mars for sure. We at least know how far away Mars is exactly, and what obstacles lie in the way, and what inhabitants it has. They just got on a boat, and pray for fair wind and good weather so they might reach india, and thank god there was a whole other continent in between.


That's actually a fascinating question! It turns out we don't know enough to answer one way or another.

The model commonly used in radiation protection to assess cancer risk with respect to radiation dose is called the Linear No-Threshold (LNT) model [1]. The model critically assumes that (1) total radiation dose is the only predictor of cancer risk, and (2) any radiation exposure results in an increased cancer risk.

This model works at high absorbed doses, however, its applicability is highly controversial when used with low abosorbed doses or with relatively high absorbed doses that occur over a long period of time (ie: low dose rate).

The thing is, the human body has built-in defense mechanisms against cancer such as DNA repair. There is a good body of evidence that small doses and low-rate exposures do not result in cancer risk (ie: there is a threshold absorbed dose and probably also a threshold dose rate), but the model does not account for this.

This is particularly problematic when trying to assess excess mortality from things such as radiological accidents: when you multiply the small LNT-predicted risk for a low dose times a very large population, you end up with a lot of cancers. This is one of the reasons you'll see estimates for deaths from the Chernobyl accident vary by orders of magnitude.

It's also problematic when assessing something like a Mars mission: yes, the astronauts would get large cumulative doses, but the dose rate is pretty low over most of the mission (other than during high dose rate solar events where they would need radiation shielding). How much of an elevated cancer risk is it actually? Nobody is quite sure.

[1] https://en.wikipedia.org/wiki/Linear_no-threshold_model


I remember vaguely that if there would have been a sun eruption at the lunar mission the team might have died. I tried a while ago to find a good source on that though. If this were true, forget Mars.


As far as cancer rates go I've heard it's about as bad as taking up smoking cigarettes. Which is not great but doesn't seem like a show stopper


According to https://xkcd.com/radiation/, you get about 4x the normal daily background radiation from a transamerican flight.


Cosmic rays can also be high energy ions. Most of the photons come from the sun and the ions can come from outside the universe and the sun.


Cosmic rays are not exclusively associated with the sun. There are a lot of Galactic Cosmic Rays still. It's pretty much leftovers from supernovae, black hole mergers, neutron stars, accretion disks and such. But yeah, at night would be better.

Edit: Correction, there is very little difference and fewer cosmic rays during the day. Source: https://arxiv.org/pdf/physics/0105005.pdf


They have to do two things :-)

1) Use ECC memory

2) Go underground

"One experiment measured the soft error rate at the sea level to be 5,950 failures in time (FIT = failures per billion hours) per DRAM chip. When the same test setup was moved to an underground vault, shielded by over 50 feet (15 m) of rock that effectively eliminated all cosmic rays, zero soft errors were recorded.[6] In this test, all other causes of soft errors are too small to be measured, compared to the error rate caused by cosmic rays."

"Soft Errors" https://en.wikipedia.org/wiki/Soft_error#Cosmic_rays_creatin...


> 1) Use ECC memory

Not exactly. When I was in telco, where I had this problem was in FPGA's, we had all ECC memory and I never linked any problems to bit flips in RAM. But as I remember, the FPGA's we had were using a type of SRAM cell, but because it's not a memory module the FPGA programming could bit flip. So the product had a checksum function, that read back the program on a cycle and reset itself if the program no longer matched the checksum. So we would see 1-2 crashes / restarts per week in our FPGAs that we believe were bit flips.

We then ran an anlysis on any of these that higher than expected error rates to try and identify actually bad hardware and replace them.

I think the vendor eventually came up with a way to reprogram the FPGA without just crashing and rebooting the entire board.


Many modern FPGAs now include dedicated logic for config SRAM "scrubbing." This logic continuously checks config frame checksums to identify upsets. These can then be fixed in real-time either using the error correction properties of the checksum technique, or from the non-volatile config memory (typically NOR flash). It's also important to note that only a subset of the SRAM config bits are critical for a given application. Usually this is a small percentage of the overall array.

https://www.xilinx.com/support/documentation/application_not...

If even higher levels of reliability are needed, there are rad-hard-by-design FPGA families (e.g. Xilinx Virtex 5QV). These have a special config SRAM cell that has more charge storage sites than a conventional SRAM cell. It is less area efficient than a conventional SRAM cell, but geometry of the charge storage sites ensures that a single cosmic ray can't flip the state of a majority of them. Essentially the cell can self-correct, no scrubbing required.


Interesting and makes sense. Do you have any additional references you would suggest and particularly in the context of FPGAs ?

Would you say this quick reference is a good overview ? https://www.intel.com/content/dam/www/programmable/us/en/pdf...


Sorry, I should have mentioned this was quite a few years ago, so I'm very out of date. So I don't have any known good references that are handy. That link you shared seems pretty good on a quick scan through and inline with what I remember, I'm pretty sure I dug up similar resources for other vendors, including one I think was looking at satellite hardware.


Compaq/HP handled this with many redundant resources / cores: https://en.wikipedia.org/wiki/Tandem_Computers


Polyethylene is supposedly good at blocking cosmic rays. It would be funny to me if the fix was just a drop ceiling full of old grocery bags.


Have a source? That seems like an awesome way to recycle grocery bags.


Water is also a good protector. And the polyethylene is actually a proxy for "hydrogen atoms". The reason is that dense hydrogen has a lot of protons to interact with the incoming radiation.

But unfortunately plastic bags are not dense polyethylene. You would kind of need full blocks of solid polyethylene...


You can make solid polyethylene out of plastic bags by pressing them in a mold heated to ~100-150C.

But it's quite flammable, so you might not want to use it as a building material.


The peer comment posted a source that has lots of references.

Though most of them are tests in space, where I assume the thickness requirements would rule out grocery bags. I am curious how thick a layer of HDPE you would need on earth to make any notable difference.


That linked nature paper seems to indicate 10g/cm^2 for a 50% reduction. A standard shopping bag is about 5g, so you would need roughly 20k bags/m^2, or ~2k bags/ft^2. At 0.9g/cm^3, that would be roughly 11cm of solid polyethylene


I just squished a plastic bag as much as I could, and got it down to an inch cubed. So if I hypothetically did that for 2000 bags in 1ft^2 (sorry for English units), I think that would mean it could be 13-14in thick. So maybe it is reasonable to attenuate radiation by about 50% in a rather generous drop ceiling?


Ah, so bags are out, as is practically the ceiling. But cut sheet HDPE is easy to find, so tiles atop your machine would be relatively cheap and easy.


I work for a particle accelerator, and I can confirm you that our beam dump uses polyethylene for neutron shielding: https://www.sciencedirect.com/science/article/pii/S092037961...



Doesn't sound like that solution is plenum-rated.


Nice, I'm going to line the rooftops with Tyvek now.


5950 failures per billion hours is approximately equal to 1 failure every 19 years.


And if you have a datacenter with 200 machines, that's roughly once a month.


Each machine is going to have way more than 1 DRAM chip. A 128GB DDR4 stick is going to have ~20 chips on one side (not sure if they're double sided, just looking at product listing), and you're going to have terabytes of RAM per machine.


That's a bit more than just 'go underground'. I've seen people dig for cables but I've never seen them dig a hole 15m deep.


A 'Faraday Cage' should also help.

edit: No idea why the downvotes. Faraday cages have long been thought of as a way to protect electrical devices from the electromagnetic waves which can be a result of solar flares.

I even chatted to someone from NASA’s Solar Dynamics Observatory about it in the past.


Cosmic rays are not electromagnetic waves. They are highly energetic particles, like protons and naked helium nuclei.


Because a Faraday Cage would make it worst not better:

https://www.reddit.com/r/askscience/comments/1fsiv9/how_effe...


This reminds me of Bitsquatting, which is like typosquatting but uses domains one bit-flip away instead of one mistyped letter away.[0] In a small experiment, for some popular domains, the author reportedly got a lot of hits.

In the associated DEFCON talk, the author notes that even if you use ECC memory, it's unlikely that the DRAM in your hard drives does too.[1]

0: http://dinaburg.org/bitsquatting.html

1: https://www.youtube.com/watch?v=9WcHsT97suU


I have always wondered whether new ideas are seeded by cosmic rays disturbing neurons in the brain..


That'd mean people underground (or under thick layer of concrete, like in skyscrapers) are not creative.


Sounds like we have a hypothesis and a test environment; now we just need a test for creativity and we can start doing science!


I guess you could say they'd be...

Living under a rock.

Yeah, yeah, I'm leaving. Put the weapons away...


But grad students are often sent to work in the basement, and yet they need to have the most ideas!


I suspect that modern processors are closer to quantum limits than what neurons are.


That's actually mind blowing to think about. Thanks for sharing this.


> That's actually mind blowing to think about.

That would be too many cosmic rays.

And the "flipped the wrong neuron" possibility is much scarier in my mind. And probably in everyone else's brain as well.


I always wondered if that random stabbing pain in your rib or kidney etc was a comic ray to the nerve.


seems too low-level, no? the temporary excitation of a single neuron would most likely just fade away into the equilibrating process of the brain


>In chaos theory, the butterfly effect is the sensitive dependence on initial conditions in which a small change in one state of a deterministic nonlinear system can result in large differences in a later state.

https://en.wikipedia.org/wiki/Butterfly_effect


Is there reason to believe the brain is chaotic on that level? The reason I doubt it because any stable biological system has as one of its primary tasks - if not its #1 primary task - to maintain equilibrium. Otherwise it would collapse into chaos due to the inherent imprecision of molecular phenomena. So I would be very surprised if you could stimulate a neuron - not even a neuron, but an electron inside a neuron - and cause something as complex as an idea to form. More likely it would be equilibrated away


Well, cosmic rays can cause perceptions: https://en.wikipedia.org/wiki/Cosmic_ray_visual_phenomena

That said, just saying "chaos theory means any change can cause anything" is a terribly weak argument for "cosmic rays cause ideas". Not to mention all the reasons why it's implausible as a significant idea generator.


Ahh, the ol' Cisco get-out-of-jail-free card.

Headline would be more accurate if it said "30K unidentified bugs blamed on unverifiable phenomenon each year in Japan".


Incorrect. According to the article NTT experimented with neutrons (simulated cosmic rays) in a lab.

NTT systematically set out to be able to distinguish these different kinds of error, and you go ahead and ignore that effort.


I'm speculating here but it would be more likely that software bugs cause more errors than cosmic rays.


Found the rust programmer.


The rust programmer is the one out there ending that phrase with "and this a problem that must be solved".

...

Well, the rust programmer and the nuclear engineer, possibly.


This is a great place to ask. Are our brains susceptible to this kind of interference and if so do you think the structure evolved to handle some corruption of this kind?


We could reduce the cosmic ray flux by thickening the atmosphere. A small increase in normal air density would give us a significant reduction in cosmic rays. Talk to the terraformers. They've given a lot of thought to how to make air.


Many computer problems are probabilistic in nature with very low occurrence, and it can be very challenging to come to any real conclusions about the cause and outcome of various process failures. While I believe cosmic rays are a problem, the issue I'm dealing with now is bad silicon- when you make a bunch of chips (millions), some fraction of them will compute a few operations incorrectly sometimes. Post-manufacture validation doesn't catch all problems and ultimately some machines slip in the serving fleet.

ML People who train on this hardware report more NaNs causing training failure than expected due to software bugs. It's extremely challenging to debug because most ML codes are very robust to small amounts of injected noise, especially gaussian independent noise (there's literature showing that introducing random numbers in training often helps training go faster).

This is a fascinating area and there aren't a lot of people who can really make forward progress in it.


https://www.wnycstudios.org/podcasts/radiolab/articles/bit-f...

A more comprehensive an interesting take on the cosmic ray phenomenon. It’s actually a global issue that is accounted for in certain electronics.


Cosmic bitflip is like the Godwin's Law of BOFH excuses.


As we move to 5nm chips, I wonder if there's a greater likelihood of errors from background radiation.


Is appeal to cosmic rays a candidate for an engineering logical fallacy?


I'm curious if anyone has ran a year long memtest and seen what happens. That, and if they can perhaps measure the actual amount of particles coming into contact with the system.


meanwhile, here is a 230 megapixel photo of the sun I saw just this week:

https://twitter.com/AJamesMcCarthy/status/137946154911957401...


How many are caused by Windows? :-)


> Soft errors occur when the data in an electronic device is corrupted after neutrons, produced when cosmic rays hit oxygen and nitrogen in the earth's atmosphere, collide with the semiconductors within the equipment.

Uhhhhhhh. Neutrons? Doubt goes here. I was pretty sure charged particles and ionizing radiation are to blame for messing up your electronics.


Neutron radiation is in fact ionizing radiation.

It is much more dangerous than other types, as it readily passes most materials, yet dosage is absorbed. Absorbed neutrons cause particles to (almost always) become radioactive, which causes havoc as those new radioactive particles decay.

The other common types of radiation (alpha, beta and gamma) do not cause things to become radioactive (usually), instead they are a threat if the actually radioactive material accumulates as dust on your skin (for example).

Neutron radiation is much more dangerous because it seeds radioactivity deep in the body, so you can't clean it off by washing or waiting for the absorbed particles to be flushed back out of the body naturally within a few days.


One of the ways high energy neutrons interact is by scattering protons elastically. Their masses are essentially the same so the neutron can transfer its momentum very efficiently. These scattered protons (perhaps from plastic in the packaging etc) are then the charged particles that flip bits. These protons are generally lower energy than cosmic ray protons, which means they deposit MORE energy per distance traveled, so they can be worse than cosmics directly.


Supposedly yes neutrons [0 they give a link under causes]. I guess neutrons are more penetrating than protons, can pass through chips/buildings etc more easily, and even though weak interaction, it ends up more significant contribution than protons etc that interact more easily.

[0] https://en.m.wikipedia.org/wiki/Soft_error




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: