Hacker News new | past | comments | ask | show | jobs | submit login
The mystery of my desktop that locks up when it gets too cold (utcc.utoronto.ca)
111 points by zdw on March 31, 2019 | hide | past | favorite | 77 comments



Get a thermometer and check the ambient temp. It could be a bug on the mobo/BIOS. As an AWS SRE, we once had an issue where we had waves of racks being unavailable. They'd go offline, were booted by automation, whereupon they'd come up only to go offline again. Our usual approach to solving such issues didn't work (we had a concall spanning 3 days to triage this issue). We typically relied on IPMI sensors for thermal events in the fleet. The typical script fetched the IPMI temp and if it was too high raised a ticket. What we never factored was a temp going too low. So, it turns out, that the datacenter in question would allow air from the outside in winter to help with cooling during winter. One of the air vents got stuck open and servers went cold... (IN deg C): 5, 4, 3, 2, 1, 0, 255 << oops bios initiates a server shutdown due to a very high chassis temp. Server vendor issue a patch a few weeks later. We had to put in an alarm for temp going too low in our monitoring!


Interesting. A decade ago I observed something similar to the article while traveling in Nepal / Tibet.

Me and friend both had iPods (of different generations) that got 'altitude sick'.

At two points in the trip we got about 5,000m. Each time both iPods went totally unresponsive. Both times we'd only be there for a few hours and once we descended they'd stay working again.

Obvious thing was something to do with the cold and battery, but both times the battery came at the same charged once they started working.

Can't remember exactly what the ambient temperature was (we were in cars both times) but it's puzzled me whether it was cold or altitude.


A quick trip to Wikipedia suggests that iPod Classics had a spinning disk for storage. The disk in hard drives "floats" on a cushion of air. When the outside air pressure gets lower, the internal pressure is relatively higher and the gap increases. A big enough pressure difference and the disc can jam and stop spinning altogether. I do not know if that is what affected your iPods, but I do know that hard drive failures at the summit of Mauna Kea (~4000m) are common and it seems to mostly be down to luck which ones work and which don't (once you have a working one, it tends to stay working).


Wow, that's awesome. No idea if that's actually what was happening, but it fits the picture of why these were getting 'altitude sickness'


Wow I can't believe they used an unsigned 8 bit integer for temps


I recently had to parse string-based temperature readings coming in over UART (e.g. 25.4C), and I forgot to include the leading minus character in my regular expression. I think generative testing is a great option here if you run the regex "in reverse".


He does measure the ambient temperature in the article. Happens around 60 degrees Fahrenheit, so well over freezing.


Maybe cold solder on a component on the mobo. Heat expands the joint slightly making a connection, cold temperature makes the contact points contract away from each other, breaking the connection. I had a similar issue with an LCD dashboard display in my car, resoldering fixed it.


The temp threshold reported (60F) seems too high for this to be a mechanical issue. I'd put my money on some marginal circuitry, probably in the PSU or on-board DC/DC converters, or perhaps a crystal -- they're wily things. I suppose it could also be "bad silicon".

Anyhow, time to break out a can of freezer spray. For those without a Fry's handy you can substitute compressed air cans (available at Staples and Costco), if inverted so as to spray the propellant. Except I didn't tell you to do that if you end up giving yourself a frostbite.


A lot of people don't realize how fragile semiconductors are with regards to reliability. It takes a TON of work to get a virgin part ready for "production." There's huge pressure to get it out the door and it goes out working "just enough." Semiconductors parts are held together by tons of "bubble gum and bailing wire" hacks to keep them operational over wide temperature swings.


I'd just add that you get what you pay for. There's no bubble gum hacks keeping your ISPs networking gear forwarding packets when their air con fails and the inlet temperature hits 50C. Or when it goes the other way (to a lesser degree).That's the higher spec they are engineered to meet.


I had a loose SMD chip in an old phone which I fixed by jamming cardboard from a pack of Fig Newtons between the case and the logic board. I recommend choosing a bit without the filling on.


Thats exactly what Apple does to their refurbs! https://www.youtube.com/watch?v=XaGHcBZjmWA


Yep, came here to suggest cold solder too. Would be entirely unsurprising.


Could also be a cracked trace of some sort. Either way a good smack can really help. All those movies where they hit the console and it comes back to life aren't too far off.


I once used a CRT monitor at school where the picture would spontaneously lose color. A good smack always fixed it... for about 30 seconds. It became a reflex - work work work work SMACK work work work work SMACK work work work...


I've heard a similar story. Old colleague had been introduced to a computer with same problem and fix. He looked at it for a while then reached behind it and seated the cable properly and the problem went away. Turns out there's an actual reason vga-cables had those screws.


Oh yeah. Also misseated RAM or PCI/graphics card (in desktops) can do the same. A couple of times after car trips I 'fixed' my old desktop with a judicious slap on the side of the case.


Those temperature related problems can be reproduced and spotted by using one of those cold spray cans ("freeze spray" or similar names) made especially for electronics that can be used to selectively cool down parts of a device until the problem appears.


In a pinch you can also use a can of compressed air (not actually air) held upside down.


Chances are the substance in "canned air" is the same as in "freeze spray" --- a hydrofluorocarbon refrigerant. (They used to be a CFC, before those were banned for ozone layer depletion.)


It is not widely known that HFCs are a really, really terrible greenhouse gas, with hundreds of times the infrared-retaining power of CO2 and hundreds of times the half-life in the atmosphere.

It is estimated that if all the HFCs currently in use ends up vented, it would account for half the total greenhouse effect. It's really important to make sure A/C systems don't leak. If you are ever in a position to specify a big system, get one that uses ammonia.


> get one that uses ammonia

... or CO2, or hydrocarbons. The general term is "natural working fluids". But yes, a thousand times this.


Ammonia is toxic, hydrocarbons are flammable, and CO2 requires extremely high working pressures meaning more expensive equipment and lower efficiency. It's not surprising that before ozone depletion and global warming were known, CFCs became the refrigerant to use, since they're almost completely inert and work at low pressures.


I agree that the CFCs have some nice properties.

But the new generation of synthetic refrigerants like R1234 are not nice, they are flammable AND they form nice things like hydrofluoric acid when they burn.

CO2 is widely used in Japan in vending machines you find everywhere, it's widespread in supermarket refrigeration systems, and it's looking to be the refrigerant of choice for next-gen EVs. It's already what Mercedes use for the AC i the S-class.

Hydrocarbons are flammable, yes, but the amount used for a residential unit like a refrigerator is less than what's in your can of lighter fluid in the cupboard. For a commercial kitchen, people have no problem with multiple 10 kg cylinders of hydrocarbon for cooking, but a few hundred grams in the refrigeration system is very dangerous?


That's why it's super important to have a good infrastructure for deconstructing and safely emptying AC units and refrigerations. Older models use the nasty stuff that broke down the ozone layer, glad that has been banned at least.


Wouldn't just compressed air (the same stuff we breathe) be good enough?

Actually, in places where compressed air is used frequently, why not use a compressor? Much more renewable. I can imagine regular compressed air causes static electricity or something though so maybe a system that negates that (and filters the air)?


Much of the cooling comes from a phase change - the reason it works well as a freeze spray is the same as its use in a refrigerant. The can contains liquid which boils. To get the same effect with compressed air you'd need much higher pressure.


It's maybe worth mentioning that people doing that trick with duster gas (such as R-134a) should avoid freezing any body parts, and should also try not to breathe much of the gas.


Is r134a that bad for you? It's the aerosol in by Albuterol inhaler (for asthma.)


R134a is being phased out in automotive AC systems due to its high global warming potential. It’s being replaced by R1234yf, and in the newest systems, CO2 (aka R744).

I would hope it’s not being used unnecessarily as an aerosol in things like air dusters!


R-134a is just one of the possibilities in duster gas, [1] but none of them are great to be inhaling.

[1] https://en.wikipedia.org/wiki/Gas_duster


The danger is that it's an asphyxiant --- like most other gases. Otherwise it is inert, which is why it is used. Inhaling a small quantity such as from a dose of an inhaler would not be harmful.


Like when my phone would only turn on after I put it in the freezer. Eventually that stopped working too. I miss that phone. I loved that phone. I still have it. Can't seem to let it go.


I had an electronic module that worked in our lab, but failed at the customers, but then worked again in ours.

Turns out one of the pins in a space grade connector (MDM-25) open-circuited exactly between 68F and 70F. Our lab was 70F, and the customers was 69F. The way we caught it was to be watching the temp chamber when it slowly ramped from cold to ambient.

Turns out the connector supplier had shifted production to Mexico, and they were contaminating the contacts with RTV when sealing it.


You have the skills to diagnose this sort of thing but I suspect that a mere Unix sysadmin (soz: "herder") doesn't.

For the likes of Chris and me: put in a new mobo and move on, is surely the fix that any techy would do.


You can selectively hit areas with cold spray (i.e. duster turned upside down). Could be a bad/oxidized connector, cracked solder joint, tombstoned part. Only worth diagnosing it for curiosity.


Be careful when using duster that way, some of them leave a white, powdery residue.


RTV?


It’s a type of silicone rubber: https://en.m.wikipedia.org/wiki/RTV_silicone


Check to see if the screws that hold down the motherboard are flexing it in any way. In the '90s we had trouble with machines coming from the factory with the screws too tight and loosening things up solved the issue. Sometimes we had to put non-conductive washers underneath the mobo because the post to which it was attached was too short.


Yes, back then I had a Power Computing Mac clone that no one could make stable. After hours of staring I realized that the steel chassis was faintly bent, flexing the motherboard and stressing the inferior Ram sockets of the day. A very thin washer in the right place fixed everything.


Tricky-to-reproduce problems are always the hardest. It sounds like you've made good headway in identifying temperature as an exacerbant.

Replacing the motherboard would be a sensible next step.

If you've exhausted all the traditional suggestions for troubleshooting (disassembling and reassembling all components, se-seating RAM, CPU, etc), try this:

Get a thermal imaging camera (if available) and a can of cold spray (sometimes referred to colloquially as "liquid nitrogen"). Cool sections of the boards at a time, and see if you can isolate which area causes the lockup (e.g. something near power-related IC's for USB?). The camera isn't critical, but might help you envision where temperature changes most rapidly and achieve better granularity as to which components you thermally stress in any given test.

Borrowing a PSU from a friend and repeating the cold test might also be enlightening.

Good luck, and let us know how it turns out!


If you treat a circuit as a 2D surface, yes. But the cold spray won't evenly change the temperature of that 1000uf electrolytic capacitor. It's a good start though.

Side note: A vendor my company uses outright rejects bug reports that aren't consistently reproducible. Very annoying.

We waste a lot of time trying to find a pattern to the issue, but can't always do so.


I've worked with a product manager that would reject very legitimate bug reports just because they came from me. Coming from a developer's mindset, I would make very detailed bug reports just like I would dream to receive. The product manager told my direct manager that I was trying to show off and make his team look bad. So then it became me writing up the bug report, but my manager would put his name on it and the product manager complaining our department was out to get him.


I hate it when they do that. Seen it from the inside, and it's a blatant cop-out. When a company shows no interest in tracking down and correcting their botched work that's a big red flag for me to be on the lookout for another vendor.

Imagine if Boeing said "those plane crashes are intermittent, we won't work this problem until you've consistently reproduced it."


Two distinct possibilities:

1. Conductors with different thermal expansion coefficients and/or cold solder joint/s. BGA chips (ie graphics) especially. Reballing BGAs is no simple procedure to prevent popcorning, thermal damage and cold joints.

2. Condensation - I have an A1278 non-Retina MBP that I'm donating that, for 5 years now, refuses to recognize the boot drive if it's warmed up too quickly. I suspect corrosion resistance and/or a short from condensation somewhere along the SATA path or a signal feeding a chip that provides it. I've tried using a pencil eraser on the male and female SATA connector contacts to no avail. I bet Louis Rossmann could fix it for $180-350 but I already bought a Lenovo T480.


For your A1278, have you swapped the SATA cable? It’s such a common point of failure that we keep them in stock at our repair shops. A new cable is pretty cheap (~$20, proprietary, of course.) I find that many of these “intermittent boot failure” scenarios can be fixed just by replacing the cable.

Of course, I would also take this time to upgrade to a SSD if you haven’t already. :)


> Reballing BGAs is no simple procedure

Anyone remember the Xbox 360 Towel Trick? The idea was to wrap an Xbox with a specific error code in a towel, allowing it to overheat to the point that it resoldered a bad connection.


No, you could never resolder anything like that. What this did was warm motherboard to the point it was able to bend and release accumulated (from thermal cycling) stress _temporarily_ fixing broken solder joints.


My Sony PS3 3D Display has the well-known problem where its screen shuts off for a few seconds every once in a while. I noticed it got significantly worse this winter, when I opened the window for some fresh air.

Did a few tests, and sure enough, if the room temperature is below 22C and the screen has been on for about 10 minutes, then it starts to shut off frequently. My guess is the ambient temperature is causing some metal component to contract and break a connection.


Screens and audio devices shutting off periodically is a typical sign of clock drift.

It's where a graphics card is outputting 60Hz and the screen is expecting to receive 60Hz, but one of them is slightly off (60.0001Hz).

At some point the sending and receiving get too far misaligned, some error condition is triggered, and the whole thing restarts.

It's a design flaw really - you should never recreate someone else's clock signal.

There are some times it's impossible to avoid - for example a picture in picture mode has to synchronise with two other people's clock signals for each of the incoming images.

In your case, temperature will affect the speed of the oscillator making it happen more frequently unless you have the ideal temperature.


> My guess is the ambient temperature is causing some metal component to contract and break a connection.

That failure mode is less likely to cause a "well-known problem" and would show up in other circumstances (not just temperature).

I thought maybe capacitance, but it looks as though they are designed to be fairly flat deltas around 20°C:

https://www.murata.com/products/emiconfun/capacitor/2012/10/...


No, it's your tab connectors at the edge of the screen pulling away - those aren't soldered, they're literally glued on and the glue is very temperature-sensitive.

Source: I used to do LCD screen repair for Philips, LG, and Sony. Next to TCON board problems, the most common issue was panel edge IC glue failure.


I would start by checking all the capacitors if they are in good condition (i.e not inflated). Then maybe search for a bad solder point.


It could be humidity. It’s very dry when it’s cold outside, heater kicks on, and indoor humidity plunges even further. The colder it is outside, the larger the differential to indoor humidity.


I, no kidding, had a PC once that booted up better if opened up the side and pointed a space heater directly into it.

On cold starts it would boot, then lose all power seconds later, boot again, and it would work for a few seconds longer than before, rinse and repeat 5 times or so and it would finally stay booted. After surmising that it was temperature related, I decided to try applying extra heat. It worked. After it was running for 2 or 3 minutes I would turn the heater off.

I never identified what component was causing this.


I don't know if anyone has suggested this but low temps can cause condensation on the board if you live in a high humidity area. This could have the effect of shorting the board.


I know you said "all the fans were spinning" but my power supply fan starts hitting it's own frame when it gets too cold, did you check that fan? I forgot to check that one myself the first time I diagnosed. In my case it's just super loud but doesn't stop the fan or cause a shutdown.

If that's it, loosening the screws holding the fan in place just a little bit worked for me.


What surprises me is that his writing seems to imply that the computer freezes but then manages to recover and continue as if nothing happened. I’d have expected it to lock up forever and/or crash & reboot.

Would be interesting to see system logs (if any) after such an incident - they could contain some clues as to which parts go offline (he mentioned USB devices going off) during the problem.


This was (is) a lack of clarity in my entry. When the system locks up, it only recovers by rebooting through the BIOS; it doesn't resume operation from some suspension. System logs cut off abruptly at the time of the hang, with nothing abnormal even a few seconds before the time and no kernel messages sent out through netconsole (I don't have a serial console available).

(I'm the author of the linked-to entry.)


People are talking about freezer spray, but really you need to substitute parts, first, until you narrow it down. PSU, RAM, GPU, motherboard, CPU. Start with RAM -- it's easiest and most likely, and you can get along on half for a while, just to see.

RAM, PSUs and motherboards are remarkably cheap to replace.


Professional overclockers usually use liquid nitrogen (LN2) to cool down their CPUs to <0C, but significantly below this point the thermal paste between the CPU and its heatspreader will snap [1]. If this system ever saw extreme cold, like leaving it in a car overnight in the winter, or being transported in <0C plane/truck storage, that might be a possible explanation.

It looks like you have the i7-8700K and a suitable motherboard/CPU cooler for overclocking, and from what I've read, the stock thermal paste isn't enough for it which is why Intel switched to soldered IHS's with the 9700K. I'd try replacing the thermal paste if I were you, here's a video showing how to delid it [2]. Thermal Grizzly Kryonaut is the best paste in my opinion, and it's super cheap.

If this doesn't work, you could try posting your story to the HWBot forums [3] and see if anyone has any ideas.

[1] https://www.tomshardware.com/reviews/splave-overclocking-wor...

[2] https://www.youtube.com/watch?v=2ixxYDMFR24

[3] https://community.hwbot.org/


When you say the power was cut to your USB keyboard, could it perhaps have been reset and in a unenumerated state? If you have a multimeter it could be useful to check the various power rails in the bad state.

Try running Memtest86 and lower the temp while it runs.


When it's locked up, measure the voltage on the grey wire of the ATX power connector (while plugged in).

It should be 5 volts.

If it's 0 volts, your issue is in the power supply.

If it's 5 volts, your issue could be in either the power supply or motherboard.


I had a GFX card that would only work when warm (power-on PC, black screen, wait 30 seconds, power-off PC, power-on PC, PC boots). Failing to wait long enough the PC wouldn't boot. PC booted fine with another GFX card. Never tried debugging further.

My wife had a computer (before we met) that would boot only when warmed up using a hair dryer !


I see some people have already suggested a dry solder joint, not that simple to find but try look with a strong light and a magnifying glass. The freeze spray sounds like a great idea as well


It'll be a bad solder joint on a BGA chip - no way to see those without an x-ray really.


I had a webcam that becomes faulty when it gets cold. We suspected that the rubber connector gets tightened when it's cold, but couldn't quite figure out why.


The interesting thing about rubber is that it loosens up when it's cold, and shrinks when heated


Do not use a cold-spray to emulate this. Because when you have a part that is significantly colder than ambient temperature, condensation is very likely to occur.


Sounds like a capacitor acting up IMHO. This can be tested by Cooling the capacitors using an air can used upside down.


It can be much simpler than that. Maybe an I/O pin is left floating by accident (neither pulled up or down) and below a certain temperature, it gets toggled and brings the machine down. We actually had this exact problem, but it was the other way around: The floating pin was pulled the wrong way after a certain temperature was surpassed.


This kind of problem almost made me go crazy once when developing a simple encoder routine. It is the absolute worst.

"sometimes it does work, so a hardware fault is very unlikely and why isn't my software working if the sun is shining?"

I reimplemented these routines countless times...

Seriously, I was slowly beginning to think to look for a new career. Already getting angry again just thinking about it.


Not temperature, but I once had a tiny speck of solder bridge two address lines together on an EEPROM. Drove me nuts figuring out why my attempts to program it kept failing after installing a JTAG when everything else I read worked fine.


Sooooo you decide to log a call ...

Hardware description, OS (and version), software installed, etc etc before we even start to think about it.

Bugger that: cost your time at say £20 per hour (I'm thinking of a reasonably good techy take home in a reasonably rich economy). Now do a cost/benefit analysis ..... buy a new motherboard, fit it and delete the blog post.

Don't forget that the ambient temperature may also correlate with say humidity or some other parameter. Buy a new mobo and delete the post and move on unless you are prepared to really go to town with a decent investigation 8)


You must have fun hobbies.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: