Intel says 13th and 14th Gen mobile CPUs are crashing

PaulKeeble · 2024-07-21T19:04:57.000000Z

Its absurd this is still going on 6 months after the story first broke and we are really none the wiser. With estimates of 10-25% of CPUs impacted from the desktop side it seems likely all the CPUs are going to fail (including mine). They can't even recall and replace products yet as the problem isn't known. I sure hope Intel isn't just hiding the cause when its known all along because that is going to turn into big lawsuits across the world.

kingsleyopara · 2024-07-21T19:15:09.000000Z

Completely agree. The lack of clarity around all this is hardly confidence inspiring. Definitely seems like a good time to be considering AMD or Qualcomm.

yellowapple · 2024-07-21T19:23:16.000000Z

It does indeed make me glad I've opted to pick AMD over Intel for my recent computer purchases (at least of the x86-64 variety).

Sparkyte · 2024-07-21T19:29:03.000000Z

I've been using AMD since 2004. My first AMD processor was the Athlon 64 3000+, I was a kid I wasn't really allowed anything too expensive. We had dominately used Intel upt that point but when 64bit CPUs hit it was a revolutionary thing.

The roughest era of AMD CPUs was the FX era. While it was comprable to its mid-range competition it was alos a sure fast way to burn down your house with its power draw.

Ryzen was a huge step forward in CPU design and architecture.

I see this era as Intel's FX era, if they have the right leadership in place they can turn the boat around and innovate.

Rinzler89 · 2024-07-21T19:50:25.000000Z

>The roughest era of AMD CPUs was the FX era.

Ahem. Bulldozer?

>Ryzen was a huge step forward in CPU design and architecture.

First gen Ryzen was kinda mediocre. Second gen(correction: meaning Zen 2 not Ryzen 2000 which was still Zen 1) was where the performance came.

Also let's not ignore how they screwed consumers like me by dropping SW support for Vega in 2023 while still selling laptops with Vega powered APUs on the shelves all the way till present day in 2024, or having a naming scheme that's intentionally confusing to mislead consumers where you don't know if that Ryzen 7000 laptop APU has Zen2, Zen3, Zen3+ or Zen4 CPU cores, if it's 4nm, 5nm, 6nm or 7nm or if it's running RDNA2, RDNA3 or the now obsolete Vega in a modern system.[1] Maddening.

Despite that I'm a returning AMD customer to avoid Intel, but I'm having my own issues now with their iGPU drivers making me regret not going Intel this time around. The grass isn't always greener across the fence, just different issues.

I get it, you're an AMD fan, but let's be objective and not ignore their stinkers and anti-consumer practices which they had plenty of and only played nice for a while to get sympathy because they were the struggling underdog, but didn't hesitate to milk and deceive consumers the moment they got back on top like any other for profit company with a moment of market dominance.

My point being, don't get attached or loyal to any large company, since you're just a dollar sign for all of them. Be an informed consumer and make purchasing decisions on objective current factors, not blind brand loyalty from the distant past.

[1] https://www.pcworld.com/article/1445760/amds-mobile-ryzen-na...

https://www.digitaltrends.com/computing/amd-confusing-naming...

gruez · 2024-07-21T19:56:20.000000Z

>>The roughest era of AMD CPUs was the FX era.

>Ahem. Bulldozer?

Bulldozer is the same as FX.

>AMD FX is a series of high-end AMD microprocessors for personal computers which debuted in 2011, claimed as AMD's first native 8-core desktop processor.[1] The line was introduced with the Bulldozer microarchitecture at launch (codenamed "Zambezi"), and was then succeeded by its derivative Piledriver in 2012 (codenamed "Vishera").

yread · 2024-07-21T20:34:29.000000Z

>The roughest era of AMD CPUs was the FX era.

Or the early Athlons that would literally burn down without cooling? https://www.youtube.com/watch?v=YYQSHXNFvUk

rasz · 2024-07-22T00:48:52.000000Z

Toms Hardware posted retraction over a year later admitting motherboard was at fault and test was proposed and designed by Intel (including picking motherboard vendors) as part of their Pentium 4 promotion drive.

Same as Pentium 3 of same era, thermal throttling on socket A was supposed to be implemented by Motherboard vendors using chip integrated thermal diode. Pentium 3 would burn same way if put on a motherboard with non working thermal cutout.

whaleofatw2022 · 2024-07-23T12:06:37.000000Z

> thermal throttling on socket A was supposed to be implemented by Motherboard vendors using chip integrated thermal diode

TBirds and spitfire didn't have die sensor, that was first on Palomino/Morgan.

That said I've seen P4s die due to cooler failure so it was still dumb.

rasz · 2024-07-23T12:26:40.000000Z

This is from that Toms article:

"Just like AMD's mobile Athlon4 processors, AthlonMP is based on AMD's new 'Palomino'-core, which will also be used in the upcoming AthlonXP processor. This core comes equipped with a thermal diode that is required for Mobile Athlon4's clock throttling abilities. Unfortunately Palomino is still lacking a proper on-die thermal protection logic. A motherboard that doesn't read the thermal diode is unable to protect the new Athlon processor from a heat death. We used a specific Palomino motherboard, Siemens' D1289 with VIA's KT266 chipset."

Intel suggested Siemens D1289 board for the test, board didnt have thermal protection. Intel suggested (or even delivered) Pentium III motherboard with working thermal protection.

Rinzler89 · 2024-07-21T20:17:57.000000Z

>AMD FX is a series of high-end AMD microprocessors for personal computers which debuted in 2011

Ha, well that's wrong. This is the first time I find a mistake or more accurately, a contradiction in Wikipedia.

AMD's first FX CPU (the FX-51) came out in 2003 as a premium Athlon 64 that was an expensive power hungry beast, which is the one I assume the GP was talking about. Here, also from Wikipedia:

"The Athlon 64 FX is positioned as a hardware enthusiast product, marketed by AMD especially toward gamers. Unlike the standard Athlon 64, all of the Athlon 64 FX processors have their multipliers completely unlocked."

https://en.wikipedia.org/wiki/Athlon_64#Athlon_64_FX

gruez · 2024-07-21T21:05:31.000000Z

It's not contradictory. The "FX" you're talking about is used as "Athlon FX"[1], whereas the "FX" in the article is "AMD FX"[2]. The branding might be a bit confusing, but the article isn't wrong.

[1] https://en.wikipedia.org/wiki/File:AMD_Athlon64_FX.jpg

[2] https://commons.wikimedia.org/wiki/File:AMD_FX_CPU_New_logo....

echoangle · 2024-07-21T20:13:51.000000Z

> First gen Ryzen was mediocre. Second gen was where the performance came.

Are you sure? I just looked at Ryzen 5 1600 vs 2600 benchmarks and the difference is around 5%. And I also remember the hype when the first generation was released. I think Ryzen gen 1 was by far the largest step.

Joel_Mckay · 2024-07-21T20:22:47.000000Z

Modern chip model numbers are just branding, and one must look at the benchmarks if you want value:

https://www.cpubenchmark.net/high_end_cpus.html

Yes, it is deceptive and annoying shenanigans for retail products =3

tedunangst · 2024-07-21T20:33:48.000000Z

Zen 2 is Ryzen 3000.

Dylan16807 · 2024-07-22T03:08:44.000000Z

Becoming mediocre by Intel's standard was a huge step at the time. So both of you can be right.

ahartmetz · 2024-07-22T13:29:56.000000Z

Almost on par with Intel in single core but twice the amount of cores. A big deal if you had a use for all these cores - I did, compiling C++ code.

Sparkyte · 2024-07-22T16:51:39.000000Z

Both of you forget that for the longest time Intel consumer chips excluded virtualization and other features until Ryzen 1st generation had it available. Like AVX-512 for example. 1st generation was a huge win in functionality for consumers even if it didn't hit the same performance of Intel. AVX-512 wasn't support on first gen, but there were other features I forget now but it was also a reason I had stuck to AMD.

KronisLV · 2024-07-23T11:57:14.000000Z

> First gen Ryzen was kinda mediocre.

I've used both the Ryzen 3 1200 and 7 1700 and all of them seemed fine for their time and price.

Honestly, I had the 1700 in my main PC up until last year, it was still very much okay for most things I might want to actually do, except no ReBAR support pushed me towards a Ryzen 5 4500 (got it for a low price, otherwise slightly better than the 1700 in performance, still good for my needs; runs noticeably hotter though, even without a big OC).

I guess things are quite different for enthusiasts and power users, but their needs probably don't affect what would be considered bad/mediocre/good for the general population.

rasz · 2024-07-23T12:08:12.000000Z

Im sure you will be happy to hear this is purely artificial limitation introduced by AMD for product segmentation purposes. Very first Ryzen Zen generation does fully support ReBAR in hardware, but its locked by AMD bios.

https://www.techpowerup.com/276125/asus-enables-resizable-ba...

KronisLV · 2024-07-23T12:15:50.000000Z

Yeah, there were also efforts like this, too https://github.com/xCuri0/ReBarUEFI

Given that I got an Intel Arc A580 for myself, this was pretty important! Quite bad that it wasn't officially supported if there are no hardware issues and I would have liked to just keep using the 1700 for a few more years, but opted for just buying a new CPU so my old one would be a reasonable backup, path of least resistance in this case.

Would also like to try out the recent Intel CPUs (though surely not the variety that seems to have stability issues), but that's not in the cards for now because most of my PCs and homelab all use AM4, on which I'll stay for the foreseeable future.

Sparkyte · 2024-07-22T15:36:53.000000Z

I actually like both companies. Intel isn't bad, right now isn't great for them though.

We are better for Intel and AMD to coexist. But my gamble is on AMD because I've always liked the compatibility of the hardware with variety of technology. You can easily get server grade interfacing on consumer grade parts. For the longest time that wasn't true for Intel. When AMD pulls an Intel I'll be full Intel. There are huge wins for Intel getting new fabs built in the states, because it means a lot for security and development.

tracker1 · 2024-07-23T16:12:59.000000Z

As for pulling an Intel, they kind of did with 5000 series pricing iirc.

SmellTheGlove · 2024-07-21T19:45:11.000000Z

The funny part is that those FX processors were max 125w TDP. No different than today really.

mhitza · 2024-07-21T21:25:16.000000Z

I remember that I received a ridiculously high RPM fan with my FX-8350 CPU (in the box), which sounded like a vacuum when it ran. Took me less than a week to upgrade to a proper fan that managed to cool that damn thing at 600RPM or so, and life was quiet again!

BenjiWiebe · 2024-07-21T22:06:29.000000Z

I'm still running that fan/CPU combo... I intended to replace it many years ago but have never pulled the trigger yet.

Joel_Mckay · 2024-07-21T20:03:22.000000Z

"Evil Inside(tm)" software made sure many of the libraries and compilers had much slower performance on AMD chips for years.

We had to use intel cpu/gpu + CUDA gpu simply because of compatibility requirements (heavy media codecs and ML workloads.)

Lets be honest, AMD technically has had a better product for decades if you exclude the power consumption metric. ARM64 v8 is also good, if and only if you don't need advanced gpu features.

The Ryzen chips definitely are respectable in passmarks benchmark value stats rankings. =)

smallstepforman · 2024-07-22T04:42:20.000000Z

The 3700x and 5700x are 65W parts specifically made for quiet/cool boxes (they’re also 8 core). I have both since I enjoy my sanity and dont care about 10% extra performance. They are the pick of the litter in my mind. Also have a laptop with 5850h. Same with their Navi chips, not blazing hot but good enough, and my boxes and nice and quiet.

Joel_Mckay · 2024-07-22T05:51:54.000000Z

I think we've been in the "good-enough" computing age for awhile, and only the CUDA-gpu/codec-asic primarily feature in most desktop upgrade decisions.

Quiet machines are great, especially when you have to sit next to one for 9 hours a day. =3

plasticchris · 2024-07-22T22:28:18.000000Z

I have a 3700x I keep around 1.05v and it is completely quiet (and p95 stable) under a massive air cooler which basically never spins up the fans.

ffgjgf1 · 2024-07-22T09:09:08.000000Z

> AMD technically has had a better product for decades if you exclude the power consumption metric

And single core performance.

And some other stuff which obviously didn’t matter during the period in question but suddenly became very important when AMD surpassed Intel in that regard…

zigzag312 · 2024-07-21T19:59:04.000000Z

I've picked AMD over Intel too, but I've had so many issues with it that I partly regret it. Memory stability issues, extremely long boot times, too high voltage, iGPU driver timeouts. Most of the issues have been fixed, but not all. After months of dealing with an annoying memory leak, I've just recently been able to confirm that it is caused by a Zen 4 iGPU driver.

teeheelol · 2024-07-21T22:05:53.000000Z

I would never buy an AMD machine again after my last Ryzen 3600X. So many issues. It had to be power cycled 2-3 times to get it to boot. Memory corruption issues and stability issues galore. Not overclocked. Stock configuration. Decent quality board and power supply. Just hell.

Swapped board out assuming it was that. Same problem. Turned out to be the CPU which was a pain in the ass getting a warranty replacement for.

Ended up buying a new open box Intel 12400 Lenovo lump off eBay and using that.

BirAdam · 2024-07-22T03:18:38.000000Z

I had similar issues with Zen of a few different generations, and with various boards. As a result, I built a new machine around an Intel 12400 as well. I did have to buy a thermaltake socket reinforcement bracket to mitigate the bending issue.

Oddly, this Intel build somewhat restored my faith in humans to build hardware and software as the thing seems to work quite well.

plasticchris · 2024-07-22T22:30:36.000000Z

An issue with these parts was that the OOB config wasn’t very good - even if you knew to turn on the XMP profiles it still threw a ridiculous amount of voltage at the chip in pursuit of a few percent performance increase.

justinclift · 2024-07-21T23:00:04.000000Z

> Decent quality board

Which board was it?

teeheelol · 2024-07-21T23:20:12.000000Z

Tried an MSI B550 initially. Think the second one was an Asus B550. The CPU swap did work ok the original board!

But at that point I was using the Lenovo box. So I just sold all the crap on eBay for the next victim.

justinclift · 2024-07-22T04:21:12.000000Z

Interesting. MSI doesn't really have a fantastic reputation for boards, and apparently ASUS's quality isn't that good any more either. :(

For my Ryzen 5000 series build (a while ago now) I went with an ASRock board for ECC support, and also ECC ram.

It's been mostly flawless, though as I'm undervolting the ram it does let me know about an ECC corrected error once every 6-9 months or so. ;)

teeheelol · 2024-07-22T07:14:51.000000Z

I don't think there's a lot in it to be honest between vendors. They are all cheap garbage with lurid ass chunks of metal and artwork designed by a 5 year old stuck all over them.

And there's one thing you can NEVER trust and that is objectivity from gamers when looking at failure and reliability statistics. It's one huge cargo cult.

Notably my kids both have Ryzen 5600G + MSI B550 boards with no problems.

milankragujevic · 2024-07-22T04:59:56.000000Z

I have been using Gigabyte for a very long time and had no problems. ASUS was OK for me too, but MSI boards were the worst due to stability, driver and cooling curve problems. Don’t buy MSI.

smallstepforman · 2024-07-22T04:50:26.000000Z

The B550 series is a power reduced cost cutting version of the x570 boards. They are only meant for the 6 core version of chips, and the 65W versions. You need to pick your components carefully.

zigzag312 · 2024-07-22T07:51:47.000000Z

VRM is the component that you need to be looking at regarding the power delivery for the CPU. There are many motherboards that combine a lower-tier chipset and a high-end VRM.

jcynix · 2024-07-22T07:41:52.000000Z

B550 was that limited initially. Even the Ryzen 9 5950X runs on B550 series motherboards today. B550 is a bit scaled down, e.g. no PCIe 4.x lines, just 3.x, but that's OK with me.

My motherboard is an ASUS ROG Strix variety with 4x32GB ECC RAM and the Ryzen 9 5950X works just fine.

teeheelol · 2024-07-22T07:16:25.000000Z

The chipset doesn't deliver power. So this is wrong. It has less PCI lanes and that's about it. I don't need them so I didn't buy them :)

washadjeffmad · 2024-07-22T02:17:17.000000Z

I built an Intel workstation for the first time in two decades when the 13700K was released. It hasn't been a bed of roses, starting with thermal throttling from the LGA1700 socket bending the IHS so badly that the heatsink only contacted it in a strip down the middle, needing to physically reseat the onboard HDMI for the display signal to resume after the monitor is disconnected, a generally boiling TDP, DDR5 quirks like 5-minute training times (no blame here, just didn't expect my servers to boot faster), and generally having goofier names for UEFI options designed around overclocking. I still don't know how to use XTU.

Couple that with the underwhelming software support for AI/ML on their own hardware for about a year after CPU and GPU launch, and I wish I'd just stuck to AMD.

I don't think either are perfect, but it's the devil you know, and I've grown to trust that even when AMD cocks something up, they'll listen to customers, coordinate engineering efforts with OEMs, and handle it. Intel are either too high and mighty or don't empower their engineers to treat partners like partners without layers of management getting involved to be able to do something similar.

ffgjgf1 · 2024-07-22T09:10:13.000000Z

> Couple that with the underwhelming software support for AI/ML on their own hardware for about a year after CPU and GPU launch, and I wish I'd just stuck to AMD.

What support did AMD have?

washadjeffmad · 2024-07-22T14:15:26.000000Z

Choosing Intel brought no advantage over AMD. What support did AMD need to overcome that?

ffgjgf1 · 2024-07-22T21:11:19.000000Z

Seems like a strange way to express that point? Why mention underwhelming support for AI/ML if it’s the same on both? (if we’re talking about desktop chips I don’t even understand what’s that supposed to mean).

Joel_Mckay · 2024-07-21T20:14:42.000000Z

Sounds like bad ram (clean contacts, re-seat, and test) or temperature issues (the main reason we still use mobile i7-12700H was cheap ddr4 64GB ram stick kit, Iris media gpu drivers, and rtx CUDA gpu.)

Intel has its own issues, Gigabyte told me to pound sand when asking to unlock the bios on my own equipment to disable IME.

There is no greener grass on the fence line... just a different set of issues =3

Rinzler89 · 2024-07-21T20:41:08.000000Z

>Sounds like bad ram (clean contacts, re-seat, and test)

Since he's taking about iGPU issues, he most likely has a laptop APU, so no RAM to reseat. I'm also having similar issues on my Ryzen 7000 laptop. Kinda regret upgrading from the Ryzen 5000 laptop which AMD obsoleted just 2 years after I bought it, as at least that had no issues. Hopefully new drivers in the future will fix stability but you never know.

What I do know, is that this will most likely be my last AMD machine if Intel shows improvement to match AMD, since their Linux driver support is just top notch.

zigzag312 · 2024-07-21T21:09:05.000000Z

Desktop Ryzen 7950X.

Increasing the VRAM size (UMA size) to 4 GB fixed the frequent driver timeouts for me.

Reverting to older driver (driver cleaner -> driver v23.11.1) fixed the memory leak. This memory leak is weird since PoolMon doesn't show anything unusual. Nothing shows as using too much memory anywhere, except committed memory size grows to over 100GB after few days of uptime and RamMap shows a large amount of unused-active memory.

baq · 2024-07-21T21:31:18.000000Z

GPUs have the most complex drivers in the whole system, we're talking tens of millions LOCs, so it is absolutely not surprising that you're having issues like that given how recent AMD's investment into APUs is. I wouldn't use them for a few more years; get a cheap discrete GPU from nvidia or maybe even from Intel.

onli · 2024-07-21T22:03:04.000000Z

Hm? AMD's investing in APUs is not a new thing, that's going back to the FX days with their FM1 socket. Since Ryzen 1 they have their G APUs, and their integrated graphics power the steamdeck and many other mobile handhelds. Plus, Intel's integrated graphics are known for their driver issues (and so is Arc, for now), so I'd disagree with that recommendation.

MindSpunk · 2024-07-22T01:19:22.000000Z

APU is not only not a new thing, it’s a marketing term AMD themselves invented over 10 years ago pushing the entire concept of having an iGPU.

Joel_Mckay · 2024-07-21T21:57:31.000000Z

The rtx3090 is an Ampere gpu, and will apparently be supported in the new open nVidia driver release.

Should get interesting soon =)

sekh60 · 2024-07-21T22:19:56.000000Z

In Nova? Or just the in-kernel component?

Joel_Mckay · 2024-07-21T22:33:00.000000Z

Press release:

https://developer.nvidia.com/blog/nvidia-transitions-fully-t...

Yet to personally try it out, but this should eventually enable better integration with the library ecosystems. =3

dist-epoch · 2024-07-21T21:37:15.000000Z

I have a similar CPU, and I also get frequent iGPU crashes, but only when opening multiple tabs (6+) with video.

I also increased UMA to 4 GB, it reduced the crash frequency, but it still happens.

The discrete NVIDIA GPU I use at the same time is fine.

Joel_Mckay · 2024-07-21T21:51:59.000000Z

Please post the cpu-z (win) or cpu-x (linux) chip make/model for other users to compare/search.

If there is enough data here, we may be able to see a common key detail emerge. i.e. if the anecdotal problem(s) remain overtly random, than a solution from the community or OEM may prove impossible.

Thanks in advance, =3

Delk · 2024-07-22T00:26:16.000000Z

I initially got somewhat frequent hangs on Fedora with a Radeon 680M iGPU (in a Ryzen 7 PRO 6850U APU). The hangs stopped when I added amdgpu.dcdebugmask=0x10 to kernel boot options, based on some comments in an AMD Linux driver bug report [1]. That seems to disable panel self-refresh so it would seem to be related to that somehow.

Stability has been fine since. The bug report has since been closed but I haven't tested in a while to see if disabling PSR is still needed or if the issue has actually been fixed.

I haven't seen significant stability issues on Windows, although I don't use it much on the AMD device.

[1] https://gitlab.freedesktop.org/drm/amd/-/issues/2443

saltcured · 2024-07-22T16:03:00.000000Z

Is that Wayland or Xorg?

With PSR in the mix, is the system really hanging or is it just failing to update the screen somehow? I.e. can you tell the difference with logs or a remote connection or configure and use an unprompted shutdown via the power button?

Delk · 2024-07-27T16:39:41.000000Z

It was on Wayland. I'm not sure if I tried with X.

I can't remember the details of it. It effectively hung in the sense that I couldn't get the system into a usable state again locally without rebooting. I'm not sure if the system responded to the power button or not, or whether there was useful log output.

I didn't bother trying with a remote connection since the hang was frequent enough that it wouldn't have been of any use as a workaround anyway. I'd guess switching to another virtual console probably didn't work because I'd probably remember it if it did.

I can try re-enabling PSR and see if the problem is still there if you're interested.

Delk · 2024-07-30T15:15:36.000000Z

Looks like some of the patches discussed in that bug report work around the problem by disabling PSR-SU for the specific timing controller my display also has. Those patches are in current kernels already. So basically the problem is gone for me, even if I remove the dcdebugmask.

So, I don't really know if the system was fully hanging, or if the display was just unable to update any more, but it was likely exactly the same that happened to other people with Parade TCONs in that bug discussion.

Joel_Mckay · 2024-07-22T00:53:04.000000Z

Thanks for contributing.

Your tip may help some folks in the future. =)

Joel_Mckay · 2024-07-21T21:32:03.000000Z

Please pull the chip maker/model and ram details off your rig:

sudo apt-get install cpu-x

sudo cpu-x

I think comparing your specifications may help other users narrow down if a manufacturing or software defect is present.

Thanks in advance =3

Joel_Mckay · 2024-07-21T20:58:48.000000Z

Depends on the failure mode, as it is common for specs to drift around under load (also, temperature cycling stresses PCB, and can shear BGA connections.)

I'd try a slower cheap set of lower-bandwidth/higher-latency ram sticks to see if it stops glitching up. If you are using low latency sticks (iGPU means this is usually recommended), than dropping the performance a bit may stabilize your specific equipment.

Of course, I'm not that smart... so YMMV... =3

Rinzler89 · 2024-07-21T21:02:37.000000Z

There are no sticks in my laptop. I was taking about soldered RAM as is he norm on recent high speed LPDDR5X laptops.

Joel_Mckay · 2024-07-21T21:20:21.000000Z

Please pull the chip maker/model off your rig:

sudo apt-get install cpu-x

sudo cpu-x

We may still be able to use this information to compare with other users glitches to see if there is some underlying similarity.

Unfortunately, if it is a thermal stress/warping on the PCB cracking open RAM BGA balls on chips or shifting traces... One won't really be able to completely identify the intermittent issue.

We were actually looking at buying a similar economy model earlier this year (ended up with a few classic Lenovo models instead)... so please be verbose with the make/model to help future searchers =3

Rinzler89 · 2024-07-21T21:30:13.000000Z

Can't be thermal, I checked.

Joel_Mckay · 2024-07-21T21:45:03.000000Z

X-ray vision like Superman I gather... nice... ;)

Please dump the problematic cpu/ram chip model numbers to help other users. These chip manufacturer numbers is not really personally identifiable information, as they are shared between hundreds of thousands of products.

The classic cpu-z for Windows users is here if you don't run *nix:

https://www.cpuid.com/softwares/cpu-z.html

Best regards, =3

Rinzler89 · 2024-07-22T05:30:54.000000Z

>X-ray vision like Superman I gather... nice... ;)

That snarkyness is uncalled for. I repasted the laptop, ran benchmarks and checked the temperature sensors plus used my FLIR. It's no thermal issues. It's just AMD iGPU driver buggyness.

  Processors Information
  -------------------------------------------------------------------------
  Socket 1      ID = 0
  Number of cores    8 (max 8)
  Number of threads  16 (max 16)
  Secondary bus #    0
  Number of CCDs    1
  Manufacturer    AuthenticAMD
  Name      AMD Ryzen 7 7840HS
  Codename    Phoenix
  Specification    AMD Ryzen 7 7840HS with Radeon 780M Graphics   
  Package     Socket FP7
  CPUID      F.4.1
  Extended CPUID    19.74
  Core Stepping    PHX-A1
  Technology    4 nm
  TDP Limit    54.0 Watts
  Tjmax      90.0 °C
  Core Speed    2761.5 MHz
  Multiplier x Bus Speed  27.71 x 99.6 MHz
  Base frequency (cores)  99.6 MHz
  Base frequency (mem.)  99.6 MHz
  Instructions sets  MMX (+), SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AES, AVX, AVX2, AVX512 (DQ, BW, VL, CD, IFMA, VBMI, VBMI2, VNNI, BITALG, VPOPCNTDQ, BF16), FMA3, SHA
  Microcode Revision  0xA704104
  L1 Data cache    8 x 32 KB (8-way, 64-byte line)
  L1 Instruction cache  8 x 32 KB (8-way, 64-byte line)
  L2 cache    8 x 1024 KB (8-way, 64-byte line)
  L3 cache    16 MB (16-way, 64-byte line)
  Preferred cores    2 (#1, #3)
  Max CPUID level    0000000Dh
  Max CPUID ext. level  80000026h
  FID/VID Control    yes
  # of P-States    3
  P-State      FID 0x898 - VID 0xBF (38.00x - 1.194 V)
  P-State      FID 0x858 - VID 0xAB (22.00x - 1.069 V)
  P-State      FID 0xA50 - VID 0x97 (16.00x - 0.944 V)
  PStateReg    0x80000000-0x49AFC898
  PStateReg    0x80000000-0x45AAC858
  PStateReg    0x80000000-0x4425CA50
  PStateReg    0x00000000-0x00000000
  PStateReg    0x00000000-0x00000000
  PStateReg    0x00000000-0x00000000
  PStateReg    0x00000000-0x00000000
  PStateReg    0x00000000-0x00000000

  Package Type    0x4
  Model      00
  String 1    0x0
  String 2    0x0
  Page      0x0
  Power Unit    0.0
  SMU Version    76.73.00
  TDP/TJMAX    0x36005A
  TCTL Offset    0x0
  PMTV      004C0008
  Package Power Tracking (PPT)    54.0 W (current)
  Package Power Limit #1 (long)     35.0 W
  Package Power Limit #2 (short)    25.0 W


  DMI Physical Memory Array  
   location  Motherboard
   usage   System Memory
   correction  None
   max capacity  64 GB
   max# of devices  4

  DMI Memory Device  
   designation  DIMM 0
   format   Row of chips
   type   LPDDR5
   total width  32 bits
   data width  32 bits
   size   8 GB
   speed   6400 MHz
   manufacturer  Micron Technology
   part number  MT62F2G32D4DS-026 WT
   serial number  00000000
   voltage   0.500000
   manufacturer id  0x2C80
   product id  0x0


  Display adapter 0 (primary) 
   ID   0x2180003
   Name   AMD RadeonT 780M
   Board Manufacturer Lenovo
   Codename  Phoenix
   Cores   768
   ROP Units  16
   Technology  4 nm
   Memory size  1024 MB
   Current Link Width x16
   Current Link Speed 16.0 GT/s
   PCI device  bus 99 (0x63), device 0 (0x0), function 0 (0x0)
    Vendor ID 0x1002 (0x17AA)
    Model ID 0x15BF (0x3819)
    Revision ID 0xC7
   Root device  bus 0 (0x0), device 8 (0x8), function 1 (0x1)
   Performance Level Current
    Core clock 800.0 MHz
    Shader clock 400.0 MHz
    Memory clock 800.0 MHz
   Driver version  32.0.11021.1011
   WDDM Model  3.1

Joel_Mckay · 2024-07-22T07:55:05.000000Z

People use special x-ray machines to inspect the BGA solder bonds to the PCB underneath chips. These chips may also be additionally glued down to the PCB on higher end equipment (impossible to visually inspect.)

Note BGA chips were never initially intended to be larger than 20mm wide, and can still put enormous shear forces on the contact bonds as the solder solidifies post re-flow (and the bimetallic cantilever PCBs form start to pop-back.) A certain percentage of products will thus fail when they warm up as the PCB will locally heat/warp the area again, and foobar a few random connections in the process. A low-heat paint-stripper heat-blower might be able to replicate the crash to eliminate this theory, and you might be able to RMA the board/chips if you are still under warrantee.

Could indeed also just be software as some suspect, but that is a lot harder to find in kernel drivers.

It is hard to read peoples emotions online, but do assume if someone is reaching out to help they probably think you are worth respecting too.

Thanks for posting data other users may find useful, and have a wonderful day. =3

Rinzler89 · 2024-07-22T09:22:49.000000Z

No, it's not a solder/BGA issue because RAM/CPU/GPU stress benchmarks would cause some instabilities but that's never the case. The instability only manifests when running video decode tasks(watching youtube) in the browser and browsing websites in paralel, meaning it's 99% sure a iGPU driver issue.

Joel_Mckay · 2024-07-22T10:49:19.000000Z

Have you tried h264ify and blocked 60fps video?

Might be interesting if the bug is codec dependent, as YT can stress some browsers configs (software codecs etc.)

Does it do this with the windows 11 driver set as well? =)

zigzag312 · 2024-07-21T20:51:34.000000Z

I did ~12h RAM test few times and it always passed successfully (except when I was testing EXPO profile on early BIOS version).

I also did Prime95 CPU stress testing a few times without issues.

All issues seem to be related to either BIOS or drivers.

Joel_Mckay · 2024-07-21T21:04:05.000000Z

Pleas join the branch discussing the idea of using slower/cheaper RAM.

What is your current ram chip model, maker, and configuration on your machine?

sudo apt-get install cpu-x

sudo cpu-x

Cheers, =3

zigzag312 · 2024-07-21T21:21:09.000000Z

Corsair Vengeance 64GB (2x32GB) 5600MHz C36. Module Part Number: CMH64GX5M2B5600C36. DRAM manufactured by Samsung.

Running RAM at default speeds (4800MHz) or using XMP profile 5600MHz C36 doesn't affect these issues (they are no more or less frequent).

EDIT: XMP profile, not EXPO.

Joel_Mckay · 2024-07-21T21:33:49.000000Z

Thanks for helping the other users =3

zigzag312 · 2024-07-21T22:04:52.000000Z

Some more info if it helps anyone:

CPU Ryzen 9 7950X. Family: F (ext.: 19), Model: 1 (ext.: 61), Stepping: 2, Revision: RPL-B2.

iGPU: Raphael, revision: C1.

MB: ASUS TUF Gaming X670E-PLUS WiFi. Rev 1.xx. Southbridge rev.: 51.

ocdtrekkie · 2024-07-21T20:35:14.000000Z

I've been staunchly an Intel stan since Pentium 4s were cool and this year will be my first AMD build. Have already been using their server hardware at the office and not disappointed at all.

No particular straw broke the camel's back, they just haven't managed to justify their price premium in a very long time.

Thaxll · 2024-07-22T00:56:08.000000Z

If you play games, Intel has no answer to the X3D series from AMD. It's even worse with their p vs e cores that creates issues in many games.

ffgjgf1 · 2024-07-22T09:15:33.000000Z

> It's even worse with their p vs e cores that creates issues in many games.

Didn’t AMD also have issues with different cache types/sizes on dual CCD chips? Meaning that it basically didn’t make any sense to buy anything more expensive than the 7800x3d if you only care about gaming.

winrid · 2024-07-21T19:55:18.000000Z

Did you use AM5? It was hardly without issues, with users experiencing 30+ second POST times. I'm not even sure that's fixed yet with most motherboards OOTB.

schmidtleonard · 2024-07-21T21:41:36.000000Z

Wow, I was planning an AM5 build but 30 second POST is yikes. Does it go away with an update?

wtallis · 2024-07-21T22:05:08.000000Z

The long POST times are a consequence of DDR5 link training. It's not entirely an AMD-specific problem. Most motherboards for either Intel or AMD now have a feature to skip most of the link training if it doesn't look like there's been a hardware change or since the last boot, but it's unavoidable on the first boot.

aaronmdjones · 2024-07-21T23:44:32.000000Z

30 seconds would be a blessed relief. Link training takes upwards of 3 minutes for my 4x 32GiB DDR5 machine whenever I update its firmware, and then 3 minutes all over again when I load the XMP profile instead of running at the new firmware's safe stock 4000 MT/s.

winrid · 2024-07-22T21:07:13.000000Z

link training does have an impact, but regardless of link training it looks like many users still report 15+ seconds before ever reaching the bootloader: https://www.reddit.com/r/buildapc/comments/1815vyr/am5_users...

It looks like some newer motherboards are finally fixing the issue. Note that AM5 is now 2 years old.

forinti · 2024-07-21T22:00:35.000000Z

Just last week I bought a new desktop for a family member. I was considering an Intel CPU, but at the last minute found a better and cheaper option with an AMD processor. Am I glad I dodged that bullet.

aaronmdjones · 2024-07-21T23:18:29.000000Z

I've been firmly in camp Intel ever since my third Athlon 2200XP burned itself out (and finally took the motherboard socket with it on its way to the melty grave) back in circa ~2005 (Intel CPUs had thermal envelope protection at the time, AMD CPUs did not).

This fiasco has me convinced that I will not be building an Intel system again, and I haven't even (yet) had any problems with either of the Z790 i7-14700K systems I put together in March.

cooljacob204 · 2024-07-21T23:35:09.000000Z

Being in any camp is just a bad way to approach things.

I go with AMD because they make the best desktop CPUs right now. When Intel gets their shit together (hopefully) and the pendulum swings back I'll go with them.

These are all corporations at the end of the day. They get better and worse overtime. They certainly never stay the same forever and deserve any kind of loyalty to their brand.

BoingBoomTschak · 2024-07-22T02:27:16.000000Z

Well, maybe, but Intel is historically much scummier than AMD ever was. ICC benchmark cheating, OEM bribery debacle, ECC only for XEON, constant socket changes, shitty non-solder TIM for a long time, etc...

immibis · 2024-07-22T17:56:59.000000Z

Remember when Nvidia used to have the better open source drivers than ATI? The company that's in the lead always acts the shittiest because they can because they already have the lead. Don't tie yourself to one company.

rasz · 2024-07-22T00:51:45.000000Z

Changing platforms was easier than fixing cooling?

aaronmdjones · 2024-07-22T01:00:30.000000Z

Unknown to me at the time (I wasn't even yet an adult) is that the heatsink was of an incorrect model, so it wasn't making good contact and doing the job properly. This wouldn't have been a problem if the CPU thermal throttled; that would immediately cause a performance problem that I would have been curious about. No, instead they just died, over the course of about 2 years. Changing platform wasn't my choice, my mother bought a new desktop and it had a Pentium. I don't remember the model.

Heston · 2024-07-22T01:01:29.000000Z

If the Xeon counterpart isn't failing as well, then they most certainly know the problem. Too aggressive voltages for a more fragile transistor size

paulmd · 2024-07-22T14:57:17.000000Z

I think there’s a real concern Xeon e-2400 may be failing at this point too. It's an open question if Emerald Rapids might have the same issues (and EMR has mesh, not ring, so this is an interesting question as to diagnosing the cause!) but W-2400 and W-3500 still use Golden Cove.

The leading theory at this point is not really voltage or current related but actually a defect in the coatings that is allowing oxidation in the vias. Manufacturing defect.

https://www.youtube.com/watch?v=gTeubeCIwRw

It affects even 35W SKUs like 13700t, so it’s really not the snide/trite “too much power” thing. Like bro zen boosts to 6 GHz now too and nobody says a word. And believe it or not, if you look at the power consumption, both of them are probably fairly comparable in core power - both brands are consistently boosting to around 25-30W single-thread under load. AMD's highest voltages will occur during these same single-core boost loads, these are the ones of concern at this point - if it is just voltage that is killing these 35W chips, well, AMD is playing in the exact same voltage/current domains.

Furthermore, if it was power it wouldn’t be a problem that is limited to 10-25% of the silicon, it would be all of them.

There was a specific problem with partners implementing eTVB wrong, and that was rectified. The remaining problem is actually pretty complex and potentially there are multiple overlapping issues.

It just has become this lightning rod for people who are generally dissatisfied with Intel, and people are lumping their random "it doesn't keep up with X3D efficiency!" complaints into one big bucket. But like, Intel actually isn't all that far off the non-x3d skus in terms of efficiency, especially in non-AVX workloads. "140W instead of 115W for gaming" is pretty typical of the results, and that's not "burn my processor out" level bad. 13900K has always been silly, but 13700K is like... fine?

https://tpucdn.com/review/intel-core-i7-13700k/images/power-...

https://tpucdn.com/review/intel-core-i7-13700k/images/effici...

https://tpucdn.com/review/intel-core-i7-13700k/images/power-...

https://old.reddit.com/r/hardware/comments/yehe1s/intel_rapt...

(granted this may be launch BIOS, and it sounds like part of the problem is that partners have been tinkering over time and it's gotten worse and worse... I'm dubious these numbers are the same numbers as you'd get today, but in fact they are pretty consistent across a whole bunch of reviewers, ctrl-f "CPU consumption" and the gaming and non-AVX power numbers are in broadly unconcerning ranges, 57-170W is broadly speaking fine.)

Again, even if there is a power/current issue, at the end of the day it's going to have a specific cause and diagnosis attached to it - like AMD's issues were VSOC being too high. Saying "too much power" is like writing "died of a broken heart" on a death certificate, maybe that's true but that's not really a medical diagnosis. Some voltage on some rail is damaging some thing, or some thermal limit is being exceeded unintentionally, and that is causing electromigration, or something.

You might as well just come out and say it: intel's hubris displeased the gods, they tempted fate and this is their divine punishment. That's really what people are trying to say. Right? Don't dress it up in un-deserved technical window-dressing.

j45 · 2024-07-21T19:58:56.000000Z

I wonder if it's a hardware design or build defect, and the solution may be too inhibiting of performance.

bitfilped · 2024-07-21T21:27:42.000000Z

10-25% are Intel's numbers, it's closer to 50% in production

DaoVeles · 2024-07-21T22:53:12.000000Z

Hard to tell. My workplace is currently running on nothing but 13th Gen i5' HP Elitebooks. We haven't had any issues but then I suspect these would all be running CPUS from the same batch, possibly even the same wafer.

paulmd · 2024-07-22T14:58:27.000000Z

No, intel has said nothing at all (that’s part of the problem) and 10-25% are the numbers from Wendell and GN who have been investigating the issue in various companies’ prod environments and event logs.

https://www.youtube.com/watch?v=gTeubeCIwRw

No need to make shit up, things are bad enough already.

bigboy12 · 2024-07-22T03:04:07.000000Z

Uck. intel is dead what a joke. Sell it off for scrap. Hasn’t made any real innovations in 30 years.

teeheelol · 2024-07-21T21:54:41.000000Z

I hear a lot of anecdotes and noise from YouTubers around this but little to no actual data or analysis. I am a skeptic until I see concrete data. That covers both the mobile and desktop issues.

Observations so far are limited to:

I have seen actual evidence that some W680 boards have been shipping with an unlimited power profile which will toast a CPU fairly quickly. As to who’s fault that is and if this correlates or is casual to the rest of the reports I don’t know.

My own Asus B760M board shipped with an unlimited power profile. I had to switch it to “Intel Default”. This machine has been under heavy load with no issues so far.

When I have done research I have only found people reporting this on custom build systems or low balling “servers”. I haven’t found any viable big brand system failure reports yet (Dell/HP/Lenovo etc). While some of this might be statistical failures I’d like to see configuration eliminated from the data as a cause first.

I think it would be rather nice at this point if Intel produced their own desktop boards again with their own tested BIOS. So we have something viable to compare against a reference system rather than the usual ugly junk shifter outfits or big brands. A fully vertically integrated component PC would be a nice thing to have again. They just worked!

fotcorn · 2024-07-21T22:39:26.000000Z

Gamers Nexus is talking to one big PC manufacturer (my guess is Dell) that is seeing failure rates of 10-25% for specific SKUs: https://youtu.be/gTeubeCIwRw?t=527&si=YzpDzI2IyadzQYid

Not fully confirmed yet, but that sounds really bad. It seems like it also hits low power models like the 13900T, which would imply this isn't just a voltage issue from auto overclocking.

teeheelol · 2024-07-21T22:54:22.000000Z

Yeah this is still second hand information though and there isn’t any data still. There may be confounding factors.

Lots of speculation that is all.

Someone (at intel) needs to get an incident management process around this and start doing some proper comms.

Thaxll · 2024-07-22T00:53:33.000000Z

For sure they already have a team working on it for month. I think it's that bad that they don't talk about it yet.

teeheelol · 2024-07-22T07:45:44.000000Z

Corporates have learned not to say anything about stuff because it turns into YT influencer fuel rather than rational analysis.

szundi · 2024-07-22T04:54:47.000000Z

We can only get second hand info

teeheelol · 2024-07-22T07:46:10.000000Z

YT influencers can publish their sources and data.

hulitu · 2024-07-22T16:12:47.000000Z

This is like saying politicians can publish their lobbyists.

ssl-3 · 2024-07-22T01:07:41.000000Z

I miss the days of Intel desktop boards.

They were boring in every single way: They weren't flashy, they weren't expensive, they didn't have weird features, and they were ridiculously stable.

I didn't ever buy any of them for myself because I like to tinker with stuff, but I sold a bunch of them to people who simply wanted a computer that just worked.

solardev · 2024-07-22T02:09:05.000000Z

Is it normal for configuration to be able to override hardware thermal protections?

immibis · 2024-07-22T17:57:58.000000Z

If the target market is overclockers. They want to be able to override everything for a high score if they want to. My board (ASUS TRX50) has all kinds of override settings for fan speeds, voltages, TDP (whatever that does!) and a warning not to mess with them if you don't know what you're doing.

teeheelol · 2024-07-22T07:11:23.000000Z

Yes unfortunately. When you buy "enthusiast boards" which is everything that Dell and HP etc don't ship these days then you have literally no idea what crappy BIOS and software configuration you are inheriting.

paulmd · 2024-07-22T16:03:35.000000Z

yes, even W680 can override power and thermal limits, voltage, current excursion protection, etc. Everything except clock multiplier.

https://youtu.be/5KHCLBqRrnY?t=2694

that is part of the problem, W680 is not the same thing as C266 (and even C266 might be able to do it, wendell is sounding concerned about E-2400 platform too). W680 is still a consumer-socket product, it's just one that supports ECC. Like yes, people run those in a datacenter and that's fine and normal and supported - some customers want high single-threaded performance, and the big server chips just aren't as good at that. One of the affected customers is Citadel, which is unsurprising if you think about it (HFT).

this also means you get fun stuff like 13700T sometimes being run without power limits... but even within power limits they've seen 13700T degrading too, which is kind of a point against the whole "their hubris and power consumption angered the gods" thesis. If 35W is too much power, we're all cooked.

But it's hard to say, since nothing is being run within-spec and you have to bend over backwards to get "stock" behavior etc. Which buildzoid has elaborated and clarified on (after a couple initial videos that were working from incomplete info). And like yeah, that's a whole shitshow too... not only were partners severely breaking the spec in a whole bunch of places, both in the sense of departing further from the spec in ways that could cause problems, and also performing a factory undervolt out-of-the-box that isn't necessarily stable, and this has gotten more and more out-of-spec over time too (both the undervolting and loadline). Also, the "intel baseline profile" and "intel failsafe profile" apparently did not come from Intel, those were made up by gigabyte and msi, while the Intel Default profile did. Great stuff, you love to see it. /s

https://www.youtube.com/watch?v=eUzbNNhECp4

https://www.youtube.com/watch?v=k6pUZs_tuJo

But there just has to be a reason that only 10-25% of samples are affected and if it's just generically power or current you should see it everywhere. Hence why board config is/was a concern, and why GN is now kinda pointing the finger at this "contamination/oxidation of the vias" fab problem theory.

magicalhippo · 2024-07-21T19:37:11.000000Z

The desktop CPU issues were discussed earlier here[1] and here[2]. This is something else entirely, or so they say...

[1]: https://news.ycombinator.com/item?id=40946644

[2]: https://news.ycombinator.com/item?id=39478551

metadat · 2024-07-22T01:56:42.000000Z

Thanks! Macro-expanded:

https://news.ycombinator.com/item?id=40946644 — Intel is selling defective 13-14th Gen CPUs — July 2024 (84 comments)

https://news.ycombinator.com/item?id=39478551 — Intel Processor Instability Causing Oodle Decompression Failures — March 2024 (254 comments)

userbinator · 2024-07-21T21:55:29.000000Z

There was a prediction from 2016 that things would get much worse for CPU bugs starting with Skylake:

https://news.ycombinator.com/item?id=16058920

It seems that article was updated with this one too.

DaoVeles · 2024-07-21T22:57:13.000000Z

I used to say, 'Never bet against Intel' but the last 5-10 years or so have not been kind to them. They have been kicking out the supports in the name of efficiency and we are seeing the impacts of this now.

Same issue that is plaguing Boeing. MBA is now a swear word.

bornfreddy · 2024-07-22T07:17:31.000000Z

I think they have a chance to escape the Boeing destiny, though? With Gelsinger the "technical reign" returned to the company, if I understand correctly?

chad1n · 2024-07-21T21:23:57.000000Z

A few years ago, if you said you buy AMD, people would think you are hallucinating, but now it looks like it's the only reliable vendor for x64. Intel was once the king of reliability, but in the last years, it looks like the king of bugs.

PartiallyTyped · 2024-07-21T21:26:48.000000Z

That was around 6-7 years ago at this point. Personally, every AMD machine I've had since then was very stable on Linux with an NVDA gpu. My latest one, an intel + NVDA, had issues under virtually all linux distros I had tried.

Now that I don't need CUDA anymore I might consider going full team red.

79a6ed87 · 2024-07-21T22:50:29.000000Z

I would say that the AMD preferability came along with Meltdown

hulitu · 2024-07-22T16:20:21.000000Z

> A few years ago, if you said you buy AMD, people would think you are hallucinating

I always bought AMD. You have to look at the (actual) numbers.

AMD was looking bad in some benchmarks due to the icc issue but the price was just better.

Intel has (had) some good stuff, but it was very expensive.

2OEH8eoCRo0 · 2024-07-21T21:25:33.000000Z

x64 or x86_64 or AMD64?

JonChesterfield · 2024-07-21T18:46:50.000000Z

Ah but only due to a broad range of hardware and software issues, not because of the same hardware issue killing the desktop equivalents, so that's good news.

Sakos · 2024-07-21T20:08:15.000000Z

Based on Intel's behavior so far and the previous comment by Alderon Games' founder, I'm not sure why you're so willing to believe them at face value.

> "The laptops crash in the exact same way as the desktop parts including workloads under Unreal Engine, decompression, ycruncher or similar. Laptop chips we have seen failing include but not limited to 13900HX etc.," Cassells said.

> "Intel seems to be down playing the issues here most likely due to the expensive costs related to BGA rework and possible harm to OEMs and Partners," he continued. "We have seen these crashes on Razer, MSI, Asus Laptops and similar used by developers in our studio to work on the game. The crash reporting data for my game shows a huge amount of laptops that could be having issues."

JonChesterfield · 2024-07-21T20:42:53.000000Z

I'm totally willing to believe they're experiencing a broad range of hardware and software issues :)

bayindirh · 2024-07-21T19:24:30.000000Z

When your processor is cooking itself to death, all bets are off. We have seen some of them in our data center over the years, albeit very rarely.

Interestingly, a modern processor is very resilient against losing its functional blocks during operation. While this is a boon, diagnosing these problems is a bit too complicated for the inexperienced.

TomatoCo · 2024-07-21T20:10:32.000000Z

> a modern processor is very resilient against losing its functional blocks during operation

I'm very curious, can you elaborate on that?

bayindirh · 2024-07-21T21:53:24.000000Z

An x86 processor can detect when it makes a serious error in some pipelines and rerun these steps until things go right. This is the first line of recovery (this is why temperature spikes start to happen when a CPU reaches its overclocking limits. It starts to make mistakes and this mechanism kicks in).

Also x86 has something called “machine check architecture” which constantly monitors the system and the CPU and throws “Machine Check Exceptions” when something goes very wrong.

These exceptions divide into “recoverable” and “unrecoverable” exceptions. An unrecoverable exception generally triggers a kernel panic, and recoverable ones are logged in system logs.

Moreover, a CPU can lose (fry) some caches (e.g.: half of L1), and it’ll boot with whatever available, and report what it can access and address. In some extreme cases, it loses its FPU or vector units, and instead of getting upset, it tries to do the operations at microcode level or with whatever units available. This manifests as extremely low LINPACK numbers. We had a couple of these, but I didn’t run accuracy tests on these specimens, but LINPACK didn’t say anything about the results. Just the performance was very low when compared to normal processors.

Throttling is a normal defense against poor cooling. Above mechanisms try to keep the processor operational in limp mode, so you can diagnose and migrate somehow.

markus_zhang · 2024-07-22T01:51:35.000000Z

Thanks. How did you know all of these? I guess working in data center does have its boon and curse.

bayindirh · 2024-07-22T07:40:04.000000Z

Actually it has accumulated over years. First being interested in hardware itself, and following the overclocking scene (and doing my own experiments), then my job as an HPC administrator allowed me to touch a lot of systems. Trying to drive them to max performance without damaging them resulted in seeing lots of edge cases over the years.

On top of that, I was always interested in high performance / efficient computing and did my M.Sc. and Ph.D. in related subjects.

It's not impossible to gather this knowledge, but it's a lot of rabbit holes which are a bit hard to find sometimes.

markus_zhang · 2024-07-22T12:00:02.000000Z

Thanks. Do you think the M.Sc. and Ph.D. helped a lot? I don't have any experience in this field and feel that this is probably one of the domains that people HAVE to rely mostly on vendor manuals and very low level debugging messages. Maybe at the same level of AAA game engine optimization?

bayindirh · 2024-07-22T12:24:43.000000Z

Yes, they helped a lot, but because I was already interested in high performance programming and was looking to improve myself on these fronts. Also, I have started my job after my B.Sc., so there was a positive feedback between my work and research (I fed my work with research and fed my research with my know-how from my job, which was encouraged and required by the place I work).

You need to know a lot of things to do this. Actually it's half dark art and half science. Vendors do not always tell the full story about their hardware (Intel's infamous AVX frequency, and their compilers' shenanigans when it detects an AMD CPU), and you need to be able to bend the compiler to your will to build the binary the way you want. Lastly, of course, you shall know what you're doing with your code and understand how it translates to assembly and what your target CPU does with all that.

To be able to understand that kind of details, we have valgrind/callgrind, perf, software traces, some vendor specific low-level tools to see what processor is doing, and pure timing-related logging.

Game engines are different beast, I do scientific software, but a friend of mine was doing game engines. Highly optimized graphics drivers are black boxes and that's a whole different game. It's not very well documented, trade secrets ridden, and tons of undocumented behaviors which drivers do to optimize stuff. Plus, you have to use the driver very complex and sometimes ugly ways to make it perform.

While this is hard to start and looks like a big mountain, all this gets way easier when you develop a "feeling for the machine". It's similar how mechanics listen to an engine and say "it's spark plug 3, most probably". You can feel how a program runs and where it chokes just by observing how it runs.

This is why C/C++ is also used in a lot of low level contexts. Yes, it allows you to do some very dangerous things, but if you need to do things fast, and you can prove mathematically that this dangerous thing can't happen, you can unlock some more potential from your system. People doing this is very few, and people who do this recklessly (or just code carelessly) give C/C++ a bad name.

It's not impossible. If Carmack, Unreal, scientific software companies, Pixar, Blender and more are able to do it, you can do it, too.

markus_zhang · 2024-07-22T13:33:53.000000Z

Thanks man, really appreciate the detailed reply.

> To be able to understand that kind of details, we have valgrind/callgrind, perf, software traces, some vendor specific low-level tools to see what processor is doing, and pure timing-related logging.

Working as a data warehouse engineer, I'm not exposed to these kinds of things. Our upstream team, the streaming guys, does have a bit of exposure to performance related to Spark.

> It's similar how mechanics listen to an engine and say "it's spark plug 3, most probably". You can feel how a program runs and where it chokes just by observing how it runs.

> It's not impossible. If Carmack, Unreal, scientific software companies, Pixar, Blender and more are able to do it, you can do it, too.

I kinda feel that I have to switch job to learn this kind of things. I do have some personal projects but they do not need the attention. I'll see what I can do. I have always wanted to move away from data engineering anyway.

echoangle · 2024-07-21T20:15:51.000000Z

I’m not sure sure if that’s what they meant, but generally, CPUs will throttle or shut down if they detect overtemp, hopefully before they start encountering errors which lead to wrong calculation results/crashes.

luckystarr · 2024-07-22T11:40:21.000000Z

ECC RAM would have probably helped, but that got axed in consumer CPUs probably due to financial optimizations. They wanted to have ECC as an upsell feature for 'server grade' products.

That was pretty short sighted.

wongogue · 2024-07-22T12:57:37.000000Z

Doesn’t DDR5 memory have some kind of error correction? Is the error rate different on DDR4 and DDR5 setups?

wtallis · 2024-07-22T15:20:49.000000Z

DDR5 (and any other DRAM build on the latest fab processes for DRAM) has on-die ECC which provides some protection against corruption of data at rest. It's necessary because the density of the memory array is too high; there's not enough isolation between memory cells and not enough charge stored in each cell to ensure sufficiently low error rates without adding the on-die ECC. A typical DDR5 chip might be less susceptible to random bit-flips or rowhammer than a typical DDR4 chip, but the on-die ECC is really only intended to prevent a major regression in reliability.

What ordinary consumer DDR5 modules still lack is any form of ECC on the link between the DRAM and the CPU's memory controller. With the link running at about twice the speed used by DDR4, DDR5 is much more challenging for the memory controller/PHY to handle.

Ekaros · 2024-07-21T19:27:53.000000Z

I think I'm fine, my backup laptop is 12th gen... So should be fine. Still amazing that it is two generations. Problems were not noted or even considered already with 13th...

saltminer · 2024-07-21T19:54:25.000000Z

> Still amazing that it is two generations.

The 14th gen is so similar to the 13th gen, Intel took a lot of heat for it in the initial reviews. It's no surprise that they both suffer the same ails.

wtallis · 2024-07-21T22:11:31.000000Z

It's not similar. It's literally the same silicon. They didn't tape out any new dies for the products branded as "14th gen"; not even a new stepping. Just minor tweaks to the binning.

79a6ed87 · 2024-07-21T22:54:35.000000Z

I didn’t know that. I would like to know how to get more informed about these kind of structural differences on CPU generations.

Going back to the what you said, Intel selling the same silicon as two different generations (even if this is still just marketing terminology) is a bit lame on their side.

icelancer · 2024-07-22T00:52:20.000000Z

Check the CPU benchmarks between 13th and 14th gen. There is virtually no difference in single threaded workloads at the top.

mapt · 2024-07-21T20:47:29.000000Z

Having put together an i7-12900k rig on a z690 six months ago, two observations -

* DDR5 is wildly different from previous generations in being much less stable with more DIMMs, due to timing synchronization sensitivity. With four 6000 sticks I just flat out can't get more than a 12 hour stable prime95, even at jedec-4800 certified speeds. I can't even boot at 6000. My first few months were plagued with random crashes minutes into loading a game.

* There is a consensus that we're operating at & beyond the limit of this consumer ATX platform's TDP. There are recognized limitations in the motherboard retention mechanism that has prompted the use of aftermarket shims. Only the very top of the line largest air heatsinks are practical, and even then you spend much of the time thermally limited. Daring people regularly prove that the heatspreader is a limiting factor by going back to bare die cooling and getting five or ten degrees of advantage.

Because of the temp throttling becoming a normal state rather than an emergency protection, better cooling translates directly into higher performance.

Intel 13th gen and 14th gen were supposedly very similar, with slight thermal improvements from the process node.

aspenmayer · 2024-07-21T21:41:28.000000Z

If you have memory errors, you can corrupt your OS during and/or after install time, which may explain some of your instability. Memory errors must be resolved prior to OS installation for any kind of expectations of problem-free usage.

userbinator · 2024-07-21T22:02:45.000000Z

An overnight run of memtest86+ with all the tests enabled (including the RowHammer ones) is necessary to verify RAM correctness. I wonder if the latter is related to this somehow.

aspenmayer · 2024-07-21T22:05:29.000000Z

I would agree with your recommendations and your concerns. If you have RAM errors and proceed to install an OS, you’re gonna have a bad time.

Windows SFC scans, DISM, etc can fix up some these issues after the fact, but unless you’re also going to repair-install all your software again, just save all the data and reinstall. It’s just not worth the trouble and you’ll be chasing your tail and ghosts forever.

wtallis · 2024-07-21T22:17:25.000000Z

> With four 6000 sticks I just flat out can't get more than a 12 hour stable prime95, even at jedec-4800 certified speeds. I can't even boot at 6000.

Note that while your memory sticks may be rated to handle JEDEC's DDR5-4800 speed, and faster with XMP profiles, Intel's memory controller is only rated to operate at DDR5-4000 with two single-rank modules per channel, and DDR5-3600 with two dual-rank modules per channel. The speed of an individual DIMM is not the only important factor anymore. For the 12th gen parts, Intel didn't even promise DDR5-4800 unless the motherboard only had one slot per channel.

Sohcahtoa82 · 2024-07-22T17:48:41.000000Z

> Only the very top of the line largest air heatsinks are practical, and even then you spend much of the time thermally limited.

Air cooling is just not adequate these days. For extreme CPU loads, it hasn't been adequate for YEARS.

I've had an i9-9900K since about two months after its release, but had an air cooler on it. I'm a gamer, but nothing pushed all its cores until I got Cities: Skylines 2 last year. Even with my fan at 100%, I was bouncing off the thermal limit and getting a BSoD about once every hour or so. I had to turn down the thermal limit (and of course lose some performance, though I don't think I noticed) in order for my system to remain stable.

Upgraded to liquid cooling, now I never go above 70C, and I could probably go even lower with a more aggressive fan profile.

My wife did a system overhaul to a 13th gen i5, and we got a liquid cooler for her. She was like "I don't do crazy overclocking, why do I need a liquid cooler?", and I said that liquid cooling is basically a necessity for modern CPUs unless you're buying something low-end.

immibis · 2024-07-23T02:52:53.000000Z

I have an AMD Threadripper 7000 system with DDR5 ECC registered RAM, one stick per channel (maximum) and I've noticed that one corrected bit error is logged every few hours.

dboreham · 2024-07-22T01:48:23.000000Z

For desktop I'm on my second generation of AIO liquid cooling (as in, the machine I use now and the previous one three years ago are liquid cooled). Air cooling is too noisy.

yread · 2024-07-21T20:55:08.000000Z

The original post is more informative:

https://www.radgametools.com/oodleintel.htm

> Intel 13900K and 14900K processors, less likely 13700, 14700 and other related processors as well

andix · 2024-07-21T20:58:54.000000Z

Is anyone who knows about this still buying Intel? Seems like taking quite a risk.

buildbot · 2024-07-21T21:49:17.000000Z

I’ve experienced this with the extremely weird but cool intel compute cards: https://cdrdv2-public.intel.com/780985/nuc-13-compute-elemen...

Running a test linux build, 1/5 times it will crash/reboot mid test. :(

shadowpho · 2024-07-21T21:56:25.000000Z

Can you return them?

buildbot · 2024-07-21T22:19:13.000000Z

Its passed the return window, I’m sure I could make a warranty claim (and then be stuck with the same issue). Luckily it was just one and I paid 200 total for the card and chassis :)

moffkalast · 2024-07-21T19:59:40.000000Z

Hmm I was considering buying the Lattepanda Sigma for a project, but seeing it's a 13th gen mobile i5-1340P... err maybe not. It is a shame though, it's beefier than any ARM board and AMD doesn't seem to bother doing SBC integrations for some reason. I guess they hate money.

omnimus · 2024-07-21T20:17:20.000000Z

You can get something with N100 processor.

_huayra_ · 2024-07-21T20:22:06.000000Z

The N100 is great, hopefully not affected by this problem though I'm not sure yet. It sips power and I've really liked it for homelab use where memory and IO are more important than core count (because most of the times things are idle, but one wants to keep VMs in memory and oversub the cores).

Medox · 2024-07-21T20:58:52.000000Z

Or something with an N97, which performs better but is less power efficient. e.g. the new Odroid H4's

moffkalast · 2024-07-21T20:21:09.000000Z

Yeah that would be the Delta, but it's significantly slower (~6 times at multicore). The N100 is just a 9th gen Celeron after all. I'm more or less looking for a complete powerhouse in a smaller than ITX form factor for extremely compute intensive multithreaded stuff.

tracker1 · 2024-07-23T16:26:54.000000Z

You can do a mini-pc with up to a Ryzen 9 8945HS. Which at 65-85W is a bit of a beast, as far as 8-cores goes on the sub-itx size. The 8(8|9)4(0|5)H(S) are all pretty good options though. Just got a Beelink SER8 (8845HS) for Chimera, and it's been running very nicely.

stevenhuang · 2024-07-21T20:43:54.000000Z

I believe this was what caused sudden system instabilities on my 13600kf. I even undervolted my chip (lite load 1) when I got it, things ran fine for years until just a few weeks ago when I started hard freezing. I ended up disabling XMP which "fixed" it.

rustcleaner · 2024-07-22T04:05:22.000000Z

So lucky I opted for an i7 13850 in my new thinkpad and instead put the cash towards the RTX 3500. Doing large language models on the go, on GPU... and on Qubes OS no doubt... simply amazing.

rasz · 2024-07-22T06:42:48.000000Z

Cant wait to learn what else Intel manufactured in same fab using same processes. Any of their GPUs? FPGAs? FPGAs that went into military stuff?

blibble · 2024-07-21T23:47:17.000000Z

no reason to buy any intel products until they admit there's a problem here

rustcleaner · 2024-07-22T04:11:54.000000Z

Nice! This means some sweet "We're sorry, please come back to us" discounts will be on the menu next year!

FpUser · 2024-07-22T00:41:05.000000Z

I've long switched to AMD for my laptops and desktops. All work just fine

IAmNotACellist · 2024-07-21T22:33:14.000000Z

And here I just bought a laptop with a 13900HX...

BizarroLand · 2024-07-22T22:45:27.000000Z

13950hx here

tonetegeatinst · 2024-07-22T00:00:45.000000Z

How do I tell what gen CPU are in the laptops? Intel's naming scheme is confusing and and is no better

jacooper · 2024-07-22T00:02:29.000000Z

It's pretty obvious Intel iX-13XX = 13th Gen Intel iX-14XX = 14th gen

dankwizard · 2024-07-22T03:13:15.000000Z

But even that's confusing because 14th gen is still using 13th gen technology, it isn't a true "next gen"

dagmx · 2024-07-22T04:29:33.000000Z

That’s not really confusing because the title says it affects 13th and 14th. You don’t need to know anything beyond that.