Its absurd this is still going on 6 months after the story first broke and we are really none the wiser. With estimates of 10-25% of CPUs impacted from the desktop side it seems likely all the CPUs are going to fail (including mine). They can't even recall and replace products yet as the problem isn't known. I sure hope Intel isn't just hiding the cause when its known all along because that is going to turn into big lawsuits across the world.
Completely agree. The lack of clarity around all this is hardly confidence inspiring. Definitely seems like a good time to be considering AMD or Qualcomm.
I've been using AMD since 2004. My first AMD processor was the Athlon 64 3000+, I was a kid I wasn't really allowed anything too expensive. We had dominately used Intel upt that point but when 64bit CPUs hit it was a revolutionary thing.
The roughest era of AMD CPUs was the FX era. While it was comprable to its mid-range competition it was alos a sure fast way to burn down your house with its power draw.
Ryzen was a huge step forward in CPU design and architecture.
I see this era as Intel's FX era, if they have the right leadership in place they can turn the boat around and innovate.
>Ryzen was a huge step forward in CPU design and architecture.
First gen Ryzen was kinda mediocre. Second gen(correction: meaning Zen 2 not Ryzen 2000 which was still Zen 1) was where the performance came.
Also let's not ignore how they screwed consumers like me by dropping SW support for Vega in 2023 while still selling laptops with Vega powered APUs on the shelves all the way till present day in 2024, or having a naming scheme that's intentionally confusing to mislead consumers where you don't know if that Ryzen 7000 laptop APU has Zen2, Zen3, Zen3+ or Zen4 CPU cores, if it's 4nm, 5nm, 6nm or 7nm or if it's running RDNA2, RDNA3 or the now obsolete Vega in a modern system.[1] Maddening.
Despite that I'm a returning AMD customer to avoid Intel, but I'm having my own issues now with their iGPU drivers making me regret not going Intel this time around. The grass isn't always greener across the fence, just different issues.
I get it, you're an AMD fan, but let's be objective and not ignore their stinkers and anti-consumer practices which they had plenty of and only played nice for a while to get sympathy because they were the struggling underdog, but didn't hesitate to milk and deceive consumers the moment they got back on top like any other for profit company with a moment of market dominance.
My point being, don't get attached or loyal to any large company, since you're just a dollar sign for all of them. Be an informed consumer and make purchasing decisions on objective current factors, not blind brand loyalty from the distant past.
>AMD FX is a series of high-end AMD microprocessors for personal computers which debuted in 2011, claimed as AMD's first native 8-core desktop processor.[1] The line was introduced with the Bulldozer microarchitecture at launch (codenamed "Zambezi"), and was then succeeded by its derivative Piledriver in 2012 (codenamed "Vishera").
Toms Hardware posted retraction over a year later admitting motherboard was at fault and test was proposed and designed by Intel (including picking motherboard vendors) as part of their Pentium 4 promotion drive.
Same as Pentium 3 of same era, thermal throttling on socket A was supposed to be implemented by Motherboard vendors using chip integrated thermal diode. Pentium 3 would burn same way if put on a motherboard with non working thermal cutout.
"Just like AMD's mobile Athlon4 processors, AthlonMP is based on AMD's new 'Palomino'-core, which will also be used in the upcoming AthlonXP processor. This core comes equipped with a thermal diode that is required for Mobile Athlon4's clock throttling abilities. Unfortunately Palomino is still lacking a proper on-die thermal protection logic. A motherboard that doesn't read the thermal diode is unable to protect the new Athlon processor from a heat death. We used a specific Palomino motherboard, Siemens' D1289 with VIA's KT266 chipset."
Intel suggested Siemens D1289 board for the test, board didnt have thermal protection. Intel suggested (or even delivered) Pentium III motherboard with working thermal protection.
>AMD FX is a series of high-end AMD microprocessors for personal computers which debuted in 2011
Ha, well that's wrong. This is the first time I find a mistake or more accurately, a contradiction in Wikipedia.
AMD's first FX CPU (the FX-51) came out in 2003 as a premium Athlon 64 that was an expensive power hungry beast, which is the one I assume the GP was talking about. Here, also from Wikipedia:
"The Athlon 64 FX is positioned as a hardware enthusiast product, marketed by AMD especially toward gamers. Unlike the standard Athlon 64, all of the Athlon 64 FX processors have their multipliers completely unlocked."
It's not contradictory. The "FX" you're talking about is used as "Athlon FX"[1], whereas the "FX" in the article is "AMD FX"[2]. The branding might be a bit confusing, but the article isn't wrong.
> First gen Ryzen was mediocre. Second gen was where the performance came.
Are you sure? I just looked at Ryzen 5 1600 vs 2600 benchmarks and the difference is around 5%. And I also remember the hype when the first generation was released. I think Ryzen gen 1 was by far the largest step.
Both of you forget that for the longest time Intel consumer chips excluded virtualization and other features until Ryzen 1st generation had it available. Like AVX-512 for example. 1st generation was a huge win in functionality for consumers even if it didn't hit the same performance of Intel. AVX-512 wasn't support on first gen, but there were other features I forget now but it was also a reason I had stuck to AMD.
I've used both the Ryzen 3 1200 and 7 1700 and all of them seemed fine for their time and price.
Honestly, I had the 1700 in my main PC up until last year, it was still very much okay for most things I might want to actually do, except no ReBAR support pushed me towards a Ryzen 5 4500 (got it for a low price, otherwise slightly better than the 1700 in performance, still good for my needs; runs noticeably hotter though, even without a big OC).
I guess things are quite different for enthusiasts and power users, but their needs probably don't affect what would be considered bad/mediocre/good for the general population.
Im sure you will be happy to hear this is purely artificial limitation introduced by AMD for product segmentation purposes. Very first Ryzen Zen generation does fully support ReBAR in hardware, but its locked by AMD bios.
Given that I got an Intel Arc A580 for myself, this was pretty important! Quite bad that it wasn't officially supported if there are no hardware issues and I would have liked to just keep using the 1700 for a few more years, but opted for just buying a new CPU so my old one would be a reasonable backup, path of least resistance in this case.
Would also like to try out the recent Intel CPUs (though surely not the variety that seems to have stability issues), but that's not in the cards for now because most of my PCs and homelab all use AM4, on which I'll stay for the foreseeable future.
I actually like both companies. Intel isn't bad, right now isn't great for them though.
We are better for Intel and AMD to coexist. But my gamble is on AMD because I've always liked the compatibility of the hardware with variety of technology. You can easily get server grade interfacing on consumer grade parts. For the longest time that wasn't true for Intel. When AMD pulls an Intel I'll be full Intel. There are huge wins for Intel getting new fabs built in the states, because it means a lot for security and development.
I remember that I received a ridiculously high RPM fan with my FX-8350 CPU (in the box), which sounded like a vacuum when it ran. Took me less than a week to upgrade to a proper fan that managed to cool that damn thing at 600RPM or so, and life was quiet again!
"Evil Inside(tm)" software made sure many of the libraries and compilers had much slower performance on AMD chips for years.
We had to use intel cpu/gpu + CUDA gpu simply because of compatibility requirements (heavy media codecs and ML workloads.)
Lets be honest, AMD technically has had a better product for decades if you exclude the power consumption metric. ARM64 v8 is also good, if and only if you don't need advanced gpu features.
The Ryzen chips definitely are respectable in passmarks benchmark value stats rankings. =)
The 3700x and 5700x are 65W parts specifically made for quiet/cool boxes (they’re also 8 core). I have both since I enjoy my sanity and dont care about 10% extra performance. They are the pick of the litter in my mind. Also have a laptop with 5850h. Same with their Navi chips, not blazing hot but good enough, and my boxes and nice and quiet.
I think we've been in the "good-enough" computing age for awhile, and only the CUDA-gpu/codec-asic primarily feature in most desktop upgrade decisions.
Quiet machines are great, especially when you have to sit next to one for 9 hours a day. =3
> AMD technically has had a better product for decades if you exclude the power consumption metric
And single core performance.
And some other stuff which obviously didn’t matter during the period in question but suddenly became very important when AMD surpassed Intel in that regard…
I've picked AMD over Intel too, but I've had so many issues with it that I partly regret it. Memory stability issues, extremely long boot times, too high voltage, iGPU driver timeouts. Most of the issues have been fixed, but not all. After months of dealing with an annoying memory leak, I've just recently been able to confirm that it is caused by a Zen 4 iGPU driver.
I would never buy an AMD machine again after my last Ryzen 3600X. So many issues. It had to be power cycled 2-3 times to get it to boot. Memory corruption issues and stability issues galore. Not overclocked. Stock configuration. Decent quality board and power supply. Just hell.
Swapped board out assuming it was that. Same problem. Turned out to be the CPU which was a pain in the ass getting a warranty replacement for.
Ended up buying a new open box Intel 12400 Lenovo lump off eBay and using that.
I had similar issues with Zen of a few different generations, and with various boards. As a result, I built a new machine around an Intel 12400 as well. I did have to buy a thermaltake socket reinforcement bracket to mitigate the bending issue.
Oddly, this Intel build somewhat restored my faith in humans to build hardware and software as the thing seems to work quite well.
An issue with these parts was that the OOB config wasn’t very good - even if you knew to turn on the XMP profiles it still threw a ridiculous amount of voltage at the chip in pursuit of a few percent performance increase.
I don't think there's a lot in it to be honest between vendors. They are all cheap garbage with lurid ass chunks of metal and artwork designed by a 5 year old stuck all over them.
And there's one thing you can NEVER trust and that is objectivity from gamers when looking at failure and reliability statistics. It's one huge cargo cult.
Notably my kids both have Ryzen 5600G + MSI B550 boards with no problems.
I have been using Gigabyte for a very long time and had no problems. ASUS was OK for me too, but MSI boards were the worst due to stability, driver and cooling curve problems. Don’t buy MSI.
The B550 series is a power reduced cost cutting version of the x570 boards. They are only meant for the 6 core version of chips, and the 65W versions. You need to pick your components carefully.
VRM is the component that you need to be looking at regarding the power delivery for the CPU. There are many motherboards that combine a lower-tier chipset and a high-end VRM.
B550 was that limited initially. Even the Ryzen 9 5950X runs on B550 series motherboards today. B550 is a bit scaled down, e.g. no PCIe 4.x lines, just 3.x, but that's OK with me.
My motherboard is an ASUS ROG Strix variety with 4x32GB ECC RAM and the Ryzen 9 5950X works just fine.
I built an Intel workstation for the first time in two decades when the 13700K was released. It hasn't been a bed of roses, starting with thermal throttling from the LGA1700 socket bending the IHS so badly that the heatsink only contacted it in a strip down the middle, needing to physically reseat the onboard HDMI for the display signal to resume after the monitor is disconnected, a generally boiling TDP, DDR5 quirks like 5-minute training times (no blame here, just didn't expect my servers to boot faster), and generally having goofier names for UEFI options designed around overclocking. I still don't know how to use XTU.
Couple that with the underwhelming software support for AI/ML on their own hardware for about a year after CPU and GPU launch, and I wish I'd just stuck to AMD.
I don't think either are perfect, but it's the devil you know, and I've grown to trust that even when AMD cocks something up, they'll listen to customers, coordinate engineering efforts with OEMs, and handle it. Intel are either too high and mighty or don't empower their engineers to treat partners like partners without layers of management getting involved to be able to do something similar.
> Couple that with the underwhelming software support for AI/ML on their own hardware for about a year after CPU and GPU launch, and I wish I'd just stuck to AMD.
Seems like a strange way to express that point? Why mention underwhelming support for AI/ML if it’s the same on both? (if we’re talking about desktop chips I don’t even understand what’s that supposed to mean).
Sounds like bad ram (clean contacts, re-seat, and test) or temperature issues (the main reason we still use mobile i7-12700H was cheap ddr4 64GB ram stick kit, Iris media gpu drivers, and rtx CUDA gpu.)
Intel has its own issues, Gigabyte told me to pound sand when asking to unlock the bios on my own equipment to disable IME.
There is no greener grass on the fence line... just a different set of issues =3
>Sounds like bad ram (clean contacts, re-seat, and test)
Since he's taking about iGPU issues, he most likely has a laptop APU, so no RAM to reseat. I'm also having similar issues on my Ryzen 7000 laptop. Kinda regret upgrading from the Ryzen 5000 laptop which AMD obsoleted just 2 years after I bought it, as at least that had no issues. Hopefully new drivers in the future will fix stability but you never know.
What I do know, is that this will most likely be my last AMD machine if Intel shows improvement to match AMD, since their Linux driver support is just top notch.
Increasing the VRAM size (UMA size) to 4 GB fixed the frequent driver timeouts for me.
Reverting to older driver (driver cleaner -> driver v23.11.1) fixed the memory leak. This memory leak is weird since PoolMon doesn't show anything unusual. Nothing shows as using too much memory anywhere, except committed memory size grows to over 100GB after few days of uptime and RamMap shows a large amount of unused-active memory.
GPUs have the most complex drivers in the whole system, we're talking tens of millions LOCs, so it is absolutely not surprising that you're having issues like that given how recent AMD's investment into APUs is. I wouldn't use them for a few more years; get a cheap discrete GPU from nvidia or maybe even from Intel.
Hm? AMD's investing in APUs is not a new thing, that's going back to the FX days with their FM1 socket. Since Ryzen 1 they have their G APUs, and their integrated graphics power the steamdeck and many other mobile handhelds. Plus, Intel's integrated graphics are known for their driver issues (and so is Arc, for now), so I'd disagree with that recommendation.
Please post the cpu-z (win) or cpu-x (linux) chip make/model for other users to compare/search.
If there is enough data here, we may be able to see a common key detail emerge. i.e. if the anecdotal problem(s) remain overtly random, than a solution from the community or OEM may prove impossible.
I initially got somewhat frequent hangs on Fedora with a Radeon 680M iGPU (in a Ryzen 7 PRO 6850U APU). The hangs stopped when I added amdgpu.dcdebugmask=0x10 to kernel boot options, based on some comments in an AMD Linux driver bug report [1]. That seems to disable panel self-refresh so it would seem to be related to that somehow.
Stability has been fine since. The bug report has since been closed but I haven't tested in a while to see if disabling PSR is still needed or if the issue has actually been fixed.
I haven't seen significant stability issues on Windows, although I don't use it much on the AMD device.
With PSR in the mix, is the system really hanging or is it just failing to update the screen somehow? I.e. can you tell the difference with logs or a remote connection or configure and use an unprompted shutdown via the power button?
It was on Wayland. I'm not sure if I tried with X.
I can't remember the details of it. It effectively hung in the sense that I couldn't get the system into a usable state again locally without rebooting. I'm not sure if the system responded to the power button or not, or whether there was useful log output.
I didn't bother trying with a remote connection since the hang was frequent enough that it wouldn't have been of any use as a workaround anyway. I'd guess switching to another virtual console probably didn't work because I'd probably remember it if it did.
I can try re-enabling PSR and see if the problem is still there if you're interested.
Looks like some of the patches discussed in that bug report work around the problem by disabling PSR-SU for the specific timing controller my display also has. Those patches are in current kernels already. So basically the problem is gone for me, even if I remove the dcdebugmask.
So, I don't really know if the system was fully hanging, or if the display was just unable to update any more, but it was likely exactly the same that happened to other people with Parade TCONs in that bug discussion.
Depends on the failure mode, as it is common for specs to drift around under load (also, temperature cycling stresses PCB, and can shear BGA connections.)
I'd try a slower cheap set of lower-bandwidth/higher-latency ram sticks to see if it stops glitching up. If you are using low latency sticks (iGPU means this is usually recommended), than dropping the performance a bit may stabilize your specific equipment.
We may still be able to use this information to compare with other users glitches to see if there is some underlying similarity.
Unfortunately, if it is a thermal stress/warping on the PCB cracking open RAM BGA balls on chips or shifting traces... One won't really be able to completely identify the intermittent issue.
We were actually looking at buying a similar economy model earlier this year (ended up with a few classic Lenovo models instead)... so please be verbose with the make/model to help future searchers =3
Please dump the problematic cpu/ram chip model numbers to help other users. These chip manufacturer numbers is not really personally identifiable information, as they are shared between hundreds of thousands of products.
The classic cpu-z for Windows users is here if you don't run *nix:
>X-ray vision like Superman I gather... nice... ;)
That snarkyness is uncalled for. I repasted the laptop, ran benchmarks and checked the temperature sensors plus used my FLIR. It's no thermal issues. It's just AMD iGPU driver buggyness.
Processors Information
-------------------------------------------------------------------------
Socket 1 ID = 0
Number of cores 8 (max 8)
Number of threads 16 (max 16)
Secondary bus # 0
Number of CCDs 1
Manufacturer AuthenticAMD
Name AMD Ryzen 7 7840HS
Codename Phoenix
Specification AMD Ryzen 7 7840HS with Radeon 780M Graphics
Package Socket FP7
CPUID F.4.1
Extended CPUID 19.74
Core Stepping PHX-A1
Technology 4 nm
TDP Limit 54.0 Watts
Tjmax 90.0 °C
Core Speed 2761.5 MHz
Multiplier x Bus Speed 27.71 x 99.6 MHz
Base frequency (cores) 99.6 MHz
Base frequency (mem.) 99.6 MHz
Instructions sets MMX (+), SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AES, AVX, AVX2, AVX512 (DQ, BW, VL, CD, IFMA, VBMI, VBMI2, VNNI, BITALG, VPOPCNTDQ, BF16), FMA3, SHA
Microcode Revision 0xA704104
L1 Data cache 8 x 32 KB (8-way, 64-byte line)
L1 Instruction cache 8 x 32 KB (8-way, 64-byte line)
L2 cache 8 x 1024 KB (8-way, 64-byte line)
L3 cache 16 MB (16-way, 64-byte line)
Preferred cores 2 (#1, #3)
Max CPUID level 0000000Dh
Max CPUID ext. level 80000026h
FID/VID Control yes
# of P-States 3
P-State FID 0x898 - VID 0xBF (38.00x - 1.194 V)
P-State FID 0x858 - VID 0xAB (22.00x - 1.069 V)
P-State FID 0xA50 - VID 0x97 (16.00x - 0.944 V)
PStateReg 0x80000000-0x49AFC898
PStateReg 0x80000000-0x45AAC858
PStateReg 0x80000000-0x4425CA50
PStateReg 0x00000000-0x00000000
PStateReg 0x00000000-0x00000000
PStateReg 0x00000000-0x00000000
PStateReg 0x00000000-0x00000000
PStateReg 0x00000000-0x00000000
Package Type 0x4
Model 00
String 1 0x0
String 2 0x0
Page 0x0
Power Unit 0.0
SMU Version 76.73.00
TDP/TJMAX 0x36005A
TCTL Offset 0x0
PMTV 004C0008
Package Power Tracking (PPT) 54.0 W (current)
Package Power Limit #1 (long) 35.0 W
Package Power Limit #2 (short) 25.0 W
DMI Physical Memory Array
location Motherboard
usage System Memory
correction None
max capacity 64 GB
max# of devices 4
DMI Memory Device
designation DIMM 0
format Row of chips
type LPDDR5
total width 32 bits
data width 32 bits
size 8 GB
speed 6400 MHz
manufacturer Micron Technology
part number MT62F2G32D4DS-026 WT
serial number 00000000
voltage 0.500000
manufacturer id 0x2C80
product id 0x0
Display adapter 0 (primary)
ID 0x2180003
Name AMD RadeonT 780M
Board Manufacturer Lenovo
Codename Phoenix
Cores 768
ROP Units 16
Technology 4 nm
Memory size 1024 MB
Current Link Width x16
Current Link Speed 16.0 GT/s
PCI device bus 99 (0x63), device 0 (0x0), function 0 (0x0)
Vendor ID 0x1002 (0x17AA)
Model ID 0x15BF (0x3819)
Revision ID 0xC7
Root device bus 0 (0x0), device 8 (0x8), function 1 (0x1)
Performance Level Current
Core clock 800.0 MHz
Shader clock 400.0 MHz
Memory clock 800.0 MHz
Driver version 32.0.11021.1011
WDDM Model 3.1
People use special x-ray machines to inspect the BGA solder bonds to the PCB underneath chips. These chips may also be additionally glued down to the PCB on higher end equipment (impossible to visually inspect.)
Note BGA chips were never initially intended to be larger than 20mm wide, and can still put enormous shear forces on the contact bonds as the solder solidifies post re-flow (and the bimetallic cantilever PCBs form start to pop-back.) A certain percentage of products will thus fail when they warm up as the PCB will locally heat/warp the area again, and foobar a few random connections in the process. A low-heat paint-stripper heat-blower might be able to replicate the crash to eliminate this theory, and you might be able to RMA the board/chips if you are still under warrantee.
Could indeed also just be software as some suspect, but that is a lot harder to find in kernel drivers.
It is hard to read peoples emotions online, but do assume if someone is reaching out to help they probably think you are worth respecting too.
Thanks for posting data other users may find useful, and have a wonderful day. =3
No, it's not a solder/BGA issue because RAM/CPU/GPU stress benchmarks would cause some instabilities but that's never the case. The instability only manifests when running video decode tasks(watching youtube) in the browser and browsing websites in paralel, meaning it's 99% sure a iGPU driver issue.
I've been staunchly an Intel stan since Pentium 4s were cool and this year will be my first AMD build. Have already been using their server hardware at the office and not disappointed at all.
No particular straw broke the camel's back, they just haven't managed to justify their price premium in a very long time.
> It's even worse with their p vs e cores that creates issues in many games.
Didn’t AMD also have issues with different cache types/sizes on dual CCD chips? Meaning that it basically didn’t make any sense to buy anything more expensive than the 7800x3d if you only care about gaming.
Did you use AM5? It was hardly without issues, with users experiencing 30+ second POST times. I'm not even sure that's fixed yet with most motherboards OOTB.
The long POST times are a consequence of DDR5 link training. It's not entirely an AMD-specific problem. Most motherboards for either Intel or AMD now have a feature to skip most of the link training if it doesn't look like there's been a hardware change or since the last boot, but it's unavoidable on the first boot.
30 seconds would be a blessed relief. Link training takes upwards of 3 minutes for my 4x 32GiB DDR5 machine whenever I update its firmware, and then 3 minutes all over again when I load the XMP profile instead of running at the new firmware's safe stock 4000 MT/s.
Just last week I bought a new desktop for a family member. I was considering an Intel CPU, but at the last minute found a better and cheaper option with an AMD processor. Am I glad I dodged that bullet.
I've been firmly in camp Intel ever since my third Athlon 2200XP burned itself out (and finally took the motherboard socket with it on its way to the melty grave) back in circa ~2005 (Intel CPUs had thermal envelope protection at the time, AMD CPUs did not).
This fiasco has me convinced that I will not be building an Intel system again, and I haven't even (yet) had any problems with either of the Z790 i7-14700K systems I put together in March.
Being in any camp is just a bad way to approach things.
I go with AMD because they make the best desktop CPUs right now. When Intel gets their shit together (hopefully) and the pendulum swings back I'll go with them.
These are all corporations at the end of the day. They get better and worse overtime. They certainly never stay the same forever and deserve any kind of loyalty to their brand.
Well, maybe, but Intel is historically much scummier than AMD ever was. ICC benchmark cheating, OEM bribery debacle, ECC only for XEON, constant socket changes, shitty non-solder TIM for a long time, etc...
Remember when Nvidia used to have the better open source drivers than ATI? The company that's in the lead always acts the shittiest because they can because they already have the lead. Don't tie yourself to one company.
Unknown to me at the time (I wasn't even yet an adult) is that the heatsink was of an incorrect model, so it wasn't making good contact and doing the job properly. This wouldn't have been a problem if the CPU thermal throttled; that would immediately cause a performance problem that I would have been curious about. No, instead they just died, over the course of about 2 years. Changing platform wasn't my choice, my mother bought a new desktop and it had a Pentium. I don't remember the model.
I think there’s a real concern Xeon e-2400 may be failing at this point too. It's an open question if Emerald Rapids might have the same issues (and EMR has mesh, not ring, so this is an interesting question as to diagnosing the cause!) but W-2400 and W-3500 still use Golden Cove.
The leading theory at this point is not really voltage or current related but actually a defect in the coatings that is allowing oxidation in the vias. Manufacturing defect.
It affects even 35W SKUs like 13700t, so it’s really not the snide/trite “too much power” thing. Like bro zen boosts to 6 GHz now too and nobody says a word. And believe it or not, if you look at the power consumption, both of them are probably fairly comparable in core power - both brands are consistently boosting to around 25-30W single-thread under load. AMD's highest voltages will occur during these same single-core boost loads, these are the ones of concern at this point - if it is just voltage that is killing these 35W chips, well, AMD is playing in the exact same voltage/current domains.
Furthermore, if it was power it wouldn’t be a problem that is limited to 10-25% of the silicon, it would be all of them.
There was a specific problem with partners implementing eTVB wrong, and that was rectified. The remaining problem is actually pretty complex and potentially there are multiple overlapping issues.
It just has become this lightning rod for people who are generally dissatisfied with Intel, and people are lumping their random "it doesn't keep up with X3D efficiency!" complaints into one big bucket. But like, Intel actually isn't all that far off the non-x3d skus in terms of efficiency, especially in non-AVX workloads. "140W instead of 115W for gaming" is pretty typical of the results, and that's not "burn my processor out" level bad. 13900K has always been silly, but 13700K is like... fine?
(granted this may be launch BIOS, and it sounds like part of the problem is that partners have been tinkering over time and it's gotten worse and worse... I'm dubious these numbers are the same numbers as you'd get today, but in fact they are pretty consistent across a whole bunch of reviewers, ctrl-f "CPU consumption" and the gaming and non-AVX power numbers are in broadly unconcerning ranges, 57-170W is broadly speaking fine.)
Again, even if there is a power/current issue, at the end of the day it's going to have a specific cause and diagnosis attached to it - like AMD's issues were VSOC being too high. Saying "too much power" is like writing "died of a broken heart" on a death certificate, maybe that's true but that's not really a medical diagnosis. Some voltage on some rail is damaging some thing, or some thermal limit is being exceeded unintentionally, and that is causing electromigration, or something.
You might as well just come out and say it: intel's hubris displeased the gods, they tempted fate and this is their divine punishment. That's really what people are trying to say. Right? Don't dress it up in un-deserved technical window-dressing.
Hard to tell. My workplace is currently running on nothing but 13th Gen i5' HP Elitebooks. We haven't had any issues but then I suspect these would all be running CPUS from the same batch, possibly even the same wafer.
No, intel has said nothing at all (that’s part of the problem) and 10-25% are the numbers from Wendell and GN who have been investigating the issue in various companies’ prod environments and event logs.
I hear a lot of anecdotes and noise from YouTubers around this but little to no actual data or analysis. I am a skeptic until I see concrete data. That covers both the mobile and desktop issues.
Observations so far are limited to:
I have seen actual evidence that some W680 boards have been shipping with an unlimited power profile which will toast a CPU fairly quickly. As to who’s fault that is and if this correlates or is casual to the rest of the reports I don’t know.
My own Asus B760M board shipped with an unlimited power profile. I had to switch it to “Intel Default”. This machine has been under heavy load with no issues so far.
When I have done research I have only found people reporting this on custom build systems or low balling “servers”. I haven’t found any viable big brand system failure reports yet (Dell/HP/Lenovo etc). While some of this might be statistical failures I’d like to see configuration eliminated from the data as a cause first.
I think it would be rather nice at this point if Intel produced their own desktop boards again with their own tested BIOS. So we have something viable to compare against a reference system rather than the usual ugly junk shifter outfits or big brands. A fully vertically integrated component PC would be a nice thing to have again. They just worked!
Not fully confirmed yet, but that sounds really bad. It seems like it also hits low power models like the 13900T, which would imply this isn't just a voltage issue from auto overclocking.
They were boring in every single way: They weren't flashy, they weren't expensive, they didn't have weird features, and they were ridiculously stable.
I didn't ever buy any of them for myself because I like to tinker with stuff, but I sold a bunch of them to people who simply wanted a computer that just worked.
If the target market is overclockers. They want to be able to override everything for a high score if they want to. My board (ASUS TRX50) has all kinds of override settings for fan speeds, voltages, TDP (whatever that does!) and a warning not to mess with them if you don't know what you're doing.
Yes unfortunately. When you buy "enthusiast boards" which is everything that Dell and HP etc don't ship these days then you have literally no idea what crappy BIOS and software configuration you are inheriting.
that is part of the problem, W680 is not the same thing as C266 (and even C266 might be able to do it, wendell is sounding concerned about E-2400 platform too). W680 is still a consumer-socket product, it's just one that supports ECC. Like yes, people run those in a datacenter and that's fine and normal and supported - some customers want high single-threaded performance, and the big server chips just aren't as good at that. One of the affected customers is Citadel, which is unsurprising if you think about it (HFT).
this also means you get fun stuff like 13700T sometimes being run without power limits... but even within power limits they've seen 13700T degrading too, which is kind of a point against the whole "their hubris and power consumption angered the gods" thesis. If 35W is too much power, we're all cooked.
But it's hard to say, since nothing is being run within-spec and you have to bend over backwards to get "stock" behavior etc. Which buildzoid has elaborated and clarified on (after a couple initial videos that were working from incomplete info). And like yeah, that's a whole shitshow too... not only were partners severely breaking the spec in a whole bunch of places, both in the sense of departing further from the spec in ways that could cause problems, and also performing a factory undervolt out-of-the-box that isn't necessarily stable, and this has gotten more and more out-of-spec over time too (both the undervolting and loadline). Also, the "intel baseline profile" and "intel failsafe profile" apparently did not come from Intel, those were made up by gigabyte and msi, while the Intel Default profile did. Great stuff, you love to see it. /s
But there just has to be a reason that only 10-25% of samples are affected and if it's just generically power or current you should see it everywhere. Hence why board config is/was a concern, and why GN is now kinda pointing the finger at this "contamination/oxidation of the vias" fab problem theory.
I used to say, 'Never bet against Intel' but the last 5-10 years or so have not been kind to them. They have been kicking out the supports in the name of efficiency and we are seeing the impacts of this now.
Same issue that is plaguing Boeing. MBA is now a swear word.
I think they have a chance to escape the Boeing destiny, though? With Gelsinger the "technical reign" returned to the company, if I understand correctly?
A few years ago, if you said you buy AMD, people would think you are hallucinating, but now it looks like it's the only reliable vendor for x64. Intel was once the king of reliability, but in the last years, it looks like the king of bugs.
That was around 6-7 years ago at this point. Personally, every AMD machine I've had since then was very stable on Linux with an NVDA gpu. My latest one, an intel + NVDA, had issues under virtually all linux distros I had tried.
Now that I don't need CUDA anymore I might consider going full team red.
Ah but only due to a broad range of hardware and software issues, not because of the same hardware issue killing the desktop equivalents, so that's good news.
Based on Intel's behavior so far and the previous comment by Alderon Games' founder, I'm not sure why you're so willing to believe them at face value.
> "The laptops crash in the exact same way as the desktop parts including workloads under Unreal Engine, decompression, ycruncher or similar. Laptop chips we have seen failing include but not limited to 13900HX etc.," Cassells said.
> "Intel seems to be down playing the issues here most likely due to the expensive costs related to BGA rework and possible harm to OEMs and Partners," he continued. "We have seen these crashes on Razer, MSI, Asus Laptops and similar used by developers in our studio to work on the game. The crash reporting data for my game shows a huge amount of laptops that could be having issues."
When your processor is cooking itself to death, all bets are off. We have seen some of them in our data center over the years, albeit very rarely.
Interestingly, a modern processor is very resilient against losing its functional blocks during operation. While this is a boon, diagnosing these problems is a bit too complicated for the inexperienced.
An x86 processor can detect when it makes a serious error in some pipelines and rerun these steps until things go right. This is the first line of recovery (this is why temperature spikes start to happen when a CPU reaches its overclocking limits. It starts to make mistakes and this mechanism kicks in).
Also x86 has something called “machine check architecture” which constantly monitors the system and the CPU and throws “Machine Check Exceptions” when something goes very wrong.
These exceptions divide into “recoverable” and “unrecoverable” exceptions. An unrecoverable exception generally triggers a kernel panic, and recoverable ones are logged in system logs.
Moreover, a CPU can lose (fry) some caches (e.g.: half of L1), and it’ll boot with whatever available, and report what it can access and address. In some extreme cases, it loses its FPU or vector units, and instead of getting upset, it tries to do the operations at microcode level or with whatever units available. This manifests as extremely low LINPACK numbers. We had a couple of these, but I didn’t run accuracy tests on these specimens, but LINPACK didn’t say anything about the results. Just the performance was very low when compared to normal processors.
Throttling is a normal defense against poor cooling. Above mechanisms try to keep the processor operational in limp mode, so you can diagnose and migrate somehow.
Actually it has accumulated over years. First being interested in hardware itself, and following the overclocking scene (and doing my own experiments), then my job as an HPC administrator allowed me to touch a lot of systems. Trying to drive them to max performance without damaging them resulted in seeing lots of edge cases over the years.
On top of that, I was always interested in high performance / efficient computing and did my M.Sc. and Ph.D. in related subjects.
It's not impossible to gather this knowledge, but it's a lot of rabbit holes which are a bit hard to find sometimes.
Thanks. Do you think the M.Sc. and Ph.D. helped a lot? I don't have any experience in this field and feel that this is probably one of the domains that people HAVE to rely mostly on vendor manuals and very low level debugging messages. Maybe at the same level of AAA game engine optimization?
Yes, they helped a lot, but because I was already interested in high performance programming and was looking to improve myself on these fronts. Also, I have started my job after my B.Sc., so there was a positive feedback between my work and research (I fed my work with research and fed my research with my know-how from my job, which was encouraged and required by the place I work).
You need to know a lot of things to do this. Actually it's half dark art and half science. Vendors do not always tell the full story about their hardware (Intel's infamous AVX frequency, and their compilers' shenanigans when it detects an AMD CPU), and you need to be able to bend the compiler to your will to build the binary the way you want. Lastly, of course, you shall know what you're doing with your code and understand how it translates to assembly and what your target CPU does with all that.
To be able to understand that kind of details, we have valgrind/callgrind, perf, software traces, some vendor specific low-level tools to see what processor is doing, and pure timing-related logging.
Game engines are different beast, I do scientific software, but a friend of mine was doing game engines. Highly optimized graphics drivers are black boxes and that's a whole different game. It's not very well documented, trade secrets ridden, and tons of undocumented behaviors which drivers do to optimize stuff. Plus, you have to use the driver very complex and sometimes ugly ways to make it perform.
While this is hard to start and looks like a big mountain, all this gets way easier when you develop a "feeling for the machine". It's similar how mechanics listen to an engine and say "it's spark plug 3, most probably". You can feel how a program runs and where it chokes just by observing how it runs.
This is why C/C++ is also used in a lot of low level contexts. Yes, it allows you to do some very dangerous things, but if you need to do things fast, and you can prove mathematically that this dangerous thing can't happen, you can unlock some more potential from your system. People doing this is very few, and people who do this recklessly (or just code carelessly) give C/C++ a bad name.
It's not impossible. If Carmack, Unreal, scientific software companies, Pixar, Blender and more are able to do it, you can do it, too.
> To be able to understand that kind of details, we have valgrind/callgrind, perf, software traces, some vendor specific low-level tools to see what processor is doing, and pure timing-related logging.
Working as a data warehouse engineer, I'm not exposed to these kinds of things. Our upstream team, the streaming guys, does have a bit of exposure to performance related to Spark.
> It's similar how mechanics listen to an engine and say "it's spark plug 3, most probably". You can feel how a program runs and where it chokes just by observing how it runs.
> It's not impossible. If Carmack, Unreal, scientific software companies, Pixar, Blender and more are able to do it, you can do it, too.
I kinda feel that I have to switch job to learn this kind of things. I do have some personal projects but they do not need the attention. I'll see what I can do. I have always wanted to move away from data engineering anyway.
I’m not sure sure if that’s what they meant, but generally, CPUs will throttle or shut down if they detect overtemp, hopefully before they start encountering errors which lead to wrong calculation results/crashes.
ECC RAM would have probably helped, but that got axed in consumer CPUs probably due to financial optimizations. They wanted to have ECC as an upsell feature for 'server grade' products.
DDR5 (and any other DRAM build on the latest fab processes for DRAM) has on-die ECC which provides some protection against corruption of data at rest. It's necessary because the density of the memory array is too high; there's not enough isolation between memory cells and not enough charge stored in each cell to ensure sufficiently low error rates without adding the on-die ECC. A typical DDR5 chip might be less susceptible to random bit-flips or rowhammer than a typical DDR4 chip, but the on-die ECC is really only intended to prevent a major regression in reliability.
What ordinary consumer DDR5 modules still lack is any form of ECC on the link between the DRAM and the CPU's memory controller. With the link running at about twice the speed used by DDR4, DDR5 is much more challenging for the memory controller/PHY to handle.
I think I'm fine, my backup laptop is 12th gen... So should be fine. Still amazing that it is two generations. Problems were not noted or even considered already with 13th...
The 14th gen is so similar to the 13th gen, Intel took a lot of heat for it in the initial reviews. It's no surprise that they both suffer the same ails.
It's not similar. It's literally the same silicon. They didn't tape out any new dies for the products branded as "14th gen"; not even a new stepping. Just minor tweaks to the binning.
I didn’t know that. I would like to know how to get more informed about these kind of structural differences on CPU generations.
Going back to the what you said, Intel selling the same silicon as two different generations (even if this is still just marketing terminology) is a bit lame on their side.
Having put together an i7-12900k rig on a z690 six months ago, two observations -
* DDR5 is wildly different from previous generations in being much less stable with more DIMMs, due to timing synchronization sensitivity. With four 6000 sticks I just flat out can't get more than a 12 hour stable prime95, even at jedec-4800 certified speeds. I can't even boot at 6000. My first few months were plagued with random crashes minutes into loading a game.
* There is a consensus that we're operating at & beyond the limit of this consumer ATX platform's TDP. There are recognized limitations in the motherboard retention mechanism that has prompted the use of aftermarket shims. Only the very top of the line largest air heatsinks are practical, and even then you spend much of the time thermally limited. Daring people regularly prove that the heatspreader is a limiting factor by going back to bare die cooling and getting five or ten degrees of advantage.
Because of the temp throttling becoming a normal state rather than an emergency protection, better cooling translates directly into higher performance.
Intel 13th gen and 14th gen were supposedly very similar, with slight thermal improvements from the process node.
If you have memory errors, you can corrupt your OS during and/or after install time, which may explain some of your instability. Memory errors must be resolved prior to OS installation for any kind of expectations of problem-free usage.
An overnight run of memtest86+ with all the tests enabled (including the RowHammer ones) is necessary to verify RAM correctness. I wonder if the latter is related to this somehow.
I would agree with your recommendations and your concerns. If you have RAM errors and proceed to install an OS, you’re gonna have a bad time.
Windows SFC scans, DISM, etc can fix up some these issues after the fact, but unless you’re also going to repair-install all your software again, just save all the data and reinstall. It’s just not worth the trouble and you’ll be chasing your tail and ghosts forever.
> With four 6000 sticks I just flat out can't get more than a 12 hour stable prime95, even at jedec-4800 certified speeds. I can't even boot at 6000.
Note that while your memory sticks may be rated to handle JEDEC's DDR5-4800 speed, and faster with XMP profiles, Intel's memory controller is only rated to operate at DDR5-4000 with two single-rank modules per channel, and DDR5-3600 with two dual-rank modules per channel. The speed of an individual DIMM is not the only important factor anymore. For the 12th gen parts, Intel didn't even promise DDR5-4800 unless the motherboard only had one slot per channel.
> Only the very top of the line largest air heatsinks are practical, and even then you spend much of the time thermally limited.
Air cooling is just not adequate these days. For extreme CPU loads, it hasn't been adequate for YEARS.
I've had an i9-9900K since about two months after its release, but had an air cooler on it. I'm a gamer, but nothing pushed all its cores until I got Cities: Skylines 2 last year. Even with my fan at 100%, I was bouncing off the thermal limit and getting a BSoD about once every hour or so. I had to turn down the thermal limit (and of course lose some performance, though I don't think I noticed) in order for my system to remain stable.
Upgraded to liquid cooling, now I never go above 70C, and I could probably go even lower with a more aggressive fan profile.
My wife did a system overhaul to a 13th gen i5, and we got a liquid cooler for her. She was like "I don't do crazy overclocking, why do I need a liquid cooler?", and I said that liquid cooling is basically a necessity for modern CPUs unless you're buying something low-end.
I have an AMD Threadripper 7000 system with DDR5 ECC registered RAM, one stick per channel (maximum) and I've noticed that one corrected bit error is logged every few hours.
For desktop I'm on my second generation of AIO liquid cooling (as in, the machine I use now and the previous one three years ago are liquid cooled). Air cooling is too noisy.
Its passed the return window, I’m sure I could make a warranty claim (and then be stuck with the same issue). Luckily it was just one and I paid 200 total for the card and chassis :)
Hmm I was considering buying the Lattepanda Sigma for a project, but seeing it's a 13th gen mobile i5-1340P... err maybe not. It is a shame though, it's beefier than any ARM board and AMD doesn't seem to bother doing SBC integrations for some reason. I guess they hate money.
The N100 is great, hopefully not affected by this problem though I'm not sure yet. It sips power and I've really liked it for homelab use where memory and IO are more important than core count (because most of the times things are idle, but one wants to keep VMs in memory and oversub the cores).
Yeah that would be the Delta, but it's significantly slower (~6 times at multicore). The N100 is just a 9th gen Celeron after all. I'm more or less looking for a complete powerhouse in a smaller than ITX form factor for extremely compute intensive multithreaded stuff.
You can do a mini-pc with up to a Ryzen 9 8945HS. Which at 65-85W is a bit of a beast, as far as 8-cores goes on the sub-itx size. The 8(8|9)4(0|5)H(S) are all pretty good options though. Just got a Beelink SER8 (8845HS) for Chimera, and it's been running very nicely.
I believe this was what caused sudden system instabilities on my 13600kf. I even undervolted my chip (lite load 1) when I got it, things ran fine for years until just a few weeks ago when I started hard freezing. I ended up disabling XMP which "fixed" it.
So lucky I opted for an i7 13850 in my new thinkpad and instead put the cash towards the RTX 3500. Doing large language models on the go, on GPU... and on Qubes OS no doubt... simply amazing.