Hacker News new | past | comments | ask | show | jobs | submit login

Is this memory corruption you speak of silent, or simply fatal?

This could be a significant problem if the workload requires some form of integrity, since the hardware could be quietly introducing errors into otherwise normal looking computing

I remember having this issue with overclocked AMD cards mining too, where it was common to try to undervolt or overclock the memory. I wonder if any of those tuning tools work here, and if it would be possible to underclock the memory to increase its stability.

Either way, this echoes some of the sentiment I generally had around hardware intended for mining, including the bitcoin branded 2000 watt power supplies built with bottom of the barrel parts. Most hardware built for mining was built with exactly one purpose in mind, and has significant warts when it is attempted to be repurposed. The kind of constraints and requirements that cryptomining presents are really quite different from those of most modern IT systems.




Silent. It'll be things like you can't ssh into the box any more or you log in and can't reboot it. Likely due to ethash mining, which is heavily RAM based and the voltage/clocking. Luckily, it is easy to change those settings to build more stability. I have a process that auto tunes the machines for known instabilities... but the weird silent ram corruption ones are much harder to detect.

You're totally right that mining hardware was majority single purpose, especially at large scale. Those PSU's did the job, but yes, in general, hand soldered in China and prone to do weird things.

It certainly puts a hamper into what can be done with it now that the merge has happened, but I'd like to keep trying to find uses!


I wonder if these have any chance of running TensorFlow or other ML applications. The problem would again, be that there is no local storage and thus the 4GB Stable Diffusion model might be a bit much, but once loaded, perhaps it may work well for that kind of non critical application.

I think one of the reasons GPU memory corruption may cause the system to freeze is because the GPU and main memory are unified on APUs, which would probably explain the machines being difficult to login or use sometimes


It is effectively this GPU with RDNA1: https://en.wikipedia.org/wiki/Radeon_RX_5000_series

Yes, shared memory is definitely the cause.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: