Movidius kind of did, as a part of Intel. They sold dozens of their sticks, maybe even hundreds. By EOY “good enough” built in power efficient acceleration will be available on sub-$100 dev boards. So they have to raise while they still can.
Please correct me if I am wrong, but Movidius also supplied Vision Processing Unit to Tesla. They terminated their contract with Tesla over some tussle which I'm not aware of.
There’s no real stack without an NDA, if you want to write code for the device itself. Clang does support the SHAVE cores they use, but I don’t know what the tooling actually looks like. Probably much like any other cross compilation tool chain.
If you just want to use it, there’s a Python SDK which you can use to run a subset of models popular frameworks can train. In that case it’s pretty easy to get going. I believe there’s also a C++ integration of some sort, but I haven’t used it myself.
It’s all moot, though, because properly programmed, recent ARM CPUs are fast enough to run usable models such as MobileNet and ShuffleNet with good performance. The problem is that at the moment this draws anywhere between 4-12W if things are running full blast, at least until throttling kicks in. Movidius is more economical in terms of joules per inference, but it also costs quite a bit, and it’s not built in, so market penetration is non-existent.
As I mentioned, though, this should start being less of an issue with the advent of cheap, built in tensor acceleration IP.
You should take a look at their architecture diagram. There’s a lot of meat on the bones there besides DSP, particularly when you look at their memory setup and consider how much bandwidth it can provide simultaneously. And let us not forget that convolution is essentially a DSP operation.
I was, frankly, surprised by the technical depth of what they were able to put together.
I have no experience with AI, but is this application very different from ltc or eth fpga mining? Do I get it right, that AI chip needs plenty of memory to hold coefficients, manipulate data, get next set of coefficients, repeat this multiple times and the bottleneck is memory bandwidth? Or it can be parallelized with more FPGAs for different layers with corresponding storage for coefficients.
But its still all von neumann architecture. We need a clean break at the hardware level to really get somewhere on this imho. I think reservoir computing is the way to go, with a specially designed substrate on silicon acting as the reservoir. Thats where i will be investing if I get the chance.