Qualcomm's Snapdragon chips ship with a Hexagon DSP core, which is optimized for high-throughput numerical calculations -- not the branch heavy code you'll see in most general-purpose applications.
TensorFlow does lots of matrix multiplies. The Hexagon chip can do 8 multiplies each cycle, and runs multiple threads on each core. The benchmark isn't clear, but it's likely that _one_ Hexagon instruction can replace multiple normal ARM instructions for the inner loop.
Looks like those slides are for an older version. Since then they've bumped up the size of the vector execution units from 64 bits to 1024 bits and quadrupled the number of them, if I'm reading this right.
Unfortunately, this DSP is not FOSS, you need an SDK for this with binary components. Hopefully some day we have a cross-DSP standard or at least documentation in order to use said chips. OpenCL could also acquire a DSP profile.
A DSP is significantly simpler than a CPU, specially this particular bland.
The hard part is to implement it efficiently (power, area speed). So in theory with could have an open source design and the vendors could still compete with each other by providing the most efficient implementation.
Nice, looks like about 10x speed up for this classification task.
I think there are big gains to be made in lower precision inference too. Lots of people doing interesting work in that area, check out these guys - https://xnor.ai/https://arxiv.org/abs/1603.05279
Are the two devices running the same model? The article claims the DSP has higher confidence, but I don't see why that would be the case. I suppose one could work at a higher precision but that wouldn't make sense if they're comparing performance.
I've talked a little bit with some engineers at Qualcomm who worked on projects like this. My impression was that they make a lot of compromises when they optimize a computer vision algorithm for their hardware which slightly alters the result, but can run extremely fast with comparable performance. It's likely they're doing something similar here, which might explain the difference in confidences, but I highly doubt that it objectively classifies images better than the one running on the CPU. If anything the better performance is an illusion just because the model running on the DSP reacts more quickly.
Sure! They have a neural net that they pre-trained for image recognition as a demo. They ran it on the mobile device both times -- no cloud involved -- but the one on the left is running on the CPU, while the one on the right is running on a DSP located on the same chip. The DSP is specialized for workloads that have very regular control flow and involve a lot of fixed-point arithmetic. Running the neural network is such a workload, so they get impressive speed and power improvements by using the DSP instead of the CPU.
How does training the brain compare to running a trained brain?
Can the pre-trained brain (the one in the phone) flip to training mode? Can you teach it something and upload that new training result to the original?
Or for things it doesn't recognise, do you need to add the images and classification to the training data and create a 'new brain' and download it to the phone?
Is there one super organism (cloud based learning) that gives birth to millions of mini-minds. Each mini-mind asking it's parent to help it with things it doesn't understand. In 20 years time what will this say about consciousness? Where would it live? Is this a new way to think about minds, those that are distributed in many physicals devices?
And the precision of the hardware changing thought processes in subtle ways is very interesting. Upgrading a neural net to a new hardware platform would change how it works, how it thinks and makes decisions.
> How does training the brain compare to running a trained brain?
Harder operations and you need to do a lot more of them. Far more suited to having a single massive training system then send out the information just for inference.
Another thing that can be done is to train a large neural net then figure out which bits you can cut out without sacrificing much accuracy. The newer, smaller net is then faster to run and more likely to actually fit neatly into the RAM on your phone.
> Can the pre-trained brain (the one in the phone) flip to training mode? Can you teach it something and upload that new training result to the original?
Technically you probably could, but practically the answer is no for the types of nets used in this kind of thing. You'd want to be training the net on millions of images, and even if it were as fast as the inference on the phones that'd still take way too long.
[edit - interestingly this is not only technically possible but pretty much what is often done but on more powerful machines. You can start with a pre-trained network or model and then "fine tune" it with your own data: http://cs231n.github.io/transfer-learning/]
> Or for things it doesn't recognise, do you need to add the images and classification to the training data and create a 'new brain' and download it to the phone?
This is generally the approach, yes. It has other advantages though, the performance can be checked and compared once then re-used lots of times.
> Is there one super organism (cloud based learning) that gives birth to millions of mini-minds. Each mini-mind asking it's parent to help it with things it doesn't understand. In 20 years time what will this say about consciousness? Where would it live? Is this a new way to think about minds, those that are distributed in many physicals devices?
In many ways, sounds similar to delegating work to more junior / less well trained staff.
Lower-power and real-time machine learning, and it could also be used for stuff like computational photography. Doing computational photography through the cloud while you're taking a picture would be pretty crazy.
It can also be useful for basic "AI assistants" that process the data locally, so you get some extra privacy. For instance, you could get better image search on the device, without ever putting the photos in the cloud.
I also don't think any of those AI assistants that Google and Facebook are pushing with their messengers actually need to exist in the cloud. But of course Google and Facebook will continue to prefer doing it over the cloud because they actually want that data for themselves.
I think Huawei is also pushing for "smart notification management" to save battery life using such AI, although so far Huawei's solution has been pretty dumb. But I can see how this could improve in the future.
There should be at least a few more use cases where this is useful, and I think we'll see more smartphone makers take advantage of this.
> It can also be useful for basic "AI assistants" that process the data locally, so you get some extra privacy. For instance, you could get better image search on the device, without ever putting the photos in the cloud.
Until we are surrounded by recording devices that have autoencoder-based speaker fingerprinting and audio transcription, combined with some NLP to make sure that if you say "Hello, I'm Tom Walker", it'll remember and fill that in in the transcriptions. Instead of having vague videos and maybe some confusing sounds that can be deciphered by the police if there's enough reason to put in the effort and personnel we'll now have direct audio transcriptions of everything we say and do everywhere available to a number of companies.
And the worst part of it is. This is useful. For security, for remembering things, for automated secretary, for ... People will want this, and the features it can bring, so it'll happen, and privacy will be eroded until it's entirely gone.
In some of these examples the Hexagon DSP one detects it first but with a low confidence, and then the CPU detects it later with a higher confidence than the Hexagon DSP one has yet obtained.
If you were using this for a real purpose, would you only consider it identified at a certain confidence? If you did then the CPU one is surprisingly more performant in some of these examples despite taking longer to get to the object at all.
They appear to be seeing slightly different scenes. I think the phones are next to each other, and this might explain the difference in what they're reporting.
I'd be very interested to know if there's any difference in the processing that should be taken into account however.
Cell phones were already powerful enough for most of this to begin with. Face recognition? Windows 98 on a 180MHz Evergreen Overclocked processor and 48MB RAM did that just fine. Voice? Ditto. Handwriting recognition? Palm Pilots with far less power could do it.
I think nobody's bothered to code that stuff in, because, well, despite trying time and time again to make these hot-shit features for a couple of decades, these features end up unused. Those that pay attention to history see this, and figure "Probably not worth trying, even in this day and age."
Maybe they are unused because the implementation was poor. And the implementation was poor because a Pentium 2 processor is shit and deep learning has only been practical in the last couple of years.
More specifically, I had a Palm Pilot, and you had to write using weird letter shapes for it to work.
The implementation worked great. You walked up to your computer, turned it on, and looked at the camera. Bam! You were logged in, assuming you had enough proper light on your face for the software to make out your facial shape.
It was just bloody annoying because you had to wait about 15 seconds for it to figure everything out. It was far faster to just use the keyboard.
High accuracy voice recognition done fully on the mobile client, instead of on the cloud, would be pretty big, and useful. I think we'll get there in 1-2 years
HMM based speech recognition works ok locally today (and has worked for a while as you say) but there is a large difference in word error rate, handling noisy environments etc when compared to SOTA recurrent neural network models. Those don't (yet) run realtime locally on mobile hardware, but we are not far off.
The difference between 8x% and 97% accurate speech recognition in terms of user experience is pretty drastic
I remember Dragon Naturally Speaking. You had to do a lot of training to your specific voice to get it to a level of accuracy I would describe as "not great." You'd have to slow down your speaking to get anything I'd consider acceptable output out of it. Modern systems are much better.
Systems Requirements for FaceIt PC
-- Microsoft Windows 95.
-- 90 MHz Intel Pentium compatible or higher.
-- 16 MB RAM/10 MB free hard disc space.
-- VGA or higher video display adapter.
-- CD-ROM drive.
-- Microsoft Video for Windows (VFW) compatible video capture system with
resolution 320 x 240, depth: 15 bit RGB, capture rate to memory:
5 frames/sec.
there is a big difference in recognition accuracy between that and current state of the art systems. This isnt just a bit of incremental improvement. Todays systems are way better, but they also require a bit more processing.
Ahh well how can we forget FaceIt: the cutting edge of facial recognition. 20 years behind the curve but at least it can run on Windows 95, so that proves your point? Damn these kids with their bloated software, we got terrible results using rubbish software with 16mb of ram processing at 5 FPS and so should they!
Looks like most of the use cases you mentioned could benefit from this? About "good network", today's wireless network's peak performance might be enough, but being mobile, you can't guarantee you always have that peak performance. Having a dedicated high performance SoC definitely helps those use cases especially under mobile.
I for one am curious how large the image classification neural net is (in MB). I've come across some image classifier (vgg16) in some ML course that was a 500MB file, although the format may have been very inefficient.
If it's a 100MB file, you'd basically have to ship it with the operating system.
Is this available now or just announced? I've searched their site and forums but can't find anything that's been released, including for the 820, aside from some lower-level SDKs (comma.ai's openpilot uses these lower-level SDKs in their closed-source portion).
More like a CUDAish backend for a specific application. And for Hexagon rather than ARM, Hexagon and ARM are both architectures. ARM is a RISC for application processing and Hexagon is a VLIW for digital signal processing.
This is for executing existing models rather than training them. EG, they train a speech recognition net in the cloud or wherever then run the model on your phone directly for performance/network reasons.
TensorFlow does lots of matrix multiplies. The Hexagon chip can do 8 multiplies each cycle, and runs multiple threads on each core. The benchmark isn't clear, but it's likely that _one_ Hexagon instruction can replace multiple normal ARM instructions for the inner loop.
You can see some more on how the Hexagon DSP works here: http://pages.cs.wisc.edu/~danav/pubs/qcom/hexagon_hotchips20...