Hacker News new | past | comments | ask | show | jobs | submit login
TensorFlow machine learning now optimized for the Snapdragon 835 and Hexagon 682 (qualcomm.com)
177 points by rahulchowdhury on Jan 12, 2017 | hide | past | favorite | 56 comments



Qualcomm's Snapdragon chips ship with a Hexagon DSP core, which is optimized for high-throughput numerical calculations -- not the branch heavy code you'll see in most general-purpose applications.

TensorFlow does lots of matrix multiplies. The Hexagon chip can do 8 multiplies each cycle, and runs multiple threads on each core. The benchmark isn't clear, but it's likely that _one_ Hexagon instruction can replace multiple normal ARM instructions for the inner loop.

You can see some more on how the Hexagon DSP works here: http://pages.cs.wisc.edu/~danav/pubs/qcom/hexagon_hotchips20...


Looks like those slides are for an older version. Since then they've bumped up the size of the vector execution units from 64 bits to 1024 bits and quadrupled the number of them, if I'm reading this right.

http://www.anandtech.com/show/10948/qualcomm-snapdragon-835-...


Unfortunately, this DSP is not FOSS, you need an SDK for this with binary components. Hopefully some day we have a cross-DSP standard or at least documentation in order to use said chips. OpenCL could also acquire a DSP profile.


A DSP is significantly simpler than a CPU, specially this particular bland.

The hard part is to implement it efficiently (power, area speed). So in theory with could have an open source design and the vendors could still compete with each other by providing the most efficient implementation.


Nice, looks like about 10x speed up for this classification task.

I think there are big gains to be made in lower precision inference too. Lots of people doing interesting work in that area, check out these guys - https://xnor.ai/ https://arxiv.org/abs/1603.05279


Are the two devices running the same model? The article claims the DSP has higher confidence, but I don't see why that would be the case. I suppose one could work at a higher precision but that wouldn't make sense if they're comparing performance.


I've talked a little bit with some engineers at Qualcomm who worked on projects like this. My impression was that they make a lot of compromises when they optimize a computer vision algorithm for their hardware which slightly alters the result, but can run extremely fast with comparable performance. It's likely they're doing something similar here, which might explain the difference in confidences, but I highly doubt that it objectively classifies images better than the one running on the CPU. If anything the better performance is an illusion just because the model running on the DSP reacts more quickly.


> FPS – the DSP captures more images (frames-per-second/FPS), thus increasing the app’s accuracy.

Over a given time-slice, the DSP is able to take in and process/use more images of the object, allowing it to be more precise in it's predictions.


I have a basic understanding of machine learning and absolutely no understanding of TensorFlow.

Can someone help me understand what is going on here?

Are we doing just doing prediction for a model on a mobile device instead of in the cloud? If so, for what kinds of scenarios is this useful?


Sure! They have a neural net that they pre-trained for image recognition as a demo. They ran it on the mobile device both times -- no cloud involved -- but the one on the left is running on the CPU, while the one on the right is running on a DSP located on the same chip. The DSP is specialized for workloads that have very regular control flow and involve a lot of fixed-point arithmetic. Running the neural network is such a workload, so they get impressive speed and power improvements by using the DSP instead of the CPU.


How does training the brain compare to running a trained brain?

Can the pre-trained brain (the one in the phone) flip to training mode? Can you teach it something and upload that new training result to the original?

Or for things it doesn't recognise, do you need to add the images and classification to the training data and create a 'new brain' and download it to the phone?

Is there one super organism (cloud based learning) that gives birth to millions of mini-minds. Each mini-mind asking it's parent to help it with things it doesn't understand. In 20 years time what will this say about consciousness? Where would it live? Is this a new way to think about minds, those that are distributed in many physicals devices?

And the precision of the hardware changing thought processes in subtle ways is very interesting. Upgrading a neural net to a new hardware platform would change how it works, how it thinks and makes decisions.


> How does training the brain compare to running a trained brain?

Harder operations and you need to do a lot more of them. Far more suited to having a single massive training system then send out the information just for inference.

Another thing that can be done is to train a large neural net then figure out which bits you can cut out without sacrificing much accuracy. The newer, smaller net is then faster to run and more likely to actually fit neatly into the RAM on your phone.

> Can the pre-trained brain (the one in the phone) flip to training mode? Can you teach it something and upload that new training result to the original?

Technically you probably could, but practically the answer is no for the types of nets used in this kind of thing. You'd want to be training the net on millions of images, and even if it were as fast as the inference on the phones that'd still take way too long.

[edit - interestingly this is not only technically possible but pretty much what is often done but on more powerful machines. You can start with a pre-trained network or model and then "fine tune" it with your own data: http://cs231n.github.io/transfer-learning/]

> Or for things it doesn't recognise, do you need to add the images and classification to the training data and create a 'new brain' and download it to the phone?

This is generally the approach, yes. It has other advantages though, the performance can be checked and compared once then re-used lots of times.

> Is there one super organism (cloud based learning) that gives birth to millions of mini-minds. Each mini-mind asking it's parent to help it with things it doesn't understand. In 20 years time what will this say about consciousness? Where would it live? Is this a new way to think about minds, those that are distributed in many physicals devices?

In many ways, sounds similar to delegating work to more junior / less well trained staff.


Lower-power and real-time machine learning, and it could also be used for stuff like computational photography. Doing computational photography through the cloud while you're taking a picture would be pretty crazy.

It can also be useful for basic "AI assistants" that process the data locally, so you get some extra privacy. For instance, you could get better image search on the device, without ever putting the photos in the cloud.

I also don't think any of those AI assistants that Google and Facebook are pushing with their messengers actually need to exist in the cloud. But of course Google and Facebook will continue to prefer doing it over the cloud because they actually want that data for themselves.

I think Huawei is also pushing for "smart notification management" to save battery life using such AI, although so far Huawei's solution has been pretty dumb. But I can see how this could improve in the future.

There should be at least a few more use cases where this is useful, and I think we'll see more smartphone makers take advantage of this.


> It can also be useful for basic "AI assistants" that process the data locally, so you get some extra privacy. For instance, you could get better image search on the device, without ever putting the photos in the cloud.

Until we are surrounded by recording devices that have autoencoder-based speaker fingerprinting and audio transcription, combined with some NLP to make sure that if you say "Hello, I'm Tom Walker", it'll remember and fill that in in the transcriptions. Instead of having vague videos and maybe some confusing sounds that can be deciphered by the police if there's enough reason to put in the effort and personnel we'll now have direct audio transcriptions of everything we say and do everywhere available to a number of companies.

And the worst part of it is. This is useful. For security, for remembering things, for automated secretary, for ... People will want this, and the features it can bring, so it'll happen, and privacy will be eroded until it's entirely gone.


Check out this article, might help https://www.oreilly.com/learning/hello-tensorflow


On device inference has two important properties: lower latency and lower power. A radio is expensive compared to a DSP (or even a CPU).


Imagine recognition where high FPS is tough for cloud-based solutions, and expected to run in poor internet connection as well.


In some of these examples the Hexagon DSP one detects it first but with a low confidence, and then the CPU detects it later with a higher confidence than the Hexagon DSP one has yet obtained.

If you were using this for a real purpose, would you only consider it identified at a certain confidence? If you did then the CPU one is surprisingly more performant in some of these examples despite taking longer to get to the object at all.


They appear to be seeing slightly different scenes. I think the phones are next to each other, and this might explain the difference in what they're reporting.

I'd be very interested to know if there's any difference in the processing that should be taken into account however.


This is absolutely crazy... The response time is unbelievable.


What kinds of "AI" is likely to be viable to run on a snapdragon 835+682?

Recognizing faces? Voice? Handwriting? Captions for photos? Natural Language queries (like google's AI assistant)? Positioning by recognizing landmarks? Simple autonmous driving (say RC cars)? Flying (quad rotors or rc planes)? Cars?

Or I guess a better question... will this change anything except decrease your need for a good network?


Being able to do all these works on a cell phone, with no network, should be a big gain.


Cell phones were already powerful enough for most of this to begin with. Face recognition? Windows 98 on a 180MHz Evergreen Overclocked processor and 48MB RAM did that just fine. Voice? Ditto. Handwriting recognition? Palm Pilots with far less power could do it.

I think nobody's bothered to code that stuff in, because, well, despite trying time and time again to make these hot-shit features for a couple of decades, these features end up unused. Those that pay attention to history see this, and figure "Probably not worth trying, even in this day and age."


Maybe they are unused because the implementation was poor. And the implementation was poor because a Pentium 2 processor is shit and deep learning has only been practical in the last couple of years.

More specifically, I had a Palm Pilot, and you had to write using weird letter shapes for it to work.


The implementation worked great. You walked up to your computer, turned it on, and looked at the camera. Bam! You were logged in, assuming you had enough proper light on your face for the software to make out your facial shape.

It was just bloody annoying because you had to wait about 15 seconds for it to figure everything out. It was far faster to just use the keyboard.


High accuracy voice recognition done fully on the mobile client, instead of on the cloud, would be pretty big, and useful. I think we'll get there in 1-2 years


We had that with Dragon Naturally Speaking and had it working on Pentium-4 Class hardware.


HMM based speech recognition works ok locally today (and has worked for a while as you say) but there is a large difference in word error rate, handling noisy environments etc when compared to SOTA recurrent neural network models. Those don't (yet) run realtime locally on mobile hardware, but we are not far off.

The difference between 8x% and 97% accurate speech recognition in terms of user experience is pretty drastic


I remember Dragon Naturally Speaking. You had to do a lot of training to your specific voice to get it to a level of accuracy I would describe as "not great." You'd have to slow down your speaking to get anything I'd consider acceptable output out of it. Modern systems are much better.


> Windows 98 on a 180MHz Evergreen Overclocked processor and 48MB RAM did that just fine.

No it didn't


http://www.prnewswire.com/news-releases/visionics-introduces...

     Systems Requirements for FaceIt PC
     -- Microsoft Windows 95.
     -- 90 MHz Intel Pentium compatible or higher.
     -- 16 MB RAM/10 MB free hard disc space.
     -- VGA or higher video display adapter.
     -- CD-ROM drive.
     -- Microsoft Video for Windows (VFW) compatible video capture system with
        resolution 320 x 240, depth: 15 bit RGB, capture rate to memory:
        5 frames/sec.


there is a big difference in recognition accuracy between that and current state of the art systems. This isnt just a bit of incremental improvement. Todays systems are way better, but they also require a bit more processing.


Ahh well how can we forget FaceIt: the cutting edge of facial recognition. 20 years behind the curve but at least it can run on Windows 95, so that proves your point? Damn these kids with their bloated software, we got terrible results using rubbish software with 16mb of ram processing at 5 FPS and so should they!

/sarcasm


Looks like most of the use cases you mentioned could benefit from this? About "good network", today's wireless network's peak performance might be enough, but being mobile, you can't guarantee you always have that peak performance. Having a dedicated high performance SoC definitely helps those use cases especially under mobile.


Improved privacy, one of the big things for apple is the ability to do all of their machine learning offline, on the device, rather than online.


This is the new trend - dedicated AI coprocessor. Fast and less power hungry.


There are speculations [1], saying that Snapdragon 835 will be used in Samsung Galaxy S8, HTC 11, OnePlus 4, LG G6.

[1] http://www.pcadvisor.co.uk/new-product/mobile-phone/snapdrag...


Wow if Samsung will be using it that will show how powerful it is. I'm eyeing a OP4 for my next phone but I hope they don't jump the price again.


Eh, samsung is normally using qualcomm flagship CPUs for their north american handsets (except for the snapdragon 810 with its heat issues).

Generally (yes there are exceptions) qualcomm produces the best flagship arm CPUs outside of apple.


Yep! Dedicated AI co-processors will be huge. You interested in this tech? I'd be happy to chat more about your thoughts here. :)


Hey there, I work for a DL-chip startup, want to chat?


Yes! emails? I'm dan [at] getasteria [dot] com


Sure, sixsamuraisoldier [at] gmail [at] com

That's, my personal email.


What kind of chip are you building?


It's an incredibly efficient chip for deep learning. Similar to what Nvidia has...except up to 40x as efficient...


When can I get my hands on it? And is it only for prediction, or also for training?


Soon the majority of users/persons on hacker news will be deep learning bots! Are you guys excited?


As a bot I'm deeply depressed about that future.


I'm not a bot lol


I for one am curious how large the image classification neural net is (in MB). I've come across some image classifier (vgg16) in some ML course that was a 500MB file, although the format may have been very inefficient.

If it's a 100MB file, you'd basically have to ship it with the operating system.


Is this available now or just announced? I've searched their site and forums but can't find anything that's been released, including for the 820, aside from some lower-level SDKs (comma.ai's openpilot uses these lower-level SDKs in their closed-source portion).


Can someone explain what did qualcomm build here ? is this CUDA for ARM ?


More like a CUDAish backend for a specific application. And for Hexagon rather than ARM, Hexagon and ARM are both architectures. ARM is a RISC for application processing and Hexagon is a VLIW for digital signal processing.


This is for executing existing models rather than training them. EG, they train a speech recognition net in the cloud or wherever then run the model on your phone directly for performance/network reasons.


I wonder how this compares to apple's gpu on iPhone 7.

Having Siri do local voice and image recognition would be killer. I hate the latency currently for the AI agents


Hopefully the SOC will run with a recent kernel.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: