Hacker News new | past | comments | ask | show | jobs | submit login
Andrew Ng predicts half of web searches will soon be speech and images (venturebeat.com)
60 points by thousandx on Sept 20, 2014 | hide | past | favorite | 46 comments



Summary: someone who is working on a product to search by speech and images issues a press release predicting that people will soon have half their searches being speech and images.


Well Andrew Ng can work on pretty much anything he wanted to in search. It's not as if he is a random product manager assigned to this job. He picked this. He put his money where his mouth is. I'd think that would give him more credibility on the claim, not less.


Right. It's reasonable to question a claim if someone has a vested interest in the outcome, but Baidu's future is looking pretty secure no matter what interface people prefer for their searches. Researchers definitely have biases, but I would much rather hear an educated opinion from a researcher who has chosen to dedicate time and thought into a subject than almost anybody else.


Came here for this. (Not saying Ng is wrong though.)


tooshay


did you mean touché?


This idea of searching by speech or images seems ... uninspired. Claiming the interface will change and we'll search by speech or images is a linear progression from text searching.

How about framing the problem as proactive vs reactive searches? Piece together enough fragmented data about me to know what song I'll have stuck in my head and don't know the name of, recognize my email contains a course syllabus and auto-populate my calendar with assignments and study times... a million other proactive tasks all done with me in mind.

Geez these guys are talking about changing the interface... try and get rid of the interface all together!


You're ideas seem like massive invasions of privacy. I get we're trending towards a loss of it (it may be inevitable). But yeah I can get my own sandwhich while you get you're auto delivered to you seconds before you wanted it.


That's what makes Google Now so cool. It's not super advanced (yet?), but in many circumstances it's quite good at automatically giving you the information you need.


no Google Now is not cool. i already have my mom nagging me. i don't need my phone to chime in.


Let's consider only "safe for work" Internet content:

Google/Bing have done well with keyword/phrase searching with results sorted by popularity and date.

Ng seems to be thinking that the big change will be having speech and images coming from the users as their input to the search process. My guess is that this will not be very important. I also guess that search for Internet content based on speech and images will become more important.

Yes, likely Ng can get a lot of pictures of what he knows to be, say, Ferraris and use some of them as the training set with a neural network to identify Ferraris and test the training with the rest of the pictures. Okay. Maybe his neural network will be able to identify Ferraris. So, he could repeat this training for, say, 100,000 objects -- Fords, bread, airplanes, jewelery, Victorian houses, .... Maybe there will be some value there.

My view is that the future of Internet search is quite different.


words are an incredibly efficient mechanism for communicating with google. Google image search is great for some things, like you find a picture of a sculpture on tumblr and it's not credited, use the photo as the input to find the source. But Taking photos of a ferrari seems like a dreadfully inefficient way to make a search. Maybe identifying flowers would be a better example, but most cars have in very clear letters the make and model written on the back.


As I understand it, you have described what Google does with words and pictures well.

Ng wants neural networks to identify things, maybe Ferraris. Then he wants a search user to send a picture as their input for the search they want to do. So, then the user might be able to find more pictures of Ferraris. Maybe.

For what Ng is doing, I doubt that flowers would work because there are far too many too different cases of flowers.

Search by keywords/phrases by Google/Bing has worked very well for a huge collection of Internet content.

But I am guessing that in a sense Ng is correct about images and sounds -- there stands to be a lot more such content on the Internet in the future.

Generally I'm guessing that there is also a huge collection of Internet content, searches people want to do, and results they want to find where search via keywords/phrases such as via Google/Bing is from poor down to useless. Thus, my guess is that a new means of search is needed.

What do you think?


> For what Ng is doing, I doubt that flowers would work because there are far too many too different cases of flowers.

Only if you see it as a supervised learning problem. An alternative is to find the nearest matches, after which you can let human intelligence make the final visual match. Often, the webpage containing the matched image will have enough context to identify the object being searched for.


What is a near match is an issue.


Are you suggesting that there is a distinction between "searches people want to do" and "results they want to find" (besides the purely functional one of search -> result). e.g. there are search results people want, even though they don't know they want them? Unknown unknowns that people will retrospectively be grateful for?


I'm guessing the work flow would be something like "I want intermittent rotation" except probably not expressed as literately. Maybe a sketch or a lot of babble along those terms gets searched on, whatever input format. If it works, maybe you find a Geneva mechanism.

If you have enough domain experience to search for "continuous rotation rotary intermittent rotary" then you'll find it, but if you are mechanically illiterate you may not know what intermittent means or rotary... maybe.

It would be a truly amazing display of AI to be given a really poor sketch of a Geneva mechanism, it'll find a really nice blueprint. I'd be impressed if this found the hypoid gear in a differential. I'd be more impressed if someone who doesn't understand the concept or reason for a torsen differential was able to none the less search it.

As a concrete example theres a pretty impressive lego torsen(-ish) diff out there. Its easy to find if you google for the terms. I'd be impressed if you could give a sketch to a search engine and find this lego diff.


I'm not trying to compete with Google/Bing where their work with keywords/phrases, page rank popularity, and, say, date, work well. And for a huge pile of Internet content, search, and results, they do work well.

For the searches you mention, I believe that maybe for one of them it could be possible to improve on Google/Bing, but I don't believe that real AI would be needed. For describing how to build such a search engine, that might take more than the 10,000 character limit on HN posts!


Yes, there can be some differences. E.g., a user might search for something that does not exist. So, they can do the search, that is, attempt, but not find the results.

Or, there's a lot of content on the Internet; a lot of it, a lot of people want; to get it they want to do searches that promise to be able to find that content. Nothing more obscure than that.


I get why someone would think that speech is going to be more widely used: the computing world is moving to devices - phones and tablets - that are awkward to type on. Wearables (if they ever become a big thing) won't usually have a keyboard at all. I don't really see why picture searches would be that big.


I think it depends on how we define what a 'search' is exactly. When you wear a future Google Glass and it's discovering that you're drinking an iced coffee instead of a regular coffee, did it do that by searching? If so, did it do 100 'searches' to discover this fact, where a search is a really an image search of static frames from your vision?

Or is a 'search' something that a person dictates to a computer to answer a question?


I can see this happening in a couple of years, specially images since the world uploads thousand or even millions of images online. Then there is the whole face recognition software's that are not limited to people, places and things.


The problem with speech search is that talking can be disruptive.

But if they could do accurate speech recognition with a very quiet whisper (maybe using lip reading technology too) then I could see it completely dominating text searches in usage.


yeah, I kinda agree. It would be cool if our smartphones could read lips.


I'd get excited about image search technology if it could identify components of pictures (e.g. a red house, an angry cat, etc). This means that when I search for "happy family at beach" the search would actually look at the db of pics to find a) a family b) a beach and c) indications of happiness. That'd drastically cut my image search time down.

Text tags on images have a limit to the accuracy of the results.


I think that Andrew is making an easy prediction. Of course most search queries will be over voice or image.

My wife and I usually use speech input for web search on our phones. Also, the current Google image search is very nice. Have you tried using it? Go to Google image search and drag a photo to the text input field. I have used this to identify pictures of small mechanical parts and also to identify plants.


A lot of skepticism but I can say that children who are less than 5 (don't read/write) are the main target for this kind of search.


Exactly! More importantly, those who can't spell yet.

A year ago I showed my kid (then 5) Google voice search on iPad. Her eyes lit up and off she went on Youtube and later Google search itself... First obscure animals - obscure to me, but apparently mentioned on 'Wild Kratz' and other animal shows - Then stuff she heard about on'Magic Schoolbus' and Brainpop.

Judging from what she's been showing me, it works for song titles and lyrics as well, and she loves the voice search interface on our Amazon FireTV as well.

As a result, I find myself using it more often. I love it on the FireTV. If I had a single button for picture search on my Sammy, like the Amazon Fire Phone has, I'd use that too.

I will show her picture search tomorrow. Can't believe I forgot that.

Edit: after reading the article, one more important group comes to mind: those for whom English is not their first language. Calling my parents tomorrow.


So after a less than 5 year old grows up, why would they switch to a more antiquated form of search? They now have a level of expectation that voice search works and when it doesn't the software sucks.


Why more antiquated?


speech I can understand, but how does he figure pictures will be a huge way to search? There aren't that many times where I'm said to myself "I wish I could just take a picture to search for similar items".


If you think about wearables there's a more compelling case. If I walk past a car I like I can look at it and get specs, price, who has it for sale, etc. Or if I'm at the grocery store and wow, those tomatoes look amazing, I see recipes. While it seems weird to search by image right now, it will probably feel more natural once it's seamless; it certainly removes friction from the experience.


The thing is I don't think those are really image searches. If I take a picture of a car it's probably a car I can't afford so I'm not really looking for a dealer, I'm probably looking for either 1) what's the make/model, 2) where have I seen that before, 3) what year is?

For produce there are even more things I might be interested in. Is that a good tomato or what is that deformed looking red plant?

Essentially I'm saying that image searches have just never been useful for me without supplying textual context. Speech has the potential to help with that but just doesn't seem to work well in consumer devices yet. Speech will get there but image searches just aren't useful without speech or text.


> While it seems weird to search by image right now, it will probably feel more natural once it's seamless;

Especially when we become so "plugged in" that I can search for what I see. Right now, if I want to see "tomato recipes" I'm probably not inclined to pull my phone out, take a picture (and who wants that on their stream anyway?) and then paste picture + recipes? I'd rather just type "tomato recipes". But if I can search what I see and speak, e.g. "oh look, tomatoes? Find recipes!" that's much easier.


or you could just type or say "honda civic" - is that really so hard?


A prediction is about what people will do, not what they should do. Taking that into account, is your objection relevant?


Things like taking a picture of a product, and being able to order it straight away from Amazon. (big push behind the Firephone)

And then at some point, it'll become a live-streaming thing. i.e. instead of actually flipping out your phone and taking a picture, it'll be some google glass type thing. You walk past a store and you look at it and it's recognized instantly. You can then get the wiki page, the floor plan from gov records, its online store and opening times from various online repositories of such information. Of course this is already possible using GPS, just an example.


Maybe because you cant. But what if you could? Ive frequently wanted to be able to take pictures of concert posters and have my phone prompt me to add it to my calendar; or just buy tickets. Or maybe its a picture of something you want to know more about and have no way of choosing a search term. You need to pick a category and then filter through results yourself.

This way you wouldnt have to do that.


> Ive frequently wanted to be able to take pictures of concert posters and have my phone prompt me to add it to my calendar; or just buy tickets

I bet you've already done something like this with QR Code enabled posters.


As a sibling comment tersely stated, I have to think that such a shift to image searches will be dominated by pornography. Remember that porn accounts for an enormous fraction of web traffic and searches.

A friend once quipped, "the best porn site in the world is Google." As "reverse image searches" (where you provide an image for the service to find image matches, similar photos, associated information on the photo to figure out the identity of the person in the photo) improve in quality and data mining from the pages they find the related images from, the more people will shift their behavior to this from the current text based searching.

There will of course be searches for similar looking wardrobe or other commerce or travel reasons. But online porn is a behemoth that will overshadow any other image based search motive. It is driven by one of man's three basic desires after all (nourishment, sleep, and sex).


Well, Baidu, focusing on the China market, is not allowed to optimize for porn (though I'm sure they do covertly) If anything, they are expected to filter out adult images proactively.


Well, my thought is that even if Baidu won't allow pornographic searches, that won't stop its user base from adopting a separate service to fulfill their needs.


Some ideas...

- take a photo of a street scene and ask, "how do get home?"

- or more generally, "where am I?"

- If I hate reading every ingredient on a menu to check for allergies, why don't I point my phone at the menu and have it tell me what I would like best?

- "what's in this sandwich?"

- take a photo of water, "is this safe to drink?" (maybe realistic with near-IR?)

- "what plant is this?"

- record birdsong, "what species is that?"

- photo of text fragment, "who wrote this and which book is it from?"


I assume it would be something like you see an object in real life, and you want to know more about it, so you take a picture of it, and use that image to search.


Images will become more ubiquitous over the next few years. It's very hard to describe certain things that an image can very quickly.

I think mobile image search will see a big uptick if there's a good interface for it.


porn




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: