Hacker News new | past | comments | ask | show | jobs | submit login
The state of Computer Vision and AI: we are really, really far (2012) (karpathy.github.io)
190 points by harperlee on May 7, 2015 | hide | past | favorite | 121 comments



One of the problems is we still do not have the hardware to begin doing the needed computation to solve the problem.

The brain in reality is quite slow. We know neurons require at least 5 ms to do any computation. But the real power lies in the shear parallelism.

Essentially biology set a limit of 5 ms but evolution worked around it by creating billions of neurons. Even If they are slower, because there are so many of them, they can do more computation in that 5 ms gap than all computers in the world combined ! Its truly marvellous when you sit back and think about that.

When you think about how that gets done just using that information, the modern computer architecture works completely differently , What that means is we do not even have the adequate hardware to start working on this problem.

We can push the current limit of computation to solve sub-problems and it seems under that constrain we have done very well. But its a slow evolution. GPUs gave us a clue, now we have FPGAs and soon we will have better hardware to create "better" intelligent machines. The machinery that is the brain is vastly complex and beautiful, but not well understood. Its a slow and incremental process but we will get there sometime this or the next century.


One of the most important thing to remember is the efficiency of the human brain is layered. The ancient parts (vision for example) have been honed to an amazing level of efficiency so even the most powerful computers can't come close, while the other modules (say simple arithmetic) are so inefficient that a $2 calculator needs to uses only 1/100,000th the energy to solve the same problem. The general rule is anything we find easy as humans to accomplish is amazingly hard and ancient, while anything we find hard is relatively easy and recent (evolutionary).


One thing to point out that even if the "module" that does arithmetic is slow, the brain understands the context of that calculation.

2 + 2 can be viewed as a computation that happens in nature and our brain all the time.

But the real magic doesn't lie in the pure computation. Our brain understands what that 2 means in the context of all the knowledge of the entire Universe. 2 cows, 2 sheeps, 2 planets ?

Understanding context is what we are really good at, as said in the article "prior" knowledge. So its not surprising that a cheap calculator can outsmart me in computation when it doesn't actually compute the context of its computation, which is sort if cheating.


I was trying to stay away from "meaning" since it gets messy, but if you just compare the energy used by sub-conscious human arithmetic with that expended by a simple cpu then the cpu is massively more efficient. We find mathematics far more mentally taxing than visual observation purely because we are so bad at mathematics, not because visual processing is easy.


Arithmetic savants and trained arithmeticians can come a lot closer to the calculator, showing that the usual human brain layout and/or architecture is a really poor fit for performing this type of calculations.

There's also some evidence that the "efficiency" of the more powerful parts of animal brains (including humans) simply comes from a sort of n-dimensional conditional railroading, which would work like this if my understanding is correct: if you map (B/W) visual inputs to XY axes of a slice of neurons, then when a line is formed, this fires up a line of neurons, which allows a direct connection between the Z±1 neurons of the first and last neurons on the mapped line, so this connection becomes the line. These Z±1 neurons have pathways to various other nearby other potential Z±1 neurons; a connection lighting up between Z±1 neurons that don't have a straightline connection over the XY map could detect different shapes. Then the Z±1 layers and their connections could be creating highway maps between both Z±2 neurons and diagonally-offset neurons that don't "fit in the grid" we're imagining here (remember, we're in 3D and neurons don't form a homogeneous grid, so this is all abstraction anyway).

These second-level connections, arguably identifying something like shapes, have connections in such a way that the next level up detects something else, etc. (insert magic we don't understand), until the detection results in recognizing a specific object, while the side-channel connections we haven't mentioned yet but kept occurring at every level connect to a different side-channel ZY map for, say, the position of the object (with its own layers of processing that eventually connect to the object concept neuron so that the "position" is associated with the recognized object), and/or perhaps other side-channel maps or parallel XY mappings for various other things. And of course, a lot of these things connect to completely different parts that we ignore here, and a lot of the information indirectly makes its way to the conscious mind in this way; the brain doesn't compile a complete vision report and submit it to the conscious mind in one download, it's all using the same hardware.


In essence, in order to figure out the brain, we would need to figure out how a single neuron actually works.

I think that will take us 500 years, give or take a few.


Why do you think that would be the case?

You could say the same about every tool that we use, which we don't totally understand. It just needs to work.


No, we wouldn't need this. How we could come to figure out and understand "the brain"[1] without understanding its smaller components is worth several science classes mostly involving maths, most of which I haven't mastered and thus won't arrogantly attempt to explain. We've managed to figure out other complex systems without understanding their components in detail. For one point of comparison, we'd already had a pretty good idea of how inertia and motion worked (Newton's laws) long before we understood in detail the atoms and forces acting upon/between them.

Then there's also the problem of your estimate. 500 years is a rather exaggerated timeframe. Twenty years ago I could've had the most prominent field experts tell me humans would never in the next three hundred years figure out biology even for the simplest of living organisms, because they were simply too irreducibly complex for that. And then a few years ago we simulated an entire worm's nervous system and gave it enough of a body for it to use, move around, and receive input from its entirely-fictional environment. Now we've got people working on doing the same thing with a cat brain. We're nearing breakthroughs in creating fully-synthetic, fully-functional animal organs that you can essentially real-life drag'n'drop to replace a failing natural organ without complications.

Knowing the above, are you really sure it's not 15 rather than 500 years?

[1]. (which is about as meaningful as saying we understand "weather", which is really a huge messy amalgam of many completely unrelated things ranging from fluidic motion and thermodynamics all the way to plate tectonics and even anthropology, after a fashion)


I can pick up a rock and throw it at the water so it skips across the surface. It takes less than a second. How many calculations did I just do?

We are 'slow' only when we formalize the problem.


> How many calculations did I just do?

Probably not that many. Let's say you create a water-skipping robot with two different AI programs in it and a shaky, not very precise arm (but still as good as a human arm in terms of specs).

One is custom-made to the physics of the problem, and will calculate the angles and the forces and the pressures and the candy tasty physekz all over, and then decide on a particular motion of the robot arm, and the rock will skip. It'll take a lot of computing power, but it'll work.

The other just has a goal, to see the rock skip, and tries things at semi-random (though it has the general knowledge that it can be done and that it has to throw the rock towards the water in a particular way for it to happen and that it can control the outcome), and tries to figure out patterns between what it did and what happened as a result. Eventually, it has a particular tactic, a certain set of instructions, which could be "down down down the left thing and up up right down down the right thing and the up thing does down down up push push down push twist-force-2 at the same time", an entirely not at all complicated set of instructions with very little computation. This signal is sent through other nodes that might not have the perfect signal number, and eventually this outputs to the arm's motors... which nevertheless manage to make the rock skip, because the motion here is almost the same as the motion of the first AI, yet this one was obtained by eliminating the ones that didn't result in the rock skipping.

TL;DR: You don't do that many calculations on the spot. The "maths" in most situations like this is done by elimination throughout all the attempts you've made in your life to control a throw. Subsequent throws once you're already a practiced rock-skipper just involve firing "the same neurons as usual" which send "the same signals as usual" to your muscles, which result in the same skipping as usual, with very little math. If you already know that 2x^5 + 20 = 110 implies x = 45 from doing the same calculation twenty times in the past hour, your brain isn't performing "calculations" anymore, it's just repeating a pattern that's already there in your brain like a table lookup.


I tend to disagree with the idea that we don't have the computational power equivalent to the brain. The reasoning is that, yes, all the things you say are true about having billions of slow neurons, but that makes the assumption that their only fault is that they are slow. I don't necessarily agree with that assumption. The evolutionary process doesn't find true maxima, it finds local maxima. The key is that the neuron is good enough for the task of intelligence and planning and pattern detection. However, the architecture is most likely highly inefficient. The reason why they are so good comparatively is because we just kept adding neurons, either by increasing the size of the frontal cortex and by the folding, so as to jam more in there.

That is not an architectural design per se, and I highly doubt that it is optimal. Rather, it is good enough. Now, the Von Neumann computer architecture is of course lacking. Parallel computing as we know it works, but it's a total nightmare, and hardly competes with the incredible parallelism in nature. But I do think that there is sufficient power now to really feasibly compete with the brain, intelligence wise.

All this with a grain of salt. I have been reading stuff in the weird corners of AI research. I think this article is fundamentally a straw man. (I believe that...) Deep learning and probabilistic machine learning are fundamentally flawed for strong AI. Jeff Hawkins is a well respected AI researcher that I think seems to agree. Another problem is of course that these are all trained in a supervised fashion. Google does well because they have huge tagged data sets. However, the brain doesn't work by training on tagged data sets, it can learn on it's own, unsupervised. So long as we keep pointing mainstream AI down the supervised statistical machine learning path, we will always be far away from strong AI.


I don't agree that brains can learn on their own, unsupervised.

Firstly, the brain has a basic feedback mechanism, which can be thought of as basic supervision: it receives quick feedback on certain kinds of hardware issues its actions caused. Most computers don't even get a similar level of feedback (eg, that they tied up the cluster on junk computations).

Secondarily, modern humans are booted through this mechanism by already running humans, in what is direct supervised learning: from the time we're infants -- and for the next two decades -- our actions are supervised, commented on, corrected and our exposure to materials is metered and selected to create a (theoretically) optimum consumption plan.

That we get any results out of computers with a couple months of training when comparing them to humans with a couple decades of training is testament to the fact that computer learning is orders of magnitude more effective than human learning.

I find it very strange that people leave out the 2 decades of hardware tuning and supervised knowledge building that humans get when discussing about how awful it is we having to train our machines if we want them to be smart.


Brains can definitely learn without supervision. Observe any child playing; they explore and learn from stimulus in ways far more sophisticated than our current machine learning models.

An interesting aspect of brains vs machines is how brains can learn from other brains indirectly. When someone tags a database for a machine to learn from, that is a form of direct communication (the kind machines are good at). A crow can watch another crow use a stick as a tool, and in turn learns that a stick can be used as a tool. This requires a complex understanding of the situation.


If you feed one AI a tagged data set to train it, I think you would be able to easily hook it up to an untrained AI to train the second AI. As in, pass the first AI an un-tagged input, get it's result, and pass the second AI the same input tagged with the first AI's result. That would be essentially the same thing as a crow learning from watching another crow, aside from the fact that the crows are self-directed.


> I don't agree that brains can learn on their own, unsupervised.

Facts don't require your agreement. They're still facts.

You're suggesting that the "2 decades of hardware tuning" is equivalent to programming, and that just isn't the case. While parents definitely provide instruction, each child's brain is fully on its own to program itself. It teaches itself how to make sense of the visual signals it's receiving from the retina... no one teaches a child up from down, how shading and colors differ, how to use both eyes in tandem to provide depth perception. It's all automatic and 'magical' in a way, because literally NOTHING you can do as a parent can speed that process up in any meaningful way. The same is true for hearing. You can't just upload your current understanding of the world... the closest you can come is by trying to help them along mechanically, by holding them up as they 'walk' their legs, for instance, but their brain is still figuring it out entirely independently. Showing the child how to move their legs is NOT the equivalent of showing them how to detect orientation using their inner ear, or teaching them how to send impulses to their muscles in the right order and at the right time, or teaching them to predict momentum so that they don't just fall on their face.

> computer learning is orders of magnitude more effective than human learning

The only reason one could even ATTEMPT to make that claim is that a computer's learning is something that can be copied digitally. A single machine can't even approach the learning abilities of the brain. Computers are so fundamentally different that it's not even a fair comparison to the computer... the brain will win every time.

> I find it very strange that people leave out the 2 decades of hardware tuning and supervised knowledge building that humans get

It's because it just isn't relevant. Take the most advanced AI available today, slap it in a robot and spend 2 decades raising it. I guarantee it won't end up anywhere close to a human in abilities or intelligence. If you can't see that, you're being intentionally obtuse.


The last part was funny.

Modern robotics has been around for more than 20 years and there is no autonomy whatsoever.

It might surprise many people who claim that AIG is 20-30 away from now that in robotics, something as simple as cup stacking is a major problem. If you can figure out how to do that you will get slightly famous.

The brain and a computer work so fundamentally differently that its impossible to provide a single framework to compare them. This is why I used the metric of computational complexity but even that is not at all accurate.

I think due to the tremendous amount of automation we have seen in the past 10-20 years, many programmer have difficulty appreciating the awesomeness of biological intelligence. Rather than view the success of automation as a failure of how human education works.

By fully appreciating how complexity of the biological intelligence can we start to learn from it and append it.


Totally agree! I think we have deviated too far from biological models in order to achieve marketable results. Google is satisfied with probabilistic machine learning because it will still make them plenty of money being able to tag images. This is why I admire the approach Jeff Hawkins is taking by going back to the biological model for more inspiration.


Of course, speaking from physics - yes. Its a local maxima.

Its one of the shortcoming of natural selection and genetic programming.

We have better tools to do sequential computation than biology can match. And soon we will be able to harness the power of quantum mechanics to do our bidding.

But the point I am trying to make is whatever approach we use, sequential or parallel, Just in terms of computational complexity. The brain's capacity is greater than the sum of the computers humans have ever produced. Just using that metric we are still far away from having the tools to simulate brains, how long will it take for Intel to compress the complexity of ALL computers ever produced into a single chip ?

Surely its not tomorrow ? maybe 20-30 years ? I do not know.

What I do know is in 2015 it seems that there is a lot of work that needs to be done just from a hardware perspective.


I would agree that the computational complexity of every single quark in every single atomic particle in every atom in every synapse is far far greater than the computational power we poses at this day. I just think it's a red herring to chase down that level of computational completixy (for now. True physical simulations truly amaze me, and I can't wait to see them be more granular and accurate). When it comes to intelligence, I think we are mostly missing the algorithm, not the computational power. I think once somebody finds the algorithm (which may be very soon in my own opinion), we will have more than enough computational power to blow the mind away by orders of magnitude.


There's a lot of structure in how the brain is wired that is is baked in when you're born. The visual cortex is in the back, the hippocampus is in the middle, the frontal cortex is in the front. You do have unsupervised learning, but you start out with highly tuned hardware. Is that not some form of basic programming?


yeah, absolutely. But that's the power of millenia of evolutionary tuning, which we have to intelligence to design ourselves, removing the inefficiencies. That then drops the computational power required for equivalent behavior.


I agree that the current Von Neumann architecture is probably not adequate. However, I think there is some active research in this area.

Here are a couple links:

From today actually. Memristors are an exciting development if they can scale production: http://www.technologyreview.com/news/537211/a-better-way-to-...

A startup working on this: http://brainchipinc.com/

IBM's TrueNorth Architecture: http://www.research.ibm.com/articles/brain-chip.shtml


All these development are really excellent.

The more people who are interested in solving these problems the better.

Neural Networks are awesome - but modern AI tells us that there are also various other techniques, The problem with Neural Networks is they are essentially a black box and many mathematicians and engineers are not happy working with black boxes.

Right now the models for synapses and Neurons are very simple compared to what actually is happening in them. But regardless the approach is not in vain.

(My current research involves autonomous motor control with tactile feedback, which is very far away from AGI)


Can you tell me more about your research? I became really fascinated with motor control after reading an introductory neuroscience book a while back. I feel like this is not an area that receives as much attention as it should.


True. However we know shockingly little of how neurons actually do computation. Everything a neuron does it does so with a combination of electrical and chemical signals. This is why it's slow, but it also introduces considerable noise in the process.

Also, because they're usually in some skull-like object, and because neurons are incredibly hard to culture, we simply lack the tools to observe them in their natural environment to the resolution you might want. Simple questions like how often a neural signal fails to cross a synapse in a brain are remarkable difficult to answer. Do we need to know how every neuron is connected to every other neuron in the brain to understand it? Even if we had such a map, would it be missing something so crucial as to be useless (for example how glia interact with the neuron)? Nobody can say for sure right now.


The human brain may indeed be beyond all computers.

On the other hand, the brain and eyes of the fruit fly do amazing things. I think it's reasonable to say that vision is lacking both hardware and algorithms at this point.


He wrote this more recently when convnets surpassed "human level performance": http://karpathy.github.io/2014/09/02/what-i-learned-from-com...

Gains have continued in 2015.

Also, more recent reflections on his research which I think gives a bit more color to the OP: http://karpathy.github.io/2014/07/03/feature-learning-escapa...

Kind of like with the rise of electricity, the microprocessor, the PC, or internet -- in the beginning, only the people building it understood what all the fuss was about. But that changed quickly over the course of N years (where N ends up being sooner than everyone thinks). If you had started a career in any of those fields before they were obvious in hindsight, you would have probably done quite well.

The author of the post has not quit to go start a photo app as far as I know, he's still doing research on the cutting edge of deep learning because that's where the most promise is.


I don't think his later comments really negate his earlier comments.

Neural networks continue to make progress in the narrow field they are designed for and researching these I'm sure continues to be interesting. That doesn't change the point that human don't interpret an image as a couple of annotations but as rich fabric of information far beyond what computer vision current does.

Basically, there is a ocean of interesting, useful and excite things computers can do before they arrive at what humans can do.


I'm not sure how well we can estimate how difficult a task is just based on our amazement at "iceberg complexity".

Just the fact that I was able to read that article the way I did is the product of something enormous. Discovery of electricity and semiconductors, mathematics, development of CPUs and memory, global computer networks, the entire software stack that allows me to display what someone wrote three years ago on a glowing rectangle with no involvement of any paper at all. The entire social and economic development that allows me to read this instead of walking through the woods and trying to impale some creature on an arrow.

That's a lot of solved problems, most of them figured out during the past century. I don't think some image segmentation, pose estimation and semantic reasoning is going to take quite as long, especially with more and more people working on it.


Advances in computer vision are being made every day. Image understanding is a big challenge, and the way we tackle big challenges is in little steps.

I was impressed by some video segmentation and object classification results that Microsoft showed off the other day at its Ignite conference. We're a lot farther along than some people realize.

Picture here: https://twitter.com/MS_Ignite/status/595365048547180545

Video clip at the 1:02:00 mark: https://channel9.msdn.com/Events/Ignite/2015/KEY02


In fact, this blog author had published a paper pretty recently on scene understanding.

http://cs.stanford.edu/people/karpathy/deepimagesent/

This got a lot of press:

http://www.nytimes.com/2014/11/18/science/researchers-announ...


You're right that the video looks very impressive. However, I'm hesitant to believe that this actually works as well as one might perceive. In this area (called semantic segmentation in the computer vision world) it is common for a people to highlight both the good cases (where the labels/segmentation are correct) and the bad cases (where they are incorrect). In the MS video they only show the good. It's easy to cherry-pick examples where it works -- even if the overall accuracy is very low.

Furthermore, I don't know which dataset they're using. Perhaps it only works on a small set of objects such as those shown in the video.

I don't mean to knock their results but it will still take time to get this to work on a more broad set of videos.


The article implies that we will or ought to arrive at the point where a machine can appreciate why an image is funny in the same complex way that a human can. Not to discount the value of AI Vision research and technology, but at some point we must ask "why".

Partly the question goes to the difference between artificial intelligence, and artificial consciousness. According to some definitions, the former is ability to produce relevant information, while the latter is an autonomous system capable of using that information for self interest.

For example, no matter how complex a system like Watson is, it's no more conscious than a rock. Meanwhile, rudimentary life form is something we are nowhere near replicating artificially. This distinction is quite important, and very intelligent AI pundits seem to fail to make it (or understand it?).

While we are quite capable of producing intelligence artificially, the capacities associated with the consciousness become confused easily in the mind. While some intelligence is easy to produce with machines, there are some problems of experience that simply cannot be solved by artificial intelligence, and require consciousness.

But let's be clear: producing an artificial consciousness is orders of magnitude more complex an engineering challenge than building an artificial intelligence machine.

It is also potentially enormously destructive: many times more difficult to create, and at least as destructive as atomic bomb.


> How can we even begin to go about writing an algorithm that can reason about the scene like I did?

You are doing AI wrong. AI should learn all of that context by itself, from a large amount of stimulus. If it was a good one, it might be able to learn enough in less than N years, where N is the age of a human who would laugh at the photo.


I think the author was using a linguistic-shorthand. He is an active researcher in the field: http://cs.stanford.edu/people/karpathy/


Not just "stimulus", but active exploration of the environment, communicating with and learning from other people, etc.


If we could make AI study the relationship between objects in images and videos, it could learn a lot of raw common sense. Also, it could be useful to add a different domain to the mix by cross referencing that with information extracted from text.


It strikes me that a forum such as this one might make a great training tool for AI. So many good cues for following branched reasoning, dealing with ambiguity etc. I usually"read" the articles through the comments.


> AI should learn all of that context by itself, from a large amount of stimulus.

That's basically what Nature is doing anyway.


The flash you get out of a joke is build up on a huge framework of human interaction that takes literally years to acquire for the "dedicated" BI of a brain. The fact the AI is only starting to reach the power of facial recognition which is for us a seemingly "low level automatic" function tells us the amount of learning left for an AI to come close to our understanding of the world. Not only in terms of computational power but in length and diversity of the learning process.

Assuming sufficiently powerful Neural Networks, they will probably go through years of learning our world through interaction, just like a kid does, before it "gets" the joke. Doesn't mean it's impossible, and that's quit scary (in a good and bad sense I guess).


Years _worth_ of learning. They could learn by watching videos, or playback of sensory data of early AIbots and by playing video games created for the specific purpose of training them which will be much more efficient.


Good point. The thought actually actually popped into my mind after I wrote the post : "Wall E" and many more films/books have suggested this accelerated learning path.

However this is information only, not interaction, I suspect this will have a serious distorsion effect on how the AI "perceives". It's a wild guess, but I believe interaction is at the root of understanding.

Edit : yes you also mention interaction through video games, which I skipped when I scanned your comment. But then again video games might be still far from the depth of real world interaction, more of a learning enforcer than the source of it - like books are for us....


Interaction in a sense acts as a method of preventing over-fitting. If you can't interact and are given a fixed batch to learn from you can find trivial overfitted good predictive models (e.g. the identity model).

One way of reproducing this aspect of interactions is simply using standard ML techniques to prevent overfitting, such as cross validation.

There's another aspect that's more difficult to reproduce that is the "online learning" aspect of interactions. If you can interact, you can form hypothesis in real time, test and modify them. This can greatly enhance learning efficiency I suppose -- you may directly explore fails in your models and improve in an optimal way.

This aspect also might be reproduced I believe simply through a large enough dataset. The learner could be given some capability to explore this dataset in a non-sequential way and look for informative results in it.

Interesting stuff.


But there is already so much content on the internet, that accelerated learning could still happen by way of proxied interactions visible in "old" content. Does your AI have to interact with people to learn how to interact? Or is watching interaction good enough?


In a nutshell I don't see how any kind of adaptive intelligence can bypass the reinforcement process of trial and error through interaction. Then again you could have simulated interaction, but that may be the equivalent of the machine dreaming :-)


You can learn to not touch a hotplate by watching someone burn their hand on it.


A related post on the front page right now:

"Neural network chip built using memristors (arstechnica.com)" https://news.ycombinator.com/item?id=9501119


To be fair, the author is evaluating a computer based on a task our brains have specifically evolved to be good at, facial recognition, social interaction, and familiar settings. You could turn it around and find tasks that computers are good at. e.g. Watch a month's worth of highway videos and count how many cars passed.


The argument in the article seems to be that we are really far away from having an AI with social intelligence. However, AI's do not need social intelligence to be interesting / dangerous. My best guess would be that creating artificial humour is much harder than creating a machine that is dangerously intelligent.


A great argument! In fact humour is a tough task for humans as well. We do it all the time but only few people can be consistently funny in the eyes of other people. Also different cultures experience humour differently. E.g., German jokes aren't funny to Chinese and the other way around. It's nearly 100% excluding.


3 years later and computers now outperform humans on Imagenet. While we still have a lot of work ahead, it shows how fast things can change in the exponential world of computers.


Can you please give a link showing an exponential improvement between 2012 and 2015?


http://image-net.org/challenges/LSVRC/2012/results.html One team at 15%

http://www.image-net.org/challenges/LSVRC/2014/results Most teams are below 15%, GooLeNet is at 6%

http://www.image-net.org/challenges/LSVRC/2014/results Microsoft is now below 5%

http://arxiv.org/pdf/1502.03167.pdf Google is now below 5%

I can't really argue whether that's exponential or not, but it amazing progress in a short amount of time.


We might be really far from a ASI that completely grok every nuance of that picture... but we already have AI that can understand pieces of the picture... and those AI's can be used to do interesting things today. This isn't an all or nothing endeavor.


A properly trained AI would have recognized Obama in that picture much faster than I did. As for the foot in the scale, I only noticed when the author mentioned it.

So much for the 'quick glance'. Which brings me to another matter. One of the reasons the author can extract all that information from that picture is because all the elements in it have been 'seen' already. A machine might not be able to extract the whole context, but things like the people involved and that they seem happy? Easy (-ish).


The image makes totally different ideas in a baby and adult. Babies doesn't understand most of the parts, but adults do. Because, adults learned everything from their experience.

We need systems that learn from experience.


Though in order to build systems that learn from experience, we first need to grasp a whole lot more clearly just how learning works for us, and how exactly is it that we persist our experience data.

To me it seems we are still in the stage that we have to understand ourselves better, because we simply can't devise and algorithm to solve a problem we cannot solve ourselves.

We can do clever stuff with statistics and math, but that seems to me like more of a hack. For instance, people create models from very small datasets, if you have kids you surely can watch in amazement just how efficient we are wired in that regard. We try to mimic that by feeding huge datasets to algorithms but it still pales in comparison.


True, we still don't know how "learning" works in brains. But it works in our neuronal network. What about combining some real neuronal cells with technology?

I remember one documentary in Discovery Channel, where some scientists make use of rat brain cells cultured on a circuit to build a small robot, which learns to avoid obstacles on it's path.

A similar video is here -https://www.youtube.com/watch?v=1-0eZytv6Qk.


Part of the experience adults use to interpret that image is that they, at one point in their life, either stepped the scale of someone else, had their scale stepped by someone else or assisted to such a scene or even a different kind of prank suitable to draw the analogy, laughing or getting angry about it. How are you going to provide such experience to a system? Edit: typo


just step out of 'Machine Learning' bubble for a sec

there have been decades of research into 'Cognitive Architectures' (http://en.wikipedia.org/wiki/Cognitive_architecture) and 'Artificial Consciousness' (http://en.wikipedia.org/wiki/Artificial_consciousness)

there is massive amount of experimental observations on learning and cognition in neuroscience and cognitive sciences (from neurophysiology to psychology) that is largely ignored by Artificial General Intelligence and Machine Learning communities.

On the other hand the progress in Deep Learning, Computer Vision, NLP and Robotics is largely ignored by neuroscientists because these learning models do not respect biological constraints

There is a whole group of narrow domains like Formal Concept Analysis, Statistical Relational Learning, Inductive Logic Programming, Commonsense Reasoning, Probabilistic Graphical Models that don't talk to each other but all deal with cognition and conceptual reasoning using different tools

I think we have a chance to make progress if these fragmented domains converge.


Not really,

There are researchers in all the different fields who's sole job is to report what other communities and doing and be the agents of cross pollination.

Everyone agrees that artificial general intelligence is a difficult problem.

Practically its not possible to converge all the different fields and also what is the point of that ?

Each researcher is interested in solving their own sets of problems what they find interesting or have the motivation to be part of the solution.

Progress is being made - maybe not at the rate of silicon valley start-ups but hard problems require time to solve.

It would not be ideal that Computer Vision people suddenly stop doing their research and take the massive risk of putting all their shoes into Deep Learning.

People doing Computer Vision have their sets of constraints and goals. If tomorrow the garbage man, cleaner, cook, etc all stop working and stay "we are all going to work on deep learning". The world will stop working.

As absurd as that sounds that is what the implications would be if theses separate fields try to converge. Even if we do solve the problem of AGI today, what direct change or improvement in human condition would be see tomorrow ?

When that AGI needs to be integrated in a framework like computer vision, robotic, or search engine you need that domain experts, practitioners in the those various fields to still exist tomorrow to maximize the economical benefit of such a technology.


I'm not suggesting experts should drop what they're good at and work on integration, it's a job for engineers, think 'Apollo 11' kind of integration of different sciences into a single working product.


The author wrote this in 2011:

>My impression from this exercise is that it will be hard to go above 80%, but I suspect improvements might be possible up to range of about 85-90%, depending on how wrong I am about the lack of training data. (2015 update: Obviously this prediction was way off, with state of the art now in 95%, as seen in this Kaggle competition leaderboard. I'm impressed!)

Thats more impressive than it sounds because each percentage point is exponentially harder than the last. Getting 95% accuracy is not 5% harder than getting 90%.

Just recently machine vision starting beating humans on imagenet. Imagenet is 1,000 classes, high resolution images, taken randomly from the internet. No one would have predicted that few years ago.

Sometimes a notable researcher like Hinton says that something like transcription of images into sentences might be possible in five years, only for researchers to demonstrate it in five months.

I remember reading something about early engineers working on computers were extremely skeptical of the rate of computer advancement. They were so focused on narrow technical problems they didnt see the big picture.


Why are we grouping together computer vision and AI? They are two distinctly different fields, in my opinion. Computer vision is really hard. Great work has been done in the last 5 years with tech like the Kinect, but there's certainly a long way to go. AI has made small steps with projects like IBM's Watson, but still seems to be in its infancy (and even that may be generous).


Is there a good term for a system that integrates both fields to achieve the goal of, say, matching human performance at visual scene understanding?


To understand the picture the machine should reason like this.

First: Look for focus, detect zone in which eyes are looking. Result: The focus is on the man at the right since eyes are directed at him.

Second: Sentiment Analysis Whatever he is doing people find it funny.

People like to play jokes that make you experience strange things that you can't understand at the moment.

Hypothesis: The man B near the main character A is interacting with A. It seems that the foot of B is interacting with the machine M. So B interacts with M that interacts with A.

Generated question: What kind of interaction could be exerted by B on M to cause M get A confused?

Hypothesis: B's foot is making the machine to malfunction in such a way to give a false message to A.

Hypothesis: To make this image more noticeable the men are well known people or famous people.

Data: The more serious the role this people play at society the more funny is the image, since their behavior is more unexpected.

Reasoning like this the machine could get a plausible hypothesis of what is happening in the scene:

A famous man B that probably do a serious and responsible role in society is playing a trick on a man A by making a machine M to malfunction in such a way that it gives a false message or information to A. The malfunction is caused by putting the foot on the machine. To make it funnier, the main character can't see that, ...

Having a general context like this the machine could look for machines and people which could play such a role and give a heavier weight to those that make the joke a better one.

The next generation of machines could modify the image to make the joke better since it can understand perfectly the context and purpose of it.


At the risk of stating the obvious, even if this were true and we were actually really far away from AI or good computer vision classification, Moore's law should take effect. For every order of magnitude of difficulty it is, by whatever measure, we should only need to wait a linear amount of time.

Also, notice that for this particular picture a lot of what the article is talking about doesn't matter:

  - Does it matter that they're in a hallway?
  - Does it matter there are mirrors?
  - Does it matter that one person is the president?
The main gag is having one person standing on a scale with the intent of making a weight measurement while the other is subverting that intent, presumably with the knowledge of bystanders. My bet is that this picture could be classified as funny, even correctly labelling the joke, if the picture was tagged and a classifier were run on a large database of other tagged pictures.


There's the explicit gag itself, and it's an old one, of stepping on a scale.

The real joke, however, is in fact that it is the president. It's more humorous because it shows the president as a real person. It's unexpected, and out of context of how we typically understand the holder of the office of the president of the united states should behave.

That's the main joke. So yes, the finer details are important. It's not just a funny picture.


I doubt a system could be written with today's technology capable of deciding if a picture is funny/not funny based on a huge database of funny/not funny hand tagged pictures. Too much is involved. What does "is funny" even mean?


There is no reason for a "computer" to do all that and if it could it would not - all that he talked about is about dealing with the limitations of our brains - our mind the machine would not care because it would not care about its limits - we are as a computer one big limited system constrained by the process that brought us into being - chance and natural selection. An ai is not constrained by this anymore than a dog is constrained from finding the same thing funny. humor is not a problem for us because we are part of the problem - we are a constantly derived huge word puzzle. an AI would not be that - AI would have a purpose.


The article nicely illustrates the complexity present in a single image and I agree that we are far from automatically and fully understand images like this. I believe we would need to replicate the full brain to do that. But still, there are many interesting applications that are possible with state of the art computer vision. The question is whether state of the art is useful for industry, not whether the ultimate goal is close or far.


Article is from a few years ago, missed existing AGI research at the time. Deep learning is now mainstream and awareness of AGI research that has been going on for years is slowly creeping into mainstream AI.

Anyway he mentioned what needs to be done at the very end of the article which is to emulate the embodied human development process, and there is quite a lot of progress in that area.


It's not just vision.


I have no hesitation in arguing that our textual understanding is currently far worse.

E.g. modern deep-learning classifiers can get high 90s F1-score on identifying entities in images, but what's our capability for recognizing e.g. product entities being mentioned in Amazon reviews? Good luck getting even 70% F1-score. It's still incredibly awful.

And this is just talking about very basic entity recognition in realistic settings. Lets not even consider relation extraction, and more meaningful tasks.


Presumably the author is familiar with the Google paper - http://arxiv.org/pdf/1411.4555v2.pdf which should get the people, scale, and laughing bits out of the image. There are a remarkable number of technologies converging here.


You know, there are some problems that are simply unsolvable. Star Trek-style teleportation, I'm inclined to think, will never happen.

I think "true" AI of the sort described in this article is, likewise a pipe dream. Not in a 100 million years of scientific effort could we create the hardware and software necessary to do this. If you think otherwise, that's because you have an enduring and overriding faith in the power of science to overcome all barriers and achieve all goals. But on what basis could you place that faith? Sure, science is generally good at solving problems and advancing technology, but there are some "technologies" that are simply not attainable: indeed, not even clearly definable.

I think that deep AI of this kind is one of those. We think we know what it would mean for a computer to think like a person. It's not so clear that we do. Given that we deeply don't understand how we ourselves not only think, but also feel, want, assess, moralize, and experience, we're unlikely to produce thinking machines.

"Ah, but the Power of Science, may it be praised forever!" you say. "Science will help us to understand how we understand!" Yeah, maybe, nah, I don't think so. Just look at the construction itself. It's loopy. "Understand how we understand"? I doubt it's possible for any instrument to ever really comprehend itself; in other words, for any thinker to ever think accurately about how it thinks.

We should keep AI research within the realm of observable, measurable, useful utilities. "Comprehension" of any kind... it's never going to happen.

There I've said it. Now roast me at the stake, O ye of great faith.


The funny thing is, you could just as well think about the complexity of any kind of animal behavior. AI isn't not just very far from human intelligence, it is also very far from any kind of living intelligence.


Roundworm elegans has just 302 fully mapped nerve cells. A people are still trying to figure out how that small system works.


If you're interested you can join in! http://www.openworm.org/


>I have a really cool idea for a mobile local social iPhone app.

I wonder if this was tongue in cheek since I thought people didn't start making fun of mo-lo-so until "Silicon Valley" aired: https://www.youtube.com/watch?v=J-GVd_HLlps.


The state of Computer Vision and AI: we are really, really far _away_


Something related - the title is:

    The state of Computer Vision and AI:
    we are really, really far. (2012)
To me, that means we are really close to achieving it, because we are really, really far along the path. But reading the article immediately creates a cognitive dissonance - that can't be right, can it?

No, the author means "we are really, really far away from our objective," which is not the same thing at all.

Was this deliberate, or did the author, in his focus on the question of interpreting images, simply not notice that his text was ambiguous as well?

Or is it just me?


It's not just you. My initial reaction was that it meant "we have come far", meaning that "we are close". My next thought was "That can't be right, he must mean we are distant from the goal."

Why is this so ambiguous? Far and distant are synonyms, right?

I suspect that it's because you can use "far" to describe a path you have traveled, in a way you can't with "distant". "We have come far", "We are far along our journey." So that "far" can mean far from a point of origin as well as a point of destination. Whereas "distant" only means distant from that point of reference.

So I thought the title was ambiguous without a preposition to clarify "far": "far away", for example.

Perhaps it's in the framing of the question: "the current state of x" implies that it's being considered as an ongoing process, which implicitly has an origin and a conceivable end.

Also, putting the subject as "we are..." puts it in the frame of an ongoing journey. If you say "That lighthouse over there, it is far." there is no ambiguity that the lighthouse is distant from us. Also "Effective computer vision: it is far" is not really ambiguous.


I think it's just you (from your username you're probably a native English speaker, still)

"we are far" to me means "we are distant (from the objective)"

https://en.wiktionary.org/wiki/far#Adverb

If it was "far ahead", or "we've come far" I'd agree with you


It's not just him. I'm a British English speaker and read it the same way. I read it as "We are far [along our course]". It came across as a bit of an odd phrase but I don't think it's really good English either way, it needs an object to be clear.


I'm from the US and I read it the same way, as a statement that we are far along our course.


The problem is nobody ever says "we're really far". It's either "we've gone really far" or "we're really far away".

It's almost gramatically incorrect to say "we're really far", unless it is a response to a question - "how far are we from home?", "we're really far". In that case it works because the subject is implicit.

It's like you can't have a title that is "It is the best smartphone yet."


Yes, to me, not being a native english speaker, it was the same: I read it as far from the goal.


Such far-sighted far-thinking will take you far.


Same here, and I opened the article thinking "cool, an optimistic article" as in the tech industry we tend to be overly critical of ... everything.

I was disappointed.


hmmm, I feel if anything tech people are more optimistic then warranted about AI. I found it refreshing this made it clear there are still a lot more to intelligence then just recognizing a face in an image. Now the tech industry being pessimistic about many other things, sure, but in my experience that hasn't been the case with AI.


I came to the same conclusion - I wonder what an AI assessment would have made of the title.


Actually, a useful application of NLP would be to detect this kind of ambiguity before it's published (or written into contracts or laws).


Exactly. Us meatbags will be useful for a while still even if just to interpret other meatbags.


At least as long as it is considered useful to interpret meatbags at all :D


Ugly bags of mostly water.


Not just you; I also went through exactly the same thought process.


I had the same sensation as you. I thought we are far, like "almost there".


I thought the opposite immediately and didn't even think twice about it before I read the comments. I wonder what that says about me.


I noticed the ambiguity and clicked on the link just to figure out which was intended.


100% sure this was the intention with the title.


I assumed the author took the statement "we are really, really close" and then replaced "close"with its opposite, "far". This is the sort of thing a non-native speaker would do, and I suspect he was unaware of the ambiguity caused by this. The implied idiomatic continuation '... far from the goal' or '... far along the path' is what throws native speakers, however non-native speakers simply parse the sentence the way the author intended in a much simpler way.


Interesting point, I've posted a question about the potential ambiguity at the English StackExchange* for additional discussion.

* http://english.stackexchange.com/questions/244893/what-is-th...


To me, that means we are really close to achieving it

So in this context far and close are synonyms? English is so weird.


The point here is the question of what comes after far:

* We are really, really far along the way

* We are really, really far from our objective

It seems clear that most native English speakers in this thread have seen the ambiguity and started by assuming the first option, whereas non-native English speakers seem to have assumed the second, and may not have noticed the other option at all.

But see also: Contranym/contronym/Auto-antonym:

http://en.wikipedia.org/wiki/Auto-antonym#Examples


I think this is a great example of the points illustrated in the article: Namely that interpretation of a {photo, sentence, whatever} depends on the history/background of the subject receiving the information and is one of the reasons why the field is "so far [away]".

For example, a person who had no experience/knowledge of scales like that shown would have a tough time discerning what the reason for humour was.


This is really good to read, once I'm not native and I consider my English not that good.


Not really. The problem is that "we're far" is ambiguous as to what we're far from.

Are we really far from our origin? Or are we really far from the destination?

In the absence of a complete statement, people tend to insert their own bias, which can lead to confusion.


The title needs rephrasing, but I wouldn't expect someone to say that we're almost there with AI, as it would sound ridiculous, so I used my natural intelligence to correct the meaning of the title and understood it as the author had intended.


The title could've easily been referring to specific progress or framed in optimism.

Can there be no articles about how far we've come until we've reached the stars? Or singularity?


I agree that the title could be read both ways, but as someone who works with CV, I knew immediately what it meant. CV outside of a few specific applications on controlled images is only just starting to work at all.


we are really, really far along the path

Except the "along the path" the path part is neither stated or implied. Rather then just being a grammar or language issue I think people are projecting their optimistic opinion of AI onto the title, with no basis for it based on what's actually written. Objectively if you read the title how it should be construed as negative, we have a long way to go, etc.


"From the objective" is neither stated nor implied either.


I'm not a native English speaker and understood it, as the author intended it. It was a bit confusing at first, because "far" is seldom used by itself in English in this way, but I settled on "far from" as opposed to "far along".


It's not seldom used in that way. In my experience, I would say that "far" by itself more commonly refers to "far along" until otherwise clarified with "far away".

The title is simply ambiguous.


Thanks, I didn't realize this. Fixed!


Oh, but Ray Kurzweil says we'll have an AI beating the turing test in 14 years, so I guess we're fine.


I don't find the picture funny, even though I understand all the bullet points from looking at the picture.

Yes, the man is confused because he's being pranked by the president. It would be funny if I was there.

But looking at this photo? It's not funny to me.

I suppose I would rank this picture a lot higher than a picture of a skyscraper in the list of funniest pictures, but neither would be above the "laugh" threshold.

Does everyone else here find it funny?


You're missing the point entirely. You can explain why people might find it funny, that's the point.


Thanks...I thought it was slam-dunk laugh out loud funny for everyone except me and something was wrong with my sense of humour...


I'm not sure when he says it's funny he means laugh out loud funny. Do you find it amusing? I think the point is as humans most of us can get the joke this image is conveying and it would be very difficult for AI to understand that. How funny you actually find it is a matter of personal taste and I don't believe the author really cares about that in this post.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: