Beyond PageRank: Learning with Content and Networks

rjurney · on Jan 2, 2010

It speaks well for Flightcaster that founder Bradford Cross can so effectively communicate such complex concepts in such a simple way to a general audience. Excellent post.

bradfordcross · on Jan 2, 2010

That is very kind of you.

It would be great if more researchers did some pieces like this from time to time.

Things are going very fast right now and it is creating a large gap in general awareness as to what is happening.

This is true not only for the general public, but also for researchers with different specializations. I think there is also room for a more technical version of this kind of essay to serve like a cross-specialization review paper.

jonmc12 · on Jan 2, 2010

"I want to switch gears slightly and talk about other ways we use humans in modern research."

I thought the first argument, regarding the use of humans to conserve resources associated with time, data and domain expertise, was right on and very articulate.

I expected this argument to be extended in the '40 years' section of the article. After all, isn't the shift to computer-driven learning simply a threshold between te utility / cost of supervised/semi-supervised learning vs unsupervised learning? In other words, couldn't you say that at some point the 'ways we use humans' becomes too expensive vs non-human methods? When this threshold is crossed is when 'non-biological entities have more reasoning and processing power than biological entities' Maybe just another way of saying the same thing.

In any case, a great post. Worth reading closely in my opinion.

sdrinf · on Jan 2, 2010

Um, no.

The trade-off isn't between utilizing humans VS utilizing computers. It's the R&D cost of building simple, and proven systems, and paying peanuts into mturk VS the R&D of complex, and sophisticated systems * probability of working.

In other words, our tightest bottleneck is not the cost of supervising, but rather our understanding of the strategies, and abilities to tackle a given problem domain using only the limited tools of a fancy abacus.

And since this is HN, I'd like to digress here, and propose the curse of researching such strategies, and abilities, is that from an investor's POV, the math simply doesn't add up. The R&D phase can go on practically as long as the budget; probability of even incremental success is slim; and except for a couple of well-defined domains, applicability is negligible. Investors have 2 AI-winters[1] to point, and laugh at, which I suspect is the main reason behind ML/IR/NLP to re-market their terminology. This is a very deep problem I would pay for humans to tackle.

To deconstruct the second part of your argument, the notion of "having more reasoning and processing power than biological entities" is naive at best. In the context of human brains neither "reasoning power", nor "processing power" could be quantified in any way that would even resemble our quantification of these properties in computers. It's like comparing the Na'vi[2] to Steve Jobs.

This naivety stems from the common misconception of presuming a singular/general reasoning capability; instead of a fairly large amount of highly specialized modules, each responsible to a single type of cognition, and reasoning. An important point to note here, is that what our brain does VS what we are building[3] differ significantly: the market is much more interested in functional diversification, than exchanging pennies of supervising for pennies of computing power.

In conclusion, the "reasoning, and processing power" of humans, and ML/IR/NLP/etc systems are not only incomparable in terms of quantification, but also in their evolutionary environment, and thus their functional goals.

[1] http://en.wikipedia.org/wiki/Ai_winter

[2] http://en.wikipedia.org/wiki/Navi#Na.27vi

[3] http://stackoverflow.com/questions/1050696/the-business-of-a...

jonmc12 · on Jan 2, 2010

The article was addressing the validity of the viewpoint that 'non-biological entities have more reasoning and processing power than biological entities...I think this is all going to happen in the next 40 years...' I tend to agree with you that there is no known quantitative measure to answer this question. Instead (because there is no known quantification method), I proposed the question - if a given learning problem can be solved most cost effectively using only computers, has it satisfied this requirement (of a non-biological entity achieving superior reasoning on a specific problem(s))?

You said 'no', due to the cost of R&D. However, computer-driver R&D, particularly in the form of generative hypothesis testing may one day be able to perform the job of 'understanding the strategies' more accurately and at a lower cost than a human.

Lets say your cost of learning is a function of the resource limitations the article laid out. ie, Cost of Learning = Cost of R&D + Cost of Data + Cost of Domain expertise. What I am suggesting is that if this Cost of Learning is lower for a computer alone than it is for a computer + human, than this is perhaps an adequate test to say 'In this problem area, a non-biological entity has achieved superior reasoning compared to a biological entity'.

btw, I tend to agree with your digression as well. This is a deep problem that is not really investor friendly, but needs sponsorship one way or another to advance the state of the art.

sdrinf · on Jan 2, 2010

Picking single-aspect reasoning lightens the condition (and thus, the strength of the argument) quite a bit. If you take a single learning problem domain, there are a number of fully unsupervised ML/IR systems already out there with superhuman reasoning capabilities.

The second part of your argument is a funky one. Fully autonomous R&D-capable AIs (AKA self-improving that is, a problem-solving / reasoning system, that is capable of creating new, and improving upon existing problem-solving / reasoning systems -which, coincidently, might include itself) are the modern equalent of the philosopher's stone, and has not only withstanded attacks in ways that few other problem domains did, but seems to be an AI-hard[1] problem by itself. As I laid down above, market forces does not favour these kind of R&D; thus I predict this problem to be unsolved for the short-term future.

On the other paw, the second parameter of your cost function - "cost of data" - can be exchanged by "amount of available data" -which seems to be in an explosive boom from where I stand. Thus, while we're talking predictions, I would hypothise a short-term future with a slow, and incremental improvements upon tackling domains combined with an exponentially growing amount of available data, making even relatively naive learning methods to have results well within the optimal range.

It will still require humans. It will still require R&D. But the "cost of learning" will be so much lower, that the process will be routinely employed (successfully) by developers; thus diversifying the overall software market even further.

I know it doesn't make for a fancy movie, but throwing away 6.5M years of R&D by mother evolution doesn't seems to be a wise diversification move.

[1] http://en.wikipedia.org/wiki/AI-complete

ramanujan · on Jan 1, 2010

bradfordcross: good overview, with one minor point. I wouldn't think of Pandora's problem as one mitigated with semi-supervised learning. That's usually applied to a situation where you have a small number of labeled points and a whole mass of unlabeled data; often the task is then to determine low density regions to define boundaries of natural clusters.

In Pandora's case, they have TONS of labeled data. All you'd need to do would be to run a decision tree (or a categorical-variable version of PCA) to (1) determine that many of those features are strongly statistically dependent and (2) reduce the number that need to be populated for any given song.

You could probably also do supervised learning on their massive sound database to infer lots of these features automatically (i.e. i bet you can pick out male vs. female vocalists without having someone listen to it).

Combining these (supervised learning on historical data + decision tree on historical data) would likely vastly increase their per-song labeling throughput. Only "global" features like song genre would have to be input by humans.

bradfordcross · on Jan 2, 2010

This is the same point I am making. If they are still manually curating each song at 30 mins each, they could just stop, use the labels they already have, and infer the rest through semi-supervised learning, or learning the target labels based on the destructured tracks.

yannis · on Jan 2, 2010

It is a good approach and one I share. NLP has done it for years, once you have a corpus which is tagged is so much more easier to then work on your data. I also like your idea of using the Mechanical Turk to gain traction on the manual tagging, in any way that is probably what would super intelligent computers might do in the 40-year span - use humans to tag - before they super-massively carry out with the balance of the calculations! :)

One area which the article did not touch is how to introduce controls to identify 'rigging' of the system, ie, similarly to controlling link farms at Google. This is where the problem in my opinion is turned the other way out.

lrm242 · on Jan 1, 2010

This is a great article. I posted on the same topic but different perspective today as well: http://fitnr.com/filtering-the-web-of-noise/

bradfordcross · on Jan 1, 2010

Thanks! Likewise - the filtering problem needs to be solved.

Funny that we are discussing in on HN where the paradigm is exactly the link curation approach you are talking about in your post.

lrm242 · on Jan 1, 2010

Indeed. From the perspective of a learning algorithm, what would it think about either of these links? They've both been shared on Twitter, both been retweeted at least once on Twitter, and one has been curated here on Hacker News.

Alas, ultimately actions taken by people can always been simulated by spammers. We talked about this a bit on Fred's post: how do you know the act of sharing is legitimate and not spam? Hacker News employs votes, but that data isn't easily extracted by a learning algorithm in a general sense. I can only assume that part of solving this problem in totality is coming up with a way of determining the realness of the human curator. Otherwise, spammers will attack as soon as it gains any sort of momentum.

johnl · on Jan 2, 2010

Yep, good place to start if you want to research the subject further, lots of avenues to take all in one spot.

felicisvc · on Jan 1, 2010

Thanks for this great post, it really sets a great platform to explore AI and its development in the future. So glad you highlighted that Google's technology is so much more than just Pagerank which people seem to want to oversimplify quite often