Hacker News new | past | comments | ask | show | jobs | submit login
Graph-Powered Machine Learning at Google (googleblog.com)
371 points by tim_sw on Oct 7, 2016 | hide | past | favorite | 66 comments



While there is a lot to get excited about with ML both as a consumer and as a software developer I can't help feeling a pang of sadness.

Big data and in this case the relationship (graph) between big data points are whats needed to make great ML/AI products. By nature the only companies that will ever have access to this data in a meaningful way are going to be larger companies: Google, Amazon, Apple, ect. Because of this I worry that small upstarts may never be able to compete on these type of products in a viable way as the requirements to build these features are so easily defensible by the larger incumbents.

I hope this is not the case but I'm getting less and less optimistic when I see articles like this.


The graph algorithm they're describing is basically Manaal Faruqui's "retrofitting", although they don't cite him. I will make the charitable assumption that they came up with it independently (and managed to miss the excitement about it at NAACL 2015).

Here's why not to be sad: Retrofitting and its variants are quite effective, and surprisingly, they're not really that computationally intensive. I use an extension of retrofitting to build ConceptNet Numberbatch [1], which is built from open data and is the best-performing semantic model on many word-relatedness benchmarks.

The entirety of ConceptNet, including Numberbatch, builds in 5 hours on a desktop computer.

Big companies have resources, but they also have inertia. Some problems are solved by throwing all the resources of Google at the problem, but that doesn't mean it's the only way to solve the problem.

[1] https://github.com/commonsense/conceptnet-numberbatch


The graph algorithm described in the blogpost is more related to label propagation (which is more than 10 years old), than to "retrofitting". And the Google paper linked in the blogpost is citing the relevant literature correctly.


I probably sounded more accusatory than I should have, and I apologize for that wording.

But I do think this is much more like retrofitting than like label propagation. It's the vectors that are being propagated, as I understand it, not labels.


I can't find a single comment thread on HN where ConceptNet is discussed. Bummer. See here: https://hn.algolia.com/?query=conceptnet&sort=byPopularity&p...

Short of signing up to the mailing list on Google Groups is there any info out there on the net that tells me:

a) How ConceptNet relates to efforts like Cyc's OpenCyc and the big G's Knowledge Graph?

b) How to add domain specific concepts, attributes, and relations to ConceptNet

Both these things are the first that pop into my mind but the website: http://conceptnet5.media.mit.edu/ doesn't seem to cover them.


Hm, I know I've brought it up before.

In relation to OpenCyc: ConceptNet doesn't have CycL or any sort of logical predicates -- the assumption is that you are just doing fuzzy machine learning over ConceptNet. I'm pretty sure you need the rest of Cyc to do anything interesting with CycL, though.

ConceptNet takes advantage of linked open data when it can, so it does contain attempted translations of the facts in OpenCyc.

Compared to the Google Knowledge Graph: well, you can't really use the Knowledge Graph without being at Google, or making a research agreement with them, right? But from what I've seen of it, and of its predecessor Freebase, I think it focuses a lot on named entities: particularly things you can look up on Wikipedia or things you can buy. Which is fine information. I'd say it's a different segment of linked data than "what words mean", which is ConceptNet's focus. So maybe think of it as more like a bigger WordNet than a smaller Knowledge Graph.

How to add domain-specific concepts to ConceptNet: the unsatisfying answer is that you can get the code for ConceptNet and alter the build process to include new data sources (try it on the 5.5 branch if you attempt this). And the startuppy sellout answer is that doing this automatically is one of the things my company Luminoso is for (http://www.luminoso.com).

Thanks for asking, and I'll try to clarify this kind of stuff when I deploy the new site.


Thanks for the response. I'll keep an eye on your progress, dig deeper into what you already do, and have a play with your tools.


Wow, there's something funny going on with HN search there. There's a quite a few discussions about ConceptNet. https://www.google.com/search?q=site%3Ahttps%3A%2F%2Fnews.yc...


It differs if you search by Stories or Comments. There are hits if you search by Stories but none of the hits contain posts which is what threw me.


> Big companies have [...] inertia.

This used to be the case. But companies nowadays are often more structured like clusters of small startups, especially for innovative products and moonshots and such.


This is data capitalism. In ordinary capitalism the rich get richer, and in data capitalism those who have data, build superior products, get more users and thus more data, while others starve. Good datasets can be bought but it is scary in its own way.


"In ordinary capitalism the rich get richer [...]".

Errr, not really? I mean, at a minimum you're making a somewhat controversial statement - lots of people (me included) wouldn't automatically agree that the rich get richer, especially if you're thinking "only" the rich get richer, which is what is implied by the comment above. (As evidence, see established companies that go bankrupt, and new startups which rise from nothing to become dominant players, much like Google itself, which has existed for a relatively brief period, or something like Uber).

So let's hope that data capitalism is like regular capitalism!


Controversial? There are ton of articles like this one:

http://www.latimes.com/business/hiltzik/la-fi-hiltzik-ft-gra...

On growing income inequality and the transfer of wealth to the 1%.

You can argue if it's right or wrong but it's a bit of whopper to say it doesn't happen.


I'm rich and part of the financial 1%. I can indeed confirm the rich get richer. I'm not going to enumerate the specific methods, but I do feel bad that middle class or bay area software engineers cannot employ certain financial tricks that only the rich can do.

As a teaser, one important thing is to not share information (i.e. Don't blog about financial "hacks"). Call it a corrupt moral compass or whatever you like, but the fact is this behavior is rampant.


I think it is education the is the biggest differentiator of this generation and that boils down to school district (and everything this entails race, wealth etc).

That something so important is decided at an early level of one's life that has wide reaching implications is quite frankly a disgrace.

Only the most endeavours break the cycle and they are too few and far between that are not really encouraged on most levels of society around the world due to the norms.


to anybody curious about this dude's 'financial hacks' - just hire an accountant

the vast majority of financial tricks require scale (of capital) - e.g. itemized deductions, business expenses, even tax-loss harvesting - but they aren't secret!


The tricks you mention are only about avoiding taxes, I wouldn't call those "rich getting richer" tricks, I'd call them rich preserving their capital tricks. Rich getting richer is more along the lines of hearing about investment opportunities while golfing with your banker, insider trading, having the capital to play in certain markets that only big money is allowed to play in, being able to take huge risks. Angel and VC investing being in that last category.


Capitalism is around 200 years old. Most of the extreme income inequality in the States has been going on for only around 30-40 years or so. So I don't think you can necessarily blame only capitalism for this.

Also, I don't think anyone knows the exact causes of this growing inequality - some libertarians, for example, would argue that it's excessive regulation that is causing part of the problem - which isn't capitalism, it's the opposite of capitalism.

But most importantly, note that I specifically talked most strongly against the idea that only the rich get richer. Just as an example, how many of today's billionaires are relatively new to the game? It's not all of them, not by a long shot. And while some minimum level of "being rich" is certainly a factor (in the send that everyone who lives in a 1st-world country is rich), there are lots of people on that list who started with relatively little.


I suggest reading Thomas Piketty's book "capitalism in the 21st century".

One of Piketty's points is the relationship between the rate of growth of the economy, `g`, and the rate of return on capital, `r`. If the rate of return of capital exceeds the rate of growth of the economy, that is `r > g`, then inequality of wealth increases over time. As they say, "the rich get richer".

These conditions (`r>g`) have been observed from historical data under normal conditions (excluding events like world wars), and we expect them to continue into the future, particularly with projections of declining or stabilising global economic growth rates [1].

The period after the two world wars (with relatively high equality) is unusual in history and over we're seeing inequality of wealth increase, inherited capital becoming an increasingly relevant factor, etc.

[1] let's ignore the other perspective of whether continual economic growth is actually feasible or desirable given the reality of hard environmental limits.


I haven't read the book, and only have a high school knowledge of economics, but that seems fairly intuitive to me.

Wealth is effectively "liquid power" and "power" is the ability to get other people do to what you want. One of the absolute most obvious things you'd want them to do is... give you more power.

So I think it's a pretty natural outcome that, absence other forces, any power imbalance will tend to magnify over time.

Times in history where the entire economy is growing very fast basically mean new power is raining down on all people uniformly. That will tend to reduce disparity in the same way that adding the same positive number to both the numerator and denominator leads to a fraction closer to one.

Of course, this simplified model treats every person as an island. Where the story gets more complex is when you consider people working together in a group. And I think through most of history when you've seen power imbalances get reduced, it's because you've seen people work together to form groups that have greater power than the smaller number of individuals they are pushing against.

One of the things that really scares me about the US today is how much we've culturally lost that ability to organize and work together. And, of course, the small number of increasingly powerful people and groups like it that way, as they always have.


Rich get richer in capitalism because they have more money to leverage getting better deals and researching better ways to multiply their money than poor people do.


Most successful (Valley) startups don't start from nothing, far from it, they start from a vast network of other rich peoples money, usually VC, occasionally the state.


Well, things are of course much more complicated than I have presented in my short comment. And economists are still debating whether Piketty's conclusions are valid.

But the dynamic I have described is certainly there. And it can become dominant.


This is not controversial at all, this is a well-known and empirically confirmed trend since the mid of last century.


Pure unregulated capitalism (which America is very much not) is Darwinian in how similar to natural selection. In natural selection the top of the food chain sometimes die off, but in general I agree with the GP poster.


I don't think it's a great idea to apply Darwinism to fields outside of biology, especially in the social sphere. It led to a lot of bad science in the past.


It sure as hell seems like natural evolution is being applied to the social sphere. Statistically, children from "well off" families do consistently better across all life pillars (health, money, family, friends) than poorer families. Come on, financially richer families are less fat/obese than poorer families. This is a direct link to living and longevity (Darwinism).


Yes, but the two groups exist and are adapted to two totally different environments. I imagine that some in the well off population see the others as a different species. Thusly, strict Darwinism doesn't apply here.

To truly see who is more fit, I propose that we take a group of 12, 6 from each class, and drop them off on an uninhabited island. We will places weapons and traps at strategic locations. We shall call this experiment Project Craving Romp.


May the odds be ever in their favor!


It's pretty hard to say that human interaction is outside biology without invoking the metaphysical


Yeah, and guess who are the people behind these falling corporations and these new startups? Yes it is them. The bankers. Wall Street. The Rothschilds. TPTB. Call them as you like, but they money are in the hands of few and without money you cannot build a startup.


Agreed, and we need the data equivalent of antitrust.


A company like Google who has a virtual monopoly on search queries can make decisions like removing referrers to downstream sites where they send users thus depriving everyone but Google access to the actual keywords. While doing this that obviously benefits Google greatly, they can also claim it benefits the user as well since their query keywords aren't being shared with a 3rd party. It is complicated to think about monopoly vs privacy in this context.


Is open data + copyleft licenses a possibility?


It is not always feasible - consider for example the dataset of all gmail messages.


A lot of the datasets behind the big image recognition and reinforcement learning (especially recently) have been open sourced:

- https://research.googleblog.com/2016/09/announcing-youtube-8...

- https://techcrunch.com/2016/10/05/udacity-open-sources-an-ad...

- http://image-net.org/

There are also more paid, private dataset services now:

- https://fantasydata.com/pricing/nfl-data-api.aspx

- https://www.quandl.com/

I could see there being a lot of opportunity for startups that democratize data, which would then unlock more startups to build novel ML architectures on top of.


I know for a fact (and you can check online too), that Google's non-public datasets are magnitudes larger in training size and far superior in learning signal (labels are very close to their task).

Google has thousands of real robotic manipulator arms to train their reinforcement learning algorithms. Algorithms trained using the OpenAI gym are practically guaranteed to fail in the real world since OpenAI gym is (designed as) a perfect simulator. Don't even get me started on internal image and speech recognition datasets collected from YouTube, search indexing, and Hangouts/Duo calls.

There is no way in hell you can compete with this. The closest you have is Facebook. But they've built an AI that has just learned how to play Go. Let alone defeat a world champion. Hell, Google is already working on beating humans at Starcraft.


It depends on the vision and products too imo. Tesla is much smaller car company than Honda. But Musk just announced that they have collected 222m Autopilot miles from Tesla cars. I assume they have been feeding all that data to ML/AI systems to create fully autonomous vehicles (Level 4). Now I don't think Honda has designed its cars in such a way that allows it to collect data from its Honda Sensing features while the cars are driven.


By nature the only companies that will ever have access to this data in a meaningful way are going to be larger companies: Google, Amazon, Apple, ect. Because of this I worry that small upstarts may never be able to compete on these type of products in a viable way as the requirements to build these features are so easily defensible by the larger incumbents.

Wow, what's with all the fatalism around here? C'mon, people have been making variations of this argument for hundreds of years, but small companies still keep coming along and knocking the KOTM off its perch. There's always a new way to do things, a new angle, a new hack, something that hasn't been done before, that can (somewhat) level the playing field (for a while).


1) Buy a $200 drone, fly it around for a day and record video output.

2) Write a web scraper and run it for a day on an AWS instance.

3) Download any of the publicly released data sets. There are lot of interesting ones out there, including medical imaging, etc.

4) Hack a video game engine to generate data.

5) Write your own data generator (could even be a generative NN model). Not all data can be simulated, but when it can be, doing so is incredibly powerful.

A few of the above options were not available just 5-7 years ago. Data is becoming more publicly accessible, not less.


That doesn't preclude the same outcome as globalization where the "rich" accumulate a much greater proportion of overall wealth while the "poor" are still left much better off in absolute terms. To say nothing of falling in-between.


I wrote about this a few months ago [1]

The one thing that makes me feel slightly better about this is the recently announced AmaGoogFaceSoft partnership on training data open sourcing [2].

I say slightly because they still control the pipes and what is released - so they'll never likely release the good data, also cause it would probably cause privacy issues if they did. It's incumbent on us as the little guys and girls to make new data pipes through innovative products.

[1] https://medium.com/@andrewkemendo/the-ai-revolution-will-be-...

[2]http://www.partnershiponai.org/


We could build software to help people collect their own data, and make it easier to do things with that information (combination of education, motivation, complementary tools, etc).


The techniques they used in the post actually try to overcome the lack of labeled data. (by using semi-supervised techniques/label propagation)

You are right that Google has a lot more raw data, but that is much cheaper to obtain, and depending on your problem domain, there could be a bunch of public datasets that are available for you to cobble together.

Some of these big tech companies don't necessarily have the right training data for their applications either. For example, Adam Coates from Baidu outlined some of the ways that they generated a better training set for their speech recognition use case here: https://youtu.be/g-sndkf7mCs?t=3817


Couldn't Blockchain technologies along with big open source projects become a great source of big data? Ideally used with open-source projects. I'm thinking Augur for one but a lot of projects connected to the Ethereum network feels they could be great source of data that can be analyzed.

I agree right now big data centralisation is a problem indeed, but I don't believe it will last. I mean, those big companies will remain but Blockchain tech is fairly likely to take a big spot in there with them, I believe.


You have gone tangent without explaining hows of your opinion.


Yes, and it's a land grab for those domains with untapped data and potential for ML/AI related products. New startups that can identify these untapped areas will succeed.


Stop thinking about products and money based on ML/AI, start thinking about AGI, research, science and the possibilities AGI could bring. You don't need big data to build AGI.

This is a step in the right direction, and from experience, a small research-only startup could achieve very similar results.

More than likely, the amount of data is a detriment to creating a better system.


I'm more optimistic about this. I think of it in terms of Big Co has tons of data. Does all the learning. Shares the learning through API for a small fee (very small relative to Small Co having to come up with the learning). For example I upload a picture and Google tells me it is a cat.


This technique could help to break this monopoly of big data as it relies less on labelling which is even harder and more expensive for a large dataset than the actual data.


Even if you have the data, you also need the processing power.


I'm a big fan of the smart replies in Allo and Inbox so this was a fun read. I did something similar in grad school where I and some other students manually labeled a handful of sentences and then used graph-based semi-supervised learning to label the rest for the purposes of using it as a training dataset. It would be neat to hear what technology they used for the graph-based learning; perhaps Cayley? We used Python's igraph at the time but it was pretty slow. It would also be interesting to try this in Neo4j.


The algorithm is well known. The main contribution is that they were able to scale up the algorithm to a huge graph. Google doesn't go into much detail about that because it's proprietary. So whats the point of this blog post? Oh yea, free Allo/Inbox advertisement.


This is really cool. The underlying graph optimization algorithm seems similar to how graphical models are trained. Is that correct? Can someone please help me understand the difference?


Yes, label propagation is very similar to belief propagation on a Gaussian Markov random field (a particularly simple type of graphical model). To be precise this is 'inference' on the graphical model, rather than 'training' or 'learning'. See the author's 2016 JMLR paper for more details: http://www.jmlr.org/proceedings/papers/v51/ravi16.pdf


I mean, it should be.. you are learning a probability distribution over the nodes and then performing some kind of inference at eval time.


Great stuff. I have been using machine learning since the late 1980s. Semi supervised learning is new to me, and it is exciting to be able to mix labeled and unlabeled data.


Anyone know if they are going to release code in Tensorflow that works with public data? I have been using the inbox reply feature in inbox and its very useful.


These algorithms can't really make much use of Tensorflow.


I see this as a knn approach in which the distance is a function of the strength of the edge between the vertices.

In the retrofitting paper cited in the comments there is a process of smoothing, that is feeding back the message or information to update the states of the graph (example in the modern ai book). It doesn't seem anything new.


is there something in Expander that makes this happen... or can something like Apache Spark be used as well ?


As a software engineer who knows nothing about ML, where's the best place to learn?


Is there implementation released maybe? Similar like they did with TensorFlow.


Unrelated but none of the google blog links open up on safari in iOS. The link opens but text doesn't show.


This is probably due to your ad blocker. Same here. The page doesn't work if you disable the tracking. After disabling Focus it's working.


That this is the case and explanation is unacceptable (not that I blame the parent, but I would like to add the statement, as we might otherwise forget, that information could be available without the tracking-mafia).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: