Hacker News new | past | comments | ask | show | jobs | submit login

> They tested it on problems from recent contests. The implication being: the statements and solutions to these problems were not available when the Github training set was collected.

Yes, and I would like to know how similar the dataset(s) were. Suppose the models were trained only on greedy algorithms and then I provided a dynamic programming problem in the test set, (how) would the model solve it?

> And yet, many humans who participate in these contests are unable to do so (although I guess the issue here is that Github is not properly indexed and searchable for humans?).

Indeed, so we don't know what "difficult" means for <human+indexed Github>, and hence we cannot compare it to <model trained on Github>.

My point is, whenever I see a new achievement of deep learning, I have no frame of reference (apart from my personal biases) of how "trivial" or "awesome" it is. I would like to have a quantity that measures this - I call it generalization difficulty.

Otherwise the datasets and models just keep getting larger, and we have no idea of the full capability of these models.




> Suppose the models were trained only on greedy algorithms and then I provided a dynamic programming problem in the test set, (how) would the model solve it?

How many human beings do you personally know who were able to solve a dynamic programming problem at first sight without ever having seen anything but greedy algorithms?

Deepmind is not claiming they have a machine capable of performing original research here.

Many human programmers are unable to solve DP problems even after having them explained several times. If you could get a machine that takes in all of Github and can solve "any" DP problem you describe in natural language with a couple of examples, that is AI above and beyond what many humans can do, which is "awesome" no matter how you put it.


> that is AI above and beyond what many humans can do, which is "awesome" no matter how you put it.

That's not the point being made. The point OP is making is that it is not possible to understand how impressive at "generalizing" to uncertainty a model is if you don't know how different the training set is from the test set. If they are extremely similar to each other, then the model generalizes weakly (this is also why the world's smartest chess bot needs to play a million games to beat the average grandmaster, who has played less than 10,000 games in her lifetime). Weak generalization vs strong generalization.

Perhaps all such published results should contain info about this "difference" so it becomes easier to judge the model's true learning capabilities.


I guess weaker generalisation is why it's better though. It converges slower but in the end it knowledge is more subtle. So my bet is more compute and programing and math is "solved" - not in research sense but very helpful "copilot".

The real fun will begin once someone discovers how to make any problem differentiable so try/error method isn't needed. I suggest watching recent Yann Le Cun interview. This will solve researching as well.


> How many human beings do you personally know who were able to solve a dynamic programming problem at first sight without ever having seen anything but greedy algorithms?

Zero, which is why if a trained network could do it, that would be "impressive" to me, given my personal biases.

>. If you could get a machine that takes in all of Github and can solve "any" DP problem you describe in natural language with a couple of examples, that is AI above and beyond what many humans can do, which is "awesome" no matter how you put it.

I agree with you that such a machine would be awesome, and AlphaCode is certainly a great step closer towards that ideal. However, I would like to have a number measures the "awesomeness" of the machine (not elo rating because that depends on a human reference), so I will have something as a benchmark to refer to when the next improvement arrives.


I understand wanting to look at different metrics to gauge progress, but what is the issue with this?

> not elo rating because that depends on a human reference


The Turing Test (https://en.wikipedia.org/wiki/Turing_test) for artificial intelligence required the machine to convince a human questioner that it was a human. Since then, most AI methods rely on a human reference of performance to showcase their prowess. I don't find this appealing because:

1) It's an imprecise target: believers can always hype and skeptics can always downplay improvements. Humans can do lots of different things somewhat well at the same time, so a machine beating human-level performance in one field (like identifying digits) says little about other fields (like identifying code vulnerabilities).

2) ELO ratings, or similar metrics are measurements of skill, and can be brute-forced to some extent, equivalent to grinding up levels in a video game. Brute-forcing a solution is "bad", but how do we know a new method is "better/more elegant/more efficient"? For algorithms we have Big-O notation, so we know (brute force < bubble sort < quick sort), perhaps there is an analogue for machine learning.

I would like performance comparisons that focus on quantities unique to machines. I don't compare the addition of computer processors with reference to human addition, so why not treat machine intelligence similarly?

There are many interesting quantities with which we can compare ML models. Energy usage is a popular metric, but we can also compare the structure of the network, the code used, the hardware, the amount of training data, the amount of training time, and the similarity between training and test data. I think a combination of these would be useful to look at every time a new model arrives.


Using my previous chess analogy, the world's smartest chess bot has played a million games to beat the average grandmaster, who has played less than 10,000 games in her lifetime. So while they both will have the same elo rating, which is a measure of how well they are at the narrow domain of chess, there is clearly something superior about the how the human grandmaster learns from just a few data points i.e. strong generalization vs the AI's weak generalization. Hence the task-specific elo rating does not give enough context to understand how well a model adapts to uncertainty. For instance - a Roomba would beat a human hands down if there was an elo rating for vacuuming floors.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: