I disagree. At the core RL is just updating a table of values, and then using fu...

vinn124 · on June 7, 2018

> At the core RL is just updating a table of values, and then using function approximation (aka, machine learning) for more complex cases.

this seems to be a common assertion about ml. other refrains include "ml is just matrix multiplication" and "ml is just affine transformations followed by nonlinearity".

while technically true, it is an unhelpful comment as it doesnt shed any light on the salient questions, such as:

* how do you reduce bias/variance, since youre sampling a minuscule slice of the state/action space? * how do you construct a value function (or something like it) in a sparse reward environment, with possibly thousands of time steps? * how do you know your policy network is exploring new states?

it is the equivalent of learning how to draw an owl: draw a few circles, then draw the fucking owl.

candiodari · on June 8, 2018

It's funny reading these comments. The thing is, if you go through AI ideas and principles as a mathematician there will be a lot of "hold on" and "you can't do that", "that's wrong" and so on. AI simply isn't correct and shouldn't work. Also there are many, many cases of "if A works, B should work as well" where A works and B doesn't.

AI is semi random. That's the truth.

> how do you reduce bias/variance, since you're sampling a minuscule slice of the state/action space?

You don't. You just try stuff and some of it works.

> how do you construct a value function (or something like it) in a sparse reward environment, with possibly thousands of time steps?

You don't. You just try stuff and some of it works.

> how do you know your policy network is exploring new states?

You don't. You just try stuff and some of it works.

Now there are intuitive answers (that are bullshit) about your questions. The problem is they don't stand up to scrutiny.

> how do you reduce bias/variance, since you're sampling a minuscule slice of the state/action space?

You assume the structure of the AI computation itself will have a "sufficiently" correct bias/variance. So the structure of the network and the structure of the updates encode something that works in the state/action space.

In other words: you try dozens of architectures until one works (also known as "hyperparameter search"). To put it another way, any AI algorithm is a huuuuuge polynomial. Ax + By + C. Backprop learns A, B and C, the "hyperparameters" encode that it's Ax + By + C that gets calculated. For instance, NEAT is a good hyperparameter search algorithm.

> how do you construct a value function (or something like it) in a sparse reward environment, with possibly thousands of time steps?

This is the problem of reinforcement learning. Literally all theory is about this very question, just don't expect to read math that holds up. But it's easy to see that for instance how Q-learning, TD(alpha), TD(0), TD(1) all do this.

How about I put down the intuition behind Q learning. You have states, rewards, and action. You are in a state, you pick an action and you are rewarded (or not). At some point the game ends and we have "rounds", which are a set of moves from the start of the game to the end. So you calculate the Q-value, which is discounted reward given optimal play. Wait what ? You calculate the expected reward given that you play optimally in the future. So the Q value for (state(t), action(t)) is R + E(max Q(state(t+1)|action(t), action(t+1)) over all possible action(t+1)'s. By playing you can constantly recalculate those Q values, meaning they become more accurate over time, and then you play by simply picking the action with the highest Q value.

> how do you know your policy network is exploring new states?

One way to do it is, when you calculate the gradient, you can see that the gradient is a big honking set of matrices. You just stick them all together into one vector, and then use the norm of that vector (which is intuitively a measure of how much the network "learned" in the last experiment) as a reward. Alternatively you just sum all the norms together and do the same.

There are other metrics. For instance you train a separate network to predict the future game state, then you add how wrong that network is to the reward function. Assuming both networks learn the same (one of those "but that's ... wrong" things), that should give you an indication of how new such a state is. The newer it is, the more the prediction network gets it wrong.

Another way is to have 2 instances of your AI solver and having them fight eachother, while also training the prediction of how the other will do, and using that same network to judge your own play.

You can also have long-term cycles where you switch between reward functions. For instance, sometimes your network just tried to terminate the game quickly, regardless of the outcome. Sometimes surprise is what you're looking for. Sometimes it just wants to see the world burn (lots of rapid action). Sometimes it wants everything to be quiet. Though, ideally, if you go down this path, you need to somehow tell the network which (combination of) rewards you're going for.

yonkshi · on June 7, 2018

Yes, bellman equation fundamental idea for all RL algorithms (i.e. updating table of values), but RL is also much more fragile than supervised learning methods, thus to ensure stability, modern algorithms have complex mathematical tools to fix that. I wouldn't say it's the math notations that's scary, rather the concept behind modern algorithms require higher mathematical concepts. For example, Max Entropy algorithm for inverse RL, it's essential to know the concepts of Shannon information entropy to understand why it works.

moultano · on June 7, 2018

Information theory does not require graduate level math. Even if it's entirely new to you, you can pick it up with a few days on wikipedia.

yonkshi · on June 7, 2018

Ah yeah I agree it's not a terribly difficult concept to learn, but perhaps we have a different definition of graduate level math. Things like information theory are, at best, skimmed over in upper div math classes during undergrad. Having a solid understanding things like KL divergence, information entropy is just an indicator of one's overall math level, and if you are already there, you can consider yourself at graduate level. Though I guess math used in ML is probably kids play comparing to a graduate level mathematician or physicist.

Buttons840 · on June 7, 2018

Granted, I am not an expert in RL. It sounds like you have more experience than me so maybe I will change my mind as I learn more.