DRL is fun, that's what matters! :) Though classical non-biological RL has some strong assumptions (Markovian) that may not work in real world (it's nice to see fixed points with Bellman operator in theory), but for some reason with the "Deep" part, magic happens.
I disagree with the conclusion. Article has some critiques on DRL, but I don't think those invalidate the field as a whole. Even the article has a section for "When could Deep RL work for me?"
Author here, this is aimed at an overview of the field, with links to relevant resources, but not a preview of courses we give at Insight since every project is different here!
I think "learn" is a bit misleading here but I do have to say it's a nice and intuitive overview of RL.
RL is quite hard and math heavy, I don't know if one can take a short cut in learning RL without solid graduate level math foundation.
I disagree. At the core RL is just updating a table of values, and then using function approximation (aka, machine learning) for more complex cases.
I think it might be perceived as being math heavy because the best resources on the topic [0][1] use a lot of math notation to express their ideas. The ideas are subtle and easy to mess up though. I think these books are some of the first that made me appreciate math notation, it can look scary, but it does convey the ideas more accurately than words can.
> At the core RL is just updating a table of values, and then using function approximation (aka, machine learning) for more complex cases.
this seems to be a common assertion about ml. other refrains include "ml is just matrix multiplication" and "ml is just affine transformations followed by nonlinearity".
while technically true, it is an unhelpful comment as it doesnt shed any light on the salient questions, such as:
* how do you reduce bias/variance, since youre sampling a minuscule slice of the state/action space?
* how do you construct a value function (or something like it) in a sparse reward environment, with possibly thousands of time steps?
* how do you know your policy network is exploring new states?
it is the equivalent of learning how to draw an owl: draw a few circles, then draw the fucking owl.
It's funny reading these comments. The thing is, if you go through AI ideas and principles as a mathematician there will be a lot of "hold on" and "you can't do that", "that's wrong" and so on. AI simply isn't correct and shouldn't work. Also there are many, many cases of "if A works, B should work as well" where A works and B doesn't.
AI is semi random. That's the truth.
> how do you reduce bias/variance, since you're sampling a minuscule slice of the state/action space?
You don't. You just try stuff and some of it works.
> how do you construct a value function (or something like it) in a sparse reward environment, with possibly thousands of time steps?
You don't. You just try stuff and some of it works.
> how do you know your policy network is exploring new states?
You don't. You just try stuff and some of it works.
Now there are intuitive answers (that are bullshit) about your questions. The problem is they don't stand up to scrutiny.
> how do you reduce bias/variance, since you're sampling a minuscule slice of the state/action space?
You assume the structure of the AI computation itself will have a "sufficiently" correct bias/variance. So the structure of the network and the structure of the updates encode something that works in the state/action space.
In other words: you try dozens of architectures until one works (also known as "hyperparameter search"). To put it another way, any AI algorithm is a huuuuuge polynomial. Ax + By + C. Backprop learns A, B and C, the "hyperparameters" encode that it's Ax + By + C that gets calculated. For instance, NEAT is a good hyperparameter search algorithm.
> how do you construct a value function (or something like it) in a sparse reward environment, with possibly thousands of time steps?
This is the problem of reinforcement learning. Literally all theory is about this very question, just don't expect to read math that holds up. But it's easy to see that for instance how Q-learning, TD(alpha), TD(0), TD(1) all do this.
How about I put down the intuition behind Q learning. You have states, rewards, and action. You are in a state, you pick an action and you are rewarded (or not). At some point the game ends and we have "rounds", which are a set of moves from the start of the game to the end. So you calculate the Q-value, which is discounted reward given optimal play. Wait what ? You calculate the expected reward given that you play optimally in the future. So the Q value for (state(t), action(t)) is R + E(max Q(state(t+1)|action(t), action(t+1)) over all possible action(t+1)'s. By playing you can constantly recalculate those Q values, meaning they become more accurate over time, and then you play by simply picking the action with the highest Q value.
> how do you know your policy network is exploring new states?
One way to do it is, when you calculate the gradient, you can see that the gradient is a big honking set of matrices. You just stick them all together into one vector, and then use the norm of that vector (which is intuitively a measure of how much the network "learned" in the last experiment) as a reward. Alternatively you just sum all the norms together and do the same.
There are other metrics. For instance you train a separate network to predict the future game state, then you add how wrong that network is to the reward function. Assuming both networks learn the same (one of those "but that's ... wrong" things), that should give you an indication of how new such a state is. The newer it is, the more the prediction network gets it wrong.
Another way is to have 2 instances of your AI solver and having them fight eachother, while also training the prediction of how the other will do, and using that same network to judge your own play.
You can also have long-term cycles where you switch between reward functions. For instance, sometimes your network just tried to terminate the game quickly, regardless of the outcome. Sometimes surprise is what you're looking for. Sometimes it just wants to see the world burn (lots of rapid action). Sometimes it wants everything to be quiet. Though, ideally, if you go down this path, you need to somehow tell the network which (combination of) rewards you're going for.
Yes, bellman equation fundamental idea for all RL algorithms (i.e. updating table of values), but RL is also much more fragile than supervised learning methods, thus to ensure stability, modern algorithms have complex mathematical tools to fix that. I wouldn't say it's the math notations that's scary, rather the concept behind modern algorithms require higher mathematical concepts. For example, Max Entropy algorithm for inverse RL, it's essential to know the concepts of Shannon information entropy to understand why it works.
Ah yeah I agree it's not a terribly difficult concept to learn, but perhaps we have a different definition of graduate level math. Things like information theory are, at best, skimmed over in upper div math classes during undergrad. Having a solid understanding things like KL divergence, information entropy is just an indicator of one's overall math level, and if you are already there, you can consider yourself at graduate level.
Though I guess math used in ML is probably kids play comparing to a graduate level mathematician or physicist.
It depends on whether you are implementing a library to support RL or using one that does - in the former case, yah, you better have a grip, in the latter, you need to understand enough to know how to utilize the library and not feed it nasty things.... this isn't to say you can be mathematically illiterate, however there are levels to required knowledge - the same as you don't need to know how to make a compiler to use one
I agree to an extend. The current state of RL algorithms are more fragmented and have limited application comparing to something like a CNN.
So unless one only wants to learn how to play Atari, they won't be able to "fix" an algorithm if it breaks down in an untested environment. E.g. a non deterministic sparse reward game.
If one day there is a generalized algorithm that can solve large set of RL problems, then I think a high level intuition is probably enough to use RL, but for now, I'd say RL is definitely not for the faint of heart.
True - I often make the heretical argument that SL vs RL is just a question of where the labeling comes from :-) You are correct that the tooling is weaker in this space, but it is growing - my point is only that there's a difference between knowing how to use the tool and knowing how to make the tool - making the tool can liquify your brain -- using someone else's tool (assuming its a good tool) will simply give you headaches from time to time :-D
Is it really math heavy? Its not like people are working out these algorithms by hand. I haven't done RL but would assume you just need to download some libraries and tutorials then play around with it to get an understanding. This may go faster if you already read a book about it, worked out the math behind how it works, etc but that shouldn't be necessary.
For very limited applications you can probably plug and play something like DDQN or A3C, but since we don't have a good algorithm that applies to many RL problems, any innovation will take quite a bit of theoretical background.
A while ago we tried to use DDQN to self-play a multi-agent pacman environment that we built, it was able to converge but the convergence was very suboptimal. We had to modify the sampling process in order to get slightly better convergence.
A small tangential criticism, but using "deep" every other sentence and especially expressions like "classical deep learning" made me take this article less seriously.
This is not unique to this author, sadly. I'm tired of seeing the d word thrown in research papers just for the sake of adding more buzzwords per buzzword.
Once you've made clear you are using neural networks with a lot of layers you can start using some variation in the discourse. Maybe just call them neural networks...
There were so many technical terms, I'm surprised you could get through even an overview, and then practicals, in just 4 hours.
Do you know of any resources which list most of the common alternatives? e.g. what are the alternatives to a3c for parallelizing; or the alternatives to a2c for getting policy and value estimates?