reenforcement learning is not an optimization algorithm, and his times are consistent with an off the shelf Q-learning approach.
reenforcement learning is trying to find the optimal policy of action from a given position in the discrete state space.
The policy is roughly a map between state -> action. But it understands the temporal nature of the world.
It will start knowing nothing, then discover hitting a pipe is really bad, therefore states close to nearly hitting a pipe are almost as bad. Over tons of iterations the expected negative value of hitting a pipe bleeds across the expected value of the state space.
With regression and optimization, its never clear what you are optimizing against. Obviously hitting a pipe is bad. But what about the states near a pipe? Are they good or bad? There is no natural metric in the problem that tells you the distance from the pipe or what the correction action should be.
So that's the major different between reenforcment and supervised learning.
reenforcement learning is trying to find the optimal policy of action from a given position in the discrete state space.
The policy is roughly a map between state -> action. But it understands the temporal nature of the world.
It will start knowing nothing, then discover hitting a pipe is really bad, therefore states close to nearly hitting a pipe are almost as bad. Over tons of iterations the expected negative value of hitting a pipe bleeds across the expected value of the state space.
With regression and optimization, its never clear what you are optimizing against. Obviously hitting a pipe is bad. But what about the states near a pipe? Are they good or bad? There is no natural metric in the problem that tells you the distance from the pipe or what the correction action should be.
So that's the major different between reenforcment and supervised learning.