A Long Peek into Reinforcement Learning

clickok · on Sept 6, 2018

Pretty cool, this is actually a great reference for a lot of things. Even if you're familiar with RL, you might be reminded of something or learn something new.

Sutton and Barto's book is also good if you want to do more than dip your toes into RL: http://incompleteideas.net/book/the-book-2nd.html That page has the PDF with links to code, problem solutions, and course material.

If you just want an idea of what an RL algorithm looks like, I've got a (heavily) documented version of TD(λ) for linear function approximation here: https://github.com/rldotai/rl-algorithms/blob/master/py3/td....

hhmc · on Sept 6, 2018

If anyone is interested in a fun toy-program to learn Temporal Difference Learning on, the game 2048 is amenable to the approach - in fact it gives the best current results of any current AIs.

It uses the N-tuple technique - an instructive paper describing N-tuples and TDL applies to 2048 is here: https://arxiv.org/ftp/arxiv/papers/1606/1606.07374.pdf

An alternative project would be a game like "connect-4", which is amenable to the same approach.

jrx · on Sept 6, 2018

I’d be very interested to hear for which similar problems (games, control problems) reinforcement learning is not the best approach (there exists other, qualitatively better method£. We hear a lot about success stories but not a lot about limitations of RL

clickok · on Sept 6, 2018

There are a couple of places where RL/Evolutionary Methods/Something Else are jockeying for the top of the leaderboards, but usually if one technique succeeds over the others it just means that someone's put a lot of work into that one particular approach.

Anecdotally, I've heard about some optimization problems that should be a good fit for dynamic programming or reinforcement learning where it turns out they actually don't seem to work as desired.

For example, optimizing elevator policy: an agent controlling an elevator wants to achieve high throughput of passengers while also minimizing the wait-time of each individual. We want people to get to their floor quickly but also want to avoid having any individual wait for an excessively long period of time. The agent can only observe information about the buttons pressed on each floor, although it gets a reward like that reflects our desiderata[1]. As it turns out, most elevators are not running some sort of cool machine optimized policy, but rather something hand-coded.

This is not for lack of opportunity, either-- apparently, some researchers pitched Otis on this in the 90s, and it didn't work as well as what they already had, despite this looking like a case where theory should match reality. Why this is, I don't know, but it might come down to the fact that there's really a lot more to a "pleasant elevator experience" than might be assumed (or modeled by the reward function), or perhaps the hand coded policy incorporates background knowledge about human vertical locomotion unavailable to a machine.

-----

1. Maybe something `R_{t} = Transporting(t) - ∑ Waiting(i, t)`, meaning something like "number of people currently being transported minus the people who are waiting times how long they've been waiting for".

hhmc · on Sept 7, 2018

Or perhaps those researchers in the 90s just whiffed?

It's ostensibly the case for other RL problems that we've only recently crossed some compute power viability threshold. Maybe contemporary researchers with contemporary compute power would have better luck?

tnecniv · on Sept 6, 2018

One class of examples would be problems where probing the environment is expensive. In many robotics applications, this can be tricky because the robot might break or not return to a suitable initial condition without human oversight. Moreover, the physical experiment can relatively slow so little data is generated, and what is generated may not generalize well. That being said, how to overcome these challenges is a very active research topic in the community.

hhmc · on Sept 6, 2018

It probably depends on how you define "best approach".

For example in chess - sure, alphazero is better than stockfish - but stockfish is "good enough" (100s of elo better than the best human), without the cpu-cycle cost of alphazero.