The Class Every Reinforcement Learning Researcher Should Take

What training dogs can teach you about RL

Vincent Vanhoucke
Towards Data Science

--

My team just spent a day hanging out with some Very Good Boys and some Very Good Girls, all in the name of research.

One Very Good Girl, courtesy of Dallas Hamilton.

Much of machine learning research purports to take inspiration from neuroscience, psychology and child development, touting concepts such as Hebbian learning, curiosity-driven exploration or curriculum learning as justification — and, more often than not, a post-rationalization — for the latest twist in architecture design or learning theory.

This ignores the fact that the modern machine learning toolkit has neither a substrate that comes close to approximating the brain’s neurophysiology, nor the high level of consciousness or intellectual development of even a small child.

There is an interesting middle ground between the study of neural structures of the brain and that of human children, which is to look at learning in animals. Much has been written about experiments on simple organisms, from molds to insects or bats. But that line of research is often limited to learning ‘in the wild,’ whereby the environment puts pressure this or that way to elicit a specific behavior, without the intervention of a motivated teacher.

Machine learning, however, is arguably often more about devising a better teacher than a better learner.

Animal training adds the requisite additional layer to this picture, whereby animal minds capable of eliciting very complex behaviors, but driven by simple goals and motivations (food, play, companionship), are being shaped by people with decades of experience in eliciting the right kind behavior in ways that are effective, efficient and repeatable. The parallels to deep reinforcement learning research are clear: a relatively narrow channel to communicate rewards with the agent (mostly a varying spectrum of reward and punishment), complex and ambiguous inputs, a capable but somewhat opaque learner with a knack for responding to whatever incentives are presented to it, complex behaviors to learn. But also a ‘teacher’ that we want to make as effective as possible, and whose complexity and ingenuity is only limited by their imagination or understanding of the best strategy to deploy. Admittedly, the analogy is stretched the thinnest when we presume that a dog presented with a treat has a functional response that approximates a neural network subject to back-propagation.

My first exposure with animal learning was in reading about the horrific tales of the chimpanzees used in the early days of the space program in The Right Stuff. This book (a fantastic read nonetheless) makes the case that given the extreme conditions the animals were in, no amount of positive reward could make up for the level of stress they were put under, and all the learning was driven by negative reward, namely electrified foot pedals that were attached to their feet. Huh.

The class we attended had of course none of that, and was full of happy, cuddly, energetic dogs who were extremely excited to learn. For that was the first lesson of the day: reward seeking behaviors and the taste for learning, while to some extent innate and often determined by the dog’s personality, are also the first thing to cultivate and develop. That connection between behaviors and outcomes can be built and strengthened, and much of the subsequent learning builds on it. This basic building block rests on the classic Pavlovian response: taking something that is primal, like yearning for food, and associating it with simple cues that one can manipulate, such as using a clicker or a ‘good boy’ as a proxy reward. Leveraging the dopamine rush associated with anticipation of reward is a key ingredient in building a tight coupling between inputs and expected outcomes.

One Very Good Boy, courtesy of Michael Fraas.

What’s interesting for a machine learning scientist is that the issue of finding proxy rewards that are more flexible to manipulate than the true reward comes up over and over, and how to ensure the system keeps an association between the true reward and those proxies is often a real problem. Another notable use of such proxy rewards is to create negative rewards that are not intrinsically negative (as in painful or upsetting) — I’ll come back to those. I found interesting that one key ingredient of this phase of training is the fact that this dopamine response has a key property that facilitates transfer learning: it’s not attached to the reward itself, it’s attached to the expectation of the reward. This makes it possible to substitute the proxy even in the absence of true reward. Furthermore, a direct consequence is that dopamine response is strongest when the reward is most uncertain. Trainers use this knowledge to amplify the effect of rewards by making the rewards highly stochastic: sometimes high, sometimes low, sometimes absent entirely (dare I say dropout?). Keeping the trainee guessing is an effective way to keep their eagerness to seek ‘perfection’ in order to maximize their reward.

Once eagerness to learn is established, much is about eliciting interesting behaviors, either by being crafty (scratch the dog’s snout to incite it put its paw on it, and ‘proxy away’ from the original reward to turn that into a ‘shame’ gesture), or by taking advantage of chance behaviors and rewarding them as if those had been the goals in the first place (hindsight experience replay anyone?). One key aspect of this behavior elicitation is to shower your trainee with reward as soon as a key step in the behavior is reached, and subsequently keeping them engaged just enough with lesser rewards until they repeat it perfectly. This reward dance is subtle, and not intuitive at all, but the really fun thing is that you can practice it yourself by simply playing the role of the dog and trainer (no treats required, just a signal like a clicker or a whistle). That role-playing was by far the most fun and illuminating part of the training, and trying to put into words the kinds of strategies you develop to get your teammate to perform complex tasks with a simple binary reward would not do it justice. I recommend that you experience it for yourself. Much of this is essentially a game of reward shaping. Reward shaping has the bad reputation of being very task-specific, and hence difficult to turn into a general strategy for reinforcement learning, but this experience has really left me thinking that the strategies I ended up developing were not really task specific, and with the right vocabulary, could be distilled into very general rules, provided that the setting allows the reward to have access to the state of the agent as well as the world. It also convinced me that reward shaping should be dynamic and depend on the agent’s history, which I don’t see generally being done.

Alex Irpan and Dallas Hamilton shaping Jie Tan’s rewards.

The next step of the training is to introduce negative rewards. Again, ‘negative’ experiences are only negative by proxy: a collar that generated a buzzy electrical feeling was attached to our arm, and we were told to consider it ‘bad’. Likewise, dogs have been trained to associate dissatisfaction with that feeling by a series of exercises where food is only delivered after the buzzy feeling stops. The fascinating thing about negative rewards is that they are purely seen by trainers as a way to improve sample efficiency, though of course they didn’t express it in this way. No new behaviors are expected from introducing negative rewards, merely faster learning, by essentially narrowing the exploration space and disincentivizing behaviors that have nothing to do with the task at hand. And we experienced this firsthand: it’s very tempting to use negative rewards to ‘steer’ your dog (or, in our case, our proxy-dog of a colleague) towards the right behavior, but we quickly figured out that the bulk of the steering happened best using positive rewards and a judicious pruning of the search space by buzzing the agent away when they strayed from the ‘interesting’ set of possible actions.

There is probably a lot more to learn from training dogs, and much of it felt like the kind of intuition you’d only be able to build through experiencing it in person. For me, the most immediate insight was to reconcile myself with reward shaping, and at getting some hints on ways it may be made more task-agnostic: perhaps it would require making the shaping itself dynamic, stateful and potentially even stochastic.

Not bad for a day of doggy fun.

--

--

I am a Distinguished Scientist at Google, working on Machine Learning and Robotics.