# Interaction-Grounded Learning: Learning from feedback, not rewards

In a typical reinforcement learning problem, an agent situated in an environment makes observations, takes actions, and receives rewards. The goal of the agent is to learn to receive the largest sum of (sometimes discounted) rewards possible. To do so, the reward at each time-step is used to adjust the likelihood of the actions the agent takes in a given state, such that in the future the agent will receive more reward on average than it did in the past. This setting has been extensively studied, and very efficient algorithms exist in both the tabular and deep learning settings to solve it.

What if there were no rewards available, and instead the agent simply receives a feedback signal from the environment? In that case, most traditional RL approaches no longer apply. It is this problem setting which was recently described as “Interaction-Grounded Learning” (IGL) by Tengyang Xie and his collaborators at Microsoft Research in their ICLR 2021 paper of the same name. In that work, they not only laid out the IGL setting, but also proposed a couple of preliminary algorithms which can solve IGL problems. In this post, I will walk through IGL in a little more depth, and provide code for solving a simple digit identification problem using feedback instead of rewards. I open-sourced my PyTorch code, which can be found at this link:

https://github.com/awjuliani/interaction-grounded-learning

## The IGL Setting

In the paper, the authors motivate IGL with examples from human-computer interface research. If we want machines which can interact with humans in a natural way, we need them to be able to learn from human feedback in a natural way as well. Asking the human to provide a discrete reward signal to train the agent after every action it takes is an unreasonably cumbersome burden. It is also the case that demonstration data may not be available, or may not make sense in a number of contexts. Instead, if the computer could learn to interpret the humans hand gestures, facial features, or even brain signal to infer the latent reward signal, learning could happen in a much smoother way.

Making things more concrete, the authors propose a much simpler toy problem to validate their early approach. This problem is a simple MNIST digit identification task. At each trial, the agent is shown an image of an MNIST digit, and must guess the identity of the digit (between 0 and 9). If the agent guesses correctly, it is provided with a feedback signal corresponding to an image of the digit one. If it guesses incorrectly, it is provided an image of a zero digit. The problem is to learn to infer the meaning of this feedback, and to use it to improve the performance of the agent.

Such a task is indeed solvable, provided that you make some simplifying assumptions. The key assumption made in the IGL paper is that the desired policy is significantly different from a random policy. We can see this in the case of the MNIST task, where a random policy will provide the agent with a feedback signal consisting of many more images of zeros than images of ones. In contrast, the optimal policy will result in feedback consisting only of images of ones.

*The learning problem is then to jointly learn a policy and a reward decoder for which the expected value of the learned policy with respect to the decoded rewards is greater than a random policy with respect to the decoded rewards.*

In the paper, the authors provide both an offline and online algorithm for solving this problem, and provide a set of theoretical analyses regarding there solution. I highly recommend taking a look at their paper for all the details.

My own implementation of an IGL agent differs slightly from theirs, but solves the problem with comparable efficiency.

The main idea is to collect mini-batches of trials from both the policy we are training (exploit) and the random policy (explore). The exploit policy is then updated with its decoded rewards using policy-gradient to improve the likelihood of taking rewarding actions. The decoded rewards of the random policy are also minimized using gradient descent to decrease the average rewards received by the random policy. The process is repeated until convergence. The result of this procedure is that the learned policy adopts an increasingly different policy than the random policy. Since getting a feedback signal of images of only ones makes the learned policy maximally different than the random policy (with respect to feedback), that is what it learns to do. You can see the results of the learning process below.

The code available here can be used to reproduce these results in about a minute.

## Applying IGL to “Real Problems”

As I mentioned above, IGL has the potential to be applicable to many real-world domains where a nice reward signal is not available, but a messy feedback signal still might be. It is likely that a number of extensions might be required to the current approach before that becomes feasible though. Indeed, this novel formulation is riple for additional follow-up work.

In the algorithms described in the paper and here, an assumption is made that the desired policy is significantly different from a random policy. This is not always the case. We can also imagine counter-examples, where the optimal policy is not only similar to the random policy, but where there are much worse policies which are quite different. Imagine a version of the MNIST task where selecting the true digit identity was undesirable (i.e. would traditionally provide a `-1`

reward). The current formulation also assumes a contextual bandit setting, which despite its broad applicability, may not work for settings which are better suited to an MDP formulation. Feedback from humans is also quite messy, and the difference between “good” and “bad” feedback may be quite noisy, or even change over time.

Regardless, if we want to be able to arrive at a world where agents and humans interact in more natural and fluent ways, agents which learn from ambiguous feedback is going to be an essential step towards getting there, and IGL provides a useful formalism towards that goal.