The present in terms of the future: Successor representations in Reinforcement learning

11 min readOct 31, 2019

A few years ago wrote a series of articles on the basics of Reinforcement Learning (RL). In those, I walked through a number of the fundamental algorithms and ideas of RL, providing code along the way. This is a kind of continuation of that series, with the difference being that this article (and possibly future ones) will dive into some of the more eclectic and experimental elements of the RL world. I want to begin with a topic close to my heart and recent research interests, something called the successor representation (SR). In this post I will walk through the inspiration and basic mathematics of SRs, demonstrate a simple working example, and discuss the ways in which SRs have been applied in deep learning, psychology, and neuroscience in recent years.

(Note: If you’d like to jump ahead and follow along with the tutorial code, you can find it here).

Background — Value estimates

Almost all reinforcement learning algorithms are concerned, in one way or another, with the task of estimating value. At the simplest level this means learning how good it is to be in the current state — i.e. V(s), or how good is it to take any of the actions available in the current state — i.e. Q(s, a). It is with this latter piece of information that informed decisions can be made so as to optimize future reward. These two functions, V and Q, serve as the backbone of the majority of contemporary algorithms, ranging from DQN and SAC to PPO and many more.

The process of learning value estimates from the rewards in an environment is referred to as credit assignment. Given an often sparse reward signal, how do we appropriately assign value to the states and actions which eventually lead to these rewards? Put even more simply, what states and actions were useful for acquiring the reward, and which were not? If we can learn this, we can learn an optimal policy.

The vast majority of algorithms involve learning these value estimates directly from the reward. There are other ways to arrive at a value estimate though. In fact, value estimates can be seen as the combination of two pieces of information which are even more fundamental: (1) — the reward function of an environment, and (2) — the dynamics of the environment and agent itself. This second piece makes up the successor representation, and it provides a mathematically straightforward way for an agent to represent the present state in terms of the future.

Basics of the Successor Representation

The value estimate for a given state corresponds to the expected temporally-discounted return which an agent is expected to receive over time starting from that state.

The insight described by Peter Dayan, who first formalized the successor representation is that this function can be decomposed into the expected discounted future states which will be encountered and the expected reward in those states. With this insight, we can re-write the value function as follows:

If the states of the environment are discrete and known in advance, then we can represent V(s) using the product of a reward array and a successor matrix. In this case, R(s) corresponds to an array in which each element represents a state, and the value of that element is the reward received in that particular state. M(s, s’) corresponds to a matrix with the rows and columns each corresponding to all the states in the environment, and the values of the matrix corresponding to the expected future occupancy of one state given another. A given row of M(s, *) reflects how often all other states are likely to be visited from a given state. When R(s) and M(s,s’) are combined using the dot product, the result is a traditional value estimate for any given state of the environment.

We can learn R(s) and M(s,s’) using all of the same traditional methods for learning value estimates in reinforcement learning. In order to learn R(s), all that is needed is to minimize the distance between the true reward and R(s):

(Where alpha is the learning rate and r is the reward)

In order to learn M(s,s’), we can use the canonical temporal difference learning algorithm:

(Where alpha is the learning rate, gamma is the discount factor, and I is a one-hot vector with 1 in the index of the state)

If we want to learn a state/action Q-function instead of a state value function, we can simply learn a successor tensor, of the shape actions x states x states.

An Interactive Example

Simple four-room gridworld environment. Red corresponds to the agent position. Blue corresponds to the walls. Green corresponds to the goal position.

What is appealing about learning these two pieces of information separately is that changes to one of the two parts no longer impacts the other, as would be the case in traditional value-learning algorithms. Take for example a simple gridworld environment with four rooms. In this environment a specific state in the rightmost room contains a treasure chest and provides an agent with a +1 reward, and all other states provide no reward.

If you’d like to try this example out for yourself, you can find an interactive ipython notebook, along with all the required code in a GitHub repository here.

We can train an agent using any number of algorithms to move toward the rewarding chest. Let’s say that after the initial learning phase, our environment has changed, and the rewarding chest has been moved to the bottom right room. If we have used a traditional value learning algorithm, the agent now has to completely re-learn the value estimates, since the reward landscape has changes, and all the value estimates are wrong. If, however it learned the reward function and the successor representation separately, it only needs to re-learn the reward function, and can re-use the successor representation it learned originally.

Successor representations of nine states from four-room environment. Each state is represented in terms of future expected states.

There is one small caveat, which is that depending on the policy used to learn the the successor representation, it may be biased toward the original reward location, and might need a little fine-tuning. Why is that? Because the successor representation is itself actually composed of two separate entities itself! Don’t worry though, the recursive decomposition ends here. The successor representation is the product of the inherent transition dynamics of the environment itself (i.e. what happens when you take each action in each state — T(s, a) = s’) and the policy the agent is following (i.e. what action the agent chooses to take in each state — P(s) = a). Because of this, unless the policy is completely uniform random, the successor representation will be biased towards certain parts of the state space based on the biases in the policy function. In the case of the four-room environment, the successor states will likely point toward the original reward location, even when the reward is placed in the new location. Despite this, using the successor representation still results in much faster learning than using just a value function by itself.

Agent state occupancy during test-time after being exposed to both reward locations. Agent learns two separate paths to each of the goals.

What the example above demonstrates is that successor representations enable a simple version of task transfer. Because we have learned separate functions for the state representation and the reward function, we can now transfer what we’ve learned about how the agent moves around the space, and apply it to any arbitrary reward function possible within this environment. I encourage those following along with the example code to do just this, and see what kind of results come about.

The Successor Representation and Hierarchical RL

Once we have learned a successor representation, it can be analyzed to discover things about the nature of the environment. Going back again to the four-room environment example, once we have learned a successor representation, we can perform dimensionality reduction on the learned representation of each state. This allows us to visualize the representations in a human interpretable way, even though they are inherently high-dimensional. If we plot them on a 2D plane (see figure below), we can make a couple observations. The first is that all of the states in each of the four rooms are represented as being very similar to all other states in that room. If we think about what the successor representation captures, this is to be expected, as each of the states in a given room is most likely to lead to another state in the room following any given policy.

More interestingly, the four states corresponding to the doors between the rooms are represented far apart from all other states, and are between the groupings of within-room states. This is because from these states the agent can end up in very different parts of the state space. We can refer to these states as bottlenecks, since they allow for transitions between different groups of states. Learning which states are bottlenecks can be useful, especially if we want to then use them as the states in a higher-level controller as part of a hierarchical learning algorithm. It is exactly these kinds of states which make ideal candidates for option termination points (or sub-goals) in a hierarchical reinforcement learning framework. Many people also believe this kind of hierarchical decomposition of state space is what naturally takes place in human and other animal brains (More on the implications of the successor representation for psychology below).

All state’s successor representations from four-room environment plotted on 2D grid. Bottleneck states marked in orange. All other states are marked in purple.

The Successor Representation and Deep Learning

So far, I have only described learning successor representations in the tabular domain. There is in fact plenty of work that has been done to extend the successor framework to the world of deep learning and neural networks as well. The key insight to make this happen is that instead of R(s) and M(s,s’) being a vector and matrix, they can become the high-dimensional outputs of neural networks. Likewise, rather than being a simple one-hot index, the state can be represented as any vector of real numbers. Because of this property, successors in the deep learning framework are referred to as successor features, a phrase proposed by Andre Barreto and colleagues.

Example of a neural network architecture utilizing successor features. Specifically, this is the Universal Successor Feature Approximator agent.

In the past few years there have been a number of interesting lines of research taking advantage of successor features. One great example of this is utilizing them in a powerful multi-task learning setting. Borsa and colleagues have shown that agents using successor features can flexibly transfer between multiple different tasks in a high-dimensional 3D navigation task. Agents using successor features have also been shown to be able to learn to play multiple Atari games with only limited exposure to the true reward function. Successor features have also been used by researchers at the University of Alberta and Google Brain to generate an intrinsic exploration reward signal for agents playing Atari games. Since the successor features measure how likely certain future states are to appear, it turns out that they can also be used to determine how novel new states are as well.

The Successor Representation in Psychology and Neuroscience

It isn’t just in the domain of machine learning that the successor representation has gained traction as a paradigm of choice, it has also caught the eye of a number of psychologists and neuroscientists as well.

Research by Evan Russek and colleagues has demonstrated that agents using a modified form of the successor representation are able to mimic a number of the basic capacities of rodents when navigating mazes. These capabilities fall under what psychologists refer to as a cognitive map. This ‘map’ provides the animals the ability to represent and navigate space in a non-reactive way. For rodents, this has been demonstrated in their ability to perform latent learning (the ability to use irrelevant information from a previous task to solve a new task) and the ability to quickly adapt to shortcuts and detours. Agents trained using a successor representation and a form of experience replay called Dyna were able to demonstrate both of these abilities.

Firing activity of a place cell in a rodent as it navigates around a circular space. B and C show effects of moving an indicator on the wall of the space. The place cell firing is anchored by the position of the indicator. Reproduced from Muller & Kubie 1987.

Research scientist Kimberly Stachenfeld and colleagues have proposed that the activity of place cells can be modeled using a successor representation. For those unfamiliar with place cells, they are cells in the mammalian hippocampus which fire whenever the animal is in a specific location in space, regardless of the direction the animal is facing. These cells are thought of as one of the building blocks of the cognitive map. It turns out that these cells display certain biases in their response to the location of the animal which the successor representation captures nicely.

The connection goes further though. There is another canonical group of brain cells (this time in the entorhinal cortex), referred to as grid cells. These cells fire consistently as the animal moves around an environment, with each cell responding with a uniform grid-like firing affinity for different parts of the environment. It also happens to be the case that taking the eigendecomposition of the successor representation of the place cells results in a representation that looks a lot like those grid cells.

Grid cell activity of rodent as it navigates environments of different shapes. Reproduced from Stachenfeld et al., 2017.

Eigendecomposition of successor representation of agent that navigated environments of different shapes. Reproduced from Stachenfeld et al., 2017.

Another line of behavioral research by research scientist Ida Momennejad and colleagues has shown that humans make decisions which are more consistent with learning a successor representation, rather than learning either a direct value function, or a more complex model of the environment. The authors of the study demonstrated this by showing that like the agent navigating the four-room environment, humans are better able to adapt when the reward location changes than they are when the structure of the environment itself changes. The successor approach also has advantages over pure planning, where an agent or person would have to consciously run through a chain of possible future outcomes to pick the best one. With a successor representation, that information is already available in a summarized form for you.

Epilogue

All of this research is still in the early phases, but it is encouraging to see that a useful and insightful computation principle from over twenty years ago is being so fruitfully applied in both the domains of deep learning and neuroscience. As you may have guessed by this point, it is also something that has been at the center of my own research this past year, and I hope to be able to share the fruit of some of that work at some point in the not-too-distant future.

Even if the successor formulation turns out to not exactly be how humans and animals represent their ‘current state,’ the concept of representing the present in terms of the future is an appealing one. We are, after all, always forward-looking creatures, and predictive representations of one form or another are what can make that orientation possible.