# Maximum Entropy Policies in Reinforcement Learning & Everyday Life

--

As those who follow this blog are probably aware, I spend a lot of time thinking about Reinforcement Learning (RL). These thoughts naturally extend into my everyday life, and the ways in which the formalisms provided by RL can be applied to the world beyond artificial agents. For those unfamiliar, Reinforcement Learning is an area of Machine Learning dedicated to optimizing behaviors in the context of external rewards. If you’d like an introduction, I wrote a series of introductory articles on the topic a couple years ago. What I want to talk about in this article is not RL as a whole, but specifically the role of randomness in action selection.

At first glance, the notion of randomness may seem counterintuitive for an algorithm with the intention of arriving at optimal behavior. Surely optimal behavior isn’t random. It turns out however that random actions are essential to the learning process. We use random actions in RL because we want our agents to be able to explore their worlds, and in lieu of some a-priori knowledge about the world, random actions are as good a policy to take as any other to start exploring the environment. That being said, these random actions are taken under specific conditions, such as when selecting actions from a probability distribution, like you will find in policy gradient methods, or using an epsilon-greedy schedule for action selection, like you will find in value-based methods.

## Entropy in Reinforcement Learning

In many RL learning algorithms, such as the policy-gradient and actor-critic families, the actions are defined as a probability distribution, condition on the state of the environment: p(a | s). When an agent takes a discrete action, picking one of many possible actions, a categorical distribution is used. In the case of a continuous control agent, a gaussian distribution with a mean and standard deviation may be used. With these kinds of policies, the randomness of the actions an agent takes can be quantified by the entropy of that probability distribution.

Entropy is a term with a long history. It was originally used in physics to denote the lack of order within a system. From there it was integrated into the core of information theory, as a measure of information present within a communication. In the case of RL, the information theoretic definition gets repurposed. Because RL is all about learned behaviors, entropy here relates directly to the unpredictability of the actions which an agent takes in a given policy. The greater the entropy, the more random the actions an agent takes.

In RL, the goal is typically formaized as optimizing the long-term sum of discounted rewards. This means learning to take specific sequences of actions which can accomplish this goal at the exclusions of other possible action sequences. Such a learning process will naturally lead to the entropy of the action selection policy decreasing. This is only reasonable, since if we expect purposeful and coordinated behavior, then that behavior will naturally be less random than the original policy.

## Encouraging Entropy

Those familiar with the RL literature however will know that this is not the entire story. In addition to encouraging a policy to converge toward a set of probabilities over actions which lead to a high long-term reward, it is also typical to add what is sometimes called an “entropy bonus” to the loss function. This bonus encourages the agent to take actions more unpredictably, rather than less so.

Entropy bonuses are used because without them an agent can too quickly converge on a policy that is locally optimal, but not necessarily globally optimal. Anyone who has worked on RL problems empirically can attest to how often an agent may get stuck learning a policy that only runs into walls, or only turns in a single direction, or any number of clearly suboptimal, but low-entropy behaviors. In the case where the globally optimal behavior is difficult to learn due to sparse rewards or other factors, an agent can be forgiven for settling on something simpler, but less optimal. The entropy bonus is used to attempt to counteract this tendency by adding an entropy increasing term to the loss function, and it works well in most cases. Indeed, many of the current state of the art on-policy Deep RL methods such as A3C, PPO, and others take this approach.

## Maximizing for Long-term Entropy

While entropy bonuses are widely used, they actually connect to a much more fundamental concept in the theory of learning behavior. The entropy bonus described so far is what is referred to as a one-step bonus. This is because it applies to only the current state of the agent in an environment, and doesn’t take account of future states the agent may find itself in. It can be thought of as a “greedy” optimization of entropy. We can draw a parallel to how RL agents learn from rewards. Rather than optimizing for the reward at every timestep, agents are trained to optimize for the long-term sum of future rewards. We can apply this same principle to the entropy of the agent’s policy, and optimize for the long-term sum of entropy.

There is actually theoretical work from a number of researchers, to suggest that not only providing an entropy bonus at each time step, but optimizing for this long-term objective is an even better approach. What this means is that it is optimal for an agent to learn not only to get as much future reward as possible, but also to put itself in positions where its future entropy will be the largest.

One way to think about it is that an optimal agent does everything necessary to get as much reward as possible, but is as non-committal as possible about the specific set of actions it is taking, such that it can change its behavior in the future. An ever simpler way to think about it is that optimizing for long-term entropy means optimizing for long-term adaptability. This way, if a better way of acting presents itself, either within the current episode of learning, or within the training process as a whole, the agent can most easily transition to another policy. The formalism around this approach is referred to as maximum entropy reinforcement learning. As you can probably imagine, it is called this because we want to jointly optimize for long-term reward as well as long-term entropy. This approach is useful under a number of circumstances, all of which related to changes in the agent’s knowledge of the environment, or changes in the environment itself over time. There has been some work to empirically validate this approach as well, as you can see in the figure below. Even in these few Atari tasks, which are stationary, the use of a long-term entropy reward leads to similar or better performance.

## Maximum Entropy Policies in Everyday Life

I’d like to argue that this maximum entropy reinforcement learning principal actually applies much more broadly than just to RL, and touches many aspects of our lives as well. In maximum entropy RL, the basic principle is that optimal behavior corresponds to a proper balance between commitment and adaptability. I believe that this applies just as well to life decisions as it does to the behavior of artificial agents.