Maximum Entropy Policies in Reinforcement Learning & Everyday Life

Arthur Juliani
8 min readNov 2, 2018


As those who follow this blog are probably aware, I spend a lot of time thinking about Reinforcement Learning (RL). These thoughts naturally extend into my everyday life, and the ways in which the formalisms provided by RL can be applied to the world beyond artificial agents. For those unfamiliar, Reinforcement Learning is an area of Machine Learning dedicated to optimizing behaviors in the context of external rewards. If you’d like an introduction, I wrote a series of introductory articles on the topic a couple years ago. What I want to talk about in this article is not RL as a whole, but specifically the role of randomness in action selection.

At first glance, the notion of randomness may seem counterintuitive for an algorithm with the intention of arriving at optimal behavior. Surely optimal behavior isn’t random. It turns out however that random actions are essential to the learning process. We use random actions in RL because we want our agents to be able to explore their worlds, and in lieu of some a-priori knowledge about the world, random actions are as good a policy to take as any other to start exploring the environment. That being said, these random actions are taken under specific conditions, such as when selecting actions from a probability distribution, like you will find in policy gradient methods, or using an epsilon-greedy schedule for action selection, like you will find in value-based methods.

Entropy in Reinforcement Learning

In many RL learning algorithms, such as the policy-gradient and actor-critic families, the actions are defined as a probability distribution, condition on the state of the environment: p(a | s). When an agent takes a discrete action, picking one of many possible actions, a categorical distribution is used. In the case of a continuous control agent, a gaussian distribution with a mean and standard deviation may be used. With these kinds of policies, the randomness of the actions an agent takes can be quantified by the entropy of that probability distribution.

Categorical (left), and Gaussian (Right) distributions. Orange shows low-entropy distributions, while blue shows high-entropy distributions.

Entropy is a term with a long history. It was originally used in physics to denote the lack of order within a system. From there it was integrated into the core of information theory, as a measure of information present within a communication. In the case of RL, the information theoretic definition gets repurposed. Because RL is all about learned behaviors, entropy here relates directly to the unpredictability of the actions which an agent takes in a given policy. The greater the entropy, the more random the actions an agent takes.

Equation for entropy of a discrete probability distribution (p).

In RL, the goal is typically formaized as optimizing the long-term sum of discounted rewards. This means learning to take specific sequences of actions which can accomplish this goal at the exclusions of other possible action sequences. Such a learning process will naturally lead to the entropy of the action selection policy decreasing. This is only reasonable, since if we expect purposeful and coordinated behavior, then that behavior will naturally be less random than the original policy.

Encouraging Entropy

Those familiar with the RL literature however will know that this is not the entire story. In addition to encouraging a policy to converge toward a set of probabilities over actions which lead to a high long-term reward, it is also typical to add what is sometimes called an “entropy bonus” to the loss function. This bonus encourages the agent to take actions more unpredictably, rather than less so.

Update equation for A3C. Entropy bonus is H(π) term.

Entropy bonuses are used because without them an agent can too quickly converge on a policy that is locally optimal, but not necessarily globally optimal. Anyone who has worked on RL problems empirically can attest to how often an agent may get stuck learning a policy that only runs into walls, or only turns in a single direction, or any number of clearly suboptimal, but low-entropy behaviors. In the case where the globally optimal behavior is difficult to learn due to sparse rewards or other factors, an agent can be forgiven for settling on something simpler, but less optimal. The entropy bonus is used to attempt to counteract this tendency by adding an entropy increasing term to the loss function, and it works well in most cases. Indeed, many of the current state of the art on-policy Deep RL methods such as A3C, PPO, and others take this approach.

Maximizing for Long-term Entropy

While entropy bonuses are widely used, they actually connect to a much more fundamental concept in the theory of learning behavior. The entropy bonus described so far is what is referred to as a one-step bonus. This is because it applies to only the current state of the agent in an environment, and doesn’t take account of future states the agent may find itself in. It can be thought of as a “greedy” optimization of entropy. We can draw a parallel to how RL agents learn from rewards. Rather than optimizing for the reward at every timestep, agents are trained to optimize for the long-term sum of future rewards. We can apply this same principle to the entropy of the agent’s policy, and optimize for the long-term sum of entropy.

There is actually theoretical work from a number of researchers, to suggest that not only providing an entropy bonus at each time step, but optimizing for this long-term objective is an even better approach. What this means is that it is optimal for an agent to learn not only to get as much future reward as possible, but also to put itself in positions where its future entropy will be the largest.

Equation for Maximum Entropy Reinforcement Learning. Optimal policy π corresponds to maximum over both discounted rewards and entropy.

One way to think about it is that an optimal agent does everything necessary to get as much reward as possible, but is as non-committal as possible about the specific set of actions it is taking, such that it can change its behavior in the future. An ever simpler way to think about it is that optimizing for long-term entropy means optimizing for long-term adaptability. This way, if a better way of acting presents itself, either within the current episode of learning, or within the training process as a whole, the agent can most easily transition to another policy. The formalism around this approach is referred to as maximum entropy reinforcement learning. As you can probably imagine, it is called this because we want to jointly optimize for long-term reward as well as long-term entropy. This approach is useful under a number of circumstances, all of which related to changes in the agent’s knowledge of the environment, or changes in the environment itself over time. There has been some work to empirically validate this approach as well, as you can see in the figure below. Even in these few Atari tasks, which are stationary, the use of a long-term entropy reward leads to similar or better performance.

Results from experiments comparing one-step entropy bonus (red) to long-term optimization of entropy (blue). In the six tasks compared, the long-term entropy optimization leads to as good or better performance than the naive one-step entropy optimization. Taken from

Maximum Entropy Policies in Everyday Life

I’d like to argue that this maximum entropy reinforcement learning principal actually applies much more broadly than just to RL, and touches many aspects of our lives as well. In maximum entropy RL, the basic principle is that optimal behavior corresponds to a proper balance between commitment and adaptability. I believe that this applies just as well to life decisions as it does to the behavior of artificial agents.

Consider the hypothetical example of moving to a new city in a colder climate than where you grew up. You might have developed a habit of frequently wearing t-shirts and shorts where you came from. In the new city this may lead to a less comfortable experience. Your willingness to adjust your wardrobe to fit the new circumstances is directly related to how “high-entropy” your clothing policy was. In the original city you optimized your clothing for comfort. If you had a high-entropy policy, you will quickly be able to adapt to the new city. If you had a low-entropy clothing policy, then you may stubbornly hold on to your pre-existing clothing patterns, and suffer as a result. The key here is not only having a high-entropy policy in the moment, but ensuring that when something like a move to a new city happens, the entropy will also be high. This would correspond to not spending all your money on T-shirts, for example.

The above example may seem somewhat silly, but I think it is reflective of a vast array of phenomena we encounter in modern society. Let’s consider another example at the societal level: that of scientific development. Take for example any revolution in science, such as the Copernican, Darwinian, or your personal favorite. Scientists attempting to optimize the rewards (fame, truth, societal/technological impact, etc) of scientific discovery, were faced with an opportunity to either continue along their pre-existing lines of research, or to adapt to the new paradigm. Those with “high-entropy” research programs are more likely to adapt to a scientific program based on a sun-centered universe, or natural-selection based development of an organism’s traits. In contrast, those with low-entropy policies are more likely to continue with their pre-existing programs to their detriment. The long-term maximum-entropy aspect comes in early in these scientists careers. They are faced with the opportunities to ensure that their research programs don’t become too closed or focused on specific theoretical beliefs. Making these decisions early in a career then enables swift changes later in one’s life.

The examples provided above are just a few of countless possible ones related to our personal and collective decision making in the world. Similar examples can easily be drawn from interpersonal life, politics, and any number of other life decisions. In all cases the key is to plan not only for a good outcome, but the ability to change when the world does. This is an insight that many successful individuals have already ingrained into their lives, and I imagine there will likely be many artificial individuals imbued with the same insight to come.



Arthur Juliani

Postdoctoral researcher at Microsoft. Interested in artificial intelligence, neuroscience, philosophy, and meditation.

Recommended from Medium


See more recommendations