The Emotional Lives of RL Agents

14 min readMar 8, 2023

Introduction

When psychologists discuss mental content, they typically divide it into two broad categories: thoughts (cognition) and emotions (affect). The field of Artificial Intelligence (AI) has focused extensively on understanding and simulating cognition in machines. Especially in the past decade this approach has seen great success, with more and more human-like cognitive abilities being performed by artificial agents. Despite focusing solely on cognition however, the question has started to be raised whether advanced large language models (LLMs) like ChatGPT are sentient in any real sense (I don’t think that they are). Still, the question of sentience is intimately connected to the question of whether there is any sense in which LLMs (or any artificial agent for that matter) is engaged not only in cognitive processing, but also some form of what we might call affective computation. Unlike the sentience question, I think that understanding affective computation is valuable both to improve the capacities of artificial agents, but also to provide a lens through which we can understand our own emotional lives.

The one domain within AI that has historically served as an exception to the broader focus on cognition has been Reinforcement Learning (RL). From its outset, RL was developed as a way to model motivation and learning in the face of a special kind of valenced feedback referred to as rewards. At the core of the formulation is an assumption that the agent is trained to obtain positive rewards and avoid negative rewards. As humans we learn to do the same, but not just because of some unconscious processing. Rather, we do it because they feel a certain way. It is through this “feels a way” that affect comes into the picture.

In this article, I want to explore this connection between RL and affect more deeply. To do so, I will be working through a conceptual framework through which researchers have attempted to understand both moment-to-moment affective experiences as well as the longer-term construct of mood through a computational lens [1], [2]. I will be particularly focusing on the work of Yael Niv and colleagues at Princeton, whose relevant work you can find referenced at the bottom of this article. I believe that this particular computational lens has compelling implications both for how we understand human emotional experience, as well as for how we might build artificial agents going forward. That said, for the sake of succinctness, I will be presenting a version with a few simplifications. If you are interested in the fuller elaboration, I recommend exploring the original research articles.

Reinforcement Learning Basics

Before diving into the more complex constructs of affect and mood, it is useful to establish some of the basic terms within the RL framework. If you are already familiar with RL, feel free to skip to the next section. The most fundamental concept is that of the agent, an entity that observes the world, uses those observations to infer the state of the world, and then produces actions in response to that state. The process by which observations are mapped to actions is called the policy of the agent. In RL we train an agent to learn a policy which will maximize the expected value in a given environment. This expected value is the result of integrating the rewards it expects to encounter within the environment over time. As such, a policy with a large positive expected value corresponds to a policy that receives a large positive set of rewards.

We can now define these basics in slightly more formal terms. The policy is is defined as the probability of selecting a given action (a) conditioned on the current state (s): π(a | s). The value function describes the expected return from a given state: V(s), or as the expected return from a given state and action: Q(s, a). A reward function describes the reward the agent can expect to receive for a given action in a given state R(s, a). When an agent takes an action in the environment, the dynamics of the environment are governed by a probabilistic transition function T(s’ | s, a). Lastly the observation provided to the agent is governed by a probabilistic observation function O(o | s, a).

Given the state-action value function Q(s, a) and the state value function V(s), we can define one additional term, the advantage A(s, a), which corresponds to how much better a given action is (or we expect it to be) than what was more generally expected from the state in which the action was taken. We can accordingly define the advantage with respect to these two value functions: A(s, a) = Q(s, a) — V(s). Far from being just a mathematical curiosity, the advantage is often practically the term used to derive a learning signal to improve the policy of the agent. This makes sense, because it provides a direct measure which can be used to increase or decrease the probability of taking a given action in the future. If the advantage is positive, then the action is one that was better than expected in that state, and the agent should take it more frequently, and if it is negative, then it is one the agent should take less often.

In a world in which the environment and policy remain stationary and deterministic, Q(s, a) and V(s) would be identical. In practice however, agents typically have to learn both Q(s, a) and V(s) from experience while they are learning a policy π(a | s). This means a lot of trial and error is involved, and that the “correct” Q(s, a) and V(s) vary over time. In online learning settings, it is also common to approximate the Q(s, a) by decomposing it into immediate reward, plus a discounted expected future reward. Formally this looks like the following: Q(s, a) = r + γ * Q(s’, a’), where γ is a discount factor used to privilege more temporally proximal rewards. Using this decomposition, which forms the basis of temporal difference learning (TD), immediate outcomes from actions in the environment can directly influence the advantage being learned. In the context of TD learning advantage is also sometimes referred to as the reward prediction error (RPE) in cases where there is only a single decision which needs to be made. In this case: A(s, a) = r — V(s), where r = Q(s, a), and there is no next state or Q(s’, a’).

Affect as Momentary Advantage

If we think about our daily emotional experiences, we find that somewhat counterintuitively what brings us happiness or sorrow is not directly connected to the underlying external rewards we find in our environment. Take for example the classic adage that “money can’t buy happiness.” There are plenty of wealthy individuals with clinical depression, and likewise individuals with low incomes living happy lives. If affect does not correspond to the underlying reward function, then it likewise does not correspond directly to the value function either, since V(s) simply provides an internal estimate of the expected future reward.

If not when we receive rewards, then when do we find ourselves most happy? Research suggests that it is at moments when we were expecting things to go poorly, and then they turn out (to our surprise) to go well. Likewise, when do we experience the greatest sadness but when we expect things to go well, and they turn out worse than expected. This intuitive understanding of happiness or sadness suggests that it is in fact the advantage function which is best suited to describe the moment-to-moment affective experience rather than reward or value estimates.

The sense in which advantage captures our feelings about an event extends beyond lived experience to the domain of memory as well. It is here where things get more complicated, since it is possible to update our expectations relative to the outcome given later information, which is to say that we have the benefit of hindsight. Let’s imagine that a hypothetical college student ‘Jane’ has just graduated, and is choosing between two job offers at different companies. One is a large and stable company offering an average entry-level position (Company A). The other is a smaller startup offering a more exciting role with greater responsibilities and freedom (Company B). At the time, Jane decides to work for Company B. In doing so, she accrues a positive advantage (and corresponding positive affect), because she now has a job which is better than what she expected to get upon completing college.

What happens though if Company B goes bankrupt six months after she starts the job? It is suddenly the case that what was experienced as a positive event (taking the more exciting job) is henceforth remembered by Jane instead as a regrettable mistake. Maybe she tells herself that she should have chosen the safer option, and insists that she won’t make the same mistake in the future. In computational terms, this change of feeling is because the advantage associated with the underlying action has changed due to the value of the decision changing from being above the expectation to below. This is because with the new information she learned after being fired, her estimate of Q(s, a) decreased relative to V(s) for her decision about where to work after college.

Things might have gone otherwise. Assume Jane had taken the job at Company A instead. She might have felt relatively neutral about her decision at first, or perhaps even regretted it in the immediate aftermath, thinking that she lost out on a promising opportunity. If she later learns that Company B went bankrupt though, she would now re-evaluate her decision to work for Company A and feel much better about that decision as a result. What was a missed opportunity before is now seen as having dodged a bullet due to the newly revised decrease in V(s) relative to Q(s, a).

Figure 1. Advantage is the difference between outcomes (real, recalled, or imagined) and the expectation in a given state.

The advantages underlying our affective experiences need not correspond even to actual events (present or past). They can also be influenced by imagined alternatives. To understand this we can return to the case of Jane taking one job over the other. Let’s imagine that she chose Company B, and the company did not go bankrupt after all. She is generally happy with her decision, but then one day starts daydreaming about working at Company C, which she didn’t even apply to, or even know about before taking a job at Company B. If she imagines that Company C would be a much better place to work than where she is now, then the advantage of her original decision goes down due to an increase in V(s) relative to Q(s, a). This is all despite the fact that Company C wasn’t even part of the original decision-making process at the time!

With this perspective on affect corresponding to advantage, we can start to think of how such a system relates to emotional pathologies. Given the generation of an incorrect Q(s, a) or V(s) value estimate, either in reality, memory, or imagination, the resulting advantage may not actually be appropriate for the situation. If we overestimate V(s) by imagining unreachable alternatives, then we are forced to experience our current decisions as “bad” regardless of their objective real value to us. Likewise, if we underestimate Q(s, a) by not appreciating the real expected value of our current situation, we would likewise needlessly experience regret. As such, correctly estimating these values has a real impact both on our emotional experience, as well as our ability to learn to act more adaptively in the future, since advantage is one of the main signals driving behavioral learning.

Mood as Temporally Integrated Advantage

With the basic computational building block of affect understood as advantage, we can move to describing a more complex emotional variable: mood. Rather than reflecting one’s feelings at a given time about particular events, mood captures a longer-term sense of how one generally feels and interprets events. In this sense, we can think of mood as being derived from affect through a process of temporal integration. More formally, we can define mood at a given time as: M = M + α * (A(s, a) — M), where α is the learning rate. In this sense, mood tracks the trend in advantage over time. Depending on the scale of the alpha term, an individual may be more or less sensitive to current experiences, thus modulating their ability to impact a person’s longer-term mood.

Let’s use an example to better understand this phenomena. Say we are considering Jane, our college graduate once again. She makes the decision to work for Company B, and experiences a large positive advantage value as a result of starting work for the company. This will result in an increase in her mood, which was otherwise lower than the advantage she received from starting the job. However, in the absence of other strongly positive or negative advantage events, over time this increase in mood will slowly decrease (due to an integration of small advantage values over time), and eventually return back to a baseline level. This captures the typical experience of the novelty of initially mood-boosting positive experiences such as new relationships, jobs, homes, etc, eventually wearing off. One can even perhaps interpret a mood near neutral as a kind of boredom.

Figure 2. Example from a simulation of an agent learning to find a goal in a gridworld. While episodic return increases, so does mood. Over time however, mood returns back to baseline. To reproduce these results yourself, see: *here*.

Importantly, this mood parameter is the integration of not only actual experiences, but also of remembered or imagined experiences, each of which have an advantage associated with them, as described above. In this way, repeatedly recalling a past negative event can directly impact one’s mood, even if the event happened years ago in the past. Likewise, imagining a set of positive events in the future can brighten our day, even if those events are unlikely to happen anytime soon (if at all). In fact this intentional recall of potential positive future events (i.e events with a positively valued advantage) is the basis for some forms of talk therapy used to help treat mood disorders [3].

We can also use this computational lens to understand traumatic experiences, and their impact on an individual’s long term mood. There is a line of research suggesting that the valence of experience scales exponentially, rather than linearly [4]. What this means is that the difference between a somewhat positive experience and a very positive experience might be an order of magnitude greater than the difference between a neutral experience and a somewhat positive experience. In terms of a quantitative advantage, we might say that a neutral and predictable experience such as brushing one’s teeth corresponds to 0 advantage, a positive unexpected experience such as getting a promotion at work would correspond to a 5 advantage, but the unexpected positive experience of starting a new romantic relationship might correspond to a 500 advantage.

What this means for updating the mood term is that the very positive experience has a disproportionate impact on mood, not just in the moment, but also into the longer-term future. The same is true of negative affective experiences. As such, if someone experiences an extremely negative and unpredicted event, such as the death of a loved one (which we might arbitrarily assign a -10000 advantage), then that event can potentially impact the mood for weeks, months, or even years. This is especially true given the fact that mood is updated not only from actual experience, but from remembered experience as well. In this sense, a traumatic event can continue to influence the ongoing mood directly as well as indirectly.

Figure 3. The difference between linear and exponential scaling. A line of research suggests affective experience follows the latter rather than the former.

If this were the full story however, then we would always be caught up in either remembering or imagining the most extreme events of the past, keeping our emotional life perpetually intensified. While this happens to some individuals struggling with mental health issues, it is not the typical experience. What happens most commonly is that over time experiences which were emotionally strong at the time get processed, integrated, and eventually lose their emotional sharpness. This is because the advantage of a given event does not stay fixed. As we update our Q(s, a) and V(s) estimates, the advantage correspondingly decreases over time. In fact this is exactly the objective we use to update these value estimates as well as the policy by which we act in the world. Because of this, over time the advantage of an emotionally salient event will trend towards zero, and thus be less likely to have a strong impact on one’s mood when it is recalled. Of course this process is not purely unconscious. Especially in the case of emotionally difficult or traumatic experiences, their integration (resulting in updated value estimates) can require reevaluations of one’s sense of self and place in their family, workplace, or broader community.

The purpose of affect/advantage is to provide a learning signal which can direct us to act more optimally in the future, given a set of goals and environmental conditions. The phenomenal experience of the affect is itself the learning signal. When it comes to mood however, it is less immediately obvious what role it might have in improving our ability to learn adaptive behaviors. One hypothesis which I will expand upon here is that mood acts as a kind of momentum signal for the expected change in rewards over time [1].

The thinking goes that there is typically a strong spatial and temporal correlation between positive and negative rewards. If our primate ancestors discovered a fruit tree in a forest, then it is likely the case that there would be other fruit trees nearby. In this way, it is appropriate to begin to expect other positive rewards in the future. Likewise, if a family member dies of disease, it may be appropriate to expect other deaths in the near future. Computationally this can be described as a modification of the value function update: V(s) = V(s) + α * (f * M + r — V(s)), where f is a term used to modulate the impact of mood on value updates. What this means is that if we have a negative mood, it will color our value updates to be more negative than they would otherwise, and vice-versa. A related, but slightly different perspective is that the mood serves as an analogue for the momentum parameter used in stochastic gradient descent algorithms [2], biasing learning accordingly. Further research needs to be done though to fully understand its complex role in learning.

If you are interested in experimenting with a simple version of these algorithms for learning and using mood, see this notebook I created which uses the neuro-nav library, and a MoodQ agent.

Further Considerations

In this article I attempted to outline a basic model for the computational basis of moment-to-moment affect and temporally extended mood in humans. I believe it has the explanatory power to allow us to make sense of a number of phenomena in our daily emotional lives. Still, like any basic model it is an oversimplification of the reality of human experience. In particular, the actual emotional experiences that we have are not only influenced by our underlying affective systems, but are also the result of complex cognitive, physical, and social interplays. There are also many different forms of mental pathology which result in disruptions in affect and mood that aren’t accounted for without developing the system described here much further. Still, I think it is a useful starting point.

If you are interested in a further exploration of the relationship between affect and RL, I recommend a recent paper called “Emotions as Computations” [5]. While the focus in AI has been heavily on cognition, there is a world of value to be gained from understanding emotion and its complex relationship to adaptive behavior. I hope that this article has provided a glimpse into that potential.

References

[1] Eldar, E., Rutledge, R. B., Dolan, R. J., & Niv, Y. (2016). Mood as representation of momentum. Trends in cognitive sciences, 20(1), 15–24.

[2] Bennett, D., Davidson, G., & Niv, Y. (2022). A model of mood as integrated advantage. Psychological Review, 129(3), 513.

[3] Jacobson, N. S., Dobson, K. S., Truax, P. A., Addis, M. E., Koerner, K., Gollan, J. K., … & Prince, S. E. (1996). A component analysis of cognitive-behavioral treatment for depression. Journal of consulting and clinical psychology, 64(2), 295.

[4] Gómez-Emilsson, A. (2019). Logarithmic Scales of Pleasure and Pain: Rating, Ranking, and Comparing Peak Experiences Suggest the Existence of Long Tails for Bliss and Suffering. QRI.

[5] Emanuel, A., & Eldar, E. (2022). Emotions as computations. Neuroscience & Biobehavioral Reviews, 104977.