The field of Deep Reinforcement Learning (DeepRL) has made significant gains in the past half-decade. To go from sometimes solving simple Atari games to achieving superhuman performance in complex modern games like Starcraft 2 and DOTA in such a short time is quite an accomplishment. Indeed, given such successes one might think that RL has solved most interesting problems in games. There are however domains where there is still much progress to be made. One such area is in tasks which change over time, and require online adaptation on the part of the agent. The field of Meta-RL has sought to address such environments. In the Meta-RL formulation, an agent learns to solve not just one task, but rather to solve a distribution of tasks. Trained agents are then evaluated based on how well they can perform in unseen tasks drawn from a similar distribution. Often performing well in the never-before-seen tasks involves learning how to properly probe the environment to understand its potentially novel dynamics. This process has various names, such as hypothesis testing, online inference, or active inference, and is an active area of research.
Recently, a group at DeepMind has proposed a novel 3D environment for evaluating Meta-RL agents called Alchemy. In this environment, the agent must mix various items together in a pot to create a desired outcome. The nature of the items and the rules governing their interaction changes between episodes, thus requiring meta-learning in the form of online hypothesis testing to solve optimally. A surprising outcome of their work was the discovery that current state-of-the-art DeepRL algorithms fail to perform this task optimally. In doing so, the learned agents display a lack of hypothesis testing ability, preferring to act based on average outcomes over all episodes, rather than paying attention to the unique rules of a given episode. What the work suggested was that although current DeepRL agents were able to learn to solve the control problem without issue, the underlying problem of properly probing each environment to understand how it differs from the others, and exploiting that difference was beyond the capability of the trained agents. This weakness leaves open space for future research in the area.
While the Alchemy environment is quite extensive and well designed in its approach to testing for these key cognitive abilities, it involves a large amount of low-level motor control on the part of the agent, making it less accessible to researchers with smaller computational budgets (4/13/2021 Update: I have learned there is also a symbolic version which does away with the compute-intensive 3D requirements). Like many purpose-built environments, it also lacks a certain amount of ecological validity, since the rules of the task were developed a priori for DeepRL research. I was recently thinking about these limitations, and it struck me that there exists a ready-made class of environments which display many of the desirable properties for testing Meta-RL agents, and could be simulated at very little computational cost. These are Japanese Role Playing Games (JRPGs), a genre of video game which became popular in the mid-90s with long running series such as Final Fantasy, Dragon Quest, Persona, and Pokemon.
JRPGs typically involve long complicated fantasy stories, where a group of individuals come together to save the world by battling a series of increasingly challenging enemies over the course of dozens of hours of gameplay. The main form of challenge in these games lies in their battle system, where the player controls one or more protagonists and must battle one or more enemies at a time. These battles traditionally take the form of so-called turn-based encounters, where the protagonists and enemies take turns taking actions. I believe it is these turn-based battles which are particularly ripe for benchmarking Meta-RL agents. Before I get into that though, a few words on what makes a useful Meta-RL benchmark to begin with.
(Note: While this article focuses on examples from Japanese turn-based RPGs, many Western turn-based RPGs share the same relevant mechanics, and you can feel free to swap in your preferred game into the examples discussed below. I personally grew up playing and enjoying many JRPGs, so those are the examples presented here.)
In order to make a worthwhile Meta-RL benchmark, there must be a distribution of similar environments, each with different underlying dynamics which require different policies to solve. Sometimes these differences are quite small, but other times they can result in significantly different strategies on the part of the agent. The simplest version of this setup used in research is often a distribution of multi-armed bandits, where each bandit has a different probability distribution over outcomes for each of its arms. In this case the agent must experiment in each environment by determining which arm of the bandit provides the best return on average, and then exploiting that knowledge. To do so, the agent must learn a policy at two levels of abstraction. At the highest level it must learn a policy for performing the experiment in each sampled environment, and then at the lower level it must learn a policy for exploiting that learned knowledge. It is this multi-level learning process which earns Meta-RL the “Meta” name.
In a typical JRPG, the criteria described above is readily met by there being a variety of different possible enemies for the player to encounter in a given game. In fact, many JRPGs can have up to hundreds of unique enemies, many of which are only encountered once. When encountering a never-before-seen enemy, the player must perform hypothesis testing to determine the optimal strategy for defeating it. In Meta-RL context, this involves probing the environment to discover the hidden dynamics. This can take the form of learning about various possible elemental weaknesses and strengths, determining whether the enemy has a high defense, or other contextual vulnerabilities. The player does this by acting within the environment. Physically attacking an enemy provides information about whether it has a high or low physical defense. Casting an ice spell on the enemy will allow the player to determine whether the enemy has a weakness or resistance to ice magic. Once such a weakness has been discovered, the optimal strategy can then be deployed to defeat the enemy.
Of course, the enemy does not sit still, while the player attempts to learn and exploit its weaknesses, the enemy is attempting to defeat the player by taking actions of its own. The adversarial nature of these encounters induces a natural pressure for the player to learn and exploit the optimal strategy as quickly and efficiently as possible. Either poor hypothesis testing or sub-optimal strategy exploitation can lead to the player being defeated by the enemy. Once an optimal strategy is quickly discovered though, and the enemy is defeated, the process repeats with a potentially novel enemy and a new set of hypotheses to test regarding its latent properties.
The above description maps quite nicely onto the bandit problem, where there are two hierarchical levels of policies to be learned. What sets JRPGs apart however is that there is a third hierarchical level at which an agent needs to meta-learn. It is not only the case that within a game different enemies will require different policies to defeat, but between games the rules governing those optimal strategies will differ. Take for example two games within the same series: Final Fantasy X and Final Fantasy XIV. In Final Fantasy X, elemental weaknesses work in a paired fashion. Ice magic damages fire-based enemies, and fire magic damages ice-based enemies. In Final Fantasy XIV however, elemental weaknesses work on a rock-paper-scissors basis, where fire beats ice, ice beats wind, and wind beats fire. Other games in the genre have even more varying elemental interactions, with the Pokemon games for example containing dozens of elemental affinities with complex relationships between them. As humans, we can quickly learn the rules of each game and apply them within that game. In doing so, we deploy a flexible policy at three levels of abstraction: the game, the battle within the game, and the turn within the battle.
In addition to online hypothesis testing, performing well in the set of all possible JRPGs involves a powerful and flexible memory system. Whatever an agent might learn about a single enemy might be useful if the agent encounters that enemy again within the game, or within a similar game. Furthermore, the properties of enemies within a game often follow set patterns. Enemies with ice resistance, for example, may often have visual cues in their design which might imply their status, such as a crystal-blue appearance. An ice wolf or an ice bird enemy thus might both share similar appearances. Enemies with high defensive stats might be visually represented as wearing heavy armor. Each game typically deploys an internally consistent visual language to convey this information to players. Importantly though, this visual languages changes between games. As such, the hypothesis space when encountering a new enemy within a game is not uniform over all possible strategies. Learning to remember these regularities over time allows the agent to more quickly arrive at the optimal strategy within any battle.
I hope that all of the properties described above make it clear that JRPGs are a potentially fertile ground for DeepRL research, and Meta-RL research in particular. This is leaving aside the other kinds of interesting learning problems these games often pose, such as puzzle solving, inventory management, equipment upgrading, and multi-agent coordination when the player’s team consists of more than one protagonist. While Starcraft 2 and DOTA are significant challenges, there is much more that games have to offer the field as we strive to develop agents with more and more general intelligence.
I also recognize that the qualities described here with respect to JRPGs are not totally unique to the genre. Many games, for example, involve hypothesis testing whenever new mechanisms or enemies are introduced. What I believe sets the JRPG genre apart however is the extent to which these aspects are formalized, and largely removed from considerations of visuals, physics, or control. What is complex and cognitively interesting about these games is not that there is necessarily a large state or action space, as many game genres typically considered challenging for RL possesses, but rather that each encounter with a new enemy poses a slightly different kind of challenge from all previous encounters, and thus requires hypothesis testing and the online learning of a new policy. Importantly, the agent must actively experiment with each new enemy or set of enemies to learn the optimal strategy. While I believe that it is this aspect which makes the genre unique, I would be very interested in hearing other opinions, as there are perhaps other genres which are equally ripe for testing Meta-RL agents.
For those convinced of the value of JRPGs to Meta-RL research, and looking for practical directions with respect to where to go next, I think that there are two possible paths to follow. One would be to assemble a suite of JRPGs and create a benchmark out of them. This poses a number of technical difficulties, and similar attempts to follow this path such as the OpenAI Gym Retro have failed to attract the attention which the original Atari suite garnered. The other approach is to take the mechanicals principles of JRPGs described here and to develop a novel benchmark suite from scratch.
Thankfully, what is compelling about these games is not complicated physics, visuals, or controls, but rather the need for performing hypothesis testing, and the learning and general application of rules over multiple scales of abstraction. In fact, the underlying rules governing these games are simple to write, and can be implemented easily in any programming language. Special attention would need to be paid however to ensure that such a benchmark stays true to the complexities of real games within the genre, and does not become overly accommodated to the current capabilities of DeepRL agents. As an additional means of validating such a benchmark, classic JRPGs from the Super Nintendo or other console emulators could be used as a held-out set of test environments for an agent.
If anyone is interested in such a project and has the time to commit to it, feel free to reach out. I would be happy to collaborate on such an endeavor as an open-source project. Or, if you know of a similar project that already exists, please let me know! For the reasons discussed above, I believe it has the potential to provide a useful addition to the current field of Meta-RL benchmarks by filling a currently under-explore niche. In the meantime, if you are a researcher looking for a challenging benchmark that will push the current state of the art, DeepMind’s Alchemy is open-source and poses a number of interesting yet-to-be-solved problems.