Written by
Rober Boshra
July 6, 2023

Every day, we face the challenge of balancing our wants and needs, prioritizing essential requirements while efficiently utilizing our resources. We become experts at prioritizing needs that are important (e.g. a meal when we are very hungry), forming flexible plans (i.e. actions that get us what we need) that are integral to our survival. However, it remains to be fully understood how we balance our needs to form this dynamic interplay of our motivators and actions. Do we have independent cognitive modules that aim to optimize each of our wants? Do we have an all encompassing self that takes all our needs into account and generates optimized actions? These are fundamental questions for a deeper understanding of decision making and how we maneuver in a dynamic and ever-changing environment with extensive implications for building practical, scalable reinforcement learning agents. In the pursuit of understanding how agents may navigate this intricate dance, Princeton Neuroscience Institute researchers Zack Dulberg, Rachit Dubey, Isabel M. Berwian, and Jonathan Cohen initiated a project that explored the efficiency and scalability of utilizing modular “selves” compared to a single “self” model. This research not only targets the computational goals crucial for developing smarter and more efficient machine learning agents but also paves the way to understanding psychological conflicts inherent in the human psyche.

An agent traverses the grid environment using discrete actions to acquire resources (1-4) and fill internally monitored needs.

The researchers were motivated by the conflicting goals we encounter in our daily lives.  To investigate this, they employed trained agents using a modern training framework called deep reinforcement learning. Agents were trained to maintain homeostasis of a certain number of needs (e.g., an over-simplified model of a biological entity may be hunger, thirst, temperature, and rest). Over time, these needs (or stats) decrease and require replenishment (example illustration on the right). This act of replenishment was done by the agent moving in a 2D grid environment to a location that has the resource corresponding to the stat (e.g., resource food for stat hunger). Agents were trained explicitly to achieve this kind of physiological stability, using a framework known as  homeostatically-regulated reinforcement learning. The authors utilized deep Q learning, an approach that combines deep learning with Q learning, to train a reinforcement learning agent to observe its surroundings, determine a value for each of its actions (in this case, one of 4 movements along a grid), and select the best action to maximize reward. To test their primary question of whether a modular agent provides a benefit over a monolithic agent, they trained two types of agents. The monolithic agent was trained to understand the environment in a unified way; it optimized its multiple objectives in unison and followed a single “reward signal” that maximized for all resources in tandem. In contrast, the modular agent had as many “brains” or modules as the number of resources in the environment, each vying to satisfy its own needs. Although all modules learned incrementally and suggested actions that maximized their own need, only one action was taken due to the agent's single "body."

Zack and the team reported intriguing results. Not only was the modular agent faster in learning compared to its monolithic counterpart, but also the modular agent had additional perks in addition to its training speed. In reinforcement learning, explicit exploration is often required to prevent the model from solely exploiting its flawed representations during the early stages of training. The modular agent’s performance was largely invariant to this explicit imposition of exploration. This advantage was attributed, at least in part, to what the authors referred to as "free exploration." Basically, as the agent’s “body” was commandeered by the multiple modules, the non-winning modules got to be taken along for the ride to explore the environment. In other words, training formed a “... virtuous cycle with exploration” where exploration facilitated learning representations that further enhanced exploration and improved subsequent actions, explained lead author Zack Dulberg. Further, the modular agent was inherently robust to the curse of dimensionality. Since modules focused only on their respective resources, the complexity arising from combinations of resources in higher dimensions was avoided while all together holding a more complete representation of the environment (see illustration below). In contrast, the monolithic agent often took longer to achieve a desirable homeostatic state, tended to overshoot in the early stages of training, and was highly influenced by explicitly defined exploration criteria due to the lack of "free exploration" (see illustration below).

A monolithic representations of rewards

The authors next tackled whether this inherent advantage of modular agents was restricted to these static resources. What happens when the goals move over time and the agent is required to alter its understanding of the environment dynamically to achieve its goals? The findings again confirmed modularity to be a winning feature in these agents. The modular agent was faster to learn than the monolithic counterpart in non-static environments and scaled better to increases in the number of resources.

The modular agent (orange) was faster to train compared to its monolithic (blue) counterpart.

These results open up numerous possibilities to exploring the potential of modular agents. In the current implementation, modules matched one-to-one to the number of resources that the agent is trained to maintain. How does this scale to a large number of resources? How beneficial would a strategy be involving flexible assignment of modules or sharing of resources dynamically in a single module based on demands? These and many other questions arise that target both an abstract, computational understanding of this intrinsic benefit of modularity and how to best utilize it to build better and smarter agents. Similarly, these modeling studies may have immense insight into psychological and cognitive neuroscience questions. In particular, Zack noted his great excitement for further studies of modular systems: “How do modules develop? What features of the developmental environment put pressure on different solutions? And do the benefits of modularity explain why internal psychological conflict seems so central to the human condition?” Answering these questions can provide insight into how agents could adopt different strategies to learn efficiently in specific environments. Tapping into developmental trajectories may also shed light on analogues in psychological developmental trajectories, in a way providing a sandbox for probing translational questions in the future, and possibly suggesting a computational basis for informal theories of conflict that have persisted since the time of Sigmund Freud.

The article Having multiple selves helps learning agents explore and adapt in complex changing worlds has been published in the Proceedings of the National Academy of Sciences (https://doi.org/10.1073/pnas.2221180120). Congratulations to the authors for the impactful milestone and for opening up exciting avenues for further research.