研究报道多巴胺神经元未来奖励的多维分布图—小柯机器人

首页 | 新闻 | 博客 | 院士 | 人才 | 会议 | 基金·项目 | 论文 | 视频·直播 | 小柯机器人 | 医学科普

研究报道多巴胺神经元未来奖励的多维分布图

作者：小柯机器人发布时间：2025/6/5 16:13:06

葡萄牙尚帕利莫未知中心Joseph J. Paton小组报道了多巴胺神经元未来奖励的多维分布图。相关论文于2025年6月4日发表在《自然》杂志上。

在这里，课题组研究人员提出了时间量级RL (TMRL)，这是分布RL的多维变体，它学习未来奖励随时间和量级的联合分布。研究小组还发现了在小鼠行为过程中光遗传学鉴定的DANs活性中TMRL样计算的特征。具体来说，研究团队表明在时间折扣和跨DANs的奖励幅度调整方面存在显著的多样性。这些功能允许计算一个二维的，未来奖励的概率图。

此外，从该代码中得出的奖励时间预测与预期行为相关，表明类似的信息是指导何时采取行动的主题。最后，通过模拟觅食环境中的行为，该团队强调了在面对动态奖励景观和内部状态时，奖励随时间和大小的联合概率分布的好处。这些发现表明，丰富的概率奖励信息被学习并传达给DANs，并建议对TD算法进行简单的局部时间扩展，以解释如何获取和计算这些信息。

据悉，中脑多巴胺神经元（DANs）发出奖励预测错误信号，使接收回路了解预期的奖励。然而，DANs被认为为时间差异（TD）强化学习（RL）提供了基础，RL是一种学习时间贴现预期未来奖励均值的算法，丢弃有关奖励数量和延迟的经验分布的主题信息。

附：英文原文

Title: A multidimensional distributional map of future reward in dopamine neurons

Author: Sousa, Margarida, Bujalski, Pawel, Cruz, Bruno F., Louie, Kenway, McNamee, Daniel C., Paton, Joseph J.

Issue&Volume: 2025-06-04

Abstract: Midbrain dopamine neurons (DANs) signal reward-prediction errors that teach recipient circuits about expected rewards1. However, DANs are thought to provide a substrate for temporal difference (TD) reinforcement learning (RL), an algorithm that learns the mean of temporally discounted expected future rewards, discarding useful information about experienced distributions of reward amounts and delays2. Here we present time–magnitude RL (TMRL), a multidimensional variant of distributional RL that learns the joint distribution of future rewards over time and magnitude. We also uncover signatures of TMRL-like computations in the activity of optogenetically identified DANs in mice during behaviour. Specifically, we show that there is significant diversity in both temporal discounting and tuning for the reward magnitude across DANs. These features allow the computation of a two-dimensional, probabilistic map of future rewards from just 450ms of the DAN population response to a reward-predictive cue. Furthermore, reward-time predictions derived from this code correlate with anticipatory behaviour, suggesting that similar information is used to guide decisions about when to act. Finally, by simulating behaviour in a foraging environment, we highlight the benefits of a joint probability distribution of reward over time and magnitude in the face of dynamic reward landscapes and internal states. These findings show that rich probabilistic reward information is learnt and communicated to DANs, and suggest a simple, local-in-time extension of TD algorithms that explains how such information might be acquired and computed.

DOI: 10.1038/s41586-025-09089-6

Source: https://www.nature.com/articles/s41586-025-09089-6

期刊信息

Nature：《自然》，创刊于1869年。隶属于施普林格·自然出版集团，最新IF：69.504
官方网址：http://www.nature.com/
投稿链接：http://www.nature.com/authors/submit_manuscript.html