哈佛大学Naoshige Uchida研究团队取得一项新突破。他们的最新研究提出了大脑中的多时间尺度强化学习。2025年6月4日,国际知名学术期刊《自然》发表了这一成果。
在这里,该研究组探讨了生物强化学习中多个时间尺度的存在。该课题组人员首先表明,在多个时间尺度上学习的强化代理具有明显的计算优势。接下来,课题组人员报告了在执行两种行为任务的小鼠中,多巴胺能神经元用不同的贴现时间常数编码奖励预测误差。他们的模型解释了提示诱发的短暂反应和被称为多巴胺斜坡的较慢时间尺度波动的时间折扣的异质性。至关重要的是,测量到的单个神经元的折损因子在两个任务中是相关的,这表明这是一种细胞特异性的特性。总之,他们的结果为理解多巴胺能神经元的功能异质性提供了一个新的范例,并为在许多情况下人类和动物主题非指数折扣的经验观察提供了机制基础,并为设计更有效的强化学习算法开辟了新的途径。
据悉,为了在复杂的环境中茁壮成长,动物和人工智能体必须学会采取适应性行动,以最大限度地提高适应性和回报。这种自适应行为可以通过强化学习来学习,强化学习是一类算法,在训练人工智能体和描述中脑多巴胺能神经元的放电方面取得了成功。在经典的强化学习中,智能体根据单个时间尺度对未来奖励进行指数折扣,称为折扣因子。
附:英文原文
Title: Multi-timescale reinforcement learning in the brain
Author: Masset, Paul, Tano, Pablo, Kim, HyungGoo R., Malik, Athar N., Pouget, Alexandre, Uchida, Naoshige
Issue&Volume: 2025-06-04
Abstract: To thrive in complex environments, animals and artificial agents must learn to act adaptively to maximize fitness and rewards. Such adaptive behaviour can be learned through reinforcement learning1, a class of algorithms that has been successful at training artificial agents2,3,4,5 and at characterizing the firing of dopaminergic neurons in the midbrain6,7,8. In classical reinforcement learning, agents discount future rewards exponentially according to a single timescale, known as the discount factor. Here we explore the presence of multiple timescales in biological reinforcement learning. We first show that reinforcement agents learning at a multitude of timescales possess distinct computational benefits. Next, we report that dopaminergic neurons in mice performing two behavioural tasks encode reward prediction error with a diversity of discount time constants. Our model explains the heterogeneity of temporal discounting in both cue-evoked transient responses and slower timescale fluctuations known as dopamine ramps. Crucially, the measured discount factor of individual neurons is correlated across the two tasks, suggesting that it is a cell-specific property. Together, our results provide a new paradigm for understanding functional heterogeneity in dopaminergic neurons and a mechanistic basis for the empirical observation that humans and animals use non-exponential discounts in many situations9,10,11,12, and open new avenues for the design of more-efficient reinforcement learning algorithms.
DOI: 10.1038/s41586-025-08929-9
Source: https://www.nature.com/articles/s41586-025-08929-9
Nature:《自然》,创刊于1869年。隶属于施普林格·自然出版集团,最新IF:69.504
官方网址:http://www.nature.com/
投稿链接:http://www.nature.com/authors/submit_manuscript.html