安全约束下合作型多智能体TD3算法 Cooperative multi-agent TD3 algorithm under security constraints期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

安全约束下合作型多智能体TD3算法

引用本文：	郝禹哲,王振雷.安全约束下合作型多智能体TD3算法[J].计算机应用研究,2023,40(6):1692-1696+1701.

作者姓名：	郝禹哲王振雷

作者单位：	华东理工大学,华东理工大学

基金项目：	国家重点研发计划资助项目(2018YFB1701103);国家自然科学基金重大项目(61890930-3);国家自然科学基金面上资助项目(61873093,61873092)

摘要：	合作马尔可夫博弈中，每个智能体不仅要实现共同的目标，还需要保证联合动作能够满足设定的约束条件。为此提出了安全约束下的合作型多智能体TD3算法MACTD3 (multi-agent constrainted twin delayed deep deterministic policy gradient)。首先，结合注意力机制对各个智能体采取的动作与决策过程约束条件进行了协调。然后利用拉格朗日乘子构造了修正的代价函数。进而为保证算法的收敛性，保证每一个智能体能够满足预先设定的约束条件，设计了不同时间尺度分学习策略：在短时间尺度上执行Actor-Critic网络的梯度下降，在长时间尺度上对拉格朗日参数进行迭代。最后在异质和同质的合作型多智能体环境下进行实验。实验结果表明，与其他算法相比，提出的MACTD3算法始终能够获得最小的惩罚成本；通过数量的扩展性实验表明了MACTD3在不同数量智能体的情况下仍然能够满足约束条件，证明了算法的有效性与扩展性。
关键词：	安全强化学习多智能体拉格朗日乘子法
收稿时间：	2022/8/19 0:00:00
修稿时间：	2023/5/16 0:00:00
Cooperative multi-agent TD3 algorithm under security constraints

haoyuzhe and Wangzhenlei.Cooperative multi-agent TD3 algorithm under security constraints[J].Application Research of Computers,2023,40(6):1692-1696+1701.

Authors:	haoyuzhe and Wangzhenlei

Affiliation:	East China University of Science and Technology,

Abstract:	In a cooperative Markov game, each intelligence not only has to achieve a common goal, but also needs to ensure that the joint actions can satisfy the set constraints. Therefore, this paper proposed a cooperative multi-agent TD3 algorithm(MACTD3) under security constraints. Firstly, it used the attention mechanism to coordinate the actions of individual intelligences with the decision process constraints. Then it constructed a modified cost function using Lagrange multipliers. Further, in order to ensure the convergence of the algorithm, it was ensured that each intelligent body could satisfy the pre-defined constraints with different time-scale sub-learning policies. The parameters of the Actor-Critic network perform gradient descent while Lagrangian parameters on the long time scale. Finally, experimental results in heterogeneous and homogeneous cooperative multi-intelligent environments show that the MACTD3 algorithm can always obtain the minimum penalty cost compared with other algorithms; the scalability experiments by number show that MACTD3 can still satisfy the constraints with different numbers of intelligences, proving the effectiveness and scalability of the algorithm.

Keywords:	safe reinforcement learning multi-agent Lagrangian multipliers

	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏