[1]吴卿源,谭晓阳.基于UCB算法的交替深度Q网络[J].南京师范大学学报(工程技术版),2022,(01):024-29.[doi:10.3969/j.issn.1672-1292.2022.01.004]
 Wu Qingyuan,Tan Xiaoyang.Alternated Deep Q Network Based on Upper Confidence Bound[J].Journal of Nanjing Normal University(Engineering and Technology),2022,(01):024-29.[doi:10.3969/j.issn.1672-1292.2022.01.004]
点击复制

基于UCB算法的交替深度Q网络
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2022年01期
页码:
024-29
栏目:
机器学习
出版日期:
2022-03-15

文章信息/Info

Title:
Alternated Deep Q Network Based on Upper Confidence Bound
文章编号:
1672-1292(2022)01-0024-06
作者:
吴卿源1谭晓阳12
(1.南京航空航天大学计算机科学与技术学院,江苏 南京 211106)(2.南京航空航天大学模式分析与机器智能工业和信息化部重点实验室,江苏 南京 211106)
Author(s):
Wu Qingyuan1Tan Xiaoyang12
(1.College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China)(2.MIIT Key Laboratory of Pattern Analysis and Machine Intelligence,Nanjing University of Aeronautics and Astronautics,Nanjing 211006,China)
关键词:
强化学习深度强化学习深度Q网络最大置信度上界
Keywords:
reinforcement learningdeep reinforcement learningdeep Q-networkupper confidence bound
分类号:
TP18
DOI:
10.3969/j.issn.1672-1292.2022.01.004
文献标志码:
A
摘要:
在深度强化学习中,智能体需要与环境进行交互学习,这就需要智能体能够很好地去平衡利用与探索. 因此如何提升算法的样本有效性,增加算法的探索能力,一直是深度强化学习领域中非常重要的研究方向. 结合已有研究成果,提出了一种交替使用多个不同初始化深度Q网络方法,使用网络随机初始化带来的探索性能. 基于最大置信度上界算法先构造一种交替选择深度Q网络策略. 并将该调度网络策略与多个随机初始化的深度Q网络结合,得到基于最大置信度上界的交替深度Q网络算法. 在多个不同的标准强化学习实验环境上的实验结果表明,该算法比其他基准算法有更高的样本效率和算法学习效率.
Abstract:
The agent needs to learn interactively with the environment in the paradigm of deep reinforcement learning(DRL). The important dilemma of DRL is that the agent needs to balance exploitation and exploration. Therefore,how to improve the sample efficiency of algorithms and increase the exploration ability of the algorithm is a very popular research direction in the field of DRL. Different from existing works,we apply multiple DQNs with independent random initialization and use them to interact with the environment alternately. Using the generalized exploration abilities brought by random initialization of the networks,this paper proposes a method of alternately selecting DQN based on the maximum confidence upper bound(UCB)method,which is called Alternated DQN(ADQN). Experimental results on different standard reinforcement learning experimental environments show that ADQN has higher sample efficiency and algorithm learning efficiency than other benchmark algorithms.

参考文献/References:

[1] SUTTON R S,BARTO A G. Reinforcement learning:an introduction[J]. IEEE Transactions on Neural Networks,1998,9(5):1054-1054.
[2]MNIH V,KAVUKCUOGLU K,SILVER D,et al. Playing atari with deep reinforcement learning[J]. arXiv Preprint arXiv:1312.5602,2013.
[3]MNIH V,KAVUKCUOGLU K,SILVER D,et al. Human-level control through deep reinforcement learning[J]. Nature,2015,518(7540):529-533.
[4]SILVER D,HUANG A,MADDISON C J,et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature,2016,529(7587):484-489.
[5]SCHRITTWIESER J,ANTONOGLOU I,HUBERT T,et al. Mastering atari,go,chess and shogi by planning with a learned model[J]. Nature,2020,588(7839):604-609.
[6]VAN HASSELT H,GUEZ A,SILVER D. Deep reinforcement learning with double Q-learning[J]. arXiv Preprint arXiv:1509.06461v3,2016.
[7]SCHAUL T,QUAN J,ANTONOGLOU I,et al. Prioritized experience replay[J]. arXiv Preprint arXiv:1511.05952,2015.
[8]WANG Z,SCHAUL T,HESSEL M,et al. Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning. Lodon,UK,2016:1995-2003.
[9]HESSEL M,MODAYIL J,VAN HASSELT H,et al. Rainbow:Combining improvements in deep reinforcement learning[C]//Thirty-second AAAI Conference on Artificial Intelligence. Lousiana,USA,2018.
[10]FORTUNATO M,AZAR M G,PIOT B,et al. Noisy networks for exploration[J]. arXiv Preprint arXiv:1706.10295,2017.
[11]OSBAND I,BLUNDELL C,PRITZEL A,et al. Deep exploration via bootstrapped DQN[J]. Advances in Neural Information Processing Systems,2016,29.
[12]CHEN R Y,SIDOR S,ABBEEL P,et al. UCB exploration via Q-ensembles[J]. arXiv Preprint arXiv:1706.01502,2017.
[13]朱斐,吴文,刘全,等. 一种最大置信上界经验采样的深度Q网络方法[J]. 计算机研究与发展,2018,55(8):1694-1705.
[14]WATKINS C,DAYAN P. Q-learning[J]. Machine Learning,1992,8(3/4):279-292.
[15]ANSCHEL O,BARAM N,SHIMKIN N. Averaged-DQN:variance reduction and stabilization for deep reinforcement learning[C]//International Conference on Machine Learning. Sydney,Australia,2017:176-185.

相似文献/References:

[1]毛 晋,熊 轲,位 宁,等.基于深度强化学习的超密集网络中多用户上行功率控制方法[J].南京师范大学学报(工程技术版),2022,(01):016.[doi:10.3969/j.issn.1672-1292.2022.01.003]
 Mao Jin,Xiong Ke,Wei Ning,et al.Power Control in Ultra Dense Network:A DeepReinforcement Learning Based Method[J].Journal of Nanjing Normal University(Engineering and Technology),2022,(01):016.[doi:10.3969/j.issn.1672-1292.2022.01.003]
[2]王哲超,傅启明,陈建平,等.小样本场景下的强化学习研究综述[J].南京师范大学学报(工程技术版),2022,(01):086.[doi:10.3969/j.issn.1672-1292.2022.01.013]
 Wang Zhechao,Fu Qiming,Chen Jianping,et al.Review of Research on Reinforcement Learning in Few-Shot Scenes[J].Journal of Nanjing Normal University(Engineering and Technology),2022,(01):086.[doi:10.3969/j.issn.1672-1292.2022.01.013]

备注/Memo

备注/Memo:
收稿日期:2021-08-31.
基金项目:科技创新2030重大项目(2021ZD0113203)、国家自然科学基金项目(61976115).
通讯作者:谭晓阳,博士,教授,研究方向:强化学习. E-mail:x.tan@nuaa.edu.cn
更新日期/Last Update: 2022-03-15