Wang Zhechao,Fu Qiming,Chen Jianping,et al.Review of Research on Reinforcement Learning in Few-Shot Scenes[J].Journal of Nanjing Normal University(Engineering and Technology),2022,(01):086-92.[doi:10.3969/j.issn.1672-1292.2022.01.013]





Review of Research on Reinforcement Learning in Few-Shot Scenes
(1.苏州科技大学电子与信息工程学院,江苏 苏州 215009)(2.苏州科技大学江苏省建筑智慧节能重点实验室,江苏 苏州 215009)(3.苏州科技大学苏州市移动网络技术与应用重点实验室,江苏 苏州 215009)
Wang Zhechao123Fu Qiming123Chen Jianping23Hu Fuyuan123Lu You123Wu Hongjie123
(1.School of Electronic and Information Engineering,Suzhou University of Science and Technology,Suzhou 215009,China)(2.Jiangsu Provincial Key Laboratory of Building Intelligence and Energy Saving,Suzhou University of Science and Technology,Suzhou 215009,China)(3.Suzhou Key Laboratory of Mobile Networking and Applied Technologies,Suzhou University of Science and Technology,Suzhou 215009,China)
reinforcement learningfew-shot learningmeta-learningtransfer learninglifelong learningknowledge generalization
根据小样本问题背景,将小样本场景分成两类,第一类场景追求更专业的性能,第二类场景追求更通用的性能. 一般在知识泛化过程中,不同的场景对知识载体的需求有着明显的倾向性. 针对小样本学习方法,以知识载体的角度,将其分为使用过程性知识的方法和使用陈述性知识的方法,再讨论该分类下的小样本强化学习算法. 最后,从理论和应用等方面提出了可能的发展方向,以期为后续研究提供参考.
According to the background of the few-shot problem,this paper divides few-shot scenes into two types. The first type of scenes pursues more professional performance,while the other pursues more general performance. In the process of knowledge generalization,different scenes have obvious tendency to the requirement of knowledge carrier. Because of the discovery,the FSL is divided into two types in terms of knowledge carrier,where one type uses procedural knowledge and the other uses declarative knowledge. Then FS-RL algorithms under this classification are discussed. Finally,the possible development direction is proposed from the theory and the application,hoping to provide insights to following research.


[1] 吉珊珊. 基于神经网络树和人工蜂群优化的数据聚类[J]. 南京师大学报(自然科学版),2021,44(1):119-127.
[2]LI F F,FERGUS R,PERSON P. A bayesian approach to unsupervised one-shot learning of object categories[C]//Proceedings of the 9th IEEE International Conference on Computer Vision. Nice,France:IEEE,2003:1134-1141.
[3]SUTTON R S,BARTO A G. Reinforcement learning:an introduction[M]. London:MIT Press,2018.
[4]MITCHELL M T. Machine learning[M]. New York:McGraw-Hill,1997.
[5]TOBIN J,FONG R,RAY A,et al. Domain randomization for transferring deep neural networks from simulation to the real world[J]. arXiv Preprint arXiv:1703.06907,2020.
[6]HESTER T,VECERIK M,PIETQUIM O,et al. Deep Q-learning from demonstrations[C]//The 32nd AAAI Conference on Artificial Intelligence. New Orleans,USA,2018:3223-3230.
[7]ANDERSON J R. Cognitive psychology and its applications[M]. 3rd ed. New York:Freeman,1990.
[8]王皓,高阳,陈兴国. 强化学习中的迁移:方法和进展[J]. 电子学报,2008,36(Suppl 1):39-43.
[9]KIM B,FARAHMAND A,PINEAU J,et al. Approximate policy iteration with demonstration data[C]//The 1st Multi-disciplinary Conference on Reinforcement Learning and Decision Making. Princeton,USA,2013:168-172.
[10]BERTSEKAS D P. Approximate policy iteration:a survey and some new methods[J]. Journal of Control Theory and Applications,2011,9(3):310-335.
[11]PIOT B,GEIST M,PIETQUIN O. Boosted bellman residual minimization handling expert demonstrations[C]//The 25th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Nancy,France,2014:549-564.
[12]CHEMALI J,LAZARIC A. Direct policy iteration with demonstrations[C]//The 24th International Joint Conference on Artificial Intelligence. Buenos Aires,Argentina,2015:3380-3386.
[13]LAZARIC A,RESTELI M,ANDREA B. Transfer of samples in batch reinforcement learning[C]//The 25th International Conference on Machine Learning. Helsinki,Finland,2008:544-551.
[14]CORTES C,MOHRI M,RILEY M,et al. Sample selection bias correction theory[C]//The 19th International Conference on Algorithmic Learning Theory. Budapest,Hungary,2008:38-53.
[15]LAROCHE R,BARLIER M. Transfer reinforcement learning with shared dynamics[C]//The 31st AAAI Conference on Artificial Intelligence. San Francisco,USA,2017:2147-2153.
[16]TIRINZONI A,SESSA A,MATTEO P,et al. Importance weighted transfer of samples in reinforcement learning[C]//The 35th International Conference on Machine Learning. Stockholm,Sweden,2018:4943-4952.
[17]ERNST D,GEURTS P,WEHENKEL L. Tree-based batch mode reinforcement learning[J]. Journal of Machine Learning Research,2005,6(4):503-556.
[18]NG A Y,HARADA D,RUSSELL S J. Policy invariance under reward transformations:Theory and application to reward shaping[C]//The 16th International Conference on Machine Learning. Bled,Slovenia,1999:278-287.
[19]WIEWIORA E,COTTREL G W,ELKAN C. Principled methods for advising reinforcement learning agents[C]//The 20th International Conference on Machine Learning. Washington DC,USA,2003:792-799.
[20]DEVLIN S,KUDENKO D. Dynamic potential-based reward shaping[C]//The 11th International Conference on Autonomous Agents and Multiagent Systems. Valencia,Spain,2012:433-440.
[21]HARUTYUNYAN A,DEVLIN S,VRANCX P,et al. Expressing arbitrary reward functions as potential-based advice[C]//The 29th AAAI Conference on Artificial Intelligence. Austin,USA,2015:2652-2658.
[22]FINN C,ABBEEL P,LEVINE S. Model-agnostic meta-learning for fast adaptation of deep networks[C]//The 34th International Conference on Machine Learning. Sydney,Australia,2017:1126-1135.
[23]DELEU T,BENGIO Y. The effects of negative adaptation in Model-Agnostic Meta-Learning[J]. arXiv Preprint arXiv:1812.02159,2018.
[24]RUSU A A,COLMENAREJO S G,GüLCEHRE C,et al. Policy distillation[C]//arXiv Preprint arXiv:1511.06295,2016.
[25]ABEL D. A theory of state abstraction for reinforcement learning[C]//The 31st Innovative Applications of Artificial Intelligence Conference. Honolulu,USA,2019:9876-9877.
[26]ABEL D,HERSHKOWITZ D E,LITTMAN M L. Near optimal behavior via approximate state abstraction[C]//International Conference on Machine Learning. New York,USA,2016:2915-2923.
[27]VALIANT L G. A theory of the learnable[J]. Communications of the Association for Computing Machinery. 1984,27(11):1134-1142.
[28]YAO H,ZHANG C,WEI Y,et al. Graph few-shot learning via knowledge transfer[C]//The 34th AAAI Conference on Artificial Intelligence. New York,USA,2020:6656-6663.
[29]ZHANG C,YAO H,HUANG C,et al. Few-shot knowledge graph completion[C]//The 34th AAAI Conference on Artificial Intelligence. New York,USA,2020:3041-3048.
[30]PARISOTTO E,BA J L,SALAKHUTDINOV R. Actor-mimic:deep multitask and transfer reinforcement learning[C]//The 4th International Conference on Learning Representations. San Juan,Puerto Rico,2016:156-171.
[31]MEHTA B,DELEU T,RAPARTHY S C,et al. Curriculum in gradient-based meta-reinforcement learning[J]. arXiv Preprint arXiv:2002.07956,2020.
[32]BENGIO Y,LOURADOUR J,COLLOBERT R,et al. Curriculum learning[C]//The 26th Annual International Conference on Machine Learning. New York,USA,2009:41-48.
[33]HESTER T,STONE P. Texplore:real-time sample-efficient reinforcement learning for robots[J]. Machine Learning,2013,90(3):385-429.
[34]施伟,冯旸赫,程光权,等. 基于深度强化学习的多机协同空战方法研究[J]. 自动化学报,2021,47(7):1610-1623.
[35]孟琭,沈凝,祁殷俏,等. 基于强化学习的三维游戏控制算法[J]. 东北大学学报(自然科学版),2021,42(4):478-482,493.


 Wu Qingyuan,Tan Xiaoyang.Alternated Deep Q Network Based on Upper Confidence Bound[J].Journal of Nanjing Normal University(Engineering and Technology),2022,(01):024.[doi:10.3969/j.issn.1672-1292.2022.01.004]


通讯作者:傅启明,博士,副教授,研究方向:强化学习、深度学习、智能信息处理等. E-mail:fqm_1@126.com
更新日期/Last Update: 2022-03-15