首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
强化学习是机器学习领域的研究热点, 是考察智能体与环境的相互作用, 做出序列决策、优化策略并最大化累积回报的过程. 强化学习具有巨大的研究价值和应用潜力, 是实现通用人工智能的关键步骤. 本文综述了强化学习算法与应用的研究进展和发展动态, 首先介绍强化学习的基本原理, 包括马尔可夫决策过程、价值函数、探索-利用问题. 其次, 回顾强化学习经典算法, 包括基于价值函数的强化学习算法、基于策略搜索的强化学习算法、结合价值函数和策略搜索的强化学习算法, 以及综述强化学习前沿研究, 主要介绍多智能体强化学习和元强化学习方向. 最后综述强化学习在游戏对抗、机器人控制、城市交通和商业等领域的成功应用, 以及总结与展望.  相似文献   

2.
陈奕宇  霍静  丁天雨  高阳 《软件学报》2024,35(4):1618-1650
近年来,深度强化学习(deep reinforcement learning, DRL)已经在诸多序贯决策任务中取得瞩目成功,但当前,深度强化学习的成功很大程度依赖于海量的学习数据与计算资源,低劣的样本效率和策略通用性是制约其进一步发展的关键因素.元强化学习(meta-reinforcementlearning,Meta-RL)致力于以更小的样本量适应更广泛的任务,其研究有望缓解上述限制从而推进强化学习领域发展.以元强化学习工作的研究对象与适用场景为脉络,对元强化学习领域的研究进展进行了全面梳理:首先,对深度强化学习、元学习背景做基本介绍;然后,对元强化学习作形式化定义及常见的场景设置总结,并从元强化学习研究成果的适用范围角度展开介绍元强化学习的现有研究进展;最后,分析了元强化学习领域的研究挑战与发展前景.  相似文献   

3.
张立华  刘全  黄志刚  朱斐 《软件学报》2023,34(10):4772-4803
逆向强化学习(inverse reinforcement learning, IRL)也称为逆向最优控制(inverse optimal control, IOC),是强化学习和模仿学习领域的一种重要研究方法,该方法通过专家样本求解奖赏函数,并根据所得奖赏函数求解最优策略,以达到模仿专家策略的目的.近年来,逆向强化学习在模仿学习领域取得了丰富的研究成果,已广泛应用于汽车导航、路径推荐和机器人最优控制等问题中.首先介绍逆向强化学习理论基础,然后从奖赏函数构建方式出发,讨论分析基于线性奖赏函数和非线性奖赏函数的逆向强化学习算法,包括最大边际逆向强化学习算法、最大熵逆向强化学习算法、最大熵深度逆向强化学习算法和生成对抗模仿学习等.随后从逆向强化学习领域的前沿研究方向进行综述,比较和分析该领域代表性算法,包括状态动作信息不完全逆向强化学习、多智能体逆向强化学习、示范样本非最优逆向强化学习和指导逆向强化学习等.最后总结分析当前存在的关键问题,并从理论和应用方面探讨未来的发展方向.  相似文献   

4.
强化学习可以让机器人通过与环境的交互,学习最优的行动策略,是目前机器人领域关注的重要前沿方向之一.文中简述机器人任务规划问题的形式化建模,分析强化学习的主要方法,分别介绍无模型强化学习、基于模型的强化学习和分层强化学习的研究进展,着重探讨基于强化学习的机器人任务规划的研究进展,并讨论各种强化学习及其应用情况.最后总结强化学习在机器人应用中面临的问题与挑战,展望未来的研究方向.  相似文献   

5.
深度强化学习进展: 从AlphaGo到AlphaGo Zero   总被引:1,自引:0,他引:1  
2016年初,AlphaGo战胜李世石成为人工智能的里程碑事件.其核心技术深度强化学习受到人们的广泛关注和研究,取得了丰硕的理论和应用成果.并进一步研发出算法形式更为简洁的AlphaGo Zero,其采用完全不基于人类经验的自学习算法,完胜AlphaGo,再一次刷新人们对深度强化学习的认知.深度强化学习结合了深度学习和强化学习的优势,可以在复杂高维的状态动作空间中进行端到端的感知决策.本文主要介绍了从AlphaGo到AlphaGo Zero的深度强化学习的研究进展.首先回顾对深度强化学习的成功作出突出贡献的主要算法,包括深度Q网络算法、A3C算法、策略梯度算法及其他算法的相应扩展.然后给出AlphaGo Zero的详细介绍和讨论,分析其对人工智能的巨大推动作用.并介绍了深度强化学习在游戏、机器人、自然语言处理、智能驾驶、智能医疗等领域的应用进展,以及相关资源进展.最后探讨了深度强化学习的发展展望,以及对其他潜在领域的人工智能发展的启发意义.  相似文献   

6.
目前深度强化学习算法已经可以解决许多复杂的任务,然而如何平衡探索和利用的关系仍然是强化学习领域的一个基本的难题,为此提出一种联合随机性策略的深度强化学习探索方法.该方法利用随机性策略具有探索能力的特点,用随机性策略生成的经验样本训练确定性策略,鼓励确定性策略在保持自身优势的前提下学会探索.通过结合确定性策略算法DDPG...  相似文献   

7.
深度强化学习中稀疏奖励问题研究综述   总被引:1,自引:0,他引:1  
强化学习作为机器学习的重要分支,是在与环境交互中寻找最优策略的一类方法。强化学习近年来与深度学习进行了广泛结合,形成了深度强化学习的研究领域。作为一种崭新的机器学习方法,深度强化学习同时具有感知复杂输入和求解最优策略的能力,可以应用于机器人控制等复杂决策问题。稀疏奖励问题是深度强化学习在解决任务中面临的核心问题,在实际应用中广泛存在。解决稀疏奖励问题有利于提升样本的利用效率,提高最优策略的水平,推动深度强化学习在实际任务中的广泛应用。文中首先对深度强化学习的核心算法进行阐述;然后介绍稀疏奖励问题的5种解决方案,包括奖励设计与学习、经验回放机制、探索与利用、多目标学习和辅助任务等;最后对相关研究工作进行总结和展望。  相似文献   

8.
深度强化学习综述: 兼论计算机围棋的发展   总被引:2,自引:0,他引:2  
深度强化学习将深度学习的感知能力和强化学习的决策能力相结合,可以直接根据输入的图像进行控制,是一种更接近人类思维方式的人工智能方法.自提出以来,深度强化学习在理论和应用方面均取得了显著的成果.尤其是谷歌深智(Deep Mind)团队基于深度强化学习方法研发的计算机围棋"初弈号–Alpha Go",在2016年3月以4:1的大比分战胜了世界围棋顶级选手李世石(Lee Sedol),成为人工智能历史上一个新里程碑.为此,本文综述深度强化学习的发展历程,兼论计算机围棋的历史,分析算法特性,探讨未来的发展趋势和应用前景,期望能为控制理论与应用新方向的发展提供有价值的参考.  相似文献   

9.
深度分层强化学习是深度强化学习领域的一个重要研究方向,它重点关注经典深度强化学习难以解决的稀疏奖励、顺序决策和弱迁移能力等问题.其核心思想在于:根据分层思想构建具有多层结构的强化学习策略,运用时序抽象表达方法组合时间细粒度的下层动作,学习时间粗粒度的、有语义的上层动作,将复杂问题分解为数个简单问题进行求解.近年来,随着研究的深入,深度分层强化学习方法已经取得了实质性的突破,且被应用于视觉导航、自然语言处理、推荐系统和视频描述生成等生活领域.首先介绍了分层强化学习的理论基础;然后描述了深度分层强化学习的核心技术,包括分层抽象技术和常用实验环境;详细分析了基于技能的深度分层强化学习框架和基于子目标的深度分层强化学习框架,对比了各类算法的研究现状和发展趋势;接下来介绍了深度分层强化学习在多个现实生活领域中的应用;最后,对深度分层强化学习进行了展望和总结.  相似文献   

10.
近年来,强化学习在电子游戏、棋类、决策控制等领域取得了巨大进展,也带动着金融交易系统的迅速发展.金融交易问题已经成为强化学习领域的研究热点,特别是股票、外汇和期货等方面具有广泛的应用需求和学术研究意义.以金融领域常用的强化学习模型的发展为脉络,对交易系统、自适应算法、交易策略等方面的诸多研究成果进行了综述.最后讨论了强化学习在金融领域应用中存在的困难和挑战,并对今后强化学习交易系统发展趋势进行展望.  相似文献   

11.
一种围棋定式的机器学习方法   总被引:5,自引:0,他引:5  
谷蓉  刘学民  朱仲涛  周杰 《计算机工程》2004,30(6):142-144,173
提出了一种围棋定式的机器学习方法。利用此方法可实现从棋谱库中自动提取定式并生成定式库。此外,对于棋谱数量较大的情况,采用分阶段学习方法,提高了学习效率。应用此方法,时34000局棋谱进行处理,得到定式点680638个。最后,还给出了1种基于组合博弈理论在计算机围棋博弈系统中使用定式的方法。  相似文献   

12.
The game of Go is considered one of the most complicated games in the world. One Go game is divided into three stages: the opening, the middle, and the ending stages. Millions of people regularly play Go in countries around the world. The game is played by two players. One is White and another is Black. The players alternate placing one of their stones on an empty intersection of a square grid-patterned game board. The player with more territory wins the game. This paper proposes a soft-computing-based emotional expression mechanism and applies it to the game of computer Go to make Go beginners enjoy watching Go game and keep their tension on the game. First, the knowledge base and rule base of the proposed mechanism are defined by following the standards of the fuzzy markup language. The soft-computing mechanism for Go regional alarm level is responsible for showing the inferred regional alarm level to Go beginners. Based on the inferred board situation, the fuzzy inference mechanisms for emotional pleasure and arousal are responsible for inferring the pleasure degree and arousal degree, respectively. An emotional expression mapping mechanism maps the inferred degree of pleasure and degree of arousal into the emotional expression of the eye robot. The protocol transmission mechanism finally sends the pre-defined protocol to the eye robot via universal serial bus interface to make the eye robot express its emotional motion. From the experimental results, it shows that the eye robot can support Go beginners to have fun and retain their tension while watching or playing a game of Go.  相似文献   

13.
This paper reports on the investigation of the possibilities of enhancing the formal e‐learning process by harnessing the potential of informal game‐based learning on social networks. The goal of the research is to improve the outcomes of the formal learning process through the design and implementation of an educational game on a social network and its integration with a learning management system (LMS) of an educational institution. As a proof of concept, a Facebook educational game that enables students to learn and test their knowledge was developed. Furthermore, the game was integrated with the Moodle LMS and the evaluation was performed within the e‐learning system at the Faculty of Organizational Sciences, University of Belgrade. The results show that the application of social network edutainment in an e‐learning ecosystem has a positive impact on both the students' results and their satisfaction with the learning process as a whole.  相似文献   

14.
曹慧芳  刘知青 《软件》2011,32(1):79-82
机器博弈,也称计算机博弈,即让计算机下棋。围棋是一种策略性二人棋类游戏,使用格状棋盘及黑白二色棋子进行对弈。文中计算机围棋游戏引擎的开发采用马尔科夫决策模型,使用人工智能的知识,含有大量计算,整个计算紧密依赖于系统资源,计算量越大,引擎的选点越精确,棋力越高。针对嵌入式系统软硬件的特定性,其资源和计算能力的局限性,本文主要完成了两个工作:一是将实验室适用于PC的游戏引擎移植到WinCE,开发适合嵌入式系统的围棋游戏引擎,实现大规模计算的移植,使游戏引擎在嵌入式有限的资源上,通过精简的计算量,达到不错的效果;二是实现WinCE上围棋游戏前台界面的开发。  相似文献   

15.
强化学习用于解决无模型情况下的优化决策问题,是实现人工智能的重要技术之一,但传统的表格型强化学习方法难以处理具有大规模、连续空间的控制问题。近似强化学习受到函数逼近思想的启发,对价值函数或策略函数参数化表示,通过参数优化间接获得最优行为策略,在视频游戏、棋类对抗及机器人控制等领域应用效果显著。基于此,对近似强化学习算法的研究现状与应用进展进行了梳理和综述。介绍了近似强化学习相关的基础理论;分类总结了近似强化学习的经典算法及一些相应的改进方法;概述了近似强化学习在机器人控制领域的研究进展,并总结了当前面临的若干主要问题,为后续的研究提供参考。  相似文献   

16.
围棋机器博弈是机器博弈中重要的分支之一,其庞大的博弈空间给机器博弈研究者带来了巨大挑战.目前围棋机器博弈多采用静态估值搜索与蒙特卡洛树搜索,故将时间差分算法引入至九路围棋机器博弈系统中,提出基于时间差分算法的围棋机器博弈系统模型,该博弈系统具有一定的自学习能力,能在不断的对弈中逐步提高博弈能力.通过与采用α-β搜索算法的博弈系统进行实际对弈,证明了该方法的可行性.  相似文献   

17.
The Oriental game of Go contains a unique method by which pieces, called stones, are captured and made safe from capture. A group of stones safe from capture is called safe, unconditionally alive, or similar terms. Life or its lack can be determined by lookahead through the game tree, at some expense. We present a graph-theoretic static analysis of the board arrangement which determines unconditional life or its lack, together with proofs of its equivalency to look ahead. An algorithm for the static evaluation is given and we argue that it is the preferable method for computer Go play. These results constitute the first realistic theorems in the theory of Go.  相似文献   

18.
The Oriental game of Go contains a unique method by which pieces, called stones, are captured and made safe from capture. A group of stones safe from capture is called safe, unconditionally alive, or similar terms. Life or its lack can be determined by lookahead through the game tree, at some expense. We present a graph-theoretic static analysis of the board arrangement which determines unconditional life or its lack, together with proofs of its equivalency to look ahead. An algorithm for the static evaluation is given and we argue that it is the preferable method for computer Go play. These results constitute the first realistic theorems in the theory of Go.  相似文献   

19.
This article presents a new learning system for predicting life and death in the game of Go. It is called Gone. The system uses a multi-layer perceptron classifier which is trained on learning examples extracted from game records. Blocks of stones are represented by a large amount of features which enable a rather precise prediction of life and death. On average, Gone correctly predicts life and death for 88% of all the blocks that are relevant for scoring. Towards the end of a game the performance increases up to 99%. A straightforward extension for full-board evaluation is discussed. Experiments indicate that the predictor is an important component for building a strong full-board evaluation function.  相似文献   

20.
The benefits derived from delivering learning content in ways that match the student's learning style have been identified in classroom learning and eLearning. Although there is limited empirical evidence in adaptive Games-Based Learning (GBL), adaptivity has been identified to have the potential to improve learning effectiveness. This paper presents the results of a study to investigate the use of learning styles in GBL particularly in identifying the learning style in GBL including the learning style's fluctuation during the learning process. For the purposes of this study, a game with two game modes was developed: 1) non-adaptivity mode and 2) a mode that had an in-game adaptive system that dynamically and continuously adapted its contents according to the student's interactions in the game. In both modes, the interactions between the participants and the game were recorded in a database. The study was performed with 60 students in Higher Education. The results show that the learning style identified by using a learning style questionnaire is not always consistent with the learning style identified in the game. The results also show the learning style fluctuates during the learning process in GBL although there is a tendency for participants to choose the same learning style as the learning style identified outside the game in the first mission of the game. The number of mistakes committed by participants has been identified to have a strong correlation to the fluctuation. The results contribute to the body of empirical evidence in adaptive GBL particularly in identifying the learning style fluctuation in GBL and the paper provides recommendations on the use of adaptivity in GBL to accommodate this fluctuation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号