期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

穆翔刘全傅启明孙洪坤周鑫《通信学报》2013,34(10):11-99

针对传统的基于查询表或函数逼近的Q值迭代算法在处理连续空间问题时收敛速度慢、且不易求解连续行为策略的问题,提出了一种基于两层模糊划分的在策略时间差分算法——DFP-OPTD,并从理论上分析其收敛性。算法中第一层模糊划分作用于状态空间,第二层模糊划分作用于动作空间,并结合两层模糊划分计算出Q值函数。根据所得的Q值函数,使用梯度下降方法更新模糊规则中的后件参数。将DFP-OPTD应用于经典强化学习问题中,实验结果表明,该算法有较好的收敛性能,且可以求解连续行为策略。相似文献

2.

一种新的基于值函数迁移的快速Sarsa算法

下载免费PDF全文

傅启明刘全尤树华黄蔚章晓芳《电子学报》2014,42(11):2157-2161

知识迁移是当前机器学习领域的一个新的研究热点.其基本思想是通过将经验知识从历史任务到目标任务的迁移,达到提高算法收敛速度和收敛精度的目的.针对当前强化学习领域中经典算法收敛速度慢的问题,提出在学习过程中通过迁移值函数信息,减少算法收敛所需要的样本数量,加快算法的收敛速度.基于强化学习中经典的在策略Sarsa算法的学习框架,结合值函数迁移方法,优化算法初始值函数的设置,提出一种新的基于值函数迁移的快速Sarsa算法--VFT-Sarsa.该算法在执行前期,通过引入自模拟度量方法,在状态空间以及动作空间一致的情况下,对目标任务中的状态与历史任务中的状态之间的距离进行度量,对其中相似并满足一定条件的状态进行值函数迁移,而后再通过学习算法进行学习.将VTF-Sarsa算法用于Random Walk问题,并与经典的Sarsa算法、Q学习算法以及具有较好收敛速度的QV算法进行比较,实验结果表明,该算法在保证收敛精度的基础上,具有更快的收敛速度. 相似文献

3.

一种基于随机投影的贝叶斯时间差分算法

下载免费PDF全文

刘全于俊王辉傅启明朱斐《电子学报》2016,44(11):2752-2757

在强化学习方法中,大部分的算法都是基于值函数评估的算法.高斯过程时间差分算法利用贝叶斯方法来评估值函数,通过贝尔曼公式和贝叶斯规则,建立立即奖赏与值函数之间的概率生成模型.在状态空间中,通过在线核稀疏化并利用最小二乘方法来求解新样本的近似线性逼近,以提高算法的执行速度,但时间复杂度依然较高.针对在状态空间中近似状态的选择问题,在高斯过程框架下提出一种基于随机投影的贝叶斯时间差分算法,该算法利用哈希函数把字典状态集合中的元素映射成哈希值,根据哈希值进行分组,进而减少状态之间的比较.实验结果表明,该方法不仅能够提高算法的执行速度,而且较好地平衡了评估状态值函数精度和算法执行时间. 相似文献

4.

基于改进DQN强化学习算法的弹性光网络资源分配研究

尚晓凯韩龙龙翟慧鹏《光通信技术》2023,(5):12-15

针对光网络资源分配中频谱资源利用率不高的问题,提出了一种改进的深度Q网络（DQN）强化学习算法。该算法基于ε-greedy策略,根据动作价值函数和状态价值函数的差异来设定损失函数,并不断调整ε值,以改变代理的探索率。通过这种方式,实现了最优的动作值函数,并较好地解决了路由与频谱分配问题。此外,采用了不同的经验池取样方法,以提高迭代训练的收敛速度。仿真结果表明：改进DQN强化学习算法不仅能够使弹性光网络训练模型快速收敛,当业务量为300 Er l ang时,比DQN算法频谱资源利用率提高了10.09%,阻塞率降低了12.41%,平均访问时延减少了1.27 ms。相似文献

5.

一种基于模型的可分解贝叶斯在线强化学习

下载免费PDF全文

仵博郑红燕冯延蓬陈鑫《电子学报》2014,42(7):1429-1434

针对贝叶斯强化学习中参数个数巨大,收敛速度慢,无法实现在线学习的问题,提出一种基于模型的可分解贝叶斯强化学习方法.首先,将学习参数进行可分解表示,降低学习参数的个数;然后,根据先验知识和观察数据采用贝叶斯方法来学习,最优化探索和利用二者之间的平衡关系;最后,采用基于点的贝叶斯强化学习方法实现学习过程的快速收敛,从而达到在线学习的目的.仿真结果表明该算法能够满足实时系统性能的要求. 相似文献

6.

基于拓扑序列更新的值迭代算法

黄蔚刘全孙洪坤傅启明周小科《通信学报》2014,35(8):8-62

提出一种基于拓扑序列更新的值迭代算法,利用状态之间的迁移关联信息,将任务模型的有向图分解为一系列规模较小的强连通分量,并依据拓扑序列对强连通分量进行更新。在经典规划问题Mountain Car和迷宫实验中的结果表明,算法的收敛速度更快,精度更高,且对状态空间的增长有较强的顽健性。相似文献

7.

基于可中断Option的在线分层强化学习方法

朱斐许志鹏刘全伏玉琛王辉《通信学报》2016,37(6):65-74

针对大数据体量大的问题,在Macro-Q算法的基础上提出了一种在线更新的Macro-Q算法(MQIU),同时更新抽象动作的值函数和元动作的值函数,提高了数据样本的利用率。针对传统的马尔可夫过程模型和抽象动作均难于应对可变性,引入中断机制,提出了一种可中断抽象动作的Macro-Q无模型学习算法(IMQ),能在动态环境下学习并改进控制策略。仿真结果验证了MQIU算法能加快算法收敛速度,进而能解决更大规模的问题,同时也验证了IMQ算法能够加快任务的求解,并保持学习性能的稳定性。相似文献

8.

基于强化学习的交叉口交通低排放信号控制研究

李昕《电子技术》2014,(8):5-8

交叉口车辆排放较为复杂,尤其是在考虑初始排队长度的情况下,更是难以建立明确的数学模型。Q学习是一种无模型的强化学习算法,通过与环境的试错交互学习最优控制策略。本文提出了一种基于Q学习的交通排放信号控制方案。利用仿真平台USTCMTS2.0,通过不断地试错学习找到在不同相位排队长度下最优配时。在Q学习中添加了模糊初始化Q函数的方法以改进Q学习的收敛速度,加速了学习过程。仿真实验结果表明:强化学习算法取得较好的效果。相比较Hideki的方法,在车流量较高时,车辆平均排放量减少了13.9%,并且对Q函数值的模糊初始化大大加速了Q函数收敛的过程。相似文献

9.

基于SumTree采样结合Double DQN的非合作式多用户动态功率控制方法

刘骏王永华王磊尹泽中《电讯技术》2023,63(10):1603-1611

为了保证认知无线网络中次用户本身的通信服务质量,同时降低次用户因发射功率不合理而造成的功率损耗,提出了一种基于SumTree采样结合深度双Q网络（Double Deep Q Network,Double DQN）的非合作式多用户动态功率控制方法。通过这种方法,次用户可以不断与辅助基站进行交互,在动态变化的环境下经过不断的学习,选择以较低的发射功率完成功率控制任务。其次,该方法可以解耦目标Q值动作的选择和目标Q值的计算,能够有效减少过度估计和算法的损失。并且,在抽取经验样本时考虑到不同样本之间重要性的差异,采用了结合优先级和随机抽样的SumTree采样方法,既能保证优先级转移也能保证最低优先级的非零概率采样。仿真结果表明,该方法收敛后的算法平均损失值能稳定在0.04以内,算法的收敛速度也至少快了10个训练回合,还能提高次用户总的吞吐量上限和次用户功率控制的成功率,并且将次用户的平均功耗降低了0.5 mW以上。相似文献

10.

基于Q学习算术优化算法的无人机三维航迹规划

丁兵兵匡珍春卢来《电光与控制》2024,(3):61-69

针对传统方法求解无人机三维航迹规划易导致规划代价高、精度差和容易陷入局部最优的不足,提出基于Q学习算术优化算法的无人机三维航迹规划算法。为了提升算术优化算法的寻优精度,引入Circle混沌映射提高初始种群多样性和分布均匀性,引入Q学习根据个体状态自适应调整数学优化加速函数更新,均衡算法全局搜索与局部开发,设计最优解邻域扰动优化全局搜索能力。通过建立无人机三维航迹规划模型,将航迹规划转化为多目标函数优化问题,并利用改进算法求解无人机三维航迹规划,以综合考虑航迹代价、地形代价和边界代价的目标函数评估粒子适应度,对航迹规划迭代寻优。仿真实验结果表明,所提算法规划的航迹具有更低的总代价和适应不同复杂地形环境的稳定性。相似文献

11.

Heuristic Sarsa algorithm based on value function transfer

Jianping CHEN Zhengxia YANG Quan LIU Hongjie WU Yang XU Qiming FU 《通信学报》2018,39(8):37-47

With the problem of slow convergence for traditional Sarsa algorithm,an improved heuristic Sarsa algorithm based on value function transfer was proposed.The algorithm combined traditional Sarsa algorithm and value function transfer method,and the algorithm introduced bisimulation metric and used it to measure the similarity between new tasks and historical tasks in which those two tasks had the same state space and action space and speed up the algorithm convergence.In addition,combined with heuristic exploration method,the algorithm introduced Bayesian inference and used variational inference to measure information gain.Finally,using the obtained information gain to build intrinsic reward function model as exploring factors,to speed up the convergence of the algorithm.Applying the proposed algorithm to the traditional Grid World problem,and compared with the traditional Sarsa algorithm,the Q-Learning algorithm,and the VFT-Sarsa algorithm,the IGP-Sarsa algorithm with better convergence performance,the experiment results show that the proposed algorithm has faster convergence speed and better convergence stability. 相似文献

12.

Hyperparameter Optimization for Machine Learning Models Based on Bayesian Optimization

下载免费PDF全文

Jia Wu Xiu-Yun Chen Hao Zhang Li-Dong Xiong Hang Lei Si-Hao Deng 《电子科技学刊:英文版》2019,17(1):26-40

Hyperparameters are important for machine learning algorithms since they directly control the behaviors of training algorithms and have a significant effect on the performance of machine learning models. Several techniques have been developed and successfully applied for certain application domains. However, this work demands professional knowledge and expert experience. And sometimes it has to resort to the brute-force search. Therefore, if an efficient hyperparameter optimization algorithm can be developed to optimize any given machine learning method, it will greatly improve the efficiency of machine learning. In this paper, we consider building the relationship between the performance of the machine learning models and their hyperparameters by Gaussian processes. In this way, the hyperparameter tuning problem can be abstracted as an optimization problem and Bayesian optimization is used to solve the problem. Bayesian optimization is based on the Bayesian theorem. It sets a prior over the optimization function and gathers the information from the previous sample to update the posterior of the optimization function. A utility function selects the next sample point to maximize the optimization function. Several experiments were conducted on standard test datasets. Experiment results show that the proposed method can find the best hyperparameters for the widely used machine learning models, such as the random forest algorithm and the neural networks, even multi-grained cascade forest under the consideration of time cost. 相似文献

13.

Advantage estimator based on importance sampling

Quan LIU Yubin JIANG Zhihui HU 《通信学报》2019,40(5):108-116

In continuous action tasks,deep reinforcement learning usually uses Gaussian distribution as a policy function.Aiming at the problem that the Gaussian distribution policy function slows down due to the clipped action,an importance sampling advantage estimator was proposed.Based on the general advantage estimator,an importance sampling mechanism was introduced by the estimator to improve the convergence speed of the algorithm and correct the deviation of the value function caused by calculating the target strategy and action strategy ratio of the boundary action.In addition,the L parameter was introduced by ISAE which improved the reliability of the sample and limited the stability of the network parameters by limiting the range of the importance sampling rate.In order to verify the effectiveness of the ISAE,applying it to proximal policy optimization and comparing it with other algorithms on the MuJoCo platform.Experimental results show that ISAE has a faster convergence rate. 相似文献

14.

基于马尔科夫毯约束的最优贝叶斯网络结构学习算法

下载免费PDF全文

谭翔元高晓光贺楚超《电子学报》2019,47(9):1898-1904

本文针对最优贝叶斯网络的结构学习问题,在动态规划算法（Dynamic Programming,DP）的基础上,使用IAMB算法（Incremental Association Markov Blanket,IAMB）计算得到的马尔科夫毯对评分计算过程进行约束,减少了评分的计算次数,提出了基于马尔科夫毯约束的动态规划算法（Dynamic Programming Constrained with Markov Blanket,DPCMB）,研究了IAMB算法中重要性阈值对DPCMB算法的各项性能指标的影响,给出了调整阈值的合理建议.实验结果表明,DPCMB算法可以通过调整重要性阈值,使该算法的精度与DP算法相当,极大地减少了算法的运行时间、评分计算次数和所需存储空间. 相似文献

15.

一种最大集合期望损失的多目标Sarsa(λ)算法

刘全李瑾傅启明崔志明伏玉琛《电子学报》2013,41(8):1469-1473

针对RoboCup这一典型的多目标强化学习问题,提出一种基于最大集合期望损失的多目标强化学习算法LRGM-Sarsa(λ)算法.该算法预估各个目标的最大集合期望损失,在平衡各个目标的前提下选择最佳联合动作以产生最优联合策略.在单个目标训练的过程中,采用基于改进MSBR误差函数的Sarsa(λ)算法,并对动作选择概率函数和步长参数进行优化,解决了强化学习在使用非线性函数泛化时,算法不稳定、不收敛的问题.将该算法应用到RoboCup射门局部策略训练中,取得了较好的效果,表明该学习算法的有效性. 相似文献

16.

RBF神经网络的梯度下降训练方法中的学习步长优化 总被引：9，自引：0，他引：9

林嘉宇刘荧《信号处理》2002,18(1):43-48

梯度下降法是训练ＲＢＦ神经网络的一种有效方法。和其他基于下降法的算法一样,ＲＢＦ神经网络的梯度下降训练方法中也存在学习步长的取值问题。本文基于误差能量函数对学习步长的二阶Ｔａｙｌｏｒ展开,构造了一种优化学习步长的方法,进行了较详细的推导：实验表明,本方法可有效地加速梯度下降法的收敛速度、提高其性能。该方法的思想可以用于其他基于下降法的学习步长的优化中。相似文献

17.

Structure Learning in Bayesian Networks Using Asexual Reproduction Optimization

Ali Reza Khanteymoori Mohammad Bagher Menhaj Mohammad Mehdi Homayounpour 《ETRI Journal》2011,33(1):39-49

A new structure learning approach for Bayesian networks based on asexual reproduction optimization (ARO) is proposed in this paper. ARO can be considered an evolutionary‐based algorithm that mathematically models the budding mechanism of asexual reproduction. In ARO, a parent produces a bud through a reproduction operator; thereafter, the parent and its bud compete to survive according to a performance index obtained from the underlying objective function of the optimization problem: This leads to the fitter individual. The convergence measure of ARO is analyzed. The proposed method is applied to real‐world and benchmark applications, while its effectiveness is demonstrated through computer simulations. Results of simulations show that ARO outperforms genetic algorithm (GA) because ARO results in a good structure and fast convergence rate in comparison with GA. 相似文献