首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《Mechatronics》2014,24(8):966-974
Reinforcement learning (RL) is a framework that enables a controller to find an optimal control policy for a task in an unknown environment. Although RL has been successfully used to solve optimal control problems, learning is generally slow. The main causes are the inefficient use of information collected during interaction with the system and the inability to use prior knowledge on the system or the control task. In addition, the learning speed heavily depends on the learning rate parameter, which is difficult to tune. In this paper, we present a sample-efficient, learning-rate-free version of the Value-Gradient Based Policy (VGBP) algorithm. The main difference between VGBP and other frequently used algorithms, such as Sarsa, is that in VGBP the learning agent has a direct access to the reward function, rather than just the immediate reward values. Furthermore, the agent learns a process model. This enables the algorithm to select control actions by optimizing over the right-hand side of the Bellman equation. We demonstrate the fast learning convergence in simulations and experiments with the underactuated pendulum swing-up task. In addition, we present experimental results for a more complex 2-DOF robotic manipulator.  相似文献   

2.
Energy harvesting from the working environment has received increasing attention in the research of wireless sensor networks. Recent developments in this area can be used to replenish the power supply of sensors. However, power management is still a crucial issue for such networks due to the uncertainty of stochastic replenishment. In this paper, we propose a generic mathematical framework to characterize the policy for single hop transmission over a replenishable sensor network. Firstly, we introduce a Markov chain model to describe different modes of energy renewal. Then, we derive the optimal transmission policy for sensors with different energy budgets. Depending on the energy status of a sensor and the reward for successfully transmitting a message, we prove the existence of optimal thresholds that maximize the average reward rate. Our results are quite general since the reward values can be made application-specific for different design objectives. Compared with the unconditional transmit-all policy, which transmits every message as long as the energy storage is positive, the proposed optimal transmission policy is shown to achieve significant gains in the average reward rate.  相似文献   

3.
In this paper we study an optimal server allocation problem, where a single server is shared among multiple queues based on the queue backlog information. Due to the physical nature of the system this information is delayed, in that when the allocation decision is made, the server only has the backlog information from an earlier time. Queues have different arrival processes as well as different buffering/holding costs. The objective is to minimize the expected total discounted holding cost over a finite or infinite horizon. We introduce an index policy where the index of a queue is a function of the state of the queue. Our primary interest is to characterize conditions under which this index policy is optimal. We present a fairly general method bounding the reward of serving one queue instead of another. Using this result, sufficient conditions on the optimality of the index policy can be derived for a variety of arrival processes and packet holding costs. These conditions are in general in the form of sufficient separation among indices, and they characterize the part of the state space where the index policy is optimal. We provide examples and derive the indices and illustrate the region where the index policy is optimal.  相似文献   

4.
Consider a circuit switched broadband ISDN network that support a variety of traffic classes (e.g., data, voice, video, facsimile), each of which has its own traffic requirement and reward function. We address the problem of dynamically allocating the capacity of each circuit among the traffic classes. As an optimal allocation policy is extremely hard to find, we apply a different methodology by which we bound from above the optimal expected reward, and propose a specific threshold policy-the restricted complete sharing (RCS)-that yields a reward sufficiently close to this bound. The initial parameters of the threshold policy are found with the aid of our bounding technique, and are improved by two iterative procedures. The quality of our policy is demonstrated by several numerical examples  相似文献   

5.
This paper is concerned with the optimal flow control of an ATM switching element in a broadband-integrated services digital network. We model the switching element as a stochastic fluid-flow system with a finite buffer, a constant output rate, and Markov-modulated fluid input. There is a cost of holding fluid and a reward for admitting the fluid to the buffer. We study the optimal flow control policies that minimize the fetal expected discounted cost. We analyze the problem by two different approaches and show that the optimal policy is of the turnpike type with the turnpike levels dependent on the states of the Markov-modulated source. We also state sufficient conditions under which the optimal turnpike levels are monotonic functions of the states of the Markov-modulated source  相似文献   

6.
In this paper, we formulate the combined handoff and channel assignment problems in a CDMA LEO satellite network as a reward/cost optimization problem. The probabilistic properties of signals (channel fading as function of satellite elevation angle) and of the traffic in the footprints are used to formulate a finite‐horizon Markov decision process. The optimal policy is obtained by minimizing a cost function consisting of the weighted sum of the switching costs and blocking costs of traffics subject to a bit‐error‐rate or outage probability constraint. A backward induction algorithm is applied to derive the optimal policy. Performance of the optimal policy and the direct threshold policy are compared. This revised version was published online in June 2006 with corrections to the Cover Date.  相似文献   

7.
Cooperative relaying is emerging as an effective technology to fulfill requirements on high data rate coverage in next-generation cellular networks,like long term evolution-advanced (LTE-Advanced).In this paper,we propose a distributed joint relay node (RN) selection and power allocation scheme over multihop relaying cellular networks toward LTE-Advanced,taking both the wireless channel state and RNs’ residual energy into consideration.We formulate the multihop relaying cellular network as a restless bandit system.The first-order finite-state Markov chain is used to characterize the time-varying channel and residual energy state transitions.With this stochastic optimization formulation,the optimal policy has indexability property that dramatically reduces the computational complexity.Simulation results demonstrate that the proposed scheme can efficiently enhance the expected system reward,compared with other existing algorithms.  相似文献   

8.
Optimality of Myopic Sensing in Multichannel Opportunistic Access   总被引:2,自引:0,他引:2  
This paper considers opportunistic communication over multiple channels where the state (“good” or “bad”) of each channel evolves as independent and identically distributed (i.i.d.) Markov processes. A user, with limited channel sensing capability, chooses one channel to sense and decides whether to use the channel (based on the sensing result) in each time slot. A reward is obtained whenever the user senses and accesses a “good” channel. The objective is to design a channel selection policy that maximizes the expected total (discounted or average) reward accrued over a finite or infinite horizon. This problem can be cast as a partially observed Markov decision process (POMDP) or a restless multiarmed bandit process, to which optimal solutions are often intractable. This paper shows that a myopic policy that maximizes the immediate one-step reward is optimal when the state transitions are positively correlated over time. When the state transitions are negatively correlated, we show that the same policy is optimal when the number of channels is limited to two or three, while presenting a counterexample for the case of four channels. This result finds applications in opportunistic transmission scheduling in a fading environment, cognitive radio networks for spectrum overlay, and resource-constrained jamming and antijamming.   相似文献   

9.

Recently distributed real-time database systems are intended to manage large volumes of dispersed data. To develop distributed real-time data processing, a reality and stay competitive well defined protocols and algorithms must be required to access and manipulate the data. An admission control policy is a major task to access real-time data which has become a challenging task due to random arrival of user requests and transaction timing constraints. This paper proposes an optimal admission control policy based on deep reinforcement algorithm and memetic algorithm which can efficiently handle the load balancing problem without affecting the Quality of Service (QoS) parameters. A Markov decision process (MDP) is formulated for admission control problem, which provides an optimized solution for dynamic resource sharing. The possible solutions for MDP problem are obtained by using reinforcement learning and linear programming with an average reward. The deep reinforcement learning algorithm reformulates the arrived requests from different users and admits only the needed request, which improves the number of sessions of the system. Then we frame the load balancing problem as a dynamic and stochastic assignment problem and obtain optimal control policies using memetic algorithm. Therefore proposed admission control problem is changed to memetic logic in such a way that session corresponds to individual elements of the initial chromosome. The performance of proposed optimal admission control policy is compared with other approaches through simulation and it depicts that the proposed system outperforms the other techniques in terms of throughput, execution time and miss ratio which leads to better QoS.

  相似文献   

10.
We address the issue of optimal coding rate scheduling for adaptive type-I hybrid automatic repeat request wireless systems. In this scheme, the coding rate is varied depending on channel, buffer and incoming traffic conditions. In general, we consider the hidden Markov model for both time-varying flat fading channel and bursty correlated incoming traffic. It is shown that the appropriate framework for computing the optimal coding rate allocation policies is partially observable Markov decision process (POMDP). In this framework, the optimal coding rate allocation policy maximizes the reward function, which is a weighted sum of throughput and buffer occupancy with appropriate sign. Since polynomial amount of space is needed to calculate the optimal policy even for a simple POMDP problem, we investigate maximum-likelihood, voting and Q-MDP policy heuristic approaches for the purpose of efficient and real-time solution. Our results show that three heuristics perform close to completely observable system state case if the fading and/or traffic state mixing rate is slow. On the other hand, when the channel fading is fast, Q-MDP heuristic is the most throughput-efficient among considered heuristics. Also, its performance is close to the optimal coding rate allocation policy of fully observable system state case. We also explore the performances of the proposed heuristics in the bursty correlated traffic case and show that maximum-likelihood and voting heuristics consistently outperform the non-adaptive case  相似文献   

11.
A generalization of the block replacement (BR) policy is proposed and analyzed for a system subject to shocks. Under such a policy, an operating system is preventively replaced by new ones at times i·T (i=1,2,3,...) independently of its failure history. If the system fails in: (a) ((i-1)·T, (i-1)·T+T0), it is either replaced by a new one or minimally repaired; or (b) ((i-1)·T+T0, i·T), it is either minimally repaired or remains inactive until the next planned replacement. The choice of these two actions is based on some mechanism (modeled as random) which depends on the number of shocks since the latest replacement. The average cost rate is obtained using the results of renewal reward theory. The model with two variables is transformed into a model with one variable and the optimum policy is discussed. Various special cases are considered. The results extend many of the well-known results for BR policies  相似文献   

12.
Living organisms learn by acting on their environment, observing the resulting reward stimulus, and adjusting their actions accordingly to improve the reward. This actionbased or Reinforcement Learning can capture notions of optimal behavior occurring in natural systems. We describe mathematical formulations for Reinforcement Learning and a practical implementation method known as Adaptive Dynamic Programming. These give us insight into the design of controllers for man-made engineered systems that both learn and exhibit optimal behavior.  相似文献   

13.
The 4th generation wireless communication systems aim to provide users with the convenience of seamless roaming among heterogeneous wireless access networks. To achieve this goal, the support of vertical handoff is important in mobility management. This paper focuses on the vertical handoff decision algorithm, which determines the criteria under which vertical handoff should be performed. The problem is formulated as a constrained Markov decision process. The objective is to maximize the expected total reward of a connection subject to the expected total access cost constraint. In our model, a benefit function is used to assess the quality of the connection, and a penalty function is used to model the signaling incurred and call dropping. The user’s velocity and location information are also considered when making handoff decisions. The policy iteration and Q-learning algorithms are employed to determine the optimal policy. Structural results on the optimal vertical handoff policy are derived by using the concept of supermodularity. We show that the optimal policy is a threshold policy in bandwidth, delay, and velocity. Numerical results show that our proposed vertical handoff decision algorithm outperforms other decision schemes in a wide range of conditions such as variations on connection duration, user’s velocity, user’s budget, traffic type, signaling cost, and monetary access cost.  相似文献   

14.
Due to the uncertainty of the connections in delay tolerant networks, the source may need help from other nodes and make these nodes serve as relays to forward the messages to the destination. To further improve the performance, the source may also make these nodes serve as agents, which can help the source to make other nodes serve as relays. However, nodes may not be willing to help the source without any reward because of the selfish nature. This means that the source has to pay certain reward to the nodes that provide help. Furthermore, such fees may be varying with time. For example, if the nodes guess that the source is eager to transmit the message to the destination, they may ask for more reward. In addition, the reward that the source obtains from the destination may be varying with time, too. For example, the sooner the destination gets the message, the more reward may be. In such complex case, it may not be good for the source to request help all the time. This paper proposes a unifying theoretical framework based on Ordinary Differential Equations to evaluate the total reward that the source can obtain. Then, based on the framework, we study the optimal control problem by Pontryagin’s Maximum Principle and prove that the optimal policy confirms to the threshold form in some cases. Simulations based on both synthetic and real motion traces show the accuracy of the framework. Furthermore, we demonstrate that the performance of the optimal policy is the best through extensive numerical results.  相似文献   

15.
A geometric-process repair-model with good-as-new preventive repair   总被引:3,自引:0,他引:3  
This paper studies a deteriorating simple repairable system. In order to improve the availability or economize the operating costs of the system, the preventive repair is adopted before the system fails. Assume that the preventive repair of the system is as good as new, while the failure repair of the system is not, so that the successive working times form a stochastic decreasing geometric process while the consecutive failure repair times form a stochastic increasing geometric process. Under this assumption and others, by using geometric process we consider a replacement policy N based on the failure number of the system. Our problem is to determine an optimal replacement policy N such that the average cost rate (i.e., the long-run average cost per unit time) is minimized. The explicit expression of the average cost rate is derived, and the corresponding optimal replacement policy can be determined analytically or numerically. And the fixed-length interval time of the preventive repair in the system is also discussed. Finally, an appropriate numerical example is given. It is seen from that both the optimal policies N** and N* are unique. However, the optimal policy N** with preventive repair is better than the optimal policy N* without preventive repair  相似文献   

16.
In this paper, for packet transmission over flat fading channel in single-input-single-output system, we consider the power control problem in a cross-layer design where adaptive modulation is adopted at physical layer to improve spectral efficiency and the queues are modeled as of finite length at data link layer. The goal is to identify the optimal queuing-aware power allocation algorithm to minimize the overall system packet error rate under the constraint of long-term transmit power. One crucial step which we call `inner?? problem is to find the optimal power vector at a given target packet error rate at physical layer. Rather than attack the multi-dimensional optimization problem directly using conventional methods, we first observe that the `inner?? problem is closely related to an average reward Markov decision process problem, and relax the former to the latter so as to take advantage of its equivalence with linear program which allows efficient solution. Since randomness in the associated Markov decision process is only slight, at most mild, we propose an approximately deterministic policy as suboptimal solution to the `inner?? problem with insignificant performance degradation. We also propose two-parameter power allocation functions to achieve suboptimal results with low complexity. The impacts of system parameters on the overall system performance are also evaluated. The accuracy of the numerical result is verified by Monte Carlo simulations.  相似文献   

17.
In this paper, a deteriorating simple repairable system with three states, including two failure states and one working state, is studied. Assume that the system after repair cannot be "as good as new", and the deterioration of the system is stochastic. Under these assumptions, we use a replacement policy N based on the failure number of the system. Then our aim is to determine an optimal replacement policy N/sup */ such that the average cost rate (i.e., the long-run average cost per unit time) is minimized. An explicit expression of the average cost rate is derived. Then, an optimal replacement policy is determined analytically or numerically. Furthermore, we can find that a repair model for the three-state repairable system in this paper forms a general monotone process model. Finally, we put forward a numerical example, and carry through some discussions and sensitivity analysis of the model in this paper.  相似文献   

18.
We address the issue of optimal energy allocation and admission control for communications satellites in Earth orbit. Such satellites receive requests for transmission as they orbit the Earth, but may not be able to serve them all, due to energy limitations. The objective is to choose which requests to serve so that the expected total reward is maximized. The special case of a single energy-constrained satellite is considered. Rewards and demands from users for transmission (energy) are random and known only at request time. Using a dynamic programming approach, an optimal policy is derived and is characterized in terms of thresholds. Furthermore, in the special case where demand for energy is unlimited, an optimal policy is obtained in closed form. Although motivated by satellite communications, our approach is general and can be used to solve a variety of resource allocation problems in wireless communications.  相似文献   

19.
强化学习是一种Agent在与环境交互过程中,通过累计奖赏最大化来寻求最优策略的在线学习方法.由于在不稳定环境中,某一时刻的MDP模型在与Agent交互之后就发生了变化,导致基于稳定MDP模型传统的强化学习方法无法完成不稳定环境下的最优策略求解问题.针对不稳定环境下的策略求解问题,利用MDP分布对不稳定环境进行建模,提出一种基于公式集的策略搜索算法--FSPS.FSPS算法在学习过程中搜集所获得的历史样本信息,并对其进行特征信息的提取,利用这些特征信息来构造不同的用于动作选择的公式,采取策略搜索算法求解最优公式.在此基础之上,给出所求解策略的最优性边界,并从理论上证明了迁移到新MDP分布中策略的最优性主要依赖于MDP分布之间的距离以及所求解策略在原始MDP分布中的性能.最后,将FSPS算法用于经典的Markov Chain问题,实验结果表明,所求解的策略具有较好的性能.  相似文献   

20.
In this paper, a maintenance model for two-unit redundant system with one repairman is studied. At the beginning, unit 1 is operating, unit 2 is the standby unit. The costs include the operating reward, repair cost and replacement cost, besides, a penalty cost is incurred if the system breaks down. Two kinds of replacement policy, based on the number of failures for two units and the working age, respectively are used. The long-run average cost per unit time for each kind of replacement policy is derived. Also, a particular model in which the system is deteriorative, two units are identical and the penalty cost rate is high, is thoroughly studied.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号