首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
Interest in inverse reinforcement learning (IRL) has recently increased,that is,interest in the problem of recovering the reward function underlying a Markov decision process (MDP) given the dynamics of the system and the behavior of an expert.This paper deals with an incremental approach to online IRL.First,the convergence property of the incremental method for the IRL problem was investigated,and the bounds of both the mistake number during the learning process and regret were provided by using a detailed proof.Then an online algorithm based on incremental error correcting was derived to deal with the IRL problem.The key idea is to add an increment to the current reward estimate each time an action mismatch occurs.This leads to an estimate that approaches a target optimal value.The proposed method was tested in a driving simulation experiment and found to be able to efficiently recover an adequate reward function.  相似文献   

2.
This paper deals with a new approach based on Q-learning for solving the problem of mobile robot path planning in complex unknown static environments.As a computational approach to learning through interaction with the environment,reinforcement learning algorithms have been widely used for intelligent robot control,especially in the field of autonomous mobile robots.However,the learning process is slow and cumbersome.For practical applications,rapid rates of convergence are required.Aiming at the problem of slow convergence and long learning time for Q-learning based mobile robot path planning,a state-chain sequential feedback Q-learning algorithm is proposed for quickly searching for the optimal path of mobile robots in complex unknown static environments.The state chain is built during the searching process.After one action is chosen and the reward is received,the Q-values of the state-action pairs on the previously built state chain are sequentially updated with one-step Q-learning.With the increasing number of Q-values updated after one action,the number of actual steps for convergence decreases and thus,the learning time decreases,where a step is a state transition.Extensive simulations validate the efficiency of the newly proposed approach for mobile robot path planning in complex environments.The results show that the new approach has a high convergence speed and that the robot can find the collision-free optimal path in complex unknown static environments with much shorter time,compared with the one-step Q-learning algorithm and the Q(λ)-learning algorithm.  相似文献   

3.
The path planning of autonomous mobile robots (PPoAMR) is a very complex multi-constraint problem. The main goal is to find the shortest collision-free path from the starting point to the target point. By the fact that the PPoAMR problem has the prior knowledge that the straight path between the starting point and the target point is the optimum solution when obstacles are not considered. This paper proposes a new path planning algorithm based on the prior knowledge of PPoAMR, which includes the fitness value calculation method and the prior knowledge particle swarm optimization (PKPSO) algorithm. The new fitness calculation method can preserve the information carried by each individual as much as possible by adding an adaptive coefficient. The PKPSO algorithm modifies the particle velocity update method by adding a prior particle calculated from the prior knowledge of PPoAMR and also implemented an elite retention strategy, which improves the local optima evasion capability. In addition, the quintic polynomial trajectory optimization approach is devised to generate a smooth path. Finally, some experimental comparisons with those state-of-the-arts are carried out to demonstrate the effectiveness of the proposed path planning algorithm.  相似文献   

4.
In this paper we study optimal control problems with the control variable appearing linearly.A novel method for optimization with respect to the switching times of controls containing both bang-bang and singular arcs is presented.This method transforms the control problem into a finite-dimensional optimization problem by reformulating the control problem as a multi-stage optimization problem.The optimal control problem is partitioned as several stages, with each stage corresponding to a particular control arc.A control vector parameterization approach is applied to convert the control problem to a static nonlinear programming(NLP) problem.The control profiles and stage lengths act as decision variables.Based on the Pontryagin maximal principle,a multi-stage adjoint system is constructed to calculate the gradients required by the NLP solvers.Two examples are studied to demonstrate the effectiveness of this strategy.  相似文献   

5.
This paper addresses an unmanned aerial vehicle (UAV) path planning problem for a team of cooperating heterogeneous vehicles composed of one UAV and multiple unmanned ground vehicles (UGVs). The UGVs are used as mobile actuators and scattered in a large area. To achieve multi-UGV communication and collaboration, the UAV serves as a messenger to fly over all task points to collect the task information and then flies all UGVs to transmit the information about tasks and UGVs. The path planning of messenger UAV is formulated as a precedence-constrained dynamic Dubins traveling salesman problem with neighborhood (PDDTSPN). The goal of this problem is to find the shortest route enabling the UAV to fly over all task points and deliver information to all requested UGVs. When solving this path planning problem, a decoupling strategy is proposed to sequentially and rapidly determine the access sequence in which the UAV visits task points and UGVs as well as the access location of UAV in the communication neighborhood of each task point and each UGV. The effectiveness of the proposed approach is corroborated through computational experiments on randomly generated instances. The computational results on both small and large instances demonstrate that the proposed approach can generate high-quality solutions in a reasonable time as compared with two other heuristic algorithms.  相似文献   

6.
Online optimization has received numerous attention in recent two decades, mostly inspired by its potential applications to auctions, smart grids, portfolio management, dictionary learning, neural networks, and so on. Generally, online optimization is a sequence of decision making processes, where a sequence of time-varying loss functions are gradually revealed in a dynamic environment which may be adversarial. At each time instant, the loss function information at current time is revealed to the decision maker only after her/his decision is made. The objective of online optimization is to choose the best decision at each time step as far as possible, but unfortunately, this goal is generally diffcult or impossible to achieve. As such, to measure the performance for an algorithm, two metrics are usually exploited, i.e., regret and competitive ratio, for which the former one is leveraged more frequently in the literature. Moreover, two kinds of regrets, i.e., static and dynamic regrets, are usually considered by researchers, where the static regret is to compare the performance with a cumulative loss with respect to the same best decision through all the time horizons, while the dynamic regret is with respect to the best decision at each time instant. More recently, another regret, called adaptive regret , has been proposed and investigated as a suitable metric for changing environments, as dynamic regret does. Historically, centralized online optimization is first addressed, that is, there is a centralized decision maker who can access all the information on the revealed loss function at each time. Along this line, a wide range of results have thus far been reported in the literature. For example, online optimization was studied subject to feasible set constraints, where it has been shown that the optimal bound is O( √ T) for static regret....  相似文献   

7.
The aim of software testing is to find faults in a program under test, so generating test data that can expose the faults of a program is very important. To date, current stud- ies on generating test data for path coverage do not perform well in detecting low probability faults on the covered path. The automatic generation of test data for both path coverage and fault detection using genetic algorithms is the focus of this study. To this end, the problem is first formulated as a bi-objective optimization problem with one constraint whose objectives are the number of faults detected in the traversed path and the risk level of these faults, and whose constraint is that the traversed path must be the target path. An evolution- ary algorithm is employed to solve the formulated model, and several types of fault detection methods are given. Finally, the proposed method is applied to several real-world programs, and compared with a random method and evolutionary opti- mization method in the following three aspects: the number of generations and the time consumption needed to generate desired test data, and the success rate of detecting faults. The experimental results confirm that the proposed method can effectively generate test data that not only traverse the target path but also detect faults lying in it.  相似文献   

8.
In this paper, a Newton-conjugate gradient (CG) augmented Lagrangian method is proposed for solving the path constrained dynamic process optimization problems. The path constraints are simplified as a single final time constraint by using a novel constraint aggregation function. Then, a control vector parameterization (CVP) approach is applied to convert the constraints simplified dynamic optimization problem into a nonlinear programming (NLP) problem with inequality constraints. By constructing an augmented Lagrangian function, the inequality constraints are introduced into the augmented objective function, and a box constrained NLP problem is generated. Then, a linear search Newton-CG approach, also known as truncated Newton (TN) approach, is applied to solve the problem. By constructing the Hamiltonian functions of objective and constraint functions, two adjoint systems are generated to calculate the gradients which are needed in the process of NLP solution. Simulation examples demonstrate the effectiveness of the algorithm.  相似文献   

9.
Conflict resolution (CR) plays a crucial role in safe air traffic management (ATM). In this paper, we propose a new hybrid distributed-centralized tactical CR approach based on cooperative co-evolutionary named the CCDG (cooperative co-evolutionary with dynamic grouping) strategy to overcome the drawbacks of the current two types of approaches, the totally centralized approach and distributed approach. Firstly, aircraft are divided into several sub-groups based on their interdependence. Besides, a dynamic grouping strategy is proposed to group the aircraft to better deal with the tight coupling among them. The sub-groups are adjusted dynamically as new conflicts appear after each iteration. Secondly, a fast genetic algorithm (GA) is used by each sub-group to optimize the paths of the aircraft simultaneously. Thirdly, the aircraft's optimal paths are obtained through cooperation among different sub-groups based on cooperative co-evolutionary (CC). An experimental study on two illustrative scenarios is conducted to compare the CCDG method with some other existing approaches and it is shown that CCDG which can get the optimal solution effectively and efficiently in near real time, outperforms most of the existing approaches including Stratway, the fast GA, a general evolutionary path planner, as well as three well-known cooperative co-evolution algorithms.  相似文献   

10.
As a large amount of data is increasingly generated from edge devices, such as smart homes, mobile phones, and wearable devices, it becomes crucial for many applications to deploy machine learning modes across edge devices.The execution speed of the deployed model is a key element to ensure service quality.Considering a highly heterogeneous edge deployment scenario, deep learning compiling is a novel approach that aims to solve this problem.It defines models using certain DSLs and generates effi...  相似文献   

11.
In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural networkbased reinforcement learning, thereby potentially leading to more effective policy improvement.   相似文献   

12.
In this article we show that there is a strong connection between decision tree learning and local pattern mining. This connection allows us to solve the computationally hard problem of finding optimal decision trees in a wide range of applications by post-processing a set of patterns: we use local patterns to construct a global model. We exploit the connection between constraints in pattern mining and constraints in decision tree induction to develop a framework for categorizing decision tree mining constraints. This framework allows us to determine which model constraints can be pushed deeply into the pattern mining process, and allows us to improve the state-of-the-art of optimal decision tree induction.  相似文献   

13.

In this paper, we propose a temporal difference (TD) learning method, called integral TD learning that efficiently finds solutions to continuous-time (CT) linear quadratic regulation (LQR) problems in an online fashion where system matrix A is unknown. The idea originates from a computational reinforcement learning method known as TD(0), which is the simplest TD method in a finite Markov decision process. For the proposed integral TD method, we mathematically analyze the positive definiteness of the updated value functions, monotone convergence conditions, and stability properties concerning the locations of the closed-loop poles in terms of the learning rate and the discount factor. The proposed method includes the existing value iteration method for CT LQR problems as a special case. Finally, numerical simulations are carried out to verify the effectiveness of the proposed method and further investigate the aforementioned mathematical properties.

  相似文献   

14.
Kernel-based least squares policy iteration for reinforcement learning.   总被引:4,自引:0,他引:4  
In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, near-optimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernel-based least squares temporal-difference algorithm called KLSTD-Q is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTD-Q solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTD-Q algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALD-based kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a double-link underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating an initial controller to ensure online performance.  相似文献   

15.
Robust motion control is fundamental to autonomous mobile robots. In the past few years, reinforcement learning (RL) has attracted considerable attention in the feedback control of wheeled mobile robot. However, it is still difficult for RL to solve problems with large or continuous state spaces, which is common in robotics. To improve the generalization ability of RL, this paper presents a novel hierarchical RL approach for optimal path tracking of wheeled mobile robots. In the proposed approach, a graph Laplacian-based hierarchical approximate policy iteration (GHAPI) algorithm is developed, in which the basis functions are constructed automatically using the graph Laplacian operator. In GHAPI, the state space of an Markov decision process is divided into several subspaces and approximate policy iteration is carried out on each subspace. Then, a near-optimal path-tracking control strategy can be obtained by GHAPI combined with proportional-derivative (PD) control. The performance of the proposed approach is evaluated by using a P3-AT wheeled mobile robot. It is demonstrated that the GHAPI-based PD control can obtain better near-optimal control policies than previous approaches.  相似文献   

16.
为加快分层强化学习中任务层次结构的自动生成速度,提出了一种基于多智能体系统的并行自动分层方法,该方法以Sutton提出的Option分层强化学习方法为理论框架,首先由多智能体合作对状态空间进行并行探测并集中聚类产生状态子空间,然后多智能体并行学习生成各子空间上内部策略,最终生成Option.以二维有障碍栅格空间内两点间最短路径规划为任务背景给出了算法并进行了仿真实验和分析,结果表明,并行自动分层方法生成任务层次结构的速度明显快于以往的串行自动分层方法.本文的方法适用于空间探测、路径规划、追逃等类问题领域.  相似文献   

17.
In real world, the automatic detection of liver disease is a challenging problem among medical practitioners. The intent of this work is to propose an intelligent hybrid approach for the diagnosis of hepatitis disease. The diagnosis is performed with the combination of k‐means clustering and improved ensemble‐driven learning. To avoid clinical experience and to reduce the evaluation time, ensemble learning is deployed, which constructs a set of hypotheses by using multiple learners to solve a liver disease problem. The performance analysis of the proposed integrated hybrid system is compared in terms of accuracy, true positive rate, precision, f‐measure, kappa statistic, mean absolute error, and root mean squared error. Simulation results showed that the enhanced k‐means clustering and improved ensemble learning with enhanced adaptive boosting, bagged decision tree, and J48 decision tree‐based intelligent hybrid approach achieved better prediction outcomes than other existing individual and integrated methods.  相似文献   

18.
This paper presents several results on some cost-minimizing path problems in polygonal regions. For these types of problems, an approach often used to compute approximate optimal paths is to apply a discrete search algorithm to a graph G(epsilon) constructed from a discretization of the problem; this graph is guaranteed to contain an epsilon-good approximate optimal path, i.e., a path with a cost within (1 + epsilon) factor of that of an optimal path, between given source and destination points. Here, epsilon > 0 is the user-defined error tolerance ratio. We introduce a class of piecewise pseudo-Euclidean optimal path problems that includes several non-Euclidean optimal path problems previously studied and show that the BUSHWHACK algorithm, which was formerly designed for the weighted region optimal path problem, can be generalized to solve any optimal path problem of this class. We also introduce an empirical method called the adaptive discretization method that improves the performance of the approximation algorithms by placing discretization points densely only in areas that may contain optimal paths. It proceeds in multiple iterations, and in each iteration, it varies the approximation parameters and fine tunes the discretization.  相似文献   

19.
In adversarial classification, the interaction between classifiers and adversaries can be modeled as a game between two players. It is natural to model this interaction as a dynamic game of incomplete information, since the classifier does not know the exact intentions of the different types of adversaries (senders). For these games, equilibrium strategies can be approximated and used as input for classification models. In this paper we show how to model such interactions between players, as well as give directions on how to approximate their mixed strategies. We propose perceptron-like machine learning approximations as well as novel Adversary-Aware Online Support Vector Machines. Results in a real-world adversarial environment show that our approach is competitive with benchmark online learning algorithms, and provides important insights into the complex relations among players.  相似文献   

20.
Most of the methods that generate decision trees for a specific problem use the examples of data instances in the decision tree–generation process. This article proposes a method called RBDT‐1—rule‐based decision tree—for learning a decision tree from a set of decision rules that cover the data instances rather than from the data instances themselves. The goal is to create on demand a short and accurate decision tree from a stable or dynamically changing set of rules. The rules could be generated by an expert, by an inductive rule learning program that induces decision rules from the examples of decision instances such as AQ‐type rule induction programs, or extracted from a tree generated by another method, such as the ID3 or C4.5. In terms of tree complexity (number of nodes and leaves in the decision tree), RBDT‐1 compares favorably with AQDT‐1 and AQDT‐2, which are methods that create decision trees from rules. RBDT‐1 also compares favorably with ID3 while it is as effective as C4.5 where both (ID3 and C4.5) are well‐known methods that generate decision trees from data examples. Experiments show that the classification accuracies of the decision trees produced by all methods under comparison are indistinguishable.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号