首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 437 毫秒
1.
We consider the total weighted completion time scheduling problem for parallel identical machines and precedence constraints, P| prec|\sum w i C i . This important and broad class of problems is known to be NP-hard, even for restricted special cases, and the best known approximation algorithms have worst-case performance that is far from optimal. However, little is known about the experimental behavior of algorithms for the general problem. This paper represents the first attempt to describe and evaluate comprehensively a range of weighted completion time scheduling algorithms. We first describe a family of combinatorial scheduling algorithms that optimally solve the single-machine problem, and show that they can be used to achieve good performance for the multiple-machine problem. These algorithms are efficient and find schedules that are on average within 1.5\percent of optimal over a large synthetic benchmark consisting of trees, chains, and instances with no precedence constraints. We then present several ways to create feasible schedules from nonintegral solutions to a new linear programming relaxation for the multiple-machine problem. The best of these linear programming-based approaches finds schedules that are within 0.2\percent of optimal over our benchmark. Finally, we describe how the scheduling phase in profile-based program compilation can be expressed as a weighted completion time scheduling problem and apply our algorithms to a set of instances extracted from the SPECint95 compiler benchmark. For these instances with arbitrary precedence constraints, the best linear programming-based approach finds optimal solutions in 78\percent of cases. Our results demonstrate that careful experimentation can help lead the way to high quality algorithms, even for difficult optimization problems. Received October 30, 1998; revised March 28, 2001.  相似文献   

2.
We present a linear algebra framework for structured matrices and general optimization problems. The matrices and matrix operations are defined recursively to efficiently capture complex structures and enable advanced compiler optimization. In addition to common dense and sparse matrix types, we define mixed matrices, which allow every element to be of a different type. Using mixed matrices, the low‐ and high‐level structure of complex optimization problems can be encoded in a single type. This type is then analyzed at compile time by a recursive linear solver that picks the optimal algorithm for the given problem. For common computer vision problems, our system yields a speedup of 3–5 compared to other optimization frameworks. The BLAS performance is benchmarked against the MKL library. We achieve a significant speedup in block‐SPMV and block‐SPMM. This work is implemented and released open‐source as a header‐only extension to the C+ + math library Eigen.  相似文献   

3.
协作式全局指令调度与寄存器分配   总被引:1,自引:1,他引:0  
指令级并行是现代高性能代理器的重要特征,对于发挥这类处理器所具有的并行处理能力来说,编译器有至关重要的影响。文中讨论指令级并行编译中的核心问题-全局指令调度与 器分配,并以作者为一种新型的显式并行体系结构微处理器的编译系统为背景,介绍了此类编译器后端设计中面临的指令调度与寄存器分配的时序问题,以及为解决这一问题而提出了的一种协作式全局指令调度与寄存器分配方法。  相似文献   

4.
The goal of the less is more approach (LIMA) for solving optimization problems that has recently been proposed in Mladenovi? et al. (2016) is to find the minimum number of search ingredients that make a heuristic more efficient than the currently best. In this paper, LIMA is successfully applied to solve the obnoxious p‐median problem (OpMP). More precisely, we developed a basic variable neighborhood search for solving the OpMP, where the single search ingredient, the interchange neighborhood structure, is used. We also propose a new simple local search strategy for solving facility location problems, within the interchange neighborhood structure, which is in between the usual ones: first improvement and best improvement strategies. We call it facility best improvement local search. On the basis of experiments, it appeared to be more efficient and effective than both first and best improvement. According to the results obtained on the benchmark instances, our heuristic turns out to be highly competitive with the existing ones, establishing new state‐of‐the‐art results. For example, four new best‐known solutions and 133 ties are claimed in testing the set with 144 instances.  相似文献   

5.
Local CPS conversion is a compiler transformation for improving the code generated for nested loops by a direct-style compiler that uses recursive functions to represent loops. The transformation selectively applies CPS conversion at non-tail call sites, which allows the compiler to use a single machine procedure and stack frame for both the caller and callee. In this paper, we describe LCPS conversion, as well as a supporting analysis. We have implemented Local CPS conversion in the MOBY compiler and describe our implementation. In addition to improving the performance of loops, Local CPS conversion is also used to aid the optimization of non-local control flow by the MOBY compiler. We present results from preliminary experiments with our compiler that show significant reductions in loop overhead as a result of Local CPS conversion.  相似文献   

6.
Iteration space tiling is a common strategy used by parallelizing compilers and in performance tuning of parallel codes. We address the problem of determining the tile size that minimizes the total execution time. We restrict our attention to uniform dependency computations with two-dimensional, parallelogram-shaped iteration domain which can be tiled with lines parallel to the domain boundaries. The target architecture is a linear array (or a ring). Our model is developed in two steps. We first abstract each tile by two simple parameters, namely tile periodPtand intertile latencyLt. We formulate and partially resolve the corresponding optimization problem independent of the machine and program. Next, we refine the model with realistic machine and program parameters, yielding a discrete nonlinear optimization problem. We solve this analytically, yielding a closed form solution, which can be used by a compiler before code generation.  相似文献   

7.
In this paper we present our experience in developing an optimizing compiler for general purpose computation on graphics processing units (GPGPU) based on the Cetus compiler framework. The input to our compiler is a naïve GPU kernel procedure, which is functionally correct but without any consideration for performance optimization. Our compiler applies a set of optimization techniques to the naive kernel and generates the optimized GPU kernel. Our compiler supports optimizations for GPU kernels using either global memory or texture memory. The implementation of our compiler is facilitated with a source-to-source compiler infrastructure, Cetus. The code transformation in the Cetus compiler framework is called a pass. We classify all the passes used in our work into two categories: functional passes and optimization passes. The functional passes translate input kernels into desired intermediate representation, which clearly represents memory access patterns and thread configurations. A series of optimization passes improve the performance of the kernels by adapting them to the target GPGPU architecture. Our experiments show that the optimized code achieves very high performance, either superior or very close to highly fine-tuned libraries.  相似文献   

8.
投资组合优化问题是NP难解问题,通常的方法很难较好地接近全局最优.在经典微粒群算法(PSO)的基础上,研究了基于量子行为的微粒群算法(QPSO)的单阶段投资组合优化方法,具体介绍了依据目标函数如何利用QPSO算法去寻找最优投资组合.在具体应用中,为了提高算法的收敛性和稳定性对算法进行了改进.利用真实历史数据进行验证,结果表明在解决单阶段投资组合优化问题时,基于QPSO算法的投资组合优化的性能比PSO算法更加优越,且QPSO算法在投资组合优化领域具有很大的实际应用价值.  相似文献   

9.
This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a locality-communication graph (ICG) and the formulation of the compiler technique as a mixed integer nonlinear programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions - the decompositions - that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.  相似文献   

10.
Nowadays, increase in time complexity of applications and decrease in hardware costs are two major contributing drivers for the utilization of high‐performance architectures such as cluster computing systems. Actually, cluster computing environments, in the contemporary sophisticated data centres, provide the main infrastructure to process various data, where the biomedical one is not an exception. Indeed, optimized task scheduling is key to achieve high performance in such computing environments. The most distractive assumption about the problem of task scheduling, made by the state‐of‐the‐art approaches, is to assume the problem as a whole and try to enhance the overall performance, while the problem is actually consisted of two disparate‐in‐nature subproblems, that is, sequencing subproblem and assigning one, each of which needs some special considerations. In this paper, an efficient hybrid approach named ACO‐CLA is proposed to solve task scheduling problem in the mesh‐topology cluster computing environments. In the proposed approach, an enhanced ant colony optimization (ACO) is developed to solve the sequence subproblem, whereas a cellular learning automata (CLA) machine tackles the assigning subproblem. The utilization of background knowledge about the problem (i.e., tasks' priorities) has made the proposed approach very robust and efficient. A randomly generated data set consisting of 125 different random task graphs with various shape parameters, like the ones frequently encountered in the biomedicine, has been utilized for the evaluation of the proposed approach. The conducted comparison study clearly shows the efficiency and superiority of the proposed approach versus traditional counterparts in terms of the performance. From our first metric, that is, the NSL (normalized schedule length) point of view, the proposed ACO‐CLA is 2.48% and 5.55% better than the ETF (earliest time first), which is the second‐best approach, and the average performance of all other competing methods. On the other hand, from our second metric, that is, the speedup perspective, the proposed ACO‐CLA is 2.66% and 5.15% better than the ETF (the second‐best approach) and the average performance of all the other competitors.  相似文献   

11.
Steve Carr  Philip Sweany 《Software》2003,33(15):1419-1445
This paper describes our experiments comparing multiple scalar replacement algorithms to evaluate their effectiveness on entire scientific application benchmarks within the context of a production‐level compiler. We investigate at what point aggressive scalar replacement becomes detrimental and which dependence tests are necessary to give scalar replacement enough information to be effective. As many commercial optimizing compilers may include some version of scalar replacement as an optimization, it is important to determine how aggressive these algorithms need to be. Previously, no study has examined ‘how much’ scalar replacement is sufficient and effective within the context of an existing highly optimizing compiler. Our experiments show that, on whole programs, simple algorithms and simple dependence analysis capture nearly all opportunities for scalar replacement found in scientific application benchmarks. While additional aggressiveness may lead to some performance gain in some individual loops, it also leads to performance degradation too often to be worth the risk when considering entire applications. Algorithms restricted to value reuse over at most one loop iteration and to fully redundant array references give the best results. Our experiment further shows that scalar replacement is not only an effective optimization, but also a feasible one for commercial optimizers since the simple algorithms are not computationally expensive. Based upon our findings, we conclude that scalar replacement ought to be a part of any highly optimizing compiler because of its low cost and significant potential gain. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

12.
Like a processor executes flawlessly at different frequencies, a compiler should produce correct results at any optimization level. The Intel® Itanium® processor family with its new features, like the register stack engine and control- and data speculation, provides new and unique challenges for ported software and compiler technology. This paper describes validation and evaluation techniques that can be employed in compilation tools and can help to get a cleaner port of an application, a more robust compilation system and even insights into performance tuning opportunities. Using Itanium as a specific example, the paper explains why the register stack engine (RSE), the large register file, or control- and data speculation can potentially expose bugs in poorly written or compiled software. It then demonstrates validation and evaluation techniques to find or expose these bugs. An evaluation team can employ them to find, eliminate and evaluate software bugs. A compiler team can use them to make the compiler more stable and robust. A performance analysis team can use them to uncover performance opportunities in an application. We demonstrate our validation and evaluation techniques on code examples and provide run-time data to indicate the cost of some of our methods.  相似文献   

13.
We present a new memory access optimization for Java to perform aggressive code motion for speculatively optimizing memory accesses by applying partial redundancy elimination (PRE) techniques. First, to reduce as many barriers as possible and to enhance code motion, we perform alias analysis to identify all the regions in which each object reference is not aliased. Secondly, we find all the possible barriers. Finally, we perform code motions in three steps. For the first step, we apply a non‐speculative PRE algorithm to move load instructions and their following instructions in the backwards direction of the control flow graph. For the second step, we apply a speculative PRE algorithm to move some of them aggressively before the conditional branches. For the third step, we apply our modified version of a non‐speculative PRE algorithm to move store instructions in the forward direction of the control flow graph and to even move some of them after the merge points. We implemented our new algorithm in our production‐level Java just‐in‐time compiler. Our experimental results show that our speculative algorithm improves the average (maximum) performance by 13.1% (90.7%) for jBYTEmark and 1.4% (4.4%) for SPECjvm98 over the fastest algorithm previously described, while it increases the average (maximum) compilation time by 0.9% (2.9%) for both benchmark suites. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

14.
Because of intensive inter‐node communications, image compositing has always been a bottleneck in parallel visualization systems. In a heterogeneous networking environment, the variation of link bandwidth and latency adds more uncertainty to the system performance. In this paper, we present a pipelining image compositing algorithm in heterogeneous networking environments, which is able to rearrange the direction of data flow of a compositing pipeline under strict ordering constraint. We introduce a novel directional image compositing operator that specifies not only the color and α channels of the output but also the direction of data flow when performing compositing. Based on this new operator, we thoroughly study the properties of image compositing pipelines in heterogeneous environments. We develop an optimization algorithm that could find the optimal pipeline from an exponentially large searching space in polynomial time. We conducted a comprehensive evaluation on the ns‐3 network simulator. Experimental results demonstrate the efficiency of our method. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

15.
The best tracking problem for a single‐input‐single‐output (SISO) networked control system with communication constraints is studied in this paper. The tracking performance is measured by the energy of the error signal between the output of the plant and the reference signal. The communication constraints under consideration are finite bandwidth and networked induced‐delay. Explicit expressions of the minimal tracking error have been obtained for networked control systems with or without communication constraints. It is shown that the best tracking performance dependents on the nonminimum phase zeros, and unstable poles of the given plant, as well as the bandwidth and networked induced‐delay. It is also shown that, if the constraints of the communication channel do not exist, the best tracking performance reduces to the existing tracking performance of the control system without communication constraints. The result shows how the bandwidth and networked induced‐delay of a communication channel may fundamentally constrain a control system's tracking capability. Some typical examples are given to illustrate the theoretical results.  相似文献   

16.
17.
Most of the current compiler projects for distributed memory architectures leave the critical and time-consuming problem of finding performance-efficient data distributions and profitable program transformations for a given parallel program almost entirely to the programmer. Performance estimators provide critical performance information to both programmers and parallelizing compilers, the most crucial part of which involves determining the communication overhead induced by a program. In this paper, we present a very practical approach to the problem of compile-time estimation of communication costs for regular codes that includes analytical methods to model the number of messages exchanged, data volume transferred, transfer time, and network contention. In order to achieve high estimation accuracy, our estimator aggressively exploits compiler analysis and optimization information. It is assumed that machine parameters and problem size are known at compile time. We conducted a variety of experiments to validate the estimation accuracy and the ability to support both the programmer and compiler in the effort of performance tuning of parallel programs. We believe that our approach can be automatically applied to a large class of regular codes.  相似文献   

18.
A Vectorizing Compiler for Multimedia Extensions   总被引:6,自引:0,他引:6  
In this paper, we present an implementation of a vectorizing C compiler for Intel's MMX (Multimedia Extension). This compiler would identify data parallel sections of the code using scalar and array dependence analysis. To enhance the scope for application of the subword semantics, our compiler performs several code transformations. These include strip mining, scalar expansion, grouping and reduction, and distribution. Thereafter inline assembly instructions corresponding to the data parallel sections are generated. We have used the Stanford University Intermediate Format (SUIF), a public domain compiler tool, for our implementation. We evaluated the performance of the code generated by our compiler for a number of benchmarks. Initial performance results reveal that our compiler generated code produces a reasonable performance improvement (speedup of 2 to 6.5) over the the code generated without the vectorizing transformations/inline assembly. In certain cases, the performance of the compiler generated code is within 85% of the hand-tuned code for MMX architecture.  相似文献   

19.
We aim to find robust solutions in optimization settings where there is uncertainty associated with the operating/environmental conditions, and the fitness of a solution is hence best described by a distribution of outcomes. In such settings, the nature of the fitness distribution (reflecting the performance of a particular solution across a set of operating scenarios) is of potential interest in deciding solution quality, and previous work has suggested the inclusion of robustness as an additional optimization objective. However, there has been limited investigation of different robustness criteria, and the impact this choice may have on the sample size needed to obtain reliable fitness estimates. Here, we investigate different single and multi-objective formulations for robust optimization, in the context of a real-world problem addressed via simulation-based optimization. For the (limited evaluation) setting considered, our results highlight the value of an explicit robustness criterion in steering an optimizer towards solutions that are not only robust (as may be expected), but also associated with a profit that is, on average, higher than that identified by standard single-objective approaches. We also observe significant interactions between the choice of robustness measure and the sample size employed during fitness evaluation, an effect that is more pronounced for our multi-objective models.  相似文献   

20.
The architecture of a production optimizing compiler for Pascal is described, and the structure of the optimizer is detailed. The compiler performs both interprocedural and global optimizations, in addition to optimization of basic blocks. We have found that a high-level structured language such as Pascal provides unique opportunities for effective optimization, but that standard optimization techniques must be extended to take advantage of these opportunities. These issues are considered in our discussion of the optimization algorithms we have developed and the sequence in which we apply them.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号