首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A number of highly-threaded, many-core architectures hide memory-access latency by low-overhead context switching among a large number of threads. The speedup of a program on these machines depends on how well the latency is hidden. If the number of threads were infinite, theoretically, these machines could provide the performance predicted by the PRAM analysis of these programs. However, the number of threads per processor is not infinite, and is constrained by both hardware and algorithmic limits. In this paper, we introduce the Threaded Many-core Memory (TMM) model which is meant to capture the important characteristics of these highly-threaded, many-core machines. Since we model some important machine parameters of these machines, we expect analysis under this model to provide a more fine-grained and accurate performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the dynamic programming algorithm and Johnson’s algorithm have the same performance in the PRAM model. However, our model predicts different performance for large enough memory-access latency and validates the intuition that the dynamic programming algorithm performs better on these machines. We validate several predictions made by our model using empirical measurements on an instantiation of a highly-threaded, many-core machine, namely the NVIDIA GTX 480.  相似文献   

2.
多核系统工作负载的动态性和不可预知性往往会导致系统功耗高、延时长,运行期间敏捷的任务分配方法能有效解决上述问题.为此,针对多核系统中的任务分配问题提出一种近似模型,以估计任意给定节点周围的可用节点数量;然后,采用爬坡搜索启发式策略(SHiC),在所有可用节点中迅速搜索出最优首个节点;最后,利用CoNA算法实现任务的高效分配.在不同网络规模和网络参数设置下展开仿真,实验结果表明,SHiC可实现显著的性能提升,与当前最新研究相比,改进了网络延时和功耗.  相似文献   

3.
Energy consumption is a critical issue in parallel and distributed embedded systems. We present a novel algorithm for energy efficient scheduling of Directed Acyclic Graph (DAG) based applications on Dynamic Voltage Scaling (DVS) enabled systems. Experimental results show that our algorithm provides near optimal solutions for energy minimization with considerably smaller computational time and memory requirements compared to an existing algorithm that provides near optimal solutions.  相似文献   

4.
Summary In the paper a method of labelling is applied to constructing a correct and complete transformation system which allows one, for any program scheme, to construct systematically any permissible memory allocation for variables of the scheme. Permissible memory allocations are assumed to be such allocations that preserve all information flow connections from the initial scheme.Notations B set of internal statements (2.3) - D information flow graph (2.2) - D j component of an information flow graph (2.2) - E set of statements reachable downwards arcs (2.3) - G skeleton (2.1) - i input - L set of statements reachable upwards arcs (2.3) - N information carrier (2.1) - o output (2.1) - (o, i) information pair (2.1) - p pole (2.1) - R memory allocation (2.1) - L Lavrov scheme (2.1) - T, U statements (2.1) - V poles allocation (2.1) - W inconsistency graph (2.3) - X memory (2.1) - x, y variables (2.1) - empty set - calculability relation (2.1) - equivalence relation (3.1) - transformability relation (3.1) - end of a lemma proof - end of a theorem proof - empty word The author is greatful to Miss E. L. Gorel whose research of axiomatic of generalized Yanov schemata has stimulated this writing.  相似文献   

5.
《Micro, IEEE》2004,24(6):118-127
Power is a major problem for scaling the hardware needed to support memory disambiguation in future out-of-order architectures. In current machines, the traditional detection of memory ordering violations requires frequent associative searches of state proportional to the instruction window size. A new class of solutions yields an order-of-magnitude reduction in the energy required to properly order loads and stores for windows of hundreds to thousands of in-flight instructions  相似文献   

6.
In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural constraints impose hard limits on the main memory bandwidth, and push for software techniques which optimize the memory usage of complex multi-kernel applications. In this work, we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution of image processing applications expressed as standard OpenVX graphs on cluster-based many-core accelerators. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator demonstrate that our approach leads to massive reduction of time and bandwidth, even when the main memory bandwidth for the accelerator is severely constrained.  相似文献   

7.
Dynamic storage allocation is a vital component of programming systems intended for multiprocessor architectures that support globally shared memory. Highly parallel algorithms for access to system data structures lie at the core of effective memory allocation strategies as well as solutions to other parallel systems problems. In this paper, we investigate four algorithms, all based on the first fit approach, that provide different granularities of parallel access to the allocator's data structures. These solutions employ a variety of design techniques including specialized locking protocols, the use of atomic fetch-and- operations, and structural modifications. We describe experiments designed to compare the performance of these schemes. The results show that simple algorithms are appropriate when the expected number of concurrent requests per memory is low and the request pattern is not bursty. Algorithms that support finer granularity access while avoiding locking protocols are successful in a range of larger processor/memory ratios.This research was supported in part by the National Science Foundation under Grant Number DCR 8320136, DARPA/U.S. Army Engineer Topographic Laboratories under contract number DACA76-85-C-0001, and Unisys Corporation.A preliminary version appeared in International Conference on Parallel Processing, August 1987.  相似文献   

8.
The Journal of Supercomputing - Finite element method (FEM) has been used for years for radiation problems in the field of electromagnetism. To tackle problems of this kind, mesh truncation...  相似文献   

9.
Intelligent Service Robotics - Tasks in the real world are complex and often require multiple robots to collaborate to be serviced. In many cases, a task might require different sensory inputs and...  相似文献   

10.
An Inverted file is a commonly used index for both archival databases and free text where no updates are expected. Applications like information filtering and dynamic environments like the Internet require inverted files to be updated efficiently. Recently, extensible inverted files are proposed which can be used for fast online indexing. The effective storage allocation scheme for such inverted files uses the arrival rate to preallocate storage. In this article, this storage allocation scheme is improved by using information about both the arrival rates and their variability to predict the storage needed, as well as scaling the storage allocation by a logarithmic factor. The resultant, final storage utilization rate can be as high as 97-98% after indexing about 1.6 million documents. This compares favorably with the storage utilization rate of the original arrival rate storage allocation scheme. Our evaluation shows that the retrieval time for extensible inverted file on solid state disk is on average similar to the retrieval time for in-memory extensible inverted file. When file seek time is not an issue, our scalable storage allocation enables extensible inverted files to be used as the main index on disk. Our statistical storage allocation may be applicable to novel situations where the arrival of items follows a binomial, Poisson or normal distribution.  相似文献   

11.
In this paper, we consider the optimal loop scheduling and minimum storage allocation problems based on the argument-fetching dataflow architecture model. Under the argument-fetching model, the result generated by a node is stored in a unique location which is addressable by its successors. The main contribution of this paper includes: for loops containing no loop-carried dependences, we prove that the problem of allocating minimum storage required to support rate-optimal loop scheduling can be solved in polynomial time. The polynomial time algorithm is based on the fact that the constraint matrix in the formulation is totally unimodular. Since the instruction processing unit of an argument-fetching dataflow architecture is very much like a conventional processor architecture without a program counter, the solution of the optimal loop storage allocation problem for the former will also be useful for the latter.  相似文献   

12.
Recent development in computer hardware has brought more widespread emergence of shared memory, multi-core systems. These architectures offer opportunities to speed up various tasks—model checking and reachability analysis among others. In this paper, we present a design for a parallel shared memory LTL model checker that is based on a distributed memory algorithm. To improve the scalability of our tool, we have devised a number of implementation techniques which we present in this paper. We also report on a number of experiments we conducted to analyse the behaviour of our tool under different conditions using various models. We demonstrate that our tool exhibits significant speedup in comparison with sequential tools, which improves the workflow of verification in general.  相似文献   

13.
This paper addresses task allocation schemes for MIN-based multiprocessors. Two types of allocation policies, cubic and noncubic, are discussed here. Conflicts through the network and inability to partition the system effectively are the main bottlenecks in a MIN-based system. To solve both the problems, a renaming scheme for input and output ports of a MIN is proposed. We use the baseline MIN as an example in this work and call the renaming scheme as bit reversal (BR) matching pattern. Allocation with the new matching pattern minimizes conflicts and partitions the system completely into independent subsystems. The novelty of this matching pattern is that we can use any dynamic cubic allocation and/or scheduling scheme developed for the hypercubes also for the MIN machines. The BR matching pattern can be used with any kind of MIN. An allocation policy for noncubic tasks is also presented with this matching pattern. Various performance measures with different allocation algorithms are compared via simulation. The advantages of the algorithms with the proposed matching pattern are shown in terms of system efficiency, delay and task miss ratio  相似文献   

14.
We explore novel algorithms for DVS (Dynamic Voltage Scaling) based energy minimization of DAG (Directed Acyclic Graph) based applications on parallel and distributed machines in dynamic environments. Static DVS algorithms for DAG execution use the estimated execution time. The estimated time in practice is overestimated or underestimated. Therefore, many tasks may be completed earlier or later than expected during the actual execution. For overestimation, the extra available slack can be added to future tasks so that energy requirements can be reduced. For underestimation, the increased time may cause the application to miss the deadline. Slack can be reduced for future tasks to reduce the possibility of not missing the deadline. In this paper, we present novel dynamic scheduling algorithms for reallocating the slack for future tasks to reduce energy and/or satisfy deadline constraints. Experimental results show that our algorithms are comparable to static algorithms applied at runtime in terms of energy minimization and deadline satisfaction, but require considerably smaller computational overhead.  相似文献   

15.
For the data making up Poisson processes, a method for optimal allocation of the computer memory consisting of serially positioned stacks was proposed.  相似文献   

16.
17.
The Journal of Supercomputing - As the distributed computing systems have been widely used in many research and industrial areas, the problem of allocating tasks to available processors in the...  相似文献   

18.
This paper presents a novel compiler algorithm,called acyclic orientation graph coloring(AOG coloring),for managing data objects in software-managed memory allocation.The key insight is that softwaremanaged memory allocation could be solved as an interval coloring problem,or equivalently,an acyclic orientation problem.We generalize graph coloring register allocation to interval coloring memory allocation by maintaining an acyclic orientation to the currently colored subgraph.This is achieved with some well-crafted heuristics,including Aggressive Simplify that does not necessarily preserve colorability and Best-Fit Select that assigns intervals(i.e.,colors)to nodes by possibly adjusting the colors already assigned to other nodes earlier.Our algorithm generalizes and subsumes as a special case the classical graph coloring register allocation algorithm without notably increased complexity:it deals with memory allocation while preserving the elegance and practicality of traditional graph coloring register allocation.We have implemented our algorithm and tested it on Appel’s 27921 interference graphs for scalars(augmented with node weights).Our algorithm outperforms Memory Coloring,the best in the literature,for software-managed memory allocation,on 98.64%graphs,in which,the gaps are more than 20%on 68.31%graphs and worse only on 0.29%graphs.We also tested it on all the 73 DIMACS weighted benchmarks(weighted graphs),AOG Coloring outperforms Memory Coloring on all of the benchmarks,in which,the gaps are more than 20%on 83.56%graphs.  相似文献   

19.
Over the years, presence of heterogeneous system has dominated the area of concurrent job execution. Heterogeneous system is the natural choice as it can be designed with the legacy system. Scheduling, on such systems, is an important activity as it affects the job execution characteristic. Heterogeneity introduces many challenges for the efficient job execution. Heterogeneity in core architecture introduces the possibility of heterogeneous memory architecture in many/multi core heterogeneous system. This makes it often impossible to determine for the same instruction if a high frequency core has low or high memory latency in comparison to the low frequency core and vice-versa. The work proposes an improved scheduler for such systems in which both core and memory are heterogeneous. It defines average effective time ( \(\hbox {AE}_\mathrm{t}\) ) as the base parameter for this purpose. Priorities of each thread (workload) and the core are dynamically generated using \(\hbox {AE}_\mathrm{t}\) for effective mapping. Experimental results, on the benchmark data, reveal that the proposed scheduler performs much better in terms of cores utilization, speedup and efficiency in comparison to other similar models.  相似文献   

20.
Mapping linear workflow applications onto a set of homogeneous processors can be optimally solved in polynomial time for the throughput objective with fewer processors than stages. This result holds true even when setup times occur in the execution and homogeneous buffers are available for the storage of intermediate results. In this kind of application, several computation stages are interconnected as a linear application graph, and each stage holds a buffer of limited size where intermediate results are stored and a processor setup time occurs when passing from one stage to another. In this paper, we tackle the problem in which the buffer sizes are not given beforehand and must be fixed before the execution to maximize the throughput within each processor. The goal of this work is to minimize the cost induced by the setup times by allocating buffers that are proportinal in size to each other. We present a closed formula to compute the optimal buffer allocation in the case of nondecreasing setup costs in the linear application. For the case of unsorted setup times, we provide competitive heuristics that are validated via extensive simulation. Three nonscalable brute force algorithms are also provided to compare heuristic approaches to optimal ones for small applications and to evaluate the relevance of our approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号