首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Hosts with several, possibly heterogeneous and/or multicore, processors provide new challenges and opportunities to accelerate applications with high communications bandwidth requirements. Many opportunities to scale these network applications with the increase in the link bandwidths are related to the exploitation of the available parallelism provided by the presence of several processing cores in the servers, not only for computing the workload of the user application but also for decreasing the overhead associated to the network interface and the system software.  相似文献   

2.
Programming heterogeneous multiprocessor architectures combining multiple processor cores and hardware accelerators is a real challenge. Computer-aided design and development tools try to reduce the large design space by simplifying hardware-software mapping mechanisms. However, energy consumption is not well supported in most of design space exploration methodologies due to the difficulty to estimate energy consumption fast and accurately. To this aim, this paper proposes and validates an exploration method for partitioning tilling-based parallel applications on software cores and hardware accelerators under energy-efficiency constraints. The methodology is based on energy and performance measurement of a tiny subset of the design space and an analytical formulation of the performance and energy of an application kernel mapped onto a heterogeneous architecture. This closed-form expression is captured and solved using Mixed Integer Linear Programming, which allows for very fast exploration and results in the best hardware and software partitioning under energy constraint. The approach is validated on two application kernels using a Zynq-based architecture showing more than 12% acceleration speed-up and energy saving compared to standard approaches. Results also show that the most energy-efficient solution is application- and platform-dependent and moreover hardly predictable, which highlights the need for fast exploration tools as in this paper.  相似文献   

3.
Influence of most resource-intensive regular system interrupts on performance of parallel programs is studied. Depending on hardware architecture and OS settings, these interrupts take 0.1–5% of CPU operation time [1, 2], yet they can cause 10–100% decrease of performance of a parallel program, for instance, for bulk operations [3–10]. The way these interrupts influence operation time of a class of parallel programs with synchronization between neighboring processes at each iteration, such as stencil computations, synchronous cellular automaton, and explicit difference scheme, is studied. Results of testing on computing clusters are given. Measures to reduce influence the interrupts exert on performance of a parallel program are discussed.  相似文献   

4.
5.
The memory behavior of cache oblivious stencil computations   总被引:1,自引:0,他引:1  
We present and evaluate a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an “ideal cache” of size Z, our algorithm saves a factor of Θ(Z 1/n ) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy. We evaluate our algorithm in terms of the number of cache misses, and demonstrate that the memory behavior agrees with our theoretical predictions. Our experimental evaluation is based on a finite-difference solution of a heat diffusion problem, as well as a Gauss-Seidel iteration and a 2-dimensional LBMHD program, both reformulated as cache oblivious stencil computations. This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.  相似文献   

6.
The Journal of Supercomputing - Complex tile shapes maximize parallelism and locality of stencil computations by enabling tile-wise concurrent start, i.e., all tiles along a particular tiling...  相似文献   

7.
Apan Qasem  Josh Magee 《Software》2013,43(6):705-729
Translation Lookaside Buffers (TLBs) can play a critical role in improving the performance of emerging parallel workloads. Most current chip multiprocessor systems include multilevel TLBs and provide support for superpages both at the hardware and software level. Judicious use of superpages can significantly cut down the number of TLB misses and improve overall system performance. However, indiscriminate superpage allocation results in page fragmentation and increased application footprint, which often outweigh the benefits of reduced TLB misses. Previous research has explored policies for smart allocation of superpages from an operating system perspective. This paper presents a compiler‐based strategy for automatic and profitable memory allocation via superpages. A significant advantage of a compiler‐based approach is the availability of data‐reuse information within an application. Our strategy employs data‐locality analysis to estimate the TLB demands for both single‐threaded and multi‐threaded programs and uses this metric to apply selective superpage allocation. Apart from its obvious utility in improving TLB performance, this strategy can be used to improve the effectiveness of certain data‐layout transformations and can be a useful tool in benchmarking and automatic performance tuning. We demonstrate the effectiveness of this strategy with experiments on three multicore platforms on a workload that contains both sequential and parallel applications. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

8.
We present a scalable parallelization scheme for high-order stencil computations that also optimizes memory behavior on multicore clusters. Our multilevel approach combines: (i)?inter-node parallelization via spatial decomposition; (ii)?inter-core parallelization via multithreading and explicit non-uniform memory access (NUMA) control; (iii)?data locality optimizations through auto-tuned tiling for efficient use of hierarchical memory; and (iv)?register blocking and data parallelism via single-instruction multiple-data techniques to utilize registers and exploit data locality. The scheme is applied to a sixth-order stencil based finite-difference time-domain code. Weak-scaling parallel efficiency is over 98?% on 32,768 BlueGene/P processors. Multithreading with explicit NUMA control attains 9.9-fold speedup on a dual 12-core AMD Opteron system. Data locality optimizations achieve 7.7-fold reduction of the last level cache miss rate of Intel Nehalem, whereas register blocking increases data parallelism and thereby achieves 5.9 Gflops performance on a single core. Register blocking?+ multithreading optimizations achieve 5.8-fold speedup on a single quadcore Nehalem.  相似文献   

9.
10.
The paper focuses on the problem of partitioning and mapping parallel programs onto heterogeneous embedded multiprocessor architectures for real-time applications. Such applications present unique constraints and challenges. In addition to heterogeneity, the proposed partitioning and mapping algorithms satisfy memory, task throughput, task placement, intertask communication bandwidth, and co-location constraints. They do so for architectures that utilize circuit-switched (rather than packet-switched) interprocessor communication and optimize latency and throughput in addition to load-balancing. Finally, these mapping algorithms make use of knowledge of the local scheduling discipline to accommodate real-time scheduling constraints. Our focus is on unstructured parallel programs that fall into one of two classes: (i) the class of computations characteristic of control applications in a real-time environment where tasks execute concurrently, periodically exchanging information, and (ii) pipelined computation graphs found in sensor data processing applications. The algorithms are implemented in a set of tools that operate with commercial CASE tools at one end, and present an interface to multiprocessor simulators at the other end. Collectively, the algorithms form a significant component of an interactive design environment for the development and mapping of real-time embedded parallel programs. The paper describes the algorithms, the encapsulating toolset, and presents an example of their application to an existing embedded application—an Autonomous Underwater Vehicle application.  相似文献   

11.
In this paper, an innovative strategy for the data-flow synchronization in shared-memory systems is proposed. This strategy assumes to synchronize only interdependent threads instead of using the barrier approach that—in contrast to our approach—synchronize all threads. We demonstrate the adaptation of the data-flow synchronization strategy to two complex scientific applications based on stencil codes. An algorithm for the data-flow synchronization is developed and successfully used for both applications. The proposed approach is evaluated for various Intel microarchitectures released in the last 5 years, including the newest processors: Skylake and Knights Landing. The important part of this assessment is the performance comparison of the proposed data-flow synchronization with the OpenMP barrier. The experimental results show that the performance of the studied applications can be accelerated up to 1.3 times using the proposed data-flow synchronizations strategy.  相似文献   

12.
This paper describes a formal synthesis approach to design of optimal application-specific heterogeneous multiprocessor systems. The method generates a static task execution schedule along with the structure of the multiprocessor system and a mapping of subtasks to processors. The approach itself is quite general, but its application is demonstrated with a specific style of design. The approach involves creation of a Mixed Integer-Linear Programming (MILP) model and solution of the model. A primary component of the model is the set of relations that must be satisfied to ensure proper ordering of various events in the task execution as well as to ensure completeness and correctness of the system. Several experiments and tradeoff studies have been performed using the approach. These results indicate that the approach can be a useful tool in designing application-specific multiprocessor systems.  相似文献   

13.
The emergence of new environments, such as grids and mobile devices, poses new requirements for programming systems and models. Migration techniques, which have been extensively studied but not widely used, gain a chance to rise in light of these requirements. This survey reviews a particular case of migration that fits well into those environments—that is, heterogeneous strong migration of running computations at application level. We approach it from an implementation point of view, commenting on related issues and current solutions. We discuss the strong influence that language support for migration has on implementation issues, and the advantages that a higher level of support would represent for the portability and performance of the application, as well as in terms of control and flexibility for the programmer. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

14.
An algorithm is proposed for scheduling dependent tasks in time-varying heterogeneous multiprocessor systems, in which computational power and links between processors are allowed to change over time. Link contention is considered in the multiprocessor scheduling problem. A linear switching-state space-modeling paradigm is introduced to enable theoretical analysis from a system engineering perspective. Theoretical analysis of this model shows its robustness against changes in processing power and link failure. The proposed algorithm uses a fuzzy decision-making procedure to handle changes in the multiprocessor system. The efficiency of the proposed algorithm is illustrated by several random experiments and comparison against a recent benchmark approach. The results show up to 18% average improvement in makespan, especially for larger scale systems.  相似文献   

15.
We consider the problem of simulating linear arrays and rings on the multiply twisted cube. We introduce a new concept, the reflected link label sequence, and use it to define a generalized Gray Code (GGC). We show that GGCs can be easily used to identify Hamiltonian paths and cycles in the multiply twisted cube. We also give a method for embedding a ring of arbitrary number of nodes into the multiply twisted cube  相似文献   

16.
《Computer Networks》2007,51(3):901-917
Peer-to-peer networks have been commonly used for tasks such as file sharing or file distribution. We study a class of cooperative file distribution systems where a file is broken up into many chunks that can be downloaded independently. The different peers cooperate by mutually exchanging the different chunks of the file, each peer being client and server at the same time. While such systems are already in widespread use, little is known about their performance and scaling behavior. We develop analytic models that provide insights into how long it takes to deliver a file to N clients given a distribution architecture. Our results indicate that even for the case of heterogeneous client populations it is possible to achieve download times that are almost independent of the number of clients and very close to optimal.  相似文献   

17.
Maximizing multiprocessor performance with the SUIF compiler   总被引:1,自引:0,他引:1  
This article describes automatic parallelization techniques in the SUIF (Stanford University Intermediate Format) compiler that result in good multiprocessor performance for array-based numerical programs. Parallelizing compilers for multiprocessors face many hurdles. However, SUIF's robust analysis and memory optimization techniques enabled speedups on three fourths of the NAS and SPECfp95 benchmark programs  相似文献   

18.
An asymptotic queuing theoretic approach is proposed to analyze the performance of an FCFS (first-come, first-served) heterogeneous multiprocessor computer system with a single bus operating in a randomly changing environment. All stochastic times in the system are considered to be exponentially distributed and independent of the random environment, while the access and service rates of the processors are subject to random fluctuations. It is shown under the assumption of `fast' arrivals that the busy period length of the bus converges weakly, under appropriate normalization, to an exponentially distributed random variable. As a consequence, main steady-state performance measures such as system throughput, mean delay time, expected waiting time, and mean number of active processors can be approximately determined. The reliability of the proposed method is validated by comparing the new approximations with known exact results  相似文献   

19.
In the paper we present a framework for partitioning data parallel computations across a heterogeneous metasystem at runtime. The framework is guided by program and resource information which is made available to the system. Three difficult problems are handled by the framework: processor selection, task placement and heterogeneous data domain decomposition. Solving each of these problems contributes to reduced elapsed time. In particular, processor selection determines the best grain size at which to run the computation, task placement reduces communication cost, and data domain decomposition achieves processor load balance. We present results which indicate that excellent performance is achievable using the framework. The paper extends our earlier work on partitioning data parallel computations across a single-level network of heterogeneous workstations.  相似文献   

20.
The multiprocessor Fixed-Job Priority (FJP) scheduling of real-time systems is studied. An important property for the schedulability analysis, the predictability (regardless to the execution times), is studied for heterogeneous multiprocessor platforms. Our main contribution is to show that any FJP schedulers are predictable on unrelated platforms. A convenient consequence is the fact that any FJP schedulers are predictable on uniform multiprocessors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号