期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The Servet 3.0 benchmark suite: Characterization of network performance degradation

Jorge González-Domínguez María J. Martín Guillermo L. Taboada Roberto R. Expósito Juan Touriño 《Computers & Electrical Engineering》2013

Servet is a suite of benchmarks focused on extracting a set of parameters with high influence on the overall performance of multicore clusters. These parameters can be used to optimize the performance of parallel applications by adapting part of their behavior to the characteristics of the machine. Up to now the tool considered network bandwidth as constant and independent of the communication pattern. Nevertheless, the inter-node communication bandwidth decreases on modern large supercomputers depending on the number of cores per node that simultaneously access the network and on the distance between the communicating nodes. This paper describes two new benchmarks that improve Servet by characterizing the network performance degradation depending on these factors. This work also shows the experimental results of these benchmarks on a Cray XE6 supercomputer and some examples of how real parallel codes can be optimized by using the information about network degradation. 相似文献

2.

Memory aware load balance strategy on a parallel branch‐and‐bound application

Juliana M.N. Silva Cristina Boeres Lúcia M.A. Drummond Artur A. Pessoa 《Concurrency and Computation》2015,27(5):1122-1144

The latest trends in high performance computing systems show an increasing demand on the use of a large scale multicore system in an efficient way so that high compute‐intensive applications can be executed reasonably well. However, the exploitation of the degree of parallelism available at each multicore component can be limited by the poor utilization of the memory hierarchy. Actually, the multicore architecture introduces some distinct features that are already observed in shared memory and distributed environments. One example is that subsets of cores can share different subsets of memory. In order to achieve high performance, it is imperative that a careful allocation scheme of an application is carried out on the available cores, based on a scheduling specification that considers not only processors characteristics but also memory contention. This paper proposes a multicore cluster representation that captures relevant performance characteristics in multicores systems such as the influence of memory hierarchy and contention on application performance. Improved performance was achieved by a branch‐and‐bound application applied to the partitioning sets problem that incorporated a memory aware load balancing strategy based on the proposed multicore cluster representation. An in‐depth analysis on this application execution showed its applicability to modern systems. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

3.

Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

Seonggun Kim Hwansoo Han Kwang-Moo Choe 《The Journal of supercomputing》2011,56(1):25-55

Multicore architectures are evolving with the promise of extreme performance for the classes of applications that require high performance and large bandwidth of memory. Irregular reduction is one of important computation patterns for many complex scientific applications, and it typically requires high performance and large bandwidth of memory. In this article, we propose region-based parallelization techniques for irregular reductions on multicore architectures with explicitly managed memory hierarchies. Managing memory hierarchy in software requires a lot of programming efforts and tends to be error-prone. The difficulties are even worse for applications with irregular data access patterns. To relieve the burden of memory management from programmers, we develop abstractions, particularly targeted to irregular reduction, for structuring parallel tasks, mapping the parallel tasks to processing units and scheduling data transfers between the memory hierarchies. Our framework employs iteration reordering based on regions of data along with dynamic scheduling of parallel tasks. We experimentally evaluate the effectiveness of our techniques for irregular reduction kernels on the Cell processor embedded in a Sony PlayStation3. Experimental results show the speedups of 8 to 14 on the six available SPEs. 相似文献

4.

Power and performance analysis of multimedia applications running on low-power devices by cache modeling

Abu Asaduzzaman Govipalagodage H. Gunasekara 《Multimedia Tools and Applications》2014,72(1):207-230

High processing speed is required to support computation intensive applications. Cache memory is used to improve processing speed by reducing the speed gap between the fast processing core and slow main memory. However, the problem of adopting cache into computing systems is twofold: cache is power hungry (that challenges energy constraints) and cache introduces execution time unpredictability (that challenges supporting real-time multimedia applications). Recently published articles suggest that using cache locking improves predictability. However, increased cache activities due to aggressive cache locking make the system consume more energy and become less efficient. In this paper, we investigate the impact of cache parameters and cache locking on power consumption and performance for real-time multimedia applications running on low-power devices. In this work, we consider Intel Pentium-like single-processor and Xeon-like multicore architectures, both with two-level cache memory hierarchy, using three popular multimedia applications: MPEG-4 (the global video coding standard), H.264/AVC (the network friendly video coding standard), and recently introduced H.265/HEVC (for improved video quality and data compression ratio). Experimental results show that cache locking mechanism added to an optimized cache memory structure is very promising to increase the performance/power ratio of low-power systems running multimedia applications. According to the simulation results, performance can be improved by decreasing cache miss rate down to 36 % and the total power consumption can be saved up to 33 %. It is also observed that H.265/HEVC has significant performance advantage over H.264/AVC (and MPEG-4) for smaller caches. 相似文献

5.

面向嵌入式多核的OpenMP扩展方法

下载免费PDF全文

王庆季振洲刘涛《计算机科学与探索》2011,5(1):81-86

为多核平台开发一种有效的编程方法已经成为并行软件研究的一个重要目标.在嵌入式多核平台上进行了OpenMP并行程序的有效的实施运行.针对嵌入式具有有限内存资源的特点,提出了通过扩展OpenMP自定义制导语句tiling来提高并行程序在嵌入式多核平台上的运行效率.扩展后的OpenMP并行程序支持循环分片,从而能够充分利用层... 相似文献

6.

LAPT: A locality-aware page table for thread and data mapping

《Parallel Computing》2016

The performance and energy efficiency of current systems is influenced by accesses to the memory hierarchy. One important aspect of memory hierarchies is the introduction of different memory access times, depending on the core that requested the transaction, and which cache or main memory bank responded to it. In this context, the locality of the memory accesses plays a key role for the performance and energy efficiency of parallel applications. Accesses to remote caches and NUMA nodes are more expensive than accesses to local ones. With information about the memory access pattern, pages can be migrated to the NUMA nodes that access them (data mapping), and threads that communicate can be migrated to the same node (thread mapping).In this paper, we present LAPT, a hardware-based mechanism to store the memory access pattern of parallel applications in the page table. The operating system uses the detected memory access pattern to perform an optimized thread and data mapping during the execution of the parallel application. Experiments with a wide range of parallel applications (from the NAS and PARSEC Benchmark Suites) on a NUMA machine showed significant performance and energy efficiency improvements of up to 19.2% and 15.7%, respectively, (6.7% and 5.3% on average). 相似文献

7.

Adaptive thread mapping strategies for transactional memory applications

Márcio Castro Luís Fabrício W. Góes Jean-François Méhaut 《Journal of Parallel and Distributed Computing》2014

Transactional Memory (TM) is a programmer friendly alternative to traditional lock-based concurrency. Although it intends to simplify concurrent programming, the performance of the applications still relies on how frequent they synchronize and the way they access shared data. These aspects must be taken into consideration if one intends to exploit the full potential of modern multicore platforms. Since these platforms feature complex memory hierarchies composed of different levels of cache, applications may suffer from memory latencies and bandwidth problems if threads are not properly placed on cores. An interesting approach to efficiently exploit the memory hierarchy is called thread mapping. However, a single fixed thread mapping cannot deliver the best performance when dealing with a large range of transactional workloads, TM systems and platforms. In this article, we propose and implement in a TM system a set of adaptive thread mapping strategies for TM applications to tackle this problem. They range from simple strategies that do not require any prior knowledge to strategies based on Machine Learning techniques. Taking the Linux default strategy as baseline, we achieved performance improvements of up to 64.4% on a set of synthetic applications and an overall performance improvement of up to 16.5% on the standard STAMP benchmark suite. 相似文献

8.

Implementation and tuning of a parallel symmetric Toeplitz eigensolver

Pedro AlonsoAuthor Vitae Antonio M. Vidal^{Author Vitae} 《Journal of Parallel and Distributed Computing》2011,71(3):485-494

In a previous paper (Vidal et al., 2008, [21]), we presented a parallel solver for the symmetric Toeplitz eigenvalue problem, which is based on a modified version of the Lanczos iteration. However, its efficient implementation on modern parallel architectures is not trivial.In this paper, we present an efficient implementation on multicore processors which takes advantage of the features of this architecture. Several optimization techniques have been incorporated to the algorithm: improvement of Discrete Sine Transform routines, utilization of the Gohberg-Semencul formulas to solve the Toeplitz linear systems, optimization of the workload distribution among processors, and others. Although the algorithm follows a distributed memory parallel programming paradigm that is led by the nature of the mathematical derivation, special attention has been paid to obtaining the best performance in multicore environments. Hybrid techniques, which merge OpenMP and MPI, have been used to increase the performance in these environments. Experimental results show that our implementation takes advantage of multicore architectures and clearly outperforms the results obtained with LAPACK or ScaLAPACK. 相似文献

9.

Measurement of the latency parameters of the Multi-BSP model: a multicore benchmarking approach

Abdorreza Savadi Hossein Deldari 《The Journal of supercomputing》2014,67(2):565-584

Computer benchmarking is a common method for measuring the parameters of a computational model. It helps to measure the parameters of any computer. With the emergence of multicore computers, the evaluation of computers was brought under consideration. Since these types of computers can be viewed and considered as parallel computers, the evaluation methods for parallel computers may be appropriate for multicore computers. However, because multicore architectures seriously focus on cache hierarchy, there is a need for new and different benchmarks to evaluate them correctly. To this end, this paper presents a method for measuring the parameters of one of the most famous multicore computational models, namely Multi-Bulk Synchronous Parallel (Multi-BSP). This method measures the hardware latency parameters of multicore computers, namely communication latency (g _i) and synchronization latency (L _i) for all levels of the cache memory hierarchy in a bottom-up manner. By determining the parameters, the performance of algorithms on multicore architectures can be evaluated as a sequence. 相似文献

10.

Impact of level-2 cache sharing on the performance and power requirements of homogeneous multicore embedded systems

Abu Fadi N. Manira 《Microprocessors and Microsystems》2009,33(5-6):388-397

In order to satisfy the needs for increasing computer processing power, there are significant changes in the design process of modern computing systems. Major chip-vendors are deploying multicore or manycore processors to their product lines. Multicore architectures offer a tremendous amount of processing speed. At the same time, they bring challenges for embedded systems which suffer from limited resources. Various cache memory hierarchies have been proposed to satisfy the requirements for different embedded systems. Normally, a level-1 cache (CL1) memory is dedicated to each core. However, the level-2 cache (CL2) can be shared (like Intel Xeon and IBM Cell) or distributed (like AMD Athlon). In this paper, we investigate the impact of the CL2 organization type (shared Vs distributed) on the performance and power consumption of homogeneous multicore embedded systems. We use VisualSim and Heptane tools to model and simulate the target architectures running FFT, MI, and DFT applications. Experimental results show that by replacing a single-core system with an 8-core system, reductions in mean delay per core of 64% for distributed CL2 and 53% for shared CL2 are possible with little additional power (15% for distributed CL2 and 18% for shared CL2) for FFT. Results also reveal that the distributed CL2 hierarchy outperforms the shared CL2 hierarchy for all three applications considered and for other applications with similar code characteristics. 相似文献

11.

Design of scalable Java message-passing communications over InfiniBand

Roberto R. Expósito Guillermo L. Taboada Juan Touri?o Ramón Doallo 《The Journal of supercomputing》2012,61(1):141-165

This paper presents ibvdev a scalable and efficient low-level Java message-passing communication device over InfiniBand. The continuous increase in the number of cores per processor underscores the need for efficient communication support for parallel solutions. Moreover, current system deployments are aggregating a significant number of cores through advanced network technologies, such as InfiniBand, increasing the complexity of communication protocols, especially when dealing with hybrid shared/distributed memory architectures such as clusters. Here, Java represents an attractive choice for the development of communication middleware for these systems, as it provides built-in networking and multithreading support. As the gap between Java and compiled languages performance has been narrowing for the last years, Java is an emerging option for High Performance Computing (HPC). The developed communication middleware ibvdev increases Java applications performance on clusters of multicore processors interconnected via InfiniBand through: (1) providing Java with direct access to InfiniBand using InfiniBand Verbs API, somewhat restricted so far to MPI libraries; (2) implementing an efficient and scalable communication protocol which obtains start-up latencies and bandwidths similar to MPI performance results; and (3) allowing its integration in any Java parallel and distributed application. In fact, it has been successfully integrated in the Java messaging library MPJ Express. The experimental evaluation of this middleware on an InfiniBand cluster of multicore processors has shown significant point-to-point performance benefits, up to 85% start-up latency reduction and twice the bandwidth compared to previous Java middleware on InfiniBand. Additionally, the impact of ibvdev on message-passing collective operations is significant, achieving up to one order of magnitude performance increases compared to previous Java solutions, especially when combined with multithreading. Finally, the efficiency of this middleware, which is even competitive with MPI in terms of performance, increments the scalability of communications intensive Java HPC applications. 相似文献

12.

Performance‐based parallel loop self‐scheduling using hybrid OpenMP and MPI programming on multicore SMP clusters

Chao‐Tung Yang Chao‐Chin Wu Jen‐Hsiang Chang 《Concurrency and Computation》2011,23(8):721-744

Parallel loop self‐scheduling on parallel and distributed systems has been a critical problem and it is becoming more difficult to deal with in the emerging heterogeneous cluster computing environments. In the past, some self‐scheduling schemes have been proposed as applicable to heterogeneous cluster computing environments. In recent years, multicore computers have been widely included in cluster systems. However, previous researches into parallel loop self‐scheduling did not consider certain aspects of multicore computers; for example, it is more appropriate for shared‐memory multiprocessors to adopt Open Multi‐Processing (OpenMP) for parallel programming. In this paper, we propose a performance‐based approach using hybrid OpenMP and MPI parallel programming, which partition loop iterations according to the performance weighting of multicore nodes in a cluster. Because iterations assigned to one MPI process are processed in parallel by OpenMP threads run by the processor cores in the same computational node, the number of loop iterations allocated to one computational node at each scheduling step depends on the number of processor cores in that node. Experimental results show that the proposed approach performs better than previous schemes. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

13.

多核阵列的任务调度技术研究

陈亦欧吕信科凌翔《计算机科学》2017,44(8):42-45, 70

随着信号处理的复杂度的增加,多核并行架构成为数字信号系统的有效解决方案。主要研究了面向数字信号处理系统的无线多核阵列的任务调度问题。从数字信号处理系统与无线多核阵列的性能和开销要求出发,以功耗、热分布以及延时为优化目标,设计出相应的功耗、热均衡评估与延时模型,作为多目标优化算法的目标函数。同时,在NSGA-II算法的基础上改进拥挤策略与初始种群,并设计新的适应度函数,兼顾3个优化目标的性能,增加探索到更优解的可能性。最后,在无线多核阵列平台上采用多种任务图进行仿真,验证了所提算法的有效性与优越性。相似文献

14.

High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

Arslan Munir Farinaz Koushanfar Ann Gordon-Ross Sanjay Ranka 《The Journal of supercomputing》2013,66(1):431-487

Technological advancements in the silicon industry, as predicted by Moore’s law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high performance from tiled many-core architectures (TMAs)—an architectural innovation in the multicore technology. Although embedded systems design is traditionally power-centric, there has been a recent shift toward high-performance embedded computing due to the proliferation of compute-intensive embedded applications. The TMAs are suitable for these embedded applications due to low-power design features in many of these TMAs. We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decomposition, cache locality, tile locality, memory balancing, and horizontal communication for TMAs. We elaborate compiler-based optimizations that are applicable to TMAs, such as function inlining, loop unrolling, and feedback-based optimizations. We present a case study with optimized dense matrix multiplication algorithms for Tilera’s TILEPro64 to experimentally demonstrate the performance and performance per watt optimizations on TMAs. Our results quantify the effectiveness of algorithmic choices, cache blocking, compiler optimizations, and horizontal communication in attaining high performance and performance per watt on TMAs. 相似文献

15.

Transactional memory

Håkan Grahn 《Journal of Parallel and Distributed Computing》2010

Current and future processor generations are based on multicore architectures where the performance increase comes from an increasing number of cores on a chip. In order to utilize the performance potential of multicore architectures the programs also need to be parallel, but writing parallel programs is a non-trivial task. Transactional memory tries to ease parallel program development by providing atomic and isolated execution of code sequences, enabling software composability and protected access to shared data. In addition, transactional memory has the ability to execute atomic code sequences in parallel as long as no data conflicts occur. Transactional memory implementation proposals exist for both hardware and software, as well as hybrid solutions. This special issue on transactional memory introduces transactional memory as a concept, presents an overview of some of the most important approaches so far, and finally, includes five articles that advances the state-of-the-art in transactional memory research. 相似文献

16.

多核机群上数据密集型应用并行程序性能优化

黄华林钟诚《计算机工程与应用》2012,48(30):73-77

在异构多核机群系统上利用数据任务块的动态调度策略和全锁定技术,给出一种面向数据密集型应用的结点内主存和可用的共享二级缓存大小中动态调度数据块的多进程级和多线程级并行编程机制,给出了优化数据密集型应用并行程序性能的策略和技术。在多核计算机组成的异构机群上并行求解随机序列多关键字查找的实验结果表明,所给出的多核并行程序设计机制和性能优化方法可行和高效。相似文献

17.

An efficient distributed randomized algorithm for solving large dense symmetric indefinite linear systems

Marc Baboulin Dulceneia Becker George Bosilca Anthony Danalis Jack Dongarra 《Parallel Computing》2014

Randomized algorithms are gaining ground in high-performance computing applications as they have the potential to outperform deterministic methods, while still providing accurate results. We propose a randomized solver for distributed multicore architectures to efficiently solve large dense symmetric indefinite linear systems that are encountered, for instance, in parameter estimation problems or electromagnetism simulations. The contribution of this paper is to propose efficient kernels for applying random butterfly transformations and a new distributed implementation combined with a runtime (PaRSEC) that automatically adjusts data structures, data mappings, and the scheduling as systems scale up. Both the parallel distributed solver and the supporting runtime environment are innovative. To our knowledge, the randomization approach associated with this solver has never been used in public domain software for symmetric indefinite systems. The underlying runtime framework allows seamless data mapping and task scheduling, mapping its capabilities to the underlying hardware features of heterogeneous distributed architectures. The performance of our software is similar to that obtained for symmetric positive definite systems, but requires only half the execution time and half the amount of data storage of a general dense solver. 相似文献

18.

Supporting soft real-time parallel applications on multiprocessors

《Journal of Systems Architecture》2014,60(2):152-164

The prevalence of multicore processors has resulted in the wider applicability of parallel programming models such as OpenMP and MapReduce. A common goal of running parallel applications implemented under such models is to guarantee bounded response times while maximizing system utilization. Unfortunately, little previous work has been done that can provide such performance guarantees. In this paper, this problem is addressed by applying soft real-time scheduling analysis techniques. Analysis and conditions are presented for guaranteeing bounded response times for parallel applications under global EDF multiprocessor scheduling. 相似文献

19.

GPGPU test suite minimisation: search based software engineering performance improvement using graphics cards

Shin Yoo Mark Harman Shmuel Ur 《Empirical Software Engineering》2013,18(3):550-593

It has often been claimed that SBSE uses so-called ‘embarrassingly parallel’ algorithms that will imbue SBSE applications with easy routes to dramatic performance improvements. However, despite recent advances in multicore computation, this claim remains largely theoretical; there are few reports of performance improvements using multicore SBSE. This paper shows how inexpensive General Purpose computing on Graphical Processing Units (GPGPU) can be used to massively parallelise suitably adapted SBSE algorithms, thereby making progress towards cheap, easy and useful SBSE parallelism. The paper presents results for three different algorithms: NSGA2, SPEA2, and the Two Archive Evolutionary Algorithm, all three of which are adapted for multi-objective regression test selection and minimization. The results show that all three algorithms achieved performance improvements up to 25 times, using widely available standard GPUs. We also found that the speed-up was observed to be statistically strongly correlated to the size of the problem instance; as the problem gets harder the performance improvements also get better. 相似文献

20.

Explicit formulation of multibody dynamics based on principle of dynamical balance and its parallelization

Seho Shin Jonghoon Park Jaeheung Park 《Multibody System Dynamics》2016,37(2):175-193

Efficient computation of dynamics parameters is one of the important issues in simulation and control of the multibody systems as these systems become more complex. Recent advances in computer architecture are toward multiple core systems rather than high-speed single core systems. Therefore, parallel computation algorithms for dynamics parameters should be designed to improve the performance on these multicore architectures. In this paper, a new dynamics computation algorithm is derived using the principle of dynamical balance, which provides explicit computation of dynamic parameters. This new algorithm has the structure to which parallel computation can be easily applicable. Parallel computation methods are then applied so that we can exploit the structure of the proposed dynamics computation algorithm based on the principle of dynamical balance. The parallel algorithm is designed based on task and data-parallelism. The performance of the proposed algorithm is verified on robots with various topologies. The improved speed of parallel computation is demonstrated through these experiments. 相似文献