首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
王洁  衷璐洁  曾宇 《计算机科学》2011,38(10):281-284
多核处理器的新特性使多核机群的存储层次更加复杂,同时也给MPI程序带来了新的优化空间。国内外学 者提出了许多多核机群下MPI程序的优化方法和技术。测试了3个不同多核机群的通信性能,并分别在Intel与 AMD多核机群下实验评估了几种具有普遍意义的优化技术:混合MPI/OpcnMP、优化MPI运行时参数以及优化 MPI进程摆放,同时对实验结果和优化性能进行了分析。  相似文献   

2.
大规模并行应用程序的性能优化和并行化的关键瓶颈之一在于多核CPU中越来越深和越来越复杂的存储层次。文中系统地分析和总结了当前主要多核CPU和并行程序设计语言中的局部性设计方法,提出了两种局部性,即横向局部性和纵向局部性,从这两种局部性的视角深入分析了当前的主要并行程序设计语言的局部性设计机制,进一步总结对比了其优缺点,并指出了新一代并行程序设计语言应具有的特点,重点提出了新语言应同时综合考虑两种局部性支持的设计机制的研究观点。  相似文献   

3.
This paper presents a scalable and efficient Message-Passing in Java (MPJ) collective communication library for parallel computing on multi-core architectures. The continuous increase in the number of cores per processor underscores the need for scalable parallel solutions. Moreover, current system deployments are usually multi-core clusters, a hybrid shared/distributed memory architecture which increases the complexity of communication protocols. Here, Java represents an attractive choice for the development of communication middleware for these systems, as it provides built-in networking and multithreading support. As the gap between Java and compiled languages performance has been narrowing for the last years, Java is an emerging option for High Performance Computing (HPC).  相似文献   

4.
Executing traditional Message Passing Interface (MPI) applications on multi-core cluster balancing speed and computational efficiency is a difficult task that parallel programmers have to deal with. For this reason, communications on multi-core clusters ought to be handled carefully in order to improve performance metrics such as efficiency, speedup, execution time and scalability. In this paper we focus our attention on SPMD (Single Program Multiple Data) applications with high communication volume and synchronicity and also following characteristics such as: static, local and regular. This work proposes a method for SPMD applications, which is focused on managing the communication heterogeneity (different cache level, RAM memory, network, etc.) on homogeneous multi-core computing platform in order to improve the application efficiency. In this sense, the main objective of this work is to find analytically the ideal number of cores necessary that allows us to obtain the maximum speedup, while the computational efficiency is maintained over a defined threshold (strong scalability). This method also allows us to determine how the problem size must be increased in order to maintain an execution time constant while the number of cores are expanded (weak scalability) considering the tradeoff between speed and efficiency. This methodology has been tested with different benchmarks and applications and we achieved an average improvement around 30.35% of efficiency in applications tested using different problems sizes and multi-core clusters. In addition, results show that maximum speedup with a defined efficiency is located close to the values calculated with our analytical model with an error rate lower than 5% for the applications tested.  相似文献   

5.
A framework for synthesizing communication-efficient distributed-memory parallel programs for block recursive algorithms such as the fast Fourier transform (FFT) and Strassen's matrix multiplication is presented. This framework is based on an algebraic representation of the algorithms, which involves the tensor (Kronecker) product and other matrix operations. This representation is useful in analyzing the communication implications of computation partitioning and data distributions. The programs are synthesized under two different target program models. These two models are based on different ways of managing the distribution of data for optimizing communication. The first model uses point-to-point interprocessor communication primitives, whereas the second model uses data redistribution primitives involving collective all-to-many communication. These two program models are shown to be suitable for different ranges of problem size. The methodology is illustrated by synthesizing communication-efficient programs for the FFT. This framework has been incorporated into the EXTENT system for automatic generation of parallel/vector programs for block recursive algorithms.  相似文献   

6.
F-MPJ: scalable Java message-passing communications on parallel systems   总被引:1,自引:0,他引:1  
This paper presents F-MPJ (Fast MPJ), a scalable and efficient Message-Passing in Java (MPJ) communication middleware for parallel computing. The increasing interest in Java as the programming language of the multi-core era demands scalable performance on hybrid architectures (with both shared and distributed memory spaces). However, current Java communication middleware lacks efficient communication support. F-MPJ boosts this situation by: (1) providing efficient non-blocking communication, which allows communication overlapping and thus scalable performance; (2) taking advantage of shared memory systems and high-performance networks through the use of our high-performance Java sockets implementation (named JFS, Java Fast Sockets); (3) avoiding the use of communication buffers; and (4) optimizing MPJ collective primitives. Thus, F-MPJ significantly improves the scalability of current MPJ implementations. A performance evaluation on an InfiniBand multi-core cluster has shown that F-MPJ communication primitives outperform representative MPJ libraries up to 60 times. Furthermore, the use of F-MPJ in communication-intensive MPJ codes has increased their performance up to seven times.  相似文献   

7.
针对并行计算机体系结构中没有通用的计算模型这一问题,分析了一些现有的典型计算模型,在同步性、通信方式、参数方面进行比较,以LogGP模型为基础提出一种改进的mzLogGP模型。利用MPI并行算法对满足节点计算资源非独占、网络存在拥塞条件下的并行程序进行分析与测试,通过增加memory层次化层数和网络拥塞指数这两个参数,计算其计算开销和通信开销,将实测时间与预测时间进行比较,可知随节点数的增加系统误差不断减小,说明该新模型能改善并行应用在多核处理器集群平台上运行的性能,具有较好的可扩展性。  相似文献   

8.
Multicast is one of the most frequently used collective communication operations in multi-core SoC platforms. Bus as the traditional interconnect architecture for SoC development has been highly efficient in delivering multicast messages. Since the bus is non-scalable, it can not address the bandwidth requirements of the large SoCs. The networks on-chip (NoCs) emerged as a scalable alternative to address the increasing communication demands of such systems. However, due to its hop-to-hop communication, the NoCs may not be able to deliver multicast operations as efficiently as buses do. Adopting multi-port routers has been an approach to improve the performance of the multicast operations in interconnection networks. This paper presents a novel analytical model to compute communication latency of the multicast operation in wormhole-routed interconnection networks employing asynchronous multi-port routers scheme. The model is applied to the Quarc NoC and its validity is verified by comparing the model predictions against the results obtained from a discrete-event simulator developed using OMNET++.  相似文献   

9.
An island model is a typical implementation of genetic programming on parallel computers with distributed memory. The island model has a migration facility that sends/receives some individuals in an island to/from another island to maintain diversity. The island model requires synchronization to migrate same-generation individuals between islands, and this synchronization causes an increase in computation time. This article proposes a new parallel genetic programming implementation based on the island model with asynchronous migration. Most recent computers are equipped with one or more multi-core processors, and are suitable for multi-threading. Therefore we employ a communication thread for migration between islands. The communication thread on a processor communicates with the communication thread on another processor to migrate individuals at appropriate intervals. Since the migration and other genetic operations can be independently processed on each core, and since we allow the exchange of individuals of different generations, no synchronization is needed in our implementation. In addition, a fitness calculation is also executed in parallel by the remaining cores. Experimental results show that the proposed method can reduce the computation time to about 17% in serial GP by using 40 threads.  相似文献   

10.
郑启龙  汪睿  周寰 《计算机应用》2011,31(6):1453-1457
大规模集群已经发展到多核的时代,多核架构对并行计算提出了新的要求。消息传递接口(MPI)是最常用的并行编程模型,而群集通信又是MPI中的重要组成部分。研究高效的群集通信算法对并行计算效率的提升有着重要的作用。KD60平台是采用首款国产多核芯片——龙芯3号搭建的国产万亿次多核集群。首先分析了KD60平台多核集群的体系特征以及多核架构下通信具有的层次性特征;然后分析原有群集通信算法实现原理及其不足;最后以广播为例,在原有算法基础上,采用一种基于片上多核(CMP)架构改进算法,改变原有算法通信模式,同时结合实验平台KD60体系特征,对算法做了体系相关优化。实验结果表明,改进算法能够很好地利用多核结构的特点,提高了群集通信广播算法的性能。  相似文献   

11.
Abstract Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with general-purpose multi-processors, multi-core DSPs normally have a more complex memory hierarchy, such as on-chip core-local memory and non-cache-coherent shared memory. As a result, efficient multi-core DSP applications are very difficult to write. The current approach used to program multi-core DSPs is based on proprietary vendor software development kits (SDKs), which only provide low-level, non-portable primitives. While it is acceptable to write coarse-grained task-level parallel code with these SDKs, writing fine-grained data parallel code with SDKs is a very tedious and error-prone approach. We believe that it is desirable to possess a high-level and portable parallel programming model for multi-core DSPs. In this paper, we propose OpenMDSP, an extension of OpenMP designed for multi-core DSPs. The goal of OpenMDSP is to fill the gap between the OpenMP memory model and the memory hierarchy of multi-core DSPs. We propose three classes of directives in OpenMDSP, including 1) data placement directives that allow programmers to control the placement of global variables conveniently, 2) distributed array directives that divide a whole array into sections and promote the sections into core-local memory to improve performance, and 3) stream access directives that promote big arrays into core-local memory section by section during parallel loop processing while hiding the latency of data movement by the direct memory access (DMA) of a DSP. We implement the compiler and runtime system for OpenMDSP on PreeScale MSC8156. The benchmarking results show that seven of nine benchmarks achieve a speedup of more than a factor of 5 when using six threads.  相似文献   

12.
Accurate measurement and modeling of network performance is important for predicting and optimizing the running time of high-performance computing applications. Although the LogP family of models has proven to be a valuable tool for assessing the communication performance of parallel architectures, non-intrusive LogP parameter assessment of real systems remains a difficult task. Based on an analysis of accuracy and contention properties of existing measurement methods, we develop a new low-overhead measurement method which also assesses protocol changes in the underlying transport layers. We use the gathered parameters to simulate LogGP models of collective operations and demonstrate the errors in common benchmarking methods for collective operations. The simulations provide new insight into the nature of collective algorithms and their pipelining properties. We show that the error of conventional benchmark methods can grow linearly with the system size.  相似文献   

13.
异构众核架构具有超高的性能功耗比,已成为超级计算机体系结构的重要发展方向.但众核系统更为复杂的并行层次和存储层次,给编程和优化带来了极大的挑战,因此研究面向众核系统的并行编程技术,对于降低国产众核系统并行应用的编程难度、提升并行程序的性能都具有重要的意义.提出统一架构的多模式并行编程模型,包括异构融合的加速运算模型和按同构方式编程的自主运算模型,根据编程模型设计了Parallel C语言,能有效描述国产众核系统的异构并行性,与其它众核系统上MPI+X的使用模式相比,编程和系统优化都具有全局视角,在多级局部性描述、单边消息、兼容已有多核应用等方面具有特色;基于Open64构建了Parallel C编译系统,全面支持加速运算模型和自主运算模型,提出并实现了数据布局与自动DMA、编译指导的线程代理和拓扑位置感知的集合通信等优化.Micro Benchmark和实际应用在神威太湖之光计算机系统上的测试数据表明,Parallel C语言和编译系统具有良好的性能和可扩展性,能够有效支撑大型应用.  相似文献   

14.
基于TI公司的异构多核数字信号处理SoC芯片C66AK2H06完成一款PE超声相控阵探伤仪的算法处理及软件部分的设计。设计着重围绕新系统架构中的软件架构、算法组成、多核并行计算、核间通信、片间通信、共享内存和多层内存分配等展开,并对算法处理性能、实时性性能等方面的关键指标进行优化。研究结果表明采用C66AK2H06实现的系统架构更加灵活,性能得到提升,具有很好的可扩展性和可升级性,基于该架构研制的PE超声相控阵探伤仪实现了更高性能、更低功耗以及更好的便携性。  相似文献   

15.
《Parallel Computing》2014,40(10):611-627
Work-stealing and work-sharing are two basic paradigms for dynamic task scheduling. This paper introduces an adaptive and hierarchical task scheduling scheme (AHS) for multi-core clusters, in which work-stealing and work-sharing are adaptively used to achieve load balancing.Work-stealing has been widely used in task-based parallel programing languages and models, especially on shared memory systems. However, high inter-node communication costs hinder work-stealing from being directly performed on distributed memory systems. AHS addresses this issue with the following techniques: (1) initial partitioning, which reduces the inter-node task migrations; (2) hierarchical scheduling scheme, which performs work-stealing inside a node before going across the node boundary and adopts work-sharing to overlap computation and communication at the inter-node level; and (3) hierarchical and centralized control for inter-node task migration, which improves the efficiency of victim selection and termination detection.We evaluated AHS and existing work-stealing schemes on a 16-nodes multi-core cluster. Experimental results show that AHS outperforms existing schemes by 11–21.4%, for the benchmarks studied in this paper.  相似文献   

16.
在列数据库中,连接操作依然是最核心和最耗时的操作,GPU强大的计算能力可为此提供新的优化手段。基于Fermi架构,提出了新的Hash Join算法和Sort merge Join算法,其基本思想是充分利用该架构新增的缓存结构来减少连接操作的cache缺失率。与CUDA stream技术相结合,新算法在输出结果较多时可以有效地隐藏主存与显存间数据传输带来的延迟,进一步提升其执行效率。实验结果证实了基于Fcrmi架构的Hash Join算法处理偏抖数据的高效性及Sort merge Join算法的稳定性,并且通过比较表明,这两种算法的性能全面优于基于多核CPU充分优化的Join算法,最大加速2.4倍,在外键分布高偏抖时新的Hash Join算法的执行速度甚至达到每秒217M元组。  相似文献   

17.
建立一个适用于整数序列排序的数据分配模型,在多核计算节点组成的异构机群上设计通信高效的整数序列并行算法。所提出的数据分配模型依据机群中各节点不同的计算能力、通信速率和存储容量,动态计算出调度分配给各节点的数据块的大小以平衡各个节点的负载。所设计的并行排序算法利用整数序列的特性,主节点采取两轮分发数据与接收结果的方法,从节点运用分桶打包方式返回有序的整数子序列给主节点,主节点采用桶映射方法将各个有序子序列直接整合成最终有序序列,以减少需要耗费较多通信时间的数据归并操作。分析与实验测试结果表明,给出的多核机群上的整数序列并行排序算法高效,具有良好的可扩展性。  相似文献   

18.
面向层次化NoC的混合并行编程模型   总被引:1,自引:0,他引:1       下载免费PDF全文
曹祥  易伟  潘红兵  高明伦  李丽 《计算机工程》2010,36(13):278-280
为更好发挥多核处理器的硬件性能,针对层次化的片上网络架构,提出MPI/OpenMP混合并行编程模型。运用基于MPI的任务级并行模型实现片内簇间的高效通信,采用OpenMP模型实现簇内四核的通信、同步和数据交换。实验结果表明,与单一并行编程模型相比,混合并行编程模型加速比提高了20%~50%。  相似文献   

19.
Unified Parallel C(UPC) is a parallel extension of ANSI C based on the Partitioned Global Address Space(PGAS) programming model,which provides a shared memory view that simplifies code development while it can take advantage of the scalability of distributed memory architectures.Therefore,UPC allows programmers to write parallel applications on hybrid shared/distributed memory architectures,such as multi-core clusters,in a more productive way,accessing remote memory by means of different high-level language constructs,such as assignments to shared variables or collective primitives.However,the standard UPC collectives library includes a reduced set of eight basic primitives with quite limited functionality.This work presents the design and implementation of extended UPC collective functions that overcome the limitations of the standard collectives library,allowing,for example,the use of a specific source and destination thread or defining the amount of data transferred by each particular thread.This library fulfills the demands made by the UPC developers community and implements portable algorithms,independent of the specific UPC compiler/runtime being used.The use of a representative set of these extended collectives has been evaluated using two applications and four kernels as case studies.The results obtained confirm the suitability of the new library to provide easier programming without trading off performance,thus achieving high productivity in parallel programming to harness the performance of hybrid shared/distributed memory architectures in high performance computing.  相似文献   

20.
一种用于评估多核处理器存储层次性能的模型,使用排队论建模,求解速度快,可以在设计早期给出不同配置参数对处理器整体性能的影响,从而调整存储层次结构,优化设计.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号