首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
In this paper, we propose high-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers. We use the four-step or six-step FFT algorithms to implement the radix-2, 3 and 5 parallel 1-D complex FFT algorithms. In our parallel FFT algorithms, since we use cyclic distribution, all-to-all communication takes place only once. Moreover, the input data and output data are both in natural order.We also show that the suitability of a parallel FFT algorithm is machine-dependent because of the differences in the architecture of the processor elements in distributed-memory parallel computers. Experimental results of 2p3q5r point FFTs on distributed-memory parallel computers, HITACHI SR2201 and IBM SP2 are reported. We succeeded to get performances of about 130 GFLOPS on a 1024PE HITACHI SR2201 and about 1.25 GFLOPS on a 32PE IBM SP2.  相似文献   

3.
The control of a statically configured pipeline corresponds to certain paths in its state graph. Properties of this graph and algorithms for optimal paths are discussed.  相似文献   

4.
5.
In this paper we describe a new algorithm for maintaining a balanced search tree on a message-passing MIMD architecture; the algorithm is particularly well suited for implementation on a small number of processors. We introduce a (2B-2, 2B) search tree that uses a bidirectional ring of O(log n) processors to store n entries. Update operations use a bottom-up node-splitting scheme, which performs significantly better than top-down search tree algorithms. The bottom-up algorithm requires many fewer messages and results in less blocking due to synchronization than top-down algorithms. Additionally, for a given cost ratio of computation to communication the value of B may be varied to maximize performance. Implementations on a parallel-architecture simulator are described  相似文献   

6.
The problem of performing a global combine (summation) operation on a distributed memory computer using a two-dimensional mesh interconnect with wormhole routing is considered. We present algorithms that are asymptotically optimal for short vectors (O(log(p)) for p processing nodes) and for long vectors (O(n) for n data elements per node), as well as hybrid algorithms that are superior for intermediate n. The algorithms are analyzed using detailed performance models that include the effects of link conflicts and other characteristics of the underlying communication system. The models are validated using experimental data from the Intel Touchstone DELTA computer. We show that no one algorithm is optimal for all vector lengths; rather, each of the presented algorithms is superior under some circumstances.  相似文献   

7.
8.
本文提出了二维线性系统特征结构配置的两种新算法,并给出了状态反馈矩阵存在的充分条件。  相似文献   

9.
The challenges imposed by environmental issues, such as global warming and the energy crisis, are demanding more responsible energy usage, including in the optical networking field. In optical transmission networks, most of the electrical power is consumed by the optical-electrical-optical conversion in optical repeaters. Modern optical network control plane technologies allow idle optical repeaters to be put into a low-power sleep mode. Inspired by this, we propose a novel power-efficient routing and wavelength assignment (RWA) algorithm, called HTAPE. The HTAPE algorithm exploits the knowledge of the connection holding times to minimize the number of optical repeaters in the active mode, and hence reduce the total electricity consumption of the optical network. We test the new algorithm on the typical CERNET and USNET networks. Compared with traditional RWA algorithms without holding-time-awareness, it is observed that the HTAPE algorithm yields significant reductions in power consumption.  相似文献   

10.
11.
本文提出了一种基于行列分解算法的8×8二维反向离散余弦变换(IDCT)处理器。不再需要传统的为保持输入列向量的输入寄存器和并串转换寄存器,这既减小了芯片面积又减小了处理延时。其中的一维离散余弦变换采用查找表实现,作为查找表的ROM比传统的分布式算法的ROM也小的多。我们提出的二维IDCT处理器不仅具有面积优化、低延时、高吞吐率的特点,并且具有规整的、全流水结构,因此非常适合VLSI和FPGA实现。  相似文献   

12.
综合分析和比较了当前各种主要的实时虫孔路由算法的优缺点。  相似文献   

13.
14.
A decimation-in-time radix-2 fast Fourier transform (FFT) algorithm is considered here for implementation in multiprocessors with shared bus, multistage interconnection network (MIN), and in mesh connected computers. Results are derived for data allocation, interprocessor communication, approximate computation time, and speedup of an N point FFT on any P available processing elements (PE's). Further generalization is obtained for a radix-r FFT algorithm. An N X N point two-dimensional discrete Fourier transform (DFT) implementation is also considered when one or more rows of the input data matrix are allocated to each PE.  相似文献   

15.
In this correspondence, some image transforms and features such as projections along linear patterns, convex hull approximations, Hough transform for line detection, diameter, moments, and principal components will be considered. Specifically, we present algorithms for computing these features which are suitable for implementation in image analysis pipeline architectures. In particular, random access memories and other dedicated hardware components which may be found in the implementation of classical techniques are not longer needed in our algorithms. The effectiveness of our approach is demonstrated by running some of the new algorithms in conventional short-pipelines for image analysis. In related papers, we have shown a pipeline architecture organization called PPPE (Parallel Pipeline Projection Engine), which unleashes the power of projection-based computer vision, image processing, and computer graphics. In the present correspondence, we deal with just a few of the many algorithms which can be supported in PPPE. These algorithms illustrate the use of the Radon transform as a tool for image analysis.  相似文献   

16.
The scalability of communication infrastructure in modern Integrated Circuits (ICs) becomes a challenging issue, which might be a significant bottleneck if not carefully addressed. Towards this direction, the usage of Networks-on-Chip (NoC) is a preferred solution. In this work, we propose a software-supported framework for quantifying the efficiency of heterogeneous 3-D NoC architectures. In contrast to existing approaches for NoC design, the introduced heterogeneous architecture consists of a mixture of 2-D and 3-D routers, which reduces the delay and power consumption with a slight impact on packet hops. More specifically, the experimental results with a number of DSP applications show the effectiveness of the introduced methodology, as we achieve on average 25% higher maximum operation frequency and 39% lower power consumption compared to the uniform 3-D NoCs.  相似文献   

17.
基于软硬件的协同支持在众核上对1-DFFT算法的优化研究   总被引:2,自引:0,他引:2  
随着高性能计算需求的日益增加,片上众核(many-core)处理器成为未来处理器架构的发展方向.快速傅立叶变换(FFT)作为高性能计算中的重要应用,对计算能力和通信带宽都有较高的要求.因此基于众核处理器平台,实现高效、可扩展的FFT算法是算法和体系结构设计者共同面临的挑战.文中在众核处理器Godson-T平台上对1-D FFT算法进行了优化和评估,在节省几乎三分之一L2 Cache存储开销的情况下,通过隐藏矩阵转置,计算与通信重叠等优化策略,使得优化后的1-D FFT算法达到3倍以上的性能提升.并通过片上网络拥塞状况的实验分析,发现对于像FFT这样访存带宽受限的应用,增加L2 Cache的访问带宽,可以缓解因为爆发式读写带给片上网络和L2 Cache的压力,进一步提高程序的性能和扩展性.  相似文献   

18.
OP2 is an “active” library framework for the solution of unstructured mesh applications. It aims to decouple the specification of a scientific application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents the design of the current OP2 library for generating efficient code targeting contemporary GPU platforms. In this we focus on some of the software architecture design choices and low-level optimizations to maximize performance on NVIDIA’s Fermi architecture GPUs. The performance impact of these design choices is quantified on two NVIDIA GPUs (GTX560Ti, Tesla C2070) using the end-to-end performance of an industrial representative CFD application developed using the OP2 API. Results show that for each system, a number of key configuration parameters need to be set carefully in order to gain good performance. Utilizing a recently developed auto-tuning framework, we explore the effect of these parameters, their limitations and insights into optimizations for improved performance.  相似文献   

19.
20.
二维环/双环互连Petersen图网络及其路由算法   总被引:4,自引:1,他引:4  
王雷  林亚平  陈治平  文学 《计算机学报》2004,27(9):1290-1296
基于双环结构提出了一种Petersen图的新扩展方法 ,并在此基础上构造了一个 2维双环互连Petersen图网络DCP(k) .分析了 2维环互连Petersen图网络TCP(k)的特性 ,给出了TCP(k)优于 2 DTorus互联网络的直径及可分组性的条件 .证明了DCP(k)和TCP(k)具有良好的可扩性和连接度 ;而且对 10×k个节点组成的互联网络 ,DCP(k)和TCP(k)均具有比RP(k)及 2 DTorus互联网络更小的直径和更优越的可分组性 .最后 ,分别设计了DCP(k)和TCP(k)上的单播和广播路由算法 ,证明了其通信效率较RP(k)上的对应算法均分别有明显提高 ,且DCP(k)更优于TCP(k) .  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号