首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 187 毫秒
1.
网格连接的处理器阵列是一种应用广泛的高性能体系结构,而容错处理器阵列的重构技术是近年来的研究热点之一.现有的研究多数集中在串行重构算法上,忽视了该结构重构时内在的可并行性.本文根据阵列结构的特点设计了一种基于VHDL语言的重构算法,该算法从第一行的各个无故障处理器单元同时向下选路,具有潜在的并行性,.实验结果表明,与现有的串行算法相比,本文提出的并行算法同样能够生成最大规模的目标阵列并且当物理阵列大小为48×48,本文提出的并行算法加速重构将近20倍.  相似文献   

2.
从多处理器阵列中获取所需大小并且同步通讯性能优良的子阵列,是高性能拓扑重构的核心问题之一。基于不同的逻辑列剔除策略提出了3种面向通讯同步的拓扑重构算法:基于分治思想剔除逻辑列的重构算法(SCA_01),该算法能够使被优化的逻辑列相对均匀地分布在物理阵列中;优先剔除长逻辑列的贪心重构算法(SCA_02),该算法能够使被优化的逻辑列的长链接总数最少;基于分治与长链接数的混成重构算法(SCA_03),该算法将某一区域内的最长逻辑列剔除,且尽可能将剩余逻辑列均匀分布在物理阵列中。同时,对逻辑阵列的最大通讯延时给出了下界的求解算法。实验结果表明,3种算法在故障率小于1%、逻辑列的剔除率超过20%时,算法重构出的逻辑阵列的通讯延时特别接近计算出的性能下界。在多数情况下SCA_01优于SCA_02和SCA_03,而后两者的性能相近。在小阵列上且故障率与剔除率较小时,SCA_02具有性能优势,但在大阵列上SCA_03具有优势。在32×32的阵列上,SCA_01构造的阵列产生的通讯延时较SCA_02和SCA_03产生的延时平均减少25%,并且运行速度也提升了19.4%。  相似文献   

3.
可重构多处理器阵列上的容错技术可用来重构含有故障单元的处理器阵列,以便获得最大可用的目标阵列。现有的研究成果主要侧重于重构算法的构造,还没有涉及对重构后目标阵列的同步通讯性能的研究。提出了一种改善目标阵列同步通讯性能的电路优化算法,用来降低目标阵列行与行之间通讯的延时,使得相邻两行处理器的通讯尽可能达到同步。实验结果表明,提出的算法对不同大小、不同故障率的阵列都有相应的同步通讯性能的改善。  相似文献   

4.
将OpenACC编程模型用于异构多核处理器时,由于异构多核处理器加速设备内存有限,操作大量数据的代码不能获得很好的加速。针对这一问题,在OpenACC中引入循环分块子句,对循环进行分块处理,使每个循环块使用的数据能够存储在设备内存中;提出面向异构多核处理器的循环分块子句生成算法,并在基于Open64的"源-源"自动并行化系统Auto-ACC中进行实现。测试结果表明,在异构多核处理器上,扩展的循环分块子句及所提生成算法能够对程序进行明显的加速。  相似文献   

5.
基于FPGA的嵌入式多核处理器及SUSAN算法并行化   总被引:1,自引:0,他引:1  
给出了四核心嵌入式并行处理器FPEP的结构设计并建立了FPGA验证平台.为了对多核处理器平台性能进行评测,提出了基于OpenMP的3种可行的图像处理领域的经典算法SUSAN算法的并行化方法:直接并行化SUSAN、图像分块处理和多图像并行处理,并对这3种并行算法在Intel四核心平台和FPEP的FPGA验证平台上进行性能测试.实验表明,3种并行算法在两种四核心平台下均可获得接近3.0的加速比,多图像并行处理在FPEP的FPGA验证平台可以获得接近4.0的加速比.  相似文献   

6.
逐次松弛迭代算法(SOR)是求解线性方程组的一种常用迭代算法,当系数矩阵正定时,它具有较快的收敛速度。但是,由于每个迭代步内存在数据相关,它难以实现并行计算。目前的SOR并行算法采用数据分解的方法,但由于该法并行区域过小,同步通讯代价大,并行效率低。本文提出了SOR的一种新型并行算法,该算法与传统SOR方法等价,具有相同的收敛性和迭代结果。该并行算法通过矩阵分块增大了可并行计算的区域,并引入流水线技术,利用各处理器间通讯与计算时间的重叠,获得较理想的并行加速效率。通过多核微机以及小规模集群上的数值实验证明,本文提出的SOR并行算法在求解大型稠密线性方程组时具有较好的并行效率。  相似文献   

7.
高效的容错技术对于提高多处理器系统的可靠性至关重要。环网(Torus)是连接多处理器阵列的重要网络结构,而环网处理器阵列上的容错重构技术目前尚属空白。针对环网阵列的特殊连接方式,将环网阵列重构问题转化为矛盾图上求解最大独立集问题。矛盾图上的结点表示故障处理器的替换方案,而边代表了不同替换方案之间的不可共存特性。主要是根据三种不同的冗余处理器分布方案,设计生成矛盾图算法,求解最大独立集算法,以及由独立集生成逻辑处理器阵列算法,取得了令人满意的结果。实验结果表明,当阵列规模较小或故障率较低时,一行一列和十字型的冗余单元分布的重构能力较好;而随着阵列规模或故障率的增大,三种冗余单元分布策略的重构成功率都随之下降,但可通过增加冗余单元以及调整冗余分布来改善容错效果。此外,从实验结果中还可以看出,环网处理器阵列的容错能力显然优于网格(Mesh)处理器阵列。  相似文献   

8.
张思乾  程果  陈荤  熊伟 《计算机科学》2012,39(1):295-298
随着处理器由高主频的单核处理器逐步转向片上多核处理器(CMP),计算机并行处理能力不断提升。通过分析GIS串行算法面临的性能瓶颈,利用CMP的优势,采用线程级并行处理栅格数据。针对边缘提取算法,深入分析和比较了MPI、OpenMP等当前主流的并行编程模式,提出了并行性能估计模型。基于OpenMP编程模型分析线程数、调度方式和分块大小对算法并行性能的影响,实现边缘提取最优并行。实验证明,性能评估模型能够准确预测CMP环境下的并行性能,基于OpenMP实现的边缘提取并行算法能够提高图像边缘提取效率。  相似文献   

9.
随着处理器由高主频的单核处理器逐步转向片上多核处理器(CMP),计算机并行处理能力不断提升.通过分析GIS串行算法面临的性能瓶颈,利用CMP的优势,采用线程级并行处理栅格数据.针对边缘提取算法,深入分析和比较了MPI、OpenMP等当前主流的并行编程模式,提出了并行性能估计模型.基于OpenMP编程模型分析线程数、调度方式和分块大小对算法并行性能的影响,实现边缘提取最优并行.实验证明,性能评估模型能够准确预测CMP环境下的并行性能,基于OpenMP实现的边缘提取并行算法能够提高图像边缘提取效率.  相似文献   

10.
新一代视频编码标准获得了较高的编码效率,但同时也增加了计算量。HEVC(High Efficiency Video Coding)并行算法能够提高编码速度,开发适用于多核处理器的并行编码算法对于满足高清视频实时传输和大规模实时共享具有十分重要的意义。分析帧内预测算法在处理像素过程中数据之间的依赖关系,进行基于预测模式的细粒度并行性的设计。块与块之间采用流水线处理,减少帧内预测算法的执行时间。利用动态可编程可重构视频阵列处理器,对帧内预测算法进行验证。实验结果表明,相比于HM16.0官方测试标准,信噪比提高了10%,算法的执行时间减少了大约70%。  相似文献   

11.
Effective fault tolerance techniques are essential for improving the reliability of multiprocessor systems. At the same time, fault tolerance must be achieved at high speed to meet the real-time constraints of embedded systems. While parallelism has often been exploited to increase performance, to the best of our knowledge, there has been no previously reported work on parallel reconfiguration of mesh-connected processor arrays with faults. This paper presents two parallel algorithms to accelerate reconfiguration of the processor arrays. The first algorithm reconfigures a host array in parallel in a multithreading manner. The threads in the parallel algorithm execute independently within a safe rerouting distance. The second algorithm is based on a divide-and-conquer approach to first generate the leftmost segments in parallel and then merge the segments in parallel. When compared to the conventional algorithm, simulation results from a large number of instances confirm that the proposed algorithms significantly accelerate the reconfiguration without loss of harvest.  相似文献   

12.
Homogeneous processor arrays are emerging in tera-scale computation and effective fault tolerance techniques are essential to improving the reliability of such complex integrated circuits. We study the degradable processor arrays to achieve fault tolerance by employing reconfiguration. Three bypass schemes and three rerouting schemes are proposed to reconfigure three-dimensional processor arrays with defective processors to achieve target arrays without faults. A heuristic algorithm is proposed to construct a target array on the selected rows and columns. It is also proved that the proposed greedy plane rerouting algorithm (GPR) produces maximum target array. In addition, the problem of constructing the communication efficient array is considered in this paper. An algorithm is proposed to refine the communication among processors within the target array constructed by GPR. Experimental study shows that the proposed algorithm GPR produces target arrays with higher harvest and lower degradation on the host arrays with fault density no more than 5%. In addition, the communication performance is significantly optimized by reducing the number of long interconnects, and the average improvement is about 34% for all cases considered in this paper.  相似文献   

13.
In a multiprocessor array, some processing elements (PEs) fail to function normally due to hardware defects or soft faults caused by overheating, overload or occupancy by other running applications. Fault-tolerant reconfiguration reorganizes fault-free PEs to a new regular topology by changing the interconnection among PEs. This paper investigates the problem of constructing as large as possible logical array with short interconnects from a physical array with faults. A flexible rerouting scheme is developed to improve the efficiency of utilizing fault-free PEs. Under the scheme, two efficient reconfiguration algorithms are proposed. The first algorithm is able to generate the maximum logical array (MLA) in linear time. The second algorithm reduces the interconnect length of the MLA, and it is capable of producing nearly optimal logical arrays in comparison to the lower bound of the interconnect length, that is also proposed in this paper. Experimental results validate the efficiency of the flexible rerouting schemes and the proposed algorithms. For 128×128 host arrays with 30% unavailable PEs, the proposed approaches improve existing algorithm up to 44% in terms of logical array size, while reducing the interconnection redundancy by 49.6%. In addition, the proposed algorithms are more scalable than existing approaches. On host arrays with 50% unavailable PEs, our algorithms can produce logical arrays with harvest over 56% while existing approaches fail to construct a feasible logical array.  相似文献   

14.
The authors consider a mesh-connected processor array with broadcast, in which the processors in each row and in each column can access a shared bus. This computation model is an abstraction of commercially available parallel machines such as the distributed array processor (DAP). The authors show efficient data movement using the buses, which leads to significantly faster solution times for several image problems compared to those on an ordinary mesh-connected array. For several problems, the time performance is comparable to that on the pyramid of corresponding size. Alternate parallel organizations with broadcast feature are also shown. Utilizing the broadcast feature on the arrays can lead to significant improvement of the time performance of many image computations  相似文献   

15.
16.
The distance calculation in an image is a basic operation in computer vision, pattern recognition, and robotics. Several parallel algorithms have been proposed for calculating the Euclidean distance transform (EDT). Recently, Chen and Chuang proposed a parallel algorithm for computing the EDT on mesh-connected SIMD computers (1995). For an nxn image, their algorithm runs in O(n) time on a two-dimensional (2-D) nxn mesh-connected processor array. In this paper, we propose a more efficient parallel algorithm for computing the EDT on a reconfigurable mesh model. For the same problem, our algorithm runs in O(log(2)n) time on a 2-D nxn reconfigurable mesh. Since a reconfigurable mesh uses the same amount of VLSI area as a plain mesh of the same size does when implemented in VLSI, our algorithm improves the result in [3] significantly.  相似文献   

17.
In this paper, we propose a new I/O overhead free Givens rotations based parallel algorithm for solving a system of linear equations. The algorithm uses a new technique called two-sided elimination and requires an N×(N+1) mesh-connected processor array to solve N linear equations in (5N-log N-4) time steps. The array is well suited for VLSI implementation as identical processors with simple and regular interconnection pattern are required. We also describe a fault-tolerant scheme based on an algorithm based fault tolerance (ABFT) approach. This scheme has small hardware and time overhead and can tolerate up to N processor failures  相似文献   

18.
通过冗余修复方法来解决超大规模集成电路(VLSI)制造过程中因缺陷而造成的成品率低的问题。根据物理阵列中缺陷单元的分布情况,构造相应的矛盾图模型,将阵列的重构问题转化为用蚁群优化算法求解矛盾图的最大独立集问题,使得所求独立集的顶点个数恰为缺陷单元的个数。实验表明,与标准遗传算法和神经网络算法相比,用蚁群优化算法来求解单通道冗余VLSI阵列重构问题是简单有效的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号