首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
江南 《福建电脑》2002,(12):3-4
本文简要介绍了并行处理的算法策略以及并行处理的描述性定义,对并行技术在软件和硬件上的具体实现有较具体的阐述,并行处理在软件上的实现主要是分析程序的相关性及建立网络互连来实现的,在硬件上主要有三类机型:多处理机,多计算机和SIMD/向量机,在硬件技术方面主要通过存储器,处理机和流水线三个方面来实现并行技术。  相似文献   

2.
本文完成了视频服务器的硬件设计,针对如何充分发挥DM642硬件平台的处理能力,提出了关于AVS-M算法的编码优化方案.该方案是对软件框架流程进行仔细考虑后提出的,避免了冗余操作,针对存储系统对各部分的数据结构进行了设计.而且通过DMA实现了计算与数据传输的并行处理。  相似文献   

3.
基于AVS-M和DM642视频服务器的研究   总被引:2,自引:0,他引:2  
本文完成了视频服务器的硬件设计,针对如何充分发挥DM642硬件平台的处理能力,提出了关于AVS-M算法的编码优化方案,该方案是对软件框架流程进行仔细考虑后提出的,避免了冗余操作,针对存储系统对各部分的数据结构进行了设计,而且通过DMA实现了计算与数据传输的并行处理。  相似文献   

4.
基于随机上下文无关文法(SCFG)理论模型进行RNA二级结构预测是目前采用计算方法研究RNA二级结构的一种重要途径.由于基于SCFG模型的标准结构预测算法(Coche-Younger-Kasami,CYK)巨大的时空复杂度,对CYK算法进行加速成为计算生物学领域一个极具挑战性的热点问题.CYK的并行性能受限于算法多维度、非一致性的数据依赖关系和较低的计算/通信比,现有的基于通用微处理器结构的大规模并行处理方案不能获得令人满意的加速效果,并且大规模并行计算机系统硬件设备的购置、使用、日常维护的成本高昂,其适用性受到诸多限制.文中在深入分析CYK算法计算特征的基础上,基于FPGA平台提出并实现了一种细粒度的并行CYK算法.设计采用了对三维动态规划矩阵按区域分割和逐层按列并行处理的计算策略实现了多个处理单元间的负载均衡;采用数据预取、滑动窗口和数据传递流水线实现处理单元间的数据重用,有效解决了计算和通信间的平衡问题;设计了一种类似脉动阵列(systolic-like array)结构的主从多PE并行计算阵列,并在目前最大规模的FPGA芯片(Xilinx XC5VLX330)上成功集成了16个处理单元(process...  相似文献   

5.
多级互连网络中的multicast通信   总被引:3,自引:1,他引:3  
MPP系统中的并行通信是目前并行处理研究的热点,改善并行通信性能,提高网络吞吐率是促进MPP性能发挥的关键问题。multicast通信是区别于点到点通信的一对多通信方式,因而功能更强大,使用起来更灵活方便,在并行处理中应用十分广泛。文中以基于开关元件实现结点间动态互连的多级互连网络为背景,研究了multicast通信路上算法的效率。  相似文献   

6.
针对在传统串行结构上执行图象匹配算法时影响执行速度提高的原因,通过分析图象匹配算法的内部流水性,并行性,提出了一种加速执行图象匹配算法的硬件并行结构,通过引入流水线数据延迟及多个并行处理单元。该结构使得重复读取存贮器的操作次数大大减少,从而加速图象匹配操作。文中给出了该模型的实现框图,并计算了采用该结构执行图象匹配算法所需计算表明,对大小为64×64的搜索象,32×32的模板象。该结构可在不到9m  相似文献   

7.
统一化的快速距离变换   总被引:12,自引:2,他引:10  
距离变换是图象处理和分析的有效工具,是关于图象是全局操作,为了避免庞大的计算量,人们通常采用将全局操作分解成局部操作的策略,但是这种分解策略只能产生近似的欧氏距离,本文提出了一种统一的化的距离变换算法,它不需要并行处理硬件的支持就能快速实现距离变换心最近特征变变换,对于使用不同的距离测度函数,仅需要调整距离查找表,而算法本身不用做任何改动,本文最后给出算法分析和实验结果。  相似文献   

8.
提出了一种改进的粒子滤波目标跟踪算法,提出了限定区域的伪随机算法和根据权值分布的自适应重采样算法来提升目标跟踪的精度和并行特性。同时在算法的FPGA硬件结构实现上,对程序结构进行调整,充分利用流水线并行处理数值计算,运用硬件并行特性加快粒子的权值排序过程。实验结果表明,提出的算法在实验室场景与遮挡情况下都具有良好的跟踪准确性和实时性。  相似文献   

9.
大规模并行处理系统互连通信的新技术研究   总被引:2,自引:0,他引:2  
本文综述了大规模并行处理系统研究了工作的概况,指出其研究热点和关键技术是实现高效的互连通信。文中重点介绍了该领域的研究内容:结点结构、网络接口、切换技术,拓扑结构,路由算法,通信机制,通信协议,计算模型等。  相似文献   

10.
根据菲涅尔计算全息循环迭代算法,得到了一种适合VHDL硬件描述语言的计算全息算法,将并行处理的思想引入到计算全息技术中,提出了一种基于FPGA并行处理的计算全息图实现方法,提高了运算速度。最后通过仿真得到了计算全息图,并成功再现,结果证明该方法可以实现计算全息图的制作。  相似文献   

11.
1IntroductionSoftwaredistributedsharedmemory(DSM)system,orsharedvirtualmemory(SVM)system,providesanabstractionofsinglesharedspaceontopofthephysicallydistributedmemoriespresentedonnetworkofworkstations.Ithasbeenextensivelystudiedinthepastdecadesinceitcombinestheprogrammabilityofsharedmemorysystemsandscalabilityofdistributedsystems[1].However,theperformancegapbetweensoftwareDSMsystemsandmessagepajssingplatformsremainsexisting,whichpreventstheprevalenceofthesoftwareDSMsystemsgreatly.Ingenera…  相似文献   

12.
A new language construct, called molecule, is described for the efficient implementation of algorithms on parallel computers. A molecule can be considered a procedure associated with a molecule type. Each molecule type characterizes a particular computation mode (sequential, pipelining, array processing, dataflow, multiprocessing, etc.). Basic concepts of molecule are introduced with a procedural language, called PAL. A concrete example is presented to illustrate layered software development using PAL on a multicomputer (the iPSC). It is concluded that high-level languages, augmented with the molecule construct, offer application flexibility, user friendliness, and efficiency in implementing parallel programs  相似文献   

13.
Various contiguous and noncontiguous processor allocation policies have been proposed for mesh-connected multicomputers. Contiguous allocation suffers from high external processor fragmentation because it requires that the processors allocated to a parallel job be contiguous and have the same topology as the multicomputer. The goal of lifting the contiguity condition in noncontiguous allocation is reducing processor fragmentation. However, this can increase the communication overhead because the distances traversed by messages can be longer, and messages from different jobs can interfere with each other by competing for communication resources. The extra communication overhead depends on how the allocation request is partitioned and mapped to free processors. In this paper, we investigate a new class of noncontiguous allocation schemes for two-dimensional mesh-connected multicomputers. These schemes are different from previous ones in that request partitioning is based on the submeshes available for allocation. The available submeshes selected for allocation to a job are such that a high degree of contiguity among their processors is achieved. The proposed policies are compared to previous noncontiguous policies using detailed simulations, where several common communication patterns are considered. The results show that the proposed policies can reduce the communication overhead and improve performance substantially.  相似文献   

14.
Strategies for dynamic load balancing on highly parallel computers   总被引:5,自引:0,他引:5  
Dynamic load balancing strategies for minimizing the execution time of single applications running in parallel on multicomputer systems are discussed. Dynamic load balancing (DLB) is essential for the efficient use of highly parallel systems when solving non-uniform problems with unpredictable load estimates. With the evolution of more highly parallel systems, centralized DLB approaches which make use of a high degree of knowledge become less feasible due to the load balancing communication overhead. Five DLB strategies are presented which illustrate the tradeoff between 1) knowledge - the accuracy of each balancing decision, and 2) overhead - the amount of added processing and communication incurred by the balancing process. All five strategies have been implemented on an Inter iPSC/2 hypercube  相似文献   

15.
Cluster/distributed computing has become a popular, cost-effective alternative to high-performance parallel computers. Many parallel programming languages and related programming models have become widely accepted on clusters. However, the high communication overhead is a major shortcoming of running parallel applications on cluster/distributed computing environments. To reduce the communication overhead and thus the completion time of a parallel application, this paper introduces and evaluates an efficient Key Message (KM) approach to support parallel computing on cluster computing environments. In this paper, we briefly present the model and algorithm, and then analytical and simulation methods are adopted to evaluate the performance of the algorithm. It demonstrates that when network background load increases or the computation to communication ratio decreases, the analysis results show better improvement on communication of a parallel application over the system which does not use the KM approach.  相似文献   

16.
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup.  相似文献   

17.
In this paper, we present a static load balancing method for mapping production rules in an expert system onto processors of a message-passing multicomputer. The method uses simulated annealing to achieve a nearly optimal allocation of production rules onto processor nodes. The approach balances the initial rule distribution to avoid higher communication demand among processor nodes at run time. A formal mapping model is developed and a new cost function is defined for the annealing process. New heuristic swap functions and cooling policies are provided to ensure the quality of the annealing process. A software load balancing package, SIMAL, was implemented on a SUN workstation to carry out the benchmark experiments. The overhead associated with this mapping method is O(m In m), where m is the number of rules in the production system. Two benchmark production systems, Toru-waltz and Tourney, are mapped onto a hypercube computer with 32 nodes. Experimental benchmark results verify the effectiveness of the rule mapping method. The method can be applied for distributed artificial intelligence processing or for the parallel execution of cooperating expert systems on a message-passing multicomputer.  相似文献   

18.
提高工作站机群系统通信性能方法的研究   总被引:2,自引:0,他引:2  
工作站机群系统是目前并行处理技术的研究特点之一,而春网络通信的性能又是机群系统的关键。本文在分析影响系统性能因素的基础上,从软、硬件两方面着重讨论了高速网络技术,精简通信协议和Active Message通信机制等提高网络通信性能的方法。  相似文献   

19.
通过信能不高是影响软件分布式共享存储系统性能的主要因素之一,用户级通信技术能够充分发挥高速网络的硬件性能,减少数据拷贝次数,降低软件件开发销,明显改善了带宽和延迟,为软件分布式共享存储系统性能的提高开避了新的途径,设计并实现了一个面向软件分布式存储系统的用户级通信库,它不仅改善了系统的通禽性能,同时也使得系统的并行计算性能得到改善,从而十分显著地提高了软件分布式共享存储系统的整体性能。  相似文献   

20.
Cellular computing architectures represent an important class of computation that are characterized by simple processing elements, local interconnect and massive parallelism. These architectures are a good match for many image and video processing applications and can be substantially accelerated with Reconfigurable Computers. We present a flexible software/hardware framework for design, implementation and automatic synthesis of cellular image processing algorithms. The system provides an extremely flexible set of parallel, pipelined and time-multiplexed components which can be tailored through reconfigurable hardware for particular applications. The most novel aspects of our framework include a highly pipelined architecture for multi-scale cellular image processing as well as support for several different pattern recognition applications. In this paper, we will describe the system in detail and present our performance assessments. The system achieved speed-up of at least 100× for computationally expensive sub-problems and 10× for end-to-end applications compared to software implementations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号