首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 93 毫秒
1.
本文将讨论在多处理器系统上,用大规模并行处理技术实现人工神经网络时,处理器是通信开销问题。通过理论推导和实验证明,全连接和随机连接神经网络在多处理器系统上并行实现时,处理器网络的拓扑结构对神经网络实现的通信开销影响很小,改变处理器的拓扑结构也无助于神经网络实现性能的提高。  相似文献   

2.
直觉告诉我们:当人工神经网络算法在多处理器系统上并行实现时,处理器网络的拓扑结构和处理器节点的扇入尺寸(即输入输出规模)会影响并行算法的效率,但是对全连接和随机连接神经网络,上述结论并不成立。在神经网络的并行实现中,处理器的通信开销是一个主要的限制因素,本文将对全连接和随机连接神经网络并行实现的几个相关问题进行讨论。1 学习时间的分解  相似文献   

3.
本文对两种非模块型神经网络-全连接和随机连接神经网络在多处理器网络上并行实现的性能进行了分析,指出处理器网络的拓扑结构和结点的扇玫对这两种神经网络的并行实现的性能影响不大,并对神经知多处理并行实现时一个学习周期时间进行了分析,讨论了并行实现的最大加速比和最优处理器规模的计算。  相似文献   

4.
龙芯3号互联系统的设计与实现   总被引:5,自引:1,他引:4  
龙芯3号的互联结构设计采用了一种基于二维Mesh的可伸缩分布式多核结构,可为芯片级、主板级和系统级的互联提供统一的拓扑结构和逻辑设计.龙芯3号的对外接口采用扩展的HyperTransport协议,既可以用于连接IO,又可以实现多芯片的互联.在龙芯3号的互联结构中还设置了软件路由配置机制,可以在板级直接构筑中等规模的CC-NUMA系统和更大规模的NCC-NUMA系统,提供高效的通信机制.介绍了基于龙芯3号的多处理器系统互联架构.采用了双层可伸缩互联结构:片内由二维Mesh连接多个结点.结点内由交叉开关连接多个处理器核和二级缓存模块.片间无需额外硬件支持即可通过支持缓存一致性的HyperTransport接口实现16核的多处理器系统.利用层次化目录技术,龙芯3号还可以支持更大规模的多处理器系统.龙芯3号的互联架构为搭建简洁、高效、灵活、高度可扩展的共享存储多处理器系统提供了有力支持.  相似文献   

5.
TMS320C6678多核DSP的核间通信方法   总被引:8,自引:3,他引:5  
嵌入式应用中采用多处理系统所面临的主要难题是多处理器内核之间的通信。对Key-Stone架构TMS320C6678处理器的多核间通信机制进行研究,利用处理器间中断和核间通信寄存器,设计并实现了多核之间的通信。从系统的角度出发,设计与仿真了两种多核通信拓扑结构,并分析对比了性能。对设计多核DSP处理器的核间通信有一定的指导价值。  相似文献   

6.
片上网络互连拓扑综述   总被引:1,自引:0,他引:1  
随着器件、工艺和应用技术的不断发展,片上多处理器已经成为主流技术,而且片上多处理器的规模越来越大、片内集成的处理器核数目越来越多,用于片内处理器核及其它部件之间互连的片上网络逐渐成为影响片上多处理器性能的瓶颈之一。片上网络的拓扑结构定义网络内部结点的物理布局和互连方法,决定和影响片上网络的成本、延迟、吞吐率、面积、容错能力和功耗等,同时影响网络路由策略和网络芯片的布局布线方法,是片上网络研究中的关键之一。对比了不同片上网络的拓扑结构,分析了各种结构的性能,并对未来片上网络拓扑研究提出建议。  相似文献   

7.
随着器件、工艺和应用技术的不断发展,片上多处理器已经成为主流技术,而且片上多处理器的规模越来越大、片内集成的处理器核数目越来越多,用于片内处理器核及其它部件之间互连的片上网络逐渐成为影响片上多处理器性能的瓶颈之一.片上网络的拓扑结构定义网络内部结点的物理布局和互连方法,决定和影响片上网络的成本、延迟、吞吐率、面积、容错...  相似文献   

8.
开销敏感的多处理器最优节能实时调度算法   总被引:1,自引:0,他引:1  
嵌入式多处理器系统的能耗问题变得日益重要,如何减少能耗同时满足实时约束成为多处理器系统节能实时调度中的一个重要问题.目前绝大多数研究基于关键速度降低处理器的频率以减少动态能耗,采用关闭处理器的方法减少静态能耗.虽然这种方法可以实现节能,但是不能保证最小化能耗.而现有最优的节能实时调度未考虑处理器状态切换的时间和能量开销,因此在切换开销不可忽视的实际平台中不再是最优的.文中针对具有独立动态电压频率调节和动态功耗管理功能的多处理器系统,考虑处理器切换开销,提出一种基于帧任务模型的最优节能实时调度算法.该算法根据关键速度来判断系统负载情况,确定具有最低能耗值的活跃处理器个数,然后根据状态切换开销来确定最优调度序列.该算法允许实时任务在处理器之间任意迁移,计算复杂度小,易于实现.数学分析证明了该算法的最优性.  相似文献   

9.
多处理器片上系统任务调度研究进展评述   总被引:9,自引:0,他引:9  
多处理器片上系统在单芯片上集成了多种指令集处理器,可完成复杂完整的功能,在图像处理、网络多媒体和嵌入式系统等应用领域前景广阔.任务映射与调度是多处理器片上系统设计的关键问题之一.介绍了多处理器片上系统的基本结构和面临的挑战,从调度算法分析和实现框架两个方面着重探讨了近年来多处理器片上系统任务调度的国内外研究进展情况,分析了当前亟待解决的问题与下一步主要的研究方向,可为多处理器片上系统相关研究提供参考.  相似文献   

10.
适用于多核处理器的簇状片上网络设计   总被引:1,自引:1,他引:0       下载免费PDF全文
提出一种新型簇状片上网络架构。该架构以二维网状拓扑结构连接各个簇单元,每个簇单元由3个处理器、1个直接访存单元和1个簇共享存储单元组成。基于该架构的多核处理器可以获得更高的通信效率及存储器利用率。在实验系统上实现3 780点的快速傅里叶变换,结果表明,在快速傅里叶变换应用中存储器的利用率能提升至79.5%。  相似文献   

11.
神经元的映射分配是人工神经网络虚拟实现中的重要研究课题。本文系统地分析了人工神经网络的重要性质-并行分布处理,并对映射分配问题中的两个关键性概念-负载均衡和通信开销进行了深入讨论。以此为基础,提出了一系列映射算法,并对算法性能进行了分析。其中,吸收算法最大程度地开发了人工神经网络固有的并行性,是一个实时的算法。  相似文献   

12.
The efficiency of the basic operations of a NUMA (nonuniform memory access) multiprocessor determines the parallel processing performance on a NUMA multiprocessor. The authors present several analytical models for predicting and evaluating the overhead of interprocessor communication, process scheduling, process synchronization, and remote memory access, where network contention and memory contention are considered. Performance measurements to support the models and analyses through several numerical examples have been done on the BBN GP1000, a NUMA shared-memory multiprocessor. Analytical and experimental results give a comprehensive understanding of the various effects, which are important for the effective use of NUMA shared-memory multiprocessor. The results presented can be used to determine optimal strategies in developing an efficient programming environment for a NUMA system  相似文献   

13.
The design issues affecting a parallel implementation of the alpha-beta search algorithm are discussed with emphasis on a tree decomposition scheme that is intended for use on well ordered trees. In particular, the principal variation splitting method has been implemented, and experimental results are presented which show how such refinements as progressive deepening, narrow window searching, and the use of memory tables affect the performance of multiprocessor based chess playing programs. When dealing with parallel processing systems, communication delays are perhaps the greatest source of lost time. Therefore, an implementation of our tree decomposition based algorithm is presented, one that operates with a modest amount of message passing within a network of processors. Since our system has low search overhead, the principal basis for comparison is the communication overhead, which in turn is shown to have two components.  相似文献   

14.
用多机系统进行并行仿真是解决大规模连续系统实时仿真问题的有效途径。多机并行仿真中关键要解决的问题,是如何有效地将一个仿真任务分配到多机系统上并发执行,并获得高的加速比。本文介绍了作者自行研制的并行仿真软件支撑环境PARSIM,它可将一个传统单机上串行执行的仿真程序自动转换成在同构型多机系统上高效并发执行的并行仿真程序,并就并行性识别,多任务自动划分等问题展开了讨论,给出了相应的算法和应用实例。  相似文献   

15.
We propose and evaluate a parallel “decomposite best-first” search branch-and-bound algorithm (dbs) for MIN-based multiprocessor systems. We start with a new probabilistic model to estimate the number of evaluated nodes for a serial best-first search branch-and-bound algorithm. This analysis is used in predicting the parallel algorithm speed-up. The proposed algorithm initially decomposes a problem into N subproblems, where N is the number of processors available in a multiprocessor. Afterwards, each processor executes the serial best-first search to find a local feasible solution. Local solutions are broadcasted through the network to compute the final solution. A conflict-free mapping scheme, known as the step-by-step spread, is used for subproblem distribution on the MIN. A speedup expression for the parallel algorithm is then derived using the serial best-first search node evaluation model. Our analysis considers both computation and communication overheads for providing realistic speed-up. Communication modeling is also extended for the parallel global best-first search technique. All the analytical results are validated via simulation. For large systems, when communication overhead is taken into consideration, it is observed that the parallel decomposite best-first search algorithm provides better speed-up compared to other reported schemes  相似文献   

16.
In large scale multiprocessor systems, the distance between processors should be taken into account by software to reduce the network traffic and the communication overhead. A load balancing method based on P3 (Processing Power Plane) model is proposed to enable programmers to specify distributing computational load, keeping the locality of the computation. In this method, a process is allocated to a rectangle on a hypothetical processing power plane. The size of the rectangle represents the processing power given to the process, and the distance between rectangles represents the communication cost between them. This plane is divided to processors, and the region of the processor may be dynamically reshaped to alleviate imbalance on P3. Mechanism for realization of the method has been implemented on the Multi-PSI/version, 2, which is a parallel processing system with 64 processing elements connected to form a 2-dimensional mesh network. A packet transmission mechanism of the Multi-PSI/version 2 is described, which realizes the process distribution along with the balancing method.  相似文献   

17.
This letter presents a novel cooperative neural network ensemble learning method based on Negative Correlation learning. It enables easy integration of various network models and reduces communication bandwidth significantly for effective parallel speedup. Comparison with the best Negative Correlation learning method reported demonstrates comparable performance at significantly reduced communication overhead.  相似文献   

18.
The system design of a locally connected competitive neural network for video motion detection is presented. The motion information from a sequence of image data can be determined through a two-dimensional multiprocessor array in which each processing element consists of an analog neuroprocessor. Massively parallel neurocomputing is done by compact and efficient neuroprocessors. Local data transfer between the neuroprocessors is performed by using an analog point-to-point interconnection scheme. To maintain strong signal strength over the whole system, global data communication between the host computer and neuroprocessors is carried out in a digital common bus. A mixed-signal very large scale integration (VLSI) neural chip that includes multiple neuroprocessors for fast video motion detection has been developed. Measured results of the programmable synapse, and winner-takes-all circuitry are presented. Based on the measurement data, system-level analysis on a sequence of real-world images was conducted.  相似文献   

19.
Low-latency communication over ATM networks using active messages   总被引:1,自引:0,他引:1  
von Eicken  T. Basu  A. Buch  V. 《Micro, IEEE》1995,15(1):46-53
Today's communication architectures for parallel machines reduce communication overheads and latencies by over an order of magnitude. However, carrying over these techniques to workstation clusters connected by an ATM network presents major design challenges. We discuss the differences in communication characteristics between workstation clusters built from standard hardware and software components and state-of-the-art multiprocessors, and then evaluate a prototype implementation of an active message communication layer. Application round-trip latencies of about 50 microseconds for small messages roughly compare to a similar implementation on the Thinking Machines CM-5 multiprocessor  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号