首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Synchronous dataflow architecture for network processors   总被引:1,自引:0,他引:1  
Carlstrom  J. Boden  T. 《Micro, IEEE》2004,24(5):10-18
Network processors are programmable, highly integrated communications circuits optimized to provide processing at high data and packet rates. The packet instruction set computer (PISC) architecture is a synchronous dataflow architecture developed for network processors. It uses a deep pipeline that contains two types of processing elements: PISC processors, which perform programmable data manipulation, and I/O processors, which provide access to shared resources such as look-up table memory, hardware accelerators, or coprocessors.  相似文献   

2.
非定常Monte Carlo输运问题的并行算法   总被引:1,自引:0,他引:1  
文中给出了非定常MonteCarlo(下文简写为MC)输运问题的并行算法 ,对并行程序的加载运行模式进行了讨论和优化设计 .针对MC并行计算设计了一种理想情况下无通信的并行随机数发生器算法 .动态MC输运问题有大量的I/O操作 ,特别是读取剩余粒子数据文件需要大量的I/O时间 ,文中针对I/O问题 ,提出了三种并行I/O算法 .最后给出了并行算法的性能测试结果 ,对比串行计算时间 ,使用 6 4台处理机时的并行计算时间缩短了 30倍  相似文献   

3.
《Computer Networks》2003,41(5):563-586
Network processors (NPs) are an emerging field of programmable processors that are optimized to implement data plane packet processing networking functions. Unlike the general-purpose CPUs that rely heavily on caching for improving performance, the lack of locality in packet processing and need for high-performance I/O have forced designers to come up with innovative architectures that can hide memory latency while still processing packets at high data rates. Most of these NPs use some type of multiprocessing in combination with a hierarchy of memory types to achieve high performance. In addition, to keep up with packets arriving at high data rates over multiple incoming media interfaces, an NP must perform fast I/O and memory operations such as packet storage, table lookup, and extraction of fields in packet headers. We describe an architecture that uses a combination of distributed memory architecture and one or more multithreaded processors to achieve the necessary performance. We describe the challenges in programming such a processor including the issues related to consistency and maintaining packet ordering. We also present a programming model for generic network applications that uses software pipelines. We then demonstrate the use of the programming model in implementing two applications, namely, mapping traffic management algorithms onto a multithreaded architecture and an implementation of a media gateway based on voice-over-AAL2.  相似文献   

4.
In general, message passing multiprocessors suffer from communication overhead between processors and shared memory multiprocessors suffer from memory contention. Also, in computer vision tasks, data I/O overhead limits performance. In particular, high level vision tasks, which are complex and require nondeterministic communication, are strongly affected by these disadvantages. This paper proposes a flexibly (tightly/loosely) coupled hypercube multiprocessor (FCHM) for high level vision to alleviate these problems. A variable address space memory scheme in which a set of adjacent memory modules can be merged into a shared memory module by a dynamically partitionable hypercube topology is proposed. The architecture is quantitatively analyzed using computational models and simulated on the Intel’s Personal SuperComputer (iPSC/I), a hypercube multiprocessor. A parallel algorithm for exhaustive search is simulated on FCHM using the iPSC/I showing significant performance improvements over that of the iPSC/I. This research was supported in part by IBM corporation.  相似文献   

5.
Several issues concerning the design of an I/O (input/output) system for a multiprocessor such as a hypercube are examined. A methodology is proposed for connecting the I/O processors to such a system for efficient I/O access. The effect of I/O communication on the multiprocessor network is analyzed. Different disk organizations that can be employed within such a system are evaluated to see which organization has a better performance. It is observed that parallelism in serving an I/O request plays a dominant role in the scientific workload. The problem of mapping specific data structures such as matrices onto the disks so that the data can be accessed efficiently is considered  相似文献   

6.
使用工作站网络并行执行prolog程序   总被引:1,自引:0,他引:1       下载免费PDF全文
陶杰  鞠九滨 《软件学报》1994,5(11):38-43
本文介绍了一个在SUN工作站网络上实现的分布式C—PROLOG解释系统DC-PROLOG,它能够自动地将其应用程序的顺序解释过程变为并行解释过程;能够充分利用空闲的主存资源求解大问题,使一些单机上因内存容量不足而无法执行的任务得以执行.  相似文献   

7.
存储设备的功耗在整个计算机系统尤其是数据中心所占的比例已经越来越大。要进行硬盘功耗优化,在缺少硬件直接测量硬盘功耗的情况下就需要对硬盘功耗进行建模。现代操作系统通过文件系统层进行硬盘I/O操作管理,并提供实时监控数据。通过对EXT2、EXT4和NILFS2等常用文件系统的硬盘I/O过程进行实验分析,发现不同的文件系统对硬盘功耗具有不同的影响,为了刻画这种差异性,提出基于I/O空闲率的硬盘功耗建模与评估方法。同时使用I/O空闲率来指导硬盘进行功耗优化,将视频播放器的I/O空闲率由4.79%提高到96.55%,硬盘的功耗降低45.28%。  相似文献   

8.
设计了一个基于SDZX-MV-02核的多处理器架构,设计的公共存储器总线切换器,解决了多处理器共享数据的问题;设计的I/O锁存器,解决了多处理器之间的信息、命令和状态的互传;给出了实现框图、实现代码和仿真结果,较好地解决了用低端徽处理器实现高端机器视觉处理的问题。  相似文献   

9.
As the technology used to implement computer network infrastructure advances, networking resources are becoming more vulnerable to attack. Recent router designs are based on general-purpose programmable processors, which increase their potential vulnerability. To address this issue, a Secure Packet Processing platform has been developed that can flexibly protect emerging router systems. Both instruction-level operation of embedded processors and I/O operations of router ports are monitored to detect anomalous behavior. If such behavior is detected, a recovery system is invoked to restore the system into an operational state. Experimental results show that processor-based attacks can generally be determined by a processing monitor within a single instruction. I/O anomalies, including unexpected packet broadcast or delay, can be detected by an I/O monitor with limited overhead. Overall, the system overhead for secure monitoring is limited to a fraction of the overall system space, memory, and power budget.  相似文献   

10.
The Cydra 5 is a heterogeneous multiprocessor system that targets small work groups or departments of scientists and engineers. The two types of processors are functionally specialized for the different components of the work load found in a departmental setting. The Cydra 5 numeric processor, based on a directed-data-flow architecture, provides consistently high performance on a broader class of numerical computations. The interactive processors offload all nonnumeric work from the numeric processor, leaving it free to spend all its time on the numeric application. The I/O processors permit high-bandwidth I/O transitions with minimal involvement from the interactive or numeric processors. The system architecture and data-flow architecture are described. The numeric processor decisions and tradeoffs are examined, and the main memory system is discussed. Some reflections on the design issues are offered  相似文献   

11.
片上系统SoC是指在单个芯片上集成了专用处理器、通用处理器、DSP、共享内存块、专用内存块、I/O部件等多个IP核的复杂的系统。规则拓扑的Mesh网格具有布线工整等优点,利用Mesh网格可以很方便地实现复杂的片上系统SoC。知识产权核(Intellectual Property Cores,IP核)到Mesh格件的映射问题是SOC设计的关键问题之一,其本质上是一种二次分配问题的NP难问题,目前没有多项式时间的求优方法。而遗传算法能够有效地求解问题的近似最优解。提出一种基于遗传的映射算法能够在几分钟内求得最小通信能耗的映射。  相似文献   

12.
This paper presents further results on the design and implementation of various optimizations based on our earlier work of developing a parallel pipelined model for the computational intensive applications that have multiple processing tasks. Performance evaluation of this model was done by using a real-time airborne radar application that employs a Space-Time Adaptive Processing (STAP) algorithm. This paper focuses on the following four issues: (1) The tradeoffs between increasing the throughput and reducing the latency are examined in more detail when allocating processors among different processing tasks. (2) A multi-threaded design is incorporated into the pipeline model and implemented on a massively parallel computer with symmetric multi-processor nodes, which shows enhanced performance. (3) The disk I/O is incorporated into the parallel pipeline to study its effect on performance in which two I/O task designs have been implemented: embedding I/O in the pipeline or having a separate I/O task. By using a double buffering approach together with the asynchronous I/O, the overall pipeline performance scales well as the number of processors increases. (4) From the comparison of the two I/O implementations, it is discovered that the latency may be improved when merging multiple tasks into a single task. The effect of reorganizing the task structure of the pipeline is discussed in detail. All the performance results shown in this work demonstrate the linear scalability the parallel pipeline model can achieve using a production radar application. Although this paper focuses on the implementation of the parallel pipeline model and uses the results from a STAP application to support the claims of the discovered properties for this pipeline, this model is also applicable to many other types of applications with similar computational characteristics.  相似文献   

13.
We show that the product of two N × N boolean matrices can be calculated in constant time on an LARPBS with O(N3 / log N) processors. All data communications and computations are performed on the bit level. To the best of the author's knowledge, this is the first parallel boolean matrix multiplication algorithm that has constant execution time, and is executed on a distributed memory system with (N3) processors. By using our boolean matrix multiplication algorithm, it is shown that the transitive closure of a directed graph can be obtained in O(log N) time ( measured by bit level operations) on an LARPBS with O (N3 / log N) processors. To the best of our knowledge, this is the first parallel algorithm for tansitive closure of directed graphs with time complexity O(log N) (comparable to that of CRCW PRAM) and cost O (N3) on a realistic parallel computing model, which has no shared memory, and interprocessor communications are dealt with explicitly and efficiently.  相似文献   

14.
在高性能处理器中,I/O带宽需求不断增加,一方面高速接口的通道数目不断增加,另一方面接口传输速率也在逐渐提升.高性能处理器的片上网络必须能够匹配各种高速I/O的带宽需求,且必须保证DM A请求能够正确完成.然而各种高速接口协议与片上网络协议在通信机制上存在较大的差别,可能导致死锁等现象的产生.首先对匹配高性能I/O的片上网络存在的问题进行分析,然后提出一种高带宽I/O设计方法及死锁解决方法.采用解死锁方法的片上网络增强了I/O系统的鲁棒性,同时可以减少片上网络设计及运行时的各种限制,提升I/O性能.最后,将所提出的优化方法应用到高性能服务器处理器芯片中,并进行评测,针对16通道PCIe 4.0接口,双向读写带宽分别达到30 GB/s,在一些特殊场景出现死锁以后,片上网络能自动检测死锁并解除死锁.  相似文献   

15.
在宽带环境下,系统的网络通信能力很强,为了提高视频点播的并发点播数和实时响应性能,需要解决视频服务器端磁盘存取速度的瓶颈效应,文章提出的视频组播策略采用了自适应缓存算法,它在综合考虑网络通信能力和磁盘存取速度的基础上,优化了系统的整体性能,提高了传统视频点播批处理算法的效率。  相似文献   

16.
邻接矩阵算法在科学计算与信息处理方面有着极为重要的应用,是图论的基础研究之一。针对目前邻接矩阵算法多是基于串行,或并行SIMD模型而无法解决存储冲突的问题,提出一种基于SIMD-EREW共享存储模型的并行邻接矩阵算法。算法使用O(p)个并行处理单元,在O(n2/p)的时间内完成对n个数据点邻接矩阵的计算。将提出算法与现有算法进行的性能对比分析表明:本算法明显改进了现有文献的研究结果,是一种并行无存储冲突的邻接矩阵算法。  相似文献   

17.
Paired gang scheduling   总被引:2,自引:0,他引:2  
Conventional gang scheduling has the disadvantage that when processes perform I/O or blocking communication, their processors remain idle because alternative processes cannot be run independently of their own gangs. To alleviate this problem, we suggest a slight relaxation of this rule: match gangs that make heavy use of the CPU with gangs that make light use of the CPU (presumably due to I/O or communication activity), and schedule such pairs together, allowing the local scheduler on each node to select either of the two processes at any instant. As I/O-intensive gangs make light use of the CPU, this only causes a minor degradation in the service to compute-bound jobs. This degradation is more than offset by the overall improvement in system performance due to the better utilization of the resources.  相似文献   

18.
邻接矩阵算法在科学计算与信息处理方面有着极为重要的应用,是图论的基础研究之一。针对目前邻接矩阵算法多是基于串行,或并行SIMD模型而无法解决存储冲突的问题,提出一种基于SIMD—EREW共享存储模型的并行邻接矩阵算法,算法使用O(p)个并行处理单元,在O(n^2/p)的时间内完成对n个数据点邻接矩阵的计算。将提出算法与现有算法进行的性能对比分析表明:本算法明显改进了现有文献的研究结果,是一种并行无存储冲突的邻接矩阵算法。  相似文献   

19.
We present the design, implementation, and performance evaluation of a suite of resource policing mechanisms that allow guest processes to efficiently and unobtrusively exploit otherwise idle workstation resources. Unlike traditional policies that harvest cycles only from unused machines, we employ fine-grained cycle stealing to exploit resources even from machines that have active users. We developed a suite of kernel extensions that enable these policies to operate without significantly impacting host processes: 1) a new starvation-level CPU priority for guest jobs, 2) a new page replacement policy that imposes hard bounds on physical memory usage by guest processes, and 3) a new I/O scheduling mechanism called rate windows that throttle guest processes' usage of I/O and network bandwidth. We evaluate both the individual impacts of each mechanism, and their utility for our fine-grain cycle stealing.  相似文献   

20.
Parallel applications can be executed using the idle computing capacity of workstation clusters. However, it remains unclear how to schedule the processors among different applications most effectively. Processor scheduling algorithms that were successful for shared-memory machines have proven to be inadequate for distributed memory environments due to the high costs of remote memory accesses and redistributing data. We investigate how knowledge of system load and application characteristics can be used in scheduling decisions. We propose a new algorithm based on adaptive equipartitioning, which, by properly exploiting both the information types above, performs better than other nonpreemptive scheduling rules, and nearly as well as idealized versions of preemptive rules (with free preemption). We conclude that the new algorithm is suitable for use in scheduling parallel applications on networks of workstations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号