首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In this paper, a closed queuing network model with both single and multiple servers has been proposed to model dataflow in a multi-threaded architecture. Multi-threading is useful in reducing the latency by switching among a set of threads in order to improve the processor utilization. Two sets of processors, synchronization and execution processors exist. Synchronization processors handle load/store operations and execution processors handle arithmetic/logic and control operations. A closed queuing network model is suitable for large number of job arrivals. The normalization constant is derived using a recursive algorithm for the given model. State diagrams are drawn from the hybrid closed queuing network model, and the steady-state balance equations are derived from it. Performance measures such as average response times and average system throughput are derived and plotted against the total number of processors in the closed queuing network model. Other important performance measures like processor utilizations, average queue lengths, average waiting times and relative utilizations are also derived.  相似文献   

2.
Fault-tolerant dataflow system is an entirely new field.This paper presents an overview of FTDF-TD,a fault-tolerant dataflow system.Various aspects of FTDF-TD,such as the architecture,the fault-tolerant strategy,and its reliability,are described.The research on overhead and performance evaluation based on software simulation is introduced.It is shown that FTDF-TD gets valuable results in reducing overhead,execution time and increasing reliability of processor array,compared with two other fault-tolerant systems.  相似文献   

3.
In this paper a dataflow architecture is introduced that maps efficiently onto multi-FPGA platforms and is composed of communication channels which can be dynamically adapted to the dataflow of the algorithm. The reconfiguration of the topology can be accomplished within a single clock cycle while DSP operations are in progress. Finally, the programmability and scalability of the proposed architecture is demonstrated by a high-performance parallel FFT implementation.  相似文献   

4.
This paper presents a new architecture for embedded systems and describes an appropriate method for programming a control system. A grinding machine control system was built and an experimental verification of the theoretical approach was performed. The efficiency of this novel system was compared with the conventional control systems by grinding a workpiece up to the stringent quality requirements. The superior performance of the OR dataflow control system lead to the encouraging conclusions presented in this paper.  相似文献   

5.
在高性能计算领域,数据流是一类重要的计算结构,也在很多实际场景表现出很好的性能和适用性。在数据流计算模式中,程序是以数据流图来表示的,数据流计算中一个关键的问题是如何将数据流图映射到多个执行单元上。通过分析现有数据流结构的指令映射方法及其不足,提出了基于数据流结构的新型指令映射优化方法。主要是根据多地址共享数据包的特性对指令映射方法进行优化,延迟多地址共享数据路由包的拆分,减少网络拥堵。  相似文献   

6.
A parallel systolic structure of the adaptive moving target detector (AMTD) is described. The adaptive signal processing is realized in the whole area of observation by a set of systolic processors operating in parallel. The cost of systolic AMTD implementation (necessary number of processing elements and computational steps) is also evaluated.  相似文献   

7.
Current rendering processors are aiming to process triangles as fast as possible and they have the tendency of equipping with multiple rasterizers to be capable of handling a number of triangles in parallel for increasing polygon rendering performance. However, those parallel architectures may have the consistency problem when more than one rasterizer try to access the data at the same address. This paper proposes a consistency-free memory architecture for sort-last parallel rendering processors, in which a consistency-free pixel cache architecture is devised and effectively associated with three different memory systems consisting of a single frame buffer, a memory interface unit, and consistency-test units. Furthermore, the proposed architecture can reduce the latency caused by pixel cache misses because the rasterizer does not wait until cache miss handling is completed when the pixel cache miss occurs. The experimental results show that the proposed architecture can achieve almost linear speedup upto four rasterizers with a single frame buffer.  相似文献   

8.
国产异构众核处理器是我国打破国际技术壁垒,在高性能计算领域取得突破的关键环节.围绕国产超算的软件生态环境建设,采用智能源码转换的方法盘活海量多核架构的遗产代码是加速软件研发效率,推动领域发展的重要途径.针对国产运算核心不支持C++编译的现状,基于开源的ANTLR语言翻译工具,提出了一种面向异构众核处理器的智能化C++语...  相似文献   

9.
Embedded processors rely on the efficient use of instruction-level parallelism to answer the performance and energy needs of modern applications. Though improving performance is the primary goal for processors in general, it might lead to a negative impact on energy consumption, a particularly critical constraint for current systems. In this paper, we present SoMMA, a software-managed memory architecture for embedded multi-issue processors that can reduce energy consumption and energy-delay product (EDP), while still providing an increase in memory bandwidth. We combine the use of software-managed memories (SMM) with the data cache, and leverage the lower energy access cost of SMMs to provide a processor with reduced energy consumption and EDP. SoMMA also provides a better overall performance, as memory accesses can be performed in parallel, with no cost in extra memory ports. Compiler-automated code transformations minimize the programmer's effort to benefit from the proposed architecture. The approach shows average speedups of 1.118x and 1.121x, while consuming up to 11% and 12.8% less energy when comparing two modified ρVEX processors and their baselines, at full-system level comparisons. SoMMA also shows reduction of up to 41.5% on full-system EDP, maintaining the same processor area as baseline processors.  相似文献   

10.
A resource efficient and high-performance architecture for a two-dimensional multi-level discrete wavelet transform processor is presented in this paper. The JPEG2000 standard integer lossless 5-3 filter has been implemented. It achieves optimal hardware utilisation with minimal combinational logic block slices and high frequency of operation. To reduce the hardware complexity and to achieve high performance the proposed architecture implements lifting scheme with a single multiplier-free processing element to perform both predict and update operations. Symmetric extension is used at image boundaries without requiring any extra clock cycle. The generic architecture is very flexible and can perform up to five levels of forward transform on any arbitrary image size. Synthesis of the 5-level architecture on Xilinx Virtex 5 FPGA shows that the processor can achieve a maximum frequency of operation of 221.44 MHz. The reduced hardware complexity and high frequency of operation render the design suitable for incorporation in image processing applications requiring fast operations. The 5-level design has been successfully implemented on a Xilinx Spartan 3E FPGA, utilising only 1104 slices for a 512-by-512 pixel test image, the lowest hardware requirements for a 5-level discrete wavelet transform processor reported to date.  相似文献   

11.
Application-specific instruction processors (ASIPs) have great potential to meet the challenging demands of pervasive systems. This hierarchical system automatically designs highly customized multicluster processors. In the first of two tightly coupled components, design space exploration heuristically searches the basic capabilities that define the processor's overall parallelism. In the second, a hardware compiler determines the detailed architecture configuration that realizes the parallelism.  相似文献   

12.
In this paper, we propose a new interconnection mechanism for network line cards. We project that the packet storage needs for the next-generation networks will be much higher. Such that the number of memory modules required to store the packets will be more than that can be directly connected to the network processor (NPU). In other words, the NPU I/O pins are limited and they do not scale well with the growing number of memory modules and processing elements employed on the network line cards. As a result, we propose to explore more suitable off-chip interconnect and communication mechanisms that will replace the existing systems and that will provide extraordinary high throughput. In particular, we investigate if the packet-switched k-ary n-cube networks can be a solution. To the best of our knowledge, this is the first time, the k-ary n-cube networks are used on a board. We investigate multiple k-ary n-cube based interconnects and include a variation of 2-ary 3-cube interconnect called the 3D-mesh. All of the k-ary n-cube interconnects include multiple, highly efficient techniques to route, switch, and control packet flows in order to minimize congestion spots and packet loss within the interconnects. We explore the tradeoffs between implementation constraints and performance. Performance results show that k-ary n-cube topologies significantly outperform the existing line card interconnects and they are able to sustain higher traffic loads. Furthermore, the 3D-mesh reaches the highest performance results of all interconnects and allows future scalability to adopt more memories and/or processors to increase the line card’s processing power.  相似文献   

13.
Many present applications usually require high communication throughputs. Multiprocessor nodes and multicore architectures, as well as programmable NICs (Network Interface Cards) provide new opportunities to take advantage of the available multigigabits per second link bandwidths. Nevertheless, to achieve adequate communication performance levels efficient parallel processing of network tasks and interfaces should be considered. In this paper, we leverage network processors as heterogeneous microarchitectures with several cores that implement multithreading and are suited for packet processing, to investigate on the use of parallel processing to accelerate the network interface, and thus the network applications developed above it. More specifically, we have implemented an intrusion prevention system (IPS) with such a network processor. We describe the IPS we have developed that after its offloaded implementation allows faster packet processing of both normal and corrupted traffic. The benefits from placing the IPS close to the network, by using specialized network processors, give many times lower latency and higher bandwidth available to the legitimate traffic.  相似文献   

14.
《Computer Networks》2003,41(5):601-621
To provide flexibility in deploying new protocols and services, general-purpose processing engines are being placed in the datapath of routers. Such network processors (NPs) are typically simple RISC multiprocessors that perform forwarding and custom application processing of packets. The inherent unpredictability of execution time of arbitrary instruction code poses a significant challenge in providing service guarantees for data flows that compete for such processing resources in the network. However, we show that network processing workloads are highly regular and predictable, which can be exploited for scheduling purposes. We present two such predictive processor scheduling algorithms that aim at providing service guarantees as well as improving the performance of the NP by increasing the instruction data locality. Simulation results show that these algorithms provide significantly better performance than processor scheduling algorithms that do not take packet processing times into consideration.  相似文献   

15.
对网络安全管理系统建模及进行可生存性分析是一个非常重要的研究课题,但对于大型网络如何建模和评估生存性,尚无较好的解决方案。现存的两种评估生存性的方法存在较大局限性。提出了一种新的方法,分析邻接节点间的关系,用成熟的数学模型进行实现,使网络中间接的两个节点间形成一种生存性关系。理论分析和实验证明:该模型和算法是可行、有效的。可在不同状态下评估网络中可生存性,计算效率、适用性强于以前的算法。  相似文献   

16.
Due to the increasing complexity of the processors, developers often seek for tools that would simplify the process of finding bottlenecks while executing applications. Although more and more data may be collected from processors, usually much detailed knowledge about the internals of a given architecture is required to understand them.This paper introduces a Top-Down Characterization Approximation for the analysis of applications performance executed on AMD processors and is an extension of a Top-Down Method initially developed by Intel. Since not all required performance counters are available on AMD processors to calculate the exact values of metrics, this method was named as an approximation. It allows one to get a deeper understanding of different stages of program execution, compare different architectures and identify bottlenecks in out-of-order processors. It hides from the user the complexity of microarchitecture details and at the same time exposes the main contributors of inefficient program execution. This method aims at defining a few main metrics on top of performance counters to easily locate the main efficiency issues.At this time this method was applied to Intel processors only. The main reason behind it was the fact that it uses designated performance counters that are unique among different processors and its portability is not straightforward. Positive feedback from users encouraged the authors to develop a similar technique for AMD processors.  相似文献   

17.
This paper describes our effort to build extensible routers using a combination of general‐purpose and network processors. We emphasize five overriding challenges that dictate our design decisions: (1) optimal resource allocation; (2) efficient but flexible scheduling of the CPU; (3) maintaining overall router robustness; (4) maximizing router performance; and (5) providing sufficient extensibility to enable the injection of new functionality into the router. We adopt a hierarchical architecture, in which packet flows traverse a range of processing/forwarding paths, thereby partitioning hardware and software in concert. This paper both presents the architecture, and describes our experiences implementing the architecture and addressing the five design challenges in a prototype built from Intel IXP 1200 and a Pentium. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

18.
Distributed object-oriented environments have become important platforms for parallel and distributed service frameworks. Among distributed object-oriented software, .NET Remoting provides a language layer of abstractions for performing parallel and distributed computing in .NET environments. In this paper, we present our methodologies in supporting .NET Remoting over meta-clustered environments. We take the advantage of the programmability of network processors to develop the content-based switch for distributing workloads generated from remote invocations in .NET. Our scheduling mechanisms include stateful supports for .NET Remoting services. In addition, we also propose scheduling policy to incorporate work-flow models as the models are now incorporated in many of tools of grid architectures. The result of our experiment shows that the improvement of EFT is from 5% to 21% when compared to ETT and is from 8% to 34% when compared to RR while the stateful task ratio is 50%. Our schemes are effective in supporting the switching of .NET Remoting computations over meta-cluster environments.  相似文献   

19.
The fast-changing communications market requires high-performance yet flexible network-processing platforms. StepNP is an exploratory network processor simulation environment for exploring applications, multiprocessor network-processing architectures, and SoC tools. Supporting model interaction, instrumentation, and analysis, the platform lets R&D teams easily add new processors, coprocessors, and interconnects.  相似文献   

20.
CORDIC流水线结构因其高吞吐率及规整性,而很适合于FFT蝶形运算,但其缺点是耗资源多,本文从FFT中旋转因子固定不任意的特点出发,根据CORDIC基本旋转角度与缩放因子的对应关系和缩放因子之间的转换规律,对CORDIC流水线结构进行了改进,在蝶形运算速度不变的情况下,进一步减少所耗资源,在字长为16位的FFT中,每个旋转因子可用25位的控制序列来替代,从而使每个旋转因子的存储空间由32位减少到25位。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号