首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
A new type of high performance array processor system is presented in this paper.Unlikethe conventional host-peripheral array processor systems,this system is designed with afunctionally distributed approach.The design philosophy is described first.Then the hardwareorganizations of two concrete systems,namely:150-AP and GF-10/12,including thecommunication between processors are shown.Some attractive system performances for usersprograms are also given.  相似文献   

3.
随着视频编解码标准的不断演进,算法处理的数据量也随之剧增。多核结构并行化处理技术在提升算法计算速度的同时,使得存储结构成为了整个编解码系统性能的瓶颈。针对视频编解码算法访存的局部性、各算法之间数据交互频繁性、算法内部大量临时数据不交互性的特点,设计并实现了由私有存储层和共享存储层构成的多层次分布式存储结构。通过Xilinx公司的Virtex-6系列xc6vlx550T开发板对设计进行测试,实验结果表明,该结构在保持简洁性和可扩展性的同时,最高可提供9.73 GB/s的访存带宽,能够满足视频编解码算法数据访存的需求。  相似文献   

4.
Design of array processor software for nonlinear structural analysis   总被引:1,自引:0,他引:1  
This paper presents an investigation of the solution of large-scale nonlinear structural problems using a 32-bit minicomputer with an attached 64-bit array processor. The two processors communicate via a common memory interface. A user-oriented software package has been designed to allow the use of the given computer configuration by a typical engineer or a scientist. It is possible to use the system without a detailed knowledge of the operation of the hardware and the complex data handling necessary to manipulate the data associated with a large problem. Several test examples are considered using finite element 3-D frame models and the Newton-Raphson solution scheme. The array processor could not be utilized as yet, due to the lack of the proper vendor software. Hence, a simulator was designed to predict the system performance.  相似文献   

5.
Design and construction of an array processor that performs autocorrelation functions is presented. The architecture of this system offers speed and avoids complexity. Random access memory with shifting across zero techniques and a high speed address generator are used. Performance is measured for different sizes of arrays and compared with required time of processing the same arrays using software.  相似文献   

6.
7.
提出了可重构密码处理器内部连接网络的连通性、网络宽度、网络规模等概念和设计过程中应该遵循的一些设计原则,给出了全互联、单总线、多总线3种典型的内部连接网络,并对它们的特性进行了分析。  相似文献   

8.

The application of multimedia in embedded systems (ES), such as Virtual reality and 3-D imaging, represents the current trend in ES development. Coupling multimedia with ES has raised new multimedia-related challenges that have been added to the common ES constraints. These challenges deal with the real-time, quality, performance and efficient processing requirements of multimedia applications. The integration of self-adaptation in ES development has been, for many years, a paramount solution to cope with these issues. Although there has been extensive research on the topic of ES self-adaptation, the related works still lack global approaches that better deal with multimedia-related constraints. Coordinating different adaptation mechanisms, monitoring multiple system constraints and supporting multi-application contexts are still underexplored. The aim of the present work is to fill in these gaps by providing a global adaptation approach that offers better adaptation decisions with fair resource sharing among competing multimedia applications. With the above challenges in mind, we propose a multi-constraints combined adaptation approach that targets multimedia ES. It addresses four critical system constraints: maximizing the overall system‘s Quality of Application (QoA) under the real-time constraint, the remaining system energy and the available network bandwidth. It coordinates the adaptation at both application and architecture levels. To test and validate the proposed technique, a videophone system is designed on a Xilinx FPGA development board. It executes two complex multimedia applications. The validation results show the aptitude of the proposed system to successfully reconfigure itself at run-time in response to its constraints.

  相似文献   

9.
按照可重配置处理器的体系结构建立并实现功耗模型;模型对处理器的电路级特性进行抽象,基于体系结构级属性和工艺参数进行静态峰值功耗估算,基于性能模拟器进行动态功耗统计,并实现三种条件时钟下的门控技术;可重配置处理器与超标量通用微处理器相比,在性能方面获得的平均加速比为3.59,而在功耗方面的平均增长率仅为1.48;通过实验还说明采用简单的CC1门控技术能有效地降低可重配置系统的功耗和硬件复杂度;该模型为可重配置处理器低功耗设计和编译器级低功耗优化研究奠定了基础。  相似文献   

10.
Growing demand for high speed processing of streamed data (e.g. video-streams, digital signal streams, communication streams, etc.) in the advanced manufacturing environments requires the adequate cost-efficient stream-processing platforms. Platforms based on the embedded microprocessors often cannot satisfy performance requirements due to limitations associated with the sequential nature of data execution process. During the last decade, development and prototyping of the above embedded platforms has started moving towards utilization of the Field Programmable Gate Array (FPGA) devices. However, the programming of an application to the FPGA based platform became an issue due to relatively complicated hardware design process. The paper presents an approach which allows simplification of the application programming process by utilization of: (i) the uniformed FPGA platform with the dynamically reconfigurable architecture, (ii) a programming technique based on a temporal partitioning of the application in segments which can be described in terms of macro-operators (function specific virtual components). The paper describes the concept of the approach, presents the analytical investigation and experimental verification of the cost-effectiveness of the proposed platform comparing to the platforms based on sequential micro-processors. It is also shown that the approach can be beneficially utilized in collaborative design and manufacturing.  相似文献   

11.
可重构数据流SPJ查询处理器的研究   总被引:1,自引:1,他引:0  
数据流的实时处理需要很高的处理速度,一种解决方法是使用协处理器。然而协处理器硬布线是不变的,查询不断变化使其一定时间内综合性能达不到最优。为提高数据流处理速度和资源利用率,采用了可重构的数据流SPJ查询处理器,在具备选择、投影和连接三种查询模块及相应指令集的基础上,根据输入查询的查询树调用相应的模块自适应对FPGA编程,改变自身的硬布线,实现数据流的处理。通过大量实验验证了处理器不仅正确,而且具备高速度和灵活性。  相似文献   

12.
针对实时操作系统复杂性内核导致嵌入式应用程序编译速度慢、可复用性差的问题,提出基于通用嵌入式计算机架构(GEC)的RT-thread实时操作系统驻留方法。在合理划分存储空间的基础上,通过对中断服务例程进行共享,为用户提供底层驱动与软件应用层的函数调用服务。最后以D1-H应用处理器为例进行RT-thread驻留测试。实践结果表明,该驻留方法实现了系统内核与应用程序的物理隔离,编译时间更短,开发效率更高,为嵌入式程序开发的时效性、便捷性和简易性提供了应用基础。  相似文献   

13.
We present efficient parallel matrix multiplication algorithms for linear arrays with reconfigurable pipelined bus systems (LARPBS). Such systems are able to support a large volume of parallel communication of various patterns in constant time. An LARPBS can also be reconfigured into many independent subsystems and, thus, is able to support parallel implementations of divide-and-conquer computations like Strassen's algorithm. The main contributions of the paper are as follows. We develop five matrix multiplication algorithms with varying degrees of parallelism on the LARPBS computing model; namely, MM1, MM 2, MM3, and compound algorithms C1(ϵ)and C2(δ). Algorithm C1(ϵ) has adjustable time complexity in sublinear level. Algorithm C2(δ) implies that it is feasible to achieve sublogarithmic time using σ(N3) processors for matrix multiplication on a realistic system. Algorithms MM3, C1(ϵ), and C2(δ) all have o(𝒩3) cost and, hence, are very processor efficient. Algorithms MM1, MM3, and C1(ϵ) are general-purpose matrix multiplication algorithms, where the array elements are in any ring. Algorithms MM2 and C2(δ) are applicable to array elements that are integers of bounded magnitude, or floating-point values of bounded precision and magnitude, or Boolean values. Extension of algorithms MM 2 and C2(δ) to unbounded integers and reals are also discussed  相似文献   

14.
15.
This paper proposes a mixed-level simulator for dynamic coarse-grained reconfigurable processor(CGRP),called ReSSIM(reconfigurable system simulation implementation mechanism),and the corresponding simulation tool-chain,including task compiler,profiler and debugger.A generic modeling methodology supporting convenient extension of on-chip modules is also proposed.In order to explore the details of the interested modules while maintaining reasonable simulation speed,RCA(reconfigurable computing array),the key reconfigurable device in ReSSIM,is modeled on cycle-accurate level,while the other modules are modeled on transaction level.The typical parameters of RCA are scalable and adjustable,which helps the architects to explore the massive details of the reconfigurable device.Experiment shows that simulation speedup achieved ranges from 9.26× to 18.39× compared with VCS(Synopsys verilog compiler simulator) when running three computingintensive kernel tasks of H.264 decoding algorithm-IDCT(inverse discrete cosine transform),deblocking and MC-chroma(motion compensation).Simulation speed for a set of real applications,such as MPEG4,G.729 and EFR,is 35× slower than the corresponding native executions(i.e.measured from the real chip).And the relative simulation errors are 11% less than the measured IPC(instructions per cycle) of the real chip.  相似文献   

16.
运用可重构cache和动态电压缩放技术,为处理器及其cache提出了一种基于程序段的自适应低能耗算法PBLEA(phasebased low energy algorithm)。该算法使用建立在指令工作集签名基础上的程序段监测状态机来判断程序段是否发生变化,并作出cache容量及CPU电压和频率的调整决定。在程序段内,使用容量调整状态机和通过计算频率缩放因子β来先后对cache容量及CPU电压和频率进行调整。在Simpanalyzer模拟器上完成了该算法的实现。通过对MiBench测试程序集的测试表明  相似文献   

17.
Several mesh-like coarse-grained reconfigurable architectures have been devised in the last few years accompanied with their corresponding mapping flows. One of the major bottlenecks in mapping algorithms on these architectures is the limited memory access bandwidth. Only a few mapping methodologies encountered the problem of the limited bandwidth while none has explored how the performance improvements are affected, from the architectural characteristics. We study in this paper the impact that the architectural parameters have on performance speedups achieved when the PEs’ local RAMs are used for storing the variables with data reuse opportunities. The data reuse values are transferred in the internal interconnection network instead of being fetched, from external memories, in order to reduce the data transfer burden on the bus network. A novel mapping algorithm is also proposed that uses a list scheduling technique. The experimental results quantified the trade-offs that exist between the performance improvements and the memory access latency, the interconnection network and the processing element’s local RAM size. For this reason, our mapping methodology targets on a flexible architecture template, which permits such an exploration. More specifically, the experiments showed that the improvements increase with the memory access latency, while a richer interconnection topology can improve the operation parallelism by a factor of 1.4 on average. Finally, for the considered set of benchmarks, the operation parallelism has been improved from 8.6% to 85.1% from the application of our methodology, and by having each PE’s Local RAM a size of 8 words.
Costas E. GoutisEmail:
  相似文献   

18.
Optical interconnections attract many engineers and scientists’ attention due to their potential for gigahertz transfer rates and concurrent access to the bus in a pipelined fashion. These unique characteristics of optical interconnections give us the opportunity to reconsider traditional algorithms designed for ideal parallel computing models, such as PRAMs. Since the PRAM model is far from practice, not all algorithms designed on this model can be implemented on a realistic parallel computing system. From this point of view, we study Cole’s pipelined merge sort [Cole R. Parallel merge sort. SIAM J Comput 1988;14:770–85] on the CREW PRAM and extend it in an innovative way to an optical interconnection model, the LARPBS (Linear Array with Reconfigurable Pipelined Bus System) model [Pan Y, Li K. Linear array with a reconfigurable pipelined bus system—concepts and applications. J Inform Sci 1998;106;237–58]. Although Cole’s algorithm is optimal, communication details have not been provided due to the fact that it is designed for a PRAM. We close this gap in our sorting algorithm on the LARPBS model and obtain an O(log N)-time optimal sorting algorithm using O(N) processors. This is a substantial improvement over the previous best sorting algorithm on the LARPBS model that runs in O(log N log log N) worst-case time using N processors [Datta A, Soundaralakshmi S, Owens R. Fast sorting algorithms on a linear array with a reconfigurable pipelined bus system. IEEE Trans Parallel Distribut Syst 2002;13(3):212–22]. Our solution allows efficiently assign and reuse processors. We also discover two new properties of Cole’s sorting algorithm that are presented as lemmas in this paper.  相似文献   

19.
Reconfigurable platforms can be very effective for lowering production costs because they allow the reuse of architecture resources across a variety of applications. We show how to program a reduced-instruction-set-computing (RISC) microprocessor with a reconfigurable functional unit, focusing on DSP applications and using the example of a turbodecoder. We have developed a complete design flow, including a methodology and compilation tool chain, to address the instruction set hardware-software codesign problem for a processor with a runtime reconfigurable unit. The flow starts from a system-level specification (usually a software program) of the application and partitions it into software and hardware domains to achieve the best speed, power, and area performance, while satisfying resource constraints imposed by the target platform architecture. We describe a methodology and a set of tools that allow extensive design exploration for hardware-software codesign with the goal of improving the overall utilization of reconfigurable multimedia platforms.  相似文献   

20.
In many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access in many parallel programs on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance trade-off between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a generalized processor mapping technique to minimize the amount of data exchange for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) array redistribution and vice versa. The main idea of the generalized processor mapping technique is first to develop mapping functions for computing a new rank of each destination processor. Based on the mapping functions, a new logical sequence of destination processors can be derived. The new logical processor sequence is then used to minimize the amount of data exchange in a redistribution. The generalized processor mapping technique can handle array redistribution with arbitrary source and destination processor sets and can be applied to multidimensional array redistribution. We present a theoretical model to analyze the performance improvement of the generalized processor mapping technique. To evaluate the performance of the proposed technique, we have implemented the generalized processor mapping technique on an IBM SP2 parallel machine. The experimental results show that the generalized processor mapping technique can provide performance improvement over a wide range of redistribution problems  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号