首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 218 毫秒
1.
Stream processors, with the stream programming model, have demonstrated significant performance advantages in the domains signal processing, multimedia and graphics applications. In this paper we examine the applicability of a stream processor to 2-D Jacobi iteration which is widely used to solve partial differential equations, an important class of scientific programs. We first map 2-D Jacobi iteration in FORTRAN version to the stream processor in a straightforward way. In a stream processor system, the management of system resources is the programmers' responsibility. We then present several optimizations, which avail the stream program for 2-D Jacobi iteration, called StreamJacobi, of various aspects of the stream processor architecture. Finally, we analyze the performance of StreamJacobi, with different scales, and the presented optimizations. The final stream program StreamJacobi is from 2.31 to 6.42 times faster than the corresponding FORTRAN programs on a Xeon 5100 processor, with the optimizations playing an important role in realizing the performance improvement.  相似文献   

2.
The demand for high-performance embedded processors in multimedia mobile electronics is growing and their power consumption thus increasingly threatens battery lifetime.It is usually believed that the dynamic voltage and frequency scaling (DVFS) feature saves significant energy by changing the performance levels of processors to match the performance demands of applications on the fly.However,because the energy efficiency of embedded processors is rapidly improving,the effectiveness of DVFS is expected to change.In this paper,we analyze the benefit of DVFS in state-of-the-art mobile embedded platforms in comparison to those in servers or PCs.To obtain a clearer view of the relationship between power and performance,we develop a measurement methodology that can synchronize time series for power consumption with those for processor utilization.The results show that DVFS hardly improves the energy efficiency of mobile multimedia electronics,and can even significantly worsen energy efficiency and performance in some cases.According to this observation,we suggest that power management for mobile electronics should concentrate on adaptive and intelligent power management for peripheral devices.As a preliminary design,we implement an adaptive network interface card (NIC) speed control that reduces power consumption by 10% when NIC is not heavily used.Our results provide valuable insights into the design of power management schemes for future mobile embedded systems.  相似文献   

3.
The Godson-3B processor is a powerful processor designed for high performance servers including Dawning Servers.It offers significantly improved performance over previous Godson-3 series CPUs by incorporating eight CPU cores and vector computing units.It contains 582.6 M transistors within 300 mm2 area in 65 nm technology and is implemented in parallel with full hierarchical design flows.In Godson-3B,advanced clock distribution mechanisms including GALS (Globally Asynchronous Locally Synchronous) and clock mesh are adopted to obtain an OCV tolerable clock network.Custom-designed de-skew modules are also implemented to afford further latency balance after fabrication.The power reduction of Godson-3B is maintained by MLMM (Multi Level Multi Mode) clock gating and multi-threshold-voltage cells substitution schemes.The highest frequency of Godson-3B is 1.05 GHz and the peak performance is 128 GFlops (double-precision) or 256 GFlops (single-precision) with 40 W power consumption.  相似文献   

4.
General-purpose processor (GPP) is an important platform for fast Fourier transform (FFT),due to its flexibility,reliability and practicality.FFT is a representative application intensive in both computation and memory access,optimizing the FFT performance of a GPP also benefits the performances of many other applications.To facilitate the analysis of FFT,this paper proposes a theoretical model of the FFT processing.The model gives out a tight lower bound of the runtime of FFT on a GPP,and guides the architecture optimization for GPP as well.Based on the model,two theorems on optimization of architecture parameters are deduced,which refer to the lower bounds of register number and memory bandwidth.Experimental results on different processor architectures (including Intel Core i7 and Godson-3B) validate the performance model.The above investigations were adopted in the development of Godson-3B,which is an industrial GPP.The optimization techniques deduced from our performance model improve the FFT performance by about 40%,while incurring only 0.8% additional area cost.Consequently,Godson-3B solves the 1024-point single-precision complex FFT in 0.368 μs with about 40 Watt power consumption,and has the highest performance-per-watt in complex FFT among processors as far as we know.This work could benefit optimization of other GPPs as well.  相似文献   

5.
In this paper, the architecture of trustworthy and controllable networks is discussed to meet arising application requirements. After reviewing the lessons and experiences of success and failure in the Internet and summarizing related work, we analyze the basic targets of providing trustworthiness and controllability. Then, the anticipant architecture is introduced. Based on the resulting design, several trustworthy and controllable mechanisms are also discussed.  相似文献   

6.
Numerous new RISC processors provide support for supercomputing.By using the “mini-Cray” i860 superscalar processor,an add-on board has been developed to boost the performance of a real time system.A parallel heterogeneous multiprocessor surercomputing system,TSP,is constructed.In this paper,we present the system design consideration and described the architecture of the TSP and its features.  相似文献   

7.
The IBM Cell Broadband Engine (BE) simulator simulates parallel execution of programs on a state-of-the-art 9-core Cell processor model. In this paper, we report our experience with implementing, simulation, and analyzing the performance of image processing applications on the IBM Cell Broadband Engine Simulator measured on the IBM Cell We report the performance simulator for PPE-only and embedded applications, and with various input data file sizes or numbers of SPEs enabled. The simulator results indicate that Cell BE processor can outperform modern single-core RISC processors in orders of magnitude on SIMD compute intensive applications such as edge detection. We also explore different features and development processes available on the simulator. Different techniques for obtaining accurate results (close to real hardware result) are also explored.  相似文献   

8.
There is a general consensus about the success of Internet architecture in academia and industry. However, with the development of diversified application, the existing Internet architecture is facing more and more challenges in scalability, security, mobility and performance. A novel evolvable Internet architecture framework is proposed in this paper to meet the continuous changing application requirements. The basic idea of evolvability is relaxing the constraints that limit the development of the architecture while adhering to the core design principles of the Internet. Three important design constraints used to ensure the construction of the evolvable architecture, including the evolvability constraint, the economic adaptability constraint and the manageability constraint, are comprehensively described. We consider that the evolvable architecture can be developed from the network layer under these design constraints. What's more, we believe that the address system is the foundation of the Internet. Therefore, we propose a general address platform which provides a more open and efficient network environment for the research and development of the evolvable architecture.  相似文献   

9.
Multithreaded technique is the developing trend of high performance processor. Memory consistency model is essential to the correctness, performance and complexity of multithreaded processor. The chip multithreaded consistency model adapting to multithreaded processor is proposed in this paper. The restriction imposed on memory event ordering by chip multithreaded consistency is presented and formalized. With the idea of critical cycle built by Wei-Wu Hu, we prove that the proposed chip multithreaded consistency model satisfies the criterion of correct execution of sequential consistency model. Chip multithreaded consistency model provides a way of achieving high performance compared with sequential consistency model and easures the compatibility of software that the execution result in multithreaded processor is the same as the execution result in uniprocessor. The implementation strategy of chip multithreaded consistency model in Godson-2 SMT processor is also proposed. Godson-2 SMT processor supports chip multithreaded consistency model correctly by exception scheme based on the sequential memory access queue of each thread.  相似文献   

10.
Biologically Inspired Behaviour Design for Autonomous Robotic Fish   总被引:1,自引:0,他引:1  
Behaviour-based approach plays a key role for mobile robots to operate safely in unknown or dynamically changing environments. We have developed a hybrid control architecture for our autonomous robotic fish that consists of three layers: cognitive, behaviour and swim pattern. In this paper, we describe some main design issues of the behaviour layer, which is the centre of the layered control architecture of our robotic fish. Fuzzy logic control (FLC) is adopted here to design individual behaviours. Simulation and real experiments are presented to show the feasibility and the performance of the designed behaviour layer.  相似文献   

11.
龙芯1号处理器的故障注入方法与软错误敏感性分析   总被引:12,自引:0,他引:12  
在纳米级制造工艺下以及在航天等特殊应用场合中,可靠性将是处理器设计中的一个重要考虑因素.以龙芯1号处理器为研究对象,探讨了处理器可靠性设计中的故障注入方法,并提出了一种同时运行两个处理器RTL模型的故障注入与分析方法,可以实现连续快速的处理器仿真故障注入.在此基础上,进一步分析了龙芯1号处理器的软错误敏感性,通过快速注入大约30万个软错误,保证了分析结果具有较好的统计意义,可以有效指导后续的容错与可靠性设计.  相似文献   

12.
龙芯2号同时多线程处理器的软硬件接口设计   总被引:1,自引:0,他引:1  
随着生产工艺的提高,芯片上能集成越来越多的晶体管,多线程技术也逐步成为一种主流的处理器体系结构技术,而多线程处理器的软硬件接口也就成为急需解决的问题.在分析同时多线程的软件需求的基础上,提出龙芯2号同时多线程处理器的软硬件接口协同设计解决方案,给出相应的操作系统实现方案.同时,在Linux 2.4.20的基础上实现了龙芯2号同时多线程处理器相应的操作系统.通过运行SPEC CPU2000等测试程序进行性能评测,充分说明实现软硬件接口的龙芯2号同时多线程处理器极大地提高了多进程负载的性能.分析和设计方案不仅适用于同时多线程处理器,而且对于片内多核处理器的设计也有借鉴作用.  相似文献   

13.
龙芯2号处理器设计和性能分析   总被引:16,自引:4,他引:16  
介绍龙芯2号处理器设计及其性能测试结果.龙芯2号采用四发射超标量超流水结构。片内一级指令和数据高速缓存各64KB,片外二级高速缓存最多可达8MB.为了充分发挥流水线的效率,龙芯2号实现了先进的转移猜测、寄存器重命名、动态调度等乱序执行技术以及非阻塞的Cache访问和load Speculation等动态存储访问机制.龙芯2号处理器采用0.18gm的CMOS工艺实现,在正常电压下的最高工作频率为500MHz,500MHz时的实测功耗为3~5W.龙芯2号单精度峰值浮点运算速度为20亿a/秒,双精度浮点运算速度为10亿a/秒,SPECCPU2000的实测性能是龙芯1号的8~10倍,综合性能已经达到PentiumⅢ的水平.目前芯片样机能流畅运行完整的64位中文Linux操作系统,全功能的Mozilla浏览器、多媒体播放器和OpenOffice办公套件,可以满足绝大多数桌面应用的要求.  相似文献   

14.
多核处理器的性能与系统软件有着密切的联系:操作系统是处理器与应用程序之间的接口,对于充分利用处理器特性和提高应用程序的性能起着极其重要的作用;编译器与处理器体系结构密切相关,一方面要产生处理器支持的二进制代码,另一方面还要结合处理器特性产生高效运行的代码,其性能好坏直接影响着系统的整体性能.为了提高龙芯3A系统的实际性能,从操作系统和编译器着手,结合龙芯3A微结构特征,进行了一系列有效的优化.这些措施包括CC-NUMA多核操作系统的实现、操作系统二级Cache锁机制、操作系统调度共享二级Cache分配、自动向量化编译和支持预取机制的编译等.实验结果表明,在系统软件中增加对处理器特性的支持,能够充分挖掘体系结构的优势,对系统性能有较大的好处.其性能优化技术对于其他处理器的优化也有一定的借鉴价值.  相似文献   

15.
龙芯2号处理器的同时多线程设计   总被引:1,自引:0,他引:1  
提出了适合龙芯2号处理器的同时多线程处理器模型,并介绍了具体的微体系结构设计以及相应的Linux操作系统的实现方案.通过在设计的龙芯2号同时多线程处理器上启动Linux操作系统,并运行应用程序,例如SPEC CPU2000,进行性能评测.结果表明,龙芯2号同时多线程处理器通过挖掘线程级并行性,将龙芯2号处理器的性能提高了31.1%.  相似文献   

16.
根据龙芯2号处理器体系结构的特点,引入浮点乘加、条件move和预取等一系列特殊指令,并且对开源编译器GCC进行修改使其支持这些特殊指令,同时对生成对应指令的算法进行了调整和优化.实践中已经证明,特殊指令的引入和相应的优化比较好的提升了应用程序的性能,达到了预期的效果.  相似文献   

17.
Reilly  M. Edmondson  J. 《Computer》1998,31(5):50-58
Although producing a finished microprocessor takes the effort of many engineers in many disciplines, the first step requires that an architecture team sketch out the organization of a better, faster and cheaper chip. This effort involves searching for a solution to a design problem in a space of possible solutions. Some solutions in the space are bad, some are better, only a few are satisfactory. The goal is to find the solution that satisfies the product goals. The architecture team must invent and refine the design until it converges on an implementable architecture. In designing Digital's Alpha processors, our teams are guided in large part by an executable performance model. The model allows us to measure the effect or utility of each invention and improvement. We describe the performance model that guides one of our current Alpha processor design projects  相似文献   

18.
The Cydra 5 is a heterogeneous multiprocessor system that targets small work groups or departments of scientists and engineers. The two types of processors are functionally specialized for the different components of the work load found in a departmental setting. The Cydra 5 numeric processor, based on a directed-data-flow architecture, provides consistently high performance on a broader class of numerical computations. The interactive processors offload all nonnumeric work from the numeric processor, leaving it free to spend all its time on the numeric application. The I/O processors permit high-bandwidth I/O transitions with minimal involvement from the interactive or numeric processors. The system architecture and data-flow architecture are described. The numeric processor decisions and tradeoffs are examined, and the main memory system is discussed. Some reflections on the design issues are offered  相似文献   

19.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号