首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
龙芯2号同时多线程处理器的软硬件接口设计   总被引:1,自引:0,他引:1  
随着生产工艺的提高,芯片上能集成越来越多的晶体管,多线程技术也逐步成为一种主流的处理器体系结构技术,而多线程处理器的软硬件接口也就成为急需解决的问题.在分析同时多线程的软件需求的基础上,提出龙芯2号同时多线程处理器的软硬件接口协同设计解决方案,给出相应的操作系统实现方案.同时,在Linux 2.4.20的基础上实现了龙芯2号同时多线程处理器相应的操作系统.通过运行SPEC CPU2000等测试程序进行性能评测,充分说明实现软硬件接口的龙芯2号同时多线程处理器极大地提高了多进程负载的性能.分析和设计方案不仅适用于同时多线程处理器,而且对于片内多核处理器的设计也有借鉴作用.  相似文献   

2.
片上多核处理器存储一致性验证   总被引:2,自引:0,他引:2  
存储一致性验证是片上多核处理器功能验证的重要部分.由于验证并行程序的执行结果是否符合存储一致性模型理论上是NP难问题,现有的验证方法中只能采用一些时间复杂度大于O(n3)的不完全方法.发现在支持写原子性的多处理器系统中,两条执行时间不重叠的操作之间存在确定的时间序.通过引入时间序的概念,设计并实现了一种线性时间复杂度的存储一致性验证工具LCHECK.LCHECK利用时间序将验证局部化,使得在表示程序执行结果的有向图中,序关系边的推导和正确性检测都被限定在有限范围内.与现有其他方法相比,LCHECK时间复杂度低,对程序长度和访存地址数没有限制,因此验证效率更高.作为国产片上多核处理器龙芯3号的重要验证工具, LCHECK发现了一些存储系统的设计错误.  相似文献   

3.
Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation.  相似文献   

4.
矩阵乘法作为高性能计算中的关键组成部分,是一种具有计算和访存密集特点的典型应用,因此优化矩阵乘法的性能对通用处理器是非常重要的.为了提高矩阵乘法的性能,本文提出了一种性能模型,用于预测通用处理器上矩阵乘法的执行时间.该模型反映了矩阵乘法执行时间与通用处理器的运算部件、访存带宽、寄存器个数等结构参数之间的关系,可以指导处理器结构的优化来平衡计算和访存能力、提高执行速度.基于该模型本文给出了在一个优化的通用处理器结构中,寄存器个数和访存带宽应满足的理论下界.本文在Godson-3B处理器平台上对该性能模型进行了验证,实验结果表明矩阵乘法执行时间的预测精确度达到95%以上.基于该模型,本文还提出了一种对Godson-3B结构进行优化的方法,使矩阵乘法的执行时间减少了50%左右.  相似文献   

5.
This paper introduces the microarchitecture and physical implementation of the Godson-2E processor, which is a four-issue superscalar RISC processor that supports the 64-bit MIPS instruction set. The adoption of the aggressive out-of-order execution and memory hierarchy techniques help Godson-2E to achieve high performance. The Godson-2E processor has been physically designed in a 7-metal 90nm CMOS process using the cell-based methodology with some bitsliced manual placement and a number of crafted cells and macros. The processor can be run at 1GHz and achieves a SPEC CPU2000 rate higher than 500.  相似文献   

6.
Microarchitecture of the Godson-2 Processor   总被引:23,自引:3,他引:23       下载免费PDF全文
The Godson project is the first attempt to design high performance general-purpose microprocessors in China. This paper introduces the microarchitecture of the Godson-2 processor which is a 64-bit, 4-issue, out-of-order execution RISC processor that implements the 64-bit MlPS-like instruction set. The adoption of the aggressive out-of-order execution techniques (such as register mapping, branch prediction, and dynamic scheduling) and cache techniques (such as non-blocking cache, load speculation, dynamic memory disambiguation) helps the Godson-2 processor to achieve high performance even at not so high frequency. The Godson-2 processor has been physically implemented on a 6-metal 0.18μm CMOS technology based on the automatic placing and routing flow with the help of some crafted library cells and macros. The area of the chip is 6,700 micrometers by 6,200 micrometers and the clock cycle at typical corner is 2.3ns.  相似文献   

7.
随着生产工艺的提高,芯片上能集成越来越多的晶体管,多线程技术也逐步成为一种主流的处理器体系结构技术.提出一种融合同时多线程技术和微线程技术的新型体系结构同时多微线程(simultaneous multi-microthreading,SMMT),并给出同时多微线程体系结构的实现方案.SMMT有效结合同时多线程技术硬件代价小和微线程技术能够加速单进程应用的优点,通过软硬件协同的方式充分挖掘单进程程序的微线程级并行性.通过在设计的龙芯2号同时多微线程处理器上进行性能评测,结果表明,同时多微线程体系结构能够有效地加速单进程的程序,以很小的硬件代价显著地提高了处理器的性能.  相似文献   

8.
龙芯2号处理器的同时多线程设计   总被引:1,自引:0,他引:1  
提出了适合龙芯2号处理器的同时多线程处理器模型,并介绍了具体的微体系结构设计以及相应的Linux操作系统的实现方案.通过在设计的龙芯2号同时多线程处理器上启动Linux操作系统,并运行应用程序,例如SPEC CPU2000,进行性能评测.结果表明,龙芯2号同时多线程处理器通过挖掘线程级并行性,将龙芯2号处理器的性能提高了31.1%.  相似文献   

9.
基于共享Cache多核处理器的Hash连接优化   总被引:1,自引:0,他引:1  
邓亚丹  景宁  熊伟 《软件学报》2010,21(6):1220-1232
针对目前主流的多核处理器,研究了基于共享缓存多核处理器环境下的数据库Hash连接优化.首先提出基于Radix-Join算法的Hash连接多线程执行框架,通过实例分析了影响多线程Radix-Join算法性能的因素.在此基础上,优化了Hash连接多线程执行框架中的各种线程及其访问共享Cache的性能,优化了聚集连接时Hash连接算法的内存访问,并分析了多线程聚集划分的加速比.基于开源数据库INGRES和EaseDB,实现了所提出的连接多线程执行框架,在实验中测试了多线程Hash连接框架的性能.实验结果表明,该算法可以有效解决Hash连接执行时共享Cache在多线程条件下的访问冲突和处理器负载均衡问题,极大地提高了Hash连接性能.  相似文献   

10.
The authors present a data-race-free-1, shared-memory model that unifies four earlier models: weak ordering, release consistency (with sequentially consistent special operations), the VAX memory model, and data-race-free-0. Data-race-free-1 unifies the models of weak ordering, release consistency, the VAX, and data-race-free-0 by formalizing the intuition that if programs synchronize explicitly and correctly, then sequential consistency can be guaranteed with high performance in a manner that retains the advantages of each of the four models. Data-race-free-1 expresses the programmer's interface more explicitly and formally than weak ordering and the VAX, and allows an implementation not allowed by weak ordering, release consistency, or data-race-free-0. The implementation proposal for data-race-free-1 differs from earlier implementations by permitting the execution of all synchronization operations of a processor even while previous data operations of the processor are in progress. To ensure sequential consistency, two sychronizing processors exchange information to delay later operations of the second processor that conflict with an incomplete data operation of the first processor  相似文献   

11.
We describe the design and implementation of a highly optimized, multithreaded algorithm for the propositional satisfiability problem. The algorithm is based on the Davis-Putnam-Logemann-Loveland sequential algorithm, but includes many of the optimization techniques introduced in recent years. We provide experimental results for the execution of the parallel algorithm on a variety of multiprocessor machines with shared memory architecture. In particular, the detrimental effect of parallel execution on the performance of processor cache is studied.  相似文献   

12.
一种分片式多核处理器的用户级模拟器   总被引:1,自引:0,他引:1  
黄琨  马可  曾洪博  张戈  章隆兵 《软件学报》2008,19(4):1069-1080
随着片上晶体管资源的增多和互连线延迟的加大,分片式多核微处理器已成为多核处理器设计的新方向.为了对这种新型处理器进行体系结构的深入研究和设计空间的探索,设计并实现了针对分片式多核处理器的用户级多核性能模拟器.该多核模拟器在龙芯2号单处理器核的基础上,完整地模拟了基于目录的Cache一致性协议和存储转发式片上互联网络的结构模型,详细地刻画了由于系统乱序处理各种请求应答和请求之间的冲突而造成的时序特性,可以通过运行各种串行或并行的工作负载对多核处理器的各种重要性能指标加以评估,为多核处理器的结构设计提供了快速、灵活、高效的研究平台.  相似文献   

13.
This paper explores microarchitecture models for a simultaneous multithreaded (SMT) processor with multimedia enhancements. We start with a wide-issue superscalar processor, enhance it by the SMT technique, by multimedia units, and by an additional on-chip RAM storage. Our workload is a multithreaded MPEG-2 video decompression algorithm that extensively uses multimedia units. The simulations show that a single-threaded, 8-issue maximum processor (assuming an abundance of resources) reaches an instructions per cycle (IPC) count of only 1.60, while an 8-threaded 8-issue processor is able to reach an IPC of 6.07. A more realistic processor model reaches an IPC of 1.27 in the single-threaded 8-issue vs 3.03 in the 4-threaded 4-issue and 3.21 in the 8-threaded 8-issue modes. Our conclusion on next generation’s microprocessors is that a 2- or 4-threaded 4-issue processor with a small on-chip RAM accessed by a local load/store unit will be superior to a wide-issue (single-threaded) superscalar processor at least for MPEG-2 style video decompression algorithms.  相似文献   

14.
Yeager  K.C. 《Micro, IEEE》1996,16(2):28-41
The Mips R10000 is a dynamic, superscalar microprocessor that implements the 64-bit Mips 4 instruction set architecture. It fetches and decodes four instructions per cycle and dynamically issues them to five fully-pipelined, low-latency execution units. Instructions can be fetched and executed speculatively beyond branches. Instructions graduate in order upon completion. Although execution is out of order, the processor still provides sequential memory consistency and precise exception handling. The R10000 is designed for high performance, even in large, real-world applications with poor memory locality. With speculative execution, it calculates memory addresses and initiates cache refills early. Its hierarchical, nonblocking memory system helps hide memory latency with two levels of set-associative, write-back caches  相似文献   

15.
支持多核并行程序确定性重放的高效访存冲突记录方法   总被引:2,自引:0,他引:2  
多核系统中并行程序执行过程的不确定性给程序调试带来了很大的困难.准确记录初始执行中冲突访存的次序是并行程序确定性重放的基础.提出了通过建立精确happens-before关系记录访存冲突的方法.此方法利用简洁高效的地址冲突检测机制确定冲突访存操作在执行中所处happens-before序关系的位置,可以抑制部分记录信息的产生,从而有效减少记录信息.与其他方式方法相比,可以进一步压缩17%的记录条数.采用逻辑向量时钟描述冲突访存操作间的happens-before关系,与采用标量时钟相比,可以避免happens-before关系的误识,降低重放执行时并行度的损失.  相似文献   

16.
龙芯2号处理器设计和性能分析   总被引:16,自引:4,他引:16  
介绍龙芯2号处理器设计及其性能测试结果.龙芯2号采用四发射超标量超流水结构。片内一级指令和数据高速缓存各64KB,片外二级高速缓存最多可达8MB.为了充分发挥流水线的效率,龙芯2号实现了先进的转移猜测、寄存器重命名、动态调度等乱序执行技术以及非阻塞的Cache访问和load Speculation等动态存储访问机制.龙芯2号处理器采用0.18gm的CMOS工艺实现,在正常电压下的最高工作频率为500MHz,500MHz时的实测功耗为3~5W.龙芯2号单精度峰值浮点运算速度为20亿a/秒,双精度浮点运算速度为10亿a/秒,SPECCPU2000的实测性能是龙芯1号的8~10倍,综合性能已经达到PentiumⅢ的水平.目前芯片样机能流畅运行完整的64位中文Linux操作系统,全功能的Mozilla浏览器、多媒体播放器和OpenOffice办公套件,可以满足绝大多数桌面应用的要求.  相似文献   

17.
Simulation results are presented using the hardware-implemented, trace-based dynamic instruction scheduler of our single process DTSVLIW architecture to schedule instructions from several processes into multiple streams of VLIW instructions for execution by a wide-issue, simultaneous multi-threading (SMT) execution engine. The scheduling process involves single instruction execution of each process, dynamically scheduling executed instructions into blocks of VLIW instructions cached for subsequent SMT execution: SMT provides a mechanism to reduce the impact of horizontal and vertical waste, and variable memory latencies, seen in the DTSVLIW. Preliminary experiments explore this extended model. Results achieve PE utilization of up to 87% on a 4-thread, 1-scalar, 8 PE design, with speed-ups of up to 6.3 that of a single processor. Noticeably it only needs a single scalar process to be scheduled at any time, with main memory fetches being 1–4% that of a single processor.  相似文献   

18.
《Computer Networks》2003,41(5):563-586
Network processors (NPs) are an emerging field of programmable processors that are optimized to implement data plane packet processing networking functions. Unlike the general-purpose CPUs that rely heavily on caching for improving performance, the lack of locality in packet processing and need for high-performance I/O have forced designers to come up with innovative architectures that can hide memory latency while still processing packets at high data rates. Most of these NPs use some type of multiprocessing in combination with a hierarchy of memory types to achieve high performance. In addition, to keep up with packets arriving at high data rates over multiple incoming media interfaces, an NP must perform fast I/O and memory operations such as packet storage, table lookup, and extraction of fields in packet headers. We describe an architecture that uses a combination of distributed memory architecture and one or more multithreaded processors to achieve the necessary performance. We describe the challenges in programming such a processor including the issues related to consistency and maintaining packet ordering. We also present a programming model for generic network applications that uses software pipelines. We then demonstrate the use of the programming model in implementing two applications, namely, mapping traffic management algorithms onto a multithreaded architecture and an implementation of a media gateway based on voice-over-AAL2.  相似文献   

19.
SMA:一种新的多线程处理器模型   总被引:1,自引:0,他引:1       下载免费PDF全文
本文提出一种新的多线程处理器模型,它结合了前瞻性执行机制和多线程执行机制,既能从更大的指令窗口中开发出更多的ILP,又能屏蔽各种长延迟操作,达到较高的资 源利用率。本文深入讨论了SMA模型及其特点,并进行了初步的性能分析。  相似文献   

20.
存储模型仿真器的设计与实现   总被引:2,自引:1,他引:1  
存储一致性问题和高速缓存一致性问题是共享存储并行计算机中两个最关键的问题,通过仿真器对它们进行了量化研究,设计并实现了一个存储模型仿真器MMS.基于MMS仿真了不同并行机结构模型下多种存储一致性模型的行为;针对不同类型的计算问题比较了不同的存储一致性模型,并对实验结果进行了分析;实现了几个不同的高速缓存一致性协议,并比较了它们的性能.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号