共查询到17条相似文献,搜索用时 46 毫秒
1.
多核处理器中,各个处理器核之间可以并发地进行外部存储访问,提供不同于单处理器的存储级并行(memory level parallelism)能力.不规则应用中的循环,传统的并行方法难以识别其并行性,不能充分利用多核处理器存储级并行能力和并行计算能力.对基于软件开发多核处理器存储级并行进行了讨论,提出一种前瞻并行多线程算法LLSM(loop level speculative mssultithreading).LLSM对不规则应用中的循环进行并行化,在多核处理器上的测试数据表明:该算法能够有效地挖掘多核处理器的存储级并行能力和计算能力,同时指出多核环境下存储级并行计算公式需要考虑线程同步开销. 相似文献
2.
描述了一种可以有效提高存储级并行(Memory Level Parallelism,MIP)的指令优化锁步执行模型--OLSM(Optimized Lock-Step execution Model)执行模型,并建立了一种能体现OLSM模型思想的层次存储结构.OLSM允许显示并行指令计算(Explicit Parallel Instruction Computmg,EPIC)微处理器实现一定程度的乱序执行,解决了传统超长指令字(Very Long Instruction Word,VLIW)锁步执行的缺陷,可以充分利用结构中的大量计算和存储资源,最大化隐藏存储延迟、提高MLP. 相似文献
3.
Runahead执行技术能够显著地提高计算机系统的存储级并行,而无需对处理器结构做出较大改动。但Runahead执行处理器要比传统处理器多执行很多指令,最多是正常执行指令数的三倍以上,大大增加了处理器的功耗。本文通过分析发现Runahead执行在预执行阶段会执行大量的无效指令,据此提出一种减少无效指令的方法来提高Runa-head执行处理器的效率。通过实验分析,在性能影响较小的情况下,该方法最多可以减少50%的Runahead执行处理器在预执行阶段执行的无效指令。 相似文献
4.
嵌入式RISC处理器体系结构并行技术的研究 总被引:1,自引:0,他引:1
本文通过对目前国内外主流嵌入式处理器体系结构创新与发展的研究,着重从处理器体系结构中RISC规则的突破、数据处理、多线程、多核处理器的构成等多种并行技术的应用,对提高系统运行效率和降低运行功耗,作了较为全面的分析,同时研究了这些并行机制的实现技术。研究表明,嵌入式处理器结构中并行技术的应用,是应对目前嵌入式应用高性能、低功耗挑战的有效方法。 相似文献
5.
6.
并行计算机体系结构的多种多样往往使普通计算机工作者感到扑朔迷离。本文在高级分类的范围内综述了可供选择的并行处理方法。 相似文献
7.
并行数据库系统的体系结构 总被引:1,自引:0,他引:1
一、引言 进入九十年代以来,越来越多的应用表明,传统的大型计算机系统缺乏支持高性能联机事务处理和复杂查询操作的能力。当今数据库规模的急剧澎胀、数据库工作负载的日益加重,以及新的应用领域的不断出现和成熟,已使传统的大型计算机达到了性能的极限。例如,美国国家专利局的信息数据库的信息量高达25太字节(1980年)[1],即使使用目前最快的大型机,按每秒处理100兆字节的处理速度,要把这个数据库全部检索一遍,也要花费100小时。设计支持海量数据和满足实时要求的高性能的数据库系,统已经成为数据库研究领域所面临的一项严峻挑战。 相似文献
8.
功耗评估器是研究体系结构的重要工具,目前国内外已经有一些成熟的微体系结构级功耗评估器,但都仅仅面向单一的体系结构,或支持的功能部件较少,不利于进行全面的功耗分析。论文分析了现有的评估工具各自的优点与不足,设计了一个可配置的微体系结构级功耗评估器,并在Windows下编程实现。实验结果表明该功耗评估器具有可配置性好、适用范围较广的特点。 相似文献
9.
时钟门控等低功耗技术引起的电流波动以及供电网络上的寄生阻抗效应,共同形成感应噪声(di/dt),引起供电电压波动.过大的电压波动可能引发时延故障并影响系统正确运行,被称之为电压紧急.文章分析了同时多线程处理器中电压紧急与程序访存行为之间的关系,结合程序的存储级并行性,提出了一种线程调度方法以减少电压紧急对系统性能的影响.实验结果表明,与flush方法相比,所提方法在双线程环境下平均减少21.7%的电压紧急,在四线程环境下平均减少25.2%的电压紧急,并能够有效提高同时多线程处理器的公平性. 相似文献
10.
描述了Intel Pentium4的NetBurst微体系结构,着重分析其内部实现的细节.回顾了Intel处理器的设计,分析了NetBurst微体系结构,再结合其内部功能单元,分析了其指令流水的过程,最后,给出了Pentium4的性能评测. 相似文献
11.
Wei-WuHu Fu-XinZhang Zu-SongLi 《计算机科学技术学报》2005,20(2):0-0
The Godson project is the first attempt to design high performance general-purpose microprocessors in China. This paper introduces the microarchitecture of the Godson-2 processor which is a 64-bit, 4-issue, out-of-order execution RISC processor that implements the 64-bit MlPS-like instruction set. The adoption of the aggressive out-of-order execution techniques (such as register mapping, branch prediction, and dynamic scheduling) and cache techniques (such as non-blocking cache, load speculation, dynamic memory disambiguation) helps the Godson-2 processor to achieve high performance even at not so high frequency. The Godson-2 processor has been physically implemented on a 6-metal 0.18μm CMOS technology based on the automatic placing and routing flow with the help of some crafted library cells and macros. The area of the chip is 6,700 micrometers by 6,200 micrometers and the clock cycle at typical corner is 2.3ns. 相似文献
12.
I-Jui Sung Nasser Anssari John A. Stratton Wen-Mei W. Hwu 《International journal of parallel programming》2012,40(1):4-24
We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured
grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid
as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a
discretized representation of space, are representative of many important structured grid applications. Using the information
available through variable-length array syntax, standardized in C99 and other modern languages, we enable automatic data layout
transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations
to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout
that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured
grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations,
we observe up to 10.94X speedup over the original layout, and a 1.16X performance gain in the worst case. 相似文献
13.
N. I. V'yukova V. A. Galatenko S. V. Samborskii S. M. Shumakov 《Programming and Computer Software》2002,28(5):261-279
In the paper, code generation problems specific for processor architectures with explicit parallelism are discussed, and methods for solving them are suggested. The basic method discussed is a postprocessor implemented on the level of assembler instructions. 相似文献
14.
15.
As chip multiprocessors with simultaneous multithreaded cores are becoming commonplace, there is a need for simple approaches
to exploit thread-level parallelism. In this paper, we consider thread-level speculation as a means to reap thread-level parallelism
out of application binaries. We first investigate the tradeoffs between scheduling speculative threads on the same core and
on different cores. While threads contend for the same resources using the former approach, the latter approach is plagued
by the overhead for inter-core communication. Despite the impact of resource contention, our detailed simulations show that
the first approach provides the best performance due to lower inter-thread communication cost. The key contribution of the
paper is the proposed design and evaluation of the dual-thread speculation system. This design point has very low complexity
and reaps most of the gains of a system.
The work was carried out while Fredrik Warg was a graduate student at Chalmers University of Technology. 相似文献
16.
Parallelism and evolutionary algorithms 总被引:13,自引:0,他引:13
This paper contains a modern vision of the parallelization techniques used for evolutionary algorithms (EAs). The work is motivated by two fundamental facts: 1) the different families of EAs have naturally converged in the last decade while parallel EAs (PEAs) are still lack of unified studies; and 2) there is a large number of improvements in these algorithms and in their parallelization that raise the need for a comprehensive survey. We stress the differences between the EA model and its parallel implementation throughout the paper. We discuss the advantages and drawbacks of PEAs. Also, successful applications are mentioned and open problems are identified. We propose potential solutions to these problems and classify the different ways in which recent results in theory and practice are helping to solve them. Finally, we provide a highly structured background relating to PEAs in order to make researchers aware of the benefits of decentralizing and parallelizing an EA 相似文献