期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

严婕《计算机应用与软件》2010,27(8)

目前多核架构已成为处理器的主流设计并成为各种多媒体应用的主流处理平台,而核间通信的效率是影响多核处理器性能的重要因素之一.提出了一种针对多媒体应用程序的核间通信的优化方法.该方法利用此类应用程序数据读取的规律性,通过在多核处理器上添加通信队列,实现只读数据的快速传递,从而提高多媒体应用程序的并行执行效率.实验表明使用通信队列对各种多媒体核心算法的性能都有普遍提高.同时,该方法具有良好的扩展性,当内核数目增加,通信队列所带来的好处也更加明显. 相似文献

2.

嵌入式处理器中的Sector Cache:性能与面积的折衷

左琦付宇卓鲁欣《小型微型计算机系统》2006,27(1):166-170

Sector Cache曾经被用于一些最早使用Cache技术的计算机系统中．虽然Sector Cache的性能略差于普通Cache，但同样Cache容量下Sector结构所需的标记位明显少于普通结构．由于嵌入式处理器对芯片面积的要求非常严格，Sector Cache的优点在嵌入式处理器中就更为明显．本文用基于仿真的方法详细分析了Sector结构的Cache在嵌入式应用环境下的性能．仿真结果表明，合理使用Sector结构可以以较小的性能代价有效地减少标记位数量．因此，采用Sector Cache就可以在满足性能要求的前提下尽可能减小Cache控制器的面积．本文认为Sector Cache是嵌入式处理器设计者进行性能／面积折衷有效手段．相似文献

3.

一种嵌入式系统的滑动Cache机制设计

何青松邓超邱志《单片机与嵌入式系统应用》2015,15(3)

为了提高嵌入式系统中Cache的使用效率,针对不同类型的应用程序对指令和数据Cache的容量实时需求不同,提出一种滑动Cache组织方案.均衡考虑指令和数据Cache需求,动态地调整一级Cache的容量和配置.采用滑动Cache结构,不但降低了一级Cache的动态和静态泄漏功耗,而且还降低了整个处理器的动态功耗.模拟仿真结果表明,该方案在有效降低Cache功耗的同时能够提高Cache的综合性能. 相似文献

4.

龙芯3A多核处理器系统级性能优化与分析

孟小甫高翔从明张爽爽《计算机研究与发展》2012,(Z1):137-142

多核处理器的性能与系统软件有着密切的联系:操作系统是处理器与应用程序之间的接口,对于充分利用处理器特性和提高应用程序的性能起着极其重要的作用;编译器与处理器体系结构密切相关,一方面要产生处理器支持的二进制代码,另一方面还要结合处理器特性产生高效运行的代码,其性能好坏直接影响着系统的整体性能.为了提高龙芯3A系统的实际性能,从操作系统和编译器着手,结合龙芯3A微结构特征,进行了一系列有效的优化.这些措施包括CC-NUMA多核操作系统的实现、操作系统二级Cache锁机制、操作系统调度共享二级Cache分配、自动向量化编译和支持预取机制的编译等.实验结果表明,在系统软件中增加对处理器特性的支持,能够充分挖掘体系结构的优势,对系统性能有较大的好处.其性能优化技术对于其他处理器的优化也有一定的借鉴价值. 相似文献

5.

单芯片多处理器

黄光奇凌云翔周兴铭《新电脑》1998,(6)

随着半导体工艺技术的飞速发展，单芯片多处理器（SCMP）结构将是一条提高处理器性能的有效途径。在总结单芯片多处理研究现状的基础上，重点论述了单芯片多处理器的未来研究方向。相似文献

6.

CMP中基于目录的协作Cache设计方案

下载免费PDF全文

赵小雨吴俊敏隋秀峰王庆波唐轶轩《计算机工程》2010,36(21):283-285

片上多处理器中二级Cache的设计和管理是影响其性能的关键因素之一。在私有二级Cache的基础上,提出一种基于集中式一致性目录的协作Cache设计方案,通过有效地管理片上存储资源来优化处理器的性能,从而使该协作Cache具有平均访存延迟小、Cache缺失率低、可扩展性好等优点。实验结果显示,与共享二级Cache设计相比,协作Cache可以将4核处理器的吞吐量平均提高13.5%,而其硬件开销约为8.1%。相似文献

7.

动态可配置片上数据存储单元设计

《计算机测量与控制》2014,(3):869-871

作为嵌入式处理器的关键部件,片上Cache的功耗能占到整个处理器功耗的50%以上;一个设计良好的片上数据存储单元能有效降低处理器功耗,并且提高整个系统的性能;便签式存储器(Scratchpad memory,SPM)具有占用片上面积少、功耗低和访问时延确定等优点,因此成为嵌入式系统领域的研究热点;以SPM为基础,介绍了一种动态可配置片上数据存储单元的设计方法,并提出SPM操作函数,方便应用程序开发;实验结果表明,该片上数据存储单元能耗降低超过35%,测试程序运行时间平均减少了20.3%。相似文献

8.

基于混合纠错码的可容错性高速缓存研究

赵彩丁永林陈志坚《计算机应用研究》2016,33(2)

针对低电压下,Cache硬错误和软错误概率提高导致Cache不能正常工作的问题,提出了一种基于混合纠错码的Cache结构。该结构利用脏数据正确性必须由处理器中Cache保证而干净数据可由片外恢复的数据特征,将Cache分成多比特纠错码和单比特纠错码保护的两个区域。通过采用新的Cache替换策略,使得脏数据总处于多比特纠错码保护区域,保证其得到较强保护,从而保证Cache在低电压下的可靠性运行。基于EEMBC测试基准的实验结果表明,该设计可以在590mv电压下正常运行,与该领域最新研究 VS-ECC相比,降低了23.6%的纠错码存储信息量,性能提高5.9%。相似文献

9.

一种低功耗高性能的滑动Cache方案 总被引：2，自引：0，他引：2

赵学梅叶以正李晓明时锐《计算机研究与发展》2004,41(11):2035-2042

Cache存储器的功耗占整个芯片功耗的主要部分．针对不同类型的应用程序对指令和数据Cache的容量实时需求不同，一种滑动Cache组织方案被提出．它均衡考虑指令和数据Cache需求，动态地调整一级Cache的容量和配置，消除了Cache中闲置部分产生的功耗．SPEC95仿真结果表明，采用滑动Cache结构不但降低了一级Cache的动态和静态泄漏功耗，而且还降低了整个处理器的动态功耗，提高了性能．滑动Cache比两种传统Cache结构和DRI结构的一级Cache平均动态功耗分别降低21．3％，19．52％和20．62％．采用滑动Cache结构与采用两种传统Cache结构和DRI结构相比，处理器平均动态功耗分别降低了8．84％，8．23％和10．31％，平均能量延迟乘积提高了12．25％，7．02％和13．39％．相似文献

10.

基于分支路径跟踪的猜测访存数据Cache污染控制技术

刘松鹤宋焕生亓淑敏李文敏《计算机应用研究》2013,30(7):2064-2067

“存储墙”问题是高性能处理器设计必须跨越的障碍之一, 高效、智能的Cache系统是处理器存储体系的关键因素。具有分支预测能力的处理器在猜测执行分支路径上访存指令时取回的存储器数据所导致的Cache污染会显著影响Cache和处理器性能。分析了猜测执行和Cache数据污染对处理器性能的影响, 在此基础上结合分支预测机制的特征提出了一种基于分支路径跟踪的Cache污染控制技术——Contra, 通过构建分支路径跟踪表对猜测路径写入Cache的数据进行跟踪, 并对这些数据的存储、访问和替换过程进行控制, 有效地避免了污染数据对Cache效率的影响, 提升了处理器存储系统的性能。仿真结果表明, Contra技术相对于baseline结构来说, L1 D-Cache命中率提升幅度为0. 03%～6. 69%, 平均提升为1. 80%; IPC的提升幅度为0. 01%～6. 60%, 平均提升为2. 56%。相似文献

11.

SCMP: A Single-Chip Message-Passing Parallel Computer

Baker James M. Gold Brian Bucciero Mark Bennett Sidney Mahajan Rajneesh Ramachandran Priyadarshini Shah Jignesh 《The Journal of supercomputing》2004,30(2):133-149

As technology improves and transistor feature sizes continue to shrink, the effects of on-chip interconnect wire latencies on processor clock speeds will become more important. In addition, as we reach the limits of instruction-level parallelism that can be extracted from application programs, there will be an increased emphasis on thread-level parallelism. To continue to improve performance, computer architects will need to focus on architectures that can efficiently support thread-level parallelism while minimizing the length of on-chip interconnect wires. The SCMP (Single-Chip Message-Passing) parallel computer system is one such architecture. The SCMP system includes up to 64 processors on a single chip, connected in a 2-D mesh with nearest neighbor connections. Memory is included on-chip with the processors and the architecture includes hardware support for communication and the execution of parallel threads. Since there are no global signals or shared resources between the processors, the length of the interconnect wires will be determined by the size of the individual processors, not the size of the entire chip. Avoiding long interconnect wires will allow the use of very high clock frequencies, which, when coupled with the use of multiple processors, will offer tremendous computational power. 相似文献

12.

Nahalal: Cache Organization for Chip Multiprocessors

Guz Z. Keidar I. Kolodny A. Weiser U.C. 《Computer Architecture Letters》2007,6(1):21-24

This paper addresses cache organization in chip multiprocessors (CMPs). We show that in CMP systems it is valuable to distinguish between shared data, which is accessed by multiple cores, and private data accessed by a single core. We introduce Nahalal, an architecture whose novel floorplan topology partitions cached data according to its usage (shared versus private data), and thus enables fast access to shared data for all processors while preserving the vicinity of private data to each processor. Nahalal exhibits significant improvements in cache access latency compared to a traditional cache design. 相似文献

13.

Power/Performance/Thermal Design-Space Exploration for Multicore Architectures

Monchiero M. Canal R. Gonzalez A. 《Parallel and Distributed Systems, IEEE Transactions on》2008,19(5):666-681

Multicore architectures have been ruling the recent microprocessor design trend. This is due to different reasons: better performance, thread-level parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one), and the ease and reuse of design. This paper presents a thorough evaluation of multicore architectures. The architecture that we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1, shared/private L2, and a shared bus interconnect. We consider a benchmark set composed of several parallel shared memory applications. We explore the design space related to the number of cores, L2 cache size, and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption, and temperature. Design trade-offs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed, showing the importance of considering thermal effects at the architectural level to achieve the best design choice. 相似文献

14.

A single-chip multiprocessor for multimedia: the MVP 总被引：2，自引：0，他引：2

Guttag K. Gove R.J. Van Aken J.R. 《Computer Graphics and Applications, IEEE》1992,12(6):53-64

The multimedia video processor (MVP) architecture, which incorporates a variety of parallel processing techniques to deliver very high performance to a wide range of imaging and graphics applications, is described. The MVP combines, on a single semiconductor chip, multiple fully programmable processors with multiple data streams connected to shared RAMs through a crossbar network. Each of the independent processors can execute many operations in parallel every cycle. The architecture is scalable and supports different numbers of processors to meet the cost and performance requirements of different markets. MVP's target environment and the development of MVP are outlined 相似文献

15.

A single-chip multiprocessor

Nayfeh B.A. Olukotun K. 《Computer》1997,30(9):79-85

Presents the case for billion-transistor processor architectures that will consist of chip multiprocessors (CMPs): multiple (four to 16) simple, fast processors on one chip. In their proposal, each processor is tightly coupled to a small, fast, level-one cache, and all processors share a larger level-two cache. The processors may collaborate on a parallel job or run independent tasks (as in the SMT proposal). The CMP architecture lends itself to simpler design, faster validation, cleaner functional partitioning, and higher theoretical peak performance. However for this architecture to realize its performance potential, either programmers or compilers will have to make code explicitly parallel. Old ISAs will be incompatible with this architecture (although they could run slowly on one of the small processors) 相似文献

16.

Performance Advantage of Reconfigurable Cache Design on Multicore Processor Systems

Jie Tao Marcel Kunze Fabian Nowak Rainer Buchty Wolfgang Karl 《International journal of parallel programming》2008,36(3):347-360

With the trends of microprocessor design towards multicore, cache performance becomes more important because an off-chip access would be increasingly expensive due to the competition across the processor cores. A question arises: How to design the cache architecture to prevent a performance bottleneck caused by data accesses? This work studies a reconfigurable cache architecture that can be dynamically configured for meeting the individual demand of running applications. Using a self-developed cache simulator, we first examined how different cache organization and configuration influence the parallel execution of OpenMP applications. The experimental results show that applications benefit from a flexible cache with reconfigurability. This motivated us to go a step further and develop a hardware prototype of this novel architecture. 相似文献

17.

A superlinear speedup region for matrix multiplication

Marjan Gusev Sasko Ristov 《Concurrency and Computation》2014,26(11):1847-1868

The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores (e.g., L3 cache) and each core possesses private low level caches (e.g., L1 and L2 caches). Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

18.

众核处理器的共享一级指令缓存研究

张昆刘骁郑方谢向辉《计算机工程与科学》2017,39(5):834-840

众核处理器设计在芯片面积上受到了巨大挑战,如何将有限的芯片面积投入到运算能力中,是众核处理器体系结构研究的热点。聚焦众核处理器的指令缓存结构设计,研究通过在多核核心之间共享一级指令缓存,以获取指令系统及处理器流水线性能的提升。给出了共享指令缓存的结构设计,对该结构进行了节拍级精确的性能模拟,并通过RTL级代码的综合得到了面积开销和时序指标。测试结果表明,共享指令缓存可以降低11%~27%的缓存脱靶率,提升4%~7%的流水线性能。相似文献

19.

A NUCA Substrate for Flexible CMP Cache Sharing 总被引：1，自引：0，他引：1

Jaehyuk Huh J. Changkyu Kim C. Shafi H. Lixin Zhang L. Burger D. Keckler S.W. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1028-1040

We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses. 相似文献

20.

Architectural support for thread communications in multi-core processors

Sevin Varoglu Stephen Jenks 《Parallel Computing》2011,37(1):26-41

In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors. 相似文献