首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Most existing analytical models for memory interference generally assume random bank selection for each memory access. In vector computers, however, memory accesses are typically regularly patterned with a number of data items being accessed concurrently from different banks. Very little is known about the queueing behavior of memory interferences in multiple stream vector accesses. This paper presents an analytical model for memory interferences due to vector accesses in multiple vector processor systems. The model captures the effects of both bank conflicts among elements within one vector access stream and conflicts among multiple vector access streams on system performance. The model is based on a closed queueing network assuming an ideal interconnection network. An approximation technique is proposed to solve the memory queueing system that serves customers in a complicated way (non-FIFO). We also carry out extensive simulation experiments to study memory interference and validate our analytical model. Simulation results and analytical results are in a very good agreement, indicating that the model is very accurate. We further validate our analysis by comparing the numerical results obtained from our analytical model with those measurement results that were published by other researchers. Based on our analytical model and simulations, we carry out performance evaluation of the multiple vector processor systems. Our numerical results show that memory access conflicts pose a severe limitation on the number of useful processors in the system, implying that memory system design is essential to high-performance computing  相似文献   

2.
1IntroductionSequelltialconsistencyl1listhepopularacceptedcriterionofcorrectexecutioninshared-memorysystems.Itdefinesacorrectexecutionastheonewhoseresultisthesameasiftheoperationsofallprocessorswereexecutedinsomesequentialorder,andtheoperationsofeachindividualprocessoraPpearinthissequenceintheorderspecifiedbytheprogram.Troicalimplemelltationofsequelltialconsistencyrequireseachaccesstobedelayeduntilthe..previousaccessinthesameprocesscompletesl2].Thisisdetrimentaltosystemperformancebecauseitdis…  相似文献   

3.
The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM’s speed and throughput. To achieve this goal, this paper proposes techniques to take advantage of the characteristics of the 3-stage access of contemporary DRAM chips by grouping the accesses of the same row together and interleaving the execution of memory accesses from different banks. A family of Bubble Filling Scheduling (BFS) algorithms are proposed in this paper to minimize memory access schedule length and improve memory access time for embedded systems.When the memory access trace is known in some application-specific embedded systems, this information can be fully utilized to generate efficient memory access schedules. The offline BFS algorithm can generate schedules which are 47.49% shorter than in-order scheduling and 8.51% shorter than existing burst scheduling on average. When memory accesses are received by the single memory controller in real time, the memory accesses have to be scheduled as they come. The online BFS algorithm in this paper serves this purpose and generates schedules which are 58.47% shorter than in-order scheduling and 4.73% shorter than burst scheduling on average. To improve the memory throughput and further reduce the memory access schedule, an architecture with dual memory controllers is proposed. According to the experimental results, the dual controller algorithm can generate schedules which are 62.89% shorter than in-order scheduling, 14.23% shorter than burst scheduling, and 10.07% shorter than single controller BFS algorithms on average.  相似文献   

4.
Billion-transistor processors will be much as they are today, just bigger, faster and wider (issuing more instructions at once). The authors describe the key problems (instruction supply, data memory supply and an implementable execution core) that prevent current superscalar computers from scaling up to 16- or 32-instructions per issue. They propose using out-of-order fetching, multi-hybrid branch predictors and trace caches to improve the instruction supply. They predict that replicated first-level caches, huge on-chip caches and data value speculation will enhance the data supply. To provide a high-speed, implementable execution core that is capable of sustaining the necessary instruction throughput, they advocate a large, out-of-order-issue instruction window (2,000 instructions), clustered (separated) banks of functional units and hierarchical scheduling of ready instructions. They contend that the current uniprocessor model can provide sufficient performance and use a billion transistors effectively without changing the programming model or discarding software compatibility  相似文献   

5.
Traditional implementation of sequential consistency in shared-memory systems requires memory accesses to be globally performed in program order.Based on an event ordering model for correct executions in shared-memory systems,this paper proposes and proves that out-of-order execution does not influence the correctness of an execution providing certain condition is met.Simulation results show that out-of-order execution proposed in this paper is an effective way to improve the performance of a sequentially consistent shared-memory system.  相似文献   

6.
随着处理器和存储器速度差距的不断拉大,访存指令尤其是频繁cache miss的指令成为影响性能的重要瓶颈。编译器由于无法得知访存指令动态执行的拍数,一般假定这些指令的延迟为cache命中或者cache miss的延迟,所以并不准确。我们引入cache profiling技术来收集访存指令运行时的cache miss或者命中的信息,利用这些信息来计算访存的延迟。乱序机器上硬件的指令调度对于发射窗口内的指令能进行很好的动态调度,编译器则对更长的范围内的指令调度更有优势。在reorder buffer中cache miss一旦发生,容易引起reorder buffer满,导致流水线阻塞。调度容易cache miss的指令。使其并行执行,从而隐藏cache miss的长延迟,就可以提高程序性能。因此,我们针对load指令,一方面修改频繁miss的指令的延迟,一方面修改调度策略,提高存储级并行度。实验证明,我们的调度对于bzip2有高达4.8%的提升,art有4%的提升,整体平均提高1.5%。  相似文献   

7.
《Journal of Systems Architecture》1999,45(12-13):1001-1022
Commodity microprocessors contain more on-chip memory with each successive generation, and will contain tens of megabytes within the decade. We describe a novel architecture that runs an unmodified uniprocessor program across multiple nodes, each of which contains a processor tightly integrated with a sizable memory. The execution of instructions is replicated, while the access of operands is distributed across the nodes. Each node accesses operands in its fast local memory and broadcasts them to the other nodes. This architecture exploits out-of-order execution and the fact that each chip has integrated processor and memory, to run memory-intensive, hard-to-parallelize programs more efficiently. In this paper, we describe an implementation with specific solutions to the unique problems that this architecture poses. Finally, we conclude by comparing simulation results of our implementation to more traditional equivalent systems. In our simulated implementation, five unmodified SPEC95 binaries ran – in most cases – considerably faster than in systems with more traditional memory systems.  相似文献   

8.
Intel’s XScale which has powered many multimedia applications uses scoreboard to control instruction execution. Scoreboard stalls the pipeline whenever a source operand or functional unit is needed but not available. While waiting for the availability of the resources, the processor accesses the scoreboard every cycle. Such accesses consume energy without contributing to performance. We address this inefficiency by investigating stall behaviour and introduce an adaptive technique to avoid regular access to the scoreboard during stall periods. Our study shows that by using our technique and for the representative subset of MiBench benchmark suite studied here, it is possible to reduce scoreboard energy consumption by up to 33% while maintaining performance cost within 0.25%.  相似文献   

9.
Memory accesses introduce big‐time overhead and power consumption because of the performance gap between processors and main memory. This paper describes and evaluates a technique, loop scheduling with memory access reduction (LSMAR), that replaces hidden redundant load operations with register operations in loop kernels and performs partial scheduling for newly generated register operations subject to register constraints. By exploiting data dependence of memory access operations, the LSMAR technique can effectively reduce the number of memory accesses of loop kernels, thereby improving timing performance. The technique has been implemented into the Trimaran compiler and evaluated using a set of benchmarks from DSPstone and MiBench on the cycle‐accurate simulator of the Trimaran infrastructure. The experimental results show that when the LSMAR technique is applied, the number of memory accesses can be reduced by 18.47% on average over the benchmarks when it is not applied. The measurements also indicate that the optimizations only lead to an average 1.41% increase in code size. With such small code size expansion, the technique is more suitable for embedded systems compared with prior work.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

10.
Caches are essential to bridge the gap between the high latency main memory and the fast processor pipeline. Standard processor architectures implement two first-level caches to avoid a structural hazard in the pipeline: an instruction cache and a data cache. For tight worst-case execution times it is important to classify memory accesses as either cache hit or cache miss. The addresses of instruction fetches are known statically and static cache hit/miss classification is possible for the instruction cache. The access to data that is cached in the data cache is harder to predict statically. Several different data areas, such as stack, global data, and heap allocated data, share the same cache. Some addresses are known statically, other addresses are only known at runtime. With a standard cache organization all those different data areas must be considered by worst-case execution time analysis. In this paper we propose to split the data cache for the different data areas. Data cache analysis can be performed individually for the different areas. Access to an unknown address in the heap does not destroy the abstract cache state for other data areas. Furthermore, we propose to use a small, highly associative cache for the heap area. We designed and implemented a static analysis for this cache, and integrated it into a worst-case execution time analysis tool.  相似文献   

11.
存储器访问速度已经成为现代处理器系统中的瓶颈。对于存储器访问的调度可以有效地提高存储器带宽利用率。基于一种称为突发调度的机制进行改进,通过使用优先级表达式从各个块里选择最合适的突发来访问存储器,运用自适应的方法来调节优先级表达式里各项的系数,使得这一方法针对不同的应用都能取得好的效果。通过运行SPECCPU2000测试程序和stream程序,与顺序访问机制以及突发调度机制相比,该自适应调度机制将总线利用率分别提高了52%和4.8%。[0]  相似文献   

12.
The performance and energy efficiency of current systems is influenced by accesses to the memory hierarchy. One important aspect of memory hierarchies is the introduction of different memory access times, depending on the core that requested the transaction, and which cache or main memory bank responded to it. In this context, the locality of the memory accesses plays a key role for the performance and energy efficiency of parallel applications. Accesses to remote caches and NUMA nodes are more expensive than accesses to local ones. With information about the memory access pattern, pages can be migrated to the NUMA nodes that access them (data mapping), and threads that communicate can be migrated to the same node (thread mapping).In this paper, we present LAPT, a hardware-based mechanism to store the memory access pattern of parallel applications in the page table. The operating system uses the detected memory access pattern to perform an optimized thread and data mapping during the execution of the parallel application. Experiments with a wide range of parallel applications (from the NAS and PARSEC Benchmark Suites) on a NUMA machine showed significant performance and energy efficiency improvements of up to 19.2% and 15.7%, respectively, (6.7% and 5.3% on average).  相似文献   

13.
A discussion is presented of the use of dynamic storage schemes to improve parallel memory performance during three important classes of data accesses: vector accesses in which multiple strides are used to access a single vector, block accesses, and constant-geometry FFT accesses. The schemes investigated are based on linear address transformations, also known as XOR schemes. It has been shown that this class of schemes can be implemented more efficiently in hardware and has more flexibility than schemes based on row rotations or other techniques. Several analytical results are shown. These include: quantitative analysis of buffering effects in pipelined memory systems; design rules for storage schemes that provide conflict-free access using multiple strides, blocks, and FFT access patterns; and an analysis of the effects of memory bank cycle time on storage scheme capabilities  相似文献   

14.
Accessing pixels in memory is a well-known bottleneck of SIMD (single instruction multiple data) processors in video/imaging. To tackle it, we propose new block and row access modes of parallel on-chip memory subsystem, which enable a higher processing throughput and lower energy consumption than the access modes of the state-of-the-art subsystems. The new access modes significantly reduce the number of on-chip memory accesses, and thereby accelerate one of key video/imaging kernels: sub-pixel block-matching motion estimation. The main idea is to exploit spatial overlaps of blocks/rows accessed for pixel interpolation, which are known at the subsystem design-time, and merge multiple accesses into a single one by accessing somewhat more pixels at a time than with other parallel memories. To avoid the need for a wider, and, therefore, more costly SIMD datapath, we propose new memory read operations that split all pixels accessed at a time into multiple SIMD-wide blocks/rows, in a convenient way for further processing. As a proof of concept, we describe a parametric, scalable, and cost-efficient architecture that supports the new access modes. The architecture is based on a previously proposed set of memory banks with multiple pixels per bank word, and a previously proposed shifted scheme for arranging pixels in the banks. We analytically and experimentally demonstrate advantages of this work on a case study of sub-pixel motion estimation for video frame-rate conversion. The implemented motion estimator processes 2160p video at 60 fps in real time, while clocked at 600 MHz. Compared to the implementations based on the state-of-the-art subsystems, this work enables 40–70 % higher throughput, consumes 17–44 % less energy and has similar silicon area and off-chip memory bandwidth costs. That is 1.8–2.9 times more efficient than the prior art, considering the throughput and all costs, i.e., consumption, area, and off-chip bandwidth. Such a higher efficiency is the result of the new access modes, which reduced the number of on-chip memory accesses by 1.6–2.1 times, and the cost-efficient architecture.  相似文献   

15.
The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time.  相似文献   

16.
Open Computing Language (OpenCL) is an open, functionally portable programming model for a large range of highly parallel processors. To provide users with access to the underlying platforms, OpenCL has explicit support for features such as local memory and vector data types (VDTs). However, these are often low‐level, hardware‐specific features, which can be detrimental to performance on different platforms. In this paper, we focus on VDTs and investigate their usage in a systematic way. First, we propose two different approaches (inter‐vdt and intra‐vdt) to use VDTs in OpenCL kernels, and show how to translate scalar OpenCL kernels to vectorized ones. After obtaining vectorized code, we evaluate the performance effects of using VDTs with two types of benchmarks: micro‐benchmarks and macro‐benchmarks. With micro‐benchmarks, we study the execution model of VDTs and the role of the compiler‐aided vectorizer on five devices. With macro‐benchmarks, we explore the changes of memory access patterns before and after using VDTs, and the resulting performance impact. Not only our evaluation provides insights into how OpenCL's VDTs are mapped on different processors, but it also indicates that using such data types introduces changes in both computation and memory accesses. Based on the lessons learned, we discuss how to deal with performance portability in the presence of VDTs. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

17.
顺序流水线结构由于逻辑简单,复杂性小,被广泛应用于嵌入式系统,以及片上多核和众核处理器.但是顺序流水线的指令发射速率受制于访存等长延迟事件,性能往往很低.本文在顺序流水线的基础上提出了一种基于指令块的乱序发射策略,只以少量的复杂性获得较好的性能功耗比.实验结果表明,本文提出的乱序发射策略相比顺序发射是一种复杂性有效的改进.  相似文献   

18.
Custom instructions potentially improve execution speed and code compression of embedded applications. However, more efficient custom instructions need higher number of simultaneous registerfile accesses. Larger registerfiles are more power hungry with complex forwarding interconnects. Therefore, due to the limited ports of the base processor registerfile, size and efficiency of custom instructions could be generally limited. Recent researches have focused on overcoming this limitation by some innovative architectural techniques supplemented with customized compilations. However, to the best of our knowledge there are few researches that take into account the complete pipeline design and implementation considerations. This paper proposes a customized instruction set and pipeline architecture for an optimized embedded engine. The proposed architecture increases the performance by enhancing the available registerfile data bandwidth through register access pipelining. The achieved improvements are made by introducing double-word custom instructions whose registerfile accesses are overlapped in the pipeline. Potential hazards in such instructions are resolved by the introduced pipeline backwarding concept, yielding higher performance and code compression. While we study the effectiveness of the proposed architecture on domain-specific workloads from packet-processing benchmarks, the developed framework and architecture are applicable to other embedded application domains.  相似文献   

19.
In multi-core Digital Signal Processing (DSP) Systems, the processor-memory gap remains the primary obstacle in improving system performance. This paper addresses this bottleneck by combining task scheduling and memory accesses so that the system architecture and memory modules of a multi-core DSP can be utilized as efficiently as possible. To improve the system and memory utilization, the key is to take advantage of locality as much as possible and integrate it into task scheduling. Two algorithms are proposed to optimize memory accesses while scheduling tasks with timing and resource constraints. The first one uses Integer Linear Programming (ILP) to produce a schedule with the most efficient memory access sequence while satisfying the constraints. The second one is a heuristic algorithm which can produce a near optimal schedule with polynomial running time. The experimental results show that the memory access cost can be reduced up to 60% while the schedule length is also shortened.  相似文献   

20.
乱序执行是现代微处理器设计中普遍采用的提高流水线性能的方法,但乱序执行并乱序退出的全乱序结构在超标量处理器中应用并不普遍,这种全乱序的结构对基于参考模型的处理器正确性验证提出了巨大的挑战。主要介绍了从处理器的程序行为是否正确的最终标准——程序员可见的结构变量按程序行为进行顺序变化的角度对全乱序结构的处理器验证提出了一种全新的解决方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号