期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Distributed fair DRAM scheduling in network-on-chips architecture

《Journal of Systems Architecture》2013,59(7):543-550

Memory access scheduling is an effective manner to improve performance of Chip Multi-Processors (CMPs) by taking advantage of the timing characteristics of a DRAM. A memory access scheduler can subdivide resources utilization (banks and rows) to increase throughput by accessing different DRAM banks in parallel. However, different threads running on different cores may exhibit different performance. One thread may experience starvation while the others are serviced normally. Therefore, designing a scheduler which reduces the unfairness in the DRAM system, while also improving system throughput on a variety of workloads and systems, is necessary. In this paper, a distributed fair DRAM scheduling for two-dimensional mesh network-on-chips (NoCs), called DFDS, is presented. The key design points in DFDS are: (i) assessing the total waiting cycles of a memory request in NoC and considering it as a metric in arbitration. For this purpose waiting cycles of a memory request are put in an additional flit in a packet and are updated while traversing the NoC, and (ii) proposing a semi-dynamic virtual channel allocation to provide in-order memory requests to memory controllers (MCs). Consequently, we use a simple scheduling algorithm in MCs, instead of complex algorithms. To validate our approach, we apply synthetic and real workload from Parsec benchmark suite. The results show effectiveness of our approach, as we reduce the waiting time of memory requests by up to 15%. 相似文献

2.

一种自适应的存储器访问乱序调度机制

杨磊逄珺时磊张铁军王东辉侯朝焕《计算机工程》2009,35(8):125-127

存储器访问速度已经成为现代处理器系统中的瓶颈。对于存储器访问的调度可以有效地提高存储器带宽利用率。基于一种称为突发调度的机制进行改进,通过使用优先级表达式从各个块里选择最合适的突发来访问存储器,运用自适应的方法来调节优先级表达式里各项的系数,使得这一方法针对不同的应用都能取得好的效果。通过运行SPECCPU2000测试程序和stream程序,与顺序访问机制以及突发调度机制相比,该自适应调度机制将总线利用率分别提高了52%和4.8%。[0] 相似文献

3.

Minimizing accumulative memory load cost on multi-core DSPs with multi-level memory

《Journal of Systems Architecture》2013,59(7):389-399

In multi-core Digital Signal Processing (DSP) Systems, the processor-memory gap remains the primary obstacle in improving system performance. This paper addresses this bottleneck by combining task scheduling and memory accesses so that the system architecture and memory modules of a multi-core DSP can be utilized as efficiently as possible. To improve the system and memory utilization, the key is to take advantage of locality as much as possible and integrate it into task scheduling. Two algorithms are proposed to optimize memory accesses while scheduling tasks with timing and resource constraints. The first one uses Integer Linear Programming (ILP) to produce a schedule with the most efficient memory access sequence while satisfying the constraints. The second one is a heuristic algorithm which can produce a near optimal schedule with polynomial running time. The experimental results show that the memory access cost can be reduced up to 60% while the schedule length is also shortened. 相似文献

4.

H.264编码器存储带宽分析及DRAM控制器设计

下载免费PDF全文

胡红旗许家栋孙景楠《计算机工程与应用》2009,45(14):141-144

在分析H.264/AVC编码过程中存储器带宽需求的基础上,提出一种DRAM控制器结构,并实现了几种不同调度策略的DRAM控制器结构设计。实现了令牌环、固定优先级和抢占式等三种结构,结合已有的存储空间映射方法,通过减少换行及Bank切换过程中的冗余周期,进一步提高存储器的带宽利用率。实验结果表明,提出的三种存储器结构中抢占式调度具有最高的宽利用率,可满足150 MHz时钟频率条件下HDTV1080P实时编码的应用。相似文献

5.

一种解决访存调度问题的改进蚁群优化算法

田烁窦强王勇张洪广周朝兵李石明《计算机工程与科学》2016,38(8):1563-1567

存储器的访问调度策略是复杂的,不仅仅要考虑具体的电路时序参数,还有访存节拍数。在分析DRAM的特点以及访存调度策略的基础上,考虑DDR3时序规范,提出一种改进的蚁群优化访问调度策略。采用不同的trace作为测试,同贪婪式调度算法作比较,该算法可以有效降低平均总延迟、提高带宽利用率。相似文献

6.

基于强化学习方法的访存调度算法

下载免费PDF全文

邱东黎施晶晶《计算机工程与应用》2018,54(2):62-67

在现代处理器中,存储控制器是处理器芯片对片外存储器进行访问的管理者和执行者,其中对访存过程的调度算法会对实际访存性能产生十分重要的影响。针对已有调度算法在不同负载特征下自适应性不足的问题,提出了一种基于强化学习方法的ALHS算法,通过对访存调度中页命中优先时的连续页命中上限次数进行自适应调整,习得最优策略。多种不同典型访存模式的模拟结果显示,相比传统的FR-FCFS,ALHS算法运行速度平均提升了10.98%,并且可以获得近似于最优策略的性能提升,表明该算法能够自主探索环境并自我优化。相似文献

7.

基于优先级的访存调度算法研究

廖春科罗莉何鸿君《计算机与数字工程》2009,37(8):12-16

文章研究了存储控制器中的访存调度策略,提出了基于优先级的访存调度算法。首先使用遗传算法建立有效的数据源,然后对得到数据源应用统计进行调度优先级挖掘,共获取三个优先级别,这样仅使用这三个优先级构造调度算法进行访存序列调度。实验结果表明,提出的算法很好地降低了访存序列的运行时间,优化效果接近于文献[4]中提出的贪婪访存调度算法,但算法运行时间却远小于后者。相似文献

8.

Compaction of Schedules and a Two-Stage Approach for Duplication-Based DAG Scheduling 总被引：1，自引：0，他引：1

Bozdag D. Ozguner F. Catalyurek U.V. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(6):857-871

Many DAG scheduling algorithms generate schedules that require prohibitively large number of processors. To address this problem, we propose a generic algorithm, SC, to minimize the processor requirement of any given valid schedule. SC preserves the schedule length of the original schedule and reduces processor count by merging processor schedules and removing redundant duplicate tasks. To the best of our knowledge, this is the first algorithm to address this highly unexplored aspect of DAG scheduling. On average, SC reduced the processor requirement 91, 82, and 72 percent for schedules generated by PLW, TCSD, and CPFD algorithms, respectively. SC algorithm has a low complexity (O{N}³) compared to most duplication-based algorithms. Moreover, it decouples processor economization from schedule length minimization problem. To take advantage of these features of SC, we also propose a scheduling algorithm SDS, having the same time complexity as SC. Our experiments demonstrate that schedules generated by SDS are only 3 percent longer than CPFD (O{N}⁴), one of the best algorithms in that respect. SDS and SC together form a two-stage scheduling algorithm that produces schedules with high quality and low processor requirement, and has lower complexity than the comparable algorithms that produce similar high-quality results. 相似文献

9.

CARR: a scalable solution for network packet classification

Wei Li Weibin Zheng Juanjuan Lin Xiaohong Guan Ling Li Sohail S. Chaudhry Pan Wang Yanping Liu 《Expert Systems》2012,29(1):70-83

Modern Internet routers have to handle a large number of packet classification rules, which requires classification schemes to be scalable both in time and space. In this paper, we present a scalable packet classification algorithm that is developed by combining two new concepts to the well‐known bit vector (BV) scheme. We propose a range search method based on a cache‐aware tree (CATree) which makes full use of processor's cache line to reduce the number of dynamic random access memory (DRAM) accesses. Theoretically, the number of DRAM accesses of CATree is about log(m+1) times lower than that of the widely used binary search algorithm, where m is the number of keys in a single cache line. Based on our computational results on a set of 1024 keys, the CATree algorithm is 36% faster than binary search algorithm and the performance is better when applied to a larger set of keys. In addition, we develop a rule re‐arrangement algorithm to reduce the bitmap space of BV. With this re‐arrangement, the rules for the same action may be assigned an identical priority. This reduces the number of priorities as well as the memory space of the bitmap. Furthermore, this also reduces the number of memory accesses and hence, increases the CPU cache utilization. With CATree and rule re‐arrangement, the cache‐aware bit vector with rule re‐arrangement algorithm achieves better performance in comparison with the regular BV scheme, both in space and time. In our experiments, the proposed algorithm reduces the bitmap memory space of a practical set of firewall rules by two orders of magnitude and is 91% faster than the regular BV. 相似文献

10.

Coordinated concurrent memory accesses on a reconfigurable multimedia accelerator

Samar Yazdani Joël Cambonie Bernard Pottier 《Microprocessors and Microsystems》2009,33(1):13-23

Reconfigurable fabrics are designed by tiling operators and memory banks. In the context of system on chip, the inclusion of multiple local memories is critical for algorithmic performance, as they provide concurrent data accesses for configured compute processes. This paper considers a practical case where internal fabric buses and connectivity give a shared memory characteristic to the architecture. This relies on static reconfigurability and high-level programming techniques to render automated memory access scheduling feasible in a deterministic manner. A complete flow has been developed starting from the programming model down to micro-code enabling task synchronization on memory resources. Compile time analysis is achieved by observing the sequence of operations in the concurrent processes, and by synthesizing a controller program to support the best schedule of operations favoring high throughput. The hardware target is a reconfigurable fabric designed at STMicroelectronics in 65 nm. This hardware/software solution is scalable, flexible and provides high throughput on shared memory. 相似文献

11.

Minimizing write operation for multi-dimensional DSP applications via a two-level partition technique with complete memory latency hiding

《Journal of Systems Architecture》2015,61(2):112-126

Most scientific and digital signal processing (DSP) applications are recursive or iterative. The execution of these applications on a chip multiprocessor (CMP) encounters two challenges. First, as most of the digital signal processing applications are both computation intensive and data intensive, an inefficient scheduling scheme may generate huge amount of write operation, cost a lot of time, and consume significant amount of energy. Second, because CPU speed has been increased dramatically compared with memory speed, the slowness of memory hinders the overall system performance. In this paper, we develop a Two-Level Partition (TLP) algorithm that can minimize write operation while achieving full parallelism for multi-dimensional DSP applications running on CMPs which employ scratchpad memory (SPM) as on-chip memory (e.g., the IBM Cell processor). Experiments on DSP benchmarks demonstrate the effectiveness and efficiency of the TLP algorithm, namely, the TLP algorithm can completely hide memory latencies to achieve full parallelism and generate the least amount of write operation to main memory compared with previous approaches. Experimental results show that our proposed algorithm is superior to all known methods, including the list scheduling, rotation scheduling, Partition Scheduling with Prefetching (PSP), and Iterational Retiming with Partitioning (IRP) algorithms. Furthermore, the TLP scheduling algorithm can reduce write operation to main memory by 45.35% and reduce the schedule length by 23.7% on average compared with the IRP scheduling algorithm, the best known algorithm. 相似文献

12.

Efficient real-time divisible load scheduling

Anwar Mamat Ying Lu Jitender Deogun Steve Goddard 《Journal of Parallel and Distributed Computing》2012

Providing QoS and performance guarantees to arbitrarily divisible loads has become a significant problem for many cluster-based research computing facilities. While progress is being made in scheduling arbitrarily divisible loads, current approaches are not efficient and do not scale well. In this paper, we propose a linear algorithm for real-time divisible load scheduling. Unlike existing approaches, the new algorithm relaxes the tight coupling between the task admission controller and the task dispatcher. By eliminating the need to generate exact schedules in the admission controller, the algorithm avoids high overheads. We also proposed a hybrid algorithm that combines the best of our efficient algorithm and a previously best-known approach. We experimentally evaluate the new algorithm. Simulation results demonstrate that the algorithm scales well, can schedule large numbers of tasks efficiently, and performs similarly to existing approaches in terms of providing real-time guarantees. 相似文献

13.

FEADS: A Framework for Exploring the Application Design Space on Network Processors

Rajani Pai R. Govindarajan 《International journal of parallel programming》2007,35(1):1-31

Network processors are designed to handle the inherently parallel nature of network processing applications. However, partitioning and scheduling of application tasks and data allocation to reduce memory contention remain as major challenges in realizing the full performance potential of a given network processor. The large variety of processor architectures in use and the increasing complexity of network applications further aggravate the problem. This work proposes a novel framework, called FEADS, for automating the task of application partitioning and scheduling for network processors. FEADS uses the simulated annealing approach to perform design space exploration of application mapping onto processor resources. Further, it uses cyclic and r-periodic scheduling to achieve higher throughput schedules. To evaluate dynamic performance metrics such as throughput and resource utilization under realistic workloads, FEADS automatically generates a Petri net (PN) which models the application, architectural resources, mapping and the constructed schedule and their interaction. The throughput obtained by schedules constructed by FEADS is comparable to that obtained by manual scheduling for linear task flow graphs; for more complicated task graphs, FEADS’ schedules have a throughput which is upto 2.5 times higher compared to the manual schedules. Further, static scheduling of tasks results in an increase in throughput by upto 30% compared to an implementation of the same mapping without task scheduling. 相似文献

14.

基于DRAM和PCM的混合主存模拟器

张德志万寿红岳丽华《计算机系统应用》2017,26(9):16-23

相变存储器（PCM）由于其非易失性、高读取速度以及低静态功耗等优点,已成为主存研究领域的热点.然而,目前缺乏可用的PCM设备,这使得基于PCM的算法研究得不到有效验证.因此,本文提出了利用主存模拟器仿真并验证PCM算法的思路.本文首先介绍了现有主存模拟器的特点,并指出其并不能完全满足当前主存研究的实际需求,在此基础上提出并构建了一个基于DRAM和PCM的混合主存模拟器.与现有模拟器的实验比较结果表明,本文设计的混合主存模拟器能够有效地模拟DRAM和PCM混合存储架构,并能够支持不同形式的混合主存系统模拟,具有高可配置性.最后,论文通过一个使用示例说明了混合主存模拟器编程接口的易用性. 相似文献

15.

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Weiwei Fu Tianzhou Chen Chao Wang Li Liu 《The Journal of supercomputing》2014,69(3):1491-1516

On-chip distributed memory system has become an attractive solution for massive parallel memory accesses found in future many-core processors. However, increasing number of on-chip cores and memory controllers inevitably introduce many remote memory accesses, which generate a large amount of on-chip traffic and put great pressure on the interconnection. This paper tries to optimize on-chip memory access traffic via runtime thread migration. We first analyze memory access behaviors in multi-threaded applications and find that the memory access targets and volumes are similar during short periods, which makes runtime prediction feasible. But the memory access targets exhibit great mobility during long periods, motivating us to dynamically move threads towards the data. Based on these observations, we propose a novel low-cost and distributed thread migration algorithm which adjusts thread placement in chains based on benefit estimation. We present details of the workflow, including the trigger and arbitration of migration requests and the procedures to determine the migration chains. Simulation results show that our algorithm achieves system performance speedup of 11.5 % and reduces average memory access latency by 11.0 %. It can find a few but effective thread migrations to optimize on-chip memory access traffic with acceptable hardware and runtime overheads. 相似文献

16.

A hybrid heuristic–genetic algorithm for task scheduling in heterogeneous processor networks

Mohammad I. Daoud Nawwaf KharmaAuthor vitae 《Journal of Parallel and Distributed Computing》2011,71(11):1518-1531

Efficient task scheduling on heterogeneous distributed computing systems (HeDCSs) requires the consideration of the heterogeneity of processors and the inter-processor communication. This paper presents a two-phase algorithm, called H2GS, for task scheduling on HeDCSs. The first phase implements a heuristic list-based algorithm, called LDCP, to generate a high quality schedule. In the second phase, the LDCP-generated schedule is injected into the initial population of a customized genetic algorithm, called GAS, which proceeds to evolve shorter schedules. GAS employs a simple genome composed of a two-dimensional chromosome. A mapping procedure is developed which maps every possible genome to a valid schedule. Moreover, GAS uses customized operators that are designed for the scheduling problem to enable an efficient stochastic search. The performance of each phase of H2GS is compared to two leading scheduling algorithms, and H2GS outperforms both algorithms. The improvement in performance obtained by H2GS increases as the inter-task communication cost increases. 相似文献

17.

Data parallel address architecture

Jung Ho Ahn Dally W.J. 《Computer Architecture Letters》2006,5(1):30-33

Data parallel memory systems must maintain a large number of outstanding memory references to fully use increasing DRAM bandwidth in the presence of increasing latency. At the same time, the throughput of modern DRAMs is very sensitive to access pattern's due to the time required to precharge and activate banks and to switch between read and write access. To achieve memory reference parallelism a system may simultaneously issue references from multiple reference threads. Alternatively multiple references from a single thread can be issued in parallel. In this paper, we examine this tradeoff and show that allowing only a single thread to access DRAM at any given time significantly improves performance by increasing the locality of the reference stream and hence reducing precharge/activate operations and read/write turnaround. Simulations of scientific and multimedia applications show that generating multiple references from a single thread gives, on average, 17% better performance than generating references from two parallel threads. 相似文献

18.

N-step incremental straight-line algorithms 总被引：7，自引：0，他引：7

Gill G.W. 《Computer Graphics and Applications, IEEE》1994,14(3):66-72

This class of algorithms extends Bresenham's (1965) integer straight-line algorithm to generate more than one pixel per inner loop, thus reducing inner loop overhead. The quad-step algorithm is too large to justify its use in older hardware with limited memory space, but it can be viable in the context of modern memory and software sizes. Because the algorithm reduces both calculation overhead and the number of memory accesses for adjacent pixels, it can improve the performance of current systems that are limited in their processor speed and of future systems that might be limited in their memory speed. The algorithm gives results identical to those from Bresenham's single-step routine while drawing pixels in the expected direction from start to end point. Furthermore, as the gradual trend towards more bits per pixel continues, a processor supporting multi-word burst data instructions could make good use of this algorithm in speeding up line drawing into a 24-bits-per-pixel, 1-pixel-per-word color frame buffer. I chose to implement 4 steps per loop because it gave a useful performance improvement without exceeding the resources of the target processor, and it was small enough to hand-code. However, the techniques described can be used to construct a straight-line algorithm that generates more than 4 steps per loop. The relatively small average decision tree sizes indicate that algorithms of greater than 4 pixels per step might further improve line-drawing efficiency 相似文献

19.

Loop scheduling with memory access reduction subject to register constraints for DSP applications

Yi Wang Zhiping Jia Renhai Chen Meng Wang Duo Liu Zili Shao 《Software》2014,44(8):999-1026

Memory accesses introduce big‐time overhead and power consumption because of the performance gap between processors and main memory. This paper describes and evaluates a technique, loop scheduling with memory access reduction (LSMAR), that replaces hidden redundant load operations with register operations in loop kernels and performs partial scheduling for newly generated register operations subject to register constraints. By exploiting data dependence of memory access operations, the LSMAR technique can effectively reduce the number of memory accesses of loop kernels, thereby improving timing performance. The technique has been implemented into the Trimaran compiler and evaluated using a set of benchmarks from DSPstone and MiBench on the cycle‐accurate simulator of the Trimaran infrastructure. The experimental results show that when the LSMAR technique is applied, the number of memory accesses can be reduced by 18.47% on average over the benchmarks when it is not applied. The measurements also indicate that the optimizations only lead to an average 1.41% increase in code size. With such small code size expansion, the technique is more suitable for embedded systems compared with prior work.Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

20.

A genetic algorithm for a single hoist scheduling in the printed-circuit-board electroplating line 总被引：6，自引：0，他引：6

Joon-Mook Lim 《Computers & Industrial Engineering》1997,33(3-4):789-792

In this paper, the problem of determining cyclic schedules for a material handling hoist in the printed-circuit-board(PCB) electroplating line is considered. The objective of this research is to determine an optimal simple-cycle schedule of the hoist which in turn maximizes the line throughput rate. Previous approaches to the cyclic hoist scheduling problem are all mathematical programming-based approaches to develop cyclic schedules(Mixed Integer Programming, Linear Programming based Branch and Bound, Branch and Bound Search Method and so on). In this paper, a genetic algorithm-based approach for a single hoist scheduling in the PCB electroplating line is described. Through an experiment for the well known example data, the proposed algorithm is shown to be more efficient than the previous mathematical programming-based algorithm. 相似文献