期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

李勇胡慧俐杨焕荣《计算机应用》2014,34(4):1005-1009

数字信号处理软件中循环程序在执行时间上占有很大比例,用指令缓冲器暂存循环代码可以减少程序存储器的访问次数,提高处理器性能。在VLIW处理器指令流水线中增加一个支持循环指令的缓冲器,该缓冲器能够缓存循环程序指令,并以软件流水的形式向功能部件派发循环程序指令。这样循环程序代码只需访存一次而执行多次,大大减少了访存次数。在循环指令运行期间,缓冲器发出信号使程序存储器进入睡眠状态可以降低处理器功耗。典型的应用程序测试表明,使用了循环缓冲后,取指流水线空闲率可达90%以上,处理器整体性能提高10%左右,而循环缓冲的硬件面积开销大约占取指流水线的9%。相似文献

2.

Analysis of L1 Ⅰ-Cache Miss Penalty Based on Interval Model

MU Ya-Ii YANG Bing YU Ming-yan 《计算机工程》2012,38(7)

一级指令Cache的平均缺失损失被量化为下一级存储系统的访问时间,在进行处理器性能瓶颈分析中简单的量化会引起较大的误差.针对该问题,应用区间模型分析影响一级指令Cache平均缺失损失的前端因素,并用模拟实验进行分析研究,结果表明,除下一级存储系统的访问时间外,取指带宽、取指队列的大小、一级指令Cache缺失率及程序特性,会对一级指令Cache平均缺失损失产生影响. 相似文献

3.

基于区间模型的一级指令Cache缺失损失分析

下载免费PDF全文

穆雅莉杨兵喻明艳《计算机工程》2012,38(7):273-275,278

一级指令Cache的平均缺失损失被量化为下一级存储系统的访问时间,在进行处理器性能瓶颈分析中简单的量化会引起较大的误差。针对该问题,应用区间模型分析影响一级指令Cache平均缺失损失的前端因素,并用模拟实验进行分析研究,结果表明,除下一级存储系统的访问时间外,取指带宽、取指队列的大小、一级指令Cache缺失率及程序特性,会对一级指令Cache平均缺失损失产生影响。相似文献

4.

基于SESC仿真器的存储预取器设计

赵磊张萌刘芳《计算机与现代化》2013,(6):183-188

仿真器是在宿主机上运行并能模拟目标体系结构机器行为的一种软件系统,它可以解释并执行目标体系结构机器上可执行的程序,同时可提供运行时的指令和事件相关记录,以及目标体系结构机器的性能统计参数。系统级体系结构仿真器是可以作为一个虚拟目标机器运行的软件系统,它可以实现对单(多)处理器、内存系统、Cache和外部设备等子系统的功能模拟。本文根据多核处理器结构特点,研究体系结构仿真器与测试程序的设计方法。利用体系结构仿真器,分析不同结构的多核处理器片外存储访问需求,讨论片外存储访问带宽对计算性能的影响问题。总结出多核系统片外存储器访问的机制与需求,以及片外访存与程序特征的关系。相似文献

5.

基于K Framework的向量化机器学习指令语义形式化

黄厚华刘嘉祥施晓牧《软件学报》2023,34(8):3853-3869

ARM针对ARMv8.1-M微处理器架构推出基于M-Profile向量化扩展方案的技术,并命名为ARM Helium,声明能为ARM Cortex-M处理器提升达15倍的机器学习性能.随着物联网的高速发展,微处理器指令执行正确性尤为重要.指令集的官方手册作为芯片模拟程序,片上应用程序开发的依据,是程序正确性基本保障.主要介绍利用可执行语义框架K Framework对ARMv8.1-M官方参考手册中向量化机器学习指令的语义正确性研究.基于ARMv8.1-M的官方参考手册自动提取指令集中描述向量化机器学习指令执行过程的伪代码,并将其转换为形式化语义转换规则.通过K Framework提供的可执行框架利用测试用例,验证机器学习指令算数运算执行的正确性. 相似文献

6.

基于统计学习分析多核间性能干扰

赵家程崔慧敏冯晓兵《软件学报》2013,24(11):2558-2570

普遍认为,云计算和多核处理器将会统治计算领域的未来.但是,目前云计算数据中心的计算资源使用率非常低,其主要原因在于多核处理器上存在严重且不可预知的性能干扰.为了保证关键应用程序的QoS,只能禁止这些关键程序与其他程序共同运行,导致了资源的过度分配.为了提高数据中心的利用率,分析多核间的性能干扰成为一个关键的问题.观察到程序遭受的核间性能干扰可以表示为内存子系统总压力的线性分段函数,而与构成压力的具体应用程序无关.以此观察为基础,提出了一种基于统计学习的多核间性能干扰分析方法,使用主成分线性回归的方法获得干扰模型,可以精确且定量地预测任意程序由于内存子系统资源竞争导致的性能下降.实验结果表明,平均预测误差仅为1.1%. 相似文献

7.

在异构系统上基于权重和复制的调度算法

邓德康张振荣《计算机工程与设计》2023,(10):3004-3011

动态电压和频率扩展技术（DVFS）的发展使异构系统可以实现低功耗，然而DVFS通过降低处理器的执行频率来降低功耗，大大增加了处理器临时故障风险，应用的可靠性受到极大威胁。针对先前算法在任务调度过程中容易出现调度失败的问题，提出一种基于权重和复制的调度算法（SAWR），以在异构系统上完成应用调度，满足并行应用的可靠性目标，同时降低系统功耗。仿真结果表明，与先前的算法相比，所提算法可以实现良好的性能。相似文献

8.

一种确定处理器频率的节能调度策略

陈专红胡虚怀《计算机工程》2013,(8)

在数据中心动态电压调整(DVS)节能应用中,追求低的处理器执行频率不一定能获得好的节能效果。因为利用DVS技术降低电压/频率在降低系统功耗的同时,会导致系统性能的降低,引起执行时间增加。为此,通过分析实时系统数据中心实时事务基于DVS的能耗数学模型,结合事务执行时间与处理器频率的关系,推导一种仅依赖于服务器静态特征参数的处理器能效最优初始执行频率的计算方法。实例数据的计算结果表明,使用最优初始执行频率完成事务,比单一使用最大处理器频率可以节省30%左右的能耗。相似文献

9.

众核处理器片上同步机制和评估方法研究 总被引：1，自引：0，他引：1

徐卫志宋风龙刘志勇范东睿余磊张帅《计算机学报》2010,33(10)

同步机制是片上多核/众核处理器正确执行和协同通信的关键,其效率对处理器的性能非常重要.针对片上众核体系结构,提出并实现了两种粗粒度同步机制和一种细粒度同步机制,即片上专用硬件支持的同步机制、基于原语的片上互斥访问同步机制和基于满空标志位的细粒度同步机制;提出了粗粒度同步机制的评估标准和评估方法,并设计了量化评估程序.以片上同构众核处理器Godson-T模拟器和AMD Opteron商业片上多核处理器为平台,评估比较了提出的硬件支持的同步机制与基于原语的同步机制的性能.结果表明,硬件支持可以使得片上众核处理器的同步机制性能明显提高;在传统基于原语的同步机制中,大部分性能损失是由于负载不平衡和同步点的串行化操作而造成的等待时间. 相似文献

10.

移动设备应用程序的体系结构特征分析

黄永兵陈明宇《计算机学报》2015,38(2)

移动设备如智能手机、平板电脑已成为最主要的电子消费品,且呈现出快速增长的趋势.移动设备上运行的应用程序种类丰富多样,对硬件平台上不同资源的需求也大不相同;而移动设备的硬件平台在性能和功耗上有其自身的局限性.因而,分析移动设备应用程序体系结构层次的特征,对于硬件平台如处理器、内存等资源部件的设计,以及应用程序的优化,具有指导性的意义.文中选取了Android操作系统上的多类常用的应用程序,深入地分析了其在主流移动设备上的微体系结构相关的特征.结果表明,移动设备的应用程序普遍存在较高的指令缓存和指令转换后援缓冲器缺失率,并且分支预测失败率也较高.基于各程序的体系结构特征,文中抽取了部分最具代表性特征的应用程序,并提出了一个用于体系结构研究的移动设备基准测试程序Moby.Moby测试程序包括了浏览器、邮件客户端、音乐及视频播放器、文档阅读器及地图等应用.同时,文中还详细分析了Moby测试程序微体系结构无关的特征,如指令的组成、指令局部性特征、工作集大小及指令执行流等. 相似文献

11.

Understanding the future of energy-performance trade-off via DVFS in HPC environments

M. Etinski J. Corbalan J. Labarta M. Valero 《Journal of Parallel and Distributed Computing》2012

DVFS is a ubiquitous technique for CPU power management in modern computing systems. Reducing processor frequency/voltage leads to a decrease of CPU power consumption and an increase in the execution time. In this paper, we analyze which application/platform characteristics are necessary for a successful energy-performance trade-off of large scale parallel applications. We present a model that gives an upper bound on performance loss due to frequency scaling using the application parallel efficiency. The model was validated with performance measurements of large scale parallel applications. Then we track how application sensitivity to frequency scaling evolved over the last decade for different cluster generations. Finally, we study how cluster power consumption characteristics together with application sensitivity to frequency scaling determine the energy effectiveness of the DVFS technique. 相似文献

12.

Assignment of independent tasks to minimize completion time

Ben A. Blake 《Software》1992,22(9):723-734

The task of scheduling dynamic applications that consist of single process tasks on a non-shared memory multicomputer is examined in this paper. Each task of the application is assumed to (1) require execution on a single processor, (2) have an estimate of its maximum execution time, and (3) not wait on communications with other tasks. The objective of the studied schedulers is to map an application's tasks onto the underlying hardware in such a way that the application's completion time is minimized. Experimental evaluation of the schedulers indicate that in many situations, a more sophisticated scheduler fails to outperform simpler schedulers. 相似文献

13.

Automatic runtime frequency-scaling system for energy savings in parallel applications

Vaibhav Sundriyal Masha Sosonkina Zhao Zhang 《The Journal of supercomputing》2014,68(2):777-797

Although high-performance computing has always been about efficient application execution, both energy and power consumption have become critical concerns owing to their effect on operating costs and failure rates of large-scale computing platforms. Modern processors provide techniques, such as dynamic voltage and frequency scaling (DVFS) and CPU clock modulation (called throttling), to improve energy efficiency on-the-fly. Without careful application, however, DVFS and throttling may cause a significant performance loss due to system overhead. This paper proposes a novel runtime system that maximizes energy saving by selecting appropriate values for DVFS and throttling in parallel applications. Specifically, the system automatically predicts communication phases in parallel applications and applies frequency scaling considering both the CPU offload, provided by the network-interface card, and the architectural stalls during computation. Experiments, performed on NAS parallel benchmarks as well as on real-world applications in molecular dynamics and linear system solution, demonstrate that the proposed runtime system obtaining energy savings of as much as 14 % with a low performance loss of about 2 %. 相似文献

14.

Area efficient remote code execution platform with on-demand instruction manager for cloud-connected code executable IoT devices

《Simulation Modelling Practice and Theory》2017

An energy-area efficient cloud-connected software execution architecture in IoT sensor processor is proposed. A remotely installed sensor device such as an environmental activity monitor is commonly implemented using the conventional embedded processor only providing the fixed services, which includes statically compiled embedded software in on-chip flash memory. Instead of conventional on-chip flash memory for an instruction code area, we adopt an virtually mapped internal memory concept to realize cloud-connected software execution, in where the remote storage area via the IoT platform is indirectly mapped onto the physical address space of the instruction memory using a dynamic address translation technique. The proposed cloud-connected architecture of the system enables on-demand code execution for the instructions, which are fetched from the cloud-side remote storage area in the runtime, instead of using a directly-connected on-chip instruction bus. The proposed storage-less approach may be adopted to reduce the high access current and large chip area overhead by eliminating the on-chip code flash memory. To reduce the access current overhead in order to retrieve the requested instruction, a small-sized RAM scratch pad is adopted for retaining the hot-spot instruction code and early filled with pre-estimated instruction sector. The experimental results show that the proposed technique reduces the energy consumption and packet delay of an IoT device for executing the remote embedded software, as well as the reduced chip area by realizing a storage-less sensor architecture. 相似文献

15.

Energy-centric DVFS controlling method for multi-core platforms 总被引：1，自引：0，他引：1

Shin-gyu Kim Hyeonsang Eom Heon Y. Yeom Sang Lyul Min 《Computing》2014,96(12):1163-1177

Dynamic voltage and frequency scaling (DVFS) is a well-known and effective technique for reducing energy consumption in modern processors. However, accurately predicting the effect of frequency scaling on system performance is a challenging problem in real environments. In this paper, we propose a realistic DVFS performance prediction method, and a practical DVFS control policy (eDVFS) that aims to minimize total energy consumption in multi-core platforms. We also present power consumption estimation models for CPU and DRAM by exploiting a hardware energy monitoring unit. We implemented eDVFS in Linux, and our evaluation results show that eDVFS can save a substantial amount of energy compared with Linux “on-demand” CPU governor in diverse environments. 相似文献

16.

Energy minimization for reliability-guaranteed real-time applications using DVFS and checkpointing techniques

《Journal of Systems Architecture》2015,61(2):71-81

This paper addresses the energy minimization issue when executing real-time applications that have stringent reliability and deadline requirements. To guarantee the satisfaction of the application’s reliability and deadline requirements, checkpointing, Dynamic Voltage Frequency Scaling (DVFS) and backward fault recovery techniques are used. We formally prove that if using backward fault recovery, executing an application with a uniform frequency or neighboring frequencies if the desired frequency is not available, not only consumes the minimal energy but also results in the highest system reliability. Based on this theoretical conclusion, we develop a strategy that utilizes DVFS and checkpointing techniques to execute real-time applications so that not only the applications reliability and deadline requirements are guaranteed, but also the energy consumption for executing the applications is minimized. The developed strategy needs at most one execution frequency change during the execution of an application, hence, the execution overhead caused by frequency switching is small, which makes the strategy particularly useful for processors with a large frequency switching overhead. We empirically compare the developed real-time application execution strategy with recently published work. The experimental results show that, without sacrificing reliability and deadline satisfaction guarantees, the proposed approach can save up to 12% more energy when compared with other approaches. 相似文献

17.

Two versions of architectures for dynamic implied addressing mode

Jonghee M. Youn Minwook Ahn Yunheung Paek Jongwung Kim Jeonghun Cho 《Journal of Systems Architecture》2010,56(8):368-383

The complexity of today’s embedded applications increases with various requirements such as execution time, code size or power consumption. To satisfy these requirements for performance, efficient instruction set design is one of the important issues because an instruction customized for specific applications can make better performance than multiple instructions in aspect of fast execution time, decrease of code size, and low power consumption. Limited encoding space, however, does not allow adding application specific and complex instructions freely to the instruction set architecture. To resolve this problem, conventional architectures increases free space for encoding by trimming excessive bits required beyond the fixed word length. This approach however shows severe weakness in terms of the complexity of compiler, code size and execution time. In this paper, we propose a new instruction encoding scheme based on the dynamic implied addressing mode (DIAM) to resolve limited encoding space and side-effect by trimming. We report our two versions of architectures to support our DIAM-based approach. In the first version, we use a special on-chip memory to store extra encoding information. In the second version, we replace the memory by a small on-chip buffer along with a special instruction. We also suggest a code generation algorithm to fully utilize DIAM. In our experiment, the architecture augmented with DIAM shows about 8% code size reduction and 18% speed up on average, as compared to the basic architecture without DIAM. 相似文献

18.

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Weiwei Fu Tianzhou Chen Chao Wang Li Liu 《The Journal of supercomputing》2014,69(3):1491-1516

On-chip distributed memory system has become an attractive solution for massive parallel memory accesses found in future many-core processors. However, increasing number of on-chip cores and memory controllers inevitably introduce many remote memory accesses, which generate a large amount of on-chip traffic and put great pressure on the interconnection. This paper tries to optimize on-chip memory access traffic via runtime thread migration. We first analyze memory access behaviors in multi-threaded applications and find that the memory access targets and volumes are similar during short periods, which makes runtime prediction feasible. But the memory access targets exhibit great mobility during long periods, motivating us to dynamically move threads towards the data. Based on these observations, we propose a novel low-cost and distributed thread migration algorithm which adjusts thread placement in chains based on benefit estimation. We present details of the workflow, including the trigger and arbitration of migration requests and the procedures to determine the migration chains. Simulation results show that our algorithm achieves system performance speedup of 11.5 % and reduces average memory access latency by 11.0 %. It can find a few but effective thread migrations to optimize on-chip memory access traffic with acceptable hardware and runtime overheads. 相似文献

19.

A simulator for adaptive parallel applications

Basile Schaeli Sebastian Gerlach Roger D. Hersch 《Journal of Computer and System Sciences》2008,74(6):983-999

Dynamically allocating computing nodes to parallel applications is a promising technique for improving the utilization of cluster resources. Detailed simulations can help identify allocation strategies and problem decomposition parameters that increase the efficiency of parallel applications. We describe a simulation framework supporting dynamic node allocation which, given a simple cluster model, predicts the running time of parallel applications taking CPU and network sharing into account. Simulations can be carried out without needing to modify the application code. Thanks to partial direct execution, simulation times and memory requirements are reduced. In partial direct execution simulations, the application's parallel behavior is retrieved via direct execution, and the duration of individual operations is obtained from a performance prediction model or from prior measurements. Simulations may then vary cluster model parameters, operation durations and problem decomposition parameters to analyze their impact on the application performance and identify the limiting factors. We implemented the proposed techniques by adding direct execution simulation capabilities to the Dynamic Parallel Schedules parallelization framework. We introduce the concept of dynamic efficiency to express the resource utilization efficiency as a function of time. We verify the accuracy of our simulator by comparing the effective running time, respectively the dynamic efficiency, of parallel program executions with the running time, respectively the dynamic efficiency, predicted by the simulator under different parallelization and dynamic node allocation strategies. 相似文献