期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

赵姗郝春亮翟健李明树《软件学报》2020,31(9):2965-2979

近年来,在移动计算环境中,异构多核处理器已经逐渐成为主流.与传统同构的处理器设计相比,此类异构多核处理器以更低的功耗成本满足设备的计算需求.但是异构环境下CPU核之间的微架构差异,也为操作系统中的一些基本方法提出了新的挑战.面向性能非对称异构多核环境下调度的负载均衡问题,从系统层面提出了一种负载均衡机制S-Bridge,可以减少处理器微架构差异以及任务执行需求差异对传统负载均衡带来的影响.S-Bridge的主要贡献是从系统层提供了通用的、适配异构性的负载均衡相关接口,使任意调度器都能方便地与异构多核处理器系统进行适配.基于CFS和HMP调度器在ARM平台上进行实验,同时在X86平台上进行S-Bridge通用性的验证,结果表明：S-Bridge可以支持不同真实平台和内核版本的快速实现,平均性能提升超过15%,部分情况下可达65%. 相似文献

2.

异构系统功耗感知的并行循环调度方法

王桂彬杨学军徐新海林一松李鑫《软件学报》2011,22(9):2222-2234

以类OpenMP的并行程序为研究对象,在满足性能约束的条件下,结合异构系统并行循环调度和处理器动态电压调节技术优化系统功耗.首先建立了异构系统功耗感知的并行循环调度问题基本模型;然后,通过分析方法给出异构系统并行循环调度的能耗下界,该下界可用于评估功耗优化方法的实际效率;进而将异构系统并行循环调度问题归纳为整数规划问题,在此基础上,提出了处理器内循环再调度方法进一步降低功耗.最后,以CPU-GPU异构系统为平台评测了10个典型kernel程序.实验结果表明,该方法可以有效降低系统功耗,提高系统效能. 相似文献

3.

面向动态异构多核处理器的公平调度算法

王涛安虹孙涛高晓川张海博程亦超彭毅《软件学报》2014,25(S2):80-89

动态异构多核处理器的处理器核可动态调整的特征给操作系统调度算法带来了新的机遇和挑战.利用处理器核动态可调整的特征能更好地适应不同任务的运行需求,带来巨大的性能优化空间.然而也带来新的代价和更复杂的公平性的计算.为了解决面向动态异构多核处理器结构上的公平性调度问题,提出了一个基于集中式运行队列的调度模型,以降低调度算法在动态处理器核变化所带来的维护开销.并重新思考在动态异构处理器结构下公平性的定义,基于原有CFS调度算法提出新的HFS调度算法.HFS调度算法不仅能简单而有效地利用动态异构多核处理器的性能优势,而且能提供在动态异构多核处理器上的公平性调度.通过模拟SCMP,ACMP,DHCMP平台,证明了提出的HFS调度算法能够很好地发挥DHCMP结构的性能特征,比运行目前主流调度算法的SCMP和ACMP结构提升10.55%的用户级性能(ANTT),14.24%的系统吞吐率(WSU). 相似文献

4.

基于申威众核处理器的1、2级BLAS函数优化研究

孙家栋孙乔邓攀杨超《计算机系统应用》2017,26(11):101-108

BLAS （Basic Linear Algebra Subprograms）是一个以向量和矩阵为操作对象的基础函数库.该库中函数分为3个级别,各个级别分别提供了向量-向量（1级）、向量-矩阵（2级）、矩阵-矩阵（3级）之间的基本运算.本文研究如何在申威众核处理器上BLAS-1、2级函数的并行实现,并充分利用平台特性对它们进行深度的性能调优,归纳总结程序在申威平台上的并行实现与优化技巧.申威26010 CPU采用了异构众核架构,众多计算核心提供的大规模并行处理能力,使单块芯片具有3 TFLOPS的双精度浮点计算性能.实验结果显示BLAS-1、2级函数相对于GotoBLAS参考实现版的平均加速比分别高达11.x和6.x,对于每一优化手段,均有明显的性能加速. 相似文献

5.

基于异构感知静态调度与动态迁移的异构多核调度机制

张苗张德贤《计算机应用》2011,31(7):1808-1810

异构多核处理器体系结构可以有效减少功效开销,是处理器发展的趋势,负载不平衡问题会造成处理器执行的不稳定。提出一种基于异构感知的静态调度和动态线程迁移相结合的异构多核调度机制,解决了不同核之间的负载平衡问题,提高了吞吐量。仿真实验通过将此调度机制与静态调度策略（SS）比较,表明该机制提高了异构多核处理器的性能并保证了执行过程的稳定性。相似文献

6.

Cell异构多核处理器上流水并行优化技术*

曹倩胡长军李士刚《计算机应用研究》2011,28(9):3344-3347

针对如何发挥异构多核处理器的优势从而提高程序执行效率的问题,提出了Cell异构多核处理器上实现线程同步流水并行和迭代同步流水并行两种优化技术,该优化技术可以有效地提高非规则写和控制结构非规则的执行速度。通过在Cell处理器上对NAS benchmarks中的IS、EP、LU以及SPEC2001中的MOLDYN进行测试,结果表明该流水并行方案有效地改善了临界区和flush操作的执行效率,明显地提高了程序的执行速度。相似文献

7.

异构多核下兼顾应用公平性和能耗的调度方法研究

杨亚琪栾钟治杨海龙杨姝钱德沛《计算机工程与科学》2016,38(5):848-856

异构多核处理器通常由高性能的大核和低能耗的小核组成,在其上进行合理的线程调度可以有效地提高资源利用率,节省能耗。之前论文提出的大小核上的公平性调度并没有考虑核上有不同频率/电压状态的情况,而现在支持DVFS调节的处理器越来越普遍,因此很有必要将线程间公平度的计算进行扩展和改进。提出在每个核有若干种不同的DVFS状态时异构多核处理器上线程公平度的计算方法,对已有的性能预测模型进行改进,采用自适应算法调整模型中的系数,并在此基础上提出了一种调度策略,维持各线程之间的公平度和处理器功率满足提前设定的阈值,同时选取能效最优化的配置,实现减小应用运行能耗的目的。实验结果表明,与所提出的调度策略相比,采用static、DVFS-only、swap-only三种调度方法时,在总的运行时间几乎相同的情况下,平均要多产生20%以上能耗,对于有些应用甚至达到了50%。相似文献

8.

异构处理器多操作系统协同技术研究

冯瑞青张激赵俊才《计算机系统应用》2018,27(12):90-95

随着嵌入式设备应用场景日趋复杂的变化,异构多核架构逐渐成为嵌入式处理器的主流架构.目前,多核处理器主要采用的单操作系统模式在实际应用中存在诸多局限性.为了充分发挥异构处理器的多核特性,针对异构处理器不同核部署相应的操作系统并实现多操作系统协同处理技术至关重要.本文对异构多核处理器（ARM+DSP）操作系统进行了研究,在异构多核平台上成功移植了嵌入式Linux和国产DSP实时操作系统ReWorks;为实现ReWorks与Linux操作系统协同处理,本文对核间通信的关键技术进行分析研究,并以TI公司的AM5718为例,设计了一系列多核异构通信组件.经测试,本文设计的异构通信组件实现了在ARM上对DSP核进行ReWorks操作系统和应用程序的动态加载、Linux与ReWorks核间消息收发、以及Linux与ReWorks的协同计算等功能. 相似文献

9.

基于机器学习的异构感知多核调度方法

安鑫康安夏近伟李建华陈田任福继《计算机应用》2005,40(10):3081-3087

异构多核处理器已成为现代嵌入式系统的主流解决方案，而好的在线映射或调度方法对其充分发挥高性能和低功耗的优势起着至关重要的作用。针对异构多核处理系统上的应用程序动态映射和调度问题，提出一种基于机器学习、能快速准确评估程序性能和程序行为阶段变化的检测技术来有效确定重映射时机从而最大化系统性能的映射和调度解决方案。该方案一方面通过合理选择处理核和程序运行时的静态和动态特征来有效感知异构处理所带来的计算能力和工作负载运行行为的差异，从而能够构建更加准确的预测模型；另一方面通过引入阶段检测来尽可能减少在线映射计算的次数，从而能够提供更加高效的调度方案。最后，在SPLASH-2数据集上验证了所提出调度方案的有效性。实验结果表明，与Linux默认的完全公平调度（CFS）方法相比，所提出的方法在系统计算性能方面提高了52%，在CPU资源利用率上提高了9.4%。这表明所提方法在系统计算性能和CPU资源利用率方面具备优良的性能，可以有效提升异构多核系统的应用动态映射和调度效果。相似文献

10.

基于机器学习的异构感知多核调度方法

安鑫康安夏近伟李建华陈田任福继《计算机应用》2020,40(10):3081-3087

异构多核处理器已成为现代嵌入式系统的主流解决方案,而好的在线映射或调度方法对其充分发挥高性能和低功耗的优势起着至关重要的作用。针对异构多核处理系统上的应用程序动态映射和调度问题,提出一种基于机器学习、能快速准确评估程序性能和程序行为阶段变化的检测技术来有效确定重映射时机从而最大化系统性能的映射和调度解决方案。该方案一方面通过合理选择处理核和程序运行时的静态和动态特征来有效感知异构处理所带来的计算能力和工作负载运行行为的差异,从而能够构建更加准确的预测模型;另一方面通过引入阶段检测来尽可能减少在线映射计算的次数,从而能够提供更加高效的调度方案。最后,在SPLASH-2数据集上验证了所提出调度方案的有效性。实验结果表明,与Linux默认的完全公平调度（CFS）方法相比,所提出的方法在系统计算性能方面提高了52%,在CPU资源利用率上提高了9.4%。这表明所提方法在系统计算性能和CPU资源利用率方面具备优良的性能,可以有效提升异构多核系统的应用动态映射和调度效果。相似文献

11.

Leveraging workload diversity through OS scheduling to maximize performance on single-ISA heterogeneous multicore systems 总被引：1，自引：0，他引：1

Juan Carlos SaezAuthor Vitae Daniel Shelepov^{Author Vitae} 《Journal of Parallel and Distributed Computing》2011,71(1):114-131

Recent research has highlighted the potential benefits of single-ISA heterogeneous multicore processors over cost-equivalent homogeneous ones, and it is likely that future processors will integrate cores that have the same instruction set architecture (ISA) but offer different performance and power characteristics. To fully tap into the potential of these processors, the operating system must be aware of the hardware asymmetry when making scheduling decisions and map applications to cores in consideration of their performance characteristics. We propose a Heterogeneity-Aware Signature-Supported (HASS) scheduling algorithm that performs this mapping using per-thread architectural signatures, which are compact summaries of threads’ architectural properties. We implemented HASS in OpenSolaris, and demonstrated that it always outperforms a heterogeneity-agnostic scheduler (by as much as 12.5%) for workloads exhibiting sufficient diversity. Our evaluation also includes an extensive comparison with other heterogeneity-aware schedulers to provide a more clear understanding of the pros and cons behind HASS. 相似文献

12.

Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

《Parallel Computing》2007,33(10-11):700-719

We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. We investigate user-level schedulers that dynamically “rightsize” the dimensions and degrees of parallelism on the cell broadband engine. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. We evaluate recently introduced schedulers for event-driven execution and utilization-driven dynamic multi-grain parallelization on Cell. We also present a new scheduling scheme for dynamic multi-grain parallelism, S-MGPS, which uses sampling of dominant execution phases to converge to the optimal scheduling algorithm. We evaluate S-MGPS on an IBM Cell BladeCenter with two realistic bioinformatics applications that infer large phylogenies. S-MGPS performs within 2–10% of the optimal scheduling algorithm in these applications, while exhibiting low overhead and little sensitivity to application-dependent parameters. 相似文献

13.

Efficient and scalable scheduling for performance heterogeneous multicore systems

Pengcheng Nie Zhenhua Duan 《Journal of Parallel and Distributed Computing》2012

Performance heterogeneous multicore processors (HMP for brevity) consisting of multiple cores with the same instruction set but different performance characteristics (e.g., clock speed, issue width), are of great concern since they are able to deliver higher performance per watt and area for programs with diverse architectural requirements than comparable homogeneous ones. However, such power and area efficiencies of performance heterogeneous multicore systems can only be achieved when workloads are matched with cores according to both the properties of the workload and the features of the cores. 相似文献

14.

基于向量引用Platform-Oblivious内存连接优化技术

张延松张宇王珊《软件学报》2018,29(3):883-895

以MapD为代表的图分析数据库系统通过GPU、Phi等新型众核处理器来支持高性能分析处理,在面向复杂数据模式时连接操作仍然是重要的性能瓶颈.近年来,异构处理器逐渐成为高性能计算的主流平台,内存连接性能的研究从多核CPU平台扩展到新兴的众核处理器,但众多的研究成果并未系统地揭示连接算法性能、连接数据集大小、硬件架构之间的内在联系,难以为未来异构处理器平台的数据库提供连接平台优化选择策略.本文以面向多核CPU、Xeon Phi、GPU处理器平台的内存连接优化技术为目标,通过优化内存哈希表设计,实现以向量映射替代哈希映射操作,消除哈希代价对内存连接算法的影响,从而更加准确地测量内存连接算法在多核CPU的cache大小、Xeon Phi的cache大小、Xeon Phi的并发多线程、GPU的SIMT（单指令多线程）机制等硬件相关因素影响下的性能特征.实验结果表明,缓存与并发多线程机制是提高内存连接算法性能的重要影响因素.缓存机制对于满足cache大小的连接操作具有性能优势,而GPU的并发多线程机制则在较大表的连接操作中具有较高的性能,Xeon Phi则在满足其L2 cache大小的连接操作中具有最高性能.实验结果揭示了内存连接操作性能与异构处理器硬件特性的联系,为未来异构处理器平台内存数据库查询优化器提供了优化策略. 相似文献

15.

Embedded GPU and multicore processors for emotional-based mobile robotic agents

《Future Generation Computer Systems》2016

Control architectures based on emotions are becoming promising solutions for the implementation of future robotic systems. The basic controllers of this architecture are the emotional processes that decide which behaviors the robot must activate to fulfill the objectives. The number of emotional processes increases (hundreds of millions/s) with the complexity level of the application, limiting the processing capacity of a main processor to solve the complex problems. Fortunately, the potential parallelism of emotional processes permits their execution in parallel, hence enabling the computing power to tackle the complex dynamic problems. In this paper, Graphic Processing Unit (GPU), multicore processors and single instruction multiple data (SIMD) instructions are used to provide parallelism for the emotional processes. Different GPUs, multicore processors and SIMD instruction sets are evaluated and compared to analyze their suitability to cope with robotic applications. The applications are set-up taking into account different environmental conditions, robot dynamics and emotional states. Experimental results show that, despite the fact that GPUs have a bottleneck in the data transmission between the host and the device, the evaluated GTX 670 GPU provides a performance of more than one order of magnitude greater than the initial implementation of the architecture on a single core. Thus, all complex proposed application problems can be solved using the GPU technology in contrast to the first prototype where only 55% of them could be solved. Using AVX SIMD instructions, the performance of the architecture is increased in 3.25 times in relation to the first implementation. Thus, from the 27 proposed applications about 88.8% are solved. In the case of the SSE SIMD instructions, the performance is almost doubled and the robot could solve about 74% of the proposed application problems. The use of AVX and SSE SIMD instructions provides almost the same performance as a quad- and a dual-core, respectively, with the advantage that they do not add any additional hardware cost. 相似文献

16.

Resources Snapshot Model for Concurrent Transactions in Multi-Core Processors

下载免费PDF全文

赵雷杨季文《计算机科学技术学报》2013,28(1):106-118

Transaction parallelism in database systems is an attractive way of improving transaction performance.There exists two levels of transaction parallelism,inter-transaction level and intra-transaction level.With the advent of multicore processors,new hopes of improving transaction parallelism appear on the scene.The greatest execution efficiency of concurrent transactions comes from the lowest dependencies of them.However,the dependencies of concurrent transactions stand in the way of exploiting parallelism.In this paper,we present Resource Snapshot Model(RSM) for resource modeling in both levels.We propose a non-restarting scheduling algorithm in the inter-transaction level and a processor assignment algorithm in the intra-transaction level in terms of multi-core processors.Through these algorithms,execution performance of transaction streams will be improved in a parallel system with multiple heterogeneous processors that have different number of cores. 相似文献

17.

Software and Hardware Techniques to Optimize Register File Utilization in VLIW Architectures 总被引：1，自引：0，他引：1

Javier Zalamea Josep Llosa Eduard Ayguadé Mateo Valero 《International journal of parallel programming》2004,32(6):447-474

High-performance microprocessors are currently designed with the purpose of exploiting instruction level parallelism (ILP). The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requirements of the loops. This paper reviews hardware and software techniques that alleviate the high register demands of aggressive scheduling heuristics on VLIW cores. From the software point of view, instruction scheduling can stretch lifetimes and reduce the register pressure. If more registers than those available in the architecture are required, some actions (such as the injection of spill code) have to be applied to reduce this pressure, at the expense of some performance degradation. From the hardware point of view, this degradation could be reduced if a high-capacity register file were included without causing a negative impact on the design of the processor (cycle time, area and power dissipation). Novel organizations for the register file based on clustering and hierarchical organization are necessary to meet the technology constraints. This paper proposes the used of a clustered organization and proposes an aggressive instruction scheduling technique that minimizes the negative effect of the limitations imposed by the register file organization. 相似文献

18.

多核阵列的任务调度技术研究

陈亦欧吕信科凌翔《计算机科学》2017,44(8):42-45, 70

随着信号处理的复杂度的增加,多核并行架构成为数字信号系统的有效解决方案。主要研究了面向数字信号处理系统的无线多核阵列的任务调度问题。从数字信号处理系统与无线多核阵列的性能和开销要求出发,以功耗、热分布以及延时为优化目标,设计出相应的功耗、热均衡评估与延时模型,作为多目标优化算法的目标函数。同时,在NSGA-II算法的基础上改进拥挤策略与初始种群,并设计新的适应度函数,兼顾3个优化目标的性能,增加探索到更优解的可能性。最后,在无线多核阵列平台上采用多种任务图进行仿真,验证了所提算法的有效性与优越性。相似文献

19.

Memory aware load balance strategy on a parallel branch‐and‐bound application

Juliana M.N. Silva Cristina Boeres Lúcia M.A. Drummond Artur A. Pessoa 《Concurrency and Computation》2015,27(5):1122-1144

The latest trends in high performance computing systems show an increasing demand on the use of a large scale multicore system in an efficient way so that high compute‐intensive applications can be executed reasonably well. However, the exploitation of the degree of parallelism available at each multicore component can be limited by the poor utilization of the memory hierarchy. Actually, the multicore architecture introduces some distinct features that are already observed in shared memory and distributed environments. One example is that subsets of cores can share different subsets of memory. In order to achieve high performance, it is imperative that a careful allocation scheme of an application is carried out on the available cores, based on a scheduling specification that considers not only processors characteristics but also memory contention. This paper proposes a multicore cluster representation that captures relevant performance characteristics in multicores systems such as the influence of memory hierarchy and contention on application performance. Improved performance was achieved by a branch‐and‐bound application applied to the partitioning sets problem that incorporated a memory aware load balancing strategy based on the proposed multicore cluster representation. An in‐depth analysis on this application execution showed its applicability to modern systems. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献