首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 718 毫秒
1.
片外访存带宽是共享存储多核系统的主要性能瓶颈.访存带宽敏感的任务调度可以有效缓解并发程序间的访存竞争,提高系统吞吐率.然而调度策略的实施需要关于程序执行的先验知识,给系统用户增加了额外负担;另一方面,并发程序间的带宽竞争使得运行时收集的程序带宽需求信息不精确,影响了调度效果.在该文中,作者提出了一个低开销、对用户透明的跨执行优化方法解决上述问题.它在运行时识别程序的阶段性(phase)行为,并估算每个phase的独占执行性能;上述信息被存储到数据库中,在程序未来的执行中指导调度,并且信息精度随着程序的多次执行持续增加.上述过程使得带宽敏感调度策略的进行不再需要任何用户信息制导,并且优化了调度效果.作者在基于Intel Xeon处理器的8核系统上实现并评估了该系统,测试结果表明:相对于Linux操作系统(OS)默认的调度策略,该文的方法能平均提高系统吞吐率3.7%,对于某些特定程序组达8.5%.  相似文献   

2.
在现代处理器中,存储控制器是处理器芯片对片外存储器进行访问的管理者和执行者,其中对访存过程的调度算法会对实际访存性能产生十分重要的影响。针对已有调度算法在不同负载特征下自适应性不足的问题,提出了一种基于强化学习方法的ALHS算法,通过对访存调度中页命中优先时的连续页命中上限次数进行自适应调整,习得最优策略。多种不同典型访存模式的模拟结果显示,相比传统的FR-FCFS,ALHS算法运行速度平均提升了10.98%,并且可以获得近似于最优策略的性能提升,表明该算法能够自主探索环境并自我优化。  相似文献   

3.
随着处理器和主存之间性能差距的不断增大,长延迟访存成为影响处理器性能的主要原因之一.存储级并行通过多个访存并行执行减少长延迟访存对处理器性能的影响.文中回顾了存储级并行出现的背景,介绍了存储级并行的概念及其与处理器性能模型之间的关系;分析了限制处理器存储级并行的主要因素;详细综述了提高处理器存储级并行的各种技术,进行了...  相似文献   

4.
图形处理器凭借着比传统CPU更高的峰值性能和能效,以及日渐成熟的软件环境,逐渐成为构建异构并行系统的最流行的加速器之一。虽然GPU依靠轻量级线程的灵活切换来隐藏访存延迟,但其超高的并发度仍然给存储系统带来了很大压力,其性能的有效发挥受访存效率的强烈影响。因此GPU程序的访存行为分析及优化一直是GPU相关领域的研究热点,但很少有工作从体系结构的角度分析存储层次的设计对性能的影响。为了更好地指导GPU存储层次的设计和访存优化,从实验的角度详细地分析了GPU各存储层次对程序性能的影响,并总结出若干指导性的优化策略,为未来类似体系结构的存储层次设计和程序优化提供建议。  相似文献   

5.
随着存储系统的访问速度与处理器运算速度的差距越来越显著,访存性能已成为提高处理器性能的瓶颈.通过对程序的访存行为进行分析,提出快速地址计算的自适应栈高速缓存方案.该方案将栈访问从数据高速缓存的访问中分离出来,充分利用栈空间数据访问的特点,提高指令级并行度,减少数据高速缓存污染,降低数据高速缓存失效率,并采用快速地址计算策略,减少栈访问的命中时间.该栈高速缓存在发生栈溢出时能够自适应地关闭,以避免栈切换对处理器性能的影响.栈高速缓存标志中增加进程标识,进程切换时不需要将数据写到低层存储系统中,适用于多进程环境.SPEC CPU2000程序运行结果表明,采用快速地址计算的自适应栈高速缓存方案,25.8%的访存指令可以并行执行,数据高速缓存失效率平均降低9.4%,IPC值平均提高6.9%.  相似文献   

6.
结合访存失效队列状态的预取策略   总被引:1,自引:0,他引:1  
随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%.  相似文献   

7.
随着处理器和存储器速度差距的不断拉大,访存指令尤其是频繁cache miss的指令成为影响性能的重要瓶颈。编译器由于无法得知访存指令动态执行的拍数,一般假定这些指令的延迟为cache命中或者cache miss的延迟,所以并不准确。我们引入cache profiling技术来收集访存指令运行时的cache miss或者命中的信息,利用这些信息来计算访存的延迟。乱序机器上硬件的指令调度对于发射窗口内的指令能进行很好的动态调度,编译器则对更长的范围内的指令调度更有优势。在reorder buffer中cache miss一旦发生,容易引起reorder buffer满,导致流水线阻塞。调度容易cache miss的指令。使其并行执行,从而隐藏cache miss的长延迟,就可以提高程序性能。因此,我们针对load指令,一方面修改频繁miss的指令的延迟,一方面修改调度策略,提高存储级并行度。实验证明,我们的调度对于bzip2有高达4.8%的提升,art有4%的提升,整体平均提高1.5%。  相似文献   

8.
软件流水是一种重要的指令调度技术,它通过同时执行来自不同循环迭代的指令来加快循环的执行时间.随着处理器速度和访存速度差距越拉越大,访存指令尤其是cache miss的访存指令日益成为系统性能提高的瓶颈.由于这些指令的延迟不是固定的,如何在软件流水中预测并掩盖这些访存指令的延迟是非常重要的.与前人预测访存延迟的方法不同,引入cache profiling技术,通过动态收集到profile信息来预测访存延迟,并进行适当的调度.当增加模调度循环中的访存指令的延迟时,启动间隔也会随之增大,导致性能不会随之上升.CSMS算法和FLMS算法在尽量不增大启动间隔的情况下,改变访存指令的延迟.改进了CSMS算法和FLMS算法,根据cache profiling的信息来改变访存延迟,所以比前人的方法更为准确.实验表明,新方法可以有效地提高程序性能,对SPEC2000测试程序平均性能提高1%左右,个别例子的性能改进高达11%.  相似文献   

9.
有限的片外存储带宽是制约流处理器性能提升的瓶颈之一,流存储系统已经采用了多种方式来缓解这个问题,但当前的设计并没有充分考虑应用具体的访存模式对有效带宽利用率的影响.通过分析和实验,评估流存储系统主要设计参数对不同访存模式的优化效果;在此基础上针对不同的流访问并行度提出了相应的结构改进,加入宽发射和短作业优先调度支持,充分挖掘存储访问的局部性和并行性,改善了负载平衡,从而有效地提高了片外带宽的使用效率和流程序的整体性能.  相似文献   

10.
为了提高移动图形处理器中统一架构染色器的效率,减少其与片外存储器间的访问次数,提出了一种4端口纹理高速缓存结构。该结构采用基于Mipamp算法的纹理映射和基于细化层次(Level of Detail,LOD)选择不同单端口Cache的存储方式,提高了纹理Cache的命中率。此外为了提高数据吞吐率,采用4端口并行读取纹素。设计了FIFO缓冲区预取数据,降低访存延迟。利用SV搭建实验平台对纹理图像进行测试,结果表明纹理Cache的平均命中率为92.5%,数据吞吐率接近单端口Cache的4倍。  相似文献   

11.
随着移动设备数量的急剧增长及计算密集型应用如人脸识别、车联网以及虚拟现实等的广泛使用,为了实现满足用户QoS请求的任务和协同资源的最优匹配,使用合理的计算密集型应用的任务调度方案,从而解决边缘云中心时延长、成本高、负载不均衡和资源利用率低等问题。阐述了边缘计算环境下计算密集型应用的任务调度框架、执行过程、应用场景及性能指标。从时间和成本、能耗和资源利用率以及负载均衡和吞吐量为优化目标的边缘计算环境下计算密集型应用的任务调度策略进行了对比和分析,并归纳出目前这些策略的优缺点及适用场景。通过分析5G环境下基于SDN的边缘计算架构,提出了基于SDN环境下的边缘计算密集型数据包任务调度策略、基于深度强化学习的计算密集型应用的任务调度策略和5G IoV网络中多目标跨层任务调度策略。从容错调度、动态微服务调度、人群感知调度以及安全和隐私等几个方面总结和归纳了目前边缘计算环境中任务调度所面临的挑战。  相似文献   

12.
数据流模型作为一种新型的模型,在许多应用中扮演着重要的角色.基于数据流模型的查询处理技术也得到了广泛的研究.为了提高查询系统的性能,现有的研究成果主要可以划分为两类:调度优化和降低负载方法.调度优化方法通过改变元组执行次序来提高查询性能.降低负载方法在负载超出系统处理能力时,通过减少输入流量来提高吞吐率.然而,同时运用这两种方法来提高查询性能的研究工作还很少.结合共享滑动窗口查询操作的调度优化方法和降低负载方法,提出了两种在burst环境下提高查询吞吐率的策略:均匀降载策略和小窗口准确降载策略.理论分析和实验结果均证明这两种策略能显著提高系统的性能.  相似文献   

13.
Efficient Execution of Multiple Queries on Deep Memory Hierarchy   总被引:1,自引:0,他引:1       下载免费PDF全文
This paper proposes a complementary novel idea, called MiniTasking to further reduce the number of cache misses by improving the data temporal locality for multiple concurrent queries. Our idea is based on the observation that, in many workloads such as decision support systems (DSS), there is usually significant amount of data sharing among different concurrent queries. MiniTasking exploits such data sharing to improve data temporal locality by scheduling query execution at three levels: query level batching, operator level grouping and mini-task level scheduling. The experimental results with various types of concurrent TPC-H query workloads show that, with the traditional N-ary Storage Model (NSM) layout, MiniTasking significantly reduces the L2 cache misses by up to 83%, and thereby achieves 24% reduction in execution time. With the Partition Attributes Across (PAX) layout, MiniTasking further reduces the cache misses by 65% and the execution time by 9%. For the TPC-H throughput test workload, MiniTasking improves the end performance up to 20%.  相似文献   

14.
Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead.  相似文献   

15.
Due to its potential, using virtual machines in grid computing is attracting increasing attention. Most of the researches focus on how to create or destroy a virtual execution environments for different kinds of applications, while the policy of managing the virtual environments is not widely discussed. This paper proposes the design, implementation, and evaluation of an adaptive and dependable virtual execution environment for grid computing, ADVE, which focuses on the policy of managing virtual machines in grid environments. To build a dependable virtual execution environments for grid applications, ADVE provides an set of adaptive policies managing virtual machine, such as when to create and destroy a new virtual execution environment, when to migrate applications from one virtual execution environment to a new virtual execution environment. We conduct experiments over a cluster to evaluate the performance of ADVE, and the experimental results show that ADVE can improve the throughput and the reliability of grid resources with the adaptive management of virtual machines.  相似文献   

16.
Cloud computing is an Information Technology deployment model established on virtualization. Task scheduling states the set of rules for task allocations to an exact virtual machine in the cloud computing environment. However, task scheduling challenges such as optimal task scheduling performance solutions, are addressed in cloud computing. First, the cloud computing performance due to task scheduling is improved by proposing a Dynamic Weighted Round-Robin algorithm. This recommended DWRR algorithm improves the task scheduling performance by considering resource competencies, task priorities, and length. Second, a heuristic algorithm called Hybrid Particle Swarm Parallel Ant Colony Optimization is proposed to solve the task execution delay problem in DWRR based task scheduling. In the end, a fuzzy logic system is designed for HPSPACO that expands task scheduling in the cloud environment. A fuzzy method is proposed for the inertia weight update of the PSO and pheromone trails update of the PACO. Thus, the proposed Fuzzy Hybrid Particle Swarm Parallel Ant Colony Optimization on cloud computing achieves improved task scheduling by minimizing the execution and waiting time, system throughput, and maximizing resource utilization.  相似文献   

17.
Scheduling large-scale applications in heterogeneous distributed computing systems is a fundamental NP-complete problem that is critical to obtaining good performance and execution cost. In this paper, we address the scheduling problem of an important class of large-scale Grid applications inspired by the real world, characterized by a huge number of homogeneous, concurrent, and computationally intensive tasks that are the main sources of performance, cost, and storage bottlenecks. We propose a new formulation of this problem based on a cooperative distributed game-theory-based method applied using three algorithms with low time complexity for optimizing three important metrics in scientific computing: execution time, economic cost, and storage requirements. We present comprehensive experiments using simulation and real-world applications that demonstrate the effectiveness of our approach in terms of time and fairness compared to other related algorithms.  相似文献   

18.
In recent years, a variety of computational sites and resources have emerged, and users often have access to multiple resources that are distributed. These sites are heterogeneous in nature and performance of different tasks in a workflow varies from one site to another. Additionally, users typically have a limited resource allocation at each site capped by administrative policies. In such cases, judicious scheduling strategy is required in order to map tasks in the workflow to resources so that the workload is balanced among sites and the overhead is minimized in data transfer. Most existing systems either run the entire workflow in a single site or use naïve approaches to distribute the tasks across sites or leave it to the user to optimize the allocation of tasks to distributed resources. This results in a significant loss in productivity. We propose a multi-site workflow scheduling technique that uses performance models to predict the execution time on resources and dynamic probes to identify the achievable network throughput between sites. We evaluate our approach using real world applications using the Swift parallel and distributed execution framework. We use two distinct computational environments-geographically distributed multiple clusters and multiple clouds. We show that our approach improves the resource utilization and reduces execution time when compared to the default schedule.  相似文献   

19.
Stream computing applications require minimum latency and high throughput for efficiently processing real-time data. Typically, data-intensive applications where large datasets are required to be moved across execution nodes have low latency requirements. In this paper, a stream-based data processing model is adopted to develop an algorithm for optimal partitioning the input data such that the inter-partition data flow remains minimal. The proposed algorithm improves the execution of the data-intensive workflows in heterogeneous computing environments by partitioning the data-intensive workflow and mapping each partition on the available heterogeneous resources that offer minimum execution time. Minimum data movement between the partitions reduces the latency, which can be further reduced by applying advanced data parallelism techniques. In this paper, we apply data parallelism technique to the bottleneck (most compute-intensive) task in each partition that significantly reduces the latency. We study the effectiveness and the performance of the proposed approach by using synthesized workflows and real-world applications, such as Montage and Cybershake. Our evaluation shows that the proposed algorithm provides schedules with approximately 12% reduced latency and nearly 17% enhanced throughput as compared to the existing state of the art algorithms.  相似文献   

20.
In grid computing environment, several classes of multi‐component applications exist. These types of applications may often require additional resources of different types that go beyond what is available in any of the sites making up the grid resource composition. The heterogeneity nature of both the user application and the computing environment makes this a challenging problem. However, the current off‐the‐shelf scheduling software can hardly cope with these diversities in distributed computing application frameworks. Therefore, there is the need for an adequate scheduling system that would grant simultaneous or coordinated access to application of multi‐component nature that requires resources of possibly multiple types, in multiple locations, managed by different resource providers. The main focus of this paper is to develop a mobile agent scheduling model that addresses the aforementioned challenge. A scheduling policy that pertains to job scheduling and resource allocation is proposed. The scheduling policy treats different multi‐component applications requiring diverse heterogeneous resources fairly. The policy is used by mobile agents to schedule user applications and to also find available and suitable distributed resource that are capable of executing user application at a very minimal time. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号