期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

李春江杨学军《计算机工程与科学》2009,31(8)

主从式单边异构体系结构的异构多核处理器广泛应用于面向专门应用领域的计算加速,如异构多核嵌入式处理器、DSP、SoC等;高性能的该类处理器也可用于一些大规模科学和工程计算问题的处理。主从式单边异构处理器对编程模型和编译技术提出了很多挑战性问题,如编程模型的选择、编程语言的设计、编译器架构设计以及运行库的设计等。本文分析了这一类处理器结构特点和执行模型,认为功能卸载模型是最适用于这一体系结构的编程模型;并分析了面向功能卸载模型的编程语言设计关键问题,提出了编译系统的架构,讨论了相应的运行库设计问题。相似文献

2.

一种面向非对称多核处理器的综合性调度算法

陈锐忠齐德昱林伟伟李剑《软件学报》2013,24(2):343-357

在非对称多核处理器上进行任务调度时,现有的操作系统调度器没有考虑其非对称性.针对单一指令集非对称多核处理器上的操作系统调度问题,首先建立线性规划模型,分析各种因素,得出行为匹配、减少迁移和负载均衡的调度原则.然后,基于调度原则提出一种综合性调度算法.该算法包括两个部分:1) 集成负载表征,提出集成行为的概念,全面衡量任务的整体性和阶段性行为;2) 基于集成行为的调度算法,有效开发非对称多核处理器的特性,能够保证各核心负载均衡,同时可以避免不必要的任务迁移.另外,该算法通过参数调整机制实现了算法的通用性.该算法是一种综合处理任务的整体性和阶段性行为,并具备通用性的调度算法.实际平台上的实验结果表明,该算法可通用于多种环境,且性能比其他对应算法提高6%～22%. 相似文献

3.

多核处理器并行程序的确定性重放研究

高岚王锐钱德沛《软件学报》2013,24(6):1390-1402

多核处理器并行程序的确定性重放是实现并行程序调试的有效手段,对并行编程有重要意义。但由于多核架构下存在共享访存不同步问题,并行程序确定性重放的研究依然面临多方面的挑战,给并行程序的调试带来很大困难,严重影响了多核架构下并行程序的普及和发展。分析了多核处理器造成并行程序确定性重放难以实现的关键因素,总结了确定性重放的评价指标,综述了近年来学术界对并行程序确定性重放的研究。根据总结的评价指标,从纯软件方式和硬件支持方式对目前的确定性重放方法进行了分析与对比,并在此基础上对多核架构下并行程序的确定性重放未来的研究趋势和应用前景进行了展望。相似文献

4.

System Architecture of Godson-3 Multi-Core Processors 总被引：1，自引：0，他引：1

下载免费PDF全文

高翔陈云霁王焕东唐丹胡伟武《计算机科学技术学报》2010,25(2):181-191

Godson-3 is the latest generation of Godson microprocessor family.It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing.This paper introduces the system architecture of Godson-3 from various aspects including system scalabihty,organization of memory hierarchy, network-on-chip,inter-chip connection and I/O subsystem. 相似文献

5.

Resources Snapshot Model for Concurrent Transactions in Multi-Core Processors

下载免费PDF全文

赵雷杨季文《计算机科学技术学报》2013,28(1):106-118

Transaction parallelism in database systems is an attractive way of improving transaction performance.There exists two levels of transaction parallelism,inter-transaction level and intra-transaction level.With the advent of multicore processors,new hopes of improving transaction parallelism appear on the scene.The greatest execution efficiency of concurrent transactions comes from the lowest dependencies of them.However,the dependencies of concurrent transactions stand in the way of exploiting parallelism.In this paper,we present Resource Snapshot Model(RSM) for resource modeling in both levels.We propose a non-restarting scheduling algorithm in the inter-transaction level and a processor assignment algorithm in the intra-transaction level in terms of multi-core processors.Through these algorithms,execution performance of transaction streams will be improved in a parallel system with multiple heterogeneous processors that have different number of cores. 相似文献

6.

面向神威高性能多核处理器的并行编译优化方法

周雍浩徐金龙李斌钱宏聂凯《计算机工程》2022,48(9):130-138

在神威高性能多核服务器上,自动并行化编译系统为识别和申明程序中的并行性,产生的OpenMP程序没有经过充分的优化,其采用简单的fork-join模型,存在大量的并行循环嵌套,导致运行效率低。为提升自动并行化编译系统产生的OpenMP程序的运行效率,提出一种并行域重构优化技术。并行域重构技术通过合并程序中的并行域和扩展嵌套循环中的并行域范围,减少OpenMP程序的并行域数目,降低线程组频繁创建和合并等控制开销,将简单fork-join模型的OpenMP程序转换为性能更为高效的单程序多数据模型的OpenMP程序。实验结果表明,在新一代神威高性能多核服务器SW1621平台上,并行域重构技术在NPB3.3-OMP测试集和SPEC OMP2012测试集上的运行效率分别提高了10.77%和7.94%的,可有效提升自动并行化编译系统OpenMP程序的执行效率。相似文献

7.

A Superscalar software architecture model for Multi-Core Processors (MCPs)

Gyu Sang Choi Author Vitae Chita R. Das^{Author Vitae} 《Journal of Systems and Software》2010,83(10):1823-1837

Design of high-performance servers has become a research thrust to meet the increasing demand of network-based applications. One approach to design such architectures is to exploit the enormous computing power of Multi-Core Processors (MCPs) that are envisioned to become the state-of-the-art in processor architecture. In this paper, we propose a new software architecture model, called SuperScalar, suitable for MCP machines. The proposed SuperScalar model consists of multiple pipelined thread pools, where each pipelined thread pool consists of multiple threads, and each thread takes a different role. The main advantages of the proposed model are global information sharing by the threads and minimal memory requirement due to fewer threads.We have conducted in-depth performance analyses of the proposed scheme along with three prior software architecture schemes (Multi-Process (MP), Multi-Thread (MT) and Event-Driven (ED)) via an analytical model. The performance results indicate that the proposed SuperScalar model shows the best performance across all system and workload parameters compared to the MP, MT and ED models. Although the MT model shows competitive performance with less number of processing cores and smaller data cache size, the advantage of the SuperScalar model becomes obvious as the number of processing cores increases. 相似文献

8.

基于多核处理器的关联任务并行感知调度算法

梁秋玲张向利张红梅闫坤《计算机工程》2021,47(7):212-217

关联任务在多核处理器上并行调度所产生的通信时延,会对任务调度长度和处理器利用率造成负面影响,为了改善多核系统对关联任务的处理性能,针对关联任务在多核处理器上的调度特点,提出一种并行感知调度算法.计算各任务与终点间的最长路径值,按照该值的降序来分配任务调度次序,在分配处理器内核时兼顾关联度和任务最早可执行时间,设置最佳匹... 相似文献

9.

Performance Behaviour Analysis of the Present 3-Level Cache System for Multi-Core Processors

Muhammad Ali Ismail 《计算机技术与应用:英文》2012,(11):729-733

相似文献

10.

“英特尔杯”全国计算机多核程序设计大赛

《计算机教育》2007,(9)

<正>1.目的:普及计算机多核技术,推动国内在校大学生和软件工程师提升基于多核平台的并行程序设计和优化能力,迎接计算机多核时代的到来。相似文献

11.

多核处理器环境下内存数据库索引性能分析 总被引：2，自引：0，他引：2

郭超李坤王永炎刘胜航王宏安《计算机学报》2010,33(8)

从20世纪80年代内存数据库出现时的T树到21世纪初出现的缓存感知的CSS、CSB+树等,都适应了当时的硬件发展趋势,具有一定的性能优势.随着计算机硬件技术的进一步发展,尤其是多核技术的应用与推广,新的多核处理器在提高索引性能的同时又给内存索引结构提出了新的挑战.文中对B+树、T树、CSS树、CSB+树等几个经典的内存索引结构在多核处理器环境下的性能进行了全面的实验测试,对其在多核处理器环境下不同数据输入、不同节点大小等多种情况下的性能构成与差异进行了比较和分析,总结了在多核处理器中影响索引性能的关键因素,为内存索引结构的进一步改进奠定了坚实的基础. 相似文献

12.

基于统计学习分析多核间性能干扰

赵家程崔慧敏冯晓兵《软件学报》2013,24(11):2558-2570

普遍认为,云计算和多核处理器将会统治计算领域的未来.但是,目前云计算数据中心的计算资源使用率非常低,其主要原因在于多核处理器上存在严重且不可预知的性能干扰.为了保证关键应用程序的QoS,只能禁止这些关键程序与其他程序共同运行,导致了资源的过度分配.为了提高数据中心的利用率,分析多核间的性能干扰成为一个关键的问题.观察到程序遭受的核间性能干扰可以表示为内存子系统总压力的线性分段函数,而与构成压力的具体应用程序无关.以此观察为基础,提出了一种基于统计学习的多核间性能干扰分析方法,使用主成分线性回归的方法获得干扰模型,可以精确且定量地预测任意程序由于内存子系统资源竞争导致的性能下降.实验结果表明,平均预测误差仅为1.1%. 相似文献

13.

Optimizing Parallel S n Sweeps on Unstructured Grids for Multi-Core Clusters

下载免费PDF全文

闫洁谭光明孙凝晖《计算机科学技术学报》2013,28(4):657-670

In particle transport simulations, radiation effects are often described by the discrete ordinates (Sn) form of Boltzmann equation. In each ordinate direction, the solution is computed by sweeping the radiation flux across the grid. Parallel Sn sweep on an unstructured grid can be explicitly modeled as topological traversal through an equivalent directed acyclic graph (DAG), which is a data-driven algorithm. Its traditional design using MPI model results in irregular communication of massive short messages which cannot be effciently handled by MPI runtime. Meanwhile, in high-end HPC cluster systems, multicore has become the standard processor configuration of a single node. The traditional data-driven algorithm of Sn sweeps has not exploited potential advantages of multi-threading of multicore on shared memory. These advantages, however, as we shall demonstrate, could provide an elegant solution resolving problems in the previous MPI-only design. In this paper, we give a new design of data-driven parallel Sn sweeps using hybrid MPI and Pthread programming, namely Sweep-H, to exploit hierarchical parallelism of processes and threads. With special multi-threading techniques and vertex schedule policy, Sweep-H gets more effcient communication and better load balance. We further present an analytical performance model for Sweep-H to reveal why and when it is advantageous over former MPI counterpart. On a 64-node multicore cluster system with 12 cores per node, 768 cores in total, Sweep-H achieves nearly linear scalability for moderate problem sizes, and better absolute performance than the previous MPI algorithm on more than 16 nodes (by up to two times speedup on 64 nodes). 相似文献

14.

An Efficient Algorithm for Solving Fuzzy Linear Programming Problems

M. H. Noori Skandari M. Ghaznavi 《Neural Processing Letters》2018,48(3):1563-1582

In this article, we consider some well-known approaches for solving fuzzy linear programming (FLP) problems. We present some of the difficulties of these approaches. Then, crisp linear programming problems are suggested for solving FLP problems. A new algorithm is also given. The proposed approach has advantages over the other methods. For example, we can achieve higher membership degrees for objective function and constraints. Moreover, we show that the fuzzy optimal solutions obtained by the proposed approach are efficient enough. Also, we see that unlike the previous methods, our method finds efficient solutions by solving only one crisp linear problem instead of solving two or three crisp problems. Finally some numerical examples are presented to show the efficiency of the given approach over the other approaches. 相似文献

15.

LBM在多核并行编程模型中的应用 总被引：1，自引：0，他引：1

李彬彬李青《计算机技术与发展》2011,21(7)

LBGK(Lattice Bhatnagar-Gross-Krook)模IV不仅是LBM(Lattice Boltzmann Method)理论及应用上的新突破,而且是一种非常新颖的数值计算方法,适合大规模并行计算.多线程并行编程接11库(Multi-Thread Interface,MTI)充分利用多核处理器的资源来提升计算的性能,为在多核环境下方便地开发高效的并行程序提供了一个接口,大大地减轻了开发人员的负担.MTI提供了使用cache块技术划分数据集实现单任务数据并行计算,以及采用任务密取调度策略实现多任务并行处理.应用MI实现了LBGK模型模拟斑图形成的并行计算,并获得了较高的并行效率. 相似文献

16.

多核平台上程序在线评测辅助教学系统 总被引：1，自引：1，他引：0

李旻朔林巧《计算机系统应用》2011,20(6):129-132

选用LAMP作为开发环境,进行了程序设计在线评测辅助教学系统的设计与开发。着重论述了基于多核平台上的多线程或多进程在线评测系统的设计与实现,与单核系统相比,解决了单线程或单进程评测效率低的问题。经过与单核串行评测系统比较得出,多核系统评测速度显著提高,评测结果和串行评测所得一致,准确率高。相似文献

17.

A Parallel Dynamic Binary Translator for Efficient Multi-Core Simulation

Oscar Almer Igor Böhm Tobias Edler von Koch Björn Franke Stephen Kyle Volker Seeker Christopher Thompson Nigel Topham 《International journal of parallel programming》2013,41(2):212-235

In recent years multi-core processors have seen broad adoption in application domains ranging from embedded systems through general-purpose computing to large-scale data centres. Simulation technology for multi-core systems, however, lags behind and does not provide the simulation speed required to effectively support design space exploration and parallel software development. While state-of-the-art instruction set simulators (Iss) for single-core machines reach or exceed the performance levels of speed-optimised silicon implementations of embedded processors, the same does not hold for multi-core simulators where large performance penalties are to be paid. In this paper we develop a fast and scalable simulation methodology for multi-core platforms based on parallel and just-in-time (Jit) dynamic binary translation (Dbt). Our approach can model large-scale multi-core configurations, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded multi-core platform implementing the ARCompact instruction set architecture (Isa). We have evaluated our parallel simulation methodology against the industry standard Splash-2 and Eembc MultiBench benchmarks and demonstrate simulation speeds up to 25,307 Mips on a 32-core x86 host machine for as many as 2,048 target processors whilst exhibiting minimal and near constant overhead, including memory considerations. 相似文献

18.

Energy Efficient Block-Partitioned Multicore Processors for Parallel Applications

下载免费PDF全文

Xuan Qi Da-Kai Zhu 《计算机科学技术学报》2011,26(3):418-433

Due to the increasing power consumption in modern computing systems, energy management has become an important research area in the last decade. Recently, multicore has emerged to be an energy efficient architecture that exploits parallelisms in modern applications. However, as the number of cores on a single chip continues to increase, it has been a grand challenge on how to effectively manage the energy efficiency of multicore-based systems. In this paper, based on the voltage island and dynamic voltage and frequency scaling (DVFS) techniques, we investigate the energy efficiency of block-partitioned multicore processors, where cores are grouped into blocks with the cores on one block sharing a DVFSenabled power supply. Depending on the number of cores on each block, we study both symmetric and asymmetric block configurations. We develop a system-level power model (which can support various power management techniques) and derive both block- and system-wide energy-efficient frequencies for systems with block-partitioned multicore processors. Based on the power model, we prove that, for embarrassingly parallel applications, having all cores on a single block can achieve the same energy savings as that of the individual block configuration (where each core forms a single block and has its own power supply). However, for applications with limited degrees of parallelism, we show the superiority of the buddy-asymmetric block configuration, where the number of required blocks (and power supplies) is logarithmically related to the number of cores on the chip, in that it can achieve the same amount of energy savings as that of the individual block configuration. The energy efficiency of different block configurations is further evaluated through extensive simulations with both synthetic as well as a real life application. 相似文献

19.

Computing Optimised Parallel Speeded-Up Robust Features (P-SURF) on Multi-Core Processors

Nan Zhang 《International journal of parallel programming》2010,38(2):138-158

相似文献

20.

Smart Containers and Skeleton Programming for GPU-Based Systems

Usman Dastgeer Christoph Kessler 《International journal of parallel programming》2016,44(3):506-530

In this paper, we discuss the role, design and implementation of smart containers in the SkePU skeleton library for GPU-based systems. These containers provide an interface similar to C++ STL containers but internally perform runtime optimization of data transfers and runtime memory management for their operand data on the different memory units. We discuss how these containers can help in achieving asynchronous execution for skeleton calls while providing implicit synchronization capabilities in a data consistent manner. Furthermore, we discuss the limitations of the original, already optimizing memory management mechanism implemented in SkePU containers, and propose and implement a new mechanism that provides stronger data consistency and improves performance by reducing communication and memory allocations. With several applications, we show that our new mechanism can achieve significantly (up to 33.4 times) better performance than the initial mechanism for page-locked memory on a multi-GPU based system. 相似文献