期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

宋伟《计算机工程与科学》2012,34(7):46-53

随着多核处理器的发展,开发线程级并行成为提升应用程序执行性能的必要手段,这使得事务存储作为一种具有良好支持线程级并行前景的并行编程机制受到越来越多的关注。本文首先从事务存储系统的冲突检测机制和数据版本管理机制的角度对事务存储系统进行了分类;然后总结综述了目前主要的事务存储系统的实现方式;最后从容错的角度重新审视了事务存储,我们认为事务存储本身具有良好的容错特性,可以自然地与一些主要的容错技术结合,实现高效的故障隔离、检测及恢复。相似文献

2.

An evaluation of medium-grain dataflow code

Walid A. Najjar Lucas Roh A. P. Wim Böhm 《International journal of parallel programming》1994,22(3):209-242

In this paper, we study several issues related to the medium grain dataflow model of execution. We present bottom-up compilation of medium grainclusters from a fine grain dataflow graph. We compare thebasic block and thedependence sets algorithms that partition dataflow graphs into clusters. For an extensive set of benchmarks we assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain dataflow execution. We study the performance of medium grain dataflow when several architectural parameters, such as the number of processors, matching cost, and network latency, are varied. The results indicate that medium grain execution offers a good speedup over the fine grain model, that it is scalable, and tolerates network latency and high matching costs well. Medium grain execution can benefit from a higher output bandwidth of a processor and fainally, a simple superscalar processor with an issue rate of two is sufficient to exploit the internal parallelism of a cluster. This work is supported in part by NSF Grants CCR-9010240 and MIP-9113268. 相似文献

3.

Dual-thread Speculation: A Simple Approach to Uncover Thread-level Parallelism on a Simultaneous Multithreaded Processor

Fredrik Warg Per Stenstrom 《International journal of parallel programming》2008,36(2):166-183

As chip multiprocessors with simultaneous multithreaded cores are becoming commonplace, there is a need for simple approaches to exploit thread-level parallelism. In this paper, we consider thread-level speculation as a means to reap thread-level parallelism out of application binaries. We first investigate the tradeoffs between scheduling speculative threads on the same core and on different cores. While threads contend for the same resources using the former approach, the latter approach is plagued by the overhead for inter-core communication. Despite the impact of resource contention, our detailed simulations show that the first approach provides the best performance due to lower inter-thread communication cost. The key contribution of the paper is the proposed design and evaluation of the dual-thread speculation system. This design point has very low complexity and reaps most of the gains of a system. The work was carried out while Fredrik Warg was a graduate student at Chalmers University of Technology. 相似文献

4.

DIALOG — A dataflow model for parallel execution of logic programs

Kang Zhang Ray Thomas 《Future Generation Computer Systems》1991,6(4):373-388

The paper presents a dataflow execution model, DIALOG, for logic programs which operates on an intermediate virtual machine. The virtual machine is granulated at clause argument level to exploit argument parallelism through unification. The model utilises a new variable binding scheme that eliminates dereference operations for accessing variables, and therefore supports OR-parallelism in the highly distributed dataflow environment. The model has been implemented in Occam. A conventional dataflow architecture in support of the model has been simulated as a testbed for the evaluation. The simulation indicates some encouraging results and suggests future improvements. 相似文献

5.

基于交互信息的投机并行化方法*

李莹孙煦雪袁新宇徐印成《计算机应用研究》2010,27(6):2123-2126

针对投机并行化中如何权衡策略并确定合适的执行模型来获取理想性能的问题,提出了一种基于交互信息的投机并行化方法,利用交互信息来确定投机并行化的执行模型,建立相关评价模型,并着重从线程抽取创建角度提出了相应的策略及对应的性能评价。通过实验表明,基于交互信息进行“按需”并行化,可以达到所需的性能要求。相似文献

6.

Macro-Dataflow Computational Model and Its Simulation

下载免费PDF全文

Sun Yudong Xie Zhiliang 《计算机科学技术学报》1990,5(3):289-295

This paper discusses the relationship between parallelism granularity and system overhead of dataflow computer systems,and indicates that a trade-off between them should be determined to obtain optimal efficiency of the overall system.On the basis of this discussion,a macro-dataflow computational model is established to exploit the task-level parallelism.Working as a macro-dataflow computer,an Experimental Distributed Dataflow Simulation System(EDDSS)is developed to examine the effectiveness of the macro-dataflow computational model. 相似文献

7.

Using dataflow in a system of multiple single board computers

Jukka Aspelund 《Microprocessors and Microsystems》1981,5(3):103-107

The concept of dataflow is usually limited to totally new computer structures. This paper discusses the simulation of dataflow in a symmetric system of standard microcomputer modules. The approach has advantages in that the parallelism of the computation can be expressed and handled in a natural way. The simulation of the dataflow semantics takes, however, hundreds of machine instructions for each actor execution and quite long actors are needed to keep the overhead low. The implementation of a dataflow computation using the adaptive trapezoid method for numerical integration is discussed. In this computation the total estimated overhead is less than 10 per cent. 相似文献

8.

A Pipelining Loop Optimization Method for Dataflow Architecture

下载免费PDF全文

Xu Tan Xiao-Chun Ye Xiao-Wei Shen Yuan-Chao Xu Da Wang Lunkai Zhang Wen-Ming Li Dong-Rui Fan Zhi-Min Tang 《计算机科学技术学报》2018,33(1):116-130

与 exascale 来超级计算的时代,电源效率成为了最重要的障碍造一个 exascale 系统。Dataflow 建筑学在为科学应用完成高电源效率有本国的优点。然而,最先进的 dataflow 体系结构没能为循环处理利用高并行。处理这个问题,我们建议一个 pipelining 环优化方法(PLO ) ,它在处理元素(PE ) 在环流动做重复 dataflow 的数组加速器。这个方法由二种技术,帮助建筑学的硬件重复和帮助说明的软件重复组成。在硬件重复执行模型,一个在薄片上循环控制器被设计产生循环索引,减少计算内核并且打为 pipelining 执行的一个好基础的复杂性。在软件重复实行模型,另外的环指令被论述解决重复相关性问题。经由这二种技术,准备好了每周期执行的指令的平均数字被增加使浮点联合起来忙。当这二种技术的硬件费用是可接受的时,模拟结果证明分别地,我们的建议方法平均由 2.45x 和 1.1x 在浮点效率超过静电干扰和动态循环执行模型。相似文献

9.

The JRPM system for dynamically parallelizing sequential Java programs

Chen M.K. Olukotun K. 《Micro, IEEE》2003,23(6):26-35

As instruction-level parallelism with a single thread of control approaches its performance limits, designers must find other architectural improvements to speed up program execution. The Java runtime parallelizing machine (JRPM) system takes advantage of recent developments to enable a new approach to automatic parallelization. JRPM can exploit thread-level parallelism with minimal programmer effort. 相似文献

10.

Exploiting Data Structure Locality in the Dataflow Model

《Journal of Parallel and Distributed Computing》1995,27(2):183-200

Although the dataflow model has been shown to allow the exploitation of parallelism at all levels, research of the past decade has revealed several fundamental problems. Synchronization at the instruction level, token matching, coloring, and re-labeling operations have a negative impact on performance by significantly increasing the number of non-compute "overhead" cycles. Recently, many novel hybrid von-Neumann data driven machines have been proposed to alleviate some of these problems. The major objective has been to reduce or eliminate unnecessary synchronization costs through simplified operand matching schemes and increased task granularity. Moreover, the results from recent studies quantifying locality suggest sufficient spatial and temporal locality is present in dataflow execution to merit its exploitation. In this paper we present a data structure for exploiting locality in a data driven environment: the vector cell. A vector cell consists of a number of fixed length chunks of data elements. Each chunk is tagged with a presence bit, providing intra-chunk strictness and inter-chunk non-strictness to data structure access. We describe the semantics of the model, processor architecture and instruction set as well as a Sisal to dataflow vectorizing compiler back-end. The vector cell model is evaluated by comparing its performance to those of both a classical fine-grain dataflow processor employing I-structures and a conventional pipelined vector processor. Results indicate that the model is surprisingly resilient to long memory and communication latencies and is able to dynamically exploit the underlying parallelism across multiple processing elements at run time. 相似文献

11.

Streaming‐Enabled Parallel Dataflow Architecture for Multicore Systems

Huy T. Vo Daniel K. Osmari Brian Summa João L. D. Comba Valerio Pascucci Cláudio T. Silva 《Computer Graphics Forum》2010,29(3):1073-1082

We propose a new framework design for exploiting multi‐core architectures in the context of visualization dataflow systems. Recent hardware advancements have greatly increased the levels of parallelism available with all indications showing this trend will continue in the future. Existing visualization dataflow systems have attempted to take advantage of these new resources, though they still have a number of limitations when deployed on shared memory multi‐core architectures. Ideally, visualization systems should be built on top of a parallel dataflow scheme that can optimally utilize CPUs and assign resources adaptively to pipeline elements. We propose the design of a flexible dataflow architecture aimed at addressing many of the shortcomings of existing systems including a unified execution model for both demand‐driven and event‐driven models; a resource scheduler that can automatically make decisions on how to allocate computing resources; and support for more general streaming data structures which include unstructured elements. We have implemented our system on top of VTK with backward compatibility. In this paper, we provide evidence of performance improvements on a number of applications. 相似文献

12.

Parallel processing system -Harray-

《Computing Systems in Engineering》1990,1(1):111-130

The parallel processing system -Harray- for scientific computations is introduced. The special features of the -Harray- system described are (1) the Controlled Dataflow (CD flow) mechanism, (2) the preceding activation scheme with graph unfolding, and (3) the visual environment for dataflow program development.The CD flow mechanism, controlling the sequence of execution in two levels—dataflow execution in each processor and control flow execution between processors—is adapted in the -Harray- system. Though dataflow computers are expected to extract parallelism fully from a program, they have many problems, such as the difficulty of controlling the sequence of execution. To solve these problems, the CD flow mechanism is adopted.The preceding activation scheme makes it possible to bypass control dependencies in a program, such as IF-GOTO statements which decrease the parallelism in a program. The flow graph of a program is unfolded to decrease the control dependency and to increase the parallelism.The visual environment helps programmers in the writing and debugging of a dataflow program. The environment consists of a graphical editor of a dataflow graph, and a debugger.These special features of the -Harray- system and its execution mechanism are described. 相似文献

13.

SCMP: A Single-Chip Message-Passing Parallel Computer

Baker James M. Gold Brian Bucciero Mark Bennett Sidney Mahajan Rajneesh Ramachandran Priyadarshini Shah Jignesh 《The Journal of supercomputing》2004,30(2):133-149

As technology improves and transistor feature sizes continue to shrink, the effects of on-chip interconnect wire latencies on processor clock speeds will become more important. In addition, as we reach the limits of instruction-level parallelism that can be extracted from application programs, there will be an increased emphasis on thread-level parallelism. To continue to improve performance, computer architects will need to focus on architectures that can efficiently support thread-level parallelism while minimizing the length of on-chip interconnect wires. The SCMP (Single-Chip Message-Passing) parallel computer system is one such architecture. The SCMP system includes up to 64 processors on a single chip, connected in a 2-D mesh with nearest neighbor connections. Memory is included on-chip with the processors and the architecture includes hardware support for communication and the execution of parallel threads. Since there are no global signals or shared resources between the processors, the length of the interconnect wires will be determined by the size of the individual processors, not the size of the entire chip. Avoiding long interconnect wires will allow the use of very high clock frequencies, which, when coupled with the use of multiple processors, will offer tremendous computational power. 相似文献

14.

The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor 总被引：1，自引：0，他引：1

Michael Gschwind 《International journal of parallel programming》2007,35(3):233-262

As CMOS feature sizes continue to shrink and traditional microarchitectural methods for delivering high performance (e.g., deep pipelining) become too expensive and power-hungry, chip multiprocessors (CMPs) become an exciting new direction by which system designers can deliver increased performance. Exploiting parallelism in such designs is the key to high performance, and we find that parallelism must be exploited at multiple levels of the system: the thread-level parallelism that has become popular in many designs fails to exploit all the levels of available parallelism in many workloads for CMP systems. We describe the Cell Broadband Engine and the multiple levels at which its architecture exploits parallelism: data-level, instruction-level, thread-level, memory-level, and compute-transfer parallelism. By taking advantage of opportunities at all levels of the system, this CMP revolutionizes parallel architectures to deliver previously unattained levels of single chip performance. We describe how the heterogeneous cores allow to achieve this performance by parallelizing and offloading computation intensive application code onto the Synergistic Processor Element (SPE) cores using a heterogeneous thread model with SPEs. We also give an example of scheduling code to be memory latency tolerant using software pipelining techniques in the SPE. This paper is based in part on “Chip multiprocessing and the Cell Broadband Engine”, ACM Computing Frontiers 2006. 相似文献

15.

Energy-Efficient Thread-Level Speculation

Renau J. Strauss K. Ceze L. Wei Liu Sarangi S.R. Tuck J. Torrellas J. 《Micro, IEEE》2006,26(1):80-91

Chip multiprocessors with thread-level speculation have become the subject of intense research. This work refutes the claim that such a design is necessarily too energy inefficient. In addition, it proposes out-of-order task spawning to exploit more sources of speculative task-level parallelism. 相似文献

16.

事务体系结构的操作系统支持

徐新海杨学军所光林宇斐《计算机工程与科学》2009,31(2)

随着多核芯片的广泛应用,开发线程级并行变得至关重要。事务可以使编程者通过非常简单的多线程编程模型来实现并行,事务存储(TM)可以简单地实现事务执行的原子性和独立性。本文介绍了目前的主流事务存储系统TCC、LogTM、PTM,分析了各自的系统结构和相应的操作系统支持,并在此基础之上揭示了事务存储系统的硬件设计和操作系统支持之间的关系,最终总结得到了TM发展的一些基本规律和特点。相似文献

17.

一种支持容错的任务并行程序设计模型

王一拙陈旭计卫星苏岩王小军石峰《软件学报》2016,27(7):1789-1804

任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持. 相似文献

18.

Parallel execution of Prolog with granularity control

Lourdes Araujo Jose J. Ruz 《Future Generation Computer Systems》1998,13(6):421-441

This paper presents a system for parallel execution of Prolog supporting both independent conjunctive and disjunctive parallelism. The system is intended for distributed memory architecture and is composed of a set of workers with a hierarchical structure scheduler. The execution model has been designed in such a way that each worker's environment does not contain references to terms in other environments, thus reducing communication overhead. In order to guarantee the improvement of the performance by the parallelism exploitation, a granularity control has been introduced for each kind of parallelism. For conjunctive parallelism PDP applies a control based on the estimation provided by CASLOG. The features of the system allow to introduce this control without adding overhead. For disjunctive parallelism PDP controls granularity by applying a heuristic-based method, which can be adapted to other parallel Prolog systems. Different scheduling policies have also been tested. The system has been implemented on a transputer network and performance results show that it provides a high speedup for coarse grain parallel programs. 相似文献

19.

A Quantitative Analysis of Dataflow Program Execution - Preliminaries to a Hybrid Design

《Journal of Parallel and Distributed Computing》1993,18(3):314-326

While the dataflow execution model can potentially uncover all forms and levels of parallelism in a program, in its traditional fine grain form it does not exploit any form of locality. Recent evidence indicates that the exploitation of locality in dataflow programs could have a dramatic impact on performance. The current trend in the design of dataflow processors suggests a synthesis of traditional nonstrict fine grain instruction execution and strict coarse grain execution in order to exploit locality. While an increase in instruction granularity favors the exploitation of locality within a single execution thread, the resulting grain size may increase latency among execution threads. We define fine grain intrathread locality as a dynamic measure of instruction level locality and quantify it using a set of numeric and nonnumeric benchmarks. The results point to a very large degree of intrathread locality and a remarkable uniformity and consistency of the distribution of thread locality across a wide variety of benchmarks. As the execution is moved to a coarser granularity it can result in an increase of the input latency of operands that would have a detrimental effect on performance. We evaluate the resulting latency incurred through the partitioning of fine grain instructions into coarser grain threads. We define the concept of a cluster of fine grain instructions to quantify coarse grain input and output latencies. The results of our experiments offer compelling evidence that a coarse grain execution outperforms a fine grain grain one on a significant number of numeric codes. These results suggest that the effects of increased instruction granularity on latency is minimal for a high percentage of the measured codes, and in large part is offset by available intrathread locality. Furthermore, simulation results indicate that strict or nonstrict data structure access does not change the basic cluster characteristics. 相似文献

20.

Dataflow architectures and multithreading

Lee B. Hurson A.R. 《Computer》1994,27(8):27-39

Contrary to initial expectations, implementing dataflow computers has presented a monumental challenge. Now, however, multithreading offers a viable alternative for building hybrid architectures that exploit parallelism. The eventual success of dataflow computers will depend on their programmability. Traditionally, they've been programmed in languages such as Id and SISAL (Streams and Iterations in a Single Assignment Language) that use functional semantics. These languages reveal high levels of concurrency and translate on to dataflow machines and conventional parallel machines via the Threaded Abstract Machine (TAM). However, because their syntax and semantics differ from the imperative counterparts such as Fortran and C, they have been slow to gain acceptance in the programming community. An alternative is to explore the use of established imperative languages to program dataflow machines. However, the difficulty will be analyzing data dependencies and extracting parallelism from source code that contains side effects. Therefore, more research is still needed to develop compilers for conventional languages that can produce parallel code comparable to that of parallel functional languages 相似文献