期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Coarse-grained thread pipelining: a speculative parallel executionmodel for shared-memory multiprocessors

Kazi I.H. Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(9):952-966

This paper presents a new parallelization model, called coarse-grained thread pipelining, for exploiting speculative coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the fine-grained thread pipelining model proposed for the superthreaded architecture, allows concurrent execution of loop iterations in a pipelined fashion with runtime data-dependence checking and control speculation. The speculative execution combined with the runtime dependence checking allows the parallelization of a variety of program constructs that cannot be parallelized with existing runtime parallelization algorithms. The pipelined execution of loop iterations in this new technique results in lower parallelization overhead than in other existing techniques. We evaluated the performance of this new model using some real applications and a synthetic benchmark. These experiments show that programs with a sufficiently large grain size compared to the parallelization overhead obtain significant speedup using this model. The results from the synthetic benchmark provide a means for estimating the performance that can be obtained from application programs that will be parallelized with this model. The library routines developed for this thread pipelining model are also useful for evaluating the correctness of the codes generated by the superthreaded compiler and in debugging and verifying the simulator for the superthreaded processor 相似文献

2.

SMA:前瞻性多线程体系结构 总被引：4，自引：1，他引：3

肖刚周兴铭徐明邓鹍《计算机学报》1999,22(6):582-590

提出了一种新的ＩＬＰ处理器体系结构－前瞻性多线程体系的结构,简称ＳＭＡ．它结合了前瞻性执行机制和多线程执行机制,以整个线程为长步进行前瞻性执行,多个线程并行执行并且共享处理器硬件资源,这样,处理器既通过组合每个线程的指令窗口形成一个大的动态指令窗口,开发出程序中更大的ＩＬＰ,又利用多线程执行机制屏蔽各种长延迟操作,达到较高的资源利用率;介绍了ＳＭＡ执行模型,并讨论了ＳＭＡ处理器的实现和其中的关键技相似文献

3.

Entity-life modeling: modeling a thread architecture on the problem environment

Sanden B.I. 《Software, IEEE》2003,20(4):70-78

With Java threads and the wider availability of multiprocessors, more programmers are confronted with multithreading. Concurrent threads let you take advantage of multiprocessors to speed up execution. They are also useful on a single processor, where one thread can compute while others wait for external input. Entity-life modeling is an approach for designing multithread programs. 相似文献

4.

A general compiler framework for speculative multithreaded processors

Bhowmik A. Franklin M. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(8):713-724

Speculative multithreading (SpMT) promises to be an effective mechanism for parallelizing nonnumeric programs, which tend to have irregular and pointer-intensive data structures and complex flows of control. Proper thread formation is crucial for obtaining good speedup in an SpMT system. This paper presents a compiler framework for partitioning a sequential program into multiple threads for parallel execution in an SpMT system. This framework is very general and supports speculative threads, nonspeculative threads, loop-centric threads, and out-of-order thread spawning. It is therefore useful for compiling for a wide variety of SpMT architectures. For effective partitioning of programs, the compiler uses profiling, interprocedural pointer analysis, data dependence information, and control dependence information. The compiler is implemented on the SUIF-MachSUIF platform. A simulation-based evaluation of the generated threads shows that the use of nonspeculative threads and nonloop speculative threads provides a significant increase in speedup for nonnumeric programs. 相似文献

5.

一种改进的并发程序静态切片算法

下载免费PDF全文

肖健宇张德运陈海诠董皓《计算机工程》2006,32(14):14-16,4

分析了Krinke切片算法对循环体内嵌套有线程的程序结构会产生切片不精确的现象，认为其原因是该算法对线程间数据依赖的定义过于粗糙，且对程序行为约束不够。该文提出一种新算法，在并发程序内部表示中，增加跨线程边界循环-承载数据依赖，并引入区域化执行证据约束程序行为。实例研究表明，该算法克服了Krinke算法的不精确现象。相似文献

6.

Run-Time Support for the Automatic Parallelization of Java Programs

Bryan Chan Tarek S. Abdelrahman 《The Journal of supercomputing》2004,28(1):91-117

相似文献

7.

Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling

《Future Generation Computer Systems》2014

Modern processor architectures are increasingly complex and heterogeneous, often requiring software solutions tailored to the specific hardware characteristics of each processor model. In this article, we address this problem by targeting two processors featuring Simultaneous MultiThreading (SMT) to improve the occupancy of their internal execution units through a sustained stream of instructions coming from more than one thread. We target the AMD Bulldozer and IBM POWER7 processors as case studies for specific hardware-oriented performance optimizations that increase the variety of instructions sent to each core to maximize the occupancy of all its execution units. WorkOver, presented in this article, improves thread scheduling by increasing the performance of floating point-intensive workloads on Linux-based operating systems. WorkOver is a user-space monitoring tool that automatically identifies FPU-intensive threads and schedules them in a more efficient way without requiring any patches or modifications at the kernel level. Our measurements using standard benchmark suites show that speedups of up to 20% can be achieved by simply allowing WorkOver to monitor applications and schedule their threads, without any modification of the workload. 相似文献

8.

Background optimization in full system binary translation

R. A. Sokolov A. V. Ermolovich 《Programming and Computer Software》2012,38(3):119-126

Binary translation and dynamic optimization are widely used to provide compatibility between legacy and promising upcoming architectures on the level of executable binary codes. Dynamic optimization is one of the key contributors to dynamic binary translation system performance. At the same time it can be a major source of overhead, both in terms of CPU cycles and whole system latency, as long as optimization time is included in the execution time of the application under translation. One of the solutions that allow to eliminate dynamic optimization overhead is to perform optimization simultaneously with the execution, in a separate thread. In the paper we present implementation of this technique in full system dynamic binary translator. For this purpose, an infrastructure for multithreaded execution was implemented in binary translation system. This allowed running dynamic optimization in a separate thread independently of and concurrently with the main thread of execution of binary codes under translation. Depending on the computational resources available, this is achieved whether by interleaving the two threads on a single processor core or by moving optimization thread to an underutilized processor core. In the first case the latency introduced to the system by a computational intensive dynamic optimization is reduced. In the second case overlapping of execution and optimization threads also results in elimination of optimization time from the total execution time of original binary codes. 相似文献

9.

A continuation-based noninterruptible multithreading processor architecture

Satoshi Amamiya Makoto Amamiya Ryuzo Hasegawa Hiroshi Fujita 《The Journal of supercomputing》2009,47(2):228-252

Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation. 相似文献

10.

The impact of incorrectly speculated memory operations in a multithreaded architecture

Sendag R. Ying Chen Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(3):271-285

The speculated execution of threads in a multithreaded architecture, plus the branch prediction used in each thread execution unit, allows many instructions to be executed speculatively, that is, before it is known whether they actually needed by the program. In this study, we examine how the load instructions executed on what turn out to be incorrectly executed program paths impact the memory system performance. We find that incorrect speculation (wrong execution) on the instruction and thread-level provides an indirect prefetching effect for the later correct execution paths and threads. By continuing to execute the mispredicted load instructions even after the instruction or thread-level control speculation is known to be incorrect, the cache misses observed on the correctly executed paths can be reduced by 16 to 73 percent, with an average reduction of 45 percent. However, we also find that these extra loads can increase the amount of memory traffic and can pollute the cache. We introduce the small, fully associative wrong execution cache (WEC) to eliminate the potential pollution that can be caused by the execution of the mispredicted load instructions. Our simulation results show that the WEC can improve the performance of a concurrent multithreaded architecture up to 18.5 percent on the benchmark programs tested, with an average improvement of 9.7 percent, due to the reductions in the number of cache misses. 相似文献

11.

Thread algebra for poly-threading

J. A. Bergstra C. A. Middelburg 《Formal Aspects of Computing》2011,23(4):567-583

It is a fact of life that sequential programs are often fragmented. Consequently, fragmented program behaviours are frequently found. We consider this phenomenon in the setting of thread algebra. We extend basic thread algebra with poly-threading, the barest mechanism for sequencing of threads that are taken for program fragment behaviours. This mechanism is the counterpart of program overlaying at the level of program behaviours. We relate the resulting theory to the process theory known as ACP and use it to describe analytic execution architectures suited for fragmented programs. We also consider the case where the steps of fragmented program behaviours are interleaved in the ways of non-distributed and distributed multi-threading. 相似文献

12.

Synchronised execution on shared memory multiprocessors

Rhys Francis Ian Mathieson 《Parallel Computing》1988,8(1-3):165-175

Threads provides a mechanism for simulating the execution of parallel algorithms on a simplified model of a shared-memory multiprocessor. The algorithms can be expressed in a high-level block-structured language, which supports multiple threads of execution within a common body of program code. Results show an ability to achieve good speedup for small problems using algorithms derived by simple modifications of sequential algorithms. As well, a sibling thread synchronisation feature provides the basis for the synchronous execution of threads. k-parallel algorithms tailored to the machine size and implemented as synchronously executing iterations, can provide near linear speedup as the problem size is increased. The techniques described in this paper seem to promise an effective synchronous execution mode for shared-memory MIMD architectures. 相似文献

13.

Arachne: a portable threads system supporting migrant threads onheterogeneous network farms

Dimitrov B. Rego V. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(5):459-469

We present the design and implementation of Arachne, a threads system that can be interfaced with a communications library for multithreaded distributed computations. In particular, Arachne supports thread migration between heterogeneous platforms, dynamic stack size management, and recursive thread functions. Arachne is efficient, flexible, and portable-it is based entirely on C and C++. To facilitate heterogeneous thread operations, we have added three keywords to the C++ language. The Arachne preprocessor takes as input code written in that language and outputs C++ code suitable for compilation with a conventional C++ compiler. The Arachne runtime system manages all threads during program execution. We present some performance measurements on the costs of basic thread operations and thread migration in Arachne and compare these to costs in other threads systems 相似文献

14.

Contextual debugging and analysis of multithreaded applications

M. BEDNORZ A. GWOZDOWSKI K. ZIELI&#x;SKI 《Concurrency and Computation》1997,9(2):123-139

Multithreaded programs are especially difficult to test and debug. The aim of the paper is to present a new concept of multithreaded program analysis and debugging based on contextual visualisation of the program components that influence thread execution. For this purpose, a dedicated software package called MTV (multithreading viewer) has been designed and implemented. It performs above the run-time library level, and hence only a programmer's view of multiple threads of control execution may be analyzed. The paper presents tested program code instrumentation, communication and synchronization between the instrumented program and MTV. Next, a general concept of contextual visualisation of multithreaded programs has been elaborated. A scheme of the MTV cooperation with the monitored program is discussed. The user interface has been described. A representation of the multithreaded program state has been shown, and the capability of MTV for certain classes of error recognition has been specified and illustrated by a few examples. These examples have been not intended to be exhaustive, but they rather indicate the opportunities to exploit MTV for analysis of complex applications. Short evaluation of the proposed contextual visualisation techniques with application to multithreaded program analysis concludes the paper. © 1997 by John Wiley & Sons, Ltd. 相似文献

15.

Performance Optimization Using Extended Critical Path Analysis in Multithreaded Programs on Multiprocessors

《Journal of Parallel and Distributed Computing》2001,61(1):115-136

Efficient performance tuning of parallel programs is often hard. Optimization is often done when the program is written as a last effort to increase the performance. With sequential programs each (executed) code segment will affect the completion time. In the case of a parallel program executed on a multiprocessor this is not always true, due to dependencies between the different threads. Thus, certain code segments of the execution may not affect the completion time of the program. Optimization of such code segments will not increase the performance. In this paper we present an approach to optimize performance by finding the extended critical path of the multithreaded program. The extended critical path analysis is a generalization of the critical path analysis in the sense that it also deals with more threads than processors. We have implemented the extended critical path analysis in a performance optimization tool. The tool allows the user to determine the extended critical path of a multithreaded application written for the Solaris operating system for any number of processors based on execution on a single processor workstation. 相似文献

16.

同时多线程处理器上的动态分支预测器设计方案研究

任建安虹路放梁博《计算机科学》2006,33(3):239-243

同时多线程处理器（SMT）每个周期能够从多个线程中发射指令执行，从而大大地提高了超标量微处理器的指令吞吐量，但多个线程的同时执行也带来了许多硬件资源的共享冲突问题.其中，多个线程共享分支预测硬件的方案会对分支预测精度产生较大的影响.研究SMT处理器中分支处理方案对于处理器整体性能的影响，对于指导SMT处理器的设计是十分重要的.本文利用SMT处理器模拟器，针对各线程运行独立应用的SMT结构实验评估了几种著名的分支预测方案;给出了在单线程和多线程情况下，分支预测方案对分支预测精度和处理器整体性能的影响的分析;总结出在这样的SMT结构中，各线程拥有独立的预测器是一种较好的选择，并且由于各独立预测器可以采用小而简单的结构，所以不会带来太多的硬件开销. 相似文献

17.

A comparison of concurrent programming and cooperative multithreading under load balancing applications

Justin T. Maris Aaron W. Keen Takashi Ishihara Ronald A. Olsson 《Concurrency and Computation》2004,16(4):345-369

Two models of thread execution are the general concurrent programming execution model (CP) and the cooperative multithreading execution model (CM). CP provides nondeterministic thread execution where context switches occur arbitrarily. CM provides threads that execute one at a time until they explicitly choose to yield the processor. This paper focuses on a classic application to reveal the advantages and disadvantages of load balancing during thread execution under CP and CM styles; results from a second classic application were similar. These applications are programmed in two different languages (SR and Dynamic C) on different hardware (standard PCs and embedded system controllers). An SR‐like run‐time system, DesCaRTeS, was developed to provide interprocess communication for the Dynamic C implementations. This paper compares load balancing and non‐load balancing implementations; it also compares CP and CM style implementations. The results show that in cases of very high or very low workloads, load balancing slightly hindered performance; and in cases of moderate workload, both SR and Dynamic C implementations of load balancing generally performed well. Further, for these applications, CM style programs outperform CP style programs in some cases, but the opposite occurs in some other cases. This paper also discusses qualitative tradeoffs between CM style programming and CP style programming for these applications. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

18.

Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads

Teo Milanez Sylvain Collange Fernando Magno Quintão Pereira Wagner Meira Jr. Renato Ferreira 《Parallel Computing》2014

Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same processing unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is one architecture recently proposed that shares instruction decoding and execution between threads running the same program in an SMT processor, thereby generalizing the approach followed by Graphics Processing Units to general-purpose processors. In this paper we propose new ways to expose redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristic that handles function calls better than previous approaches. Our heuristic only inspects the program counter and the stack frame to reconverge threads; hence, it is amenable to efficient and inexpensive hardware implementation. Second, we demonstrate that this heuristic is able to reveal the existence of substantial regularity in inter-thread memory access patterns. We validate our results on data-parallel applications from the PARSEC and SPLASH suites. Our new reconvergence heuristic increases the throughput of our MMT model by 7%, when compared to a previous, and substantially more complex approach, due to Long et al. Moreover, it gives us an effective way to increase regularity in memory accesses. We have observed that over 70% of simultaneous memory accesses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses. 相似文献

19.

Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread

《Journal of Parallel and Distributed Computing》2006,66(10):1304-1321

Simultaneous multithreading (SMT) is an architectural technique that improves resource utilization by allowing instructions from multiple threads to coexist in a processor and share resources. However, earlier studies have shown that the performance of an SMT architecture begins to saturate as the number of coexisting threads increases beyond four. We show that no single fetch policy can be the best solution during the entire execution time and that a significant performance improvement can be attained by dynamically switching the fetch policies. We propose an implementation method which includes an extremely lightweight thread to control fetch policies (a detector thread) and a processor architecture to run the detector thread without impact on the user application threads. We evaluate various heuristics for the detector thread to determine the best fetch policies. We show that, with eight threads running on our simulated SMT, the proposed approach can outperform fixed scheduling mechanisms by up to 30%. 相似文献

20.

Power Efficient Approaches to Redundant Multithreading

Madan N. Balasubramonian R. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1066-1079

Noise and radiation-induced soft errors (transient faults) in computer systems have increased significantly over the last few years and are expected to increase even more as we move toward smaller transistor sizes and lower supply voltages. Fault detection and recovery can be achieved through redundancy. The emergence of chip multiprocessors (CMPs) makes it possible to execute redundant threads on a chip and provide relatively low-cost reliability. State-of-the-art implementations execute two copies of the same program as two threads (redundant multithreading), either on the same or on separate processor cores in a CMP, and periodically check results. Although this solution has favorable performance and reliability properties, every redundant instruction flows through a high-frequency complex out-of-order pipeline, thereby incurring a high power consumption penalty. This paper proposes mechanisms that attempt to provide reliability at a modest power and complexity cost. When executing a redundant thread, the trailing thread benefits from the information produced by the leading thread. We take advantage of this property and comprehensively study different strategies to reduce the power overhead of the trailing core in a CMP. These strategies include dynamic frequency scaling, in-order execution, and parallelization of the trailing thread. 相似文献