期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Power Efficient Approaches to Redundant Multithreading

Madan N. Balasubramonian R. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1066-1079

Noise and radiation-induced soft errors (transient faults) in computer systems have increased significantly over the last few years and are expected to increase even more as we move toward smaller transistor sizes and lower supply voltages. Fault detection and recovery can be achieved through redundancy. The emergence of chip multiprocessors (CMPs) makes it possible to execute redundant threads on a chip and provide relatively low-cost reliability. State-of-the-art implementations execute two copies of the same program as two threads (redundant multithreading), either on the same or on separate processor cores in a CMP, and periodically check results. Although this solution has favorable performance and reliability properties, every redundant instruction flows through a high-frequency complex out-of-order pipeline, thereby incurring a high power consumption penalty. This paper proposes mechanisms that attempt to provide reliability at a modest power and complexity cost. When executing a redundant thread, the trailing thread benefits from the information produced by the leading thread. We take advantage of this property and comprehensively study different strategies to reduce the power overhead of the trailing core in a CMP. These strategies include dynamic frequency scaling, in-order execution, and parallelization of the trailing thread. 相似文献

2.

Improving Adaptability and Per-Core Performance of Many-Core Processors Through Reconfiguration

Tameesh Suri Aneesh Aggarwal 《International journal of parallel programming》2010,38(3-4):203-224

Increasing the number of cores in a multi-core processor can only be achieved by reducing the resources available in each core, and hence sacrificing the per-core performance. Furthermore, having a large number of homogeneous cores may not be effective for all the applications. For instance, threads with high instruction level parallelism will under-perform considerably in the resource-constrained cores. In this paper, we propose a core architecture that can be adapted to improve a single thread’s performance or to execute multiple threads. In particular, we integrate Reconfigurable Hardware Unit (RHU) in the resource-constrained cores of a many-core processor. The RHU can be reconfigured to execute the frequently encountered instructions from a thread in order to increase the core’s overall execution bandwidth, thus improving its performance. On the other hand, if the core’s resources are sufficient for a thread, then the RHU can be configured to executed instructions from a different thread to increase the thread level parallelism. The RHU has low area overhead, and hence has minimal impact on scalability of the number of cores. To further limit the area overhead of this mechanism, generation of the reconfiguration bits for the RHUs of multiple cores is delegated to a single core. In this paper, we present the results for using the RHU to improve a single thread’s performance. Our experiments show that the proposed architecture improves the per-core performance by an average of about 23% across a wide range of applications. 相似文献

3.

动态二进制翻译中的指令调度技术研究与实现

孙俊文延华漆锋滨《计算机应用与软件》2008,25(1):17-19

动态二进制翻译提供了无需重新编译源代码就能将源机器生成的可执行代码自动转换到目标机器的方法,很好地解决了代码兼容性问题.其核心思想是根据程序的动态运行信息找到反复执行的代码序列,对代码序列进行翻译和优化,并将结果多次重用.指令调度作为一种有效的编译优化手段,也适用于动态二进制翻译.在对gcc的指令调度器分析研究的基础上,结合动态二进制翻译的实时性特点,提出了适合动态二进制翻译的效率高、开销小的指令调度算法. 相似文献

4.

数据流程序动态调度与优化

杨胜哲于俊清唐九飞《计算机工程与科学》2017,39(7):1201-1210

为了解决数据流编程模型的可用性问题,使其能在兼顾程序并行性的前提下适用于动态数据交互速率的流应用,设计了一种动态调度与静态优化相结合的数据流编译系统。编译器以COStream语言编写的源程序为输入,通过对源程序进行分析,以动态速率的数据通信边作为边界划分程序到粗粒度的子图,在子图内部应用静态优化。根据子图的每个计算单元的工作量估计计算资源的使用状况,实现子图内计算单元到处理器核的映射,经过阶段划分分配子图内计算单元到相应流水阶段。在运行时,每个子图在各个处理器核上均启动一个线程,通过对线程间通信的优化,避免了运行时多个线程对同一段内存同时读写产生的同步开销,减少了线程的上下文切换次数。使用信号量控制子图内线程间的同步,基于各子图计算单元运行时数据交互速率并结合当前线程的状态,动态调度各个子图的执行,构建动态的软件流水线,生成相应多线程目标代码。实验以通用X86-64多核处理器作为实验平台,测试和分析数据流编译的性能。实验结果表明,编译系统可以实现动态数据交互速率的数据流应用,扩大了编译系统可用性并且具有一定加速效果。相似文献

5.

CoDBT: A multi-source dynamic binary translator using hardware–software collaborative techniques

Haibing Guan Bo Liu Zhengwei Qi Yindong Yang Hongbo Yang Alei Liang 《Journal of Systems Architecture》2010,56(10):500-508

For implementing a dynamic binary translation system, traditional software-based solutions suffer from significant runtime overhead and are not suitable for extra complex optimization. This paper proposes using hardware–software collaboration techniques to create an high efficient dynamic binary translation system, CoDBT, which emulates several heterogeneous ISAs (Instruction Set Architectures) on a host processor without changing to the existing processor. We analyze the major performance bottlenecks via evaluating overhead of a pure software-solution DBT. Guidelines are provided for applying a suitable hardware–software partition process to CoDBT, as are algorithms for designing hardware-based binary translator and code cache management. An intermediate instruction set is introduced to make multi-source translation more practicable and scalable. Meantime, a novel runtime profiling strategy is integrated into the infrastructure to collect program hot spots information to supporting potential future optimizations. The advantages of using co-design as an implementation approach for DBT system are assessed by several SPEC benchmarks. Our results demonstrate that significant performance improvements can be achieved with appropriate hardware support choices. CoDBT could be an efficient and cost-effective solution for situations where the usual methods of performance acceleration for dynamic binary translation are inappropriate. 相似文献

6.

动态二进制翻译中数据预取优化研究*

罗琼程吴强《计算机应用研究》2009,26(12):4572-4576

动态优化是动态二进制翻译研究中一个十分重要的课题,数据预取优化能提高现代处理器体系结构应用程序性能。基于超级块(Superblock)的动态数据预取优化采用软件插桩方式收集应用程序的load访存延迟信息并构造Superblock;然后根据延迟信息以及Superblock数据流分析得出的寄存器定值引用关系,对延迟load指令进行预取优化。通过在龙芯DigitalBridge动态二进制翻译系统上实验验证,数据预取优化可以提高翻译后SPEC2000浮点测试程序代码的平均性能3.3%,开销远小于0.5%。相似文献

7.

动态二进制翻译中基本块重叠冗余的优化

下载免费PDF全文

李骏管海兵李增祥梁阿磊《计算机工程》2007,33(22):60-62

动态二进制翻译技术通常采用基本块作为翻译和执行的基本单元，动态翻译中的基本块在划分过程中存在重叠冗余的情况，即当前翻译的基本块可能是一个已经过翻译的基本块子集，或者包含一个已翻译的基本块，这增加了翻译开销。该文从优化动态二进制翻译角度出发，检测、消除由基本块重叠冗余带来的开销。实验表明，在动态二进制翻译过程中存在5%左右的基本块重叠率，通过消除这些冗余可以将翻译和执行的性能提高1%~4%。相似文献

8.

32位多线程包处理微引擎的设计 总被引：1，自引：0，他引：1

周昔平高德远樊晓桠张盛兵《小型微型计算机系统》2006,27(11):2072-2076

硬件多线程技术是网络处理器中的核心技术，本文介绍了一个专门面向网络协议处理的硬件多线程包处理微引擎NRS05的设计，详细介绍了其流水线的整体结构，提出了一种基于混合多线程的动态调度策略实现了长延时操作的隐藏，保证单线程性能能够满足应用需求的同时保证了各线程在执行核上运行的公平性，并将多线程技术和流水线技术进行了结合，解决了传统处理器中指令间因控制相关导致的流水线停顿问题，最后给出了设计的综合结果及包处理性能．相似文献

9.

An efficient scheduler of RTOS for multi/many-core system

Xiongli GuAuthor VitaePeng LiuAuthor Vitae Mei YangAuthor VitaeJie YangAuthor Vitae Cheng LiAuthor VitaeQingdong YaoAuthor Vitae 《Computers & Electrical Engineering》2012,38(3):785-800

Recently there is a trend to broaden the usage of lower-power embedded media processor core to build the future high-end computing machine or the supercomputer. However the embedded solution also faces the operating system (OS) design challenge which the thread invoking overhead is higher for fine-grained scientific workload, the message passing among threads is not managed efficiently enough and the OS does not provide convenient enough service for parallel programming. This paper presents a scheduler of master-slave real-time operating system (RTOS) to manage the thread running for the distributed multi/many-core system without shared memories. The proposed scheduler exploits the data-driven feature of scientific workloads to reduce the thread invoking overhead. And it also defines two protocols: (1) one is between the RTOS and application program, which is used to reduce the burden of parallel programming for the programmer; (2) another one is between the RTOS and networks-on-chip, which is used to manage the message passing among threads efficiently. The experimental results show that the proposed scheduler can manage the thread running with lower overhead and less storage requirement, thereby, improving the multi/many-core system performance. 相似文献

10.

多核并行技术在分子动力学模拟中的应用 总被引：1，自引：0，他引：1

刘青昆滕人达刘凤宫利东张建强《计算机工程与设计》2011,32(10):3395-3398

为了充分利用多核处理器资源,研究了一种用于分子动力学模拟中的多核并行技术。在多核处理器上利用OpenMP技术实现多线程创建与同步、动态设置子线程的调度运行方式以及负载均衡以减少子线程执行等待时间。通过对不同分子体系结构下的动力学模型测试,得出在不同子线程下并行计算的时间,并且得到了良好的性能加速比。实验结果表明,采用OpenMP并行技术可有效地提高电荷求解过程在分子动力学模拟运算中的时间效率,以及多核计算机资源的利用率。相似文献

11.

Experiences implementing efficient Java thread serialization,mobility and persistence

S. Bouchenak D. Hagimont S. Krakowiak N. De Palma F. Boyer 《Software》2004,34(4):355-393

Today, mobility and persistence are important aspects of distributed computing. They have many fields of use such as load balancing, fault tolerance and dynamic reconfiguration of applications. In this context, Java provides many useful mechanisms for the mobility of code via dynamic class loading, and the mobility or persistence of data via object serialization. However, Java does not provide any mechanism for the mobility/persistence of computation (i.e. threads). We designed and implemented a new mechanism, called Java thread serialization, that is used to build thread mobility or thread persistence. Therefore, a running Java thread can, at an arbitrary state of its execution, migrate to a remote machine where it resumes its execution, or be checkpointed on disk for possible subsequent recovery. With our services, migrating a thread is simply performed by the call of our go primitive, and checkpointing/recovering a thread is performed by the call of our store and load primitives. Several projects have recently addressed the issue of Java thread serialization, e.g. Sumatra, Wasp, JavaGo, Brakes, JavaGoX, Merpati. Some of them have attempted to minimize the overhead incurred by the thread serialization mechanism on thread performance, but none of them has been able to completely avoid this overhead. We propose a generic Java thread serialization mechanism that does not impose any performance overhead on serialized threads. This is achieved thanks to the use of type inference and dynamic de‐optimization techniques. In this paper, we describe the design and implementation details of our thread serialization prototype in Sun Microsystems' JDK. We report on experiments conducted with our prototype, present a comparative performance evaluation of the main thread serialization techniques, and confirm the elimination of the performance overhead with our thread serialization mechanism. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

12.

动态二进制翻译的多线程并行优化研究

徐金龙蒋烈辉董卫宇方明《计算机工程与设计》2011,32(7):2370-2372,2380

为了充分利用多核CPU来实现动态二进制翻译的并行化,研究了用多线程将翻译阶段和执行阶段并行优化的方法,提供了并行化系统的程序流程。并根据翻译与执行的时序及相关性,设计实现了一种超前翻译算法,它能够有效预测程序的执行路径,为翻译过程提供导向作用。实验结果表明,该优化方法提高了翻译缓存中基本块的命中率,使执行阶段尽量不被中断,进而提升了执行效率。相似文献

13.

动态二进制翻译中的冗余LOAD删除优化技术

王丽一文延华《计算机应用与软件》2008,25(6):40-43

动态二进制翻译系统是根据程序的动态执行信息来将源机器上的可执行代码翻译成目标机器上的可执行代码.在翻译成中间表示的过程中会产生一些冗余的LOAD指令,为提高代码的执行效率,提出对这些LOAD指令进行冗余删除优化.该优化技术可以使优化效果超过其自身的开销,达到优化的目的. 相似文献

14.

Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread

《Journal of Parallel and Distributed Computing》2006,66(10):1304-1321

Simultaneous multithreading (SMT) is an architectural technique that improves resource utilization by allowing instructions from multiple threads to coexist in a processor and share resources. However, earlier studies have shown that the performance of an SMT architecture begins to saturate as the number of coexisting threads increases beyond four. We show that no single fetch policy can be the best solution during the entire execution time and that a significant performance improvement can be attained by dynamically switching the fetch policies. We propose an implementation method which includes an extremely lightweight thread to control fetch policies (a detector thread) and a processor architecture to run the detector thread without impact on the user application threads. We evaluate various heuristics for the detector thread to determine the best fetch policies. We show that, with eight threads running on our simulated SMT, the proposed approach can outperform fixed scheduling mechanisms by up to 30%. 相似文献

15.

Generation, Optimization, and Evaluation of Multithreaded Code

Lucas Roh Walid A. Najjar A.P.Wim Böhm 《Journal of Parallel and Distributed Computing》1996,32(2):188

The recent advent of multithreaded architectures holds many promises: the exploitation of intrathread locality and the latency tolerance of multithreaded synchronization can result in a more efficient processor utilization and higher scalability. The challenge for a code generation scheme is to make effective use of the underlying hardware by generating large threads with a large degree of internal locality without limiting the program level parallelism or increasing latency. Top-down code generation, where threads are created directly from the compiler's intermediate form, is effective at creating a relatively large thread. However, having only a limited view of the code at any one time limits the quality of threads generated. These top-down generated threads can therefore be optimized by global, bottom-up optimization techniques. In this paper, we introduce the Pebbles multithreaded model of computation and analyze a code generation scheme whereby top-down code generation is combined with bottom-up optimizations. We evaluate the effectiveness of this scheme in terms of overall performance and specific thread characteristics such as size, length, instruction level parallelism, number of inputs, and synchronization costs. 相似文献

16.

Compiler Techniques for the Superthreaded Architectures1, 2

Jenn-Yuan Tsai Zhenzhen Jiang Pen-Chung Yew 《International journal of parallel programming》1999,27(1):1-19

Several useful compiler and program transformation techniques for the superthreaded architectures are presented in this paper. The superthreaded architecture adopts a thread pipelining execution model to facilitate runtime data dependence checking between threads, and to maximize thread overlap to enhance concurrency. In this paper, we present some important program transformation techniques to facilitate concurrent execution among threads, and to manage critical system resources such as the memory buffers effectively. We evaluate the effectiveness of those program transformation techniques by applying them manually on several benchmark programs, and using a trace-driven, cycle-by-cycle superthreaded processor simulator. The simulation results show that a superthreaded processor can achieve promising speedup for most of the benchmark programs. 相似文献

17.

一种基于频度统计的动态二进制翻译优化方法

李男庞建民单征《计算机工程与科学》2018,40(4):602-608

在动态二进制翻译过程中,将执行频度高的代码片段长时间驻留在翻译缓存,同时扩大翻译器一次执行的代码量,是减少上下文切换开销、提升系统效率的有效途径。为此,提出了“热代码识别→超块缓存构造→T-Cache管理策略改进”的优化线索,设计了一种基于频度统计的热代码识别算法,将频度值超过预设阈值的基本块及其后续基本块作为热代码识别条件;基于识别出的热代码,提出了构造超块缓存的思想,将热代码包含的基本块翻译后做物理连接,形成容量更大的超块缓存提供给T-Cache系统;以此为基础,改进了T-Cache系统原有的查找方法和替换策略。实验验证了该优化方法的正确性和有效性,在国产申威处理器平台上,该方法使得标准测试集SPEC 2006获得平均9.34%的性能提升。相似文献

18.

Architectural support for thread communications in multi-core processors

Sevin Varoglu Stephen Jenks 《Parallel Computing》2011,37(1):26-41

In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors. 相似文献

19.

The effect of scheduling discipline on spin overhead in sharedmemory parallel systems

Zahorjan J. Lazowska E.D. Eager D.L. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(2):180-198

Spinning, or busy waiting, is commonly employed in parallel processors when threads of execution must wait for some event, such as synchronization with another thread. Because spinning is purely overhead, it is detrimental to both user response time and system throughput. The effects of two environmental factors, multiprogramming and data-dependent execution times, on spinning are discussed, and it is shown how the choice of scheduling discipline can be used to reduce the amount of spinning in each case 相似文献

20.

动态二进制翻译器CrossBit的性能分析与评估

下载免费PDF全文

官孝峰梁阿磊《计算机工程与应用》2008,44(27):91-94

动态二进制翻译是广泛应用于虚拟机系统的一种二进制代码的翻译技术。动态二进制翻译由于拥有代码缓存、本地执行、代码块链接、动态热路径生成等优化技术的支持,有着很高的性能。CrossBit是一个多元多目标的动态二进制翻译系统,通过对CrossBit二进制翻译器的性能进行的研究,分析动态二进制翻译器性能提升中所必须解决的若干问题,并通过定量的分析总结了一些二进制翻译系统的在不同的配置和负载下系统优化手段的执行时策略。相似文献