期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Amdahl’s law for multithreaded multicore processors

Hao Che Minh Nguyen 《Journal of Parallel and Distributed Computing》2014

In this paper, we conduct performance scaling analysis of multithreaded multicore processors (MMPs) for parallel computing. We propose a thread-level closed-queuing network model covering a fairly large design space, accounting for hardware scaling models, coarse-grain, fine-grain, and simultaneous multithreading (SMT) cores, shared resources, including cache, memory, and critical sections. We then derive a closed-form solution for this model in terms of speedup performance measure. This solution makes it possible to analyze performance scaling properties of MMPs along multiple dimensions. In particular, we show that for the parallelizable part of the workload, the speedup, in the absence of resource contention, is no longer just a linear function of parallel processing unit counts, as predicted by Amdahl’s law, but also a strong function of workload characteristics, ranging from strong memory-bound to strong CPU-bound workloads. We also find that with core multithreading, super linear speedup, higher than that predicted by Amdahl’s law, may be achieved for the parallelizable part of the workload, if core threads exhibit strong cache affinity and the workload is strongly memory-bound. Then, we derive a tight speedup upper bound in the presence of both memory resource contention and critical section for multicore processors with single-threaded cores. This speedup upper bound indicates that with resource contention among threads, whether it is due to shared memory or critical section, a sequential term is guaranteed to emerge from the parallelizable part of the workload, fundamentally limiting the scalability of multicore processors for parallel computing, in addition to the sequential part of the workload, as dictated by Amdahl’s law. As a result, to improve speedup performance for MMPs, one should strive to enhance memory parallelism and confine critical sections as locally as possible, e.g., to the smallest possible number of threads in the same core. 相似文献

2.

Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread

《Journal of Parallel and Distributed Computing》2006,66(10):1304-1321

Simultaneous multithreading (SMT) is an architectural technique that improves resource utilization by allowing instructions from multiple threads to coexist in a processor and share resources. However, earlier studies have shown that the performance of an SMT architecture begins to saturate as the number of coexisting threads increases beyond four. We show that no single fetch policy can be the best solution during the entire execution time and that a significant performance improvement can be attained by dynamically switching the fetch policies. We propose an implementation method which includes an extremely lightweight thread to control fetch policies (a detector thread) and a processor architecture to run the detector thread without impact on the user application threads. We evaluate various heuristics for the detector thread to determine the best fetch policies. We show that, with eight threads running on our simulated SMT, the proposed approach can outperform fixed scheduling mechanisms by up to 30%. 相似文献

3.

一种基于IA-64的并行架构的研究

下载免费PDF全文

邓晴莺张民选蒋江《计算机工程与科学》2008,30(7):82-85

同时多线程（SMT）能在同一时钟周期执行不同线程的指令,同时开发了指令级并行（ILP）和线程级并行（TLP）。显式并行指令计算（EPIC）关注于编译器和硬件的相互协作。在本文中,我们设计和实现了一套并行环境,其中包括并行编译器OpenUH和基于IA-64的同时多线程体系结构EDSMT,并通过NAS并行测试程序作出了性能评测。相似文献

4.

Algorithm,software, and hardware optimizations for Delaunay mesh generation on simultaneous multithreaded architectures

Christos D. Antonopoulos Filip Blagojevic Andrey N. Chernikov Nikos P. Chrisochoides Dimitrios S. Nikolopoulos 《Journal of Parallel and Distributed Computing》2009

This article focuses on the optimization of PCDM, a parallel, two-dimensional (2D) Delaunay mesh generation application, and its interaction with parallel architectures based on simultaneous multithreading (SMT) processors. We first present the step-by-step effect of a series of optimizations on performance. These optimizations improve the performance of PCDM by up to a factor of six. They target issues that very often limit the performance of scientific computing codes. We then evaluate the interaction of PCDM with a real SMT-based SMP system, using both high-level metrics, such as execution time, and low-level information from hardware performance counters. 相似文献

5.

Analyzing and enhancing the parallel sort operation on multithreaded architectures 总被引：1，自引：1，他引：0

Layali Rashid Wessam M. Hassanein Moustafa A. Hammad 《The Journal of supercomputing》2010,53(2):293-312

The sort operation is a core part of many critical applications (e.g., database management systems). Despite the large efforts to parallelize it, the fact that it suffers from high data-dependencies vastly limits its performance. Multithreaded architectures are emerging as the most demanding technology in leading-edge processors. These architectures include simultaneous multithreading, chip multiprocessors, and machines combining different multithreading technologies. In this paper, we analyze the memory behavior and improve the performance of the most recent parallel radix and quick integer sort algorithms on modern multithreaded architectures. We achieve speedups up to 4.69× for radix sort and up to 4.17× for quicksort on a machine with 4 multithreaded processors compared to single threaded versions, respectively. We find that since radix sort is CPU-intensive, it exhibits better results on chip multiprocessors where multiple CPUs are available. While quicksort is accomplishing speedups on all types of multithreading processers due to its ability to overlap memory miss latencies with other useful processing. 相似文献

6.

OpenSMT：一个同时多线程处理器模拟器的设计和实现

路放安虹梁博任建《计算机科学》2006,33(1):158-163

同时多线程（SMT）技术是目前微处理器体系结构的研究热点之一。为了支持对SMT技术和基于SMT核的单芯片多处理器（CMP）体系结构技术的深入研究，我们在广泛使用的超标体系结构模拟器Simple Sealar的基础上，通过对SMT结构的关键特性进行适当的抽象，开发了一个SMT体系结构模拟器OpenSMT。本文介绍了谊模拟器主要的设计思想和实现方法，包括多个线程上下文结构的表示、超标量流水线各个阶段的模拟，以及模拟器设计和实现时需要解决的几个关键问题等。初步的应用研究表明，与现有可免费获得的研究用SMT模拟器相比，该模拟器能够较好地平衡模拟性能、灵活性和精度三个基本设计目标，实现了执行驱动、易于扩展指令集结构、良好的用户接口、灵活的软件结构、适宜评估更广泛的SMT体系结构设计空间等设计要求。相似文献

7.

龙芯2号同时多线程处理器的软硬件接口设计 总被引：1，自引：0，他引：1

李祖松许先超胡伟武唐志敏《软件学报》2007,18(7):1806-1817

随着生产工艺的提高,芯片上能集成越来越多的晶体管,多线程技术也逐步成为一种主流的处理器体系结构技术,而多线程处理器的软硬件接口也就成为急需解决的问题.在分析同时多线程的软件需求的基础上,提出龙芯2号同时多线程处理器的软硬件接口协同设计解决方案,给出相应的操作系统实现方案.同时,在Linux 2.4.20的基础上实现了龙芯2号同时多线程处理器相应的操作系统.通过运行SPEC CPU2000等测试程序进行性能评测,充分说明实现软硬件接口的龙芯2号同时多线程处理器极大地提高了多进程负载的性能.分析和设计方案不仅适用于同时多线程处理器,而且对于片内多核处理器的设计也有借鉴作用. 相似文献

8.

A continuation-based noninterruptible multithreading processor architecture

Satoshi Amamiya Makoto Amamiya Ryuzo Hasegawa Hiroshi Fujita 《The Journal of supercomputing》2009,47(2):228-252

Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation. 相似文献

9.

同时多线程处理器上的动态分支预测器设计方案研究

任建安虹路放梁博《计算机科学》2006,33(3):239-243

同时多线程处理器（SMT）每个周期能够从多个线程中发射指令执行,从而大大地提高了超标量微处理器的指令吞吐量,但多个线程的同时执行也带来了许多硬件资源的共享冲突问题.其中,多个线程共享分支预测硬件的方案会对分支预测精度产生较大的影响.研究SMT处理器中分支处理方案对于处理器整体性能的影响,对于指导SMT处理器的设计是十分重要的.本文利用SMT处理器模拟器,针对各线程运行独立应用的SMT结构实验评估了几种著名的分支预测方案;给出了在单线程和多线程情况下,分支预测方案对分支预测精度和处理器整体性能的影响的分析;总结出在这样的SMT结构中,各线程拥有独立的预测器是一种较好的选择,并且由于各独立预测器可以采用小而简单的结构,所以不会带来太多的硬件开销. 相似文献

10.

低功耗SMT体系结构研究 总被引：3，自引：3，他引：3

赵荣彩唐志敏《计算机工程与设计》2002,23(8):7-12,17

由于应用程序中ILP和TLP的不足或不均衡性，使得超标量和多处理的性能和资源用率受到了挑战；而同时多线程（SMT）处理器则是一种能够充分利用资源，动态进行TLP到ILP转换的能量有效结构。文章围绕高性能、低功耗这两个目标讨论和探究了WMT体系结构的基本思想、设计技术、低功耗考虑了以及编译器和操作系统设计应注意和对待的新问题。相似文献

11.

Helper threads via virtual multithreading

《Micro, IEEE》2004,24(6):74-82

Memory latency dominates the performance of many applications on modern processors, despite advances in caches and prefetching techniques. Numerous prefetching techniques, both in hardware and software, try to alleviate the memory bottleneck. One such technique, known as helper threading improves single-thread performance on a simultaneous multithreaded architecture (SMT), which shares processor resources, including caches, among logical threads. It uses otherwise idle hardware thread contexts to execute speculative threads on behalf of the main thread. Helper threading accelerates a program by exploiting a processor's multithreading capability to run assist threads. Based on the helper threading usage model, virtual multithreading (VMT), a form of switch-on-event user-level multithreading, can improve performance for real-world workloads with a wall-clock speedup of 5.0 to 38.5 percent 相似文献

12.

Systolic Architecture for Boolean Operations on Polygons and Polyhedra

D. Krishnan• L.M. Patnaik• 《Computer Graphics Forum》1987,6(3):203-210

In Computer-Aided Design applications there is often a need to compute the union, intersection and Merence of two polygons or polyhedra. The sequential algorithms for this problem are characterized by poor speed of response and large computational complexity. In order to remove these defects, an algorithm amenable to implementation on a parallel architecture is proposed. The parallel architecture designed is a systolic one which forms a dedicated subsystem to perform set-theoretic operations on polygons. The improvement in speed gained by using the systolic array as compared to a uniprocessor has been evaluated using simulation techniques. Extensions of this architecture to perform the same operations on polyhedra are also discussed. 相似文献

13.

Compiling C for the EARTH multithreaded architecture

Laurie J. Hendren Xinan Tang Yingchun Zhu Shereen Ghobrial Guang R. Gao Xun Xue Haiying Cai Pierre Ouellet 《International journal of parallel programming》1997,25(4):305-338

Multithreaded architectures provide an opportunity for efficiently executing programs with irregular parallelism and/or irregular locality. This paper presents a strategy that makes use of the multithreaded execution model without exposing multithreading to the programmer. Our approach is to design simple extensions to C, and to provide compiler support that automatically translates high-level C programs into lower-level threaded programs. In this paper we present EARTH-C our extended C language which contains simple constructs for specifying control parallelism, data locality, shared variables and atomic operations. Based on EARTH-C, we describe compiler techniques that are used for translating to lower-level Threaded-C programs for the EARTH multithreaded architecture. We demonstrate our approach with six benchmark programs. We show that even naive EARTH-C programs can lead to reasonable performance, and that more advanced EARTH-C programs can give performance very close to hand-coded threated-C programs. This work supported, in part, by NSERC and FCAR. 相似文献

14.

Tuning Compiler Optimizations for Simultaneous Multithreading

Jack L. Lo Susan J. Eggers Henry M. Levy Sujay S. Parekh Dean M. Tullsen 《International journal of parallel programming》1999,27(6):477-503

Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SM T processor is capable of issuing multiple instructions from multiple threads to a processor's functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding long-latency operations. Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine, particularly for parallel processors. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost inter-processor communication. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT, which can benefit from finegrained resource sharing within the processor. This paper reexamines several compiler optimizations in the context of simultaneous multithreading. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated, and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines, compilers can generate code that improves the performance of programs executing on SMT machines. 相似文献

15.

Analyzing the Effects of Hyperthreading on the Performance of Data Management Systems

Wessam M. Hassanein Layali K. Rashid Moustafa A. Hammad 《International journal of parallel programming》2008,36(2):206-225

As information processing applications take greater roles in our everyday life, database management systems (DBMSs) are growing in importance. DBMSs have traditionally exhibited poor cache performance and large memory footprints, therefore performing only at a fraction of their ideal execution and exhibiting low processor utilization. Previous research has studied the memory system of DBMSs on research-based simultaneous multithreading (SMT) processors. Recently, several differences have been noted between the real hyper-threaded architecture implemented by the Intel Pentium 4 and the earlier SMT research architectures. This paper characterizes the performance of a prototype open-source DBMS running TPC-equivalent benchmark queries on an Intel Pentium 4 Hyper-Threading processor. We use hardware counters provided by the Pentium 4 to evaluate the micro-architecture and study the memory system behavior of each query running on the DBMS. Our results show a performance improvement of up to 1.16 in TPC-C-equivalent and 1.26 in TPC-H-equivalent queries due to hyperthreading. 相似文献

16.

超标量处理器中引入SMT技术的性能分析研究

下载免费PDF全文

史莉雯樊晓桠黄小平《计算机工程与应用》2009,45(5):13-15

同时多线程（SMT）是一种允许多个独立的线程每周期发射多条指令的技术,这种技术充分利用了可能存在的指令级并行和线程级并行,提高了有限资源的利用率。文章以西北工业大学航空微电子中心自主研发的32位超标量处理器“龙腾R2”为基础,引入SMT技术,在基本不改变内部结构大小、不增加执行功能部件、仅做一些必要修改的前提条件下进行研究。通过仿真不同的线程数和各种线程组合,进行性能分析。尽管存在制约性能提升的一些因素,引入SMT技术后依然获得了最高约50%的性能增加。相似文献

17.

GPU parallelization of sequence segmentation using information theoretic models

《Simulation Modelling Practice and Theory》2018

Sequence segmentation has gained popularity in bioinformatics and particularly in studying DNA sequences. Information theoretic models have been used in providing accurate solutions in the segmentation of DNA sequences. Existing dynamic programming approaches provide optimal solution to the segmentation problem. However, their quadratic time complexity prohibits their applicability to long sequences. In this paper, we propose a parallel approach to improve the performance of a quasilinear sequence segmentation algorithm. The target segmentation technique is a divide-and-conquer recursive algorithm that is based on information theory principles and models. We present three parallel implementations that aim at reducing the segmentation time. The first implementation uses the multithreading capabilities of CPUs. The second one is a hybrid implementation that utilizes the synergy between the CPU and the multithreading power of GPUs. The third implementation is a variation of the hybrid approach where it utilizes the concept of unified memory between the CPU and the GPU instead of the standard memory copy approach. We demonstrate the applicability of the parallel implementations by testing them on real DNA sequences and randomly generated sequences with different lengths and different number of unique elements. The results show that the hybrid CPU-GPU approach outperforms the sequential implementation with a speedup of up to 5.9X while the CPU parallel implementation provides a poor speedup of only 1.7X. 相似文献

18.

Exploring cache performance in multithreaded processors

Dimitris Lioupis Sotiris Milios 《Microprocessors and Microsystems》1997,20(10):631-642

Multithreading is a well known technique to hide latency in a non-blocking cache architecture. By switching execution from one thread to another, the CPU can perform useful work, while waiting for pending requests to be processed by the main memory. In this paper we examine the effects of varying the associativity and block size on cache performance in a reduced locality of reference environment, due to multithreading. We find that for associativity equal to the number of threads, the cache produces very low miss rate even for small sizes. Also by taking into account the increase in cycle time due to larger cache size or associativity we find that the optimum cache configuration for best processor performance is 16Kbytes direct mapped. Finally, with a constant main memory bandwidth, increasing the block size to more than 32 bytes, reduces the miss rate, but degrades processor performance. 相似文献

19.

Simultaneous multithreading: a platform for next-generationprocessors

Eggers S.J. Emer J.S. Leby H.M. Lo J.L. Stamm R.L. Tullsen D.M. 《Micro, IEEE》1997,17(5):12-19

Simultaneous multithreading is a processor design which consumes both thread-level and instruction-level parallelism. In SMT processors, thread-level parallelism can come from either multithreaded, parallel programs or individual, independent programs in a multiprogramming workload. Instruction-level parallelism comes from each single program or thread. Because it successfully (and simultaneously) exploits both types of parallelism, SMT processors use resources more efficiently, and both instruction throughput and speedups are greater 相似文献

20.

Increasing data reuse of sparse algebra codes on simultaneous multithreading architectures

J. C. Pichel D. B. Heras J. C. Cabaleiro F. F. Rivera 《Concurrency and Computation》2009,21(15):1838-1856

In this paper the problem of the locality of sparse algebra codes on simultaneous multithreading (SMT) architectures is studied. In these kind of architectures many hardware structures are dynamically shared among the running threads. This puts a lot of stress on the memory hierarchy, and a poor locality, both inter‐thread and intra‐thread, may become a major bottleneck in the performance of a code. This behavior is even more pronounced when the code is irregular, which is the case of sparse matrix ones. Therefore, techniques that increase the locality of irregular codes on SMT architectures are important to achieve high performance. This paper proposes a data reordering technique specially tuned for these kind of architectures and codes. It is based on a locality model developed by the authors in previous works. The technique has been tested, first, using a simulator of a SMT architecture, and subsequently, on a real architecture as Intel's Hyper‐Threading. Important reductions in the number of cache misses have been achieved, even when the number of running threads grows. When applying the locality improvement technique, we also decrease the total execution time and improve the scalability of the code. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献