共查询到20条相似文献,搜索用时 413 毫秒
1.
《Journal of Parallel and Distributed Computing》2006,66(10):1304-1321
Simultaneous multithreading (SMT) is an architectural technique that improves resource utilization by allowing instructions from multiple threads to coexist in a processor and share resources. However, earlier studies have shown that the performance of an SMT architecture begins to saturate as the number of coexisting threads increases beyond four. We show that no single fetch policy can be the best solution during the entire execution time and that a significant performance improvement can be attained by dynamically switching the fetch policies. We propose an implementation method which includes an extremely lightweight thread to control fetch policies (a detector thread) and a processor architecture to run the detector thread without impact on the user application threads. We evaluate various heuristics for the detector thread to determine the best fetch policies. We show that, with eight threads running on our simulated SMT, the proposed approach can outperform fixed scheduling mechanisms by up to 30%. 相似文献
2.
We have designed and implemented a light‐weight process (thread) library called ‘Lesser Bear’ for SMP computers. Lesser Bear has high portability and thread‐level parallelism. Creating UNIX processes as virtual processors and a memory‐mapped file as a huge shared‐memory space enables Lesser Bear to execute threads in parallel. Lesser Bear requires exclusive operation between peer virtual processors, and treats a shared‐memory space as a critical section for synchronization of threads. Therefore, thread functions of the previous Lesser Bear are serialized. In this paper, we present a scheduling mechanism to execute thread functions in parallel. In the design of the proposed mechanism, we divide the entire shared‐memory space into partial spaces for virtual processors, and prepare two queues (Protect Queue and Waiver Queue) for each partial space. We adopt an algorithm in which lock operations are not necessary for enqueueing. This algorithm allows us to propose a scheduling mechanism that can reduce the scheduling overhead. The mechanism is applied to Lesser Bear and evaluated by experimental results. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献
3.
为了解决数据流编程模型的可用性问题,使其能在兼顾程序并行性的前提下适用于动态数据交互速率的流应用,设计了一种动态调度与静态优化相结合的数据流编译系统。编译器以COStream语言编写的源程序为输入,通过对源程序进行分析,以动态速率的数据通信边作为边界划分程序到粗粒度的子图,在子图内部应用静态优化。根据子图的每个计算单元的工作量估计计算资源的使用状况,实现子图内计算单元到处理器核的映射,经过阶段划分分配子图内计算单元到相应流水阶段。在运行时,每个子图在各个处理器核上均启动一个线程,通过对线程间通信的优化,避免了运行时多个线程对同一段内存同时读写产生的同步开销,减少了线程的上下文切换次数。使用信号量控制子图内线程间的同步,基于各子图计算单元运行时数据交互速率并结合当前线程的状态,动态调度各个子图的执行,构建动态的软件流水线,生成相应多线程目标代码。实验以通用X86-64多核处理器作为实验平台,测试和分析数据流编译的性能。实验结果表明,编译系统可以实现动态数据交互速率的数据流应用,扩大了编译系统可用性并且具有一定加速效果。 相似文献
4.
该文将Windows NT操作系统中的线程调度机制和应用程序中的线程同步控制方法相结合,对红外景象产生器软件系统中的各线程的性能进行了分析,研究了其典型线程之间的同步实现方法,同时提出了为提高应用程序效率而进行的改进并介绍了新系统的整体性能。实验证明,改进后的系统运行速度有显著的提高,应用程序对CPU也实现了较高的利用率。 相似文献
5.
针对具有独立DVFS的多核处理器系统,提出了一种K线程低能耗模型的并行任务调度优化算法(Tasks Optimization based on Energy-Effectiveness Model,TO-EEM)。与传统的并行任务节能调度相比,该算法的主要目标是不仅通过降低处理器频率来减少处理器瞬时功耗,而且结合并行任务间的同步互斥所造成的线程阻塞情况,合理分配线程资源来减少线程同步时间,优化并行性能;保证任务在一定的并行加速比性能前提下,提高资源利用率,减少能耗,达到程序能耗和性能之间的折衷。文中进行了大量模拟实验,结果证明提出的任务优化模型算法节能效果明显,能有效降低处理器的功耗,并始终保持线性加速比。 相似文献
6.
《Computer Architecture Letters》2004,3(1):5-5
Simultaneous Multi Threading (SMT) is a processor design method in which concurrent hardware threads share processor resources like functional units and memory. The scheduling complexity and performance of an SMT processor depend on the topology used in the fetch and issue stages. In this paper, we propose a thread sensitive issue policy for a partitioned SMT processor which is based on a thread metric. We propose the number of ready-to-issue instructions of each thread as priority metric. To evaluate our method, we have developed a reconfigurable SMT-simulator on top of the SimpleScalar Toolset. We simulated our modeled processor under several workloads composed of SPEC benchmarks. Experimental results show around 30% improvement compared to the conventional OLDEST_FIRST mixed topology issue policy. Additionally, the hardware implementation of our architecture with this metric in issue stage is quite simple. 相似文献
7.
When the critical path of a communication session between end points includes the actions of operating system kernels, there are attendant overheads. Along with other factors, such as functionality and flexibility, such overheads motivate and favor the implementation of communication protocols in user space. When implemented with threads, such protocols may hold the key to optimal communication performance and functionality. Based on implementations of reliable user‐space protocols supported by a threads framework, we focus on our experiences with internal threads' scheduling techniques and their potential impact on performance. We present scheduling strategies that enable threads to do both application‐level and communication‐related processing. With experiments performed on a Sun SPARC‐5 LAN environment, we show how different scheduling strategies yield different levels of application‐processing efficiency, communication latency and packet‐loss. This work forms part of a larger study on the implementation of multiple thread‐based protocols in a single address space, and the benefits of coupling protocols with applications. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献
8.
Eliminating synchronization faults in air traffic control software via design for verification with concurrency controllers 总被引:1,自引:0,他引:1
Aysu Betin Can Tevfik Bultan Mikael Lindvall Benjamin Lux Stefan Topp 《Automated Software Engineering》2007,14(2):129-178
The increasing level of automation in critical infrastructures requires development of effective ways for finding faults in
safety critical software components. Synchronization in concurrent components is especially prone to errors and, due to difficulty
of exploring all thread interleavings, it is difficult to find synchronization faults. In this paper we present an experimental
study demonstrating the effectiveness of model checking techniques in finding synchronization faults in safety critical software
when they are combined with a design for verification approach. We based our experiments on an automated air traffic control software component called the Tactical Separation
Assisted Flight Environment (TSAFE). We first reengineered TSAFE using the concurrency controller design pattern. The concurrency controller design pattern enables a modular verification strategy by decoupling the behaviors of the concurrency
controllers from the behaviors of the threads that use them using interfaces specified as finite state machines. The behavior
of a concurrency controller is verified with respect to arbitrary numbers of threads using the infinite state model checking
techniques implemented in the Action Language Verifier (ALV). The threads which use the controller classes are checked for
interface violations using the finite state model checking techniques implemented in the Java Path Finder (JPF). We present
techniques for thread isolation which enables us to analyze each thread in the program separately during interface verification.
We conducted two sets of experiments using these verification techniques. First, we created 40 faulty versions of TSAFE using
manual fault seeding. During this exercise we also developed a classification of faults that can be found using the presented
design for verification approach. Next, we generated another 100 faulty versions of TSAFE using randomly seeded faults that
were created automatically based on this fault classification. We used both infinite and finite state verification techniques
for finding the seeded faults. The results of our experiments demonstrate the effectiveness of the presented design for verification
approach in eliminating synchronization faults. 相似文献
9.
基于粒子群算法的多核处理器线程调度研究 总被引:1,自引:1,他引:0
为有效解决多核处理器的线程调度问题,提出了一种基于粒子群算法框架上的线程调度算法.该算法依据设计的调度模型,在线程DAG图上通过复制不在同一处理器上且存在相关性的线程,生成相互独立的子DAG图,并采用改进的粒子群优化算法对其进行合理调度,由此提高线程调度效率.仿真实现了该算法,并通过实验数据验证了该算法的优越性. 相似文献
10.
Java 虚拟机用户级多线程的设计与实现 总被引:5,自引:0,他引:5
详细介绍了国产开放系统平台Java虚拟机多线程的设计与实现.在线程调度上,采用带有独立队列的静态级别轮巡调度,较好地解决了独立循环线程的调度问题.对于线程的同步,采用了哈希混合锁的设计方案.实验结果证明,该锁具有空间小、执行效率高等特点. 相似文献
11.
在Windows操作系统扩展过程中,由于自定义调度需要自定义的线程同步,因此需要定制同步机制。经过剖析原有临界区的实现机制,设计实现了自定义临界区。在自定义临界区中,利用内核驱动程序提供调度;用无符号整数原子操作,保证内核对象操作原子性;应用内存映射机制将内核对象地址映射为用户态地址,使得操作可在用户态完成从而提高操作效率。实验结果表明,自定义的临界区可以实现线程同步。 相似文献
12.
Adaptive parallel interval branch and bound algorithms based on their performance for multicore architectures 总被引:1,自引:1,他引:0
This work studies how to adapt the number of threads of a parallel Interval Branch and Bound algorithm to the available computational
resources based on its current performance. Basically, a thread can create a new thread that will process part of the ancestor
workload. In this way, load balancing is inherent to the creation of threads. The applications in which we are interested
use branch-and-bound algorithms which are highly irregular and therefore difficult to predict. The proposed methods can be
used for more predictable algorithms as well. This research complements and does not substitute other devices that improve
the exploitation of the system, such as dynamic scheduling policies or work-stealing. Several approaches are presented. They
differ in the metrics used and in the need or not having to modify the Operating System (O.S.). The scenario for this research
is just one multithreaded application running in a multicore architecture. Experimental results show that the appropriate
number of running threads can be determined at run-time, avoiding having to statically establish the number of threads of
an application. Thread creation decisions have to be made frequently to obtain better results, but are time-consuming. One
of the presented models uses the existence of an idle processor to carry out these decisions, obtaining the desired results. 相似文献
13.
14.
同时多线程处理器同时执行来自不同线程的指令,兼顾了线程内和线程间的指令并行,使处理器的性能得以大幅提升。然而这种对资源的共享方式,可能带来对关键资源(包括重命名寄存器、指令队列等)的恶性竞争,从而出现饿死现象,甚至影响处理器的吞吐率。这主要是由于某些线程遇到长延迟指令,并长期占据关键资源,从而导致其他线程对资源的需求无法得到满足,同时这也降低了资源的利用率。降低竞争带来的负面影响,主要有3种方法:线程调度——在取指段,决定从哪些线程取指令;指令调度——决定哪些指令进入关键资源;关键资源划分——为每个线程分配独立的关键资源。主要对这些调度策略进行综述。 相似文献
15.
Simultaneous Multi-Threading (SMT) has been a very popular design in improving resource utilization by sharing key datapath components among multiple independent threads. However, allowing any of the threads to overwhelm these shared resources not only leads to unfair thread processing but may also result in severely degraded overall performance. How to prevent idling threads from clogging the critical resources in the pipeline becomes a must in sustaining desired system performance. In this paper, we show that, if one can manage to recall instructions of idling threads from the shared Issue Queue (IQ), the system performance is easily enhanced by a significant margin, with up to 20% for some benchmark mixes. An even more noteworthy feature about this technique is that the ensuing hardware overhead is very insignificant and it can also be coupled with other advanced techniques employed in other stages of the SMT pipeline for potentially additive benefits. 相似文献
16.
多核系统中,内存子系统消耗大量的能耗并且比例还会继续增大.因此,解决内存的功耗问题成为系统功耗优化的关键.根据线程的内存地址空间和负载均衡策略将系统中的线程划分成不同的线程组,根据线程所属的组,给同一组内的线程分配相同内存rank中的物理页,然后,根据划分的线程组以组为单位进行调度.提出了结合页分配和组调度的内存功耗优化方法(CAS).CAS周期性地激活当前需要的内存rank,从而可以将暂时不使用的内存rank置为低功耗状态,同时延长低功耗内存rank的空闲时间.仿真实验结果显示:与其他同类方法相比,CAS方法能够平均降低10%的内存功耗,同时提高8%的性能. 相似文献
17.
Sonia López Óscar Garnica David H. Albonesi Steven Dropsho Juan Lanchares José I. HidalgoAuthor vitae 《Microprocessors and Microsystems》2011,35(8):683-694
Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In single-threaded cores, resizable caches have demonstrated their ability to improve processor performance by adapting to the phases of the running application. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, thus, offering even more opportunities to dynamically adjust cache resources to the workload.In this paper, we demonstrate that the preferred control methodology for data cache reconfiguring in a SMT core changes as the number of running threads increases. In workloads with one or two threads, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies are closely related mathematically; the former minimizes the arithmetic mean cache access time (which we will call AMAT), while the latter minimizes its harmonic mean. We introduce an algorithm (HAMAT) that smoothly and naturally adjusts between the two strategies with the degree of multi-threading.We extend a previously proposed Globally Asynchronous, Locally Synchronous (GALS) processor core with SMT support and dynamically resizable caches. We show that the HAMAT algorithm significantly outperforms the AMAT algorithm on four-thread workloads while matching its performance on one and two thread workloads. Moreover, HAMAT achieves overall performance improvements of 18.7%, 10.1%, and 14.2% on one, two, and four thread workloads, respectively, over the best fixed-configuration cache design. 相似文献
18.
Tan Nguyen Daniel Hefenbrock Jason Oberg Ryan Kastner Scott Baden 《Journal of Parallel and Distributed Computing》2013
Face detection is a key component in applications such as security surveillance and human–computer interaction systems, and real-time recognition is essential in many scenarios. The Viola–Jones algorithm is an attractive means of meeting the real time requirement, and has been widely implemented on custom hardware, FPGAs and GPUs. We demonstrate a GPU implementation that achieves competitive performance, but with low development costs. Our solution treats the irregularity inherent to the algorithm using a novel dynamic warp scheduling approach that eliminates thread divergence. This new scheme also employs a thread pool mechanism, which significantly alleviates the cost of creating, switching, and terminating threads. Compared to static thread scheduling, our dynamic warp scheduling approach reduces the execution time by a factor of 3. To maximize detection throughput, we also run on multiple GPUs, realizing 95.6 FPS on 5 Fermi GPUs. 相似文献
19.
Bin Liu Yinliang Zhao Yuxiang Li Yanjun Sun Boqin Feng 《The Journal of supercomputing》2014,67(3):778-805
Speculative multithreading (SpMT) is a thread-level automatic parallelization technique, which partitions sequential programs into multithreads to be executed in parallel. This paper presents different thread partitioning strategies for nonloops and loops. For nonloops, we propose a cost estimation based on combined run-time effects of various speculation factors to predict the resulting performance of candidate threads to guide the thread partitioning. For loops, we parallelize all the profitable loops that can potentially offer additional performance benefits by multilevel spawning in loop bodies, loop iterations, and inner loops. Then we select a proper thread boundary located in the front of loop branch instruction to reduce invalid spawning threads that waste core resources. Experimental results show that the proposed approach can obtain a significant increase in speedup and Olden benchmarks reach a performance improvement of 6.62 % on average. 相似文献
20.
Jenn-Yuan Tsai Zhenzhen Jiang Pen-Chung Yew 《International journal of parallel programming》1999,27(1):1-19
Several useful compiler and program transformation techniques for the superthreaded architectures are presented in this paper. The superthreaded architecture adopts a thread pipelining execution model to facilitate runtime data dependence checking between threads, and to maximize thread overlap to enhance concurrency. In this paper, we present some important program transformation techniques to facilitate concurrent execution among threads, and to manage critical system resources such as the memory buffers effectively. We evaluate the effectiveness of those program transformation techniques by applying them manually on several benchmark programs, and using a trace-driven, cycle-by-cycle superthreaded processor simulator. The simulation results show that a superthreaded processor can achieve promising speedup for most of the benchmark programs. 相似文献