首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 413 毫秒
1.
Simultaneous multithreading (SMT) is an architectural technique that improves resource utilization by allowing instructions from multiple threads to coexist in a processor and share resources. However, earlier studies have shown that the performance of an SMT architecture begins to saturate as the number of coexisting threads increases beyond four. We show that no single fetch policy can be the best solution during the entire execution time and that a significant performance improvement can be attained by dynamically switching the fetch policies. We propose an implementation method which includes an extremely lightweight thread to control fetch policies (a detector thread) and a processor architecture to run the detector thread without impact on the user application threads. We evaluate various heuristics for the detector thread to determine the best fetch policies. We show that, with eight threads running on our simulated SMT, the proposed approach can outperform fixed scheduling mechanisms by up to 30%.  相似文献   

2.
We have designed and implemented a light‐weight process (thread) library called ‘Lesser Bear’ for SMP computers. Lesser Bear has high portability and thread‐level parallelism. Creating UNIX processes as virtual processors and a memory‐mapped file as a huge shared‐memory space enables Lesser Bear to execute threads in parallel. Lesser Bear requires exclusive operation between peer virtual processors, and treats a shared‐memory space as a critical section for synchronization of threads. Therefore, thread functions of the previous Lesser Bear are serialized. In this paper, we present a scheduling mechanism to execute thread functions in parallel. In the design of the proposed mechanism, we divide the entire shared‐memory space into partial spaces for virtual processors, and prepare two queues (Protect Queue and Waiver Queue) for each partial space. We adopt an algorithm in which lock operations are not necessary for enqueueing. This algorithm allows us to propose a scheduling mechanism that can reduce the scheduling overhead. The mechanism is applied to Lesser Bear and evaluated by experimental results. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

3.
为了解决数据流编程模型的可用性问题,使其能在兼顾程序并行性的前提下适用于动态数据交互速率的流应用,设计了一种动态调度与静态优化相结合的数据流编译系统。编译器以COStream语言编写的源程序为输入,通过对源程序进行分析,以动态速率的数据通信边作为边界划分程序到粗粒度的子图,在子图内部应用静态优化。根据子图的每个计算单元的工作量估计计算资源的使用状况,实现子图内计算单元到处理器核的映射,经过阶段划分分配子图内计算单元到相应流水阶段。在运行时,每个子图在各个处理器核上均启动一个线程,通过对线程间通信的优化,避免了运行时多个线程对同一段内存同时读写产生的同步开销,减少了线程的上下文切换次数。使用信号量控制子图内线程间的同步,基于各子图计算单元运行时数据交互速率并结合当前线程的状态,动态调度各个子图的执行,构建动态的软件流水线,生成相应多线程目标代码。实验以通用X86-64多核处理器作为实验平台,测试和分析数据流编译的性能。实验结果表明,编译系统可以实现动态数据交互速率的数据流应用,扩大了编译系统可用性并且具有一定加速效果。  相似文献   

4.
该文将Windows NT操作系统中的线程调度机制和应用程序中的线程同步控制方法相结合,对红外景象产生器软件系统中的各线程的性能进行了分析,研究了其典型线程之间的同步实现方法,同时提出了为提高应用程序效率而进行的改进并介绍了新系统的整体性能。实验证明,改进后的系统运行速度有显著的提高,应用程序对CPU也实现了较高的利用率。  相似文献   

5.
针对具有独立DVFS的多核处理器系统,提出了一种K线程低能耗模型的并行任务调度优化算法(Tasks Optimization based on Energy-Effectiveness Model,TO-EEM)。与传统的并行任务节能调度相比,该算法的主要目标是不仅通过降低处理器频率来减少处理器瞬时功耗,而且结合并行任务间的同步互斥所造成的线程阻塞情况,合理分配线程资源来减少线程同步时间,优化并行性能;保证任务在一定的并行加速比性能前提下,提高资源利用率,减少能耗,达到程序能耗和性能之间的折衷。文中进行了大量模拟实验,结果证明提出的任务优化模型算法节能效果明显,能有效降低处理器的功耗,并始终保持线性加速比。  相似文献   

6.
Simultaneous Multi Threading (SMT) is a processor design method in which concurrent hardware threads share processor resources like functional units and memory. The scheduling complexity and performance of an SMT processor depend on the topology used in the fetch and issue stages. In this paper, we propose a thread sensitive issue policy for a partitioned SMT processor which is based on a thread metric. We propose the number of ready-to-issue instructions of each thread as priority metric. To evaluate our method, we have developed a reconfigurable SMT-simulator on top of the SimpleScalar Toolset. We simulated our modeled processor under several workloads composed of SPEC benchmarks. Experimental results show around 30% improvement compared to the conventional OLDEST_FIRST mixed topology issue policy. Additionally, the hardware implementation of our architecture with this metric in issue stage is quite simple.  相似文献   

7.
When the critical path of a communication session between end points includes the actions of operating system kernels, there are attendant overheads. Along with other factors, such as functionality and flexibility, such overheads motivate and favor the implementation of communication protocols in user space. When implemented with threads, such protocols may hold the key to optimal communication performance and functionality. Based on implementations of reliable user‐space protocols supported by a threads framework, we focus on our experiences with internal threads' scheduling techniques and their potential impact on performance. We present scheduling strategies that enable threads to do both application‐level and communication‐related processing. With experiments performed on a Sun SPARC‐5 LAN environment, we show how different scheduling strategies yield different levels of application‐processing efficiency, communication latency and packet‐loss. This work forms part of a larger study on the implementation of multiple thread‐based protocols in a single address space, and the benefits of coupling protocols with applications. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

8.
The increasing level of automation in critical infrastructures requires development of effective ways for finding faults in safety critical software components. Synchronization in concurrent components is especially prone to errors and, due to difficulty of exploring all thread interleavings, it is difficult to find synchronization faults. In this paper we present an experimental study demonstrating the effectiveness of model checking techniques in finding synchronization faults in safety critical software when they are combined with a design for verification approach. We based our experiments on an automated air traffic control software component called the Tactical Separation Assisted Flight Environment (TSAFE). We first reengineered TSAFE using the concurrency controller design pattern. The concurrency controller design pattern enables a modular verification strategy by decoupling the behaviors of the concurrency controllers from the behaviors of the threads that use them using interfaces specified as finite state machines. The behavior of a concurrency controller is verified with respect to arbitrary numbers of threads using the infinite state model checking techniques implemented in the Action Language Verifier (ALV). The threads which use the controller classes are checked for interface violations using the finite state model checking techniques implemented in the Java Path Finder (JPF). We present techniques for thread isolation which enables us to analyze each thread in the program separately during interface verification. We conducted two sets of experiments using these verification techniques. First, we created 40 faulty versions of TSAFE using manual fault seeding. During this exercise we also developed a classification of faults that can be found using the presented design for verification approach. Next, we generated another 100 faulty versions of TSAFE using randomly seeded faults that were created automatically based on this fault classification. We used both infinite and finite state verification techniques for finding the seeded faults. The results of our experiments demonstrate the effectiveness of the presented design for verification approach in eliminating synchronization faults.  相似文献   

9.
基于粒子群算法的多核处理器线程调度研究   总被引:1,自引:1,他引:0  
为有效解决多核处理器的线程调度问题,提出了一种基于粒子群算法框架上的线程调度算法.该算法依据设计的调度模型,在线程DAG图上通过复制不在同一处理器上且存在相关性的线程,生成相互独立的子DAG图,并采用改进的粒子群优化算法对其进行合理调度,由此提高线程调度效率.仿真实现了该算法,并通过实验数据验证了该算法的优越性.  相似文献   

10.
Java 虚拟机用户级多线程的设计与实现   总被引:5,自引:0,他引:5  
丁宇新  程虎 《软件学报》2000,11(5):701-706
详细介绍了国产开放系统平台Java虚拟机多线程的设计与实现.在线程调度上,采用带有独立队列的静态级别轮巡调度,较好地解决了独立循环线程的调度问题.对于线程的同步,采用了哈希混合锁的设计方案.实验结果证明,该锁具有空间小、执行效率高等特点.  相似文献   

11.
在Windows操作系统扩展过程中,由于自定义调度需要自定义的线程同步,因此需要定制同步机制。经过剖析原有临界区的实现机制,设计实现了自定义临界区。在自定义临界区中,利用内核驱动程序提供调度;用无符号整数原子操作,保证内核对象操作原子性;应用内存映射机制将内核对象地址映射为用户态地址,使得操作可在用户态完成从而提高操作效率。实验结果表明,自定义的临界区可以实现线程同步。  相似文献   

12.
This work studies how to adapt the number of threads of a parallel Interval Branch and Bound algorithm to the available computational resources based on its current performance. Basically, a thread can create a new thread that will process part of the ancestor workload. In this way, load balancing is inherent to the creation of threads. The applications in which we are interested use branch-and-bound algorithms which are highly irregular and therefore difficult to predict. The proposed methods can be used for more predictable algorithms as well. This research complements and does not substitute other devices that improve the exploitation of the system, such as dynamic scheduling policies or work-stealing. Several approaches are presented. They differ in the metrics used and in the need or not having to modify the Operating System (O.S.). The scenario for this research is just one multithreaded application running in a multicore architecture. Experimental results show that the appropriate number of running threads can be determined at run-time, avoiding having to statically establish the number of threads of an application. Thread creation decisions have to be made frequently to obtain better results, but are time-consuming. One of the presented models uses the existence of an idle processor to carry out these decisions, obtaining the desired results.  相似文献   

13.
多核多线程结构线程调度策略研究   总被引:1,自引:0,他引:1  
片上多核多线程(CMT)结构兼具了片上多处理(CMP)和同时多线程(sMT)结构的优势,支持片上所有处于执行状态的线程每周期并行执行,导致核内与核间硬件资源共享和争用问题。该文在阐述CMT结构的资源共享特征并简要介绍SMT线程调度发展状况的基础上,主要围绕以减少资源争用为目标的线程调度策略和资源划分机制等热点,分析其研究现状,论述已有策略在处理这些问题上的优缺点,并探讨了可能的研究发展方向。  相似文献   

14.
印杰  江建慧 《计算机科学》2010,37(3):256-261
同时多线程处理器同时执行来自不同线程的指令,兼顾了线程内和线程间的指令并行,使处理器的性能得以大幅提升。然而这种对资源的共享方式,可能带来对关键资源(包括重命名寄存器、指令队列等)的恶性竞争,从而出现饿死现象,甚至影响处理器的吞吐率。这主要是由于某些线程遇到长延迟指令,并长期占据关键资源,从而导致其他线程对资源的需求无法得到满足,同时这也降低了资源的利用率。降低竞争带来的负面影响,主要有3种方法:线程调度——在取指段,决定从哪些线程取指令;指令调度——决定哪些指令进入关键资源;关键资源划分——为每个线程分配独立的关键资源。主要对这些调度策略进行综述。  相似文献   

15.
Simultaneous Multi-Threading (SMT) has been a very popular design in improving resource utilization by sharing key datapath components among multiple independent threads. However, allowing any of the threads to overwhelm these shared resources not only leads to unfair thread processing but may also result in severely degraded overall performance. How to prevent idling threads from clogging the critical resources in the pipeline becomes a must in sustaining desired system performance. In this paper, we show that, if one can manage to recall instructions of idling threads from the shared Issue Queue (IQ), the system performance is easily enhanced by a significant margin, with up to 20% for some benchmark mixes. An even more noteworthy feature about this technique is that the ensuing hardware overhead is very insignificant and it can also be coupled with other advanced techniques employed in other stages of the SMT pipeline for potentially additive benefits.  相似文献   

16.
贾刚勇  万健  李曦  蒋从锋  代栋 《软件学报》2014,25(7):1403-1415
多核系统中,内存子系统消耗大量的能耗并且比例还会继续增大.因此,解决内存的功耗问题成为系统功耗优化的关键.根据线程的内存地址空间和负载均衡策略将系统中的线程划分成不同的线程组,根据线程所属的组,给同一组内的线程分配相同内存rank中的物理页,然后,根据划分的线程组以组为单位进行调度.提出了结合页分配和组调度的内存功耗优化方法(CAS).CAS周期性地激活当前需要的内存rank,从而可以将暂时不使用的内存rank置为低功耗状态,同时延长低功耗内存rank的空闲时间.仿真实验结果显示:与其他同类方法相比,CAS方法能够平均降低10%的内存功耗,同时提高8%的性能.  相似文献   

17.
Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In single-threaded cores, resizable caches have demonstrated their ability to improve processor performance by adapting to the phases of the running application. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, thus, offering even more opportunities to dynamically adjust cache resources to the workload.In this paper, we demonstrate that the preferred control methodology for data cache reconfiguring in a SMT core changes as the number of running threads increases. In workloads with one or two threads, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies are closely related mathematically; the former minimizes the arithmetic mean cache access time (which we will call AMAT), while the latter minimizes its harmonic mean. We introduce an algorithm (HAMAT) that smoothly and naturally adjusts between the two strategies with the degree of multi-threading.We extend a previously proposed Globally Asynchronous, Locally Synchronous (GALS) processor core with SMT support and dynamically resizable caches. We show that the HAMAT algorithm significantly outperforms the AMAT algorithm on four-thread workloads while matching its performance on one and two thread workloads. Moreover, HAMAT achieves overall performance improvements of 18.7%, 10.1%, and 14.2% on one, two, and four thread workloads, respectively, over the best fixed-configuration cache design.  相似文献   

18.
Face detection is a key component in applications such as security surveillance and human–computer interaction systems, and real-time recognition is essential in many scenarios. The Viola–Jones algorithm is an attractive means of meeting the real time requirement, and has been widely implemented on custom hardware, FPGAs and GPUs. We demonstrate a GPU implementation that achieves competitive performance, but with low development costs. Our solution treats the irregularity inherent to the algorithm using a novel dynamic warp scheduling approach that eliminates thread divergence. This new scheme also employs a thread pool mechanism, which significantly alleviates the cost of creating, switching, and terminating threads. Compared to static thread scheduling, our dynamic warp scheduling approach reduces the execution time by a factor of 3. To maximize detection throughput, we also run on multiple GPUs, realizing 95.6 FPS on 5 Fermi GPUs.  相似文献   

19.
Speculative multithreading (SpMT) is a thread-level automatic parallelization technique, which partitions sequential programs into multithreads to be executed in parallel. This paper presents different thread partitioning strategies for nonloops and loops. For nonloops, we propose a cost estimation based on combined run-time effects of various speculation factors to predict the resulting performance of candidate threads to guide the thread partitioning. For loops, we parallelize all the profitable loops that can potentially offer additional performance benefits by multilevel spawning in loop bodies, loop iterations, and inner loops. Then we select a proper thread boundary located in the front of loop branch instruction to reduce invalid spawning threads that waste core resources. Experimental results show that the proposed approach can obtain a significant increase in speedup and Olden benchmarks reach a performance improvement of 6.62 % on average.  相似文献   

20.
Several useful compiler and program transformation techniques for the superthreaded architectures are presented in this paper. The superthreaded architecture adopts a thread pipelining execution model to facilitate runtime data dependence checking between threads, and to maximize thread overlap to enhance concurrency. In this paper, we present some important program transformation techniques to facilitate concurrent execution among threads, and to manage critical system resources such as the memory buffers effectively. We evaluate the effectiveness of those program transformation techniques by applying them manually on several benchmark programs, and using a trace-driven, cycle-by-cycle superthreaded processor simulator. The simulation results show that a superthreaded processor can achieve promising speedup for most of the benchmark programs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号