首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
We present the design and implementation of Arachne, a threads system that can be interfaced with a communications library for multithreaded distributed computations. In particular, Arachne supports thread migration between heterogeneous platforms, dynamic stack size management, and recursive thread functions. Arachne is efficient, flexible, and portable-it is based entirely on C and C++. To facilitate heterogeneous thread operations, we have added three keywords to the C++ language. The Arachne preprocessor takes as input code written in that language and outputs C++ code suitable for compilation with a conventional C++ compiler. The Arachne runtime system manages all threads during program execution. We present some performance measurements on the costs of basic thread operations and thread migration in Arachne and compare these to costs in other threads systems  相似文献   

2.
Multiple-context processors provide register resources that allow rapid context switching between several threads as a means of tolerating long communication and synchronization latencies. When scheduling threads on such a processor, we must first decide which threads should have their state loaded into the multiple contexts, and second, which loaded thread is to execute instructions at any given time. In this paper we show that both decisions are important, and that incorrect choices can lead to serious performance degradation. We propose thread prioritization as a means of guiding both levels of scheduling. Each thread has a priority that can change dynamically, and that the scheduler uses to allocate as many computation resources as possible to critical threads. We briefly describe its implementation, and we show simulation performance results for a number of simple benchmarks in which synchronization performance is critical.  相似文献   

3.
Simultaneous multithreading (SMT) is an architectural technique that improves resource utilization by allowing instructions from multiple threads to coexist in a processor and share resources. However, earlier studies have shown that the performance of an SMT architecture begins to saturate as the number of coexisting threads increases beyond four. We show that no single fetch policy can be the best solution during the entire execution time and that a significant performance improvement can be attained by dynamically switching the fetch policies. We propose an implementation method which includes an extremely lightweight thread to control fetch policies (a detector thread) and a processor architecture to run the detector thread without impact on the user application threads. We evaluate various heuristics for the detector thread to determine the best fetch policies. We show that, with eight threads running on our simulated SMT, the proposed approach can outperform fixed scheduling mechanisms by up to 30%.  相似文献   

4.
We present new admission tests for periodic real-time threads with explicitly stated deadlines scheduled according to the earliest deadline first (EDF) algorithm. In traditional real-time periodic scheduling, the deadline of a periodic thread is conventionally the end of the current period. In contrast, our tests support periodic threads in which the deadline may be earlier than the end of the current period. In the extreme case, the deadline may be specified as identical to the per period execution time, which results in perfectly isochronous periodic threads. The provision of such threads, which we refer to as jitter-constrained threads, helps end-systems to honour jitter as well as throughput-related QoS parameters in distributed multimedia systems. In addition, such threads can reduce end-to-end delay and buffer memory requirements as less buffering is needed to smooth excessive delay jitter.  相似文献   

5.
Although nonuniform memory access architecture provides better scalability for multicore systems, cores accessing memory on remote nodes take longer than those accessing on local nodes. Remote memory access accompanied by contention for internode interconnection degrades performance. Properly mapping threads to cores and data accessed to their nodes can substantially improve performance and energy efficiency. However, an operating system kernel's load-balancing activity may migrate threads across nodes, which thus messes up the thread mapping. Besides, subsequent data mapping behavior pays for the cost of page migration to reduce remote memory access. Once unsuitable threads are migrated, it is detrimental to system performance. This paper focuses on improving the kernel's internode load balancing on nonuniform memory access systems. We develop a memory-aware kernel mechanism and policies to reduce remote memory access incurred by internode thread migration. The Linux kernel's load balancing mechanism is modified to incorporate selection policies in the internode thread migration, and the kernel is modified to track the amount of memory used by each thread on each node. With this information, well-designed policies can then choose suitable threads for internode migration. The purpose is to avoid migrating a thread that might incur relatively more remote memory access and page migration. The experimental results show that with our mechanism and the proposed selection policies, the system performance is substantially increased when compared with the unmodified Linux kernel that does not consider memory usage and always migrates the first-fit thread in the runqueue that can be migrated to the target central processing unit.  相似文献   

6.
Distributed strategic interleaving with load balancing   总被引:1,自引:0,他引:1  
In a previous paper, we developed an algebraic theory of threads, interleaving of threads, and interaction of threads with services. In the current paper, we assume that the threads and services are distributed over the nodes of a network. We extend the theory developed so far to the distributed case by introducing distributed interleaving strategies that support explicit thread migration and see to load balancing or capability searching by implicit thread migration. The extension to the distributed case provides insight into details of multi-threading that come up in a networked environment.  相似文献   

7.
余勇  庞建民  单征  刘晓楠 《计算机工程》2012,38(9):282-284,287
统一计算设备架构(CUDA)程序移植到其他异构众核架构时的线程数不匹配。为此,提出一种层次化的线程映射模型。在第1个映射层次上,将CUDA主机端线程和设备端线程分别映射到目标平台的主核和从核阵列上,在第2个映射层次上,采用线程循环的方法消除协作线程阵列(CTA)中线程间同步操作,将整个CTA映射到从核阵列的一个从核上。实验结果表明,该模型能使CUDA程序在其他异构众核系统上得到有效运行。  相似文献   

8.
Lightweight threads have an important role to play in parallel systems: they can be used to exploit shared-memory parallelism, to mask communication and I/O latencies, to implement remote memory access, and to support task-parallel and irregular applications. In this paper, we address the question of how to integrate threads and communication in high-performance distributed-memory systems. We propose an approach based on global pointer and remote service request mechanisms, and explain how these mechanisms support dynamic communication structures, asynchronous messaging, dynamic thread creation and destruction, and a global memory model via interprocessor references. We also explain how these mechanisms can be implemented in various environments. Our global pointer and remote service request mechanisms have been incorporated in a runtime system called Nexus that is used as a compiler target for parallel languages and as a substrate for higher-level communication libraries. We report the results of performance studies conducted using a Nexus implementation; these results indicate that Nexus mechanisms can be implemented efficiently on commodity hardware and software systems.  相似文献   

9.
Model checking of safety properties can be scaled up by pooling the CPU and memory resources of multiple computers. As compute clusters containing 100s of nodes, with each node realized using multi-core (e.g., 2) CPUs will be widespread, a model checker based on the parallel (shared memory) and distributed (message passing) paradigms will more efficiently use the hardware resources. Such a model checker can be designed by having each node employ two shared memory threads that run on the (typically) two CPUs of a node, with one thread responsible for state generation, and the other for efficient communication, including (1) performing overlapped asynchronous message passing, and (2) aggregating the states to be sent into larger chunks in order to improve communication network utilization. We present the design details of such a novel model checking architecture called Eddy. We describe the design rationale, details of how the threads interact and yield control, exchange messages, as well as detect termination. We have realized an instance of this architecture for the Murphi modeling language. Called Eddy_Murphi, we report its performance over the number of nodes as well as communication parameters such as those controlling state aggregation. Nearly linear reduction of compute time with increasing number of nodes is observed. Our thread task partition is done in such a way that it is modular, easy to port across different modeling languages, and easy to tune across a variety of platforms. Supported in part by NSF award CNS-0509379 and SRC Contract 2005-TJ-1318.  相似文献   

10.
We present a modular approach to specification and verification of concurrency controllers by decoupling their behavior and interface specifications. The behavior specification of a concurrency controller defines how its shared variables change their values whereas the interface specification defines the order in which a client thread should call its methods. We show that the concurrency controllers can be designed modularly by composing their interfaces. We separate the verification of the concurrency controllers from the verification of the threads that use them. For the verification of the concurrency controllers we use infinite state verification techniques which enable us to verify controllers with parameterized constants and arbitrary number of user threads. We automatically generate Java monitors from the concurrency controller specifications which preserve the verified properties. For the thread verification we use finite state program verification tools which enable us to verify Java threads without any restrictions. We show that the user threads can be verified using stubs generated from the concurrency controller interfaces which improves the efficiency of the thread verification significantly.  相似文献   

11.
This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle.  相似文献   

12.
Today, mobility and persistence are important aspects of distributed computing. They have many fields of use such as load balancing, fault tolerance and dynamic reconfiguration of applications. In this context, Java provides many useful mechanisms for the mobility of code via dynamic class loading, and the mobility or persistence of data via object serialization. However, Java does not provide any mechanism for the mobility/persistence of computation (i.e. threads). We designed and implemented a new mechanism, called Java thread serialization, that is used to build thread mobility or thread persistence. Therefore, a running Java thread can, at an arbitrary state of its execution, migrate to a remote machine where it resumes its execution, or be checkpointed on disk for possible subsequent recovery. With our services, migrating a thread is simply performed by the call of our go primitive, and checkpointing/recovering a thread is performed by the call of our store and load primitives. Several projects have recently addressed the issue of Java thread serialization, e.g. Sumatra, Wasp, JavaGo, Brakes, JavaGoX, Merpati. Some of them have attempted to minimize the overhead incurred by the thread serialization mechanism on thread performance, but none of them has been able to completely avoid this overhead. We propose a generic Java thread serialization mechanism that does not impose any performance overhead on serialized threads. This is achieved thanks to the use of type inference and dynamic de‐optimization techniques. In this paper, we describe the design and implementation details of our thread serialization prototype in Sun Microsystems' JDK. We report on experiments conducted with our prototype, present a comparative performance evaluation of the main thread serialization techniques, and confirm the elimination of the performance overhead with our thread serialization mechanism. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

13.
We present a model checking-based method for verifying list-based concurrent set data structures. Concurrent data structures are notorious for being hard to get right and thus, their verification has received significant attention from the verification community. These data structures are unbounded in two dimensions: the list size is unbounded and an unbounded number of threads access them. Thus, their model checking requires abstraction to a model bounded in both the dimensions. We first show how the unbounded number of threads can be model checked by reduction to a finite model, while assuming a bounded number of list nodes. In particular, we leverage the CMP (CoMPositional) method which abstracts the unbounded threads by keeping one thread as is (concrete) and abstracting all the other threads to a single environment thread. Next, the method proceeds as a series of iterations where in each iteration the abstraction is model checked and, if a spurious counterexample is observed, refined. This is accomplished by the user by inspecting the returned counterexamples. If the user determines the returned counterexample to be spurious, she adds constraints to the abstraction in the form of lemmas to refine it. Upon addition, these lemmas are also verified for correctness as part of the CMP method. Thus, since these lemmas are verified as well, we show how some of the lemmas required for refinement of this abstract model can be mined automatically using an assertion mining tool, Daikon. Next, we show how the CMP method approach can be extended to the list dimension as well, to fully verify the data structures in both the dimensions. While it is possible to show correctness of these data structures for an unbounded number of threads by keeping one concrete thread and abstracting others, this is not directly possible in the list dimension as the nodes pointed to by the threads change during list traversal. Our method addresses this challenge by constructing an abstraction for which the concrete nodes, i.e., the nodes pointed to by the threads, can change as the thread pointers move with program execution. Further, our method also allows for refinement of this abstraction to prove properties of interest. We show the soundness of our method and establish its utility by model checking challenging concurrent list-based data structure examples.  相似文献   

14.
This work studies how to adapt the number of threads of a parallel Interval Branch and Bound algorithm to the available computational resources based on its current performance. Basically, a thread can create a new thread that will process part of the ancestor workload. In this way, load balancing is inherent to the creation of threads. The applications in which we are interested use branch-and-bound algorithms which are highly irregular and therefore difficult to predict. The proposed methods can be used for more predictable algorithms as well. This research complements and does not substitute other devices that improve the exploitation of the system, such as dynamic scheduling policies or work-stealing. Several approaches are presented. They differ in the metrics used and in the need or not having to modify the Operating System (O.S.). The scenario for this research is just one multithreaded application running in a multicore architecture. Experimental results show that the appropriate number of running threads can be determined at run-time, avoiding having to statically establish the number of threads of an application. Thread creation decisions have to be made frequently to obtain better results, but are time-consuming. One of the presented models uses the existence of an idle processor to carry out these decisions, obtaining the desired results.  相似文献   

15.
Most of the research effort towards verification of concurrent software has focused on multithreaded code. On the other hand, concurrency in low-end embedded systems is predominantly based on interrupts. Low-end embedded systems are ubiquitous in safety-critical applications such as those supporting transportation and medical automation; their verification is important. Although interrupts are superficially similar to threads, there are subtle semantic differences between the two abstractions. This paper compares and contrasts threads and interrupts from the point of view of verifying the absence of race conditions. We identify a small set of extensions that permit thread verification tools to also verify interrupt-driven software, and we present examples of source-to-source transformations that turn interrupt-driven code into semantically equivalent thread-based code that can be checked by a thread verifier. Finally, we demonstrate a proof-of-concept program transformation tool that converts interrupt-driven sensor network applications into multithreaded code, and we use an existing tool to find race conditions in these applications.  相似文献   

16.
Several useful compiler and program transformation techniques for the superthreaded architectures are presented in this paper. The superthreaded architecture adopts a thread pipelining execution model to facilitate runtime data dependence checking between threads, and to maximize thread overlap to enhance concurrency. In this paper, we present some important program transformation techniques to facilitate concurrent execution among threads, and to manage critical system resources such as the memory buffers effectively. We evaluate the effectiveness of those program transformation techniques by applying them manually on several benchmark programs, and using a trace-driven, cycle-by-cycle superthreaded processor simulator. The simulation results show that a superthreaded processor can achieve promising speedup for most of the benchmark programs.  相似文献   

17.
多核多线程结构线程调度策略研究   总被引:1,自引:0,他引:1  
片上多核多线程(CMT)结构兼具了片上多处理(CMP)和同时多线程(sMT)结构的优势,支持片上所有处于执行状态的线程每周期并行执行,导致核内与核间硬件资源共享和争用问题。该文在阐述CMT结构的资源共享特征并简要介绍SMT线程调度发展状况的基础上,主要围绕以减少资源争用为目标的线程调度策略和资源划分机制等热点,分析其研究现状,论述已有策略在处理这些问题上的优缺点,并探讨了可能的研究发展方向。  相似文献   

18.
A distributed system can support fault-tolerant applications by replicating data and computation at nodes that have independent failure modes. We present a scheme called parallel execution threads (PET) which can be used to implement fault-tolerant computations in an object-based distributed system. In a system that replicates objects, the PET scheme can be used to replicate a computation by creating a number of parallel threads which execute with different replicas of the invoked objects. A computation can be completed successfully if at least one thread does not encounter any failed nodes and its completion preserves the consistency of the objects. The PET scheme can tolerate failures that occur during the execution of the computation as long as all threads are not affected by the failures. We present the algorithms required to implement the PET scheme and also address some performance issues. Mustaque Ahamad received his B.E. (Hons.) degree in Electrical Engineering from the Birla Institute of Technology and Science, Pilani, India. He obtained his M.S. and Ph.D. degrees in Computer Science from the State University of New York at Stony Brook in 1983 and 1985 respectively. Since September 1985, he is an Assistant Professor in the School of Information and Computer Science at the Georgia Institute of Technology, Atlanta. His research interests include distributed operating systems, distributed algorithms, faulttolerant systems and performance evaluation. Partha Dasgupta is an Assistant Professor at Georgia Tech since 1984. He has a Ph.D. in Computer Science from the State University of New York at Stony Brook. He is the technical project director of the Clouds distributed operating systems project, as well as a coprincipal investigator of Georgia Tech's NSF-CER award. His research interests include building distributed operating systems, distributed algorithms, fault-tolerant systems and distributed programming support. Richard J. LeBlanc, Jr. received the B.S. degree in physics from Louisiana State University in 1972 and the M.S. and Ph.D. degrees in computer sciences from the University of Wisconsin-Madison in 1974 and 1977, respectively. He is currently a Professor in the School of Information and Computer Science of the Georgia Institute of Technology. His research interests include programming language design and implementation, programming environments, and software engineering. Dr. LeBlanc's current research work involves application of these interests in distributed processing systems. As co-director of the Clouds Project, he is studying language concepts and software engineering methodology for utilizing a highly reliable, object-based distributed system. He is also interested in specification-based software development methodologies and tools. Dr. LeBlanc is a member of the Association for Computing Machinery, the IEEE Computer Society and Sigma Xi.This work was supported in part by NSF grants CCR-8619886 and CCR-8806358, and RADC contract number F30602-86-C-0032  相似文献   

19.
Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify threads.  相似文献   

20.
本文讨论了在分布式数据库中实现高性能原子事务的一种方法:通过使用多线程结构的事务,分布式数据库系统的性能可以得到大幅提高。性能的改善是由于多个线程的并发执行;而且,由于线程创建和线程间切换的开销都比进程的相应开销小得多,采用多线程的分布式数据库系统的性能比传统的采用单线程(即进程)结构的数据库系统的性能要好得多。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号