首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In a previous article,(1) Gupta and Hill introduced anadaptive combining tree algorithm for busy-wait barrier synchronization on shared-memory multiprocessors. The intent of the algorithm was to achieve a barrier in logarithmic time when processes arrive simultaneously, and in constant time after the last arrival when arrival times are skewed. Afuzzy (2) version of the algorithm allows a process to perform useful work between the point at which it notifies other processes of its arrival and the point at which it waits for all other processes to arrive. Unfortunately, adaptive combining tree barriers as originally devised perform a large amount of work at each node of the tree, including the acquisition and release of locks. They also perform an unbounded number of accesses to nonlocal locations, inducing large amounts of memory and interconnect contention. We present new adaptive combining tree barriers that eliminate these problems. We compare the performance of the new algorithms to that of other fast barriers on a 64-node BBN Butterfly 1 multiprocessor, a 35-node BBN TC2000, and a 126-node KSR 1. The results reveal scenarios in which our algorithms outperform all known alternatives, and suggest that both adaptation and the combination of fuzziness with tree-style synchronization will be of increasing importance on future generations of shared-memory multiprocessors. At the University of Rochester, this work was supported in part by NSF Institutional Infrastructure grant number CDA-8822724 and ONR research contract number N00014-92-J-1801 (in conjunction with the ARPA Research in Information Science and Technology—High Performance Computing, Software Science and Technology program, ARPA Order No. 8930). At Rice University, this work was supported in part by NSF Cooperative Agreements CCR-8809615 and CCR-912008.  相似文献   

2.
Total exchange (or multiscattering) is one of the important collective communication problems in multiprocessor interconnection networks. It involves the dissemination of distinct messages from every node to every other node. We present a novel theory for solving the problem in any multidimensional (cartesian product) network. These networks have been adopted as cost-effective interconnection structures for distributed-memory multiprocessors. We construct a general algorithm for single-port networks and provide conditions under which it behaves optimally. It is seen that many of the popular topologies, including hypercubes, k-ary n-cubes, and general tori satisfy these conditions. The algorithm is also extended to homogeneous networks with 2k dimensions and with multiport capabilities. Optimality conditions are also given for this model. In the course of our analysis, we also derive a formula for the average distance of nodes in multidimensional networks; it can be used to obtain almost closed-form results for many interesting networks  相似文献   

3.
A generalized paging problem is considered. Each request is expressed as a set of u pages. In order to satisfy the request, at least one of these pages must be in the cache. Therefore, on a page fault, the algorithm must load into the cache at least one page out of the u pages given in the request. The problem arises in systems in which requests can be serviced by various utilities (e.g., a request for a data that lies in various web-pages) and a single utility can service many requests (e.g., a web-page containing various data). The server has the freedom to select the utility that will service the next request and hopefully additional requests in the future. The case u=1 is simply the classical paging problem, which is known to be polynomially solvable. We show that for any u>1 the offline problem is NP-hard and hard to approximate if the cache size k is part of the input, but solvable in polynomial time for constant values of k. We consider mainly online algorithms, and design competitive algorithms for arbitrary values of k, u. We study in more detail the cases where u and k are small. We also give an algorithm which uses resource augmentation and which is asymptotically optimal for u=2. A preliminary version of this paper appeared in Proc. Scandinavian Workshop on Algorithm Theory (SWAT 2006), pp. 124–135, 2006. Research of R. van Stee supported by Alexander von Humboldt Foundation.  相似文献   

4.
Introduces a class of hierarchical networks that is suitable for implementation of large multi-computers in VLSI with wafer scale integration (VLSI/WSI) technology. These networks, which are termed dBCube, employ the hypercube topology as a basic cluster, connect many of these clusters using a de Bruijn graph, and maintain the node connectivity to be the same for all nodes product graph. The size of this class of regular networks can be easily extended by increments of a cluster size. Local communication, to be satisfied by the hypercube topology, allows easy embedding of existing parallel algorithms, while the de Bruijn graph, which was chosen for JPL's 8096-node multiprocessor, provides the shortest distance between clusters running different parts of an application. A scheme for obtaining WSI layout is introduced and used to estimate the number of tracks needed and the required area of the wafer. The exact number of tracks in the hypercube and an approximation for the de Bruijn graph are also obtained. Tradeoffs of area versus static parameters and the size of the hypercube versus that of the de Bruijn graph are also discussed  相似文献   

5.
Massively parallel processors have begun using commodity operating systems that support demand-paged virtual memory. To evaluate the utility of virtual memory, we measured the behavior of seven shared-memory parallel application programs on a simulated distributed-shared-memory machine. Our results (1) confirm the importance of gang CPU scheduling, (2) show that a page-faulting processor should spin rather than invoke a parallel context switch, (3) show that our parallel programs frequently touch most of their data, and (4) indicate that memory, not just CPUs, must be gang scheduled. Overall, our experiments demonstrate that demand paging has limited value on current parallel machines because of the applications' synchronization and memory reference patterns and the machines' high page-fault and parallel context-switch overheads.An earlier version of this paper was presented at Supercomputing '94.This work is supported in part by NSF Presidential Young Investigator Award CCR-9157366; NSF Grants MIP-9225097, CCR-9100968, and CDA-9024618; Office of Naval Research Grant N00014-89-J-1222; Department of Energy Grant DE-FG02-93ER25176; and donations from Thinking Machines Corporation, Xerox Corporation, and Digital Equipment Corporation.  相似文献   

6.
Cheong  H. Veidenbaum  A.V. 《Computer》1990,23(6):39-47
The necessity of finding alternatives to hardware-based cache coherence strategies for large-scale multiprocessor systems is discussed. Three different software-based strategies sharing the same goals and general approach are presented. They consist of a simple invalidation approach, a fast selective invalidation scheme, and a version control scheme. The strategies are suitable for shared-memory multiprocessor systems with interconnection networks and a large number of processors. Results of trace driven simulations conducted on numerical benchmark routines to compare the performance of the three schemes are presented  相似文献   

7.
The authors describe the hardware and software of a system for prototyping digital-signal-processing applications by using commercially available digital-signal processors linked to form multiprocessors. The graphical programming environment makes it easier to program, compile, debug, and test DSP algorithms. The system reduces the cost of application prototyping, making it a feasible step at early design stages. To demonstrate the advantages of their approach, the authors explain how they prototyped a digital audio broadcast system. The complexities encountered highlight the limitations of the system  相似文献   

8.
The processor allocation problem requires recognizing and locating a free subcube that can accommodate a request for a subcube of a specified size for an incoming task. Methods reported in the literature fall into two strategies: bottom-up or bit mapped technique (BMT) and top-downer available cube technique (ACT). Our algorithm that solves the allocation problem in faulty hypercubes falls into the category of ACT's which offer the advantage over BMT's of quickly recognizing whether or not a requested subcube is available in the list of fault-free subcubes. We introduce new algebraic functions and the concept of separation factor to select a subcube for allocation. The notion of overlap-syndrome, defined in the text, quantifies the overlap among free subcubes. Our technique has full subcube recognition ability and thus recognizes more subcubes as compared to bit mapped techniques: Buddy, Gray code and its variants. The advantages of our approach over some of the existing ACT's in terms of fragmentation and overall completion time are described in the text and in simulation results  相似文献   

9.
Rsim: simulating shared-memory multiprocessors with ILP processors   总被引:1,自引:0,他引:1  
The early 1990s saw several announcements of commercial shared-memory systems using processors that aggressively exploited instruction-level parallelism (ILP), including the MIPS R10000, Hewlett-Packard PA8000, and Intel Pentium Pro. These processors could potentially reduce memory read stalls by overlapping read latency with other operations, possibly changing the nature of performance bottlenecks in the system. The authors' experience with Rsim demonstrates that modeling ILP features is important even in shared-memory multiprocessor systems. In particular, current simple processor-based approximations cannot model significant performance effects for applications exhibiting parallel read misses. Further, recent shared-memory designs such as aggressive implementations of sequential consistency use the aggressive ILP-enhancing features of modern processors that simple processor-based simulators do not model. As microprocessor systems become more complex, the availability of shared infrastructure source code is likely to become increasingly crucial. The authors plan to release a new Rsim version shortly that will include instruction caches, TLBs, multimedia extensions, simultaneous multithreading, Rabbit fast simulation mode, and ports to Linux platforms  相似文献   

10.
This paper addresses mapping of streaming applications (such as MPEG) on multiprocessor platforms with time-division-multiplexed network-on-chip. In particular, we solve processor selection, path selection and router configuration problems. Given the complexity of these problems, state of the art approaches in this area largely rely on greedy heuristics, which do not guarantee optimality. Our approach is based on a constraint programming formulation that merges a number of steps, usually tackled in sequence in classic approaches. Thus, our method has the potential of finding optimal solutions with respect to resource usage under throughput constraints. The experimental evaluation presented in here shows that our approach is capable of exploring a range of solutions while giving the designer the opportunity to emphasize the importance of various design metrics.  相似文献   

11.
Stenstrom  P. 《Computer》1988,21(11):26-37
The techniques that can be used to design a memory system that reduces the impact of contention are examined. To exemplify the techniques, the implementations and the design decisions taken in each are reviewed. The discussion covers memory organization, interconnection networks, memory allocation, cache memory, and synchronization and contention. The multiprocessor implementations considered are C.mmp, CM*, RP3, Alliant FX, Cedar, Butterfly, SPUR, Dragon, Multimax, and Balance  相似文献   

12.
A sorting algorithm, dubbed MeshSort, for multidimensional mesh-connected multiprocessors is introduced. Bitonic Sort and ShearSort are shown to be special cases of MeshSort. MeshSort thus provides some insight into the operation of parallel sorting. It requires operations only along orthogonal vectors of processors, simplifying the control of the multiprocessor. This allows MeshSort to be used on any reduced architecture where a multidimensional memory structure is interconnected with a lower dimensional structure of processors. A modified version of MeshSort, called FastMeshSort, is presented. This algorithm applies the same basic principle as MeshSort, and is almost as simple to implement, but achieves much better performance. The modified algorithm is shown to be very efficient for reasonably sized meshes. FastMeshSort is presented as a practical sorting and routing algorithm for real multidimensional mesh-connected multiprocessors. The algorithms can easily be extended to other multiprocessor structures  相似文献   

13.
Executing multiple applications concurrently is an important way of utilizing the computational power provided by emerging chip multiprocessor (CMP) architectures. However, this multiprogramming brings a resource management and partitioning problem, for which one can find numerous examples in the literature. Most of the resource partitioning schemes proposed to date focus on performance or energy centric strategies. In contrast, this paper explores reliability-aware core partitioning strategies targeting CMPs. One of our schemes considers both performance and reliability objectives by maximizing a novel combined metric called the vulnerability-delay product (VDP). The vulnerability component in this metric is represented with Thread Vulnerability Factor (TVF), a recently proposed metric for quantifying thread vulnerability for multicores. Execution time of the given application represents the delay component of the VDP metric. As part of our experimental analysis, proposed core partitioning schemes are compared with respect to normalized weighted speedup, normalized weighted reliability loss and normalized weighted vulnerability delay product gain metrics for various workloads of benchmark applications.  相似文献   

14.
The presence of high-performance mechanisms in shared-memory multiprocessors such as private caches, the extensive pipelining of memory access, and combining networks may render a logical concurrency model complex to implement or inefficient. The problem of implementing a given logical concurrency model in such a multiprocessor is addressed. Two concurrency models are considered, and simple rules are introduced to verify that a multiprocessor architecture adheres to the models. The rules are applied to several examples of multiprocessor architectures  相似文献   

15.
Consider the problem of scheduling a task set τ of implicit-deadline sporadic tasks to meet all deadlines on a t-type heterogeneous multiprocessor platform where tasks may access multiple shared resources. The multiprocessor platform has m k processors of type-k, where k∈{1,2,…,t}. The execution time of a task depends on the type of processor on which it executes. The set of shared resources is denoted by R. For each task τ i , there is a resource set R i ?R such that for each job of τ i , during one phase of its execution, the job requests to hold the resource set R i exclusively with the interpretation that (i) the job makes a single request to hold all the resources in the resource set R i and (ii) at all times, when a job of τ i holds R i , no other job holds any resource in R i . Each job of task τ i may request the resource set R i at most once during its execution. A job is allowed to migrate when it requests a resource set and when it releases the resource set but a job is not allowed to migrate at other times. Our goal is to design a scheduling algorithm for this problem and prove its performance. We propose an algorithm, LP-EE-vpr, which offers the guarantee that if an implicit-deadline sporadic task set is schedulable on a t-type heterogeneous multiprocessor platform by an optimal scheduling algorithm that allows a job to migrate only when it requests or releases a resource set, then our algorithm also meets the deadlines with the same restriction on job migration, if given processors $4 \times (1 + \operatorname{MAXP}\times \lceil \frac{\vert P\vert \times\operatorname{MAXP}}{\min \{m_{1}, m_{2}, \ldots, m_{t} \}} \rceil )$ times as fast. (Here $\operatorname{MAXP}$ and |P| are computed based on the resource sets that tasks request.) For the special case that each task requests at most one resource, the bound of LP-EE-vpr collapses to $4 \times (1 + \lceil \frac{\vert R\vert }{\min \{m_{1}, m_{2}, \ldots, m_{t} \}} \rceil )$ . To the best of our knowledge, LP-EE-vpr is the first algorithm with proven performance guarantee for real-time scheduling of sporadic tasks with resource sharing on t-type heterogeneous multiprocessors.  相似文献   

16.
The microprocessor industry is rapidly moving to the use of multicore chips as general-purpose processors. Whereas the current generation of chip multiprocessors (CMPs) target server applications, future desktop processors likely have tens of multithreaded cores on a single die. Various redundant multithreading (RMT) approaches exploit the multithreaded capability of current general-purpose microprocessors. These approaches replicate the entire program, running it as a separate thread using time or space redundancy. This guards the processor core against all errors, including those in combinational logic. Because RMT exploits the existing multithreaded hardware, it requires only a modest amount of additional hardware support for comparing results and, depending on the implementation, duplicating inputs.  相似文献   

17.
To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively  相似文献   

18.
Directory-based cache coherence in large-scale multiprocessors   总被引:1,自引:0,他引:1  
The usefulness of shared-data caches in large-scale multiprocessors, the relative merits of different coherence schemes, and system-level methods for improving directory efficiency are addressed. The research presented is part of an effort to build a high-performance, large-scale multiprocessor. The various classes of cache directory schemes are described, and a method of measuring cache coherence is presented. The various directory schemes are analyzed, and ways of improving the performance of directories are considered. It is found that the best solutions to the cache-coherence problem result from a synergy between a multiprocessor's software and hardware components  相似文献   

19.
A new parallel sorting algorithm, called parsort, suitable for implementation on tightly coupled multiprocessors is presented. The algorithm is based upon quicksort and two-way merging. An asynchronous parallel partitioning algorithm is used to distribute work evenly during merging to ensure a good load balance amongst processors, which is crucial if we are to achieve high efficiency. The implementation of this parallel sorting algorithm exhibits theoretical and measured near linear speed-up when compared to sequential quicksort. This is illustrated by the results of experiments carried out on the Sequent Balance 8000 multiprocessor.  相似文献   

20.
We analyze two important problems that arise in shared-memory multiprocessor systems. Thestale data problem involves ensuring that data items in local memory of individual processors are current, independent of writes done by other processors.False sharing occurs when two processors have copies of the same shared data block but update different portions of the block. The false sharing problem involves guaranteeing that subsequent writes are properly combined. In modern architectures these problems are usually solved in hardware, by exploiting mechanisms for hardware controlled cache consistency. This leads to more expensive and nonscalable designs. Therefore, we are concentrating on software methods for ensuring cache consistency that would allow for affordable and scalable multiprocessing systems. Unfortunately, providing software control is nontrivial, both for the compiler writer and for the application programmer. For this reason we are developing a debugging environment that will facilitate the development of compiler-based techniques and will help the programmer to tune his or her application using explicit cache management mechanisms. We extend the notion of a race condition for IBM Shared Memory System POWER/4, taking into consideration its noncoherent caches, and propose techniques for detection of false sharing problems. Identification of the stale data problem is discussed as well, and solutions are suggested.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号