期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Sorting with near linear speed-up on tightly coupled multiprocessors

Mitchell Wheat 《Concurrency and Computation》1991,3(1):1-13

A new parallel sorting algorithm, called parsort, suitable for implementation on tightly coupled multiprocessors is presented. The algorithm is based upon quicksort and two-way merging. An asynchronous parallel partitioning algorithm is used to distribute work evenly during merging to ensure a good load balance amongst processors, which is crucial if we are to achieve high efficiency. The implementation of this parallel sorting algorithm exhibits theoretical and measured near linear speed-up when compared to sequential quicksort. This is illustrated by the results of experiments carried out on the Sequent Balance 8000 multiprocessor. 相似文献

2.

Processor allocation in mesh multiprocessors using the leapfrog method

Fan Wu Ching-Chi Hsu Li-Ping Chou 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(3):276-289

The mesh-connected multiprocessor has become popular because of its simple and regular structure. A new data structure, the R-array, is proposed to represent the mesh at first. The element in the R-array stores statistical information about occupied conditions of the mesh. Statistical information of the R-array can direct the allocation process to jump to the processes that can serve as a base of a free submesh. Based on a simple and reasonable assumption, we develop a stochastic process to analyze behaviors of the proposed scheme. The proposed scheme is the first whose probabilities of locating free submeshes under different workloads are precisely computed. These results can be applied to each full-recognition scheme. In addition, the execution costs of the proposed scheme can also be accurately calculated. Finally, simulations are performed which show that the proposed schemes are faster than most. 相似文献

3.

Convolution on mesh connected multicomputers

Ranka S. Sahni S. 《IEEE transactions on pattern analysis and machine intelligence》1990,12(3):315-318

An efficient parallel algorithm is presented for convolution on a mesh-connected computer with wraparound. The algorithm does not require a broadcast feature for data values, as assumed by previously proposed algorithms. As a result, the algorithm is applicable to both SIMD and MIMD meshes. For an N×N image and a M×M template, the previous algorithms take O (M²q) time on an N×N mesh-connected multicomputer (q is the number of bits in each entry of the convolution matrix). The algorithms have complexity O(M²r), where r=max {number of bits in an image entry, number of bits in a template entry}. In addition to not requiring a broadcast capability, these algorithms are faster for binary images 相似文献

4.

Allocating precise submeshes in mesh connected systems

Po-Jen Chuang Nian-Feng Tzeng 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(2):211-217

We propose a new processor allocation strategy that applies to any mesh system and recognizes submeshes of arbitrary sizes at any locations in a mesh system. The proposed strategy allocates a submesh of exactly the size requested by an incoming task, completely avoiding internal fragmentation. Because of its efficient allocation, this strategy exhibits better performance than an earlier allocation strategy based on the buddy principle. An efficient implementation of this strategy is presented. Extensive simulation runs are carried out to collect experimental cost and performance measures of interest under different allocation schemes 相似文献

5.

Hypercube algorithms on mesh connected multicomputers

de Cerio L.D. Valero-Garcia M. Gonzalez A. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(12):1247-1260

A new methodology named CALMANT (CC-cube Algorithms on Meshes and Tori) for mapping a type of algorithm that we call CC-cube algorithm onto multicomputers with hypercube, mesh, or torus interconnection topology is proposed. This methodology is suitable when the initial problem can be expressed as a set of processes that communicate through a hypercube topology (a CC-cube algorithm). There are many important algorithms that fit into the CC-cube type. CALMANT is based on three different techniques: (a) the standard embedding to assign the processes of the algorithm to the nodes of the mesh multicomputer; (b) the communication pipelining technique to increase the level of communication parallelism inherent in the CC-cube algorithms; and (c) optimal message-scheduling algorithms proposed in this work in order to avoid conflicts and minimizing in this way the communication time. Although CALMANT is proposed for multicomputers with different interconnection network topologies, the paper only focuses on the particular case of meshes. 相似文献

6.

Sorting in constant number of row and column phases on a mesh

John M. Marberg Eli Gafni 《Algorithmica》1988,3(1-4):561-572

An algorithm for sorting on a mesh by alternately transforming the rows and columns is presented. The algorithm runs in a constant number of row- and column-transformation phases (sixteen phases), an improvement over the previous best upper bound ofO(log logm) phases,m being the number of rows in the mesh. A corresponding lower bound of five phases is also shown. 相似文献

7.

Sorting in constant number of row and column phases on a mesh 总被引：1，自引：0，他引：1

John M. Marberg Eli Gafni 《Algorithmica》1988,3(1):561-572

An algorithm for sorting on a mesh by alternately transforming the rows and columns is presented. The algorithm runs in a constant number of row- and column-transformation phases (sixteen phases), an improvement over the previous best upper bound ofO(log logm) phases,m being the number of rows in the mesh. A corresponding lower bound of five phases is also shown.This research was supported by the National Science Foundation under Grant DCR-84-51396, and by IBM Corporation under Grant D8400622. 相似文献

8.

Parallel 2-d convolution on a mesh connected array processor 总被引：2，自引：0，他引：2

Lee SY Aggarwal JK 《IEEE transactions on pattern analysis and machine intelligence》1987,(4):590-594

In this correspondence, a parallel 2-D convolution scheme is presented. The processing structure is a mesh connected array processor consisting of the same number of simple processing elements as the number of pixels in the image. For most windows considered, the number of computation steps required is the same as that of the coefficients of a convolution window. The proposed scheme can be easily extended to convolution windows of arbitrary size and shape. The basic idea of the proposed scheme is to apply the 1-D systolic concept to 2-D convolution on a mesh structure. The computation is carried out along a path called a convolution path in a systolic manner. The efficiency of the scheme is analyzed for windows of various shapes. The ideal convolution path is a Hamiltonian path ending at the center of the window, the length of which is equal to the number of window coefficients. The simple architecture and control strategy make the proposed scheme suitable for VLSI implementation. 相似文献

9.

A compendium of processor allocation strategies for two-dimensional mesh connected systems

Bonnie E. Melhart Craig A. Morgenstern Tom Nute 《Concurrency and Computation》1995,7(5):497-514

Multiple processor systems are an integral part of today's high-performance computing environment. Such systems are often configured as a two-dimensional grid of processors called a mesh. Tasks compete for rectangular submeshes of this mesh. The choice of submesh allocation strategy can significantly affect the level of processor utilization and a task's waiting time. In addition, the execution speed of various allocation algorithms varies widely, which can further affect system performance. This paper describes and categorizes several submesh allocation strategies, including a previously unreported method that is superior to other methods in terms of execution speed. The paper includes results of simulation studies used to compare the performance characteristics of the most efficient allocation strategies in each category. 相似文献

10.

Paging tradeoffs in distributed-shared-memory multiprocessors

Douglas C. Burger Rahmat S. Hyder Barton P. Miller David A. Wood 《The Journal of supercomputing》1996,10(1):87-104

Massively parallel processors have begun using commodity operating systems that support demand-paged virtual memory. To evaluate the utility of virtual memory, we measured the behavior of seven shared-memory parallel application programs on a simulated distributed-shared-memory machine. Our results (1) confirm the importance of gang CPU scheduling, (2) show that a page-faulting processor should spin rather than invoke a parallel context switch, (3) show that our parallel programs frequently touch most of their data, and (4) indicate that memory, not just CPUs, must be gang scheduled. Overall, our experiments demonstrate that demand paging has limited value on current parallel machines because of the applications' synchronization and memory reference patterns and the machines' high page-fault and parallel context-switch overheads.An earlier version of this paper was presented at Supercomputing '94.This work is supported in part by NSF Presidential Young Investigator Award CCR-9157366; NSF Grants MIP-9225097, CCR-9100968, and CDA-9024618; Office of Naval Research Grant N00014-89-J-1222; Department of Energy Grant DE-FG02-93ER25176; and donations from Thinking Machines Corporation, Xerox Corporation, and Digital Equipment Corporation. 相似文献

11.

Compiler-directed cache management in multiprocessors

Cheong H. Veidenbaum A.V. 《Computer》1990,23(6):39-47

The necessity of finding alternatives to hardware-based cache coherence strategies for large-scale multiprocessor systems is discussed. Three different software-based strategies sharing the same goals and general approach are presented. They consist of a simple invalidation approach, a fast selective invalidation scheme, and a version control scheme. The strategies are suitable for shared-memory multiprocessor systems with interconnection networks and a large number of processors. Results of trace driven simulations conducted on numerical benchmark routines to compare the performance of the three schemes are presented 相似文献

12.

An efficient non-contiguous processor allocation strategy for 2D mesh connected multicomputers

S. Bani-Mohammad M. Ould-Khaoua 《Information Sciences》2007,177(14):2867-2883

In non-contiguous allocation, a job request can be split into smaller parts that are allocated possibly non-adjacent free sub-meshes rather than always waiting until a single sub-mesh of the requested size and shape is available. Lifting the contiguity condition is expected to reduce processor fragmentation and increase system utilization. However, the distances traversed by messages can be long, and as a result the communication overhead, especially contention, is increased. The extra communication overhead depends on how the allocation request is partitioned and assigned to free sub-meshes. In this paper, a new non-contiguous processor allocation strategy, referred to as Greedy-Available-Busy-List, is suggested for the 2D mesh network. Request partitioning in our suggested strategy is based on the sub-meshes available for allocation. To evaluate the performance improvement achieved by our strategy and compare it against well-known existing non-contiguous and contiguous strategies, we conduct extensive simulation runs under the assumption of wormhole routing and three communication patterns, notably one-to-all, all-to-all and random. The results show that the new strategy can reduce the communication overhead and substantially improve performance in terms of job turnaround time and system utilization. 相似文献

13.

Processor allocation in hypercube multiprocessors

Rai S. Trahan J.L. Smailus T. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(6):606-616

The processor allocation problem requires recognizing and locating a free subcube that can accommodate a request for a subcube of a specified size for an incoming task. Methods reported in the literature fall into two strategies: bottom-up or bit mapped technique (BMT) and top-downer available cube technique (ACT). Our algorithm that solves the allocation problem in faulty hypercubes falls into the category of ACT's which offer the advantage over BMT's of quickly recognizing whether or not a requested subcube is available in the list of fault-free subcubes. We introduce new algebraic functions and the concept of separation factor to select a subcube for allocation. The notion of overlap-syndrome, defined in the text, quantifies the overlap among free subcubes. Our technique has full subcube recognition ability and thus recognizes more subcubes as compared to bit mapped techniques: Buddy, Gray code and its variants. The advantages of our approach over some of the existing ACT's in terms of fragmentation and overall completion time are described in the text and in simulation results 相似文献

14.

Heterogeneous chip multiprocessors 总被引：2，自引：0，他引：2

Kumar R. Tullsen D.M. Jouppi N.P. Ranganathan P. 《Computer》2005,38(11):32-38

Heterogeneous (or asymmetric) chip multiprocessors present unique opportunities for improving system throughput, reducing processor power, and mitigating Amdahl's law. On-chip heterogeneity allow the processor to better match execution resources to each application's needs and to address a much wider spectrum of system loads - from low to high thread parallelism - with high efficiency. 相似文献

15.

Reducing contention in shared-memory multiprocessors

Stenstrom P. 《Computer》1988,21(11):26-37

The techniques that can be used to design a memory system that reduces the impact of contention are examined. To exemplify the techniques, the implementations and the design decisions taken in each are reviewed. The discussion covers memory organization, interconnection networks, memory allocation, cache memory, and synchronization and contention. The multiprocessor implementations considered are C.mmp, CM*, RP3, Alliant FX, Cedar, Butterfly, SPUR, Dragon, Multimax, and Balance 相似文献

16.

Software strategy for multiprocessors

Mark Dowson Brian Collins Brian McBride 《Microprocessors and Microsystems》1979,3(6):263-266

The task of writing basic system software for a multiprocessor system is simplified if the system is organized so that it may be programmed as a single virtual machine. This strategy is considered for tightly coupled real time systems with many processors. Software is written in a high level language supporting parallel execution (such as concurrent PASCAL, MODULA or ADA) without knowledge of the target hardware configuration. The Demos system is discussed as an example of such an approach. 相似文献

17.

Connected component labeling of binary images on a mesh connected massively parallel processor

《Computer Vision, Graphics, and Image Processing》1989,45(2):133-149

An algorithm for connected component labeling of binary patterns using SIMD mesh connected computers is presented. The algorithm consists of three major steps: identifying exactly one point (seed point) within each connected component (region), assigning a unique label to each seed point, and expanding the labels to fill all pixels in the respective regions. Two approaches are given for identifying seed points. The first approach is based on shrinking and the second on the iterative replacement of equivalent labels with local minima or maxima. The shrinking algorithm reduces simply connected regions into single pixels, but multiply connected regions form rings around the holes contained in the regions. A parallel algorithm is developed to break each such ring at a single point. The broken rings are then reduced to single pixels by reshrinking. With iterations consisting of shrinking, breaking rings, if any, and reshrinking, each pattern (of any complexity) is reduced to isolated points within itself. In the second approach every region pixel in the image is initially given a unique label equal to its address in the image. Every 3 × 3 neighborhood in the image is then examined in parallel to replace the central label with the maximum (or minimum) of the labels assigned to the set of region pixels in the neighborhood. This is done iteratively until there is no further change. The seed points are then the locations where the pixel addresses match their converged labels. A parallel sorting method is used for assigning a consecutive set of numbers as labels to the seed points. Parallel expansion up to the boundaries of the original patterns then completes the connected component labeling. The computational complexities of the algorithm are discussed. 相似文献

18.

A shortly connected mesh topology for high performance and energy efficient network-on-chip architectures

Md. Hasan Furhad Jong-Myon Kim 《The Journal of supercomputing》2014,69(2):766-792

Network-on-chip-based communication schemes represent a promising solution to the increasing complexity of system-on-chip problems. In this paper, we propose a new mesh-like topology called the shortly connected mesh technology (ScMesh), which is based on the traditional mesh topology, to exploit the graph symmetry properties of interconnection networks. This proposed topology not only enhances network performance by reducing the network diameter, but also provides a lower area/energy solution for interconnection network scenarios. This study analyzes and compares the performance of ScMesh to some newly improved topologies, including the WK-recursive, extended-butterfly fat tree, and diametrical mesh topologies. The experiment results indicate that ScMesh outperforms the other topologies, with throughput increases of 47.71, 33.45, and 18.64 % as well as latency decreases of 45.71, 35.84, and 14.58 % compared to the extended-butterfly fat tree, WK-recursive and diametrical mesh topologies, respectively. In addition, ScMesh achieves 41.22, 32.23, and 15.01 % lower energy consumption and 38.96, 27.43, and 18.21 % lower area overhead than the extended-butterfly fat tree, WK-recursive, and diametrical mesh topologies, respectively. 相似文献

19.

Memory access dependencies in shared-memory multiprocessors

Dubois M. Scheurich C. 《IEEE transactions on pattern analysis and machine intelligence》1990,16(6):660-673

The presence of high-performance mechanisms in shared-memory multiprocessors such as private caches, the extensive pipelining of memory access, and combining networks may render a logical concurrency model complex to implement or inefficient. The problem of implementing a given logical concurrency model in such a multiprocessor is addressed. Two concurrency models are considered, and simple rules are introduced to verify that a multiprocessor architecture adheres to the models. The rules are applied to several examples of multiprocessor architectures 相似文献

20.

Reliability-aware core partitioning in chip multiprocessors

Isil Oz Haluk Rahmi Topcuoglu Mahmut Kandemir Oguz Tosun 《Journal of Systems Architecture》2012,58(3-4):160-176

Executing multiple applications concurrently is an important way of utilizing the computational power provided by emerging chip multiprocessor (CMP) architectures. However, this multiprogramming brings a resource management and partitioning problem, for which one can find numerous examples in the literature. Most of the resource partitioning schemes proposed to date focus on performance or energy centric strategies. In contrast, this paper explores reliability-aware core partitioning strategies targeting CMPs. One of our schemes considers both performance and reliability objectives by maximizing a novel combined metric called the vulnerability-delay product (VDP). The vulnerability component in this metric is represented with Thread Vulnerability Factor (TVF), a recently proposed metric for quantifying thread vulnerability for multicores. Execution time of the given application represents the delay component of the VDP metric. As part of our experimental analysis, proposed core partitioning schemes are compared with respect to normalized weighted speedup, normalized weighted reliability loss and normalized weighted vulnerability delay product gain metrics for various workloads of benchmark applications. 相似文献