首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Performance of a parallel algorithm on a parallel machine depends not only on the time complexity of the algorithm, but also on how the underlying machine supports the fundamental operations used by the algorithm. This study analyzes various mappings of image correlation algorithms in SIMD, MIMD, and mixed-mode environments. Experiments were conducted on the Intel Paragon, MasPar MP-1, nCUBE 2, and PASM prototype. The machine features considered in this study include: modes of parallelism, communication/computation ratio, network topology and implementation, SIMD CU/PE overlap, and communication/computation overlap. Performance of an implementation can be enhanced by using algorithmic techniques that match the machine features. Some algorithmic techniques discussed here are additional communication versus redundant computation, data block transfers, and communication/computation overlap. The results presented are applicable to a large class of image processing tasks. Case studies, such as the one presented here, are a necessary step in developing software tools for mapping an application task onto a single parallel machine and for mapping the subtasks of an application task, or a set of independent application tasks, onto a heterogeneous suite of parallel machines.  相似文献   

2.
In order to minimize the execution time of a parallel application running on a heterogeneously distributed computing system, an appropriate mapping scheme is needed to allocate the application tasks to the processors. The general problem of mapping tasks to machines is a well‐known NP‐hard problem and several heuristics have been proposed to approximate its optimal solution. In this paper we propose a static graph‐based mapping algorithm, called Heterogeneous Multi‐phase Mapping (HMM), which permits suboptimal mapping of a parallel application onto a heterogeneous computing distributed system by using a local search technique together with a tabu search meta‐heuristic. HMM allocates parallel tasks by exploiting the information embedded in the parallelism forms used to implement an application, and considering an affinity parameter, that identifies which machine in the heterogeneous computing system is most suitable to execute a task. We compare HMM with some leading techniques and with an exhaustive mapping algorithm. We also give an example of mapping of two real applications using HMM. Experimental results show that HMM performs well demonstrating the applicability of our approach. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

3.
提出了一种面向SIMD机器的全局数据自动分割算法,该算法能处理多个非紧嵌折循环嵌套,并且数组下标存取为循环变量的线性式,首先通过数据与迭代映射抽象了计算中的通信方式,然事提出识别规则模式通信模式的形式比条件,接着建立包含对准信息和相应通信开销的数据迭代图,并在数据迭代图的基础上提出了一个启发式算法来计算较优的数据分布和迭代分布,以优化处理单元之间的通信开销,通过发析多个循环嵌套所涉及的多个数组映和  相似文献   

4.
In this paper, a processor allocation mechanism for NoC-based chip multiprocessors is presented. Processor allocation is a well-known problem in parallel computer systems and aims to allocate the processing nodes of a multiprocessor to different tasks of an input application at run time. The proposed mechanism targets optimizing the on-chip communication power/latency and relies on two procedures: processor allocation and task migration. Allocation is done by a fast heuristic algorithm to allocate the free processors to the tasks of an incoming application when a new application begins execution. The task-migration algorithm is activated when some application completes execution and frees up the allocated resources. Task migration uses the recently deallocated processors and tries to rearrange the current tasks in order to find a better mapping for them. The proposed method can also capture the dynamic traffic pattern of the network and perform task migration based on the current communication demands of the tasks. Consequently, task migration adapts the task mapping to the current network status. We adopt a non-contiguous processor allocation strategy in which the tasks of the input application are allowed to be mapped onto disjoint regions (groups of processors) of the network. We then use virtual point-to-point circuits, a state-of-the-art fast on-chip connection designed for network-on-chips, to virtually connect the disjoint regions and make the communication latency/power closer to the values offered by contiguous allocation schemes. The experimental results show considerable improvement over existing allocation mechanisms.  相似文献   

5.
The performance of conjugate gradient (CG) algorithms for the solution of the system of linear equations that results from the finite-differencing of the neutron diffusion equation was analyzed on SIMD, MIMD, and mixed-mode parallel machines. A block preconditioner based on the incomplete Cholesky factorization was used to accelerate the conjugate gradient search. The issues involved in mapping both the unpreconditioned and preconditioned conjugate gradient algorithms onto the mixed-mode PASM prototype, the SIMD MasPar MP-1, and the MIMD Intel Paragon XP/S are discussed. On PASM , the mixed-mode implementation outperformed either SIMD or MIMD alone. Theoretical performance predictions were analyzed and compared with the experimental results on the MasPar MP-1 and the Paragon XP/S. Other issues addressed include the impact on execution time of the number of processors used, the effect of the interprocessor communication network on performance, and the relationship of the number of processors to the quality of the preconditioning. Applications studies such as this are necessary in the development of software tools for mapping algorithms onto either a single parallel machine or a heterogeneous suite of parallel machines.  相似文献   

6.
We design a task mapper TPCM for assigning tasks to virtual machines, and an application-aware virtual machine scheduler TPCS oriented for parallel computing to achieve a high performance in virtual computing systems. To solve the problem of mapping tasks to virtual machines, a virtual machine mapping algorithm (VMMA) in TPCM is presented to achieve load balance in a cluster. Based on such mapping results, TPCS is constructed including three components: a middleware supporting an application-driven scheduling, a device driver in the guest OS kernel, and a virtual machine scheduling algorithm. These components are implemented in the user space, guest OS, and the CPU virtualization subsystem of the Xen hypervisor, respectively. In TPCS, the progress statuses of tasks are transmitted to the underlying kernel from the user space, thus enabling virtual machine scheduling policy to schedule based on the progress of tasks. This policy aims to exchange completion time of tasks for resource utilization. Experimental results show that TPCM can mine the parallelism among tasks to implement the mapping from tasks to virtual machines based on the relations among subtasks. The TPCS scheduler can complete the tasks in a shorter time than can Credit and other schedulers, because it uses task progress to ensure that the tasks in virtual machines complete simultaneously, thereby reducing the time spent in pending, synchronization, communication, and switching. Therefore, parallel tasks can collaborate with each other to achieve higher resource utilization and lower overheads. We conclude that the TPCS scheduler can overcome the shortcomings of present algorithms in perceiving the progress of tasks, making it better than schedulers currently used in parallel computing.  相似文献   

7.
This paper examines measures for evaluating the performance of algorithms for single instruction stream–multiple data stream (SIMD) machines. The SIMD mode of parallelism involves using a large number of processors synchronized together. All processors execute the same instruction at the same time; however, each processor operates on a different data item. The complexity of parallel algorithms is, in general, a function of the machine size (number of processors), problem size, and type of interconnection network used to provide communications among the processors. Measures which quantify the effect of changing the machine-size/problem-size/network-type relationships are therefore needed. A number of such measures are presented and are applied to an example SIMD algorithm from the image processing problem domain. The measures discussed and compared include execution time, speed, parallel efficiency, overhead ratio, processor utilization, redundancy, cost effectiveness, speed-up of the parallel algorithm over the corresponding serial algorithm, and an additive measure called "sprice" which assigns a weighted value to computations and processors.  相似文献   

8.
PASM is a proposed large-scale distributed/parallel processing system which can be partitioned into independent SIMD/MIMD machines of various sizes. One design problem for systems such as PASM is task scheduling. The use of multiple FIFO queues for nonpreemptive task scheduling is described. Four multiple-queue scheduling algorithms with different placement policies are presented and applied to the PASM parallel processing system. Simulation of a queueing network model is used to compare the performance of the algorithms. Their performance is also considered in the case where there are faulty control units and processors. The multiple-queue scheduling algorithms can be adapted for inclusion in other multiple-SIMD and partitionable SIMD/MIMD systems that use similar types of interconnection networks to those being considered for PASM.  相似文献   

9.
Triggered by the ever increasing advancements in processor and networking technology, a cluster of PCs connected by a high-speed network has become a viable and cost-effective platform for the execution of computation intensive parallel multithreaded applications. However, there are two research issues to be tackled in the scheduling problem for PC cluster computing: (1) how to reduce the communication overhead of executing a multithreaded application on the cluster; (2) how to exploit the heterogeneity, which is unavoidable in an evolving PC cluster, for the application. In this paper, we propose to use a duplication based approach in scheduling tasks/threads to a heterogeneous cluster of PCs. In duplication based scheduling, critical tasks are redundantly scheduled to more than one machine, in order to reduce the number of inter-task communication operations. The start times of the succeeding tasks are also reduced. The task duplication process is guided given the system heterogeneity in that the critical tasks are scheduled or replicated in faster machines. The algorithm has been implemented in our experimental application parallelization system for generating multithreaded parallel code executable on a cluster of Pentium PCs. Our experiments, using three numerical applications and one protocol processing kernel (multithreading per request), have indicated that heterogeneity of PC cluster is indeed useful for optimizing the execution of parallel multithreaded programs.  相似文献   

10.
11.
This paper examines the applicability of fine-grained tree-structured SIMD machines, which are amenable to highly efficient VLSI implementation, to several low-level image understanding tasks. Algorithms are presented for histogramming, thresholding, image correlation, connected component labeling, and computing Euler number. A particular massively parallel machine called NON-VON is used for purposes of explication and performance evaluation. Only NON-VON tree-structured communication capabilities and its SIMD mode of execution are considered in this paper. Novel algorithmic techniques are described, such as vertical pipelining, subproblem partitioning, associative matching, and data duplication, that effectively exploit the massive parallelism available in fine-grained SIMD tree machines while avoiding communication bottlenecks. Simulation results are presented and compared with results obtained or forecast for other highly parallel machines. The relative advantages and limitations of the class of machines under consideration are outlined; except for some types of image correlation, the fine-grained SIMD tree is exceptionally fast.  相似文献   

12.
Parallel algorithms on SIMD (single-instruction stream multiple-data stream) machines for hierarchical clustering and cluster validity computation are proposed. The machine model uses a parallel memory system and an alignment network to facilitate parallel access to both pattern matrix and proximity matrix. For a problem with N patterns, the number of memory accesses is reduced from O(N 3) on a sequential machine to O(N2) on an SIMD machine with N PEs  相似文献   

13.
Describes the IXM2 associative processor and its main application in speech-to-speech translation. The IXM2 is a semantic memory system machine that began as a faithful implementation of the NETL semantic network machine and grew into a massively parallel SIMD machine that has demonstrated the power of large associative memories. Such processors can support robust performance in speech applications. In fact, the IXM2 with 73 transputers has outperformed a Cray in some language-translation tasks. We selected speech-to-speech translation as our main application because it is one of the grand challenges of massively parallel artificial intelligence. The social implications of successful automatic translation are enormous-e.g. people who speak different languages could communicate in real time by using interpreting telephony  相似文献   

14.
Features of an explicitly parallel programming language targeted for reconfigurable parallel processing systems, where the machine's N processing elements (PEs) are capable of operating in both the SIMD and SPMD modes of parallelism, are described. The SPMD (single program-multiple data) mode of parallelism is a subset of the MIMD mode where all processors execute the same program. By providing all aspects of the language with an SIMD mode version and an SPMD mode version that are syntactically and semantically equivalent, the language facilitates experimentation with and exploitation of hybrid SIMD/SPMD machines. Language constructs (and their implementations) for data management, data-dependent control-flow, and PE-address-dependent control-flow are presented. These constructs are based on experience gained from programming a parallel machine prototype and are being incorporated into a compiler under development. Much of the research presented is applicable to general SIMD machines and MIMD machines  相似文献   

15.
A software behavioural simulator for a new massively parallel single-instruction/multiple data (SIMD) architecture has been developed that can accurately simulate the entire 16, 384 bit-serial processor array. The key to this high performance modelling is the exploitation of an inherent mapping that exists between massively parallel SIMD architectures and the vector architectures used in many high performance scientific super-computers. The new SIMD architecture, called BLITZEN, is based on the Massively Parallel Processor (MPP) built for NASA by Goodyear in the late 1970s. By simulating the full-scale machine with very high performance, the simulator allows development of algorithms and high-level software to proceed before realization of the hardware. This paper describes the SIMD - vector architecture mapping, the highly vectorized simulator in which it is used, and how the result was a simulator that achieved a level of performance three orders of magnitude faster than the conventional uniprocessor approach.  相似文献   

16.
Network processors are designed to handle the inherently parallel nature of network processing applications. However, partitioning and scheduling of application tasks and data allocation to reduce memory contention remain as major challenges in realizing the full performance potential of a given network processor. The large variety of processor architectures in use and the increasing complexity of network applications further aggravate the problem. This work proposes a novel framework, called FEADS, for automating the task of application partitioning and scheduling for network processors. FEADS uses the simulated annealing approach to perform design space exploration of application mapping onto processor resources. Further, it uses cyclic and r-periodic scheduling to achieve higher throughput schedules. To evaluate dynamic performance metrics such as throughput and resource utilization under realistic workloads, FEADS automatically generates a Petri net (PN) which models the application, architectural resources, mapping and the constructed schedule and their interaction. The throughput obtained by schedules constructed by FEADS is comparable to that obtained by manual scheduling for linear task flow graphs; for more complicated task graphs, FEADS’ schedules have a throughput which is upto 2.5 times higher compared to the manual schedules. Further, static scheduling of tasks results in an increase in throughput by upto 30% compared to an implementation of the same mapping without task scheduling.  相似文献   

17.
Multi-view video coding (MVC) comprises rich 3D information and is widely used in new visual media,such as 3DTV and free viewpoint TV (FTV). However,even with mainstream computer manufacturers migrating to multi-core processors,the huge computational requirement of MVC currently prohibits its wide use in consumer markets. In this paper,we demonstrate the design and implementation of the first parallel MVC system on Cell Broadband EngineTM processor which is a state-of-the-art multi-core processor. We propose a task-dispatching algorithm which is adaptive data-driven on the frame level for MVC,and implement a parallel multi-view video decoder with modified H.264/AVC codec on real machine. This approach provides scalable speedup (up to 16 times on sixteen cores) through proper local store management,utilization of code locality and SIMD improvement. Decoding speed,speedup and utilization rate of cores are expressed in experimental results.  相似文献   

18.
This paper describes an object recognition algorithm both on a sequential machine and on a single instruction multiple data (SIMD) parallel processor such as the MIT connection machine. The problem, in the way it is presently formulated on a sequential machine, is essentially a propagation of constraints through a tree of possibilities in an attempt to prune the tree to a small number of leaves. The tree can become excessively large, however, and so implementations on massively parallel machines are sought in order to speed up the problem. Two fast parallel algorithms are described here, a static algorithm and a dynamic algorithm. The static algorithm reformulates the problem by assigning every leaf in the completely expanded unpruned tree to a separate processor in the connection machine. Then pruning is done in nearly constant time by broadcasting constraints to the entire SIMD array. This parallel version is shown to run three to four orders of magnitude faster than the sequential version. For large recognition problems which would exceed the capacity of the machine, a dynamic algorithm is described which performs a series of loading and pruning steps, dynamically allocating and deallocating processors through the use of the connection machine's global router communications mechanism.  相似文献   

19.
In order to program SIMD (single instruction stream-multiple data stream) parallel machines used for tasks such as speech and image processing, a language with explicit parallel constructs is often desirable. The language Ada, developed by the Department of Defense, is used here as a basis for such a language. Extensions of Ada, which allow the user to specify such operations as interprocessor communications and activation of processors, are proposed. These features are demonstrated by showing their use in a common speech processing algorithm, the parallel FFT.  相似文献   

20.
In this paper attention is concentrated on the mapping of computationally intensive multi‐task applications onto shared computational grids. This problem, already known to be as NP‐complete in parallel systems, becomes even more arduous in such environments. To find a near‐optimal mapping solution a parallel version of a Differential Evolution algorithm is presented and evaluated on different applications and operating conditions of the grid nodes. The purpose is to select for a given application the mapping solutions that minimize the greatest among the time intervals which each node dedicates to the execution of the tasks assigned to it. The experiments, effected with applications represented as task interaction graphs, demonstrate the ability of the evolutionary tool to perform multisite grid mapping, and show that the parallel approach is more effective than the sequential version both in enhancing the quality of the solution and in the time needed to get it. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号