首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Parallel computers offer a solution to improve the lengthy computation time of many conventional, sequential programs used in molecular biology. On a parallel computer, different pieces of the computation are performed simultaneously on different processors. LINKMAP is a sequential program widely used by scientists to perform genetic linkage analysis. We have converted LINKMAP to run on a parallel computer, using the machine-independent parallel programming language, Linda. Using the parallelization of LINKMAP as a case study, the paper outlines an approach to converting existing highly iterative programs to a parallel form. The paper describes the steps involved in converting the sequential program to a parallel program. It presents performance benchmarks comparing the sequential version of LINKMAP with the parallel version running on different parallel machines. The paper also discusses alternative approaches to the problem of "load balancing," making sure the computational load is shared as evenly as possible among the available processors.  相似文献   

2.
姿态运动的并行仿真模型   总被引:1,自引:0,他引:1       下载免费PDF全文
本文构造运载火箭姿态运动的并行数字仿真模型。首先将姿态运动的隐式数学模型变换到显式模型,然后应用预处理数值积分方法离散化,并将其计算量均匀地分配给各个处理机,最后构造了各个处理机之间的同步方法。实际计算表明,本文构造的并行仿真模型的计算复杂性小,具有较高的加速比。  相似文献   

3.
Problem partitioning to solve ordinary differential equations on a parallel processor system using classical numerical integration methods involves defining and ordering computation tasks and scheduling the tasks for execution, In defining tasks there is a tradeoff between decomposing a computation into a large number of primitive tasks to expose all potential parallelism and decomposing it into a smaller number of tasks to simplify scheduling. Scheduling is an intractable problem; heuristic scheduling algorithms reduce the effort required to schedule tasks but cannot guarantee that the parallel solution will execute in minimum time. An example illustrates difficulties encountered in scheduling tasks for parallel computation and use of a dependency graph as a tool in problem partitioning. The need for an efficient mechanism for asynchronous data exchanges among processors is demonstrated.  相似文献   

4.
殷新春  陈崚  谢立 《计算机应用》2003,23(4):13-15,18
实际并行计算中,在问题规模一定的前提下,有时增加处理机个数不但不能明显减少计算时间,反而会使计算时间延长。文中针对这种现象进行分析,提出了一种建立在对通令延时解析分析基础上选择最优处理机上数的策略,以使并行计算时间最短,这种策略的正确性已通过在曙光-2000并行机上的实验结果得到验证。  相似文献   

5.
In this paper we benchmark the performance of the Cray T3D, IBM 9076 SP/1 and Intel Paragon XP/S parallel computers, using implementations of parallel algorithms for the computation of the vector outer-product A = uvT operation. The vector outer-product operation, although very simple in nature, requires the computation of a large number of floating-point operations and its parallelization induces a great level of communication between the processors. It is thus suited to measure the relative speed of the processor, memory subsystem and network capabilities of a parallel computer. It should not be considered a ‘toy problem’, since it arises in numerical methods in the context of the solution of systems of non-linear equations – still a difficult problem to solve. We present algorithms for both the explicit shared-memory and message-passing programming models together with theoretical computation models for those algorithms. Actual experiments were run on those computers, using Fortran 77 implementations of the algorithms. The results obtained with these experiments show that due to the high degree of communication between the processors one needs a parallel computer with fast communications and carefully implemented data exchange routines. The theoretical computation model allows prediction of the speed-up to be obtained for some problem size on a given number of processors. © 1997 John Wiley & Sons, Ltd.  相似文献   

6.

In distributed computing, divisible load theory provides an important system model for allocation of data-intensive computations to processing units working in parallel. The main task is to define how a computation job should be split into parts, to which processors those parts should be allocated and in which sequence. The model is characterized by multiple parameters describing processor availability in time, transfer times of job parts to processors, their computation times and processor usage costs. The main criteria are usually the schedule length and cost minimization. In this paper, we provide the generalized formulation of the problem, combining key features of divisible load models studied in the literature, and prove its NP-hardness even for unrestricted processor availability windows. We formulate a linear program for the version of the problem with a fixed number of processors. For the case with an arbitrary number of processors, we close the gaps in the study of special cases, developing efficient algorithms for single criterion and bicriteria versions of the problem, when transfer times are negligible.

  相似文献   

7.
In this paper, we present a software tool, RTS (real time simulator), that analyses the time cost behaviour of parallel computations through simulation. It is assumed in RTS that the computer system which supports the executions of parallel computations has a limited number of processors all processors have the same speed and they communicate with each other through a shared memory. In RTS, the time cost of a parallel computation is defined as a function of the input, the algorithm, the data structure, the processor speed, the number of processors, the processor power allocation, the communication and the execution environment. How RTS models the time cost is first discussed in the paper. In the model, a locking technique is used to manipulate the access to the shared memory, processing power is equally allocated among all the operations that are currently being performed in parallel in the computer system, and the number of operations in the execution environment of a parallel computation changes from time to time. How RTS works and how the simulation is used to do time cost analysis are also discussed.  相似文献   

8.
块三对角线性方程组的一种分布式并行算法   总被引:16,自引:0,他引:16  
骆志刚  李晓梅 《计算机学报》2000,23(10):1028-1034
提出了分布环境下求解三对角线性方程组的一种并行算法,该算法基于对计算量的仔细估算,合理地将方程组求解工作分配到各处理机,达到负载平衡,同时,充分地将计算与通信重叠,减少处理机空闲时间;当块三以角线性方程组的系数矩阵为对角占优时,算法在执行过程中不会中断;文中分析了算法的复杂性,给出了在分析布存储多计算机系统上的数值试验结果,数值结果表明,文中算法的效率较Chung等的算法有较大的提高。  相似文献   

9.
In this paper, we first show how a certain ordering of vertices, called bicompatible elimination ordering (BCO), of a proper interval graph can be used to solve efficiently several problems in proper interval graphs. We, then, propose an NC parallel algorithm (i.e., polylogarithmic-time employing a polynomial number of processors) in SIMD CRCW PRAM (Single Instruction Stream Multiple Data Stream Concurrent Read Concurrent Write Parallel Random Access Machine) model of parallel computation to compute a BCO of a proper interval graph. To the best of our knowledge, this is the first NC parallel algorithm to compute a BCO of a proper interval graph.  相似文献   

10.
三对角线性方程组的一种有效分布式并行算法   总被引:8,自引:0,他引:8  
提出了分布式存储环境下求解三对角线性方程的一种并行算法,该算法基于“分而治之”的策略,高效地形成并求解其缩减方程组,避免不必要的冗余计算,通过对计算量的仔细估计,较好地平衡了各处理机的负载;同时,充分利用了计算与通信重叠技术,减少处理机空闲时间,分析了自救的复杂性,给 分布存储多计算机系统上的数值试验结果,数值结果表明,算法的效率较迟利华和李晓梅的DPP算法有较大的提高。  相似文献   

11.
网状函数关系数据对象的通用编辑技术   总被引:1,自引:0,他引:1  
文中对一类具有网状函数关系的数据对象论述了其交互式的编辑技术,重点讨论了结构化程序设计中对对话框操作方式下网状关系数据对象数据模型的描述、编辑及处理。所述的方法可以实现通用的与应用无关的复杂对象的数据编辑,以增强软件的可重用性。  相似文献   

12.
In parallel adaptive mesh refinement (AMR) computations the problem size can vary significantly during a simulation. The goal here is to explore the performance implications of dynamically varying the number of processors proportional to the problem size during simulation. An emulator has been developed to assess the effects of this approach on parallel communication, parallel runtime and resource consumption. The computation and communication models used in the emulator are described in detail. Results using the emulator with different AMR strategies are described for a test case. Results show for the test case, varying the number of processors, on average, reduces the total parallel communications overhead from 16 to 19% and improves parallel runtime time from 4 to 8%. These results also show that on average resource utilization improves more than 37%.  相似文献   

13.
The constrained 2D cutting stock problem is an irregular problem with dynamic data structures, highly variable amounts of computation per task, and unpredictable amounts and patterns of communication. This paper describes the design and implementation of a parallel solution to this problem on a cluster of workstations and a distributed memory multicomputer. The key element of our parallel solution is the replication of an important data structure on all processors. By exploiting properties of the cutting stock problem which allow the use of relaxed consistency mechanisms, our approach is able to reduce the overheads for communication and synchronization in comparison to approaches that partition the data structure among processors. A token-based lazy release consistency protocol is used to ensure mutual exclusion and maintain consistency, and a randomized work-stealing protocol is employed to dynamically balance work among processors. Good speedups are reported for three benchmark problems executed on two distributed memory platforms: a cluster of workstations interconnected by a 10 Mbit/s Ethernet and an Intel Paragon. © 1998 John Wiley & Sons, Ltd.  相似文献   

14.
We have parallelized the FASTA algorithm for biological sequence comparison using Linda, a machine-independent parallel programming language. The resulting parallel program runs on a variety of different parallel machines. A straight-forward parallelization strategy works well if the amount of computation to be done is relatively large. When the amount of computation is reduced, however, disk I/O becomes a bottleneck which may prevent additional speed-up as the number of processors is increased. The paper describes the parallelization of FASTA, and uses FASTA to illustrate the I/O bottleneck problem that may arise when performing parallel database search with a fast sequence comparison algorithm. The paper also describes several program design strategies that can help with this problem. The paper discusses how this bottleneck is an example of a general problem that may occur when parallelizing, or otherwise speeding up, a time-consuming computation.  相似文献   

15.
We consider the problem of scheduling tasks on multiprocessor architectures in the presence of communication delays. Given a set of dependent tasks, the scheduling problem is to allocate the tasks to processors such that the pre-specified precedence constraints among the tasks are obeyed and certain cost-measures (such as the computation time) are minimized. Several cases of the scheduling problem have been proven to be NP-complete. Nevertheless, there are polynomial time algorithms for interesting special cases of the general scheduling problem. Most of these results, however, do not take into consideration the delays due to message passing among processors. In this paper we study the increase in time complexity of scheduling problems due to the introduction of communication delays. In particular, we address the open problem of scheduling Out-forests (In-forests) in a multiprocessor system of m identical processors when communication delays are considered. The corresponding problem of scheduling Out-forests (In-forests) without communication delays admits an elegant polynomial time solution as presented first by Hu in 1961; however, the problem in the presence of communication delays has remained unsolved. We present here first known polynomial time algorithms for the computation of the optimal schedule when the number of available processors is given and bounded and both computation and communication delays are assumed to take one unit of time. Furthermore, we present a linear-time algorithm for computing a near-optimal schedule for unit-delay out-forests. The schedule's length exceeds the optimum by no more than (m-2) time units, where m is the number of processors. Hence for two processors the computed schedule is strictly optimum  相似文献   

16.
Recently Akl et al. introduced a new model of parallel computation, called BSR (broadcasting with selective reduction) and showed that it is more powerful than any CRCW PRAM and yet requires no more resources for implementation than even EREW PRAM. The model allows constant time solutions to sorting, parallel prefix and other problems. In this paper, we describe constant time solutions to the parenthesis matching, decoding binary trees in bitstring representation, generating next tree shape in B-order, and the reconstruction of binary trees from their traversals, using the BSR model. They are the first constant time solutions to mentioned problems on any model of computation. The number of processors used is equal to the input size, for each problem. A new algorithm for sorting integers is also presented  相似文献   

17.
The modular exponentiation operation of the current algorithms for asymmetric cryptography is the most expensive part in terms of computational cost. The RSA algorithm, for example, uses the modular exponentiation algorithm in encryption and decryption procedure. Thus, the overall performance of those asymmetric cryptosystems depends heavily on the performance of the specific algorithm used for modular exponentiation. This work proposes new parallel algorithms to perform this arithmetical operation and determines the optimal number of processors that yields the greatest speedup. The optimal number is obtained by balancing the processing load evenly among the processors. Practical implementations are also performed to evaluate the theoretical proposals.  相似文献   

18.
Multigrid methods are powerful techniques to accelerate the solution of computationally-intensive problems arising in a broad range of applications. Used in conjunction with iterative processes for solving partial differential equations, multigrid methods speed up iterative methods by moving the computation from the original mesh covering the problem domain through a series of coarser meshes. But this hierarchical structure leaves domain-parallel versions of the standard multigrid algorithms with a deficiency of parallelism on coarser grids. To compensate, several parallel multigrid strategies with more parallelism, but also more work, have been designed. We examine these parallel strategies and compare them to simpler standard algorithms to try to determine which techniques are more efficient and practical. We consider three parallel multigrid strategies: (1) domain-parallel versions of the standard V-cycle and F-cycle algorithms; (2) a multiple coarse grid algorithm, proposed by Fredrickson and McBryan, which generates several coarse grids for each fine grid; and (3) two Rosendale algorithm, which allow computation on all grids simultaneously. We study an elliptic model problem on simple domains, discretized with finite difference techniques on block-structured meshes in two or three dimensions with up to 106 or 109 points, respectively. We analyze performance using three models of parallel computation: the PRAM and two bridging models. The bridging models reflect the salient characteristics of two kinds of parallel computers: SIMD fine-grain computers, which contain a large number of small (bitserial) processors, and SPMD medium-grain computers, which have a more modest number of powerful (single chip) processors. Our analysis suggests that the standard algorithms are substantially more efficient than algorithms utilizing either parallel strategy. Both parallel strategies need too much extra work to compensate for their extra parallelism. They require a highly impractical number of processors to be competitive with simpler, standard algorithms. The analysis also suggests that the F-cycle, with the appropriate optimization techniques, is more efficient than the V-cycle under a broad range of problem, implementation, and machine characteristics, despite the fact that it exhibits even less parallelism than the V-cycle. Research at Princeton University partially supported by the National Science Foundation, Grant No. CCR-8920505, and the Office of Naval Research, Contract No. N0014-91-J-1463.  相似文献   

19.
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains, including digital signal processing, image processing, and computer vision. The parameters of the performance for such stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which data sets are processed). These two criteria are distinct since multiple data sets can be pipelined or processed in parallel. The central contribution of this research is a new algorithm to determine a processor mapping for a chain of tasks that optimizes latency in the presence of a throughput constraint. We also discuss how this algorithm can be applied to solve the converse problem of optimizing throughput with a latency constraint. The problem formulation uses a general and realistic model of intertask communication and addresses the entire problem of mapping, which includes clustering tasks into modules, assigning of processors to modules, and possible replicating of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in the number of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran.  相似文献   

20.
Client-Agent-Server model (CAS model) which can decrease the work load of the client by adding agent processors to the Client-Server model (CS model) is proposed and an approach to parallel test generation for logic circuits on the CAS model is presented. Two problems are considered: optimal granularity problem and optimal scheme problem. First, the problem of parallel test generation on the CAS model is formulated to analyze the effect of the granularity (grain size of target faults allocated to processors) in both cases of static and dynamic task allocation (optimal granularity problem). Then the relationship between the number of processors and the total processing time is analyzed (optimal scheme problem). From the analysis, it is shown that the CAS model can reduce the total processing time over the CS model and that there exists an optimal scheme (an optimal pair of numbers of agent processors and server processors) for the CAS model which minimizes the total processing time for a given number of processors. To corroborate the analysis, the proposed parallel test generation algorithm is implemented on a network of more than 100 workstations and experimental results for the ISCAS benchmark circuits are presented. It is shown that the experimental results are very close to the theoretical results which confirms the existence of optimal granularity and optimal scheme which minimizes the total processing time for the CAS model  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号