期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

RTS: A system to simulate the real time cost behaviour of parallel computations

Bin Qin Howard A. Sholl Reda A. Ammar 《Software》1988,18(10):967-985

In this paper, we present a software tool, RTS (real time simulator), that analyses the time cost behaviour of parallel computations through simulation. It is assumed in RTS that the computer system which supports the executions of parallel computations has a limited number of processors all processors have the same speed and they communicate with each other through a shared memory. In RTS, the time cost of a parallel computation is defined as a function of the input, the algorithm, the data structure, the processor speed, the number of processors, the processor power allocation, the communication and the execution environment. How RTS models the time cost is first discussed in the paper. In the model, a locking technique is used to manipulate the access to the shared memory, processing power is equally allocated among all the operations that are currently being performed in parallel in the computer system, and the number of operations in the execution environment of a parallel computation changes from time to time. How RTS works and how the simulation is used to do time cost analysis are also discussed. 相似文献

2.

Optimal Partitioning and Granularity of Uniform Task Graphs

下载免费PDF全文

Zhang Zhongyun Li Guojie 《计算机科学技术学报》1991,6(2):185-194

Task partitioning is an important technique in parallel processing.In this paper,we investigate the optimal partitioning strategies and granularities of tasks with communications based on several models of parallel computer systems.Different from the usual approach,we study the optimal partitioning strategies and granularities from the viewpoint of minimizing T as well as minimizing NT^2,where N is the number of processors used and T is the program execution time using N processors.Our results show that the optimal partitioning strategies for all cases discussed in this paper are the same--either to assign all tasks to one processor or to distribute them among the processors as equally as possible depending only on the functions of ratio of running time to communication time R/C. 相似文献

3.

面向语句的MBFL变异体约减策略

王林鑫王微微赵瑞莲李征《计算机科学》2017,44(11):175-180

在软件调试过程中如何高效、精确地定位程序中的错误代码是软件开发人员普遍关注的问题。MBFL是一种基于变异分析的错误定位技术,它在获得较高错误定位精度的同时会生成大量变异体,并在变异体上执行测试用例集,开销庞大。为了减少MBFL的变异执行开销,提出面向语句的变异体约减策略,通过分析测试用例的执行信息, 按一定比例对每条由失败测试用例覆盖的语句生成的变异体集合进行约减。实验结果表明,在7个程序包的112个错误版本上,应用面向语句的变异体约减策略的MBFL,在保持较高错误定位精度的同时,能够有效减少73.51%~79.98%的变异执行开销。相似文献

4.

Performance modeling for SPMD message-passing programs

JÜRGEN BREHM PATRICK H. WORLEY MANISH MADHUKAR 《Concurrency and Computation》1998,10(5):333-357

Today's massively parallel machines are typically message-passing systems consisting of hundreds or thousands of processors. Implementing parallel applications efficiently in this environment is a challenging task, and poor parallel design decisions can be expensive to correct. Tools and techniques that allow the fast and accurate evaluation of different parallelization strategies would significantly improve the productivity of application developers and increase throughput on parallel architectures. This paper investigates one of the major issues in building tools to compare parallelization strategies: determining what type of performance models of the application code and of the computer system are sufficient for a fast and accurate comparison of different strategies. The paper is built around a case study employing the performance prediction tool (PerPreT) to predict performance of the parallel spectral transform shallow water model code (PSTSWM) on the Intel Paragon. PSTSWM is a parallel application code that was designed to evaluate different parallel strategies for the spectral transform method as it is used in climate modeling and weather forecasting. Multiple parallel algorithms and algorithm variants are embedded in the code. PerPreT uses a relatively simple algebraic model to predict execution time for SPMD (single program multiple data) parallel applications. Applications are modeled through parameterized formulae for communication and computation, where the parameters include the problem size, the number of processors used to execute the program, and system characteristics (e.g. setup times for communication, link bandwidth and sustained computing performance per processor). In this paper we describe performance models that predict the performance of the different algorithms in PSTSWM accurately enough to allow them to be compared, establishing the feasibility of such a demanding application of performance modeling. We also discuss issues in generating and validating the performance models, emphasizing the practical importance of tools such as PerPreT in such studies. © 1998 John Wiley & Sons, Ltd. 相似文献

5.

A parallel dynamic load-balancing algorithm for solution-adaptive finite element meshes on 2D tori

Yeh-Ching Chung Yaa-Jyun Yeh J.-S Liu 《Concurrency and Computation》1995,7(7):615-631

To efficiently execute a finite element program on a 2D torus, we need to map nodes of the corresponding finite element graph to processors of a 2D torus such that each processor has approximately the same amount of computational load and the communication among processors is minimized. If nodes of a finite element graph do not increase during the execution of a program, the mapping only needs to be performed once. However, if a finite element graph is solution-adaptive, that is, nodes of a finite element graph increase discretely due to the refinement of some finite elements during the execution of a program, a dynamic load-balancing algorithm has to be performed many times in order to balance the computational load of processors while keeping the communication cost as low as possible. In the paper we propose a parallel dynamic load-balancing algorithm (LB) to deal with the load-imbalancing problem of a solution-adaptive finite element program on a 2D torus. The algorithm uses an iterative approach to achieve load-balancing. We have implemented the proposed algorithm along with two parallel mapping algorithms, parallel orthogonal recursive bisection (ORB) and parallel recursive mincut bipartitioning (MC), on a simulated 2D torus. Three criteria, the execution time of load-balancing algorithms, the computation time of an application program under different load balancing algorithms, and the total execution time of an application program (under several refinement phases) are used for performance evaluation. Simulation results show that (1) the execution of LB is faster than those of MC and ORB; (2) the mappings of LB are better than those of ORB and MC; and (3) the speedups of LB are better than those of ORB and MC. 相似文献

6.

MuJava: an automated class mutation system

Yu‐Seung Ma Jeff Offutt Yong Rae Kwon 《Software Testing, Verification and Reliability》2005,15(2):97-133

Several module and class testing techniques have been applied to object‐oriented (OO) programs, but researchers have only recently begun developing test criteria that evaluate the use of key OO features such as inheritance, polymorphism, and encapsulation. Mutation testing is a powerful testing technique for generating software tests and evaluating the quality of software. However, the cost of mutation testing has traditionally been so high that it cannot be applied without full automated tool support. This paper presents a method to reduce the execution cost of mutation testing for OO programs by using two key technologies, mutant schemata generation (MSG) and bytecode translation. This method adapts the existing MSG method for mutants that change the program behaviour and uses bytecode translation for mutants that change the program structure. A key advantage is in performance: only two compilations are required and both the compilation and execution time for each is greatly reduced. A mutation tool based on the MSG/bytecode translation method has been built and used to measure the speedup over the separate compilation approach. Experimental results show that the MSG/bytecode translation method is about five times faster than separate compilation. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

7.

Partitioning and mapping of nested loops for linear array multicomputers 总被引：1，自引：1，他引：0

Jang-Ping Sheu Tzung-Shi Chen 《The Journal of supercomputing》1995,9(1-2):183-202

In distributed-memory multicomputers, minimizing interprocessor communication is the key to the efficient execution of parallel programs. In order to reduce the amount of communication overhead, parallel programs on multicomputers must be carefully scheduled by parallelizing compilers. This paper proposes some compilation techniques for partitioning and mapping nested loops with constant data dependences onto linear array multicomputers. First, a systematic partition strategy is proposed to project ann-dimensional computational structure, representing ann-nested loop, onto a line to form a one-dimensional projected structure with low communication overhead. Then, a mapping algorithm is proposed for mapping the partitioned loops onto linear arrays in a way that balances the workload and minimizes the communication cost among processors. Finally, parallel execution codes can be automatically generated for such linear array multicomputers. 相似文献

8.

基于SOM神经网络的二阶变异体约简方法

宋利刘靖《软件学报》2019,30(5):1464-1480

二阶变异测试通过向源程序中人工注入两个缺陷来模拟程序实际的复杂缺陷,在软件测试中具有重要意义.但由一阶变异体组合形成二阶变异体后数量会急剧增长,极大地增加了程序的执行开销.为了减少二阶变异体数量,降低程序的执行开销,提出一种基于SOM神经网络的二阶变异体约简方法.该方法首先采用较为全面的二阶变异体错误组合策略,对一阶变异体组合形成二阶变异体;然后,根据二阶变异体执行过程中的中间值相似性,进行基于SOM神经网络的变异体聚类.使用经典的基准程序和开源程序进行了方法的验证,实验结果表明,一方面,使用错误覆盖更为全面的组合策略能够充分模拟程序的复杂缺陷,聚类约简后,二阶变异体的个数在极大减少的同时,二阶变异充分度和一阶变异充分度更加接近,但是因为执行的二阶变异体数目明显降低,从而使得运行聚类后的二阶变异体时间开销明显比执行全部二阶变异体降低;另一方面,实验过程发现了有利于增加测试组件的隐藏二阶变异体. 相似文献

9.

Problem size,parallel architecture,and optimal speedup

《Journal of Parallel and Distributed Computing》1988,5(4):404-420

The communication and synchronization overhead inherent in parallel processing can lead to situations where adding processors to the solution method actually increases execution time. Problem type, problem size, and architecture type all affect the optimal number of processors to employ. In this paper we examine the numerical solution of an elliptic partial differential equation in order to study the relationship between problem size and architecture. The equation's domain is discretized into n² grid points which are divided into partitions and mapped onto the individual processor memories. We analytically quantify the relationships among grid size, stencil type, partitioning strategy processor execution time, and communication network type. In doing so, we determine the optimal number of processors to assign to the solution (and hence the optimal speedup), and identify (i) the smallest grid size which fully benefits from using all available processors, (ii) the leverage on performance given by increasing processor speed or communication network speed, and (iii) the suitability of various architectures for large numerical problems. Finally, we compare the predictions of our analytic model with measurements from a multiprocessor and find that the model accurately predicts performance. 相似文献

10.

Performance Bounds on Multiprocessor Scheduling Strategies for Chain Structured Programs

《Journal of Parallel and Distributed Computing》1994,23(1):112-118

In multiprocessors with static allocation of processes to processors, scheduling can be done locally for each processor. The scheduling strategy may have dramatic effect on the execution time of a parallel program. It is NP-hard to find an optimal schedule, and very little is known on how close the heuristic solutions get. In order to obtain nontrivial performance bounds, this study focuses on a restricted class of parallel programs, viz., chain structured programs. The major result is a theorem stating that if certain program parameters are known the execution time of the optimal schedule can be calculated within a factor of 2, even though the optimal schedule is unknown. Using a previously developed tool, one can extract the necessary parameters from a parallel program. This technique makes it possible to compare the execution time for different scheduling strategies with the optimal case. The technique used for calculating the performance bounds gives important hints on how to design efficient scheduling algorithms for chain structured programs. 相似文献

11.

Parallel Molecular Dynamics: Implications for Massively Parallel Machines

Valerie E. Taylor Rick L. Stevens Kathryn E. Arnold 《Journal of Parallel and Distributed Computing》1997,45(2):159

Molecular dynamics simulation is a class of applications that require reducing the execution time of fixed-size problems. This reduction in execution time is important to drug design and protein interaction studies. Many implementations of parallel molecular dynamics have been developed, but very little work has addressed issues related to the use of machines with 50,000 processors for modest-sized problems in the range of 50,000 atoms. Current massively parallel machines present a major obstacle to achieving good performance:communication overhead. In this paper we quantify the communication latency and network bandwidth necessary to achieve 30–40% efficiency on future message-passing machines with sizes on the order of tens of thousands of processors, for executing molecular dynamics problems with the same order of atoms. We derive an analytical model of a benchmark application that simulates a system of helium atoms executing on the Intel Touchstone Delta using an interaction decomposition method. This model is validated and used to extrapolate information on the startup time and network bandwidth. The results indicate that for an MPP with a four-dimensional mesh topology using 400 MHz processors, the communication startup time must be at most 30 clock cycles and the network bandwidth at least 2.3 GB/s. This configuration results in 30–40% efficiency of the MPP for a problem with 50,000 atoms executing on 50,000 processors. 相似文献

12.

Analysis of parallelism in nested DO loops

P.W. Foulk S.M. Nassar 《Journal of Systems and Software》1985,5(1):73-80

The execution time of FORTRAN programs can be decreased by putting solutions to problems in their maximally parallel forms. The most important issue is the DO-loop. In this study nested DO-loops were considered and analysis of parallellism was performed on matrix multiplication using a PROLOG program. When processed by the AIDS system, the maximally parallel graph was produced. This indicates the number of processors that could be used in parallel to execute the FORTRAN program. The study shows that the maximally parallel program can run in considerably less time than that needed to run the original sequential FORTRAN program. N×N matrix multiplication programs are speeded up by a time-saving ratio that is always greater then (1:N²), but it cannot exceed (1:N³), since N³ is the maximum number of processors used in parallel at any time. These time-saving ratio evaluations assume that all operations have equal execution time and initialization overhead is ignored. 相似文献

13.

A Uniform Framework for Dynamic Load Balancing Strategies in Distributed Processing Systems

《Journal of Parallel and Distributed Computing》1994,23(2):246-255

Load balancing plays a central role in processor utilizations in distributed systems. Several strategies have been proposed in the literature to achieve load balancing. Usually, these strategies attempt to achieve a tradeoff between reducing the execution time of an application and minimizing the synchronization and the communication overhead. In this paper, we present a general model in which load balancing decisions are reached by enforcing performance metrics which may be adapted to reflect the specific requirements of different environments. Many of the load balancing schemes that have been suggested in the literature can be viewed as specific instances of the general framework presented in this paper. The basic scheme in this framework uses a load contention number that accounts for the load of the processors, the communication cost and the distance among processors. It is meant to be adaptable to the overall load on the system, the load on the communication devices, the run time characteristics of the tasks, and the configuration of the system. Furthermore, its implementation is not computationally complex. Thus, the gains made by load balancing are not overshadowed by the load balancing cost. 相似文献

14.

Parallel genetic programming: Synchronous and asynchronous migration

Shisanu Tongchim Prabhas Chongstitvatana 《Artificial Life and Robotics》2001,5(4):189-194

Genetic programming (GP) has been used successfully as a technique for constructing robot control programs. Depending on the number of evaluations and the cost of each evaluation however, GP may require a substantial amount of processing time to find a feasible solution. The advent of parallel GP has brought the execution time of GP to a more acceptable level. This paper investigates parallel GP with a mobile robot navigation problem. The parallel implementations are based on a coarse-grained model. A technique for distributing the task of serial GP is proposed. In particular, this technique shows that the total amount of work can be reduced while maintaining the quality of the solutions. Asynchronous and synchronous implementations are examined. We compare the performance in terms of both the solution quality and the execution time. The timing analysis is investigated to give an insight into the behavior of parallel implementations. The results show that the parallel algorithm with asynchronous migration using 10 processors is 33 times faster than the serial algorithm. This work was presented in part at the 5th International Symposium on Artificial Life and Robotics, Oita, Japan, January 26–28, 2000. 相似文献

15.

A framework for the application of metaheuristics to?tasks-to-processors assignation problems

Francisco Almeida Javier Cuenca Domingo Giménez Antonio Llanes-Castro Juan-Pedro Martínez-Gallar 《The Journal of supercomputing》2012,62(2):698-723

The optimization of the execution time of a parallel algorithm can be achieved through the use of an analytical cost model function representing the running time. Typically the cost function includes a set of parameters that model the behavior of the system and the algorithm. In order to reach an optimal execution, some of these parameters must be fitted according to the input problem and to the target architecture. An optimization problem can be stated where the modeled execution time for the algorithm is used to estimate the parameters. Due to the large number of variable parameters in the model, analytical minimization techniques are discarded. Exhaustive search techniques can be used to solve the optimization problem, but when the number of parameters or the size of the computational system increases, the method is impracticable due to time restrictions. The use of approximation methods to guide the search is also an alternative. However, the dependence on the algorithm modeled and the bad quality of the solutions as a result of the presence of many local optima values in the objective functions are also drawbacks to these techniques. The problem becomes particularly difficult in complex systems hosting a large number of heterogeneous processors solving non-trivial scientific applications. The use of metaheuristics allows for the development of valid approaches to solve general problems with a large number of parameters. A well-known advantage of metaheuristic methods is the ability to obtain high-quality solutions at low running times while maintaining generality. We propose combining the parameterized analytical cost model function and metaheuristic minimization methods, which contributes to a novel real alternative to minimize the parallel execution time in complex systems. The success of the proposed approach is shown with two different algorithmic schemes on parallel heterogeneous systems. Furthermore, the development of a general framework allows us to easily develop and experiment with different metaheuristics to adjust them to particular problems. 相似文献

16.

CMuJava:一个面向Java程序并发变异体生成系统

孙昌爱耿宁代贺鹏顾友达《软件学报》2022,33(2):397-409

并发程序由多个共享存储空间并发执行的流程组成.由于流程之间执行次序的不确定性,使得并发软件系统的测试比较困难.变异测试是一种基于故障的软件测试技术,广泛用于评估测试用例集的充分性和测试技术的有效性.将变异测试应用于并发程序的一个关键问题是,如何高效地生成大量的模拟并发故障的变异体集合.给出了一种并发程序的变异测试框架,... 相似文献

17.

Efficient computation of optimal assignments for distributed tasks

《Journal of Parallel and Distributed Computing》1987,4(4):342-362

We consider the problem of finding an optimal assignment of the modules of a program to processors in a distributed system. A module incurs an execution cost that may be different for each processor assignment, and modules that are not assigned to the same processor but that communicate with one another incur a communication cost. An optimal assignment minimizes the sum of the module execution costs and the intermodule communication costs. This problem is known to be NP-complete for more than three processors. Using a branch-and-bound-with-underestimates algorithm to reduce the size of the search tree, we evaluate its average time and space complexity for two underestimating functions through simulation. The more complex of the two functions, called the minimum independent assignment cost underestimate (MIACU), performs extremely well over a wide range of values of program model parameters such as the number of modules, the number of processors, and the ratio of average module execution cost to average intermodule communication cost. By reordering the list of modules to allow a subset of modules that do not communicate with one another to be assigned last, further improvements using MIACU are possible. 相似文献

18.

异构机群系统上带返回信息的可分负载多轮调度算法 总被引：1，自引：0，他引：1

钟诚李显宁《计算机研究与发展》2008,45(Z1):99-104

针对处理机具有不同的计算速度、通信能力的异构机群计算环境,以及实际应用中许多问题的求解在处理完任务后向中心处理机节点返回处理结果信息的情形,通过允许计算和通信操作重叠执行,采取FIFO调度策略和多次并行分配计算任务的方法,提出一种带返回结果信息的调度轮数可变的可分负载多轮调度算法.实验结果表明,该算法对于处理具有返回结果信息的应用的调度性能优于UMR可分负载多轮调度算法,并且可以获得近似最优的调度轮数. 相似文献

19.

Decreasing the cost of mutation testing with second‐order mutants

Macario Polo Mario Piattini Ignacio García‐Rodríguez 《Software Testing, Verification and Reliability》2009,19(2):111-131

Although powerful, mutation is a computationally very expensive testing technique. In fact, its three main stages (mutant generation, mutant execution and result analysis) require many resources to be successfully accomplished. Thus, researchers have made important efforts to reduce its costs. This paper represents an additional effort in this sense. It describes the results of two experiments in which, by means of combining the original set of mutants and therefore obtaining a new set of mutants—each one with two faults—the number of mutants used is reduced to half. Results lead to believe that mutant combination does not decrease the quality of the test suite, whereas it supposes important savings in mutant execution and result analysis. Copyright © 2008 John Wiley & Sons, Ltd. 相似文献

20.

Techniques for compiling programs on distributed memory multicomputers

PeiZong Lee 《Parallel Computing》1995,21(12):1895-1923

It is widely accepted that distributed memory parallel computers will play an important role in solving computation-intensive problems. However, the design of an algorithm in a distributed memory system is time-consuming and error-prone, because a programmer is forced to manage both parallelism and communication. In this paper, we present techniques for compiling programs on distributed memory parallel computers. We will study the storage management of data arrays and the execution schedule arrangement of Do-loop programs on distributed memory parallel computers. First, we introduce formulas for representing data distribution of specific data arrays across processors. Then, we define communication cost for some message-passing communication operations. Next, we derive a dynamic programming algorithm for data distribution. After that, we show how to improve the communication time by pipelining data, and illustrate how to use data-dependence information for pipelining data. Jacobi's iterative algorithm and the Gauss elimination algorithm for linear systems are used to illustrate our method. We also present experimental results on a 32-node nCUBE-2 computer. 相似文献