期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performance Prediction for Parallel Iterative Solvers

V. Blanco P. González J. C. Cabaleiro D. B. Heras T. F. Pena J. J. Pombo F. F. Rivera 《The Journal of supercomputing》2004,28(2):177-191

In this paper, an exhaustive parallel library of sparse iterative methods and preconditioners in HPF and MPI was developed, and a model for predicting the performance of these codes is presented. This model can be used both by users and by library developers to optimize the efficiency of the codes, as well as to simplify their use. The information offered by this model combines theoretical features of the methods and preconditioners in addition to certain practical considerations and predictions about aspects of the performance of their execution in distributed memory multiprocessors. 相似文献

2.

Optimal and Efficient Parallel Tridiagonal Solvers Using Direct Methods

Santos Eunice E. 《The Journal of supercomputing》2004,30(2):97-115

The problem of solving tridiagonal linear systems on parallel distributed-memory environments is considered in this paper. In particular, two common direct methods for solving such systems are considered: odd-even cyclic reduction and prefix summing. For each method, a variety of lower bounds on execution time for solving tridiagonal linear systems are presented. Specifically, lower bounds are presented that (a) hold when the number of data items per processor is bounded, (b) are general lower bounds, and (c) for specific data layouts commonly used in designing parallel algorithms to solve tridiagonal linear systems. Furthermore, algorithms are presented that have running times within a constant factor of the lower bounds provided. Lastly, a comparison of bounds for odd-even cyclic reduction and prefix summing is given. 相似文献

3.

Highly Scalable Two- and Three-Dimensional Navier-Stokes Parallel Solvers on MIMD Multiprocessors

Averbuch A. Ioffe L. Israeli M. Vozovoi L. 《The Journal of supercomputing》1997,11(1):7-39

In this paper we present a new parallel algorithm for the solution of the incompressible two- and three-dimensional Navier-Stokes equations. The parallelization is achieved via domain decomposition. The computational region is considered in the form of a 2-D or 3-D periodic box decomposed into parallel strips (slabs). For time discretization we use a third order multistep method of [11]. The time discretization procedure results in solving global elliptic problems of (monotonic) Helmholtz and Poisson types in each time step. For the space discretization we employ the multidomain local Fourier (MDLF) method that was developed in [9, 10, 13]. The discretization in the periodic directions is performed by the standard Fourier method. In the direction across the strips we use the Local Fourier Basis technique which involves the overlapping of the neighboring subdomains and smoothing of local functions across the interior boundaries (interfaces). The matching of the local solutions is performed by adding properly weighted interface Green's functions. Their amplitudes are found in terms of the jumps of the solution and its first derivatives at the interfaces. The present paper extends the results of our previous works [1, 9, 10, 13] on parallel use of the MDLF method in three-fold aspects: 1. In [1] a model Navier-Stokes type system was considered which does not include the pressure term. Correspondingly, in each time step only the Helmholtx type equations were solved. It was shown that the parallel solution of this equation can be accomplished using only local (neighbor-to-neighbor) communication due to localization properties of the Helmholtz operator. We consider the complete Navier-Stokes system including the pressure term. The solution of the Poisson equation for pressure has the potential to degrade the performance and the achieved speedup of a parallel algorithm due to the global nature of this equation that necessitates global communication among the processors. However, we show that only a few lowest harmonics require for the global data transfer whereas the rest of harmonics can be treated locally. Therefore, most of the communication that is required for parallelization of the Navier-Stokes solver using the MDLF method is mainly local between adjacent subdomains (processors). Moreover, the percentage of the time spent in global communication reduces as the size of the problem increases. Thus, the present parallel algorithm is highly scalable. 2. In [l] we considered only 2-D equations. In this paper we extend the previous technique to 3-D problems. 3. Previously, the MDLF solver was implemented only on the MEIKO parallel machine. In this paper the 2-D and 3-D Navier-Stokes solvers are implemented on three MIMD message-passing multiprocessors (a 60-processors IBM SP2, a 20-processors MOSIX [3], and a network of 10 Alpha workstations) and achieve an efficiency of more than 70% to 95%. The same code written with the PVM (parallel virtual machine [7]) software package was executed on all the above distinct computational platforms. Detailed performance results, which include scalability analysis, are presented. This revised version was published online in June 2006 with corrections to the Cover Date. 相似文献

4.

绝对时钟智能控制器的研制

田继成廖瑜《电脑开发与应用》1997,10(3):40-41,6

介绍了一种日常参数智能控制器的研制及研制过程中的几个重要问题的解决方法。特别是用于城市照明及美化城市的彩灯的开、关时间智能控制方面，给出了依据当地绝对时间、经度、纬度进行昼夜时间计算的方法，使之在不接光电传感器情况下可跟踪季节的昼夜变化。控制系统采用８９Ｃ２０５１单片机，配有时钟日历芯片ＤＳ１２８８７，系统体积小、功能强、可靠性高。相似文献

5.

利用MapReduce平台实现高效并行的频繁子图挖掘

孙鹤立陈强刘玮黄健斌邹建华《计算机科学与探索》2014,(7):790-801

频繁子图挖掘是数据挖掘领域的一个重要问题,并且有着广泛的应用。在Hadoop平台上实现了一种基于MapReduce的高效频繁子图挖掘算法Cloud-GFSG（cloud-global frequent subgraph）。该算法基于Apriori思想,在扩展边生成新的子图时,使用已经挖掘出的k-1阶的频繁子图生成k阶的频繁子图。同时,检查是否存在待扩展生成的子图,设定生成的频繁子图表示规则,保证了频繁子图信息的唯一性。较同类算法相比,该算法在挖掘频繁子图时更具通用性,并且在扩展边时避免产生大量的复制图,从而使得算法的正确性得以保证,且运行效率显著提高。相似文献

6.

Distributed Parallel Strategies for Industrial CFD Solvers: A Case Study and Analysis of Performances

Gianmarco Manzini Luca Stolcis 《Journal of Parallel and Distributed Computing》1999,57(3):3339

The complexity of characterizing both parallel hardware and software makes it very difficult to explain and predict the performances of parallel programs for real industrial CFD applications. A performance model based on a generalized Amdahl's formulation has been developed and applied to a flow solver. The present formulation allows us to explain the behavior of a typical CFD explicit multiblock solver when the program is run on a multiprocessor distributed-memory system. Using this approach, it is possible to gain an insight on the performance limitations of this class of parallel solvers, by considering the impact of larger and larger number of processors on fixed-scaled problems. 相似文献

7.

A Parallelization Scheme Based on Work Stealing for a Class of SAT Solvers

Bernard?Jurkowiak Email author Chu?Min?Li Gil?Utard 《Journal of Automated Reasoning》2005,34(1):73-101

Because of the inherent NP-completeness of SAT, many SAT problems currently cannot be solved in a reasonable time. Usually, in order to tackle a new class of SAT problems, new ad hoc algorithms must be designed. Another way to solve a new problem is to use a generic solver and employ parallelism to reduce the solve time. In this paper we propose a parallelization scheme for a class of SAT solvers based on the DPLL procedure. The scheme uses a dynamic load-balancing mechanism based on work-stealing techniques to deal with the irregularity of SAT problems. We parallelize Satz, one of the best generic SAT solvers, with our scheme to obtain a parallel solver called PSatz. The first experimental results on random 3-SAT problems and a set of well-known structured problems show the efficiency of PSatz. PSatz is freely available and runs on any networked workstations under Unix/Linux. 相似文献

8.

基于GPU的并行Cholesky分解及其应用

沈雁戴瑜兴《计算机工程》2019,45(2):284-289

在OpenCL并行计算框架的clMAGMA库中,Cholesky分解算法采用大尺寸分块并行方法,不能充分利用GPU的高速局部存储器,且在计算过程中存在多次GPU-CPU间的数据传递。为此,提出采用小尺寸分块并行方法,充分利用GPU中的高速局部存储器,使矩阵子块的逆矩阵得到复用,完成对称正定矩阵的高效Cholesky分解,并且其能够应用于三维视觉光束平差问题中的大型正定矩阵的分解。实验结果表明,该方法的Cholesky分解速度比clMAGMA提升50%以上,针对光束平差问题,比Ceres Solver中使用的Eigen库速度提升约38倍。相似文献

9.

ANTS: Agents on Networks, Trees, and Subgraphs 总被引：1，自引：0，他引：1

Israel A. Reference to Wagner Michael Reference to Lindenbaum Alfred M. Reference to Bruckstein 《Future Generation Computer Systems》2000,16(8):915-926

Efficient exploration of large networks is a central issue in data mining and network maintenance applications. In most existing work there is a distinction between the active ‘searcher’ which both executes the algorithm and holds the memory and the passive ‘searched graph’ over which the searcher has no control at all. Large dynamic networks like the Internet, where the nodes are powerful computers and the links have narrow bandwidth and are heavily-loaded, call for a different paradigm, in which a noncentralized group of one or more lightweight autonomous agents traverse the network in a completely distributed and parallelizable way. Potential advantages of such a paradigm would be fault tolerance against network and agent failures, and reduced load on the busy nodes due to the small amount of memory and computing resources required by the agent in each node. Algorithms for network covering based on this paradigm could be used in today’s Internet as a support for data mining and network control algorithms. Recently, a vertex ant walk () method has been suggested [I.A. Wagner, M. Lindenbaum, A.M. Bruckstein, Ann. Math. Artificial Intelligence 24 (1998) 211–223] for searching an undirected, connected graph by an a(ge)nt that walks along the edges of the graph, occasionally leaving ‘pheromone’ traces at nodes, and using those traces to guide its exploration. It was shown there that the ant can cover a static graph within time nd, where n is the number of vertices and d the diameter of the graph. In this work we further investigate the performance of the method on dynamic graphs, where edges may appear or disappear during the search process. In particular we prove that (a) if a certain spanning subgraph S is stable during the period of covering, then the method is guaranteed to cover the graph within time nd_s, where d_s is the diameter of S, and (b) if a failure occurs on each edge with probability p, then the expected cover time is bounded from above by nd((logΔ/log(1/p))+((1+p)/(1−p))), where Δ is the maximum vertex degree in the graph. We also show that (c) if G is a static tree then it is covered within time 2n. 相似文献

10.

Domain Decomposition Models for Parallel Monte Carlo Transport

Alme Henry J. Rodrigue Garry H. Zimmerman George B. 《The Journal of supercomputing》2001,18(1):5-23

We present a strategy for parallelizing computations that use the transport method. It combines spatial domain decomposition with domain replication to realize the scaling benefits of replication while allowing for problems whose computational mesh will not fit in a single processor's memory. The mesh is decomposed into a small number of spatial domains—typically fewer domains than there are processors—and heuristics are used to estimate the computational effort required to generate the solution in each subdomain using Monte Carlo. That work estimate determines the number of times a subdomain is replicated relative to the others. Timing of runs for two problems show that the new method scales better than traditional domain decomposition. 相似文献

11.

Finding Unmanipulable Singularities in Parallel Mechanisms Using Jacobian Decomposition

Yawei Yang John F. O’Brien 《Journal of Intelligent and Robotic Systems》2008,53(1):3-19

A geometric method is presented to determine the unmanipulable singular configurations of a general class of parallel mechanisms. In unmanipulable singular configurations, the composite Jacobian matrix that maps active joints velocities to end-effector velocity loses rank, indicating the loss of a task degree-of-freedom. Finding unmanipulable configurations is difficult due to the complexity of the Jacobian matrix. The problem is greatly simplified by a novel decomposition of the matrix presented in this paper. The method is used to find singularities in several example parallel machines. 相似文献

12.

Coarse-Grained Parallel Transitive Closure Algorithm: Path Decomposition Technique 总被引：1，自引：0，他引：1

Gibbons Alan; Pagourtzis Aris; Potapov Igor; Rytter Wojciech 《Computer Journal》2003,46(4):391-400

相似文献

13.

基于消息传递的时域分解并行算法研究

李永刚欧阳洁《计算机工程与应用》2006,42(18):32-36

将Parareal算法中的预估校正格式加以改进,提出时域分解并行算法。基于主从模式和消息传递,具体考察了群体通信和非阻塞通信模式,并设计出通用而简便的并行化模型。在集群系统下对热传导方程和对流扩散方程的数值模拟结果表明:算法具有较高的加速性能以及良好的可扩展性,体现了时域分解的独特优势。相似文献

14.

一种用计算域分解的等几何分析并行化方法

郭利财黄章进顾乃杰《小型微型计算机系统》2013,34(6)

提出一种按照计算域分解的并行化方法来构建等几何分析的刚度矩阵和右侧向量.将计算域分解成为若干个不相交的子区域,然后为每个区域分配一个处理器,所有处理器并行进行子区域上面的计算,所有处理器完成子区域的计算以后,使用一个快速的归并算法完成线性系统的装配.实验表明,本文提出的方法在8核的机器上可以达到6.46的加速比,能够在4秒左右的时间计算680万个矩阵元素个数.使用Intel MKL稀疏求解器来求解线性系统,本文的等几何分析求解器能够在大约10秒的时间内求解52万的自由度,本文的方法比ISOGAT速度要快上万倍. 相似文献

15.

基于平行因子分解的频域盲解卷积算法

李剑杨贤《计算机与网络》2012,(9):51-54

针对卷积混合盲分离问题,文章提出了一张基于张量平行因子分解的盲分离算法。该算法通过将接收信号的频域相关矩阵叠加成三阶张量,再对此三阶张量进行平行因子分解,最后利用基于K-means聚类的全排列解模糊算法来完成无排列模糊的混合矩阵估计。通过仿真实验,计算分离信号与源信号的相似系数,结果表明提出的算法具有很好的分离效果,而且实现简单,可满足实际应用的要求。相似文献

16.

基于GPU的多层次并行QR分解算法研究

穆帅王晨曦邓仰东《计算机仿真》2013,30(9)

QR分解作为一个基本计算模块,广泛应用在图像处理、信号处理、通信工程等众多领域.传统的并行QR分解算法只能挖掘计算过程中的数据级并行.在分析快速Givens Rotation分解特征的基础上,提出了一种多层次并行算法,能够同时挖掘计算过程中的任务级并行和数据级并行,非常适合于以图形处理器(GPU)为代表的大规模并行处理器.同时,采用GPU的并行QR分解算法可以作为基本运算模块被GPU平台上的众多应用程序直接调用.实验结果显示,与CPU平台上使用OpenMP实现的算法相比,基于GPU的多层次并行算法能够获得5倍以上的性能提升,而调用QR分解模块的奇异值分解(SVD)应用可以获得3倍以上的性能提升. 相似文献

17.

Two-Time Scale Controlled Markov Chains: A Decomposition and Parallel Processing Approach

Haurie A. Moresino F. 《Automatic Control, IEEE Transactions on》2007,52(12):2325-2331

This correspondence deals with a class of ergodic control problems for systems described by Markov chains with strong and weak interactions. These systems are composed of a set of subchains that are weakly coupled. Using results already available in the literature one formulates a limit control problem the solution of which can be obtained via an associated nondifferentiable convex programming (NDCP) problem. The technique used to solve the NDCP problem is the Analytic Center Cutting Plane Method (ACCPM) which implements a dialogue between, on one hand, a master program computing the analytical center of a localization set containing the solution and, on the other hand, an oracle proposing cutting planes that reduce the size of the localization set at each main iteration. The interesting aspect of this implementation comes from two characteristics: (i) the oracle proposes cutting planes by solving reduced sized Markov Decision Problems (MDP) via a linear program (LP) or a policy iteration method; (ii) several cutting planes can be proposed simultaneously through a parallel implementation on processors. The correspondence concentrates on these two aspects and shows, on a large scale MDP obtained from the numerical approximation ldquoa la Kushner-Dupuisrdquo of a singularly perturbed hybrid stochastic control problem, the important computational speed-up obtained. 相似文献

18.

一种基于自动机分解的网络协议并行处理策略 总被引：1，自引：0，他引：1

张志斌郭莉方滨兴陈小军《计算机学报》2006,29(8):1496-1504

为了解决线程级并行体系结构带来的线程间对于共享资源的竞争问题,利用模拟对协议处理自动机中不同状态阶段的性能特点和在SMT结构上各状态阶段对于Cache竞争的情况进行了对比,指出CPU中Load／Store部件有可能构成协议处理的性能瓶颈之一,且对于有状态协议的处理在建立连接阶段具有较强的Cache竞争能力．进而提出一种新的协议并行处理策略,根据各状态阶段的性能特点利用有限自动机分解的方法将协议自动机划分成若干子自动机后并行处理．模拟结果表明该策略能够缓解协议并行处理时线程间对于Cache的竞争,在真实SMT环境中与基于连接的并行策略相比处理性能提高了5％左右．相似文献

19.

Achieving Scalable Parallel Molecular Dynamics Using Dynamic Spatial Domain Decomposition Techniques

Lars Nyland Jan Prins Ru Huai Yun Jan Hermans Hye-Chung Kum Lei Wang 《Journal of Parallel and Distributed Computing》1997,47(2):129

To achieve scalable parallel performance in molecular dynamics simulations, we have modeled and implemented several dynamic spatial domain decomposition algorithms. The modeling is based upon the bulk synchronous parallel architecture model (BSP), which describes supersteps of computation, communication, and synchronization. Using this model, we have developed prototypes that explore the differing costs of several spatial decomposition algorithms and then use this data to drive implementation of our molecular dynamics simulator,Sigma. The parallel implementation is not bound to the limitations of the BSP model, allowing us to extend the spatial decomposition algorithm. For an initial decomposition, we use one of the successful decomposition strategies from the BSP study and then subsequently use performance data to adjust the decomposition, dynamically improving the load balance. The motivating reason to use historical performance data is that the computation to predict a better decomposition increases in cost with the quality of prediction, while the measurement of past work often has hardware support, requiring only a slight amount of work to modify the decomposition for future simulation steps. In this paper, we present our adaptive spatial decomposition algorithms, the results of modeling them with the BSP, the enhanced spatial decomposition algorithm, and its performance results on computers available locally and at the national supercomputer centers. 相似文献

20.

Image-Space Decomposition Algorithms for Sort-First Parallel Volume Rendering of Unstructured Grids

Kutluca Hüuseyin Kurç Tah¨sin M. Aykanat Cevdet 《The Journal of supercomputing》2000,15(1):51-93

Twelve adaptive image-space decomposition algorithms are presented for sort-first parallel direct volume rendering (DVR) of unstructured grids on distributed-memory architectures. The algorithms are presented under a novel taxonomy based on the dimension of the screen decomposition, the dimension of the workload arrays used in the decomposition, and the scheme used for workload-array creation and querying the workload of a region. For the 2D decomposition schemes using 2D workload arrays, a novel scheme is proposed to query the exact number of screen-space bounding boxes of the primitives in a screen region in constant time. A probe-based chains-on-chains partitioning algorithm is exploited for load balancing in optimal 1D decomposition and iterative 2D rectilinear decomposition (RD). A new probe-based optimal 2D jagged decomposition (OJD) is proposed which is much faster than the dynamic-programming based OJD scheme proposed in the literature. The summed-area table is successfully exploited to query the workload of a rectangular region in constant time in both OJD and RD schemes for the subdivision of general 2D workload arrays. Two orthogonal recursive bisection (ORB) variants are adapted to relax the straight-line division restriction in conventional ORB through using the medians-of-medians approach on regular mesh and quadtree superimposed on the screen. Two approaches based on the Hilbert space-filling curve and graph-partitioning are also proposed. An efficient primitive classification scheme is proposed for redistribution in 1D, and 2D rectilinear and jagged decompositions. The performance comparison of the decomposition algorithms is modeled by establishing appropriate quality measures for load-balancing, amount of primitive replication and parallel execution time. The experimental results on a Parsytec CC system using a set of benchmark volumetric datasets verify the validity of the proposed performance models. The performance evaluation of the decomposition algorithms is also carried out through the sort-first parallelization of an efficient DVR algorithm. 相似文献