期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fast Parallel Reordering and Isomorphism Testing of k -Trees

Del Greco Sekharan Sridhar 《Algorithmica》2008,32(1):61-72

Abstract. In this paper two problems on the class of k -trees, a subclass of the class of chordal graphs, are considered: the fast reordering problem and the isomorphism problem. An O(log ² n) time parallel algorithm for the fast reordering problem is described that uses O(nk(n-k)/\kern -1ptlog n) processors on a CRCW PRAM proving membership in the class NC for fixed k . An O(nk(k+1)!) time sequential algorithm for the isomorphism problem is obtained representing an improvement over the O(n ² k(k+1)!) algorithm of Sekharan (the second author) [10]. A parallel version of this sequential algorithm is presented that runs in O(log ² n) time using O((nk((k+1)!+n-k))/log n) processors improving on a parallel algorithm of Sekharan for the isomorphism problem [10]. Both the sequential and parallel algorithms use a concept introduced in this paper called the kernel of a k -tree. 相似文献

2.

并行处理技术概况 总被引：1，自引：0，他引：1

袁方王凤先《微机发展》1997,7(4):8-11

本文对并行计算机的体系结构及软件系统作综合性论述，并介绍了并行处理技术的发展趋势及若干关键问题。相似文献

3.

Efficient Parallel Computation of the Characteristic Polynomial of a Sparse, Separable Matrix

J. H. Reif 《Algorithmica》2001,29(3):487-510

{This paper is concerned with the problem of computing the characteristic polynomial of a matrix. In a large number of applications, the matrices are symmetric and sparse : with O(n) non-zero entries. The problem has an efficient sequential solution in this case, requiring O(n ² ) work by use of the sparse Lanczos method. A major remaining open question is: to find a polylog time parallel algorithm with matching work bounds. Unfortunately, the sparse Lanczos method cannot be parallelized to faster than time Ω (n) using n processors. Let M(n) be the processor bound to multiply two n \times n matrices in O(log n) parallel time. Giesbrecht [G2] gave the best previous polylog time parallel algorithms for the characteristic polynomial of a dense matrix with O (M(n)) processors. There is no known improvement to this processor bound in the case where the matrix is sparse. Often, in addition to being symmetric and sparse, the matrix has a sparsity graph (which has edges between indices of the matrix with non-zero entries) that has small separators. This paper gives a new algorithm for computing the characteristic polynomial of a sparse symmetric matrix, assuming that the sparsity graph is s(n) -separable and has a separator of size s(n)=O(n ^γ ) , for some γ , 0 < γ < 1 , that when deleted results in connected components of ≤α n vertices, for some 0 < α < 1 , with the same property. We derive an interesting algebraic version of Nested Dissection, which constructs a sparse factorization of the matrix A-λ I _n where A is the input matrix and I _n is the n \times n identity matrix. While Nested Dissection is commonly used to minimize the fill-in in the solution of sparse linear systems, our innovation is to use the separator structure to bound also the work for manipulation of rational functions in the recursively factored matrices. The matrix elements are assumed to be over an arbitrary field. We compute the characteristic polynomial of a sparse symmetric matrix in polylog time using P(n)(n+M(s(n))) ≤ P(n)(n+ s(n) ^2.376 ) processors, where P(n) is the processor bound to multiply two degree n polynomials in O(log n) parallel time using a PRAM (P(n) = O(n) if the field supports an FFT of size n but is otherwise O(nlog log n) [CK]. Our method requires only that a matrix be symmetric and non-singular (it need not be positive definite as usual for Nested Dissection techniques). For the frequently occurring case where the matrix has small separator size, our polylog parallel algorithm has work bounds competitive with the best known sequential algorithms (i.e., the Ω(n ² ) work of sparse Lanczos methods), for example, when the sparsity graph is a planar graph, s(n) ≤ O( \sqrt n ) , and we require polylog time with only P(n)n ^1.188 processors. } Received September 26, 1997; revised June 5, 1999. 相似文献

4.

基于LogP简化模型的矩阵求逆并行算法研究

陈天麒曾庆华孙世新《计算机科学》2003,30(8):176-177

LogP is becoming a practical parallel computation model that meets the demanding of parallel computers and parallel algorithms. So it is important to re-design parallel algorithms on the LogP model. This paper studies the parallel algorithm of computing converse matrix on the simplified LogP model, and gets the simulating results. 相似文献

5.

用分布式并行算法选取GF〔p〕上椭圆曲线的基点

张金山《计算机仿真》2004,(4)

椭圆曲线密码体制 (ECC)的研究与实现已逐渐成为公密码体制研究的主流 ,适用于密码的安全椭圆曲线及其基点的选取 ,是椭圆曲线密码实现的基础 ,而高效性是椭圆曲线密码系统得以广泛应用的重要因素。该文首先介绍有限域上定义的椭圆曲线及点群运算规则 ,给出椭圆曲线点群的阶。其次 ,就大素数域上安全椭圆曲线的基点的选取算法作了讨论 ,采用分布式并行算法 ,进一步改进优化 ,并借助于MIRACL系统利用标准C语言对它们成功实现 .实际测试结果表明 ,该工作确实加快了安全椭圆曲线基点的选取。相似文献

6.

基于MPSoC并行调度的矩阵乘法加速算法研究

杨飞马昱春侯金徐宁《计算机科学》2017,44(8):36-41

矩阵乘法是数值分析以及图形图像处理算法的基础,通用的矩阵乘法加速器设计一直是嵌入式系统设计的研究热点。但矩阵乘法由于计算复杂度高,处理效率低,常常成为嵌入式系统运算速度的瓶颈。为了在嵌入式领域更好地使用矩阵乘法,提出了基于MPSoC(MultiProcessor System-on-Chip)的软硬件协同加速的架构。在MPSoC的架构下,一方面,设计了面向硬件约束的矩阵分块方法,从而实现了通用的矩阵乘法加速器系统;另一方面,通过利用MPSoC下的多核架构,提出了相应的任务划分和负载平衡调度算法,提高了并行效率和整体系统加速比。实验结果表明,所提架构及算法实现了通用的矩阵乘法计算,并且通过软硬件协同设计实现的多核并行调度算法与传统单核设计相比在计算效率方面得到了显著的提高。相似文献

7.

Fast Four‐Way Parallel Radix Sorting on GPUs

Linh Ha Jens Krüger Cláudio T. Silva 《Computer Graphics Forum》2009,28(8):2368-2378

Efficient sorting is a key requirement for many computer science algorithms. Acceleration of existing techniques as well as developing new sorting approaches is crucial for many real‐time graphics scenarios, database systems, and numerical simulations to name just a few. It is one of the most fundamental operations to organize and filter the ever growing massive amounts of data gathered on a daily basis. While optimal sorting models for serial execution on a single processor exist, efficient parallel sorting remains a challenge. In this paper, we present a hardware‐optimized parallel implementation of the radix sort algorithm that results in a significant speed up over existing sorting implementations. We outperform all known General Processing Unit (GPU) based sorting systems by about a factor of two and eliminate restrictions on the sorting key space. This makes our algorithm not only the fastest, but also the first general GPU sorting solution. 相似文献

8.

Design and Implementation of a Practical Parallel Delaunay Algorithm 总被引：1，自引：0，他引：1

G. E. Blelloch G. L. Miller J. C. Hardwick D. Talmor 《Algorithmica》1999,24(3-4):243-269

This paper describes the design and implementation of a practical parallel algorithm for Delaunay triangulation that works well on general distributions. Although there have been many theoretical parallel algorithms for the problem, and some implementations based on bucketing that work well for uniform distributions, there has been little work on implementations for general distributions. We use the well known reduction of 2D Delaunay triangulation to find the 3D convex hull of points on a paraboloid. Based on this reduction we developed a variant of the Edelsbrunner and Shi 3D convex hull algorithm, specialized for the case when the point set lies on a paraboloid. This simplification reduces the work required by the algorithm (number of operations) from O(n log ² n) to O(n log n) . The depth (parallel time) is O( log ³ n) on a CREW PRAM. The algorithm is simpler than previous O(n log n) work parallel algorithms leading to smaller constants. Initial experiments using a variety of distributions showed that our parallel algorithm was within a factor of 2 in work from the best sequential algorithm. Based on these promising results, the algorithm was implemented using C and an MPI-based toolkit. Compared with previous work, the resulting implementation achieves significantly better speedups over good sequential code, does not assume a uniform distribution of points, and is widely portable due to its use of MPI as a communication mechanism. Results are presented for the IBM SP2, Cray T3D, SGI Power Challenge, and DEC AlphaCluster. Received June 1, 1997; revised March 10, 1998. 相似文献

9.

并行算术编码在Android上的实现

王如亲常传文林明《计算机与数字工程》2013,41(9)

随着移动互联网设备的大量出现,以Google公司推出的Android为代表的开放操作系统得到了广泛的应用,同时,多核处理器的出现也为软件设计带来了新的挑战.OpenMP是一种已得到广泛应用的多核编程标准.但是,由于编译环境不支持等原因,OpenMP尚未在Android系统中得到应用.论文围绕OpenMP标准在Android系统中的应用进行研究,并取得了成功.文中给出了两种在Android系统上使用OpenMP技术的实现方法,并以算术编码为例,编写了测试程序,选取了目前广泛使用的Canturbury测试集中的文件对程序性能进行测试,取得了良好的效果. 相似文献

10.

Finding the k Shortest Paths in Parallel

E. Ruppert 《Algorithmica》2000,28(2):242-254

A concurrent-read exclusive-write PRAM algorithm is developed to find the k shortest paths between pairs of vertices in an edge-weighted directed graph. Repetitions of vertices along the paths are allowed. The algorithm computes an implicit representation of the k shortest paths to a given destination vertex from every vertex of a graph with n vertices and m edges, using O(m+nk log² k) work and O( log^3k log ^*k+ log n( log log k+ log ^*n)) time, assuming that a shortest path tree rooted at the destination is pre-computed. The paths themselves can be extracted from the implicit representation in O( log k + log n) time, and O(n log n +L) work, where L is the total length of the output. Received July 2, 1997; revised June 18, 1998. 相似文献

11.

Parallel Complexity of Matrix Multiplication1

Santos Eunice E. 《The Journal of supercomputing》2003,25(2):155-175

Effective design of parallel matrix multiplication algorithms relies on the consideration of many interdependent issues based on the underlying parallel machine or network upon which such algorithms will be implemented, as well as, the type of methodology utilized by an algorithm. In this paper, we determine the parallel complexity of multiplying two (not necessarily square) matrices on parallel distributed-memory machines and/or networks. In other words, we provided an achievable parallel run-time that can not be beaten by any algorithm (known or unknown) for solving this problem. In addition, any algorithm that claims to be optimal must attain this run-time. In order to obtain results that are general and useful throughout a span of machines, we base our results on the well-known LogP model. Furthermore, three important criteria must be considered in order to determine the running time of a parallel algorithm; namely, (i) local computational tasks, (ii) the initial data layout, and (iii) the communication schedule. We provide optimality results by first proving general lower bounds on parallel run-time. These lower bounds lead to significant insights on (i)–(iii) above. In particular, we present what types of data layouts and communication schedules are needed in order to obtain optimal run-times. We prove that no one data layout can achieve optimal running times for all cases. Instead, optimal layouts depend on the dimensions of each matrix, and on the number of processors. Lastly, optimal algorithms are provided. 相似文献

12.

移动目标的快速识别算法 总被引：5，自引：0，他引：5

吴炯张秀彬张峰门蓬涛孙志旻《微计算机信息》2004,20(3):27-28,4

本文针对现有技术中存在的问题提出一种对复杂背景下的多个移动物体进行目标快速识别与跟踪的复合算法。该算法中采用对连续图像进行差影计算来确定移动目标区域，从而能去除复杂背景干扰，可以明显提高目标识别的速度和准确率。该系统已被实验证明其有效性和实用性．可广泛应用于监测与识别系统，也可应用于无人监控．无人自主操作等各种领域。本文中所提到的算法现为国家某重点项目中的关键技术。相似文献

13.

Optimal Parallel Randomized Algorithms for the Voronoi Diagram of Line Segments in the Plane

Rajasekaran Ramaswami 《Algorithmica》2002,33(4):436-460

Abstract. We present an optimal parallel randomized algorithm for the Voronoi diagram of a set of n nonintersecting (except possibly at endpoints) line segments in the plane. Our algorithm runs in O(log n) time with high probability using O(n) processors on a CRCW PRAM. This algorithm is optimal in terms of work done since the sequential time bound for this problem is Ω(n log n) . Our algorithm improves by an O(log n) factor the previously best known deterministic parallel algorithm, given by Goodrich, ó'Dúnlaing, and Yap, which runs in O( log ² n) time using O(n) processors. We obtain this result by using a new ``two-stage' random sampling technique. By choosing large samples in the first stage of the algorithm, we avoid the hurdle of problem-size ``blow-up' that is typical in recursive parallel geometric algorithms. We combine the two-stage sampling technique with efficient search and merge procedures to obtain an optimal algorithm. This technique gives an alternative optimal algorithm for the Voronoi diagram of points as well (all other optimal parallel algorithms for this problem use the transformation to three-dimensional half-space intersection). 相似文献

14.

The memory behavior of cache oblivious stencil computations 总被引：1，自引：0，他引：1

Matteo Frigo Volker Strumpen 《The Journal of supercomputing》2007,39(2):93-112

We present and evaluate a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an “ideal cache” of size Z, our algorithm saves a factor of Θ(Z ^1/n) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy. We evaluate our algorithm in terms of the number of cache misses, and demonstrate that the memory behavior agrees with our theoretical predictions. Our experimental evaluation is based on a finite-difference solution of a heat diffusion problem, as well as a Gauss-Seidel iteration and a 2-dimensional LBMHD program, both reformulated as cache oblivious stencil computations. This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004. 相似文献

15.

高阶矩阵相乘算法的研究

杨永娟蒋群《数字社区&智能家居》2007,3(16):1080

本文通过对于高阶矩阵相乘算法的串并行比较,尤其是MPI技术下的并行算法的探讨,得出在MPI环境下进行高阶矩阵相乘的可行性、简单性、必要性. 相似文献

16.

The Complexity of Parallel Multisearch on Coarse-Grained Machines

A. Bäumker W. Dittrich A. Pietracaprina 《Algorithmica》1999,24(3-4):209-242

Given m ordered segments that form a partition of some universe (e.g., a two-dimensional strip), the multisearch problem consists of determining, for a set of n query points in the universe, the segments they belong to. We present the first nontrivial parallel deterministic scheme for performing multisearch on a distributed-memory machine when m=ω(n) . The scheme is designed on the BSP* model of parallel computation, a variant of Valiant's BSP which rewards blockwise communication, and relies on a suitable redundant representation of the segments. The time needed to answer the queries is analyzed as a function of the redundancy and of the BSP* parameters. We show that optimal performance can be obtained using logarithmic redundancy. We also prove a lower bound on the communication requirements of any deterministic multisearch scheme realized on a distributed-memory machine. The lower bound exhibits a tradeoff between the redundancy used to represent the segments and the performance of the scheme. Received June 1, 1997; revised March 10, 1998. 相似文献

17.

Parallel computation and conflicts in memory access

Luděk Kučera 《Information Processing Letters》1982,14(2):93-96

相似文献

18.

Fast and Robust Approximation of Smallest Enclosing Balls in Arbitrary Dimensions

Thomas Larsson Linus Källberg 《Computer Graphics Forum》2013,32(5):93-101

In this paper, an algorithm is introduced that computes an arbitrarily fine approximation of the smallest enclosing ball of a point set in any dimension. This operation is important in, for example, classification, clustering, and data mining. The algorithm is very simple to implement, gives reliable results, and gracefully handles large problem instances in low and high dimensions, as confirmed by both theoretical arguments and empirical evaluation. For example, using a CPU with eight cores, it takes less than two seconds to compute a 1.001‐approximation of the smallest enclosing ball of one million points uniformly distributed in a hypercube in dimension 200. Furthermore, the presented approach extends to a more general class of input objects, such as ball sets. 相似文献

19.

Fast leader election in anonymous rings with bounded expected delay

Rena Bakhshi Jörg Endrullis 《Information Processing Letters》2011,111(17):864-870

We propose a probabilistic network model, called asynchronous bounded expected delay (ABE), which requires a known bound on the expected message delay. In ABE networks all asynchronous executions are possible, but executions with extremely long delays are less probable. Thus, the ABE model captures asynchrony that occurs in sensor networks and ad-hoc networks.At the example of an election algorithm, we show that the minimal assumptions of ABE networks are sufficient for the development of efficient algorithms. For anonymous, unidirectional ABE rings of known size n we devise a probabilistic election algorithm having average message and time complexity O(n). 相似文献

20.

PD-100型并行仿真计算机系统设计

洪远麟康继昌龙卫红赵光飞肖骊《微机发展》1991,(2)

PD-100为实时仿真用的、基于Transputer的同构型并行计算机。它采用了当今的VLSI器件,从而使高性能、低价格成为可能。然而,它也带来了编程的困难。为此,我们开发了面向状态方程的并行仿真语言,便于用户编写并行仿真程序。相似文献