期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Distributed computation of the knn graph for large high-dimensional point sets

Erion Plaku Lydia E. Kavraki 《Journal of Parallel and Distributed Computing》2007

High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over 100 processors and indicate that similar speedup can be obtained with several hundred processors. 相似文献

2.

A linear time algorithm for max-min length triangulation of a convex polygon

Shiyan Hu 《Information Processing Letters》2007,101(5):203-208

We consider the following planar max-min length triangulation problem: given a set of n points in the Euclidean plane, find a triangulation such that the length of the shortest edge in the triangulation is maximized. In this paper, a linear time algorithm is proposed for computing the max-min length triangulation of a set of points in convex position. In addition, an O(nlogn) time algorithm is proposed for computing the max-min length k-set triangulation of a set of points in convex position, where we are to compute a set of k vertices such that the max-min length triangulation on them is minimized over all possible k-set. We further show that the graph version of max-min length triangulation is NP-complete, and some common heuristics such as greedy algorithm are in general not able to give a bounded-ratio approximation to the max-min length triangulation. 相似文献

3.

SAC—A Functional Array Language for Efficient Multi-threaded Execution

Clemens Grelck Sven-Bodo Scholz 《International journal of parallel programming》2006,34(4):383-427

We give an in-depth introduction to the design of our functional array programming language SaC, the main aspects of its compilation into host machine code, and its parallelisation based on multi-threading. The language design of SaC aims at combining high-level, compositional array programming with fully automatic resource management for highly productive code development and maintenance. We outline the compilation process that maps SaC programs to computing machinery. Here, our focus is on optimisation techniques that aim at restructuring entire applications from nested compositions of general fine-grained operations into specialised coarse-grained operations. We present our implicit parallelisation technology for shared memory architectures based on multi-threading and discuss further optimisation opportunities on this level of code generation. Both optimisation and parallelisation rigorously exploit the absence of side-effects and the explicit data flow characteristic of a functional setting. 相似文献

4.

消息传递并行程序的自动生成

下载免费PDF全文

张平李清宝赵荣彩《计算机工程与应用》2007,43(8):74-77

针对分布内存结构的并行化将串行程序转变为在各处理节点上运行的SPMD并行程序,节点程序包含该节点所执行的运算和与其它节点交换信息的通信操作。讨论了在已知数据分解和计算划分的前提下生成分布内存结构下的消息传递并行程序的算法,以Lam提出的线性不等式基本框架为基础,在Paraguin工作基础上进行了有效的改进：第一在代码生成算法中引入了数据分布;第二将处理器空间由一维扩展到多维;第三将虚拟处理器到物理处理器的映射关系引入代码生成算法,从而减少了节点间通信的数量,提高了生成并行代码的性能。相似文献

5.

Line-segment intersection reporting in parallel

Christine Rüb 《Algorithmica》1992,8(1-6):119-144

In this paper we give a parallel algorithm for line-segment intersection reporting in the plane. It runs in timeO(((n +k) logn log logn)/p) usingp processors on a concurrent-read-exclusive-write (CREW)-PRAM, wheren is the number of line segments,k is the number of intersections, andp ≤n +k. 相似文献

6.

ERI sorting for emerging processor architectures

Tirath Ramdas Gregory K. Egan 《Computer Physics Communications》2009,180(8):1221-1229

Electron Repulsion Integrals (ERIs) are a common bottleneck in ab initio computational chemistry. It is known that sorted/reordered execution of ERIs results in efficient SIMD/vector processing. This paper shows that reconfigurable computing and heterogeneous processor architectures can also benefit from a deliberate ordering of ERI tasks. However, realizing these benefits as net speedup requires a very rapid sorting mechanism. This paper presents two such mechanisms. Included in this study are analytical, simulation-based, and experimental benchmarking approaches to consider five use cases for ERI sorting, i.e. SIMD processing, reconfigurable computing, limited address spaces, instruction cache exploitation, and data cache exploitation. Specific consideration is given to existing cache-based processors, FPGAs, and the Cell Broadband Engine processor. It is proposed that the analyses conducted in this work should be built upon to aid the development of software autotuners which will produce efficient ab initio computational chemistry codes for a variety of computer architectures. 相似文献

7.

Generating Local Addresses and Communication Sets for Data-Parallel Programs

《Journal of Parallel and Distributed Computing》1995,26(1):72-84

Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance Fortran. We demonstrate a storage scheme for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution that does not waste any storage, and show that, under this storage scheme, the local memory access sequence of any processor for a computation involving the regular section A(ℓ:h:s) is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and we extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little runtime overhead and acceptable preprocessing time. 相似文献

8.

Parallel use of multiplicative congruential random number generators

Pei-Chi Wu 《Computer Physics Communications》2006,175(1):25-29

On parallel processors or in distributed computing environments, generating and sharing one stream of random numbers for all parallel processing elements is usually impractical. A more attractive method is to allow each processing element to generate random numbers independently. This paper investigates parallel use of multiplicative congruential generators. We analyze the leapfrog, the regular spacing, and the random spacing methods. Our results show: (1) The leapfrog method can result in multipliers of low spectral values. (2) In the random spacing method, the minimal distance between n substreams is only 1/n² of cycle length in average. (3) The regular spacing method can result in strong correlation between substreams if the starting points αjx₀ () are poorly selected. We then suggest selecting multiplier a and factor α based on their k-dimensional spectral values and the minimal distance between substreams of these generators. 相似文献

9.

Parallel computational geometry of rectangles

Sharat Chandran Sung Kwon Kim David M. Mount 《Algorithmica》1992,7(1-6):25-49

Rectangles in a plane provide a very useful abstraction for a number of problems in diverse fields. In this paper we consider the problem of computing geometric properties of a set of rectangles in the plane. We give parallel algorithms for a number of problems usingn processors wheren is the number of upright rectangles. Specifically, we present algorithms for computing the area, perimeter, eccentricity, and moment of inertia of the region covered by the rectangles inO(logn) time. We also present algorithms for computing the maximum clique and connected components of the rectangles inO(logn) time. Finally, we give algorithms for finding the entire contour of the rectangles and the medial axis representation of a givenn × n binary image inO(n) time. Our results are faster than previous results and optimal (to within a constant factor). 相似文献

10.

A (4n − 9)/3 diagnosis algorithm on n-dimensional cube network

Xiaofan Yang Yuan Yan Tang 《Information Sciences》2007,177(8):1771-1781

As a generalization of the precise and pessimistic diagnosis strategies of system-level diagnosis of multicomputers, the t/k diagnosis strategy can significantly improve the self-diagnosing capability of a system at the expense of no more than k fault-free processors (nodes) being mistakenly diagnosed as faulty. In the case k ? 2, to our knowledge, there is no known t/k diagnosis algorithm for general diagnosable system or for any specific system. Hypercube is a popular topology for interconnecting processors of multicomputers. It is known that an n-dimensional cube is (4n − 9)/3-diagnosable. This paper addresses the (4n − 9)/3 diagnosis of n-dimensional cube. By exploring the relationship between a largest connected component of the 0-test subgraph of a faulty hypercube and the distribution of the faulty nodes over the network, the fault diagnosis of an n-dimensional cube can be reduced to those of two constituent (n − 1)-dimensional cubes. On this basis, a diagnosis algorithm is presented. Given that there are no more than 4n − 9 faulty nodes, this algorithm can isolate all faulty nodes to within a set in which at most three nodes are fault-free. The proposed algorithm can operate in O(N log₂ N) time, where N = 2ⁿ is the total number of nodes of the hypercube. The work of this paper provides insight into developing efficient t/k diagnosis algorithms for larger k value and for other types of interconnection networks. 相似文献

11.

Parallel algorithms for some functions of two convex polygons

Mikhail J. Atallah Michael T. Goodrich 《Algorithmica》1988,3(1-4):535-548

Let P andQ be two convex,n-vertex polygons. We consider the problem of computing, in parallel, some functions ofP andQ whenP andQ are disjoint. The model of parallel computation we consider is the CREW-PRAM, i.e., it is the synchronous shared-memory model where concurrent reads are allowed but no two processors can simultaneously attempt to write in the same memory location (even if they are trying to write the same thing). We show that a CREW-PRAM havingn ^1/k processors can compute the following functions in O(k^1+?) time: (i) the common tangents betweenP andQ, and (ii) the distance betweenP andQ (and hence a straight line separating them). The positive constant ? can be made arbitrarily close to zero. Even with a linear number of processors, it was not previously known how to achieve constant time performance for computing these functions. The algorithm for problem (ii) is easily modified to detect the case of zero distance as well. 相似文献

12.

Solving the generalized eigenvalue problem on a synchronous linear processor array

《Parallel Computing》1986,3(2):153-166

We present a parallel method to solve the generalized eigenvalue problem on a linear array of processors, each connected to their nearest neighbors and operating synchronously. We also include a wrap-around connection from end to end. Our method is based on the well-known QZ algorithm of Moler and Stewart, which simultaneously reduces two n × n matrices to upper triangular form by orthogonal or unitary transformations. We show how this algorithm may be partitioned and distributed of n + 1 processors, achieving a speed-up over the serial algorithm of O(n). We use the concept of windows to describe the action of each processor at each step. We show how to incorporate singles shifts, and how to apply orthogonal plane rotations on either side of a matrix without the need to transpose the matrix itself. 相似文献

13.

Upper Bounds on Number of Steals in Rooted Trees

Charles E. Leiserson Tao B. Schardl Warut Suksompong 《Theory of Computing Systems》2016,58(2):223-240

Inspired by applications in parallel computing, we analyze the setting of work stealing in multithreaded computations. We obtain tight upper bounds on the number of steals when the computation can be modeled by rooted trees. In particular, we show that if the computation with n processors starts with one processor having a complete k-ary tree of height h (and the remaining n ? 1 processors having nothing), the maximum possible number of steals is \({\sum }_{i=1}^{n}(k-1)^{i}\binom {h}{i}\). 相似文献

14.

A strong direct product theorem for quantum query complexity

Troy Lee Jérémie Roland 《Computational Complexity》2013,22(2):429-462

We show that quantum query complexity satisfies a strong direct product theorem. This means that computing k copies of a function with fewer than k times the quantum queries needed to compute one copy of the function implies that the overall success probability will be exponentially small in k. For a boolean function f, we also show an XOR lemma—computing the parity of k copies of f with fewer than k times the queries needed for one copy implies that the advantage over random guessing will be exponentially small. We do this by showing that the multiplicative adversary method, which inherently satisfies a strong direct product theorem, characterizes bounded-error quantum query complexity. In particular, we show that the multiplicative adversary bound is always at least as large as the additive adversary bound, which is known to characterize bounded-error quantum query complexity. 相似文献

15.

Communication-Efficient Sorting Algorithms on Reconfigurable Array of Processors With Slotted Optical Buses

《Journal of Parallel and Distributed Computing》1999,57(2):166-187

The reconfigurable array with slotted optical buses (RASOB) has recently received a lot of attention from the research community. In this paper, we first discuss the reconfiguration methods and communication capabilities of the RASOB architecture. Then, we use this architecture for the implementation of efficient sorting algorithms on the 1D RASOB and the 2D RASOB. Our parallel sorting algorithm on the 1D RASOB is based on an efficient divide-and-conquer scheme. It sortsNdata items usingNprocessors inO(k) communication cycles where k is the size of the data items to be sorted in bits. We further develop a parallel sorting algorithm on the 2D RASOB based on the sorting algorithm on the 1D RASOB in conjunction with the well known Rotatesort algorithm. Similarly, this algorithm sortsNdata items on a 2D RASOB of sizeNinO(k) communication cycles. These sorting algorithms are much more efficient than state-of-the-art sorting algorithms on reconfigurable arrays of processors withelectronicbuses using the same number of processors. 相似文献

16.

Some problems in distributed computational geometry

Sergio Rajsbaum Jorge Urrutia 《Theoretical computer science》2011,412(41):5760-5770

In a planar geometric network vertices are located in the plane, and edges are straight line segments connecting pairs of vertices, such that no two of them intersect. In this paper we study distributed computing in asynchronous, failure-free planar geometric networks, where each vertex is associated to a processor, and each edge to a bidirectional message communication link. Processors are aware of their locations in the plane.We consider fundamental computational geometry problems from the distributed computing point of view, such as finding the convex hull of a geometric network and identification of the external face. We also study the classic distributed computing problem of leader election, to understand the impact that geometric information has on the message complexity of solving it.We obtain an O(nlog²n) message complexity algorithm to find the convex hull, and an O(nlogn) message complexity algorithm to identify the external face of a geometric network of n processors. We present a matching lower bound for the external face problem. We prove that the message complexity of leader election in a geometric ring is Ω(nlogn), hence geometric information does not help in reducing the message complexity of this problem. 相似文献

17.

The complexity of preemptive scheduling given interprocessor communication delays

《Information Processing Letters》1987,25(2):123-125

We discuss the problem of scheduling af set of independent tasks T, each t_i ϵ T of lenght ℓ_i ϵ Z⁺, on m identical processors. We allow preemption but assume a communication delay of time k ϵ N. Whenever a task is preempted from one processor to another, there must be a delay of at least k time units. We show that if k = 1, an optimal schedule can be found in polynomial time but if k ⩾ 2, the corresponding decision problem is NP-complete. 相似文献

18.

A Parallel Algorithm for the Visibility of a Simple Polygon Using Scan Operations

《CVGIP: Graphical Models and Image Processing》1993,55(3):192-202

This paper describes a parallel algorithm for computing the visible portion of a simple planar polygon with N vertices from a given point on or inside the polygon. The algorithm accomplishes this in O(k log N) time using O(N/log N) processors, where k is the link-diameter of the polygon in consideration. The link-diameter of a polygon is the maximum number of straight line segments needed to connect any two points within the polygon, where all line segments lie completely within the polygon. The algorithm can also be used to compute the visible portion of the plane given a point outside of the polygon. Except in this case, the parameter k in the asymptotic bounds would be the link diameter of a different polygon. The algorithm is optimal for sets of polygons that have a constant link diameter. It is a rather simple algorithm, and has a very small run time constant, making it fast and practical to implement. The interprocessor communication needed involves only local neighbor communication and scan operations (i.e., parallel prefix operations). Thus the algorithm can be implemented not only on an EREW PRAM, but also on a variety of other more practical machine architectures, such as hypercubes, trees, butterflies, and shuffle exchange networks. The algorithm was implemented on the Connection Machine as well as the MasPar MP- 1, and various performance tests were conducted. 相似文献

19.

An improved image analogy method based on adaptive CUDA-accelerated neighborhood matching framework

Ying Tang Xiaoying Shi Tingzhe Xiao Jing Fan 《The Visual computer》2012,28(6-8):743-753

The image analogy framework is especially useful to synthesize appealing images for non-homogeneous input and gives users creative control over the synthesized results. However, the traditional framework did not adaptively employ the searching strategy based on neighborhood’s different textural contents. Besides, the synthesis speed is slow due to intensive computation involved in neighborhood matching. In this paper we present a CUDA-based neighborhood matching algorithm for image analogy. Our algorithm adaptively applies the global search of the exact L ₂ nearest neighbor and k-coherence search strategies during synthesis according to different textural features of images, which is especially usefully for non-homogeneous textures. To consistently implement the above two search strategies on GPU, we adopt the fast k nearest neighbor searching algorithm based on CUDA. Such an acceleration greatly reduces the time of the pre-process of k-coherence search and the synthesis procedure of the global search, which makes possible the adjustment of important synthesis parameters. We further adopt synthesis magnification to get the final high-resolution synthesis image for running efficiency. Experimental results show that our algorithm is suitable for various applications of the image analogy framework and takes full advantage of GPU’s parallel processing capability to improve synthesis speed and get satisfactory synthesis results. 相似文献

20.

On Spectral Bounds for the k-Partitioning of Graphs

Elsässer Robert Lücking Thomas Monien Burkhard 《Theory of Computing Systems》2003,36(5):461-478

When executing processes on parallel computer systems a major bottle-neck is interprocessor communication. One way to address this problem is to minimize the communication between processes that are mapped to different processors. This translates to the k-partitioning problem of the corresponding process graph, where k is the number of processors. The classical spectral lower bound of (|V|/2k)\sum^k _i=1λ_i for the k-section width of a graph is well known. We show new relations between the structure and the eigenvalues of a graph and present a new method to get tighter lower bounds on the k-section width. This method makes use of the level structure defined by the k-section. We define a global expansion property and prove that for graphs with the same k-section width the spectral lower bound increases with this global expansion. We also present examples of graphs for which our new bounds are tight up to a constant factor. 相似文献