共查询到20条相似文献,搜索用时 15 毫秒
1.
An efficient implementation of parallel eigenvalue computation for massively parallel processing 总被引:4,自引:0,他引:4
This paper describes an efficient implementation and evaluation of a parallel eigensolver for computing all eigenvalues of dense symmetric matrices. Our eigensolver uses a Householder tridiagonalization method, which has higher parallelism and performance than conventional methods when problem size is relatively small, e.g., the order of 10,000. This is very important for relevant practical applications, where many diagonalizations for such matrices are required so often. The routine was evaluated on the 1024 processors HITACHI SR2201, and giving speedup ratios of about 2–5 times as compared to the ScaLAPACK library on 1024 processors of the HITACHI SR2201. 相似文献
2.
Taoufik Saidani Mohamed Atri Lazhar Khriji Rached Tourki 《Journal of Real-Time Image Processing》2016,11(1):63-74
With the augmentation in multimedia technology, demand for high-speed real-time image compression systems has also increased. JPEG 2000 still image compression standard is developed to accommodate such application requirements. Embedded block coding with optimal truncation (EBCOT) is an essential and computationally very demanding part of the compression process of JPEG 2000 image compression standard. Various applications, such as satellite imagery, medical imaging, digital cinema, and others, require high speed and performance EBCOT architecture. In JPEG 2000 standard, the context formation block of EBCOT tier-1 contains high complexity computation and also becomes the bottleneck in this system. In this paper, we propose a fast and efficient VLSI hardware architecture design of context formation for EBCOT tier-1. A high-speed parallel bit-plane coding (BPC) hardware architecture for the EBCOT module in JPEG 2000 is proposed and implemented. Experimental results show that our design outperforms well-known techniques with respect to the processing time. It can reach 70 % reduction when compared to bit plane sequential processing. 相似文献
3.
This paper focuses on the implementation and the performance analysis of a smooth particle mesh Ewald method on several parallel computers. We present the details of the algorithms and our implementation that are used to optimize parallel efficiency on such parallel computers. 相似文献
4.
We consider a system of Maxwell’s and Landau-Lifshitz-Gilbert equations describing magnetization dynamics in micromagnetism. The problem is discretized by a convergent, unconditionally stable finite element method. A multigrid preconditioned Uzawa type method for the solution of the algebraic system resulting from the discretized Maxwell’s equations is constructed. The efficiency of the method is demonstrated on numerical experiments and the results are compared to those obtained by simplified models. 相似文献
5.
Min—Min is a popular heuristic for scheduling tasks to heterogeneous computational resources, which has been applied either directly or as part of more sophisticated heuristics. However, for large scenarios such as grid computing platforms, the time complexity of a straightforward implementation of Min—Min, which is quadratic in the number of tasks, may be prohibitive. This has motivated the development of high performance computing (HPC) implementations, and the use of simpler heuristics for the sake of acceptable execution times. We propose a simple algorithm that implements Min—Min requiring only O(mn) operations for scheduling n tasks on m machines. Our experiments show, in practice, that a straightforward sequential implementation of this algorithm significantly outperforms other state of the art implementations of Min—Min, even compared to HPC implementations. In addition, the proposed algorithm is at least as suitable for parallelization as a direct implementation of Min—Min. 相似文献
6.
San Juan Sebastián Pablo Virtanen Tuomas Garcia-Molla Victor M. Vidal Antonio M. 《The Journal of supercomputing》2019,75(3):1298-1309
The Journal of Supercomputing - This paper presents an analysis of an efficient parallel implementation of the active-set Newton algorithm (ASNA), which is used to estimate the nonnegative weights... 相似文献
7.
《International Journal of Parallel, Emergent and Distributed Systems》2012,27(4):293-301
In this paper, we present a new, easy to implement algorithm for detecting the termination of a parallel asynchronous computation on distributed-memory MIMD computers. We demonstrate that it operates concurrently with the main computation, adding minimal overhead, and we prove that it correctly detects termination when it occurs. Experimental results confirm that the termination detection routine imposes an overhead smaller than the experimental uncertainty. 相似文献
8.
G. Of 《Computing》2008,82(2-3):139-155
Fast boundary element methods still need good preconditioning techniques for an almost optimal complexity. An algebraic multigrid method is presented for the single layer potential using the fast multipole method. The coarsening is based on the cluster structure of the fast multipole method. The effort for the construction of the nearfield part of the coarse grid matrices and for an application of the multigrid preconditioner is of the same almost optimal order as the fast multipole method itself. 相似文献
9.
Jakub ístek Bedich Sousedík Pavel Burda Jan Mandel Jaroslav Novotný 《Computers & Fluids》2011,46(1):429-435
A parallel implementation of the Balancing Domain Decomposition by Constraints (BDDC) method is described. It is based on formulation of BDDC with global matrices without explicit coarse problem. The implementation is based on the MUMPS parallel solver for computing the approximate inverse used for preconditioning. It is successfully applied to several problems of Stokes flow discretized by Taylor–Hood finite elements and BDDC is shown to be a promising method also for this class of problems. 相似文献
10.
11.
A new internal array structure, called a double-array, implementing a trie structure is presented. The double-array combines the fast access of a matrix form with the compactness of a list form. The algorithms for retrieval, insertion and deletion are introduced through examples. Although insertion is rather slow, it is still practical, and both the deletion and the retrieval time can be improved from the list form. From the comparison with the list for various large sets of keys, it is shown that the size of the double-array can be about 17 per cent smaller than that of the list, and that the retrieval speed of the double-array can be from 3–1 to 5–1 times faster than that of the list. 相似文献
12.
A block preconditioner is considered in a parallel computing environment. This preconditioner has good parallel properties, however, the convergence deteriorates when the number of blocks increases. Two different techniques are studied to accelerate the convergence: overlapping at the interfaces and using a coarse grid correction. It appears that the latter technique is indeed scalable, so the wall clock time is constant when the number of blocks increases. Furthermore, the method is easily added to an existing solution code. 相似文献
13.
In this paper we propose an efficient algorithm to implement parallel integer multiplication by a combination of parallel additions, shifts and reads from a memory-resident lookup table dedicated to squares. Such an operator called PIM (parallel integer multiplication) is in fact microprogrammed at the PROM level. Our theoretical approach is included within the framework of time and space parallel complexity theory. The mathematical relation used defines this multiplication operator in terms of a difference of two quadratic expressions, each being computed in parallel by one addition and one shift. This leads to the CPU time for any pair of numbers being constant. Our contribution is above all of practical interest on any massively parallel architecture in the field of scientific and numerical computing. 相似文献
14.
Sun-yuan Hsieh 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(9):985-993
In the literature, there are quite a few sequential and parallel algorithms for solving problems on distance-hereditary graphs. With an n-vertex and m-edge distance-hereditary graph G, we show that the efficient domination problem on G can be solved in O(log/sup 2/ n) time using O(n + m) processors on a CREW PRAM. Moreover, if a binary tree representation of G is given, the problem can be optimally solved in O(log n) time using O(n/log n) processors on an EREW PRAM. 相似文献
15.
We present a parallel recognition algorithm for bipartite-permutation graphs. The algorithm can be executed in O(log n) time on the CRCW PRAM if O(n3/log n) processors are used, or O(log2 n) time on the CREW PRAM if O(n3/log2 n) processors are used. Chen and Yesha (1993) have presented another CRCW PRAM algorithm that takes O(log2n) time if O(n 3) processors are used. Compared with Chen and Yesha's algorithm, our algorithm requires either less time and fewer processors on the same machine model, or fewer processors on a weaker machine model. Our algorithm can also be applied to determine if two bipartite-permutation graphs are isomorphic 相似文献
16.
17.
We present an efficient parallel algorithm for building the separating tree for a separable permutation. Our algorithm runs in O(log2n) time using O(nlog1.5n) operations on the CREW PRAM and O(log2n) time using O(nlognloglogn) operations on the COMMON CRCW PRAM. 相似文献
18.
The specification of data structure in higher-level languages is isolated from the related specifications of data allocation and binding of names. Structure specification is claimed to be the definition of the accessing (addressing) function for items having the structure. Conventional techniques for data structure isolation in higher-level languages are examined and are found to suffer from a lack of clarity and efficiency. The means by which data structure accessors may be defined in Bliss, the specification of their association with named, allocated storage, and their automatic invocation by reference to the named storage only, are discussed. An example is presented which illustrates their efficient implementation and their utility for separating the activities, of data structure programming and algorithmic programming.This work was supported by the Advanced Research Projects Agency of the Office of the Secretary of Defense (F44620-70-C-0107) and is monitored by the Air Force Office of Scientific Research. 相似文献
19.
The Journal of Supercomputing - The prefix computation strategy is a fundamental technique used to solve many problems in computer science such as sorting, clustering, and computer vision. A large... 相似文献