期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An efficient implementation of parallel eigenvalue computation for massively parallel processing 总被引：4，自引：0，他引：4

Takahiro Katagiri Yasumasa Kanada 《Parallel Computing》2001,27(14):1831-1845

This paper describes an efficient implementation and evaluation of a parallel eigensolver for computing all eigenvalues of dense symmetric matrices. Our eigensolver uses a Householder tridiagonalization method, which has higher parallelism and performance than conventional methods when problem size is relatively small, e.g., the order of 10,000. This is very important for relevant practical applications, where many diagonalizations for such matrices are required so often. The routine was evaluated on the 1024 processors HITACHI SR2201, and giving speedup ratios of about 2–5 times as compared to the ScaLAPACK library on 1024 processors of the HITACHI SR2201. 相似文献

2.

An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

Taoufik Saidani Mohamed Atri Lazhar Khriji Rached Tourki 《Journal of Real-Time Image Processing》2016,11(1):63-74

With the augmentation in multimedia technology, demand for high-speed real-time image compression systems has also increased. JPEG 2000 still image compression standard is developed to accommodate such application requirements. Embedded block coding with optimal truncation (EBCOT) is an essential and computationally very demanding part of the compression process of JPEG 2000 image compression standard. Various applications, such as satellite imagery, medical imaging, digital cinema, and others, require high speed and performance EBCOT architecture. In JPEG 2000 standard, the context formation block of EBCOT tier-1 contains high complexity computation and also becomes the bottleneck in this system. In this paper, we propose a fast and efficient VLSI hardware architecture design of context formation for EBCOT tier-1. A high-speed parallel bit-plane coding (BPC) hardware architecture for the EBCOT module in JPEG 2000 is proposed and implemented. Experimental results show that our design outperforms well-known techniques with respect to the processing time. It can reach 70 % reduction when compared to bit plane sequential processing. 相似文献

3.

An efficient parallel implementation of the smooth particle mesh Ewald method for molecular dynamics simulations

Kwang Jin Oh Yuefan Deng 《Computer Physics Communications》2007,177(5):426-431

This paper focuses on the implementation and the performance analysis of a smooth particle mesh Ewald method on several parallel computers. We present the details of the algorithms and our implementation that are used to optimize parallel efficiency on such parallel computers. 相似文献

4.

An efficient multigrid preconditioner for Maxwell’s equations in micromagnetism

Ľubomír Baňas 《Mathematics and computers in simulation》2010

We consider a system of Maxwell’s and Landau-Lifshitz-Gilbert equations describing magnetization dynamics in micromagnetism. The problem is discretized by a convergent, unconditionally stable finite element method. A multigrid preconditioned Uzawa type method for the solution of the algebraic system resulting from the discretized Maxwell’s equations is constructed. The efficiency of the method is demonstrated on numerical experiments and the results are compared to those obtained by simplified models. 相似文献

5.

An efficient implementation of the Min-Min heuristic

Pablo Ezzatti Martín Pedemonte Álvaro Martín 《Computers & Operations Research》2013

Min—Min is a popular heuristic for scheduling tasks to heterogeneous computational resources, which has been applied either directly or as part of more sophisticated heuristics. However, for large scenarios such as grid computing platforms, the time complexity of a straightforward implementation of Min—Min, which is quadratic in the number of tasks, may be prohibitive. This has motivated the development of high performance computing (HPC) implementations, and the use of simpler heuristics for the sake of acceptable execution times. We propose a simple algorithm that implements Min—Min requiring only O(mn) operations for scheduling n tasks on m machines. Our experiments show, in practice, that a straightforward sequential implementation of this algorithm significantly outperforms other state of the art implementations of Min—Min, even compared to HPC implementations. In addition, the proposed algorithm is at least as suitable for parallelization as a direct implementation of Min—Min. 相似文献

6.

Analysis of an efficient parallel implementation of active-set Newton algorithm

San Juan Sebastián Pablo Virtanen Tuomas Garcia-Molla Victor M. Vidal Antonio M. 《The Journal of supercomputing》2019,75(3):1298-1309

The Journal of Supercomputing - This paper presents an analysis of an efficient parallel implementation of the active-set Newton algorithm (ASNA), which is used to estimate the nonnegative weights... 相似文献

7.

An efficient parallel termination detection algorithm

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(4):293-301

In this paper, we present a new, easy to implement algorithm for detecting the termination of a parallel asynchronous computation on distributed-memory MIMD computers. We demonstrate that it operates concurrently with the main computation, adding minimal overhead, and we prove that it correctly detects termination when it occurs. Experimental results confirm that the termination detection routine imposes an overhead smaller than the experimental uncertainty. 相似文献

8.

An efficient algebraic multigrid preconditioner for a fast multipole boundary element method

G. Of 《Computing》2008,82(2-3):139-155

Fast boundary element methods still need good preconditioning techniques for an almost optimal complexity. An algebraic multigrid method is presented for the single layer potential using the fast multipole method. The coarsening is based on the cluster structure of the fast multipole method. The effort for the construction of the nearfield part of the coarse grid matrices and for an application of the multigrid preconditioner is of the same almost optimal order as the fast multipole method itself. 相似文献

9.

Application of the parallel BDDC preconditioner to the Stokes flow

Jakub &#x;ístek Bed&#x;ich Sousedík Pavel Burda Jan Mandel Jaroslav Novotný 《Computers & Fluids》2011,46(1):429-435

A parallel implementation of the Balancing Domain Decomposition by Constraints (BDDC) method is described. It is based on formulation of BDDC with global matrices without explicit coarse problem. The implementation is based on the MUMPS parallel solver for computing the approximate inverse used for preconditioning. It is successfully applied to several problems of Stokes flow discretized by Taylor–Hood finite elements and BDDC is shown to be a promising method also for this class of problems. 相似文献

10.

An efficient implementation of optimization algorithms

Jean Vignes 《Mathematics and computers in simulation》1984,26(3):243-256

相似文献

11.

An efficient implementation of trie structures

Jun-Ichi Aoe Katsushi Morimoto Takashi Sato 《Software》1992,22(9):695-721

A new internal array structure, called a double-array, implementing a trie structure is presented. The double-array combines the fast access of a matrix form with the compactness of a list form. The algorithms for retrieval, insertion and deletion are introduced through examples. Although insertion is rather slow, it is still practical, and both the deletion and the retrieval time can be improved from the list form. From the comparison with the list for various large sets of keys, it is shown that the size of the double-array can be about 17 per cent smaller than that of the list, and that the retrieval speed of the double-array can be from 3–1 to 5–1 times faster than that of the list. 相似文献

12.

Coarse grid acceleration of a parallel block preconditioner

C. J. 《Future Generation Computer Systems》2001,17(8):933-940

A block preconditioner is considered in a parallel computing environment. This preconditioner has good parallel properties, however, the convergence deteriorates when the number of blocks increases. Two different techniques are studied to accelerate the convergence: overlapping at the interfaces and using a coarse grid correction. It appears that the latter technique is indeed scalable, so the wall clock time is constant when the number of blocks increases. Furthermore, the method is easily added to an existing solution code. 相似文献

13.

An efficient algorithm for parallel integer multiplication

Benjamin Singer George Saon 《Journal of Network and Computer Applications》1996,19(4):415-418

In this paper we propose an efficient algorithm to implement parallel integer multiplication by a combination of parallel additions, shifts and reads from a memory-resident lookup table dedicated to squares. Such an operator called PIM (parallel integer multiplication) is in fact microprogrammed at the PROM level. Our theoretical approach is included within the framework of time and space parallel complexity theory. The mathematical relation used defines this multiplication operator in terms of a difference of two quadratic expressions, each being computed in parallel by one addition and one shift. This leads to the CPU time for any pair of numbers being constant. Our contribution is above all of practical interest on any massively parallel architecture in the field of scientific and numerical computing. 相似文献

14.

An efficient parallel algorithm for the efficient domination problem on distance-hereditary graphs

Sun-yuan Hsieh 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(9):985-993

In the literature, there are quite a few sequential and parallel algorithms for solving problems on distance-hereditary graphs. With an n-vertex and m-edge distance-hereditary graph G, we show that the efficient domination problem on G can be solved in O(log/sup 2/ n) time using O(n + m) processors on a CREW PRAM. Moreover, if a binary tree representation of G is given, the problem can be optimally solved in O(log n) time using O(n/log n) processors on an EREW PRAM. 相似文献

15.

An efficient parallel recognition algorithm forbipartite-permutation graphs

Chang-Wu Yu Gen-Huey Chen 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(1):3-10

We present a parallel recognition algorithm for bipartite-permutation graphs. The algorithm can be executed in O(log n) time on the CRCW PRAM if O(n³/log n) processors are used, or O(log² n) time on the CREW PRAM if O(n³/log²n) processors are used. Chen and Yesha (1993) have presented another CRCW PRAM algorithm that takes O(log²n) time if O(n ³) processors are used. Compared with Chen and Yesha's algorithm, our algorithm requires either less time and fewer processors on the same machine model, or fewer processors on a weaker machine model. Our algorithm can also be applied to determine if two bipartite-permutation graphs are isomorphic 相似文献

16.

An efficient functional implementation of FIFO queues

F.Warren Burton 《Information Processing Letters》1982,14(5):205-206

相似文献

17.

An efficient parallel algorithm for building the separating tree

Yijie Han Sanjeev Saxena Xiaojun Shen 《Journal of Parallel and Distributed Computing》2010

We present an efficient parallel algorithm for building the separating tree for a separable permutation. Our algorithm runs in O(log²n)

O ({log}^{2} n)

time using O(nlog^1.5n)

O (n {log}^{1.5} n)

operations on the CREW PRAM and O(log²n)

O ({log}^{2} n)

time using O(nlognloglogn)

O (n log n log log n)

operations on the COMMON CRCW PRAM. 相似文献

18.

An implementation base for efficient data structuring

D. S. Wile C. M. Geschke 《International journal of parallel programming》1972,1(3):209-224

The specification of data structure in higher-level languages is isolated from the related specifications of data allocation and binding of names. Structure specification is claimed to be the definition of the accessing (addressing) function for items having the structure. Conventional techniques for data structure isolation in higher-level languages are examined and are found to suffer from a lack of clarity and efficiency. The means by which data structure accessors may be defined in Bliss, the specification of their association with named, allocated storage, and their automatic invocation by reference to the named storage only, are discussed. An example is presented which illustrates their efficient implementation and their utility for separating the activities, of data structure programming and algorithmic programming.This work was supported by the Advanced Research Projects Agency of the Office of the Secretary of Defense (F44620-70-C-0107) and is monitored by the Air Force Office of Scientific Research. 相似文献

19.

An efficient parallel strategy for high-cost prefix operation

Bahig Hazem M. Fathy Khaled A. 《The Journal of supercomputing》2021,77(6):5267-5288

The Journal of Supercomputing - The prefix computation strategy is a fundamental technique used to solve many problems in computer science such as sorting, clustering, and computer vision. A large... 相似文献

20.

An efficient parallel neural network-based multi-instance learning algorithm

Cheng Hua Li Iker Gondra Lijun Liu 《The Journal of supercomputing》2012,62(2):724-740

相似文献