共查询到20条相似文献,搜索用时 15 毫秒
1.
大整数运算广泛地应用于公钥加密算法、大规模科学计算中高精度浮点数运算类以及构建大特征值等领域,然而其大部分算法空间和时间开销都很大,尤其对于核心运算之一的大整数乘法,当数据达到一定规模时,超长的串行计算时间已成为制约算法应用的巨大瓶颈.近几年来,伴随着多核、众核芯片的迅猛发展,通过充分挖掘算法本身的并行度以利用并行处理器的强大计算能力,进而高效地提升算法性能,成为一种研究趋势.本文基于通用多核并行计算平台,研究了大整数乘法Comba及Karatsuba快速算法的并行化,提出了高效的多核并行算法.在算法实现及性能优化上,采用了OpenMP+SIMD的多级并行技术,使性能获得巨大提升.在性能测试上,我们使用优化的并行算法与原始串行算法进行对比试验,结果显示,8线程并行Comba算法和Karatsuba算法相比串行对应算法分别实现了5.85倍以及6.14倍的性能加速比提升. 相似文献
2.
S. K. S. Gupta C. -H. Huang P. Sadayappan R. W. Johnson 《Journal of Parallel and Distributed Computing》1996,34(2):137
A framework for synthesizing communication-efficient distributed-memory parallel programs for block recursive algorithms such as the fast Fourier transform (FFT) and Strassen's matrix multiplication is presented. This framework is based on an algebraic representation of the algorithms, which involves the tensor (Kronecker) product and other matrix operations. This representation is useful in analyzing the communication implications of computation partitioning and data distributions. The programs are synthesized under two different target program models. These two models are based on different ways of managing the distribution of data for optimizing communication. The first model uses point-to-point interprocessor communication primitives, whereas the second model uses data redistribution primitives involving collective all-to-many communication. These two program models are shown to be suitable for different ranges of problem size. The methodology is illustrated by synthesizing communication-efficient programs for the FFT. This framework has been incorporated into the EXTENT system for automatic generation of parallel/vector programs for block recursive algorithms. 相似文献
3.
4.
We consider the following one- and two-dimensional bucketing problems: Given a set S of n points in \reals 1 or \reals 2 and a positive integer b , distribute the points of S into b equal-size buckets so that the maximum number of points in a bucket is minimized. Suppose at most (n/b) + Δ points lie in each bucket in an optimal solution. We present algorithms whose time complexities depend on b and Δ . No prior knowledge of Δ is necessary for our algorithms. For the one-dimensional problem, we give a deterministic algorithm that achieves a running time of O(b 4 (Δ 2 +log n) + n) . For the two-dimensional problem, we present a Monte Carlo algorithm that runs in subquadratic time for small values of b and Δ . The previous algorithms, by Asano and Tokuyama [1], searched the entire parameterized space and required Ω ( n 2 ) time in the worst case even for constant values of b and Δ . We also present a subquadratic algorithm for the special case of the two-dimensional problem when b=2 . 相似文献
5.
Abstract. We consider the following one- and two-dimensional bucketing problems: Given a set S of n points in \reals
1
or \reals
2
and a positive integer b , distribute the points of S into b equal-size buckets so that the maximum number of points in a bucket is minimized. Suppose at most (n/b) + Δ points lie in each bucket in an optimal solution. We present algorithms whose time complexities depend on b and Δ . No prior knowledge of Δ is necessary for our algorithms.
For the one-dimensional problem, we give a deterministic algorithm that achieves a running time of O(b
4
(Δ
2
+log n) + n) . For the two-dimensional problem, we present a Monte Carlo algorithm that runs in subquadratic time for small values of
b and Δ . The previous algorithms, by Asano and Tokuyama [1], searched the entire parameterized space and required Ω ( n
2
) time in the worst case even for constant values of b and Δ . We also present a subquadratic algorithm for the special case of the two-dimensional problem when b=2 . 相似文献
6.
Hazem M. Bahig Sameh S. Daoud Mahmoud K. A. Khairat 《The Journal of supercomputing》2002,22(3):269-275
We consider the problem of sorting n integers when the elements are drawn from the restricted domain [1...n]. A new deterministic parallel algorithm for sorting n integers is obtained. Its running time is O(lognlog(n/logn)) using n/logn processors on EREW (exclusive read exclusive write) PRAM (parallel random access machine). Also, our algorithm was modified to become optimal when we use
processors. This algorithm belongs to class EP (Efficient, Polynomial fast). 相似文献
7.
The problem of merging two sorted arrays A = (a1, a2, ..., an1) and B = (b1, b2, ..., bn2) is considered. For input elements that are drawn from a domain of integers [1...s] we present an algorithm that runs in O(log log log s) time using n/log log log s CREW PRAM processors (optimal speed-up) and O(nsε) space, where n = n1 + n2. For input elements that are drawn from a domain of integers [1...n] we present a second algorithm that runs in O(α(n)) time (where α(n) is the inverse of Ackermann′s function) using n/α(n) CREW PRAM processors and linear space. This second algorithm is non-uniform; however, it can be made uniform at a price of a certain loss of speed, or by using a CRCW PRAM. 相似文献
8.
9.
H.264是ITU与ISO联合共同开发的具有高编码效率、高压缩质量的视频新标准。整数变换是其提高压缩性能最主要的改进方法之一,基于同样的整数变换过程的变换基可以不唯一,因此在整数变换的理论确定后,寻找变换基是一件重要的工作,提出了一种通用的变换基生成算法。该方法通过分析整数DCT变换的原理,指出了整数变换矩阵应该满足的4个约束条件,以满足正交性约束为出发点,导出了整数矩阵元素之间的数量关系,并辅以另外3个约束条件,采用搜索的方法寻找变换基。实验结果表明,该算法在经过几十步的搜索后,就能找出所有可用的变换基,JVT参考模型用到的变换基也在其中。 相似文献
10.
Nicolas Basset 《Algorithmica》2016,76(4):989-1034
11.
We present two deterministic polynomial time algorithms for the following problem: check whether a sparse polynomial f(x) vanishes at a given primitive nth root of unity ζ
n
. A priori f(ζ
n
) may be nonzero and doubly exponentially small in the input size. The existence of a polynomial time procedure in the case
of factored n was conjectured by D. Plaisted in 1984, but all previously known algorithms are either randomized, or do not run in polynomial
time.
We apply polynomial zero testing algorithms to construct a nondeterministic polynomial time algorithm for the torsion point
problem (TP). The problem TP is a particular case of the feasibility problem for a system of polynomial equations in complex
numbers (coefficients of polynomials are integers). In the problem TP all coordinates of a solution must be roots of unity. 相似文献
12.
We examine the problem of integer representation in a nearly minimal number of bits so that the increment and the decrement
(and indeed the addition and the subtraction) operations can be performed using few bit inspections and fewer bit changes.
The model of computation we considered is the bit probe model, where the complexity measure counts only the bitwise accesses
to the data structure. We present several efficient data structures to represent integer that use a logarithmic number of
bit inspections and a constant number of bit changes per operation. The most space-efficient data structure uses only one
extra bit. We also present an extension to our data structure to support efficient addition and subtraction, where the larger
value is replaced by the result, while retaining the same asymptotic bounds for the increment and the decrement operations. 相似文献
13.
主要研究了著名的几何曲线——蔓叶线的一种并行生成算法,以Bresenham算法为基础,对蔓叶线的并行生成算法进行了分析和讨论。首先,从蔓叶线图像的一个已知点开始,根据递推公式逐点选择最靠近蔓叶线的像素点;然后引入并行机制生成蔓叶线的图像;最后,利用C#多线程模拟实现了该算法。模拟结果表明,这是关于蔓叶线图像的一种快速、高效的并行算法。 相似文献
14.
生物序列的k-mer频次统计是生物信息处理中一个非常基础且重要的问题. 本文针对多序列在对齐模式下,不同偏移处一段长度范围内的k-mer频次统计问题进行了研究. 提出了一种逆向遍历k-mer计数算法BTKC. 该算法能够充分利用长度的k-mer统计信息,快速得到长度的k-mer统计信息,从而避免了统计任意长度的k-mer频次信息时都需要对所有序列进行遍历. 算法的时间复杂度分析及实验结果表明,相比于传统的前向遍历FTKC算法,BTKC算法性能提升非常明显,且其时间复杂度与k-mer长度的变化范围无关,非常适合于在k-mer长度变化范围较大的情况下使用. 相似文献
15.
Boosting Algorithms for Parallel and Distributed Learning 总被引:1,自引:0,他引:1
The growing amount of available information and its distributed and heterogeneous nature has a major impact on the field of data mining. In this paper, we propose a framework for parallel and distributed boosting algorithms intended for efficient integrating specialized classifiers learned over very large, distributed and possibly heterogeneous databases that cannot fit into main computer memory. Boosting is a popular technique for constructing highly accurate classifier ensembles, where the classifiers are trained serially, with the weights on the training instances adaptively set according to the performance of previous classifiers. Our parallel boosting algorithm is designed for tightly coupled shared memory systems with a small number of processors, with an objective of achieving the maximal prediction accuracy in fewer iterations than boosting on a single processor. After all processors learn classifiers in parallel at each boosting round, they are combined according to the confidence of their prediction. Our distributed boosting algorithm is proposed primarily for learning from several disjoint data sites when the data cannot be merged together, although it can also be used for parallel learning where a massive data set is partitioned into several disjoint subsets for a more efficient analysis. At each boosting round, the proposed method combines classifiers from all sites and creates a classifier ensemble on each site. The final classifier is constructed as an ensemble of all classifier ensembles built on disjoint data sets. The new proposed methods applied to several data sets have shown that parallel boosting can achieve the same or even better prediction accuracy considerably faster than the standard sequential boosting. Results from the experiments also indicate that distributed boosting has comparable or slightly improved classification accuracy over standard boosting, while requiring much less memory and computational time since it uses smaller data sets. 相似文献
16.
17.
本文详细介绍了1-D DFT精确计算的六步框架并行算法和按位并行计算法,以及按位计算法在2-D Mesh和Torus上的模拟实现,同时介绍了近似计算中的基于奇异值分解的算法和基于快速多极方法的算法。对于2-D DFT,本文介绍了并行行列算法和并行多项式变换算法,并分析了其优缺点。 相似文献
18.
19.
《Journal of Parallel and Distributed Computing》2001,61(4):536-544
With the widening gap between processor speeds and disk access speeds, the I/O bottleneck has become critical. Parallel disk systems have been introduced to alleviate this bottleneck. In this paper we present deterministic and randomized selection algorithms for parallel disk systems. The algorithms to be presented, in addition to being asymptotically optimal, have small underlying constants in their time bounds and hence have the potential of being practical. 相似文献
20.
《Journal of Parallel and Distributed Computing》1995,26(1):116-124
In this paper, we present optimal O(log n) time, O(n/log n) processor EREW PRAM parallel algorithms for finding the connected components, cut vertices, and bridges of a permutation graph. We also present an O(log n) time, O(n) processor, CREW PRAM model parallel algorithm for finding a Breadth First Search (BFS) spanning tree of a permutation graph rooted at vertex 1 and use the same to derive an efficient parallel algorithm for the All Pairs Shortest Path problem on permutation graphs. 相似文献