期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Faster optimal parallel prefix circuits: New algorithmic construction

Yen-Chun Chin-Yu 《Journal of Parallel and Distributed Computing》2005,65(12):1585-1595

Parallel prefix circuits are parallel prefix algorithms on the combinational circuit model. A prefix circuit with n inputs is depth-size optimal if its depth plus size equals 2n-2. Smaller depth implies faster computation, while smaller size implies less power consumption, less VLSI area, and less cost. To be of practical use, the depth and fan-out of a depth-size optimal prefix circuit should be small. A circuit with a smaller fan-out is in general faster and occupies less VLSI area. In this paper, we present a new algorithm to design parallel prefix circuits, and construct a class of depth-size optimal parallel prefix circuits, named SU4, with fan-out 4. When n30, SU4 has the smallest depth among all known depth-size optimal prefix circuits with fan-out 4. 相似文献

2.

Constructing H4, a Fast Depth-Size Optimal Parallel Prefix Circuit

Lin Yen-Chun Hsu Yao-Hsien Liu Chun-Keng 《The Journal of supercomputing》2003,24(3):279-304

Given n values x ₁, x ₂,...,x _n and an associative binary operation , the prefix problem is to compute x ₁x ₂x _i, 1in. Prefix circuits are combinational circuits for solving the prefix problem. For any n-input prefix circuit D with depth d and size s, if d+s=2n–2, then D is depth-size optimal. In general, a prefix circuit with a small depth is faster than one with a large depth. For prefix circuits with the same depth, a prefix circuit with a smaller fan-out occupies less area and is faster in VLSI implementation. This paper is on constructing parallel prefix circuits that are depth-size optimal with small depth and small fan-out. We construct a depth-size optimal prefix circuit H4 with fan-out 4. It has the smallest depth among all known depth-size optimal prefix circuits with a constant fan-out; furthermore, when n136, its depth is less than, or equal to, those of all known depth-size optimal prefix circuits with unlimited fan-out. A size lower bound of prefix circuits is also derived. Some properties related to depth-size optimality and size optimality are introduced; they are used to prove that H4 is depth-size optimal. 相似文献

3.

Fast problem-size-independent parallel prefix circuits

Yen-Chun Lin Li-Ling HungAuthor Vitae 《Journal of Parallel and Distributed Computing》2009

A family of parallel algorithms solving the prefix problem on the combinational circuit model is presented. These prefix circuits are waist-size optimal with waist 1 (WSO-1). They are not only building blocks for constructing fast depth-size optimal prefix circuits, but also themselves fast problem-size-independent prefix circuits. When the problem size is greater than the circuit width, the presented prefix circuits may very much faster than any other prefix circuits of the same width, especially when the problem size is greater than or equal to twice the circuit width. The new prefix circuits are compared analytically with other representative prefix circuits to show how fast they are. They have the minimum depth and are the fastest among all WSO-1 prefix circuits of the same width and fan-out. Thus, they are better building blocks than other WSO-1 circuits for constructing fast depth-size optimal prefix circuits with the same fan-out. 相似文献

4.

Limited width parallel prefix circuits

David A. Carlson Binay Sugla 《The Journal of supercomputing》1990,4(2):107-129

In this paper, we present lower and upper bounds on the size of limited width, bounded and unbounded fan-out parallel prefix circuits. The lower bounds on the sizes of such circuits are a function of the depth, width, and number of inputs. The size requirement of an N input bounded fan-out parallel prefix circuit having limited width W and extra depth k (the difference between allowed and minimum possible depth) is shown to be (N log₂ W/2^k + N) for k log₂ W. This implies that insisting on minimum depth causes the circuit size to be nonlinear, while as little as log₂log₂ W of extra depth can possibly reduce the size to linear. Also, we show that there is a clear difference between the two cases of bounded and unbounded fan-out by proving the size of a limited width, unbounded fan-out parallel prefix circuit lies between a lower bound of ((2 + 2^1–k/3)N) and an upper bound of O((2 + 2^1–k)N).Uniform, systolic constructions of limited width parallel prefix circuits are provided here and shown to be asymptotically optimal. By associating the width of the circuit with the number of processors and the fan-out capabilities of the circuit with the interconnection structure of a multiprocessor, time- and processor-efficient algorithms may be developed. 相似文献

5.

A New Class of Depth-Size Optimal Parallel Prefix Circuits 总被引：1，自引：1，他引：0

Lin Yen-Chun Shih Chao-Cheng 《The Journal of supercomputing》1999,14(1):39-52

Given n values x₁, x₂, ... ,x_n and an associative binary operation o, the prefix problem is to compute x₁ox₂o··· ox_i, 1in. Many combinational circuits for solving the prefix problem, called prefix circuits, have been designed. It has been proved that the size s(D(n)) and the depth d(D(n)) of an n-input prefix circuit D(n) satisfy the inequality d(D(n))+s(D(n))2n–2; thus, a prefix circuit is depth-size optimal if d(D(n))+s(D(n))=2n–2. In this paper, we construct a new depth-size optimal prefix circuit SL(n). In addition, we can build depth-size optimal prefix circuits whose depth can be any integer between d(SL(n)) and n–1. SL(n) has the same maximum fan-out lgn+1 as Snir's SN(n), but the depth of SL(n) is smaller; thus, SL(n) is faster. Compared with another optimal prefix circuit LYD(n), d(LYD(n))+2d(SL(n))d(LYD(n)). However, LYD(n) may have a fan-out of at most 2 lgn–2, and the fan-out of LYD(n) is greater than that of SL(n) for almost all n12. Because an operation node with greater fan-out occupies more chip area and is slower in VLSI implementation, in most cases, SL(n) needs less area and may be faster than LYD(n). Moreover, it is much easier to design SL(n) than LYD(n). 相似文献

6.

Limit laws for terminal nodes in random circuits with restricted fan-out: a family of graphs generalizing binary search trees

Hosam Mahmoud Tatsuie Tsukiji 《Acta Informatica》2004,41(2-3):99-110

We introduce a family of graphs C(n,i,s,a) that generalizes the binary search tree. The graphs represent logic circuits with fan-in i, restricted fan-out s, and arising by n progressive additions of random gates to a starting circuit of a isolated nodes. We show via martingales that a suitably normalized version of the number of terminal nodes in binary circuits converges in distribution to a normal random variate.Received: 23 July 2003, Published online: 29 October 2004 相似文献

7.

Balancing Bounded Treewidth Circuits

Maurice Jansen Jayalal Sarma 《Theory of Computing Systems》2014,54(2):318-336

We use algorithmic tools for graphs of small treewidth to address questions in complexity theory. For our main construction, we prove that multiplicatively disjoint arithmetic circuits of size n ^O(1) and treewidth k can be simulated by bounded fan-in arithmetic formulas of depth O(k ²logn). From this we derive an analogous statement for syntactically multilinear arithmetic circuits, which strengthens the central theorem of M. Mahajan and B.V.R. Rao (Proc. 33rd International Symposium on Mathematical Foundations of Computer Science, vol. 5162, pp. 455–466, 2008). We show our main construction has the following three applications:

Bounded width arithmetic circuits of size n ^O(1) can be balanced to depth O(logn), provided chains of iterated multiplication in the circuit are of length O(1).
Boolean bounded fan-in circuits of size n ^O(1) and treewidth k can be simulated by bounded fan-in formulas of depth O(k ²logn). This strengthens in the non-uniform setting the known inclusion that SC⁰?NC¹.
We demonstrate treewidth restricted cases of Directed-Reachability and Circuit Value Problem that can be solved in LogDCFL.

We also give a construction showing, for both arithmetic and Boolean circuits, that any circuit of size n ^O(1) and treewidth O(logⁱ n) can be simulated by a circuit of width O(logⁱ⁺¹ n) and size n ^c, where c=O(1), if i=0, and c=O(loglogn) otherwise. 相似文献

8.

Random Data Accesses on a Coarse-Grained Parallel Machine II. One-to-Many and Many-to-One Mappings

《Journal of Parallel and Distributed Computing》1997,44(1):24-34

This paper describes deterministic communication-efficient algorithms for performing random data accesses with hot spots on a coarse-grained parallel machine. The general random access read-write operations with hot spots can be completed inCμn/p(+ lower order terms) time and is optimal and scalable providedn⪢p³+p²τ/μ (nis the number of elements distributed acrosspprocessors, τ is the start-up overhead and 1/μ is the data transfer rate).Cis a small constant between 3 and 4 for the random access write operation, slightly higher for the random access read operation. Monotonic random access reads/writes can be completed with smaller constants and are optimal for smallernas well. A companion paper [26] deals with the problem of performing dynamic permutations. 相似文献

9.

Amplifying circuit lower bounds against polynomial time, with applications

Richard J. Lipton Ryan Williams 《Computational Complexity》2013,22(2):311-343

We give a self-reduction for the Circuit Evaluation problem (CircEval) and prove the following consequences.

Amplifying size–depth lower bounds. If CircEval has Boolean circuits of n ^k size and n ^1?δ depth for some k and δ, then for every ${\epsilon > 0}$ , there is a δ′ > 0 such that CircEval has circuits of ${n^{1 + \epsilon}}$ size and ${n^{1- \delta^{\prime}}}$ depth. Moreover, the resulting circuits require only ${\tilde{O}(n^{\epsilon})}$ bits of non-uniformity to construct. As a consequence, strong enough depth lower bounds for Circuit Evaluation imply a full separation of P and NC (even with a weak size lower bound).
Lower bounds for quantified Boolean formulas. Let c, d > 1 and e < 1 satisfy c < (1 ? e + d )/d. Either the problem of recognizing valid quantified Boolean formulas (QBF) is not solvable in TIME[n ^c], or the Circuit Evaluation problem cannot be solved with circuits of n ^d size and n ^e depth. This implies unconditional polynomial-time uniform circuit lower bounds for solving QBF. We also prove that QBF does not have n ^c-time uniform NC circuits, for all c < 2.

相似文献

10.

Bounded-Depth Circuits Cannot Sample Good Codes

Shachar Lovett Emanuele Viola 《Computational Complexity》2012,21(2):245-266

We study a variant of the classical circuit-lower-bound problems: proving lower bounds for sampling distributions given random bits. We prove a lower bound of 1 ? 1/n ^??(1) on the statistical distance between (i) the output distribution of any small constant-depth (a.k.a. AC⁰) circuit f : {0, 1}^poly(n) ?? {0, 1}ⁿ, and (ii) the uniform distribution over any code ${\mathcal{C} \subseteq \{0, 1\}^n}$ that is ??good,?? that is, has relative distance and rate both ??(1). This seems to be the first lower bound of this kind. We give two simple applications of this result: (1) any data structure for storing codewords of a good code ${\mathcal{C} \subseteq \{0, 1\}^n}$ requires redundancy ??(log n), if each bit of the codeword can be retrieved by a small AC⁰ circuit; and (2) for some choice of the underlying combinatorial designs, the output distribution of Nisan??s pseudorandom generator against AC⁰ circuits of depth d cannot be sampled by small AC⁰ circuits of depth less than d. 相似文献

11.

Random Data Accesses on a Coarse-Grained Parallel Machine I. One-to-One Mappings

《Journal of Parallel and Distributed Computing》1997,44(1):14-23

This paper describes deterministic communication-efficient algorithms for performing dynamic permutations on a coarse-grained parallel machine. Our analysis shows that the general permutation operation can be completed inCμn/p(+ lower order terms) time and is optimal and scalable providedn⪢p³+p²τ/μ (nis the size of the permutation or the number of elements distributed across thepprocessors, τ is the start-up overhead, and 1/μ is the data transfer rate).Cis a small constant typically between 2 and 3 for write permutations, slightly higher for read permutations. Modifications to exploit locality of access are presented. Special classes of permutations that are optimal for smaller sizes are also described. A companion paper [22] deals with the problem of random data accesses with hot spots. 相似文献

12.

Learning large-alphabet and analog circuits with value injection queries

Dana Angluin James Aspnes Jiang Chen Lev Reyzin 《Machine Learning》2008,72(1-2):113-138

We consider the problem of learning an acyclic discrete circuit with n wires, fan-in bounded by k and alphabet size s using value injection queries. For the class of transitively reduced circuits, we develop the Distinguishing Paths Algorithm, that learns such a circuit using (ns)^O(k) value injection queries and time polynomial in the number of queries. We describe a generalization of the algorithm to the class of circuits with shortcut width bounded by b that uses (ns)^O(k+b) value injection queries. Both algorithms use value injection queries that fix only O(kd) wires, where d is the depth of the target circuit. We give a reduction showing that without such restrictions on the topology of the circuit, the learning problem may be computationally intractable when s=n ^Θ(1), even for circuits of depth O(log?n). We then apply our large-alphabet learning algorithms to the problem of approximate learning of analog circuits whose gate functions satisfy a Lipschitz condition. Finally, we consider models in which behavioral equivalence queries are also available, and extend and improve the learning algorithms of (Angluin in Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing, pp. 584–593, 2006) to handle general classes of gate functions that are polynomial time learnable from counterexamples. 相似文献

13.

Efficient Parallel Computations on the Reduced Mesh of Trees Organization

《Journal of Parallel and Distributed Computing》1994,20(2):121-135

Optimal and near optimal parallel algorithms for several fundamental problems are proposed for a parallel organization consistmg of n processors, each having access to a row and a column of an n × n array of memory modules. Parallel computations are implemented on such an organization by decomposing them into alternating orthogonal processing phases. A number of efficient data movement techniques are developed for the proposed organization which lead to optimal or near optimal solutions to several communication-intensive problems such as sorting, performing permutations, list ranking (data dependent parallel prefix), and problems on graphs represented by an unsorted list of n² edges. It is also shown that the proposed organization is capable of simulating any fixed-degree network on n² processors with O(n) loss in time, which is optimal. Finally, an enhanced organization having p processors, 1 ≤ p ≤ n², and O(n²) memory locations is presented, and is shown to provide optimal speedups for adjacency-matrix based graph problems for any number of processors in the range [1, n^3/2]. 相似文献

14.

Parallel Algorithms for the Circuit Value Update Problem

C. E. Leiserson K. H. Randall 《Theory of Computing Systems》1997,30(6):583-597

The circuit value update problem is the problem of updating values in a representation of a combinational circuit when some of the inputs are changed. We assume for simplicity that each combinational element has bounded fan-in and fan-out and can be evaluated in constant time. This problem is easily solved on an ordinary serial computer in O(W+D) time, where W is the number of elements in the altered subcircuit and D is the subcircuit's embedded depth (its depth measured in the original circuit). In this paper we show how to solve the circuit value update problem efficiently on a P-processor parallel computer. We give a straightforward synchronous, parallel algorithm that runs in expected time. Our main contribution, however, is an optimistic, asynchronous, parallel algorithm that runs in expected time, where W and D are the size and embedded depth, respectively, of the ``volatile' subcircuit, the subcircuit of elements that have inputs which either change or glitch as a result of the update. To our knowledge, our analysis provides the first analytical bounds on the running time of an optimistic, asynchronous, parallel algorithm. Received November 1, 1995, and in final form November 25, 1996. 相似文献

15.

Improving the average delay of sorting

Andreas Jakoby Maciej Liśkiewicz Rüdiger Reischuk Christian Schindelhauer 《Theoretical computer science》2009

In previous work we have introduced an average case measure for the time complexity of Boolean circuits. Instead of fixed circuit depth, for each input we take the minimal number of time steps necessary to perform the computation for that particular input using gates that forward their output values as soon as possible. This measure is called delay. Based on it, the complexity of a whole class of functions that can be described as prefix computations has been analysed in detail. 相似文献

16.

Parallelization of the Gaussian Elimination Algorithm on Systolic Arrays

《Journal of Parallel and Distributed Computing》1996,33(1):69-75

We study the parallel implementation of the Gaussian elimination scheme on systolic arrays. We first show that the time (resp. area) complexity of the algorithm isT= 3n− 1 (resp.S= (n²/4) +O(n)), wherenis the size of the linear system. Then we exhibit three algorithms. The two first ones are optimal in time. The first one corresponds to an orthogonally connected array of size 3(n²/8) +O(n). The second network is smaller,S= 3(n²/10) +O(n), but two layers are necessary in order to obtain a regular layout with local communications. The third one is hexagonally connected, has size (n²/3) +O(n), and is almost optimal in time. 相似文献

17.

A class of almost-optimal size-independent parallel prefix circuits

Hatem M. El-Boghdadi 《Journal of Parallel and Distributed Computing》2013

Prefix computation is one of the fundamental problems that can be used in many applications such as fast adders. Most proposed parallel prefix circuits assume that the circuit is of the same width as the input size. 相似文献

18.

New upper bounds on the Boolean circuit complexity of symmetric functions

E. Demenkov A. Kojevnikov A. Kulikov G. Yaroslavtsev 《Information Processing Letters》2010,110(7):264-267

In this note, we present improved upper bounds on the circuit complexity of symmetric Boolean functions. In particular, we describe circuits of size 4.5n+o(n) for any symmetric function of n variables, as well as circuits of size 3n for function. 相似文献

19.

Majority gates vs. general weighted threshold gates

Mikael Goldmann Johan Håstad Alexander Razborov 《Computational Complexity》1992,2(4):277-300

In this paper we study small depth circuits that contain threshold gates (with or without weights) and parity gates. All circuits we consider are of polynomial size. We prove several results which complete the work on characterizing possible inclusions between many classes defined by small depth circuits. These results are the following:

1.	A single threshold gate with weights cannot in general be replaced by a polynomial fan-in unweighted threshold gate of parity gates.
2.	On the other hand it can be replaced by a depth 2 unweighted threshold circuit of polynomial size. An extension of this construction is used to prove that whatever can be computed by a depthd polynomial size threshold circuit with weights can be computed by a depthd+1 polynomial size unweighted threshold circuit, whered is an arbitrary fixed integer.
3.	A polynomial fan-in threshold gate (with weights) of parity gates cannot in general be replaced by a depth 2 unweighted threshold circuit of polynomial size.

相似文献

20.

On the minimum number of negations leading to super-polynomial savings

Stasys Jukna 《Information Processing Letters》2004,89(2):71-74

We show that an explicit sequence of monotone functions can be computed by Boolean circuits with polynomial (in n) number of And, Or and Not gates, but every such circuit must use at least logn−O(loglogn) Not gates. This is almost optimal because results of Markov [J. ACM 5 (1958) 331] and Fisher [Lecture Notes in Comput. Sci., Vol. 33, Springer, 1974, p. 71] imply that, with only small increase of the total number of gates, any circuit in n variables can be simulated by a circuit with at most ⌈log(n+1)⌉ Not gates. 相似文献