首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Microsystem Technologies - This paper explores design of various components that are extensively used in digital circuits namely AND gate, OR gate, NAND gate, NOR gate and Full Adder using GDI...  相似文献   

2.
We offer a universal multilevel structure (of parallel and parallel-sequential types) that uses logical AND, NOR and OR functions to convert cyclic code words containing a cyclic group of units or zeros. Realized as FPGAs, the structures are proved to ensure correct operation. The results of simulation experiments are given as time diagrams.  相似文献   

3.

Applying deep neural networks (DNNs) in mobile and safety-critical systems, such as autonomous vehicles, demands a reliable and efficient execution on hardware. The design of the neural architecture has a large influence on the achievable efficiency and bit error resilience of the network on hardware. Since there are numerous design choices for the architecture of DNNs, with partially opposing effects on the preferred characteristics (such as small error rates at low latency), multi-objective optimization strategies are necessary. In this paper, we develop an evolutionary optimization technique for the automated design of hardware-optimized DNN architectures. For this purpose, we derive a set of inexpensively computable objective functions, which enable the fast evaluation of DNN architectures with respect to their hardware efficiency and error resilience. We observe a strong correlation between predicted error resilience and actual measurements obtained from fault injection simulations. Furthermore, we analyze two different quantization schemes for efficient DNN computation and find one providing a significantly higher error resilience compared to the other. Finally, a comparison of the architectures provided by our algorithm with the popular MobileNetV2 and NASNet-A models reveals an up to seven times improved bit error resilience of our models. We are the first to combine error resilience, efficiency, and performance optimization in a neural architecture search framework.

  相似文献   

4.
Discrete dynamical systems based on dependency graphs have played an important role in the mathematical theory of computer simulations. In this paper, we are concerned with parallel dynamical systems (PDS) and sequential dynamical systems (SDS) with the OR and NOR functions as local functions. It has been recognized by Barrett, Mortveit and Reidys that SDS with the NOR function are closely related to combinatorial properties of the dependency graphs. We present an evaluation scheme for systems with the OR and NOR functions which can be used to clarify some basic properties of the dynamical systems. We show that for forests that does not contain a single edge the number of orientations equals the number of different OR-SDS.  相似文献   

5.
We propose a new model for exact learning of acyclic circuits using experiments in which chosen values may be assigned to an arbitrary subset of wires internal to the circuit, but only the value of the circuit's single output wire may be observed. We give polynomial time algorithms to learn (1) arbitrary circuits with logarithmic depth and constant fan-in and (2) Boolean circuits of constant depth and unbounded fan-in over AND, OR, and NOT gates. Thus, both AC0 and NC1 circuits are learnable in polynomial time in this model. Negative results show that some restrictions on depth, fan-in and gate types are necessary: exponentially many experiments are required to learn AND/OR circuits of unbounded depth and fan-in; it is NP-hard to learn AND/OR circuits of unbounded depth and fan-in 2; and it is NP-hard to learn circuits of constant depth and unbounded fan-in over AND, OR, and threshold gates, even when the target circuit is known to contain at most one threshold gate and that threshold gate has threshold 2. We also consider the effect of adding an oracle for behavioral equivalence. In this case there are polynomial-time algorithms to learn arbitrary circuits of constant fan-in and unbounded depth and to learn Boolean circuits with arbitrary fan-in and unbounded depth over AND, OR, and NOT gates. A corollary is that these two classes are PAC-learnable if experiments are available. Finally, we consider an extension of the model called the synchronous model. We show that an even more general class of circuits are learnable in this model. In particular, we are able to learn circuits with cycles.  相似文献   

6.
郑宇华  谢立 《计算机学报》1993,16(9):641-647
本文提出一种新型并行推理机制BTJ,它同时支持受限“与”并行和完全“或”并行,与其它“与/或”并行模型相比,BTJ具有高并行度和低运行时刻代价的优点,性能测试结果表明,BTJ对于“与”并行和“或”并行均可获得较好的并行加速比。  相似文献   

7.
The paper introduces a neural network-based model of logical connectives. The basic processing unit consists of two types of generic OR and AND neurons structured into a three layer topology. Due to the functional integrity we will be referring to it as an OR/AND neuron. The specificity of the logical connectives is captured by the OR/AND neuron within its supervised learning. Further analysis of the connections of the neuron obtained in this way provides a better insight into the nature of the connectives applied in fuzzy sets by emphasizing their features of “locality” and interactivity. Afterward, we will study several architectures of neural networks comprising these neurons treated as their basic functional components. The numerical studies embrace both the structures formed by single OR/AND neurons and aimed at modeling logical connectives (including the Zimmermann-Zysno data set, 1980) and the networks representing various decision-making architectures. We will also propose a realization of a pseudo median filter in which the OR/AND neurons play an ultimate role  相似文献   

8.
We propose new shared memory multiprocessor architectures and evaluate their performance for future Internet protocol (IP) routers based on symmetric multiprocessor (SMP) and cache coherent nonuniform memory access (CC-NUMA) paradigms. We also propose a benchmark application suite, RouterBench, which consists of four categories of applications representing key functions on the time-critical path of packet processing in routers. An execution driven simulation environment is created to evaluate SMP and CC-NUMA router architectures using this RouterBench. The execution driven simulation can produce accurate cycle-level execution time prediction and reveal the impact of various architectural parameters on the performance of routers. We port the FUNET trace and its routing table for use in our experiments. We find that the CC-NUMA architecture provides an excellent scalability for design of high-performance IP routers. Results also show that the CC-NUMA architecture can sustain good lookup performance, even at a high frequency of route updates.  相似文献   

9.
For pt.I. see ibid., p. 170-80. In pt.I, we presented a binding environment for the AND and OR parallel execution of logic programs. This environment was instrumental in rendering a compiler for the AND and OR parallel execution of logic programs machine independent. In this paper, we describe a compiler based on the Reduce-OR process model (ROPM) for the parallel execution of Prolog programs, and provide performance of the compiler on five parallel machines: the Encore Multimax, the Sequent Symmetry, the NCUBE 2, the Intel i860 hypercube and a network of Sun workstations. The compiler is part of a machine independent parallel Prolog development system built on top of a run time environment for parallel programming called the Chare kernel, and runs unchanged on these multiprocessors. In keeping with the objectives behind the ROPM, the compiler supports both on and independent AND parallelism in Prolog programs and is suitable for execution on both shared and nonshared memory machines. We discuss the performance of the Prolog compiler in some detail and describe how grain size can be used to deliver performance that is within 10% of the underlying sequential Prolog compiler on one processor, and scale linearly with increasing number of processors on problems exhibiting sufficient parallelism. The loose coupling between parallel and sequential components makes it possible to use the best available sequential compiler as the sequential component of our compiler  相似文献   

10.
In conventional architectures, the central processing unit (CPU) spends a significant amount of execution time allocating and de-allocating memory. Efforts to improve memory management functions using custom allocators have led to only small improvements in performance. In this work, we test the feasibility of decoupling memory management functions from the main processing element to a separate memory management hardware. Such memory management hardware can reside on the same die as the CPU, in a memory controller or embedded within a DRAM chip. Using Simplescalar, we simulated our architecture and investigated the execution performance of various benchmarks selected from SPECInt2000, Olden and other memory intensive application suites.

Hardware allocator reduced the execution time of applications by as much as 50%. In fact, the decoupled hardware results in a performance improvement even when we assume that both the hardware and software memory allocators require the same number of cycles. We attribute much of this improved performance to improved cache behavior since decoupling memory management functions reduces cache pollution caused by dynamic memory management software. We anticipate that even higher levels of performance can be achieved by using innovative hardware and software optimizations. We do not show any specific implementation for the memory management hardware. This paper only investigates the potential performance gains that can result from a hardware allocator.  相似文献   


11.

Comparator is an essential building block in many digital circuits such as biometric authentication, data sorting, and exponents comparison in floating-point architectures among others. Quantum-dot Cellular Automata (QCA) is a latest nanotechnology that overcomes the drawbacks of Complementary Metal Oxide Semiconductor (CMOS) technology. In this paper, novel area optimized 2n-bit comparator architecture is proposed. To achieve the objective, 1-bit stack-type and 4-bit tree-based stack-type (TB-ST) comparators are proposed using QCA. Then, two tree-based architectures of 4-bit comparators are arranged in two layers to optimize the number of quantum cells and area of an 8-bit comparator. Thus, this design can be extended to any 2n-bit comparator. Simulation results of 4-bit and 8-bit comparators using QCADesigner 2.0.3 show that there is a significant improvement in the number of quantum cells and area occupancy. The proposed TB-ST 8-bit comparator uses 2.5 clock cycles and 622 quantum cells with area occupancy of 0.49 µm2 which is an improvement by 10.5% and 38%, respectively, compared to existing designs. Scaling it to a 32-bit comparator, the proposed architecture requires only 2675 quantum cells in an area of 2.05 µm2 with a delay of 3.5 clock cycles, indicating 9.35% and 28.8% improvements, respectively, demonstrating the merit of the proposed architecture. Besides, energy dissipation analysis of the proposed TB-ST 8-bit comparator is simulated on QCADesigner-E tool, indicating average energy dissipation reduction of 17.3% compared to existing works.

  相似文献   

12.
The complexity of today’s embedded applications increases with various requirements such as execution time, code size or power consumption. To satisfy these requirements for performance, efficient instruction set design is one of the important issues because an instruction customized for specific applications can make better performance than multiple instructions in aspect of fast execution time, decrease of code size, and low power consumption. Limited encoding space, however, does not allow adding application specific and complex instructions freely to the instruction set architecture. To resolve this problem, conventional architectures increases free space for encoding by trimming excessive bits required beyond the fixed word length. This approach however shows severe weakness in terms of the complexity of compiler, code size and execution time. In this paper, we propose a new instruction encoding scheme based on the dynamic implied addressing mode (DIAM) to resolve limited encoding space and side-effect by trimming. We report our two versions of architectures to support our DIAM-based approach. In the first version, we use a special on-chip memory to store extra encoding information. In the second version, we replace the memory by a small on-chip buffer along with a special instruction. We also suggest a code generation algorithm to fully utilize DIAM. In our experiment, the architecture augmented with DIAM shows about 8% code size reduction and 18% speed up on average, as compared to the basic architecture without DIAM.  相似文献   

13.
The problem of constructing a time-optimal schedule for jobs execution with precedence logical conditions is considered. Every job is associated with a list of its direct predecessors, execution time and the number of the completed direct predecessors necessary to start execution of the job. It is shown that the problem can be solved by the methods of the cyclical games theory [1, 2]. We propose a pseudopolynomial algorithm for construction of the optimal schedule; i.e. the presented algorithm is efficient for the problems with small data arrays. The paper extends the results from [3], where AND/OR schedules are considered. A job can be started, when all its direct predecessor have been executed or when at least one of its direct predecessors has been completed.  相似文献   

14.
In this paper, we propose a concept for multi-level reconfigurable architectures with more than two levels of reconfiguration, and study these architectures theoretically and experimentally. The proposed architectures are extensions of 2-level reconfigurable architectures where the reconfiguration operations on the lowest level correspond to the reconfiguration operations of standard 1-level reconfigurable architectures, and the reconfigurable units are simple switches. It is shown that finding an optimal number of reconfiguration levels and a corresponding reconfiguration scheme that minimizes the number of reconfiguration bits for a given algorithm can be done in polynomial time. But finding the optimal number of reconfiguration levels is NP-hard for heterogeneous multi-level architectures, where the number of reconfiguration levels varies for the different reconfigurable units. Experimental results for different test applications show that 3–4 reconfiguration levels are optimal with respect to the number of reconfiguration bits needed. The number of reconfiguration bits is reduced by 35–86% compared to 1-level reconfiguration and by 8–34% compared to 2-level reconfiguration. The heterogeneous architecture reduces the number of necessary reconfiguration bits by additional 1–5% and also needs less SRAM cells.  相似文献   

15.

Development in photonic integrated circuits (PICs) provides a promising solution for on-chip optical computation and communication. PICs provides the best alternative to traditional networks-on-chip (NoC) circuits which face serious challenges such as bandwidth, latency and power consumption. Integrated optics have substantiated the ability to accomplish low-power communication and low-power data processing at ultra-high speeds. In this work, we propose a new architecture for NoC, which might improve overall on-chip network performance by reducing its power consumption, providing large channel capacity for communication, decreasing latency among nodes and reducing hop count. Some of the key features of the proposed architecture are to reduce the waveguide network for communication among nodes, and this architecture can be used as a brick to construct other architectures. In this architecture, we use micro-ring resonator (MRR) and it is used to provide a high bandwidth connection among nodes with a lesser number of waveguide networks. Furthermore, results show that this architecture of PICs provides better performance in terms of low communication latency, low power consumption, high bandwidth. It also provides acceptable FSR value, FWHR value, finesse value and Q-factor of micro-ring resonators used for the design of MRR in this architecture.

  相似文献   

16.
Energy costs have become increasingly problematic for high performance processors, but the rising number of cores on-chip offers promising opportunities for energy reduction. Further, emerging architectures such as heterogeneous multicores present new opportunities for improved energy efficiency. While previous work has presented novel memory architectures, multithreading techniques, and data mapping strategies for reducing energy, consideration to thread generation mechanisms that take into account data locality for this purpose has been limited. This study presents methodologies for the joint partitioning of data and threads to parallelize sequential codes across an innovative heterogeneous multicore processor called the Passive/Active Multicore (PAM) for reducing energy consumption from on-chip data transport and cache access components while also improving execution time. Experimental results show that the design with automatic thread partitioning offered reductions in energy-delay product (EDP) of up to 48%.  相似文献   

17.
In this paper, we present a family of three algorithms which serve to perform checkpoints and to roll back time warp. These algorithms are primarily intended for use in simulations in which there are a large number of LPs and in which events have a small computational granularity. Important representatives of this class are VLSI and computer network simulations. In each of our algorithms, LPs are gathered into clusters via algorithms which are application dependent. In order to examine the performance of our algorithms and to compare them to Time Warp, we made use of two of the largest digital logic circuits available from the ISCAS89 benchmark series of combinational circuits. The execution time, number of states saved, and maximal memory consumption were compared to the same quantities for time warp. Our results indicated that each of the algorithms occupies a different point in the spectrum of possible trade-offs between memory usage and execution time, ranging from substantial memory savings (at a comparable cost in speed) to memory savings and a comparable speed to time warp. Hence, an important benefit of our algorithms is the ability to trade off memory requirements with execution time  相似文献   

18.

Powerlists are data structures that can be successfully used for defining parallel programs based on divide-and-conquer paradigm. These parallel recursive data structures and their algebraic theories offer both a methodology to design parallel algorithms and parallel programming abstractions to ease the development of parallel applications. The paper presents a technique for speeding up the parallel recursive programs defined based on powerlists. The improvements are achieved by applying transformation rules that introduce tuple functions and prefix operators, for which a more efficient execution model is defined. Together with the execution model, a cost model is also defined in order to allow a proper evaluation. The treated examples emphasise the fact that the transformation leads to important improvements of the programs. The speeding up is achieved by reducing the number of recursive calls, and also by enable the fusion of splitting/combining operations on different data structures. In addition, enhancing the function that has to be computed to other useful functions using a tuple, could improved the cost reduction even more.

  相似文献   

19.
The communication and synchronization overhead inherent in parallel processing can lead to situations where adding processors to the solution method actually increases execution time. Problem type, problem size, and architecture type all affect the optimal number of processors to employ. In this paper we examine the numerical solution of an elliptic partial differential equation in order to study the relationship between problem size and architecture. The equation's domain is discretized into n2 grid points which are divided into partitions and mapped onto the individual processor memories. We analytically quantify the relationships among grid size, stencil type, partitioning strategy processor execution time, and communication network type. In doing so, we determine the optimal number of processors to assign to the solution (and hence the optimal speedup), and identify (i) the smallest grid size which fully benefits from using all available processors, (ii) the leverage on performance given by increasing processor speed or communication network speed, and (iii) the suitability of various architectures for large numerical problems. Finally, we compare the predictions of our analytic model with measurements from a multiprocessor and find that the model accurately predicts performance.  相似文献   

20.
Field-programmable gate arrays (FPGAs) are being integrated with processors on the same motherboard or even chip in order to achieve flexible high-performance computing, and this may become main stream in chip multi-core architectures. However, the expensive FPGA area is often used inefficiently, with much of the logic idle at any given time. This work, motivated by the Dynamic-Link Library (DLL) concept in software, explores the possibility of “hardware DLLs” by finding ways for fast dynamic incremental reconfiguration of FPGAs. So doing would, among other things, enable same-function replication at any given time, with functions changing quickly over time, thereby enabling efficient exploitation of data parallelism at no additional hardware cost.We present two new multi-context FPGA architectures based on two different configuration storage architectures: local and centralized. Problems such as configuration storage and reconfiguration (time, power and space) overhead are considered. Well known area and power models are used in evaluating various approaches and in order to provide guidelines for matching architectures to target applications. Lastly, we provide insights into resulting scheduling issues. Our findings provide the foundation and “rules of the game” for subsequent development of reconfiguration schedulers and execution environments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号