首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
An introduction to neural networks and neural information processing is provided. Neurocomputers are discussed, focusing on how their design exploits the architectural properties of VLSI circuits. General-purpose and special-purpose neurocomputer developments throughout the world are examined. As illustration, and to put European developments in perspective, some of the important projects in the United States and Japan are described. European research is then discussed in greater detail  相似文献   

2.
Electron Repulsion Integrals (ERIs) are a common bottleneck in ab initio computational chemistry. It is known that sorted/reordered execution of ERIs results in efficient SIMD/vector processing. This paper shows that reconfigurable computing and heterogeneous processor architectures can also benefit from a deliberate ordering of ERI tasks. However, realizing these benefits as net speedup requires a very rapid sorting mechanism. This paper presents two such mechanisms. Included in this study are analytical, simulation-based, and experimental benchmarking approaches to consider five use cases for ERI sorting, i.e. SIMD processing, reconfigurable computing, limited address spaces, instruction cache exploitation, and data cache exploitation. Specific consideration is given to existing cache-based processors, FPGAs, and the Cell Broadband Engine processor. It is proposed that the analyses conducted in this work should be built upon to aid the development of software autotuners which will produce efficient ab initio computational chemistry codes for a variety of computer architectures.  相似文献   

3.
Depth recovery from gray-scale images is an important topic in the field of computer and robot vision. Intensity gradient analysis (IGA) is a robust technique for inferring depth information from a sequence of images acquired by a sensor undergoing translational motion. IGA obviates the need for explicitly solving the correspondence problem and hence is an efficient technique for range estimation. Many applications require real time processing at very high frame rates. The design of special purpose hardware could significantly speed up the computations in IGA. In this paper, we propose two VLSI architectures for high-speed range estimation based on IGA. The architectures fully utilize the principles of pipelining and parallelism in order to obtain high speed and throughput. The designs are conceptually simple and suitable for implementation in VLSI  相似文献   

4.
Lee  K.H. Leung  K.S. Cheang  S.M. 《Micro, IEEE》1990,10(4):50-61
A low-cost alternative to using a full-scale Lisp machine for list processing is proposed. The ASLP, a PC-based, highly pipelined list processor with two memory modules, supplements a procedural programming language to solve the list-manipulation problems in AI programs. This hardware-assisted processor was designed for use on an IBM PC AT to support list-manipulation functions. A discussion of Lisp and its internal structure and a review of existing Lisp machines are included. An estimation of the theoretical execution time for some typical list-manipulation functions, expressed in terms of machine cycles and memory cycles, is presented  相似文献   

5.
Most Western Governments (USA, Japan, EEC, etc.) have now launched national programmes to develop computer systems for use in the 1990s. These so-called Fifth Generation computers are viewed as “knowledge” processing systems which support the symbolic computation underlying Artificial Intelligence applications. The major driving force in Fifth Generation computer design is to efficiently support very high level programming languages (i.e. VHLL architecture).

Historycally, however, commercial VHLL architectures have been largely unsuccesful. The driving force in computer designs has principally been advances in hardware which at the present time means architectures to exploit very large scale integration (i.e. VLSI architecture).

This paper examines VHLL architectures and VLSI architectures and their probable influences on Fifth Generation computers. Interestingly the major problem for both architecture classes is parallelism; how to orchestrate a single parallel computation so that it can be distributed across an ensemble of processors.  相似文献   


6.
The authors present a compile-time scheduling heuristic called dynamic level scheduling, which accounts for interprocessor communication overhead when mapping precedence-constrained, communicating tasks onto heterogeneous processor architectures with limited or possibly irregular interconnection structures. This technique uses dynamically-changing priorities to match tasks with processors at each step, and schedules over both spatial and temporal dimensions to eliminate shared resource contention. This method is fast, flexible, widely targetable, and displays promising performance  相似文献   

7.
An efficient processor allocation policy is presented for hypercube computers. The allocation policy is called free list since it maintains a list of free subcubes available in the system. An incoming request of dimension k (2k nodes) is allocated by finding a free subcube of dimension k or by decomposing an available subcube of dimension greater than k. This free list policy uses a top-down allocation rule in contrast to the bottom-up approach used by the previous bit-map allocation algorithms. This allocation scheme is compared to the buddy, gray code (GC), and modified buddy allocation policies reported for the hypercubes. It is shown that the free list policy is optimal in a static environment, as are the other policies, and it also gives better subcube recognition ability compared to the previous schemes in a dynamic environment. The performance of this policy, in terms of parameters such as average delay, system utilization, and time complexity, is compared to the other schemes to demonstrate its effectiveness. The extension of the algorithm for parallel implementation, noncubic allocation, and inclusion/exclusion allocation is also given  相似文献   

8.
We describe new architectures for the efficient computation of redundant manipulator kinematics (direct and inverse). By calculating the core of the problem in hardware, we can make full use of the redundancy by implementing more complex self-motion algorithms. A key component of our architecture is the calculation in the VLSI hardware of the Singular Value Decomposition of the manipulator Jacobian. Recent advances in VLSI have allowed the mapping of complex algorithms to hardware using systolic arrays with advanced computer arithmetic algorithms, such as the coordinate rotation (CORDIC) algorithms. We use CORDIC arithmetic in the novel design of our special-purpose VLSI array, which is used in computation of the Direct Kinematics Solution (DKS), the manipulator Jacobian, as well as the Jacobian Pseudoinverse. Application-specific (subtask-dependent) portions of the inverse kinematics are handled in parallel by a DSP processor which interfaces with the custom hardware and the host machine. The architecture and algorithm development is valid for general redundant manipulators and a wide range of processors currently available and under development commercially.  相似文献   

9.
Eckardt  H. 《Computing》1994,53(1):13-31
Computing - I/O in computer systems is prone to become a bottleneck. This is a particular severe problem in highly parallel machines where some applications are fully I/O bound if only one or few...  相似文献   

10.
11.
The size of the program code has become a critical design constraint in embedded systems, especially in handheld devices. Large program codes require large memories, which increase the size and cost of the chip. In addition, the power consumption is increased due to higher memory I/O bandwidth. Program compression is one of the most often used methods to reduce the size of the program code. In this paper, dictionary-based program compression is evaluated on a customizable processor architecture with parallel resources. In addition to code density, the effectiveness of the method is evaluated in terms of area and power consumption. Furthermore, a mechanism is proposed to maintain the programmability after compression. Up to 77% reduction in area and 73% reduction in power consumption of the program memory and the associated control logic were obtained.  相似文献   

12.
An adaptive electronic neural network processor has been developed for high-speed image compression based on a frequency-sensitive self-organization algorithm. The performance of this self-organization network and that of a conventional algorithm for vector quantization are compared. The proposed method is quite efficient and can achieve near-optimal results. The neural network processor includes a pipelined codebook generator and a paralleled vector quantizer, which obtains a time complexity O(1) for each quantization vector. A mixed-signal design technique with analog circuitry to perform neural computation and digital circuitry to process multiple-bit address information are used. A prototype chip for a 25-D adaptive vector quantizer of 64 code words was designed, fabricated, and tested. It occupies a silicon area of 4.6 mmx6.8 mm in a 2.0 mum scalable CMOS technology and provides a computing capability as high as 3.2 billion connections/s. The experimental results for the chip and the winner-take-all circuit test structure are presented.  相似文献   

13.
We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3–4 and 4–16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development.

Program summary

Program title: HONEICatalogue identifier: AEDW_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDW_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPLv2No. of lines in distributed program, including test data, etc.: 216 180No. of bytes in distributed program, including test data, etc.: 1 270 140Distribution format: tar.gzProgramming language: C++Computer: x86, x86_64, NVIDIA CUDA GPUs, Cell blades and PlayStation 3Operating system: LinuxRAM: at least 500 MB freeClassification: 4.8, 4.3, 6.1External routines: SSE: none; [1] for GPU, [2] for Cell backendNature of problem: Computational science in general and numerical simulation in particular have reached a turning point. The revolution developers are facing is not primarily driven by a change in (problem-specific) methodology, but rather by the fundamental paradigm shift of the underlying hardware towards heterogeneity and parallelism. This is particularly relevant for data-intensive problems stemming from discretisations with local support, such as finite differences, volumes and elements.Solution method: To address these issues, we present a hardware aware collection of libraries combining the advantages of modern software techniques and hardware oriented programming. Applications built on top of these libraries can be configured trivially to execute on CPUs, GPUs or the Cell processor. In order to evaluate the performance and accuracy of our approach, we provide two domain specific applications; a multigrid solver for the Poisson problem and a fully explicit solver for 2D shallow water equations.Restrictions: HONEI is actively being developed, and its feature list is continuously expanded. Not all combinations of operations and architectures might be supported in earlier versions of the code. Obtaining snapshots from http://www.honei.org is recommended.Unusual features: The considered applications as well as all library operations can be run on NVIDIA GPUs and the Cell BE.Running time: Depending on the application, and the input sizes. The Poisson solver executes in few seconds, while the SWE solver requires up to 5 minutes for large spatial discretisations or small timesteps.References:
  • [1] 
    http://www.nvidia.com/cuda.
  • [2] 
    http://www.ibm.com/developerworks/power/cell.
  相似文献   

14.
15.
16.
The implementation of a Hough transform processor using a wafer-scale-integration technology, restructurable VLSI circuit is described. The Hough transform is typically used as a grouping operation in an image processing sequence. The transform discussed here groups pixels in order to extract linear features. This calculation is realized with a wafer-scale processor that allows a complete line extraction system to be integrated on a single PC board. Also discussed is the use of the CAD tools that allowed this processor to be realized without incurring silicon layout and processing overhead  相似文献   

17.
《Computer Networks》2003,41(5):641-665
The designs of most systems-on-a-chip (SoC) architectures rely on simulation as a means for performance estimation. Such designs usually start with a parameterizable template architecture, and the design space exploration is restricted to identifying the suitable parameters for all the architectural components. However, in the case of heterogeneous SoC architectures such as network processors the design space exploration also involves a combinatorial aspect––which architectural components are to be chosen, how should they be interconnected, task mapping decisions––thereby increasing the design space. Moreover, in the case of network processor architectures there is also an associated uncertainty in terms of the application scenario and the traffic it will be required to process. As a result, simulation is no longer a feasible option for evaluating such architectures in any automated or semi-automated design space exploration process due to the high simulation times involved. To address this problem, in this paper we hypothesize that the design space exploration for network processors should be separated into multiple stages, each having a different level of abstraction. Further, it would be appropriate to use analytical evaluation frameworks during the initial stages and resort to simulation techniques only when a relatively small set of potential architectures is identified. None of the known performance evaluation methods for network processors have been positioned from this perspective.We show that there are already suitable analytical models for network processor performance evaluation which may be used to support our hypothesis. To this end, we choose a reference system-level model of a network processor architecture and compare its performance evaluation results derived using a known analytical model [Thiele et al., Design space exploration of network processor architectures, in: Proc. 1st Workshop on Network Processors, Cambridge, MA, February 2002; Thiele et al., A framework for evaluating design tradeoffs in packet processing architectures, in: Proc. 39th Design Automation Conference (DAC), New Orleans, USA, ACM Press, 2002] with the results derived by detailed simulation. Based on this comparison, we propose a scheme for the design space exploration of network processor architectures where both analytical performance evaluation techniques and simulation techniques have unique roles to play.  相似文献   

18.
In this contribution the concept of functional- level power analysis (FLPA) for power estimation of programmable processors is extended in order to model embedded as well as heterogeneous processor architectures featuring different embedded processor cores. The basic FLPA approach is based on the separation of the processor architecture into functional blocks like, e.g. processing unit, clock network, internal memory, etc. The power consumption of these blocks is described by parameterized arithmetic models. By application of a parser based automated analysis of assembler codes the input parameters of the arithmetic functions like e.g. the achieved degree of parallelism or the kind and number of memory accesses can be computed. For modeling an embedded general purpose processor (here, an ARM940T) the basic FLPA modeling concept had to be extended to a so-called hybrid functional-level and instruction-level (FLPA/ILPA) model in order to achieve a good modeling accuracy. In order to show the applicability of this approach even a heterogeneous processor architecture (OMAP5912) featuring an ARM926EJ-S core and a C55x DSP core has been modeled using the hybrid FLPA/ILPA technique described before. The approach is exemplarily demonstrated and evaluated applying a variety of basic digital signal processing tasks ranging from basic filters to complete audio decoders or classical benchmark suits. Estimated power figures for the inspected tasks are compared to physically measured values for both inspected processor architectures. A resulting maximum estimation error of 9% for the ARM940T and less than 4% for the OMAP5912 is achieved.  相似文献   

19.
This paper describes the programming language Actus II, which has evolved from the Pascal-based parallel language Actus, and has also been influenced by the architecture of array processors. This language facilitates the construction of parallel algorithms in a notation which is independent of the underlying architecture. Work on the implementation of a compiler for the ICL Distributed Array Processor (DAP) is currently under way and some aspects of this implementation are described.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号