期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

CP-PACS: A massively parallel processor at the University of Tsukuba 总被引：1，自引：0，他引：1

Kisaburo Nakazawa Hiroshi Nakamura Taisuke Boku Ikuo Nakata Yoshiyuki Yamashita 《Parallel Computing》1999,25(13-14):1635-1661

Computational Physics by Parallel Array Computer System (CP-PACS) is a massively parallel processor developed and in full operation at the Center for Computational Physics at the University of Tsukuba. It is an MIMD machine with a distributed memory, equipped with 2048 processing units and 128 GB of main memory. The theoretical peak performance of CP-PACS is 614.4 Gflops. CP-PACS achieved 368.2 Gflops with the Linpack benchmark in 1996, which at that time was the fastest Gflops rating in the world.CP-PACS has two remarkable features. Pseudo Vector Processing feature (PVP-SW) on each node processor, which can perform high speed vector processing on a single chip superscalar microprocessor; and a 3-dimensional Hyper-Crossbar (3-D HXB) Interconnection network, which provides high speed and flexible communication among node processors.In this article, we present the overview of CP-PACS, the architectural topics, some details of hardware and support software, and several performance results. 相似文献

2.

Achieving supercomputer performane for neural net simulation withan array of digital signal processors

Muller U.A. Baumle B. Kohler P. Gunzinger A. Guggenbuhl W. 《Micro, IEEE》1992,12(5):55-65

Music, a digital signal processor (DSP)-based system with a parallel distributed-memory architecture that provides enormous computing power yet retains the flexibility of a general-purpose computer, is discussed. It is shown that Music reaches a peak performance of 2.7 Gflops at a significantly lower cost, power consumption, and space requirement than conventional supercomputers. The Music system hardware, programming, and backpropagation implementation are described 相似文献

3.

Shared-Memory Parallel Vector Implementation of the Immersed Boundary Method for the Computation of Blood Flow in the Beating Mammalian Heart 总被引：3，自引：0，他引：3

McQueen David Peskin Charles 《The Journal of supercomputing》1997,11(3):213-236

This paper describes the parallel implementation of the immersed boundary method on a shared-memory machine such as the Cray C-90 computer. In this implementation, outer loops are parallelized and inner loops are vectorized. The sustained computation rates achieved are 0.258 Gflops with a single processor, 1.89 Gflops with 8 processors, and 2.50 Gflops with 16 processors. An application to the computer simulation of blood flow in the heart is presented. 相似文献

4.

Hardware accelerator for molecular dynamics: MDGRAPE-2

Ryutaro Susukita Toshikazu Ebisuzaki Hideaki Furusawa Atsushi Kawai Takahiro Koishi Tetsu Narumi 《Computer Physics Communications》2003,155(2):115-131

We developed MDGRAPE-2, a hardware accelerator that calculates forces at high speed in molecular dynamics (MD) simulations. MDGRAPE-2 is connected to a PC or a workstation as an extension board. The sustained performance of one MDGRAPE-2 board is 15 Gflops, roughly equivalent to the peak performance of the fastest supercomputer processing element. One board is able to calculate all forces between 10 000 particles in 0.28 s (i.e. 310000 time steps per day). If 16 boards are connected to one computer and operated in parallel, this calculation speed becomes ∼10 times faster. In addition to MD, MDGRAPE-2 can be applied to gravitational N-body simulations, the vortex method and smoothed particle hydrodynamics in computational fluid dynamics. 相似文献

5.

Better than $l/Mflops sustained: a scalable PC-based parallel computer for lattice QCD

Zoltán Fodor Sándor D. Katz Gábor Papp 《Computer Physics Communications》2003,152(2):121-134

We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. The Eötvös Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD, respectively (for 64-bit applications the performance is approximately halved). The novel feature of our system is its communication architecture. In order to have a scalable, cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware costs is spent on the communication). According to our benchmark measurements this type of communication results in around 40% communication time fraction for lattices upto 48³·96 in full QCD simulations. The price/sustained-performance ratio for full QCD is better than $l/Mflops for Wilson (and around $1.5/Mflops for staggered) quarks for practically any lattice size, which can fit in our parallel computer. The communication software is freely available upon request for non-profit organizations. 相似文献

6.

Parallel Algorithm Design on Some Distributed Systems 总被引：3，自引：0，他引：3

下载免费PDF全文

Sun Jiachang Chi Xuebin Cao Jianwen Zhang Linbo 《计算机科学技术学报》1997,12(2):97-104

Some testing results on DAWINING-1000,Paragon and workstation cluster are described in this paper.On the home-made parallel system DAWNING-1000 with 32 computational processors,the practical performance of 1.1777 Gflops and 1.58 Gflops has been measured in solving a dense linear system and doing matrix multiplication,respectively .The scalability is also investigated.The importance of designing efficient parallel algorithms for evaluating parallel systems is emphasized. 相似文献

7.

Lattice QCD results from the CP-PACS computer

《Parallel Computing》1999,25(10-11):1257-1280

We present physics results from large-scale simulations of Quantum Chromodynamics (QCD) on the space-time lattice carried out with the CP-PACS computer. The CP-PACS is a massively parallel system with a peak speed of 614 Gflops and 320 Gbyte of main memory developed at the Center for Computational Physics, University of Tsukuba.Since the start of full operation of CP-PACS in October 1996, precision calculation of the light hadron spectrum in the quenched approximation of QCD and a systematic attempt at a calculation without this approximation have been pursued. Physics motivations of these calculations, the computational difficulties,and advances brought in by the CP-PACS are discussed. Performance of the CP-PACS for lattice QCD computations are described in a companion paper by S. Aoki et al. 相似文献

8.

Cholesky分解细粒度并行算法 总被引：1，自引：0，他引：1

邬贵明窦勇王淼《计算机工程与科学》2010,32(9):102-106

本文提出了一种Cholesky分解细粒度流水线并行算法,该算法可以处理任意规模的数据,可以充分开发FP-GA加速器提供的细粒度并行。实验表明,该算法具有很好的可扩展性,在Xilinx XC5 VLX330 FPGA上能够集成36个处理单元(PE),当矩阵的阶为16384、运行频率为200MHz时性能达到14.3GFLOPS。相似文献

9.

Special-purpose computer for holography HORN-4 with recurrence algorithm 总被引：1，自引：0，他引：1

Tomoyoshi Shimobaba Sinsuke HishinumaTomoyoshi Ito 《Computer Physics Communications》2002,148(2):160-170

We designed and built a special-purpose computer for holography, HORN-4 (HOlographic ReconstructioN) using PLD (Programmable Logic Device) technology. HORN computers have a pipeline architecture. We use HORN-4 as an attached processor to enhance the performance of a general-purpose computer when it is used to generate holograms using a “recurrence formulas” algorithm developed by our previous paper. In the HORN-4 system, we designed the pipeline by adopting our “recurrence formulas” algorithm which can calculate the phase on a hologram. As the result, we could integrate the pipeline composed of 21 units into one PLD chip. The units in the pipeline consists of one BPU (Basic Phase Unit) unit and twenty CU (Cascade Unit) units. These CU units can compute twenty light intensities on a hologram plane at one time. By mounting two of the PLD chips on a PCI (Peripheral Component Interconnect) universal board, HORN-4 can calculate holograms at high speed of about 42 Gflops equivalent. The cost of HORN-4 board is about 1700 US dollar. We could obtain 800×600 grids hologram from a 3D-image composed of 415 points in about 0.45 sec with the HORN-4 system. 相似文献

10.

Real-Time Neural Network Inversion on the SRC-6e Reconfigurable Computer 总被引：2，自引：0，他引：2

Duren R. W. Marks II R. J. Reynolds P. D. Trumbo M. L. 《Neural Networks, IEEE Transactions on》2007,18(3):889-901

Implementation of real-time neural network inversion on the SRC-6e, a computer that uses multiple field-programmable gate arrays (FPGAs) as reconfigurable computing elements, is examined using a sonar application as a specific case study. A feedforward multilayer perceptron neural network is used to estimate the performance of the sonar system (Jung , 2001). A particle swarm algorithm uses the trained network to perform a search for the control parameters required to optimize the output performance of the sonar system in the presence of imposed environmental constraints (Fox , 2002). The particle swarm optimization (PSO) requires repetitive queries of the neural network. Alternatives for implementing neural networks and particle swarm algorithms in reconfigurable hardware are contrasted. The final implementation provides nearly two orders of magnitude of speed increase over a state-of-the-art personal computer (PC), providing a real-time solution 相似文献

11.

墙的另一面—图灵模型更深层次的思考

李奕权郑利龙《计算机科学》2011,38(9):282-287

最近几年,计算机体系迈向多处理器结构道路。然而,冯诺曼机器主导的多核结构,令我们处在存储墙错误的一面。地址参数引入的冗余,降低了处理器的效率,成为图灵一冯诺曼模型的致命要害。对图灵模型作更深一层思考,以信息变换统一了冯诺曼机器程序变换和神经网络变换,分析了两种变换的异同及其优劣。提出以微核为基础,并按变换的成熟程度,向灵活的可编程的冯氏机器或高速的神经网络分化。模拟生物神经系统的进化,构建为人们服务的智能机器。201。年9月15日,美国波士敦的高性能嵌入式计算(HPEC)研讨会上,耳卜鲁大学的欧亨尼奥·卡鲁塞伊罗教授发表了一个基于人类视觉系统的高性能计算机“神经流”(NeuFlow),其体系结构利用了与本文的仿生电脑十分相似的概念。相似文献

12.

Artificial neural networks based modeling for solving Volterra integral equations system

《Applied Soft Computing》2015

Properly designing an artificial neural network is very important for achieving the optimal performance. This study aims to utilize an architecture of these networks together with the Taylor polynomials, to achieve the approximate solution of second kind linear Volterra integral equations system. For this purpose, first we substitute the Nth truncation of the Taylor expansion for unknown functions in the origin system. Then we apply the suggested neural net for adjusting the numerical coefficients of given expansions in resulting system. Consequently, the reported architecture using a learning algorithm that based on the gradient descent method, will adjust the coefficients in given Taylor series. The proposed method was illustrated by several examples with computer simulations. Subsequently, performance comparisons with other developed methods was made. The comparative experimental results showed that this approach is more effective and robust. 相似文献

13.

汽车控制系统中的双速CAN总线设计

冯钟《电子技术应用》2008,34(3)

针对汽车控制系统中各控制单元对系统响应时间要求不一致的实际情况,提出建立双速CAN总线网络以连接汽车的各控制单元的方法。在系统中,对实时性要求高的计算机控制单元采用高速CAN网络传输;其他采用低速CAN网络传输。结果表明,采用高、低速两条CAN总线网络的设计思路,既方便地实现了整个系统的数据共享,又有效地缓解了整个总线的通信负担,从而提高了控制的可靠性。相似文献

14.

Fei Teng 64 Stream Processing System: Architecture, Compiler, and Programming

Yang Xuejun Yan Xiaobo Xing Zuocheng Deng Yu Jiang Jiang Du Jing Zhang Ying 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(8):1142-1157

The stream architecture is a novel microprocessor architecture with wide application potential. It is critical to study how to use the stream architecture to accelerate scientific computing programs. However, existing stream processors and stream programming languages are not designed for scientific computing. To address this issue, we design and implement a 64-bit stream processor, Fei Teng 64 (FT64), which has a peak performance of 16 Gflops. FT64 supports two kinds of communications, message passing and stream communications, based on which, an interconnection architecture is designed for a FT64-based high-performance computer. This high-performance computer contains multiple modules, with each module containing eight FT64s. We also design a novel stream programming language, Stream Fortran 95 (SF95), together with the compiler SF95Compiler, so as to facilitate the development of scientific applications. We test nine typical scientific application kernels on our FT64 platform to evaluate this design. The results demonstrate the effectiveness and efficiency of FT64 and its compiler for scientific computing. 相似文献

15.

Velocity Estimation for Robot Manipulators Using Neural Network

S. P. Chan 《Journal of Intelligent and Robotic Systems》1998,23(2-4):147-163

In robot manipulators, optical incremental encoders are widely used as the transducers to monitor joint position and velocity information. With incremental encoder, positional information is determined as discrete data relative to a reference (home) position. However, velocity information can only be deduced by processing the position data. In this paper, a method of using a neural network to estimate the velocity information of robotic joint from discrete position versus time data is proposed and evaluated. The architecture of the neural net and the training methodology are presented and discussed.This approach is then applied to estimate the joint velocity of a SCARA robot while performing an electronic component assembly task. Based on computer simulations, comparison of the accuracy of the neural network estimator with two other well established velocity estimation algorithms are made. The neural net approach can maintain good performance even in the presence of measurement noises. 相似文献

16.

Beyond Moore's law: Internet growth trends

Roberts L.G. 《Computer》2000,33(1):117-119

To keep pace with the Internet's growth, the maximum speed of core routers and switches must increase at the same rate. In a study conducted in 1969, the author analyzed 39 scientific computers released or planned for release from 1958 to 1972 to determine optimal computer replacement strategy (http://www.ziplink.net/lroberts/Forecast69.htm). This study looked at the trend of CPU throughput per dollar and predicted that computer performance would double every 18.6 months. Updating the study using data for 1999 PCs shows that the trend over 41 years is a doubling of computer performance every 21 months, a remarkably small correction. A similar study tracking the costs from the first ARPA packet switches in 1969 to the most modern routers and ATM switches in 1999 confirms that packet switches have followed the same trend as computers, with performance per dollar doubling every 21 months. Although the computer performance rate predicted in the updated 1969 study is similar to Moore's law, the trends are not identical. It would appear that both the performance per dollar for computers and the serial interface speed for communications are increasing at 94 percent of the yearly growth rate of semiconductor performance. We can use this information about performance and cost trends to predict the cost of computers and communications and to understand the Internet traffic growth. Keeping up with these trends will be a major engineering challenge 相似文献

17.

Adaptive control of a nonlinear dc motor drive using recurrent neural networks

《Applied Soft Computing》2008,8(1):371-382

A model-following adaptive control structure is proposed for the speed control of a nonlinear motor drive system and the compensation of the nonlinearities. A recurrent artificial neural network is used for the online modeling and control of the nonlinear motor drive system with high static and Coulomb friction. The neural network is first trained off-line to learn the inverse dynamics of the motor drive system using a modified form of the decoupled extended Kalman filter algorithm. It is shown that the recurrent neural network structure combined with the inverse model control approach allows an effective direct adaptive control of the motor drive system. The performance of this method is validated experimentally on a dc motor drive system using a standard personal computer. The results obtained confirm the excellent disturbance rejection and tracking performance properties of the system. 相似文献

18.

基于神经网络与计算机视觉的产品质检方法

下载免费PDF全文

严太山崔杜武《计算机工程》2007,33(23):191-193

采用计算机视觉原理与神经网络技术的自动化检测方法是计算机检测的新发展,具有非接触性、速度快、效率高、柔性好等优点,在现代产品质量检测中有着广泛的应用前景。该文介绍了基于神经网络与计算机视觉的产品质量检测系统的一般结构,阐述了这种系统的一个实例——玻璃瓶裂纹在线检测系统的实现方法。由于神经网络的应用,使得该检测系统具有良好的自学习、自适应能力,成功地实现了对生产线上玻璃瓶裂纹的快速、精确的检测。相似文献

19.

一类高分辨方位估计算法的稳健性研究

侯颖妮冯西安黄建国《计算机仿真》2007,24(6):92-95

在水下多目标方位估计中,为了降低阵列误差对高分辨方位估计算法性能的影响,增强算法的稳健性,提出通道归一Toeplitz化的误差校正方法,通过计算机仿真对MUSIC、Mini-Norm、Root-MUSIC、TLS-ESPRIT和MODE算法性能的分析研究,结果表明此校正方法能很好地改善高分辨算法的性能,并且水池实验数据的处理结果也进一步验证了仿真结论的正确性;而且从仿真和水池实验结果可以看到经校正后TLS-ESPRIT、Root-MUSIC和MODE算法稳健性高,有较低的噪声分辨门限,工程应用前景良好. 相似文献

20.

基于RBF网络的高炉热流分析传感器故障诊断

陈至坤陈少敏李福进王福斌郭建飞董传阳《传感器与微系统》2007,26(6):58-59

在炼铁高炉热流强度分析系统中要用到温度、流量等传感器,为确保热流分析系统中传感器数据的可靠性及系统的连续、稳定运行,诊断系统用径向基函数(RBF)神经网络对传感器进行故障判断。系统由上位机、温度及流量采集装置、传感器等组成,采用RBF神经网络为每一个传感器建立预测模型,网络的输入为传感器采集信号最近的n个值,输出为该传感器在n+1时刻的预测输出值。网络通过在线学习实现对传感器的在线故障监测,经仿真分析表明:用RBF神经网络构建预测模型可满足实时性的诊断要求,提高了诊断系统的诊断精度。相似文献