首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper analyzes parallel implementation of the backpropagation training algorithm on a heterogeneous transputer network (i.e., transputers of different speed and memory) connected in a pipelined ring topology. Training-set parallelism is employed as the parallelizing paradigm for the backpropagation algorithm. It is shown through analysis that finding the optimal allocation of the training patterns amongst the processors to minimize the time for a training epoch is a mixed integer programming problem. Using mixed integer programming optimal pattern allocations for heterogeneous processor networks having a mixture of T805-20 (20 MHz) and T805-25 (25 MHz) transputers are theoretically found for two benchmark problems. The time for an epoch corresponding to the optimal pattern allocations is then obtained experimentally for the benchmark problems from the T805-20, TS805-25 heterogeneous networks. A Monte Carlo simulation study is carried out to statistically verify the optimality of the epoch time obtained from the mixed integer programming based allocations. In this study pattern allocations are randomly generated and the corresponding time for an epoch is experimentally obtained from the heterogeneous network. The mean and standard deviation for the epoch times from the random allocations are then compared with the optimal epoch time. The results show the optimal epoch time to be always lower than the mean epoch times by more than three standard deviations (3sigma) for all the sample sizes used in the study thus giving validity to the theoretical analysis.  相似文献   

2.
Continuing advances in VLSI fabrication technology are allowing circuit designs to become more and more complex and are thereby fuelling the need for ever-faster digital simulators. In this paper, we investigate a multi-transputer-based method for speeding up gate-level digital timing simulation, the acknowledged 'workhorse' of the digital circuit design verification process. In particular, we describe a variant of the basic conservative method for distributed discrete-event simulation and we present PARSIM, a gate-level digital timing simulator which is based on this method and which runs on arrays of transputers. Preliminary results on small transputer arrays demonstrate good speed-ups for bitslice partitions of circuits with regular structure, including supralinear speed-ups for a large (1664-gate) circuit. Although these results are encouraging, poor results for less-than-ideal partitions of our test circuits suggest that we require an improvement in the efficiency of deadlock resolution and/or a means of automated optimized partitioning before PARSIM can be used as a practical tool for speeding up gate-level simulation.  相似文献   

3.
Our digital fuzzy processor's main features are high throughput, performance independent of fuzzy-model size, high design parameter flexibility, Max-Min inference, and ability to handle a large number of complex rules without loss of efficiency. We carried out circuit development using a VHDL simulator, with European Silicon Structures' 1-μm standard cells. We have achieved performance results of over 10-million fuzzy logical inferences per second  相似文献   

4.
We present two parallel multilevel methods for solving large-scale discretized partial differential equations on unstructured 2D/3D grids. The presented methods combine three powerful numerical algorithms: overlapping domain decomposition, multigrid method and adaptivity. As the foundation of the methods we propose an algorithm for generating and partitioning a hierarchy of adaptively refined unstructured grids, so that adaptivity can be incorporated up to a certain grid level. We ensure that the resulting subgrid hierarchies are well balanced and no inter-processor communication is needed across different grid levels, thus obtaining high parallel efficiency. Numerical experiments show that the parallel multilevel methods offer almost equally fast convergence as their sequential multigrid counterpart. And the resulting implementation has reasonably good scalability. Received: 4 December 1998 / Accepted: 12 January 2000  相似文献   

5.
Flat Concurrent Prolog (FCP) is a general purpose logic programming language designed for concurrent programming and parallel execution. Staring with a concise introduction of the language and its underlying computational model we describe how to implement a distributed FCP interpreter on a transputer environment using OCCAM. Basic techniques we used for exploiting and controlling parallelism are explained in terms of an abstract architecture. The result of mapping this abstract model on transputers is presented as concrete architecture. Substantial design issues are considered in detail.  相似文献   

6.
There is a perceived need within the database community to extend the traditional relational database systems so as to accommodate applications which are deductive in nature. One major problem involved in such an extension is the efficient processing of recursive queries. To this end, parallel processing is expected to play an important role. While substantial work has been done in devising strategies for processing recursive queries in parallel, it is perhaps surprising that little has been reported on the implementation and the run-time performance of these strategies. In the paper we report our experience of implementing a pipelined evaluation strategy on transputers. A wide range of queries, database structures and architectural configurations are considered as benchmarks in this study. The performance is studied in terms of both speed-up factors and communication costs. The experimental results show the potential of processing recursive queries in parallel, and provide insight into the usefulness of using transputers for such applications.  相似文献   

7.
In this paper, we present parallel multilevel algorithms for the hypergraph partitioning problem. In particular, we describe for parallel coarsening, parallel greedy k-way refinement and parallel multi-phase refinement. Using an asymptotic theoretical performance model, we derive the isoefficiency function for our algorithms and hence show that they are technically scalable when the maximum vertex and hyperedge degrees are small. We conduct experiments on hypergraphs from six different application domains to investigate the empirical scalability of our algorithms both in terms of runtime and partition quality. Our findings confirm that the quality of partition produced by our algorithms is stable as the number of processors is increased while being competitive with those produced by a state-of-the-art serial multilevel partitioning tool. We also validate our theoretical performance model through an isoefficiency study. Finally, we evaluate the impact of introducing parallel multi-phase refinement into our parallel multilevel algorithm in terms of the trade off between improved partition quality and higher runtime cost.  相似文献   

8.
9.
A technique for improving the performance of a gate-level logic simulator through the use of interactive hardware is described. This technique not only reduces the program run-time and increases the gate capacity of the simulator, but also enables the same simulator to be used directly for on-line logic circuit testing. The simulator and interactive hardware were designed and implemented on a PDP11/20 computer.  相似文献   

10.
The analysis of large-scale nonlinear shell problems asks for parallel simulation approaches. One crucial part of efficient and well scalable parallel FE-simulations is the solver for the system of equations. Due to the inherent suitability for parallelization one is very much directed towards preconditioned iterative solvers. However thin-walled-structures discretized by finite elements lead to ill-conditioned system matrices and therefore performance of iterative solvers is generally poor. This situation further deteriorates when the thickness change of the shell is taken into account. A preconditioner for this challenging class of problems is presented combining two approaches in a parallel framework. The first approach is a mechanically motivated improvement called ‘scaled director conditioning’ (SDC) and is able to remove the extra-ill conditioning that appears with three-dimensional shell formulations as compared to formulations that neglect thickness change of the shell. It is introduced at the element level and harmonizes well with the second approach utilizing a multilevel algorithm. Here a hierarchy of coarse grids is generated in a semi-algebraic sense using an aggregation concept. Thereby the complicated and expensive explicit generation of course triangulations can be avoided. The formulation of this combined preconditioning approach is given and the effects on the performance of iterative solvers is demonstrated via numerical examples.  相似文献   

11.
Qualitative simulation is a rather new and challenging simulation paradigm. Its major strength is the prediction of all physically possible behaviors of a system given only weak and incomplete information about it. This strength is exploited more and more in applications like design, monitoring and fault diagnosis. However, the poor performance of current qualitative simulators complicates or even prevents their application in technical environments. This paper presents the development of a special-purpose computer architecture for the bestknown qualitative simulator QSIM. Two design methods are applied to improve the performance. Complex functions are parallelized and mapped onto a multiprocessor system. Less complex functions are accelerated by software to hardware migration; they are executed on specialized coprocessors.  相似文献   

12.
The verification of multilevel designs in a single simulator environment can be achieved efficiently using concurrent simulation. MCS is a research simulation tool developed in conjunction with Compaq Computer Corporation and Draper Laboratories. MCS overcomes limitations imposed by merged simulator approaches. MCS achieves this by incorporating techniques that are not specific to any abstraction level, making it attractive for testing interface interconnects and mixed-mode logic. We describe our approach, which is a cohesive simulator platform based on concurrent simulation algorithms  相似文献   

13.
Synchronous VLSI design is approaching a critical point, with clock distribution becoming an increasingly costly and complicated issue and power consumption rapidly emerging as a major concern. Hence, recently, there has been a resurgence of interest in asynchronous digital design techniques as they promise to liberate VLSI systems from clock skew problems, offer the potential for low power and high performance and encourage a modular design philosophy which makes incremental technological migration a much easier task. This activity has revealed a need for modelling and simulation techniques suitable for the asynchronous design style. Contributing to the quest for modelling and simulation techniques suitable for asynchronous design, and motivated by the increasing debate regarding the potential of CSP for this purpose, this paper investigates the suitability of occam, a CSP-based programming language, for the modelling and simulation of complex asynchronous systems. A generic modelling framework is introduced and issues arising from the parallel semantics of CSP/occam when the latter is employed to perform simulation are addressed.  相似文献   

14.
Summary This paper focuses upon a particular conservative algorithm for parallel simulation, the Time of Next Event (TNE) suite of algorithms [13]. TNE relies upon a shortest path algorithm which is independently executed on each processor in order to unblock LPs in the processor and to increase the parallelism of the simulation. TNE differs fundamentally from other conservative approaches in that it takes advantage of having several LPs assigned to each processor, and does not rely upon message passing to provide lookahead. Instead, it relies upon a shortest path algorithm executed independently in each processor. A deadlock resolution algorithm is employed for interprocessor deadlocks. We describe an empirical investigation of the performance of TNE on the iPSC/i860 hypercube multiprocessor. Several factors which play an important role in TNE's behavior are identified, and the speedup relative to a fast uniprocessor-based event list algorithm is reported. Our results indicate that TNE yields good speedups and out-performs an optimized version of the Chandy&Misra-null message (CMB) algorithm. TNE was 2–5 times as fast as the CM approach for less than 10 processors (and 1.5–3 times as fast when more than 10 processors were used for the same population of processes.) Azzedine Boukerche received the State Engineer degree in Software Engineering from Oran University, Oran, Algeria, and the M.Sc. degree in Computer Science from McGill University, Montreal, Canada. He is a Ph.D. candidate at the School of Computer Science, McGill University. During 1991–1992, he was a visiting doctoral student at the California Institute of Technology. He is employed as a Faculty Lecturer of computer Science at McGill University since 1993. His research interests include parallel simulation, distributed algorithms, and system performance analysis. He is a student member of the IEEE and ACM. Carl Tropper is an Associate Professor of Computer Science at McGill University. His primary area of research is parallel discrete event simulation. His general area of interest is in parallel computing and distributed algorithms in particular. Previously, he has done research in the performance modeling of computer networks, having written a book,Local Computer Network Technologies, while active in the area. Before coming to university life, he worked for the BBN Corporation and the Mitre Corporation, both located in the Boston area. He spent the 1991–92 academic year on a sabbatical leave at the Jet Propulsion Laboratories of the California Institute of Technology where he contributed to a project centered about the verification of flight control software. As part of this project he developed algorithms for the parallel simulation of communicating finite state machines. During winters he may be found hurtling down mountains on skis.This work has been completed while the author was a visiting doctoral student at the California Institute of TechnologyWas on sabbatical leave at the Jet Propulsion laboratories, California Institute of Technology  相似文献   

15.
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.

Program summary

Program title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Sub-program numdiff is used for the test run.  相似文献   

16.
The Chandy-Misra algorithm offers more parallelism than the standard event-driven algorithm for digital logic simulation. With suitable enhancements, the Chandy-Misra algorithm also offers significantly better parallel performance. The authors present methods to optimize the algorithm using information about the large number of global synchronization points, called deadlocks, that limit performance. They classify deadlocks and describe them in terms of circuit structure. The proposed methods that use domain-specific knowledge to avoid deadlocks and present a way to reduce greatly the time it takes to resolve a deadlock. For one benchmark circuit, the authors eliminated all deadlocks using their techniques and increased the average number of logic elements available for concurrent execution from 45 to 160. Simulation results for a 60-processor machine show that the Chandy-Misra algorithm outperforms the event-driven algorithm by a factor of 2 to 15  相似文献   

17.
In oil-industry it is common use to simulate the exploitation of an oil-reservoir by means of some numerical method. Such a numerical method may use the concept of dynamical local grid refinement, in order to mark fronts of water and oil, which move through the reservoir. In this paper, we discuss a domain decomposition method, which may be used to parallelize reservoir simulation. The parallel algorithm and timing experiments on a hypercube-type parallel computer are considered.  相似文献   

18.
This paper presents a novel hardware framework of particle swarm optimization (PSO) for various kinds of discrete optimization problems based on the system-on-a-programmable-chip (SOPC) concept. PSO is a new optimization algorithm with a growing field of applications. Nevertheless, similar to the other evolutionary algorithms, PSO is generally a computationally intensive method which suffers from long execution time. Hence, it is difficult to use PSO in real-time applications in which reaching a proper solution in a limited time is essential. SOPC offers a platform to effectively design flexible systems with a high degree of complexity. A hardware pipelined PSO (PPSO) Core is applied with which the required computational operations of the algorithm are performed. Embedded processors have also been employed to evaluate the fitness values by running programmed software codes. Applying the subparticle method brings the benefit of full scalability to the framework and makes it independent of the particle length. Therefore, more complex and larger problems can be addressed without modifying the architecture of the framework. To speed up the computations, the optimization architecture is implemented on a single chip master–slave multiprocessor structure. Moreover, the asynchronous model of PSO gains parallel efficacy and provides an approach to update particles continuously. Five benchmarks are exploited to evaluate the effectiveness and robustness of the system. The results indicate a speed-up of up to 98 times over the software implementation in the elapsed computation time. Besides, the PPSO Core has been employed for neural network training in an SOPC-based embedded system which approves the system applicability for real-world applications.  相似文献   

19.
基于可重复配置硬件的现代设计技术,对Kalman滤波器的硬件实现进行了研究。应用FPGA硬件丰富、灵活以及并行运算的特点,根据"先时序后电路"的设计思想,采用自顶向下的同步设计方法进行了浮点Kalman滤波器的硬件设计。分析了Kalman滤波器的工作原理,并利用Intellectual Property(IP)核和分时复用技术对硬件结构进行了优化。最后以提高全球定位系统(GPS)精度为应用背景对所设计的滤波器进行仿真验证,Modelsim仿真实验结果表明,该浮点Kalman滤波器的硬件设计不但实时性高,而且节省资源,利于硬件实现,具有实际应用价值。  相似文献   

20.
空间机器人控制系统硬件仿真平台的研究   总被引:1,自引:0,他引:1       下载免费PDF全文
建立了空间机器人控制系统的硬件仿真平台。研究了空间机器人基于手眼视觉的控制问题,建立了系统关键部件的模拟设备。仿真平台由中央控制器、关节模拟器、手眼模拟器、动力学/运动学仿真计算机和三维动画显示计算机组成。基于该平台,对空间机器人控制特性和仿真过程中的延时环节进行了研究。系统自主捕获仿真试验结果表明,所采用的运动控制算法能够稳定收敛于目标,仿真平台能够较好地完成对实际机器人系统控制过程的模拟测试及系统控制算法的验证。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号