首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Distributed computing is a process through which a set of computers connected by a network is used collectively to solve a single problem. In this paper, we propose a distributed computing methodology for training neural networks for the detection of lesions in colonoscopy. Our approach is based on partitioning the training set across multiple processors using a parallel virtual machine. In this way, interconnected computers of varied architectures can be used for the distributed evaluation of the error function and gradient values, and, thus, training neural networks utilizing various learning methods. The proposed methodology has large granularity and low synchronization, and has been implemented and tested. Our results indicate that the parallel virtual machine implementation of the training algorithms developed leads to considerable speedup, especially when large network architectures and training sets are used.  相似文献   

2.
Ray tracing is a well known technique to generate life-like images. Unfortunately, ray tracing complex scenes can require large amounts of CPU time and memory storage. Distributed memory parallel computers with large memory capacities and high processing speeds are ideal candidates to perform ray tracing. However, the computational cost of rendering pixels and patterns of data access cannot be predicted until runtime. To parallelize such an application efficiently on distributed memory parallel computers, the issues of database distribution, dynamic data management and dynamic load balancing must be addressed. In this paper, we present a parallel implementation of a ray tracing algorithm on the Intel Delta parallel computer. In our database distribution, a small fraction of database is duplicated on each processor, while the remaining part is evenly distributed among groups of processors. In the system, there are multiple copies of the entire database in the memory of groups of processors. Dynamic data management is acheived by an ALRU cache scheme which can exploit image coherence to reduce data movements in ray tracing consecutive pixels. We balance load among processors by distributing subimages to processors in a global fashion based on previous workload requests. The success of our implementation depends crucially on a number of parameters which are experimentally evaluated. © 1997 John Wiley & Sons, Ltd.  相似文献   

3.
基于PVM的线性方程组的一种网上并行迭代算法   总被引:1,自引:0,他引:1  
针对基于PVM的桌面PC机联网而成的网络并行计算环境中,处理机的运算速度较快,而处理机间的通信相对较慢的实际情况,提出了求解线性方程组的一种分组Guass-Seidel并行迭代算法,该算法将线性方程组的增广矩阵按行分块储存在各处理机,每台处理机分别对各自的块采用Guass-Seidel迭代法进行迭代计算,其处理机间的通信较少,实现容易。并用1~24台桌面PC机联成的局域网,在PVM 3.4 on Windows2000,VC 6.0并行计算平台上编程对该算法进行了数值试验,试验结果表明,该算法较传统的Jacobi并行迭代算法和传统的Guass—Seidel并行迭代算法更优越。  相似文献   

4.
5.
Practical parallel algorithms, based on classical sequential Union-Find algorithms for computing transitive closures of binary relations, are described and implemented for both shared memory and distributed memory parallel computers. By practical algorithms, we mean algorithms that are efficient for parallel systems with bounded numbers of processors as opposed to algorithms where the number of processors grows with the problem size. Transitive closures are useful for decomposing many applications problems into independent subproblems. The implementations were on an ENCORE Multimax shared memory machine and an NCUBE hypercube. Our implementations indicate that transitive closure computations are intrinsically difficult for distributed memory parallel machines because of the need for global information. By contrast, our results for shared memory machines exhibited excellent speedups.Supported in part by NSF Grant DCR-8619103, ONR contract N000-86-G-0202 and DOE Grant DE-FG02-85ER25001.Supported in part by RADC contract F30602-85-C-0303.Supported in part by RADC contract F30602-85-C-0303.  相似文献   

6.
Quisquater  J.-J. Desmedt  Y.G. 《Computer》1991,24(11):14-22
It is demonstrated that some problems can be solved inexpensively using widely distributed computers instead of an expensive supercomputer. This is illustrated by discussing how to make a simple fault-tolerant exhaustive code-breaking machine. The solution, which uses distributed processors is based on some elementary concepts of probability theory (lotto). The need for communication between processors is almost nil. Two approaches-deterministic and random-are compared. How to hide such a machine and how to build larger versions are discussed  相似文献   

7.
We introduce a formalism which allows to treat computer architecture as a formal optimization problem. We apply this to the design of shared memory parallel machines. While present parallel computers of this type only support the programming model of a shared memory but often process simultaneous access by several processors to the shared memory sequentially, theoretical computer science offers solutions for this problem that are provably fast and asymptotically optimal. But the constants in these constructions seemed to be too large to let them be competitive. We modify these constructions under engineering aspects and improve the price/performance ratio by roughly a factor of 6. The resulting machine has surprisingly good price/performance ratio even if compared with distributed memory machines. For almost all access patterns of all processors into the shared memory, access is as fast as the access of only a single processor. Received: 29 June 1993 / 22 June 1999  相似文献   

8.
针对基于PVM的由桌面PC机联网而成的网络并行计算环境中,处理机的运算速度较快而处理机间的通信相对较慢的实际情况,给出了一种局域网求解三角形方程组的并行算法,该算法将三角形方程组的系数矩阵及右端项按行分块,然后将分块的系数矩阵及右端项按卷帘方式存储在各处理机,通过循环传送已求出的解的部分分量以减少处理机间的通信开销,实现较容易。并在1-4台桌面PC机联成的局域网,PVM 3.4 on Windows2000,VC 6.0并行计算平台上编程对该算法进行了数值试验,试验结果表明该算法是有效的。  相似文献   

9.
This paper gives an overview of two related tools that we have developed to provide more accurate measurement and modelling of the performance of message-passing communication and application programs on distributed memory parallel computers. MPIBench uses a very precise, globally synchronised clock to measure the performance of MPI communication routines. It can generate probability distributions of communication times, not just the average values produced by other MPI benchmarks. This allows useful insights to be made into the MPI communication performance of parallel computers, and in particular how performance is affected by network contention. The Performance Evaluating Virtual Parallel Machine (PEVPM) provides a simple, fast and accurate technique for modelling and predicting the performance of message-passing parallel programs. It uses a virtual parallel machine to simulate the execution of the parallel program. The effects of network contention can be accurately modelled by sampling from the probability distributions generated by MPIBench. These tools are particularly useful on clusters with commodity Ethernet networks, where relatively high latencies, network congestion and TCP problems can significantly affect communication performance, which is difficult to model accurately using other tools. Experiments with example parallel programs demonstrate that PEVPM gives accurate performance predictions on commodity clusters. We also show that modelling communication performance using average times rather than sampling from probability distributions can give misleading results, particularly for programs running on a large number of processors.  相似文献   

10.
In this paper we discuss the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.  相似文献   

11.
In this paper, analog circuit designs for implementations of Gibbs samplers are presented, which offer fully parallel computation. The Gibbs sampler for a discrete solution space (or Boltzmann machine) can be used to solve both deterministic and probabilistic assignment (association) problems. The primary drawback to the use of a Boltzmann machine for optimization is its computational complexity, since updating of the neurons is typically performed sequentially. We first consider the diffusion equation emulation of a Boltzmann machine introduced by Roysam and Miller (1991), which employs a parallel network of nonlinear amplifiers. It is shown that an analog circuit implementation of the diffusion equation requires a complex neural structure incorporating matched nonlinear feedback amplifiers and current multipliers. We introduce a simpler implementation of the Boltzmann machine, using a "constant gradient" diffusion equation, which eliminates the need for a matched feedback amplifier. The performance of the Roysam and Miller network and the new constant gradient (CG) network is compared using simulations for the multiple-neuron case, and integration of the Chapman-Kolmogorov equation for a single neuron. Based on the simulation results, heuristic criteria for establishing the diffusion equation boundaries, and neuron sigmoidal gain are obtained. The final CG analog circuit is suitable for VLSI implementation, and hence may offer rapid convergence.  相似文献   

12.
We present a method for mapping a given Bayesian network to a Boltzmann machine architecture, in the sense that the the updating process of the resulting Boltzmann machine model probably converges to a state which can be mapped back to a maximum a posteriori (MAP) probability state in the probability distribution represented by the Bayesian network. The Boltzmann machine model can be implemented efficiently on massively parallel hardware, since the resulting structure can be divided into two separate clusters where all the nodes in one cluster can be updated simultaneously. This means that the proposed mapping can be used for providing Bayesian network models with a massively parallel probabilistic reasoning module, capable of finding the MAP states in a computationally efficient manner. From the neural network point of view, the mapping from a Bayesian network to a Boltzmann machine can be seen as a method for automatically determining the structure and the connection weights of a Boltzmann machine by incorporating high-level, probabilistic information directly into the neural network architecture, without recourse to a time-consuming and unreliable learning process.  相似文献   

13.
This paper studies the behavior of scientific applications running on distributed memory parallel computers. Our goal is to quantify the floating point, memory, I/O, and communication requirements of highly parallel scientific applications that perform explicit communication. In addition to quantifying these requirements for fixed problem sizes and numbers of processors, we develop analytical models for the effects of changing the problem size and the degree of parallelism for several of the applications.The contribution of our paper is that it provides quantitative data about real parallel scientific applications in a manner that is largely independent of the specific machine on which the application was run. Such data, which are clearly very valuable to an architect who is designing a new parallel computer, were not previously available. For example, the majority of research papers in interconnection networks have used simulated communication loads consisting of fixed-size messages. Our data, which show that using such simulated loads is unrealistic, can be used to generate more realistic communication loads.  相似文献   

14.
In this paper, we propose a novel parallel 3D Delaunay triangulation algorithm for large-scale simulations on parallel computers. Our method keeps the 3D boundary representation model information during the whole parallel 3D Delaunay triangulation process running on parallel computers so that the solid model information can be accessed dynamically and the meshing results can be very approaching to the model boundary with the increase of meshing scale. The model is coarsely meshed at first and distributed on CPUs with consistent partitioned shared interfaces and partitioned model boundary meshes across processors. The domain partition aims at minimizing the edge-cuts across different processors for minimum communication cost and distributing roughly equal number of mesh vertices for load balance. Then a parallel multi-scale surface mesh refinement phase is iteratively performed to meet the mesh density criteria followed by a parallel surface mesh optimization phase moving vertices to the model boundary so as to fit model geometry feature dynamically. A dynamic load balancing algorithm is performed to change the partition interfaces if necessary. A 3D local non-Delaunay mesh repair algorithm is finally done on the shared interfaces across processors and model boundaries. The experimental results demonstrate our method can achieve high parallel performance and perfect scalability, at the same time preserve model boundary feature and generate high quality 3D Delaunay mesh as well.  相似文献   

15.
PeiZong Lee 《Parallel Computing》1995,21(12):1895-1923
It is widely accepted that distributed memory parallel computers will play an important role in solving computation-intensive problems. However, the design of an algorithm in a distributed memory system is time-consuming and error-prone, because a programmer is forced to manage both parallelism and communication. In this paper, we present techniques for compiling programs on distributed memory parallel computers. We will study the storage management of data arrays and the execution schedule arrangement of Do-loop programs on distributed memory parallel computers. First, we introduce formulas for representing data distribution of specific data arrays across processors. Then, we define communication cost for some message-passing communication operations. Next, we derive a dynamic programming algorithm for data distribution. After that, we show how to improve the communication time by pipelining data, and illustrate how to use data-dependence information for pipelining data. Jacobi's iterative algorithm and the Gauss elimination algorithm for linear systems are used to illustrate our method. We also present experimental results on a 32-node nCUBE-2 computer.  相似文献   

16.
Current connectionist simulations require huge computational resources. We describe a neural network simulator for the IBM GF11, an experimental SIMD machine with 566 processors and a peak arithmetic performance of 11 Gigaflops. We present our parallel implementation of the backpropagation learning algorithm, techniques for increasing efficiency performance measurements on the NETTALK text-to-speech benchmark, and a performance model for the simulator. Our simulator currently runs the backpropagation learning algorithm at 900 million connections per second, where each “connection per second” includes both a forward and a backward pass. This figure was obtained on the machine when only 356 processors were working; with all 566 processors operational, our simulator will run at over one billion connections per second. We conclude that the GF11 is well-suited to neural network simulation, and we analyze our use of the machine to determine which features are the most important for high performance.  相似文献   

17.
We present a new parallel algorithm for computing a maximum cardinality matching in a bipartite graph suitable for distributed memory computers.The presented algorithm is based on the Push-Relabel algorithm which is known to be one of the fastest algorithms for the bipartite matching problem. Previous attempts at developing parallel implementations of it have focused on shared memory computers using only a limited number of processors.We first present a straightforward adaptation of these shared memory algorithms to distributed memory computers. However, this is not a viable approach as it requires too much communication. We then develop our new algorithm by modifying the previous approach through a sequence of steps with the main goal being to reduce the amount of communication and to increase load balance. The first goal is achieved by changing the algorithm so that many push and relabel operations can be performed locally between communication rounds and also by selecting augmenting paths that cross processor boundaries infrequently. To achieve good load balance, we limit the speed at which global relabelings traverse the graph. In several experiments on a large number of instances, we study weak and strong scalability of our algorithm using up to 128 processors.The algorithm can also be used to find ?-approximate matchings quickly.  相似文献   

18.
Properly adapted Boltzmann machine neural networks are used to devise effective unstructured grid partitioners that are capable of providing equally loaded grid subsets with minimum interface, for concurrent data-handling on parallel computers. The partitioning scheme is based on recursive bisections so that the outcome always consists of 2n partitions. Two different techniques are introduced to speed up the—otherwise costly—partitioning process and several variants are considered. In particular, a transformation of bipolar Hopfield-type neural networks is developed providing an effective multi-scale approach. Results on a number of test cases are presented in order to assess the performance of the proposed techniques.  相似文献   

19.
数据分块数的选择是并行/分布式机器学习模型选择的基本问题之一,直接影响着机器学习算法的泛化性和运行效率。现有并行/分布式机器学习方法往往根据经验或处理器个数来选择数据分块数,没有明确的数据分块数选择准则。提出一个并行效率敏感的并行/分布式机器学习数据分块数选择准则,该准则可在保证并行/分布式机器学习模型测试精度的情况下,提高计算效率。首先推导并行/分布式机器学习模型的泛化误差与分块数目的关系。然后以此为基础,提出折衷泛化性与并行效率的数据分块数选择准则。最后,在ADMM框架下随机傅里叶特征空间中,给出采用该数据分块数选择准则的大规模支持向量机实现方案,并在高性能计算集群和大规模标准数据集上对所提出的数据分块数选择准则的有效性进行实验验证。  相似文献   

20.
Many computational-intensive problems from science and engineering are irregular in nature. This makes it difficult to develop an efficient parallel implementation, even for shared-memory machines. As a typical example, we investigate a parallel implementation of an irregular particle simulation algorithm. We concentrate on the issue which programming and system support is needed to yield an efficient implementation for a large number of processors. As an execution platform we use the SB-PRAM, a shared memory machine with up to 2048 processors. The processors of the SB-PRAM can access the global memory in unit time which is the basis for an exact performance prediction. Common approaches for parallel implementations like lock protection for concurrent accesses and sequential or distributed task queues are replaced by more efficient access mechanisms and data structures which can be realized by the powerful multiprefix operations of the SB-PRAM. Their use simplifies the implementation and yields large speedup values.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号