期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陶杰鞠九滨《软件学报》1994,5(11):38-43

本文介绍了一个在ＳＵＮ工作站网络上实现的分布式Ｃ—ＰＲＯＬＯＧ解释系统ＤＣ－ＰＲＯＬＯＧ，它能够自动地将其应用程序的顺序解释过程变为并行解释过程；能够充分利用空闲的主存资源求解大问题，使一些单机上因内存容量不足而无法执行的任务得以执行．相似文献

2.

XDP: A compiler intermediate language extension for the representation and optimization of data movement

Larry Carter Jeanne Ferrante Vasanth Bala 《International journal of parallel programming》1994,22(5):485-518

The ability to represent, manipulate, and optimize data placement and movement between processors in a distributed address space machine is crucial in allowing compilers to generate efficient code. Data placement is embodied in the concept of dataownership. Data movement can include not just the transfer of data values but the transfer of ownership as well. However, most existing compilers for distributed address space machines either represent these notions in a language-or machine-dependent manner, or represent data or ownership transfer implicitly. In this paper we describe XDP, a set of intermediate language extensions for representing and manipulating data and ownership transfers explicitly in a compller. XDP is supported by a set of per-processor structures that can be used to implement ownership testing and manipulation at run-time, XDP provides a uniform framework for translating and optimizing sequential, data parallel, and message-passing programs to a distributed address space machine. We describe analysis and optimization techniques for this explicit representation. Finally, we compare the intermediate languages of some current distributed address space compilers with XDP. 相似文献

3.

On the Implementation and Use of Ada on Fault-Tolerant Distributed Systems

《IEEE transactions on pattern analysis and machine intelligence》1987,(5):553-563

In this paper, we discuss the use of Ada® on distributed systems in which failure of processors has to be tolerated. We assume that tasks are the primary object of distribution, and that communication between tasks on separate processors will take place using the facilities of the Ada language. It would be possible to build a separate set of facilities for communication between processors, and to treat the software on each machine as a separate program. This is unnecessary and undesirable. In addition, the Ada language Reference Manual states specifically that a system consisting of communicating processors with private memories is suitable for executing an Ada program. 相似文献

4.

基于SMP的高速高精度贴片机并行图像处理 总被引：1，自引：0，他引：1

蔡妍艳胡跃明高红霞《计算机测量与控制》2006,14(1):51-53

将基于SMP的多线程并行处理技术应用于贴片机图像处理系统,通过对实验数据的分析,针对SMP系统进行分析,得到了一些关于在SMP系统中进行多线程编程时任务分配和处理器分配方面的结论。相似文献

5.

Virtual machines for distributed real-time systems

Marco Cereia Ivan Cibrario Bertolotti 《Computer Standards & Interfaces》2009,31(1):30-39

The steady increase in raw computing power of the processors commonly adopted for distributed real-time systems leads to the opportunity of hosting diverse classes of tasks on the same hardware, for example process control tasks, network protocol stacks and man–machine interfaces.This paper describes how virtualization techniques can be used to concurrently run multiple operating systems on the same physical machine, although they are kept fully separated from the security and execution timing points of view, and still have them exhibit acceptable real-time execution characteristics.With respect to competing approaches, the main advantages of this method are that it requires little or no modifications to the operating systems it hosts, along with a better modularity and clarity of design. 相似文献

6.

网络机群计算的负载指标研究与实现

胡凯马雪洁邓可《计算机工程与设计》2007,28(4):829-831

针对利用网上空闲机进行非专用机群分布式并行计算的环境,研究复杂通用网络上用什么指标发现空闲处理机及如何实时衡量处理机的负载变化以调度分配处理机.在研究现有分布式系统和专用机群负载指标的基础上,提出了一种适用于网络机群计算环境的复合型负载指标,详细讨论了它在系统中的作用和实现,并通过大量测试分析实验,得出了一个合理的负载指标更新周期. 相似文献

7.

Performance of a QR algorithm implementation on a multicluster of transputers

《Computing Systems in Engineering》1995,6(4-5):363-367

Some results of an implementation of the QR factorization by Householder reflectors, on a multicluster transputer system with distributed memory are presented, that show how important is the communication time between processor in the performance of the algorithm. The QR factorization was chosen as test method because it is required for many real life applications, for instance in least squares problems. We use a version of Householder transformation that is the basis for numerically stable QR factorization. The machine used was the MultiCluster 2 model of Parsytec which is distributed memory system with 16 Inmos T800 processors. The Helios operating system was chosen because it provides transparency in CPU management. However it limits the sets of connecting topologies to be used. The results are presented in terms of speedup and efficiency, showing the importance of the communication time on the total elapsed time. 相似文献

8.

Parallel Coarse Grain Computing of Boltzmann Machines

Ortega Julio Rojas Ignacio Diaz Antonio F. Prieto Alberto 《Neural Processing Letters》1998,7(3):169-184

The resolution of combinatorial optimization problems can greatly benefit from the parallel and distributed processing which is characteristic of neural network paradigms. Nevertheless, the fine grain parallelism of the usual neural models cannot be implemented in an entirely efficient way either in general-purpose multicomputers or in networks of computers, which are nowadays the most common parallel computer architectures. Therefore, we present a parallel implementation of a modified Boltzmann machine where the neurons are distributed among the processors of the multicomputer, which asynchronously compute the evolution of their subset of neurons using values for the other neurons that might not be updated, thus reducing the communication requirements. Several alternatives to allow the processors to work cooperatively are analyzed and their performance detailed. Among the proposed schemes, we have identified one that allows the corresponding Boltzmann Machine to converge to solutions with high quality and which provides a high acceleration over the execution of the Boltzmann machine in uniprocessor computers. 相似文献

9.

Experience with a genetic algorithm implemented on a multiprocessor computer

G.E. Plassman J. Sobieszczanski-Sobieski 《Structural and Multidisciplinary Optimization》2001,22(2):102-115

Numerical experiments were conducted to find out the extent to which a Genetic Algorithm (GA) may benefit from a multiprocessor implementation, considering, on one hand, that analyses of individual designs in a population are independent of each other so that they may be executed concurrently on separate processors, and, on the other hand, that there are some operations in a GA that cannot be so distributed. The algorithm experimented with was based on a gaussian distribution rather than bit exchange in the GA reproductive mechanism, and the test case was a hub frame structure of up to 1080 design variables. The experimentation engaging up to 128 processors confirmed expectations of radical elapsed time reductions comparing to a conventional single processor implementation. It also demonstrated that the time spent in the nondistributable parts of the algorithm and the attendant cross-processor communication may have a very detrimental effect on the efficient utilization of the multiprocessor machine and on the number of processors that can be used effectively in a concurrent manner. Three techniques were devised and tested to mitigate that effect, resulting in efficiency increasing to exceed 99 percent. Of particular interest to the user, corresponding elapsed time compression factors approaching 128 are realized on 128 processors. Received October 18, 2000 相似文献

10.

Parallel Computation of Gröbner Bases on Distributed Memory Machines

《Journal of Symbolic Computation》1994,18(3):207-222

This paper reports our work on parallelizing an algorithm computing Gröbner bases on a distributed memory parallel machine. When computing Gröbner bases, the efficiency of computation is dominated by the total number of S-polynomials. To decrease the total number of S-polynomials it is necessary to apply a selection strategy that selects the minimum polynomial as a new element of an intermediate base.On a distributed memory parallel machine, as opposed to a shared memory parallel machine, we have to take into account non-trivial communication costs between processors. To reduce such communication costs, it is better to employ coarse grained parallelism rather than fine grained parallelism.We adopt a manager-worker model. S-polynomials are reduced in worker processes in parallel, and the minimum polynomial is selected in the manager process. To implement the selection strategy in this parallel model, synchronization between worker processes is required for every selection of a new element of the intermediate base. However, in spite of synchronization, introducing the selection strategy produces not only a better absolute computation speed but also better speedup with multi-processors. We achieved about 8 times speedup with 64 processors for large problems, T-6 and Ex-17. 相似文献

11.

Scalable s-to-p broadcasting on message-passing MPPs

Hambrusch S.E. Khokhar A.A. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(8):758-768

In s-to-p broadcasting, s processors in a processor machine contain a message to be broadcast to all the processors, 1⩽s⩽p. We present a number of different broadcasting algorithms that handle all ranges of s. We show how the performance of each algorithm is influenced by the distribution of the s source processors and by the relationships between the distribution and the characteristics of the interconnection network. For the Intel Paragon we show that for each algorithm and machine dimension there exist ideal distributions and distributions on which the performance degrades. For the Cray T3D we also demonstrate dependencies between distributions and machine sizes. To reduce the dependence of the performance on the distribution of sources, we propose a repositioning approach. In this approach, the initial distribution is turned into an ideal distribution of the target broadcasting algorithm. We report experimental results for the Intel Paragon and Cray T3D and discuss scalability and performance 相似文献

12.

Adaptive parallelism and Piranha

Carriero N. Freeman E. Gelernter D. Kaminsky D. 《Computer》1995,28(1):40-49

相似文献

13.

Efficient Compositing Methods for the Sort-Last-Sparse Parallel Volume Rendering System on Distributed Memory Multicomputers

Yang Don-Lin Yu Jen-Chih Chung Yeh-Ching 《The Journal of supercomputing》2001,18(2):201-220

In the sort-last-sparse parallel volume rendering system on distributed memory multicomputers, one can achieve a very good performance improvement in the rendering phase by increasing the number of processors. This is because each processor can render images locally without communicating with other processors. However, in the compositing phase, a processor has to exchange local images with other processors. When the number of processors exceeds a threshold, the image compositing time becomes a bottleneck. In this paper, we propose three compositing methods to efficiently reduce the compositing time in parallel volume rendering. They are the binary-swap with bounding rectangle (BSBR) method, the binary-swap with run-length encoding and static load-balancing (BSLC) method, and the binary-swap with bounding rectangle and run-length encoding (BSBRC) method. The proposed methods were implemented on an SP2 parallel machine along with the binary-swap compositing method. The experimental results show that the BSBRC method has the best performance among these four methods. 相似文献

14.

A SIMD parallel trapezoid rasterization algorithm

R.Victor Klassen 《Computers & Graphics》1991,15(4):553-559

An algorithm for the rasterization of trapezoids on the Connection Machine^TM (CM) is described. The input consists of an array of trapezoids, with two horizontal sides, arranged with one trapezoid per processor. (Unless otherwise indicated, “processor” should be taken to mean virtual processor.) Each trapezoid is converted to an edge record and the edge records are then distributed to enough processors so that each processor is responsible for one scanline of one trapezoid. Each processor computes a scan record for its scanline, and the scan records are then distributed to enough processors so that one processor is responsible for a single pixel. Final interpolation of position, and possibly shading information, is performed in parallel for all pixels thus created, and the pixels are then broadcast into a frame buffer array, with depth comparisons being performed at the receiving end to ensure that the nearest pixel appears in the array. The Connection Machine-specific features used by the algorithm are logarithmic time cumulative summing, general communication with comparisons for collision arbitration, and virtual processor sets. Performance is similar to that of good graphics workstations. The intended application is to display data already resident in the machine as the result of some previous computation, when a high-performance graphics workstation is not available. 相似文献

15.

Execution behavior analysis and performance prediction for a shared-memory implementation of an irregular particle simulation method

《Simulation Practice and Theory》1998,6(7):665-687

Many computational-intensive problems from science and engineering are irregular in nature. This makes it difficult to develop an efficient parallel implementation, even for shared-memory machines. As a typical example, we investigate a parallel implementation of an irregular particle simulation algorithm. We concentrate on the issue which programming and system support is needed to yield an efficient implementation for a large number of processors. As an execution platform we use the SB-PRAM, a shared memory machine with up to 2048 processors. The processors of the SB-PRAM can access the global memory in unit time which is the basis for an exact performance prediction. Common approaches for parallel implementations like lock protection for concurrent accesses and sequential or distributed task queues are replaced by more efficient access mechanisms and data structures which can be realized by the powerful multiprefix operations of the SB-PRAM. Their use simplifies the implementation and yields large speedup values. 相似文献

16.

并行效率敏感的大规模SVM数据分块数选择

张闯廖士中《数据采集与处理》2018,33(6):1068-1076

数据分块数的选择是并行/分布式机器学习模型选择的基本问题之一,直接影响着机器学习算法的泛化性和运行效率。现有并行/分布式机器学习方法往往根据经验或处理器个数来选择数据分块数,没有明确的数据分块数选择准则。提出一个并行效率敏感的并行/分布式机器学习数据分块数选择准则,该准则可在保证并行/分布式机器学习模型测试精度的情况下,提高计算效率。首先推导并行/分布式机器学习模型的泛化误差与分块数目的关系。然后以此为基础,提出折衷泛化性与并行效率的数据分块数选择准则。最后,在ADMM框架下随机傅里叶特征空间中,给出采用该数据分块数选择准则的大规模支持向量机实现方案,并在高性能计算集群和大规模标准数据集上对所提出的数据分块数选择准则的有效性进行实验验证。相似文献

17.

Distributed computing methodology for training neural networks in an image-guided diagnostic application

Plagianakos VP Magoulas GD Vrahatis MN 《Computer methods and programs in biomedicine》2006,81(3):228-235

Distributed computing is a process through which a set of computers connected by a network is used collectively to solve a single problem. In this paper, we propose a distributed computing methodology for training neural networks for the detection of lesions in colonoscopy. Our approach is based on partitioning the training set across multiple processors using a parallel virtual machine. In this way, interconnected computers of varied architectures can be used for the distributed evaluation of the error function and gradient values, and, thus, training neural networks utilizing various learning methods. The proposed methodology has large granularity and low synchronization, and has been implemented and tested. Our results indicate that the parallel virtual machine implementation of the training algorithms developed leads to considerable speedup, especially when large network architectures and training sets are used. 相似文献

18.

ARM Linux内核编译及在Skyeye上模拟移植

郭彩霞姚强《电脑与微电子技术》2011,(15):56-60

以Linux操作系统环境为例,介绍如何在Linux操作系统下建立嵌入式交叉编译环境．并使用交叉编译工具编译嵌入式Linux内核．以及在Skyeye上模拟Linux操作系统内核移植。在ARMLinux内核交叉编译部分以S3C2410X目标机处理器．生成可在其上运行的Linux内核,内核版本为最新的Linux-2．6．39。相似文献

19.

ARM Linux内核编译及在Skyeye上模拟移植

郭彩霞姚强《现代计算机》2011,(17):56-60

以Linux操作系统环境为例,介绍如何在Linux操作系统下建立嵌入式交叉编译环境,并使用交叉编译工具编译嵌入式Linux内核,以及在Skyeye上模拟Linux操作系统内核移植。在ARM Linux内核交叉编译部分以S3C2410X目标机处理器,生成可在其上运行的Linux内核,内核版本为最新的Linux-2.6.39。相似文献

20.

Location Independent Remote Execution in NEST

《IEEE transactions on pattern analysis and machine intelligence》1987,(8):905-912

We consider a computing environment consisting of a network of autonomous, yet cooperating personal computer workstations and shared servers. Computing cycles in such an environment can be shared by creating a pool of compute servers in the network that may be used by the workstations to supplement their computing needs. Some processors may be permanently designated to be the compute servers. In addition, through an advertisement mechanism, any workstation may make itself temporarily available for a specific duration of time to be used as a compute server. In this paper, we present the design and implementation of a scheme for augmenting the UNIX® operating system with the location independent remote execution capability. This capability allows processes to be offloaded to the compute servers and preserves the execution environment of these processes as if they were still executing locally at the originating machine. Our model provides execution location independence of processes by preserving the process view of the file system, parent-child relationships, process groups, and process signaling across machine boundaries in a transparent way. We also present our scheme that allows processors to advertise themselves as available to some or all nodes in the network and withdraw as a compute server in a distributed manner. The scheme is robust in presence of node failures. 相似文献