首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In the paper, a systematic discussion is made of state of the art theory and applications of fine-grained (massive) parallelism, which is permanently being adopted in computational mathematics. All known models of fine-grained computations (cellular automaton, neural and cellular neural networks, statistical automata, etc.) are represented in terms of a unique formalism, the so-called parallel substitution algorithm (PSA), which made it possible, on the one hand, to highlight common properties of the models and, on the other hand, to demonstrate expressiveness capabilities of the PSA. Theoretical and experimental results of studies on applications of fine-grained algorithms to modeling of spatial dynamics of reaction–diffusion and molecular processes are presented. Promising prospects of their application are substantiated both for the creation of special processors designed for the implementation of the algorithms and for the implementation of the algorithms on multiprocessor systems.  相似文献   

2.
Various schemes are considered of the parallel implementation of the branch and bound method, as applied to multiprocessor computing systems (clusters) with the distributed memory. In the language of informal automata, questions are set out of the organization of the exchange of data and signals within the cluster, which afford the asynchronous operation of its processors. Common ideas are illustrated by the example of the classical traveling salesman problem and data of numerical experiments performed on the multiprocessor computing system-100 (MCS-100) are given.  相似文献   

3.
The Cydra 5 is a heterogeneous multiprocessor system that targets small work groups or departments of scientists and engineers. The two types of processors are functionally specialized for the different components of the work load found in a departmental setting. The Cydra 5 numeric processor, based on a directed-data-flow architecture, provides consistently high performance on a broader class of numerical computations. The interactive processors offload all nonnumeric work from the numeric processor, leaving it free to spend all its time on the numeric application. The I/O processors permit high-bandwidth I/O transitions with minimal involvement from the interactive or numeric processors. The system architecture and data-flow architecture are described. The numeric processor decisions and tradeoffs are examined, and the main memory system is discussed. Some reflections on the design issues are offered  相似文献   

4.
This paper proposes a new parallel architecture, which has the potential to support low-level image processing as well as intermediate and high-level vision analysis tasks efficiently. The integrated architecture consists of an SIMD mesh of processors enhanced with multiple broadcast buses, and MIMD multiprocessor with orthogonal access buses, and a two-dimensional shared memory array. Low-level image processing is performed on the mesh processor, while intermediate and high-level vision analysis is performed on the orthogonal multiprocessor. The interaction between the two levels is supported by a common shared memory. Concurrent computations and I/O are made possible by partitioning the memory into disjoint spaces so that each processor system can access a different memory space. To illustrate the power of such a two-level system, we present efficient parallel algorithms for a variety of problems from low-level image processing to high-level vision. Representative problems include matrix based computations, histogramming and key counting operations, image component labeling, pyramid computations, Hough transform, pattern clustering, and scene labeling. Through computational complexity analysis, we show that the integrated architecture meets the processing requirements of most image understanding tasks.  相似文献   

5.
Certain special computational models called the filter automata (FA) are presented. We use them in iterative, pipelined processing of discrete sequences. Such computations reveal characteristic moving, periodic objects, with soliton-like properties referred to as the filtrons. Filter automata play the role of a transmitting medium or a field, while filtrons behave like waves or particles

We characterise FA by showing their distinguishing features, and fillrons—by introducing some parameters, such as edge numbers, number of FA cycles involved, distribution of internal idle pulses, etc. We show that filtrons may be generated in parallel by a multiprocessor ring net. This means that each filtron has its associated ring computation. If such a computation succeeds then its companion filtron exists, too. The link between filtrons and their ring computations is established simply by the filtron edge numbers. This gives a method of calculating a whole filtron if its edge number and supporting FA are known, and of looking for possible filter automata capable of supporting a prescribed filtron

A number of FA computational phenomena, recently discovered, are also presented. Some of them call for physical experiments to verify to what extent the filter automata may model physical reality. We believe that such experiments could be performed in optical fibres using solitons of light.  相似文献   

6.
Speedup and efficiency, two measures for performance of pipelined computers, are now used to evaluate performance of parallel algorithms for multiprocessor systems. However, these measures consider only the computation time and number of processors used and do not include the number of the communication links in the system. The author defines two new measures, cost effectiveness and time-cost effectiveness, for evaluating performance of a parallel algorithm for a multiprocessor system. From these two measures two characterization factors for multiprocessor systems are defined and used to analyze some well-known multiprocessor systems. It is found that for a given penalty function, every multiprocessor architecture has an optimal number of processors that produces maximum profit. If too many processors are used, the higher cost of the system reduces the profit obtained from the faster solution. On the other hand, if too few processors are used, the penalty paid for taking a longer time to obtain the solution reduces the profit  相似文献   

7.
Algorithm-based fault tolerance has been proposed as a technique to detect incorrect computations in multiprocessor systems. In algorithm-based fault tolerance, processors produce data elements that are checked by concurrent error detection mechanisms. We investigate the efficacy of this approach for diagnosis of processor faults. Because checks are performed on data elements, the problem of location of data errors must first be solved. We propose a probabilistic model for the faults and errors in a multiprocessor system and use it to evaluate the probabilities of correct error location and fault diagnosis. We investigate the number of checks that are necessary to guarantee error location with high probability. We also give specific check assignments that accomplish this goal. We then consider the problem of fault diagnosis when the locations of erroneous data elements are known. Previous work on fault diagnosis required that the data sets produced by different processors be disjoint. We show, for the first time, that fault diagnosis is possible with high probability, even in systems where processors combine to produce individual data elements  相似文献   

8.
A cellular automata (CA) approach is proposed for simulating a fluid flow through porous materials with tortuous channels at pore level. The approach aims to combine CA methods both for constructing computer representation of porous material morphology and for simulating fluid flow through it. Morphology representation is obtained using CA whose evolution exhibits self-organization and results in a stable configuration. The latter is then used for Lattice Gas CA application to simulate fluid flow through a porous material specimen and compute its permeability properties. Special boundary conditions are introduced allowing for different smoothness of solid pore walls surface. The model has been tested on a small 2D fragment in a PC and then implemented to investigate a porous carbon electrode of a hydrogen fuel cell on 128 processors of a multiprocessor cluster.  相似文献   

9.
Partitioned EDF scheduling on a few types of unrelated multiprocessors   总被引:1,自引:1,他引:0  
A polynomial-time approximation scheme (PTAS) is derived for the partitioned EDF scheduling of implicit-deadline sporadic task systems upon unrelated multiprocessor platforms that are comprised of a constant number of distinct types of processors. This generalizes earlier results showing the existence of polynomial-time approximation schemes for the partitioned EDF scheduling of implicit-deadline sporadic task systems on (1) identical multiprocessor platforms, and (2) unrelated multiprocessor platforms containing a constant number of processors.  相似文献   

10.
单片多处理器结构支持较高线程级的并行,能显著提高性能.介绍了单片多处理器的结构,对一些结构模型和实际的商用处理器进行举例,并对关键技术进行了研究分析.  相似文献   

11.
Static and dynamic load balancing strategies for a multiprocessor system for a ray tracing algorithm based on constant subdivision are presented. An object space is divided into regular cubes (subspaces), whose boundary planes are perpendicular to the coordinate axes, and these are allocated to the processors in the system. Here, load balancing among the processors is the most important problem. Firstly, in a category of static load balancing, strategies for mapping the subspaces into the processors are evaluated by simulation. Moreover, we propose a hierarchical multiprocessor system in order to realize dynamic load balancing with the static one. Its architecture can overcome the limitation of the static load balancing in a large scale multiprocessor system.  相似文献   

12.
The high power consumption of modern processors becomes a major concern because it leads to decreased mission duration (for battery-operated systems), increased heat dissipation, and decreased reliability. While many techniques have been proposed to reduce power consumption for uniprocessor systems, there has been considerably less work on multiprocessor systems. In this paper, based on the concept of slack sharing among processors, we propose two novel power-aware scheduling algorithms for task sets with and without precedence constraints executing on multiprocessor systems. These scheduling techniques reclaim the time unused by a task to reduce the execution speed of future tasks and, thus, reduce the total energy consumption of the system. We also study the effect of discrete voltage/speed levels on the energy savings for multiprocessor systems and propose a new scheme of slack reservation to incorporate voltage/speed adjustment overhead in the scheduling algorithms. Simulation and trace-based results indicate that our algorithms achieve substantial energy savings on systems with variable voltage processors. Moreover, processors with a few discrete voltage/speed levels obtain nearly the same energy savings as processors with continuous voltage/speed, and the effect of voltage/speed adjustment overhead on the energy savings is relatively small.  相似文献   

13.
The Data Locality of Work Stealing   总被引:1,自引:0,他引:1  
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines, where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the number of cache misses when using work stealing, and introduce a locality-guided work-stealing algorithm and its experimental validation. {As a lower bound, we show that a work-stealing application that exhibits good data locality on a uniprocessor may exhibit poor data locality on a multiprocessor. In particular, we show a family of multithreaded computations G n whose members perform Θ(n) operations (work) and incur a constant number of cache misses on a uniprocessor, while even on two processors the total number of cache misses soars to Ω(n) . On the other hand, we show a tight upper bound on the number of cache misses that nested-parallel computations, a large, important class of computations, incur due to multiprocessing. In particular, for nested-parallel computations, we show that on P processors a multiprocessor execution incurs an expected
more misses than the uniprocessor execution. Here m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T fty is the number of nodes on the longest chain of dependencies. Based on this we give strong execution time bounds for nested-parallel computations using work stealing.} For the second part of our results, we present a locality-guided work-stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, locality-guided work stealing improves the performance of work stealing up to 80%.  相似文献   

14.
Scheduling program tasks on processors is at the core of the efficient use of multiprocessor systems. Most task-scheduling problems are known to be NP-Hard and, thus, heuristics are the method of choice in all but the simplest cases. The utilization of acknowledged sets of benchmark-problem instances is essential for the correct comparison and analysis of heuristics. Yet, such sets are not available for several important classes of scheduling problems, including multiprocessor scheduling problem with communication delays (MSPCD) where one is interested in scheduling dependent tasks onto homogeneous multiprocessor systems, with processors connected in an arbitrary way, while explicitly accounting for the time required to transfer data between tasks allocated to different processors. We propose test-problem instances for the MSPCD that are representative in terms of number of processors, type of multiprocessor architecture, number of tasks to be scheduled, and task graph characteristics (task execution times, communication costs, and density of dependencies between tasks). Moreover, we define our task-graph generators in a way appropriate to ensure that the corresponding problem instances obey the theoretical principles recently proposed in the literature.  相似文献   

15.
Cellular automata (CA) are able to produce a global behavior from local interactions between their units. They have been applied to the task scheduling problem in multiprocessor systems in a very distinguished way. As this problem is NP-Complete, heuristics and meta-heuristics are usually employed. However, these techniques must always start the scheduling process from scratch for each new parallel application given as input. On the other hand, the main advantage to use CA for scheduling is the discovery of rules while solving one application and their subsequent reuse in other instances. Recently studies related to CA-based scheduling have shown relevant approaches as the use of synchronous updating in CA evolution and good results in multiprocessor systems with two processors. However, some aspects, such as the low performance of CA-based schedulers in architectures with more than two processors and during the reuse of the discovered rules, need to be investigated. This paper presents two new models to improve CA-based scheduling to deal with such aspects. The first proposal refers to the employment of a construction heuristic to initialize CA evolution and the second one is a new neighborhood model able to capture the dependence and relations strength among the tasks in a very simple way. It was named pseudo-linear neighborhood. An extensive experimental evaluation was performed using graphs of parallel programs found in the literature and new ones randomly generated. Experimental analysis showed the combined application of both techniques makes the search for CA transition rules during learning stage more robust and leads to a significant gain when considering the reuse of them on real-world conditions.  相似文献   

16.
Suh  T. Lee  H.-H.S. Blough  D.M. 《Micro, IEEE》2004,24(5):70-78
In this second part of this two-part article, we present two examples of integrating heterogeneous processors and show the limitation of integrating processors without native support for cache coherence. Finally, we discuss the Verilog simulation results of applying our techniques to actual heterogeneous multiprocessor platforms. Experiments with actual heterogeneous multiprocessor platforms on a shared-bus measure the effectiveness of two cache coherence techniques. This integration approach, snoop-hit buffer, and the accompanying region-based cache coherence approach yield significant speedups compared to a pure software solution.  相似文献   

17.
Summary General models of multiprocessor systems in which processors are functionally dedicated are described. In these models, processors are divided into different types. A task can be assigned only to a processor of certain types. Clearly, the model of multiprocessor systems with identical processors is a special case of our models. These models also include the job shop problem in which there is exactly one processor of each type. Worst case performance bounds of priority-driven schedules are obtained.This work was supported by the National Science Foundation under Grants NSFDCR 72-03740 and NSFMCS 73-03408  相似文献   

18.
多处理机系统随着系统规模的不断扩大,其组件的脆弱性随着增大。定位和替换故障处理器对于系统保持高可靠性能至关重要。故障诊断就是系统通过自身的测试来识别故障处理机的过程。基于比较的诊断分析是多计算机系统故障诊断分析的有效方法。证明煎饼网络(Pancake Network)在比较诊断模型下的诊断度。  相似文献   

19.
王刚  王跃科  乔纯捷 《测控技术》2007,26(12):48-50
在某大型测试系统的设计中,为了实现多只处理器问灵活的数据交换及并行处理,提出了基于Ethernet的多处理器并行系统设计;通过FPGA实现了以太网交换机的介质无关接口与处理器同步串口的接口转换,从而实现了处理器接收和发送网络数据.在此基础上实现了多处理器的并行数据处理。为实现高效的对多处理器系统的开发调试,提出了基于Ethernet的多处理器网络调试方案,最后对系统的可扩展性进行了分析。  相似文献   

20.
An efficient parallel algorithm for merging two sorted lists is presented. The algorithm is based on a novel partitioning algorithm that splits the two lists among the processors, in a way that ensures load balance during the merge. The partitioning algorithm can itself be efficiently parallelized, allowing the solution to scale with increased numbers of processors. A shared memory multiprocessor is assumed. The time complexity for partitioning and merging is O(N/p + log N), where p is the number of processors and N is the total number of elements in the two lists. Implementation results on a twenty node Sequent Symmetry multiprocessor are also presented.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号