首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Real》1996,2(6):383-392
Image processing applications require both computing and communication power. The aim of the GFLOPS project was to study all aspects concerning the design of such computers. The projects' aim was to develop a parallel architectures well as its software environment to implement those applications efficiently. The proposed architecture supports up to 512 processor nodes, connected over a scalable and cost-effective network at a constant cost per node. The first prototype implementation, running since the beginning of 1995, has shown that a parallel system can be both scalable and programmable through the use of a virtually shared memory paradigm, physically implemented with atomic message passing. GFLOPS-2 is a single-user machine which is designed to be used as a low-cost parallel co-processor board in a desk-top work station. In this paper we discuss the design of the GFLOPS-2 machine and evaluate the effectiveness of the mechanisms incorporated. The analysis of the architecture behaviour has been conducted with image processing and general purpose algorithms, written in C and assembler languages, through execution driven simulations. A development environment, especially a C data-parallel language, has been built for this purpose.  相似文献   

2.
C++ was originally designed as a sequential programming language. For development of multithreaded applications, libraries, such as Pthreads, Windows threads, and Boost, are traditionally used. The C++11 standard introduced some basic concepts and means for developing parallel and concurrent programs, but the direct use of these low-level means requires high programming skills and significant efforts. The absence of high-level models of parallelism in C++ is somewhat compensated for by various parallel libraries and directive parallelization tools (such as OpenMP), as well as by language extensions supported by some compilers (Intel CilkPlus). Nevertheless, we still require more advanced means to express parallelism in programs at the level of language standard and language library. In this survey, we consider the means for parallel and concurrent programming that are included into the C++17 standard, as well as some capabilities that are to be expected in the future standards.  相似文献   

3.
Programming multiprocessor parallel architectures is a complex task. This paper describes a block-structured scientific programming language, BLAZE, designed to simplify this task. BLAZE contains array arithmetic, ‘forall’ loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow.

A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture, while neglecting the remainder. This paper describes the features of BLAZE, and show how this language would be used in typical scientific programming.  相似文献   


4.
5.
The abundant hardware resources on current reconfigurable computing systems provide new opportunities for high-performance parallel implementations of scientific computations. In this paper, we study designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems. We first analyze design trade-offs in implementing this kernel. These trade-offs are caused by the inherent parallelism of matrix multiplication and the resource constraints, including the number of configurable slices, the size of on-chip memory, and the available memory bandwidth. We propose three parameterized algorithms which can be tuned according to the problem size and the available hardware resources. Our algorithms employ linear array architecture with simple control logic. This architecture effectively utilizes the available resources and reduces routing complexity. The processing elements (PEs) used in our algorithms are modular so that it is easy to embed floating-point units into them. Experimental results on a Xilinx Virtex-ll Pro XC2VP100 show that our algorithms achieve good scalability and high sustained GFLOPS performance. We also implement our algorithms on Cray XD1. XD1 is a high-end reconfigurable computing system that employs both general-purpose processors and reconfigurable devices. Our algorithms achieve a sustained performance of 2.06 GFLOPS on a single node of XD1  相似文献   

6.
HPCG基准测试程序是一种新的超级计算机排名度量标准.该测试基准主要用于衡量超级计算机解决大规模稀疏线性系统的能力,更贴近实际应用,近年来广受关注.基于国产超级计算机研究异构众核并行HPCG软件具有非常重要的意义,其不仅可以提升国产超级计算机HPCG的排名,还对很多应用提供了并行算法、优化技术等方面的参考.本文面向某国产复杂异构超级计算机开展研究,首先采用了分块图着色算法对HPCG进行并行,并提出一种适用于结构化网格的图着色算法,该算法并行性能高于传统的JPL、CC等算法,且着色质量高,运用于HPCG后,迭代次数减少了3次,整体性能提升了6%.本文还分析了复杂异构系统各个部件传输的开销,提出一套更适用于HPCG的任务划分方法,并从稀疏矩阵存储格式、稀疏矩阵重排、访存等角度开展了细粒度的优化.另外在多进程计算时,还采用了内外区划分算法将核心函数SpMV、SymGS中的邻居通信操作进行了隐藏.最终整机测试时,性能达到国产超级计算机峰值性能的1.67%,相比单节点,整机弱可扩展性并行效率达到了92%.  相似文献   

7.
HPCG基准测试程序是一种新的超级计算机排名度量标准.该测试基准主要用于衡量超级计算机解决大规模稀疏线性系统的能力,更贴近实际应用,近年来广受关注.基于国产超级计算机研究异构众核并行HPCG软件具有非常重要的意义,其不仅可以提升国产超级计算机HPCG的排名,还对很多应用提供了并行算法、优化技术等方面的参考.面向某国产复...  相似文献   

8.
This paper proposes a predicate nameddosim which provides a new function for parallel execution of logic programs. The parallelism achieved by this predicate is a simultaneous mapping operation such as bagof and setof predicates. However, the degree of parallelism can be easily decided by arranging the arguments of the dosim goal. The parallel processing system with dosim was realized on a tight-coupled multiprocessor machine. To control the degree of parallelism and reduce the amount of memory required for execution, we introduce the grouping method for the goals executed in parallel and some variations of the dosim predicate. The effectiveness of the proposed method is demonstrated by the results of the execution of several applications.  相似文献   

9.
Research in the area of parallel evaluation mechanisms for logic programs have led to the proposal of a number of schemes exploiting various forms of parallelism. Many of the early models have been based on the conventional approach of organising concurrent components of computations as communicating processes. More recently, however, models based on more novel computation organisations, in particular, data-driven organisations, have been proposed. This paper describes the development of one such model, its implementation and the design of a data-driven machine to support it. The model exploits a form of parallelism known as OR-parallelism and is particularly suited to database applications, although it would also support general applications. It is envisaged that the proposed machine may be refined into an efficient database engine, which can then be a component of a more powerful and integrated logic programming machine.  相似文献   

10.
The construction of efficient parallel programs usually requires expert knowledge in the application area and a deep insight into the architecture of a specific parallel machine. Often, the resulting performance is not portable, i.e., a program that is efficient on one machine is not necessarily efficient on another machine with a different architecture. Transformation systems provide a more flexible solution. They start with a specification of the application problem and allow the generation of efficient programs for different parallel machines. The programmer has to give an exact specification of the algorithm expressing the inherent degree of parallelism and is released from the low-level details of the architecture. We propose such a transformation system with an emphasis on the exploitation of the data parallelism combined with a hierarchically organized structure of task parallelism. Starting with a specification of the maximum degree of task and data parallelism, the transformations generate a specification of a parallel program for a specific parallel machine. The transformations are based on a cost model and are applied in a predefined order, fixing the most important design decisions like the scheduling of independent multitask activations, data distributions, pipelining of tasks, and assignment of processors to task activations. We demonstrate the usefulness of the approach with examples from scientific computing  相似文献   

11.
Case studies in asynchronous data parallelism   总被引:1,自引:0,他引:1  
Is the owner-computes style of parallelism, captured in a variety of data parallel languages, attractive as a paradigm for designing explicit control parallel codes? This question gives rise to a number of others. Will such use be unwieldy? Will the resulting code run well? What can such an approach offer beyond merely replicating, in a more labor intensive way, the services and coverage of data parallel languages? We investigate these questions via a simple example and “real world” case studies developed using C-Linda, a language for explicit parallel programming formed by the merger of C with the Linda coordination language. The results demonstrate owner-computes is an effective design strategy in Linda.  相似文献   

12.
This paper first considers the major developments which have occurred in the design of high level languages for sequential machines. These developments illustrate how languages which were independent of the hardware eventually evolved. Two major types of language for vector and parallel processors have evolved, namely, detection of parallelism languages and expression of machine parallelism languages. The disadvantages and advantages of each type of languages are examined. A third type of language is also considered which reflects neither the compiler's detection mechanism nor the underlying hardware. The syntax of this language enables the parallel nature of a problem to be expressed directly. The language is thus appropriate for both vector and array processors.  相似文献   

13.
在国际超级计算机领域,并行机的体系结构和相应的并行程序设计语言一直是前沿课题和难点,而且体系结构的变化必将带来程序设计语言的改进和发展。本文基于一个带有超级节点的SPP体系结构,对该结构下的控制(任务)并行、数据分布和同步问题分别从语言一级进行了探讨。  相似文献   

14.
This paper presents the first results for the implementation of the logic language BRAVE on a parallel architecture. We explain the operational semantics of BRAVE with common programming examples and show how both and and or parallelism can be exploited and controlled using BRAVE syntax. The design of an abstract machine for the parallel execution of BRAVE is given along with the principles of compilation and example codings. Results are presented from running example programs on a three-processor prototype, using an interpreter for BRAVE written in C.  相似文献   

15.
Wilmarth  D.D. 《Computer》1993,26(8):70-72
The parallel processing requirements of many computer applications, such as machine vision, radar, solar, and signal processing, are reviewed. The major hardware architectural features in optimizing parallel processing performance (interconnect topology, memory locality, and synchronization facilities) are discussed. The various parallel processing models available are also discussed. These include job-level parallelism, data-level parallelism, algorithm-level parallelism, loop-level parallelism, and compute clusters  相似文献   

16.
One of the challenges for parallel compilers and compiler-related tools is, given a machine-independent parallel language, to generate executable code for a variety of computational models, and to identify those specific parallel modes for which a program is well-suited. One portion of this problem, developing a method for estimating the relative execution time of a data-parallel algorithm in an environment capable of the SIMD and SPMD (MIMD) modes of parallelism, is presented. Given a data-parallel program in a language whose syntax is mode-independent and empirical information about instruction execution time characteristics, the goal is to use static source-code analysis to determine an implementation that results in an optimal execution time for a mixed-mode machine capable of SIMD and SPMD parallelism. Statistical information about individual operation execution times and paths of execution through a parallel program is assumed. A secondary goal of this study is to indicate language, algorithm, and machine characteristics that must be researched to learn how to provide the information needed to obtain an optimal assignment of parallel modes to program segments.  相似文献   

17.
Kasi Anantha  Fred Long 《Software》1990,20(6):537-554
There are two principal methods used to exploit the parallelism available on a parallel machine: the program to be executed can be optimized by hand, or the program can be automatically converted to parallel machine code by a compiler. The first method usually derives parallelism at the procedure level; a parallel program is written in a high-level language and typically has various modules executing in parallel. By contrast, the compiler methodically transforms the program into parallel code using various transformations, such as code movement. The automatic conversion of a program to parallel code is called compaction or parallelization. This paper describes the evolution of a new compaction program and presents a new algorithm for determining legal code movements. A simulator of the target architecture was used to estimate the execution times of a sample suite of programs before and after compaction. The results verify that substantial advantages arise from applying this compaction technique.  相似文献   

18.
PPCDS(并行程序概念设计系统)是一个将数据并行高层建模语言、并行识别方法、并行程序自动构造和人机交互界面技术集成在一起的并行程序设计环境,能简化并行程序设计,有效缩短并行程序开发周期,提高并行计算效率。PPCDS集成开发环境是PPCDS的重要组成部分,文中就PPCDS集成开发环境的设计和实现进行了简单介绍。  相似文献   

19.
The IBM Blue Gene/C parallel computer aims to demonstrate the feasibility of a cellular architecture computer with millions of concurrent threads of execution. One of the major challenges in this project is showing that applications can successfully scale to this massive amount of parallelism. In this paper we demonstrate that the simulation of protein folding using classical molecular dynamics falls in this category. Starting from the sequential version of a well known molecular dynamics code, we developed a new parallel implementation that exploited the multiple levels of parallelism present in the Blue Gene/C cellular architecture. We performed both analytical and simulation studies of the behavior of this application when executed on a very large number of threads. As a result, we demonstrate that this class of applications can execute efficiently on a large cellular machine.  相似文献   

20.
The IC* project is an effort to create an environment for the design, specification, and development of complex systems such as communication protocols, parallel machines, and distributed systems. The basis of the project is the IC* model of parallel computation, in which a system is specified by a set of invariant expressions which describe its behavior in time. The features of this model include temporal and structural constraints, inherent parallelism, explicit modeling of time, nondeterministic evolution, and dynamic activation. The project also includes the construction of a parallel computer specifically designed to support the model of computation. The authors discuss the IC* model and the current user language, and describe the architecture and hardware of the prototype supercomputer built to execute IC* programs  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号