首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 42 毫秒
1.
In compiling applications for distributed memory machines, runtime analysis is required when data to be communicated cannot be determined at compile-time. One such class of applications requiring runtime analysis is block structured codes. These codes employ multiple structured meshes, which may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present runtime and compile-time analysis for compiling such applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which supports the runtime analysis required. The library is currently implemented on several different systems. We have also developed compiler analysis for determining data access patterns at compile time and inserting calls to the appropriate runtime routines. Our methods can be used by compilers for HPF-like parallel programming languages in compiling codes in which data distribution, loop bounds and/or strides are unknown at compile-time. To demonstrate the efficacy of our approach, we have implemented our compiler analysis in the Fortran 90D/HPF compiler developed at Syracuse University. We have experimented with a multi-bloc Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads and the compiler parallelized codes perform within 20% of the codes parallelized by manually inserting calls to the runtime library  相似文献   

2.
In the 1990s the Message Passing Interface Forum defined MPI bindings for Fortran, C, and C++. With the success of MPI these relatively conservative languages have continued to dominate in the parallel computing community. There are compelling arguments in favour of more modern languages like Java. These include portability, better runtime error checking, modularity, and multi‐threading. But these arguments have not converted many HPC programmers, perhaps due to the scarcity of full‐scale scientific Java codes, and the lack of evidence for performance competitive with C or Fortran. This paper tries to redress this situation by porting two scientific applications to Java. Both of these applications are parallelized using our thread‐safe Java messaging system—MPJ Express. The first application is the Gadget‐2 code, which is a massively parallel structure formation code for cosmological simulations. The second application uses the finite‐domain time‐difference method for simulations in the area of computational electromagnetics. We evaluate and compare the performance of the Java and C versions of these two scientific applications, and demonstrate that the Java codes can achieve performance comparable with legacy applications written in conventional HPC languages. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

3.
Fortran 90 provides a rich set of array intrinsic functions. Each of these array intrinsic functions operates on the elements of multi-dimensional array objects concurrently. They provide a rich source of parallelism and play an increasingly important role in automatic support of data parallel programming. However, there is no such support if these intrinsic functions are applied to sparse data sets. In this paper, we address this open gap by presenting an efficient library for parallel sparse computations with Fortran 90 array intrinsic operations. Our method provides both compression schemes and distribution schemes on distributed memory environments applicable to higher-dimensional sparse arrays. This way, programmers need not worry about low-level system details when developing sparse applications. Sparse programs can be expressed concisely using array expressions, and parallelized with the help of our library. Our sparse libraries are built for array intrinsics of Fortran 90, and they include an extensive set of array operations such as CSHIFT, EOSHIFT, MATMUL, MERGE, PACK, SUM, RESHAPE, SPREAD, TRANSPOSE, UNPACK, and section moves. Our work, to our best knowledge, is the first work to give sparse and parallel sparse supports for array intrinsics of Fortran 90. In addition, we provide a complete complexity analysis for our sparse implementation. The complexity of our algorithms is in proportion to the number of nonzero elements in the arrays, and that is consistent with the conventional design criteria for sparse algorithms and data structures. Our current testbed is an IBM SP2 workstation cluster. Preliminary experimental results with numerical routines, numerical applications, and data-intensive applications related to OLAP (on-line analytical processing) show that our approach is promising in speeding up sparse matrix computations on both sequential and distributed memory environments if the programs are expressed with Fortran 90 array expressions.  相似文献   

4.
5.
Standardization of Fortran is considered. An overview of new features of modern Fortran standards and an outline of the future Fortran standard (Fortran 2000) are given. Fortran-based programming languages for parallel computers are discussed.  相似文献   

6.
Fortran 90与面向对象程序设计   总被引:1,自引:0,他引:1  
面向对象方法已经成一种最有前途的软件开发方法之一。面对对象程序设计方法应用于工程计算已有近十年,所采用的语言为C++,Eiffel和Smalltalk等。  相似文献   

7.
We describe our implementation of C and Fortran preprocessors for the FPS T-series hypercube. The target of these preprocessors is the occam I language. We provide a brief overview of the INMOS transputer and the Weitek vector processing unit (VPU). These two units comprise one node of the T-series. Some depth of understanding of the VPU is required to fully appreciate the problems encountered in generating vector code. These complexities were not fully appreciated at the outset. The occam I language is briefly described. We focus on only those aspects of occam I which differ radically from C. The transformations used to preprocess C into occam I are discussed in detail. The special problems with the VPU both in terms of its (non)interface with occam I and in dealing with numerical programs is discussed separately. A lengthy discussion on the special techniques required for compilation is provided. C and Fortran are simply incompatible with the occam I model. We provide a catalogue of problems encountered. We emphasize that these problems are not so much with occam I but with preprocessing to occam I. We feel the CSP and occam I models are quite good for distributed processing. The ultimate message from this work should be seen in a larger context. Several languages—such as Ada and Modula-2—are being touted as the standards for the 1990s. These languages severely restrict parallel programming style; this may make saving dusty decks by preprocessing an impossibility.  相似文献   

8.
In many scientific applications, dynamic data redistribution is used to enhance algorithm performance and achieve data locality in parallel programs on distributed memory multi-computers. In this paper, we present a new method, Compressed Diagonals Remapping (CDR) technique aims to the efficiency of runtime data redistribution on banded sparse matrices. The main idea of the proposed technique is first to compress the source matrix into a Compressed Diagonal Matrix (CDM) form. Based on the compressed diagonal matrix, a one-dimensional local and global index transformation method can be applied to carry out data redistribution on the compressed diagonal matrix. This process is identical to redistribute data in the two-dimensional banded sparse matrix. The CDR technique uses an efficient one-dimensional indexing scheme to perform data redistribution on banded sparse matrix. A significant improvement of this approach is that a processor does not need to determine the complicated sending or receiving data sets for dynamic data redistribution. The indexing cost is reduced significantly. The second advantage of the present techniques is the achievement of optimal packing/unpacking stages consequent upon the consecutive attribute of column elements in a compressed diagonal matrix. Another contribution of our methods is the ability to handle sparse matrix redistribution under two disjoint processor grids in the source and destination phases. A theoretical model to analyze the performance of the proposed technique is also presented in this paper. To evaluate the performance of our methods, we have implemented the present techniques on an IBM SP2 parallel machine along with the v2m algorithm and a dense redistribution strategy. The experimental results show that our technique provides significant improvement for runtime data redistribution of banded sparse matrices in most test samples.  相似文献   

9.
Current High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should take into account the combination of data communication across distributed memory, and the exploitation of shared-memory models.In this paper, we present a new communication calculation technique to be applied across different SPMD (Single Program Multiple Data) code blocks, containing several uniform data access expressions. We have implemented this technique in Trasgo, a programming model and compilation framework that transforms parallel programs from a high-level parallel specification that deals with parallelism in a unified, abstract, and portable way.The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. Our approach allows the automatic generation of pre-compiled multi-level parallel routines, libraries, or programs that can adapt their communication, synchronization, and optimization structures to the target system, even when computing nodes have different capabilities.Our experimental results show that, despite our runtime calculation, our approach can automatically produce efficient programs compared with MPI reference codes, and with codes generated with auto-parallelizing compilers.  相似文献   

10.
A large class of intensive numerical applications show an irregular structure, exhibiting an unpredictable runtime behavior. Two kinds of irregularity can be distinguished in these applications. First, irregular control structures, derived from the use of conditional statements on data only known at runtime. Second, irregular data structures, derived from computations involving sparse matrices, grids, trees, graphs, etc. Many of these applications exhibit a large amount of parallelism, but the above features usually make that exploiting such parallelism becomes a very difficult task. This paper discusses the effective parallelization of numerical irregular codes, focusing on the definition and use of data-parallel extensions to express the parallelism that they exhibit. We show that the combination of data distributions with storage structures allows to obtain efficient parallel codes. Codes dealing with sparse matrices, finite element methods and molecular dynamics (MD) simulations are taken as working examples.  相似文献   

11.
Beck  B. Olien  D. 《Software, IEEE》1989,6(3):63-72
A process model is presented for constructing and executing parallel programs on a shared-memory multiprocessor running under Unix. The model involves some simple extensions to the standard Unix process model, a set of language extensions, runtime library support, and additional operating-system support. The model is easy to use, and it supports several higher-level parallel programming constructs in several languages, including microtasking in C and Fortran and multitasking in Ada and C++. It frees programmers to concentrate on parallel algorithms instead of low-level implementation details, and it yields good performance  相似文献   

12.
在处理高维数据过程中,特征选择是一个非常重要的数据降维步骤。低秩表示模型具有揭示数据全局结构信息的能力和一定的鉴别能力。稀疏表示模型能够利用较少的连接关系揭示数据的本质结构信息。在低秩表示模型的基础上引入稀疏约束项,构建一种低秩稀疏表示模型学习数据间的低秩稀疏相似度矩阵;基于该矩阵提出一种低秩稀疏评分机制用于非监督特征选择。在不同数据库上将选择后的特征进行聚类和分类实验,同传统特征选择算法进行比较。实验结果表明了低秩特征选择算法的有效性。  相似文献   

13.
An algorithm for making sequential programs parallel is described, which first identifies all subroutines, then determines the appropriate execution mode and restructures the code. It works recursively to parallelize the entire program. We use Fortran in our work, but many of the concepts apply to other languages. Our hardware model is a shared-memory multiprocessor system with a fixed number of identical processors, each with its own local memory connected to a common memory that is accessible to all processors equally. The model implements interprocessor synchronization and communication via special memory locations or special storage. Systems like the Cray X-MP, IBM 3090, and Alliant FX/8 fit this model. Our input is a sequential, structured Fortran program with no overlapping branches. With today's emphasis on writing structured code, this restriction is reasonable. A prototype of a system to implement the algorithm is under development on an IBM 3090 multiprocessor  相似文献   

14.
Lee  B. Hurson  A.R. 《Computer》1994,27(8):27-39
Contrary to initial expectations, implementing dataflow computers has presented a monumental challenge. Now, however, multithreading offers a viable alternative for building hybrid architectures that exploit parallelism. The eventual success of dataflow computers will depend on their programmability. Traditionally, they've been programmed in languages such as Id and SISAL (Streams and Iterations in a Single Assignment Language) that use functional semantics. These languages reveal high levels of concurrency and translate on to dataflow machines and conventional parallel machines via the Threaded Abstract Machine (TAM). However, because their syntax and semantics differ from the imperative counterparts such as Fortran and C, they have been slow to gain acceptance in the programming community. An alternative is to explore the use of established imperative languages to program dataflow machines. However, the difficulty will be analyzing data dependencies and extracting parallelism from source code that contains side effects. Therefore, more research is still needed to develop compilers for conventional languages that can produce parallel code comparable to that of parallel functional languages  相似文献   

15.
The flexibility offered by dynamically typed programming languages has been appropriately used to develop specific scenarios where dynamic adaptability is an important issue. This has made some existing statically typed languages gradually incorporate more dynamic features to their implementations. As a result, there are some programming languages considered hybrid dynamically and statically typed. However, these languages do not perform static type inference on a dynamically typed code, lacking those common features provided when a statically typed code is used. This lack is also present in the corresponding IDEs that, when a dynamically typed code is used, do not provide the services offered for static typing. We have customized an IDE for a hybrid language that statically infers type information of dynamically typed code. By using this type information, we show how the IDE can provide a set of appealing services that the existing approaches do not support, such as compile-time type error detection, code completion, transition from dynamically to statically typed code (and vice versa), and significant runtime performance optimizations. We have evaluated the programmer׳s performance improvement obtained with our IDE, and compared it with similar approaches.  相似文献   

16.
Array operations are useful in a large number of important scientific codes, such as molecular dynamics, finite element methods, climate modeling, atmosphere and ocean sciences, etc. In our previous work, we have proposed a scheme of extended Karnaugh map representation (EKMR) for multidimensional array representation. We have shown that sequential multidimensional array operation algorithms based on the EKMR scheme have better performance than those based on the traditional matrix representation (TMR) scheme. Since parallel multidimensional array operations have been an extensively investigated problem, we present efficient data parallel algorithms for multidimensional array operations based on the EKMR scheme for distributed memory multicomputers. In a data parallel programming paradigm, in general, we distribute array elements to processors based on various distribution schemes, do local computation in each processor, and collect computation results from each processor. Based on the row, column, and 2D mesh distribution schemes, we design data parallel algorithms for matrix-matrix addition and matrix-matrix multiplication array operations in both TMR and EKMR schemes for multidimensional arrays. We also design data parallel algorithms for six Fortran 90 array intrinsic functions: All, Maxval, Merge, Pack, Sum, and Cshift. We compare the time of the data distribution, the local computation, and the result collection phases of these array operations based on the TMR and the EKMR schemes. The experimental results show that algorithms based on the EKMR scheme outperform those based on the TMR scheme for all test cases.  相似文献   

17.
We present an efficient implementation of the Modified SParse Approximate Inverse (MSPAI) preconditioner. MSPAI generalizes the class of preconditioners based on Frobenius norm minimizations, the class of modified preconditioners such as MILU, as well as interface probing techniques in domain decomposition: it adds probing constraints to the basic SPAI formulation, and one can thus optimize the preconditioner relative to certain subspaces. We demonstrate MSPAI’s qualities for iterative regularization problems arising from image deblurring.Such applications demand for a fast and parallel preconditioner realization. We present such an implementation introducing two new optimization techniques: First, we avoid redundant calculations using a dictionary. Second, our implementation reduces the runtime spent on the most demanding numerical parts as the code switches to sparse QR decomposition methods wherever profitable. The optimized code runs in parallel with a dynamic load balancing.  相似文献   

18.
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper, we address the problem of performing this optimization interprocedurally. We present an Interprocedural Partial Redundancy Elimination (IPRE) scheme based upon a new, concise, full program representation. Our method is applicable to arbitrary recursive programs. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines. We have implemented our scheme as an extension to the Fortran D compilation system. We present experimental results from two codes compiled using our system to demonstrate the useful of IPRE in distributed memory compilation  相似文献   

19.
Arrays are mapped to processors through a two-step process—alignment followed by distribution—in data-parallel languages such as High Performance Fortran. This process of mapping creates disjoint pieces of the array that are locally owned by each processor. An HPF compiler that generates code for array statements must compute the sequence of local memory addresses accessed by each processor and the sequence of sends and receives for a given processor to access nonlocal data. In this paper, we present an approach to the address sequence generation problem using the theory of integer lattices. The set of elements referenced can be generated by integer linear combinations of basis vectors. Unlike other work on this problem, we derive closed form expressions for the basis vectors as a function of the mapping of data. Using these basis vectors and exploiting the fact that there is a repeating pattern in the access sequence, we derive highly optimized code that generates the pattern at runtime. The code generated uses table-lookup of the pattern. Experimental results show that our approach is faster than other solutions to this problem.  相似文献   

20.
Data abstraction is an effective tool in the design of complex systems, and the representation independence it provides is a key factor in the maintenance and adaptation of software systems. This paper describes a system development methodology based on the development of hierarchies of abstract data types (ADT's). The methodology preserves a high degree of representation independence throughout both the design and implementation of complex systems. The methodology is illustrated with examples from the design and implementation of a Vision Research Programming System. These examples include ADT specifications, ADT interface specifications, and partial implementation code for the system in two different programming languages, Ada1 and Fortran.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号