首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 593 毫秒
1.
We describe an integrated compile time and run time system for efficient shared memory parallel computing on distributed memory machines. The combined system presents the user with a shared memory programming model. The run time system implements a consistent shared memory abstraction using memory access detection and automatic data caching. The compiler improves the efficiency of the shared memory implementation by directing the run time system to exploit the message passing capabilities of the underlying hardware. To do so, the compiler analyzes shared memory accesses and transforms the code to insert calls to the run time system that provide it with the access information computed by the compiler. The run time system is augmented with the appropriate entry points to use this information to implement bulk data transfer and to reduce the overhead of run time consistency maintenance. In those cases where the compiler analysis succeeds for the entire program, we demonstrate that the combined system achieves performance comparable to that produced by compilers that directly target message passing. If the compiler analysis is successful only for parts of the program, for instance, because of irregular accesses to some of the arrays, the resulting optimizations can be applied to those parts for which the analysis succeeds. If the compiler analysis fails entirely, we rely on the run time maintenance of shared memory and thereby avoid the complexity and the limitations of compilers that directly target message passing. The result is a single system that combines efficient support for both regular and irregular memory access patterns  相似文献   

2.

Heterogeneous systems that consist of multiple CPUs and GPUs for high-performance computing are becoming increasingly popular, and OpenCL (Open Computing Language) provides a framework for writing programs that can be executed across heterogeneous devices. Compared with OpenCL 1.2, the new features of OpenCL 2.0 provide developers with better expressive power for programming heterogeneous computing environments. Currently, gem5-gpu, which includes gem5 and GPGPU-Sim, can offer an experimental simulation environment for OpenCL. In gem5-gpu, gem5 only supports CUDA, although GPGPU-Sim can support OpenCL by compiling an OpenCL kernel code to PTX code using real GPU drivers. However, this compilation flow in GPGPU-Sim can only support up to OpenCL 1.2. OpenCL 2.0 provides new features such as workgroup built-in functions, extended atomic built-in functions, and device-side enqueue. To support OpenCL 2.0, the compiler must be extended to enable the compilation of OpenCL 2.0 kernel code to PTX code. In this paper, the proposed compiler is modified from the low level virtual machine (LLVM) compiler to extend such features to enhance the emulator to support OpenCL 2.0. The proposed compiler creates local buffers for each workgroup to enable workgroup built-in functions and adds atomic built-in functions with memory order and memory scope for OpenCL 2.0 in NVPTX. Furthermore, the APIs available in CUDA are utilized to implement the OpenCL 2.0 device-side enqueue kernel and compilation schemes in Clang are revised. The AMD APP SDK 3.0 and NTU OpenCL benchmarks are used to verify that the proposed compiler can support the features of OpenCL 2.0.

  相似文献   

3.
在高性能计算机上,研究基于不同编译器的三维短波射线追踪计算.通过对三维短波射线追踪计算的分析,设计和实现了三维短波射线追踪串行计算软件和并行计算软件,分别在GCC编译器和PGI编译器中编译生成可执行文件,基于PGI编译器的运行时间明显优于基于GCC编译器的运行时间.  相似文献   

4.
Experience acquired in the development of an interface between a fiber optic system and a prototype management system developed using an object-oriented approach is discussed. Open System Interconnection (OSI) implementation concerns, the use of an ASN.1 compiler, and the use of standard application programming interfaces (APIs) are described. The role of the seven-layer OSI stack in exchanging information between managing operating systems and managed network elements is reviewed. The implementation of an OSI stack on a SUN workstation using the UNIX operating system is also described. Methods for passive and active testing of the resulting programs are discussed  相似文献   

5.
We address the problem of code generation for embedded DSP systems. Such systems devote a limited quantity of silicon to program memory, so the embedded software must be sufficiently dense. Additionally, this software must be written so as to meet various high-performance constraints. Unfortunately, current compiler technology is unable to generate dense, high-performance code for DSPs, due to the fact that it does not provide adequate support for the specialized architectural features of DSPs via machine-dependent code optimizations. Thus, designers often program the embedded software in assembly, a very time-consuming task. In order to increase productivity, compilers must be developed that are capable of generating high-quality code for DSPs. The compilation process must also be made retargetable, so that a variety of DSPs may be efficiently evaluated for potential use in an embedded system. We present a retargetable compilation methodology that enables high-quality code to be generated for a wide range of DSPs. Previous work in retargetable DSP compilation has focused on complete automation, and this desire for automation has limited the number of machine-dependent optimizations that can be supported. In our efforts, we have given code quality higher priority over complete automation. We demonstrate how by using a library of machine-dependent optimization routines accessible via a programming interface, it is possible to support a wide range of machine-dependent optimizations, albeit at some cost to automation. Experimental results demonstrate the effectiveness of our methodology, which has been used to build good-quality compilers for three fixed-point DSPs. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

6.
Programming real-time applications with SIGNAL   总被引:1,自引:0,他引:1  
The authors present the main features of the SIGNAL language and its compiler. Designed to provide safe real time system programming, the SIGNAL language is based on synchronous principles. Its semantics are defined via a mathematical model of multiple-clocked flows of data and events. SIGNAL programs describe relations on such objects, so that it is possible to program a real time application via constraints. The compiler calculates the solutions of the system and thus can be used as a proof system. The equational approach is a natural way to derive multiprocessor executions of a program. This approach uses a graphical interface of a block-diagram style, and the system is illustrated on a speech recognition application  相似文献   

7.
This paper describes the Broadway compiler and our experiences in using it to support domain-specific compiler optimizations. Our goal is to provide compiler support for a wide range of domains and to do so in the context of existing programming languages. Therefore, we focus on a technique that we call library-level optimization, which recognizes and exploits the domain-specific semantics of software libraries. The key to our system is a separation of concerns: compiler expertise is built into the Broadway compiler machinery, while domain expertise resides in separate annotation files that are provided by domain experts. We describe how this system can optimize parallel linear algebra codes written using the PLAPACK library. We find that our annotations effectively capture PLAPACK expertise at several levels of abstraction and that our compiler can automatically apply this expertise to produce considerable performance improvements. Our approach shows that the abstraction and modularity found in modern software can be as much an asset to the compiler as it is to the programmer.  相似文献   

8.
The mainstream arrival of predication, a means other than branching of selecting instructions for execution, has required compiler architects to reformulate fundamental analyses and transformations. Traditionally, the compiler has generated branches straightforwardly to implement control flow designed by the programmer and has then performed sophisticated "global" optimizations. to move and optimize code around them. In this model, the inherent tie between the control state of the program and the location of the single instruction pointer serialized run-time evaluation of control and limited the extent to which the compiler could optimize the control structure of the program (without extensive code replication). Predication provides a means of control independent of branches and instruction fetch location, freeing both compiler and architecture from these restrictions; effective compilation of predicated code, however requires sophisticated understanding of the program's control structure. This paper explores a representational technique which, through direct code analysis, maps the program's control component into a canonical database, a reduced ordered binary decision diagram (ROBDD), which fully enables the compiler to utilize and manipulate predication. This abstraction is then applied to optimize the program's control component, transforming it into a form more amenable to instruction level parallel (ILP) execution  相似文献   

9.
10.
Compiler technology for future microprocessors   总被引:4,自引:0,他引:4  
Advances in hardware technology have made it possible for microprocessors to execute a large number of instructions concurrently (i.e., in parallel). These microprocessors take advantage of the opportunity to execute instructions in parallel to increase the execution speed of a program. As in other forms of parallel processing, the performance of these microprocessors can vary greatly depending on the qualify of the software. In particular the quality of compilers can make an order of magnitude difference in performance. This paper presents a new generation of compiler technology that has emerged to deliver the large amount of instruction-level-parallelism that is already required by some current state-of-the-art microprocessors and will be required by more future microprocessors. We introduce critical components of the technology which deal with difficult problems that are encountered when compiling programs for a high degree of instruction-level-parallelism. We present examples to illustrate the functional requirements of these components. To provide more insight into the challenges involved, we present in-depth case studies on predicated compilation and maintenance of dependence information, two of the components that are largely missing from most current commercial compilers  相似文献   

11.
LS MPP编程语言研究   总被引:1,自引:1,他引:0  
为了方便开发运行于LS MPP系统的应用程序,文章研究了LS MPP编程语言.首先,分析了现有LS MPP计算机的体系结构,以及作为其未来发展方向的图像处理器的概念模型.然后,介绍了对应该概念模型的中间语言和中间表示.最后,详细分析了该概念模型对应的高级语言扩展部分.分析表明,高级语言对概念模型描述的并行计算机性能的提高非常有益,使程序员的编程更加方便,并且降低了编译器的复杂性.  相似文献   

12.
13.
A survey of some packet-switched routing methods for massively parallel computers is presented. Some of the techniques are applicable to both shared-memory and message-passing architectures. These routing methods are compared in terms of their efficiency in supporting programming models, efficiency in mapping to parallel machines, and practicality. Among the outlined methods, three nonadaptive techniques and some adaptive routing algorithms are discussed  相似文献   

14.
The PSL-to-Verilog (P2V) compiler can translate a set of assertions about a block-structured software program into a hardware design to be executed concurrently with the program. The assertions validate the correctness of the software program without altering the program's temporal behavior in any way, a result never previously achieved by any online model-checking system. Additionally, the techniques and implementations described apply to any general purpose program and the absence of execution overhead renders the system ideal for the verification and debugging of real-time systems. Assertions are expressed in a simple subset of the property specification language (PSL), an IEEE standard originally intended for the behavioral specification of hardware designs. The target execution system is the eMIPS processor, a dynamically self-extensible processor realized with a field-programmable gate array (FPGA). The system can concurrently execute and check multiple programs at a time. Assertions are compiled into eMIPS Extensions, which are loaded by the operating system software into a portion of the FPGA, and discarded once the program terminates. If an assertion is violated, the program receives an exception, otherwise, it executes fully unaware of its verifier. The software program is not modified in any way. It can be compiled separately with full optimizations and executes with or without the corresponding hardware checker. The P2V compiler, implemented in Python, generates code for the implementation of the eMIPS processor running on the Xilinx ML401 development board. It is currently used to verify software properties in areas such as testing, debugging, intrusion detection, and the behavioral verification of concurrent and real-time programs.   相似文献   

15.
We describe a system, developed as part of the Cameron project, which compiles programs written in a single-assignment subset of C called SA-C into dataflow graphs and then into VHDL. The primary application domain is image processing. The system consists of an optimizing compiler which produces dataflow graphs and a dataflow graph to VHDL translator. The method used for the translation is described here, along with some results on an application. The objective is not to produce yet another design entry tool, but rather to shift the programming paradigm from HDLs to an algorithmic level, thereby extending the realm of hardware design to the application programmer  相似文献   

16.

Achieving high performance in task-parallel runtime systems, especially with high degrees of parallelism and fine-grained tasks, requires tuning a large variety of behavioral parameters according to program characteristics. In the current state of the art, this tuning is generally performed in one of two ways: either by a group of experts who derive a single setup which achieves good – but not optimal – performance across a wide variety of use cases, or by monitoring a system’s behavior at runtime and responding to it. The former approach invariably fails to achieve optimal performance for programs with highly distinct execution patterns, while the latter induces overhead and cannot affect parameters which need to be set at compile time. In order to mitigate these drawbacks, we propose a set of novel static compiler analyses specifically designed to determine program features which affect the optimal settings for a task-parallel execution environment. These features include the parallel structure of task spawning, the granularity of individual tasks, the memory size of the closure required for task parameters, and an estimate of the stack size required per task. Based on the result of these analyses, various runtime system parameters are then tuned at compile time. We have implemented this approach in the Insieme compiler and runtime system, and evaluated its effectiveness on a set of 12 task parallel benchmarks running with 1 to 64 hardware threads. Across this entire space of use cases, our implementation achieves a geometric mean performance improvement of 39%. To illustrate the impact of our optimizations, we also provide a comparison to current state-of-the art task-parallel runtime systems, including OpenMP, Cilk, HPX, and Intel TBB.

  相似文献   

17.
Embedded and portable systems running multimedia applications create a new challenge for hardware architects. A microprocessor for such applications needs to be easy to program like a general-purpose processor and have the performance and power efficiency of a digital signal processor. This paper presents the codevelopment of the instruction set, the hardware, and the compiler for the Vector IRAM media processor. A vector architecture is used to exploit the data parallelism of multimedia programs, which allows the use of highly modular hardware and enables implementations that combine high performance, low power consumption, and reduced design complexity. It also leads to a compiler model that is efficient both in terms of performance and executable code size. The memory system for the vector processor is implemented using embedded DRAM technology, which provides high bandwidth in an integrated, cost-effective manner. The hardware and the compiler for this architecture make complementary contributions to the efficiency of the overall system. This paper explores the interactions and tradeoffs between them, as well as the enhancements to a vector architecture necessary for multimedia processing. We also describe how the architecture, design, and compiler features come together in a prototype system-on-a-chip, able to execute 3.2 billion operations per second per watt  相似文献   

18.
基于硬件性能计数器的编译器性能测试与分析   总被引:1,自引:0,他引:1  
Itanium 2处理器提供的性能监控单元实现了在程序运行过程中捕捉微结构事件的功能.利用GNU Gcc、Intel Icc和HP-Opencc编译器的不同优化选项编译并运行SPEC2006基准程序.通过CPU硬件计数器(HPCs)采集的性能数据,了解特定程序特征,分析比较编译器性能差异,对HP-Opencc编译器的性能优化给出相关参考数据.实验表明HP-Opencc编译器的的分支预测优化技术可再改进.  相似文献   

19.
Industrial computer control systems require the economical production of efficient software including executive systems, maintenance programs, and both special and general purpose application programs for direct digital control. Moreover, the hardware configuration varies considerably from the single dedicated control computer to a general purpose multicomputer system. RTL, a real-time language developed cooperatively with industrial suppliers and users specifically for industrial control, is described with emphasis on those features peculiar to applications such as dedicated direct digital control, combined direct and supervisory control, operator interfaces, and interaction with plant management computer systems. The use of RTL for the production of special purpose executive systems and general purpose application programs for direct control, startup, etc., is emphasized. Details of the language discussed with examples include its file structure for communication of data bases between independent programs, and a variety of data types including character codes, strings, labels, lists, peripheral variables, and data structures. Peripheral variables are variables in the language associated with hard-ware features of the central processor and its input-output devices such as registers, interrupts, error indicators, and addresses all of which may be referenced in the language. Regular and peripheral variable data structures--combinations of different types of variables--are included and ease considerably the burden of real-time programming. The organization and performance of the existing compiler for RTL is explained.  相似文献   

20.
An overview of language support for parallel technical computing is provided. The rationale for multithreaded languages, in which the programmer explicitly specifies what work is to be carried out by multiple processors and how their activities should be coordinated, is described. The discussion begins with an introduction to the general models for manipulating multiple threads and how they are incorporated into programming languages. The wide variety of features for creating multiple threads, scheduling their execution, synchronizing their activities, and sharing data among them are then examined. Examples in a simplified, FORTRAN-like notation are included. It is shown how the language features are distributed among commercial compiler implementations. Some less traditional approaches to multithreaded language support are presented to provide a glimpse at what might be expected in future languages and compilers  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号