首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
We describe an integrated compile time and run time system for efficient shared memory parallel computing on distributed memory machines. The combined system presents the user with a shared memory programming model. The run time system implements a consistent shared memory abstraction using memory access detection and automatic data caching. The compiler improves the efficiency of the shared memory implementation by directing the run time system to exploit the message passing capabilities of the underlying hardware. To do so, the compiler analyzes shared memory accesses and transforms the code to insert calls to the run time system that provide it with the access information computed by the compiler. The run time system is augmented with the appropriate entry points to use this information to implement bulk data transfer and to reduce the overhead of run time consistency maintenance. In those cases where the compiler analysis succeeds for the entire program, we demonstrate that the combined system achieves performance comparable to that produced by compilers that directly target message passing. If the compiler analysis is successful only for parts of the program, for instance, because of irregular accesses to some of the arrays, the resulting optimizations can be applied to those parts for which the analysis succeeds. If the compiler analysis fails entirely, we rely on the run time maintenance of shared memory and thereby avoid the complexity and the limitations of compilers that directly target message passing. The result is a single system that combines efficient support for both regular and irregular memory access patterns  相似文献   

2.
The MIT Alewife Machine   总被引:3,自引:0,他引:3  
A variety of models for parallel architectures, such as shared memory, message passing, and data flow, have converged in the recent past to a hybrid architecture form called distributed shared memory (DSM). Alewife, an early prototype of such DSM architectures, uses hybrid software and hardware mechanisms to support coherent shared memory, efficient user level messaging, fine grain synchronization, and latency tolerance. Alewife supports up to 512 processing nodes connected over a scalable and cost effective mesh network at a constant cost per node. Four mechanisms combine to achieve Alewife's goals of scalability and programmability: software extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms-including block multithreading and prefetching-mask unavoidable delays due to communication. Extensive results from microbenchmarks, together with over a dozen complete applications running on a 32-node prototype, demonstrate that integrating message passing with shared memory enables a cost efficient solution to the cache coherence problem and provides a rich set of programming primitives. Our results further show that messaging and shared memory operations are both important because each helps the programmer to achieve the best performance for various machine configurations  相似文献   

3.
Optimizing for energy constraints is of critical importance due to the proliferation of battery-operated embedded devices. Thus, it is important to explore both hardware and software solutions for optimizing energy. The focus of high-level compiler optimizations has traditionally been on improving performance. In this paper, we present an experimental evaluation of several state-of-the-art high-level compiler optimizations on energy consumption, considering both the processor core (datapath) and memory system. This is in contrast to many of the previous works that have considered them in isolation  相似文献   

4.
As new applications in embedded communications and control systems push the computational limits of digital signal processing (DSP) functions, there will be an increasing need for software applications to be migrated to hardware in the form of a hardware-software codesign system. In many cases, access to the high-level source code may not be available. It is thus desirable to have a technology to translate the software binaries intended for processors to hardware implementations. This paper provides details on the retargetable FREEDOM compiler. The compiler automatically translates DSP software binaries to register-transfer level (RTL) VHDL and Verilog for implementation on field-programmable gate arrays (FPGAs) as standalone or system-on-chip implementations. We describe the underlying optimizations and some novel algorithms for alias analysis, data dependency analysis, memory optimizations, procedure call recovery, and back-end code scheduling. Experimental results on resource usage and performance are shown for several program binaries intended for the Texas Instruments C 6211 DSP (VLIW) and the ARM 922 T reduced instruction set computer (RISC) processors. Implementation results for four kernels from the Simulink demo library and others from commonly used DSP applications, such as MPEG-4, Viterbi, and JPEG are also discussed. The compiler generated RTL code is mapped to Xilinx Virtex II and Altera Stratix FPGAs. We record overall performance gains of 1.5-26.9 for the hardware implementations of the kernels. Comparisons with the power aware compiler techniques (PACT) high-level synthesis compiler are used to show that software binaries can be used as intermediate representations from any high-level language and generate efficient hardware implementations.  相似文献   

5.
This paper describes the Broadway compiler and our experiences in using it to support domain-specific compiler optimizations. Our goal is to provide compiler support for a wide range of domains and to do so in the context of existing programming languages. Therefore, we focus on a technique that we call library-level optimization, which recognizes and exploits the domain-specific semantics of software libraries. The key to our system is a separation of concerns: compiler expertise is built into the Broadway compiler machinery, while domain expertise resides in separate annotation files that are provided by domain experts. We describe how this system can optimize parallel linear algebra codes written using the PLAPACK library. We find that our annotations effectively capture PLAPACK expertise at several levels of abstraction and that our compiler can automatically apply this expertise to produce considerable performance improvements. Our approach shows that the abstraction and modularity found in modern software can be as much an asset to the compiler as it is to the programmer.  相似文献   

6.
We present Avalanche, a prototyping framework that addresses the issues of power estimation and optimization for mixed hardware and software embedded systems. Avalanche is based on a generic embedded system architecture consisting of embedded CPU, custom hardware, and a memory hierarchy. For system-level power estimation, given various system parameters like cache sizes, cache policies, and bus width, etc., Avalanche is able to rapidly evaluate/estimate power and performance and thus facilitate comprehensive design space explorations. For system-level power optimization, Avalanche offers different modes reflecting various design scenarios: if no hardware/software partitioning or only partial partitioning has been conducted, Avalanche guides the designer in finding power-aware hardware/software partitioning; when a system has already been partitioned, Avalanche can optimize system parameters such as cache and memory size; if system parameters and partitioning are given, Avalanche applies additional optimizations for power including source-to-source compiler transformations. Avalanche has been deployed during the design phase of real-world applications including an MPEG II encoder in a set-top box design. Extensive design space explorations in terms of power and performance could be conducted within several hours and various optimization techniques led to power reductions of up to 94% without performance losses and only a slight increases in total chip size (i.e., transistor count).  相似文献   

7.
To support the quality-of-service (QoS) requirements of embedded multimedia applications off-the-shelf middleware like common object request broker architecture (CORBA) must be flexible, efficient, and predictable. Moreover, stringent memory constraints imposed by embedded system hardware necessitates a minimal footprint for middleware that supports multimedia applications. This paper provides three contributions toward developing efficient object request broker's (ORBs) middleware to support embedded multimedia applications. First, we describe optimization principle patterns used to develop a time and space-efficient CORBA inter-ORB protocol (IIOP) interpreter for the adaptive communication environment (ACE)-ORB (TAO), which is our high-performance, real-time ORB. Second, we describe the optimizations applied to TAO's interface definition language (IDL) compiler to generate efficient and small stubs/skeletons used in TAO's IIOP protocol engine. Third, we empirically compare the performance and memory footprint of interpretive (de)marshaling versus compiled (de)marshaling for a wide range of IDL data types. Applying our optimization principle patterns to TAO's IIOP protocol engine improved its interpretive (de)marshaling performance to the point where it is now comparable to the performance of compiled (de)marshaling. Moreover, our IDL compiler optimizations generate interpreted stubs/skeletons whose footprint is substantially smaller than compiled stubs/skeletons. Our results illustrate that careful application of optimization principle patterns can yield both time and space-efficient standards-based middleware  相似文献   

8.
要使多核处理器充分发挥并行计算性能,最大的挑战是并行编程模型.目前并行线程使用锁来保证线程间的同步,但锁会带来死锁等错误,并且性能很难优化.事务存储模型将一系列共享存储操作看成一个事务,保证其原子性,一致性和隔离性.它可以取代锁结构,简化编程模型,提高并行计算的性能.介绍了一种软件事务存储模型(Buffering Software Transactional Memory,BSTM)的结构,它主要采用了写缓冲的办法,简化了事务模型的设计.实验的结果表明这种模型存在一定的优势.  相似文献   

9.
Design automation for embedded systems comprising both hardware and software components demands for code generators integrated into electronic CAD systems. These code generators provide the necessary link between software synthesis tools in HW/SW codesign systems and embedded processors. General-purpose compilers for standard processors are often insufficient, because they do not provide flexibility with respect to different target processors and also suffer from inferior code quality. While recent research on code generation for embedded processors has primarily focussed on code quality issues, in this contribution we emphasize the importance of retargetability, and we describe an approach to achieve retargetability. We propose usage of uniform, external target processor models in code generation, which describe embedded processors by means of RT-level netlists. Such structural models incorporate more hardware details than purely behavioral models, thereby permitting a close link to hardware design tools and fast adaptation to different target processors. The MSSQ compiler, which is part of the MIMOLA hardware design system, operates on structural models. We describe input formats, central data structures, and code generation techniques in MSSQ. The compiler has been successfully retargeted to a number of real-life processors, which proves feasibility of our approach with respect to retargetability. We discuss capabilities and limitations of MSSQ, and identify possible areas of improvement.  相似文献   

10.
Embedded and portable systems running multimedia applications create a new challenge for hardware architects. A microprocessor for such applications needs to be easy to program like a general-purpose processor and have the performance and power efficiency of a digital signal processor. This paper presents the codevelopment of the instruction set, the hardware, and the compiler for the Vector IRAM media processor. A vector architecture is used to exploit the data parallelism of multimedia programs, which allows the use of highly modular hardware and enables implementations that combine high performance, low power consumption, and reduced design complexity. It also leads to a compiler model that is efficient both in terms of performance and executable code size. The memory system for the vector processor is implemented using embedded DRAM technology, which provides high bandwidth in an integrated, cost-effective manner. The hardware and the compiler for this architecture make complementary contributions to the efficiency of the overall system. This paper explores the interactions and tradeoffs between them, as well as the enhancements to a vector architecture necessary for multimedia processing. We also describe how the architecture, design, and compiler features come together in a prototype system-on-a-chip, able to execute 3.2 billion operations per second per watt  相似文献   

11.
Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article,we have reviewed the representative neural network accelerators.As an entirety,the corresponding software stack must consider the hardware architecture of the specific accelerator to enhance the end-to-end performance.And we summarize the programming environments of neural network accelerators and optimizations in software stack.Finally,we comment the future trend of neural network accelerator and programming environments.  相似文献   

12.
13.
We demonstrate the benefits of software shared memory protocols that adapt at run time to the memory access patterns observed in the applications. This adaptation is automatic-no user annotations are required-and does not rely on compiler support or special hardware. We investigate adaptation between singleand multiple-writer protocols, dynamic aggregation of pages into a larger transfer unit, and adaptation between invalidate and update. Our results indicate that adaptation between single- and multiple-writer and dynamic page aggregation are clearly beneficial. The results for the adaptation between invalidate and update are less compelling, showing at best gains similar to the dynamic aggregation adaptation and at worst serious performance deterioration  相似文献   

14.
Simplifying the programming models is paramount to the success of reconfigurable computing with field programmable gate arrays (FPGAs). This paper presents a methodology to combine true object-oriented design of the compiler/CAD tool with an object-oriented hardware design methodology in C++. The resulting system provides all the benefits of object-oriented design to the compiler/CAD tool designer and to the hardware designer/programmer. The two examples for domain-specific compilers presented are BSAT and StReAm. Each domain-specific compiler is targeted at a very specific application domain, such as applications that accelerate Boolean satisfiability problems with BSAT, and applications which lend themselves for implementation as a stream architecture with StReAm. The key benefit of the presented domain specific compilers is a reduction of design time by orders of magnitude while keeping the optimal performance of hand-designed circuits  相似文献   

15.
We address the problem of code generation for embedded DSP systems. In such systems, it is typical for one or more digital signal processors (DSPs), program memory, and custom circuitry to be integrated onto a single IC. Consequently, the amount of silicon area that is dedicated to program memory is limited, so the embedded software must be sufficiently dense. Additionally, this software must be written so as to meet various high-performance constraints, which may include hard real-time constraints. Unfortunately, existing compiler technology is unable to generate dense, high-performance code for DSPs since it does not provide adequate support for the specialized architectural features of DSPs. These specialized features not only allow for the fast execution of common DSP operations, but they also allow for the generation of dense assembly code that specifies these operations. Thus, system designers often hand-program the embedded software in assembly, which is a very time-consuming task. In this paper, we focus on providing compiler support for one particular specialized architectural feature, namely the paged absolute addressing mode – this feature is found in two commercial DSPs, the Texas Instruments' TMS320C25 and TMS320C50 fixed-point DSPs; however, it may also be featured in application-specific processors (ASIPs). We present some machine-dependent code optimizations that improve code density by exploiting this architectural feature. Experimental results demonstrate that for a set of typical DSP benchmarks, some of our optimizations reduce overall code size and data memory consumption by an average of 5.0% and 16.0%, respectively. Our experimental vehicle throughout this research is the TMS320C25.  相似文献   

16.

Achieving high performance in task-parallel runtime systems, especially with high degrees of parallelism and fine-grained tasks, requires tuning a large variety of behavioral parameters according to program characteristics. In the current state of the art, this tuning is generally performed in one of two ways: either by a group of experts who derive a single setup which achieves good – but not optimal – performance across a wide variety of use cases, or by monitoring a system’s behavior at runtime and responding to it. The former approach invariably fails to achieve optimal performance for programs with highly distinct execution patterns, while the latter induces overhead and cannot affect parameters which need to be set at compile time. In order to mitigate these drawbacks, we propose a set of novel static compiler analyses specifically designed to determine program features which affect the optimal settings for a task-parallel execution environment. These features include the parallel structure of task spawning, the granularity of individual tasks, the memory size of the closure required for task parameters, and an estimate of the stack size required per task. Based on the result of these analyses, various runtime system parameters are then tuned at compile time. We have implemented this approach in the Insieme compiler and runtime system, and evaluated its effectiveness on a set of 12 task parallel benchmarks running with 1 to 64 hardware threads. Across this entire space of use cases, our implementation achieves a geometric mean performance improvement of 39%. To illustrate the impact of our optimizations, we also provide a comparison to current state-of-the art task-parallel runtime systems, including OpenMP, Cilk, HPX, and Intel TBB.

  相似文献   

17.
GPU Computing   总被引:9,自引:0,他引:9  
The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart. The GPU's rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general-purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. We describe the background, hardware, and programming model for GPU computing, summarize the state of the art in tools and techniques, and present four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications.  相似文献   

18.
Native simulation is a promising virtual prototyping candidate to accelerate design space exploration of hardware/software systems, early software developments and functional verification. However, it originally fails to provide non-functional information needed for software performance estimation. To add this capability, the general approach is to extract performance metrics from target binary code and back-annotate it in the high-level code from which the binary was generated. However, due to compiler optimizations, the high-level code and the binary code usually have different structures which makes software annotation a complex task. This work proposes a loop-based mapping scheme that reflects optimizations from the binary to the high-level IR for the purpose of precisely placing performance annotations in the software and yielding accurate performance estimates. Experiments on instruction count show in average around 2% of error while conserving a considerable speedup compared to instruction set simulation.  相似文献   

19.
基于SMP集群的MPI+OpenMP混合编程模型及有效实现   总被引:12,自引:1,他引:11  
SMP集群混合了两个内存模型:每个节点是一个共享存储的多处理器,而节点间使用分布存储。这一多级体系结构引起了编程模型和性能方面的问题。文章讨论了MPI+OpenMP混合编程模型的性能和不同的实现方法,提出了多粒度MPI+OpenMP混合编程方法。建立了对称三对角特征问题的多粒度混合并行算法.并在深腾6800超级计算机上同纯MPI算法作了性能方面的比较。结果表明,该混合并行算法具有更好的扩展性和加速比。  相似文献   

20.
The design of higher performance processors has been following two major trends: increasing the pipeline depth to allow faster clock rates, and widening the pipeline to allow parallel execution of more instructions. Designing a higher performance processor implies balancing all the pipeline stages to ensure that overall performance is not dominated by any of them. This means that a faster execution engine also requires a faster fetch engine, to ensure that it is possible to read and decode enough instructions to keep the pipeline full and the functional units busy. This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars. We describe the different fetch engines proposed in the literature, the performance issues involved, and some of the proposed improvements. We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described Overall, we show how instruction fetch has evolved from fetching one instruction every few cycles, to fetching one instruction per cycle, to fetching a full basic block per cycle, to several basic blocks per cycle: the evolution of the mechanism surrounding the instruction cache, and the different compiler optimizations used to better employ these mechanisms  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号