首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 546 毫秒
1.
Castelli  G. Ragazzini  G. 《Micro, IEEE》1995,15(5):41-49
Rather than dictating the architecture of application software and hardware, a real-time operating system should be flexible enough to adapt to the application's needs. The EOS real-time operating system provides a modular, scalable software platform users can tailor to specific custom hardware architectures. Developers can use minimum configurations of EOS for simple systems or enhance it with their own code for complex systems. Ultimately, we provide a configurable software platform that helps embedded application developers create low-cost, time-effective products  相似文献   

2.
llc is a C-based language where parallelism is expressed using compiler directives. In this paper, we present a new backend of an llc compiler that produces code for GPUs. We have also implemented a software architecture that eases the development of new backends. Our design represents an intermediate layer between a high-level parallel language and different hardware architectures.  相似文献   

3.
Providing highly flexible connectivity is a major architectural challenge for hardware implementation of reconfigurable neural networks. We perform an analytical evaluation and comparison of different configurable interconnect architectures (mesh NoC, tree, shared bus and point-to-point) emulating variants of two neural network topologies (having full and random configurable connectivity). We derive analytical expressions and asymptotic limits for performance (in terms of bandwidth) and cost (in terms of area and power) of the interconnect architectures considering three communication methods (unicast, multicast and broadcast). It is shown that multicast mesh NoC provides the highest performance/cost ratio and consequently it is the most suitable interconnect architecture for configurable neural network implementation. Routing table size requirements and their impact on scalability were analyzed. Modular hierarchical architecture based on multicast mesh NoC is proposed to allow large scale neural networks emulation. Simulation results successfully validate the analytical models and the asymptotic behavior of the network as a function of its size.  相似文献   

4.
We present novel architectures for the modified K-best algorithm and its very-large-scale integration implementation for spatially multiplexed wireless multiple-input multiple-output systems. The objective was to propose a simplified architecture based on the algorithm and to significantly improve the suitability for hardware implementation. Two different architecture designs were proposed: a distributed arithmetic- based tree-search detector and a breadth-first search sphere detector. The implementations were performed to obtain a configurable architectural solution for different antenna configurations and constellations. The synthesis analysis shows that the proposed architectures achieve a throughput of > 500 Mbps with reduced hardware complexity compared to previously reported architectures.  相似文献   

5.
With the recent proliferation of multimedia applications, several fast block matching motion estimation algorithms have been proposed in order to minimize the processing time in video coding. While some of these algorithms adopt pre-defined search patterns that directly reflect the most probable motion structures, other data-adaptive approaches dynamically configure the search pattern to avoid unnecessary computations and memory accesses. Either of these approaches leads to rather difficult hardware implementations, due to their configurability and adaptive nature. As a consequence, two different but quite configurable architectures are proposed in this paper. While the first architecture reflects an innovative mechanism to implement motion estimation processors that support fast but regular search algorithms, the second architecture makes use of an application specific instruction set processor (ASIP) platform, capable of implementing most data-adaptive algorithms that have been proposed in the last few years. Despite their different natures, these two architectures provide highly configurable hardware platforms for real-time motion estimation. By considering a wide set of fast and adaptive algorithms, the efficiency of these two architectures was compared and several motion estimators were synthesized in a Virtex-II Pro XC2VP30 FPGA from Xilinx, integrated within a ML310 development platform. Experimental results show that the proposed architectures can be easily reconfigured in run-time to implement a wide set of real-time motion estimation algorithms.
Leonel Sousa (Corresponding author)Email:
  相似文献   

6.
We study issues in verifying compilers for modern imperative and object-oriented languages. We take the view that it is not the compiler but the code generated by it which must be correct. It is this subtle difference that allows for reusing standard compiler architecture, construction methods and tools also in a verifying compiler.Program checking is the main technique for avoiding the cumbersome task of verifying most parts of a compiler and the tools by which they are generated. Program checking remaps the result of a compiler phase to its origin, the input of this phase, in a provably correct manner. We then only have to compare the actual input to its regenerated form, a basically syntactic process. The correctness proof of the generation of the result is replaced by the correctness proof of the remapping process. The latter turns out to be far easier than proving the generating process correct.The only part of a compiler where program checking does not seem to work is the transformation step which replaces source language constructs and their semantics, given, e.g., by an attributed syntax tree, by an intermediate representation, e.g., in SSA-form, which is expressing the same program but in terms of the target machine. This transformation phase must be directly proven using Hoare logic and/or theorem-provers. However, we can show that given the features of today's programming languages and hardware architectures this transformation is to a large extent universal: it can be reused for any pair of source and target language. To achieve this goal we investigate annotating the syntax tree as well as the intermediate representation with constraints for exhibiting specific properties of the source language. Such annotations are necessary during code optimization anyway.  相似文献   

7.
This paper presents low-complex and novel techniques for designing reconfigurable architectures for multi-standard address generator and interleaver. The emphasis of this work is on hardware re-use, but it also focuses on optimizing the hardware to support multiple standards. A low-cost reconfigurable architecture for address generator and interleaver is proposed which operates in WLAN (802.11a/b/g and 802.11n), WiMAX (802.16e) and 3GPP LTE standards. A simple algorithm and a reconfigurable architecture that eliminates the computationally intricate mod function for LTE, and floor as well as mod function for WLAN/WiMAX, are proposed to reduce the hardware cost as well as implementation complexity. Novel architectures are also proposed to select the increment values for 16-QAM and 64-QAM schemes. A unique configurable subtracting sub-block for each modulation scheme is also presented. Software simulation is carried out to authenticate the functionality of the algorithm. The proposed reconfigurable architectures are realized on FPGA and tested on board. Synthesis results on Spartan-3 FPGA display 66% reduction in FPGA resource utilization and 74% increase in operating frequency compared to the cited address generators. Implementation results on Kintex UltraScale FPGA display a reduction of 34% in resource utilization and 20% in total on-chip power compared to the cited interleavers. This design is also implemented using 45 nm CMOS standard cell technology, and ASIC synthesis results of the reconfigurable address generator exhibit 76.4% improvement in data rate and 52.23% decrease in latency compared to the state-of-the-art address generators. The proposed multimode interleaver also exhibit 60.28% reduction in hardware complexity.  相似文献   

8.
随着深度学习模型和硬件架构的快速发展,深度学习编译器已经被广泛应用.目前,深度学习模型的编译优化和调优的方法主要依赖基于高性能算子库的手动调优和基于搜索的自动调优策略.然而,面对多变的目标算子和多种硬件平台的适配需求,高性能算子库往往需要为各种架构进行多次重复实现.此外,现有的自动调优方案也面临着搜索开销大和缺乏可解释性的挑战.为了解决上述问题,本文提出了AutoConfig,一种面向深度学习编译优化的自动配置机制.针对不同的深度学习计算负载和特定的硬件平台,AutoConfig可以构建具备可解释性的优化算法分析模型,采用静态信息提取和动态开销测量的方法进行综合分析,并基于分析结果利用可配置的代码生成技术自动完成算法选择和调优.本文创新性地将优化分析模型与可配置的代码生成策略相结合,不仅保证了性能加速效果,还减少了重复开发的开销,同时简化了调优过程.在此基础上,本文进一步将AutoConfig集成到深度学习编译器Buddy Compiler中,对矩阵乘法和卷积的多种优化算法建立分析模型,并将自动配置的代码生成策略应用在多种SIMD硬件平台上进行评估.实验结果验证了AutoConfig在代码生成策略中有效地完成了参数配置和算法选择.与经过手动或自动优化的代码相比,由AutoConfig生成的代码可达到相似的执行性能,并且无需承担手动调优的重复实现开销和自动调优的搜索开销.  相似文献   

9.
为了使多标准视频解码器中的帧内预测器能够支持H.264和AVS两种视频标准,在对H.264和AVS两标准中的帧内预测计算模式进行分析,并对各模式计算公式之间相似性进行分析的基础之上,提出了一种支持H.264和AVS两种标准的,可配置的帧内预测值计算硬件架构。该架构由于将大部分预测模式的计算放到一个可配置的计算单元中进行,从而大大减少了芯片资源的浪费。为了提高处理速度,可采用4个相同的可配置的计算单元并行计算,一次计算出4个像素点的预测值。实验结果表明,该硬件架构在FPGA上占用10371个LUTs,频率可以达到150MHz。  相似文献   

10.
PipeRench: a reconfigurable architecture and compiler   总被引:1,自引:0,他引:1  
With the proliferation of highly specialized embedded computer systems has come a diversification of workloads for computing devices. General-purpose processors are struggling to efficiently meet these applications' disparate needs, and custom hardware is rarely feasible. According to the authors, reconfigurable computing, which combines the flexibility of general-purpose processors with the efficiency of custom hardware, can provide the alternative. PipeRench and its associated compiler comprise the authors' new architecture for reconfigurable computing. Combined with a traditional digital signal processor, microcontroller or general-purpose processor, PipeRench can support a system's various computing needs without requiring custom hardware. The authors describe the PipeRench architecture and how it solves some of the pre-existing problems with FPGA architectures, such as logic granularity, configuration time, forward compatibility, hard constraints and compilation time  相似文献   

11.
The modern trend toward heterogeneous many‐core architectures has led to high architectural diversity in both high performance and high‐end embedded systems. To effectively exploit the computational resources of such a wide range of architectures, programming languages and APIs such as OpenCL have become increasingly popular. Although OpenCL provides functional code portability and the ability to fine tune the application to the target hardware, providing performance portability is still an open problem. Thus, many research works have investigated the optimization of specific combinations of application and target platform. In this paper, we aim at leveraging the experience obtained in the implementation of algorithms from the cryptography domain to provide a set of guidelines for modern many‐core heterogeneous architecture performance portability and to establish a base on which domain‐specific languages and compiler transformations could be built in the near future. We study algorithmic choices and the effect of compiler transformations on three representative applications in the chosen domain on a set of seven target platforms. To estimate how well the application fits the architecture, we define a metric of computational intensity both for the architecture and the application implementation. Besides being useful to compare either different implementation or algorithmic choices and their fitness to a specific architecture, it can also be useful to the compiler to guide the code optimization process. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

12.
In a SIMD or VLIW machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.In this paper we present a timing analysis that allows a compiler for a MIMD machine to eliminate a large fraction of the run-time synchronization by making efficient use of static code scheduling. Although these techniques can be adapted to be applied to most MIMD machines, this paper centers on the analysis and scheduling for barrier MIMD machines. Barrier MIMDs are asynchronous multiple instruction stream/multiple data stream architectures capable of parallel execution of variable execution-time instructions and arbitrary control flow (e.g., while loops and calls). However, they also incorporate a special hardware barrier synchronization mechanism that facilitates static scheduling by providing a mechanism which the compiler can use to enforce precise timing constraints. In other words, the compiler tracks relative timing between processors and uses static code scheduling until the timing imprecision becomes too large, at which point the compiler simply inserts a barrier to reduce that timing imprecision to zero (or a small constant).This paper describes new scheduling and barrier placement algorithms for barrier MIMDs that are based loosely on the list scheduling approach employed for VLIWs [Ellis 1985]. In addition, the experimental results from scheduling thousands of synthetic benchmark programs for a parameterized barrier MIMD machine are presented.  相似文献   

13.
Accurate instruction fetch and branch prediction is increasingly important in today's superscalar architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of predicting the likely outcome of branch instructions. A branch target buffer (BTB) is often used to provide target addresses for taken branches and to predict the destination of indirect jumps. Using a BTB avoids the delay needed to recalculate the destination address and reduces the misfetch penalty. However, an effective branch target buffer can be large and can possibly increase the cycle time of a processor. We propose that a design used in older computers, such as the PDP-8, be used in modern architectures instead of a BTB design. The compiler would pre-compute the branch destination for most branch instructions, allowing the branch information to be stored with the instruction. We consider computing branch destinations at link time and as instructions are fetched into the instruction cache; both alternatives offer similar performance with different advantages. A very small BTB is still useful to predict indirect branches, which cannot be pre-computed. Our results show that the Precomputed-Branch architecture performs better than an architecture using only a BTB, and has significant hardware savings. This is particularly true for larger programs more representative of modern applications.  相似文献   

14.
Compiling quantum programs   总被引:4,自引:0,他引:4  
In this paper we study a possible compiler for a high-level imperative programming language for quantum computation, the quantum Guarded-Command Language (qGCL). It is important because it liberates us from thinking of quantum algorithms at the data-flow level, in the same way as happened for standard computation a few decades ago.We make use of the normal-form approach to compiler design, introduced by Hoare, Jifeng and Sampaio. In this approach a source program is transformed, by means of algebraic manipulations, into a particular form which can be directly executed by a target machine. This entails the definition of a simple quantum hardware architecture, derived from Hoare et al.’s computing model.Our work provides a general framework for the construction of a compiler for qGCL, focusing mainly on the correctness of the design. Here we do not deal with other topics such as efficiency of compiled code, factorisation of unitary transformations and compilation of quantum data structures.  相似文献   

15.
Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi- and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4?C16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a design-space exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design.  相似文献   

16.
Biologically-inspired packet switched network on chip (NoC) based hardware spiking neural network (SNN) architectures have been proposed as an embedded computing platform for classification, estimation and control applications. Storage of large synaptic connectivity (SNN topology) information in SNNs require large distributed on-chip memory, which poses serious challenges for compact hardware implementation of such architectures. Based on the structured neural organisation observed in human brain, a modular neural networks (MNN) design strategy partitions complex application tasks into smaller subtasks executing on distinct neural network modules, and integrates intermediate outputs in higher level functions. This paper proposes a hardware modular neural tile (MNT) architecture that reduces the SNN topology memory requirement of NoC-based hardware SNNs by using a combination of fixed and configurable synaptic connections. The proposed MNT contains a 16:16 fully-connected feed-forward SNN structure and integrates in a mesh topology NoC communication infrastructure. The SNN topology memory requirement is 50 % of the monolithic NoC-based hardware SNN implementation. The paper also presents a lookup table based SNN topology memory allocation technique, which further increases the memory utilisation efficiency. Overall the area requirement of the architecture is reduced by an average of 66 % for practical SNN application topologies. The paper presents micro-architecture details of the proposed MNT and digital neuron circuit. The proposed architecture has been validated on a Xilinx Virtex-6 FPGA and synthesised using 65 nm low-power CMOS technology. The evolvable capability of the proposed MNT and its suitability for executing subtasks within a MNN execution architecture is demonstrated by successfully evolving benchmark SNN application tasks representing classification and non-linear control functions. The paper addresses hardware modular SNN design and implementation challenges and contributes to the development of a compact hardware modular SNN architecture suitable for embedded applications  相似文献   

17.
Multithreaded architectures provide an opportunity for efficiently executing programs with irregular parallelism and/or irregular locality. This paper presents a strategy that makes use of the multithreaded execution model without exposing multithreading to the programmer. Our approach is to design simple extensions to C, and to provide compiler support that automatically translates high-level C programs into lower-level threaded programs. In this paper we present EARTH-C our extended C language which contains simple constructs for specifying control parallelism, data locality, shared variables and atomic operations. Based on EARTH-C, we describe compiler techniques that are used for translating to lower-level Threaded-C programs for the EARTH multithreaded architecture. We demonstrate our approach with six benchmark programs. We show that even naive EARTH-C programs can lead to reasonable performance, and that more advanced EARTH-C programs can give performance very close to hand-coded threated-C programs. This work supported, in part, by NSERC and FCAR.  相似文献   

18.
The choice of thread-block size and shape is one of the most important user decisions when a parallel problem is written for any CUDA architecture. The reason is that thread-block geometry has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of parameters and the underlying hardware. This paper presents uBench, a complete suite of micro-benchmarks, in order to explore the impact on performance of (1) the thread-block geometry choice criteria, and (2) the GPU hardware resources and configurations. Each micro-benchmark has been designed to be as simple as possible to focus on a single effect derived from the hardware and thread-block parameter choice. As an example of the capabilities of this benchmark suite, this paper shows an experimental evaluation and comparison of Fermi and Kepler architectures. Our study reveals that, in spite of the new hardware details introduced by Kepler, the principles underlying the block geometry selection criteria are similar for both architectures.  相似文献   

19.
The performance of software on modern architectures has grown more and more difficult to predict and analyze, as modern microprocessors have grown more complex. The execution of a program now entails the complex interaction of code, compiler and processor architecture. The current generation of microprocessors is optimized to an existing set of commercial and scientific benchmarks but new applications such as data mining are becoming a significant part of the workload. In this paper we explore the use of performance monitoring hardware to analyze the execution of C4.5, a data mining application, on the IBM Power2 architecture. We see how the data gathered by the hardware can be used to identify potential changes that can be made to the program and the processor micro-architecture to improve performance. We then go on to evaluate changes to C4.5 and to the micro-architecture. Based on our experience, we identify issues that limit the use of performance monitoring hardware in user level tuning and in extending its use to high performance computing environments.  相似文献   

20.
Recently, there is a considerable research effort on providing time-predictable architectures for real-time systems. The correct functioning of them depends not only on the logically correct response, but also the time when it is given. To provide such determinism, the use of VLIW (Very Long Instruction Word) architecture with in-order pipelines and predication support became attractive. Predication is an architecture technique which helps the compiler to translate control dependencies into data dependencies reducing control flow overhead and enhancing the worst-case timing analysis. Since hardware predication support does not come for free, the goal of this work is to investigate a low overhead predication system for multi-issue VLIW machines targeting real-time applications. This study was conducted using a VLIW 4-issue prototype based on the VLIW HP ST231 ISA describing the hardware implementation and the benefits on the worst-case performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号