首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures.However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture.We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures.  相似文献   

2.
Numerous modern applications in various fields, such as communication and networking, multimedia, encryption, etc., impose extremely high demands regarding performance while at the same time requiring low energy consumption, low cost, and short design time. Often these very high demands cannot be satisfied by application implementations on programmable processors. Massively parallel multi-processor hardware accelerators are necessary to adequately serve these applications. The accelerator design for such applications has to decide both the micro-architectures of particular processors and the multi-processor system macro-architecture. Due to complex tradeoffs between the micro-architectures and macro-architectures, the micro- and macro-architecture design has to be performed in combination and not in separation, as with the state-of-the-art design methods and tools. To ensure effective and efficient application implementations, an adequate design space exploration (DSE) is necessary. It has to construct and analyze several most promising micro- and macro-architecture combinations and to select the best of them. In this paper, we will show that the lack of such a design space exploration would not only make it very difficult to satisfy the ultra-high performance demands of such applications, but it would also seriously degrade the accelerator quality in other design dimensions. To adequately design the multi-processor accelerators for highly-demanding applications, we proposed a quality-driven model-based design method. This paper is devoted to the processor architecture exploration and synthesis of the heterogeneous multi-processor system being one of the most important aspects of our method. The method is implemented in our automatic DSE tool. Using our DSE tool and the LDPC decoding application as a case study, we performed an extensive experimental research of automatic synthesis of various hardware multi-processors for LDPC decoding to show various complex issues and tradeoffs in the processor architecture design, and to demonstrate the high quality of our method and DSE tool in relation to this aspect.  相似文献   

3.
Energy consumption is one of the most constraining requirements for the development and implementation of wireless sensor networks. Many design aspects affect energy consumption, ranging from the hardware components, operations of the sensors, the communication protocols, the application algorithms, and the application duty cycle. A full design space exploration solution is therefore required to estimate the contribution to energy consumption of all of these factors, and significantly decrease the effort and time spent to choose the right architecture that fits best to a particular application. In this paper we present a flexible and extensible simulation and design space exploration framework called “PASES” for accurate power consumption analysis of wireless sensor networks. PASES performs both performance and energy analysis, including the application, the communication and the platform layers, providing an extensible and customizable environment. The framework assists the designers in the selection of an optimal hardware solution and software implementation for the specific project of interest ranging from standalone to large scale networked systems. Experimental and simulation results demonstrate the framework accuracy and utility.  相似文献   

4.
Network-on-Chip (NoC) has been proposed to overcome the complex on-chip communication problem of System-on-Chip (SoC) design in deep sub-micron. A complete NoC design contains exploration on both hardware and software architectures. The hardware architecture includes the selection of Processing Elements (PEs) with multiple types and their topology. The software architecture contains allocating tasks to PEs, scheduling of tasks and their communications. To find the best hardware design for the target tasks, both hardware and software architectures need to be considered simultaneously. Previous works on NoC design have concentrated on solving only one or two design parameters at a time. In this paper, we propose a hardware–software co-synthesis algorithm for a heterogeneous NoC architecture. The design goal is to minimize energy consumption while meeting the real-time requirements commonly seen in embedded applications. The proposed algorithm is based on Simulated-Annealing (SA). To compare the solution quality and efficiency of the proposed algorithm, we also implement the branch-and-bound and iterative algorithm to solve the hardware–software co-synthesis problem of a heterogeneous NoC. With the given synthetic task sets, the experimental results show that the proposed SA-based algorithm achieves near-optimal solution in a reasonable time, while the branch-and-bound algorithm takes a very long time to find the optimal solution, and the iterative algorithm fails to achieve good solution quality. When applying the co-synthesis algorithms to a real-world application with PE library that has little variation in PE performance and energy consumption, the iterative algorithm achieves solution quality comparable to that of the proposed SA-based algorithm.  相似文献   

5.
The co-design of architectures and algorithms has been postulated as a strategy for achieving Exascale computing in this decade. Exascale design space exploration is prohibitively expensive, at least partially due to the size and complexity of scientific applications of interest. Application codes can contain millions of lines and involve many libraries. Mini-applications, which attempt to capture some key performance issues, can potentially reduce the order of the exploration by a factor of a thousand. However, we need to carefully understand how representative mini-applications are of the full application code. This paper describes a methodology for this comparison and applies it to a particularly challenging mini-application. A multi-faceted methodology for design space exploration is also described that includes measurements on advanced architecture testbeds, experiments that use supercomputers and system software to emulate future hardware, and hardware/software co-simulation tools to predict the behavior of applications on hardware that does not yet exist.  相似文献   

6.
A decade of hardware/software codesign   总被引:1,自引:0,他引:1  
Wolf  W. 《Computer》2003,36(4):38-43
The term hardware/software codesign, coined about 10 years ago, describes a confluence of problems in integrated circuit design. By the 1990s, it became clear that microprocessor-based systems would be an important design discipline for IC designers as well. Large 16- and 32-bit microprocessors had already been used in board-level designs, and Moore's law ensured that chips would soon be large enough to include both a CPU and other subsystems. Multiple disciplines inform hardware/software codesign. Computer architecture tells us about the performance and energy consumption of single CPUs and multiprocessors. Real-time system theory helps analyze the deadline-driven performance of embedded systems. Computer-aided design assists hardware cost evaluation and design space exploration.  相似文献   

7.
赵姗  杨秋松  李明树 《软件学报》2019,30(4):1164-1190
为了满足应用程序的多样化需求,异构多核处理器出现并逐渐进入市场,其中的处理核心(core)具有不同的微架构或者指令集架构(ISA),为应用提供多样化特性支持,比如指令级并行(ILP)、内存级并行(MLP),这些核心协同工作满足整个计算系统的优化目标,比如高性能、低功耗或者良好的能效.然而,目前主流的调度技术主要是针对传统同构处理器架构设计,没有考虑异构硬件能力的差异性.在异构多核处理器环境下,调度技术如何感知硬件的异构特性,为不同类型的应用程序提供更加合适和匹配的硬件资源,这是值得探索的问题.对近年来在该研究领域的成果进行了综述研究,特别是在性能非对称多核处理器架构下,异构调度技术面临的优化目标、分析模型、调度决策和算法评估等主要问题进行了分析和描述,并依次对相关技术进行了系统的总结,最后从软硬件融合的角度对今后的研究工作进行了展望.  相似文献   

8.
Along with the development of powerful processing platforms, heterogeneous architectures are nowadays permitting new design space explorations. In this paper, we propose a novel heterogeneous architecture for reliable pedestrian detection applications. It deploys an efficient Histogram of Oriented Gradient pipeline tightly coupled with a neuro-inspired spatio-temporal filter. By relying on hardware–software co-design principles, our architecture is capable of processing video sequences from real-word dynamic environments in real time. The paper presents the implemented algorithm and details the proposed architecture for executing it, exposing in particular the partitioning decisions made to meet the required performance. A prototype implementation is described and the results obtained are discussed with respect to other state-of-the-art solutions.  相似文献   

9.
Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8 × (using 2 dual-issue cores), up to 5.2 × (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4 × (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature.  相似文献   

10.
Coarse-grained architectures (CGRAs) can be tailored and optimized for different application domains. The vast design space of coarse-grained reconfigurable architectures complicates the design of optimized processors. The goal is to design a domain-specific processor that provides just enough-flexibility for that domain while minimizing the energy consumption for a given level of performance. However, a flexible architecture template and a retargetable simulator and compiler enable systematic architecture exploration that can lead to more efficient domain-specific architecture design. This article presents such an environment and an architecture exploration for a novel CGRA template.  相似文献   

11.
Many on-chip network circuit and architecture techniques are incompatible with modern design flows, making them unsuitable for use in systems-on-chip. This paper presents a networks-on-chip (NoC) architecture design space exploration method for multi-processor systems-on-chip architecture. The NoC architecture design space is designed with a Layer-Interactive-Building block (LIB) methodology that is divided into three layers: application layer, link/network layer, and physical layer. The suggested LIB design paradigmatic philosophy provides modular building block structure in both hardware and software and the protocols for their interconnection in the three architecture layers. Using LIB the designer can easily select these building blocks to build application-specific NoCs to meet different application requirements such as media, graphic, software radio and communication network applications. The LIB provides the NoC building blocks, architecture interacting systems-on-chip components, the programming models and application mapping strategies. The LIB can be used as a complementary library and tools for future on-chip interconnection network design.  相似文献   

12.
《Computer Networks》2003,41(5):623-640
We present a design methodology for a modular network processor architecture that leads to a balanced, service-defined mix between programmable processor cores, configurable hardware assists, and specialized coprocessors. Whereas the processor cores address the flexibility and extendibility needs of the networking market, the hardware components offload the processors, or even allow them to be bypassed for certain network processor-typical tasks to optimize chip area, performance, and power efficiency. We describe the rationale behind the selected functional partitioning in hardware and software components and discuss the challenges of designing the hardware components, and of organizing and integrating the programmable cores. We quantify our approach with a performance evaluation of the overall system.  相似文献   

13.
Multicore architectures were introduced to mitigate the issue of increase in power dissipation with clock frequency. Introduction of deeper pipelines, speculative threading etc. for single core systems were not able to bring much increase in performance as compared to their associated power overhead. However for multicore architectures performance scaling with number of cores has always been a challenge. The Amdahl’s law shows that the theoretical maximum speedup of a multicore architecture is not even close to the multiple of number of cores. With less amount of code in parallel having more number of cores for an application might just contribute in greater power dissipation instead of bringing some performance advantage. Therefore there is a need of an adaptive multicore architecture that can be tailored for the application in use for higher energy efficiency. In this paper a fuzzy logic based design space exploration technique is presented that is targeted to optimize a multicore architecture according to the workload requirements in order to achieve optimum balance between throughput and energy of the system.  相似文献   

14.
Herpel  H.-J.  Glesner  M. 《Real-Time Systems》1998,14(3):269-291
Mechatronics is a rapidly growing field that requires application specific hardware/software solutions for complex information processing at very low power and area consumption, and cost. Rapid prototyping is a proven method to check a design against its requirements during early design phases and thus shorten the overall design cycle. Rapid Prototyping of real-time information processing units in mechatronics applications requires code generation and hardware synthesis tools for a fast and efficient search in the design space. In this paper we present a rapid prototyping environment that supports the designer of application specific embedded controllers during the requirement's validation phase.  相似文献   

15.
The GRAphics AcceLerator (GRAAL) design-exploration framework is an open system that offers a coherent development methodology for hardware/software cosimulation and codesign of embedded 3D graphics accelerators. GRAAL incorporates tools to help visually debug graphics algorithms implemented in hardware and to estimate performance in terms of throughput, power consumption, and area.  相似文献   

16.
For the design of classic computers the parallel programming concept is used to abstract HW/SW interfaces during high level specification of application software. The software is then adapted to existing multiprocessor platforms using a low level software layer that implements the programming model. Unlike classic computers, the design of heterogeneous MPSoC includes also building the processors and other kind of hardware components required to execute the software. In this case, the programming model hides both hardware and software refinements. This paper deals with parallel programming models to abstract both hardware and software interfaces in the case of heterogeneous MPSoC design. Different abstraction levels will be needed. For the long term, the use of higher level programming models will open new vistas for optimization and architecture exploration like CPU/RTOS tradeoffs.  相似文献   

17.
嵌入式系统在资源争用条件下的软硬件划分   总被引:2,自引:1,他引:1  
以一种具有时间约束的数据流图DFG的可调度性分析为基础,提出一种软硬件划分算法.该算法将由共享资源争用引起的性能影响考虑在内,使得软硬件划分能依据更为精确的性能分析结果,由此将缩小软硬件划分中性能估计同实际运行状况之间的差异,提高划分的合理性,也使得目标系统的性能获得更可靠的保证.  相似文献   

18.
General-purpose processors are often incapable of achieving the challenging cost, performance, and power demands of high-performance applications. To meet these demands, most systems employ a number of hardware accelerators to off-load the computationally demanding portions of the application. As an alternative to this strategy, we examine customizing the computation capabilities of a processor for a particular application. The processor is extended with hardware in the form of a set of custom function units and instruction set extensions. To effectively identify opportunities for creating custom hardware, a dataflow graph design space exploration engine heuristically identifies candidate computation subgraphs without artificially constraining their size or shape. The engine combines estimates of performance gain, cost, and inherent limitations of the processor to grow candidate graphs in profitable directions while pruning unprofitable paths. This paper describes the dataflow graph exploration engine and evaluates its effectiveness across a set of embedded applications.  相似文献   

19.
在流体系结构中,标量核同流处理核是异构核,它们之间的协同是流处理器能够正确、高效运转的基础.文中针对异构核间所采用的软件协同方法性能低的问题,提出了一种软件和硬件相结合的异构核协同方法,并基于MASA-I流处理SOPC系统进行了实现.使用媒体和数字信号处理领域核心算法进行测试的结果表明,与软件协同方法相比,使用文中方法的协同性能有2个量级的提升,程序整体性能提高一倍.  相似文献   

20.
This paper investigates the use of 64-bit ARM cores to improve the processing efficiency of upcoming HPC systems. It describes a set of available tools, models and platforms, and their combination in an efficient methodology for the design space exploration of large manycore computing clusters. Experimentations and results using representative benchmarks allow to set an exploration approach to evaluate essential design options at micro-architectural level while scaling with a large number of cores. We then apply this methodology to examine the validity of SoC partitioning as an alternative to using large SoC designs based on coherent multi-SoC models and the proposed SoC Coherent Interconnect (SCI).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号