首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 285 毫秒
1.

Design space exploration of a configurable, heterogeneous system for a given application with required throughput searches for a combination of cores or softcores with different architectures that can be accommodated within the given ASIC or FPGA area and that achieves the required throughput and optimizes power consumption. For a soft real-time streaming application, modeled as a task graph with internally parallelizable streaming tasks, this requires assigning a core type and quantity and DVFS frequency level to each task, which implies task runtime and energy consumption, and mapping and scheduling the tasks, such that the throughput requirement is met. We tightly integrate such static scheduling for stream processing applications with design space exploration of the best heterogeneous core combination, and solve the resulting combined optimization problem by an integer linear program (ILP). We evaluate our solution for different numbers of core types on synthetic and application-based task graphs, and demonstrate improvements of up to 34.8% for ARM big and LITTLE cores, and 70.5% for 3 different core types.

  相似文献   

2.
The paper proposes a novel heuristic technique for integrated hardware-software partitioning, hardware design space exploration and scheduling. The technique maps an application specified as a task graph on a heterogeneous architecture with an objective to minimize the latency of the task graph subject to the area constraint on the hardware coprocessor. The technique uses an iterative approach where the partitioner decides the processor mapping and HW design points of some tasks. The scheduler then simultaneously decides the processor mapping, HW design point and schedule time of the remaining tasks. There exists a tight coupling between the two design stages allowing them to produce superior quality designs in fewer iterations. The technique accounts for the time overheads due to inter-processor /intra-processor communication and shared memory access conflicts. It can therefore be used for both communication intensive and computation intensive applications. The technique also considers dynamic reconfiguration capability of the hardware coprocessor. The technique performs tradeoff analysis and maps hardware tasks to mutually exclusive temporal segments if this results in lower latency. The effectiveness of the technique is demonstrated by a case study of the JPEG image compression algorithm, comparison with an optimal ILP based approach and experimentation with synthetic graphs.  相似文献   

3.
Dynamic Partial Reconfiguration (DPR) enables resource sharing in FPGA-based systems. It can also be used for the mitigation of aging-related permanent faults by increasing the number of redundant Partially Reconfigurable Regions (PRRs). Normally, these PRRs are able to host any of the Partially Reconfigurable Modules (PRMs), or tasks, at one particular instance. This kind of system is called homogeneous. However, the FPGA resource constraints limit the amount of homogeneous redundancy that can be used and hence affect the lifetime of the system. This issue can be addressed by utilizing the heterogeneous approach where each PRR now only hosts a subset of the tasks. Further, the deadlines of the applications must also be taken care of in the design phase to decide the mapping and scheduling of tasks to PRRs. To this end, we propose an application-specific multi-objective system-level design methodology to determine the appropriate number of PRRs and the mapping and scheduling of tasks to the PRRs. Specifically, we propose a lifetime-aware scheduling method that maximizes the system's mean time to failure (MTTF) with different tolerances in the makespan specification of an application. We use the scheduler along with an automated floorplanner for design space exploration at design-time to generate a feasible heterogeneous PRR-based system. Our experiments show that the heterogeneous systems can offer more than 2x lifetime improvement over homogeneous ones. It also offers better scaling with increased tolerance in makespan specification.  相似文献   

4.
5.
目前针对粗粒度可重构结构循环映射的研究主要集中在操作布局和临时数据路由,缺乏考虑数据映射的研究,该文提出一种基于存储划分和路径重用的模调度映射流程。首先进行细粒度的存储划分找到合适的数据映射,提高数据存取的并行性,再用模调度寻找操作布局和临时数据路由,最后利用构建的路由开销模型平衡存储器路由和处理单元路由的使用,引入路径重用策略优化路由资源。实验结果表明,该方法在循环的启动间隔、每周期指令数和执行延迟等方面均具有良好的性能。  相似文献   

6.
《Microelectronics Journal》2014,45(2):211-216
Computer memory systems traditionally use distinct technologies for different hierarchy levels, typically volatile, high speed, high cost/byte solid state memory for caches and main memory (SRAM and DRAM), and non-volatile, low speed, low cost/byte technologies (magnetic disks and flash) for secondary storage. Currently, non-volatile memory (NVM) technologies are emerging and may substantially change the landscape of memory systems. In this work we assess system-level latency and energy impacts of a computer with persistent main memory using PCRAM and Memristor, comparing the development and execution of a search engine application implementing both a traditional file-based approach and a memory persistence approach (Mnemosyne). Our observations show that using memory persistence on top of NVM main memory, instead of a file-based approach on top DRAM/Disk, produces less than half lines of code, is more than 4× faster to develop, consumes 33× less memory energy, and executes search tasks up to 33× faster.  相似文献   

7.
In the past years, many works have demonstrated the applicability of Coarse-Grained Reconfigurable Array (CGRA) accelerators to optimize loops by using software pipelining approaches. They are proven to be effective in reducing the total execution time of multimedia and signal processing applications. However, the run-time reconfigurability of CGRAs is hampered overheads introduced by the needed translation and mapping steps. In this work, we present a novel run-time translation technique for the modulo scheduling approach that can convert binary code on-the-fly to run on a CGRA. We propose a greedy approach, since the modulo scheduling for CGRA is an NP-complete problem. In addition to read-after-write dependencies, the dynamic modulo scheduling faces new challenges, such as register insertion to solve recurrence dependences and to balance the pipelining paths. Our results demonstrate that the greedy run-time algorithm can reach a near-optimal ILP rate, better than an off-line compiler approach for a 16-issue VLIW processor. The proposed mechanism ensures software compatibility as it supports different source ISAs. As proof of concept of scaling, a change in the memory bandwidth has been evaluated. In this analysis it is demonstrated that when changing from one memory access per cycle to two memory accesses per cycle, the modulo scheduling algorithm is able to exploit this increase in memory bandwidth and enhance performance accordingly. Additionally, to measure area and performance, the proposed CGRA was prototyped on an FPGA. The area comparisons show that a crossbar CGRA (with 16 processing elements and including an 4-issue VLIW host processor) is only 1.11 × bigger than a standalone 8-issue VLIW softcore processor.  相似文献   

8.
Hardware software co-synthesis process intends to determine an optimal architecture for an embedded application specified by a task graph or a specification language. In this paper, we present a co-synthesis approach targeting MPSoCs and distributed memory multiprocessor architectures for high performance embedded applications. Our co-synthesis approach produces pipelined multiprocessor architectures consisting of heterogeneous processing elements connected by a point-to-point communication structure. The co-synthesis process consists of four distinct phases; processing element selection for addition to the system, pipelined task allocation, scheduling and a regular interconnection topology mapping. Initially, an irregular topology is generated that is mapped to a regular architecture. Our co-synthesis methodology performs system partitioning and produces an irregular topology multiprocessor system. It also generates an optimal (or sub-optimal) regular topology architecture after considering some of the well-known regular topologies like mesh, hypercube, tree, etc. The co-synthesis method is demonstrated by exploring embedded architectures for MPEG encoder and artificially generated application task graphs representing complex embedded systems.  相似文献   

9.
Rapidly developing Next Generation Sequencing technologies produce huge amounts of short reads that consisting randomly fragmented DNA base pair strings. Assembling of those short reads poses a challenge on the mapping of reads to a reference genome in terms of both sensitivity and execution time. In this paper, we propose a customized many-core hardware acceleration platform for short read mapping problems based on hash-index method. The processing core is highly customized to suite both 2-hit string matching and banded Smith-Waterman sequence alignment operations, while distributed memory interface with 3D–stacked architecture provides high bandwidth and low access latency for highly customized dataset partitioning and memory access scheduling. Conformal with original BFAST program, our design provides an amazingly 45,012 times speedup over software approach for single-end short reads and 21,102 times for paired-end short reads, while also beats similar single FPGA solution for 1466 times in case of single end reads. Optimized seed generation gives much better sensitivity while the performance boost is still impressive.  相似文献   

10.
In this paper, we study the problem of energy minimization when mapping streaming applications with throughput constraints to homogeneous multiprocessor systems in which voltage and frequency scaling is supported with a discrete set of operating voltage/frequency modes. We propose a soft real-time semi-partitioned scheduling algorithm which allows an even distribution of the utilization of tasks among the available processors. In turn, this enables processors to run at a lower frequency, which yields to lower energy consumption. We show on a set of real-life applications that our semi-partitioned scheduling approach achieves significant energy savings compared to a purely partitioned scheduling approach and an existing semi-partitioned one, EDF-os, on average by 36 % (and up to 64 %) when using the lowest frequency which guarantees schedulability and is supported by the system. By using a periodic frequency switching scheme that preserves schedulability, instead of this lowest supported fixed frequency, we obtain an additional energy saving up to 18 %. Although the throughput of applications is unchanged by the proposed semi-partitioned approach, the mentioned energy savings come at the cost of increased memory requirements and latency of applications.  相似文献   

11.
The expanded use of field programmable gate arrays (FPGA) in remote, long life, and system-critical applications requires the development and implementation of effective, efficient FPGA fault-tolerance techniques. FPGA have inherent redundancy and in-the-field reconfiguration capabilities, thus providing alternatives to standard integrated circuit redundancy-based fault-recovery techniques. Runtime reliability can be enhanced by using such unique features. Recovery from permanent logic and interconnect faults without runtime computer-aided design (CAD) support can be efficiently performed with the use of fine-grained and physical design partitioning. Faults are localized to small partitioned blocks that have fixed interfaces to the surrounding portions of the design, and the affected blocks are reconfigured with previously generated, functionally equivalent block instances that do not use the faulty resources. This technique minimizes the post-fault-detection system downtime, while requiring little area overhead. Only the finely located faulty portions of the FPGA are removed from use. In addition, the end user need not have access to CAD tools, making the algorithm completely transparent to system users. This approach has been efficiently implemented on a diverse set of FPGA architectures. The algorithm's flexibility is also apparent from the variable emphases that can be placed on system reliability, area overhead, timing overhead, design effort, and system memory. Given user-defined emphases, the algorithm can be modified to specific application requirements. Experiments using random s-independent and s-correlated fault models reveal that the approach enhances system reliability, while minimizing area and timing overhead  相似文献   

12.

Achieving high performance in task-parallel runtime systems, especially with high degrees of parallelism and fine-grained tasks, requires tuning a large variety of behavioral parameters according to program characteristics. In the current state of the art, this tuning is generally performed in one of two ways: either by a group of experts who derive a single setup which achieves good – but not optimal – performance across a wide variety of use cases, or by monitoring a system’s behavior at runtime and responding to it. The former approach invariably fails to achieve optimal performance for programs with highly distinct execution patterns, while the latter induces overhead and cannot affect parameters which need to be set at compile time. In order to mitigate these drawbacks, we propose a set of novel static compiler analyses specifically designed to determine program features which affect the optimal settings for a task-parallel execution environment. These features include the parallel structure of task spawning, the granularity of individual tasks, the memory size of the closure required for task parameters, and an estimate of the stack size required per task. Based on the result of these analyses, various runtime system parameters are then tuned at compile time. We have implemented this approach in the Insieme compiler and runtime system, and evaluated its effectiveness on a set of 12 task parallel benchmarks running with 1 to 64 hardware threads. Across this entire space of use cases, our implementation achieves a geometric mean performance improvement of 39%. To illustrate the impact of our optimizations, we also provide a comparison to current state-of-the art task-parallel runtime systems, including OpenMP, Cilk, HPX, and Intel TBB.

  相似文献   

13.
An embedded system is called a multi-mode embedded system if it performs multiple applications by dynamically reconfiguring the system functionality. Further, the embedded system is called a multi-mode multi-task embedded system if it additionally supports multiple tasks to be executed in a mode. In this paper, we address an important HW/SW partitioning problem, that is, HW/SW partitioning of multi-mode multi-task embedded applications with timing constraints of tasks. The objective of the optimization problem is to find a minimal total system cost of allocation/mapping of processing resources to functional modules in tasks together with a schedule that satisfies the timing constraints. The key success of solving the problem is closely related to the degree of the amount of utilization of the potential parallelism among the executions of modules. However, due to an inherently excessively large search space of the parallelism, and to make the task of schedulability analysis easy, the prior HW/SW partitioning methods have not been able to fully exploit the potential parallel execution of modules. To overcome the limitation, we propose a set of comprehensive HW/SW partitioning techniques which solve the three subproblems of the partitioning problem simultaneously: (1) allocation of processing resources, (2) mapping the processing resources to the modules in tasks, and (3) determining an execution schedule of modules. Specifically, based on a precise measurement on the parallel execution and schedulability of modules, we develop a stepwise refinement partitioning technique for single-mode multi-task applications, which aims to solve the subproblems 1, 2 and 3 effectively in an integrated fashion. The proposed techniques is then extended to solve the HW/SW partitioning problem of multi-mode multi-task applications (i.e., to find a globally optimized allocation/mapping of processing resources with feasible execution schedule of modules). From experiments with a set of real-life applications, it is shown that the proposed techniques are able to reduce the implementation cost by 19.0 and 17.0% for single- and multi-mode multi-task applications over that by the conventional method, respectively.  相似文献   

14.
A methodology for the hierarchical partitioning and mapping of digital signal processing (DSP) tasks to heterogeneous local cluster based network of very large scale integration (VLSI) processors is presented. The goal is to achieve rapid prototyping of VLSI DSP systems. The high level partitioning issues of DSP task graphs and the proposed metrics to guide the partitioning process are described in this paper. Partitioning tominimize power inefficiency in the DSP system is one important metric addressed by this work, since low power signal processing is paramount to new portable and high density multi-chip module (MCM) DSP systems. The application of theRatio Cut Partitioning approach to DSP graphs is explained. We illustrate our results with examples and show how the final partitions vary depending upon the target architecture to meet rapid prototyping requirements. We compare our approach with known techniques and show that it works much better for our target applications.  相似文献   

15.
A Multiprocessor System-on-Chip (MPSoC) may contain hundreds of processing elements (PEs) and thousands of tasks but design productivity is lagging the evolution of HW platforms. One problem is application task mapping, which tries to find a placement of tasks onto PEs which optimizes several criteria such as application runtime, intertask communication, memory usage, energy consumption, real-time constraints, as well as area in case that PE selection or buffer sizing are combined with the mapping procedure. Among optimization algorithms for the task mapping, we focus in this paper on Simulated Annealing (SA) heuristics. We present a literature survey and 5 general recommendations for reporting heuristics that should allow disciplined comparisons and reproduction by other researchers. Most importantly, we present our findings about SA parameter selection and 7 guidelines for obtaining a good trade-off made between solution quality and algorithm’s execution time. Notably, SA is compared against global optimum. Thorough experiments were performed with 2–8 PEs, 11–32 tasks, 10 graphs per system, and 1000 independent runs, totaling over 500 CPU days of computation. Results show that SA offers 4–6 orders of magnitude reduction is optimization time compared to brute force while achieving high quality solutions. In fact, the globally optimum solution was achieved with a 1.6—90 % probability when problem size is around 1e9–4e9 possibilities. There is approx. 90 % probability for finding a solution that is at most 18 % worse than optimum.  相似文献   

16.
The ever increasing adoption of field programmable devices in various application domains for building complex embedded systems based on FPGA processors along with the reliability issues having emerged for FPGA devices built with the latest nanometer technologies, have raised the need for new fault tolerant techniques in order to improve dependability and extend system lifetime. In addition, the runtime partial reconfiguration technology highly mature in the modern FPGA families along with the availability of unused programmable resources in most FPGA designs provide new and interesting opportunities to build advanced fault tolerance mechanisms. In this paper, we exploit the dynamic reconfiguration potential of today’s FPGA architectures and the advances in the related design support tools and we propose a fault-tolerant approach for FPGA embedded processors based on runtime partial reconfiguration. According to the proposed methodology, the processor core is partitioned into reconfigurable modules and each module is duplicated to implement a concurrent error detection mechanism. Precompiled configurations containing spare resources are generated for each duplicated module and are used to repair at runtime the defective modules. Also, a fault tolerance scheme for the proxy logic of the reconfigurable modules, which cannot move in the alternative configurations along with the rest logic, is proposed. Moreover, a compression method for the alternative partial bitstreams, which significantly reduces the high storage space requirements of the proposed approach, is presented. Two different hardware decompression schemes have been implemented in a Virtex-5 device and compared in terms of area overhead and decompression latency. Furthermore, a thorough examination has been performed, regarding how the percentage of the spare resources and their allocation in the reconfigurable regions affect the compression efficiency and the processor performance. Finally, the proposed approach has been demonstrated in three different components – ALU, multiplier-accumulator, and instruction-fetch unit – of an open-source embedded processor.  相似文献   

17.
Partial reconfiguration (PR) of FPGAs can be used to dynamically extend and adapt the functionality of computing systems by swapping in and out HW tasks. To coordinate the on-demand task execution, we propose and implement a Run-Time System Manager (RTSM) for scheduling software (SW) tasks on available processor(s) and hardware (HW) tasks on any number of reconfigurable regions (RRs) of a partially reconfigurable FPGA. Fed with the initial partitioning of the application into tasks, the corresponding task graph, and the available task mappings, the RTSM controls system operation considering the status of each task and region (e.g. busy, idle, scheduled for reconfiguration/execution, etc). Our RTSM supports task reuse and configuration prefetching to minimize reconfigurations, task movement among regions to efficiently manage the FPGA area, and region reservation for future reconfiguration and execution. We validate the correctness and portability of our RTSM executing an image processing application on two Xilinx-based platforms: ZedBoard and XUPV5. We also perform a more extensive evaluation of its features using a simulation framework, and find that – despite the technology limitations – our approach can give promising results in terms of scheduling quality. Since our RTSM supports also the scheduling of parallel SW tasks, we use it to manage the execution of the entire parallel Edge Detection application on a desktop; we compare the application execution time with that using the OpenMP framework and find that with our RTSM execution is 2.4 times faster than the unoptimized OpenMP version. When processor affinity optimization is enabled for OpenMP, our RTMS and the OpenMP are on par, indicating that the scheduling efficiency of our RTSM is competitive to this state-of-the-art scheduler, while supporting in addition the management of HW tasks.  相似文献   

18.
李娜  高博  谢宗甫 《电子科技》2022,35(2):7-13
异构多处理器的高效性和可靠性能够满足日趋复杂的信号处理任务需求,因此分层异构系统已成为信号处理平台的发展趋势.为提高平台强实时性并解决高吞吐量的问题,文中对分层异构信号处理平台的软硬件模块及架构进行了研究,并采用有向无环图对组件任务及硬件资源进行建模.将已提出的调度算法按照任务类型、调度目标、调度过程和研究方法进行分类...  相似文献   

19.
用于二维RCA跨层数据传输的旁节点无冗余添加算法   总被引:1,自引:0,他引:1  
针对二维可重构单元阵列(RCA)硬件任务的跨层数据传输问题,提出了一种前序遍历回溯旁节点添加算法。该算法针对跨层输入树、跨层输出树2种类型的数据流图,保持了原有运算节点之间的逻辑关系,实现了旁节点的无冗余添加。给出了动态可重构系统划分映射的量化评估指标体系和流水化模型,给出了添加旁节点映射的临界条件。实验结果表明,基于相同的系统结构和划分映射算法,在满足临界条件的情况下,与不加旁节点映射算法相比,加旁节点映射在划分模块数,非原始输入输出次数、配置时间、总执行周期、功耗等方面均获得了较好的改进;与已有的先进算法相比,文中算法平均执行总周期降低了23.3%(RCA5×5)和30.5%(RCA8×8),平均消耗功耗降低了15.7%(RCA5×5)和18.6%(RCA8×8),从而验证了所提方法的合理性和有效性。  相似文献   

20.
Future computing devices are likely to be based on heterogeneous architectures, which comprise of multi-core CPUs accompanied with GPU or special purpose accelerators. A challenging issue for such devices is how to effectively manage the resources to achieve high efficiency and low energy consumption. With multiple new programming models and advanced framework support for heterogeneous computing, we have seen many regular applications benefit greatly from heterogeneous systems. However, migrating the success of heterogeneous computing to irregulars remains a challenge. An irregular program's attribute may vary during execution and are often unpredictable, making it difficult to allocate heterogeneous resources to achieve the highest efficiency. Moreover, the irregularity in applications may cause control flow divergence, load imbalance and low efficiency in parallel execution. To resolve these issues, we studied and proposed phase guided dynamic work partitioning, a light-weight and fast analysis technique, to collect information during program phases at runtime in order to guide work partitioning in subsequent phases for more efficient work dispatching on heterogeneous systems. We implemented an adaptive Runtime System based on the aforementioned technique and take Ray-Tracing to explore the performance potential of dynamic work distribution techniques in our framework. The experiments have shown that the performance gain of this approach can be as high as 5 times faster than the original system. The proposed techniques can be applied to other irregular applications with similar properties.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号