首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
非一致Cache体系结构(NUCA)几乎已经成为未来片上大容量Cache的发展方向。本文指出同构单芯片多处理器的设计主要有多级Cache设计的数据一致性问题,核间通信问题与外部总线效率问题,我们也说明多处理器设计上的相应解决办法。最后给出单核与双核在性能、功耗的比较,以及双核处理器的布局规划图。利用双核处理器,二级Cache控制器与AXI总线控制器等IP提出一个可供设计AXI总线SoC的非一致Cache体系结构平台。  相似文献   

2.
In this paper, we describe a procedure for memory design and exploration for low power embedded systems. Our system consists of an instruction cache and a data cache on-chip, and a large memory off-chip. In the first step, we try to reduce the power consumption due to memory traffic by applying memory-optimizing transformations such as loop transformations. Next we use a memory exploration procedure to choose a cache configuration (cache size and line size) that satisfies the system requirements of area, number of cycles and energy consumption. We include energy in the performance metrics, since for different cache configurations, the variation in energy consumption is quite different from the variation in the number of cycles. The memory exploration procedure is very efficient since it exploits the trends in the cycles and energy characteristics to reduce the search space significantly.  相似文献   

3.
The recent MPEG Reconfigurable Media Coding (RMC) standard aims at defining media processing specifications (e.g. video codecs) in a form that abstracts from the implementation platform, but at the same time is an appropriate starting point for implementation on specific targets. To this end, the RMC framework has standardized both an asynchronous dataflow model of computation and an associated specification language. Either are providing the formalism and the theoretical foundation for multimedia specifications. Even though these specifications are abstract and platform-independent the new approach of developing implementations from such initial specifications presents obvious advantages over the approaches based on classical sequential specifications. The advantages appear particularly appealing when targeting the current and emerging homogeneous and heterogeneous manycore or multicore processing platforms. These highly parallel computing machines are gradually replacing single-core processors, particularly when the system design aims at reducing power dissipation or at increasing throughput. However, a straightforward mapping of an abstract dataflow specification onto a concurrent and heterogeneous platform does often not produce an efficient result. Before an abstract specification can be translated into an efficient implementation in software and hardware, the dataflow networks need to be partitioned and then mapped to individual processing elements. Moreover, system performance requirements need to be accounted for in the design optimization process. This paper discusses the state of the art of the combinatorial problems that need to be faced at this design space exploration step. Some recent developments and experimental results for image and video coding applications are illustrated. Both well-known and novel heuristics for problems such as mapping, scheduling and buffer minimization are investigated in the specific context of exploring the design space of dataflow program implementations.  相似文献   

4.
In this paper a single-input-single-output wireless data transmission system with adaptive modulation and coding over correlated fading channel is considered, where run-time power adjustment is not available. Higher layer data packets are enqueued into a finite size buffer space before being released into the time-varying wireless channel. Without fixing the physical layer error probability, the objective is to minimize the average joint packet loss rate due to both erroneous transmission and buffer overflow. Two optimization techniques are incorporated to achieve the best solution. The first is policy domain optimization that formulates the data rate adaptation design as classical Markov decision problem. The second is channel domain optimization that appropriately partitions the channel variation based on particular fading environment and carried traffic pattern. The derived policy domain analytical model can precisely map any policy design into various QoS performance metrics with finite buffer setup. We then propose a tractable suboptimization framework to produce different two-dimensional suboptimal solutions with scalable complexity-optimality tradeoff for practical implementations.  相似文献   

5.
System-Level Synthesis Using Evolutionary Algorithms   总被引:3,自引:0,他引:3  
In this paper, we consider system-level synthesis as the problem of optimally mapping a task-level specification onto a heterogeneous hardware/software architecture. This problem requires (1) the selection of the architecture (allocation) including general purpose and dedicated processors, ASICs, busses and memories, (2) the mapping of the specification onto the selected architecture in space (binding) and time (scheduling), and (3) the design space exploration with the goal to find a set of implementations that satisfy a number of constraints on cost and performance. Existing methodologies often consider a fixed architecture, perform the binding only, do not reflect the tight interdependency between binding and scheduling, do not consider communication (tasks and resources), or require long run-times preventing design space exploration, or yield only one implementation with optimal cost. Here, a model is introduced that handles all mentioned requirements and allows the task of system-synthesis to be specified as an optimization problem. The application and adaptation of an Evolutionary Algorithm to solve the tasks of optimization and design space exploration is described.  相似文献   

6.
As the number of cores in chip multi-processor systems increases, the contention over shared last-level cache (LLC) resources increases, thus making LLC optimization critical, especially for embedded systems with strict area/energy/power constraints. We propose cache partitioning with partial sharing (CaPPS), which reduces LLC contention using cache partitioning and improves utilization with sharing configuration. Sharing configuration enables the partitions to be privately allocated to a single core, partially shared with a subset of cores, or fully shared with all cores based on the co-executing applications’ requirements. CaPPS imposes low hardware overhead and affords an extensive design space to increase optimization potential. To facilitate fast design space exploration, we develop an analytical model to quickly estimate the miss rates of all CaPPS configurations using the applications’ isolated LLC access traces to predict runtime LLC contention. Experimental results demonstrate that the analytical model estimates cache miss rates with an average error of only 0.73 % and with an average speedup of \(3505\times \) as compared to a cycle-accurate simulator. Due to CaPPS’s extensive design space, CaPPS can reduce the average LLC miss rate by as much as 25 % as compared to baseline configurations and as much as 14–17 % as compared to prior works.  相似文献   

7.
Time dependent dielectric breakdown degrades the reliability of SRAM cache. A novel methodology to estimate SRAM cache reliability and performance is presented. The performance and reliability characteristics are obtained from activity extraction and Monte Carlo simulations, considering device dimensions, process variations, the stress probability, and the thermal distribution. Based on the reliability-performance estimation methodology, caches with various settings on associativity, cache line size, and cache size are analysed and compared. Experiments show that there exists a contradiction between performance and reliability for different cache configurations. Understanding the variation of performance and reliability can provide SRAM designers with insight on reliability-performance trade-offs for cache system design.  相似文献   

8.
In this paper, we propose a system-level design methodology for the efficient exploration of the architectural parameters of the memory sub-systems, from the energy-delay joint perspective. The aim is to find the best configuration of the memory hierarchy without performing the exhaustive analysis of the parameters space. The target system architecture includes the processor, separated instruction and data caches, the main memory, and the system buses. To achieve a fast convergence toward the near-optimal configuration, the proposed methodology adopts an iterative local-search algorithm based on the sensitivity analysis of the cost function with respect to the tuning parameters of the memory sub-system architecture. The exploration strategy is based on the Energy-Delay Product (EDP) metric taking into consideration both performance and energy constraints. The effectiveness of the proposed methodology has been demonstrated through the design space exploration of a real-world case study: the optimization of the memory hierarchy of a MicroSPARC2-based system executing the set of Mediabench benchmarks for multimedia applications. Experimental results have shown an optimization speedup of 2 orders of magnitude with respect to the full search approach, while the near-optimal system-level configuration is characterized by a distance from the optimal full search configuration in the range of 2%.  相似文献   

9.
Addition represents an important operation that significantly impacts the performance of almost every data processing system. Due to their importance and popularity, addition algorithms and their corresponding circuit implementations have consistently received attention in research circles, over the years. One of the most popular implementations for long adders is the carry skip adder. In this paper, we present the design space exploration for a variety of carry skip adder implementations. More specifically, the paper focuses on the implementation of these adders using traditional as well as novel dynamic circuit design styles. 8–16–32–64-bit adders were implemented using traditional domino, footless domino, and data driven dynamic logic (D3L) in ST Microelectronics 45 nm 1 V CMOS process. In order to further exploit the advantages of the domino and D3L approaches, a new hybrid methodology combining both strategies was implemented and presented in this work. The adders were analyzed for energy-delay trade-offs at different process corners. They were also examined for their sensitivity to process and supply voltage variations. Comparative simulation results reveal that the full D3L adder ensures a better energy-delay product over all process corners (down to 34 % and 25 % lower than the domino and hybrid implementations, respectively, at the typical corner), while showing at the same time similar performance in terms of process and supply voltage variability as compared to the other considered carry skip adder configurations.  相似文献   

10.
The reconfiguration capability of modern FPGA devices can be utilized to execute an application by partitioning it into multiple segments such that each segment is executed one after the other on the device. This division of an application into multiple reconfigurable segments is called temporal partitioning. We present an automated temporal partitioning technique for acyclic behavior level task graphs. To be effective, any behavior-level partitioning method should ensure that each temporal partition meets the underlying resource constraints. For this, a knowledge of the implementation cost of each task on the hardware should be known. Since multiple implementations of a task that differ in area and delay are possible, we perform design-space exploration to choose the best implementation of a task from among the available implementations.To overcome the high reconfiguration overhead of the current day FPGA devices, we propose integration of the temporal partitioning and design space exploration methodology with block-processing. Block-processing is used to process multiple blocks of data on each temporal partition so as to amortize the reconfiguration time. We focus on applications that can be represented as task graphs that have to be executed many times over a large set of input data. We have integrated block-processing in the temporal partitioning framework so that it also influences the design point selection for each task. However, this does not exclude usage of our system for designs for which block-processing is not possible. For both block-processing and non block-processing designs our algorithm selects the best possible design point to minimize the execution time of the design.We present an ILP-based methodology for the integrated temporal partitioning, design space exploration and block-processing technique that is solved to optimality for small sized design problems and in an iterative constraint satisfaction approach for large sized design problems. We demonstrate with extensive experimental results for the Discrete Cosine Transform (DCT) and random graphs the validity of our approach.  相似文献   

11.
The storage requirements in data-dominated signal processing systems, whose behavior is described by array-based, loop-organized algorithmic specifications, have an important impact on the overall energy consumption, data access latency, and chip area. This paper gives a tutorial overview on the existing techniques for the evaluation of the data memory size, which is an important step during the early stage of system-level exploration. The paper focuses on the most advanced developments in the field, presenting in more detail (1) an estimation approach for non-procedural specifications, where the reordering of the loop execution within loop nests can yield significant memory savings, and (2) an exact computation approach for procedural specifications, with relevant memory management applications – like, measuring the impact of loop transformations on the data storage, or analyzing the performance of different signal-to-memory mapping models. Moreover, the paper discusses typical memory management trade-offs – like, for instance, between storage requirement and number of memory accesses – taken into account during the exploration of the design space by loop transformations in the system specification.  相似文献   

12.
Information‐centric networking (ICN) has emerged as a promising candidate for designing content‐based future Internet paradigms. ICN increases the utilization of a network through location‐independent content naming and in‐network content caching. In routers, cache replacement policy determines which content to be replaced in the case of cache free space shortage. Thus, it has a direct influence on user experience, especially content delivery time. Meanwhile, content can be provided from different locations simultaneously because of the multi‐source property of the content in ICN. To the best of our knowledge, no work has yet studied the impact of cache replacement policy on the content delivery time considering multi‐source content delivery in ICN, an issue addressed in this paper. As our contribution, we analytically quantify the average content delivery time when different cache replacement policies, namely, least recently used (LRU) and random replacement (RR) policy, are employed. As an impressive result, we report the superiority of these policies in term of the popularity distribution of contents. The expected content delivery time in a supposed network topology was studied by both theoretical and experimental method. On the basis of the obtained results, some interesting findings of the performance of used cache replacement policies are provided.  相似文献   

13.
To solve design problems in which the objective is to best satisfy a given set of design specifications or constraints in the least pth or minimax sense, assuming the availability of first partial derivatives of the functions concerned with respect to the design parameters.  相似文献   

14.
Basic flip-flop structures are compared with the main emphasis on CMOS ASIC implementations. Flip-flop properties are analyzed by means of simplified models, some structural approaches for optimized metastable behavior are discussed. A special integrated test circuit which facilitates accurate and reproducible measurements is presented. The circuit has been used for carrying out metastability measurements in a wide temperature and voltage range to predict circuit parameters for worst-case designs. Results from measurements and circuit simulation indicate that different criteria for optimizing flip-flop performance should be used for synchronizers and for those applications where the observation of timing constraints imposed on flip-flop input signals can be guaranteed. These results can help in determining the reliability of existing synchronizer and arbiter designs. By means of special synchronizer cells the reliability of asynchronous interfaces can be improved significantly, enabling the system design to gain speed and flexibility in communication between independently clocked submodules  相似文献   

15.
This work describes an aggressive SRAM cell configuration using dual-V/sub T/ and minimum channel length to achieve high performance. A bitline leakage reduction technique is incorporated into an L1 cache design using the new cell in a 100-nm dual-V/sub T/ technology to eliminate impacts of bitline leakage on performance and noise margin with minimal area overhead. Bitline delay is 23% better than the best conventional design, thus enabling 6-GHz operation at with 15% higher energy.  相似文献   

16.
A methodology for the automatic design optimization of analog integrated circuits is presented. A non-fixed-topology approach is realized by combining the optimization program OPTIMAN with the symbolic simulator ISAAC. After selecting a circuit topology, the user invokes ISAAC to model the circuit. ISAAC generates both exact and simplified analytic expressions, describing the circuit's behavior. The model is then passed to the design optimization program OPTIMAN. This program is based on a generalized formulation of the analog design problem. For the selected topology, the independent design variables are automatically extracted and OPTIMAN sizes all elements to satisfy the performance constraints, thereby optimizing a user-defined design objective. The global optimization method used on the analytic circuit models is simulated annealing. Practical examples show that OPTIMAN quickly designs analog circuits, closely meeting the specifications, and that it is a flexible and reliable design and exploration tool  相似文献   

17.
18.
With advances in process technology, soft errors are becoming an increasingly critical design concern. Owing to their large area, high density, and low operating voltages, caches are worst hit by soft errors. Based on the observation that in multimedia applications, not all data require the same amount of protection from soft errors, we propose a partially protected cache (PPC) architecture, in which there are two caches, one protected and the other unprotected at the same level of memory hierarchy. We demonstrate that as compared to the existing unprotected cache architectures, PPC architectures can provide 47 times reduction in failure rate, at only 1% runtime and 3% power overheads. In addition, the failure rate reduction obtained by PPCs is very sensitive to the PPC cache configuration. Therefore, this observation provides an opportunity for further improvement of the solution by correctly parameterizing the PPC configurations. Consequently, we develop design space exploration (DSE) strategies to discover the best PPC configuration. Our DSE technique can reduce the exploration time by more than six times as compared to an exhaustive approach.   相似文献   

19.
Selecting the ideal trade-off between reliability and cost associated with a fault tolerant architecture generally involves an extensive design space exploration. Employing state-of-the-art reliability estimation methods makes this exploration un-scalable with the design complexity. In this paper we introduce a low-cost reliability analysis methodology that helps taking this key decision with less computational effort and orders of magnitude faster. Based on this methodology we also propose a selective hardening technique using a hybrid fault tolerant architecture that allows meeting the soft-error rate constraints within a given design cost-budget and vice versa. Our experimental validation shows that the methodology offers huge gain (1200 ×) in terms of computational effort in comparison with fault injection-based reliability estimation method and produces results within acceptable error limits.  相似文献   

20.
This article reintroduces power gating technique in different hierarchical levels of static random-access memory (SRAM) design including cell, row, bank and entire cache memory in 16 nm Fin field effect transistor. Different structures of SRAM cells such as 6T, 8T, 9T and 10T are used in design of 2MB cache memory. The power reduction of the entire cache memory employing cell-level optimisation is 99.7% with the expense of area and other stability overheads. The power saving of the cell-level optimisation is 3× (1.2×) higher than power gating in cache (bank) level due to its superior selectivity. The access delay times are allowed to increase by 4% in the same energy delay product to achieve the best power reduction for each supply voltages and optimisation levels. The results show the row-level power gating is the best for optimising the power of the entire cache with lowest drawbacks. Comparisons of cells show that the cells whose bodies have higher power consumption are the best candidates for power gating technique in row-level optimisation. The technique has the lowest percentage of saving in minimum energy point (MEP) of the design. The power gating also improves the variation of power in all structures by at least 70%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号