首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
In the embedded computer system domain, MPSoC systems have become increasingly popular due to the ever-increasing performance demands of modern embedded applications. The number of processing elements in these MPSoCs also steadily increases. Whereas current MPSoCs still contain a limited number of processing elements, future MPSoCs will feature tens up to hundreds of (heterogeneous) processing elements that are all integrated on a single chip. On these future large-scale MPSoC systems, the mapping of applications onto the hardware resources plays an important role to fully explore the parallelism of applications. In this article, a hierarchical run-time adaptive resource allocation framework which uses an intelligent task remapping approach is proposed to improve the system performance for large-scale MPSoCs.  相似文献   

2.
This paper investigates optimized synchronization techniques for shared memory on-chip multiprocessors (CMPs) based on network-on-chip (NoC) and targeted at future mobile systems. The proposed solution is based on the idea of locally performing synchronization operations requiring continuous polling of a shared variable, thus, featuring large contentions (e.g., spin locks and barriers). A hardware (HW) module, the synchronization-operation buffer (SB), has been introduced to queue and to manage the requests issued by the processors. By using this mechanism, we propose a spin lock implementation requiring a constant number of network transactions and memory accesses per lock acquisition. The SB also supports an efficient implementation of barriers. Experimental validation has been carried out by using GRAPES, a cycle-accurate performance/power simulation platform for multiprocessor systems-on-chip (MPSoCs). Two different architectures have been explored to prove that the proposed approach is effective independently from caches and coherence schemes adopted. For an eight-processor target architecture, we show that the SB-based solution achieves up to 50% performance improvement and 30% energy saving with respect to synchronization based on the caching of the synchronization variables and directory-based coherence protocol. Furthermore, we prove the scalability of the proposed approach when the number of processors increases  相似文献   

3.
4.
The programmable video signal processor (VSP) is an important category of processors for multimedia systems. Programmable video processors combine the flexibility of programmability with special architectural features that improve performance on video processing applications. VSPs are typically multiple processors with several processing elements (PEs) and a parallel memory system. This paper focuses on the architectural design of the PE's in a video processor and shows how technology and circuit parameters influence the structure of the datapath and, hence, the overall architecture of a programmable VSP. We emphasize the need to consider technological and circuit-level issues during the design of a system architecture and present a method whereby the conceptual organization of the PEs-the number of PEs, pipelining of the datapath, size of the register file, and number of register ports-can be evaluated in terms of a target set of applications before a detailed design is undertaken. We use motion-estimation and discrete cosine transform as example applications to illustrate how various technology parameters affect the architectural design choices. We show that the design of the register file and the datapath-pipeline depth can drastically affect PE utilization and, therefore, the number of PEs required for different applications. Our results demonstrate that pursuing the fastest cycle time can greatly increase the silicon area which must be devoted to PEs, due to both increased pipeline latency and reduced register file bandwidth  相似文献   

5.
It is observed that due to the availability of fast and highly efficient processors, many embedded system developers are attracted to implement the majority of the system components in software rather than hardware. Software implementation offers a great level of flexibility and scalability of the design. At the same time, a wide choice exists between generic processors, DSP processors, network processors, etc. This increases the design space exploration by many folds to select an appropriate processor or a processor version for a specific application or application component. In this review, recent prominent directions for embedded software performance estimation have been discussed and their salient features are summarized.  相似文献   

6.
Cache作为处理器和系统总线之间的桥梁,是芯片功耗的主要来源,低功耗Cache设计在嵌入式芯片设计中具有重要意义.传统Cache设计一般依赖于特定体系结构,难以在不同的系统中进行集成,通用性差.本文提出了一种低功耗高效率的AHB-AXI双总线结构联合Cache的IP设计.实验结果显示,本设计可以显著降低Cache功耗和提高系统性能.  相似文献   

7.
State-of-the-art devices in the consumer electronics market are relying more and more on Multi-Processor Systems-On-Chip (MPSoCs) as an efficient solution to meet their multiple design constrains, such as low cost, low power consumption, high performance and short time-to-market. In fact, as technology scales down, logic density and power density increase, generating hot spots that seriously affect the MPSoC performance and can physically damage the final system behavior. Moreover, forthcoming three-dimensional (3D) MPSoCs can achieve higher system integration density, but the aforementioned thermal problems are seriously aggravated. Thus, new thermal exploration tools are needed to study the temperature variation effects inside 3D MPSoCs. In this paper, we present a novel approach for fast transient thermal modeling and analysis of 3D MPSoCs with active (liquid) cooling solutions, while capturing the hardware-software interaction. In order to preserve both accuracy and speed, we propose a close-loop framework that combines the use of Field Programmable Gate Arrays (FPGAs) to emulate the hardware components of 2D/3D MPSoC platforms with a highly optimized thermal simulator, which uses an RC-based linear thermal model to analyze the liquid flow. The proposed framework offers speed-ups of more than three orders of magnitude when compared to cycle-accurate 3D MPSoC thermal simulators. Thus, this approach enables MPSoC designers to validate different hardware- and software-based 3D thermal management policies in real-time, and while running real-life applications, including liquid cooling injection control.  相似文献   

8.
Software defined radios (SDR) are highly configurable hardware platforms that provide the technology for realizing the rapidly expanding third (and future) generation digital wireless communication infrastructure. While there are a number of silicon alternatives available for implementing the various functions in a SDR, field programmable gate arrays (FPGAs) are an attractive option for many of these tasks for reasons of performance, power consumption and flexibility. Amongst the more complex tasks performed in a high data rate wireless system is synchronization. This paper examines carrier synchronization in SDRs using FPGA based signal processors. We provide a tutorial style overview of carrier recovery techniques for QPSK and QAM modulation schemes and report on the design and FPGA implementation of a carrier recovery loop for a 16-QAM modern. Two design alternatives are presented to highlight the rich design space accessible using configurable logic. The FPGA device utilization and performance for a carrier recovery circuit using a look-up table approach and CORDIC arithmetic are presented. The simulation and FPGA implementation process using a recent system level design tool called System Generator for DSP described.  相似文献   

9.
Multiprocessor System-on-Chip (MPSoC) systems are evolving towards a processor pool-based architecture that employs hierarchical on-chip networks for inter- and intra-processor pool communication. Since the design space of processor pool-based MPSoCs is extremely wide, the application-specific optimization of on-chip communication architecture is a nontrivial task. This paper presents a systematic methodology for a cascaded bus matrix-based on-chip network design for processor pool-based MPSoCs. Our approach finds sub-optimal architectures in terms of energy consumption and on-chip area while satisfying given performance constraints. The proposed approach allows for independent configurations of processor pools, which leads to better solutions than seen in previous work. Since a simulation is too time-consuming to evaluate the performance of complex on-chip networks, we propose to prune the designs space efficiently by two static analysis techniques to minimize the use of simulations. Thanks to the static analysis techniques, our approach achieves an order of magnitude speed improvement for architecture exploration without performance loss, compared with simulation-based approaches.  相似文献   

10.
A programmable radio baseband signal processor is one of the essential enablers of software- defined radio. As wireless standards evolve, the processing power needed for baseband processing increases dramatically and the underlying hardware needs to cope with various standards or even simultaneously maintaining several radio links. Meanwhile, the maximum power consumption allowed by mobile terminals is still strictly limited. These challenges require both system and architecture level innovations. This article introduces a design methodology for radio baseband processors discussing the challenges and solutions of radio baseband signal processing. The LeoCore architecture is presented here as an example of a baseband processor design aimed at reducing power and silicon cost while maintaining sufficient flexibility.  相似文献   

11.
12.
朱玉飞  戴紫彬  徐进辉  李功丽 《电子学报》2017,45(12):2957-2964
以信息安全设备的密码应用需求为基础,融合流体系结构处理器基本架构,设计出流体系结构密码处理器.文章主要研究和设计影响该处理器性能的瓶颈--流存储系统.此系统针对专用密码处理器的存储特点,并采用可配置化设计,满足密码应用对处理器存储系统灵活高效的要求.同时,该设计将层次化-分布-分体式存储、多数据通道流水并行化访存、流访存调度策略相结合,优化存储系统的访存效率,以提高该处理器的整体性能.研究结果表明,相比于典型密码处理器的存储设计,该设计的访存效率最高可提升约6倍.  相似文献   

13.
14.
As more processors are integrated into Multiprocessor System-on-Chips (MPSoCs) via relentless technology scaling, the mean-time-to-failure (MTTF) is reduced to the extent that unexpected processor failures are considered during design time. A popular approach to tolerate processor failures is to migrate tasks on the faulty processor to live processors. This approach, however, is not suitable for real-time digital signal processing (DSP) applications since it may not guarantee real-time constraints. In this paper, we propose the re-scheduling of the entire application to minimize throughput degradation under a latency constraint, given that the application is specified by a Synchronous Data Flow (SDF) graph. We obtain sub-optimal re-scheduling results using a genetic algorithm for each scenario of processor failures at compile-time. If a failure is detected at run-time, the live processors obtain the saved schedule, perform task transfer, and execute the remaining tasks of the current iteration. We compare preemptive and non-preemptive migration policies and propose a hybrid policy to obtain better performance. We demonstrate the viability of the proposed technique through experiments with real-life DSP applications as well as randomly generated graphs under timing constraints and random fault scenarios.  相似文献   

15.
Security processors are used to implement cryptographic algorithmswith high throughput and/or low energy consumption constraints. The designof these processors is a balancing act between flexibility and energy consumption.The target is to create a processor with just enough programmability to covera set of algorithms—an application domain. This paper proposes GEZEL,a design environment consisting of a design language and an implementationmethodology that can be used for such domain specific processors. We use thesecurity domain as driver, and discuss the impact of the domain on the targetarchitecture. We also present a methodology to create, refine and verify asecurity processor.  相似文献   

16.
The behavior of a multiprocessing system with a multistage interconnection network is studied in the presence of faulty components. Measures for the connectivity and performance of these systems are proposed, including the average number of operational paths, the average number of accessible processors and memories, the average number of fault-free processors (memories) that are connected to an accessible memory (processor), the bandwidth, and the processing power of the system. Based on these measures, a tight upper bound for the maximal fully connected system is suggested. The gracefully degrading system is then compared, through some numerical examples, to a system whose faulty components are repaired upon failure. Based on these comparisons, the anticipated reduction in system performance can be estimated and consequently, appropriate maintenance policies can be determined  相似文献   

17.
Content-addressable memory (CAM) is frequently used in applications, such as lookup tables, databases, associative computing, and networking, that require high-speed searches due to its ability to improve application performance by using parallel comparison to reduce search time. Although the use of parallel comparison results in reduced search time, it also significantly increases power consumption. In this paper, we propose a Block-xor approach to improve the efficiency of low power precomputation-based CAM (PB-CAM). Through mathematical analysis, we found that our approach can effectively reduce the number of comparison operations by 50% on average as compared with the ones-count approach for 32-bit-long inputs. In our experiment, we used Synopsys Nanosim to estimate the power consumption in TSMC 0.35-mum CMOS technology. Compared with the ones-count PB-CAM system, the experimental results show that our proposed approach can achieve on average 30% in power reduction and 32% in power performance reduction. The major contribution of this paper is that it presents theoretical and practical proofs to verify that our proposed Block-xor PB-CAM system can achieve greater power reduction without the need for a special CAM cell design. This implies that our approach is more flexible and adaptive for general designs.  相似文献   

18.
We present a systematic methodology to support the design tradeoffs of array processors in several emerging issues, such as (1) high performance and high flexibility, (2) low cost, low power, (3) efficient memory usage, and (4) system-on-a-chip or the ease of system integration. This methodology is algebraic based, so it can cope with high-dimensional data dependence. The methodology consists of some transformation rules of data dependency graphs for facilitating flexible array designs. For example, two common partitioning approaches, LPGS and LSGP, could be unified under the methodology. It supports the design of high-speed and massively parallel processor arrays with efficient memory usage. More specifically, it leads to a novel systolic cache architecture comprising of shift registers only (cache without tags). To demonstrate how the methodology works, we have presented several systolic design examples based on the block-matching motion estimation algorithm (BMA). By multiprojecting a 4D DG of the BMA to 2D mesh, we can reconstruct several existing array processors. By multiprojecting a 6D DG of the BMA, a novel 2D systolic array can be derived that features significantly improved rates in data reusability (96%) and processor utilization (99%).  相似文献   

19.
武建平  方攀  凌明  张阳 《微电子学》2012,42(1):87-91,96
便签存储器(SPM)作为主要的片上存储器之一,可以用来提升嵌入式Linux系统的性能,并降低其能耗.提出一种针对Linux内核的SPM管理及优化方案,实现了针对Linux内核热点代码段、数据段的SPM静态优化技术.利用虚存管理技术,建立以SPM页区为基础的动态SPM页框分配机制,并实现页框分配的通用接口函数.在优化热点小对象分配器(SLAB)的基础上,实现对Linux内核的动态优化.实验结果表明,该优化方案能明显降低能耗和提升性能,其内核代码段优化方案平均提升11%的系统性能.  相似文献   

20.
This paper presents a novel approach to the synthesis of interleaved memory systems that is especially suited for application-specific processors. Our synthesis system generates the optimized interleaved memories for a specific algorithm and finds the best mapping of arrays in that algorithm onto the memory system to achieve high performance. The design space is four-dimensional (4-D) and comprises the number of memory banks, the type of memory components, the storage scheme, and the range of clock period in the system. Optimal designs are found among the Pareto points (a set of nondominated points in the design space) computed for our memory model under the performance and cost criteria set by the designer. The memory model includes all the components of an interleaved memory system and covers a lookup table-based address generation with data alignment. The synthesis is based on a general periodic storage scheme, which enables efficient handling of irregular and overlapped access patterns. The synthesis process is the exhaustive search of the heavily pruned design space, and the pruning is based on mathematically proven properties of periodic storage schemes. This paper presents the theorems, the synthesis algorithm, and the methods of effective word and bank address generation. Examples are given to illustrate the effectiveness of our method  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号