期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A hierarchical run-time adaptive resource allocation framework for large-scale MPSoC systems

Wei?Quan Email author Andy?D.?Pimentel 《Design Automation for Embedded Systems》2016,20(4):311-339

In the embedded computer system domain, MPSoC systems have become increasingly popular due to the ever-increasing performance demands of modern embedded applications. The number of processing elements in these MPSoCs also steadily increases. Whereas current MPSoCs still contain a limited number of processing elements, future MPSoCs will feature tens up to hundreds of (heterogeneous) processing elements that are all integrated on a single chip. On these future large-scale MPSoC systems, the mapping of applications onto the hardware resources plays an important role to fully explore the parallelism of applications. In this article, a hierarchical run-time adaptive resource allocation framework which uses an intelligent task remapping approach is proposed to improve the system performance for large-scale MPSoCs. 相似文献

2.

Efficient Synchronization for Embedded On-Chip Multiprocessors

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(10):1049-1062

This paper investigates optimized synchronization techniques for shared memory on-chip multiprocessors (CMPs) based on network-on-chip (NoC) and targeted at future mobile systems. The proposed solution is based on the idea of locally performing synchronization operations requiring continuous polling of a shared variable, thus, featuring large contentions (e.g., spin locks and barriers). A hardware (HW) module, the synchronization-operation buffer (SB), has been introduced to queue and to manage the requests issued by the processors. By using this mechanism, we propose a spin lock implementation requiring a constant number of network transactions and memory accesses per lock acquisition. The SB also supports an efficient implementation of barriers. Experimental validation has been carried out by using GRAPES, a cycle-accurate performance/power simulation platform for multiprocessor systems-on-chip (MPSoCs). Two different architectures have been explored to prove that the proposed approach is effective independently from caches and coherence schemes adopted. For an eight-processor target architecture, we show that the SB-based solution achieves up to 50% performance improvement and 30% energy saving with respect to synchronization based on the caching of the synchronization variables and directory-based coherence protocol. Furthermore, we prove the scalability of the proposed approach when the number of processors increases 相似文献

3.

Automated design of networks of transport-triggered architecture processors using dynamic dataflow programs

Hervé Yviquel Jani Boutellier Mickaël Raulet Emmanuel Casseau 《Signal Processing: Image Communication》2013,28(10):1295-1302

相似文献

4.

A circuit-driven design methodology for video signal-processingdatapath elements

Dutta S. Wolf W. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1999,7(2):229-240

The programmable video signal processor (VSP) is an important category of processors for multimedia systems. Programmable video processors combine the flexibility of programmability with special architectural features that improve performance on video processing applications. VSPs are typically multiple processors with several processing elements (PEs) and a parallel memory system. This paper focuses on the architectural design of the PE's in a video processor and shows how technology and circuit parameters influence the structure of the datapath and, hence, the overall architecture of a programmable VSP. We emphasize the need to consider technological and circuit-level issues during the design of a system architecture and present a method whereby the conceptual organization of the PEs-the number of PEs, pipelining of the datapath, size of the register file, and number of register ports-can be evaluated in terms of a target set of applications before a detailed design is undertaken. We use motion-estimation and discrete cosine transform as example applications to illustrate how various technology parameters affect the architectural design choices. We show that the design of the register file and the datapath-pipeline depth can drastically affect PE utilization and, therefore, the number of PEs required for different applications. Our results demonstrate that pursuing the fastest cycle time can greatly increase the silicon area which must be devoted to PEs, due to both increased pipeline latency and reduced register file bandwidth 相似文献

5.

Recent trends in embedded system software performance estimation

Rajendra Patel Arvind Rajawat 《Design Automation for Embedded Systems》2013,17(1):193-213

It is observed that due to the availability of fast and highly efficient processors, many embedded system developers are attracted to implement the majority of the system components in software rather than hardware. Software implementation offers a great level of flexibility and scalability of the design. At the same time, a wide choice exists between generic processors, DSP processors, network processors, etc. This increases the design space exploration by many folds to select an appropriate processor or a processor version for a specific application or application component. In this review, recent prominent directions for embedded software performance estimation have been discussed and their salient features are summarized. 相似文献

6.

一种低功耗高效率的AHB-AXI双总线结构联合Cache的IP设计

朱伟成周莉喻庆东《微电子学与计算机》2012,29(5):46-49,53

Cache作为处理器和系统总线之间的桥梁,是芯片功耗的主要来源,低功耗Cache设计在嵌入式芯片设计中具有重要意义.传统Cache设计一般依赖于特定体系结构,难以在不同的系统中进行集成,通用性差.本文提出了一种低功耗高效率的AHB-AXI双总线结构联合Cache的IP设计.实验结果显示,本设计可以显著降低Cache功耗和提高系统性能. 相似文献

7.

Emulation-based transient thermal modeling of 2D/3D systems-on-chip with active cooling

Pablo G. Del Valle David Atienza 《Microelectronics Journal》2011,42(4):564-571

State-of-the-art devices in the consumer electronics market are relying more and more on Multi-Processor Systems-On-Chip (MPSoCs) as an efficient solution to meet their multiple design constrains, such as low cost, low power consumption, high performance and short time-to-market. In fact, as technology scales down, logic density and power density increase, generating hot spots that seriously affect the MPSoC performance and can physically damage the final system behavior. Moreover, forthcoming three-dimensional (3D) MPSoCs can achieve higher system integration density, but the aforementioned thermal problems are seriously aggravated. Thus, new thermal exploration tools are needed to study the temperature variation effects inside 3D MPSoCs. In this paper, we present a novel approach for fast transient thermal modeling and analysis of 3D MPSoCs with active (liquid) cooling solutions, while capturing the hardware-software interaction. In order to preserve both accuracy and speed, we propose a close-loop framework that combines the use of Field Programmable Gate Arrays (FPGAs) to emulate the hardware components of 2D/3D MPSoC platforms with a highly optimized thermal simulator, which uses an RC-based linear thermal model to analyze the liquid flow. The proposed framework offers speed-ups of more than three orders of magnitude when compared to cycle-accurate 3D MPSoC thermal simulators. Thus, this approach enables MPSoC designers to validate different hardware- and software-based 3D thermal management policies in real-time, and while running real-life applications, including liquid cooling injection control. 相似文献

8.

FPGA Implementation of Carrier Synchronization for QAM Receivers

Chris Dick Fred Harris Michael Rice 《The Journal of VLSI Signal Processing》2004,36(1):57-71

Software defined radios (SDR) are highly configurable hardware platforms that provide the technology for realizing the rapidly expanding third (and future) generation digital wireless communication infrastructure. While there are a number of silicon alternatives available for implementing the various functions in a SDR, field programmable gate arrays (FPGAs) are an attractive option for many of these tasks for reasons of performance, power consumption and flexibility. Amongst the more complex tasks performed in a high data rate wireless system is synchronization. This paper examines carrier synchronization in SDRs using FPGA based signal processors. We provide a tutorial style overview of carrier recovery techniques for QPSK and QAM modulation schemes and report on the design and FPGA implementation of a carrier recovery loop for a 16-QAM modern. Two design alternatives are presented to highlight the rich design space accessible using configurable logic. The FPGA device utilization and performance for a carrier recovery circuit using a look-up table approach and CORDIC arithmetic are presented. The simulation and FPGA implementation process using a recent system level design tool called System Generator for DSP described. 相似文献

9.

Efficient hierarchical bus-matrix architecture exploration of processor pool-based MPSoC

Young-Pyo Joo Sungchan Kim Soonhoi Ha 《Design Automation for Embedded Systems》2012,16(4):293-317

Multiprocessor System-on-Chip (MPSoC) systems are evolving towards a processor pool-based architecture that employs hierarchical on-chip networks for inter- and intra-processor pool communication. Since the design space of processor pool-based MPSoCs is extremely wide, the application-specific optimization of on-chip communication architecture is a nontrivial task. This paper presents a systematic methodology for a cascaded bus matrix-based on-chip network design for processor pool-based MPSoCs. Our approach finds sub-optimal architectures in terms of energy consumption and on-chip area while satisfying given performance constraints. The proposed approach allows for independent configurations of processor pools, which leads to better solutions than seen in previous work. Since a simulation is too time-consuming to evaluate the performance of complex on-chip networks, we propose to prune the designs space efficiently by two static analysis techniques to minimize the use of simulations. Thanks to the static analysis techniques, our approach achieves an order of magnitude speed improvement for architecture exploration without performance loss, compared with simulation-based approaches. 相似文献

10.

Bridging dream and reality: programmable baseband processors for software-defined radio - [integrated circuits for communications]

Liu D. Nilsson A. Tell E. Wu D. Eilert J. 《Communications Magazine, IEEE》2009,47(9):134-140

A programmable radio baseband signal processor is one of the essential enablers of software- defined radio. As wireless standards evolve, the processing power needed for baseband processing increases dramatically and the underlying hardware needs to cope with various standards or even simultaneously maintaining several radio links. Meanwhile, the maximum power consumption allowed by mobile terminals is still strictly limited. These challenges require both system and architecture level innovations. This article introduces a design methodology for radio baseband processors discussing the challenges and solutions of radio baseband signal processing. The LeoCore architecture is presented here as an example of a baseband processor design aimed at reducing power and silicon cost while maintaining sufficient flexibility. 相似文献

11.

Combined Application of Data Transfer and Storage Optimizing Transformations and Subword Parallelism Exploitation for Power Consumption and Execution Time Reduction in VLIW Multimedia Processors

K. Masselos F. Catthoor C. E. Goutis H. DeMan 《The Journal of VLSI Signal Processing》2004,37(1):53-73

相似文献

12.

流体系结构密码处理器存储系统的研究与设计

下载免费PDF全文

朱玉飞戴紫彬徐进辉李功丽《电子学报》2017,45(12):2957-2964

以信息安全设备的密码应用需求为基础,融合流体系结构处理器基本架构,设计出流体系结构密码处理器.文章主要研究和设计影响该处理器性能的瓶颈--流存储系统.此系统针对专用密码处理器的存储特点,并采用可配置化设计,满足密码应用对处理器存储系统灵活高效的要求.同时,该设计将层次化-分布-分体式存储、多数据通道流水并行化访存、流访存调度策略相结合,优化存储系统的访存效率,以提高该处理器的整体性能.研究结果表明,相比于典型密码处理器的存储设计,该设计的访存效率最高可提升约6倍. 相似文献

13.

Design Space Exploration for an ASIP/Co-Processor Architecture used in GNSS Receivers

G. Kappen L. Kurz O. Priebe T. G. Noll 《Journal of Signal Processing Systems》2010,58(1):41-51

相似文献

14.

Failure-Aware Task Scheduling of Synchronous Data Flow Graphs Under Real-Time Constraints

Chanhee Lee Sungchan Kim Hyunok Oh Soonhoi Ha 《Journal of Signal Processing Systems》2013,73(2):201-212

As more processors are integrated into Multiprocessor System-on-Chips (MPSoCs) via relentless technology scaling, the mean-time-to-failure (MTTF) is reduced to the extent that unexpected processor failures are considered during design time. A popular approach to tolerate processor failures is to migrate tasks on the faulty processor to live processors. This approach, however, is not suitable for real-time digital signal processing (DSP) applications since it may not guarantee real-time constraints. In this paper, we propose the re-scheduling of the entire application to minimize throughput degradation under a latency constraint, given that the application is specified by a Synchronous Data Flow (SDF) graph. We obtain sub-optimal re-scheduling results using a genetic algorithm for each scenario of processor failures at compile-time. If a failure is detected at run-time, the live processors obtain the saved schedule, perform task transfer, and execute the remaining tasks of the current iteration. We compare preemptive and non-preemptive migration policies and propose a hybrid policy to obtain better performance. We demonstrate the viability of the proposed technique through experiments with real-life DSP applications as well as randomly generated graphs under timing constraints and random fault scenarios. 相似文献

15.

Domain Specific Tools and Methods for Application in Security Processor Design

Patrick Schaumont Ingrid Verbauwhede 《Design Automation for Embedded Systems》2002,7(4):365-383

Security processors are used to implement cryptographic algorithmswith high throughput and/or low energy consumption constraints. The designof these processors is a balancing act between flexibility and energy consumption.The target is to create a processor with just enough programmability to covera set of algorithms—an application domain. This paper proposes GEZEL,a design environment consisting of a design language and an implementationmethodology that can be used for such domain specific processors. We use thesecurity domain as driver, and discuss the impact of the domain on the targetarchitecture. We also present a methodology to create, refine and verify asecurity processor. 相似文献

16.

On gracefully degrading multiprocessors with multistageinterconnection networks

Koren I. Koren Z. 《Reliability, IEEE Transactions on》1989,38(1):82-89

The behavior of a multiprocessing system with a multistage interconnection network is studied in the presence of faulty components. Measures for the connectivity and performance of these systems are proposed, including the average number of operational paths, the average number of accessible processors and memories, the average number of fault-free processors (memories) that are connected to an accessible memory (processor), the bandwidth, and the processing power of the system. Based on these measures, a tight upper bound for the maximal fully connected system is suggested. The gracefully degrading system is then compared, through some numerical examples, to a system whose faulty components are repaired upon failure. Based on these comparisons, the anticipated reduction in system performance can be estimated and consequently, appropriate maintenance policies can be determined 相似文献

17.

Low Power Design of Precomputation-Based Content-Addressable Memory

Shanq-Jang Ruan Chi-Yu Wu Jui-Yuan Hsieh 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(3):331-335

Content-addressable memory (CAM) is frequently used in applications, such as lookup tables, databases, associative computing, and networking, that require high-speed searches due to its ability to improve application performance by using parallel comparison to reduce search time. Although the use of parallel comparison results in reduced search time, it also significantly increases power consumption. In this paper, we propose a Block-xor approach to improve the efficiency of low power precomputation-based CAM (PB-CAM). Through mathematical analysis, we found that our approach can effectively reduce the number of comparison operations by 50% on average as compared with the ones-count approach for 32-bit-long inputs. In our experiment, we used Synopsys Nanosim to estimate the power consumption in TSMC 0.35-mum CMOS technology. Compared with the ones-count PB-CAM system, the experimental results show that our proposed approach can achieve on average 30% in power reduction and 32% in power performance reduction. The major contribution of this paper is that it presents theoretical and practical proofs to verify that our proposed Block-xor PB-CAM system can achieve greater power reduction without the need for a special CAM cell design. This implies that our approach is more flexible and adaptive for general designs. 相似文献

18.

A Systolic Design Methodology with Application to Full-Search Block-Matching Architectures

Yen-Kuang Chen S.Y. Kung 《The Journal of VLSI Signal Processing》1998,19(1):51-77

We present a systematic methodology to support the design tradeoffs of array processors in several emerging issues, such as (1) high performance and high flexibility, (2) low cost, low power, (3) efficient memory usage, and (4) system-on-a-chip or the ease of system integration. This methodology is algebraic based, so it can cope with high-dimensional data dependence. The methodology consists of some transformation rules of data dependency graphs for facilitating flexible array designs. For example, two common partitioning approaches, LPGS and LSGP, could be unified under the methodology. It supports the design of high-speed and massively parallel processor arrays with efficient memory usage. More specifically, it leads to a novel systolic cache architecture comprising of shift registers only (cache without tags). To demonstrate how the methodology works, we have presented several systolic design examples based on the block-matching motion estimation algorithm (BMA). By multiprojecting a 4D DG of the BMA to 2D mesh, we can reconstruct several existing array processors. By multiprojecting a 6D DG of the BMA, a novel 2D systolic array can be derived that features significantly improved rates in data reusability (96%) and processor utilization (99%). 相似文献

19.

面向Linux内核的片上存储优化

武建平方攀凌明张阳《微电子学》2012,42(1):87-91,96

便签存储器(SPM)作为主要的片上存储器之一,可以用来提升嵌入式Linux系统的性能,并降低其能耗.提出一种针对Linux内核的SPM管理及优化方案,实现了针对Linux内核热点代码段、数据段的SPM静态优化技术.利用虚存管理技术,建立以SPM页区为基础的动态SPM页框分配机制,并实现页框分配的通用接口函数.在优化热点小对象分配器(SLAB)的基础上,实现对Linux内核的动态优化.实验结果表明,该优化方案能明显降低能耗和提升性能,其内核代码段优化方案平均提升11％的系统性能. 相似文献

20.

Synthesis of custom interleaved memory systems

Song Chen Postula A. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2000,8(1):74-83

This paper presents a novel approach to the synthesis of interleaved memory systems that is especially suited for application-specific processors. Our synthesis system generates the optimized interleaved memories for a specific algorithm and finds the best mapping of arrays in that algorithm onto the memory system to achieve high performance. The design space is four-dimensional (4-D) and comprises the number of memory banks, the type of memory components, the storage scheme, and the range of clock period in the system. Optimal designs are found among the Pareto points (a set of nondominated points in the design space) computed for our memory model under the performance and cost criteria set by the designer. The memory model includes all the components of an interleaved memory system and covers a lookup table-based address generation with data alignment. The synthesis is based on a general periodic storage scheme, which enables efficient handling of irregular and overlapped access patterns. The synthesis process is the exhaustive search of the heavily pruned design space, and the pruning is based on mathematically proven properties of periodic storage schemes. This paper presents the theorems, the synthesis algorithm, and the methods of effective word and bank address generation. Examples are given to illustrate the effectiveness of our method 相似文献