期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Functional Approach for Testing the Reorder Buffer Memory

S. Di Carlo M. Gaudesi E. Sanchez M. Sonza Reorda 《Journal of Electronic Testing》2014,30(4):469-481

Superscalar processors have the ability to execute instructions out-of-order to better exploit the internal hardware and to maximize the performance. To maintain the in-order instruction commitment and to guarantee the correctness of the final results (as well as precise exception management), the Reorder Buffer (ROB) may be used. From the architectural point of view, the ROB is a memory array of several thousands of bits that must be tested against hardware faults to ensure a correct behavior of the processor. Since it is deeply embedded within the microprocessor circuitry, the most straightforward approach to test the ROB is through Built-In Self-Test solutions, which are typically adopted by manufacturers for end-of-production test. However, these solutions may not always be used for the test during the operational phase (in-field test) which aims at detecting possible hardware faults arising when the electronic systems works in its target environment. In fact, these solutions require the usage of test infrastructures that may not be accessible and/or documented, or simply not usable during the operational phase. This paper proposes an alternative solution, based on a functional approach, in which the test is performed by forcing the processor to execute a specially written test program, and checking the resulting behavior of the processor. This approach can be adopted for in-field test, e.g., at the power-on, power-off, or during the time slots unused by the system application. The method has been validated resorting to both an architectural and a memory fault simulator. 相似文献

2.

A Low-Cost At-Speed BIST Architecture for Embedded Processor and SRAM Cores

M.H. Tehranipour S.M. Fakhraie Z. Navabi M.R. Movahedin 《Journal of Electronic Testing》2004,20(2):155-168

We have introduced a low-cost at-speed BIST architecture that enables conventional microprocessors and DSP cores to test their functional blocks and embedded SRAMs in system-on-a-chip architectures using their existing hardware and software resources. To accommodate our proposed new test methodology, minor modifications should be applied to base processor within its test phase. That is, we modify the controller to interpret some of the instructions differently only within the initial test mode. In this paper, we have proposed a fuctional self-test methodology that is deterministic in nature. In our proposed architecture, a self test program called BIST Program is stored in an embedded ROM as a vehicle for applying tests. We first start with testing processor core using our proposed architedture. Once the testing of the processor core is completed, this core is used to test the embedded SRAMs. A test algorithm which utilizes a mixture of existing memory testing techniques and covers all important memory faults is presented in this paper. The proposed memory test algorithm covers 100% of the faults under the fault model plus a data retention test. The hardware overhead in the proposed architecture is shown to be negligible. This architecture is implemented on UTS-DSP (University of Tehran and Iran Communicaton Industries (SAMA)) IC which has been designed in VLSI Circuits and Systems Laboratory. 相似文献

3.

Error detection by duplicated instructions in super-scalarprocessors

Oh N. Shirvani P.P. McCluskey E.J. 《Reliability, IEEE Transactions on》2002,51(1):63-75

This paper proposes a pure software technique "error detection by duplicated instructions" (EDDI), for detecting errors during usual system operation. Compared to other error-detection techniques that use hardware redundancy, EDDI does not require any hardware modifications to add error detection capability to the original system. EDDI duplicates instructions during compilation and uses different registers and variables for the new instructions. Especially for the fault in the code segment of memory, formulas are derived to estimate the error-detection coverage of EDDI using probabilistic methods. These formulas use statistics of the program, which are collected during compilation. EDDI was applied to eight benchmark programs and the error-detection coverage was estimated. Then, the estimates were verified by simulation, in which a fault injector forced a bit-flip in the code segment of executable machine codes. The simulation results validated the estimated fault coverage and show that approximately 1.5% of injected faults produced incorrect results in eight benchmark programs with EDDI, while on average, 20% of injected faults produced undetected incorrect results in the programs without EDDI. Based on the theoretical estimates and actual fault-injection experiments, EDDI can provide over 98% fault-coverage without any extra hardware for error detection. This pure software technique is especially useful when designers cannot change the hardware, but they need dependability in the computer system. To reduce the performance overhead, EDDI schedules the instructions that are added for detecting errors such that "instruction-level parallelism" (ILP) is maximized. Performance overhead can be reduced by increasing ILP within a single super-scalar processor. The execution time overhead in a 4-way super-scalar processor is less than the execution time overhead in the processors that can issue two instructions in one cycle 相似文献

4.

基于统计分析的SoC定点硬件加速器字长设计

周凡时龙兴杨军《固体电子学研究与进展》2007,27(2):240-245,274

在由通用RISC处理器核和附加定点硬件加速器构成的定点SoC(System-on-Chip)芯片体系架构基础上,提出了一种新颖的基于统计分析的定点硬件加速器字长设计方法。该方法利用统计参数在数学层面上求解计算出满足不同信噪比要求下的最小字长,能有效地降低芯片面积、功耗和制作成本,从而在没有DSP协处理器的低成本RISC处理器核SoC芯片上运行高计算复杂度应用。相似文献

5.

A transparent online memory test for simultaneous detection of functional faults and soft errors in memories

Thaller K. Steininger A. 《Reliability, IEEE Transactions on》2003,52(4):413-422

The Transparent Online Memory Test (TOMT) introduced here has been specifically developed for online testing of word-oriented memories with parity or Hamming protection. Careful interleaving of a word-oriented and a bit-oriented test facilitates a fault coverage and a test duration comparable to the widely used March C- algorithm. Unlike similar methods TOMT actively exercises all bit cells in memory within one test period. Hence it not only detects soft errors but also functional faults and reliably prevents fault accumulation. Different variants of the basic TOMT algorithm are investigated in terms of fault coverage and test time. A prototype implementation for SRAM is introduced which-integrated into a standard processor/memory interface-autonomously performs the transparent online memory test. The trade-offs in terms of hardware overhead and memory access delay caused by this system integration are explored. 相似文献

6.

FPGA solution for matrix converter double sided space vector modulation algorithm

J. Andreu U. Bidarte A. Astarloa I. Martínez de Alegría P. Ibáñez 《International Journal of Electronics》2013,100(11):1181-1200

Matrix Converters (MCs) present several advantages, but yet several barriers must be overcome, such as MC modulation and control technique complexity. This article proposes a multiplatform environment that allows the implementation of the Double Sided Space Vector Modulation (DS-SVM) algorithm in a last-generation Field Programmable Gate Array (FPGA) device. The traditional digital control architecture, based on a SP and some additional devices, is improved by means of a last generation FPGA where the main processor (PowerPC), internal memory, communication interfaces, I/O capabilities and a hardware core that executes the DS-SVM have been connected using on-chip buses. The methodology begins by defining the DS-SVM in a Matlab-Simulink environment. The PowerPC delivers 680 MIPS, but it is not a good candidate to execute the DS-SVM algorithm because it is not possible to achieve the modulation frequency that is necessary for an MC. A new configurable hardware circuit that implements the whole DS-SVM algorithm is proposed. This solution achieves modulation frequencies over 100 kHz. This hardware core is connected to one of the PowerPC buses and the processor can configure it or get feedback information at any time. As the processor is liberated from the very time-consuming DS-SVM computation, it can execute many higher level tasks. 相似文献

7.

Interconnect Delay Fault Test on Boards and SoCs with Multiple Clock Domains

Hyunbean Yi Jaehoon Song Sungju Park 《ETRI Journal》2008,30(3):403-411

This paper introduces an interconnect delay fault test (IDFT) controller on boards and system‐on‐chips (SoCs) with IEEE 1149.1 and IEEE 1500 wrappers. By capturing the transition signals launched during one system clock, interconnect delay faults operated by different system clocks can be simultaneously tested with our technique. The proposed IDFT technique does not require any modification on boundary scan cells. Instead, a small number of logic gates needs to be plugged around the test access port controller. The IDFT controller is compatible with the IEEE 1149.1 and IEEE 1500 standards. The superiority of our approach is verified by implementation of the controller with benchmark SoCs with IEEE 1500 wrapped cores. 相似文献

8.

A software methodology for detecting hardware faults in VLIW data paths 总被引：1，自引：0，他引：1

Bolchini C. 《Reliability, IEEE Transactions on》2003,52(4):458-468

The proposed methodology aims to achieve processor data paths for VLIW architectures able to autonomously detect transient and permanent hardware faults while executing their applications. The approach, carried out on the compiled application software, provides the introduction of additional instructions for controlling the correctness of the computation with respect to failures in one of the data path functional units. The advantage of a software approach to hardware fault detection is interesting because it allows one to apply it only to the critical applications executed on the VLIW architecture, thus not causing a delay in the execution of noncritical tasks. Furthermore, by exploiting the intrinsic redundancy of this class of architectures no hardware modification is required on the data path so that no processor customization is necessary. 相似文献

9.

Decoupled Processors Architecture for Accelerating Data Intensive Applications using Scratch-Pad Memory Hierarchy

Athanasios Milidonis Nikolaos Alachiotis Vasileios Porpodas Harris Michail Georgios Panagiotakopoulos Athanasios P. Kakarountas Costas E. Goutis 《Journal of Signal Processing Systems》2010,59(3):281-296

We present an architecture of decoupled processors with a memory hierarchy consisting only of scratch-pad memories, and a main memory. This architecture exploits the more efficient pre-fetching of Decoupled processors, that make use of the parallelism between address computation and application data processing, which mainly exists in streaming applications. This benefit combined with the ability of scratch-pad memories to store data with no conflict misses and low energy per access contributes significantly for increasing the system’s performance. The application code is split in two parallel programs the first runs on the Access processor and computes the addresses of the data in the memory hierarchy. The second processes the application data and runs on the Execute processor, a processor with a limited address space—just the register file addresses. Each transfer of any block in the memory hierarchy up to the Execute processor’s register file is controlled by the Access processor and the DMA units. This strongly differentiates this architecture from traditional uniprocessors and existing decoupled processors with cache memory hierarchies. The architecture is compared in performance with uniprocessor architectures with (a) scratch-pad and (b) cache memory hierarchies and (c) the existing decoupled architectures, showing its higher normalized performance. The reason for this gain is the efficiency of data transferring that the scratch-pad memory hierarchy provides combined with the ability of the Decoupled processors to eliminate memory latency using memory management techniques for transferring data instead of fixed prefetching methods. Experimental results show that the performance is increased up to almost 2 times compared to uniprocessor architectures with scratch-pad and up to 3.7 times compared to the ones with cache. The proposed architecture achieves the above performance without having penalties in energy delay product costs. 相似文献

10.

A cross layer approach for efficient thermal management in 3D stacked SoCs

《Microelectronics Reliability》2016

3D stacking of silicon dies via Through Silicon Vias (TSVs) is an emerging technology to increase performance, energy efficiency and integration density of today's and future System-on-Chips (SoCs). Especially the stacking of Wide I/O DRAMs on top of a logic die is a very promising approach to tackle the memory wall and energy efficiency challenge. The potential of this type of stacking is currently under investigation by many research groups and companies in particular for mobile devices. There, for instance, the baseband processing and the application processor can be implemented on the same single logic die. On top of this die one or several Wide I/O DRAMs are stacked. An example of such a SoC is the WIOMING chip [15]. However, new challenges emerge, especially thermal management, which is already a very demanding challenge in current 2D SoCs. With 3D SoCs this problem exacerbates due to reliability issues such as the temperature sensitivity of DRAMs, i.e., the retention time of a DRAM cell largely decreases with increasing temperature.In this paper, we show a holistic cross layer reliability approach for efficient reliability management starting from measuring and modeling of DRAM retention errors, which finally leads to optimizations for specific applications. These optimizations exploit the data lifetime and the inherent error resilience of the application, which is for instance given in the probabilistic behavior of wireless communications. 相似文献

11.

Carry checking/parity prediction adders and ALUs 总被引：1，自引：0，他引：1

Nicolaidis M. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(1):121-128

In this paper, we present efficient self-checking implementations valid for all existing adder and arithmetic and logic unit (ALU) schemes (e.g., ripple carry, carry lookahead, skip carry schemes). Among all the known self-checking adder and ALU designs, the parity prediction scheme has the advantage that it requires the minimum hardware overhead for the adder/ALU and the minimum hardware overhead for the other data-path blocks. It also has the advantage to be compatible with memory systems checked by parity codes. The drawback of this scheme is that it is not fault secure for single faults. The scheme proposed in this work has all the advantages of the parity prediction scheme. In addition, the new scheme is totally self-checking for single faults. Thus, the new scheme is substantially better than any other known solution. 相似文献

12.

Heterogeneous Multi-Core Architecture That Enables 54x AAC-LC Stereo Encoding 总被引：1，自引：0，他引：1

Shikano H. Ito M. Onouchi M. Todaka T. Tsunoda T. Kodama T. Uchiyama K. Odaka T. Kamei T. Nagahama E. Kusaoke M. Nitta Y. Wada Y. Kimura K. Kasahara H. 《Solid-State Circuits, IEEE Journal of》2008,43(4):902-910

This paper describes a heterogeneous multi-core processor (HMCP) architecture that integrates general-purpose processors (CPUs) and accelerators (ACCs) to achieve exceptional performance as well as low-power consumption for the SoCs of embedded systems. The memory architectures of CPUs and ACCs were unified to improve programming and compiling efficiency. Advanced audio codec-low complexity (AAC-LC) stereo audio encoding was parallelized on a heterogeneous multi-core having homogeneous processor cores and dynamically reconfigurable processor (DRP) ACC cores in a preliminary evaluation of the HMCP architecture. The performance evaluation revealed that 54times AAC encoding was achieved on the chip with two CPUs at 600 MHz and two DRPs at 300 MHz, which achieved encoding of an entire CD within 1- 2 min. 相似文献

13.

Hardware Architectures for the Efficient Implementation of Multi-Service Broadband Access and Multimedia Home Networks

Orphanoudakis Theofanis Perissakis Stylianos Pramataris Kostas Nikolaou Nikos Zervos Nick Steck Matthias Baumhof Christoph Verkest Diederik Ykman-Couvreur Chantal Doumenis Gregory Karoubalis Fotis Theologitou Ioanna Reisis Dionysios Konstantoulakis George Vogiatzis Nikos 《Telecommunication Systems》2003,23(3-4):351-367

In multimedia applications, the stringent requirements for balancing transmission capacity, flexible service provisioning and cost reduction lead the manufactures to provide highly integrated System-on Chip (SoC) solutions. This paper analyzes the application of high-bandwidth-networking SoCs to improve on the cost efficiency of multimedia service distribution in home networks. We present a case study, where we utilize the inherent protocol processing capabilities and high bandwidth interfaces of a modern network processor, scaled down to match the performance targets and low cost requirements of the home networking environment. An efficient, low cost Residential Gateway architecture results by mapping the home services onto the processing and memory blocks of this SoC. This revised version was published online in September 2006 with corrections to the Cover Date. 相似文献

14.

Champ-av3主板的开发和应用

张西吴甜《电子工程师》2007,33(7):33-36

Champ-av3是一种功能强大的数字信号处理板,运算速度快,实时性好,十分适用于雷达、声纳、人工智能等应用场合。文中首先介绍Champ-av3板的硬件结构和各种接口,以及各类存储芯片和板内存储空间的分配方式,指出这种存储空间的特殊性和应用注意事项。介绍软件功能,列举了几个重要的软件支持包和编程注意事项。讨论了多模块协同工作方式和程序运行时耗的评估,最后给出一个应用实例。相似文献

15.

Design and Implementation of a Signaling Handling Unit for Electronic Exchanges

Govind S. Pitke M. 《Communications, IEEE Transactions on》1979,27(3):591-596

Design of a unit which handles functions of line scanning, digit forming, and digit pulsing fully independent of the main processor has been presented. These functions as implemented in conventional systems form a major portion of the real time load on the processor. The hardware unit described in this paper takes care of all these functions with only nominal interaction with the processsor, thereby reducing considerably the real time load. In this unit one very high speed logic assembly is shared by a large number of lines with the help of memory and timing circuits. The same hardware with only minor modifications can be used for telex application also. A typical unit for 1000 lines requires about 300 IC's. This includes the processor interface. 相似文献

16.

The Circuits and Robust Design Methodology of the Massively Parallel Processor Based on the Matrix Architecture

Noda H. Tanizaki T. Gyohten T. Dosaka K. Nakajima M. Mizumoto K. Yoshida K. Iwao T. Nishijima T. Okuno Y. Arimoto K. 《Solid-State Circuits, IEEE Journal of》2007,42(4):804-812

Novel circuits and design methodology of the massively parallel processor based on the matrix architecture are introduced. A fine-grained processing elements (PE) circuit for high-throughput MAC operations based on the Booth's algorithm enhances the performance of a 16-bit fixed-point signed MAC, which operates up to 30.0GOPS/W. The dedicated I/O interface circuits are designed for converting the direction of data access and supporting the interleaved memory architecture, and they are implemented for maximizing the processor core efficiency. Power management techniques for suppressing current peaks and reducing average power consumption are proposed to enhance the robustness of the macro. The circuits and the design methodology proposal in this paper are attractive for achieving a high performance and robust massively parallel SIMD processor core employed in multimedia SoCs 相似文献

17.

基于Au1200的多媒体播放终端设计

曹建清马文新厉家华周政新《电视技术》2007,31(6):34-36

采用专用多媒体设备处理器AMD Alchemy Au1200为核心,设计了一种新型的多媒体播放终端,能播放多种格式的多媒体文件.硬件主要由处理器、内存、启动ROM、硬盘、LCD屏、外围扩展接口等组成;软件由操作系统、图形系统和应用程序组成.该方案为视频联网广告系统提供了一种功能强大的、智能化的播放终端. 相似文献

18.

New matrix formulation for two-dimensional DCT/IDCT computation and its distributed-memory VLSI implementation 总被引：1，自引：0，他引：1

Hsiao S.-F. Tseng J.-M. 《Vision, Image and Signal Processing, IEE Proceedings -》2002,149(2):97-107

A direct method for the computation of 2-D DCT/IDCT on a linear-array architecture is presented. The 2-D DCT/IDCT is first converted into its corresponding I-D DCT/IDCT problem through proper input/output index reordering. Then, a new coefficient matrix factorisation is derived, leading to a cascade of several basic computation blocks. Unlike other previously proposed high-speed 2-D N /spl times/ N DCT/IDCT processors that usually require intermediate transpose memory and have computation complexity O(N/sup 3/), the proposed hardware-efficient architecture with distributed memory structure has computation complexity O(N/sup 2/ log/sub 2/ N) and requires only log/sub 2/ N multipliers. The new pipelinable and scalable 2-D DCT/IDCT processor uses storage elements local to the processing elements and thus does not require any address generation hardware or global memory-to-array routing. 相似文献

19.

Minimizing memory access energy in embedded systems by selective instruction compression

Benini L. Macii A. Macii E. Poncino M. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2002,10(5):521-531

We propose a technique for reducing the energy spent in the memory-processor interface of an embedded system during the execution of firmware code. The method is based on the idea of compressing the most commonly executed instructions so as to reduce the energy dissipated during memory access. Instruction decompression is performed on-the-fly by a hardware block located between processor and memory: No changes to the processor architecture are required. Hence, our technique is well suited for systems employing IP cores whose internal architecture cannot be modified. We describe a number of decompression schemes and architectures that effectively trade off hardware complexity and static code size increase for memory energy and bandwidth reduction, as proved by the experimental data we have collected by executing several test programs on different design templates. 相似文献

20.

Layout conscious approach and bus architecture synthesis for hardware/software codesign of systems on chip optimized for speed

Thepayasuwan N. Doboli A. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(5):525-538

This paper presents a layout-conscious approach for hardware/software codesign of systems-on-chip (SoCs) optimized for latency, including an original algorithm for bus architecture synthesis. Compared to similar work, the method addresses layout related issues that affect system optimization, such as the dependency of task communication speed on interconnect parasitic. The codesign flow executes three consecutive steps: 1) combined partitioning and scheduling: besides partitioning and scheduling, this step also identifies the minimum speed constraints for each data link; 2) IP core placement, bus architecture synthesis, and routing: IP cores are placed using a hierarchical cluster growth algorithm; bus architecture synthesis identifies a set of possible building blocks and then assembles them for minimizing bus length and complexity; poor solutions are pruned using a special table structure and select-eliminated method; and 3) rescheduling for the best bus architecture. This paper offers extensive experiments for the proposed codesign method, including bus architecture synthesis for a network processor and a JPEG SoC. 相似文献