首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
李勇  胡慧俐  杨焕荣 《计算机应用》2014,34(4):1005-1009
数字信号处理软件中循环程序在执行时间上占有很大比例,用指令缓冲器暂存循环代码可以减少程序存储器的访问次数,提高处理器性能。在VLIW处理器指令流水线中增加一个支持循环指令的缓冲器,该缓冲器能够缓存循环程序指令,并以软件流水的形式向功能部件派发循环程序指令。这样循环程序代码只需访存一次而执行多次,大大减少了访存次数。在循环指令运行期间,缓冲器发出信号使程序存储器进入睡眠状态可以降低处理器功耗。典型的应用程序测试表明,使用了循环缓冲后,取指流水线空闲率可达90%以上,处理器整体性能提高10%左右,而循环缓冲的硬件面积开销大约占取指流水线的9%。  相似文献   

2.
随着社会的快速发展,人们对嵌入式智能设备的依赖程度越来越高。嵌入式系统工作在体积受限的封闭环境中,运算部件、存储单元等相关元器件体积小集成度高,不同工作环境、不同使用频度对电子元器件的可靠性将产生重要影响。针对嵌入式系统工作过程中的动态可靠性,提出了面向系统动态可靠性的自适应目标代码生成方法。该方法借助于决策树学习算法,构建了系统可靠性评估模型;并以此为指导,设计了多路径目标代码生成方法。使得系统能够根据实际工作状态信息,自适应的最佳的执行路径,以避免系统资源使用的不均衡,提高各运算部件的可靠性。实验表明,该方法将程序对单个处理器最高使用率由80%以上降到了30%以内,将内存单元最大最小访问比例由157.3降到了15.4,有效均衡了各处理器核和内存单元的使用。  相似文献   

3.
The message-driven processor (MDP), a 36-b, 1.1-million transistor, VLSI microcomputer, specialized to operate efficiently in a multicomputer, is described. The MDP chip includes a processor, a 4096-word by 36-b memory, and a network port. An on-chip memory controller with error checking and correction (ECC) permits local memory to be expanded to one million words by adding external DRAM chips. The MDP incorporates primitive mechanisms for communication, synchronization, and naming which support most proposed parallel programming models. The MDP system architecture, instruction set architecture, network architecture, implementation, and software are discussed  相似文献   

4.
The complexity of today’s embedded applications increases with various requirements such as execution time, code size or power consumption. To satisfy these requirements for performance, efficient instruction set design is one of the important issues because an instruction customized for specific applications can make better performance than multiple instructions in aspect of fast execution time, decrease of code size, and low power consumption. Limited encoding space, however, does not allow adding application specific and complex instructions freely to the instruction set architecture. To resolve this problem, conventional architectures increases free space for encoding by trimming excessive bits required beyond the fixed word length. This approach however shows severe weakness in terms of the complexity of compiler, code size and execution time. In this paper, we propose a new instruction encoding scheme based on the dynamic implied addressing mode (DIAM) to resolve limited encoding space and side-effect by trimming. We report our two versions of architectures to support our DIAM-based approach. In the first version, we use a special on-chip memory to store extra encoding information. In the second version, we replace the memory by a small on-chip buffer along with a special instruction. We also suggest a code generation algorithm to fully utilize DIAM. In our experiment, the architecture augmented with DIAM shows about 8% code size reduction and 18% speed up on average, as compared to the basic architecture without DIAM.  相似文献   

5.
嵌入式处理器中访存部件的低功耗设计研究   总被引:2,自引:0,他引:2  
以“龙芯1号”处理器为研究对象,探讨了嵌入式处理器中访存部件的低功耗设计方法.通过对访存部件的结构、功耗以及关键路径进行分析,利用局部性原理,提出一种根据虚拟地址历史记录进行判断的方法,可以显著减少TLB和Cache对RAM块的访问次数,使得TLB部件功耗平均降低了28.1%,Cache部件功耗平均降低了54.3%,处理器总功耗平均降低了23.2%,而关键路径延时反而减少,处理器性能略有提高.  相似文献   

6.
《Journal of Systems Architecture》1999,45(12-13):1001-1022
Commodity microprocessors contain more on-chip memory with each successive generation, and will contain tens of megabytes within the decade. We describe a novel architecture that runs an unmodified uniprocessor program across multiple nodes, each of which contains a processor tightly integrated with a sizable memory. The execution of instructions is replicated, while the access of operands is distributed across the nodes. Each node accesses operands in its fast local memory and broadcasts them to the other nodes. This architecture exploits out-of-order execution and the fact that each chip has integrated processor and memory, to run memory-intensive, hard-to-parallelize programs more efficiently. In this paper, we describe an implementation with specific solutions to the unique problems that this architecture poses. Finally, we conclude by comparing simulation results of our implementation to more traditional equivalent systems. In our simulated implementation, five unmodified SPEC95 binaries ran – in most cases – considerably faster than in systems with more traditional memory systems.  相似文献   

7.
The on-chip memory performance of embedded systems directly affects the system designers' decision about how to allocate expensive silicon area. A novel memory architecture, flexible sequential and random access memory (FSRAM), is investigated for embedded systems. To realize sequential accesses, small “links”are added to each row in the RAM array to point to the next row to be prefetched. The potential cache pollution is ameliorated by a small sequential access buyer (SAB). To evaluate the architecture-level performance of FSRAM, we ran the Mediabench benchmark programs on a modified version of the SimpleScalar simulator. Our results show that the FSRAM improves the performance of a baseline processor with a 16KB data cache up to 55%, with an average of 9%; furthermore, the FSRAM reduces 53.1% of the data cache miss count on average due to its prefetching effect. We also designed RTL and SPICE models of the FSRAM, which show that the FSRAM significantly improves memory access time, while reducing power consumption, with negligible area overhead.  相似文献   

8.
The WEDSP32C high-performance, programmable digital signal processor supports 32-bit floating-point arithmetic and is upwardly compatible with its predecessor, the WEDSP32. Because it is implemented in 0.75-μm (effective channel length) CMOS technology, the second-generation device achieves high functional density with low power consumption. The DSP32C offers the following features: 25-Mflop operation; 16-Mb/s serial-input and serial-output ports; a 160-bit, parallel I/O port for control and data transfer; interrupt facilities; single-instruction μ-law and A-law data conversions; single-instruction conversions between integers and floating-point data; a byte-addressable, on-chip memory that is extendable off chip; direct memory access to and from internal and external memory via parallel and serial I/O ports; 16 Mbytes of address space; and IEEE Std. 754 floating-point format conversion. The authors describe the DSP32C's instruction set, architecture, and application development tools. The latter includes an assembler, a simulator, an optimizing C compiler, and special-purpose hardware  相似文献   

9.
随着处理器的快速发展,RISC-V的软件生态环境建设成为其在处理器市场中站稳脚跟的关键因素之一。二进制翻译是解决处理器二进制代码兼容性问题、为处理器生态环境建设获取时间成本的关键技术之一,但由于二进制翻译器难以以较低的功耗面积开销获得高效执行的二进制代码,使其无法广泛应用于嵌入式领域。针对二进制翻译器执行效率和功耗面积开销难以取得平衡的问题,采用硬件逻辑加速的方式处理ARMv7-M中条件执行指令、更新标志位指令以及桶形移位指令,并利用静态二进制翻译器对ARMv7-M程序进行IT Block分裂、地址重计算及指令映射后生成RISC-V二进制代码,以此支持ARMv7-M的各类指令。基于开源内核CV32E40P设计了一个支持ARMv7-M的处理器内核,结果表明,运行ARMv7-M程序的平均性能能够达到直接运行RISC-V程序性能的137%,与纯软件二进制翻译支持ARMv7-M相比,该处理器核运行ARMv7-M程序的性能提升了5.59倍。  相似文献   

10.
Memory is a key parameter in embedded systems since both code complexity of embedded applications and amount of data they process are increasing. While it is true that the memory capacity of embedded systems is continuously increasing, the increases in the application complexity and dataset sizes are far greater. As a consequence, the memory space demand of code and data should be kept minimum. To reduce the memory space consumption of embedded systems, this paper proposes a control flow graph (CFG) based technique. Specifically, it tracks the lifetime of instructions at the basic block level. Based on the CFG analysis, if a basic block is known to be not accessible in the rest of the program execution, the instruction memory space allocated to this basic block is reclaimed. On the other hand, if the memory allocated to this basic block cannot be reclaimed, we try to compress this basic block. This way, it is possible to effectively use the available on-chip memory, thereby satisfying most of instruction/data requests from the on-chip memory. Our experiments with this framework show that it outperforms the previously proposed CFG-based memory reduction approaches.  相似文献   

11.
Multithreaded technique is the developing trend of high performance processor. Memory consistency model is essential to the correctness, performance and complexity of multithreaded processor. The chip multithreaded consistency model adapting to multithreaded processor is proposed in this paper. The restriction imposed on memory event ordering by chip multithreaded consistency is presented and formalized. With the idea of critical cycle built by Wei-Wu Hu, we prove that the proposed chip multithreaded consistency model satisfies the criterion of correct execution of sequential consistency model. Chip multithreaded consistency model provides a way of achieving high performance compared with sequential consistency model and easures the compatibility of software that the execution result in multithreaded processor is the same as the execution result in uniprocessor. The implementation strategy of chip multithreaded consistency model in Godson-2 SMT processor is also proposed. Godson-2 SMT processor supports chip multithreaded consistency model correctly by exception scheme based on the sequential memory access queue of each thread.  相似文献   

12.
嵌入式系统通常使用闪存作为存储设备,嵌入式Linux下的MTD技术可以方便地访问Flash这样的MTD设备。文章介绍了Linux块设备驱动程序框架,详细分析了MTD设备驱动程序层次结构、核心功能模块和数据结构,最后以Motorola MPC860T开发板为例,系统地给出了针对特定Flash的MTD驱动程序开发实例。  相似文献   

13.
Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance and energy efficiency of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple-exits custom instruction and intuition behind it are introduced. Conditional execution capability has been added to the RFU to support the multi-exit feature of custom instructions. Because the proposed RFU has limitations on hardware resources (i.e., connections and processing elements), an integrated mapping-temporal partitioning framework is proposed to guarantee that the generated custom instructions can be mapped on the RFU (mappable custom instructions). Experimental results show that multi-exit custom instructions enhance the performance and energy efficiency by an average of 32% and 3% compared to custom instructions limited to one basic block, respectively. A maximum speedup of 4.9, compared to a single-issue embedded processor, and an average speedup of 1.9 was achieved on MiBench benchmark suite. The maximum and average energy saving are 56% and 22%, respectively. These performance and energy efficiency are obtained at the cost of 30% area overhead.  相似文献   

14.
Organization of the Motorola 88110 superscalar RISC microprocessor   总被引:1,自引:0,他引:1  
Diefendorff  K. Allen  M. 《Micro, IEEE》1992,12(2):40-63
Motorola's second-generation RISC microprocessor, which uses advanced techniques for exploiting instruction-level parallelism, including superscalar instruction issue, our-of-order instruction completion, speculative execution, dynamic instruction rescheduling, and two parallel, high-bandwidth, on-chip caches, is discussed. The microprocessor was designed to serve as the central processor in low-cost personal computers and workstations, and support demanding graphics and digital signal processing applications. The 88110's instruction set architecture, instruction sequencer, register files, execution units, address translation facilities, caches, and external bus interface are described  相似文献   

15.
Embedded systems are vulnerable to buffer overflow attacks. In this paper, we propose a hardware memory monitor module that aims to detect buffer overflow attacks by analyzing the security of an embedded processor at the instruction level. The functionality of the memory monitor module does not rely on the source code and can perform security check through dynamic methods. Compared with several existing countermeasures that protect only part of the program’s data space, our proposed memory monitor module can protect the program’s entire data space. The proposed memory monitor module has negligible performance overhead because it runs in parallel with the embedded processor. As demonstrated in an FPGA (Field Programmable Gate Array) based prototype, the experimental results show that our memory monitor module can effectively resist several types of buffer overflow attacks with approximately a 15% hardware cost overhead and only a 0.1% performance penalty.  相似文献   

16.
姚爱红  孙盟哲  吴剑 《计算机工程》2010,36(23):234-236,239
采用自顶向下方法,设计实现16位精简指令集计算机架构的嵌入式微处理器核HEUSoC 1,利用现场可编程门阵列片内的大量存储资源实现双端口存储器及零等待的指令和数据访问,从而保证指令的单周期执行。通过Verilog硬件描述语言实现微处理器核的RTL级描述,编写计算斐波那契数列的测试程序验证了HEUSoC 1的正确性。在Xilinx Spartan 2芯片上的统计结果表明,HEUSoC 1的资源占用率较低,处理器最高频率约为22 MHz,适合于对功耗和性价比要求严格的嵌入式应用领域。  相似文献   

17.
Warp processors are a novel architecture capable of autonomously optimizing an executing application by dynamically re-implementing critical kernels within the software as custom hardware circuits in an on-chip FPGA. Previous research on warp processing focused on low-power embedded systems, incorporating a low-end ARM processor as the main software execution resource. We provide a thorough analysis of the scalability of warp processing by evaluating several possible warp processor implementations, from low-power to high-performance, and by evaluating the potential for parallel execution of the partitioned software and hardware. We further demonstrate that even considering a high-performance 1 GHz embedded processor, warp processing provides the equivalent performance of a 2.4 GHz processor. By further enabling parallel execution between the processes and FPGA, the parallel warp processor execution provides the equivalent performance of a 3.2 GHz processor.  相似文献   

18.
一种基于JTAG的嵌入式微处理器片上可调试系统   总被引:12,自引:1,他引:12  
文章提出了一种基于JTAG的嵌入式微处理器片上的可调试系统。该系统在JTAG工业标准的基础上,能够以较少的硬件开销支持指令/数据断点设置、单步执行、寄存器内容查看和设置、内存内容查看和设置、在线编程以及微处理器运行现场设置等调试功能。文章首先介绍了嵌入式微处理器可调试设计的原理,其次介绍了嵌入式微处理器的调试系统设计,最后给出调试实例分析。  相似文献   

19.
With reference to the typical hardware configuration of a sensor node, we present the architecture of a memory protection unit (MPU) designed as a low-complexity addition to the microcontroller. The MPU is aimed at supporting memory protection and the privileged execution mode. It is connected to the system buses, and is seen by the processor as a memory-mapped input/output device. The contents of the internal MPU registers specify the composition of the protection contexts of the running program in terms of access rights for the memory pages. The MPU generates a hardware interrupt to the processor when it detects a protection violation. The proposed MPU architecture is evaluated from a number of salient viewpoints, which include the distribution, review and revocation of access permissions, and the support for important memory protection paradigms, including hierarchical contexts and protection rings.  相似文献   

20.
Real-time systems are characterised by the fact that they have to meet a set of both functional and temporal requirements. Processor architectures have a significant impact on the predictability of software execution times and can add different sources of indeterminism depending on the features provided. The LEON processor family is the reference platform for space missions of the European Space Agency, with open-source implementations that are written in VHDL language. All versions of the LEON processors conform to the SPARC architecture Version 8. This architecture groups the general-purpose registers into windows to reduce memory transfer overhead in function calls. Unfortunately, this mechanism introduces indeterminism in software execution times at various levels. In this paper, we propose an extension to the original architecture that provides determinism for a configurable subset of tasks and interrupt service routines and eliminates the concurrency-related jitter, all this with a minimum cost in terms of FPGA resource utilisation. For the validation of the proposed solution, we have implemented the extension into the VHDL code of the LEON3 processor and modified the source code of the RTEMS operating system to make use of the new functionality.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号