首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 750 毫秒
1.
An energy-area efficient cloud-connected software execution architecture in IoT sensor processor is proposed. A remotely installed sensor device such as an environmental activity monitor is commonly implemented using the conventional embedded processor only providing the fixed services, which includes statically compiled embedded software in on-chip flash memory. Instead of conventional on-chip flash memory for an instruction code area, we adopt an virtually mapped internal memory concept to realize cloud-connected software execution, in where the remote storage area via the IoT platform is indirectly mapped onto the physical address space of the instruction memory using a dynamic address translation technique. The proposed cloud-connected architecture of the system enables on-demand code execution for the instructions, which are fetched from the cloud-side remote storage area in the runtime, instead of using a directly-connected on-chip instruction bus. The proposed storage-less approach may be adopted to reduce the high access current and large chip area overhead by eliminating the on-chip code flash memory. To reduce the access current overhead in order to retrieve the requested instruction, a small-sized RAM scratch pad is adopted for retaining the hot-spot instruction code and early filled with pre-estimated instruction sector. The experimental results show that the proposed technique reduces the energy consumption and packet delay of an IoT device for executing the remote embedded software, as well as the reduced chip area by realizing a storage-less sensor architecture.  相似文献   

2.
Memory is a key parameter in embedded systems since both code complexity of embedded applications and amount of data they process are increasing. While it is true that the memory capacity of embedded systems is continuously increasing, the increases in the application complexity and dataset sizes are far greater. As a consequence, the memory space demand of code and data should be kept minimum. To reduce the memory space consumption of embedded systems, this paper proposes a control flow graph (CFG) based technique. Specifically, it tracks the lifetime of instructions at the basic block level. Based on the CFG analysis, if a basic block is known to be not accessible in the rest of the program execution, the instruction memory space allocated to this basic block is reclaimed. On the other hand, if the memory allocated to this basic block cannot be reclaimed, we try to compress this basic block. This way, it is possible to effectively use the available on-chip memory, thereby satisfying most of instruction/data requests from the on-chip memory. Our experiments with this framework show that it outperforms the previously proposed CFG-based memory reduction approaches.  相似文献   

3.
Current superscalar architectures strongly depend on an instruction issue queue to achieve multiple instruction issue and out-of-order execution. However, the issue queue requires a centralized structure and mainly causes globally broadcasting operations to wakeup and select the instructions. Therefore, a large issue queue ultimately results in a low clock rate along with a high circuit complexity. In other words, the increasing demands for a larger issue queue correspondingly impose a significant burden on achieving a higher clock speed.This paper discusses our Speculative Pre-Execution Assisted by compileR (SPEAR), a low-complexity issue queue design. SPEAR is designed to manage the small window superscalar architecture more efficiently without increasing the window size. To this end, we have first recognized that the long memory latency is one of the factors that demand a large window, and we aim at achieving early execution of the miss-causing load instructions using another hierarchy of an issue queue. We pre-execute those miss-causing instructions speculatively as an additional prefetching thread. Simulation results show that the SPEAR design achieves performance comparable to or even better than what would be obtained in superscalar architectures with a large issue queue. However, SPEAR is designed with smaller issue queues which consequently can be implemented with low hardware complexity and high clock speed.  相似文献   

4.
The static specification of operations executed in parallel using No Operations (NOPs) is another culprit to make code size to be increased in VLIW architecture. Some alternatives in the instruction encoding and memory subsystem are proposed to minimize the impact of NOP on the code size. One is the compressed cache using the packed encoding scheme and the other is the decompressed cache using the unpacked encoding scheme. The compressed cache shows high memory utilization but increases the pipeline branch penalty because it requires very complex fetch hardware. On the contrary, the fetch overhead can be decreased in the decompressed cache because the unpacked encoding scheme allows an instruction to be issued to the pipeline without any recovery process. However, it has a shortcoming that the memory utilization is deteriorated due to the memory allocation irrespective of the number of useful operations. In this research, a new instruction encoding scheme called a semi-packed encoding scheme and the section cache, which enables effective store and retrieval of semi-packed instructions, are proposed. This can decrease the hardware complexity to fetch an instruction and the wasted memory space due to NOPs via the partially fixed length of an instruction. The experimental results reveal that the memory utilization in the section cache is 3.4 times higher than in the decompressed cache. The memory subsystem using the section cache can provide about 15% performance improvement with the moderate size of chip area.  相似文献   

5.
Hitachi's SH series microprocessors feature 32-bit RISC architecture with a 16-bit, fixed-length instruction set. We describe SH3, a pipelined implementation of the SH architecture with on-chip cache, MMU, and software-programmable power management. Its higher code density and corresponding improvement in instruction-fetch latency lead to higher performance than typical 32-bit RISC architectures achieve. These features, small die size, and low power consumption make SH3 an ideal microprocessor for portable computing systems or multimedia systems  相似文献   

6.
In embedded systems, small code size is important due to memory constraints. One technique to achieve a small code size is reducing the instruction encoding from 32-bit to 16-bit, such as the ARM THUMB or MIPS-16 architectures. This half-size encoding leads to shorter register operands, making fewer registers available for register allocation and causing more spills, although invisible registers can be used as spill locations via copies. We propose reconstructing the original register file into dual-banks, added with the bank toggle instruction for bank changes and the inter-bank copies between the banks. We also propose an efficient dual-bank register allocation technique based on regions in the code to reduce spills. As a case study, we applied our banked register allocation model for the THUMB architecture. We found that the code size decreases by as much as 8% (5.8% on average) while the performance improves by as much as 11.1% (3.3% on average). Our results indicate that we would better organize the register file of an embedded CPU that can provide reduced encoding into dual banks for better quality of register allocation, rather than using the invisible registers for spills.  相似文献   

7.
李勇  胡慧俐  杨焕荣 《计算机应用》2014,34(4):1005-1009
数字信号处理软件中循环程序在执行时间上占有很大比例,用指令缓冲器暂存循环代码可以减少程序存储器的访问次数,提高处理器性能。在VLIW处理器指令流水线中增加一个支持循环指令的缓冲器,该缓冲器能够缓存循环程序指令,并以软件流水的形式向功能部件派发循环程序指令。这样循环程序代码只需访存一次而执行多次,大大减少了访存次数。在循环指令运行期间,缓冲器发出信号使程序存储器进入睡眠状态可以降低处理器功耗。典型的应用程序测试表明,使用了循环缓冲后,取指流水线空闲率可达90%以上,处理器整体性能提高10%左右,而循环缓冲的硬件面积开销大约占取指流水线的9%。  相似文献   

8.
Online periodic testing of microprocessors is a valuable means to increase the reliability of a low-cost system, when neither hardware nor time redundant protection schemes can be applied. This is particularly valid for floating-point (FP) units, which are becoming more common in embedded systems and are usually protected from operational faults through costly hardware redundant approaches. In this paper, we present scalable instruction-based self-test program development for both single and double precision FP units considering different instruction sets (MIPS, PowerPC, and Alpha), different microprocessor architectures (32/64-bit architectures) and different memory configurations. Moreover, we introduce bit-level manipulation instruction sequences that are essential for the development of FP unit's self-test programs. We developed self-test programs for single and double precision FP units on 32-bit and 64-bit microprocessor architectures and evaluated them with respect to the requirements of low-cost online periodic self-testing: fault coverage, memory footprint, execution time, and power consumption, assuming different memory hierarchy configurations. Our comprehensive experimental evaluations reveal that the instruction set architecture plays a significant role in the development of self-test programs. Additionally, we suggest the most suitable self-test program development approach when memory footprint or low power consumption is of paramount importance.  相似文献   

9.
与 exascale 来超级计算的时代,电源效率成为了最重要的障碍造一个 exascale 系统。Dataflow 建筑学在为科学应用完成高电源效率有本国的优点。然而,最先进的 dataflow 体系结构没能为循环处理利用高并行。处理这个问题,我们建议一个 pipelining 环优化方法(PLO ) ,它在处理元素(PE ) 在环流动做重复 dataflow 的数组加速器。这个方法由二种技术,帮助建筑学的硬件重复和帮助说明的软件重复组成。在硬件重复执行模型,一个在薄片上循环控制器被设计产生循环索引,减少计算内核并且打为 pipelining 执行的一个好基础的复杂性。在软件重复实行模型,另外的环指令被论述解决重复相关性问题。经由这二种技术,准备好了每周期执行的指令的平均数字被增加使浮点联合起来忙。当这二种技术的硬件费用是可接受的时,模拟结果证明分别地,我们的建议方法平均由 2.45x 和 1.1x 在浮点效率超过静电干扰和动态循环执行模型。  相似文献   

10.
新型体系结构概念—虚拟寄存器与并行的指令处理部件   总被引:4,自引:1,他引:3  
随着程序对地址空间的需求日益提高,研究者提出了虚拟存储器概念,使程序访问的地址空间免受物理存储器的限制。随着面向寄存器的RISC技术发展以及多发射结构中指令调度的日益重要,我们提出了虚拟寄存器的新概念,使寄存器空间不受物理寄存器堆大小的束缚,有利于指令调度和寄存器重新命名技术,提高指令级并行性ILP。此外,现代新型RISC处理机都着重于加强数据处理部件中的执行并行度,忽略了放在存储器中指令的处理。  相似文献   

11.
Extensive research has been done on extracting parallelism from single instruction stream processors. This paper presents our investigation into ways to modify MIMD architectures to allow them to extract the instruction level parallelism achieved by current superscalar and VLIW machines. A new architecture is proposed which utilizes the advantages of a multiple instruction stream design while addressing some of the limitations that have prevented MIMD architectures from performing ILP operation. A new code scheduling mechanism is described to support this new architecture by partitioning instructions across multiple processing elements in order to exploit this level of parallelism.  相似文献   

12.
《Journal of Systems Architecture》1999,45(12-13):1001-1022
Commodity microprocessors contain more on-chip memory with each successive generation, and will contain tens of megabytes within the decade. We describe a novel architecture that runs an unmodified uniprocessor program across multiple nodes, each of which contains a processor tightly integrated with a sizable memory. The execution of instructions is replicated, while the access of operands is distributed across the nodes. Each node accesses operands in its fast local memory and broadcasts them to the other nodes. This architecture exploits out-of-order execution and the fact that each chip has integrated processor and memory, to run memory-intensive, hard-to-parallelize programs more efficiently. In this paper, we describe an implementation with specific solutions to the unique problems that this architecture poses. Finally, we conclude by comparing simulation results of our implementation to more traditional equivalent systems. In our simulated implementation, five unmodified SPEC95 binaries ran – in most cases – considerably faster than in systems with more traditional memory systems.  相似文献   

13.
Hard real-time systems demand high performance in combination with a timing predictable program execution. The performance of a system in the worst-case, represented by its worst case execution time (WCET), highly depends on the design of the memory subsystem. In this paper we focus on the instruction memory hierarchy and quantify the impact of different on-chip instruction memories on the worst-case timing of the system. A function-based dynamic instruction scratchpad (D-ISP), an instruction cache, and static instruction scratchpads using basic-block-based and function-based assignment algorithms are compared. Therefore, we provide WCET bounds for systems with different on-chip instruction memories and different off-chip memory timings.We show that for small memory sizes a static instruction scratchpad usually outperforms the other memories in terms of the WCET estimate. However, with increasing memory sizes the D-ISP is able to reach lower WCET bounds. An instruction cache can only provide lower WCET bounds than the other memories, if no suitable assignment for the static instruction scratchpads is found or if the D-ISP suffers from thrashing or frequently loads unused code.  相似文献   

14.
Matching an application to an architecture in structure and size is a way of achieving higher computation speed. This paper presents a combination of a compiler and a reconfigurable long instruction word (RLIW) architecture as an approach to the matching problem. Configurations suitable for the execution of different parts of a program are determined by a compiler, and code is generated for both reconfiguring the hardware and performing the computation. The RLIW machine, consisting of multiple processing and global data memory modules, effectively utilizes the fine-grained parallelism detected in programs by a compiler. The long word instructions control the operation of processing and memory modules in the system. To reduce the data transfer between processing modules and data memory modules, we provide reconfigurable interconnections among the processing modules which permit direct communication. The compiler uses new techniques, including region scheduling, generation of code for reconfiguration of the system, and memory allocation techniques, to achieve improved performance. Algorithms for packing operations into long word instructions and techniques for effectively assigning memory modules to the operands required by an instruction are developed. Results of the experiments to evaluate the system indicate that speedups of 60–300% can be obtained for both scientific and nonscientific programs. The reconfigurable architecture is responsible for much of the speedup. Also, the results indicate that the major problem of memory bottleneck faced in designing parallel systems is successfully attacked.This paper represents work done while the author was at the University of Pittsburgh  相似文献   

15.
Simulation results are presented using the hardware-implemented, trace-based dynamic instruction scheduler of our single process DTSVLIW architecture to schedule instructions from several processes into multiple streams of VLIW instructions for execution by a wide-issue, simultaneous multi-threading (SMT) execution engine. The scheduling process involves single instruction execution of each process, dynamically scheduling executed instructions into blocks of VLIW instructions cached for subsequent SMT execution: SMT provides a mechanism to reduce the impact of horizontal and vertical waste, and variable memory latencies, seen in the DTSVLIW. Preliminary experiments explore this extended model. Results achieve PE utilization of up to 87% on a 4-thread, 1-scalar, 8 PE design, with speed-ups of up to 6.3 that of a single processor. Noticeably it only needs a single scalar process to be scheduled at any time, with main memory fetches being 1–4% that of a single processor.  相似文献   

16.
Custom instructions potentially improve execution speed and code compression of embedded applications. However, more efficient custom instructions need higher number of simultaneous registerfile accesses. Larger registerfiles are more power hungry with complex forwarding interconnects. Therefore, due to the limited ports of the base processor registerfile, size and efficiency of custom instructions could be generally limited. Recent researches have focused on overcoming this limitation by some innovative architectural techniques supplemented with customized compilations. However, to the best of our knowledge there are few researches that take into account the complete pipeline design and implementation considerations. This paper proposes a customized instruction set and pipeline architecture for an optimized embedded engine. The proposed architecture increases the performance by enhancing the available registerfile data bandwidth through register access pipelining. The achieved improvements are made by introducing double-word custom instructions whose registerfile accesses are overlapped in the pipeline. Potential hazards in such instructions are resolved by the introduced pipeline backwarding concept, yielding higher performance and code compression. While we study the effectiveness of the proposed architecture on domain-specific workloads from packet-processing benchmarks, the developed framework and architecture are applicable to other embedded application domains.  相似文献   

17.
The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time.  相似文献   

18.
Silent Data Corruptions (SDCs), those errors that escape detection methods, are critical for system designers because they may result in systems failures. In order to catch SDCs, mechanisms should focus on the behavioural aspects of errors in addition to their physical location or error patterns. Therefore, protection codes like parity, hamming, and the Reed-Solomon code, which heavily depend on the physical location of data bits, are not enough in processors for detection of computing errors. Using characterizing data behaviour during program executions, we have observed value locality in results of destination register for each instruction binary code (instruction opcode and operand codes). This locality exists not only in the results of each instruction, but also in the results of instructions at different memory locations having the same binary code. As a result, an architecture called Instruction Result History Table (IRHT) is presented, which is indexed by the instruction binary code. In the IRHT, a history of values produced by the same instruction binary codes are stored in and utilized during each instruction execution cycle. Any mismatch between the stored values in the IRHT and those generated by current execution, indicates an SDC syndrome. To confirm having SDCs with a high level of confidence, a second execution of the current instruction is issued. A duplication of the execution confirms whether SDC occurred. In the case of SDCs, a third instruction execution with the help of a majority voting frees the system of SDC. Several extensive simulations showed that, up to 83.54% of SDCs are detectable with the help of this locality. Moreover, with the small hardware, IRHT, i.e., 16 kB size, 80.66% of SDCs can be detected on average. Note that the presented method can detect those errors that escape conventional detection mechanisms. So, it can be utilized in conjunction with other conventional methods.  相似文献   

19.
Although technology advancement can pack more and more physical registers in processors, the numbers of architectural registers defined by the instruction set architectures (ISAs) remain relatively small on most modern processors. Exposing more architectural registers to compilers and programmers can improve the effectiveness of compiler optimization and the quality of code. However, increasing the number of architectural registers by simply adding extra bits to the register fields of instructions will expand the code size. Therefore, a better way of exposing more ISA registers without significantly expanding the code size is needed. This paper presents a new ISA called Floating Accumulator Architecture (FAA) that can expand the number of ISA registers without increasing the instruction length. Unlike the accumulator architecture whose accumulator is a fixed, special register, FAA dynamically chooses a register from the general-purpose register file as the accumulator. In other words, the accumulator in FAA is an alias to some register in the register file at any instruction, and the alias relation can be dynamically updated by FAA at any program points. Since the accumulator implicitly stores the result, the destination register field can be omitted from FAA instructions, resulting in a saving of 3 to 5 bits for each instruction. This new free instruction bit space can be utilized in two possible ways: doubling the number of ISA registers of modern 32-bit RISC processors or maintaining the number of ISA registers for 16-bit instructions on embedded processors. This paper presents the result of utilizing the free bit space to double the number of ISA registers from 16 to 32 on ARM processors, and experimental results show that performance can be improved by 7.6% on average for MediaBench benchmarks.  相似文献   

20.
田祖伟  孙光 《计算机科学》2010,37(5):130-133
程序中大量分支指令的存在,严重制约了体系结构和编译器开发并行性的能力。有效发掘指令级并行性的一个主要挑战是要克服分支指令带来的限制。利用谓词执行可有效地删除分支,将分支指令转换为谓词代码,从而扩大了指令调度的范围并且删除了分支误测带来的性能损失。阐述了基于谓词代码的指令调度、软件流水、寄存器分配、指令归并等编译优化技术。设计并实现了一个基于谓词代码的指令调度算法。实验表明,对谓词代码进行编译优化,能有效提高指令并行度,缩短代码执行时间,提高程序性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号