首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper presents instruction set architectural guidelines for improving general-purpose embedded processors to optimally accommodate packet-processing applications. Similar to other embedded processors such as media processors, packet-processing engines are deployed in embedded applications, where cost and power are as important as performance. In this domain, the growing demands for higher bandwidth and performance besides the ongoing development of new networking protocols and applications call for flexible power- and performance-optimized engines.The instruction set architectural guidelines are extracted from an exhaustive simulation-based profile-driven quantitative analysis of different packet-processing workloads on 32-bit versions of two well-known general-purpose processors, ARM and MIPS. This extensive study has revealed the main performance challenges and tradeoffs in development of evolution path for survival of such general-purpose processors with optimum accommodation of packet-processing functions for future switching-intensive applications. Architectural guidelines include types of instructions, branch offset size, displacement and immediate addressing modes for memory access along with the effective size of these fields, data types of memory operations, and also new branch instructions.The effectiveness of the proposed guidelines is evaluated with the development of a retargetable compilation and simulation framework. Developing the HDL model of the optimized base processor for networking applications and using a logic synthesis tool, we show that enhanced area, power, delay, and power per watt measures are achieved.  相似文献   

2.
IXP2400是Intel公司生产的第二代网络处理器,主要应用于开发高性能、可扩展的网络设备。在IXP2400网络处理器的基础上,分析研究了包处理技术在多线程的环境下面临的两个关键的问题——同步和包排序。  相似文献   

3.
Custom instructions potentially improve execution speed and code compression of embedded applications. However, more efficient custom instructions need higher number of simultaneous registerfile accesses. Larger registerfiles are more power hungry with complex forwarding interconnects. Therefore, due to the limited ports of the base processor registerfile, size and efficiency of custom instructions could be generally limited. Recent researches have focused on overcoming this limitation by some innovative architectural techniques supplemented with customized compilations. However, to the best of our knowledge there are few researches that take into account the complete pipeline design and implementation considerations. This paper proposes a customized instruction set and pipeline architecture for an optimized embedded engine. The proposed architecture increases the performance by enhancing the available registerfile data bandwidth through register access pipelining. The achieved improvements are made by introducing double-word custom instructions whose registerfile accesses are overlapped in the pipeline. Potential hazards in such instructions are resolved by the introduced pipeline backwarding concept, yielding higher performance and code compression. While we study the effectiveness of the proposed architecture on domain-specific workloads from packet-processing benchmarks, the developed framework and architecture are applicable to other embedded application domains.  相似文献   

4.
The demand for high-performance embedded processors in multimedia mobile electronics is growing and their power consumption thus increasingly threatens battery lifetime.It is usually believed that the dynamic voltage and frequency scaling (DVFS) feature saves significant energy by changing the performance levels of processors to match the performance demands of applications on the fly.However,because the energy efficiency of embedded processors is rapidly improving,the effectiveness of DVFS is expected to change.In this paper,we analyze the benefit of DVFS in state-of-the-art mobile embedded platforms in comparison to those in servers or PCs.To obtain a clearer view of the relationship between power and performance,we develop a measurement methodology that can synchronize time series for power consumption with those for processor utilization.The results show that DVFS hardly improves the energy efficiency of mobile multimedia electronics,and can even significantly worsen energy efficiency and performance in some cases.According to this observation,we suggest that power management for mobile electronics should concentrate on adaptive and intelligent power management for peripheral devices.As a preliminary design,we implement an adaptive network interface card (NIC) speed control that reduces power consumption by 10% when NIC is not heavily used.Our results provide valuable insights into the design of power management schemes for future mobile embedded systems.  相似文献   

5.
Distributed embedded smart cameras for surveillance applications   总被引:3,自引:0,他引:3  
Recent advances in computing, communication, and sensor technology are pushing the development of many new applications. This trend is especially evident in pervasive computing, sensor networks, and embedded systems. Smart cameras, one example of this innovation, are equipped with a high-performance onboard computing and communication infrastructure, combining video sensing, processing, and communications in a single embedded device. By providing access to many views through cooperation among individual cameras, networks of embedded cameras can potentially support more complex and challenging applications - including smart rooms, surveillance, tracking, and motion analysis - than a single camera. We designed our smart camera as a fully embedded system, focusing on power consumption, QoS management, and limited resources. The camera is a scalable, embedded, high-performance, multiprocessor platform consisting of a network processor and a variable number of digital signal processors (DSPs). Using the implemented software framework, our embedded cameras offer system-level services such as dynamic load distribution and task reconfiguration. In addition, we combined several smart cameras to form a distributed embedded surveillance system that supports cooperation and communication among cameras.  相似文献   

6.
三元按内容寻址寄存器(TCAM)是内容寻址存储器(CAM)的一种变形,较CAM而言可以使查找更加灵活,对其工作(如包处理)效率的提高也起着极大的优化作用。文章从TCAM的原理及结构特点出发,阐述了其适用于提高NP性能的原因,通过一个设计实例说明了采用TCAM协处理器在网络处理器中加速包处理过程的方法和技术。  相似文献   

7.
本文在分析流媒体应用特征的基础上,结合嵌入式系统的特点,提出了一种基于嵌入式Linux系统的流媒体服务器存储子系统高性能I/O接口的解决方案,在内存管理、DMA控制以及文件系统方面进行了设计和优化,并对该系统作了性能测试和分析。  相似文献   

8.
作为系统域网络接入设备,适配器的功能和性能对整个机群系统的性能有着至关重要的影响.鉴于嵌入式技术的发展,提出了基于Intel IOP310I/O处理器的曙光4000A超级计算机DCNet系统域网络适配器设计.适配器在原嵌入式系统基础上将本地内存总线扩展为用于网络互连的局部总线,并基于该总线设计实现了网络接口部件.DCNet适配器不但实现了与Myrinet,SCI和QsNet适配器相近的性能,而且证明了基于嵌入式系统和内存总线扩展网络接口方法实现高性能适配器是有效可行的.  相似文献   

9.
Today's distributed and high-performance applications require high computational power and high communication performance. Recently, the computational power of commodity PCs has doubled about every 18 months. At the same time, network interconnects that provide very low latency and very high bandwidth are also emerging. This is a promising trend in building high-performance computing environments by clustering - combining the computational power of commodity PCs with the communication performance of high-speed network interconnects. There are several network interconnects that provide low latency and high bandwidth. Traditionally, researchers have used simple microbenchmarks, such as latency and bandwidth tests, to characterize a network interconnects communication performance. Later, they proposed more sophisticated models such as LogP. However, these tests and models focus on general parallel computing systems and do not address many features present in these emerging commercial interconnects. Another way to evaluate different network interconnects is to use real-world applications. However, real applications usually run on top of a middleware layer such as the message passing interface (MPI). Our results show that to gain more insight into the performance characteristics of these interconnects, it is important to go beyond simple tests such as those for latency and bandwidth. In future, we plan to expand our microbenchmark suite to include more tests and more interconnects.  相似文献   

10.
Dolle  M. Schlett  M. 《Micro, IEEE》1995,15(5):32-40
Applications in telecommunications or multimedia require a new generation of fast and flexible microprocessors. We present a 32-bit RISC microprocessor with extended functionality for digital signal processing that reduces overall system cost. Due to its optimized design with just 210,000 transistors, this low-cost, medium- to high-performance microprocessor is well suited for a wide range of embedded system applications  相似文献   

11.
Energy consumption of parallel computers has been becoming the obstruction to higher-performance systems. In this paper, we focus on power optimization of high-performance interconnection networks for MPI applications in high-performance parallel computers. Compared with the past history-based work, we propose the idea of compiler-directed power-aware on/off network links. There are some idle intervals for network links during the execution of parallel applications, at which the links still consume large amounts of energy. Using on/off network links, compiler first divides load-balancing MPI applications into the communication intervals and the computation intervals, and then inserts the on/off instruction into the applications to switch the link state. To avoid the time overhead of state switching, we use a time estimation technique to analyze the computation time, and insert the on instruction before reaching the communication intervals. Results from simulations and experiments show that the proposed compiler-directed method can reduce energy consumption of interconnection networks by 20∼70%, at a loss of less than 1% network latency and performance degradation.  相似文献   

12.
To capitalize on multicore power, modern high-speed data transfer applications usually adopt multi-threaded design and aggregate multiple network interfaces. However, NUMA introduces another dimension of complexity to these applications. In this paper, we undertook comprehensive experiment on real systems to illustrate the importance of NUMA-awareness to applications with intensive memory accesses and network I/Os. Instead of simply attributing the NUMA effect to the physical layout, we provide an in-depth analysis of underlying interactions inside hardware devices. We profile the system performance by monitoring relevant hardware counters, and reveal how the NUMA penalty occurs during prefetch and cache synchronization processes. Consequently, we implement a thread mapping module in a bulk data transfer software, BBCP, as a practical example of enabling NUMA-awareness. The enhanced application is then evaluated on our high-performance testbed with storage area networks (SAN). Our experimental results show that the proposed NUMA optimizations can significantly improve BBCP’s performance in memory-based tests with various contention levels and realistic data transfers involving SAN-based storage.  相似文献   

13.
IP分类算法是提高网络设备性能的关键,无冲突规则集则是正确进行IP报文分类的前提和保证。网络处理器IntelIXP1200具有强大的可编程能力和并行分组处理能力。本文在IXP1200处理器平台上设计实现了一种无冲突的多维IP分类算法,用于保证当规则数量增加时,网络设备的数据分组转发仍能够保持正确和高速。  相似文献   

14.
Nowadays, Java is used in all types of embedded devices. For these memory-constrained systems, the automatic dynamic memory manager (Garbage Collector or GC) has been always a key factor in terms of the Java Virtual Machine (JVM) performance. Moreover, in current embedded platforms, power consumption is becoming as important as performance. Thus, in this paper we present an exploration, from an energy viewpoint, of the different possibilities of memory hierarchies for high-performance embedded systems when used by state-of-the-art GCs. This is a starting point for a better understanding of the interactions between the Java applications, the memory hierarchy and the GC.Hence, we subsequently present two techniques to reduce energy consumption on Java-based embedded systems, based on exploiting GC information. The first technique uses GC execution behavior to reduce leakage energy consumption taking advantage of the low-power mode of actual multi-banked SDRAM memories and it is intended for generational collectors. This technique can achieve a reduction up to 50% of SDRAM memory leakage.The second technique involves the inclusion of a software-controlled (scratch-pad) memory that stores GC instructions under the JVM control to reduce the active energy consumption and also improve the performance of the target embedded system and it is aimed at all kind of garbage collectors. For this last technique we have experimented with two different approaches for selecting the GC code to be stored in the scratchpad memory: one static and one dynamic. Our experimental results show that the proposed dynamic scratchpad management approach for GCs enables up to 63% energy consumption reduction and 25% performance improvement during the collector phase, which means, in terms of JVM execution, a global reduction of 29% and 17% for energy and cycles, respectively.Overall, this work outlines that the key for an efficient low-power implementation of Java Virtual Machines for high-performance embedded systems is the synergy between the GC choice, the memory architecture tuning, and the inclusion of power management schemes controlled by the JVM, exploiting knowledge of the GC behavior.  相似文献   

15.
As many-core embedded systems are evolving from single-memory based designs to systems-on-a-chip running on an on-chip network, implementing a cache coherence mechanism in large-scale many-core embedded systems turns out to be a technical challenge. However, existing coherence mechanisms are difficult to scale beyond tens of cores, which require either excessive area or energy, complex hierarchical protocols, or inexact representations of sharer sets. In this paper, we present a hardware-software synergistic design of a cache coherence mechanism by considering OS-level application allocation and hardware-level coherence operations. The proposed application-oriented sparse directory (AoSD) cooperates with a contiguous allocation algorithm to isolate cache coherence traffic and thereby reduce interferences among applications. The proposed micro-architecture of sharer set representations is area-efficient; moreover, it can also be configured dynamically to track a flexible and exact sharer set. We verify our design by analyzing memory requirements of different cache organizations and implementing our design on a popular simulator Graphite to evaluate cache coherence traffic improvement. The results show that our design is both area-efficient and efficient with improvements in memory network performance by 11.74%–28.72%. It is also indicated that our design is feasible to scale up to work well in thousands-of-cores embedded systems.  相似文献   

16.
The fast-changing communications market requires high-performance yet flexible network-processing platforms. StepNP is an exploratory network processor simulation environment for exploring applications, multiprocessor network-processing architectures, and SoC tools. Supporting model interaction, instrumentation, and analysis, the platform lets R&D teams easily add new processors, coprocessors, and interconnects.  相似文献   

17.
Godson-3: A Scalable Multicore RISC Processor with x86 Emulation   总被引:2,自引:0,他引:2  
The Godson-3 microprocessor aims at high-throughput server applications, high-performance scientific computing, and high-end embedded applications. It offers a scalable network on chip, hardware support for x86 emulation, and a reconfigurable architecture. The four-core Godson-3 chip is fabricated with 65-nm CMOS technology. Eight- and 16-core Godson-3 chips are in development.  相似文献   

18.
一种适用于嵌入式系统的P2P下载模型*   总被引:1,自引:0,他引:1  
提出了一种基于BitTorrent协议的嵌入式BT下载模型。该模型对BitTorrent思想进行了深化和应用,针对于快速BT网络,在嵌入式系统应用中进行了改进,有效提高了下载速度,甚至可以在嵌入式系统上快速下载多媒体视频文件进行播放。  相似文献   

19.
The NEon system offers an integrated approach to architecting, operating, and managing network services. NEon uses policy rules defining the operation of individual network services and produces a unified set of rules that generic packet-processing engines enforce.  相似文献   

20.
在人机交互过程中,理解人类的情绪是计算机和人进行交流必备的技能之一。最能表达人类情绪的就是面部表情。设计任何现实情景中的人机界面,面部表情识别是必不可少的。在本文中,我们提出了交互式计算环境中的一种新的实时面部表情识别框架。文章对这个领域的研究主要有两大贡献:第一,提出了一种新的网络结构和基于AdaBoost的嵌入式HMM的参数学习算法。第二,将这种优化的嵌入式HMM用于实时面部表情识别。本文中,嵌入式HMM把二维离散余弦变形后的系数作为观测向量,这和以前利用像素深度来构建观测向量的嵌入式HMM方法不同。因为算法同时修正了嵌入式HMM的网络结构和参数,大大提高了分类的精确度。该系统减少了训练和识别系统的复杂程度,提供了更加灵活的框架,且能应用于实时人机交互应用软件中。实验结果显示该方法是一种高效的面部表情识别方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号