首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In this paper, we describe a procedure for memory design and exploration for low power embedded systems. Our system consists of an instruction cache and a data cache on-chip, and a large memory off-chip. In the first step, we try to reduce the power consumption due to memory traffic by applying memory-optimizing transformations such as loop transformations. Next we use a memory exploration procedure to choose a cache configuration (cache size and line size) that satisfies the system requirements of area, number of cycles and energy consumption. We include energy in the performance metrics, since for different cache configurations, the variation in energy consumption is quite different from the variation in the number of cycles. The memory exploration procedure is very efficient since it exploits the trends in the cycles and energy characteristics to reduce the search space significantly.  相似文献   

2.
This paper studies how to parallelize the emerging media mining workloads on existing small-scale multi-core processors and future large-scale platforms. Media mining is an emerging technology to extract meaningful knowledge from large amounts of multimedia data, aiming at helping end users search, browse, and manage multimedia data. Many of the media mining applications are very complicated and require a huge amount of computing power. The advent of multi-core architectures provides the acceleration opportunity for media mining. However, to efficiently utilize the multi-core processors, we must effectively execute many threads at the same time. In this paper, we present how to explore the multi-core processors to speed up the computation-intensive media mining applications. We first parallelize two media mining applications by extracting the coarse-grained parallelism and evaluate their parallel speedups on a small-scale multi-core system. Our experiment shows that the coarse-grained parallelization achieves good scaling performance, but not perfect. When examining the memory requirements, we find that these coarse-grained parallelized workloads expose high memory demand. Their working set sizes increase almost linearly with the degree of parallelism, and the instantaneous memory bandwidth usage prevents them from perfect scalability on the 8-core machine. To avoid the memory bandwidth bottleneck, we turn to exploit the fine-grained parallelism and evaluate the parallel performance on the 8-core machine and a simulated 64-core processor. Experimental data show that the fine-grained parallelization demonstrates much lower memory requirements than the coarse-grained one, but exhibits significant read-write data sharing behavior. Therefore, the expensive inter-thread communication limits the parallel speedup on the 8-core machine, while excellent speedup is observed on the large-scale processor as fast core-to-core communication is provided via a shared cache. Our study suggests that (1) extracting the coarse-grained parallelism scales well on small-scale platforms, but poorly on large-scale system; (2) exploiting the fine-grained parallelism is suitable to realize the power of large-scale platforms; (3) future many-core chips can provide shared cache and sufficient on-chip interconnect bandwidth to enable efficient inter-core communication for applications with significant amounts of shared data. In short, this work demonstrates proper parallelization techniques are critical to the performance of multi-core processors. We also demonstrate that one of the important factors in parallelization is the performance analysis. The parallelization principles, practice, and performance analysis methodology presented in this paper are also useful for everyone to exploit the thread-level parallelism in their applications.
Wenlong LiEmail:
  相似文献   

3.
We explore the energy dissipation of the Linear Processor Array (LPA) as a function of the number of available resources (Processor Units P) within the array. This number P is an important parameter, as it reflects performance, relates parallel processing to energy dissipation, and influences the scaling of the various parts of the LPA architecture (memory, address generator, communication network).To make a comparison of the different design variants for a fixed datawidth possible, we propose a high-level energy dissipation model of the processor, which is based on a detailed analysis of a general convolution algorithm.It is shown that the energy dissipation of the LPA can roughly be described by the relationship E total N/P with N presenting the datawidth in pixels. This relationship is derived from two observations: first, the largest contribution to E total is formed by the energy dissipated by the memories, and second, in our model of the LPA, the datawidth of the memories corresponds with the number of pixels N to be processed, which results in an increase of the access rate when P decreases.Furthermore, we have shown that the energy dissipation caused by communication within the LPA, increases with increasing number of resources: the trade-off between communication versus computation in parallel computing. This turns out to be negligible in the total energy dissipation, and we therefore conclude, that the optimum solution is found, when a full number of resources is applied within the LPA.  相似文献   

4.
A methodological framework for performance estimation of multimedia signal processing applications on different implementation platforms is presented. The methodology derives a complexity profile which is characteristic for an application, but completely platform-independent. By correlating the complexity profile with platform-specific data, performance estimation results for different platforms are obtained. The methodology is based on a reference software implementation of the targeted application, but is, in constrast to instruction-level profiling-based approaches, fully independent of its optimization degree. The proposed methodology is demonstrated by example of an MPEG-4 Advanced Simple Profile (ASP) video decoder. Performance estimation results are presented for two different platforms, a specialized VLIW media processor and an embedded general-purpose RISC processor, showing a high accuracy of he methodology. The approach can be employed to assist in design decisions in the specification phase of new architectures, in the selection process of a suitable target platform for a multimedia application, or in the optimization stage of a software implementation on a specific platform.Hans-Joachim Stolberg received the Dipl.-Ing. degree in electrical engineering from the University of Hannover, Germany, in 1995.From 1995 to 1996, he worked at the NEC Information Technology Research Laboratories, Kawasaki, Japan, on efficient implementation of video compression algorithms. Since 1996, he has been with the Institute of Microelectronic Systems at the University of Hannover as a Research Assistant. During summer 2001, he was a Monbukagakusho Research Fellow at the Tokyo Institute of Technology, Japan. His current research interests include VLSI architectures for video signal processing, performance estimation of multimedia schemes, and profile-guided memory organization approaches for signal processing and multimedia applications.Mladen Bereković received the Dipl.-Ing. degree in electrical engineering from the University of Hannover, Germany, in 1995.Since then he has been a Research Assistant with the Institute of Microelectronic Systems of the University of Hannover. His current research interests include VLSI architectures for video signal processing, MPEG-4, System-on-Chip (SOC) designs, and simultaneously multi-threaded (SMT) processor architectures.Peter Pirsch received the Ing. grad. degree from the engineering college in Hannover, Germany, in 1966, and the Dipl.-Ing. and Dr.-Ing. degrees from the University of Hannover, in 1973 and 1979, respectively, all in electrical engineering.From 1966 to 1973 he was employed by Telefunken, Hannover, working in the Television Department. He became a Research Assistant at the Department of Electrical Engineering, University of Hannover, in 1973, a Senior Engineer in 1978. During 1979 to 1981 he was on leave, working in the Visual Communications Research Department, Bell Laboratories, Holmdel, NJ. During 1983 to 1986 he was Department Head for Digital Signal Processing at the SEL Research Center, Stuttgart, Germany. Since 1987 he is Professor in the Department of Electrical and Computer Engineering at the University of Hannover. He served as Vice President Research of the University of Hannover from 1998 to 2002.His present research includes architectures and VLSI implementations for image processing applications, rapid prototyping and design automation for DSP applications. He is the author or coauthor of more than 200 technical papers. He has edited a book on VLSI Implementations for Image Communications (Elsevier 1993) and is author of the book Architectures for Digital Signal Processing (John Wiley 1998).Dr. Pirsch is a member of the IEEE, the German Institute of Information Technology Engineers (ITG) and the German Association of Engineers (VDI). He was recipient of several awards: the NTG paper price award (1982), IEEE Fellow (1997), IEEE Circuits and Systems Golden Jubilee Medal (1999). He was member or chair of several technical program committees of international conferences and organizer of special sessions and preconference courses. He has held several administrative and technical positions with the IEEE Circuits and Systems Society and other professional organizations. Dr. Pirsch currently serves as Vice President Publications of the IEEE Circuits and Systems Society. Since 2000 he is chairman of the Accreditation Commission for Engineering and Informatics of the Accreditation Agency for Study Programs in Engineering, Informatics, Natural Science and Mathematics (ASIIN). Dr. Pirsch is chair of the VDI committee on Engineering Education.  相似文献   

5.
Upcoming multi-media compression applications will require high memory bandwidth. In this paper, we estimate that a software reference implementation of an MPEG-4 video decoder typically requires 200 Mtransfers/s to memory to decode 1 CIF (352×288) Video Object Plane (VOP) at 30 frames/s. This imposes a high penalty in terms of power but also performance.However, we also show that we can heavily improve on the memory transfers, without sacrificing speed (even gaining about 10% on cache misses and cycles for a DEC Alpha), by aggressive code transformations. For this purpose, we have manually applied an extended version of our data transfer and storage exploration (DTSE) methodology, which was originally developed for custom hardware implementations.  相似文献   

6.
采用新型的开关电容现场可编程模拟阵列FPAA,完成音频测量滤波器的设计、时域仿真和SPICE频域仿真。以设计、仿真实现高性能的20kHz低通滤波器为例,最后下载到一片FPAA芯片中实现。  相似文献   

7.
8.
Simulation performance comparison of various linear multiuser and parallel interference cancellation (PIC) detectors in the presence of imperfect power control and channel estimation is presented. Results show that imperfect power control degrades even the performance of a single-user detector. Therefore, tight power control is highly indispensable for suboptimal detectors to maintain a good performance. When power control is not perfect, interference cancellation detectors can outperform linear multiuser detectors. Among cancellation detectors, the conventional [1] and partial PIC [2] detectors are fairly sensitive to channel estimation error, while the LMS PIC [3] is quite robust in this regard.  相似文献   

9.
详细介绍了一种基于Stratix II的高性能、系数可编程FIR滤波器的设计及实现方法,滤波器具有与微处理器兼容的编程接口,可以对滤波器系数实现动态编程。Quartus II的仿真结果表明了该方法的可行性,性能完全满足设计要求。该设计在传统的自适应滤波以及崭新的认知无线电(CR)[1]领域都具有重要的应用价值。  相似文献   

10.
详细介绍了一种基于StratixⅡ的高性能、系数可编程FIR滤波器的设计及实现方法,滤波器具有与微处理器兼容的编程接口,可以对滤波器系数实现动态煽程。QuartusⅡ的仿真结果表明了该方法的可行性,性能完全满足设计要求。该设计在传统的自适应滤波以及崭新的认知无线电(CR)领域都具有重要的应用价值。  相似文献   

11.
12.
In addition to its attractiveness for ultralow power applications, analog CMOS circuits based on the subthreshold operation of the devices are known to have significantly higher gain as compared to their superthreshold counterpart. The effects of halo [both double-halo (DH) and single-halo or lateral asymmetric channel (LAC)] doping on the subthreshold analog performance of 100-nm CMOS devices are systematically investigated for the first time with extensive process and device simulations. In the subthreshold region, although the halo doping is found to improve the device performance parameters for analog applications (such as gm/Id, output resistance and intrinsic gain) in general, the improvement is significant in the LAC devices. Low angle of tilt of the halo implant is found to give the best improvement in both the LAC and DH devices. Our results show that the CMOS amplifiers made with the halo implanted devices have higher voltage gain over their conventional counterpart, and a more than 100% improvement in the voltage gain is observed when LAC doping is made on both the p- and n-channel devices of the amplifier  相似文献   

13.
在微机线路保护中,利用数字信号处理器(DSP)高效快速的数字信号处理能力和嵌入式先进的精简指令集芯片机器(ARM)处理器强大的以太网通信功能,采用DSP+ARM9的双中央处理器(CPU)的硬件结构,两者之间采用双口随机存储器(RAM)进行数据交换。软件设计基于嵌入式Linux操作系统,移植了Bootloader、内核,构建了Ramdisk的根文件系统,并移植了应用程序。  相似文献   

14.
数字信号处理系统是水声学定位系统的核心,本系统采用DSP结合FPGA进行了水声定位系统主控机的设计,给出了硬件框图,介绍了主要组成模块的功能,完成了存储器接口设计、网络接口设计、FPGA逻辑接口设计等设计。调试以及实验结果表明,本硬件系统能够完成合作目标的信号检测与时延估计,满足了水声定位系统的要求。  相似文献   

15.
基于Matlab的OFDM系统仿真与性能分析   总被引:2,自引:5,他引:2  
正交频分复用技术(Orthogonal Frequency DiviSion Multiplexing)是第四代移动通信技术的核心技术。文章首先简要介绍了OFDM的基本原理,建立了系统仿真模型,对系统的同步和信道估计进行了分析。并基于所建模型,给出了系统的星座图和误码率曲线。仿真结果表明,所建模型能够很好地验证理论分析结果。  相似文献   

16.
解析Oracle内存结构及内存性能诊断与调优   总被引:1,自引:0,他引:1  
内存结构是Oracle数据库体系结构中最重要的部分之一,它是影响数据库性能的重要因素.服务器内存的大小直接影响数据库的运行速度。Oracle系统使用的内存结构可以分为系统全局区(SGA)、程序全局区(PGA),对这两类内存结构进行解析,并提供了诊断缓冲区设置合理与否的机制,从而实现对Oracle内存结构的优化。  相似文献   

17.
提出一种基于PCA稀疏表示算法进行直升机旋翼故障识别的方法。首先相应的PCA预处理本身计算复杂度不高且能对样本的降维程度较高,其次根据样本相似性原则,基于PCA的稀疏表示方法不仅能保持样本在处理前后相互距离不变,而且提高了计算效率。采用新的诊断模型对直升机旋翼故障分类识别,并与基于神经网络和基于支持向量机的诊断方法进行比较。结果表明本文方法对旋翼故障具有良好的识别能力。  相似文献   

18.
We introduce a strained‐SiGe technology adopting different thicknesses of Si cap layers towards low power and high performance CMOS applications. By simply adopting 3 and 7 nm thick Si‐cap layers in n‐channel and p‐channel MOSFETs, respectively, the transconductances and driving currents of both devices were enhanced by 7 to 37% and 6 to 72%. These improvements seemed responsible for the formation of a lightly doped retrograde high‐electron‐mobility Si surface channel in nMOSFETs and a compressively strained high‐hole‐mobility Si0.8Ge0.2 buried channel in pMOSFETs. In addition, the nMOSFET exhibited greatly reduced subthreshold swing values (that is, reduced standby power consumption), and the pMOSFET revealed greatly suppressed 1/f noise and gate‐leakage levels. Unlike the conventional strained‐Si CMOS employing a relatively thick (typically > 2 µm) SixGe1‐x relaxed buffer layer, the strained‐SiGe CMOS with a very thin (20 nm) Si0.8Ge0.2 layer in this study showed a negligible self‐heating problem. Consequently, the proposed strained‐SiGe CMOS design structure should be a good candidate for low power and high performance digital/analog applications.  相似文献   

19.
本文首先分析“随机信号分析”课程中功率谱密度和自相关函数,然后分别介绍了功率谱的估计方法,并以实验为例说明功率谱的估计方法,最后以无线通信系统中OFDM信号的带宽估计为例,说明功率谱估计方法在实际工程中的应用。本文对于功率谱及其估计的教学有一定的指导作用,并且有助于学生更好地理解理论和工程应用。  相似文献   

20.
功率倒置算法不需要知道接收信号的先验信息,能够使天线产生指向干扰方向的波束零陷,达到抑制干扰的目的。特别适合卫星导航、扩频通信等强干扰、弱信号环境的抗干扰应用。在对功率倒置自适应算法原理分析的基础上,利用LMS算法迭代计算功率倒置的最优权值,然后针对四阵元天线仿真了功率倒置算法在不同干扰条件下的抗干扰性能,以及迭代步长等因素对算法收敛性能的影响,验证了功率倒置算法的自适应抗干扰性能,从而为后续硬件工程实现奠定了基础。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号