首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 828 毫秒
1.
本文针对基于可配置处理器的异构多核结构,提出一种新的线程级动态调度模型。此类异构多核系统中每个核分别针对某一应用做指令集扩展,调度器通过线程、处理器核以及指令集间的映射关系,动态调度线程至适合的处理器核,从而在没有大幅增加芯片面积的前提下,达到与每个核都具有全扩展指令集相近似的加速比,此外该模型还可以有效减少编程模型的复杂度。  相似文献   

2.
为解决单核处理器时钟频率难以提高、处理器功耗逐渐增加等问题,文中提出了一种新型异构多核处理器的设计方案.该结构中增加了B--Cache结构和C--Core控制器,这种新型异构多核处理器避免了流水线因分支预测失误而flush,提高了整个处理器执行效率.  相似文献   

3.
一种异构多核处理器的并行流存储结构   总被引:4,自引:3,他引:1       下载免费PDF全文
 异构多核处理器可结合多种处理器体系结构的优势,既保留传统通用体系结构的灵活性,又拥有大量计算资源,可提供更高的峰值计算性能.YHFT64-3异构多核处理器中浮点处理部件18套,峰值计算能力强大,设计与之相匹配的存储系统是一项重大挑战.针对YHFT64-3处理器,本文提出了一种并行流层次存储结构,深入阐述了如何体现应用特点、支持并行数据流处理的存储系统的设计思想和方法,从多个层次实现对并行数据流的挖掘或捕获.测试结果表明,这种存储结构体现了应用特点,能够较好地发挥YHFT64-3处理器的性能,同频情况下(500MHz),YHFT64-3比YHFT64-2性能高2—3个数量级,与1.6GHz的Itanium2性能相当,但代价更低.  相似文献   

4.
A multicore system-on-chip (SoC) has been developed for various applications (recognition, inference, measurement, control, and security) that require high-performance processing and low power consumption. This SoC integrates three types of synthesizable processors: eight CPUs (M32R), two multi-bank matrix processors (MBMX), and a controller (M32C). These processors operate at 1 GHz, 500 MHz, and 500 MHz, respectively. These three types of processors are interconnected on this chip with a high-bandwidth multi-layer system bus. The eight CPUs are connected to a common pipelined bus using a cache coherence mechanism. Additionally, a 512-kB L2 cache memory is shared by the eight CPUs to reduce internal bus traffic. A multi-bank matrix processor with 2-read/1-write calculation and background I/O operation has been adopted. The 1-GHz CPU is realized using a delay management network which consists of delay monitors that can be applied for any kind of application or process technology. Our configurable heterogeneous architecture with nine CPUs and two matrix processors reduces power consumption by 45%.  相似文献   

5.
阮利  秦广军  肖利民  祝明发 《通信学报》2013,34(12):131-141
提出了一种基于龙芯多核处理器的高效能云计算节点机的软硬件设计和实现方法,并研制成功相应原型系统。实验和测试表明,本系统单节点取得了每秒0.256×1012次浮点运算能力(Tflops),单一机柜可容纳42个1U节点机箱,672颗CPU,2 688个CPU核(672×4)的性能,总体具有基于龙芯多核处理器、高密度、高性能功耗比等优点,为基于龙芯多核处理器的云计算系统奠定了坚实基础。  相似文献   

6.
It’s a promising way to improve performance significantly by adding reconfigurable processing unit (RPU) to a general purpose processor. In this paper, a Reconfigurable Multi-Core (RMC) architecture combining general multi-core and reconfigurable logic is proposed. Reconfigurable logic is separated into RPUs logically, which are coupled with general purpose cores as co-processors via a full crossbar switch. An RPU Manager (RPU-M) is also designed to manage RPUs. To verify RMC, a simulation method based on the Simics and Virtex 5 FPGA is adopted, which simplifies the simulation and assures the evaluation accuracy of hardware function cores. Five workloads are selected to test RMC, including 3-DES, AES, SHA2, IDCT and JPEG_ENC. The experimental results show a 3.10 times average speedup over software implementation on the original multi-core, and the data and control communication overhead on RMC is acceptable.  相似文献   

7.
In order to achieve high computational performance and low power consumption, many modern microprocessors are equipped with special multimedia instructions and multi-core processing capabilities. The number of cores on a single chip increases double every three years. Therefore, besides complexity reduction by smart algorithms such as fast macroblock mode selection, an effective algorithm for parallelizing H.264/AVC is also very crucial in implementing a real-time encoder on a multi-core system. This algorithm serves to uniformly distribute workloads for H.264/AVC encoding over several slower and simpler processor cores on a single chip. In this paper, we propose a new adaptive slice-size selection technique for efficient slice-level parallelism of H.264/AVC encoding on a multi-core processor using fast macroblock mode selection as a pre-processing step. For this we propose an estimation method for the computational complexity of each macroblock using pre macroblock mode selection. Simulation results, with a number of test video sequences, show that, without any noticeable degradation, the proposed fast macroblock mode selection reduces the total encoding time by about 57.30%. The proposed adaptive slice-level parallelism has good parallel performance compared to conventional fixed slice-size parallelism. The proposed method can be applied to many multi-core systems for real-time H.264 video encoding.  相似文献   

8.
The performance of recent CPUs has been rapidly increasing with the help of parallel architectural supports, such as SIMD (Single Instruction Multiple Data) extensions and multi-core architecture. However, efficient use of such parallel supports for adaptive filtering is difficult due to feedback loops that induce the data dependency problem. In this paper, efficient parallel computation of adaptive filters is studied for multi-core architecture with SIMD arithmetic support. Control- and data-level parallel computation methods are considered, where the former finds parallelism in the evaluation of one output sample, while the latter processes multiple output samples at a time to increase the degree of parallelism. The control-level parallel approach frequently utilizes the pipelining technique to uncover the parallelism, whereas the data-level approach employs a parallel computation method for linear recurrence equations to resolve the dependency. Not only adaptive transversal LMS (Least Mean Square) but also gradient adaptive lattice (GAL) and QR-decomposition based least-square lattice (QRD-LSL) filters are implemented on a PC that employs both SIMD and multi-core architecture.  相似文献   

9.
We have designed a microprocessor that is based on a single instruction multiple data stream (SIMD) architecture. It features a two-way superscalar architecture for multimedia embedded systems that need to support especially MPEG2 video decoding/encoding and 3DCG image processing. This microprocessor meets all requirements of embedded systems, including (a) MPEG2 (MP@ML) decoding and graphic processing capabilities for three-dimensional images, (b) programming flexibility, and (c) low power consumption and low manufacturing cost. High performance was achieved by enhanced parallel processing capabilities while adopting a SIMD architecture and a two-way superscalar architecture. Programming flexibility was increased by providing 170 dedicated multimedia instructions. Low power consumption was achieved by utilizing advanced process technology and power-saving circuits. The processor supports a general-purpose RISC instruction set. This feature is important, as the processor will have to work as a controller of various target systems. The processor has been fabricated by 0.21-μm CMOS four-metal technology on a 9.84×10.12 mm die. It performs 2.16 GOPS/720 MFLOPS at an operating frequency of 180 MHz, with a power consumption of 1.2 W and a power supply of 1.8 V  相似文献   

10.
针对目前通用的达芬奇异构多核处理器,研究了其ARM核、DSP核以及视频协处理器之间的通信与协作机制.在分析多核处理器核间通信原理的基础上,研究了TMS320DM816x系列达芬奇异构多核处理器的核间通信技术,详细阐述片上核间互联结构与核间通信软件的实现.最后基于SysLink底层通信模块设计了多路高清音视频应用系统,对核间通信进行验证.系统可充分发挥各处理核的性能,实现了各核间的高效协作.  相似文献   

11.
In this paper, an attention controlled multi-core architecture is proposed for energy efficient object recognition. The proposed architecture employs two IP layers having different roles for energy efficient recognition processing: the attention/control IPs compute regions-of-interest (ROIs) of the entire image and control the multiple processing cores to perform local object recognition processing on selected area. To this end, a task manager is proposed to perform dynamic scheduling of various ROI tasks from the attention IP to multiple cores in a unit of small-sized grid-tile. Thanks to a number of grid-tile threads generated by the task manager, the utilization of the multiple cores amounts to 92% on average. As a result, the proposed architecture achieves 2.1× energy reduction in multi-core recognition system by indicating processing cores to focus on critical area of the image with a 0.87 mJ attention processing. Finally, the proposed architecture is implemented in 0.13 μm CMOS technology and the fabricated chip verifies 3.2× lower energy dissipation per frame than the state-of-the-art object recognition processor.  相似文献   

12.
This paper presents an integrated self-aware computing model mitigating the power dissipation of a heterogeneous reconfigurable multicore architecture by dynamically scaling the operating frequency of each core. The power mitigation is achieved by equalizing the performance of all the cores for an uninterrupted exchange of data. The multicore platform consists of heterogeneous Coarse-Grained Reconfigurable Arrays (CGRAs) of application-specific sizes and a Reduced Instruction-Set Computing (RISC) core. The CGRAs and the RISC core are integrated with each other over a Network-on-Chip (NoC) of six nodes arranged in a topology of two rows and three columns. The RISC core constantly monitors and controls the performance of each CGRA accelerator by adjusting the operating frequencies unless the performance of all the CGRAs is optimally balanced over the platform. The CGRA cores on the platform are processing some of the most computationally-intensive signal processing algorithms while the RISC core establishes packet based synchronization between the cores for computation and communication. All the cores can access each other’s computational and memory resources while processing the kernels simultaneously and independently of each other. Besides general-purpose processing and overall platform supervision, the RISC processor manages performance equalization among all the cores which mitigates the overall dynamic power dissipation by 20.7 % for a proof-of-concept test.  相似文献   

13.
对网络处理器的结构设计、验证和性能评估等内容进行研究,设计了一种用于边缘网络应用的网络处理器,并对它进行了FPGA实现和评估分析。该网络处理器采用并发多处理结构,拥有完善的C语言开发环境和操作系统等基础软件支持,其单处理引擎和四处理引擎配置在Xilinx XC2VP30 FPGA上的运行频率为116.4MHz和83.5MHz,分别占用7100和15250个四输入LUT。实验和分析表明该网络处理器具有较高的效率和良好的可扩展性,能满足边缘网络的转发及远程控制等各领域的应用需求。  相似文献   

14.
异构多核处理器可将不同类型的任务分配到不同类型的处理器核上并行处理,面对不同的应用需求,可以提供比较灵活、高效的处理机制。文中提出一种面向SoC的异构多核系统的设计方法,运用该方法可高效方便地实现图像处理算法。首先对图像退化和复原的基本方法进行介绍,给出算法实现的基本模型,并运用数字信号处理开发工具System Generator进行系统级建模仿真。然后通过EDK Processor自动生成图像退化和复原的协处理器Pcore,结合Xilinx的MicroBlaze软核,构建出异构多核片上系统。  相似文献   

15.
A single-chip MPEG-2 MP@ML codec, integrating 3.8M gates on a 72-mm/sup 2/ die, is described. The codec employs a heterogeneous multiprocessor architecture in which six microprocessors with the same instruction set but different customization execute specific tasks such as video and audio concurrently. The microprocessor, developed for digital media processing, provides various extensions such as a very-long-instruction-word coprocessor, digital signal processor instructions, and hardware engines. Making full use of the extensions and optimizing the architecture of each microprocessor based upon the nature of specific tasks, the chip can execute not only MPEG-2 MP@ML video/audio/system encoding and decoding concurrently, but also MPEG-2 MP@HL decoding in real time.  相似文献   

16.
Novel algorithmic features of multimedia applications and advances in VLSI technologies are driving forces behind the new multimedia signal processors. We propose an architecture platform which could provide high performance and flexibility, and would require less external I/O and memory access. It is comprised of array processors to be used as the hardware accelerator and RISC cores to be used as the basis of the programmable processor. It is a hierarchical and scalable architecture style which facilitates the hardware-software codesign of multimedia signal processing circuits and systems. While some control-intensive functions can be implemented using programmable CPUs, other computation-intensive functions can rely on hardware accelerators.To compile multimedia algorithms, we also present an operation placement and scheduling scheme suitable for the proposed architectural platform. Our scheme addresses data reusability and exploits local communication in order to avoid the memory/communication bandwidth bottleneck, which leads to faster program execution. Our method shows a promising performance: a linear speed-up of 16 times can be achieved for the block-matching motion estimation algorithm and the true motion tracking algorithm, which have formed many multimedia applications (e.g., MPEG-2 and MPEG-4).  相似文献   

17.
This paper represents a new real-time infrared scene simulator. It has two PENTIUM 266MHz CPUs, the hardware of Z-BUFFER and the architecture based on PCI bus. The hardware architecture of the system and the schematic diagram are given and show how the great significance was achieved. Finally , the experimental results indicate that the simulator can meet the great demand of application in practice.  相似文献   

18.
Since the number of processing cores in a General Purpose Processor (GPP) increases steadily, parallelization of algorithms is a well known topic in computer science. Algorithms have to be adapted to this new processor architecture to fully exploit the available processing power. This development equally affects the Software Defined Radio (SDR) technology because the GPP has become an important processor for SDR platforms. To make use of the entire processing power of a multi-core GPP and hence to avoid system inefficiency, this work provides an approach to parallelize C/C+ + code using OpenMP. This application programming interface provides a rapid way to parallelize code using compiler directives inserted at appropriate positions in the code. The processing load can be shared between all available cores. We use Matlab Simulink as a framework for a model-based design and evaluate the processing gain of embedded handwritten C-code blocks with OpenMP support.We will show that with OpenMP the core utilization is increased. Compared to a single-core GPP, we will present the increase of the processing speed depending on the number of cores. We will also highlight the limitations of code parallelization. In our results, we will show that a straightforward implementation of algorithms without multi-core consideration will cause an underutilized system.  相似文献   

19.
多核处理器使得并行系统的结构日益复杂,已经成为处理器的主流,并发展成为各种通信与媒体应用的主流处理平台.通讯结构是多核系统中的核心技术之一,核间通信的效率是影响多核处理器性能的重要指标.目前有三种主要的通讯架构:总线系统结构、交叉开关网络和片上网络.总线结构设计相对方便、硬件消耗较少、成本较低,交叉开关是适用于构建大容...  相似文献   

20.
Including multiple cores on a single chip has become the dominant mechanism for scaling processor performance. Exponential growth in the number of cores on a single processor is expected to lead in a short time to mainstream computers with hundreds of cores. Scalable implementations of parallel algorithms will be necessary in order to achieve improved single-application performance on such processors. In addition, memory access will continue to be an important limiting factor on achieving performance, and heterogeneous systems may make use of cores with varying capabilities and performance characteristics. An appropriate programming model can address scalability and can expose data locality while making it possible to migrate application code between processors with different parallel architectures and variable numbers and kinds of cores. We survey and evaluate a range of multicore processor architectures and programming models with a focus on GPUs and the Cell BE processor. These processors have a large number of cores and are available to consumers today, but the scalable programming models developed for them are also applicable to current and future multicore CPUs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号