首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
赵丽萍 《计算机工程》2006,32(6):95-97,185
网络处理器(NP)以数据平面的微码可编程特性,兼顾专用硬件芯片高性能和纯软件解决方案配置灵活性优势,提供解决转发性能与QoS瓶颈的有效途径。以AMCC NP34xx网络处理器为例,总结了NP通用体系结构,研究基于网络处理器架构的数据平面软件设计框架,提出了综合微码与高级语言的混合编程模式和性能优化方法。在此基础上,结合一种二层交换(L2SW)数据平面软件的设计与实现,系统阐述了基于网络处理器的数据平面软件关键实现技术和性能评估策略。  相似文献   

2.
针对目前车联网(VANET)数据转发效率低的问题,提出了软件定义网络(SDN)的数据转发策略和路由选择技术。首先,采用了软件定义车联网的分层控制结构,由局部控制器和全局控制器组成,实现数据转发和控制分离,可灵活控制数据转发的方向;然后,设计了单条路段的车辆路由机制,该机制预测车辆节点位置并采用贪心策略,实现数据的稳定传输;其次,设计了多个需求间的路段路由机制,该机制采用广度优先搜索(BFS)算法和边集相结合的方式,实现多个需求间路径不相交,缓解带宽瓶颈问题;最后,通过仿真验证,对比无线自组网按需平面距离向量(AODV)路由,所提出的数据转发策略和路由选择算法在数据分组接收率上提高40%以上,平均延迟时间降低60%左右。实验结果表明,软件定义车联网的数据转发策略和路由选择技术能够提高数据转发效率,减少平均收包延时。  相似文献   

3.
一种支持SIMD指令的流水化可拆分乘加器结构   总被引:1,自引:0,他引:1  
李东晓 《计算机工程》2006,32(7):264-266
乘加器是媒体数字信号处理器的关键运算部件。该文结合32位数字信号处理器芯片MD32开发(“863”计划)实践,提出了一种流水化可拆分的乘加器硬件实现结构,通过对乘法操作的流水处理实现了200MHz工作频率下的单周期吞吐量指标,通过构造可拆分的数据通道实现了对SIMD乘法指令的支持,支持4个通道16位媒体数据的并行乘法,大大提升了处理器的媒体处理性能。文中对所提出的乘加器体系结构,给出了理论依据和实验结果,通过MD32的流片实现得到了物理验证。  相似文献   

4.
一种有效的同时多线程处理器取指控制机制   总被引:1,自引:0,他引:1  
同时多线程处理器通过每时钟周期从多个运行的线程取指令执行,极大地提高了处理器的性能.分支预测器的预测精度和取指策略的效率是影响同时多线程处理器性能的关键.通过将一个基于值的分支预测器和一个基于线程推进速度的取指策略相结合,提出一种新的取指控制机制.该结构的硬件开销较小,实现复杂度较低.实验结果表明,该取指控制机制有效地提高了处理器的性能,其相对于传统取指控制机制的性能加速比为28%且该加速比也高于目前基于流缓冲区和基于分支分类器的取指控制机制.  相似文献   

5.
结合访存失效队列状态的预取策略   总被引:1,自引:0,他引:1  
随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%.  相似文献   

6.
针对现有车联网(VANET)中数据转发效率低的问题,提出了软件定义网络(SDN)的数据转发机制。首先,设计了软件定义车联网的分层次网络模型,该模型由局部控制器和车辆组成,实现控制与数据转发分离,具有可扩展性、独行性等特点;其次,设计了车辆路由转发机制,该机制采用动态规划和二分搜索的方法,以实现高效的数据转发;最后,通过仿真验证,对比无线自组网按需平面距离向量路由(AODV)、目的节点序列距离矢量路由(DSDV)、动态源路由(DSR)和最优链路状态路由(OLSR)算法,所提的数据转发机制在传递成功比上提高大约100%,而端到端延迟时间降低大约20%。实验结果表明,软件定义车联网的数据转发机制能够提高路由转发效率、减小延迟。  相似文献   

7.
为了提高非结构化P2P网络中数据查询搜索的效率,提出一种新型的一跳查询与转发数据搜索新策略(OHQFS),它是以非结构化P2P网络中的数据查询请求转发策略和非转发策略为基础,将它们整合、集成后得到的一种新策略.OHQFS策略中的查询源结点直接搜索其自身的所有邻居节点,并将查询请求转发给这些邻居节点,使得这些邻居节点再去搜索它们的相邻节点.该策略在数据查询搜索过程中无须维持一个很大的邻居节点信息集合,系统维护开销较小,通过本策略中固有的一步查询转发,使其邻居节点和邻居的邻居节点处于查询搜索范围内.网络仿真的实验结果表明,OHQFS策略相对于转发搜索策略而言,它提高了查询效率;而相对于非转发策略,它获得了较高的成功率.  相似文献   

8.
“存储墙”问题是高性能处理器设计必须跨越的障碍之一, 高效、智能的Cache系统是处理器存储体系的关键因素。具有分支预测能力的处理器在猜测执行分支路径上访存指令时取回的存储器数据所导致的Cache污染会显著影响Cache和处理器性能。分析了猜测执行和Cache数据污染对处理器性能的影响, 在此基础上结合分支预测机制的特征提出了一种基于分支路径跟踪的Cache污染控制技术——Contra, 通过构建分支路径跟踪表对猜测路径写入Cache的数据进行跟踪, 并对这些数据的存储、访问和替换过程进行控制, 有效地避免了污染数据对Cache效率的影响, 提升了处理器存储系统的性能。仿真结果表明, Contra技术相对于baseline结构来说, L1 D-Cache命中率提升幅度为0. 03%~6. 69%, 平均提升为1. 80%; IPC的提升幅度为0. 01%~6. 60%, 平均提升为2. 56%。  相似文献   

9.
并行虚拟文件系统PVFS的数据服务器缺少负载均衡机制,因此存在热点服务器,降低了系统整体性能.提出了一种基于副本的负载均衡机制,通过文件数据备份的方式进行负载迁移,以解决这一瓶颈问题.其通过选择备份文件时权衡文件的热度与大小以降低数据备份的开销,将热点数据以较小代价转移到较空闲的服务器上,有效地提高了整个系统的数据吞吐量.其主要涉及了热点监测、数据备份源-目的节点选择以及备份文件策略3个部分的工作.实验结果表明:提出的负载均衡机制有效地提高了系统的整体性能,最高达到了24%.  相似文献   

10.
无线传感器网络中,大量的传感器节点检测的数据需要转发给基站.但是由于节点本身能量十分有限,有时需要多次转发才能到达基站.由于转发机制的存在,在基站附近的节点除了转发本身检测的数据外,还要大量转发远离基站的外层节点的数据,这使得基站附近节点负载过重而提前死亡,导致基站附近形成”能量洞”,最终导致网络死亡.针对这个问题,提出一种新的非均匀节点布置策略:调整网络的节点密度布置结构、增加内层节点的密度以分担内层节点负载,从而避免”能量洞”,提高网络的能量利用率.本文的策略通过NS2仿真平台加以实现.仿真结果表明本策略针对”能量洞”问题是有效的、能够有效提高网络的能耗均衡性.  相似文献   

11.
Pipelining and bypassing in a VLIW processor   总被引:1,自引:0,他引:1  
This short note describes issues involved in the bypassing mechanism for a very long instruction word (VLIW) processor and its relation to the pipeline structure of the processor. The authors first describe the pipeline structure of their processor and analyze its performance and compare it to typical RISC-style pipeline structures given the context of a processor with multiple functional units. Next they study the performance effects of various bypassing schemes in terms of their effectiveness in resolving pipeline data hazards and their effect on the processor cycle time  相似文献   

12.
In this paper, we have proposed an efficient method for integrating longer pipeline coprocessors with SPARCv8 compliant processor implementations that requires minimum changes in the existing processor pipeline. The proposed integration method is independent of the length of the coprocessor pipeline. We have used COordinate Rotation DIgital Computer (CORDIC) core as the coprocessor that has been integrated with SPARCv8 based LEON3 processor. Only a subset of the coprocessor instructions defined in the Instruction Set Architecture (ISA) are required in our proposed method. The required synchronisation of data and control signals between the coprocessor and LEON3 pipeline has been presented in detail. The performance of the resulting closely-coupled design is compared with bus-based integration in terms of speed, power and area in the System-on-Chip (SoC) design, and both FPGA and ASIC results are reported. Our proposed integration method shows significant improvements over bus-based method for applications that require consecutive coprocessor operations in terms of CPI metric along with substantial reduction in number of cycles. Similar strategy can be employed for integration with coprocessors having different pipeline lengths.  相似文献   

13.
In this paper, a scalable scheme, configurable via register-transfer level parameters, for full register bypassing in a modern embedded processor architecture, termed ByoRISC, is presented. The register bypassing specification is parameterized regarding the number of homogeneous register file read and write ports and the number of pipeline stages of the processor. The performance characteristics (cycle time, chip area) of the proposed technique have been evaluated for FPGA target implementations of the synthesizable ByoRISC model. It is proved that, a full bypassing network is a viable solution for the elimination of data hazards when servicing instructions with multiple read and write operands. While the maximum clock frequency is reduced by 17.9% in average, when using partial versus full forwarding, the positive effect of custom computation eliminates this effect by providing cycle speedups of 3.9× to 5.5× and corresponding execution time speedups for a ByoRISC testbed processor of 3.6×. Individual application speedups of up to 9.4× have also been obtained.  相似文献   

14.
针对嵌入式控制与数字信号处理混合应用领域,建立了一种基于MCU-DSP融合架构处理器的Load先行机制.该内核使用静态超标量技术,拥有整数、存取、循环三条流水线,并采用特殊的四级流水.在存取流水线中,Load先行机制通过动态调度指令的访存顺序,实现了Load指令对Store指令的先行,提前了整数流水线中运算操作数的准备,加快了流水线的处理速度.  相似文献   

15.
在主流通用处理系统中,超标量机制及高速缓存使得自修改代码(SMC)成为一种需要特殊处理的情况,为了继续支持使用自修改代码的程序并兼容原有程序,在处理系统设计中需要对SMC的情况进行支持。本文分析并对比了多种程序的SMC行为及解决方案,设计了一种利用FIFO队列在流水线外检测SMC的方案,避免了对主流水线的干扰;并通过复用访存通路来检测SMC导致的缓存一致性问题,由于优化后的设计不需要额外的端口,避免了在数据缓存使用多端口设计,使得整体面积下降了1.16%。同零开销的理想方案相比,该方案对性能的影响小于0.1%。  相似文献   

16.
龙芯1号处理器结构设计   总被引:26,自引:7,他引:26  
首先介绍了龙芯处理器的研制背景及其技术路线。分析了龙芯处理器坚持高性能定位、稳扎稳打的设计策略以及兼容主流处理器的原因,并指出在目前达到与国外相同主频的客观条件不具备的情况下,应走通过优化处理器结构来提高性能的道路,并以处理器结构技术的突破为根本。然后介绍了龙芯1号处理器的体系结构设计,包括基于操作队列复用的动态流水线设计、在乱序执行的情况下实现精确例外处理、取指与转移控制结构、存储管理以及针对缓冲区逐出攻击的系统安全设计等等。测试表明龙芯1号处理器的指令流水线效率高,其安全设计能有效防范使用缓冲区送出技术进行的网络攻击。但龙芯1号处理器的Cache过小,在组织方式上也有待改进。  相似文献   

17.
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.  相似文献   

18.
介绍一种基于锁相环六倍频速度采集的无刷直流电机(BLDCM)控制系统,以高性能处理器STM32为核心控制元件,采用锁相环倍频技术实现对电机转速的高精度控制.该系统包括处理器、驱动电路、逆变电路、倍频电路和电机,其中倍频电路对三路霍尔信号实现六倍频处理,输出作为系统的速度反馈.经仿真和实际测试,所设计的锁相环六倍频电路,...  相似文献   

19.
Nowadays, multi-core processor is the main technology used in desktop PCs, laptop computers and mobile hardware platforms. As the number of cores on a chip keeps increasing, it adds up the complexity and impacts more on both power and performance of a processor. In multi-processors, the number of cores and various parameters, such as issue-width, number of instructions and execution time, are key design factors to balance the amount of thread-level parallelism and instruction-level parallelism. In this paper, we perform a comprehensive simulation study that aims to find the optimum number of processor cores in desktop/laptop computing processor models with shallow pipeline depth. This paper also explores the trade-off between the number of cores and different parameters used in multi-processors in terms of power–performance gains and analyzes the impact of 3D stacking on the design of simultaneous multi-threading and chip multiprocessing. Our analysis shows that the optimum number of cores varies with different classes of workloads, namely: SPEC2000, SPEC2006 and MiBench. Simulation study is presented using architectures with shorter pipeline depth, showing that (1) the optimum number of cores for power–performance is 8, (2) the optimum number of threads in the range [2, 4], and (3) for beyond 32 cores, multi-core processors are no longer efficient in terms of performance benefits and overall power consumption.  相似文献   

20.
针对如何发挥异构多核处理器的优势从而提高程序执行效率的问题,提出了Cell异构多核处理器上实现线程同步流水并行和迭代同步流水并行两种优化技术,该优化技术可以有效地提高非规则写和控制结构非规则的执行速度。通过在Cell处理器上对NAS benchmarks中的IS、EP、LU以及SPEC2001中的MOLDYN进行测试,结果表明该流水并行方案有效地改善了临界区和flush操作的执行效率,明显地提高了程序的执行速度。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号