期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

贾一鸣李磊肖建青《微电子学与计算机》2023,(4):117-124

针对多核系统中PLB和AXI总线间协议转换的需求,首先研究了总线协议与PowerPC处理器的访存行为,进一步研究了流水控制、读写叠加等高效率转换策略,最后研究了面向多核系统应用的缓存一致性维护策略.针对命令信号、读数据信号和写数据信号独立传输的特点,设计多通道流水线结构,既实现了命令与数据的流水并行也实现了读事务与写事务的叠加并行;在多通道流水线结构的基础上,提出一种流水并行+可变长描述符的2级加速转换技术,通过给予流水线输入更连续的总线事务,实现更高的总线转换效率;借鉴Cache表项的结构和维护策略,提出基于动态命中预测的缓存一致性维护技术,加速一致性读命令的进程.最终,实现一种高性能的PLB到AXI总线桥设计,达到总线协议行为全覆盖、命令转换低延迟的目标.总线桥应用于某款基于双核PowerPC处理器的异构多核体系结构芯片,解决了SoC系统内PLB到AXI总线的高效、高可靠转换问题,并在65 nm工艺下完成流片. 相似文献

2.

基于RISC-V处理器的TileLink与AXI4总线桥设计与实现

洪广伟崔超虞致国顾晓峰《微电子学与计算机》2022,(4):100-108

RISC-V是近年提出的一种开源精简指令集架构,TileLink总线是专为RISC-V处理器设计的片上总线.为使RISC-V处理器灵活适配更多已有的AXI4 IP资源,提出一种高效率TileLink与AXI4总线桥设计方案,其中由一系列功能子模块匹配总线间数据传输方式的差异,以流水线传输形式实现数据跨协议的传输,增加总... 相似文献

3.

多处理器设计技术探计

Mike McTigue 《电子设计技术》2003,10(6):82-92

人们可以将多处理器设计定义为这样的系统:它把各项功能和任务分配给多个处理器去完成,而这些处理器彼此协调、互相沟通,以保证行为一致.多处理器系统比单处理器设计更加复杂.它要求为内务操作和协调功能进行额外的编程,调试也更为复杂.这是因为多处理器系统需要各处理器间的交互作用,而在单处理器体系结构中不存在这种处理器之间的交互作用的问题.尽管多处理器设计增加了复杂性,但多年来,它一直用于高性能计算机和工作站中,而且正跻身于日益增多的嵌入式系统应用之中. 相似文献

4.

2004年度影响中国的嵌入式创新产品奖

《电子产品世界》2004,(17)

2004年5月17日,ARM公司发布了一款新的可授权处理器。该处理器是ARM与NEC电子公司合作开发的。MPCore可综合多处理器基于ARMv6体系结构,可配置为1~4个处理器,性能可达2600Dhrystone MIPS。其功能包括:一级可配置缓存、64位AMBA AXI接口、矢量浮点协处理器和可编程中断分配。该处理器支持不工作状态的处理器启用Adaptive Shutdown技术,提供动态节能,低功耗性能与不含缓存的常规130nm处理器相同,为0.57mW/MHz。ARM智能能量管理技术通过动态地预测所需性能,降低电压和频率,这两项技术可实现高达85%的功耗节省。多处理器理想地满… 相似文献

5.

双核3U Compact PCI服务器级工业计算机

《今日电子》2008,(8)

cPCI-3920搭配英特尔Core2Duo双核处理器,具有4MB闪存（Cache）与667MHz总线,若客户需要更省电与更佳的散热能力,也可选择搭配极低功耗的单核Celeron处理器。cPCI—3920支持具备纠错功能（ECC）的单通道DDR2—400SDRAM,最高容量达2GB,内存同样也是直接焊接在PCB板上,具有极佳抗震性。相似文献

6.

“龙腾R2”微处理器流水线的设计及优化 总被引：4，自引：3，他引：1

黄小平樊晓桠贾琳白永强《微电子学与计算机》2006,23(2):144-147

32位RISC微处理器“龙腾R2”是西北工业大学航空微电子中心2005年设计的一款自主知识产权的嵌入式微处理器。采用PowerPC体系结构，六级流水线，具有独立的数据Cache和指令Cache。文章介绍“龙腾R2”处理器流水线的设计思想以及优化方案。重点介绍流水线中相关的解决方案、精确异常的实现以及流水线中指令预取级的设计与实现等。相似文献

7.

“龙腾~RR2”微处理器流水线的设计及优化

黄小平樊晓桠贾琳白永强《微电子学与计算机》2006,(2)

32位RISC微处理器“龙腾~R R2”是西北工业大学航空微电子中心2005年设计的一款自主知识产权的嵌入式微处理器,采用PowerPC体系结构,六级流水线,具有独立的数据Cache和指令Cache。文章介绍“龙腾~R R2”处理器流水线的设计思想以及优化方案,重点介绍流水线中相关的解决方案、精确异常的实现以及流水线中指令预取级的设计与实现等。相似文献

8.

"龙腾(R)R2"微处理器流水线的设计及优化

黄小平樊晓桠贾琳白永强《微电子学与计算机》2006,23(2)

32位RISC微处理器"龙腾(R)R2"是西北工业大学航空微电子中心2005年设计的一敖自主知识产权的嵌入式微处理器,采用PowerPC体系结构,六级流水线,具有独立的数据Cache和指令Cache.文章介绍"龙腾(R)R2"处理器流水线的设计思想以及优化方案,重点介绍流水线中相关的解决方案、精确异常的实现以及流水线中指令预取级的设计与实现等. 相似文献

9.

通信处理器开拓了多总线Ⅱ的带宽

Scholhamer G. 智少游《通信技术》1987,(2)

在多处理器系统内,多总线Ⅱ借助消息传送来增强处理器之间的通信能力。在多总线Ⅱ上的并行系统总线(PSB)可以发送两种类型的消息:请求消息和非请求消息。多总线请求消息用于传送数据块。当象通信控制器这样的设备已收到数据时,就必须把它发送到主CPU。在请求消息方式下,控制器发送一个非请求消息到主机,以请求缓冲存储器接收它的数据块;然后主机发送一个非请求的回答消息,这个消息带有分配相似文献

10.

多线程非阻塞指令Cache设计

胡孔阳陈鹏桑红石《微电子学与计算机》2012,29(5):143-147

非阻塞Cache是指Cache在等待预取数据返回时,还能继续提供指令和数据.首先分析了多线程非阻塞Cache的处理器需求,然后提出其时序要求和一种实现方案.利用SystemVerilog对该方案进行RTL级建模和性能评估.仿真结果表明,该方案可以很好地应用于多线程、乱序执行处理器的指令引擎设计之中. 相似文献

11.

一种新颖的双端口数据高速缓冲存储器

张卫新单睿侯朝焕《微电子学》2003,33(6):537-540

VLIW体系结构是媒体处理器的首选技术。解决处理器内核与访存之间的数据瓶颈,可以采用双Load／Store单元。为此,需要开发具有双端口访问能力的数据高速缓冲存储器。通过分析双端口情况下的系统工作时序、缺失(miss)处理和替换算法,设计并实现了一个4路组相连、容量为16kB的双端口数据高速缓冲存储器。通过在高速缓冲存储器内使用双端口SRAM,使其具有真正双端口并行访问能力,提高了处理器内核的数据吞吐能力。相似文献

12.

一种数据Cache的设计和验证

屈凌翔袁潇王澧《电子与封装》2014,(5):28-32

Cache能够提高DSP处理器对外部存储器的存取速度,提高DSP的性能,设计高性能低功耗的Cache,对于提高DSP芯片的整体性能有着十分重大的意义。描述了DSP芯片中一种高性能低功耗的数据Cache。这种Cache可以通过增加具备重装功能的Line Buffer来减少处理器对Cache的访问频率,从而降低Cache功耗。通过FFT、AC3、FIR三种基准程序测试表明,Line Buffer可以降低35%的Cache访问频率,明显降低了数据Cache功耗。相似文献

13.

Accurate Analysis of Single Event Upsets in a Pipelined Microprocessor

M. Rebaudengo M. Sonza Reorda M. Violante 《Journal of Electronic Testing》2003,19(5):577-584

Modern processors embed features such as pipelined execution units and cache memories that can not be directly controlled by programmers through the processor instruction set. As a result, software-based fault injection approaches are even less suitable for assessing the effects of SEUs in modern processors, since they are not able to evaluate the effects of SEUs affecting pipelines and caches. In this paper we report an analysis of a commercial processor core where the effects of SEUs located in the processor pipeline and cache memories are studied. The obtained results are compared with those software-based approaches provide, showing that software-based approaches may lead to significant errors during the error rate estimation. A major novelty of the paper is an extensive analysis of the effects of SEUs in the pipeline of a commercial processor core during the execution of several benchmark programs. 相似文献

14.

一种基于FPGA的实时视频采集与远程传输系统

张龙滨黎福海《电视技术》2011,35(17):45-47

以FPGA器件XC5VLX110T作为核心,选用芯片AD9980进实时视频采集,扩充大容量DDR2 SDRAM作为系统运算缓存,内置TEMAC控制器建立以太网通信链路,配置软核处理器MicroBlaze管理各模块,并设计配套软件,构成一种功能强大的视频采集板卡,并支持组网集中管理. 相似文献

15.

RDMM: Runtime dynamic migration mechanism of distributed cache for reconfigurable array processor

《Integration, the VLSI Journal》2020

Reconfigurable array processors have emerged as powerful solution to speed up computationally intensive applications. However, they may suffer from a data access bottleneck as the frequency of memory access rises. At present, the distributed cache design in the reconfigurable array processor has a large cache failure rate, and the frequent access to external memory leads to a long delay in memory access. To mitigate this problem, we present a Runtime Dynamically Migration Mechanism (RDMM) of distributed cache for reconfigurable array processor based on the feature of obvious locality and high parallelism in accessing data. This mechanism allows temporary, static data to be dynamically scheduled to migrate data with a high access frequency from the remote cache to the processor's local migration storage table based on how often the reconfigurable array processors access the remote cache. We can accurately get the data on the shortest path by way of data search strategy based on migration storage tables, thereby effectively reducing the access delay of the entire system, increasing the memory bandwidth of the reconfigurable array processor. We leverage the hardware platform of reconfigurable array processor to test the proposed mechanism. The experimental results show that RDMM reduces access delay by up to 35.24% compared with the tradition distributed cache at the highest conflict rate. And compared with the Ref.[19], Ref.[20], Ref.[21] and Ref.[23], the working frequency can be increased by 15%, the hit rate can be increased by 6.1%, and the peak bandwidth can be increased by about 3×. 相似文献

16.

The on-chip 3-MB subarray-based third-level cache on an Itanium microprocessor

Weiss D. Wuu J.J. Chin V. 《Solid-State Circuits, IEEE Journal of》2002,37(11):1523-1529

The 3-MB on-chip level three cache in the Itanium 2 processor, built on an 0.18-/spl mu/m, six-layer Al metal process, employs a subarray design style that efficiently utilizes available area and flexibly adapts to floor plan changes. Through a distributed decoding scheme and compact circuit design and layout, 85% array efficiency was achieved for the subarrays. In addition, various test and reliability features were included. The cache allows for a store and a load every four core cycles and has been characterized to operate above 1.2 GHz at 1.5 V and 110/spl deg/C. When running at 1.0 GHz, the cache provides a total bandwidth of 64 GB/s. 相似文献

17.

A Case Study: Power and Performance Improvement of a Chip Multiprocessor for Transaction Processing

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(7):865-868

Current high-end microprocessor designs focus on increasing instruction parallelism and clock frequency at the expense of power dissipation. This paper presents a case study of a different direction, a chip multiprocessor (CMP) with a smaller processor core than a baseline high-end 130-nm 64-bit SPARC server uniprocessor. We demonstrate that the size of the baseline processor core can be reduced by 2/3 using a combination of logical resource reduction and dense custom macros while still delivering about 70% of the TPC-C performance. Circuit speed is traded for power reduction by reducing the power supply from 1.0 to 0.8 V and increasing transistor channel lengths by 12.5% above the minimum. The resulting CMP with six reduced size cores and 4-MB L2 cache is estimated to run at 1.8 GHz while consuming less than 30% of the power compared to the scaled baseline dual-core processor running at 2.4 GHz. The proposed CMP is more than four times higher in TPC/W than the dual-core processor, facilitating the design of high-density servers. 相似文献

18.

Energy optimization of multilevel cache architectures for RISC andCISC processors

Ko U. Balsara P.T. Nanda A.K. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1998,6(2):299-308

In this paper, we present the characterization and design of energy-efficient, on chip cache memories. The characterization of power dissipation in on-chip cache memories reveals that the memory peripheral interface circuits and bit array dissipate comparable power. To optimize performance and power in a processor's cache, a multidivided module (MDM) cache architecture is proposed to conserve energy in the bit array as well as the memory peripheral circuits. Compared to a conventional, nondivided, 16-kB cache, the latency and power of the MDM cache are reduced by a factor of 1.9 and 4.6, respectively. Based on the MDM cache architecture, the energy efficiency of the complete memory hierarchy is analyzed with respect to cache parameters in a multilevel processor cache design. This analysis was conducted by executing the SPECint92 benchmark programs with the miss ratios for reduced instruction set computer (RISC) and complex instruction set computer (CISC) machines 相似文献

19.

The first IA-64 microprocessor

Rusu S. Singer G. 《Solid-State Circuits, IEEE Journal of》2000,35(11):1539-1544

The first implementation of the IA-64 architecture achieves high performance by using a highly parallel execution core, while maintaining binary compatibility with the IA-32 instruction set. Explicitly parallel instruction computing (EPIC) design maximizes performance through hardware and software synergy. The processor contains 25.4 million transistors and operates at 800 MHz. The chip is fabricated in a 0.18-μm CMOS process with six metal layers and packaged in a 1012-pad organic land grid array using C4 (flip chip) assembly technology. A core speed back-side bus connects the processor to a 4-MB L3 cache 相似文献

20.

A Systolic Design Methodology with Application to Full-Search Block-Matching Architectures

Yen-Kuang Chen S.Y. Kung 《The Journal of VLSI Signal Processing》1998,19(1):51-77

We present a systematic methodology to support the design tradeoffs of array processors in several emerging issues, such as (1) high performance and high flexibility, (2) low cost, low power, (3) efficient memory usage, and (4) system-on-a-chip or the ease of system integration. This methodology is algebraic based, so it can cope with high-dimensional data dependence. The methodology consists of some transformation rules of data dependency graphs for facilitating flexible array designs. For example, two common partitioning approaches, LPGS and LSGP, could be unified under the methodology. It supports the design of high-speed and massively parallel processor arrays with efficient memory usage. More specifically, it leads to a novel systolic cache architecture comprising of shift registers only (cache without tags). To demonstrate how the methodology works, we have presented several systolic design examples based on the block-matching motion estimation algorithm (BMA). By multiprojecting a 4D DG of the BMA to 2D mesh, we can reconstruct several existing array processors. By multiprojecting a 6D DG of the BMA, a novel 2D systolic array can be derived that features significantly improved rates in data reusability (96%) and processor utilization (99%). 相似文献