共查询到20条相似文献,搜索用时 15 毫秒
1.
Noda H. Nakajima M. Dosaka K. Nakata K. Higashida M. Yamamoto O. Mizumoto K. Tanizaki T. Gyohten T. Okuno Y. Kondo H. Shimazu Y. Arimoto K. Saito K. Shimizu T. 《Solid-State Circuits, IEEE Journal of》2007,42(1):183-192
This paper describes the design and implementation of the massively parallel processor based on the matrix architecture which is suitable for portable multimedia applications. The proposed architecture in this paper achieves the high performance of 40 GOPS in the case of consecutive fixed-point 16-bit additions at 200MHz clock frequency and the small power dissipation of 250mW. In addition, 1Mbit SRAM for data registers and 2048 2-bit-grained processing elements connected by a flexible switching network are integrated in the small area of 3.1 mm 2 in 90nm CMOS low standby technology. These design techniques and architectures described in this paper are attractive for realizing area-efficient, energy-efficient, and high-performance multimedia processors 相似文献
2.
3.
本文首先论述了超常指令字VLIW和多核处理器体系结构,重点介绍了华威处理器的设计。该处理器是一款基于VLIW和SIMD体系结构的多核微处理器,本文重点对该处理器的体系结构、指令调度和编译优化技术进行了介绍,并给出了采用推断推测技术的优化结果。 相似文献
4.
5.
为了解决当前椭圆曲线密码处理器普遍存在灵活性低、资源占用大的问题,该文采用统计建模的方式,以面积-时间(AT)综合性能指标为指导,提出了一种面向椭圆曲线密码并行处理架构的量化评估方式,并确定3路异构并行处理架构可使处理器综合性能达到最优。其次,该文提出一个分离分级式存储结构和一个运算资源高度复用的模运算单元,可增强存储器的访问效率和运算资源的利用率。在90 nm CMOS工艺下综合,该文处理器的面积为1.62mm2,完成一次GF(2571)和GF(p521)上的点乘运算分别需要2.26 ms/612.4J和2.63 ms/665.4J。与同类设计相比,该文处理器不仅具有较高的灵活性、可伸缩性,而且其芯片面积和运算速度达到了很好的折中。 相似文献
6.
文中结合PicoJava和JOP等一些经典的Java处理器的优势,设计了一种基于RISC结构的Java处理器.它充分利用了Java指令折叠技术和精简指令集处理器的优势,不仅降低了设计复杂度,而且在很大程度上提高了Java处理器的性能. 相似文献
7.
8.
9.
10.
《中兴通讯技术(英文版)》2009,(1):54-58
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well. 相似文献
11.
提出了一种通用、高效的基于FPGA的多DSP并行处理系统,并对其进行了仿真。从仿真结果来看,该系统的数据读写时序与DSP芯片要求的数据读写时序完全吻合,可实现数据的高速并行处理,并达到了设计的目的。 相似文献
12.
13.
14.
15.
文章分析了CORDIC处理器的各种结构。给出了如何在电路结构级根据具体设计要求对面积、时间和吞吐量等性能进行折衷的设计方法,并用该方法设计实现了面向空间应用、符合IEEE-754单精度标准、采用粒度为2的流水结构的高性能CORDIC处理器。该设计方法对CORDIC处理器的电路结构级设计有重要的指导和借鉴意义。 相似文献
16.
介绍了一种基于ADI公司的双片ADSP-TS201S型DSP芯片的数字信号处理器并行工作模式的设计。采用EPROM加载和链路口加载的方式分别对主片和从片进行程序的引导加载。简单介绍两片DSP的分工工作模式:其中主片DSP可以用于与外部进行数据交互通信和对双片DSP的控制管理;从片DSP可以专用于整个系统核心算法的实现。两片DSP通过DMA中断进行算法的同步以保证整个系统的实时运转。大致介绍系统构成,远程管控的实现方式。详细介绍主片的远程参数数据库和核心算法程序的更新所采用的设计方法。主片接收外部传递的信息及数据采用中断模式进行。 相似文献
17.
18.
A. Broggi G. Conte F. Gregoretti C. Sansoè R. Passerone L.M. Reyneri 《The Journal of VLSI Signal Processing》1998,19(1):5-18
In this paper PAPRICA, a massively parallel coprocessor devoted to the analysis of bitmapped images is presented considering first the computational model, then the architecture and its implementation, and finally the performance analysis. The main goal of the project was to develop a subsystem to be attached to a standard workstation and to operate as a specialized processing module in dedicated systems. The computational model is strongly related to the concepts of mathematical morphology, and therefore the instruction set of the processing units implements basic morphological transformations. Moreover, the specific processor virtualization mechanism allows to handle and process multiresolution data sets. The actual implementation consists of a mesh of 256 single bit processing units operating in a SIMD style and is based on a set of custom VLSI circuits. The architecture comprises specific hardware extensions that significantly improved performances in real-time applications. 相似文献
19.
20.
This paper presents an implementation approach for the test of routers in a fine grain massively parallel architecture. First, an ad hoc test technique which diffuses test messages router by router is analyzed. Even though the technique does not add hardware, it is shown inefficient and not applicable due to practical constraints such as the limited number of pins of the chip implementing the machine. Based on a hierarchical implementation of the IEEE 1149.1 standard, two approaches are proposed and compared in terms of the area overhead, the overall test time and the flexibility in applying tests and diagnosing the routers inside the machine. The basic idea for both approaches is to construct groups of basic cells which are driven by the same test block and compare their test results after the same test vectors are applied at each cell input. The two approaches differ in the granularity of a basic cell. The choice of an implementation approach is not trivial. It is shown that each approach presents better performance than the other, that is, the approach which allows better fault coverage and less test time requires more silicon and less diagnostic possibilities compared to the second approach. 相似文献