共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
Tim O Gara 《电子产品世界》2005,(22):67-68
DSP结构可以分为定点型(FXP)和浮点型(FLP).虽然FXP型DSP只能实现整数运算,但是它运算速度快,占用资源少,比FLP型成本低.而FXP型DsP使用FLP算法能够实现更高的精度和动态运算范围.对FXP DSP结构支持下的FLP需求不断增长,这主要有以下原因:第一,实现算法代码通常用C/C (采用浮点数形式)编写,将FLP算法转换成FXP格式是比较麻烦的.而将浮点算法移植到DSP平台所花费的时间较少,因而FLP降低了研发成本.另外,常用的算法得益于浮点运算提供的较大的运算范围.最后,在某些情况下应用FXP算法无法获得期望的精度和动态范围. 相似文献
4.
This paper presents a technique for simulating processors based on the principle of compiled simulation. Unlike existing, commercially available instruction set simulators for DSPs, which are of interpretive character, the proposed technique performs instruction decoding and simulation scheduling at compile time. The technique offers up to three orders of magnitude faster simulation. The high speed allows the user to explore algorithms and hardware/software trade-offs before any hardware implementation. Moreover, the user can tailor the compiled simulation to trade speed for more accuracy. In this paper, the sources of the speedup and the limitations of the technique are analyzed and the realization of the simulation compiler is presented. 相似文献
5.
Chi Ching Chi Mauricio Alvarez-Mesa Jan Lucas Ben Juurlink Thomas Schierl 《Journal of Signal Processing Systems》2013,71(3):247-260
The Joint Collaborative Team on Video Decoding is developing a new standard named High Efficiency Video Coding (HEVC) that aims at reducing the bitrate of H.264/AVC by another 50 %. In order to fulfill the computational demands of the new standard, in particular for high resolutions and at low power budgets, exploiting parallelism is no longer an option but a requirement. Therefore, HEVC includes several coding tools that allows to divide each picture into several partitions that can be processed in parallel, without degrading the quality nor the bitrate. In this paper we adapt one of these approaches, the Wavefront Parallel Processing (WPP) coding, and show how it can be implemented on multi- and many-core processors. Our approach, named Overlapped Wavefront (OWF), processes several partitions as well as several pictures in parallel. This has the advantage that the amount of (thread-level) parallelism stays constant during execution. In addition, performance and power results are provided for three platforms: a server Intel CPU with 8 cores, a laptop Intel CPU with 4 cores, and a TILE-Gx36 with 36 cores from Tilera. The results show that our parallel HEVC decoder is capable of achieving an average frame rate of 116 fps for 4k resolution on a standard multicore CPU. The results also demonstrate that exploiting more parallelism by increasing the number of cores can improve the energy efficiency measured in terms of Joules per frame substantially. 相似文献
6.
V. Kitsakis K. Nakos D. Reisis N. Vlassopoulos 《Journal of Signal Processing Systems》2018,90(11):1593-1607
The current paper introduces an efficient technique for parallel data addressing in FFT architectures performing in-place computations. The novel addressing organization provides parallel load and store of the data involved in radix-r butterfly computations and leads to an efficient architecture when r is a power of 2. The addressing scheme is based on a permutation of the FFT data, which leads to the improvement of the address generating circuit and the butterfly processor control. Moreover, the proposed technique is suitable for mixed radix applications, especially for radixes that are powers of 2 and straightforward continuous flow implementation. The paper presents the technique and the resulting FFT architecture and shows the advantages of the architecture compared to hitherto published results. The implementations on a Xilinx FPGA Virtex-7 VC707 of the in-place radix-8 FFT architectures with input sizes 64 and 512 complex points validate the results. 相似文献
7.
Scalable Parallel Memory Architectures for Video Coding 总被引:1,自引:0,他引:1
Current video compression standards, which process frames macroblock by macroblock, employ several processing functions to achieve the compression. These functions refer to data memory address space in different ways. E.g., performing motion estimation and motion compensation functions requires many times data accesses unaligned to word boundaries. On the other hand, Discrete Cosine Transformation (DCT) and inverse of it (IDCT) for 8 × 8 block can be performed first for rows and then for columns. Thus, transposition is needed between these two stages. Among other things, parallel memory architecture can provide a solution for these tasks. In our other paper, we shortly surveyed parallel memory architectures and proposed parallel memory architecture designs for different data path widths for video coding applications. In this paper, we construct video coding function examples by using the proposed parallel data memory efficiently. Furthermore, performance and implementation cost of the parallel memory architecture are estimated and compared to more conventional memory architectures. The examples are given for different data bus widths (16, 32, 64, and 128 bits). We show that the parallel memory can keep the data path fully utilized in many video coding function implementations. This ensures high-speed operation and full utilization of the processing resources. 相似文献
8.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(7):872-877
Long Bose–Chaudhuri–Hocquenghen (BCH) codes are used as the outer error correcting codes in the second-generation Digital Video Broadcasting Standard from the European Telecommunications Standard Institute. These codes can achieve around 0.6-dB additional coding gain over Reed–Solomon codes with similar code rate and codeword length in long-haul optical communication systems. BCH encoders are conventionally implemented by a linear feedback shift register architecture. High-speed applications of BCH codes require parallel implementation of the encoders. In addition, long BCH encoders suffer from the effect of large fanout. In this paper, three novel architectures are proposed to reduce the achievable minimum clock period for long BCH encoders after the fanout bottleneck has been eliminated. For an (8191, 7684) BCH code, compared to the original 32-parallel BCH encoder architecture without fanout bottleneck, the proposed architectures can achieve a speedup of over 100%. 相似文献
9.
10.
Israel Martínez-Pérez Wolfgang Brandt Michael Wild Karl-Heinz Zimmermann 《Journal of Signal Processing Systems》2010,58(2):117-124
The stickers model is a model of DNA computation that is computationally complete and universal. Many NP complete problems can be described by stickers programs that have polynomial runtime and are exponential in space. The stickers model can be viewed as a bit-vertically operating register machine. This makes it attractive for in silico implementation. This paper describes a stickers model for the maximum clique problem and its implementation by an FPGA architecture. The results show that the FPGA based algorithm is comparable with existing software algorithms for moderate problem sizes. More generally, the stickers model seems to be a well-suited programming model for dedicated hardware. 相似文献
11.
《IEEE transactions on circuits and systems. I, Regular papers》2008,55(11):3438-3447
12.
提出了一种基于优化时间重叠技术的10位300 MHz采样率4路并行流水线A/D转换器的设计方法,该方法降低了对运算放大器的要求。通过理论计算和实例设计,证明了此低功耗设计方法的显著效果。设计了一个用于前端的运算放大器,在CSM 0.35μm CMOS工艺、3.3 V电源电压下,该运放的增益为106 dB,单位增益带宽为402 MHz,建立时间为8.8 ns。采用优化时间重叠技术后,可满足4路并行300 MHz采样率的要求,功耗仅为8.57 mW,可大大降低整个并行流水线A/D转换器的功耗。 相似文献
13.
DSP的并行处理方法 总被引:3,自引:0,他引:3
TI公司TMS320C6x和AD公司ADSP2106x是目前业界使用广泛的数字信号处理嚣(DSP).本文详细地介绍了利用TMS320C6x的接口HPI、接口McBSP以及ADSP2106x的Link接口分别组成并行DSP处理系统的方法.同时介绍了这些方法的优缺点。 相似文献
14.
15.
16.
17.
18.
Alberto Tarable Libero Dinoi Sergio Benedetto 《Communications Letters, IEEE》2007,11(2):167-169
In this paper we propose a technique to implement in a parallel fashion a turbo decoder based on an arbitrary permutation, and to expand its interleaver in order to produce a family of prunable S-random interleavers suitable for parallel implementations. We show that the spread properties of the obtained interleavers are almost optimal and we prove by simulation that they are very competitive in terms of error floor performance. A few details on the decoder architecture are also provided 相似文献
19.
Alexandre Borghi Jérôme Darbon Sylvain Peyronnet Tony F. Chan Stanley Osher 《Journal of Signal Processing Systems》2013,71(1):1-20
In this paper we consider the l 1-compressive sensing problem. We propose an algorithm specifically designed to take advantage of shared memory, vectorized, parallel and many-core microprocessors such as the Cell processor, new generation Graphics Processing Units (GPUs) and standard vectorized multi-core processors (e.g. quad-core CPUs). Besides its implementation is easy. We also give evidence of the efficiency of our approach and compare the algorithm on the three platforms, thus exhibiting pros and cons for each of them. 相似文献