共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
以最大-最小蚁群系统为基础,为蚁群采用增加了嗅觉分辨能力,应用于粗粒度可配置结构芯片的路由问题。以开发的粗粒度可重构芯片CTaiJi为对象,通过几个算例的比较,可以看到此方法找到最优解的能力优于目前常用的谈判阻塞算法。 相似文献
3.
Benaissa M. Yiqun Zhu 《IEEE transactions on circuits and systems. I, Regular papers》2007,54(3):555-565
A novel reconfigurable sequential decoder architecture based on the Fano algorithm is presented in which the constraint length, the threshold spacing, and the time-out threshold are all run time reconfigurable. To maximize decoding performance, a maximum possible backward depth (of a whole frame) is performed. This is achieved by using shift registers combined with memory to store the information of an entire visited path. A field-programmable gate array) prototype of the decoder is built and actual hardware decoding performances in terms of decoding speeds, bit error rates (BERs), and buffer overflow rates, are obtained and comparisons made. To overcome the decoding delay that is inherent in sequential decoders, a hybrid scheme, including simple block codes and cyclic redundancy check is proposed to limit the number of backward search operations that the sequential decoder has to execute. As a result, a significant reduction in decoding delay and buffer overflow rate is achieved while maintaining comparative decoding performance in terms of BER 相似文献
4.
Application of Reconfigurable CORDIC Architectures 总被引:1,自引:0,他引:1
Oskar Mencer Luc Séméria Martin Morf Jean-Marc Delosme 《The Journal of VLSI Signal Processing》2000,24(2-3):211-221
Reconfiguration enables the adaption of Coordinate Rotation DIgital Computer (CORDIC) units to the specific needs of sets of applications, hence creating application specific CORDIC-style implementations. Reconfiguration can be implemented at a high level, taking the entire CORDIC unit as a basic cell (CORDIC-cells) implemented in VLSI, or at a low level such as Field-Programmable Gate Arrays (FPGAs). We suggest a design methodology and analyze area/time results for coarse (VLSI) and fine-grain (FPGA) reconfigurable CORDIC units. For FPGAs we implement CORDIC units in Verilog HDL and our object-oriented design environment, PAM-Blox. For CORDIC-cells, multiple reconfigurable CORDIC modules are synthesized with state-of-the-art CAD tools. At the algorithm level we present a case study combining multiple CORDICs based on a geometrical interpretation of a normalized ladder algorithm for adaptive filtering to reduce latency and area of a fully pipelined CORDIC implementation. Ultimately, the goal is to create automatic tools to map applications directly to reconfigurable high-level arithmetic units such as CORDICs. 相似文献
5.
Dynamically reconfigurable hardware has already been deployed for accelerating computationally demanding applications. Some
of these hardware architectures allow run time reconfiguration but this usually leads to a large reconfiguration overhead.
The advantage of run time reconfiguration is that it allows new algorithmic solutions for many applications. To study the
potential of frequent run time reconfiguration it is interesting to investigate its costs and benefits from an abstract point
of view and to develop new architectural concepts. Multi-level reconfigurable architectures are one such concept that introduces
several levels of reconfiguration. This paper deals with new types of multi-level reconfigurable architectures. The corresponding
problem of finding the best granularity for different reconfiguration levels is formulated and investigated. Although this
problem is shown to be NP-complete, an interesting restricted subcase is solved optimally in polynomial time. For the general
case, a good heuristic is proposed that is based on solutions for the restricted case. Results on three example applications
show that the reconfiguration cost can be reduced with the new architectures. Based on a proposed measure of relative efficiency
it is also shown that the new architectures are more efficient so that they obtain a larger reconfiguration cost reduction
with less additional hardware.
相似文献
Martin MiddendorfEmail: |
6.
Zambreno J. Honbo D. Choudhary A. Simha R. Narahari B. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2006,94(2):419-431
One of the key problems facing the computer industry today is ensuring the integrity of end-user applications and data. Researchers in the relatively new field of software protection investigate the development and evaluation of controls that prevent the unauthorized modification or use of system software. While many previously developed protection schemes have provided a strong level of security, their overall effectiveness has been hindered by a lack of transparency to the user in terms of performance overhead. Other approaches take to the opposite extreme and sacrifice security for the sake of this transparency. In this work we present an architecture for software protection that provides for a high level of both security and user transparency by utilizing field programmable gate array (FPGA) technology as the main protection mechanism. We demonstrate that by relying on FPGA technology, this approach can accelerate the execution of programs in a cryptographic environment, while maintaining the flexibility through reprogramming to carry out any compiler-driven protections that may be application-specific. 相似文献
7.
Nuzzo P. De Bernardinis F. Terreni P. Van der Plas G. 《IEEE transactions on circuits and systems. I, Regular papers》2008,55(6):1441-1454
8.
9.
介绍了JPEG算法的原理,设计出合理的流程将JPEG算法并行化,在多核处理器架构上并行处理,并通过内存读取等方面的优化,极大地提高了JPEG解码算法的解码速度.实测表明,在三核集成的多核处理器上,JPEG图像的平均解码周期为单核处理器上的36%左右. 相似文献
10.
11.
针对多约束下的行流水粗粒度可重构体系结构的硬件任务划分映射问题,提出了一种多目标优化映射算法.该算法根据运算节点执行时延、依赖度等因素构造了累加概率权值函数,在满足可重构单元面积和互连等约束下,通过该函数值动态调整就绪节点的映射调度次序,当一块可重构单元阵列当前行映射完毕后,就自动换行,当一块阵列被填满,就切换到下一块,当一个数据流图映射完毕后,就自动计算划分块数等参数.实验结果表明,与层贪婪映射算法相比,文中算法平均执行总周期降低了8.4%(RCA4×4)和5.3%(RCA6×6),与分裂压缩内核映射算法相比,文中算法平均执行总周期降低了20.6%(RCA4×4)和21.0%(RCA6×6),从而验证了文中提出算法的有效性. 相似文献
12.
DSP处理器通过采用VLIW结构获得了高性能,同时也增加了编译器为其生成汇编代码的难度.代码生成器作为编译器的代码生成部件,是VLIW结构能够发挥性能的关键.由此提出并实现了一种基于可重定向编译框架的代码生成器.该代码生成器充分利用VLIW的体系结构特点,支持SIMD指令,支持谓词执行,能够生成高度指令级并行的汇编代码,显著提高应用程序的执行性能. 相似文献
13.
Yuan Xie Wolf W. Lekatsas H. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(8):975-980
Code size "bloating" in embedded very long instruction word (VLIW) processors is a major concern for embedded systems since memory is one of the most restricted resources. In this paper, we describe a code compression algorithm based on arithmetic coding, discuss how to design decompression architecture, and illustrate the tradeoffs between compression ratio and decompression overhead, by using different probability models. Experimental results for a VLIW embedded processor TMS320C6x show that compression ratios between 67% and 80% can be achieved, depending on the probability models used. A precache decompression unit design is implemented in TSMC 0.25 mum and a test chip is fabricated. 相似文献
14.
Matthias?Hartmann Vasileios??Pantazis Tom?Vander Aa Mladen?Berekovic Christian?Hochberger 《Journal of Signal Processing Systems》2010,60(2):225-237
Due to the increasing demands on efficiency, performance and flexibility reconfigurable computational architectures are very
promising candidates in embedded systems design. Recently coarse-grained reconfigurable array architectures (CGRAs), such
as the ADRES CGRA and its corresponding DRESC compiler are gaining more popularity due to several technological breakthroughs
in this area. We investigate the mapping of two image processing algorithms, Wavelet encoding and decoding, and TIFF compression
on this novel type of array architectures in a systematic way. The results of our experiments show that CGRAs based on ADRES
and its DRESC compiler technology deliver improved performance levels for these two benchmark applications when compared to
results obtained on a state-of-the-art commercial DSP platform, the c64x DSP from Texas Instruments. ADRES/DRESC can beat
its performance by at least 50% in cycle count and the power consumption even drops to 10% of the published numbers of the
c64x DSP. 相似文献
15.
This paper presents a new methodology of automatic RTL code generation from coarse-grain dataflow specification for fast HW/SW
cosynthesis. A node in a coarse-grain dataflow specification represents a functional block such as FIR and DCT and an arc
may deliver multiple data samples per block invocation, which complicates the problem and distinguishes it from behavioral
synthesis problem. Given optimized HW library blocks for dataflow nodes, we aim to generate the RTL codes for the entire hardware
system including glue logics such as buffer and MUX, and the central controller. In the proposed design methodology, a dataflow
graph can be mapped to various hardware structures by changing the resource allocation and schedule information. It simplifies
the management of the area/performance tradeoff in hardware design and widens the design space of hardware implementation
of a dataflow graph. We also support Fractional Rate Dataflow (FRDF) specification for more efficient hardware implementation.
To overcome the additional hardware area overhead in the synthesized architecture, we propose two techniques reducing buffer
overhead. Through experiments with some real examples, the usefulness of the proposed technique is demonstrated.
相似文献
Soonhoi Ha (Corresponding author)Email: |
16.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(1):151-155
17.
18.
Yoon J.W. Shrivastava A. Sanghyun Park Minwook Ahn Yunheung Paek 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(11):1565-1578
Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA, and thus they are i) unable to map applications, even though a mapping exists, and ii) using too many processing elements (PEs) to map an application. In this paper, we model several CGRA details, e.g., irregular CGRA topologies, shared resources and routing PEs in our compiler and develop a graph drawing based approach, split-push kernel mapping (SPKM), for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5times more applications than the previous approach, while generating mappings which have better qualities in terms of utilized CGRA resources. Utilizing fewer resources is directly translated into increased opportunities for novel power and performance optimization techniques. Our technique shows less power consumption in 71 cases and shorter execution cycles in 66 cases out of 100 synthetic applications, with minimum mapping time overhead. We observe similar results on a suite of benchmarks collected from Livermore loops, Mediabench, Multimedia, Wavelet and DSPStone benchmarks. SPKM is not a customized algorithm only for a specific CGRA template, and it is demonstrated by exploring various PE interconnection topologies and shared resource configurations with SPKM. 相似文献
19.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(12):1696-1707
20.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(11):1465-1474