首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 921 毫秒
1.
在处理海量数据时,以软件方式实现的Z标准(Zstd)无损压缩算法难以满足特定应用领域对压缩速度的需求.对Zstd进行硬件加速设计是解决这一问题的有效方案,尤其是针对Zstd的有限状态熵编码(finite state entropy,FSE)的硬件加速.因此,提出一种适用于Zstd的FSE压缩、解压硬件实现架构,采用固定...  相似文献   

2.
基于遗传进化的元胞级并行无失真数据压缩方法   总被引:5,自引:1,他引:4  
帅典勋  顾静 《计算机学报》1999,22(8):797-803
利用一阶和二阶细胞自动机,进行元胞级并行无失真数据压缩,细胞自动机中的数据压缩规则由遗传进化算法得到,构造相应的全局置的置换映射,分别证明了一阶和二阶细胞自动机文本压缩规则的正确性。讨论阴关的时间复杂性及符号动力学特性。  相似文献   

3.
为满足实时隐秘传输的要求,给出了一种实时高嵌入效率信息隐藏算法的硬件实现.该算法在长为n bit可修改宿主数据中,至多修改1 bit数据的情况下便可嵌入[log2(n+1)] bit的机密数据.嵌入算法已成功应用于语音中的信息隐藏.算法的硬件设计基于两级加密具有很高的安全性,硬件实现可达到85.7 MHz的处理速率,8路并行处理可达到530 Mb/s的数据吞吐量.一般可满足实时隐秘传输的要求.  相似文献   

4.
Modern complex embedded applications in multiple application fields impose stringent and continuously increasing functional and parametric demands. To adequately serve these applications, massively parallel multi-processor systems on a single chip (MPSoCs) are required. This paper is devoted to the design of scalable communication architectures of massively parallel hardware multi-processors for highly-demanding applications. We demonstrated that in the massively parallel hardware multi-processors the communication network influence on both the throughput and circuit area dominates the processors influence, while the traditionally used flat communication architectures do not scale well with the increase of parallelism. Therefore, we propose to design highly optimized application-specific partitioned hierarchical organizations of the communication architectures through exploiting the regularity and hierarchy of the actual information flows of a given application. We developed related communication architecture synthesis strategies and incorporated them into our quality-driven model-based multi-processor design methodology and related automated architecture exploration framework. Using this framework we performed a large series of architecture synthesis experiments. Some of the results of the experiments are presented in this paper. They demonstrate many features of the synthesized communication architectures and show that our method and related framework are able to efficiently synthesize well scalable communication architectures even for the high-end massively parallel multi-processors that have to satisfy extremely stringent computation demands.  相似文献   

5.
Parallelization of the Kalman filter algorithm, with emphasis on the specific demands of multicore architecture implementation, is investigated. The approach is based on the nonrestrictive assumption of a banded system matrix. Both time-varying and time-invariant systems can be generally transformed to such a form. The proposed method is applied to a radio interference power estimation problem for which speedup evaluations using up to eight cores are performed. It is shown that the algorithm is capable of achieving linear speedup in the number of cores used, while speedup factors for a parallel BLAS implementation are less than two. An algorithm analysis that provides guidelines to the choice of implementation hardware to meet a desired performance is also provided.  相似文献   

6.
This paper presents an analytical performance prediction model and methodology that can be used to predict the execution time, speedup, scalability and similar performance metrics of a large set of image processing operations running on a p-processor parallel system. The model which requires only a few parameters obtainable on a minimal system can help in the systematic design, evaluation and performance tuning of parallel image processing systems. Using the model one can reason about the performance of a parallel image processing system prior to implementation. The method can also support programmers in detecting critical parts of an implementation and system designers in predicting hardware performance and the effect of hardware parameter changes on performance. The execution of parallel image processing operations was studied and operations were arranged in three main problem classes based on data locality and the communication patterns of the algorithms. The core of the method is the derivation of the overhead function, as it is the overhead that determines the achievable speedup. The overheads were examined and modelled for each class. The use of the method is illustrated by four class-representative image processing algorithms: image-scalar addition, convolution, histogram calculation and the Fast Fourier Transform. The developed performance model has been validated on a 16-node parallel machine and it has been shown that the model is able to predict the parallel run-time and other performance metrics of parallel image processing operations accurately.  相似文献   

7.
A hardware accelerator for self-organizing feature maps is presented. We have developed a massively parallel architecture that, on the one hand, allows a resource-efficient implementation of small or medium-sized maps for embedded applications, requiring only small areas of silicon. On the other hand, large maps can be simulated with systems that consist of several integrated circuits that work in parallel. Apart from the learning and recall of self-organizing feature maps, the hardware accelerates data pre- and postprocessing. For the verification of our architectural concepts in a real-world environment, we have implemented an ASIC that is integrated into our heterogeneous multiprocessor system for neural applications. The performance of our system is analyzed for various simulation parameters. Additionally, the performance that can be achieved with future microelectronic technologies is estimated.  相似文献   

8.
Burdened by their popularity, recommender systems increasingly take on larger datasets while they are expected to deliver high quality results within reasonable time. To meet these ever growing requirements, industrial recommender systems often turn to parallel hardware and distributed computing. While the MapReduce paradigm is generally accepted for massive parallel data processing, it often entails complex algorithm reorganization and suboptimal efficiency because mid-computation values are typically read from and written to hard disk. This work implements an in-memory, content-based recommendation algorithm and shows how it can be parallelized and efficiently distributed across many homogeneous machines in a distributed-memory environment. By focusing on data parallelism and carefully constructing the definition of work in the context of recommender systems, we are able to partition the complete calculation process into any number of independent and equally sized jobs. An empirically validated performance model is developed to predict parallel speedup and promises high efficiencies for realistic hardware configurations. For the MovieLens 10 M dataset we note efficiency values up to 71 % for a configuration of 200 computing nodes (eight cores per node).  相似文献   

9.
We present a novel approach to ray tracing execution on commodity graphics hardware using CUDA. We decompose a standard ray tracing algorithm into several data‐parallel stages that are mapped efficiently to the massively parallel architecture of modern GPUs. These stages include: ray sorting into coherent packets, creation of frustums for packets, breadth‐first frustum traversal through a bounding volume hierarchy for the scene, and localized ray‐primitive intersections. We utilize the well known parallel primitives scan and segmented scan in order to process irregular data structures, to remove the need for a stack, and to minimize branch divergence in all stages. Our ray sorting stage is based on applying hash values to individual rays, ray stream compression, sorting and decompression. Our breadth‐first BVH traversal is based on parallel frustum‐bounding box intersection tests and parallel scan per each BVH level. We demonstrate our algorithm with area light sources to get a soft shadow effect and show that our concept is reasonable for GPU implementation. For the same data sets and ray‐primitive intersection routines our pipeline is ~3x faster than an optimized standard depth first ray tracing implemented in one kernel.  相似文献   

10.
An FPGA-based RGBD imager   总被引:1,自引:0,他引:1  
This paper describes a trinocular stereo vision system using a single chip of FPGA to generate the composite color (RGB) and disparity data stream at video rate, called the RGBD imager. The system uses the triangular configuration of three cameras for synchronous image capture and the trinocular adaptive cooperative algorithm based on local aggregation for smooth and accurate dense disparity mapping. We design a fine-grain parallel and pipelining architecture in FPGA for implementation to achieve a high computational and real-time throughput. A binary floating-point format is customized for data representation to satisfy the wide data range and high computation precision demands in the disparity calculation. Memory management and data bit-width control are applied in the system to reduce the hardware resource consumption and accelerate the processing speed. The system is able to produce dense disparity maps with 320 × 240 pixels in a disparity search range of 64 pixels at the rate of 30 frames per second.  相似文献   

11.
更快速的高阶细胞自动机超并行数据压缩方法   总被引:1,自引:0,他引:1  
构造出高阶置换映射,进而得出更有效的高阶细胞自动机超并行数据压缩方法,在不增加细胞自动机总体结构复杂性的情况下,比文献「1」中并行压缩方法的处理速度可以成倍地提高。证明了用遗传进化算法得到的高阶细胞自动机元胞级无失真数据压缩规则的正确性和可行性,讨论了有关的时间复杂性及高阶数据压缩方法的有效性。  相似文献   

12.
The high chip-level integration enables the implementation of large-scale parallel processing architectures with 64 and more processing nodes on a single chip or on an FPGA device. These parallel systems require a cost-effective yet high-performance interconnection scheme to provide the needed communications between processors. The massively parallel Network on Chip (mpNoC) was proposed to address the demand for parallel irregular communications for massively parallel processing System on Chip (mppSoC). Targeting FPGA-based design, an efficient mpNoC low level RTL implementation is proposed taking into account design constraints. The proposed network is designed as an FPGA based Intellectual Property (IP) able to be configured in different communication modes. It can communicate between processors and also perform parallel I/O data transfer which is clearly a key issue in an SIMD system. The mpNoC RTL implementation presents good performances in terms of area, throughput and power consumption which are important metrics targeting an on chip implementation. mpNoC is a flexible architecture that is suitable for use in FPGA-based parallel systems. This paper introduces the basic mppSoC architecture. It mainly focuses on the mpNoC flexible IP based design and its implementation on FPGA. The integration of mpNoC in mppSoC is also described. Implementation results on a Stratix II FPGA device are given for three data-parallel applications ran on mppSoC. The obtained good performances justify the effectiveness of the proposed parallel network. It is shown that the mpNoC is a lightweight parallel network making it suitable for both small as well as large FPGA-based parallel systems.  相似文献   

13.
Recent computer systems and handheld devices are equipped with high computing capability, such as general purpose GPUs (GPGPU) and multi-core CPUs. Utilizing such resources for computation has become a general trend, making their availability an important issue for the real-time aspect. Discrete cosine transform (DCT) and quantization are two major operations in image compression standards that require complex computations. In this paper, we develop an efficient parallel implementation of the forward DCT and quantization algorithms for JPEG image compression using Open Computing Language (OpenCL). This OpenCL-based parallel implementation utilizes a multi-core CPU and a GPGPU to perform DCT and quantization computations. We demonstrate the capability of this design via two proposed working scenarios. The proposed approach also applies certain optimization techniques to improve the kernel execution time and data movements. We developed an optimal OpenCL kernel for a particular device using device-based optimization factors, such as thread granularity, work-items mapping, workload allocation, and vector-based memory access. We evaluated the performance in a heterogeneous environment, finding that the proposed parallel implementation was able to speed up the execution time of the DCT and quantization by factors of 7.97 and 8.65, respectively, obtained from 1024 × 1024 and 2084 × 2048 image sizes in 4:4:4 format.  相似文献   

14.
MPEG video compression is quite difficult to achieve in real time, and hardware solutions for this problem are expensive. We present a portable, fault-tolerant, parallel, MPEG-1 encoder implemented in software. We detail the implementation strategy for the encoder and give performance results on a network of workstations and a massively parallel processor. We also show that our encoder can efficiently adapt to fluctuating processing power typical in workstation networks.  相似文献   

15.
16.
Genetic Algorithm for Boolean minimization in an FPGA cluster   总被引:1,自引:0,他引:1  
Evolutionary algorithms are an alternative option to the Boolean synthesis due to that they allow one to create hardware structures that would not be able to be obtained with other techniques. This paper shows a parallel genetic programming (PGP) Boolean synthesis implementation based on a cluster of FPGAs that takes full advantage of parallel programming and hardware/software co-design techniques. The performance of our cluster of FPGAs implementation has been compared with an HPC implementation. The experimental results have shown an excellent behavior in terms of speed up (up to ×500) and in terms of solving the scalability problems of this algorithms present in previous works.  相似文献   

17.
In this paper,a new parallel-by-cell approach to the undistorted data compression based on cellular automaton and genetic algorithm is presented.The local compression rules in a cellular automaton are obtained by using a gnetic evolutionary algorithm.The correctness of the hyper-parallel compression,the time complexity,and the relevant symbolic dynamic behaviour are discussed.In comparison with other traditional sequential or small-scale parallel methods for undistorted data compression,the proposed approcah shows much higher real-time performance,better suitability and feasibility for the systolic hardware implementation.  相似文献   

18.
An evolutionary algorithm implemented in hardware is expected to operate much faster than the equivalent software implementation. However, this may not be true for slow fitness evaluation applications. This paper introduces a fast evolutionary algorithm (FEA) that does not evaluate all new individuals, thus operating faster for slow fitness evaluation applications. Results of a hardware implementation of this algorithm are presented that show the real time advantages of such systems for slow fitness evaluation applications. Results are presented for six optimisation functions and for image compression hardware.  相似文献   

19.
罗鹏  许应  封君  王新安 《计算机工程》2009,35(13):153-155
针对椭圆曲线密码体制中的有限域乘法运算,讨论基本的串行结构、并行结构以及串并混合结构乘法器的硬件实现及存在的缺陷,提出一种改进的乘法器结构。该结构利用分治算法,通过低位宽乘法运算级联,降低运算复杂度,减少所需的时钟数。FPGA实验结果证明新结构在相同频率下有更小的面积和时间乘积。GF(2^233)域上椭圆曲线点乘采用此结构一次计算仅需0.811ms,满足椭圆曲线密码体制的应用要求。  相似文献   

20.
Recent advances in computer graphics have relied on high‐quality textures in order to generate photorealistic real‐time images. Texture compression standards meet these growing demands for data, but current texture compression schemes use fixed‐rate methods where statically sized blocks of pixels are represented using the same numbers of bits irrespective of their data content. In order to account for the natural variation in detail, we present an alternative format that allows variable bit‐rate texture compression with minimal changes to texturing hardware. Our proposed scheme uses one additional level of indirection to allow the variation of the block size across the same texture. This single change is exploited to both vary the amount of bits allocated to certain parts of the texture and to duplicate redundant texture information across multiple pixels. To minimize hardware changes, the method picks combinations of block sizes and compression methods from existing fixed‐rate standards. With this approach, our method is able to demonstrate energy savings of up to 50%, as well as higher quality compressed textures over current state of the art techniques.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号