首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 343 毫秒
1.
Distributed hypermedia systems that support collaboration are important emerging tools for creation, discovery, management and delivery of information. These systems are becoming increasingly desired and practical as other areas of information technologies advance. A framework is developed for efficiently exploring the hypermedia design space while intelligently capitalizing on tradeoffs between performance and area. We focus on a category of processors that are programmable yet optimized to a hypermedia application.The key components of the framework presented in this paper are a retargetable instruction-level parallelism compiler, instruction level simulators, a set of complete media applications written in a high level language, and a media processor synthesis algorithm. The framework addresses the need for efficient use of silicon by exploiting the instruction-level parallelism found in media applications by compilers that target multiple-instruction-issue processors.Using the developed framework we conduct an extensive exploration of the design space for a hypermedia application. We find that there is enough instruction-level parallelism in the typical media and communication applications to achieve highly concurrent execution when throughput requirements are high. On the other hand, when throughput requirements are low, there is little value in multiple-instruction-issue processors. Increased area does not improve performance enough to justify the use of multiple-instruction-issue processors when throughput requirements are low.The framework introduced in this paper is valuable in making early architecture design decisions such as cache and issue width trade-off when area is constrained, and the number of branch units and instruction issue width.  相似文献   

2.
Application-specific processors offer an attractive option in the design of embedded systems by providing high performance for a specific application domain. In this work, we describe the use of a reconfigurable processor core based on an RISC architecture as starting point for application-specific processor design. By using a common base instruction set, development cost can be reduced and design space exploration is focused on the application-specific aspects of performance. An important aspect of deploying any new architecture is verification which usually requires lengthy software simulation of a design model. We show how hardware emulation based on programmable logic can be integrated into the hardware/software codesign flow. While previously hardware emulation required massive investment in design effort and special purpose emulators, an emulation approach based on high-density field-programmable gate array (FPGA) devices now makes hardware emulation practical and cost effective for embedded processor designs. To reduce development cost and avoid duplication of design effort, FPGA prototypes and ASIC implementations are derived from a common source: We show how to perform targeted optimizations to fully exploit the capabilities of the target technology while maintaining a common source base  相似文献   

3.
Real-time image processing usually requires an enormous throughput rate and a huge number of operations. Parallel processing, in the form of specialized hardware, or multiprocessing are therefore indispensable. This piper describes a flexible programmable image processing system using the field programmable gate array (FPGA). The logic cell nature of currently available FPGA is most suitable for performing real-time bit-level image processing operations using the bit-level systolic concept. Here, we propose a novel architecture, the programmable image processing system (PIPS), for the integration of these programmable hardware and digital signal processors (DSPs) to handle the bit-level as well as the arithmetic operations found in many image processing applications. The versatility of the system is demonstrated by the implementation of a 1-D median filter.  相似文献   

4.
Complex network protocols and various network services require significant processing capability for modern network applications. One of the important features in modern networks is differentiated service. Along with differentiated service, rapidly changing network environments result in congestion problems. In this paper, we analyze the characteristics of representative congestion control applications-scheduling and queue management algorithms, and we propose application-specific acceleration techniques that use instruction-level parallelism (ILP) and packet-level parallelism (PLP) in these applications. From the PLP perspective, we propose a hardware acceleration model based on detailed analysis of congestion control applications. In order to get large throughputs, a large number of processing elements (PEs) and a parallel comparator are designed. Such hardware accelerators provide large parallelism proportional to the number of processing elements added. A 32-PE enhancement yields 24/spl times/ speedup for weighted fair queueing (WFQ) and 27/spl times/ speedup for random early detection (RED). For ILP, new instruction set extensions for fast conditional operations are applied for congestion control applications. Based on our experiments, proposed architectural extensions show 10%-12% improvement in performance for instruction set enhancements. As the performance of general-purpose processors rapidly increases, defining architectural extensions (e.g., multi-media extensions (MMX) as in multimedia applications) for general-purpose processors could be an alternative solution for a wide range of network applications.  相似文献   

5.
Application-specific instruction-set processors (ASIPs) provide a good alternative for video processing acceleration, but the productivity gap implied by such a new technology may prevent leveraging it fully. Video processing SoCs need flexibility that is not available in pure hardware architectures, while pure software solutions do not meet video processing performance constraints. Thus, ASIP design could offer a good tradeoff between performance and flexibility. Video processing algorithms are often characterized by intrinsic parallelism that can be accelerated by ASIP specialized instructions. In this paper, we propose a new approach for exploiting sequences of tightly coupled specialized instructions in ASIP design applicable to video processing. Our approach, which avoids costly data communications by applying data grouping and data reuse, consists of accelerating an algorithm’s critical loops by transforming them according to a new intermediate representation. This representation is optimized and loop parallelism possibilities are also explored. This approach has been applied to video processing algorithms such as the ELA deinterlacer and the 2D-DCT. Experimental results show speedups up to 18 (on the considered applications, while the hardware overhead in terms of additional logic gates was found to be between 18 and 59%.  相似文献   

6.
The programmable video signal processor (VSP) is an important category of processors for multimedia systems. Programmable video processors combine the flexibility of programmability with special architectural features that improve performance on video processing applications. VSPs are typically multiple processors with several processing elements (PEs) and a parallel memory system. This paper focuses on the architectural design of the PE's in a video processor and shows how technology and circuit parameters influence the structure of the datapath and, hence, the overall architecture of a programmable VSP. We emphasize the need to consider technological and circuit-level issues during the design of a system architecture and present a method whereby the conceptual organization of the PEs-the number of PEs, pipelining of the datapath, size of the register file, and number of register ports-can be evaluated in terms of a target set of applications before a detailed design is undertaken. We use motion-estimation and discrete cosine transform as example applications to illustrate how various technology parameters affect the architectural design choices. We show that the design of the register file and the datapath-pipeline depth can drastically affect PE utilization and, therefore, the number of PEs required for different applications. Our results demonstrate that pursuing the fastest cycle time can greatly increase the silicon area which must be devoted to PEs, due to both increased pipeline latency and reduced register file bandwidth  相似文献   

7.
In response to the continuous growth in network bandwidth and application requirements, specialized chips called network processors have been built to deliver high performance and flexibility at moderate cost. Network processors often employ parallelism to achieve this high performance/cost ratio. However, the same parallelism can also make the behavior of the software difficult to understand. When applications need to maintain reliable performance under heavy load, seemingly unrelated code fragments can interact with each other unexpectedly because of hardware resource contention, thereby impacting performance. To help software designers deal with this problem, we propose using software simulation to compare the impact of different design choices on performance. We show that it is possible to use relatively simple models, yet still extract information that aids in performance tuning the system.  相似文献   

8.
Many-core processors are good candidates for speeding up video coding because the parallelism of these applications can be exploited more efficiently by the many-core architecture. Lock methods are important for many-core architecture to ensure correct execution of the program and communication between threads on chip. The efficiency of lock method is critical to overall performance of chipped many-core processor. In this paper, we propose two types of hardware locks for on-chip many-core architecture, a centralized lock and a distributed lock. First, we design the architectures of centralized lock and distributed lock to implement the two hardware lock methods. Then, we evaluate the performance of the two hardware locks and a software lock by quantitative evaluation micro-benchmarks on a many-core processor simulator Godson-T. The experimental results show that the locks with dedicated hardware support have higher performance than the software lock, and the distributed hardware lock is more scalable than the centralized hardware lock.  相似文献   

9.
A programmable instruction decoder (PID) is introduced for designing adaptive multi-core DSP architectures by using a hardware/software co-reconfigurable approach without employing programmable devices. This PID permits DSP software developers for post-manufacturing modification of their DSP instruction sets to add their application-specific instructions whenever necessary. In addition, PID offers software developers an enhanced means to utilize the underlying DSP architectures by rescheduling implemented micro-operations for their tailored instructions in the DSP processors. Thus, emerging DSP applications can be swiftly and efficiently re-imported to PID-based DSP processors without re-fabrication of new DSP chips. In addition to instruction-level modification, an innovative instruction-packing procedure for PID is presented for further enhancement of the PID-based DSP systems. PID architecture was developed and implemented in VHDL. The PID-based DSP systems were also developed and evaluated to demonstrate various post-manufacturing adaptabilities in DSP processor systems. Various multi-core DSP architectures based on Texas Instruments’ TMS320C55 DSP processor were used for evaluating performance and adaptability of this new programmable instruction decoder.  相似文献   

10.
The advantages of the programmable control paradigm are widely known in the design of synchronous sequential circuits: easy correction of late design errors, easy upgrade of product families to meet time-to-market constraints, and modifications of the control algorithm, even at run time. However, despite the growing interest in asynchronous (self-timed) circuits, programmable asynchronous controllers based on the idea of microprogramming have not been actively pursued. In this paper, we propose an asynchronous microprogrammed control organization (called a microengine) that targets application-specific implementations and emphasizes simplicity, modularity, and high performance. The architecture takes advantage of the natural ability of self-timed circuits to chain actions efficiently without the clock-based scheduling constraints that would be involved in comparable synchronous designs. The result is a general approach to the design of application-specific microengines featuring a programmable data-path topology that offers very compact microcode and high performance-in fact, performance close to that offered by automated hardwired controllers. In performance comparisons of a CD-player error decoder design, the proposed microengine architecture was 26 times faster than the general purpose hardware of a 280 MIPS microprocessor, over three times as fast as the special purpose hardware of a low-power macromodule based implementation, and even slightly faster than a finite state machine-based implementation  相似文献   

11.
Emerging digital communication applications and the underlying architectures encounter drastically increasing performance and flexibility requirements. In this paper, we present a novel flexible multiprocessor platform for high throughput turbo decoding. The proposed platform enables exploiting all parallelism levels of turbo decoding applications to fulfill performance requirements. In order to fulfill flexibility requirements, the platform is structured around configurable application-specific instruction-set processors (ASIP) combined with an efficient memory and communication interconnect scheme. The designed ASIP has an single instruction multiple data (SIMD) architecture with a specialized and extensible instruction-set and 6-stages pipeline control. The attached memories and communication interfaces enable its integration in multiprocessor architectures. These multiprocessor architectures benefit from the recent shuffled decoding technique introduced in the turbo-decoding field to achieve higher throughput. The major characteristics of the proposed platform are its flexibility and scalability which make it reusable for all simple and double binary turbo codes of existing and emerging standards. Results obtained for double binary WiMAX turbo codes demonstrate around 250 Mb/s throughput using 16-ASIP multiprocessor architecture.   相似文献   

12.
13.
This paper presents the design and implementation of a hyperelliptic curve cryptography (HECC) coprocessor over affine and projective coordinates, along with measurements of its performance, hardware complexity, and power consumption. We applied several design techniques, including parallelism, pipelining, and loop unrolling, in designing field arithmetic units, group operation units, and scalar multiplication units to improve the performance and power consumption. Our affine and projective coordinate‐based HECC processors execute in 0.436 ms and 0.531 ms, respectively, based on the underlying field GF(289). These results are about five times faster than those for previous hardware implementations and at least 13 times better in terms of area‐time products. Further results suggest that neither case is superior to the other when considering the hardware complexity and performance. The characteristics of our proposed HECC coprocessor show that it is applicable to high‐speed network applications as well as resource‐constrained environments, such as PDAs, smart cards, and so on.  相似文献   

14.
Embedded system architectures comprising of software programmable components (e.g. DSP, ASIP, and micro-controller cores) and customized hardware co-processors, integrated into a single cost-efficient VLSI chip, are emerging as a key solution to todays microelectronics design problems. This trend is being driven by new emerging applications in the areas of wireless communication, high-speed optical networking, and multimedia computing, fueled by increasing levels of integration. These applications are often subject to stringent requirements in terms of processing performance, power dissipation, and flexibility. A key problem confronted by embedded system designers today is the rapid prototyping of an application-specific embedded system architecture where different combinations of programmable processor components, library hardware components, and customized hardware components must be integrated together, while ensuring that the hardware and software parts communicate correctly. Designers often spend an enormous time on this highly error proned task. In this paper, we present a solution to this embedded architecture co-synthesis and system integration problem based on an orchestrated combination of architectural strategies, parameterized libraries, and software CAD tools.  相似文献   

15.
A synthesis environment that targets software programmable architectures such as digital signal processors (DSPs) is presented. These processors are well suited for implementation of real-time signal processing systems with medium throughput requirements. Techniques that tightly couple the synthesis environment to an existing communication system simulator are also presented. This enables a seamless transition between the simulation and implementation design level of communication systems. Special focus is on optimization techniques for mapping data flow oriented block diagrams onto DSPs. The combination of different mapping and optimization strategies allows comfortable synthesis of real-time code that is highly adapted to application-specific needs imposed by constraints on memory space, sampling rate, or latency. Thus, tradeoff analysis is supported by efficient interactive or automatic exploration of the design space. All presented concepts are illustrated by the design of a phase synchronizer with automatic gain control on a floating-point DSP  相似文献   

16.
In order to achieve maximization of parallelism, effective distribution of rendering tasks, balance between performance and flexibility in graphics processing pipeline, this article presents design, performance analysis and optimization for multi-core interactive graphics processing unit (MIGPU). This processor integrates twelve processing cores with specific instruction set architecture and many sophisticated application-specific accelerators into a 3D graphics engine. It is implemented on XC6VLX550T field programmable gate array (FPGA). MIGPU supports OpenGL2.0 with programmable front-end processor, vertex shader, plane clipper, geometry transformer, three-D clippers and pixel shaders. For boosting the performance of MIGPU, the relationship model is established between primitive types, vertices, pixels, and the effect of culling, clipping, and memory access, and shows a way to improve the speed up of the graphics pipeline. It is capable of assigning graphics rendering tasks to different processors for efficiency and flexibility. The pixel filling rate can reach to 40 Mpixel/s at its peak performance.  相似文献   

17.
In order to satisfy cost and performance requirements, digital signal processing and telecommunication systems are generally implemented with a combination of different components, from custom-designed chips to off-the-shelf processors. These components vary in their area, performance, programmability and so on, and the system functionality is partitioned amongst the components to best utilize this tradeoff. However, for performance critical designs, it is not sufficient to only implement the critical sections as custom-designed high-performance hardware, but it is also necessary to pipeline the system at several levels of granularity. We present a design flow and an algorithm to first allocate software and hardware components, and then partition and pipeline a throughput-constrained specification amongst the selected components. This is performed to best satisfy the throughput constraint at minimal application-specific integrated-circuit cost. Our ability to incorporate partitioning with pipelining at several levels of granularity enables us to attain high throughput designs, and also distinguishes this paper from previously proposed hardware/software partitioning algorithms  相似文献   

18.
19.
Application-specific instruction-set processors (ASIPs) provide a good alternative for video processing acceleration, but the productivity gap implied by such a new technology may prevent leveraging it fully. Video processing SoCs need flexibility that is not available in pure hardware architectures, while pure software solutions do not meet video processing performance constraints. Thus, ASIP design could offer a good tradeoff between performance and flexibility. Video processing algorithms are often characterized by intrinsic parallelism that can be accelerated by ASIP specialized instructions. In this paper, we propose a new approach for exploiting sequences of tightly coupled specialized instructions in ASIP design applicable to video processing. Our approach, which avoids costly data communications by applying data grouping and data reuse, consists of accelerating an algorithm’s critical loops by transforming them according to a new intermediate representation. This representation is optimized and loop parallelism possibilities are also explored. This approach has been applied to video processing algorithms such as the ELA deinterlacer and the 2D-DCT. Experimental results show speedups up to 18 (on the considered applications, while the hardware overhead in terms of additional logic gates was found to be between 18 and 59%.
Samuel PierreEmail:
  相似文献   

20.
Coarse-grained reconfigurable arrays (CGRAs) have shown potential for application in embedded systems in recent years. Numerous reconfigurable processing elements (PEs) in CGRAs provide flexibility while maintaining high performance by exploring different levels of parallelism. However, a difference remains between the CGRA and the application-specific integrated circuit (ASIC). Some application domains, such as software-defined radios (SDRs), require flexibility with performance demand increases. More effective CGRA architectures are expected to be developed. Customisation of a CGRA according to its application can improve performance and efficiency. This study proposes an application-specific CGRA architecture template composed of generic PEs (GPEs) and special PEs (SPEs). The hardware of the SPE can be customised to accelerate specific computational patterns. An automatic design methodology that includes pattern identification and application-specific function unit generation is also presented. A mapping algorithm based on ant colony optimisation is provided. Experimental results on the SDR target domain show that compared with other ordinary and application-specific reconfigurable architectures, the CGRA generated by the proposed method performs more efficiently for given applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号