首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Parallel image processing with the block data parallel architecture   总被引:2,自引:0,他引:2  
Many digital signal and image processing algorithms can be speeded up by executing them in parallel on multiple processors. The speed of parallel execution is limited by the need for communication and synchronization between processors. In this paper, we present a paradigm for parallel processing that we call the block data flow paradigm (BDFP). The goal of this paradigm is to reduce interprocessor communication and relax the synchronization requirements for such applications. We present the block data parallel architecture which implements this paradigm, and we present methods for mapping algorithms onto this architecture. We illustrate this methodology for several applications including two-dimensional (2-D) digital filters, the 2-D discrete cosine transform, QR decomposition of a matrix and Cholesky factorization of a matrix. We analyze the resulting system performance for these applications with regard to speedup and efficiency as the number of processors increases. Our results demonstrate that the block data parallel architecture is a flexible, high-performance solution for numerous digital signal and image processing algorithms  相似文献   

2.
This paper presents a new heuristic, concurrent, iterative loop-based scheduling and allocation algorithm for high-level synthesis of digital signal processing (DSP) architectures using heterogeneous functional units. In a heterogeneous architecture, functional units could be either bit-serial or digit-serial or bit-parallel. We assume that a library of functional units based on heterogeneous implementation style is available. Experiments show that this new heuristic synthesis approach generates optimal and near-optimal area solutions. Although optimum synthesis of such architectures were proposed recently using an integer linear programming (ILP) model, our method can produce similar solutions in one to two orders of magnitude less time, at the expense of sacrificing the cost optimality. We compare the solutions generated by the proposed algorithm with the optimal solutions generated by the ILP approach and other recent techniques. We have incorporated this new algorithm into the Minnesota ARchitecture Synthesis (MARS-II) system.  相似文献   

3.
4.
We present a framework for the analysis of the decoding delay in multiview video coding (MVC). We show that in real-time applications, an accurate estimation of the decoding delay is essential to achieve a minimum communication latency. As opposed to single-view codecs, the complexity of the multiview prediction structure and the parallel decoding of several views requires a systematic analysis of this decoding delay, which we solve using graph theory and a model of the decoder hardware architecture. Our framework assumes a decoder implementation in general purpose multi-core processors with multi-threading capabilities. For this hardware model, we show that frame processing times depend on the computational load of the decoder and we provide an iterative algorithm to compute jointly frame processing times and decoding delay. Finally, we show that decoding delay analysis can be applied to design decoders with the objective of minimizing the communication latency of the MVC system.  相似文献   

5.
A digital circuit usually comprises a controller and datapath. The time spent for determining a valid controller behavior to detect a fault usually dominates test generation time. A validation test set is used to verify controller behavior and, hence, it activates various controller behaviors. In this paper, we present a novel methodology wherein the controller behaviors exercised by test sequences in a validation test set are reused for detecting faults in the datapath. A heuristic is used to identify controller behaviors that can justify/propagate pre-computed test vectors/responses of datapath register-transfer level (RTL) modules. Such controller behaviors are said to be compatible with the corresponding precomputed test vectors/responses. The heuristic is fairly accurate, resulting in the detection of a majority of stuck-at faults in the datapath RTL modules. Also, since test generation is performed at the RTL and the controller behavior is predetermined, test generation time is reduced. For microprocessors, if the validation test set consists of instruction sequences then the proposed methodology also generates instruction-level test sequences.  相似文献   

6.
We present a systematic methodology to support the design tradeoffs of array processors in several emerging issues, such as (1) high performance and high flexibility, (2) low cost, low power, (3) efficient memory usage, and (4) system-on-a-chip or the ease of system integration. This methodology is algebraic based, so it can cope with high-dimensional data dependence. The methodology consists of some transformation rules of data dependency graphs for facilitating flexible array designs. For example, two common partitioning approaches, LPGS and LSGP, could be unified under the methodology. It supports the design of high-speed and massively parallel processor arrays with efficient memory usage. More specifically, it leads to a novel systolic cache architecture comprising of shift registers only (cache without tags). To demonstrate how the methodology works, we have presented several systolic design examples based on the block-matching motion estimation algorithm (BMA). By multiprojecting a 4D DG of the BMA to 2D mesh, we can reconstruct several existing array processors. By multiprojecting a 6D DG of the BMA, a novel 2D systolic array can be derived that features significantly improved rates in data reusability (96%) and processor utilization (99%).  相似文献   

7.
In this paper, we consider multiprocessor implementation of real-time recursive digital signal processing algorithms. The objective is to devise a periodic schedule, and a fully static task assignment scheme to meet the desired throughput rate while minimizing the number of processors. Toward this goal, we propose a notion called cutoff time. We prove that the minimum-processor schedule can be found within a finite time interval bounded by the cutoff time. As such the complexity of the scheduling algorithm and the allocation algorithm can be significantly reduced. Next, taking advantage of the cutoff time, we derive efficient heuristic algorithms which promise better performance and less computation complexity compared to other existing algorithms. Extensive benchmark examples are tested which yield most encouraging results  相似文献   

8.
The high demand on computation performance by multimedia information processing such as digital video compression and decompression has made multiprocessor computation more and more popular. In this paper, we present our study on adaptive job assignment for multiprocessor implementation of a Motion Picture Expert Group 2 (MPEG2) video encoding algorithm. Data partitioning technique is used for job assignment to the processors in the multiprocessor environment to exploit the data structure adopted by the MPEG standard that divides a frame of a picture into macro blocks (MBs) which are processed independently during encoding. An adaptive data partitioning scheme is developed to cope with the inherent nonuniform spatial distribution of motion activities, such that the computation load distribution over processors is as uniform as possible, which helps improve the speedup of the whole multiprocessor system. Simulations with several video sequences have shown that, in comparison to its nonadaptive counterpart, the adaptive scheme can effectively improve the speedup of the multiprocessor system. In addition, the speedup scales well with the increase of the number of processors used in the computation  相似文献   

9.
基于测试向量中不确定位的漏电流优化技术   总被引:1,自引:1,他引:0  
众所周知,CMOS电路测试时由漏电流引起的漏电流功耗在测试功耗中处于重要地位.降低测试时的漏电流对于延长需要周期性自测试的便携式系统电池寿命、提高测试的可靠性和降低测试成本都至关重要.文章首先分析了漏电流的组成,和与之相关的晶体管的堆栈效应.然后,我们提出了一种基于测试向量中不确定位(X位)、使用遗传算法优化集成电路测试时漏电流的方法.实验结果证明在组合电路和时序电路测试中该方法能够在不影响故障覆盖率的条件下,有效优化测试时电路的漏电流.  相似文献   

10.
As tester complexity and cost increase, reducing test time is an important manufacturing priority. Test time can be reduced by ordering tests so as to fail defective units early in the test process. Algorithms to order tests that guarantee optimality require execution time that is exponential in the number of tests applied. We develop a simple polynomial-time heuristic to order tests. The heuristic, based on criteria that offer local optimality, offers globally optimal solutions in many cases. An ordering algorithm requires information on the ability of tests to detect defective units. One way to obtain this information is by simulation. We obtain it by applying all possible tests to a small subset of manufactured units and assuming the information obtained from this subset is representative. The ordering heuristic was applied to manufactured digital and analog integrated circuits (ICs) tested with commercial testers. When both approaches work, the orders generated by the heuristic are optimal. More importantly, the heuristic is able to generate an improved order for large problem sizes when the optimal algorithm is not able to do so. The new test orders result in a significant reduction, as high as a factor of four, in the time needed to identify defective units. We also assess the validity of using such sampling techniques to order tests  相似文献   

11.
There are now several laboratories worldwide which routinely perform three-dimensional (3D) physical modeling of acoustic phenomena of interest to seismologists. Corresponding computerized mathematical models have been limited to two dimensions on conventional computers due to the immense computational requirements of accurate full-size problems. The introduction of high-performance vector processors in the 100 + MFLOP range has at last made 3D computerized models a practical reality. Several major oil companies and large seismic contractors have installed such processors and are contributing to research and development in this area as an improved tool in the search for fossil energy sources. We now have such models and we have some limited experience in using them, however, much more work is required before these mathematical models can routinely supplant the physical ones. Here we describe one such model which was developed for the Cyber 205. A two-dimensional (2D) model for the VAX-11 with FP5-100 Array Processors was modified to three dimensions and vectorized suitably for the larger system. We state the acoustic wave problem in mathematical form and present the algorithm used for its solution. We then review the theoretical justification for selecting this particular algorithm. We also discuss the most important technical aspects of implementing the algorithm in vector form on the Cyber system. We present typical output and finally we give timing results for several model sizes, both 2D and 3D, on the Cyber 205 and 2D on the VAX-11 with FPS-100 Array Processors.  相似文献   

12.
Wireless sensor networks (WSNs) are being used in a wide variety of critical applications such as military and health‐care applications. Such networks, which are composed of sensor nodes with limited memory capacity, limited processing capabilities, and most importantly limited energy supply, require routing protocols that take into consideration these constraints. The aim of this paper is to provide an efficient power aware routing algorithm for WSNs that guarantees QOS and at the same time minimizes energy consumption by calculating the remaining battery capacity of nodes and taking advantage of the battery recovery process. We present an online‐battery aware geographic routing algorithm. To show the effectiveness of our approach, we simulated our algorithm in ns2 and compared it with greedy perimeter stateless routing for wireless networks and battery‐aware routing for streaming data transmissions in WSNs. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

13.
Recent studies show that at-speed functional tests are better for finding realistic defects than tests executed at lower speeds. This advantage has led to growing interest in design for at-speed tests. In addition, time-to-market requirements dictate development of tests early in the design process. In this paper, we present a new methodology for synthesis of at-speed self-test programs for microprocessors. Based on information about the instruction set, this high-level test generation methodology can generate instruction sequences that exercise all the functional capabilities of complex processors. Modern processors have large memory modules, register files and powerful ALUs with comprehensive operations, which can be used to generate and control built-in tests and to evaluate the response of the tests. Our method exploits the functional units to compress and check the test response at chip internal speeds. No hardware test pattern generators or signature analyzers are needed, and the method reduces area overhead and performance impact as compared to current BIST techniques. A novel test instruction insertion technique is introduced to activate the control/status inputs and internal modules related to them. The new methodology has been applied to an example processor much more complex than any benchmark circuit used in academia today. The results show that our approach is very effective in achieving high fault coverage and automation in at-speed self-test generation for microprocessor-like circuits.  相似文献   

14.
Single point, sender based control does not scale well for multicast delivery. For applications, such as group video or teleconferencing a low total cost multicast tree is required. In this article we present a destination driven algorithm to minimize the total tree cost of multicast tree in a dynamic situation for the whole session duration. In this heuristic approach we considered the staying duration of participants are available at the time of joining. The performance of our algorithm is analyzed through extensive simulation and evaluated against several other existing dynamic multicast routing and also against one well known near optimum heuristic algorithm used for solving Steiner tree problem. We have further tested our algorithm using erroneous information given by the joining participants. Simulation results show that its performance does not degrade that much even when the range of error is considerably high, which proves the robustness of our algorithm.  相似文献   

15.
The Viterbi (1967) algorithm (VA) is known to be an efficient method for the realization of maximum-likelihood (ML) decoding of convolutional codes. The VA is characterized by a graph, called a trellis, which defines the transitions between states. To define an area efficient architecture for the VA is equivalent to obtaining an efficient mapping of the trellis. We present a methodology that permits the efficient hardware mapping of the VA onto a processor network of arbitrary size. This formal model is employed for the partitioning of the computations among an arbitrary number of processors in such a way that the data are recirculated, optimizing the use of the PEs and the communications. Therefore, the algorithm is mapped onto a column of processing elements and an optimal design solution is obtained for a particular set of area and/or speed constraints. Furthermore, the management of the surviving path memory for its mapping and distribution among the processors was studied. As a result, we obtain a regular and modular design appropriate for its VLSI implementation in which the only necessary communications between processors are the data recirculations between stages  相似文献   

16.
This paper proposes a hierarchical representation of digital signal processing algorithms suitable for real-time implementations. Petri net models are used to demonstrate every possible operating parallelism in their graphical expression, the marked Petri graph. Moreover, a hierarchical algorithm execution control based on delayed Petri graphs is presented. A strictly modular system architecture suitable for VLSI implementation and data-driven processing is reviewed in its main components. The algorithm representation is then applied to the design of the control part of the system modules. Details at the logic level of the controllers for an array of digital signal processors are presented as an application of the proposed methodology.  相似文献   

17.
This paper is concerned with multiprocessor implementations of embedded applications specified as iterative dataflow programs in which synchronization overhead can be significant. We develop techniques to alleviate this overhead by determining a minimal set of processor synchronizations that are essential for correct execution. Our study is based in the context of self-timed execution of iterative dataflow programs. An iterative dataflow program consists of a dataflow representation of the body of a loop that is to be iterated an indefinite number of times; dataflow programming in this form has been studied and applied extensively, particularly in the context of signal processing software. Self-timed execution refers to a combined compile-time/run-time scheduling strategy in which processors synchronize with one another based only on interprocessor communication requirements, and thus, synchronization of processors at the end of each loop iteration does not generally occur. We introduce a new graph-theoretic framework based on a data structure called the synchronization graph for analyzing and optimizing synchronization overhead in self-timed, iterative dataflow programs. We show that the comprehensive techniques that have been developed for removing redundant synchronizations in noniterative programs can be extended in this framework to optimally remove redundant synchronizations in our context. We also present an optimization that converts a feedforward dataflow graph into a strongly connected graph in such a way as to reduce synchronization overhead without slowing down execution  相似文献   

18.
The growing interest in multiprocessor system-on-chip (MPSoC) design, or ‘multicore’ processors, has resulted in some confusion between the various types of multiprocessor architectures and their suitability in different application spaces. In particular, there are clear differences between the general-purpose, symmetric multiprocessor (SMP) approaches, and the application-specific, asymmetric multiprocessor (AMP) architectures. Configurable and extensible processors are especially suited for the AMP approach, yet their flexibility means that new design methodologies and tools must be developed to allow effective utilisation of multiple instruction-set processors in a complex design. Configurable and extensible processors are especially well suited for data-intensive computational tasks, such as are found in many signal and image processing applications, including audio, video, and wireless and wired networking. A design methodology for such applications must pay careful attention to the right programming models, and dataflow styles of processing seem a natural fit to the application space. In this paper, we describe a design methodology, flow and tools for MPSoC design using configurable and extensible processors that is especially interesting for data-intensive dataflow style applications. Some of the issues involved in this design approach are used to highlight opportunities for ongoing research.  相似文献   

19.
An efficient heuristic force directed placement algorithm based on partitioning is proposed for very large-scale circuits. Our heuristic force directed approach provides a more efficient cell location adjustment scheme for iterative placement optimization than the force directed relaxation (FDR) method. We apply hierarchical partitioning based on a new parallel clustering technique to decompose circuit into several level sub-circuits. During the partitioning phase, a similar technique to ‘terminal propagation’ was introduced so as to maintain the external connections that affect cell adjustment in sub-circuit. In these lowest level sub-circuits, the heuristic force directed algorithm is used to perform iterative placement optimization. Then each pair of sub-circuits resulted from bisection combine into a larger one, in which cells are located as the best placement state of either sub-circuits. The bottom-up combination is done successively until back to the original circuit, and at each combination level the heuristic force directed placement algorithm is used to further improve the placement quality. A set of MCNC (Microelectronics Centre of North-Carolina) standard cell benchmarks is experimented and results show that our placement algorithm produces on average of 12% lower total wire length than that of Feng Shui with a little longer CPU time.  相似文献   

20.
Battery lifetime enhancement is a critical design parameter for mobile computing devices. Maximizing the battery lifetime is a particularly difficult problem due to the nonlinearity of the battery behavior and its dependence on the characteristics of the discharge profile. In this paper, we address the problem of task scheduling with voltage scaling in a battery-powered single and multiprocessor system such that the residual charge or the battery voltage (the parameters for evaluating battery performance) is maximized. We propose an efficient heuristic algorithm using a charge-based cost function derived from the analytical battery model. Our algorithm first creates a task sequence that ensures battery survival, and then distributes the available delay slack so that the cost function is maximized. The effectiveness of the algorithm has been verified using DUALFOIL, a low-level Li-ion battery simulator. The algorithm has been validated on synthetic examples created from applications running on Compaq's handheld computing research platform, ITSY  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号