首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Hardware software co-synthesis process intends to determine an optimal architecture for an embedded application specified by a task graph or a specification language. In this paper, we present a co-synthesis approach targeting MPSoCs and distributed memory multiprocessor architectures for high performance embedded applications. Our co-synthesis approach produces pipelined multiprocessor architectures consisting of heterogeneous processing elements connected by a point-to-point communication structure. The co-synthesis process consists of four distinct phases; processing element selection for addition to the system, pipelined task allocation, scheduling and a regular interconnection topology mapping. Initially, an irregular topology is generated that is mapped to a regular architecture. Our co-synthesis methodology performs system partitioning and produces an irregular topology multiprocessor system. It also generates an optimal (or sub-optimal) regular topology architecture after considering some of the well-known regular topologies like mesh, hypercube, tree, etc. The co-synthesis method is demonstrated by exploring embedded architectures for MPEG encoder and artificially generated application task graphs representing complex embedded systems.  相似文献   

2.
Technology trends are driving parallel on-chip architectures in the form of multiprocessor systems-on-a-chip (MPSoCs) and chip multiprocessors (CMPs). In these systems, the increasing on-chip communication demand among the computation elements necessitates the use of scalable, high-bandwidth network-on-chip (NoC) fabrics instead of dedicated interconnects and shared buses. As transistor feature sizes are further miniaturized, more complicated NoC architectures become feasible that can support more demanding applications. Given the myriad emerging software-hardware combinations, for cost-effectiveness, a system designer critically needs to prune this widening NoC design-space to predict the interconnect fabric(s) that best balance(s) cost/performance, before the actual design process begins. This prompted us to develop Polaris, a system-level roadmapping toolchain for on-chip interconnection networks that helps designers predict the most suitable interconnection network design(s) tailored to their performance needs and power/silicon area constraints with respect to a range of applications that the system will run. Polaris explores the plethora of NoC designs based on projections of network traffic, architectures, and process characteristics. While Polaris's toolchain is extensible so new traffic, network designs, and technology processes can be added, the current version already incorporates 7872 NoC design points. Polaris is rapid, efficiently iterating over thousands of NoC design points, while maintaining high relative and absolute accuracies when validated against detailed NoC synthesis results.  相似文献   

3.
The growing interest in multiprocessor system-on-chip (MPSoC) design, or ‘multicore’ processors, has resulted in some confusion between the various types of multiprocessor architectures and their suitability in different application spaces. In particular, there are clear differences between the general-purpose, symmetric multiprocessor (SMP) approaches, and the application-specific, asymmetric multiprocessor (AMP) architectures. Configurable and extensible processors are especially suited for the AMP approach, yet their flexibility means that new design methodologies and tools must be developed to allow effective utilisation of multiple instruction-set processors in a complex design. Configurable and extensible processors are especially well suited for data-intensive computational tasks, such as are found in many signal and image processing applications, including audio, video, and wireless and wired networking. A design methodology for such applications must pay careful attention to the right programming models, and dataflow styles of processing seem a natural fit to the application space. In this paper, we describe a design methodology, flow and tools for MPSoC design using configurable and extensible processors that is especially interesting for data-intensive dataflow style applications. Some of the issues involved in this design approach are used to highlight opportunities for ongoing research.  相似文献   

4.
根据实时红外动态场景仿真系统的实时性要求,设计并实现了一个以ADSP-TS201 DSP为核心处理器的并行多DSP系统.详细讨论了硬件系统设计、红外动态场景仿真算法的任务划分、并行绘制体系结构、多DSP的应用程序开发模型等.最后给出了实验测试结果.  相似文献   

5.
The modeling and performance analysis of the two-dimensional (2-D) inverse fast cosine transform (FCT) algorithm on a multiprocessor has been considered. The computational and communication complexities of this algorithm on a shared bus, multistage interconnection network, and on mesh-connected multiprocessor architectures have been determined. The performance of the three multiprocessor architectures has been compared in terms of speedup, efficiency, and cost effectiveness. The evaluation shows that for this application, the indirect binary n-cube MIN connected multiprocessor architecture is the best in terms of speedup and efficiency among all three architectures considered  相似文献   

6.
Stack Size Minimization for Embedded Real-Time Systems-on-a-Chip   总被引:1,自引:0,他引:1  
The primary goal for real-time kernel software for single and multiple-processor on a chip systems is to support the design of timely and cost effective systems. The kernel must provide time guarantees, in order to predict the timely behaviorof the application, an extremely fast response time, in order not to waste computing power outside of the application cycles and save as much RAM space as possible in order to reduce the overall cost of the chip. The research on real-time software systems has produced algorithms that allow to effectively schedule system resources while guaranteeing the deadlines of the application and to group tasks in a very small number of non-preemptive sets which require much less RAM memory for stack. Unfortunately, up to now the research focus has been on time guarantees rather than on the optimization of RAM usage.Furthermore, these techniques do not apply to multiprocessor architectures which are likely to be widely used in future microcontrollers. This paper presents innovative scheduling and optimization algorithms that effectively solve the problem of guaranteeing schedulability with an extremely little operating system overhead and minimizing RAM usage. We developed a fast and simple algorithm for sharing resources in multiprocessor systems, together with an innovative procedure for assigning a preemption threshold to tasks. These allow the use of a single user stack. The experimental part shows the effectiveness of a simulated annealing-based tool that allows to find a schedulable system configuration starting from the selection of a near-optimal task allocation. When used in conjunction with our preemption threshold assignment algorithm, our tool further reduces the RAM usage in multiprocessor systems.  相似文献   

7.
Emerging digital communication applications and the underlying architectures encounter drastically increasing performance and flexibility requirements. In this paper, we present a novel flexible multiprocessor platform for high throughput turbo decoding. The proposed platform enables exploiting all parallelism levels of turbo decoding applications to fulfill performance requirements. In order to fulfill flexibility requirements, the platform is structured around configurable application-specific instruction-set processors (ASIP) combined with an efficient memory and communication interconnect scheme. The designed ASIP has an single instruction multiple data (SIMD) architecture with a specialized and extensible instruction-set and 6-stages pipeline control. The attached memories and communication interfaces enable its integration in multiprocessor architectures. These multiprocessor architectures benefit from the recent shuffled decoding technique introduced in the turbo-decoding field to achieve higher throughput. The major characteristics of the proposed platform are its flexibility and scalability which make it reusable for all simple and double binary turbo codes of existing and emerging standards. Results obtained for double binary WiMAX turbo codes demonstrate around 250 Mb/s throughput using 16-ASIP multiprocessor architecture.   相似文献   

8.
Dataflow languages enable describing signal processing applications in a platform independent fashion, which makes them attractive in today’s multiprocessing era. RVC-CAL is a dynamic dataflow language that enables describing complex data-dependent programs such as video decoders. To this date, design automation toolchains for RVC-CAL have enabled creating workstation software, dedicated hardware and embedded application specific multiprocessor implementations out of RVC-CAL programs. However, no solution has been presented for executing RVC-CAL applications on generic embedded multiprocessing platforms. This paper presents a dataflow-based multiprocessor communication model, an architecture prototype that uses it and an automated toolchain for instantiating such a platform and the software for it. The complexity of the platform increases linearly as the number of processors is increased. The experiments in this paper use several instances of the proposed platform, with different numbers of processors. An MPEG-4 video decoder is mapped to the platform and executed on it. Benchmarks are performed on an FPGA board.  相似文献   

9.
Domain specific coarse-grained reconfigurable architectures (CGRAs) have great promise for energy-efficient flexible designs for a suite of applications. Designing such a reconfigurable device for an application domain is very challenging because the needs of different applications must be carefully balanced to achieve the targeted design goals. It requires the evaluation of many potential architectural options to select an optimal solution. Exploring the design space manually would be very time consuming and may not even be feasible for very large designs. Even mapping one algorithm onto a customized architecture can require time ranging from minutes to hours. Running a full power simulation on a complete suite of benchmarks for various architectural options require several days. Finding the optimal point in a design space could require a very long time. We have designed a framework/tool that made such design space exploration (DSE) feasible. The resulting framework allows testing a family of algorithms and architectural options in minutes rather than days and can allow rapid selection of architectural choices. In this paper, we describe our DSE framework for domain specific reconfigurable computing where the needs of the application domain drive the construction of the device architecture. The framework has been developed to automate design space case studies, allowing application developers to explore architectural tradeoffs efficiently and reach solutions quickly. We selected some of the core signal processing benchmarks from the MediaBench benchmark suite and some edge-detection benchmarks from the image processing domain for our case studies. We describe two search algorithms: a stepped search algorithm motivated by our manual design studies and a more traditional gradient based optimization. Approximate energy models are developed in each case to guide the search toward a minimal energy solution. We validate our search results by comparing the architectural solutions selected by our tool to an architecture optimized manually and by performing sensitivity tests to evaluate the ability of our algorithms to find good quality minima in the design space. All selected fabric architectures were synthesized on 130 nm cell-based ASIC fabrication process from IBM. These architectures consume almost same amount of energy on average, but the gradient based approach is more general and promises to extend well to new problem domains. We expect these or similar heuristics and the overall design flow of the system to be useful for a wide range of architectures, including mesh based and other commonly used architectures for CGRAs.  相似文献   

10.
以常用应用研究为背景,研究了数字系统中的单DSP系统和多DSP系统方案与设计选型。针对不同的应用场合和功能需求,阐述了以单DSP器件为主,以FPGA、MCU和CPU等器件为辅的几种单DSP系统配置方案和它们的具体应用场合,归纳了广播式、串行、并行和主从式等几种多DSP系统连接方式和它们的具体应用。  相似文献   

11.
In this paper, we propose an efficient and secure embedded processing architecture that addresses various challenges involved in using face-based biometrics for authenticating a user to an embedded system. Our paper considers the use of robust face verifiers (PCA-LDA, Bayesian), and analyzes the computational workload involved in running their software implementations on an embedded processor. We then present a suite of hardware and software enhancements to accelerate these algorithms-fixed-point arithmetic, various code optimizations, generic custom instructions and dedicated coprocessors, and exploitation of parallel processing capabilities in multiprocessor systems-on-chip (SoCs). We also identify attacks targeted against the authentication process, and develop security measures to ensure the integrity of biometric code/data. We evaluated the proposed architectures in the context of popular open-source software implementations of face authentication algorithms running on a commercial embedded processor (Xtensa from Tensilica). Our paper shows that fast, in-system verification is possible even in the context of many resource-constrained embedded systems. We also demonstrate that the security of the authentication process for the given attack model can be achieved with minimum hardware overheads  相似文献   

12.
The aim of rapid prototyping real-time applications is to substantially reduce development times by confirming the functional and temporal requirements of the application at a very early stage of development with the help of an executable prototype. Hence, the real-time rapid prototyping system presented in this paper integrates two complementary tasks: On one hand it provides an automated design environment for a rapid and facile generation of a working prototype. On the other hand, the design process is extended with real-time requirement specification and analysis in order to prove that the embedded system will meet all timing requirements, and to verify that the timing requirements have been modeled correctly. The REAR framework uses annotated SDL for the system specification, from which both, the compilable source codes and the real-time analysis model are generated. After instrumentation for timing measurements, the C and VHDL source code is compiled and synthesized, linked with communication libraries and executed on the configurable, heterogeneous multiprocessor target system. Several case studies demonstrate the feasibility of this approach.  相似文献   

13.
Distributed shared memory is an architectural approach that allows multiprocessors to support a single shared address space that is implemented with physically distributed memories. Hardware-supported distributed shared memory is becoming the dominant approach for building multiprocessors with moderate to large numbers of processors. Cache coherence allows such architectures to use caching to take advantage of locality in applications without changing the programmer's model of memory. We review the key developments that led to the creation of cache-coherent distributed shared memory and describe the Stanford DASH multiprocessor, the first working implementation of hardware-supported scalable cache coherence. We then provide a perspective on such architectures and discuss important remaining technical challenges  相似文献   

14.
As a solution for dealing with the design complexity of multiprocessor SoC architectures, we present a joint Simulink-SystemC design flow that enables mixed hardware/software refinement and simulation in the early design process. First, we introduce the Simulink combined algorithm/architecture model (CAAM) unifying the algorithm and the abstract target architecture. From the Simulink CAAM, a hardware architecture generator produces architecture models at three different abstract levels, enabling a trade-off between simulation time and accuracy. A multithread code generator produces memory-efficient multithreaded programs to be executed on the architecture models. To show the applicability of the proposed design flow, we present experimental results on two real video applications.  相似文献   

15.
A software environment designed to support the real-time implementation of Digital Signal Processing (DSP) applications onto multiple programmable processors is described. The system, called McDAS, allows a designer to program his application as he would on a single processor, using a high level signal-flow DSP language. The program is then automatically scheduled and compiled onto a target multiprocessor. The environment allows the scheduler to be invoked with different numbers of processors and multiprocessor topology to explore various implementations. McDAS maximizes the computational throughput by exploiting pipelining, retiming, and parallel execution under the architectural constraints. The code generator is retargetable to different multiprocessor architectures as well as core processors. Data buffers and synchronizations are automatically inserted to ensure correct execution. The final implementation can be used for simulation speedup or real-time processing.  相似文献   

16.
A large range of application domains, from real-time embedded systems to grid-computing applications, now requires distribution. This trend implies definitions of new or tailored distribution mechanisms dedicated to specific applications and puts a strain on current middleware architectures and development. Middleware intends to separate an application from variations in hardware and operating systems. This new interoperability problem itself is serious industrial issue. We call this the middleware paradox. Next-generation middleware should be versatile enough to instantiate the exact required mechanisms of different distribution models. Middleware components that depend on a specific distribution model should be limited to application-level components or to protocol-level components.  相似文献   

17.
Multiprocessor systems offer several advantages like high performance and enhanced reliability, compared with single processor systems. However, it is difficult to achieve the performance potential of a multiprocessor system. In this situation, a monitor that measures the dynamic behavior of such systems may help to detect performance bottlenecks. This paper presents a modular hardware monitor for multi-DSP systems based on the digital signal processor TMS320C40 by Texas Instruments.  相似文献   

18.
The integration capabilities ofvlsi technology allow for implementation of complex real time applications of signal processing. Whether these circuits are dedicated(asics) or programmed (processors and multiprocessors), it is necessary to use adequate methods andcad tools which aid the design of complex systems and circuits. These will then make it possible to implement real time applications with a reduced cost of development and production. Today these methods give interesting results, both for the design of multiprocessor systems, as well as forasics. They are based on a formalism for the modeling of algorithms and architectures, and on optimization techniques for the design. We present the basic characteristics of these methods. We show some results in acoustic echo cancellation in the area of multiprocessors and in the field ofasics.  相似文献   

19.
Analyses of four broadband fiber-optic subscriber loop architectures, including active (high-speed time division multiplexing (TDM)-based) and passive (dense wavelength division multiplexing (WDM)-based, WDM-based with an analog subcarrier-multiplexing overlay, and splitter-based) double-star topologies, are presented. The analyses focus on specific demonstrated architectures and use component cost projections based on learning curves to estimate future network costs on a per-subscriber basis. Also investigated is the sensitivity of projected cost-per-subscriber to remote multiplexing node size and to double-star prove-in distance. The results indicate that the four architectures have very different double-star prove-in distances and that loop costs are minimized for much smaller remote node sizes than active loops, thus permitting cost-effective deployment of passive loops for smaller groups of subscribers. In addition, cost breakdowns for the four architectures indicate that splitter-based passive loops share electronics more effectively among subscribers than loop architectures requiring dedicated (per-subscriber) electronic interfaces, resulting in projected cost advantages for the splitter-based networks  相似文献   

20.
The co-synthesis of hardware–software systems for complex embedded applications has been studied extensively with focus on various qualitative system objectives such as high speed performance and low power dissipation. One of the main challenges in the construction of multiprocessor systems for complex real time applications is provide high levels of system availability that satisfies the users’ expectations. Even though the area of hardware software cosynthesis has been studied extensively in the recent past, the issues that specifically relate to design exploration for highly available architectures need to be addressed more systematically and in a manner that supports active user participation. In this paper, we propose a user-centric co-synthesis mechanism for generating gracefully degrading, heterogeneous multiprocessor architectures that fulfills the dual objectives of achieving real-time performance as well as ensuring high levels of system availability at acceptable cost. A flexible interface allows the user to specify rules that effectively capture the users’ perceived availability expectations under different working conditions. We propose an algorithm to map these user requirements to the importance attached to the subset of services provided during any functional state. The system availability is evaluated on the basis of these user-driven importance values and a CTMC model of the underlying fail-repair process. We employ a stochastic timing model in which all the relevant performance parameters such as task execution times, data arrival times and data communication times are taken to be random variables. A stochastic scheduling algorithm assigns start and completion time distributions to tasks. A hierarchical genetic algorithm optimizes the selections of resources, i.e. processors and busses, and the task allocations. We report the results of a number of experiments performed with representative task graphs. Analysis shows that the co-synthesis tool we have developed is effectively driven by the user’s availability requirements as well as by the topological characteristics of the task graph to yield high quality architectures. We experimentally demonstrate the edge provided by a stochastic timing model in terms of performance assessment, resource utilization, system-availability and cost. An erratum to this article is available at .  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号