共查询到20条相似文献,搜索用时 15 毫秒
3.
Reliability is an important design concern for modern many-core embedded systems. Specifically, on-chip interconnecting systems are vulnerable to permanent channel faults and transient data transmission faults which may significantly impact the overall system performance. In this work, a Unified Link-layer Fault-tolerant NoC (ULF-NoC) architecture is proposed. ULF-NoC is developed for NoC equipped with bidirectional channels and features wormhole switching (instead of store-and-forward switching) and packet-based retransmission. An intelligent buffer controller is developed that does not require separate, dedicated buffer spaces to support packet retransmissions. Extensive simulations using both synthetic and real world data traffics demonstrated marked performance of the proposed ULF-NoC solution. 相似文献
4.
Adaptive routing algorithms improve network performance by distributing traffic over the whole network. However, they require congestion information to facilitate load balancing. To provide local and global congestion information, we propose a learning method based on dual reinforcement learning approach. This information can be dynamically updated according to the changing traffic condition in the network by propagating data and learning packets. We utilize a congestion detection method which updates the learning rate according to the congestion level. This method calculates the average number of free buffer slots in each switch at specific time intervals and compares it with maximum and minimum values. Based on the comparison result, the learning rate sets to a value between 0 and 1. If a switch gets congested, the learning rate is set to a high value, meaning that the global information is more important than local. In contrast, local is more emphasized than global information in non-congested switches. Results show that the proposed approach achieves a significant performance improvement over the traditional Q-routing, DRQ-routing, DBAR and Dynamic XY algorithms. 相似文献
5.
If software for embedded processors is based on a time-triggered architecture, using co-operative task scheduling, the resulting system can have very predictable behaviour. Such a system characteristic is highly desirable in many applications, including (but not restricted to) those with safety-related or safety-critical functions. In practice, a time-triggered, co-operatively scheduled (TTCS) architecture is less widely employed than might be expected, not least because care must be taken during the design and implementation of such systems if the theoretically predicted behaviour is to be obtained. In this paper, we argue that the use of appropriate ‘design patterns’ can greatly simplify the process of creating TTCS systems. We briefly explain the origins of design patterns. We then illustrate how an appropriate set of patterns can be used to facilitate the development of a non-trivial embedded system. 相似文献
6.
We examine several VLSI architectures and compare these for their suitability for various forms of the band matrix multiplication problem. The following architectures are considered: chain, broadcast chain, mesh, broadcast mesh and hexagonally connected. The forms of the matrix multiplication problem that are considered are: band matrix × vector and band matrix × band matrix. Metrics to measure the utilization of resources (bandwidth and processors) are also proposed. An important feature of this paper is the inclusion of correctness proofs. These proofs are provided for selected designs and illustrate how VLSI designs may be proved correct using traditional mathematical tools. 相似文献
7.
On-chip interconnection networks (OCINs) in many-core embedded systems consume large portions of the chip’s area, cost, delay and power. In addition to competing in area, cost, and power, OCINs must feature low diameters to meet real time deadlines. To achieve these goals, designing low-latency networks and sharing network resources are essential. We explore 13 OCINs – some are new such as the Enhanced Kite and the Spidergon–Donut networks – in 64-core systems with various topologies and properties. We also derive and compare their worst case delays, longest and average distances, critical link lengths, bisection bandwidths, total link and router costs, and total arbiter powers. Results indicate that the Enhanced Kite, Kite, Spidergon–Donut and Spidergon–Donut4 stand out in best worst-case delays with the Spidergon–Donut4 additionally featuring lower link and router costs, total arbiter power, and better 2D implementation and scalability. 相似文献
8.
This case study describes the specification and formal verification of the key part of SPaS, a development tool for the design of open loop programmable control developed at the University of Applied Sciences in Leipzig. SPaS translates the high-level representation of an open loop programmable control into a machine executable instruction list. The produced instruction list has to exhibit the same behaviour as suggested by the high-level representation. We discuss the following features of the case study: characterization of the correctness requirements, design of a verification strategy, the correctness proof, and the relation to the Common Criteria evaluation standard. 相似文献
9.
In this paper, we are exploring the approach to utilize system-specific static analyses of code with the goal to improve software quality for specific software systems. Specialized analyses, tailored for a particular system, make it possible to take advantage of system/domain knowledge that is not available to more generic analyses. Furthermore, analyses can be selected and/or developed in order to best meet the challenges and specific issues of the system at hand. As a result, such analyses can be used as a complement to more generic code analysis tools because they are likely to have a better impact on (business) concerns such as improving certain software quality attributes and reducing certain classes of failures. We present a case study of a large, industrial embedded system, giving examples of what kinds of analyses could be realized and demonstrate the feasibility of implementing such analyses. We synthesize lessons learned based on our case study and provide recommendations on how to realize system-specific analyses and how to get them adopted by industry. 相似文献
10.
In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural constraints impose hard limits on the main memory bandwidth, and push for software techniques which optimize the memory usage of complex multi-kernel applications. In this work, we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution of image processing applications expressed as standard OpenVX graphs on cluster-based many-core accelerators. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator demonstrate that our approach leads to massive reduction of time and bandwidth, even when the main memory bandwidth for the accelerator is severely constrained. 相似文献
11.
Highly regular many-core architectures tend to be more and more popular as they are suitable for inherently highly parallelizable applications such as most of the image and video processing domain. In this article, we present a novel architecture for many-core microprocessor ASIC dedicated to embedded video and image processing applications. We propose a flexible many-core approach with two architectures one implemented in CMOS 65 nm technology containing 16 open-source tiles and the other implemented in CMOS FD-SOI 28 nm technology containing 64 open-source tiles. Each tile of these architectures can choose its communication links depending on the most relevant overall parallelism scheme for a targeted application. Both chips are fully functional in simulation. The layouts are presented with frequency, area and power consumption results. Various case studies are presented to illustrate the proposed flexible many-core architectures and enable to focus on architecture exploration, instantiated scheme of parallelization and timing performance. 相似文献
12.
多核系统工作负载的动态性和不可预知性往往会导致系统功耗高、延时长,运行期间敏捷的任务分配方法能有效解决上述问题.为此,针对多核系统中的任务分配问题提出一种近似模型,以估计任意给定节点周围的可用节点数量;然后,采用爬坡搜索启发式策略(SHiC),在所有可用节点中迅速搜索出最优首个节点;最后,利用CoNA算法实现任务的高效分配.在不同网络规模和网络参数设置下展开仿真,实验结果表明,SHiC可实现显著的性能提升,与当前最新研究相比,改进了网络延时和功耗. 相似文献
13.
Matrix multiplication is widely used in a variety of application domains. When the input matrices and the product differ in the memory format, matrix transpose is required. The efficiency of matrix transpose has a non-negligible impact on performance. However, the state-of-the-art software solution and its optimizations suffer from low efficiency due to frequent interference to main pipeline and their inability to achieve parallel matrix transpose and multiplication. To address this issue, we propose AMT, an asynchronous and in-place matrix transpose mechanism based on C2R algorithm, to efficiently perform matrix transpose. AMT performs matrix transpose in an asynchronous processing module and uses two customized asynchronous matrix transpose instructions to facilitate processing. We implement the logic design of AMT using RTL and verify its correctness. Simulation results show that AMT achieves an average of 1.27x (up to 1.48x) speedup over a state-of-the-art software baseline, and is within 95.4% of an ideal method. Overhead analysis shows that AMT only incurs small area overhead and power consumption. 相似文献
14.
Many image processing applications need real-time performance, while having restrictions of size, weight and power consumption. Common solutions, including hardware/software co-designs, are based on Field Programmable Gate Arrays (FPGAs). Their main drawback is long development time. In this work, a co-design methodology for processor-centric embedded systems with hardware acceleration using FPGAs is proposed. The goal of this methodology is to achieve real-time embedded solutions, using hardware acceleration, but achieving development time similar to that of software projects. Well established methodologies, techniques and languages from the software domain—such as Object-Oriented Paradigm design, Unified Modelling Language, and multithreading programming—are applied; and semiautomatic C-to-HDL translation tools and methods are used and compared. The methodology is applied to achieve an embedded implementation of a global vision algorithm for the localization of multiple robots in an e-learning robotic laboratory. The algorithm is specifically developed to work reliably 24/7 and to detect the robot’s positions and headings even in the presence of partial occlusions and varying lighting conditions expectable in a normal classroom. The co-designed implementation of this algorithm processes 1,600 × 1,200 pixel images at a rate of 32 fps with an estimated energy consumption of 17 mJ per frame. It achieves a 16× acceleration and 92 % energy saving, which compares favorably with the most optimized embedded software solutions. This case study shows the usefulness of the proposed methodology for embedded real-time image processing applications. 相似文献
15.
This work aims to pave the way for an efficient open system architecture applied to embedded electronic applications to manage the processing of computationally complex algorithms at real-time and low-cost. The target is to define a standard architecture able to enhance the performance-cost trade-off delivered by other alternatives nowadays in the market like general-purpose multi-core processors. Our approach, sustained by hardware/software (HW/SW) co-design and run-time reconfigurable computing, is synthesizable in SRAM-based programmable logic. As proof-of-concept, a run-time partially reconfigurable field-programmable gate array (FPGA) is addressed to carry out a specific application of high-demanding computational power such as an automatic fingerprint authentication system (AFAS). Biometric personal recognition is a good example of compute-intensive algorithm composed of a series of image processing tasks executed in a sequential order. In our pioneer conception, these tasks are partitioned and synthesized first in a series of coprocessors that are then instantiated and executed multiplexed in time on a partially reconfigurable region of the FPGA. The implementation benchmark of the AFAS either as a pure software approach on a PC platform under a dual-core processor (Intel Core 2 Duo T5600 at 1.83 GHz) or as a reconfigurable FPGA co-design (identical algorithm partitioned in HW/SW tasks operating at 50 or 100 MHz on the second smallest device of the Xilinx Virtex-4 LX family) highlights a speed-up of one order of magnitude in favor of the FPGA alternative. These results let point out biometric recognition as a sensible killer application for run-time reconfigurable computing, mainly in terms of efficiently balancing computational power, functional flexibility and cost. Such features, reached through partial reconfiguration, are easily portable today to a broad range of embedded applications with identical system architecture. 相似文献
16.
In a recent paper Fox, Otto and Hey consider matrix algorithms for hypercubes. For hypercubes allowing pipelined broadcast of messages they present a communication efficient algorithm. We present in this paper a similar algorithm that uses only nearest neighbour communication. This algorithm will therefore by very communication efficient also on hypercubes not allowing pipelined broadcast. We introduce a new algorithm that reduces the asymptotic communication cost from
. This is achieved by regarding the hypercube as a set of subcubes and by using the cascade sum algorithm. 相似文献
17.
In the paper we give a straightforward, highly efficient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system. © 1997 by John Wiley & Sons, Ltd. 相似文献
18.
This paper describes several parallel algorithmic variations of the Neville elimination. This elimination solves a system
of linear equations making zeros in a matrix column by adding to each row an adequate multiple of the preceding one. The parallel
algorithms are run and compared on different multi- and many-core platforms using parallel programming techniques as MPI,
OpenMP and CUDA. 相似文献
|