首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
大多数基于卷积神经网络(CNN)的算法都是计算密集型和存储密集型的,很难应用于具有低功耗要求的航天、移动机器人、智能手机等嵌入式领域。针对这一问题,提出一种面向CNN的高并行度现场可编程逻辑门阵列(FPGA)加速器。首先,比较研究CNN算法中可用于FPGA加速的4类并行度;然后,提出多通道卷积旋转寄存流水(MCRP)结构,简洁有效地利用了CNN算法的卷积核内并行;最后,采用输入输出通道并行+卷积核内并行的方案提出一种基于MCRP结构的高并行度CNN加速器架构,并将其部署到XILINX的XCZU9EG芯片上,在充分利用片上数字信号处理器(DPS)资源的情况下,峰值算力达到2 304 GOPS。以SSD-300算法为测试对象,该CNN加速器的实际算力为1 830.33 GOPS,硬件利用率达79.44%。实验结果表明,MCRP结构可有效提高CNN加速器的算力,基于MCRP结构的CNN加速器可基本满足嵌入式领域大部分应用的算力需求。  相似文献   

2.
Phase change memory (PCM) is a promising candidate to replace DRAM as main memory, thanks to its better scalability and lower static power than DRAM. However, PCM also presents a few drawbacks, such as long write latency and high write power. Moreover, the write commands parallelism of PCM is restricted by instantaneous power constraints, which degrades write bandwidth and overall performance. The write power of PCM is asymmetric: writing a zero consumes more power than writing a one. In this paper, we propose a new scheduling policy, write power asymmetry scheduling (WPAS), that exploits the asymmetry of write power. WPAS improveswrite commands parallelism of PCM memory without violating power constraint. The evaluation results show that WPAS can improve performance by up to 35.5%, and 18.5% on average. The effective read latency can be reduced by up to 33.0%, and 17.1% on average.  相似文献   

3.
The problem of mapping affine loop nests onto parallel computers with distributed memory is considered. A technique for algorithm scheduling and distributing operations and data over processors is proposed. This technique makes it possible to generate pipeline parallelism and minimize the amount of data exchanges between the processors. The method is adapted for automation and explicitly allows for dependence on outer variables of loops.  相似文献   

4.
张震  付印金  胡谷雨 《计算机应用》2018,38(8):2230-2235
相变存储器(PCM)凭借低功耗的优势有望成为新一代主存储器,但是耐受性的缺陷成为其广泛应用的重要障碍。现有的随机存取存储器(DRAM)缓存技术和磨损均衡分别从减少PCM写数量以及均匀化写操作分布两个角度延长PCM使用寿命,但前者在写回数据时未考虑数据的读写倾向性,后者在空间局部性较强的应用场景下存在数据交换粒度、空间开销、随机性等诸多问题。因此,设计一种全新的混合存储架构,结合最近最少使用(LRU)算法和带有时间变化的最不经常使用(LFU-Aging)算法提出区分数据读写倾向性的缓存策略,并且基于布隆过滤器(BF)设计针对强空间局部性工作集的动态磨损均衡算法,在有效减少冗余写操作的同时实现低空间开销的组间磨损均衡操作。实验结果表明,该策略能够减少PCM上13.4%~38.6%的写操作,同时有效均匀90%以上分组的写操作分布。  相似文献   

5.
Initially, parallel algorithms were designed by parallelising the existing sequential algorithms for frequently occurring problems on available parallel architectures.

More recently, parallel strategies have been identified and utilised resulting in many new parallel algorithms. However, the analysis of such techniques reveals that further strategies can be applied to increase the parallelism. One of these strategies, i.e., increasing the computational work in each processing node, can reduce the memory accesses and hence congestion in a shared memory multiprocessor system. Similarly, when network message passing is minimised in a distributed memory processor system, dramatic improvements in the performance of the algorithm ensue.

A frequently occurring computational problem in digital signal processing (DSP) is the solution of symmetric positive definite Toeplilz linear systems. The Levinson algorithm for solving such linear equations is where the Toeplitz matrix property is utilised in the elimination process of each element to advantage. However, it can be shown that in the Parallel Implicit Elimination (PIE) method where more than one element is eliminated simultaneously, the Toeplitz structure can again be utilised to advantage. This relatively simple strategy yields a reduction in accesses to shared memory or network message passing, resulting in a significant improvement in the performance of the algorithm [2],  相似文献   

6.
This paper addresses high level synthesis for real-time digital signal processing (DSP) architectures using heterogeneous functional units (FUs). For such special purpose architecture synthesis, an important problem is how to assign a proper FU type to each operation of a DSP application and generate a schedule in such a way that all requirements can be met and the total cost can be minimized. We propose a two-phase approach to solve this problem. In the first phase, we solve the heterogeneous assignment problem, i.e., how to assign proper FU types to applications such that the total cost can be minimized while the timing constraint is satisfied. In the second phase, based on the assignments obtained in the first phase, we propose a minimum resource scheduling algorithm to generate a schedule and a feasible configuration that uses as little resource as possible. We prove that the heterogeneous assignment problem is NP-complete. Efficient algorithms are proposed to find an optimal solution when the given DFG is a simple path or a tree. Three other algorithms are proposed to solve the general problem. The experiments show that our algorithms can effectively reduce the total cost compared with the previous work.  相似文献   

7.
张望  贾佳  孟渊  白旭 《计算机应用》2017,37(5):1341-1346
由于对广泛使用的AES算法的性能要求越来越高,基于软件的密码算法已经越来越难以满足高吞吐量密码破解的需求,因此越来越多的算法利用现场可编程逻辑门阵列(FPGA)平台进行加速。针对AES算法在FPGA硬件上存在的开发复杂度高且开发周期长等问题,采用高层次综合(HLS)设计方法,使用高级程序语言描述并设计AES硬件加速算法。首先利用循环展开等提高运算并行度;其次使用资源平衡技术进行优化,充分利用片上存储和电路资源;最后添加全流水结构,提高整体设计的时钟频率和吞吐量,同时也详细对比分析基准设计、利用结构展开、资源均衡以及流水线优化方法的设计。经过实验表明,在Xilinx xc7z020clg484 FPGA芯片上,最终AES算法的时钟频率最高达到127.06 MHz,而吞吐量达到了16.26 Gb/s,较之基准的AES设计,性能提升了三个数量级。  相似文献   

8.
We present techniques for exploiting fine-grained parallelism extracted from sequential programs on a fine-grained MIMD system. The system exploits fine-grained parallelism through parallel execution of instructions on multiple processors as well as pipelined nature of individual processors. The processors can communicate data values via globally shared registers as well as dedicated channel queues. Compilation techniques are presented to utilize these mechanisms. A scheduling algorithm has been developed to distribute operations among the processors in a manner that reduces communication among the processors. The compiler identifies data dependencies which require synchronization and enforces them using channel queues. Delays that may result by attempting write operations to a full channel queue are avoided by spilling values from channels to local registers. If an interprocessor data dependency does not require synchronization, then the data value is passed through a shared register or shared memory.Partially supported by National Science Foundation Presidential Young Investigator Award CCR-9157371 (CCR-9249143) to the University of Pittsburgh.  相似文献   

9.
为挖掘可重构处理器的内在并行性,需要编译器通过分析程序的并行性来决定可重构处理器硬件最好的执行模式。为此,提出一种基于可重构处理器的并行优化算法。将有向无环图的并行计算部分映射到可重构处理器上,对任务实现3个不同层次的并行性(指令级并行、循环级并行、线程级并行)。测试结果表明,该算法使得可重构处理器在处理任务时比未用并行优化算法的性能提升1.2倍左右。  相似文献   

10.
In this paper, we present an algorithm that can be used to implement sequential, causal, or cache consistency in distributed shared memory (DSM) systems. For this purpose it includes a parameter that allows us to choose the consistency model to be implemented. If all processes run the algorithm with the same value in this parameter, the corresponding consistency is achieved. (Additionally, the algorithm tolerates that processes use certain combination of parameter values.) This characteristic allows a concrete consistency model to be chosen, but implements it with the more efficient algorithm in each case (depending of the requirements of the applications). Additionally, as far as we know, this is the first algorithm proposed that implements cache coherence.In our algorithm, all the read and write operations are executed locally when implementing causal and cache consistency (i.e., they are fast). It is known that no sequential algorithm has only fast memory operations. In our algorithm, however, all the write operations and some read operations are fast when implementing sequential consistency. The algorithm uses propagation and full replication, where the values written by a process are propagated to the rest of the processes. It works in a cyclic turn fashion, with each process of the DSM system, broadcasting one message in turn. The values written by the process are sent in the message (instead of sending one message for each write operation): However, unnecessary values are excluded. All this permits the amount of message traffic owing to the algorithm to be controlled.  相似文献   

11.
We model a deterministic parallel program by a directed acyclic graph of tasks, where a task can execute as soon as all tasks preceding it have been executed. Each task can allocate or release an arbitrary amount of memory (i.e., heap memory allocation can be modeled). We call a parallel schedule “space efficient” if the amount of memory required is at most equal to the number of processors times the amount of memory required for some depth-first execution of the program by a single processor. We describe a simple, locally depth-first scheduling algorithm and show that it is always space efficient. Since the scheduling algorithm is greedy, it will be within a factor of two of being optimal with respect to time. For the special case of a program having a series-parallel structure, we show how to efficiently compute the worst case memory requirements over all possible depth-first executions of a program. Finally, we show how scheduling can be decentralized, making the approach scalable to a large number of processors when there is sufficient parallelism  相似文献   

12.
Multi-module memory has been employed in high-end digital signal processing system (DSP). It provides high memory bandwidth and low power operating mode for energy savings. However, making full use of these architectural features is a challenging problem for code optimization. In this paper, we propose an integer linear programming (ILP) model to optimize the performance and energy consumption of multi-module memories by solving variable assignment, instruction scheduling and operating mode setting problems simultaneously. The combined effect of performance and energy saving requirements has been considered as well. Specially, we develop two optimization techniques to improve the computation efficiency of our ILP model. The experimental results show that the optimal performance and energy solution can be achieved within a reasonable amount of time.  相似文献   

13.
Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application tasks with dependences. These applications exhibit both task and data parallelism, and combining these two (also called mixed parallelism) has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task and data parallelism required to minimize the parallel completion time (makespan) of these applications. In other words, our algorithm determines the set of tasks that should be run concurrently and the number of processors to be allocated to each task. The processor allocation and scheduling decisions are made in an integrated manner and are based on several factors such as the structure of the task graph, the runtime estimates and scalability characteristics of the tasks, and the intertask data communication volumes. A locality-conscious scheduling strategy is used to improve intertask data reuse. Evaluation through simulations and actual executions of task graphs derived from real applications and synthetic graphs shows that our algorithm consistently generates schedules with a lower makespan as compared to Critical Path Reduction (CPR) and Critical Path and Allocation (CPA), two previously proposed scheduling algorithms. Our algorithm also produces schedules that have a lower makespan than pure task- and data-parallel schedules. For task graphs with known optimal schedules or lower bounds on the makespan, our algorithm generates schedules that are closer to the optima than other scheduling approaches.  相似文献   

14.
Most scientific and digital signal processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms  相似文献   

15.
数字信号处理(DSP)具有并行的硬件乘法器、流水线结构以及快速的片内存储器等资源,其技术广泛地应用于数字信号处理的各个领域.介绍了IIR数字滤波器的原理,利用MATLAB软件生成滤波器的输入数据和系数,进行相应的数据压缩处理,并生成仿真波形,最后给出了用DSP语言实现IIR数字滤波器的仿真结果,同时对仿真结果进行了分析、比较,确保了输出波形的精确度.  相似文献   

16.
Emerging non-volatile memory technologies, especially flash-based solid state drives (SSDs), have increasingly been adopted in the storage stack. They provide numerous advantages over traditional mechanically rotating hard disk drives (HDDs) and have a tendency to replace HDDs. Due to the long existence of HDDs as primary building blocks for storage systems, however, much of the system software has been specially designed for HDD and may not be optimal for non-volatile memory media. Therefore, in order to realistically leverage its superior raw performance to the maximum, the existing upper layer software has to be re-evaluated or re-designed. To this end, in this paper, we propose PASS, an optimized I/O scheduler at the Linux block layer to accommodate the changing trend of underlying storage devices toward flash-based SSDs. PASS takes the rich internal parallelism in SSDs into account when dispatching requests to the device driver in order to achieve high performance. Specifically, it parti-tions the logical storage space into fixed-size regions (preferably the component package sizes) as scheduling units. These scheduling units are serviced in a round-robin manner and for every chance that the chosen dispatching unit issues only a batch of either read or write requests to suppress the excessive mutual interference. Additionally, the requests are sorted according to their visiting addresses while waiting in the dispatching queues to exploit high sequential performance of SSD. The experimental results with a variety of workloads have shown that PASS outperforms the four Linux off-the-shelf I/O schedulers by a degree of 3%up to 41%, while at the same time it improves the lifetime significantly, due to reducing the internal write amplification.  相似文献   

17.
韩亮  陈杰  陈晓东 《计算机应用研究》2005,22(1):99-101,193
许多按照高性能思想设计出的DSP处理器,其性能却在应用中得不到很好的发挥。深入分析DSP处理器的指令编码就会发现,要使其高性能得以发挥就应该在设计指令集时慎重考虑指令的编码方式。要么通过提高指令编码密度的方式提高处理器的并行度;要么使用更加简单和规则的指令编码以提高处理器编程和编译的效率。在分别讨论、比较了两种方式后,提出了一种基于Huffman算法的能够提高编码效率的指令编码方法。  相似文献   

18.
赵姗  杨秋松  李明树 《软件学报》2019,30(4):1164-1190
为了满足应用程序的多样化需求,异构多核处理器出现并逐渐进入市场,其中的处理核心(core)具有不同的微架构或者指令集架构(ISA),为应用提供多样化特性支持,比如指令级并行(ILP)、内存级并行(MLP),这些核心协同工作满足整个计算系统的优化目标,比如高性能、低功耗或者良好的能效.然而,目前主流的调度技术主要是针对传统同构处理器架构设计,没有考虑异构硬件能力的差异性.在异构多核处理器环境下,调度技术如何感知硬件的异构特性,为不同类型的应用程序提供更加合适和匹配的硬件资源,这是值得探索的问题.对近年来在该研究领域的成果进行了综述研究,特别是在性能非对称多核处理器架构下,异构调度技术面临的优化目标、分析模型、调度决策和算法评估等主要问题进行了分析和描述,并依次对相关技术进行了系统的总结,最后从软硬件融合的角度对今后的研究工作进行了展望.  相似文献   

19.
王正华  郭炜  魏继增 《计算机工程》2010,36(10):282-284
针对传输触发架构下代码生成中指令调度的流水线冲突、调度死锁、资源冲突等问题,给出一种基于最小延时的遗传搜索算法模型,将软件旁路优化和资源动态分配优化整合到该模型中。实验结果表明,该算法能产生较高质量的并行代码,90%以上测试用例的指令级并行度高于表调度算法获得的结果。  相似文献   

20.
Inter-satellite link (ISL) scheduling is required by the BeiDou Navigation Satellite System (BDS) to guarantee the system ranging and communication performance. In the BDS, a great number of ISL scheduling instances must be addressed every day, which will certainly spend a lot of time via normal metaheuristics and hardly meet the quick-response requirements that often occur in real-world applications. To address the dual requirements of normal and quick-response ISL schedulings, a data-driven heuristic assisted memetic algorithm (DHMA) is proposed in this paper, which includes a high-performance memetic algorithm (MA) and a data-driven heuristic. In normal situations, the high-performance MA that hybridizes parallelism, competition, and evolution strategies is performed for high-quality ISL scheduling solutions over time. When in quick-response situations, the data-driven heuristic is performed to quickly schedule high-probability ISLs according to a prediction model, which is trained from the high-quality MA solutions. The main idea of the DHMA is to address normal and quick-response schedulings separately, while high-quality normal scheduling data are trained for quick-response use. In addition, this paper also presents an easy-to-understand ISL scheduling model and its NP-completeness. A seven-day experimental study with 10 080 one-minute ISL scheduling instances shows the efficient performance of the DHMA in addressing the ISL scheduling in normal (in 84 hours) and quick-response (in 0.62 hour) situations, which can well meet the dual scheduling requirements in real-world BDS applications.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号