首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
针对CFD程序中常见的自相关循环结构,文章分析了波前并行技术不能对其进行并行化的原因,针对其相关实质,提出了自相关循环的镜像分解技术,通过消除跨迭代的反相关,实现自相关循环结构的波前并行,完成自相关循环的并行化。  相似文献   

2.
以某逆合成孔径雷达串行仿真程序的优化为实例,基于多核微机研究雷达仿真运行效率优化方法。首先总结了目前常用的雷达信号仿真并行算法,指出了各自方法的优缺点,同时分析了雷达信号仿真中的并行粒度;然后研究了目标回波生成、仿真结果显示、脉冲压缩等各个计算单元的耗时情况和优化策略,并对测试结果进行了分析。最后采用高性能函数库、多通道并行脉压、回波模拟快速递推算法、目标回波并行仿真和流水线并行等多种策略对ISAR串行仿真实例进行了优化,综合优化后的程序运行速度提高为原来的6.7倍,表明了优化策略的有效性和实用性。  相似文献   

3.
《红外技术》2017,(2):152-156
红外与微光的融合算法及相关产品的性能验证需要同步仿真视频源。本文提出了一种基于Vega的红外与微光视频的同步仿真方法,研发了基于FPGA的串口数据并行发送装置,实现对视频仿真的同步控制;并开发了串口程序和动态仿真控制程序,实现串口监控、数据接收、数据识别、仿真控制等功能;借助于计算机和控制板之间的串口通信技术,实现了两台计算机同步仿真红外视频和微光视频,最后通过图像融合分析验证了该方法的有效性。  相似文献   

4.
王娴  吴张永  牛骁  刘畅 《信息技术》2012,(6):11-14,18
并行离散事件仿真技术为复杂大系统的设计与研究提供了便利,该技术要求仿真程序在事件的因果约束条件下,正确、合理地推进仿真。仿真时钟管理及推进是影响并行仿真系统仿真效率的重要因素之一。通过对保守和乐观两种基本同步机制的对比性分析,研究和总结了仿真时钟的管理及推进的关键要素,研究结论对并行离散事件仿真系统具体的同步机制算法的实现具有一定的参考意义。  相似文献   

5.
为对CUDA并行程序内核性能进行分析和预测,从而指导并行程序设计及性能优化,提出一种性能预测框架.1)从GPU编程模型和设备架构细节入手,以线程束为研究单位,通过整合与GPU程序用时密切相关的软硬件基本特征,定义了并行空间闲置度、流处理器线程束负载、并行效应因子等高层次性能相关特征.2)基于上述特征,框架针对线程负载均衡型GPU程序,评估内核函数在不同问题规模以及执行配置下的执行时间.3)依据性能评估原理提出了内核函数执行配置参数的优化策略.验证实验结果表明,该框架在两种典型情境下对现有程序性能的平均预测准确率分别达到89%和94%,客观归纳了高层次特征与程序性能间的相关关系,且能定性分析并行算法性能水平.  相似文献   

6.
韩秉君  黄诗铭  杜滢 《电信科学》2015,31(10):82-88
提出了一种在 Kepler 架构 GPU(graphics processing unit,图形处理器)上利用 CUDA(compute unified device architecture,统一计算设备架构)技术加速通信仿真中DFT(discrete Fourier transform,离散傅里叶变换)处理过程的方法。该方法的核心思想是利用线程级并行技术实现单条收发链路内部DFT运算的并行加速,并利用动态并行和Hyper-Q技术实现不同收发用户对之间链路处理过程的并行加速,从而最终达到加速仿真中DFT处理过程的目的。实验结果表明,相对单核单线程CPU程序和上一代Fermi架构GPU程序,该方法分别能够将DFT处理速度提升300倍和3倍,具有较好的加速效果。  相似文献   

7.
针对目前WRF模式迁移中存在的问题,提出具体迁移部署的方法。随后通过对美国东部地区降雨模拟,给出WRF模式在IntelX86和ARM架构上的计算性能的差异,通过NCL方法计算了结果的准确性。实验结果表明,WRF模式可被完整地迁移到ARM架构的超级计算机上。仿真结果显示,在并行运算下,WRF模式在ARM架构超算集群上耗时更短,且单节点运算中ARM可有效提高并行运算效率,具有一定的使用价值,为研究者提供参考。  相似文献   

8.
根据CRC(循环冗余校验码)算法的原理,和ISO/IEC18000-6标准中超高频射频识别系统对校验电路的要求,分析串行CRC算法,提出了一种并行CRC算法。经Verilog-HDL语言编写该算法程序,在QuartusⅡ9.0软件上仿真。最终给出仿真结果以及并行CRC生成模块和校验模块,仿真结果证明并行CRC算法有效提高了系统中数据的处理速度。  相似文献   

9.
神经网络的并行化计算是加快计算速度的一种有效途径,但是,编写并行程序较为麻烦,尤其2对同步和通讯语句的处理理炙繁琐。本文通过对BP神经网络计算方法的分析,找出其有利于并行化的特点,然后,根据给定的参数生成,包含通信与同步语句在内的并行化的源程序,并在文章的最后给出了对并行效率的预测。  相似文献   

10.
将映射序列扩频(MSSS)方法用于并行组合扩频(PCSS)水声(UWA)通信,有效降低通信信号峰均比,进而提高通信性能。但当使用Gold码作为扩频码时,由于Gold码的循环移位相加特性,导致映射信号在接收端解相关时会出现伪峰,严重降低通信系统的性能。为了减小伪峰对通信性能的影响,该文提出基于相关峰相位差法的并行组合扩频(PDCP-PCSS)和交织并行组合扩频(IPCSS)两种方法。PDCP-PCSS方法在接收端加入相关峰相位差法对伪峰进行识别和剔除,有效降低了伪峰对通信性能的影响。IPCSS方法将交织技术与并行组合扩频相结合,避免了伪峰的生成。通过仿真和海上试验验证,这两种方法相比于传统的并行组合扩频水声通信(CPCSS)方法具有更好的通信性能,PDCP-PCSS方法的通信性能最优,但适用范围仅限于3个Gold码的组合,而IPCSS方法的适用范围更广。  相似文献   

11.
Parallel simulation is an efficient way to cope with long runtimes and high computational requirements in simulations of modern complex integrated electronic circuits and systems. This paper presents an algorithm for parallel simulation based on parallelization in equation formulation and simultaneous calculation of matrix contributions for nonlinear analog elements. In addition, the paper describes the development of a grid interface for a parallel simulator that enables a designer to perform simulations on distant computer clusters. Performances of the developed parallel simulation algorithm are evaluated by simulation of a microelectromechanical system.  相似文献   

12.
The ParaScope parallel programming environment   总被引:1,自引:0,他引:1  
The ParaScope parallel programming environment, developed to support scientific programming of shared-memory multiprocessors, is described. It includes a collection of tools that use global program analysis to help users develop and debug parallel programs. The focus is on ParaScope's compilation system. The compilation system extends the traditional single-procedure compiler by providing a mechanism for managing the compilation of complete programs. The ParaScope editor brings both compiler analysis and user expertise to bear on program parallelization. The debugging system detects and reports timing-dependent errors, called data races, in execution of parallel programs. A project aimed at extending ParaScope to support programming in FORTRAN D, a machine-independent parallel programming language for use with both distributed-memory and shared-memory parallel computers, is described  相似文献   

13.
文章介绍JXTA技术在P2P中的应用,分析了JXTA应用于分布式并行计算的优缺点。针对目前JXTA网络环境下实现并行计算任务分配未实现负载平衡问题,我们提出了一种对JXTA中对等点负载预测来实现并行计算任务调度方法。最后用一个具有任务无相关性并行计算素数检索程序检验该并行任务分配策略的性能。  相似文献   

14.
利用矩阵将数字信号处理算法采用并行化方式表示,然后利用矩阵张量积的思想,推导出以张量积表示的两级迭代并行滤波算法。在此基础上进一步分解得到多级迭代的并行滤波算法,并且采用多相分解的方式进一步提高系统应用中的并行化率。利用上述算法对码率为5 Gbps的16QAM信号进行32路并行化处理的MATLAB乘加级仿真,仿真结果与串行算法得到的数据等效,最终得到与理想误码率对比的系统误码率,仿真结果与理论误码率误差小于0.05 dB。  相似文献   

15.
Compiling for distributed-memory systems   总被引:1,自引:0,他引:1  
Compilation techniques for the source-to-source translation of programs in an extended FORTRAN 77 to equivalent parallel message-passing programs are discussed. A machine-independent language extension to FORTRAN 77, Data Parallel FORTRAN (DPF), is introduced. It allows the user to write programs for distributed-memory multiprocessing systems (DMMPS) using global addresses, and to specify the distribution of data across the processors of the machine. Message-Passing FORTRAN (MPF), a FORTRAN extension that allows the formulation of explicitly parallel programs that communicate via explicit message passing, is also introduced. Procedures and optimization techniques for both languages are discussed. Additional optimization methods and advanced parallelization techniques, including run-time analysis, are also addressed. An extensive overview of related work is given  相似文献   

16.
This paper presents a new type of network simulator for simulating the call‐level operations of telecom networks and especially ATM networks. The simulator is a pure time‐true type as opposed to a call‐by‐call type simulator. It is also characterized as a batch‐type simulator. The entire simulation duration is divided into short time intervals of equal duration, t. During t, a batch processing of call origination or termination events is executed and the time‐points of these events are sorted. The number of sorting executions is drastically reduced compared to a call‐by‐call simulator, resulting in considerable timesaving. The proposed data structures of the simulator can be implemented by a general‐purpose programming language and are well fitted to parallel processing techniques for implementation on parallel computers, for further savings of execution time. We have first implemented the simulator in a sequential computer and then we have applied parallelization techniques to achieve its implementation on a parallel computer. In order to simplify the parallelization procedure, we dissociate the core simulation from the built‐in call‐level functions (e.g. bandwidth control or dynamic routing) of the network. The key point for a parallel implementation is to organize data by virtual paths (VPs) and distribute them among processors, which all execute the same set of instructions on this data. The performance of the proposed batch‐type, time‐true, ATM‐network simulator is compared with that of a call‐by‐call simulator to reveal its superiority in terms of sequential execution time (when both simulators run on conventional computers). Finally, a measure of the accuracy of the simulation results is given. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

17.
This paper discusses parallelization of elliptic curve cryptography hardware accelerators using elliptic curves over binary fields $BBF_{2^{m}}$. Elliptic curve point multiplication, which is the operation used in every elliptic curve cryptosystem, is hierarchical in nature, and parallelism can be utilized in different hierarchy levels as shown in many publications. However, a comprehensive analysis on the effects of parallelization has not been previously presented. This paper provides tools for evaluating the use of parallelism and shows where it should be used in order to maximize efficiency. Special attention is given for a family of curves called Koblitz curves because they offer very efficient point multiplication. A new method where the latency of point multiplication is reduced with parallel field arithmetic processors is introduced. It is shown to outperform the previously presented multiple field multiplier techniques in the cases of Koblitz curves and generic curves with fixed base points. A highly efficient general elliptic curve cryptography processor architecture is presented and analyzed. Based on this architecture and analysis on the effects of parallelization, a few designs are implemented on an Altera Stratix II field-programmable gate array (FPGA).   相似文献   

18.
19.
This paper studies how to parallelize the emerging media mining workloads on existing small-scale multi-core processors and future large-scale platforms. Media mining is an emerging technology to extract meaningful knowledge from large amounts of multimedia data, aiming at helping end users search, browse, and manage multimedia data. Many of the media mining applications are very complicated and require a huge amount of computing power. The advent of multi-core architectures provides the acceleration opportunity for media mining. However, to efficiently utilize the multi-core processors, we must effectively execute many threads at the same time. In this paper, we present how to explore the multi-core processors to speed up the computation-intensive media mining applications. We first parallelize two media mining applications by extracting the coarse-grained parallelism and evaluate their parallel speedups on a small-scale multi-core system. Our experiment shows that the coarse-grained parallelization achieves good scaling performance, but not perfect. When examining the memory requirements, we find that these coarse-grained parallelized workloads expose high memory demand. Their working set sizes increase almost linearly with the degree of parallelism, and the instantaneous memory bandwidth usage prevents them from perfect scalability on the 8-core machine. To avoid the memory bandwidth bottleneck, we turn to exploit the fine-grained parallelism and evaluate the parallel performance on the 8-core machine and a simulated 64-core processor. Experimental data show that the fine-grained parallelization demonstrates much lower memory requirements than the coarse-grained one, but exhibits significant read-write data sharing behavior. Therefore, the expensive inter-thread communication limits the parallel speedup on the 8-core machine, while excellent speedup is observed on the large-scale processor as fast core-to-core communication is provided via a shared cache. Our study suggests that (1) extracting the coarse-grained parallelism scales well on small-scale platforms, but poorly on large-scale system; (2) exploiting the fine-grained parallelism is suitable to realize the power of large-scale platforms; (3) future many-core chips can provide shared cache and sufficient on-chip interconnect bandwidth to enable efficient inter-core communication for applications with significant amounts of shared data. In short, this work demonstrates proper parallelization techniques are critical to the performance of multi-core processors. We also demonstrate that one of the important factors in parallelization is the performance analysis. The parallelization principles, practice, and performance analysis methodology presented in this paper are also useful for everyone to exploit the thread-level parallelism in their applications.
Wenlong LiEmail:
  相似文献   

20.
We propose a new parallelization scheme for the hmmsearch function of the HMMER software, in order to target FPGA technology. hmmsearch is a very compute intensive software for biological sequence alignment, based on profile hidden Markov models. We derive a flexible, generic, scalable hardware parallel architecture which can accelerate the core of hmmsearch by nearly two orders of magnitude, without modifying the original algorithm of this software. Our derivation is based on the expression of the algorithm as a set of recurrence equations, and we show in a systematic way how a very efficient parallel version of the algorithm can be found by combining scheduling, projection, partitioning, pipelining and precision analysis. We present the performance of the implementation of this parallel algorithm on a FPGA platform.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号