首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
线程级推测(Thread-Level Speculation, TLS)是多核上一种加速串行程序的线程级自动并行化技术。循环具有规则的结构并在运行时占有大量的执行时间,因此循环是挖掘并行性的理想对象。然而,选择哪些循环并行才能提高程序的加速比是一个很难决定的问题。为了解决该问题,该文提出一种基于性能预测的循环选择方法。基于输入训练集获取程序预执行的剖析信息,同时结合各种推测因素,构建了循环结构的性能预测模型。预测结果定量评估了循环推测并行的加速比并决定该循环在运行时是否适合并行。实验结果表明,该文提出的方法能有效地预测循环并行时所蕴含的并行性,并依据预测结果准确地选择具有并行收益的循环推测并行,最终Olden基准测试集加速比性能平均提升了12.34%。  相似文献   

2.
并行化程序的出现大大提高了应用程序的执行效率,多核程序设计时需要对程序的性能进行考虑。本文重点讨论OpenMP编程模型中多核多线程程序在并行化开销、负载均衡、线程同步开销方面对程序性能的影响。  相似文献   

3.
推测多线程主要针对编译器生成的指令进行线程划分,在控制流和数据流分析基础上,实现串行程序的自动并行化.模拟器作为检验线程划分算法的有效手段,不仅能验证程序执行结果的正确性,而且可以评估程序并发执行的加速比性能,进一步也可以反映线程划分算法的合理性.针对Olden Suite程序在模拟器上的运行时统计信息,分析线程划分中所存在的寄存器依赖问题.同时,结合实例详细讨论造成寄存器依赖的主要原因.最后,针对寄存器依赖问题提出一种改进的线程划分方法.  相似文献   

4.
针对基于群智能优化的点云配准算法计算时间长的问题,提出一种基于CUDA的并行粒子群配准算法.以点对点距离最短为适应度函数,利用粒子群算法各粒子天然的并行性,将运算过程分配到GPU的各个线程中计算变换参数.由于GPU多个线程运算同时执行互不干扰,极大地提高了粒子群的运算速度,从而可以实现点云的快速、精确配准.实验结果表明,该算法既克服了ICP算法对点云初始位置要求高的缺点,又有效解决了基于群智能优化的点云配准算法计算时间长的问题.  相似文献   

5.
为提高计算速度,复杂网格模型操作处理中的仿射变换计算被移植到具有可编程能力的GPU上实现.在并行计算中,每个线程计算一个顶点的k邻域权重值和坐标变换,多个线程同时执行.经过对线程结构安排、设备存储器分配等的优化,充分发挥GPU并行运算性能.实验结果表明,GPU加速计算对大规模网格顶点仿射变换的处理得到了较好的加速效果.  相似文献   

6.
本月精彩文章推荐通过LabVIEW优化多核处理器信号处理性能http://www.ednchina.com/1009-001.aspx作为一种并行结构的编程语言,LabVIEW能将多个并列的程序分支自动分配成多个线程并分派到各个处理核上,让一些计算量较大的数学运算或信号处理应用得以提高运行效率,并获取最佳性能。  相似文献   

7.
针对SystemC(SC)原有串行仿真内核无法充分利用多核处理器的处理能力问题,提出了一种基本SC的多核处理器并行仿真方案.新方案充分利用多线程操作系统及线程池技术的并行处理能力,通过改进SC原有串行内核的线程调度方式,对其底层仿真过程进行改进,使改进后的SC能够更好地利用多核处理器的处理能力加速仿真模拟过程.此外,新方案还对原有SC仿真过程及框架进行了分层处理,从而简化了仿真系统内部的模块相互之间的连接及其数据传输,缩短了仿真系统的建模及处理时间,大幅提高系统的仿真效率.  相似文献   

8.
RPC校正是光学遥感影像生产流程中重要的环节,将遥感影像从图像坐标转换到地理坐标.为缩短RPC校正过程的时间,实现影像的近实时处理,使用当前性能一流的GPU(NVIDIA TITAN RTX)实现了宽幅遥感影像的并行RPC校正,GPU为每个需要计算的像素分配线程,完成坐标变换、重采样等计算过程,实现并行处理;并与CPU...  相似文献   

9.
徐杭威  赵壮  岳江  柏连发 《红外技术》2018,40(4):362-368
在保证分类结果清晰、准确的前提下,为了提高分类执行效率,本文基于图形处理器(graphic processing unit,GPU)及并行优化,提出一种基于归一化光谱向量的高光谱图像实时性非监督分类方法.利用高光谱图像的空间一致性有效提高分类精度,同时,利用归一化光谱向量简化了像元间相似性的计算公式,统一了图像内像元处理方式,并利用GPU并行技术有效提高计算速度.首先,利用GPU并行处理方法计算空间相邻像元间光谱向量相似性,根据高斯拟合取得安全阈值;然后利用光谱角作为像元光谱相似测度,将相似像元划为同质区;最后以同质区内各像元平均光谱向量表述同质区光谱特征,根据安全阈值合并相似的同质区完成分类.用AVIRIS数据评估了该方法性能.本文的理论分析和实验结果显示,与现有非监督分类方法相比,该方法分类精度更高,同时,算法本身运行速度更快.  相似文献   

10.
国产化申威处理器出现较晚,其在多媒体领域中的性能还不突出,同时通用处理器中的单指令流多数据流(SIMD)因能有效提升并行处理能力而受到处理器厂商的青睐。为提高国产化自主平台申威架构的多媒体处理能力,结合申威架构Core3B体系的SIMD指令系统,提出一种基于申威架构的SIMD指令集H.264编码优化方法。结合申威处理器的并行结构特点,利用申威适配的Perf、Top指令等系统性能分析工具,采集两种主流视频分辨率下与编码性能强相关的高频热点函数,详细分析其程序并行化可行性,采用手工嵌入申威SIMD和访存扩展等汇编指令进行细粒度优化。实验结果表明,该方法在申威架构下的H.264平均编码性能提升了约30%。相应工作成果已推送到申威社区,增强了基于申威处理器的国产计算机在桌面多媒体应用领域的工作体验。  相似文献   

11.
GPUs can provide powerful computing ability especially for data parallel applications, such as video/image processing applications. However, the complexity of GPU system makes the optimization of even a simple algorithm difficult. Different optimization methods on a GPU often lead to different performances. The matrix–vector multiplication routine for general dense matrices (GEMV) is an important kernel in video/image processing applications. We find that the implementations of GEMV in CUBLAS or MAGMA are not efficient, especially for small or fat matrix. In this paper, we propose a novel register blocking method to optimize GEMV on GPU architecture. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization methods for GEMV are comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also evaluated in the experiment. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices.  相似文献   

12.
韩秉君  黄诗铭  杜滢 《电信科学》2015,31(10):82-88
提出了一种在 Kepler 架构 GPU(graphics processing unit,图形处理器)上利用 CUDA(compute unified device architecture,统一计算设备架构)技术加速通信仿真中DFT(discrete Fourier transform,离散傅里叶变换)处理过程的方法。该方法的核心思想是利用线程级并行技术实现单条收发链路内部DFT运算的并行加速,并利用动态并行和Hyper-Q技术实现不同收发用户对之间链路处理过程的并行加速,从而最终达到加速仿真中DFT处理过程的目的。实验结果表明,相对单核单线程CPU程序和上一代Fermi架构GPU程序,该方法分别能够将DFT处理速度提升300倍和3倍,具有较好的加速效果。  相似文献   

13.
介绍了基于ARM+DSP架构的嵌入式机器视觉系统的特性,分析了制约嵌入式机器视觉系统性能的因素。从操作系统和应用程序方面,讨论了嵌入式机器视觉系统的优化方案。通过对嵌入式Linux内核和文件系统进行裁剪,对应用程序代码进行大量的优化,并充分利用Cotex-A处理器独有的NEON加速技术,使系统开机启动时间缩短25 s,应用程序运行速度提高2.5倍。  相似文献   

14.
基于GPU的快速二维沃尔什变换研究   总被引:1,自引:1,他引:1  
提出了一种基于GPU(Graphics Processing Unit,图形处理器)CUDA(Compute Unified Device Architecture,计算统一设备架构)平台的快速二维沃尔什变换(Walsh Transform)实现方法.该方法利用GPU的并行结构和硬件特点,从算法实现、存储类型、逻辑构架设置等方面提高了沃尔什变换的运算速度.实验结果表明,随着图像分辨率的增加,沃尔什变换在GPU上运行时间远低于CPU,GPU比CPU具有更明显的加速效果.  相似文献   

15.
Future computing devices are likely to be based on heterogeneous architectures, which comprise of multi-core CPUs accompanied with GPU or special purpose accelerators. A challenging issue for such devices is how to effectively manage the resources to achieve high efficiency and low energy consumption. With multiple new programming models and advanced framework support for heterogeneous computing, we have seen many regular applications benefit greatly from heterogeneous systems. However, migrating the success of heterogeneous computing to irregulars remains a challenge. An irregular program's attribute may vary during execution and are often unpredictable, making it difficult to allocate heterogeneous resources to achieve the highest efficiency. Moreover, the irregularity in applications may cause control flow divergence, load imbalance and low efficiency in parallel execution. To resolve these issues, we studied and proposed phase guided dynamic work partitioning, a light-weight and fast analysis technique, to collect information during program phases at runtime in order to guide work partitioning in subsequent phases for more efficient work dispatching on heterogeneous systems. We implemented an adaptive Runtime System based on the aforementioned technique and take Ray-Tracing to explore the performance potential of dynamic work distribution techniques in our framework. The experiments have shown that the performance gain of this approach can be as high as 5 times faster than the original system. The proposed techniques can be applied to other irregular applications with similar properties.  相似文献   

16.
杨肖  孙建伶 《中国通信》2011,8(6):11-18
As data volume grows , many enterprises are considering using MapReduce for its simplicity. However, how to evaluate the performance improvement before deployment is still an issue. Current researches of MapReduce performance are mainly based on monitoring and simulation, and lack mathematical models . In this paper, we present a simple but powerful performance model for the prediction of the execution time of a MapReduce program with limited resources. We study each component of MapReduce framework, and analyze the relation between the overall performance and the number of mappers and reducers based on our model. Two typical MapReduce programs are evaluated in a small cluster with 13 nodes. Experimental results show that the mathematical performance model can estimate the execution time of MapReduce programs reliably. According to our model, number of mappers and reducers can be tuned to form a better execution pipeline and lead to better performance. The model also points out potential bottlenecks of the framework and future improvement.  相似文献   

17.
Single GPU scaling is unable to keep pace with the soaring demand for high throughput computing. As such executing an application on multiple GPUs connected through an off-chip interconnect will become an attractive option to explore. However, much of the current code is written for a single GPU system. Porting such a code for execution on multiple GPUs is difficulty task. In particular, it requires programmer effort to determine how data is partitioned across multiple GPU cards and then launch the appropriate thread blocks that mostly accesses the data that is local to that card. Otherwise, cross-card data movement is an expensive operation. In this work we explore hardware support to efficiently parallelize a single GPU code for execution on multiple GPUs. In particular, our approach focuses on minimizing the number of remote memory accesses across the off-chip network without burdening the programmer to perform data partitioning and workload assignment. We propose a data-location aware thread block scheduler to schedule the thread blocks on the GPU that has most of its input data. The scheduler exploits well known observation that GPU workloads tend to launch a kernel multiple times iteratively to process large volumes of data. The memory accesses of the thread block across different iterations of a kernel launch exhibit correlated behavior. Our data location aware scheduler exploits this predictability to track memory access affinity of each thread block to a specific GPU card and stores this information to make scheduling decisions for future iterations. To further reduce the number of remote accesses we propose a hybrid mechanism that enables migrating or copying the pages between the memory of multiple GPUs based on their access behavior. Hence, most of the memory accesses are to the local GPU memory. Over an architecture consisting of two GPUs, our proposed schemes are able to improve the performance by 1.55× when compared to single GPU execution across widely used Rodinia [17], Parboil [18], and Graph [23] benchmarks.  相似文献   

18.
马歌  肖汉 《现代电子技术》2014,(20):103-106
Prewitt算法是数字图像分割中最常用的边缘检测算法。采用传统CPU上的串行方法实现该算法需要较大的计算量、耗时较长,因此,通过GPU对其进行性能加速有着重要的意义。然而由于GPU硬件体系结构的差异性,跨平台移植是一件非常困难的工作。针对上述问题,提出了一种基于OpenCL异构框架的Prewitt图像边缘检测并行算法。实验结果表明,该并行算法比CPU上的串行算法运行速度快,加速比可达30倍,有效地提高了大规模数据处理的效率,可移植性好,具有较高的应用价值。  相似文献   

19.
针对目前地层层析成像算法中正演算法存在计算量大、计算速度慢的问题,以图像处理器(GPU)为核心,研究并实现了一种基于GPU平台的时域有限差分(FDTD)正演算法。CUDA是一种由NVIDIA推出的GPU通用并行计算架构,也是目前较为成熟的GPU并行运算架构。而FDTD正演算法本身在算法特性上满足并行的要求,二者的结合将极大地加速程序的计算速度。在基于标准Marmousi速度模型的正演模拟中,程序速度提升30倍,而GPU正演图像与CPU正演结果误差小于千分之一。算例表明CUDA可以大大加速目前的FDTD正演算法,并且随着GPU硬件自身的发展和计算架构的不断改进,加速效果还将进一步提升,这将有利于后续波形反演工作的进展。  相似文献   

20.
设计了一整套高清视频监控中心方案,基于高端CPU软件解码和基于微软的DXVA使用GPU进行硬件解码,实现了多场所多摄像机实时监控、录制、语音对讲、云台控制、大屏输出和转发到解码器等功能.测试结果显示,采用DXVA框架后,解码速度和CPU占有率都得到很大的性能提升,能够为视频监控、视频会议、远程医疗、网络教学等提供实时服务.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号