首页 | 本学科首页   官方微博 | 高级检索  
文章检索
  按 检索   检索词:      
出版年份:   被引次数:   他引次数: 提示:输入*表示无穷大
  收费全文   82篇
  免费   7篇
  国内免费   11篇
综合类   2篇
金属工艺   1篇
机械仪表   2篇
建筑科学   3篇
无线电   6篇
一般工业技术   2篇
原子能技术   1篇
自动化技术   83篇
  2022年   3篇
  2021年   5篇
  2019年   4篇
  2018年   6篇
  2017年   6篇
  2016年   9篇
  2015年   7篇
  2014年   20篇
  2013年   19篇
  2012年   9篇
  2011年   8篇
  2010年   3篇
  2004年   1篇
排序方式: 共有100条查询结果,搜索用时 15 毫秒
1.
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation.  相似文献   
2.
The increase in computational capabilities has made time-domain methods applicable for long-range sound propagation modelling. However, such approaches remain very demanding in terms of computational resources. Most current computers are supplied with a powerful device which is still little exploited: the graphics processing unit (GPU). The paper describes an implementation of a transmission line matrix model which allows parallel calculations on heterogeneous systems. A voxelization algorithm used to generate the computational domain is presented. A splitting process is also expounded which makes feasible performing huge domains simulations by accurately dividing the computational domain into subdomains. Each subdomain is enlarged by introducing extra cells containing neighbours subdomains data in order to run several computational iterations on a graphic device without data exchange with the system memory. The influence of the ghost layer depth and the speeding up of computation time with GPU are then illustrated in a realistic built-up area.  相似文献   
3.
Large-scale compute clusters of heterogeneous nodes equipped with multi-core CPUs and GPUs are getting increasingly popular in the scientific community. However, such systems require a combination of different programming paradigms making application development very challenging.In this article we introduce libWater, a library-based extension of the OpenCL programming model that simplifies the development of heterogeneous distributed applications. libWater consists of a simple interface, which is a transparent abstraction of the underlying distributed architecture, offering advanced features such as inter-context and inter-node device synchronization. It provides a runtime system which tracks dependency information enforced by event synchronization to dynamically build a DAG of commands, on which we automatically apply two optimizations: collective communication pattern detection and device-host-device copy removal.We assess libWater’s performance in three compute clusters available from the Vienna Scientific Cluster, the Barcelona Supercomputing Center and the University of Innsbruck, demonstrating improved performance and scaling with different test applications and configurations.  相似文献   
4.
Multi‐core systems equipped with micro processing units and accelerators such as digital signal processors (DSPs) and graphics processing units (GPUs) have become a major trend in processor design in recent years in attempts to meet ever‐increasing application performance requirements. Open Computing Language (OpenCL) is one of the programming languages that include new extensions proposed to exploit the computing power of these kinds of processors. Among the newly extended language features, the single‐instruction multiple‐data (SIMD) linguistics and vector types are added to OpenCL to exploit hardware features of the accelerators. The addition makes it necessary to consider how traditional compiler data flow analysis can be adopted to meet the optimization requirements of vector linguistics. In this paper, we propose a calculus framework to support the data flow analysis of vector constructs for OpenCL programs that compilers can use to perform SIMD optimizations. We model OpenCL vector operations as data access functions in the style of mathematical functions. We then show that the data flow analysis for OpenCL vector linguistics can be performed based on the data access functions. Based on the information gathered from data flow analysis, we illustrate a set of SIMD optimizations on OpenCL programs. The experimental results incorporating our calculus and our proposed compiler optimizations show that the proposed SIMD optimizations can provide average performance improvements of 22% on x86 CPUs and 4% on advanced micro devices GPUs. For the selected 15 benchmarks, 11 of them are improved on x86 CPUs, and six of them are improved on advanced micro devices GPUs. The proposed framework has the potential to be used to construct other SIMD optimizations on OpenCL programs. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   
5.
本文提出一种基于异构计算的并行化退火粒子滤波方法(P-APF),使用OpenCL框架实现了实时无标记运动跟踪任务。退火粒子滤波过程被分解成若干具有相应粒度的子任务。根据相应的并行度,每个计算任务被分配到标准或附属处理器进行处理,以充分利用OpenCL框架的异构计算能力。提出一种任务时延隐藏策略进一步减少时间消耗。在不同人体运动数据库的实验中,P-APF能在不降低跟踪精度的前提下实现实时处理。时间消耗随着粒子数或视角数目的增加基本保持不变,平均加速比为106。  相似文献   
6.
现代GPU一般都提供特定硬件(如纹理部件、光栅化部件及各种片上缓存)以加速二维图像的处理和显示过程,相应的编程模型(CUDA、OpenCL)都定义了特定程序设计接口(CUDA的纹理内存,OpenCL的图像对象)以便图像应用能利用相关硬件支持。以典型图像模糊化处理算法在AMD平台GPU的优化为例,探讨了OpenCL的图像对象在图像算法优化上的适用范围,尤其是分析了其相对于更通用的基于全局内存加片上局部存储进行性能优化的方法的优劣。实验结果表明,图像对象只有在图像为四通道且计算过程中需要缓存的数据量较小时才能带来较好的性能改善,其余情况采用全局内存加局部存储都能获得较好性能。优化后的算法性能相对于精心实现的CPU版加速比为200~1000;相对于NVIDIA NPP库相应函数的性能加速比为1.3~5。  相似文献   
7.
针对如光束平差这样的大规模优化问题,实现基于OpenCL的并行化自动微分。采用更有效的反向计算模式,实现对多参数函数的导数计算。在OpenCL框架下,主机端完成C/C++形式的函数构建以及基于拓扑排序的计算序列生成,设备端按照计算序列完成函数值以及导数的并行计算。测试结果表明,将实现的自动微分应用于光束平差的雅可比矩阵计算后,相比于采用OpenMP的Ceres Solver,运行速度提高了约3.6倍。  相似文献   
8.
所述六足仿生机器人基于SoC FPGA平台实现,结合了机械结构设计、六足步态控制、蓝牙传输技术、弯曲传感器、OpenCL图像处理加速、VR显示等诸多技术。ARM部分作为主控,存储摄像头视频图像,并调用FPGA模块对图像处理加速,通过路由器架设的局域网向VR眼镜输出视频流信息。FPGA部分用于接收蓝牙信号,驱动机器人手臂运动,摄像头拍摄角度切换以及六足行进。实际操作时操作者需佩戴自制的数据手套和VR眼镜。操作数据手套上的方向按键可控制机器人移动。数据手套的每个手指上安装有弯曲传感器,用于控制机械手臂跟随人手实时运动。VR眼镜中放置一个智能手机作为显示终端,实时显示机器人摄像头获取的画面。经过多次实际测试,操作者佩戴VR眼镜及数据手套均可远程操控机器人抓取置于复杂地形中的水瓶。  相似文献   
9.
随着计算机技术的不断发展,软件的规模也越来越大。一张遥感图像可达到数G以上,处理起来有时候可能需要数个小时。因此,针对这些大数据量的系统来说,加速比提高一倍,就会使运行时间减少几个小时,这对于系统来说就是一种非常可观的现实,非常值得去实现。本文将以NDVI算法为例,主要介绍了NDVI算法、NDVI的应用和性质、OpenCL介绍。  相似文献   
10.
基于OpenCL的图像积分图算法优化研究   总被引:1,自引:0,他引:1  
图像积分图算法在快速特征检测中有着广泛的应用,通过GPU对其进行性能加速有着重要的现实意义。然而由于GPU硬件架构的复杂性和不同硬件体系架构间的差异性,完成图像积分图算法在GPU上的优化,进而实现不同GPU平台间的性能移植是一件非常困难的工作。在分析不同CPU平台底层硬件架构的基础上,从片外访存带宽利用率、计算资源利用率和数据本地化等多个角度考察了不同优化方法在不同GPU硬件平台上对性能的影响。并在此基础上实现了基于OpenCL的图像积分图算法。实验结果表明,优化后的算法在AMD和NVIDIA CPU上分别取得了11.26和12.38倍的性能加速,优化后的GPU kernel比NVIDIA NPP库中的相应函数也分别取得了55.01%和65.17%的性能提升。验证了提出的优化方法的有效性和性能可移植性。  相似文献   
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号