首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 475 毫秒
1.
字符串匹配是计算科学中研究最广泛的问题之一,已成为信息检索和生物计算等领域的核心操作。然而受限于CPU的计算能力和存储器访问带宽,传统的串行字符串匹配算法难以进一步提升性能。GPU在计算能力和存储器访问带宽上有很大提升,已经在很多应用上取得了卓越成效。gAC作为一种基于GPU的并行AC算法,针对GPU的SIMT(Single-Instruction Multiple-Thread)以及合并存储器访问的技术特点,采取了减少条件分支、合并访问全局存储器等优化方法,使得在C1060GPU上的字符串扫描速度达到51Gb/s,比基于CPU的串行算法提升了28倍。  相似文献   

2.
近年来,基于图形处理器的通用计算获得了广泛关注,并在多个领域取得了进展.内存OLAP减少了磁盘I/O,但基于单核或多核CPU的计算能力及cache miss成为新的性能瓶颈,从而无法保证好的效率.而图形处理器由于其众多核和高带宽能够很好地适应OLAP计算特性.通过图形处理器来加速任一cuboid的计算,从而提高整个内存OLAP系统的性能.提出了基于图形处理器的分块并行算法,并对算法进行了优化及讨论了数据稀疏和数据分布倾斜等不同条件下的算法.算法通过扩展可以突破内存限制,组成磁盘、内存、显存三级流水线,适应海量数据计算;同时算法也可以作为计算整个cube的基础.通过实验比较,基于图形处理器的算法明显优于四核CPU算法.  相似文献   

3.
Onboard computing is one of the principal needs in space-related technology in the recent years. In particular, onboard hyperspectral imaging (HSI) processing has advanced significantly. Due to advances in sensor technology, onboard HSI processing continuously meets new challenges related to increasing dataset size, limited processing time and limited communication links. High throughput and data reduction are crucial for satisfying real-time constraint and for preserving transmission bandwidth. For systems capable of accommodating a wide range of processing algorithms, there is a need for a flexible communication infrastructure that can provide fast access to/from memory in different access patterns. In this paper, existing FPGA-related Direct Memory Access (DMA) solutions have been evaluated, and a new DMA solution tailored for hyperspectral images has been proposed. Results show that the proposed DMA core, CubeDMA, handles targeted memory access patterns in more efficient manner than existing solutions while being resource efficient.  相似文献   

4.
Walton  S. Hutton  A. Touch  J. 《Computer》1998,31(11):46-52
In networking today, host workstations are increasingly being used as routers. Host based routers offer a number of advantages, but they suffer from inefficient support for high bandwidth interfaces. The authors' work has focused on the technology's major drawback its inefficiency in supporting high bandwidth interfaces. Their approach is to optimize packet processing by applying techniques that transfer packets directly among host interfaces, thus removing an extra data copy. This technique increases data throughput by 45 percent while reducing the host's CPU load. They found that peer DMA forwarding can increase host based router throughput by up to 45 percent, supporting bandwidths of 480 Mbps. Peer DMA host based forwarding requires network interface cards with substantial shared memory resources, because packet queues are stored on the interfaces themselves, rather than in host RAM. The queuing algorithm remains in the host CPU, supporting advanced queue management. Current systems have limited packet processing. A combination of streamlined forwarding algorithms and aggregate interrupt processing should further increase host based capability. Moving some of the IP processing out to the NIC coprocessor may enable this. It is also apparent that as processor speeds increase, the advantages of peer DMA will aid throughput for small packet sizes  相似文献   

5.
高带宽远程内存结构中的预取研究   总被引:1,自引:0,他引:1  
高速电路和光互联技术的发展极大地提高了网络的速度与带宽。因而,突破高性能计算机CPU与内存紧耦合的传统结构成为可能,CPU与内存的耦合不再受距离的限制,这必将引起体系结构的变革。文[1]提出DSAG结构——CPU与内存在空间上分离,每个CPU节点上仅留少量内存.将海量内存放在远程统一管理作为内存服务器,CPU节点和内存服务器之间通过高速网络互连。这种新的体系结构带来了更好的共享性和可扩展性,但同时也对我们解决CPU和内存之间的不平衡性问题带来了挑战。为了降低DSAG这种远程内存结构增加的访存时延,我们考虑到CPU正常访存没有充分利用网络的高带宽,因此可以利用剩余的网络带宽来进行远程内存数据的预取。本论文在应用程序执行时记录本地(相对于远程内存)不命中的地址信息,以页对齐分析其中存在的页框流(Page Frame Stream)的统计特征,并提出可基于页框流的预取机制可降低访存延迟、提升系统性能的观点。最后我们采用模拟的方法验证了观点的可行性与正确性,进一步提出了三种预取策略,比较并分析影响预取效果的因素。  相似文献   

6.
程克非  张聪  汪林林  张勤 《计算机应用》2005,25(10):2431-2433
引入了基于CPU硬件性能计数器的性能数据采集和分析方法,从软件运行时刻的细粒度参数入手分析软件运行时刻的性能表现,从而更为准确地反映系统实际的动态运行状态。实验证明,这种方法对于需要详细掌握系统动态运行状态的应用能够提供非常有效的分析数据,同时也在一定程度上对编译器的性能优化给出了相关参考数据。  相似文献   

7.
随着无人机技术和深度学习技术的发展,基于深度学习的多目标检测算法在工业无人机中得到了广泛的应用。针对目前基于深度学习的多目标检测算法占用大量计算量资源,难以在算力有限的中小型无人机平台上实时运行的问题,分析了深度学习算法在低功耗CPU上的耗时,提出一种卷积神经网络计算优化方法。在机载计算机中进行仿真,结果表明在检测效果基本不变的条件下,算法帧率达到了56FPS,实现了无人机平台上的实时多目标检测。  相似文献   

8.
Optimizing main-memory join on modern hardware   总被引:4,自引:0,他引:4  
In the past decade, the exponential growth in commodity CPU's speed has far outpaced advances in memory latency. A second trend is that CPU performance advances are not only brought by increased clock rates, but also by increasing parallelism inside the CPU. Current database systems have not yet adapted to these trends and show poor utilization of both CPU and memory resources on current hardware. In this paper, we show how these resources can be optimized for large joins and translate these insights into guidelines for future database architectures, encompassing data structures, algorithms, cost modeling and implementation. In particular, we discuss how vertically fragmented data structures optimize cache performance on sequential data access. On the algorithmic side, we refine the partitioned hash-join with a new partitioning algorithm called "radix-cluster", which is specifically designed to optimize memory access. The performance of this algorithm is quantified using a detailed analytical model that incorporates memory access costs in terms of a limited number of parameters, such as cache sizes and miss penalties. We also present a calibration tool that extracts such parameters automatically from any computer hardware. The accuracy of our models is proven by exhaustive experiments conducted with the Monet database system on three different hardware platforms. Finally, we investigate the effect of implementation techniques that optimize CPU resource usage. Our experiments show that large joins can be accelerated almost an order of magnitude on modern RISC hardware when both memory and CPU resources are optimized  相似文献   

9.
Implementations of relational operators on GPU processors have resulted in order of magnitude speedups compared to their multicore CPU counterparts. Here we focus on the efficient implementation of string matching operators common in SQL queries. Due to different architectural features the optimal algorithm for CPUs might be suboptimal for GPUs. GPUs achieve high memory bandwidth by running thousands of threads, so it is not feasible to keep the working set of all threads in the cache in a naive implementation. In GPUs the unit of execution is a group of threads and in the presence of loops and branches, threads in a group have to follow the same execution path; if some threads diverge, then different paths are serialized. We study the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory. We evaluate the memory efficiency in terms of memory access pattern and achieved memory bandwidth for different parallelization methods. To reduce thread divergence, we split string matching into multiple steps. We evaluate the different matching algorithms in terms of average- and worst-case performance and compare them against state-of-the-art CPU and GPU libraries. Our experimental evaluation shows that thread and memory efficiency affect performance significantly and that our proposed methods outperform previous CPU and GPU algorithms in terms of raw performance and power efficiency. The Knuth–Morris–Pratt algorithm is a good choice for GPUs because its regular memory access pattern makes it amenable to several GPU optimizations.  相似文献   

10.
Conventional implementations of iterative numerical algorithms, especially multigrid methods, merely reach a disappointing small percentage of the theoretically available CPU performance when applied to representative large problems. One of the most important reasons for this phenomenon is that the need for data locality due to poor main memory latency and limited bandwidth is entirely neglected by many developers designing numerical software. Only when most of the data to be accessed during the computation are found in the system cache (or in one of the caches if the machine architecture comprises a cache hierarchy) fast program execution can be expected. Otherwise, i.e. in case of a significant rate of cache misses, the processor must stay idle until the necessary operands are fetched from main memory, whose cycle time is in general extremely large compared to the time needed to execute a floating point instruction. In this paper, we describe program transformation techniques developed to improve the cache performance of two-dimensional multigrid algorithms. Although we merely consider the solution of Poisson's equation on the unit square using structured grids, our techniques provide valuable hints towards the efficient treatment of more general problems. Received January 31, 1999; revised October 17, 1999  相似文献   

11.
12.
软件二进制插桩是软件性能分析、漏洞挖掘、质量评价领域的关键技术.在嵌入式环境下,传统动态插桩算法受到无操作系统、CPU架构复杂、内存资源紧张等局限,难以展开工作.文章以软件动态二进制插桩算法为研究目的,通过静态特征分析和动态跟踪算法,引入图论算法对固件中的二进制进行分析,提出了嵌入式设备远程调试协议,实现了对软件运行时...  相似文献   

13.
Graphics processing units (GPUs) have an SIMD architecture and have been widely used recently as powerful general-purpose co-processors for the CPU. In this paper, we investigate efficient GPU-based data cubing because the most frequent operation in data cube computation is aggregation, which is an expensive operation well suited for SIMD parallel processors. H-tree is a hyper-linked tree structure used in both top-k H-cubing and the stream cube. Fast H-tree construction, update and real-time query response are crucial in many OLAP applications. We design highly efficient GPU-based parallel algorithms for these H-tree based data cube operations. This has been made possible by taking effective methods, such as parallel primitives for segmented data and efficient memory access patterns, to achieve load balance on the GPU while hiding memory access latency. As a result, our GPU algorithms can often achieve more than an order of magnitude speedup when compared with their sequential counterparts on a single CPU. To the best of our knowledge, this is the first attempt to develop parallel data cubing algorithms on graphics processors.  相似文献   

14.
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm.  相似文献   

15.
Over the past decade, the problem of fair bandwidth allocation among contending traffic flows on a link has been extensively researched. However, as these flows traverse a computer network, they share different kinds of resources (e.g., links, buffers, router CPU). The ultimate goal should hence be overall fairness in the allocation of multiple resources rather than a specific resource. Moreover, conventional resource scheduling algorithms depend strongly upon the assumption of prior knowledge of network parameters and cannot handle variations or lack of information about these parameters. In this paper, we present a novel scheduler called the composite bandwidth and CPU scheduler (CBCS), which jointly allocates the fair share of the link bandwidth as well as processing resource to all competing flows. CBCS also uses a simple and adaptive online prediction scheme for reliably estimating the processing times of the incoming data packets. Analytically, we prove that CBCS is efficient, with a per-packet work complexity of O(1). Finally, we present simulation results and experimental outcomes from a real-world implementation of CBCS on an Intel IXP 2400 network processor. Our results highlight the improved performance achieved by CBCS and demonstrate the ease with which it can be implemented on off-the-shelf hardware  相似文献   

16.
Cache自适应写分配策略   总被引:1,自引:0,他引:1  
处理器所能提供的有效带宽是目前制约处理器性能提高的关键因素 .通过对Cache写失效行为的分析,提出了一种新的提高处理器带宽利用率的Cache写失效处理策略--Cache自适应写分配策略 .该策略在访存失效队列中收集全修改Cache块,对全修改Cache块采用非写分配策略,并能够自适应地切换为写分配策略 .与传统的Cache写失效处理策略相比,Cache自适应写分配策略硬件代价小,避免了不必要的数据传输,降低Cache污染,减少存储管理队列阻塞的频率 .结果表明,采用Cache自适应写分配策略,STREAM基准测试程序带宽平均提高62.6%,SPEC CPU2000程序的IPC值平均提高5.9% .  相似文献   

17.
宏基因组基因聚类是筛选致病基因的新型方法,其依赖于海量的测序数据、有效的聚类算法以及高效的计算机来实现。相关系数矩阵的计算是进行聚类前必须完成的操作,占总计算量的比重较大。以某基因库为例,包含1300个样本、每样本百万基因的数据,单线程运行需要27年。充分发挥多核CPU的潜力,利用GPU加速卡强大的计算能力,将程序扩展到多节点集群上运行,是重要而迫切的工作。在仔细分析算法的基础上,首先针对单CPU节点和单GPU卡做了高效实现,获得了接近理想的加速比;然后利用缓存优化进一步提升性能;最后使用负载均衡方法在MPI线程间分发计算任务,实现了良好的扩展。相比未优化的单线程程序,16节点CPU获得了238.8倍的加速,6 块GPU卡获得了263.8倍的加速。  相似文献   

18.
目前以太网的发展速度远高于存储器和CPU的发展速度,存储器访问和CPU处理网络协议已经成为TCP的性能瓶颈。网络带宽的不断增大对CPU造成了沉重的负担,大约需要1GHz的CPU处理资源对1Gbps的网络流量进行协议处理。为此,使用多核NPU作为NIC,实现TCP接收数据路径中的校验和计算、报文乱序重组功能,并将合并之后的大报文经Linux网卡驱动程序交由协议栈处理,从而减少协议栈处理报文和网卡产生中断的数量,提升端系统的TCP性能。在10Gbps以太网络中,实验取得4.9Gbps的TCP接收数据吞吐量。  相似文献   

19.
Distributed Control of Multi-Robot Systems Engaged in Tightly Coupled Tasks   总被引:1,自引:0,他引:1  
NASA mission concepts for the upcoming decades of this century include exploration of sites such as steep cliff faces on Mars, as well as infrastructure deployment for a sustained robotic/manned presence on planetary and/or the lunar surface. Single robotic platforms, such as the Sojourner rover successfully flown in 1997 and the Mars Exploration Rovers (MER) which landed on Mars in January of 2004, have neither the autonomy, mobility, nor manipulation capabilities for such ambitious undertakings. One possible approach to these future missions is the fielding of cooperative multi-robot systems that have the required onboard control algorithms to more or less autonomously perform tightly coordinated tasks. These control algorithms must operate under the constrained mass, volume, processing, and communication conditions that are present on NASA planetary surface rover systems. In this paper, we describe the design and implementation of distributed control algorithms that build on our earlier development of an enabling architecture called CAMPOUT (Control Architecture for Multi-robot Planetary Outposts). We also report on some ongoing physical experiments in tightly coupled distributed control at the Jet Propulsion Lab in Pasadena, CA where in the first study two rovers acquire and carry an extended payload over uneven, natural terrain, and in the second three rovers form a team for cliff access.  相似文献   

20.
分时EDF算法及其在多媒体操作系统中的应用   总被引:2,自引:0,他引:2  
提出了一种新的CPU调度算法--分时EDF(Earliest Deadine First)算法,该算法能保证硬实时任务不丢失死线,并易于在分时系统中实现。以分时EDF算法为基础,提出一种新的CPU层次调度算法--HRFSFQ,该算法用于多媒体操作系统时能保证各类任务的QoS。最后通过大量实验证明了上述算法的有效性和正确性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号