首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Yu  Hui  Jiang  Xin-Yu  Zhao  Jin  Qi  Hao  Zhang  Yu  Liao  Xiao-Fei  Liu  Hai-Kun  Mao  Fu-Bing  Jin  Hai 《计算机科学技术学报》2022,37(4):797-813

Many systems have been built to employ the delta-based iterative execution model to support iterative algorithms on distributed platforms by exploiting the sparse computational dependencies between data items of these iterative algorithms in a synchronous or asynchronous approach. However, for large-scale iterative algorithms, existing synchronous solutions suffer from slow convergence speed and load imbalance, because of the strict barrier between iterations; while existing asynchronous approaches induce excessive redundant communication and computation cost as a result of being barrier-free. In view of the performance trade-off between these two approaches, this paper designs an efficient execution manager, called Aiter-R, which can be integrated into existing delta-based iterative processing systems to efficiently support the execution of delta-based iterative algorithms, by using our proposed group-based iterative execution approach. It can efficiently and correctly explore the middle ground of the two extremes. A heuristic scheduling algorithm is further proposed to allow an iterative algorithm to adaptively choose its trade-off point so as to achieve the maximum efficiency. Experimental results show that Aiter-R strikes a good balance between the synchronous and asynchronous policies and outperforms state-of-the-art solutions. It reduces the execution time by up to 54.1% and 84.6% in comparison with existing asynchronous and the synchronous models, respectively.

  相似文献   

2.
A Survey of General-Purpose Computation on Graphics Hardware   总被引:31,自引:0,他引:31  
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general‐purpose computation to graphics hardware. We begin with the technical motivations that underlie general‐purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general‐purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general‐purpose application development on graphics hardware.  相似文献   

3.
Delta-based accumulative iterative computation (DAIC) model is currently proposed to support iterative algorithms in a synchronous or an asynchronous way. However, both the synchronous DAIC model and the asynchronous DAIC model only satisfy some given conditions, respectively, and perform poorly under other conditions either for high synchronization cost or for many redundant activations. As a result, the whole performance of both DAIC models suffers fromthe serious network jitter and load jitter caused bymultitenancy in the cloud. In this paper, we develop a system, namely HybIter, to guarantee the performance of iterative algorithms under different conditions. Through an adaptive execution model selection scheme, it can efficiently switch between synchronous and asynchronous DAIC model in order to be adapted to different conditions, always getting the best performance in the cloud. Experimental results show that our approach can improve the performance of current solutions up to 39.0%.  相似文献   

4.
BSPlib: The BSP programming library   总被引:1,自引:0,他引:1  
BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming in two distinct styles: direct remote memory access (DRMA) using put or get operations, and bulk synchronous message passing (BSMP). Currently, implementations of BSPlib exist for a variety of modern architectures, including massively parallel computers with distributed memory, shared memory multiprocessors, and networks of workstations. BSPlib has been used in several scientific and industrial applications; this paper briefly describes applications in benchmarking, Fast Fourier Transforms (FFTs), sorting, and molecular dynamics.  相似文献   

5.
迭代式计算是一类重要的大数据分析应用.在分布式计算框架MapReduce上实现迭代计算时,计算会被分解成多个作业并按作业依存关系顺序运行,这使得程序与分布式文件系统(DFS)有多次交互而影响程序执行时间.对这些交互相关数据的缓存会降低与DFS的交互时间,进而提升程序总体的性能.考虑到集群中的大量内存在多数情况下会处于空闲状态,提出了一种使用内存缓存的迭代式应用编程框架MemLoop.该系统从作业提交API、调度算法、缓存管理模块实现缓存管理以充分利用内存缓存迭代间可驻留数据与迭代内依存数据.我们将此框架与已有相关框架进行了比较,实验结果表明该框架能够提升迭代程序的性能.  相似文献   

6.
Through the combination of the sequential spectral factorization and the coprime factorization, a k‐step ahead MIMO H (cumulative minimax) predictor is derived which is stable for the unstable noise model. This predictor and the modified internal model of the reference signal are embedded into the H optimization framework, yielding a single degree of freedom multi‐input–multi‐output H predictive controller that provides stochastic disturbance rejection and asymptotic tracking of the reference signals described by the internal model. It is shown that for a plant/disturbance model, that represents a large class of systems, the inclusion of the H predictor into the H control algorithm introduces a performance/robustness tuning knob: an increase of the prediction horizon enforces a more conservative control effort and, correspondingly, results in deterioration of the transient and the steady‐state (tracking error variance) performance, but guarantees large robustness margin, while the decrease of the prediction horizon results in a more aggressive control signal and better transient and steady‐state performance, but smaller robustness margin. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

7.
GPGPU性能模型及应用实例分析   总被引:2,自引:1,他引:1  
现代图形处理器(GPU)的高性能吸引了大量非图形应用,为了有效地进行性能预测和优化,提出一种GPU处理通用计算问题的性能模型.通过分析现代GPU并行架构和工作原理,将GPU的通用计算过程划分为数据获取、计算、输出和传输4个并列的阶段,结合程序特点和硬件规格对各阶段进行量化分析,完成性能预测.通过实验分析得出两大性能影响要素:计算强度和访问密度,并将其作为性能优化的基本准则.该模型被用于分析几种常见的图像和视频处理算法在GPU上的实现,包括高斯卷积、离散余弦变换和运动估计.实验结果表明,通过增大计算强度和访问密度,文中优化方案显著地降低了GPU上的执行时间,使得计算效率提升了4~10倍,充分说明了该模型在性能预测和优化方面的有效性.  相似文献   

8.
Simulation is indispensable in computer architecture research. Researchers increasingly resort to detailed architecture simulators to identify performance bottlenecks, analyze interactions among different hardware and software components, and measure the impact of new design ideas on the system performance. However, the slow speed of conventional execution‐driven architecture simulators is a serious impediment to obtaining desirable research productivity. This paper describes a novel fast multicore processor architecture simulation framework called Two‐Phase Trace‐driven Simulation (TPTS), which splits detailed timing simulation into a trace generation phase and a trace simulation phase. Much of the simulation overhead caused by uninteresting architectural events is only incurred once during the cycle‐accurate simulation‐based trace generation phase and can be omitted in the repeated trace‐driven simulations. We report our experiences with tsim, an event‐driven multicore processor architecture simulator that models detailed memory hierarchy, interconnect, and coherence protocol based on the TPTS framework. By applying aggressive event filtering, tsim achieves an impressive simulation speed of 146 millions of simulated instructions per second, when running 16‐thread parallel applications. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

9.
In this paper, an efficient framework is proposed to the consensus and formation control of distributed multi‐agent systems with second‐order dynamics and unknown time‐varying parameters, by means of an adaptive iterative learning control approach. Under the assumption that the acceleration of the leader is unknown to any follower agents, a new adaptive auxiliary control and the distributed adaptive iterative learning protocols are designed. Then, all follower agents track the leader uniformly on [0,T] for consensus problem and keep the desired distance from the leader and achieve velocity consensus uniformly on [0,T] for the formation problem, respectively. The distributed multi‐agent coordinations performance is analyzed based on the Lyapunov stability theory. Finally, simulation examples are given to illustrate the effectiveness of the proposed protocols in this paper.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

10.
Recent advances in neuroscientific understanding have highlighted the highly parallel computation power of the mammalian neocortex. In this paper we describe a GPGPU-accelerated implementation of an intelligent learning model inspired by the structural and functional properties of the neocortex. Furthermore, we consider two inefficiencies inherent to our initial implementation and propose software optimizations to mitigate such problems. Analysis of our application’s behavior and performance provides important insights into the GPGPU architecture, including the number of cores, the memory system, atomic operations, and the global thread scheduler. Additionally, we create a runtime profiling tool for the cortical network that proportionally distributes work across the host CPU as well as multiple GPGPUs available to the system. Using the profiling tool with these optimizations on Nvidia’s CUDA framework, we achieve up to 60× speedup over a single-threaded CPU implementation of the model.  相似文献   

11.
目的 图像内补与外推可看做根据已知区域绘制未知区域的问题,是计算机视觉领域研究热点。近年来,深度神经网络成为解决内补与外推问题的主流方法。然而,当前解决方法多分别对待内补与外推问题,导致二者难以统一处理;且模型多采用卷积神经网络(convolutional neural network,CNN)构建,受到视野局部性限制,较难绘制远距离内容。针对这两个问题,本文按照分而治之思想联合CNN与Transformer构建深度神经网络,提出图像内补与外推统一处理框架及模型。方法 将内补与外推问题的解决过程分解为“表征、预测、合成”3个部分,表征与合成采用CNN完成,充分利用其局部相关性进行图像到特征映射和特征到图像重建;核心预测由Transformer实现,充分发挥其强大的全局上下文关系建模能力,并提出掩膜自增策略迭代预测特征,降低Transformer同时预测大范围未知区域特征的难度;最后引入对抗学习提升绘制图像逼真度。结果 实验给出在多种数据集下内补与外推对比评测,结果显示本文方法各项性能指标均超越对比方法。通过消融实验发现,模型相比采用非分解方式具有更佳表现,说明分而治之思路功效显著。此外,对掩膜自增策略进行详细的实验分析,表明迭代预测方法可有效提升绘制能力。最后,探究了Transformer关键结构参数对模型性能的影响。结论 本文提出一种迭代预测统一框架解决图像内补与外推问题,相较对比方法性能更佳,并且各部分设计对性能提升均有贡献,显示了迭代预测统一框架及方法在图像内补与外推问题上的应用价值与潜力。  相似文献   

12.
Harmful algal blooms have caused critical problems worldwide because they pose serious threats to human health and aquatic ecosystems. In particular, red tide blooms of Cochlodinium polykrikoides have caused serious damage to aquaculture in Korean coastal waters. In this study, multiple linear regression, regression tree (RT), and Random Forest models were applied to detect C. polykrikoides blooms in coastal waters. Five types of input data sets were implemented to test the performance of the models. The observed number of C. polykrikoides cells and reflectance data from Geostationary Ocean Color Imager images obtained in a 3-year period (2013–2015) were used to train and validate the models. The RT model demonstrated the best prediction performance when four bands and three-band ratio data were simultaneously used as input data. The results obtained via iterative model development with randomly chosen input data indicate that the recognition of patterns in the training data caused variations in the prediction performance. This work provides useful tools for reliable estimation of the number of C. polykrikoides cells using reasonable coastal water reflectance data sets. It is expected that administrators and decision-makers whose work is associated with coastal waters will be able to easily access and manipulate the RT model.  相似文献   

13.
In this paper we present our experience in developing an optimizing compiler for general purpose computation on graphics processing units (GPGPU) based on the Cetus compiler framework. The input to our compiler is a naïve GPU kernel procedure, which is functionally correct but without any consideration for performance optimization. Our compiler applies a set of optimization techniques to the naive kernel and generates the optimized GPU kernel. Our compiler supports optimizations for GPU kernels using either global memory or texture memory. The implementation of our compiler is facilitated with a source-to-source compiler infrastructure, Cetus. The code transformation in the Cetus compiler framework is called a pass. We classify all the passes used in our work into two categories: functional passes and optimization passes. The functional passes translate input kernels into desired intermediate representation, which clearly represents memory access patterns and thread configurations. A series of optimization passes improve the performance of the kernels by adapting them to the target GPGPU architecture. Our experiments show that the optimized code achieves very high performance, either superior or very close to highly fine-tuned libraries.  相似文献   

14.
In this study, an iterative learning control (ILC) algorithm is proposed to improve synchronous errors in rigid tapping. In rigid tapping, the displacements of the z‐axis and spindle must be kept synchronous to prevent damage. Using learning control provides better commands for both the z‐axis and spindle dynamics, improving the synchronicity of the output responses of the z‐axis and spindle. The proposed ILC makes use of synchronous errors in the previous cycle of tapping to modify the current position commands of both the z‐axis and spindle. A systematic algorithm is proposed for the computation of learning gains that guarantee the monotonic convergence of synchronous errors. A systematic procedure of applying ILC to rigid tapping is also proposed, where the ideas of effective learning gains and stop learning criteria are discussed. Experimental results on a tapping machine verify the effectiveness of the proposed ILC algorithm.  相似文献   

15.
The formation of protein secondary structure especially the regions of β-sheets involves long-range interactions between amino acids. We propose a novel recurrent neural network architecture called segmented-memory recurrent neural network (SMRNN) and present experimental results showing that SMRNN outperforms conventional recurrent neural networks on long-term dependency problems. In order to capture long-term dependencies in protein sequences for secondary structure prediction, we develop a predictor based on bidirectional segmented-memory recurrent neural network (BSMRNN), which is a noncausal generalization of SMRNN. In comparison with the existing predictor based on bidirectional recurrent neural network (BRNN), the BSMRNN predictor can improve prediction performance especially the recognition accuracy of β-sheets.  相似文献   

16.
The general-purpose graphic processing unit (GPGPU) is a popular accelerator for general applications such as scientific computing because the applications are massively parallel and the significant power of parallel computing inheriting from GPUs. However, distributing workload among the large number of cores as the execution configuration in a GPGPU is currently still a manual trial-and-error process. Programmers try out manually some configurations and might settle for a sub-optimal one leading to poor performance and/or high power consumption. This paper presents an auto-tuning approach for GPGPU applications with the performance and power models. First, a model-based analytic approach for estimating performance and power consumption of kernels is proposed. Second, an auto-tuning framework is proposed for automatically obtaining a near-optimal configuration for a kernel computation. In this work, we formulated that automatically finding an optimal configuration as the constraint optimization and solved it using either simulated annealing (SA) or genetic algorithm (GA). Experiment results show that the fidelity of the proposed models for performance and energy consumption are 0.86 and 0.89, respectively. Further, the optimization algorithms result in a normalized optimality offset of 0.94% and 0.79% for SA and GA, respectively.  相似文献   

17.
The lattice‐Boltzmann method is well suited for implementation in single‐instruction multiple‐data (SIMD) environments provided by general purpose graphics processing units (GPGPUs). This paper discusses the integration of these GPGPU programs with OpenMP to create lattice‐Boltzmann applications for multi‐GPU clusters. In addition to the standard single‐phase single‐component lattice‐Boltzmann method, the performances of more complex multiphase, multicomponent models are also examined. The contributions of various GPU lattice‐Boltzmann parameters to the performance are examined and quantified with a statistical model of the performance using Analysis of Variance (ANOVA). By examining single‐ and multi‐GPU lattice‐Boltzmann simulations with ANOVA, we show that all the lattice‐Boltzmann simulations primarily depend on effects corresponding to simulation geometry and decomposition, and not on the architectural aspects of GPU. Additionally, using ANOVA we confirm that the metrics of Efficiency and Utilization are not suitable for memory‐bandwidth‐dependent codes. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

18.
We present the development process behind AtlantikSolar, a small 6.9 kg hand‐launchable low‐altitude solar‐powered unmanned aerial vehicle (UAV) that recently completed an 81‐hour continuous flight and thereby established a new flight endurance world record for all aircraft below 50 kg mass. The goal of our work is to increase the usability of such solar‐powered robotic aircraft by maximizing their perpetual flight robustness to meteorological deteriorations such as clouds or winds. We present energetic system models and a design methodology, implement them in our publicly available conceptual design framework for perpetual flight‐capable solar‐powered UAVs, and finally apply the framework to the AtlantikSolar UAV. We present the detailed AtlantikSolar characteristics as a practical design example. Airframe, avionics, hardware, state estimation, and control method development for autonomous flight operations are described. Flight data are used to validate the conceptual design framework. Flight results from the continuous 81‐hour and 2,338 km covered ground distance flight show that AtlantikSolar achieves 39% minimum state‐of‐charge, 6.8 h excess time and 6.2 h charge margin. These performance metrics are a significant improvement over previous solar‐powered UAVs. A performance outlook shows that AtlantikSolar allows perpetual flight in a 6‐month window around June 21 at mid‐European latitudes, and that multi‐day flights with small optical‐ or infrared‐camera payloads are possible for the first time. The demonstrated performance represents the current state‐of‐the‐art in solar‐powered low‐altitude perpetual flight performance. We conclude with lessons learned from the three‐year AtlantikSolar UAV development process and with a sensitivity analysis that identifies the most promising technological areas for future solar‐powered UAV performance improvements.  相似文献   

19.
随着工艺和制程技术的不断发展以及体系架构的日趋完善,通用图形处理器(general purpose graphics processing units, GPGPU)的并行计算能力得到了很大的提升,其在高性能、高吞吐量等通用计算应用场景的使用越来越广泛.GPGPU通过支持大量线程的并发执行,可以较好地隐藏长延时访存操作,从而获得高并行计算能力.然而,GPGPU在处理计算和访存不规则的应用时,其存储子系统的效率受到很大影响,尤其是片上缓存的争用情况尤为突出,难以及时提供计算操作所需的数据,使得GPGPU的高并行计算能力不能得到充分发挥.解决片上缓存的争用问题、优化缓存子系统的性能,是优化GPGPU性能的主要解决方案之一,也是目前研究GPGPU性能优化的主要热点之一.目前,针对GPGPU缓存子系统的性能优化研究主要集中在线程级并行度(thread level parallelism, TLP)调节、访存顺序调节、数据通量增强、最后一级缓存(last level cache, LLC)优化和基于非易失性存储(non-volatile memory, NVM)的GPGPU缓存新架构设计等5个方面.也从这5个方面重点分析讨论了目前主要的GPGPU缓存子系统性能优化方法,并在最后指出了未来GPGPU缓存子系统优化需要进一步探讨的问题,对GPGPU缓存子系统性能优化的研究有重要意义.  相似文献   

20.
For a class of fractional‐order linear continuous‐time switched systems specified by an arbitrary switching rule, this paper proposes a PDα‐type fractional‐order iterative learning control algorithm. For systems disturbed by bounded measurement noise, the robustness of PDα‐type algorithm is first discussed in the iteration domain and the tracking performance is analyzed. Next, a sufficient condition for monotone convergence of the algorithm is studied when external noise is absent. The results of analysis and simulation illustrate the feasibility and effectiveness of the proposed control algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号