共查询到20条相似文献,搜索用时 15 毫秒
1.
硬件数据预取技术可以有效提升处理器的访存性能,是申威处理器性能优化过程中亟需突破的一项技术。硬件开销和处理器架构的制约是硬件预取技术实现中的主要难点。借鉴学术界对硬件预取技术的研究成果和工业界的应用现状,紧密结合申威处理器的结构特点,研究了申威处理器硬件预取技术的实现方法。以流预取为例,在处理器核心面积增加0.97%的情况下,硬件预取技术的应用可以将目前申威处理器的整数性能平均提升5.17%,最高提升28.88%;浮点性能平均提升6.39%,最高提升30.11%。 相似文献
2.
Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and the accelerators and between computing nodes. We consider the case where the intra-node parallelism is handled with HMPP compiler directives and message-passing programming with MPI is used to program the inter-node communications. This way of programming on such an heterogeneous architecture is costly and error-prone. In this paper, we specifically demonstrate the transformation of HMPP programs designed to exploit a single computing node equipped with a GPU into an heterogeneous HMPP + MPI exploiting multiple GPUs located on different computing nodes. 相似文献
3.
In our previous work, we have provided tools for an efficient characterization of biomedical images using Legendre and Zernike moments, showing their relevance as biomarkers for classifying image tiles coming from bone tissue regeneration studies (Ujaldón, 2009) [24]. As part of our research quest for efficiency, we developed methods for accelerating those computations on GPUs (Martín-Requena and Ujaldón, 2011) and . This new stage of our work focuses on the efficient data partitioning to optimize the execution on many-cores and clusters of GPUs to attain gains up to three orders of magnitude when compared to the execution on multi-core CPUs of similar age and cost using 1 Mpixel images. We deploy a successive and successful chain of optimizations which exploit symmetries in trigonometric functions and access patterns to image pixels which are effectively combined with massive data parallelism on GPUs to enable (1) real-time processing for our set of input biomedical images, and (2) the use of high-resolution images in clinical practice. 相似文献
4.
针对目前流数据存在数量巨大、生成迅速和概念漂移的特点,提出了一种基于长短期记忆(LSTM)网络和滑动窗口的流数据异常检测方法。首先采用LSTM网络进行数据预测,之后计算预测值与实际值的差值。对于每个数据,选择合适的滑动窗口,将滑动窗口区间内的所有差值进行分布建模,再根据每个差值在当前分布的概率密度来计算数据异常可能性。LSTM网络不仅可以进行数据预测,还可以边预测边学习,实时更新调整网络,保证模型的有效性;而利用滑动窗口可以使得异常分数的分配更为合理。最后使用在真实数据基础上制造的模拟数据进行了实验。实验结果验证了所提方法在低噪声环境下比直接利用差值进行检测和异常数据分布建模法(ADM)方法的平均曲线下面积(AUC)值分别提高了0.187和0.05。 相似文献
6.
Data gravitation-based classification model, a new physic law inspired classification model, has been demonstrated to be an effective classification model for both standard and imbalanced tasks. However, due to its large scale of gravitational computation during the feature weighting process, DGC suffers from high computational complexity, especially for large data sets. In this paper, we address the problem of speeding up gravitational computation using graphics processing unit (GPU). We design a GPU parallel algorithm namely GPU–DGC to accelerate the feature weighting process of the DGC model. Our GPU–DGC model distributes the gravitational computing process to parallel GPU threads, in order to compute gravitation simultaneously. We use 25 open classification data sets to evaluate the parallel performance of our algorithm. The relationship between the speedup ratio and the number of GPU threads is discovered and discussed based on the empirical studies. The experimental results show the effectiveness of GPU–DGC, with the maximum speedup ratio of 87 to the serial DGC. Its sensitivity to the number of GPU threads is also discovered in the empirical studies. 相似文献
7.
As the prevalence of general purpose computations on GPU, shared memory programming models were proposed to ease the pain of GPU programming. However, with the demanding needs of more intensive workloads, it’s desirable to port GPU programs to more scalable distributed memory environment, such as multi-GPUs. To achieve this, programs need to be re-written with mixed programming models (e.g. CUDA and message passing). Programmers not only need to work carefully on workload distribution, but also on scheduling mechanisms to ensure the efficiency of the execution. In this paper, we studied the possibilities of automating the process of parallelization to multi-GPUs. Starting from a GPU program written in shared memory model, our framework analyzes the access patterns of arrays in kernel functions to derive the data partition schemes. To acquire the access pattern, we proposed a 3-tiers approach: static analysis, profile based analysis and user annotation. Experiments show that most access patterns can be derived correctly by the first two tiers, which means that zero efforts are needed to port an existing application to distributed memory environment. We use our framework to parallelize several applications, and show that for certain kinds of applications, CUDA-Zero can achieve efficient parallelization in multi-GPU environment. 相似文献
8.
<正>用户在实现DDR或者DDR2SDRAM接口时,怎样选择合适的Altera~Stratix II、Stratix II GX和HardCopy~II器件的外部存储器接口方案,一般有两种选择:有两种选择: 相似文献
9.
The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Program summaryProgram title: PPIDD Catalogue identifier: AEEF_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEEF_1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 17 698 No. of bytes in distributed program, including test data, etc.: 166 173 Distribution format: tar.gz Programming language: Fortran, C Computer: Many parallel systems Operating system: Various Has the code been vectorised or parallelized?: Yes. 2–256 processors used RAM: 50 Mbytes Classification: 6.5 External routines: Global Arrays or MPI-2 Nature of problem: Many scientific applications require management and communication of data that is global, and the standard MPI-2 protocol provides only low-level methods for the required one-sided remote memory access. Solution method: The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Running time: Problem dependent. The test provided with the distribution takes only a few seconds to run. 相似文献
10.
在气象领域,三维风暴数据场可视化是风暴监测及灾害预测的重要技术手段之一。可视化效率及质量直接影响到风暴体分析研究的准确性和时效性。针对传统的二维、三维纹理映射体绘制方法进行了研究,提出了一种基于GPU的风暴数据场多维纹理混合绘制技术。该方法采用了三维纹理存储风暴数据场并结合代理几何体动态生成方法,克服了传统方法中纹理数据冗余的问题,并保证了模型的三维交互流畅性。该方法中提出的纹理映射光滑重采样策略,显著提高了风暴体模型显示效果,并在一定程度上避免了CPU-GPU通信瓶颈。 相似文献
11.
受环境因素影响,卤水下矿床表面地势平缓,采集的矿床点云冗余点较多,为了提高对矿床进行三维建模的效率,设计了一种基于GPU并行的点云简化的改进算法。对每个小栅格内的点进行最小二乘的平面拟合,根据各个点到拟合平面的距离精简了大部分冗余点,并通过剩余点的曲率进行了第二次精简。将整个处理过程限定在每个小栅格内,在降低计算量的同时避免了因过度简化而出现的空洞现象。另外,对点云的简化过程进行了基于GPU的多线程并行处理,极大地提高了整个处理过程的效率。实验表明,算法改进后达到原算法效果的同时提高了算法效率,利用GPU加速后,大大缩短了算法的执行时间。 相似文献
12.
大数据计算中存在流计算、内存计算、批计算和图计算等不同模式,各种计算模式有不同的访存、通信和资源利用等特征。GPU异构集群在大数据分析处理中得到广泛应用,然而缺少研究GPU异构集群在大数据分析中的计算模型。多核CPU与GPU协同计算时不仅增加了计算资源的密度,而且提高节点间和节点内的通信复杂度。为了从理论上研究GPU与多核CPU协同计算问题,面向多种计算模式建立一个多阶段的协同计算模型(p-DCOT)。p-DCOT以BSP大同步并行模型为核心,将协同计算过程分成数据层、计算层和通信层三个层次,并且延用DOT模型的矩阵来形式化描述计算和通信行为。通过扩展p-DOT模型描述节点内和节点间的协同计算行为,细化了负载均衡的参数并证明时间成本函数,最后用典型计算作业验证模型及参数分析的有效性。该协同计算模型可成为揭示大数据分析处理中协同计算行为的工具。 相似文献
13.
提出了异构多核图形处理器(HMGPU)存储管理系统的硬件实现方法,采用固定分区与分页式分区两种方式分别对大片连续数据与小片非连续数据进行管理,使用Verilog语言进行硬件设计和仿真,并在FPGA开发板上进行了验证。实验结果表明,该系统为HMGPU提供了2 021.2 MB/s的有效存储带宽。 相似文献
14.
In this paper, we propose mining frequent patterns from univariate uncertain data streams, which have a quantitative interval for each attribute in a transaction and a probability density function indicating the possibilities that the values in the interval appear. Many data streams comprise flows of univariate uncertain data, for example, the records of atmospheric pollution sensors, and network monitoring records. We propose two algorithms to address this issue: the ExactU2Stream algorithm and the ApproxiU2Stream algorithm. The former incrementally stores the incoming transactions, and delays the mining process until it is requested. The latter mines the transactions immediately when they arrive, and stores the derived frequent patterns. Compared with the latter, the former returns results that are more accurate, but it also requires more response time. Both algorithms utilize the sliding window scheme, which decomposes the continuous data stream into discrete, overlapping chunks. The proposed algorithms outperform the compared methods in terms of runtime and memory usage. We have applied the two proposed algorithms to the data streams recording the air quality in Taiwan; the derived frequent patterns not only show the common air quality in Taiwan but also show the extremely bad air quality when a sand storm affects Taiwan. 相似文献
15.
在很多新兴应用领域、如传感器网络,实时监控系统等,产生的数据流是不断变化的、连续到达的、数据值可能不确定、且必须被快速处理。其中有些操作,如数据流的实时窗口连接运算,非常消耗时间,这对数据流处理系统的性能提出了严峻的挑战。目前,大多数算法采用软件优化来提高处理速度,但其性能提高有限。利用GPU(图形处理器)的高并行度、多线程、高带宽的并行处理能力,设计了一种软硬件结合的方法来加速处理数据流的窗口连接操作。在CUDA(统一计算架构)下,由CPU控制将内存中的数据传输至GPU存储器中,然后利用多线程进行并行处理。实验验证了提出的方法可以大幅度提高多数据流窗口连接的处理速度,可达到纯软件处理的50倍左右。 相似文献
16.
Graphics processing units have proved their capability for general-purpose computing in many research areas. In this paper, we propose the mechanism and implementation of a database system that encrypts and decrypts data by using GPU. The proposed mechanism is mainly designed for database systems that require data encryption and decryption to support high security level. The outsourced database systems or database cloud service could be a good candidate for our system. By exploiting the computation capability of GPU, we achieve not only a fast encryption and decryption time per operation, but also a higher overall performance of a database system by offloading computation to GPU. Moreover, the proposed system includes a mechanism which can decide whether to offload computation to GPU or not for more performance gain. We implemented the AES algorithm based on CUDA framework and integrate with MySQL, a commodity database system. Our evaluation demonstrates that the encryption and decryption on GPU show eight times better performance compared to that on CPU when the data size is 16 MB and the performance gain is proportional to the data size. We also show that the proposed system alleviates the utilization of CPU, and the overall performance of the database system is improved by offloading heavy encrypting and decrypting computation to GPU. 相似文献
18.
With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In This work, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the techniques we have developed starting from a common specification of the algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree construction. The main results from our experiments are as follows: 1) Among full replication, optimized full locking, and cache-sensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be crucial for achieving high performance. 相似文献
19.
To implement single-ended flexible arrays (‘elastic memory’) a class of underlying dynamic storage allocation methods called buddy systems may be used to allocate fixed blocks of memory of restricted choices of size. Arrays are implemented in one approach as contiguous blocks of memory and in another as two-level structures. Each approach is combined with three buddy methods for allocating blocks of memory, making six methods in all. This paper illustrates the general ideas of implementing elastic memory mechanisms in terms of the buddy system interface. 相似文献
20.
给出了由(2,1,N)系列卷积码Viterbi译码中路径度量存储器及其接口的使用FPGA实现时的设计方法,译码器采用四个ACS并行运算的方式,状态度量的更新采用乒乓模式,阐述了存储器的分块方法和读写地址及读写时钟的确定。设计充分利用了FPGA内存资源丰富的特点,具有较高的译码速度和简单的控制逻辑。 相似文献
|