期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient sorting design on a novel embedded parallel computing architecture with unique memory access

Wenbiao Zhou Zhaoyun Cai Ruiqiang Ding Chen Gong Dake Liu 《Computers & Electrical Engineering》2013

Embedded Parallel computing architecture with Unique Memory Access (ePUMA) is a domain-specific embedded heterogeneous 9-core chip multiprocessor, which has a unique design with low power and high silicon efficiency for high-throughput DSP in emerging telecommunication and multimedia applications. Sorting is one of the most widely studied algorithms, more embedded applications also need efficient sorting. This paper proposes an efficient bitonic sorting algorithm eSORT for the novel ePUMA DSP. eSORT algorithm consists of two parts: an in-core sorting algorithm and an intra-core sorting algorithm. Both algorithms are adapted to the novel architecture and take advantage of the ePUMA platform. This paper implemented and evaluated the eSORT for variable datasets on ePUMA multi-core DSP and compared its performance with the Cell BE processors with the same SIMD parallelization structure. Results show that bitonic sort on ePUMA multi-core DSP has much better performance and scalability. Compared with optimized bitonic sort on Cell BE, the in-core sort is 11 times faster and intra-core sort is 15 times faster in average. 相似文献

2.

Decision tree building on multi‐core using FastFlow

Marco Aldinucci Salvatore Ruggieri Massimo Torquati 《Concurrency and Computation》2014,26(3):800-820

The whole computer hardware industry embraced the multi‐core. The extreme optimisation of sequential algorithms is then no longer sufficient to squeeze the real machine power, which can be only exploited via thread‐level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an in‐depth study of the parallelisation of an implementation of the C4.5 algorithm for multi‐core architectures. We characterise elapsed time lower bounds for the forms of parallelisations adopted and achieve close to optimal performance. Our implementation is based on the FastFlow parallel programming environment, and it requires minimal changes to the original sequential code. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

3.

MMX technology extension to the Intel architecture 总被引：2，自引：0，他引：2

Peleg A. Weiser U. 《Micro, IEEE》1996,16(4):42-50

Designed to accelerate multimedia and communications software, MMX technology improves performance by introducing data types and instructions to the IA that exploit the parallelism in these applications. MMX technology extends the Intel architecture (IA) to improve the performance of multimedia, communications, and other numeric-intensive applications. It uses a SIMD (single-instruction, multiple-data) technique to exploit the parallelism inherent in many algorithms, producing full application performance of 1.5 to 2 times faster than the same applications run on the same processor without MMX. The extension also maintains full compatibility with existing IA microprocessors, operating systems, and applications while providing new instructions and data types that applications can use to achieve a higher level of performance on the host CPU 相似文献

4.

Efficient and retargetable SIMD translation in a dynamic binary translator

下载免费PDF全文

Sheng‐Yu Fu Ding‐Yong Hong Yu‐Ping Liu Jan‐Jan Wu Wei‐Chung Hsu 《Software》2018,48(6):1312-1330

The single‐instruction multiple‐data (SIMD) computing capability of modern processors is continually improved to deliver ever better performance and power efficiency. For example, Intel has increased SIMD register lengths from 128 bits in streaming SIMD extension to 512 bits in AVX‐512. The ARM scalable vector extension supports SIMD register length up to 2048 bits and includes predicated instructions. However, SIMD instruction translation in dynamic binary translation has not received similar attention. For example, the widely used QEMU emulates guest SIMD instructions with a sequence of scalar instructions, even when the host machines have relevant SIMD instructions. This leaves significant potential for performance enhancement. We propose a newly designed SIMD translation framework for dynamic binary translation, which takes advantage of the host's SIMD capabilities. The proposed framework has been built in HQEMU, an enhanced QEMU with a separate thread for applying LLVM optimizations. The current prototype supports ARMv7, ARMv8, and IA32 guests on the X86‐64 AVX‐2 host. Compared with the scalar‐translation version HQEMU, our framework runs up to 1.84 times faster on Standard Performance Evaluation Corporation 2006 CFP benchmarks and up to 6.81 times faster on selected real applications. 相似文献

5.

基于OpenCL的并行kNN算法设计与实现

杨朋霖冯百明周志阳温向慧《计算机工程与科学》2017,39(12):2198-2202

kNN算法是机器学习和数据挖掘程序中经常使用的经典算法。随着数据量的增大,kNN算法的执行时间急剧上升。为了有效利用现代计算机的GPU等计算单元减少kNN算法的计算时间,提出了一种基于OpenCL的并行kNN算法,该算法对距离计算和排序两个瓶颈点进行并行化,在距离计算阶段使用细粒度并行化策略和优化的线程模型,排序阶段使用优化内存模型的双调排序。以UCI数据集letter为测试集,分别使用E8400和GTS450运行kNN算法进行测试,采用GPU加速的并行kNN算法的计算速度比CPU版提高了40.79倍。相似文献

6.

High‐performance N‐thread software solutions for mutual exclusion

Peter A. Buhr David Dice Wim H. Hesselink 《Concurrency and Computation》2015,27(3):651-701

Software solutions for mutual exclusion developed over a 30‐year period, starting with complex ad hoc algorithms and progressing to simpler formal ones. While it is easy to dismiss software solutions for mutual exclusion, as this family of algorithms is antiquated and most platforms support atomic hardware instructions, there is still a need for these algorithms in threaded, embedded systems running on low‐cost processors lacking atomic instructions. While N‐thread solutions are usually short (10–25 lines of code), each is ingenious with exceptionally subtle aspects, often making it difficult to prove correctness or construct an implementation. This work examines correctness and performance of the implementations. An extensive survey of existing algorithms is presented, with explanations of the intuition behind the algorithms and how they work. Several errors were found and corrections made, as well as a few small improvements, in the existing algorithms; two new high‐performance algorithms were developed. Finally, a worst‐case high‐contention performance experiment is performed to compare the algorithms and contrast them with three common locks based on hardware atomic instructions. The results show our two new algorithms are highly competitive with an equivalent hardware lock (Mellor‐Crummey and Scott) over a range of 1–32 processors. Hence, threading is a viable alternative to event‐driven programming for complex embedded systems without atomic instructions. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

7.

CUDA‐quicksort: an improved GPU‐based implementation of quicksort

Emanuele Manca Andrea Manconi Alessandro Orro Giuliano Armano Luciano Milanesi 《Concurrency and Computation》2016,28(1):21-43

Sorting is a very important task in computer science and becomes a critical operation for programs making heavy use of sorting algorithms. General‐purpose computing has been successfully used on Graphics Processing Units (GPUs) to parallelize some sorting algorithms. Two GPU‐based implementations of the quicksort were presented in literature: the GPU‐quicksort, a compute‐unified device architecture (CUDA) iterative implementation, and the CUDA dynamic parallel (CDP) quicksort, a recursive implementation provided by NVIDIA Corporation. We propose CUDA‐quicksort an iterative GPU‐based implementation of the sorting algorithm. CUDA‐quicksort has been designed starting from GPU‐quicksort. Unlike GPU‐quicksort, it uses atomic primitives to perform inter‐block communications while ensuring an optimized access to the GPU memory. Experiments performed on six sorting benchmark distributions show that CUDA‐quicksort is up to four times faster than GPU‐quicksort and up to three times faster than CDP‐quicksort. An in‐depth analysis of the performance between CUDA‐quicksort and GPU‐quicksort shows that the main improvement is related to the optimized GPU memory access rather than to the use of atomic primitives. Moreover, in order to assess the advantages of using the CUDA dynamic parallelism, we implemented a recursive version of the CUDA‐quicksort. Experimental results show that CUDA‐quicksort is faster than the CDP‐quicksort provided by NVIDIA, with better performance achieved using the iterative implementation. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

8.

Efficient parallel implementation of three‐point viterbi decoding algorithm on CPU,GPU, and FPGA

Rongchun Li Yong Dou Dan Zou 《Concurrency and Computation》2014,26(3):821-840

In wireless communication, Viterbi decoding algorithm (VDA) is the one of most popular channel decoding algorithms, which is widely used in WLAN, WiMAX, or 3G communications. However, the throughput of Viterbi decoder is constrained by the convolutional characteristic. Recently, the three‐point VDA (TVDA) was proposed to solve this problem. In TVDA, the whole procedure can be divided into three phases, the forward, trace‐back, and decoding phases. In this paper, we analyze the parallelism of TVDA and propose parallel TVDA on the multi‐core CPU, graphics processing unit (GPU), and field programmable gate array (FPGA). We demonstrate approaches that fully exploit its performance potential on CPU, GPU, and FPGA computing platforms. For CPU platforms, we perform two optimization methods, single instruction multiple data and multithreading to gain over 145 × speedup over the naive CPU version on a quad‐core CPU platform. For GPU platforms, we propose the combination of cached memory optimization, coalesced global memory accesses, codeword packing scheme, and asynchronous data transition, achieving the throughput of 404.65 Mbps and 12 × speedup over initial GPU versions on an NVIDIA GeForce GTX580 card and 7 × speedup over Intel quad‐core CPU i5‐2300, under the same manufacturing year and both with fully optimized schemes. In addition, for FPGA platforms, we customize a radix‐4 pipelined architecture for the TVDA in a 45‐nm FPGA chip from Xilinx (XC6VLX760). Under 209.15‐MHz clock rate, it achieves a throughput of 418.30 Mbps. Finally, we also discuss the performance evaluation and efficiency comparison of different flexible architectures for real‐time Viterbi decoding in terms of the decoding throughput, power consumption, optimization schemes, programming costs, and price costs.Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

9.

A Framework for Simple Sorting Algorithms on Parallel Disk Systems

S. Rajasekaran 《Theory of Computing Systems》2001,34(2):101-114

In this paper we present a simple parallel sorting algorithm and illustrate its application in general sorting, disk sorting, and hypercube sorting. The algorithm (called the (l,m) -mergesort (LMM)) is an extension of the bitonic and odd—even mergesorts. Literature on parallel sorting is abundant. Many of the algorithms proposed, though being theoretically important, may not perform satisfactorily in practice owing to large constants in their time bounds. The algorithm presented in this paper has the potential of being practical. We present an application to the parallel disk sorting problem. The algorithm is asymptotically optimal (assuming that N is a polynomial in M , where N is the number of records to be sorted and M is the internal memory size). The underlying constant is very small. This algorithm performs better than the disk-striped mergesort (DSM) algorithm when the number of disks is large. Our implementation is as simple as that of DSM (requiring no fancy data structures or prefetch techniques.) As a second application, we prove that we can get a sparse enumeration sort on the hypercube that is simpler than that of the classical algorithm of Nassimi and Sahni [16]. We also show that Leighton's columnsort algorithm is a special case of LMM. Online publication December 26, 2000. 相似文献

10.

Parallelizing and optimizing large‐scale 3D multi‐phase flow simulations on the Tianhe‐2 supercomputer

Dali Li Chuanfu Xu Yongxian Wang Zhifang Song Min Xiong Xiang Gao Xiaogang Deng 《Concurrency and Computation》2016,28(5):1678-1692

The lattice Boltzmann method (LBM) is a widely used computational fluid dynamics method for flow problems with complex geometries and various boundary conditions. Large‐scale LBM simulations with increasing resolution and extending temporal range require massive high‐performance computing (HPC) resources, thus motivating us to port it onto modern many‐core heterogeneous supercomputers like Tianhe‐2. Although many‐core accelerators such as graphics processing unit and Intel MIC have a dramatic advantage of floating‐point performance and power efficiency over CPUs, they also pose a tough challenge to parallelize and optimize computational fluid dynamics codes on large‐scale heterogeneous system. In this paper, we parallelize and optimize the open source 3D multi‐phase LBM code openlbmflow on the Intel Xeon Phi (MIC) accelerated Tianhe‐2 supercomputer using a hybrid and heterogeneous MPI+OpenMP+Offload+single instruction, mulitple data (SIMD) programming model. With cache blocking and SIMD‐friendly data structure transformation, we dramatically improve the SIMD and cache efficiency for the single‐thread performance on both CPU and Phi, achieving a speedup of 7.9X and 8.8X, respectively, compared with the baseline code. To collaborate CPUs and Phi processors efficiently, we propose a load‐balance scheme to distribute workloads among intra‐node two CPUs and three Phi processors and use an asynchronous model to overlap the collaborative computation and communication as far as possible. The collaborative approach with two CPUs and three Phi processors improves the performance by around 3.2X compared with the CPU‐only approach. Scalability tests show that openlbmflow can achieve a parallel efficiency of about 60% on 2048 nodes, with about 400K cores in total. To the best of our knowledge, this is the largest scale CPU‐MIC collaborative LBM simulation for 3D multi‐phase flow problems. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

11.

支持SIMD 与簇间双字传输体系下的VLIW DSP 分簇算法

陈思灵郑启龙冯玉谦付和萍《计算机系统应用》2012,21(10):100-104

VLIW DSP通过软件流水获得时间并行性,通过指令分簇获得空间并行性.指令的分簇本质上是资源分配问题.传统的指令分簇假设一条指令分到某一簇执行,而某些体系结构提供SIMD指令,传统的分簇算法对这类体系结构并不完全适用.提出的基于评估模型的分簇算法能对SIMD指令和普通指令进行合理的分簇.分簇之后,通过调度簇间传输指令,合成适当的簇间双字传输指令.由于SIMD和簇间双字传输的引入,以及较好的分簇决策,程序整体的调度延迟变短.对许多数字信号处理程序相对于没分簇的情况下的性能有2～3倍的性能提升,相对寄存器压力分簇算法有约7～10%性能的提升. 相似文献

12.

A Case Study on Compiler Optimizations for the Intel® CoreTM 2 Duo Processor

Aart J. C. Bik David L. Kreitzer Xinmin Tian 《International journal of parallel programming》2008,36(6):571-591

The complexity of modern processors poses increasingly more difficult challenges to software optimization. Modern optimizing compilers have become essential tools for leveraging the power of recent processors by means of high-level optimizations to exploit multi-core platforms and single-instruction-multiple-data (SIMD) instructions, as well as advanced code generation to deal with microarchitectural performance aspects. Using the Intel^® Core^TM 2 Duo processor and Intel Fortran/C++ compiler as a case study, this paper gives a detailed account of the sort of optimizations required to obtain high performance on modern processors. 相似文献

13.

Optimal Sorting Algorithms for a Simplified 2D Array with Reconfigurable Pipelined Bus System

He Min Wu Xiaolong Zheng Si Qing Englert Burkhard 《Parallel and Distributed Systems, IEEE Transactions on》2010,21(3):303-312

In recent years, many researchers have investigated optical interconnections as parallel computing. Optical interconnections are attractive due to their high bandwidth and concurrent access to the bus in a pipelined fashion. The Linear Array with Reconfigurable Pipelined Bus System (LARPBS) model is a powerful optical bus system that combines both the advantages of optical buses and reconfiguration. To increase the scalability of the LARPBS model, we propose a two-dimensional extension: a simplified two-dimensional Array with Reconfigurable Pipelined Bus System (2D ARPBS). While achieving better scalability, we show the effectiveness of this newly proposed model by designing two novel optimal sorting algorithms on this model. The first sorting algorithm is an extension of Leighton's seven-phase columnsort algorithm that eliminates the restriction of sorting only an r times s array, where r ge s^2 , and sorts an n times n array in O(log n) time. The second one is an optimal multiway mergesort algorithm that uses a novel processor efficient two-way mergesort algorithm and a novel multiway merge scheme to sort n^2 items in O(log n) time. Using an optimal sorting algorithm Pipelined Mergesort designed for the LARPBS model as a building block, we extend our research on parallel sorting on the LARPBS to a more scalable 2D ARPBS model and achieve optimality in both sorting algorithms. 相似文献

14.

一种基于奔腾SIMD指令的快速背景提取方法 总被引：3，自引：0，他引：3

周西汉刘勃周荷琴袁非牛《计算机工程与应用》2004,40(27):81-83

论文提出一种基于Intel奔腾SIMD指令的快速背景提取方法。在一种改进的混合高斯背景模型中,Jeffrey值的计算和背景模型的更新等存在着很高的内在SIMD并行性,通过将数据按照SSE数据类型组织,实现了混合高斯背景模型的SIMD算法。实验结果表明:嵌入奔腾SIMD指令的方法比传统计算提高75%左右的性能,加速了背景提取的速度,达到了实时处理的要求,具有较大的实际应用价值。相似文献

15.

超快速排序算法 总被引：1，自引：0，他引：1

周建钦《计算机工程与应用》2006,42(29):41-42,86

快速排序算法结构简单,平均性能较佳;基数排序性能较稳定。结合快速排序和基数排序,提出超快速排序算法,通过理论分析和实验表明,新算法的性能优于快速排序算法和基数排序算法。相似文献

16.

SIMD compression and the intersection of sorted integers

下载免费PDF全文

Daniel Lemire Leonid Boytsov Nathan Kurz 《Software》2016,46(6):723-749

Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the single‐instruction, multiple data (SIMD) instructions available in common processors to boost the speed of integer compression schemes. Our S4‐BP128‐D4 scheme uses as little as 0.7 CPU cycles per decoded 32‐bit integer while still providing state‐of‐the‐art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decompression speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD GALLOPING algorithm. We exploit the fact that one SIMD instruction can compare four pairs of 32‐bit integers at once. We experiment with two Text REtrieval Conference (TREC) text collections, GOV2 and ClueWeb09 (category B), using logs from the TREC million‐query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state‐of‐the‐art approach. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

17.

A PRACTICAL ALGORITHM for INTEGER SORTING INTEGER SORTING ON A MESH-CONNECTED COMPUTER*

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(4):265-278

This paper presents count-sort, a parallel algorithm for mesh-connected computers to sort integers where the range of inputs is known. A straightforward counting technique that has not been implemented previously in parallel sorting algorithms is presented. On a mesh-connected computer with N×N processors we are able to sort N integers in the range 1 sN in timewhere cN is very small. For practical values of N, the algorithm is extremely fast. Further, it is possible to expand the range by a factor k to 1 N so that the slowdown is less than k.

We produce two implementations of count-sort on the SIMD MasPar MP-I with 8192 processors. The first sorts 8-bit integers, one per processor, significantly faster than the manufacturer's current library routine for sorting 8-bit integers. The second implementation is a fast version that sorts several elements per processor. 相似文献

18.

Parallel Algorithms for Dynamic Shortest Path Problems

Ismail Chabini & Sridevi Ganugapati 《International Transactions in Operational Research》2002,9(3):279-302

The development of intelligent transportation systems (ITS) and the resulting need for the solution of a variety of dynamic traffic network models and management problems require faster‐than‐real‐time computation of shortest path problems in dynamic networks. Recently, a sequential algorithm was developed to compute shortest paths in discrete time dynamic networks from all nodes and all departure times to one destination node. The algorithm is known as algorithm DOT and has an optimal worst‐case running‐time complexity. This implies that no algorithm with a better worst‐case computational complexity can be discovered. Consequently, in order to derive algorithms to solve all‐to‐one shortest path problems in dynamic networks, one would need to explore avenues other than the design of sequential solution algorithms only. The use of commercially‐available high‐performance computing platforms to develop parallel implementations of sequential algorithms is an example of such avenue. This paper reports on the design, implementation, and computational testing of parallel dynamic shortest path algorithms. We develop two shared‐memory and two message‐passing dynamic shortest path algorithm implementations, which are derived from algorithm DOT using the following parallelization strategies: decomposition by destination and decomposition by transportation network topology. The algorithms are coded using two types of parallel computing environments: a message‐passing environment based on the parallel virtual machine (PVM) library and a multi‐threading environment based on the SUN Microsystems Multi‐Threads (MT) library. We also develop a time‐based parallel version of algorithm DOT for the case of minimum time paths in FIFO networks, and a theoretical parallelization of algorithm DOT on an ‘ideal’ theoretical parallel machine. Performances of the implementations are analyzed and evaluated using large transportation networks, and two types of parallel computing platforms: a distributed network of Unix workstations and a SUN shared‐memory machine containing eight processors. Satisfactory speed‐ups in the running time of sequential algorithms are achieved, in particular for shared‐memory machines. Numerical results indicate that shared‐memory computers constitute the most appropriate type of parallel computing platforms for the computation of dynamic shortest paths for real‐time ITS applications. 相似文献

19.

Parallel kd-Tree Construction on the GPU with an Adaptive Split and Sort Strategy

David Wehr Rafael Radkowski 《International journal of parallel programming》2018,46(6):1139-1156

We introduce a parallel kd-tree construction method for 3-dimensional points on a GPU which employs a sorting algorithm that maintains high parallelism throughout construction. Typically, large arrays in the upper levels of a kd-tree do not yield high performance when computing each node in one thread. Conversely, small arrays in the lower levels of the tree do not benefit from typical parallel sorts. To address these issues, the proposed sorting approach uses a modified parallel sort on the upper levels before switching to basic parallelization on the lower levels. Our work focuses on 3D point registration and our results indicate that a speed gain by a factor of 100 can be achieved in comparison to a naive parallel algorithm for a typical scene. 相似文献

20.

Parallel Hashing: Collision Resolution Strategies and Performance

《Journal of Parallel and Distributed Computing》1995,31(2):190-198

In sequential systems, hash tables yield almost constant time performance for single element accesses. However, in massively parallel systems, we need to consider a large number of parallel accesses. Consequently, the potential queueing delay as well as the communication overhead can alter the relative performance of hash algorithms. Thus, it is necessary to reevaluate the performance of conventional hash algorithms and investigate new algorithms that exploit the parallelism without suffering from excessive communication overheads or queueing delays. In this paper, we first study the performance of data parallel hash algorithms with conventional collision resolution strategies. For SIMD/SPMD hypercube systems, neither linear probing nor double hashing yields satisfactory performance. Thus, we develop a new collision resolution strategy, namely, hypercube hashing. Hypercube hashing combines the randomness provided in double hashing with the low communication cost inherited from linear probing to yield better performance. We also investigate efficient implementation of the chaining algorithm in data parallel systems and its performance. From the simulation results, hypercube hashing significantly outperforms the other open addressing strategies in all cases (under the assumption of random input key space). For high load factors, chaining performs better than hypercube hashing. However, with a low load factor, hypercube hashing significantly outperforms the chaining algorithm. 相似文献