首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this work, we analyze the behavior of several parallel algorithms developed to compute the two-dimensional discrete wavelet transform using both OpenMP over a multicore platform and CUDA over a GPU. The proposed parallel algorithms are based on both regular filter-bank convolution and lifting transform with small implementations changes focused on both the memory requirements reduction and the complexity reduction. We compare our implementations against sequential CPU algorithms and other recently proposed algorithms like the SMDWT algorithm over different CPUs and the Wippig&Klauer algorithm over a GTX280 GPU. Finally, we analyze their behavior when algorithms are adapted to each architecture. Significant execution times improvements are achieved on both multicore platforms and GPUs. Depending on the multicore platform used, we achieve speed-ups of 1.9 and 3.4 using two and four processes, respectively, when compared to the sequential CPU algorithm, or we obtain speed-ups of 7.1 and 8.9 using eight and ten processes. Regarding GPUs, the GPU convolution algorithm using the GPU shared memory obtains speed-ups up to 20 when compared to the CPU sequential algorithm.  相似文献   

2.
The 0–1 knapsack problem has been extensively studied in the past years due to its immediate applications in industry and financial management, such as cargo loading, stock cutting, and budget control. Many algorithms have been proposed to solve this problem, most of which are heuristic, as the problem is well-known to be NP-hard. Only a few optimal algorithms have been designed to solve this problem but with high time complexity. This paper proposes the cost-optimal parallel algorithm (COPA) on an EREW PRAM model with shared memory to solve this problem. COPA is scalable and yields optimal solutions consuming less computational time. Furthermore, this paper implements COPA on two scenarios – multicore CPU based architectures using Open MP and GPU based configurations using CUDA. A series of experiments are conducted to examine the performance of COPA under two different test platforms. The experimental results show that COPA could reduce a significant amount of execution time. Our approach achieves the speedups of up to 10.26 on multicore CPU implementations and 17.53 on GPU implementations when the sequential dynamic programming algorithm for KP01 is considered as a baseline. Importantly, GPU implementations outstand themselves in the experimental results.  相似文献   

3.
The two-stage hybrid flow shop problem under setup times is addressed in this paper. This problem is NP-Hard. on the other hand, the studied problem is modeling different real-life applications especially in manufacturing and high performance-computing. Tackling this kind of problem requires the development of adapted algorithms. In this context, a metaheuristic using the genetic algorithm and three heuristics are proposed in this paper. These approximate solutions are using the optimal solution of the parallel machines under release and delivery times. Indeed, these solutions are iterative procedures focusing each time on a particular stage where a parallel machines problem is called to be solved. The general solution is then a concatenation of all the solutions in each stage. In addition, three lower bounds based on the relaxation method are provided. These lower bounds present a means to evaluate the efficiency of the developed algorithms throughout the measurement of the relative gap. An experimental result is discussed to evaluate the performance of the developed algorithms. In total, 8960 instances are implemented and tested to show the results given by the proposed lower bounds and heuristics. Several indicators are given to compare between algorithms. The results illustrated in this paper show the performance of the developed algorithms in terms of gap and running time.  相似文献   

4.
The objective of this paper is to analyze the dynamic scheduling of dense linear algebra algorithms on shared‐memory, multicore architectures. Current numerical libraries (e.g., linear algebra package) show clear limitations on such emerging systems mainly because of their coarse granularity tasks. Thus, many numerical algorithms need to be redesigned to better fit the architectural design of the multicore platform. The parallel linear algebra for scalable multicore architectures library developed at the University of Tennessee tackles this challenge by using tile algorithms to achieve a finer task granularity. These tile algorithms can then be represented by directed acyclic graphs, where nodes are the tasks and edges are the dependencies between the tasks. The paramount key to achieve high performance is to implement a runtime environment to efficiently schedule the execution of the directed acyclic graph across the multicore platform. This paper studies the impact on the overall performance of some parameters, both at the level of the scheduler (e.g., window size and locality) and the algorithms (e.g., left‐looking and right‐looking variants). We conclude that some commonly accepted rules for dense linear algebra algorithms may need to be revisited. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

5.
Multiple Sequence Alignment (MSA) is an important problem in Bioinformatics that aims to align more than two sequences in order to emphasize similarity regions. This problem is known to be NP-Hard, so heuristic methods are used to solve it. DIALIGN-TX is an iterative heuristic method for MSA that generates alignments by concatenating ungapped regions with high similarity. Usually, the first phase of MSA algorithms is parallelized by distributing several independent tasks among the nodes. Even though heterogeneous multicore clusters are becoming very common nowadays, very few task allocation policies were proposed for this type of architecture. This paper proposes an MPI/OpenMP master/slave parallel strategy to run DIALIGN-TX in heterogeneous multicore clusters, with several allocation policies. We show that an appropriate choice of the master node has great impact on the overall system performance. Also, the results obtained in a heterogeneous multicore cluster composed of 4 nodes (30 cores), with real sequence sets show that the execution time can be drastically reduced when the appropriate allocation policy is used.  相似文献   

6.
Multicore processors are widely used in today’s computer systems. Multicore virtualization technology provides an elastic solution to more efficiently utilize the multicore system. However, the Lock Holder Preemption (LHP) problem in the virtualized multicore systems causes significant CPU cycles wastes, which hurt virtual machine (VM) performance and reduces response latency. The system consolidates more VMs, the LHP problem becomes worse. In this paper, we propose an efficient consolidation-aware vCPU (CVS) scheduling scheme on multicore virtualization platform. Based on vCPU over-commitment rate, the CVS scheduling scheme adaptively selects one algorithm among three vCPU scheduling algorithms: co-scheduling, yield-to-head, and yield-to-tail based on the vCPU over-commitment rate because the actions of vCPU scheduling are split into many single steps such as scheduling vCPUs simultaneously or inserting one vCPU into the run-queue from the head or tail. The CVS scheme can effectively improve VM performance in the low, middle, and high VM consolidation scenarios. Using real-life parallel benchmarks, our experimental results show that the proposed CVS scheme improves the overall system performance while the optimization overhead remains low.  相似文献   

7.
Dynamic programming techniques are well-established and employed by various practical algorithms, including the edit-distance algorithm or the dynamic time warping algorithm. These algorithms usually operate in an iteration-based manner where new values are computed from values of the previous iteration. The data dependencies enforce synchronization which limits possibilities for internal parallel processing. In this paper, we investigate parallel approaches to processing matrix-based dynamic programming algorithms on modern multicore CPUs, Intel Xeon Phi accelerators, and general purpose GPUs. We address both the problem of computing a single distance on large inputs and the problem of computing a number of distances of smaller inputs simultaneously (e.g., when a similarity query is being resolved). Our proposed solutions yielded significant improvements in performance and achieved speedup of two orders of magnitude when compared to the serial baseline.  相似文献   

8.
This paper deals with the wind speed prediction in wind farms, using spatial information from remote measurement stations. Owing to the temporal complexity of the problem, we employ local recurrent neural networks with internal dynamics, as advanced forecast models. To improve the prediction performance, the training task is accomplished using on-line learning algorithms based on the recursive prediction error (RPE) approach. A global RPE (GRPE) learning scheme is first developed where all adjustable weights are simultaneously updated. In the following, through weight grouping we devise a simplified method, the decoupled RPE (DRPE), with reduced computational demands. The partial derivatives required by the learning algorithms are derived using the adjoint model approach, adapted to the architecture of the networks being used. The efficiency of the proposed approach is tested on a real-world wind farm problem, where multi-step ahead wind speed estimates from 15 min to 3 h are sought. Extensive simulation results demonstrate that our models exhibit superior performance compared to other network types suggested in the literature. Furthermore, it is shown that the suggested learning algorithms outperform three gradient descent algorithms, in training of the recurrent forecast models.  相似文献   

9.
Xu  Shuhui  Wang  Yong  Lu  Peichuan 《Neural computing & applications》2017,28(7):1667-1682

Imperialist competitive algorithm is a nascent meta-heuristic algorithm which has good performance. However, it also often suffers premature convergence and falls into local optimal area when employed to solve complex problems. To enhance its performance further, an improved approach which uses mutation operators to change the behavior of the imperialists is proposed in this article. This improved approach is simple in structure and is very easy to be carried out. Three different mutation operators, the Gaussian mutation, the Cauchy mutation and the Lévy mutation, are investigated particularly by experiments. The experimental results suggest that all the three improved algorithms have faster convergence rate, better global search ability and better stability than the original algorithm. Furthermore, the three improved algorithms are also compared with other two excellent algorithms on some benchmark functions and compared with other four existing algorithms on one real-world optimization problem. The comparisons suggest that the proposed algorithms have their own specialties and good applicability. They can obtain better results on some functions than those contrastive approaches.

  相似文献   

10.
We explore the comparative performance of the Cray XMT and XMT‐2 massively multithreaded supercomputers. We use benchmarks to evaluate memory accesses for various types of loops. We also compare the performance of these machines on matrix multiply and on three previously implemented dynamic programming algorithms. It is shown that the relative performance of these machines is dependent on the size (number of processors) of the configuration, as well as the size of the problem being evaluated. In particular, small configurations of the original XMT can sometimes show slightly better performance than larger configurations of the XMT‐2, for the same problem size. We note that, under heavy memory load, performance of loops can saturate well before the maximum number of processors available. This suggests that it may not always be useful to use the maximum number of processors for a specific run. We also show that manual restructuring of nested loops, including decreasing the parallelism, can result in major improvements in performance. The results in this paper indicate that careful exploration of the space of problem sizes, number of processors used, and choices of loop parallelization can yield substantial improvements in performance. These improvements can be very significant for production codes that run for extended periods of time. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

11.
The beetle antennae search (BAS) algorithm is a single-solution metaheuristic optimizer that imitates the foraging behavior of beetles. Due to its easy implementation and fast convergence, it has been applied in various engineering fields. To enhance the local search ability and preclude the blindness of position updates performed in BAS, an improved beetle antennae search algorithm with Lévy flight (IBAS) is proposed in this paper. Additionally, to further enhance the algorithm performance, the features of inertia, normalization, and elitist selection are adopted into this proposed hybrid algorithm. In a comparison with other algorithms based on 15 well-known benchmark functions and 4 classic engineering problems, IBAS maintains good performance under the premise of a trade-off between efficacy and efficiency. The detailed results suggest that the improvements adapted in IBAS alleviate the shortcomings of BAS. In addition, a real-world engineering problem, laser energy prediction in micro-laser assisted turning, is used to further validate the potential of the proposed algorithm for industrial applications. The results indicate the feasibility of the new solution based on the IBAS algorithm. Two further applications of IBAS in the reconstruction of the temperature field and online laser energy detection are also discussed.  相似文献   

12.
Energy efficient scheduling of parallel tasks on multiprocessor computers   总被引:2,自引:1,他引:1  
In this paper, scheduling parallel tasks on multiprocessor computers with dynamically variable voltage and speed are addressed as combinatorial optimization problems. Two problems are defined, namely, minimizing schedule length with energy consumption constraint and minimizing energy consumption with schedule length constraint. The first problem has applications in general multiprocessor and multicore processor computing systems where energy consumption is an important concern and in mobile computers where energy conservation is a main concern. The second problem has applications in real-time multiprocessing systems and environments where timing constraint is a major requirement. Our scheduling problems are defined such that the energy-delay product is optimized by fixing one factor and minimizing the other. It is noticed that power-aware scheduling of parallel tasks has rarely been discussed before. Our investigation in this paper makes some initial attempt to energy-efficient scheduling of parallel tasks on multiprocessor computers with dynamic voltage and speed. Our scheduling problems contain three nontrivial subproblems, namely, system partitioning, task scheduling, and power supplying. Each subproblem should be solved efficiently, so that heuristic algorithms with overall good performance can be developed. The above decomposition of our optimization problems into three subproblems makes design and analysis of heuristic algorithms tractable. A unique feature of our work is to compare the performance of our algorithms with optimal solutions analytically and validate our results experimentally, not to compare the performance of heuristic algorithms among themselves only experimentally. The harmonic system partitioning and processor allocation scheme is used, which divides a multiprocessor computer into clusters of equal sizes and schedules tasks of similar sizes together to increase processor utilization. A three-level energy/time/power allocation scheme is adopted for a given schedule, such that the schedule length is minimized by consuming given amount of energy or the energy consumed is minimized without missing a given deadline. The performance of our heuristic algorithms is analyzed, and accurate performance bounds are derived. Simulation data which validate our analytical results are also presented. It is found that our analytical results provide very accurate estimation of the expected normalized schedule length and the expected normalized energy consumption and that our heuristic algorithms are able to produce solutions very close to optimum.  相似文献   

13.
This paper describes PENNANT, a mini‐app that operates on general unstructured meshes (meshes with arbitrary polygons), and is designed for advanced architecture research. It contains mesh data structures and physics algorithms adapted from the Los Alamos National Laboratory radiation‐hydrodynamics code FLAG and gives a sample of the typical memory access patterns of FLAG. The basic capabilities and optimization approaches of PENNANT are presented. Results are shown from sample performance experiments run on serial, multicore, and graphics processing unit implementations, giving an indication of how PENNANT can be a useful tool for studies of new architectures and programming models. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

14.
The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this work, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broad range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3-4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.  相似文献   

15.

In the past decade, heterogeneous multicore architectures with support for Single Instruction Multiple Thread (SIMT) style computing have become the standard platform of choice for scheduling HPC applications. Here, applications are typically modelled as a set of data-parallel tasks with dependencies represented in the form of a directed acyclic graph (DAG). The relevant execution time information for each constituent task in the DAG is known beforehand and is leveraged by scheduling algorithms (List or Cluster based) to ascertain near-optimal schedules at runtime. However, given an online setting, where applications are submitted by multiple users and the types of applications are not restrictive, the chances of knowing execution time information for every program are highly unlikely. In this context, we propose a class of intelligent algorithms for heterogeneous CPU-GPU platforms that leverage static analysis-assisted machine learning techniques for deciding how device assignments should be made at runtime, thus bypassing the requirement for expensive offline profiling passes. We formalize relevant task-level ranking metrics and discuss how existing scheduling techniques can be adapted for our proposed class of algorithms. We also devise an online cluster scheduling algorithm that supports dynamic task arrival by determining in any given scheduling epoch, mapping decisions for a subset of tasks in a DAG. We perform a detailed comparative analysis between our proposed cluster and list scheduling heuristics via extensive simulation experiments using a variety of heterogeneous multicore platform configurations and observe performance speedups in the range of 1.1–1.5× for cluster scheduling over that of list scheduling.

  相似文献   

16.
FVC2000: fingerprint verification competition   总被引:14,自引:0,他引:14  
Reliable and accurate fingerprint recognition is a challenging pattern recognition problem, requiring algorithms robust in many contexts. FVC2000 competition attempted to establish the first common benchmark, allowing companies and academic institutions to unambiguously compare performance and track improvements in their fingerprint recognition algorithms. Three databases were created using different state-of-the-art sensors and a fourth database was artificially generated; 11 algorithms were extensively tested on the four data sets. We believe that FVC2000 protocol, databases, and results will be useful to all practitioners in the field not only as a benchmark for improving methods, but also for enabling an unbiased evaluation of algorithms  相似文献   

17.
With the advent of multicore processors, it has become imperative to write parallel programs if one wishes to exploit the next generation of processors. This paper deals with skyline computation as a case study of parallelizing database operations on multicore architectures. First we parallelize three sequential skyline algorithms, BBS, SFS, and SSkyline, to see if the design principles of sequential skyline computation also extend to parallel skyline computation. Then we develop a new parallel skyline algorithm PSkyline based on the divide-and-conquer strategy. Experimental results show that all the algorithms successfully utilize multiple cores to achieve a reasonable speedup. In particular, PSkyline achieves a speedup approximately proportional to the number of cores when it needs a parallel computation the most.  相似文献   

18.
Asymmetric multicore processors (AMP) have become popular in both high-end and low-end computing systems due to its flexibility and high performance. A performance asymmetric multicore architecture (P-AMP) is the subcategory of AMP, which integrates the different micro-architecture cores in the same chip. Due to the heterogeneity nature of cores and applications, recognizing an optimal hardware configuration in terms of core, voltage-frequency pair for each application is still an NP-hard problem. Optimization of energy-delay product (EDP) is an additional challenging task in such architectures.To address these challenges, we developed a novel core prediction model called lightweight-deep neural network (LW-DNN) for asymmetric multicore processors. The proposed LW-DNN includes three phases, feature selection, feature optimization, and core prediction module. In the first and second phases, workload characteristics are extracted and optimized using the pre-processing algorithm and in the third phase, it predicts the appropriate cores for each workload at runtime to enhance the energy-efficiency and performance.We modeled a deep learning neural network using scikit-learn python library and evaluated in ODROID XU3 ARM big-Little performance asymmetric multicore platform. The embedded benchmarks we considered are MiBench, IoMT, Core-Mark workloads. The proposed LW-DNN prediction module compared with other traditional algorithms in terms of accuracy, execution time, energy consumption, and energy-delay product. The experimental results illustrate that accuracy achieved up to 97% in core prediction, and the average improvement in minimization of energy consumption is 33%, 35% in energy-delay product, 33% minimized in execution time correspondingly.  相似文献   

19.
Classification of large datasets is an important data mining problem. Many classification algorithms have been proposed in the literature, but studies have shown that so far no algorithm uniformly outperforms all other algorithms in terms of quality. In this paper, we present a unifying framework called Rain Forest for classification tree construction that separates the scalability aspects of algorithms for constructing a tree from the central features that determine the quality of the tree. The generic algorithm is easy to instantiate with specific split selection methods from the literature (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, SPRINT and QUEST). In addition to its generality, in that it yields scalable versions of a wide range of classification algorithms, our approach also offers performance improvements of over a factor of three over the SPRINT algorithm, the fastest scalable classification algorithm proposed previously. In contrast to SPRINT, however, our generic algorithm requires a certain minimum amount of main memory, proportional to the set of distinct values in a column of the input relation. Given current main memory costs, this requirement is readily met in most if not all workloads.  相似文献   

20.
Transactional Memory: An Overview   总被引:3,自引:0,他引:3  
Writing applications that benefit from the massive computational power of future multicore chip multiprocessors will not be an easy task for mainstream programmers accustomed to sequential algorithms rather than parallel ones. This article presents a survey of transactional memory, a mechanism that promises to enable scalable performance while freeing programmers from some of the burden of modifying their parallel code.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号