共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Kianzad V. Bhattacharyya S.S. 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(7):667-680
Multiprocessor mapping and scheduling algorithms have been extensively studied over the past few decades and have been tackled from different perspectives. In the late 1980's, the two-step decomposition of scheduling nto clustering and cluster-scheduling - was introduced. Ever since, several clustering and merging algorithms have been proposed and individually reported to be efficient. However, it is not clear how effective they are and how well they compare against single-step scheduling algorithms or other multistep algorithms. In this paper, we explore the effectiveness of the two-phase decomposition of scheduling and describe efficient and novel techniques that aggressively streamline interprocessor communications and can be tuned to exploit the significantly longer compilation time that is available to embedded system designers. We evaluate a number of leading clustering and merging algorithms using a set of benchmarks with diverse structures. We present an experimental setup for comparing the single-step against the two-step scheduling approach. We determine the importance of different steps in scheduling and the effect of different steps on overall schedule performance and show that the decomposition of the scheduling process indeed improves the overall performance. We also show that the quality of the solutions depends on the quality of the clusters generated in the clustering step. Based on the results, we also discuss why the parallel time metric in the clustering step may not provide an accurate measure for the final performance of cluster-scheduling. 相似文献
3.
We report about the performances obtained, at the application level, by two MPI implementations for Infiniband that allow direct exchange of data stored in the global memory of Graphic Processing Units (GPU) based on the Nvidia CUDA. For the same purpose, we tested also the Application Programming Interface of APEnet, which is a custom, high performance interconnect technology. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the over-relaxation algorithm. The results show that CUDA streams are instrumental in achieving the best possible performances. 相似文献
4.
We compare five existing dynamic memory allocators optimized for GPUs and show their strengths and weaknesses. In the measurements, we use three generic evaluation tests proposed in the past and we add one with a real workload, where dynamic memory allocation is used in building the k‐d tree data structure. Following the performance analysis we propose a new dynamic memory allocator and its variants that address the limitations of the existing dynamic memory allocators. The new dynamic memory allocator uses few resources and is targeted towards large and variably sized memory allocations on massively parallel hardware architectures. 相似文献
5.
Simulation time for the classical problem of Lattice Quantum Chromodynamics (Lattice QCD) is dominated by one kernel routine responsible for computing the actions of a Dirac operator. This paper describes an experience in parallelizing this kernel routine. We explore parallelization granularities for this kernel routine on Graphical Processing Units (GPUs). We show that fine-grained parallelism can outperform coarse-grained parallelization, given that control-flow and communication effects are minimized. We propose two techniques for transforming control-flow-based code to control-free code. We also show how to reduce the communication effect by optimizing for commonly used sequences of calls to this routine. In our implementation on NVIDIA 8800 GTX, we were able to achieve an 8.3x speedup over an SSE2 optimized version on 2.8 GHz Intel Xeon CPU. 相似文献
6.
To conserve space and power as well as to harness high performance in embedded systems, high utilization of the hardware is required. This can be facilitated through dynamic adaptation of the silicon resources in reconfigurable systems in order to realize various customized kernels as execution proceeds. Fortunately, the encountered reconfiguration overheads can be estimated. Therefore, if the scheduling of time-consuming kernels considers also the reconfiguration overheads, an overall performance gain can be obtained. We present our policy, experiments, and performance results of customizing and reconfiguring Field-Programmable Gate Arrays (FPGAs) for embedded kernels. Experiments involving EEMBC (EDN Embedded Microprocessor Benchmarking Consortium) and MiBench embedded benchmark kernels show high performance using our main policy, when considering reconfiguration overheads. Our policy reduces the required reconfigurations by more than 50% as compared to brute-force solutions, and performs within 25% of the ideal execution time while conserving 60% of the FPGA resources. Alternative strategies to reduce the reconfiguration overhead are also presented and evaluated. 相似文献
7.
Class-attribute interdependence maximization (CAIM) is one of the state-of-the-art algorithms for discretizing data for which classes are known. However, it may take a long time when run on high-dimensional large-scale data, with large number of attributes and/or instances. This paper presents a solution to this problem by introducing a graphic processing unit (GPU)-based implementation of the CAIM algorithm that significantly speeds up the discretization process on big complex data sets. The GPU-based implementation is scalable to multiple GPU devices and enables the use of concurrent kernels execution capabilities of modern GPUs. The CAIM GPU-based model is evaluated and compared with the original CAIM using single and multi-threaded parallel configurations on 40 data sets with different characteristics. The results show great speedup, up to 139 times faster using four GPUs, which makes discretization of big data efficient and manageable. For example, discretization time of one big data set is reduced from 2 h to \(<\) 2 min. 相似文献
8.
Given that the concurrent L1-minimization (L1-min) problem is often required in some real applications, we investigate how to solve it in parallel on GPUs in this paper. First, we propose a novel self-adaptive warp implementation of the matrix-vector multiplication (Ax) and a novel self-adaptive thread implementation of the matrix-vector multiplication (ATx), respectively, on the GPU. The vector-operation and inner-product decision trees are adopted to choose the optimal vector-operation and inner-product kernels for vectors of any size. Second, based on the above proposed kernels, the iterative shrinkage-thresholding algorithm is utilized to present two concurrent L1-min solvers from the perspective of the streams and the thread blocks on a GPU, and optimize their performance by using the new features of GPU such as the shuffle instruction and the read-only data cache. Finally, we design a concurrent L1-min solver on multiple GPUs. The experimental results have validated the high effectiveness and good performance of our proposed methods. 相似文献
9.
《Journal of Systems Architecture》2014,60(1):1-10
The GPU (Graphics Processing Unit) has recently become one of the most power efficient processors in embedded and many other environments, and has been integrated into more and more SoCs (System on Chip). Thus modern GPUs play a very important role in power aware computing. Strongly Connected Component (SCC) decomposition is a fundamental graph algorithm which has wide applications in model checking, electronic design automation, social network analysis and other fields. GPUs have been shown to have great potential in accelerating many types of computations including graph algorithms. Recent work have demonstrated the plausibility of GPU SCC decomposition, but the implementation is inefficient due to insufficient consideration of the distinguishing GPU programming model, which leads to poor performance on irregular and sparse graphs.This paper presents a new GPU SCC decomposition algorithm that focuses on full utilization of the contemporary embedded and desktop GPU architecture. In particular, a subgraph numbering scheme is proposed to facilitate the safe and efficient management of the subgraph IDs and to serve as the basis of efficient source selection. Furthermore, we adopt a multi-source partition procedure that greatly reduces the recursion depth and use a vertex labeling approach that can highly optimize the GPU memory access. The evaluation results show that the proposed approach achieves up to 41× speedup over Tarjan’s algorithm, one of the most efficient sequential SCC decomposition algorithms, and up to 3.8× speedup over the previous GPU algorithms. 相似文献
10.
Hector Ortega-Arranz Yuri Torres Arturo Gonzalez-Escribano Diego R. Llanos 《The Journal of supercomputing》2014,70(2):786-798
During the last years, GPU manycore devices have demonstrated their usefulness to accelerate computationally intensive problems. Although arriving at a parallelization of a highly parallel algorithm is an affordable task, the optimization of GPU codes is a challenging activity. The main reason for this is the number of parameters, programming choices, and tuning techniques available, many of them related with complex and sometimes hidden architecture details. A useful strategy to systematically attack these optimization problems is to characterize the different kernels of the application, and use this knowledge to select appropriate configuration parameters. The All-Pair Shortest-Path (APSP) problem is a well-known problem in graph theory whose objective is to find the shortest paths between any pairs of nodes in a graph. This problem can be solved by highly parallel and computational intensive tasks, being a good candidate to be exploited by manycore devices. In this paper, we use kernel characterization criteria to optimize an APSP algorithm implementation for NVIDIA GPUs. Our experimental results show that the combined use of proper configuration policies, and the concurrent kernels capability of new CUDA architectures, leads to a performance improvement of up to 62 % with respect to one of the possible configurations recommended by CUDA, considered as baseline. 相似文献
11.
Gupta J.P. Winter S.C. Wilson D.R. 《IEEE transactions on pattern analysis and machine intelligence》1989,15(11):1357-1367
The authors describe CTDNet, a data-driven reduction machine for the concurrent execution of applicative functional programs in the form of lambda calculus expressions. Such programs are stored as binary-tree-structured process graphs in which all processes maintain pointers to their immediate neighbors (i.e. ancestor and two children). Processes are of two basic types: master processes, which represent the original process graph, and slave processes, which carry out the actual executional work and are dynamically created and destroyed. CTDNet uses a distributed eager evaluation scheme with a modification to evaluate conditional expressions lazily, together with a form of distributed string reduction with some graphlike modifications 相似文献
12.
This paper investigates the underlying impact of predictive inaccuracies on execution scheduling, with particular reference to execution time predictions. This study is conducted from two perspectives: from that of job selection and from that of resource allocation, both of which are fundamental components in execution scheduling. A new performance metric, termed the degree of misperception, is introduced to express the probability that the predicted execution times of jobs display different ordering characteristics from their real execution times due to inaccurate prediction. Specific formulae are developed to calculate the degree of misperception in both job selection and resource allocation scenarios. The parameters which influence the degree of misperception are also extensively investigated. The results presented in this paper are of significant benefit to scheduling approaches that take into account predictive data; the results are also of importance to the application of these scheduling techniques to real-world high-performance systems. 相似文献
13.
Joan Farjo Rawad Abou Assi Wes Masri 《Software Testing, Verification and Reliability》2015,25(2):115-137
The interest in leveraging data mining and statistical techniques to enable dynamic program analysis has increased tremendously in recent years. Researchers have presented numerous techniques that mine and analyze execution profiles to assist software testing and other reliability enhancing approaches. Previous empirical studies have shown that the effectiveness of such techniques is likely to be impacted by the type of profiled program elements. This work further studies the impact of the characteristics of execution profiles by focusing on their size; noting that a typical profile comprises a large number of program elements, in the order of thousands or higher. Specifically, the authors devised six reduction techniques and comparatively evaluated them by measuring the following: (1) reduction rate; (2) information loss; (3) impact on two applications of dynamic program analysis, namely, cluster‐based test suite minimization (App‐I), and profile‐based online failure and intrusion detection (App‐II). The results were promising as the following: (a) the average reduction rate ranged from 92% to 98%; (b) three techniques were lossless and three were slightly lossy; (c) reducing execution profiles exhibited a major positive impact on the effectiveness and efficiency of App‐I; and (d) reduction exhibited a positive impact on the efficiency of App‐II, but a minor negative impact on its effectiveness. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献
14.
The purpose of this paper is to highlight the performance issues of the matrix transposition algorithms for large matrices, relating to the Translation Lookaside Buffer (TLB) cache. The existing optimisation techniques such as coalesced access and the use of shared memory, regardless of their necessity and benefits, are not sufficient enough to neutralise the problem. As the data problem size increases, these optimisations do not exploit data locality effectively enough to counteract the detrimental effects of TLB cache misses. We propose a new optimisation technique that counteracts the performance degradation of these algorithms and seamlessly complements current optimisations. Our optimisation is based on detailed analysis of enumeration schemes that can be applied to either individual matrix entries or blocks (sub-matrices). The key advantage of these enumeration schemes is that they do not incur matrix storage format conversion because they operate on canonical matrix layouts. In addition, several cache-efficient matrix transposition algorithms based on enumeration schemes are offered—an improved version of the in-place algorithm for square matrices, out-of-place algorithm for rectangular matrices and two 3D involutions. We demonstrate that the choice of the enumeration schemes and their parametrisation can have a direct and significant impact on the algorithm’s memory access pattern. Our in-place version of the algorithm delivers up to 100% performance improvement over the existing optimisation techniques. Meanwhile, for the out-of-place version we observe up to 300% performance gain over the NVidia’s algorithm. We also offer improved versions of two involution transpositions for the 3D matrices that can achieve performance increase up 300%. To the best of our knowledge, this is the first effective attempt to control the logical-to-physical block association through the design of enumeration schemes in the context of matrix transposition. 相似文献
15.
Scientific workflow execution often spans multiple self-managing administrative domains to obtain specific processing capabilities. Existing (global) analysis techniques tend to mandate every domain-specific application to unveil all private behaviors for scientific collaboration. In practice, it is infeasible for a domain-specific application to disclose its process details (as a private workflow fragment) for privacy or security reasons. Consequently, it is a challenging endeavor to coordinate scientific workflows and its distributed domain-specific applications. To address this problem, we propose a collaborative scheduling approach that can deal with temporal dependencies between a scientific workflow and a private workflow fragment. Under this collaborative scheduling approach, a private workflow fragment could maintain the temporal consistency with a scientific workflow in resource sharing and task enactments. Further, an evaluation is also presented to demonstrate the proposed approach for coordinating multiple scientific workflow executions in a concurrent environment. 相似文献
16.
17.
Jaejin Lee Jung-Ho Park Honggyu Kim Changhee Jung Daeseob Lim SangYong Han 《Journal of Parallel and Distributed Computing》2010
In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems. 相似文献
18.
The continuous broadcast of data together with an index structure is an effective way of disseminating data in a wireless mobile environment. The index allows a mobile client to tune in only when relevant data is available on the channel and leads to reduced power consumption for the clients. This paper investigates the execution of queries on broadcasted index trees when query execution corresponds to a partial traversal of the tree. Queries exhibiting this behavior include range queries and nearest neighbor queries. We present two broadcast schedules for index trees and two query algorithms executed by mobile clients. Our solutions simultaneously minimize tuning time and latency and adapt to the client’s available memory. Experimental results using real and synthetic data compare results for a broadcast with node repetition to one without node repetition and they show how a priority-based data management can help reduce tuning time and latency. 相似文献
19.
The use of rules in a distributed environment creates new challenges for the development of active rule execution models. In particular, since a single event can trigger multiple rules that execute over distributed sources of data, it is important to make use of concurrent rule execution whenever possible. This paper presents the details of the integration rule scheduling (IRS) algorithm. Integration rules are active database rules that are used for component integration in a distributed environment. The IRS algorithm identifies rule conflicts for multiple rules triggered by the same event through static, compile-time analysis of the read and write sets of each rule. A unique aspect of the algorithm is that the conflict analysis includes the effects of nested rule execution that occurs as a result of using an execution model with an immediate coupling mode. The algorithm therefore identifies conflicts that may occur as a result of the concurrent execution of different rule triggering sequences. The rules are then formed into a priority graph before execution, defining the order in which rules triggered by the same event should be processed. Rules with the same priority can be executed concurrently. The IRS algorithm guarantees confluence in the final state of the rule execution. The IRS algorithm is applicable for rule scheduling in both distributed and centralized rule execution environments. 相似文献
20.
Atluri V. Jajodia S. Bertino E. 《Knowledge and Data Engineering, IEEE Transactions on》1996,8(5):839-854
Investigates issues related to transaction concurrency control in multilevel secure databases. This paper demonstrates how the conflicts between the correctness requirements and the secrecy requirements can be reconciled by proposing two different solutions. It first explores the correctness criteria that are weaker than one-copy serializability. Each of these weaker criteria, though not as strict as one-copy serializability, is required to preserve database consistency in some meaningful way, and moreover, its implementation does not require the scheduler to be trusted. It proposes three different, increasingly stricter notions of serializability (level-wise serializability, one-item read serializability and pair-wise serializability) that can serve as substitutes for one-copy serializability. The paper then investigates secure concurrency control protocols that generate one-copy serializable histories and presents a multiversion timestamping protocol that has several very desirable properties: it is secure, produces multiversion histories that are equivalent to serial one-copy histories in which transactions are placed in a timestamp order, eliminates starvation and can be implemented using single-level untrusted schedulers 相似文献