首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this work, we present a tool that exploits heterogeneous computing to calculate the noise scattered by an object from the pressure distribution over its surface and its normal derivative. The method mainly deals with a large Matrix–Vector Product where the matrix elements must be calculated on the fly in such a way that the problem fits in main memory. To prove the performance of the heterogeneous implementations, the tool is tested using one NVIDIA K20c GPU, one Intel Xeon Phi 5110P, and two Intel Xeon E5-2650 CPUs. The speedup of the accelerated implementations ranges from \(3\times \) (Xeon Phi) to \(8\times \) (Xeon Phi  \(+\)  K20c) when compared to our parallel CPU code with \(32\) threads. This work, combined with the authors’ previous works for the computation of the acoustic pressure over the obstacle surface, results in a valuable toolset for noise control applications during aircraft design.  相似文献   

2.
线程级推测(TLS)技术可挖掘程序并行执行潜能,提高多核资源利用率,但目前TACLeBench的内核基准仍未在TLS并行化中得到有效分析。针对该问题设计了循环级推测执行的剖析方案和剖析工具。选取7个代表性的TACLeBench内核基准程序,首先对程序进行初始化分析,选取程序热点片段插入循环标识;其次对这些片段进行交叉编译,记录程序推测线程与内存地址相关数据,剖析其循环级最大潜在并行性;最后综合探讨程序运行时的特征(线程粒度、可并行化覆盖率、依赖特征)以及源码对加速比的影响。实验结果表明:1)该类程序适合采用TLS加速,与串行执行结果相比,循环结构的推测执行下的大部分程序的加速比在2以上,其中最高加速比达到20.79;2)利用TLS加速TACLeBench内核程序时,多数应用可有效利用4核到16核的计算资源。  相似文献   

3.
We present techniques to parallelize membership tests for Deterministic Finite Automata (DFAs). Our method searches arbitrary regular expressions by matching multiple bytes in parallel using speculation. We partition the input string into chunks, match chunks in parallel, and combine the matching results. Our parallel matching algorithm exploits structural DFA properties to minimize the speculative overhead. Unlike previous approaches, our speculation is failure-free, i.e., (1) sequential semantics are maintained, and (2) speed-downs are avoided altogether. On architectures with a SIMD gather-operation for indexed memory loads, our matching operation is fully vectorized. The proposed load-balancing scheme uses an off-line profiling step to determine the matching capacity of each participating processor. Based on matching capacities, DFA matches are load-balanced on inhomogeneous parallel architectures such as cloud computing environments. We evaluated our speculative DFA membership test for a representative set of benchmarks from the Perl-compatible Regular Expression (PCRE) library and the PROSITE protein database. Evaluation was conducted on a 4 CPU (40 cores) shared-memory node of the Intel Academic Program Manycore Testing Lab (Intel MTL), on the Intel AVX2 SDE simulator for 8-way fully vectorized SIMD execution, and on a 20-node (288 cores) cluster on the Amazon EC2 computing cloud. Obtained speedups are on the order of $\mathcal O \left( 1+\frac{|P|-1}{|Q|\cdot \gamma }\right) $ , where $|P|$ denotes the number of processors or SIMD units, $|Q|$ denotes the number of DFA states, and $0<\gamma \le 1$ represents a statically computed DFA property. For all observed cases, we found that $0.02<\gamma <0.47$ . Actual speedups range from 2.3 $\times $ to 38.8 $\times $ for up to 512 DFA states for PCRE, and between 1.3 $\times $ and 19.9 $\times $ for up to 1,288 DFA states for PROSITE on a 40-core MTL node. Speedups on the EC2 computing cloud range from 5.0 $\times $ to 65.8 $\times $ for PCRE, and from 5.0 $\times $ to 138.5 $\times $ for PROSITE. Speedups of our C-based DFA matcher over the Perl-based ScanProsite scan tool range from 559.3 $\times $ to 15079.7 $\times $ on a 40-core MTL node. We show the scalability of our approach for input-sizes of up to 10 GB.  相似文献   

4.
Many-core systems are basically designed for applications having large data parallelism. We propose an efficient hybrid matrix multiplication implementation based on Strassen and Winograd algorithms (S-MM and W-MM) on many-core. A depth first (DFS) traversal of a recursion tree is used where all cores work in parallel on computing each of the \(N \times N\) sub-matrices, which are computed in sequence. DFS reduces the storage to the detriment of large data motion to gather and aggregate the results. The proposed approach uses three optimizations: (1) a small set of basic algebra functions to reduce overhead, (2) invoking efficient library (CUBLAS 5.5) for basic functions, and (3) using parameter-tuning of parametric kernel to improve resource occupancy. Evaluation of S-MM and W-MM is carried out on GPU and MIC (Xeon Phi). For GPU, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as fast for arrays satisfying \(N \ge 2048\) and \(N \ge 3072\), respectively. Similar trends are observed for S-MM with reordering (R-S-MM), which is used to save storage. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20\(\times \) and 80\(\times \) for the above arrays. For MIC, two-recursion S-MM with reordering is faster than MKL library by 14–26 % for \(N \ge 1024\). Proposed implementations achieve 2.35 TFLOPS (67 % of peak) on GPU and 0.5 TFLOPS (21 % of peak) on MIC. Similar encouraging results are obtained for a 16-core Xeon-E5 server. We conclude that S-MM and W-MM implementations with a few recursion levels can be used to further optimize the performance of basic algebra libraries.  相似文献   

5.
We study a simple technique, originally presented by Herlihy (ACM Trans. Program. Lang. Syst. 15(5):745–770, 1993), for executing concurrently, in a wait-free manner, blocks of code that have been programmed for sequential execution and require significant synchronization in order to be performed in parallel. We first present an implementation of this technique, called Sim, which employs a collect object. We describe a simple implementation of a collect object from a single shared object that supports atomic Add (or XOR) in addition to read; this implementation has step complexity O(1). By plugging in to Sim this implementation, Sim exhibits constant step complexity as well. This allows us to derive lower bounds on the step complexity of implementations of several shared objects, like Add, XOR, collect, and snapshot objects, from LL/SC objects. We then present a practical version of Sim, called PSim, which is implemented in a real shared-memory machine. From a theoretical perspective, PSim has worse step complexity than Sim, its theoretical analog; in practice though, we experimentally show that PSim is highly-efficient: it outperforms several state-of-the-art lock-based and lock-free synchronization techniques, and this given that it is wait-free, i.e. that it satisfies a stronger progress condition than all the algorithms that it outperforms. We have used PSim to get highly-efficient wait-free implementations of stacks and queues.  相似文献   

6.
In recent years multi-core processors have seen broad adoption in application domains ranging from embedded systems through general-purpose computing to large-scale data centres. Simulation technology for multi-core systems, however, lags behind and does not provide the simulation speed required to effectively support design space exploration and parallel software development. While state-of-the-art instruction set simulators (Iss) for single-core machines reach or exceed the performance levels of speed-optimised silicon implementations of embedded processors, the same does not hold for multi-core simulators where large performance penalties are to be paid. In this paper we develop a fast and scalable simulation methodology for multi-core platforms based on parallel and just-in-time (Jit) dynamic binary translation (Dbt). Our approach can model large-scale multi-core configurations, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded multi-core platform implementing the ARCompact instruction set architecture (Isa). We have evaluated our parallel simulation methodology against the industry standard Splash-2 and Eembc MultiBench benchmarks and demonstrate simulation speeds up to 25,307 Mips on a 32-core x86 host machine for as many as 2,048 target processors whilst exhibiting minimal and near constant overhead, including memory considerations.  相似文献   

7.
S-Net is a declarative coordination language and component technology aimed at radically facilitating software engineering for modern parallel compute systems by near-complete separation of concerns between application (component) engineering and concurrency orchestration. S-Net builds on the concept of stream processing to structure networks of communicating asynchronous components implemented in a conventional (sequential) language. In this paper we present the design, implementation and evaluation of a new and innovative runtime system for S-Net streaming networks. The Front runtime system outperforms the existing implementations of S-Net by orders of magnitude for stress-test benchmarks, significantly reduces runtimes of fully-fledged parallel applications with compute-intensive components and achieves good scalability on our 48-core test system.  相似文献   

8.
In this article, we formulate and study quantum analogues of randomized search heuristics, which make use of Grover search (in Proceedings of the 28th Annual ACM Symposium on Theory of Computing, pp. 212–219. ACM, New York, 1996) to accelerate the search for improved offsprings. We then specialize the above formulation to two specific search heuristics: Random Local Search and the (1+1) Evolutionary Algorithm. We call the resulting quantum versions of these search heuristics Quantum Local Search and the (1+1) Quantum Evolutionary Algorithm. We conduct a rigorous runtime analysis of these quantum search heuristics in the computation model of quantum algorithms, which, besides classical computation steps, also permits those unique to quantum computing devices. To this end, we study the six elementary pseudo-Boolean optimization problems OneMax, LeadingOnes, Discrepancy, Needle, Jump, and TinyTrap. It turns out that the advantage of the respective quantum search heuristic over its classical counterpart varies with the problem structure and ranges from no speedup at all for the problem Discrepancy to exponential speedup for the problem TinyTrap. We show that these runtime behaviors are closely linked to the probabilities of performing successful mutations in the classical algorithms.  相似文献   

9.
Class-attribute interdependence maximization (CAIM) is one of the state-of-the-art algorithms for discretizing data for which classes are known. However, it may take a long time when run on high-dimensional large-scale data, with large number of attributes and/or instances. This paper presents a solution to this problem by introducing a graphic processing unit (GPU)-based implementation of the CAIM algorithm that significantly speeds up the discretization process on big complex data sets. The GPU-based implementation is scalable to multiple GPU devices and enables the use of concurrent kernels execution capabilities of modern GPUs. The CAIM GPU-based model is evaluated and compared with the original CAIM using single and multi-threaded parallel configurations on 40 data sets with different characteristics. The results show great speedup, up to 139 times faster using four GPUs, which makes discretization of big data efficient and manageable. For example, discretization time of one big data set is reduced from 2 h to \(<\) 2 min.  相似文献   

10.
Nowadays, clustering of massive datasets is a crucial part of many data-analytic tasks. Most of the available clustering algorithms have two shortcomings when used on big data: (1) a large group of clustering algorithms, e.g. \(k\) -means, has to keep the data in memory and iterate over the data many times which is very costly for big datasets, (2) clustering algorithms that run on limited memory sizes, especially the family of stream-clustering algorithms, do not have a parallel implementation to utilize modern multi-core processors and also they lack decent quality of results. In this paper, we propose an algorithm that combines parallel clustering with single-pass, stream-clustering algorithms. The aim is to make a clustering algorithm that utilizes maximum capabilities of a regular multi-core PC to cluster the dataset as fast as possible while resulting in acceptable quality of clusters. Our idea is to split the data into chunks and cluster each chunk in a separate thread. Then, the clusters extracted from chunks are aggregated at the final stage using re-clustering. Parameters of the algorithm can be adjusted according to hardware limitations. Experimental results on a 12-core computer show that the proposed method is much faster than its batch-processing equivalents (e.g. \(k\) -means++) and stream-based algorithms. Also, the quality of solution is often equal to \(k\) -means++, while it significantly dominates stream-clustering algorithms. Our solution also scales well with extra available cores and hence provides an effective and fast solution to clustering large datasets on multi-core and multi-processor systems.  相似文献   

11.
We consider greedy contention managers for transactional memory for MN execution windows of transactions with M threads and N transactions per thread. We present, formally analyze, and experimentally evaluate three new randomized greedy contention management algorithms for transaction windows. Assuming that each transaction has duration τ and conflicts with at most C other transactions inside the window, the first algorithm Offline-Greedy produces a schedule of length O(τ· (C?+?N· log(MN))) with high probability. The offline algorithm depends on knowing the conflict graph which evolves while the execution of the transactions progresses. The second algorithm Online-Greedy produces a schedule of length that is only a logarithmic factor worse than Offline-Greedy, but does not require knowledge of the conflict graph. The third algorithm Adaptive-Greedy is the adaptive version of the previous algorithms which produces a schedule of length asymptotically the same as with online algorithm by adaptively guessing the value of C. All of the algorithms exhibit competitive ratio very close to O(s), where s is the number of shared resources, and at the same time, our algorithms provide new non-trivial tradeoffs for greedy transaction scheduling that parameterize window sizes and transaction conflicts within the execution window. We evaluate these window-based algorithms experimentally using the sorted link list, red-black tree, skip list, and vacation benchmarks. The evaluation results confirm their benefits in practical performance throughput and other metrics such as aborts per commit ratio and execution time overhead, along with the non-trivial provable properties of the algorithms.  相似文献   

12.
We study a generalized job-shop problem called the body shop scheduling problem (BSSP). This problem arises from the industrial application of welding in a car body production line, where possible collisions between industrial robots have to be taken into account. BSSP corresponds to a job-shop problem where the operations of a job have to follow alternating routes on the machines, certain operations of different jobs are not allowed to be processed at the same time and after processing an operation of a certain job a machine might be unavailable for a given time for operations of other jobs. As main results we will show that for three jobs and four machines the special case where only one machine is used by more than one job is already $\mathcal NP $ -hard. This also implies that the single machine scheduling problem that asks for a makespan minimal schedule of three chains of operations with delays between the operations of a chain is $\mathcal NP $ -hard. On the positive side, we present a polynomial algorithm for the two job case and a pseudo-polynomial algorithm together with an FPTAS  for an arbitrary but constant number of jobs. Hence for a constant number of jobs we fully settle the complexity status of the problem.  相似文献   

13.
Uruguay is currently undergoing a gradual process of inclusion of wind energy in its matrix of electric power generation. In this context, a computational tool has been developed to predict the electrical power that will be injected into the grid. The tool is based on the Weather Research and Forecasting (WRF) numerical model, which is the performance bottleneck of the application. For this reason, and in line with several successful efforts of other researchers, this article presents advances in porting the WRF to GPU. In particular, we present the implementation of sintb and bdy_interp1 routines on GPU and the integration of these routines with previous efforts from other authors. The speedup values obtained for the newly ported routines on a Nvidia GeForce GTX 480 GPU are up to \(33.9\times \) when compared with the sequential WRF and \(9.2\times \) when compared with the four-threaded WRF. The integration of the newly ported routines along with previous works produces a reduction of more than a 30 % in the total runtime of the multi-core four-threaded WRF and of more than a 50 % in the single-threaded version.  相似文献   

14.
With the objective of minimizing the total execution time of a parallel program on a distributed memory parallel computer, this paper discusses the selection of an optimal supernode shape of a supernode transformation (also known as tiling). We identify three parameters of a supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For supernode transformations on algorithms with perfectly nested loops and uniform dependencies, we prove the optimality of a constant linear schedule vector and give a necessary and sufficient condition for optimal relative side lengths. We also prove that the total running time is minimized by a cutting hyperplane direction matrix from a particular subset of all valid directions and we discuss the cases where this subset is unique. The results are derived in continuous space and should be considered approximate. Our model does not include cache effects and assumes an unbounded number of available processors, the communication cost approximated by a constant, uniform dependences, and loop bounds known at compile time. A comprehensive example is discussed with an application of the results to the Jacobi algorithm.  相似文献   

15.
Periodic scheduling has many attractions for wireless telecommunications. It offers energy saving where equipment can be turned off between transmissions, and high-quality reception through the elimination of jitter, caused by irregularity of reception. However, perfect periodic schedules, in which each (of $n$ ) client is serviced at regular, prespecified intervals, are notoriously difficult to construct. The problem is known to be NP-hard even when service times are identical. This paper focuses on cases of up to three distinct periodicities, with unit service times. Our contribution is to derive a $O(n^4)$ test for the existence of a feasible schedule, and a method of constructing a feasible schedule if one exists, for the given combination of client periodicities. We also indicate why schedules with a higher number of periodicities are unlikely to be useful in practice. This methodology can be used to support perfect periodic scheduling in a wide range in real world settings, including machine maintenance service, wireless mesh networks and various other telecommunication networks transmitting packet size data.  相似文献   

16.
To obtain significant execution speedups, GPUs rely heavily on the inherent data-level parallelism present in the targeted application. However, application programs may not always be able to fully utilize these parallel computing resources due to intrinsic data dependencies or complex data pointer operations. In this paper, we explore how to leverage aggressive software-based value prediction techniques on a GPU to accelerate programs that lack inherent data parallelism. This class of applications are typically difficult to map to parallel architectures due to the presence of data dependencies and complex data pointer manipulation present in these applications. Our experimental results show that, despite the overhead incurred due to software speculation and the communication overhead between the CPU and GPU, we obtain up to 6.5 $\times $ speedup on a selected set of kernels taken from the SPEC CPU2006, PARSEC and Sequoia benchmark suites.  相似文献   

17.
Over the years, presence of heterogeneous system has dominated the area of concurrent job execution. Heterogeneous system is the natural choice as it can be designed with the legacy system. Scheduling, on such systems, is an important activity as it affects the job execution characteristic. Heterogeneity introduces many challenges for the efficient job execution. Heterogeneity in core architecture introduces the possibility of heterogeneous memory architecture in many/multi core heterogeneous system. This makes it often impossible to determine for the same instruction if a high frequency core has low or high memory latency in comparison to the low frequency core and vice-versa. The work proposes an improved scheduler for such systems in which both core and memory are heterogeneous. It defines average effective time ( \(\hbox {AE}_\mathrm{t}\) ) as the base parameter for this purpose. Priorities of each thread (workload) and the core are dynamically generated using \(\hbox {AE}_\mathrm{t}\) for effective mapping. Experimental results, on the benchmark data, reveal that the proposed scheduler performs much better in terms of cores utilization, speedup and efficiency in comparison to other similar models.  相似文献   

18.
Community detection and evaluation is an important task in graph mining. In many cases, a community is defined as a subgraph characterized by dense connections or interactions between its nodes. A variety of measures are proposed to evaluate different quality aspects of such communities—in most cases ignoring the directed nature of edges. In this paper, we introduce novel metrics for evaluating the collaborative nature of directed graphs—a property not captured by the single node metrics or by other established community evaluation metrics. In order to accomplish this objective, we capitalize on the concept of graph degeneracy and define a novel D-core framework, extending the classic graph-theoretic notion of $k$ -cores for undirected graphs to directed ones. Based on the D-core, which essentially can be seen as a measure of the robustness of a community under degeneracy, we devise a wealth of novel metrics used to evaluate graph collaboration features of directed graphs. We applied the D-core approach on large synthetic and real-world graphs such as Wikipedia, DBLP, and ArXiv and report interesting results at the graph as well at the node level.  相似文献   

19.
We address the problem of generating compact code from software pipelined loops. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates lifetime intervals spanning multiple loop iterations. These intervals require periodic register allocation (also called variable expansion), which in turn yields a code generation challenge. We are looking for the minimal unrolling factor enabling the periodic register allocation of software pipelined kernels. This challenge is generally addressed through one of: (1) hardware support in the form of rotating register files, which solve the unrolling problem but are expensive in hardware; (2) register renaming by inserting register moves, which increase the number of operations in the loop, and may damage the schedule of the software pipeline and reduce throughput; (3) post-pass loop unrolling that does not compromise throughput but often leads to impractical code growth. The latter approach relies on the proof that MAXLIVE registers (maximal number of values simultaneously alive) are sufficient for periodic register allocation (Eisenbeis et al. in PACT ’95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264–267, Manchester, UK, 1995; Hendren et al. in CC ’92: Proceedings of the 4th International Conference on Compiler Construction, pages 176–191, London, UK, 1992). However, the best existing heuristic for controlling this code growth—modulo variable expansion (Lam in SIGPLAN Not 23(7):318–328, 1988)—may not apply the correct amount of loop unrolling to guarantee that MAXLIVE registers are enough, which may result in register spills Eisenbeis et al. in PACT ’95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264–267, Manchester, UK, 1995. This paper presents our research results on the open problem of minimal loop unrolling, allowing a software-only code generation that does not trade the optimality of the initiation interval (II) for the compactness of the generated code. Our novel idea is to use the remaining free registers after periodic register allocation to relax the constraints on register reuse. The problem of minimal loop unrolling arises either before or after software pipelining, either with a single or with multiple register types (classes). We provide a formal problem definition for each scenario, and we propose and study a dedicated algorithm for each problem. Our solutions are implemented within an industrial-strength compiler for a VLIW embedded processor from STMicroelectronics, and validated on multiple benchmarks suites.  相似文献   

20.
We introduce two new natural decision problems, denoted as ? RATIONAL NASH and ? IRRATIONAL NASH, pertinent to the rationality and irrationality, respectively, of Nash equilibria for (finite) strategic games. These problems ask, given a strategic game, whether or not it admits (i) a rational Nash equilibrium where all probabilities are rational numbers, and (ii) an irrational Nash equilibrium where at least one probability is irrational, respectively. We are interested here in the complexities of ? RATIONAL NASH and ? IRRATIONAL NASH. Towards this end, we study two other decision problems, denoted as NASH-EQUIVALENCE and NASH-REDUCTION, pertinent to some mutual properties of the sets of Nash equilibria of two given strategic games with the same number of players. The problem NASH-EQUIVALENCE asks whether or not the two sets of Nash equilibria coincide; we identify a restriction of its complementary problem that witnesses ? RATIONAL NASH. The problem NASH-REDUCTION asks whether or not there is a so called Nash reduction: a suitable map between corresponding strategy sets of players that yields a Nash equilibrium of the former game from a Nash equilibrium of the latter game; we identify a restriction of NASH-REDUCTION that witnesses ? IRRATIONAL NASH. As our main result, we provide two distinct reductions to simultaneously show that (i) NASH-EQUIVALENCE is co- $\mathcal{NP}$ -hard and ? RATIONAL NASH is $\mathcal{NP}$ -hard, and (ii) NASH-REDUCTION and ? IRRATIONAL NASH are both $\mathcal{NP}$ -hard, respectively. The reductions significantly extend techniques previously employed by Conitzer and Sandholm (Proceedings of the 18th Joint Conference on Artificial Intelligence, pp. 765–771, 2003; Games Econ. Behav. 63(2), 621–641, 2008).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号