期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Design of efficient Java message-passing collectives on multi-core clusters

Guillermo L. Taboada Sabela Ramos Juan Touriño Ramón Doallo 《The Journal of supercomputing》2011,55(2):126-154

This paper presents a scalable and efficient Message-Passing in Java (MPJ) collective communication library for parallel computing on multi-core architectures. The continuous increase in the number of cores per processor underscores the need for scalable parallel solutions. Moreover, current system deployments are usually multi-core clusters, a hybrid shared/distributed memory architecture which increases the complexity of communication protocols. Here, Java represents an attractive choice for the development of communication middleware for these systems, as it provides built-in networking and multithreading support. As the gap between Java and compiled languages performance has been narrowing for the last years, Java is an emerging option for High Performance Computing (HPC). 相似文献

2.

Optimizing Parallel S n Sweeps on Unstructured Grids for Multi-Core Clusters

下载免费PDF全文

闫洁谭光明孙凝晖《计算机科学技术学报》2013,28(4):657-670

In particle transport simulations, radiation effects are often described by the discrete ordinates (Sn) form of Boltzmann equation. In each ordinate direction, the solution is computed by sweeping the radiation flux across the grid. Parallel Sn sweep on an unstructured grid can be explicitly modeled as topological traversal through an equivalent directed acyclic graph (DAG), which is a data-driven algorithm. Its traditional design using MPI model results in irregular communication of massive short messages which cannot be effciently handled by MPI runtime. Meanwhile, in high-end HPC cluster systems, multicore has become the standard processor configuration of a single node. The traditional data-driven algorithm of Sn sweeps has not exploited potential advantages of multi-threading of multicore on shared memory. These advantages, however, as we shall demonstrate, could provide an elegant solution resolving problems in the previous MPI-only design. In this paper, we give a new design of data-driven parallel Sn sweeps using hybrid MPI and Pthread programming, namely Sweep-H, to exploit hierarchical parallelism of processes and threads. With special multi-threading techniques and vertex schedule policy, Sweep-H gets more effcient communication and better load balance. We further present an analytical performance model for Sweep-H to reveal why and when it is advantageous over former MPI counterpart. On a 64-node multicore cluster system with 12 cores per node, 768 cores in total, Sweep-H achieves nearly linear scalability for moderate problem sizes, and better absolute performance than the previous MPI algorithm on more than 16 nodes (by up to two times speedup on 64 nodes). 相似文献

3.

Towards Scalable Java HPC with Hybrid and Native Communication Devices in MPJ Express

Ansar Javed Bibrak Qamar Mohsan Jameel Aamir Shafi Bryan Carpenter 《International journal of parallel programming》2016,44(6):1142-1172

MPJ Express is a messaging system that allows application developers to parallelize their compute-intensive sequential Java codes on High Performance Computing clusters and multicore processors. In this paper, we extend MPJ Express software to provide two new communication devices. The first device—called hybrid—enables MPJ Express to exploit hybrid parallelism on cluster of multicore processors by sitting on top of existing shared memory and network communication devices. The second device—called native—uses JNI wrappers in interfacing MPJ Express to native MPI implementations like MPICH and Open MPI. We evaluate performance of these devices on a range of interconnects including 1G/10G Ethernet, 10G Myrinet and 40G InfiniBand. In addition, we analyze and evaluate the cost of MPJ Express buffering layer and compare it with the performance numbers of other Java MPI libraries. Our performance evaluation reveals that the native device allows MPJ Express to achieve comparable performance to native MPI libraries—for latency and bandwidth of point-to-point and collective communications—which is a significant gain in performance compared to existing communication devices. The hybrid communication device—without any modifications at application level—also helps parallel applications achieve better speedups and scalability by exploiting multicore architecture. Our performance evaluation quantifies the cost incurred by buffering and its impact on overall performance of software. We witnessed comparative performance as both new devices improve application performance and achieve upto 90 % of the theoretical bandwidth available without application rewriting effort—including NAS Parallel Benchmarks, point-to-point and collective communication. 相似文献

4.

F-MPJ: scalable Java message-passing communications on parallel systems 总被引：1，自引：0，他引：1

Guillermo L. Taboada Juan Touri?o Ramón Doallo 《The Journal of supercomputing》2012,60(1):117-140

This paper presents F-MPJ (Fast MPJ), a scalable and efficient Message-Passing in Java (MPJ) communication middleware for parallel computing. The increasing interest in Java as the programming language of the multi-core era demands scalable performance on hybrid architectures (with both shared and distributed memory spaces). However, current Java communication middleware lacks efficient communication support. F-MPJ boosts this situation by: (1) providing efficient non-blocking communication, which allows communication overlapping and thus scalable performance; (2) taking advantage of shared memory systems and high-performance networks through the use of our high-performance Java sockets implementation (named JFS, Java Fast Sockets); (3) avoiding the use of communication buffers; and (4) optimizing MPJ collective primitives. Thus, F-MPJ significantly improves the scalability of current MPJ implementations. A performance evaluation on an InfiniBand multi-core cluster has shown that F-MPJ communication primitives outperform representative MPJ libraries up to 60 times. Furthermore, the use of F-MPJ in communication-intensive MPJ codes has increased their performance up to seven times. 相似文献

5.

Device level communication libraries for high‐performance computing in Java

Guillermo L. Taboada Juan Tourio Ramn Doallo Aamir Shafi Mark Baker Bryan Carpenter 《Concurrency and Computation》2011,23(18):2382-2403

Since its release, the Java programming language has attracted considerable attention from the high‐performance computing (HPC) community because of its portability, high programming productivity, and built‐in multithreading and networking support. As a consequence, several initiatives have been taken to develop a high‐performance Java message‐passing library to program distributed memory architectures, such as clusters. The performance of Java message‐passing applications relies heavily on the communications performance. Thus, the design and implementation of low‐level communication devices that support message‐passing libraries is an important research issue in Java for HPC. MPJ Express is our Java message‐passing implementation for developing high‐performance parallel Java applications. Its public release currently contains three communication devices: the first one is built using the Java New Input/Output (NIO) package for the TCP/IP; the second one is specifically designed for the Myrinet Express library on Myrinet; and the third one supports thread‐based shared memory communications. Although these devices have been successfully deployed in many production environments, previous performance evaluations of MPJ Express suggest that the buffering layer, tightly coupled with these devices, incurs a certain degree of copying overhead, which represents one of the main performance penalties. This paper presents a more efficient Java message‐passing communications device, based on Java Input/Output sockets, that avoids this buffering overhead. Moreover, this device implements several strategies, both in the communication protocol and in the HPC hardware support, which optimizes Java message‐passing communications. In order to evaluate its benefits, this paper analyzes the performance of this device comparatively with other Java and native message‐passing libraries on various high‐speed networks, such as Gigabit Ethernet, Scalable Coherent Interface, Myrinet, and InfiniBand, as well as on a shared memory multicore scenario. The reported communication overhead reduction encourages the upcoming incorporation of this device in MPJ Express ( http://mpj‐express.org ). Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

6.

Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture

Jie Yan Guangming Tan Ninghui Sun 《The Journal of supercomputing》2014,69(3):1462-1490

相似文献

7.

Novel high intrinsic dimensionality estimators

A. Rozza G. Lombardi C. Ceruti E. Casiraghi P. Campadelli 《Machine Learning》2012,89(1-2):37-65

Recently, a great deal of research work has been devoted to the development of algorithms to estimate the intrinsic dimensionality (id) of a given dataset, that is the minimum number of parameters needed to represent the data without information loss. id estimation is important for the following reasons: the capacity and the generalization capability of discriminant methods depend on it; id is a necessary information for any dimensionality reduction technique; in neural network design the number of hidden units in the encoding middle layer should be chosen according to the id of data; the id value is strongly related to the model order in a time series, that is crucial to obtain reliable time series predictions. Although many estimation techniques have been proposed in the literature, most of them fail on noisy data, or compute underestimated values when the id is sufficiently high. In this paper, after reviewing some of the most important id estimators related to our work, we provide a theoretical motivation of the bias that causes the underestimation effect, and we present two id estimators based on the statistical properties of manifold neighborhoods, which have been developed in order to reduce this effect. We exhaustively evaluate the proposed techniques on synthetic and real datasets, by employing an objective evaluation measure to compare their performance with those achieved by state of the art algorithms; the results show that the proposed methods are promising, and produce reliable estimates also in the difficult case of datasets drawn from non-linearly embedded manifolds, characterized by high?id. 相似文献

8.

Algorithm::Evolutionary, a flexible Perl module for evolutionary computation 总被引：1，自引：0，他引：1

Juan Julián Merelo Guervós Pedro A. Castillo Enrique Alba 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2010,14(10):1091-1109

相似文献

9.

Low‐latency Java communication devices on RDMA‐enabled networks

Roberto R. Expsito Guillermo L. Taboada Sabela Ramos Juan Tourio Ramn Doallo 《Concurrency and Computation》2015,27(17):4852-4879

Providing high‐performance inter‐node communication is a key capability for running high performance computing applications efficiently on parallel architectures. In fact, current systems deployments are aggregating a significant number of cores interconnected via advanced networking hardware with Remote Direct Memory Access (RDMA) mechanisms, that enable zero‐copy and kernel‐bypass features. The use of Java for parallel programming is becoming more promising thanks to some useful characteristics of this language, particularly its built‐in multithreading support, portability, easy‐to‐learn properties, and high productivity, along with the continuous increase in the performance of the Java virtual machine. However, current parallel Java applications generally suffer from inefficient communication middleware, mainly based on protocols with high communication overhead that do not take full advantage of RDMA‐enabled networks. This paper presents efficient low‐level Java communication devices that overcome these constraints by fully exploiting the underlying RDMA hardware, providing low‐latency and high‐bandwidth communications for parallel Java applications. The performance evaluation conducted on representative RDMA networks and parallel systems has shown significant point‐to‐point performance increases compared with previous Java communication middleware, allowing to obtain up to 40% improvement in application‐level performance on 4096 cores of a Cray XE6 supercomputer. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

10.

A Parallel Dynamic Binary Translator for Efficient Multi-Core Simulation

Oscar Almer Igor Böhm Tobias Edler von Koch Björn Franke Stephen Kyle Volker Seeker Christopher Thompson Nigel Topham 《International journal of parallel programming》2013,41(2):212-235

In recent years multi-core processors have seen broad adoption in application domains ranging from embedded systems through general-purpose computing to large-scale data centres. Simulation technology for multi-core systems, however, lags behind and does not provide the simulation speed required to effectively support design space exploration and parallel software development. While state-of-the-art instruction set simulators (Iss) for single-core machines reach or exceed the performance levels of speed-optimised silicon implementations of embedded processors, the same does not hold for multi-core simulators where large performance penalties are to be paid. In this paper we develop a fast and scalable simulation methodology for multi-core platforms based on parallel and just-in-time (Jit) dynamic binary translation (Dbt). Our approach can model large-scale multi-core configurations, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded multi-core platform implementing the ARCompact instruction set architecture (Isa). We have evaluated our parallel simulation methodology against the industry standard Splash-2 and Eembc MultiBench benchmarks and demonstrate simulation speeds up to 25,307 Mips on a 32-core x86 host machine for as many as 2,048 target processors whilst exhibiting minimal and near constant overhead, including memory considerations. 相似文献

11.

Highly-Efficient Wait-Free Synchronization

Panagiota Fatourou Nikolaos D. Kallimanis 《Theory of Computing Systems》2014,55(3):475-520

We study a simple technique, originally presented by Herlihy (ACM Trans. Program. Lang. Syst. 15(5):745–770, 1993), for executing concurrently, in a wait-free manner, blocks of code that have been programmed for sequential execution and require significant synchronization in order to be performed in parallel. We first present an implementation of this technique, called Sim, which employs a collect object. We describe a simple implementation of a collect object from a single shared object that supports atomic Add (or XOR) in addition to read; this implementation has step complexity O(1). By plugging in to Sim this implementation, Sim exhibits constant step complexity as well. This allows us to derive lower bounds on the step complexity of implementations of several shared objects, like Add, XOR, collect, and snapshot objects, from LL/SC objects. We then present a practical version of Sim, called PSim, which is implemented in a real shared-memory machine. From a theoretical perspective, PSim has worse step complexity than Sim, its theoretical analog; in practice though, we experimentally show that PSim is highly-efficient: it outperforms several state-of-the-art lock-based and lock-free synchronization techniques, and this given that it is wait-free, i.e. that it satisfies a stronger progress condition than all the algorithms that it outperforms. We have used PSim to get highly-efficient wait-free implementations of stacks and queues. 相似文献

12.

Nelder-Mead Simplex Optimization Routine for Large-Scale Problems: A Distributed Memory Implementation

Kyle Klein Julian Neira 《Computational Economics》2014,43(4):447-461

The Nelder-Mead simplex method is an optimization routine that works well with irregular objective functions. For a function of $n$ parameters, it compares the objective function at the $n+1$ vertices of a simplex and updates the worst vertex through simplex search steps. However, a standard serial implementation can be prohibitively expensive for optimizations over a large number of parameters. We describe an implementation of the Nelder-Mead method in parallel using a distributed memory. For $p$ processors, each processor is assigned $(n+1)/p$ vertices at each iteration. Each processor then updates its worst local vertices, communicates the results, and a new simplex is formed with the vertices from all processors. We also describe how the algorithm can be implemented with only two MPI commands. In simulations, our implementation exhibits large speedups and is scalable to large problem sizes. 相似文献

13.

A massively parallel,multi-disciplinary Barnes–Hut tree code for extreme-scale N-body simulations

Mathias Winkel Robert Speck Helge Hübner Lukas Arnold Rolf Krause Paul Gibbon 《Computer Physics Communications》2012,183(4):880-889

The efficient parallelization of fast multipole-based algorithms for the N-body problem is one of the most challenging topics in high performance scientific computing. The emergence of non-local, irregular communication patterns generated by these algorithms can easily create an insurmountable bottleneck on supercomputers with hundreds of thousands of cores. To overcome this obstacle we have developed an innovative parallelization strategy for Barnes–Hut tree codes on present and upcoming HPC multicore architectures. This scheme, based on a combined MPI–Pthreads approach, permits an efficient overlap of computation and data exchange. We highlight the capabilities of this method on the full IBM Blue Gene/P system JUGENE at Jülich Supercomputing Centre and demonstrate scaling across 294,912 cores with up to 2,048,000,000 particles. Applying our implementation pepc to laser–plasma interaction and vortex particle methods close to the continuum limit, we demonstrate its potential for ground-breaking advances in large-scale particle simulations. 相似文献

14.

Improving IPS by network processors

Pablo Cascón Julio Ortega Yan Luo Eric Murray Antonio Díaz Ignacio Rojas 《The Journal of supercomputing》2011,57(1):99-108

Many present applications usually require high communication throughputs. Multiprocessor nodes and multicore architectures, as well as programmable NICs (Network Interface Cards) provide new opportunities to take advantage of the available multigigabits per second link bandwidths. Nevertheless, to achieve adequate communication performance levels efficient parallel processing of network tasks and interfaces should be considered. In this paper, we leverage network processors as heterogeneous microarchitectures with several cores that implement multithreading and are suited for packet processing, to investigate on the use of parallel processing to accelerate the network interface, and thus the network applications developed above it. More specifically, we have implemented an intrusion prevention system (IPS) with such a network processor. We describe the IPS we have developed that after its offloaded implementation allows faster packet processing of both normal and corrupted traffic. The benefits from placing the IPS close to the network, by using specialized network processors, give many times lower latency and higher bandwidth available to the legitimate traffic. 相似文献

15.

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Alfred J. Park Kalyan S. Perumalla 《Journal of Parallel and Distributed Computing》2013

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013) [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance. 相似文献

16.

The 2D wavelet transform on emerging architectures: GPUs and multicores

Joaquín Franco Gregorio Bernabé Juan Fernández Manuel Ujaldón 《Journal of Real-Time Image Processing》2012,7(3):145-152

Because of the computational power of today??s GPUs, they are starting to be harnessed more and more to help out CPUs on high-performance computing. In addition, an increasing number of today??s state-of-the-art supercomputers include commodity GPUs to bring us unprecedented levels of performance in terms of raw GFLOPS and GFLOPS/cost. In this work, we present a GPU implementation of an image processing application of growing popularity: The 2D fast wavelet transform (2D-FWT). Based on a pair of Quadrature Mirror Filters, a complete set of application-specific optimizations are developed from a CUDA perspective to achieve outstanding factor gains over a highly optimized version of 2D-FWT run in the CPU. An alternative approach based on the Lifting Scheme is also described in Franco et al. (Acceleration of the 2D wavelet transform for CUDA-enabled Devices, 2010). Then, we investigate hardware improvements like multicores on the CPU side, and exploit them at thread-level parallelism using the OpenMP API and pthreads . Overall, the GPU exhibits better scalability and parallel performance on large-scale images to become a solid alternative for computing the 2D-FWT versus those thread-level methods run on emerging multicore architectures. 相似文献

17.

Mining temporal specifications from object usage

Andrzej Wasylkowski Andreas Zeller 《Automated Software Engineering》2011,18(3-4):263-292

A caller must satisfy the callee??s precondition??that is, reach a state in which the callee may be called. Preconditions describe the state that needs to be reached, but not how to reach it. We combine static analysis with model checking to mine Fair Computation Tree Logic (CTL ^F ) formulas that describe the operations a parameter goes through: ??In parseProperties(String xml), the parameter xml normally stems from getProperties().?? Such operational preconditions can be learned from program code, and the code can be checked for their violations. Applied to AspectJ, our Tikanga prototype found 169 violations of operational preconditions, uncovering 7 unique defects and 27 unique code smells??with 52% true positives in the 25% top-ranked violations. 相似文献

18.

Binary extended perfect codes of length 16 and rank 14

V. A. Zinoviev D. V. Zinoviev 《Problems of Information Transmission》2006,42(2):123-138

All extended binary perfect (16, 4, 2¹¹) codes of rank 14 over the field F ₂ are classified. It is proved that among all nonequivalent extended binary perfect (16, 4, 2¹¹) codes there are exactly 1719 nonequivalent codes of rank 14 over F ₂. Among these codes there are 844 codes classified by Phelps (Solov’eva-Phelps codes) and 875 other codes obtained by the construction of Etzion-Vardy and by a new general doubling construction, presented in the paper. Thus, the only open question in the classification of extended binary perfect (16,4,2¹¹) codes is that on such codes of rank 15 over F ₂. 相似文献

19.

Counterexample-guided abstraction refinement for symmetric concurrent programs

Alastair F. Donaldson Alexander Kaiser Daniel Kroening Michael Tautschnig Thomas Wahl 《Formal Methods in System Design》2012,41(1):25-44

Predicate abstraction and counterexample-guided abstraction refinement (CEGAR) have enabled finite-state model checking of software written in mainstream programming languages. This combination of techniques has been successful in analysing system-level sequential C code. In contrast, there is little evidence of fruitful applications of CEGAR to shared-variable concurrent software. We attribute this gap to the lack of abstraction strategies that permit a scalable analysis of the resulting multi-threaded Boolean programs. The goal of this paper is to close this gap. We have developed a symmetry-aware CEGAR technique: it takes into account the replicated structure of programs that consist of many threads executing the same procedure, and generates a Boolean program template whose multi-threaded execution soundly overapproximates the original concurrent program. State explosion during model checking parallel instantiations of this template can now be absorbed by exploiting symmetry. We have implemented our method in a tool, SymmPa, and demonstrate its superior performance over alternative approaches on a range of synchronisation programs. 相似文献

20.

A note on balanced immunity

Haiko Müller 《Theory of Computing Systems》1993,26(2):157-167

For a finite alphabet ∑ we define a binary relation on $2^{\Sigma *} \times 2^{2^{\Sigma ^* } } $ , called balanced immunity. A setB ? ∑* is said to be balancedC-immune (with respect to a classC ? 2^Σ* of sets) iff, for all infiniteL εC, $$\mathop {\lim }\limits_{n \to \infty } \left| {L^{ \leqslant n} \cap B} \right|/\left| {L^{ \leqslant n} } \right| = \tfrac{1}{2}$$ Balanced immunity implies bi-immunity and in natural cases randomness. We give a general method to find a balanced immune set'B for any countable classC and prove that, fors(n) =o(t(n)) andt(n) >n, there is aB εSPACE(t(n)), which is balanced immune forSPACE(s(n)), both in the deterministic and nondeterministic case. 相似文献