期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Model and procedure for performance and availability-wise parallel warehouses

Pedro Furtado 《Distributed and Parallel Databases》2009,25(1-2):71-96

Consider data warehouses as large data repositories queried for analysis and data mining in a variety of application contexts. A query over such data may take a large amount of time to be processed in a regular PC. Consider partitioning the data into a set of PCs (nodes), with either a parallel database server or any database server at each node and an engine-independent middleware. Nodes and network may even not be fully dedicated to the data warehouse. In such a scenario, care must be taken for handling processing heterogeneity and availability, so we study and propose efficient solutions for this. We concentrate on three main contributions: a performance-wise index, measuring relative performance; a replication-degree; a flexible chunk-wise organization with on-demand processing. These contributions extend the previous work on de-clustering and replication and are generic in the sense that they can be applied in very different contexts and with different data partitioning approaches. We evaluate their merits with a prototype implementation of the system. 相似文献

2.

The Nornir run-time system for parallel programs using Kahn process networks on multi-core machines—a flexible alternative to MapReduce

Željko Vrba Pål Halvorsen Carsten Griwodz Paul Beskow Håvard Espeland Dag Johansen 《The Journal of supercomputing》2013,63(1):191-217

Even though shared-memory concurrency is a paradigm frequently used for developing parallel applications on small- and middle-sized machines, experience has shown that it is hard to use. This is largely caused by synchronization primitives which are low-level, inherently non-deterministic, and, consequently, non-intuitive to use. In this paper, we present the Nornir run-time system. Nornir is comparable to well-known frameworks such as MapReduce and Dryad that are recognized for their efficiency and simplicity. Unlike these frameworks, Nornir also supports process structures containing branches and cycles. Nornir is based on the formalism of Kahn process networks, which is a shared-nothing, message-passing model of concurrency. We deem this model a simple and deterministic alternative to shared-memory concurrency. Experiments with real and synthetic benchmarks on up to 8 CPUs show that performance in most cases scales almost linearly with the number of CPUs, when not limited by data dependencies. We also show that the modeling flexibility allows Nornir to outperform its MapReduce counterparts using well-known benchmarks. 相似文献

3.

High efficient inverse dynamic calculation approach for a haptic device with pantograph parallel platform

Wei You Min-Xiu Kong Zhi-Jiang Du Li-Ning Sun 《Multibody System Dynamics》2009,21(3):233-247

This article presents a novel inverse dynamic calculation approach for a haptic device with pantograph parallel platform. This approach uses vector differential equations in kinematics analysis, hence deriving the explicit expressions of all links’ linear velocities, linear accelerations, angular velocities, and angular accelerations. In contrast to the regular influence coefficient method, the kinematics expressions presented herein avoid the large calculation load of matrix inverse operation, which is crucial to real-time computation. Kane’s equation is employed to establish the inverse dynamic calculation expression for the special architecture of hybrid series-parallel branch. The elements of velocity wrench of the top plate are chosen as the generalized velocities. After deriving the matrixes of partial linear velocity and partial angular velocity, the inverse dynamic equation in explicit form is obtained. Compared with the results calculated by ADAMS, the precision of this calculation approach is validated. Given that it is highly efficient and accurate, this approach is more suitable for real-time compute-torque control, especially for mechanisms with hybrid series-parallel branches. 相似文献

4.

Design of efficient Java message-passing collectives on multi-core clusters

Guillermo L. Taboada Sabela Ramos Juan Touriño Ramón Doallo 《The Journal of supercomputing》2011,55(2):126-154

This paper presents a scalable and efficient Message-Passing in Java (MPJ) collective communication library for parallel computing on multi-core architectures. The continuous increase in the number of cores per processor underscores the need for scalable parallel solutions. Moreover, current system deployments are usually multi-core clusters, a hybrid shared/distributed memory architecture which increases the complexity of communication protocols. Here, Java represents an attractive choice for the development of communication middleware for these systems, as it provides built-in networking and multithreading support. As the gap between Java and compiled languages performance has been narrowing for the last years, Java is an emerging option for High Performance Computing (HPC). 相似文献

5.

Fast and highly scalable parallel computations for fundamental matrix problems on distributed memory systems

Keqin Li 《The Journal of supercomputing》2010,54(3):271-297

We present fast and highly scalable parallel computations for a number of important and fundamental matrix problems on distributed memory systems (DMS). These problems include matrix multiplication, matrix chain product, and computing the powers, the inverse, the characteristic polynomial, the determinant, the rank, the Krylov matrix, and an LU- and a QR-factorization of a matrix, and solving linear systems of equations. Our highly scalable parallel computations for these problems are based on a highly scalable implementation of the fastest sequential matrix multiplication algorithm on DMS. We show that compared with the best known parallel time complexities on parallel random access machines (PRAM), the most powerful but unrealistic shared memory model of parallel computing, our parallel matrix computations achieve the same speeds on distributed memory parallel computers (DMPC), and have an extra polylog factor in the time complexities on DMS with hypercubic networks. Furthermore, our parallel matrix computations are fully scalable on DMPC and highly scalable over a wide range of system size on DMS with hypercubic networks. Such fast (in terms of parallel time complexity) and highly scalable (in terms of our definition of scalability) parallel matrix computations were rarely seen before on any distributed memory systems. 相似文献

6.

A comparative simulation study on the power–performance of multi-core architecture

Vijayalakshmi Saravanan Alagan Anpalagan D. P. Kothari Isaac Woungang Mohammad S. Obaidat Fellow of IEEE Fellow of SCS 《The Journal of supercomputing》2014,70(1):465-487

Nowadays, multi-core processor is the main technology used in desktop PCs, laptop computers and mobile hardware platforms. As the number of cores on a chip keeps increasing, it adds up the complexity and impacts more on both power and performance of a processor. In multi-processors, the number of cores and various parameters, such as issue-width, number of instructions and execution time, are key design factors to balance the amount of thread-level parallelism and instruction-level parallelism. In this paper, we perform a comprehensive simulation study that aims to find the optimum number of processor cores in desktop/laptop computing processor models with shallow pipeline depth. This paper also explores the trade-off between the number of cores and different parameters used in multi-processors in terms of power–performance gains and analyzes the impact of 3D stacking on the design of simultaneous multi-threading and chip multiprocessing. Our analysis shows that the optimum number of cores varies with different classes of workloads, namely: SPEC2000, SPEC2006 and MiBench. Simulation study is presented using architectures with shorter pipeline depth, showing that (1) the optimum number of cores for power–performance is 8, (2) the optimum number of threads in the range [2, 4], and (3) for beyond 32 cores, multi-core processors are no longer efficient in terms of performance benefits and overall power consumption. 相似文献

7.

Performance-based parallel application toolkit for high-performance clusters

Kuan-Ching Li Tien-Hsiung Weng 《The Journal of supercomputing》2009,48(1):43-65

Advances in computer technology, encompassed with fast emerging of multicore processor technology, have made the many-core personal computers available and more affordable. The availability of network of workstations and cluster of many-core SMPs have made them an attractive solution for high performance computing by providing computational power equal or superior to supercomputers or mainframes at an affordable cost using commodity components. In order to search alternative ways to extract unused and idle computing power from these computing resources targeting to improve overall performance, as well as to fully utilize the underlying new hardware platforms, these are major topics in this field of research. In this research paper, the design rationale and implementation of an effective toolkit for performance measurement and analysis of parallel applications in cluster environments is introduced; not only generating parallel applications’ timing graph representation, but also to provide application execution’s performance data charts. The goal in developing this toolkit is to permit application developers have a better understanding of the application’s behavior among selected computing nodes purposed for that particular execution. Additionally, multiple execution results of a given application under development can be combined and overlapped, permitting application developers to perform “what-if” analysis, i.e., to deeper understand the utilization of allocated computational resources. Experimentations using this toolkit have shown its effectiveness on the development and performance tuning of parallel applications, extending the use in teaching of message passing, and shared memory model parallel programming courses.

Tien-Hsiung WengEmail:

相似文献

8.

High performance computing tools in science and engineering

J. Ranilla E. S. Quintana J. Vigo-Aguiar 《The Journal of supercomputing》2011,58(2):143-144

相似文献

9.

PAPS — A testbed for performance prediction of parallel applications

《Parallel Computing》1997,22(13):1837-1851

The PAPS (Performance Analysis of Parallel Systems) toolset is a testbed for the model based performance prediction of message passing parallel applications executed on private memory multiprocessor computer systems. PAPS allows to describe the execution behavior of the computer hardware and operating system software resources up to a very detailed level. This enables very accurate performance prediction of parallel applications even in the case of substantial performance degradation due to contention for shared resources. In this paper the fundamental design principles and implementation methodologies for the development of the PAPS toolset are presented and the PAPS parallel system specification formalisms are described. A simplified performance study of a parallel Gaussian elimination application on the nCUBE 2 multiprocessor system is used to demonstrate the usage of the tool. 相似文献

10.

A heterogeneous parallel implementation of the Markov clustering algorithm for large-scale biological networks on distributed CPU–GPU clusters

Fu You Zhou Wei 《The Journal of supercomputing》2022,78(7):9017-9037

The Journal of Supercomputing - Biological interaction databases accommodate information about interacted proteins or genes. Clustering on the networks formed by the interaction information for... 相似文献

11.

The hybrid dynamic parallel scheduling algorithm for load balancing on Chained-Cubic Tree interconnection networks 总被引：1，自引：0，他引：1

Basel A. Mahafzah Bashira A. Jaradat 《The Journal of supercomputing》2010,52(3):224-252

The Chained-Cubic Tree (CCT) interconnection network topology was recently proposed as a continuation for the extended efforts in the area of interconnection networks’ performance improvement. This topology, which promises to exhibit the best properties of the hypercube and tree topologies, needs to be deeply investigated in order to evaluate its performance among other interconnection networks’ topologies. This work comes as a complementary effort, in which the load balancing technique is investigated as one of the most important aspects of performance improvement. This paper proposes a new load balancing algorithm on CCT interconnection networks. The proposed algorithm, which is called Hybrid Dynamic Parallel Scheduling Algorithm (HD-PSA), is a combination of two common load balancing strategies; dynamic load balancing and parallel scheduling. The performance of the proposed algorithm is evaluated both, analytically and experimentally, in terms of various performance metrics; including, execution time, load balancing accuracy, communication cost, number of tasks hops, and tasks locality. 相似文献

12.

Energy efficient scheduling of parallel tasks on multiprocessor computers 总被引：1，自引：1，他引：1

Keqin Li 《The Journal of supercomputing》2012,60(2):223-247

In this paper, scheduling parallel tasks on multiprocessor computers with dynamically variable voltage and speed are addressed as combinatorial optimization problems. Two problems are defined, namely, minimizing schedule length with energy consumption constraint and minimizing energy consumption with schedule length constraint. The first problem has applications in general multiprocessor and multicore processor computing systems where energy consumption is an important concern and in mobile computers where energy conservation is a main concern. The second problem has applications in real-time multiprocessing systems and environments where timing constraint is a major requirement. Our scheduling problems are defined such that the energy-delay product is optimized by fixing one factor and minimizing the other. It is noticed that power-aware scheduling of parallel tasks has rarely been discussed before. Our investigation in this paper makes some initial attempt to energy-efficient scheduling of parallel tasks on multiprocessor computers with dynamic voltage and speed. Our scheduling problems contain three nontrivial subproblems, namely, system partitioning, task scheduling, and power supplying. Each subproblem should be solved efficiently, so that heuristic algorithms with overall good performance can be developed. The above decomposition of our optimization problems into three subproblems makes design and analysis of heuristic algorithms tractable. A unique feature of our work is to compare the performance of our algorithms with optimal solutions analytically and validate our results experimentally, not to compare the performance of heuristic algorithms among themselves only experimentally. The harmonic system partitioning and processor allocation scheme is used, which divides a multiprocessor computer into clusters of equal sizes and schedules tasks of similar sizes together to increase processor utilization. A three-level energy/time/power allocation scheme is adopted for a given schedule, such that the schedule length is minimized by consuming given amount of energy or the energy consumed is minimized without missing a given deadline. The performance of our heuristic algorithms is analyzed, and accurate performance bounds are derived. Simulation data which validate our analytical results are also presented. It is found that our analytical results provide very accurate estimation of the expected normalized schedule length and the expected normalized energy consumption and that our heuristic algorithms are able to produce solutions very close to optimum. 相似文献

13.

Using accurate AIC-based performance models to improve the scheduling of parallel applications

Diego R. Martínez Julio L. Albín Tomás F. Pena José C. Cabaleiro Francisco F. Rivera Vicente Blanco 《The Journal of supercomputing》2011,58(3):332-340

Predictions based on analytical performance models can be used on efficient scheduling policies in order to select adequate resources for an optimal execution in terms of throughput and response time. However, developing accurate analytical models of parallel applications is a hard issue. The TIA (Tools for Instrumenting and Analysis) modeling framework provides an easy to use modeling method for obtaining analytical models of MPI applications. This method is based on modeling selection techniques and, in particular, on Akaike’s information criterion (AIC). In this paper, first the AIC-based performance model of the HPL benchmark is obtained using the TIA modeling framework. Then the use of this model for assessing the runtime estimation on different backfilling policies is analyzed in the GridSim simulator. The behavior of these simulations is compared with the equivalent simulations based on the theoretical model of the HPL provided by its developers. 相似文献

14.

OTIS-MOT: an efficient interconnection network for parallel processing

Prasanta K. Jana Dheeresh K. Mallick 《The Journal of supercomputing》2012,59(2):920-940

Mesh of trees (MOT) is well known for its small diameter, high bisection width, simple decomposability and area universality. On the other hand, OTIS (Optical Transpose Interconnection System) provides an efficient optoelectronic model for massively parallel processing system. In this paper, we present OTIS-MOT as a competent candidate for a two-tier architecture that can take the advantages of both the OTIS and the MOT. We show that an n⁴_-n^{4}_{-} processor OTIS-MOT has diameter 8log n ^∗+1 (The base of the logarithm is assumed to be 2 throughout this paper.) and fault diameter 8log n+2 under single node failure. We establish other topological properties such as bisection width, multiple paths and the modularity. We show that many communication as well as application algorithms can run on this network in comparable time or even faster than other similar tree-based two-tier architectures. The communication algorithms including row/column-group broadcast and one-to-all broadcast are shown to require O(log n) time, multicast in O(n ²log n) time and the bit-reverse permutation in O(n) time. Many parallel algorithms for various problems such as finding polynomial zeros, sales forecasting, matrix-vector multiplication and the DFT computation are proposed to map in O(log n) time. Sorting and prefix computation are also shown to run in O(log n) time. 相似文献

15.

Dynamic-CoMPI: dynamic optimization techniques for MPI parallel applications

Rosa Filgueira Jesús Carretero David E. Singh Alejandro Calderón Alberto Núñez 《The Journal of supercomputing》2012,59(1):361-391

This work presents an optimization of MPI communications, called Dynamic-CoMPI, which uses two techniques in order to reduce the impact of communications and non-contiguous I/O requests in parallel applications. These techniques are independent of the application and complementaries to each other. The first technique is an optimization of the Two-Phase collective I/O technique from ROMIO, called Locality aware strategy for Two-Phase I/O (LA-Two-Phase I/O). In order to increase the locality of the file accesses, LA-Two-Phase I/O employs the Linear Assignment Problem (LAP) for finding an optimal I/O data communication schedule. The main purpose of this technique is the reduction of the number of communications involved in the I/O collective operation. The second technique, called Adaptive-CoMPI, is based on run-time compression of MPI messages exchanged by applications. Both techniques can be applied on every application, because both of them are transparent for the users. Dynamic-CoMPI has been validated by using several MPI benchmarks and real HPC applications. The results show that, for many of the considered scenarios, important reductions in the execution time are achieved by reducing the size and the number of the messages. Additional benefits of our approach are the reduction of the total communication time and the network contention, thus enhancing, not only performance, but also scalability. 相似文献

16.

Particle swarm optimizer for variable weighting in clustering high-dimensional data 总被引：1，自引：0，他引：1

Yanping Lu Shengrui Wang Shaozi Li Changle Zhou 《Machine Learning》2011,82(1):43-70

In this paper, we present a particle swarm optimizer (PSO) to solve the variable weighting problem in projected clustering of high-dimensional data. Many subspace clustering algorithms fail to yield good cluster quality because they do not employ an efficient search strategy. In this paper, we are interested in soft projected clustering. We design a suitable k-means objective weighting function, in which a change of variable weights is exponentially reflected. We also transform the original constrained variable weighting problem into a problem with bound constraints, using a normalized representation of variable weights, and we utilize a particle swarm optimizer to minimize the objective function in order to search for global optima to the variable weighting problem in clustering. Our experimental results on both synthetic and real data show that the proposed algorithm greatly improves cluster quality. In addition, the results of the new algorithm are much less dependent on the initial cluster centroids. In an application to text clustering, we show that the algorithm can be easily adapted to other similarity measures, such as the extended Jaccard coefficient for text data, and can be very effective. 相似文献

17.

Communication-free data alignment for arrays with exponential references in parallelizing compilers for scalable parallel systems

Minyi?Guo Weng-Long?Chang Bo?Jiang Shu-Chien?Huang Sien-Tang?Tsai Michael??Ho 《The Journal of supercomputing》2012,60(1):4-30

相似文献

18.

Analyzing and enhancing the parallel sort operation on multithreaded architectures

Layali Rashid Wessam M. Hassanein Moustafa A. Hammad 《The Journal of supercomputing》2010,53(2):293-312

The sort operation is a core part of many critical applications (e.g., database management systems). Despite the large efforts to parallelize it, the fact that it suffers from high data-dependencies vastly limits its performance. Multithreaded architectures are emerging as the most demanding technology in leading-edge processors. These architectures include simultaneous multithreading, chip multiprocessors, and machines combining different multithreading technologies. In this paper, we analyze the memory behavior and improve the performance of the most recent parallel radix and quick integer sort algorithms on modern multithreaded architectures. We achieve speedups up to 4.69× for radix sort and up to 4.17× for quicksort on a machine with 4 multithreaded processors compared to single threaded versions, respectively. We find that since radix sort is CPU-intensive, it exhibits better results on chip multiprocessors where multiple CPUs are available. While quicksort is accomplishing speedups on all types of multithreading processers due to its ability to overlap memory miss latencies with other useful processing. 相似文献

19.

Designing parallel loop self-scheduling schemes using the hybrid MPI and OpenMP programming model for?multi-core grid systems

Chao-Chin Wu Chao-Tung Yang Kuan-Chou Lai Po-Hsun Chiu 《The Journal of supercomputing》2012,59(1):42-60

Loop scheduling on parallel and distributed systems has been thoroughly investigated in the past. However, none of these studies considered the multi-core architecture feature for emerging grid systems. Although there have been many studies proposed to employ the hybrid MPI and OpenMP programming model to exploit different levels of parallelism for a distributed system with multi-core computers, none of them were aimed at parallel loop self-scheduling. Therefore, this paper investigates how to employ the hybrid MPI and OpenMP model to design a parallel loop self-scheduling scheme adapted to the multi-core architecture for emerging grid systems. Three different featured applications are implemented and evaluated to demonstrate the effectiveness of the proposed scheduling approach. The experimental results show that the proposed approach outperforms the previous work for the three applications and the speedups range from 1.13 to 1.75. 相似文献

20.

A High Order Compact Scheme for the Pure-Streamfunction Formulation of the Navier-Stokes Equations

M. Ben-Artzi J.-P. Croisille D. Fishelov 《Journal of scientific computing》2010,42(2):216-250

In this paper we continue the study, which was initiated in (Ben-Artzi et al. in Math. Model. Numer. Anal. 35(2):313–303, 2001; Fishelov et al. in Lecture Notes in Computer Science, vol. 2667, pp. 809–817, 2003; Ben-Artzi et al. in J. Comput. Phys. 205(2):640–664, 2005 and SIAM J. Numer. Anal. 44(5):1997–2024, 2006) of the numerical resolution of the pure streamfunction formulation of the time-dependent two-dimensional Navier-Stokes equation. Here we focus on enhancing our second-order scheme, introduced in the last three afore-mentioned articles, to fourth order accuracy. We construct fourth order approximations for the Laplacian, the biharmonic and the nonlinear convective operators. The scheme is compact (nine-point stencil) for the Laplacian and the biharmonic operators, which are both treated implicitly in the time-stepping scheme. The approximation of the convective term is compact in the no-leak boundary conditions case and is nearly compact (thirteen points stencil) in the case of general boundary conditions. However, we stress that in any case no unphysical boundary condition was applied to our scheme. Numerical results demonstrate that the fourth order accuracy is actually obtained for several test-cases. 相似文献