共查询到20条相似文献,搜索用时 15 毫秒
1.
Pedro Furtado 《Distributed and Parallel Databases》2009,25(1-2):71-96
Consider data warehouses as large data repositories queried for analysis and data mining in a variety of application contexts. A query over such data may take a large amount of time to be processed in a regular PC. Consider partitioning the data into a set of PCs (nodes), with either a parallel database server or any database server at each node and an engine-independent middleware. Nodes and network may even not be fully dedicated to the data warehouse. In such a scenario, care must be taken for handling processing heterogeneity and availability, so we study and propose efficient solutions for this. We concentrate on three main contributions: a performance-wise index, measuring relative performance; a replication-degree; a flexible chunk-wise organization with on-demand processing. These contributions extend the previous work on de-clustering and replication and are generic in the sense that they can be applied in very different contexts and with different data partitioning approaches. We evaluate their merits with a prototype implementation of the system. 相似文献
2.
Željko Vrba Pål Halvorsen Carsten Griwodz Paul Beskow Håvard Espeland Dag Johansen 《The Journal of supercomputing》2013,63(1):191-217
Even though shared-memory concurrency is a paradigm frequently used for developing parallel applications on small- and middle-sized machines, experience has shown that it is hard to use. This is largely caused by synchronization primitives which are low-level, inherently non-deterministic, and, consequently, non-intuitive to use. In this paper, we present the Nornir run-time system. Nornir is comparable to well-known frameworks such as MapReduce and Dryad that are recognized for their efficiency and simplicity. Unlike these frameworks, Nornir also supports process structures containing branches and cycles. Nornir is based on the formalism of Kahn process networks, which is a shared-nothing, message-passing model of concurrency. We deem this model a simple and deterministic alternative to shared-memory concurrency. Experiments with real and synthetic benchmarks on up to 8 CPUs show that performance in most cases scales almost linearly with the number of CPUs, when not limited by data dependencies. We also show that the modeling flexibility allows Nornir to outperform its MapReduce counterparts using well-known benchmarks. 相似文献
3.
Guillermo L. Taboada Sabela Ramos Juan Touriño Ramón Doallo 《The Journal of supercomputing》2011,55(2):126-154
This paper presents a scalable and efficient Message-Passing in Java (MPJ) collective communication library for parallel computing
on multi-core architectures. The continuous increase in the number of cores per processor underscores the need for scalable
parallel solutions. Moreover, current system deployments are usually multi-core clusters, a hybrid shared/distributed memory
architecture which increases the complexity of communication protocols. Here, Java represents an attractive choice for the
development of communication middleware for these systems, as it provides built-in networking and multithreading support.
As the gap between Java and compiled languages performance has been narrowing for the last years, Java is an emerging option
for High Performance Computing (HPC). 相似文献
4.
Keqin Li 《The Journal of supercomputing》2010,54(3):271-297
We present fast and highly scalable parallel computations for a number of important and fundamental matrix problems on distributed
memory systems (DMS). These problems include matrix multiplication, matrix chain product, and computing the powers, the inverse,
the characteristic polynomial, the determinant, the rank, the Krylov matrix, and an LU- and a QR-factorization of a matrix,
and solving linear systems of equations. Our highly scalable parallel computations for these problems are based on a highly
scalable implementation of the fastest sequential matrix multiplication algorithm on DMS. We show that compared with the best
known parallel time complexities on parallel random access machines (PRAM), the most powerful but unrealistic shared memory
model of parallel computing, our parallel matrix computations achieve the same speeds on distributed memory parallel computers
(DMPC), and have an extra polylog factor in the time complexities on DMS with hypercubic networks. Furthermore, our parallel
matrix computations are fully scalable on DMPC and highly scalable over a wide range of system size on DMS with hypercubic
networks. Such fast (in terms of parallel time complexity) and highly scalable (in terms of our definition of scalability)
parallel matrix computations were rarely seen before on any distributed memory systems. 相似文献
5.
Advances in computer technology, encompassed with fast emerging of multicore processor technology, have made the many-core
personal computers available and more affordable. The availability of network of workstations and cluster of many-core SMPs
have made them an attractive solution for high performance computing by providing computational power equal or superior to
supercomputers or mainframes at an affordable cost using commodity components. In order to search alternative ways to extract
unused and idle computing power from these computing resources targeting to improve overall performance, as well as to fully
utilize the underlying new hardware platforms, these are major topics in this field of research. In this research paper, the
design rationale and implementation of an effective toolkit for performance measurement and analysis of parallel applications
in cluster environments is introduced; not only generating parallel applications’ timing graph representation, but also to
provide application execution’s performance data charts. The goal in developing this toolkit is to permit application developers
have a better understanding of the application’s behavior among selected computing nodes purposed for that particular execution.
Additionally, multiple execution results of a given application under development can be combined and overlapped, permitting
application developers to perform “what-if” analysis, i.e., to deeper understand the utilization of allocated computational
resources. Experimentations using this toolkit have shown its effectiveness on the development and performance tuning of parallel
applications, extending the use in teaching of message passing, and shared memory model parallel programming courses.
相似文献
Tien-Hsiung WengEmail: |
6.
7.
Vijayalakshmi Saravanan Alagan Anpalagan D. P. Kothari Isaac Woungang Mohammad S. Obaidat Fellow of IEEE Fellow of SCS 《The Journal of supercomputing》2014,70(1):465-487
Nowadays, multi-core processor is the main technology used in desktop PCs, laptop computers and mobile hardware platforms. As the number of cores on a chip keeps increasing, it adds up the complexity and impacts more on both power and performance of a processor. In multi-processors, the number of cores and various parameters, such as issue-width, number of instructions and execution time, are key design factors to balance the amount of thread-level parallelism and instruction-level parallelism. In this paper, we perform a comprehensive simulation study that aims to find the optimum number of processor cores in desktop/laptop computing processor models with shallow pipeline depth. This paper also explores the trade-off between the number of cores and different parameters used in multi-processors in terms of power–performance gains and analyzes the impact of 3D stacking on the design of simultaneous multi-threading and chip multiprocessing. Our analysis shows that the optimum number of cores varies with different classes of workloads, namely: SPEC2000, SPEC2006 and MiBench. Simulation study is presented using architectures with shorter pipeline depth, showing that (1) the optimum number of cores for power–performance is 8, (2) the optimum number of threads in the range [2, 4], and (3) for beyond 32 cores, multi-core processors are no longer efficient in terms of performance benefits and overall power consumption. 相似文献
8.
《Parallel Computing》1997,22(13):1837-1851
The PAPS (Performance Analysis of Parallel Systems) toolset is a testbed for the model based performance prediction of message passing parallel applications executed on private memory multiprocessor computer systems. PAPS allows to describe the execution behavior of the computer hardware and operating system software resources up to a very detailed level. This enables very accurate performance prediction of parallel applications even in the case of substantial performance degradation due to contention for shared resources. In this paper the fundamental design principles and implementation methodologies for the development of the PAPS toolset are presented and the PAPS parallel system specification formalisms are described. A simplified performance study of a parallel Gaussian elimination application on the nCUBE 2 multiprocessor system is used to demonstrate the usage of the tool. 相似文献
9.
The Journal of Supercomputing - Biological interaction databases accommodate information about interacted proteins or genes. Clustering on the networks formed by the interaction information for... 相似文献
10.
The hybrid dynamic parallel scheduling algorithm for load balancing on Chained-Cubic Tree interconnection networks 总被引:1,自引:0,他引:1
The Chained-Cubic Tree (CCT) interconnection network topology was recently proposed as a continuation for the extended efforts
in the area of interconnection networks’ performance improvement. This topology, which promises to exhibit the best properties
of the hypercube and tree topologies, needs to be deeply investigated in order to evaluate its performance among other interconnection
networks’ topologies. This work comes as a complementary effort, in which the load balancing technique is investigated as
one of the most important aspects of performance improvement. This paper proposes a new load balancing algorithm on CCT interconnection
networks. The proposed algorithm, which is called Hybrid Dynamic Parallel Scheduling Algorithm (HD-PSA), is a combination
of two common load balancing strategies; dynamic load balancing and parallel scheduling. The performance of the proposed algorithm
is evaluated both, analytically and experimentally, in terms of various performance metrics; including, execution time, load
balancing accuracy, communication cost, number of tasks hops, and tasks locality. 相似文献
11.
Keqin Li 《The Journal of supercomputing》2012,60(2):223-247
In this paper, scheduling parallel tasks on multiprocessor computers with dynamically variable voltage and speed are addressed
as combinatorial optimization problems. Two problems are defined, namely, minimizing schedule length with energy consumption
constraint and minimizing energy consumption with schedule length constraint. The first problem has applications in general
multiprocessor and multicore processor computing systems where energy consumption is an important concern and in mobile computers
where energy conservation is a main concern. The second problem has applications in real-time multiprocessing systems and
environments where timing constraint is a major requirement. Our scheduling problems are defined such that the energy-delay
product is optimized by fixing one factor and minimizing the other. It is noticed that power-aware scheduling of parallel
tasks has rarely been discussed before. Our investigation in this paper makes some initial attempt to energy-efficient scheduling
of parallel tasks on multiprocessor computers with dynamic voltage and speed. Our scheduling problems contain three nontrivial
subproblems, namely, system partitioning, task scheduling, and power supplying. Each subproblem should be solved efficiently,
so that heuristic algorithms with overall good performance can be developed. The above decomposition of our optimization problems
into three subproblems makes design and analysis of heuristic algorithms tractable. A unique feature of our work is to compare
the performance of our algorithms with optimal solutions analytically and validate our results experimentally, not to compare
the performance of heuristic algorithms among themselves only experimentally. The harmonic system partitioning and processor
allocation scheme is used, which divides a multiprocessor computer into clusters of equal sizes and schedules tasks of similar
sizes together to increase processor utilization. A three-level energy/time/power allocation scheme is adopted for a given
schedule, such that the schedule length is minimized by consuming given amount of energy or the energy consumed is minimized
without missing a given deadline. The performance of our heuristic algorithms is analyzed, and accurate performance bounds
are derived. Simulation data which validate our analytical results are also presented. It is found that our analytical results
provide very accurate estimation of the expected normalized schedule length and the expected normalized energy consumption
and that our heuristic algorithms are able to produce solutions very close to optimum. 相似文献
12.
Diego R. Martínez Julio L. Albín Tomás F. Pena José C. Cabaleiro Francisco F. Rivera Vicente Blanco 《The Journal of supercomputing》2011,58(3):332-340
Predictions based on analytical performance models can be used on efficient scheduling policies in order to select adequate
resources for an optimal execution in terms of throughput and response time. However, developing accurate analytical models
of parallel applications is a hard issue. The TIA (Tools for Instrumenting and Analysis) modeling framework provides an easy
to use modeling method for obtaining analytical models of MPI applications. This method is based on modeling selection techniques
and, in particular, on Akaike’s information criterion (AIC). In this paper, first the AIC-based performance model of the HPL
benchmark is obtained using the TIA modeling framework. Then the use of this model for assessing the runtime estimation on
different backfilling policies is analyzed in the GridSim simulator. The behavior of these simulations is compared with the
equivalent simulations based on the theoretical model of the HPL provided by its developers. 相似文献
13.
Rosa Filgueira Jesús Carretero David E. Singh Alejandro Calderón Alberto Núñez 《The Journal of supercomputing》2012,59(1):361-391
This work presents an optimization of MPI communications, called Dynamic-CoMPI, which uses two techniques in order to reduce the impact of communications and non-contiguous I/O requests in parallel applications.
These techniques are independent of the application and complementaries to each other. The first technique is an optimization
of the Two-Phase collective I/O technique from ROMIO, called Locality aware strategy for Two-Phase I/O (LA-Two-Phase I/O). In order to increase the locality of the file accesses, LA-Two-Phase I/O employs the Linear Assignment Problem (LAP) for finding an optimal I/O data communication schedule. The main purpose of this
technique is the reduction of the number of communications involved in the I/O collective operation. The second technique,
called Adaptive-CoMPI, is based on run-time compression of MPI messages exchanged by applications. Both techniques can be applied on every application,
because both of them are transparent for the users. Dynamic-CoMPI has been validated by using several MPI benchmarks and real HPC applications. The results show that, for many of the considered
scenarios, important reductions in the execution time are achieved by reducing the size and the number of the messages. Additional
benefits of our approach are the reduction of the total communication time and the network contention, thus enhancing, not
only performance, but also scalability. 相似文献
14.
Mesh of trees (MOT) is well known for its small diameter, high bisection width, simple decomposability and area universality.
On the other hand, OTIS (Optical Transpose Interconnection System) provides an efficient optoelectronic model for massively
parallel processing system. In this paper, we present OTIS-MOT as a competent candidate for a two-tier architecture that can
take the advantages of both the OTIS and the MOT. We show that an n4-n^{4}_{-} processor OTIS-MOT has diameter 8log n
∗+1 (The base of the logarithm is assumed to be 2 throughout this paper.) and fault diameter 8log n+2 under single node failure. We establish other topological properties such as bisection width, multiple paths and the modularity.
We show that many communication as well as application algorithms can run on this network in comparable time or even faster
than other similar tree-based two-tier architectures. The communication algorithms including row/column-group broadcast and
one-to-all broadcast are shown to require O(log n) time, multicast in O(n
2log n) time and the bit-reverse permutation in O(n) time. Many parallel algorithms for various problems such as finding polynomial zeros, sales forecasting, matrix-vector multiplication
and the DFT computation are proposed to map in O(log n) time. Sorting and prefix computation are also shown to run in O(log n) time. 相似文献
15.
In this paper, we present a particle swarm optimizer (PSO) to solve the variable weighting problem in projected clustering
of high-dimensional data. Many subspace clustering algorithms fail to yield good cluster quality because they do not employ
an efficient search strategy. In this paper, we are interested in soft projected clustering. We design a suitable k-means objective weighting function, in which a change of variable weights is exponentially reflected. We also transform the
original constrained variable weighting problem into a problem with bound constraints, using a normalized representation of
variable weights, and we utilize a particle swarm optimizer to minimize the objective function in order to search for global
optima to the variable weighting problem in clustering. Our experimental results on both synthetic and real data show that
the proposed algorithm greatly improves cluster quality. In addition, the results of the new algorithm are much less dependent
on the initial cluster centroids. In an application to text clustering, we show that the algorithm can be easily adapted to
other similarity measures, such as the extended Jaccard coefficient for text data, and can be very effective. 相似文献
16.
17.
Layali Rashid Wessam M. Hassanein Moustafa A. Hammad 《The Journal of supercomputing》2010,53(2):293-312
The sort operation is a core part of many critical applications (e.g., database management systems). Despite the large efforts to parallelize it, the fact that it suffers from high data-dependencies vastly limits its performance. Multithreaded architectures are emerging as the most demanding technology in leading-edge processors. These architectures include simultaneous multithreading, chip multiprocessors, and machines combining different multithreading technologies. In this paper, we analyze the memory behavior and improve the performance of the most recent parallel radix and quick integer sort algorithms on modern multithreaded architectures. We achieve speedups up to 4.69× for radix sort and up to 4.17× for quicksort on a machine with 4 multithreaded processors compared to single threaded versions, respectively. We find that since radix sort is CPU-intensive, it exhibits better results on chip multiprocessors where multiple CPUs are available. While quicksort is accomplishing speedups on all types of multithreading processers due to its ability to overlap memory miss latencies with other useful processing. 相似文献
18.
In this paper we continue the study, which was initiated in (Ben-Artzi et al. in Math. Model. Numer. Anal. 35(2):313–303,
2001; Fishelov et al. in Lecture Notes in Computer Science, vol. 2667, pp. 809–817, 2003; Ben-Artzi et al. in J. Comput. Phys. 205(2):640–664, 2005 and SIAM J. Numer. Anal. 44(5):1997–2024, 2006) of the numerical resolution of the pure streamfunction formulation of the time-dependent two-dimensional Navier-Stokes equation.
Here we focus on enhancing our second-order scheme, introduced in the last three afore-mentioned articles, to fourth order
accuracy. We construct fourth order approximations for the Laplacian, the biharmonic and the nonlinear convective operators.
The scheme is compact (nine-point stencil) for the Laplacian and the biharmonic operators, which are both treated implicitly
in the time-stepping scheme. The approximation of the convective term is compact in the no-leak boundary conditions case and
is nearly compact (thirteen points stencil) in the case of general boundary conditions. However, we stress that in any case
no unphysical boundary condition was applied to our scheme. Numerical results demonstrate that the fourth order accuracy is
actually obtained for several test-cases. 相似文献
19.
Chao-Chin Wu Chao-Tung Yang Kuan-Chou Lai Po-Hsun Chiu 《The Journal of supercomputing》2012,59(1):42-60
Loop scheduling on parallel and distributed systems has been thoroughly investigated in the past. However, none of these studies
considered the multi-core architecture feature for emerging grid systems. Although there have been many studies proposed to
employ the hybrid MPI and OpenMP programming model to exploit different levels of parallelism for a distributed system with
multi-core computers, none of them were aimed at parallel loop self-scheduling. Therefore, this paper investigates how to
employ the hybrid MPI and OpenMP model to design a parallel loop self-scheduling scheme adapted to the multi-core architecture
for emerging grid systems. Three different featured applications are implemented and evaluated to demonstrate the effectiveness
of the proposed scheduling approach. The experimental results show that the proposed approach outperforms the previous work
for the three applications and the speedups range from 1.13 to 1.75. 相似文献
20.
The paper addresses the problem of multi-slot just-in-time scheduling. Unlike the existing literature on this subject, it
studies a more general criterion—the minimization of the schedule makespan rather than the minimization of the number of slots
used by schedule. It gives an O(nlog 2
n)-time optimization algorithm for the single machine problem. For arbitrary number of m>1 identical parallel machines it presents an O(nlog n)-time optimization algorithm for the case when the processing time of each job does not exceed its due date. For the general
case on m>1 machines, it proposes a polynomial time constant factor approximation algorithm. 相似文献