首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Consider data warehouses as large data repositories queried for analysis and data mining in a variety of application contexts. A query over such data may take a large amount of time to be processed in a regular PC. Consider partitioning the data into a set of PCs (nodes), with either a parallel database server or any database server at each node and an engine-independent middleware. Nodes and network may even not be fully dedicated to the data warehouse. In such a scenario, care must be taken for handling processing heterogeneity and availability, so we study and propose efficient solutions for this. We concentrate on three main contributions: a performance-wise index, measuring relative performance; a replication-degree; a flexible chunk-wise organization with on-demand processing. These contributions extend the previous work on de-clustering and replication and are generic in the sense that they can be applied in very different contexts and with different data partitioning approaches. We evaluate their merits with a prototype implementation of the system.  相似文献   

2.
Even though shared-memory concurrency is a paradigm frequently used for developing parallel applications on small- and middle-sized machines, experience has shown that it is hard to use. This is largely caused by synchronization primitives which are low-level, inherently non-deterministic, and, consequently, non-intuitive to use. In this paper, we present the Nornir run-time system. Nornir is comparable to well-known frameworks such as MapReduce and Dryad that are recognized for their efficiency and simplicity. Unlike these frameworks, Nornir also supports process structures containing branches and cycles. Nornir is based on the formalism of Kahn process networks, which is a shared-nothing, message-passing model of concurrency. We deem this model a simple and deterministic alternative to shared-memory concurrency. Experiments with real and synthetic benchmarks on up to 8 CPUs show that performance in most cases scales almost linearly with the number of CPUs, when not limited by data dependencies. We also show that the modeling flexibility allows Nornir to outperform its MapReduce counterparts using well-known benchmarks.  相似文献   

3.
This paper presents a scalable and efficient Message-Passing in Java (MPJ) collective communication library for parallel computing on multi-core architectures. The continuous increase in the number of cores per processor underscores the need for scalable parallel solutions. Moreover, current system deployments are usually multi-core clusters, a hybrid shared/distributed memory architecture which increases the complexity of communication protocols. Here, Java represents an attractive choice for the development of communication middleware for these systems, as it provides built-in networking and multithreading support. As the gap between Java and compiled languages performance has been narrowing for the last years, Java is an emerging option for High Performance Computing (HPC).  相似文献   

4.
We present fast and highly scalable parallel computations for a number of important and fundamental matrix problems on distributed memory systems (DMS). These problems include matrix multiplication, matrix chain product, and computing the powers, the inverse, the characteristic polynomial, the determinant, the rank, the Krylov matrix, and an LU- and a QR-factorization of a matrix, and solving linear systems of equations. Our highly scalable parallel computations for these problems are based on a highly scalable implementation of the fastest sequential matrix multiplication algorithm on DMS. We show that compared with the best known parallel time complexities on parallel random access machines (PRAM), the most powerful but unrealistic shared memory model of parallel computing, our parallel matrix computations achieve the same speeds on distributed memory parallel computers (DMPC), and have an extra polylog factor in the time complexities on DMS with hypercubic networks. Furthermore, our parallel matrix computations are fully scalable on DMPC and highly scalable over a wide range of system size on DMS with hypercubic networks. Such fast (in terms of parallel time complexity) and highly scalable (in terms of our definition of scalability) parallel matrix computations were rarely seen before on any distributed memory systems.  相似文献   

5.
Advances in computer technology, encompassed with fast emerging of multicore processor technology, have made the many-core personal computers available and more affordable. The availability of network of workstations and cluster of many-core SMPs have made them an attractive solution for high performance computing by providing computational power equal or superior to supercomputers or mainframes at an affordable cost using commodity components. In order to search alternative ways to extract unused and idle computing power from these computing resources targeting to improve overall performance, as well as to fully utilize the underlying new hardware platforms, these are major topics in this field of research. In this research paper, the design rationale and implementation of an effective toolkit for performance measurement and analysis of parallel applications in cluster environments is introduced; not only generating parallel applications’ timing graph representation, but also to provide application execution’s performance data charts. The goal in developing this toolkit is to permit application developers have a better understanding of the application’s behavior among selected computing nodes purposed for that particular execution. Additionally, multiple execution results of a given application under development can be combined and overlapped, permitting application developers to perform “what-if” analysis, i.e., to deeper understand the utilization of allocated computational resources. Experimentations using this toolkit have shown its effectiveness on the development and performance tuning of parallel applications, extending the use in teaching of message passing, and shared memory model parallel programming courses.
Tien-Hsiung WengEmail:
  相似文献   

6.
7.
Nowadays, multi-core processor is the main technology used in desktop PCs, laptop computers and mobile hardware platforms. As the number of cores on a chip keeps increasing, it adds up the complexity and impacts more on both power and performance of a processor. In multi-processors, the number of cores and various parameters, such as issue-width, number of instructions and execution time, are key design factors to balance the amount of thread-level parallelism and instruction-level parallelism. In this paper, we perform a comprehensive simulation study that aims to find the optimum number of processor cores in desktop/laptop computing processor models with shallow pipeline depth. This paper also explores the trade-off between the number of cores and different parameters used in multi-processors in terms of power–performance gains and analyzes the impact of 3D stacking on the design of simultaneous multi-threading and chip multiprocessing. Our analysis shows that the optimum number of cores varies with different classes of workloads, namely: SPEC2000, SPEC2006 and MiBench. Simulation study is presented using architectures with shorter pipeline depth, showing that (1) the optimum number of cores for power–performance is 8, (2) the optimum number of threads in the range [2, 4], and (3) for beyond 32 cores, multi-core processors are no longer efficient in terms of performance benefits and overall power consumption.  相似文献   

8.
《Parallel Computing》1997,22(13):1837-1851
The PAPS (Performance Analysis of Parallel Systems) toolset is a testbed for the model based performance prediction of message passing parallel applications executed on private memory multiprocessor computer systems. PAPS allows to describe the execution behavior of the computer hardware and operating system software resources up to a very detailed level. This enables very accurate performance prediction of parallel applications even in the case of substantial performance degradation due to contention for shared resources. In this paper the fundamental design principles and implementation methodologies for the development of the PAPS toolset are presented and the PAPS parallel system specification formalisms are described. A simplified performance study of a parallel Gaussian elimination application on the nCUBE 2 multiprocessor system is used to demonstrate the usage of the tool.  相似文献   

9.
Fu  You  Zhou  Wei 《The Journal of supercomputing》2022,78(7):9017-9037
The Journal of Supercomputing - Biological interaction databases accommodate information about interacted proteins or genes. Clustering on the networks formed by the interaction information for...  相似文献   

10.
The Chained-Cubic Tree (CCT) interconnection network topology was recently proposed as a continuation for the extended efforts in the area of interconnection networks’ performance improvement. This topology, which promises to exhibit the best properties of the hypercube and tree topologies, needs to be deeply investigated in order to evaluate its performance among other interconnection networks’ topologies. This work comes as a complementary effort, in which the load balancing technique is investigated as one of the most important aspects of performance improvement. This paper proposes a new load balancing algorithm on CCT interconnection networks. The proposed algorithm, which is called Hybrid Dynamic Parallel Scheduling Algorithm (HD-PSA), is a combination of two common load balancing strategies; dynamic load balancing and parallel scheduling. The performance of the proposed algorithm is evaluated both, analytically and experimentally, in terms of various performance metrics; including, execution time, load balancing accuracy, communication cost, number of tasks hops, and tasks locality.  相似文献   

11.
Energy efficient scheduling of parallel tasks on multiprocessor computers   总被引:1,自引:1,他引:1  
In this paper, scheduling parallel tasks on multiprocessor computers with dynamically variable voltage and speed are addressed as combinatorial optimization problems. Two problems are defined, namely, minimizing schedule length with energy consumption constraint and minimizing energy consumption with schedule length constraint. The first problem has applications in general multiprocessor and multicore processor computing systems where energy consumption is an important concern and in mobile computers where energy conservation is a main concern. The second problem has applications in real-time multiprocessing systems and environments where timing constraint is a major requirement. Our scheduling problems are defined such that the energy-delay product is optimized by fixing one factor and minimizing the other. It is noticed that power-aware scheduling of parallel tasks has rarely been discussed before. Our investigation in this paper makes some initial attempt to energy-efficient scheduling of parallel tasks on multiprocessor computers with dynamic voltage and speed. Our scheduling problems contain three nontrivial subproblems, namely, system partitioning, task scheduling, and power supplying. Each subproblem should be solved efficiently, so that heuristic algorithms with overall good performance can be developed. The above decomposition of our optimization problems into three subproblems makes design and analysis of heuristic algorithms tractable. A unique feature of our work is to compare the performance of our algorithms with optimal solutions analytically and validate our results experimentally, not to compare the performance of heuristic algorithms among themselves only experimentally. The harmonic system partitioning and processor allocation scheme is used, which divides a multiprocessor computer into clusters of equal sizes and schedules tasks of similar sizes together to increase processor utilization. A three-level energy/time/power allocation scheme is adopted for a given schedule, such that the schedule length is minimized by consuming given amount of energy or the energy consumed is minimized without missing a given deadline. The performance of our heuristic algorithms is analyzed, and accurate performance bounds are derived. Simulation data which validate our analytical results are also presented. It is found that our analytical results provide very accurate estimation of the expected normalized schedule length and the expected normalized energy consumption and that our heuristic algorithms are able to produce solutions very close to optimum.  相似文献   

12.
Predictions based on analytical performance models can be used on efficient scheduling policies in order to select adequate resources for an optimal execution in terms of throughput and response time. However, developing accurate analytical models of parallel applications is a hard issue. The TIA (Tools for Instrumenting and Analysis) modeling framework provides an easy to use modeling method for obtaining analytical models of MPI applications. This method is based on modeling selection techniques and, in particular, on Akaike’s information criterion (AIC). In this paper, first the AIC-based performance model of the HPL benchmark is obtained using the TIA modeling framework. Then the use of this model for assessing the runtime estimation on different backfilling policies is analyzed in the GridSim simulator. The behavior of these simulations is compared with the equivalent simulations based on the theoretical model of the HPL provided by its developers.  相似文献   

13.
This work presents an optimization of MPI communications, called Dynamic-CoMPI, which uses two techniques in order to reduce the impact of communications and non-contiguous I/O requests in parallel applications. These techniques are independent of the application and complementaries to each other. The first technique is an optimization of the Two-Phase collective I/O technique from ROMIO, called Locality aware strategy for Two-Phase I/O (LA-Two-Phase I/O). In order to increase the locality of the file accesses, LA-Two-Phase I/O employs the Linear Assignment Problem (LAP) for finding an optimal I/O data communication schedule. The main purpose of this technique is the reduction of the number of communications involved in the I/O collective operation. The second technique, called Adaptive-CoMPI, is based on run-time compression of MPI messages exchanged by applications. Both techniques can be applied on every application, because both of them are transparent for the users. Dynamic-CoMPI has been validated by using several MPI benchmarks and real HPC applications. The results show that, for many of the considered scenarios, important reductions in the execution time are achieved by reducing the size and the number of the messages. Additional benefits of our approach are the reduction of the total communication time and the network contention, thus enhancing, not only performance, but also scalability.  相似文献   

14.
Mesh of trees (MOT) is well known for its small diameter, high bisection width, simple decomposability and area universality. On the other hand, OTIS (Optical Transpose Interconnection System) provides an efficient optoelectronic model for massively parallel processing system. In this paper, we present OTIS-MOT as a competent candidate for a two-tier architecture that can take the advantages of both the OTIS and the MOT. We show that an n4-n^{4}_{-} processor OTIS-MOT has diameter 8log n +1 (The base of the logarithm is assumed to be 2 throughout this paper.) and fault diameter 8log n+2 under single node failure. We establish other topological properties such as bisection width, multiple paths and the modularity. We show that many communication as well as application algorithms can run on this network in comparable time or even faster than other similar tree-based two-tier architectures. The communication algorithms including row/column-group broadcast and one-to-all broadcast are shown to require O(log n) time, multicast in O(n 2log n) time and the bit-reverse permutation in O(n) time. Many parallel algorithms for various problems such as finding polynomial zeros, sales forecasting, matrix-vector multiplication and the DFT computation are proposed to map in O(log n) time. Sorting and prefix computation are also shown to run in O(log n) time.  相似文献   

15.
In this paper, we present a particle swarm optimizer (PSO) to solve the variable weighting problem in projected clustering of high-dimensional data. Many subspace clustering algorithms fail to yield good cluster quality because they do not employ an efficient search strategy. In this paper, we are interested in soft projected clustering. We design a suitable k-means objective weighting function, in which a change of variable weights is exponentially reflected. We also transform the original constrained variable weighting problem into a problem with bound constraints, using a normalized representation of variable weights, and we utilize a particle swarm optimizer to minimize the objective function in order to search for global optima to the variable weighting problem in clustering. Our experimental results on both synthetic and real data show that the proposed algorithm greatly improves cluster quality. In addition, the results of the new algorithm are much less dependent on the initial cluster centroids. In an application to text clustering, we show that the algorithm can be easily adapted to other similarity measures, such as the extended Jaccard coefficient for text data, and can be very effective.  相似文献   

16.
17.
The sort operation is a core part of many critical applications (e.g., database management systems). Despite the large efforts to parallelize it, the fact that it suffers from high data-dependencies vastly limits its performance. Multithreaded architectures are emerging as the most demanding technology in leading-edge processors. These architectures include simultaneous multithreading, chip multiprocessors, and machines combining different multithreading technologies. In this paper, we analyze the memory behavior and improve the performance of the most recent parallel radix and quick integer sort algorithms on modern multithreaded architectures. We achieve speedups up to 4.69× for radix sort and up to 4.17× for quicksort on a machine with 4 multithreaded processors compared to single threaded versions, respectively. We find that since radix sort is CPU-intensive, it exhibits better results on chip multiprocessors where multiple CPUs are available. While quicksort is accomplishing speedups on all types of multithreading processers due to its ability to overlap memory miss latencies with other useful processing.  相似文献   

18.
In this paper we continue the study, which was initiated in (Ben-Artzi et al. in Math. Model. Numer. Anal. 35(2):313–303, 2001; Fishelov et al. in Lecture Notes in Computer Science, vol. 2667, pp. 809–817, 2003; Ben-Artzi et al. in J. Comput. Phys. 205(2):640–664, 2005 and SIAM J. Numer. Anal. 44(5):1997–2024, 2006) of the numerical resolution of the pure streamfunction formulation of the time-dependent two-dimensional Navier-Stokes equation. Here we focus on enhancing our second-order scheme, introduced in the last three afore-mentioned articles, to fourth order accuracy. We construct fourth order approximations for the Laplacian, the biharmonic and the nonlinear convective operators. The scheme is compact (nine-point stencil) for the Laplacian and the biharmonic operators, which are both treated implicitly in the time-stepping scheme. The approximation of the convective term is compact in the no-leak boundary conditions case and is nearly compact (thirteen points stencil) in the case of general boundary conditions. However, we stress that in any case no unphysical boundary condition was applied to our scheme. Numerical results demonstrate that the fourth order accuracy is actually obtained for several test-cases.  相似文献   

19.
Loop scheduling on parallel and distributed systems has been thoroughly investigated in the past. However, none of these studies considered the multi-core architecture feature for emerging grid systems. Although there have been many studies proposed to employ the hybrid MPI and OpenMP programming model to exploit different levels of parallelism for a distributed system with multi-core computers, none of them were aimed at parallel loop self-scheduling. Therefore, this paper investigates how to employ the hybrid MPI and OpenMP model to design a parallel loop self-scheduling scheme adapted to the multi-core architecture for emerging grid systems. Three different featured applications are implemented and evaluated to demonstrate the effectiveness of the proposed scheduling approach. The experimental results show that the proposed approach outperforms the previous work for the three applications and the speedups range from 1.13 to 1.75.  相似文献   

20.
The paper addresses the problem of multi-slot just-in-time scheduling. Unlike the existing literature on this subject, it studies a more general criterion—the minimization of the schedule makespan rather than the minimization of the number of slots used by schedule. It gives an O(nlog 2 n)-time optimization algorithm for the single machine problem. For arbitrary number of m>1 identical parallel machines it presents an O(nlog n)-time optimization algorithm for the case when the processing time of each job does not exceed its due date. For the general case on m>1 machines, it proposes a polynomial time constant factor approximation algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号