首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single programming paradigm that allows exploiting the hierarchical structure of these machines. Most parallel applications deployed on SMP clusters are based on MPI, the standard API for distributed-memory parallel programming, and thus may miss a number of optimization opportunities offered by the shared memory available within SMP nodes. In this paper we present extensions to the data parallel programming language HPF and associated compilation techniques for optimizing HPF programs on clusters of SMPs. The proposed extensions enable programmers to control key aspects of distributed-memory and shared-memory parallelization at a high-level of abstraction. Based on these language extensions, a compiler can adopt a hybrid parallelization strategy which closely reflects the hierarchical structure of SMP clusters by automatically exploiting shared-memory parallelism based on OpenMP within cluster nodes and distributed-memory parallelism utilizing MPI across nodes. We describe the implementation of these features in the VFC compiler and present experimental results which show the effectiveness of these techniques.  相似文献   

2.
Over the past decade, the trajectory to the petascale has been built on increased complexity and scale of the underlying parallel architectures. Meanwhile, software developers have struggled to provide tools that maintain the productivity of computational science teams using these new systems. In this regard, Global Address Space (GAS) programming models provide a straightforward and easy to use addressing model, which can lead to improved productivity. However, the scalability of GAS depends directly on the design and implementation of the runtime system on the target petascale distributed-memory architecture. In this paper, we describe the design, implementation, and optimization of the Aggregate Remote Memory Copy Interface (ARMCI) runtime library on the Cray XT5 2.3 PetaFLOPs computer at Oak Ridge National Laboratory. We optimized our implementation with the flow intimation technique that we have introduced in this paper. Our optimized ARMCI implementation improves scalability of both the Global Arrays programming model and a real-world chemistry application—NWChem—from small jobs up through 180,000 cores.  相似文献   

3.
In this paper, we present performance results from several parallel benchmarks and applications on two large Linux clusters at Sandia National Laboratories. We compare the results on the Linux clusters to performance obtained on a traditional distributed-memory massively parallel processing machine, the Intel TeraFLOPS. We discuss the characteristics of these machines that influence the performance results and identify the key components of the system that are important to allow for scalability of commodity-based PC clusters to hundreds and possibly thousands of processors.  相似文献   

4.
We study the problem of exploiting parallelism from search-based AI systems on share-nothing platforms, i.e., platforms where different machines do not have access to any form of shared memory. We propose a novel environment representation technique, called stack-splitting, which is a modification of the well-known stack-copying technique, that enables the efficient exploitation of or-parallelism from AI systems on distributed-memory machines. Stack-splitting, coupled with appropriate scheduling strategies, leads to reduced communication during distributed execution and effective distribution of larger grain-sized work to processors. The novel technique can also be implemented on shared-memory machines and it is quite competitive. In this paper we present a distributed implementation of or-parallelism based on stack-splitting including results. Our results suggest that stack-splitting is an effective technique for obtaining high performance parallel AI systems on shared-memory as well as distributed-memory multiprocessors.  相似文献   

5.
《国际计算机数学杂志》2012,89(7):1160-1166
Recently, some iterative characterizations of H-matrices have been proposed. These methods can have less computational complexity than the direct ones, but as they are all designed for a sequential computer, they may be not so effective for large scalar matrices. In this paper, based on the previous and new ideas, we discuss about the parallel characterization of H-matrices on the distributed-memory multiprocessor machines, and propose two new algorithms which need fewer number of iterations and less computational time than the earlier ones. Several numerical examples to show the effectiveness of the proposed algorithms are provided.  相似文献   

6.
There are many proposals for moving traditional video surveillance systems into the cloud, commonly known as Video Surveillance as a Service (VSaaS). Most systems use Hadoop technology for storing video records and distributing video analysis tasks. However, Hadoop is more appropriate for video retrieval services than real time video analysis. Also, existing systems offer neither flexible deployment plans, nor are they capable of automatically minimizing the number of required servers (whether they are physical or virtual machines). Our proposal involves the design and implementation of a component-based VSaaS running on Infrastructure as a Service (IaaS). This paper focuses on the design concepts and component functions that provide solutions for the availability and scalability of VSaaS. Our system can easily scale from one server up to a more complex cluster to support the varying requirements of users. It accesses cloud services via Amazon EC2 for computing services and Amazon S3 API for object storage services, since they are supported by many cloud computing IaaS providers. We also present a components deployment that is suitable for any size and type of system, which combines both physical and virtual machines. Experiments show that the system performs well, and can tolerate difficult scenarios.  相似文献   

7.
Preconditioning techniques are important in solving linear problems, as they improve their computational properties. Scaling is the most widely used preconditioning technique in linear optimization algorithms and is used to reduce the condition number of the constraint matrix, to improve the numerical behavior of the algorithms and to reduce the number of iterations required to solve linear problems. Graphical processing units (GPUs) have gained a lot of popularity in the recent years and have been applied for the solution of linear optimization problems. In this paper, we review and implement ten scaling techniques with a focus on the parallel implementation of them on GPUs. All these techniques have been implemented under the MATLAB and CUDA environment. Finally, a computational study on the Netlib set is presented to establish the practical value of GPU-based implementations. On average the speedup gained from the GPU implementations of all scaling methods is about 7×.  相似文献   

8.
《Computer Networks》2008,52(5):935-956
Proxy caching servers are widely deployed in today’s Internet. While cooperation among proxy caches can significantly improve a network’s resilience to denial-of-service (DoS) attacks, lack of cooperation can transform such servers into viable DoS targets. In this paper, we investigate a class of pollution attacks that aim to degrade a proxy’s caching capabilities, either by ruining the cache file locality, or by inducing false file locality. Using simulations, we propose and evaluate the effects of pollution attacks both in Web and peer-to-peer (p2p) scenarios, and reveal dramatic variability in resilience to pollution among several cache replacement policies.We develop efficient methods to detect both false-locality and locality-disruption attacks, as well as a combination of the two. To achieve high scalability for a large number of clients/requests without sacrificing the detection accuracy, we leverage streaming computation techniques, i.e., bloom filters and probabilistic counting. Evaluation results from large-scale simulations show that these mechanisms are effective and efficient in detecting and mitigating such attacks. Furthermore, a Squid-based implementation demonstrates that our protection mechanism forces the attacker to launch extremely large distributed attacks in order to succeed.  相似文献   

9.
Techniques for solving linear equations on a single instruction multiple data (SIMD) computer such as the ICL DAP have so far been confined to simple methods such as the Successive Overrelaxation and Alternating Direction Implicit algorithms. While these techniques are adequate for simple finite difference problems require more complex algorithms. Preconditioned conjugate gradient methods have solved difficult problems successfully on serial machines. This paper describes a preconditioning technique suitable for parallel machines and numerical results obtained from a series of problems of varying degrees of difficulty.  相似文献   

10.
Multicore processor systems have become mainstream. To release the full potential of multiple cores, applications are programmed to be parallel to keep every core busy. Unfortunately, lock contention within operating systems can limit the scalability so seriously that use of more cores leads to reduced throughput (scalability collapse). To understand and characterize the collapse behavior easily, a discrete‐event simulation model, which considers both the sequential execution of critical sections and the overhead of hardware resource contention, is designed and implemented. By the use of the model, we observe that the percentage of time used to wait for locks and the number of tasks requesting for a lock have a significant correlation with the occurrence of scalability collapse. On the basis of these observations, two new techniques (lock contention aware scheduler and requester‐based adaptive lock) are proposed to remove the scalability collapse on multicores. The proposed methods are implemented in the Linux kernel 2.6.29.4 and evaluated on an AMD 32‐core system to verify their effectiveness. By using micro‐benchmarks and macro‐benchmarks, we find that these methods can remove scalability collapse totally for four of five workloads exhibiting the collapse behavior. For one workload that does not suffer scalability collapse, these proposed methods only introduce negligible overhead. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

11.
The increase of computer performance continues to support the practice of large-scale optimization. Computers with multiple computing cores and vector processing capabilities are now widely available. We investigate how the recently introduced Advanced Vector Instruction (AVX) set on Intel-compatible architectures can be exploited in interior point methods for linear and nonlinear optimization. We focus on data structures and implementation techniques that utilize the new vector instructions. Our numerical experiments demonstrate that the AVX instruction set provides a significant performance boost in our implementation on large-scale problem that have significant fill-in in the sparse Cholesky factorization, achieving up to 100 gigaflops performance on a standard desktop computer on linear optimization problems for which the required Cholesky factorization is relatively dense.  相似文献   

12.
Symbolic computation is an important area of both Mathematics and Computer Science, with many large computations that would benefit from parallel execution. Symbolic computations are, however, challenging to parallelise as they have complex data and control structures, and both dynamic and highly irregular parallelism. The SymGridPar framework (SGP) has been developed to address these challenges on small-scale parallel architectures. However the multicore revolution means that the number of cores and the number of failures are growing exponentially, and that the communication topology is becoming increasingly complex. Hence an improved parallel symbolic computation framework is required.This paper presents the design and initial evaluation of SymGridPar2 (SGP2), a successor to SymGridPar that is designed to provide scalability onto 105 cores, and hence also provide fault tolerance. We present the SGP2 design goals, principles and architecture. We describe how scalability is achieved using layering and by allowing the programmer to control task placement. We outline how fault tolerance is provided by supervising remote computations, and outline higher-level fault tolerance abstractions.We describe the SGP2 implementation status and development plans. We report the scalability and efficiency, including weak scaling to about 32,000 cores, and investigate the overheads of tolerating faults for simple symbolic computations.  相似文献   

13.
Burdened by their popularity, recommender systems increasingly take on larger datasets while they are expected to deliver high quality results within reasonable time. To meet these ever growing requirements, industrial recommender systems often turn to parallel hardware and distributed computing. While the MapReduce paradigm is generally accepted for massive parallel data processing, it often entails complex algorithm reorganization and suboptimal efficiency because mid-computation values are typically read from and written to hard disk. This work implements an in-memory, content-based recommendation algorithm and shows how it can be parallelized and efficiently distributed across many homogeneous machines in a distributed-memory environment. By focusing on data parallelism and carefully constructing the definition of work in the context of recommender systems, we are able to partition the complete calculation process into any number of independent and equally sized jobs. An empirically validated performance model is developed to predict parallel speedup and promises high efficiencies for realistic hardware configurations. For the MovieLens 10 M dataset we note efficiency values up to 71 % for a configuration of 200 computing nodes (eight cores per node).  相似文献   

14.
Precise calculation of molecular electronic wavefunctions by methods such as coupled-cluster requires the computation of tensor contractions, the cost of which has polynomial computational scaling with respect to the system and basis set sizes. Each contraction may be executed via matrix multiplication on a properly ordered and structured tensor. However, data transpositions are often needed to reorder the tensors for each contraction. Writing and optimizing distributed-memory kernels for each transposition and contraction is tedious since the number of contractions scales combinatorially with the number of tensor indices. We present a distributed-memory numerical library (Cyclops Tensor Framework (CTF)) that automatically manages tensor blocking and redistribution to perform any user-specified contractions. CTF serves as the distributed-memory contraction engine in Aquarius, a new program designed for high-accuracy and massively-parallel quantum chemical computations. Aquarius implements a range of coupled-cluster and related methods such as CCSD and CCSDT by writing the equations on top of a C++ templated domain-specific language. This DSL calls CTF directly to manage the data and perform the contractions. Our CCSD and CCSDT implementations achieve high parallel scalability on the BlueGene/Q and Cray XC30 supercomputer architectures showing that accurate electronic structure calculations can be effectively carried out on top of general distributed-memory tensor primitives.  相似文献   

15.
讨论了虚拟化技术实现原理及其在网络对抗教学中的应用上,分析了各类虚拟化技术的优缺点和构建网络对抗训练环境存在的问题。提出了一种混合虚拟化平台设计,该平台利用Libvirt虚拟化的应用程序编程接口(API)调用两种虚拟化技术同时运行在同一物理主机上。并详细叙述了混合虚拟化平台的实现过程和方法,并在多个虚拟机内同时运行测试基准,分析其可扩展性。多种虚拟化技术在同一物理机上的混合,实现了复杂实验环境的创建,对构建大规模的网络实验平台具有一定的参考价值。  相似文献   

16.
In this paper we discuss the problem of computing a multidimensional integral on a MIMD distributed-memory multiprocessor. Adaptive quadrature is known as a good approach to the problem of achieving accuracy and reliability while attempting to minimize the number of function evaluations. The implementation makes use of dynamical data structures able to manage subinterval partition. On a distributed-memory multiprocessor, each processor is able to execute code and to manipulate data structures in its own local memory only, and data are sent from one processor to another one by explicit message-passing. Efficient implementation of an adaptive algorithm for the multidimensional quadrature on a parallel computer is quite difficult, because of the need for continuous information exchange between processors. Our algorithm is based on a global adaptive strategy which dynamically balances the workload and reduces the data communication between processors in order to use the message-passing environment efficiently. The results and timings for several tests are given.  相似文献   

17.
The ever growing request for digital information raises the need for content distribution architectures providing high storage capacity, data availability and good performance. While many simple solutions for scalable distribution of quasi-static content exist, there are still no approaches that can ensure both scalability and consistency for the case of highly dynamic content, such as the data managed inside wikis. We propose a peer-to-peer solution for distributing and managing dynamic content, that combines two widely studied technologies: Distributed Hash Tables (DHT) and optimistic replication. In our “universal wiki” engine architecture (UniWiki), on top of a reliable, inexpensive and consistent DHT-based storage, any number of front-ends can be added, ensuring both read and write scalability, as well as suitability for large-scale scenarios.The implementation is based on Damon, a distributed AOP middleware, thus separating distribution, replication, and consistency responsibilities, and also making our system transparently usable by third party wiki engines. Finally, UniWiki has been proved viable and fairly efficient in large-scale scenarios.  相似文献   

18.
The paper aims at demonstrating and confirming that breadth first search or pruning techniques can substantially improve the effectiveness of biomolecular algorithms. A breadth first search-based DNA algorithm solving the maximum clique problem for a graph is presented, and its complexity and scalability parameters are studied. The analysis shows that parameters like the number of steps, the length and volume of DNA strands, the number of enzymes and the concentration of the molecules encoding solutions are dramatically improved in comparison with previous approaches to the same problem and, theoretically, they would allow to process graphs with thousands of vertices. These parameters are also compared with several related results focusing on the scalability of DNA computing methods. Finally, an analysis of error-resistance of the algorithm is given.  相似文献   

19.
Parallel servers offer improved processing power for relational database systems and provide system scalability. In order to support the users of these systems, new ways of assessing the performance of such machines are required. If these assessments are to show how the machines perform under commercial workloads they need to be based upon models which have a real commercial basis. This paper shows how a realistic model of a financial application has been developed and how a set of tools has been created which allow the implementation of the model on any commercial database system. The tools allow the generation of large quantities of test data in a manner which renders it amenable to subsequent independent analysis. The test data thus generated forms the basis for the performance tuning of parallel database machines.Recommended by: Patrick Valduriez  相似文献   

20.
协同过滤推荐系统中聚类搜索方法研究   总被引:1,自引:0,他引:1  
最近邻计算是协同过滤方法中直接影响到推荐系统的运行效率和推荐准确率的重要一环。当用户和项目数目达到一定规模的时候,推荐系统的可扩展性明显降低。聚类的方法能在一定程度上弥补这个缺陷,但同时又会带来推荐准确性的下降。提出了一种与信息检索领域中的倒排索引相结合并采用“成员策略”的用户聚类搜索算法,缩短了最近邻计算的时间,实验的结果证明,该方法能在保证推荐正确性的前提下有效改善协同过滤推荐系统的可扩展性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号