共查询到20条相似文献,搜索用时 0 毫秒
1.
为了研究GPU的通用计算能力和适合SMP集群的编程模型,首次提出MPI+CUDA多粒度混合并行编程的新方法,节点间采用MPI实现粗粒度并行,节点内采用CUDA实现细粒度并行的混合编程方式.利用此方法在搭建的3节点SMP集群环境中,测试了大规模矩阵乘问题的并行计算能力.实验结果表明,该方法能够显著提升并行效率,同时证明MPI+CUDA混合编程模型能够充分发挥SMP集群节点间分布式存储和节点内共享内存的优势,为装有CUDA-enabled GPU的SMP集群提供了一种有效的并行策略. 相似文献
2.
A method is presented that eliminates the problem of the conventional quadratic performance criterion not being effective for some real-world systems because the performance parameters are seldom related to meaningful quantities. Globally searching the performance index allows the index to have local minimums as well as discontinuities, so it can be defined in meaningful terms. This ability to define meaningful performance indexes potentially can reduce the design time and produce better controls for nonlinear systems. The method has been implemented and tested with a simulated nonlinear system. Comparison to optimal control theory shows that the methodology has merit. Gains occur because the controller can be nonlinear and the system can be efficiently optimized to have the desired characteristics 相似文献
3.
Wen Long Ximing Liang Yafei Huang Yixiong Chen 《Neural computing & applications》2014,24(3-4):911-926
In recognition of high-quality wideband speech codecs, several standardization activities have been conducted, resulting in the selection of a wideband speech codec called adaptive multi-rate wideband (AMR-WB). The algebraic code-excited linear prediction (ACELP) technique is recommended in AMR-WB, and it is noted that most of the complexity in the ACELP structure comes from the codebook search. In this paper, a new method is proposed for codebook search based on the behavior of backward filtered target signal, d(n), introduced in ITU-T G.722.2 recommendation. To optimize the proposed scheme, five optimization algorithms (i.e., modified genetic algorithm, particle swarm optimization with dynamic inertia weight, bee colony optimization, modified differential evolution, and imperialist competition algorithm) are investigated. Experimental results show that the reduction in codebook search operations of the proposed method is able to reach up to 59 percent as compared with ITU-T G.722.2 recommendation. Meanwhile, BCO-based codebook search scheme has better convergence speed without significant degradation in quality metrics, such as segmental signal-to-noise ratio, mean opinion score, and perceptual evaluation of speech quality, when used in an AMR-WB speech codec. 相似文献
4.
Jianjun Liu K. L. Teo Xiangyu Wang Changzhi Wu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2016,20(4):1305-1313
Differential search (DS) is a recently developed derivative-free global heuristic optimization algorithm for solving unconstrained optimization problems. In this paper, by applying the idea of exact penalty function approach, a DS algorithm, where an S-type dynamical penalty factor is introduced so as to achieve a better balance between exploration and exploitation, is developed for constrained global optimization problems. To illustrate the applicability and effectiveness of the proposed approach, a comparison study is carried out by applying the proposed algorithm and other widely used evolutionary methods on 24 benchmark problems. The results obtained clearly indicate that the proposed method is more effective and efficient over the other widely used evolutionary methods for most these benchmark problems. 相似文献
5.
Mojtaba Shivaie Mohammad T. Ameli 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2014,18(8):1615-1630
In this paper a new scenario-based framework is presented for transmission expansion planning (TEP) under normal and N–1 conditions. The proposed framework takes into account cost of network losses, cost of the transmission circuits and substations in the optimization process as objective functions, while considers short-term and also long-term constraints under normal and N–1 conditions as problem constraints. The proposed model is a non-convex optimization problem having a non-linear mixed-integer nature. A new improved harmony search algorithm (IHSA) is used in order to obtain the final optimal solution. The IHSA is a recently developed optimization algorithm which imitates the music improvisation process. In this process, the harmonists improvise their instrument pitches searching for the perfect state of harmony. The newly planning methodology has been demonstrated on the well-known Garver’s 6-bus test system and a real life network of south Brazilian electric power grid in order to demonstrate the feasibility and capabilities of the proposed algorithm. The detailed results of the case studies are presented and thoroughly analyzed. The obtained TEP results illustrate the sufficiency and profitableness of the newly developed method in expansion planning when compared with other methods. 相似文献
6.
Arrays that are distributed in a block-cyclic fashion are important for many applications in the computational sciences since they often lead to parallel algorithms with good load balancing properties. We consider the problem of redistributing such an array to a new block size. This operation is directly expressible in High Performance Fortran (HPF) and will arise in applications written in this language. Efficient message passing algorithms are given for the redistribution operation, expressed in the standardized message passing interface, MPI. The algorithms are analyzed and performance results from the IBM SP-1 and Intel Paragon are given and discussed. The results show that redistribution can be done in time comparable to other collective communication operations, such as broadcast and MPI_ALLTOALL. 相似文献
7.
8.
9.
10.
During the past few years the interest paid to global optimization has rapidly increased. One of the main reasons is the new technology of parallel computers which offer computational power capable of solving global optimization problems in reasonable time. The method studied in this work is based on interval analysis which provides a reliable way for solving the problem. Despite the fact that the method contains a high degree of potential parallelism, it is not straight forward to parallelize due to its irregular and unpredictable computational behaviour. This paper deals with the problem of balancing the load dynamically, both with respect to the quantity and to the quality of the tasks. Efficient strategies are proposed and implemented on an Intel iPSC/2 hypercube. Since the sequential algorithm is used as a base it will be modified to suit the parallel algorithm. 相似文献
11.
We introduce Lemon, an MPI parallel I/O library that provides efficient parallel I/O of both binary and metadata on massively parallel architectures. Motivated by the demands of the Lattice Quantum Chromodynamics community, the data is stored in the SciDAC Lattice QCD Interchange Message Encapsulation format. This format allows for storing large blocks of binary data and corresponding metadata in the same file. Even if designed for LQCD needs, this format might be useful for any application with this type of data profile. The design, implementation and application of Lemon are described. We conclude with presenting the excellent scaling properties of Lemon on state-of-the-art high performance computers.Program summaryProgram title: LemonCatalogue identifier: AELP_v1_0Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AELP_v1_0.htmlProgram obtainable from: CPC Program Library, Queen?s University, Belfast, N. IrelandLicensing provisions: GNU General Public License version 3No. of lines in distributed program, including test data, etc.: 32 860No. of bytes in distributed program, including test data, etc.: 223 762Distribution format: tar.gzProgramming language: MPI and CComputer: Any which supports MPI I/OOperating system: AnyHas the code been vectorised or parallelised?: Yes. Includes MPI directives.RAM: Depending on input usedClassification: 11.5External routines: MPINature of problem: Distributed file I/O with metadataSolution method: MPI parallel I/O based implementation of LIME formatRunning time: Varies depending on file and architecture size, in the order of seconds 相似文献
12.
针对单层型MPI集群通信效率不高的特点,通过对比分析单层型结构和树型结构在集群聚合通信中的不同,提出了一种基于树型结构的MPI集群系统设计方案.用以降低全局通信流量和均衡主控节点负载,从而改善集群通信效率,使集群的扩展更加灵活,通过实验验证了该方案的可行性. 相似文献
13.
This contribution presents a new procedure for quantifying valve stiction in control loops based on global optimisation. Measurements of the controlled variable (PV) and controller output (OP) are used to estimate the parameters of a Hammerstein system, consisting of a connection of a two-parameter stiction model and a linear low-order process model. As the objective function is non-smooth, gradient-free optimisation algorithms, i.e., pattern search (PS) methods or genetic algorithms (GA), are used for fixing the global minimum of the parameters of the stiction model, subordinated with a least-squares estimator for identifying the linear model parameters. Some approaches for selecting the model structure of the linear model part are discussed. Results show that this novel optimisation-based technique recovers accurate and reliable estimates of the stiction model parameters, dead-band plus stick band (S) and slip jump (J), from normal (closed-loop) operating data for self-regulating and integrating processes. The robustness of the proposed approach was proven considering a range of test conditions including different process types, controller settings and measurement noise. Numerous simulation and industrial case studies are described to demonstrate the applicability of the presented techniques for different loops and for different amounts of stiction. 相似文献
14.
对验证码识别技术进行介绍,例示了使用JCAPTCHA开源框架实现验证码的方法,并基于该框架设计完成了一个验证码构件,构件将验证码的实现方法进行封装.验证码构件可以方便地集成到Java EE项目中,实践证明,该构件的确能让验证码开发工作变得更为简单有效. 相似文献
15.
An Assessment of MPI Environments for Windows NT 总被引:1,自引:0,他引:1
Takeda K. Allsopp N. K. Hardwick J. C. Macey P. C. Nicole D. A. Cox S. J. Lancaster D. J. 《The Journal of supercomputing》2001,19(3):315-323
In this paper we evaluate the MPI environments currently available for Windows NT on the Intel IA32 and Compaq/DEC Alpha architectures. We present benchmark results for low-level communication and for the NAS Parallel Benchmarks to allow comparison with other systems, but our primary interest is determining real application performance and robustness in production cluster environments. For this we use PAFEC-FE, a large FORTRAN code for finite-element analysis. We present results from three MPI implementations, two architectures, and three networking technologies (10 and 100 Mbit/s Ethernet and 1 Gbit/s Myrinet). 相似文献
16.
An efficient digital search algorithm that is based on an internal array structure called a double array, which combines the fast access of a matrix form with the compactness of a list form, is presented. Each arc of a digital search tree, called a DS-tree, can be computed from the double array in 0(1) time; that is to say, the worst-case time complexity for retrieving a key becomes 0(k ) for the length k of that key. The double array is modified to make the size compact while maintaining fast access, and algorithms for retrieval, insertion, and deletion are presented. If the size of the double array is n +cm , where n is the number of nodes of the DS-tree, m is the number of input symbols, and c is a constant particular to each double array, then it is theoretically proved that the worst-case times of deletion and insertion are proportional to cm and cm 2, respectively, and are independent of n . Experimental results of building the double array incrementally for various sets of keys show that c has an extremely small value, ranging from 0.17 to 1.13 相似文献
17.
18.
Banikazemi M. Govihdaraju R.K. Blackmore R. Panda D.K. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(10):1081-1093
The IBM RS/6000 SP system is one of the most cost-effective commercially available high performance machines. IBM RS/6000 SP systems support the Message Passing Interface standard (MPI) and LAPI. LAPI is a low level, reliable and efficient one-sided communication API library implemented on IBM RS/6000 SP systems. This paper explains how the high performance of the LAPI library has been exploited in order to implement the MPI standard more efficiently than the existing MPI. It describes how to avoid unnecessary data copies at both the sending and receiving sides for such an implementation. The resolution of problems arising from the mismatches between the requirements of the MPI standard and the features of LAPI is discussed. As a result of this exercise, certain enhancements to LAPI are identified to enable an efficient implementation of MPI on LAPI. The performance of the new implementation of MPI is compared with that of the underlying LAPI itself. The latency (in polling and interrupt modes) and bandwidth of our new implementation is compared with that of the native MPI implementation on RS/6000 SP systems. The results indicate that the MPI implementation on LAPI performs comparably to or better than the original MPI implementation in most cases. Improvements of up to 17.3 percent in polling mode latency, 35.8 percent in interrupt mode latency, and 20.9 percent in bandwidth are obtained for certain message sizes. The implementation of MPI on top of LAPI also outperforms the native MPI implementation for the NAS Parallel Benchmarks 相似文献
19.
Bruck J. De Coster L. Dewulf N. Ching-Tien Ho Lauwereins R. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(3):256-265
There are a number of models that were proposed in recent years for message passing parallel systems. Examples are the postal model and its generalization the LogP model. In the postal model a parameter λ is used to model the communication latency of the message-passing system. Each node during each round can send a fixed-size message and, simultaneously, receive a message of the same size. Furthermore, a message sent out during round r will incur a latency of λ and will arrive at the receiving node at round r+λ-1. Our goal in this paper is to bridge the gap between the theoretical modeling and the practical implementation. In particular, we investigate a number of practical issues related to the design and implementation of two collective communication operations, namely, the broadcast operation and the global combine operation. Those practical issues include, for example, (1) techniques for measurement of the value of λ on a given machine, (2) creating efficient broadcast algorithms that get the latency h and the number of nodes n as parameters and (3) creating efficient global combine algorithms for parallel machines with λ which is not an integer. We propose solutions that address those practical issues and present results of an experimental study of the new algorithms on the Inter Delta machine. Our main conclusion is that the postal model can help in performance prediction and tuning, for example, a properly tuned broadcast improves the known implementation by more than 20% 相似文献
20.
Houssein Essam H. Hosney Mosa E. Mohamed Waleed M. Ali Abdelmgeid A. Younis Eman M. G. 《Neural computing & applications》2023,35(7):5251-5275
Neural Computing and Applications - Feature selection (FS) is one of the basic data preprocessing steps in data mining and machine learning. It is used to reduce feature size and increase model... 相似文献