首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 8 毫秒
1.
Empirical optimizers like ATLAS have been very effective in optimizing computational kernels in libraries. The best choice of parameters such as tile size and degree of loop unrolling is determined in ATLAS by executing different versions of the computation. In contrast, optimizing compilers use a model-driven approach to program transformation. While the model-driven approach of optimizing compilers is generally orders of magnitude faster than ATLAS-like library generators, its effectiveness can be limited by the accuracy of the performance models used. In this paper, we describe an approach where a class of computations is modeled in terms of constituent operations that are empirically measured, thereby allowing modeling of the overall execution time. The performance model with empirically determined cost components is used to select library calls and choose data layout transformations in the context of the Tensor Contraction Engine, a compiler for a high-level domain-specific language for expressing computational models in quantum chemistry. The effectiveness of the approach is demonstrated through experimental measurements on representative computations from quantum chemistry.  相似文献   

2.
In this paper we propose a new optimization framework that unites some of the existing tensor based methods for face recognition on a common mathematical basis. Tensor based approaches rely on the ability to decompose an image into its constituent factors (i.e. person, lighting, viewpoint, etc.) and then utilizing these factor spaces for recognition. We first develop a multilinear optimization problem relating an image to its constituent factors and then develop our framework by formulating a set of strategies that can be followed to solve this optimization problem. The novelty of our research is that the proposed framework offers an effective methodology for explicit non-empirical comparison of the different tensor methods as well as providing a way to determine the applicability of these methods in respect to different recognition scenarios. Importantly, the framework allows the comparative analysis on the basis of quality of solutions offered by these methods. Our theoretical contribution has been validated by extensive experimental results using four benchmark datasets which we present along with a detailed discussion.  相似文献   

3.
4.
In the paper we present a framework for partitioning data parallel computations across a heterogeneous metasystem at runtime. The framework is guided by program and resource information which is made available to the system. Three difficult problems are handled by the framework: processor selection, task placement and heterogeneous data domain decomposition. Solving each of these problems contributes to reduced elapsed time. In particular, processor selection determines the best grain size at which to run the computation, task placement reduces communication cost, and data domain decomposition achieves processor load balance. We present results which indicate that excellent performance is achievable using the framework. The paper extends our earlier work on partitioning data parallel computations across a single-level network of heterogeneous workstations.  相似文献   

5.
We present a new method for predicting RNA secondary structure based on a genetic algorithm. The algorithm is designed to run on a massively parallel SIMD computer. Statistical analysis shows that the program performs well when compared to a dynamic programming algorithm used to solve the same problem. The program has also pointed out a long-standing simplification in the implementation of the original dynamic programming algorithm that sometimes causes it not to find the optimal secondary structure.  相似文献   

6.
We present a Python extension to the massively parallel HPC simulation toolkit waLBerla. waLBerla is a framework for stencil based algorithms operating on block-structured grids, with the main application field being fluid simulations in complex geometries using the lattice Boltzmann method. Careful performance engineering results in excellent node performance and good scalability to over 400,000 cores. To increase the usability and flexibility of the framework, a Python interface was developed. Python extensions are used at all stages of the simulation pipeline: they simplify and automate scenario setup, evaluation, and plotting. We show how our Python interface outperforms the existing text-file-based configuration mechanism, providing features like automatic nondimensionalization of physical quantities and handling of complex parameter dependencies. Furthermore, Python is used to process and evaluate results while the simulation is running, leading to smaller output files and the possibility to adjust parameters dependent on the current simulation state. C++ data structures are exported such that a seamless interfacing to other numerical Python libraries is possible. The expressive power of Python and the performance of C++ make development of efficient code with low time effort possible.  相似文献   

7.
s and t within a given planar figure F is considered. The approach contains basic methodology developed for any parallel or distributed system. The 2D scene or the edge of F are represented in the n Cartesian coordinate system (n-CCS). Several algorithms for the shortest path are given, each one to be applied in specified circumstances depending on the exact machine model or on additional information concerning geometrical properties of the figure. If these algorithms are implemented in a parallel depth search machine (PDSM), then the shortest path can be computed in time O(1). The maximum number of processors used is 0(n). The given methodology can also be adapted for producing an approximate solution when the shortest path is approximated by polygonal lines.  相似文献   

8.
We present a scalable framework for parallelizing greedy graph coloring algorithms on distributed-memory computers. The framework unifies several existing algorithms and blends a variety of techniques for creating or facilitating concurrency. The latter techniques include exploiting features of the initial data distribution, the use of speculative coloring and randomization, and a BSP-style organization of computation and communication. We experimentally study the performance of several specialized algorithms designed using the framework and implemented using MPI. The experiments are conducted on two different platforms and the test cases include large-size synthetic graphs as well as real graphs drawn from various application areas. Computational results show that implementations that yield good speedup while at the same time using about the same number of colors as a sequential greedy algorithm can be achieved by setting parameters of the framework in accordance with the size and structure of the graph being colored. Our implementation is freely available as part of the Zoltan parallel data management and load-balancing library.  相似文献   

9.
Building large-scale parallel computer systems for time-critical applications is a challenging task since the designers of such systems need to consider a number of related factors such as proper support for fault tolerance, efficient task allocation and reallocation strategies, and scalability. In this paper we propose a massively parallel fault-tolerant architecture using hundreds or thousands of processors for critical applications with timing constraints. The proposed architecture is based on an interconnection network called thebisectional network. A bisectional network is isomorphic to a hypercube in that a binary hypercube network can be easily extended as a bisectional network by adding additional links. These additional links add to the network some rich topological properties such as node symmetry, small diameter, small internode distance, and partitionability. The important property of partitioning is exploited to propose a redundant task allocation and a task redistribution strategy under realtime constraints. The system is partitioned into symmetric regions (spheres) such that each sphere has a central control point. The central points, calledfault control points (FCPs), are distributed throughout the entire system in an optimal fashion and provide two-level task redundancy and efficiently redistribute the loads of failed nodes. FCPs are assigned to the processing nodes such that each node is assigned two types of FCPs for storing two redundant copies of every task present at the node. Similarly, the number of nodes assigned to each FCP is the same. For a failure-repair system environment the performance of the proposed system has been evaluated and compared with a hypercube-based system. Simulation results indicate that the proposed system can yield improved performance in the presence of a high number of node failures.  相似文献   

10.
This paper describes a parallel solver of tridiagonal systems appropriate for distributed memory computers and implemented on an array of chain-connected T800 Transputers. Each processor in the chain uses the same program to solve its own subset of equations. This implementation is suited, for instance, for solving the heat conduction equation in one-dimensional hydrodynamic codes. The procedure performs a parallel cyclic reduction, a recursive Gaussian elimination on a reduced number of equations and a parallel backward unfolding scheme, with a direct substitution in the reduced equations. The code has been written in Occam2 language. A one-way communication of values between adjacent processors is required at each cycle of both the reduction and the unfolding steps. Due to the number of floating point operations and the amount of communications the implementation described here works efficiently on arrays with more than 4 processors and for more than 50 equations per processor.  相似文献   

11.
Algorithms for parallel memory,II: Hierarchical multilevel memories   总被引:1,自引:0,他引:1  
In this paper we introduce parallel versions of two hierarchical memory models and give optimal algorithms in these models for sorting, FFT, and matrix multiplication. In our parallel models, there areP memory hierarchies operating simultaneously; communication among the hierarchies takes place at a base memory level. Our optimal sorting algorithm is randomized and is based upon the probabilistic partitioning technique developed in the companion paper for optimal disk sorting in a two-level memory with parallel block transfer. The probability of using/times the optimal running time is exponentially small in (log ) logP.A summarized version of this research was presented at the 22nd Annual ACM Symposium on Theory of Computing, Baltimore, MD, May 1990. This work was done while the first author was at Brown University. Support was provided in part by a National Science Foundation Presidential Young Investigator Award with matching funds from IBM, by NSF Research Grants DCR-8403613 and CCR-9007851, by Army Research Office Grant DAAL03-91-G-0035, and by the Office of Naval Research and the Defense Advanced Research Projects Agency under Contract N00014-91-J-4052 ARPA Order 8225. This work was done in part while the second author was at Brown University supported by a Bellcore graduate fellowship and at Bellcore.  相似文献   

12.
This article is dedicated to techniques and theories of image fusion in automatic ways and addresses two issues—the parameter setting and quality assessment. Optimal parameters are in demand for specific applications or comparison between fusion methods because, as basic evidence, different parameters bring different fusion effects varying over a large range. In this paper, we propose a general framework of online parameter training to search optimal values that best suit input images. Furthermore, we optimized the compute‐intensive training process using parallelization and genetic algorithm, as well as patches extraction. We also propose a metric—spatial and spectral distortion—as the learning target. The spatial and spectral distortion is a fuzzy combination of mean potential energy measuring spatial distortion and Q4 measuring spectral distortion. Optimization validation on weighted Gram–Schmidt fusion indicated linear or superlinear acceleration ability, which proved that the proposed learning framework can speed up the learning process of image fusion to an acceptable time, and can thus be applied to high‐performance platforms to process large volumes of data. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

13.
The mapping of tasks of a parallel program onto nodes of a parallel computing system has a remarkable impact on application performance. In this paper we propose an optimization framework to solve the mapping problem, which takes into account the communication matrix of the application and a cost matrix that depends on the topology of the parallel system. This cost matrix is usually a distance matrix (the classic approach), but we propose a novel definition of the cost criterion, applicable to torus networks, that tries to distribute traffic evenly over the different axes; we call this the Traffic Distribution criterion. As the mapping problem can be seen as a particular instance of the Quadratic Assignment Problem (QAP), we can apply any QAP solver to this problem. In particular, we use a greedy randomized algorithm. Using simulation, we test the performance levels of the optimization-based mappings, and compare them with those of trivial mappings (consecutive, random), in two different environments: single application (one application uses all system resources all the time) and space sharing (several applications run simultaneously, on different system partitions), using systems with 2D and 3D topologies and real application traffic. Experimental results show that some applications do not benefit from optimization-based mappings: those in which there is a match between virtual and physical topologies, and those that carry out massive all-to-all communications. In other cases, optimization-based mappings with the TD criterion provide excellent performance levels.  相似文献   

14.
In high‐speed network monitoring, the ever‐growing traffic calls for a high‐performance solution for the computation of frequent items. The increasing number of cores in the current commodity multi‐core processors opens up new opportunities in parallelization. In this paper, we present a novel precision integrated framework (PRIF) that exploits the great parallel capability of multi‐cores to speed up the famous frequent algorithm. PRIF equally distributes the input data stream into sub‐threads that use the optimized weighted frequent algorithm to track local frequent items. The items with frequency increments exceeding a pre‐defined threshold are sent to a merging thread which is able to return the global continuous ε‐deficient frequent items. The theoretical correctness and complexity analyses are presented. Experiments with real and synthetic traces confirm the theoretical analyses and demonstrate the excellent performance as well as the effects of parameters and data skewness. The results show that PRIF is able to provide continuous frequent items and near‐linear speedup at the cost of greater memory use. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

15.
基于对角划分的矩阵乘并行算法   总被引:5,自引:0,他引:5  
提出了一种新的基于对角划分的矩阵乘并行算法,它在以往行列划分策略的基础上,采用基于对角划分的策略。数值试验表明该算法具有较高的加速比和并行效率。  相似文献   

16.
Increasingly large amount of multidimensional data are being generated on a daily basis in many applications. This leads to a strong demand for learning algorithms to extract useful information from these massive data. This paper surveys the field of multilinear subspace learning (MSL) for dimensionality reduction of multidimensional data directly from their tensorial representations. It discusses the central issues of MSL, including establishing the foundations of the field via multilinear projections, formulating a unifying MSL framework for systematic treatment of the problem, examining the algorithmic aspects of typical MSL solutions, and categorizing both unsupervised and supervised MSL algorithms into taxonomies. Lastly, the paper summarizes a wide range of MSL applications and concludes with perspectives on future research directions.  相似文献   

17.
18.
Cadabra is a powerful computer program for the manipulation of tensor equations. It was designed for use in high energy physics but its rich structure and ease of use lends itself well to the routine computations required in General Relativity. Here we will present a series of simple examples showing how Cadabra may be used, including verifying that the Levi-Civita connection is a metric connection and a derivation of the Gauss equation between induced and ambient curvatures.  相似文献   

19.
We present a randomized parallel algorithm that computes the greatest common divisor of two integers of n bits in length with probability 1−o(1) that takes O(nloglogn/logn) time using O(n6+?) processors for any ?>0 on the EREW PRAM parallel model of computation. The algorithm either gives a correct answer or reports failure.We believe this to be the first randomized sublinear time algorithm on the EREW PRAM for this problem.  相似文献   

20.
针对现有的并行模糊测试在测试效率、资源利用率以及异常处理上的局限性,本文围绕测试资源的生成、使用及容错三个方面提出了一种动态资源感知的系统化解决方案。针对测试环境在大规模和多场景两个维度快速搭建的需求,提出一种基于云平台的动态构建方法,加快测试环境部署,提高有效fuzz时间;针对并行模糊测试中资源利用率低的问题,提出一种多层次并行度动态调整的资源配置策略,优化整体测试资源配置并提高单机负载;针对大规模并行测试中节点易发生故障的问题,提出基于优先级调度的容错处理方法。最后,本文设计并实现了一个基于四级流水线并行处理结构的通用模糊测试框架。实验证明,该框架能够有效提高并行模糊测试的测试效率和资源利用率,实现系统的有效容错。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号