首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 515 毫秒
1.
In C++, multi‐dimensional arrays are often used but the language provides limited native support for them. The language, in its Standard Library, supplies sophisticated interfaces for manipulating sequential data, but relies on its bare‐bones C heritage for arrays. The MultiArray library, a part of the Boost library collection, enhances a C++ programmer's tool set with versatile multi‐dimensional array abstractions. It includes a general array class template and native array adaptors that support idiomatic array operations and interoperate with C++ Standard Library containers and algorithms. The arrays share a common interface, expressed as a generic programming concept, in terms of which generic array algorithms can be implemented. We present the library design, introduce a generic interface for array programming, demonstrate how the arrays integrate with the C++ Standard Library, and discuss the essential aspects of their implementation. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

2.
The development of intelligent transportation systems (ITS) and the resulting need for the solution of a variety of dynamic traffic network models and management problems require faster‐than‐real‐time computation of shortest path problems in dynamic networks. Recently, a sequential algorithm was developed to compute shortest paths in discrete time dynamic networks from all nodes and all departure times to one destination node. The algorithm is known as algorithm DOT and has an optimal worst‐case running‐time complexity. This implies that no algorithm with a better worst‐case computational complexity can be discovered. Consequently, in order to derive algorithms to solve all‐to‐one shortest path problems in dynamic networks, one would need to explore avenues other than the design of sequential solution algorithms only. The use of commercially‐available high‐performance computing platforms to develop parallel implementations of sequential algorithms is an example of such avenue. This paper reports on the design, implementation, and computational testing of parallel dynamic shortest path algorithms. We develop two shared‐memory and two message‐passing dynamic shortest path algorithm implementations, which are derived from algorithm DOT using the following parallelization strategies: decomposition by destination and decomposition by transportation network topology. The algorithms are coded using two types of parallel computing environments: a message‐passing environment based on the parallel virtual machine (PVM) library and a multi‐threading environment based on the SUN Microsystems Multi‐Threads (MT) library. We also develop a time‐based parallel version of algorithm DOT for the case of minimum time paths in FIFO networks, and a theoretical parallelization of algorithm DOT on an ‘ideal’ theoretical parallel machine. Performances of the implementations are analyzed and evaluated using large transportation networks, and two types of parallel computing platforms: a distributed network of Unix workstations and a SUN shared‐memory machine containing eight processors. Satisfactory speed‐ups in the running time of sequential algorithms are achieved, in particular for shared‐memory machines. Numerical results indicate that shared‐memory computers constitute the most appropriate type of parallel computing platforms for the computation of dynamic shortest paths for real‐time ITS applications.  相似文献   

3.
Symbolic computation has underpinned a number of key advances in Mathematics and Computer Science. Applications are typically large and potentially highly parallel, making them good candidates for parallel execution at a variety of scales from multi‐core to high‐performance computing systems. However, much existing work on parallel computing is based around numeric rather than symbolic computations. In particular, symbolic computing presents particular problems in terms of varying granularity and irregular task sizes that do not match conventional approaches to parallelisation. It also presents problems in terms of the structure of the algorithms and data. This paper describes a new implementation of the free open‐source GAP computational algebra system that places parallelism at the heart of the design, dealing with the key scalability and cross‐platform portability problems. We provide three system layers that deal with the three most important classes of hardware: individual shared memory multi‐core nodes, mid‐scale distributed clusters of (multi‐core) nodes and full‐blown high‐performance computing systems, comprising large‐scale tightly connected networks of multi‐core nodes. This requires us to develop new cross‐layer programming abstractions in the form of new domain‐specific skeletons that allow us to seamlessly target different hardware levels. Our results show that, using our approach, we can achieve good scalability and speedups for two realistic exemplars, on high‐performance systems comprising up to 32000 cores, as well as on ubiquitous multi‐core systems and distributed clusters. The work reported here paves the way towards full‐scale exploitation of symbolic computation by high‐performance computing systems, and we demonstrate the potential with two major case studies. © 2016 The Authors. Concurrency and Computation: Practice and Experience Published by John Wiley & Sons Ltd.  相似文献   

4.
We propose a parallelization scheme for an existing algorithm for constructing a web‐directory, that contains categories of web documents organized hierarchically. The clustering algorithm automatically infers the number of clusters using a quality function based on graph cuts. A parallel implementation of the algorithm has been developed to run on a cluster of multi‐core processors interconnected by an intranet. The effect of the well‐known Latent Semantic Indexing on the performance of the clustering algorithm is also considered. The parallelized graph‐cut based clustering algorithm achieves an F‐measure in the range [0.69,0.91] for the generated leaf‐level clusters while yielding a precision‐recall performance in the range [0.66,0.84] for the entire hierarchy of the generated clusters. As measured via empirical observations, the parallel algorithm achieves an average speedup of 7.38 over its sequential variant, at the same time yielding a better clustering performance than the sequential algorithm in terms of F‐measure. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

5.
The x86 multicore processor architecture and the concepts associated with symmetric multiprocessing have become the linchpin of modern operating systems and cloud computing. A solid understanding of these technologies has become critical to any developer entering the field. Unfortunately, the complexity associated with discovering, enabling, using, and virtualizing multiple cores has created a paucity in the available documentation, transferable knowledge, and readable code exemplars. This paper describes our experience in overcoming these hurdles in the design of a from scratch, multi‐core operating system––Bear––for secure and resilient cloud computing. In particular, we trace the intricacies involved in the development of a multi‐core microkernel with an integrated multi‐core hypervisor. By exploring the implementation details, from bootstrapping through core virtualization to process scheduling, this paper aims to fill the knowledge gaps, highlight potential pitfalls, and introduce multicore development in a concise start‐to‐finish exemplar.  相似文献   

6.
This paper presents an improved non‐sequential multi‐input multi‐output (MIMO) Quantitative Feedback Theory (QFT) design methodology for uncertain systems. A non‐sequential MIMO QFT stability theorem is derived that serves as the basis for an improvement of the design methodology, whereby it can be successfully applied to non‐minimum phase systems, albeit with a degree of conservatism partially inherent in independent and decentralized design methodologies. The results reduce the conservatism in a non‐sequential MIMO QFT design and provide insight into the plant cases for which the methodology can be successfully applied. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

7.
We present photon beam diffusion, an efficient numerical method for accurately rendering translucent materials. Our approach interprets incident light as a continuous beam of photons inside the material. Numerically integrating diffusion from such extended sources has long been assumed computationally prohibitive, leading to the ubiquitous single‐depth dipole approximation and the recent analytic sum‐of‐Gaussians approach employed by Quantized Diffusion. In this paper, we show that numerical integration of the extended beam is not only feasible, but provides increased speed, flexibility, numerical stability, and ease of implementation, while retaining the benefits of previous approaches. We leverage the improved diffusion model, but propose an efficient and numerically stable Monte Carlo integration scheme that gives equivalent results using only 3–5 samples instead of 20–60 Gaussians as in previous work. Our method can account for finite and multi‐layer materials, and additionally supports directional incident effects at surfaces. We also propose a novel diffuse exact single‐scattering term which can be integrated in tandem with the multi‐scattering approximation. Our numerical approach furthermore allows us to easily correct inaccuracies of the diffusion model and even combine it with more general Monte Carlo rendering algorithms. We provide practical details necessary for efficient implementation, and demonstrate the versatility of our technique by incorporating it on top of several rendering algorithms in both research and production rendering systems.  相似文献   

8.
In this paper parallelization and segmentation methodologies are used to obtain a real‐time (RT) implementation of computationally expensive estimators or filters in an FPGA. First, the filter to be applied is briefly described, and afterwards its hardware structure and VHDL implementation are indicated. A comparative study is performed between the FPGA parallelized implementation and the implementation in a sequential processor. The analysis proves that the execution times measured on the FPGA are considerably lower, making that implementation valid for its use in RT systems. Moreover, several experimental results are shown for a visual servoing application in order to evidence the good performance of the proposed algorithms and implementations. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

9.
Multi‐core processors can deliver significant performance benefits for multi‐threaded software by adding processing power with minimal latency, given the proximity of the processors. Cryptographic applications are inherently complex and involve large computations. Most cryptographic operations can be translated into logical operations, shift operations, and table look‐ups. In this paper we design a novel processor (called mu‐core) with a reconfigurable Arithmetic Logic Unit, and design custom two‐dimensional multi‐core architectures on top of it to accelerate cryptographic kernels. We propose an efficient mapping of instructions from the multi‐core grid to the individual processor cores and illustrate the performance of AES‐128E algorithm over custom‐sized grids. The model was developed using Simulink and the performance analysis suggests a positive trend towards development of large multi‐core (or multi‐ µ‐core) architectures to achieve high throughputs in cryptographic operations. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

10.
An optimized parallel algorithm is proposed to solve the problem occurred in the process of complicated backward substitution of cyclic reduction during solving tridiagonal linear systems. Adopting a hybrid parallel model, this algorithm combines the cyclic reduction method and the partition method. This hybrid algorithm has simple backward substitution on parallel computers comparing with the cyclic reduction method. In this paper, the operation count and execution time are obtained to evaluate and make comparison for these methods. On the basis of results of these measured parameters, the hybrid algorithm using the hybrid approach with a multi‐threading implementation achieves better efficiency than the other parallel methods, that is, the cyclic reduction and the partition methods. In particular, the approach involved in this paper has the least scalar operation count and the shortest execution time on a multi‐core computer when the size of equations meets some dimension threshold. The hybrid parallel algorithm improves the performance of the cyclic reduction and partition methods by 19.2% and 13.2%, respectively. In addition, by comparing the single‐iteration and multi‐iteration hybrid parallel algorithms, it is found that increasing iteration steps of the cyclic reduction method does not affect the performance of the hybrid parallel algorithm very much. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

11.
Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single‐instruction multiple‐data (SIMD) instructions and thread‐level parallelism. In this paper, we propose a new high‐performance sorting algorithm, called aligned‐access sort (AA‐sort), that exploits both the SIMD instructions and thread‐level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in‐core sorting phase and an out‐of‐core merging phase. The in‐core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out‐of‐core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA‐sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA‐sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32‐bit integers. Also, a parallel version of AA‐sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

12.
Branch‐and‐bound (B&B) algorithms are attractive methods for solving to optimality combinatorial optimization problems using an implicit enumeration of a dynamically built tree‐based search space. Nevertheless, they are time‐consuming when dealing with large problem instances. Therefore, pruning tree nodes (subproblems) is traditionally used as a powerful mechanism to reduce the size of the explored search space. Pruning requires to perform the bounding operation, which consists of applying a lower bound function to the subproblems generated during the exploration process. Preliminary experiments performed on the Flow‐Shop scheduling problem (FSP) have shown that the bounding operation consumes over 98% of the execution time of the B&B algorithm. In this paper, we investigate the use of graphics processing unit (GPU) computing as a major complementary way to speed up the search. We revisit the design and implementation of the parallel bounding model on GPU accelerators. The proposed approach enables data access optimization. Extensive experiments have been carried out on well‐known FSP benchmarks using an Nvidia Tesla C2050 GPU card. Compared to a CPU‐based single core execution using an Intel Core i7‐970 processor without GPU, speedups higher than 100 times faster are achieved for large problem instances. At an equivalent peak performance, GPU‐accelerated B&B is twice faster than its multi‐core counterpart. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

13.
This paper deals with formation control problems for multi‐agent systems by using iterative learning control (ILC) design approaches. Distributed formation ILC algorithms are presented to enable all agents in directed graphs to achieve the desired relative formations perfectly over a finite‐time interval. It is shown that not only asymptotic stability but also monotonic convergence of multi‐agent formation ILC can be accomplished, and the convergence conditions in terms of linear matrix inequalities can be simultaneously established. The derived results are also applicable to multi‐agent systems that are subject to stochastic disturbances and model uncertainties. Furthermore, the feasibility of convergence conditions and the effect of communication delays are discussed for the proposed multi‐agent formation ILC algorithms. Simulation results are given for uncertain multi‐agent systems to verify the theoretical study. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

14.
The Dandelion-like codes are eight bijections between labeled trees and strings of node labels. The literature contains optimal sequential algorithms for these bijections, but no parallel algorithms have been reported. In this paper the first parallel encoding and decoding algorithms for Dandelion-like codes are presented. Namely, a unique encoding algorithm and a unique decoding algorithm, which when properly parameterized, can be used for all Dandelion-like codes, are designed. These algorithms are optimal in the sequential setting. The encoding algorithm implementation on an EREW PRAM is optimal, while the efficient implementation of the decoding algorithm requires concurrent reading.  相似文献   

15.
Automatic test data generation is a very popular domain in the field of search‐based software engineering. Traditionally, the main goal has been to maximize coverage. However, other objectives can be defined, such as the oracle cost, which is the cost of executing the entire test suite and the cost of checking the system behavior. Indeed, in very large software systems, the cost spent to test the system can be an issue, and then it makes sense by considering two conflicting objectives: maximizing the coverage and minimizing the oracle cost. This is what we did in this paper. We mainly compared two approaches to deal with the multi‐objective test data generation problem: a direct multi‐objective approach and a combination of a mono‐objective algorithm together with multi‐objective test case selection optimization. Concretely, in this work, we used four state‐of‐the‐art multi‐objective algorithms and two mono‐objective evolutionary algorithms followed by a multi‐objective test case selection based on Pareto efficiency. The experimental analysis compares these techniques on two different benchmarks. The first one is composed of 800 Java programs created through a program generator. The second benchmark is composed of 13 real programs extracted from the literature. In the direct multi‐objective approach, the results indicate that the oracle cost can be properly optimized; however, the full branch coverage of the system poses a great challenge. Regarding the mono‐objective algorithms, although they need a second phase of test case selection for reducing the oracle cost, they are very effective in maximizing the branch coverage. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

16.
Although many image processing applications are ideally suited for parallel implementation, most researchers in imaging do not benefit from high‐performance computing on a daily basis. Essentially, this is due to the fact that no parallelization tools exist that truly match the image processing researcher's frame of reference. As it is unrealistic to expect imaging researchers to become experts in parallel computing, tools must be provided to allow them to develop high‐performance applications in a highly familiar manner. In an attempt to provide such a tool, we have designed a software architecture that allows transparent (i.e. sequential) implementation of data parallel imaging applications for execution on homogeneous distributed memory MIMD‐style multicomputers. This paper presents an extensive overview of the design rationale behind the software architecture, and gives an assessment of the architecture's effectiveness in providing significant performance gains. In particular, we describe the implementation and automatic parallelization of three well‐known example applications that contain many fundamental imaging operations: (1) template matching; (2) multi‐baseline stereo vision; and (3) line detection. Based on experimental results we conclude that our software architecture constitutes a powerful and user‐friendly tool for obtaining high performance in many important image processing research areas. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

17.
Unstructured mesh generation exposes highly irregular computation patterns, which imposes a challenge in implementing triangulation algorithms on parallel machines. This paper reports on an efficient parallel implementation of near Delaunay triangulation with High Performance Fortran (HPF). Our algorithm exploits embarrassing parallelism by performing sub‐block triangulation and boundary merge independently at the same time. The sub‐block triangulation is a divide & conquer Delaunay algorithm known for its sequential efficiency, and the boundary triangulation is an incremental construction algorithm with low overhead. Compared with prior work, our parallelization method is both simple and efficient. In the paper, we also describe a solution to the collinear points problem that usually arises in large data sets. Our experiences with the HPF implementation show that with careful control of the data distribution, we are able to parallelize the program using HPF's standard directives and extrinsic procedures. Experimental results on several parallel platforms, including an IBM SP2 and a DEC Alpha farm, show that a parallel efficiency of 42–86% can be achieved for an eight‐node distributed memory system. We also compare efficiency of the HPF implementation with that of a similarly hand‐coded MPI implementation. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

18.
This paper presents a new methodology for the design and implementation of gain‐scheduled controllers for multi‐rate systems. The proposed methodology provides a natural way to address the integrated guidance and control problem for autonomous vehicles when the outputs are sampled at different instants of time. A controller structure is first proposed for the regulation of non‐square multi‐rate systems with more measured outputs than inputs. Based on this structure, an implementation for a gain‐scheduled controller is obtained that satisfies an important property known as the linearization property. The implementation resembles the velocity implementation for single‐rate systems. The method is then applied to the problem of steering an autonomous rotorcraft along a predefined trajectory defined in terms of space and time coordinates. By considering a convenient error vector to describe the vehicle's dynamics, the trajectory tracking problem is reduced to that of regulating the error variables to zero. Because of the periodic multi‐rate nature of the onboard sensor suite, the controller synthesis is dealt with under the scope of linear periodic systems theory. Simulation results obtained with a full non‐linear rotorcraft dynamic model are presented and discussed. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

19.
Traversing voxels along a three dimensional (3D) line is one of the most fundamental algorithms for voxel‐based applications. This paper presents a new 6‐connectivity integer algorithm for this task. The proposed algorithm accepts voxels having different sizes in x, y and z directions. To explain the idea of the proposed approach, a 2D algorithm is firstly considered and then extended in 3D. This algorithm is a multi‐step as up to three voxels may be added in one iteration. It accepts both integer and floating‐point input. The new algorithm was compared to other popular voxel traversing algorithms. Counting the number of arithmetic operations showed that the proposed algorithm requires the least amount of operations per traversed voxel. A comparison of spent CPU time using either integer or floating‐point arithmetic confirms that the proposed algorithm is the most efficient. This algorithm is simple, and in compact form which also makes it attractive for hardware implementation.  相似文献   

20.
The control design problem for finite‐time and fixed‐time stabilizations of linear multi‐input system with nonlinear uncertainties and disturbances is considered. The control design algorithm based on block decomposition and implicit Lyapunov function technique is developed. The robustness properties of the obtained control laws with respect to matched and unmatched uncertainties and disturbances are studied. Procedures for tuning of control parameters are presented in the form of linear matrix inequalities. Aspects of practical implementation of developed algorithms are discussed. Theoretical results are supported by numerical simulations. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号