首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Efficient computation of dynamics parameters is one of the important issues in simulation and control of the multibody systems as these systems become more complex. Recent advances in computer architecture are toward multiple core systems rather than high-speed single core systems. Therefore, parallel computation algorithms for dynamics parameters should be designed to improve the performance on these multicore architectures. In this paper, a new dynamics computation algorithm is derived using the principle of dynamical balance, which provides explicit computation of dynamic parameters. This new algorithm has the structure to which parallel computation can be easily applicable. Parallel computation methods are then applied so that we can exploit the structure of the proposed dynamics computation algorithm based on the principle of dynamical balance. The parallel algorithm is designed based on task and data-parallelism. The performance of the proposed algorithm is verified on robots with various topologies. The improved speed of parallel computation is demonstrated through these experiments.  相似文献   

2.
《Parallel Computing》1997,23(7):899-913
Radiosity is a powerful method for solving the global illumination problem in the case of purely diffuse light reflexions. The progressive refinement algorithm provides interactivity during computation by displaying intermediate images, and overshooting methods increases the convergence rate of progressive radiosity. However, computation times remain very important. Parallelising these algorithms is a good way to significantly improve interactivity by reducing computation time. The aim of this paper is to present a method for the parallelisation of the progressive refinement radiosity algorithm on a massively parallel SIMD machine. We took care of both the SIMD machine nature and the high number of available processors on studying the several ways to efficiently implement the algorithm. The parallel scheme we propose uses a disk projection area for form factors estimate and decreases dramatically the computation times.  相似文献   

3.
With the popularity of column-store databases, modern multi-core CPUs, and general-purpose computing on graphics processing units (GPGPUs), there will be radical changes in how processing is done in the online analytical processing (OLAP) and data warehousing fields. Cube computation is a core and time-consuming problem which has been researched extensively. However, most of the algorithms have been proposed without considering the prevalent multi-core architectures and column storage. This paper presents a new parallel cube algorithm that takes advantage of multi-core architectures. We first propose a cache-conscious bottom-up computation (BUC) algorithm called CC-BUC that adopts an integrated bottom-up and breadth-first partitioning order. Each dimension is separately stored and processed. In processing each dimension, breadth-first data scanning and results outputting reduce memory I/O and enhance cache locality. Cache misses are limited in a dimension scope, and translation lookaside buffer (TLB) misses are reduced. Based on CC-BUC, we give a multi-core architecture-based cube algorithm called MC-Cubing. Multiple partitions are processed simultaneously and multiple threads undergo parallel execution inside each partition. MC-Cubing is consistent with multi-core architectures and high parallelism. The layout and associated algorithms take advantage of single instruction, multiple data (SIMD) instructions and thread-level parallelism (TLP). We implement and demonstrate the effectiveness of MC-Cubing on two multi-core architectures: multi-core CPUs and GPUs. Experimental results show that the MC-Cubing algorithm can speed up nearly six times faster than BUC in real datasets.  相似文献   

4.
We consider a graph theoretical model and study a parallel implementation of the well-known Gaussian elimination method on parallel distributed memory architectures, where the communication delay for the transmission of an elementary data is higher than the computation time of an elementary instruction. We propose and analyze two low-complexity algorithms for scheduling the tasks of the parallel Gaussian elimination on an unbounded number of completely connected processors. We compare these two algorithms with a higher-complexity general-purpose scheduling algorithm, the DSC heuristic, proposed by A. Gerasoulis and T. Yang (1993)  相似文献   

5.
We describe new architectures for the efficient computation of redundant manipulator kinematics (direct and inverse). By calculating the core of the problem in hardware, we can make full use of the redundancy by implementing more complex self-motion algorithms. A key component of our architecture is the calculation in the VLSI hardware of the Singular Value Decomposition of the manipulator Jacobian. Recent advances in VLSI have allowed the mapping of complex algorithms to hardware using systolic arrays with advanced computer arithmetic algorithms, such as the coordinate rotation (CORDIC) algorithms. We use CORDIC arithmetic in the novel design of our special-purpose VLSI array, which is used in computation of the Direct Kinematics Solution (DKS), the manipulator Jacobian, as well as the Jacobian Pseudoinverse. Application-specific (subtask-dependent) portions of the inverse kinematics are handled in parallel by a DSP processor which interfaces with the custom hardware and the host machine. The architecture and algorithm development is valid for general redundant manipulators and a wide range of processors currently available and under development commercially.  相似文献   

6.
7.
Parallel architectures and algorithms for image component labeling   总被引:1,自引:0,他引:1  
A survey and a characterization of the various parallel algorithms and architectures developed for the problem of labeling digitized images over the last two decades are presented. It is shown that four basic parallel techniques underly the various parallel algorithms for this problem. However, because most of these techniques have been developed at a theoretical level, it is still not clear which techniques are most efficient in practical terms. Parallel architectures and parallel models of computation that implement these techniques are also studied  相似文献   

8.
In this paper we present two algorithms for the parallel solution of first-order linear recurrences, We show that the algorithms can be used to efficiently solve both scalar and blocked versions of the problem on vector and SIMD architectures. The first algorithm is a parallel approach whose resulting code can be explicitly vectorized, making it suitable for efficient execution on vector architectures such as the Cray 2. The second algorithm is a modified recursive approach designed to reduce the communication overhead encountered in SIMD architectures such as the Connection Machine 2 (CM-2). We present the performance exhibited by the parallel algorithm implementations on the Cray 2 and CM-2 for both scalar and blocked versions of the recurrence problem.  相似文献   

9.
Fast BVH Construction on GPUs   总被引:1,自引:0,他引:1  
We present two novel parallel algorithms for rapidly constructing bounding volume hierarchies on manycore GPUs. The first uses a linear ordering derived from spatial Morton codes to build hierarchies extremely quickly and with high parallel scalability. The second is a top-down approach that uses the surface area heuristic (SAH) to build hierarchies optimized for fast ray tracing. Both algorithms are combined into a hybrid algorithm that removes existing bottlenecks in the algorithm for GPU construction performance and scalability leading to significantly decreased build time. The resulting hierarchies are close in to optimized SAH hierarchies, but the construction process is substantially faster, leading to a significant net benefit when both construction and traversal cost are accounted for. Our preliminary results show that current GPU architectures can compete with CPU implementations of hierarchy construction running on multicore systems. In practice, we can construct hierarchies of models with up to several million triangles and use them for fast ray tracing or other applications.  相似文献   

10.
We present a parallel toolkit for pairwise distance computation in massive networks. Computing the exact shortest paths between a large number of vertices is a costly operation, and serial algorithms are not practical for billion‐scale graphs. We first describe an efficient parallel method to solve the single source shortest path problem on commodity hardware with no shared memory. Using it as a building block, we introduce a new parallel algorithm to estimate the shortest paths between arbitrary pairs of vertices. Our method exploits data locality, produces highly accurate results, and allows batch computation of shortest paths with 7% average error in graphs that contain billions of edges. The proposed algorithm is up to two orders of magnitude faster than previously suggested algorithms and does not require large amounts of memory or expensive high‐end servers. We further leverage this method to estimate the closeness and betweenness centrality metrics, which involve systems challenges dealing with indexing, joining, and comparing large datasets efficiently. In one experiment, we mined a real‐world Web graph with 700 million nodes and 12 billion edges to identify the most central vertices and calculated more than 63 billion shortest paths in 6 h on a 20‐node commodity cluster. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

11.
Modern complex embedded applications in multiple application fields impose stringent and continuously increasing functional and parametric demands. To adequately serve these applications, massively parallel multi-processor systems on a single chip (MPSoCs) are required. This paper is devoted to the design of scalable communication architectures of massively parallel hardware multi-processors for highly-demanding applications. We demonstrated that in the massively parallel hardware multi-processors the communication network influence on both the throughput and circuit area dominates the processors influence, while the traditionally used flat communication architectures do not scale well with the increase of parallelism. Therefore, we propose to design highly optimized application-specific partitioned hierarchical organizations of the communication architectures through exploiting the regularity and hierarchy of the actual information flows of a given application. We developed related communication architecture synthesis strategies and incorporated them into our quality-driven model-based multi-processor design methodology and related automated architecture exploration framework. Using this framework we performed a large series of architecture synthesis experiments. Some of the results of the experiments are presented in this paper. They demonstrate many features of the synthesized communication architectures and show that our method and related framework are able to efficiently synthesize well scalable communication architectures even for the high-end massively parallel multi-processors that have to satisfy extremely stringent computation demands.  相似文献   

12.
A number of recent studies have revealed that the Optical Transpose Interconnection Systems (or OTIS) are promising candidates for future high-performance parallel computers. In this paper, we present and evaluate two general methods for algorithm development on the OTIS. The proposed methods are general in the sense that no specific factor network or problem domain is assumed. The proposed methods allow efficient mapping of a wide class of algorithms into the OTIS. These methods are based on grids and pipelines as popular structures that support a vast body of parallel applications including linear algebra, divide-and-conquer type of algorithms, sorting, and FFT computation. Timing models for measuring the performance of the proposed methods are also provided. Through these models, the performance of various algorithms on the OTIS are evaluated and compared with their counterparts on conventional electronic interconnection systems. This study confirms the viability of the OTIS as an attractive alternative for large-scale parallel architectures. Finally, we show how the proposed methods can be used to design parallel algorithms for linear algebra on the OTIS.  相似文献   

13.
Stackless traversal algorithms for ray tracing acceleration structures require significantly less storage per ray than ordinary stack‐based ones. This advantage is important for massively parallel rendering methods, where there are many rays in flight. On SIMD architectures, a commonly used acceleration structure is the MBVH, which has multiple bounding boxes per node for improved parallelism. It scales to branching factors higher than two, for which, however, only stack‐based traversal methods have been proposed so far. In this paper, we introduce a novel stackless traversal algorithm for MBVHs with up to four‐way branching. Our approach replaces the stack with a small bitmask, supports dynamic ordered traversal, and has a low computation overhead. We also present efficient implementation techniques for recent CPU, MIC (Intel Xeon Phi) and GPU (NVIDIA Kepler) architectures.  相似文献   

14.
Dynamic programming (DP) is a popular technique which is used to solve combinatorial search and optimization problems. This paper focuses on one type of DP, which is called nonserial polyadic dynamic programming (NPDP). Owing to the nonuniform data dependencies of NPDP, it is difficult to exploit either parallelism or locality. Worse still, the emerging multi/many-core architectures with small on-chip memory make these issues more challenging. In this paper, we address the challenges of exploiting the fine grain parallelism and locality of NPDP on multicore architectures. We describe a latency-tolerant model and a percolation technique for programming on multicore architectures. On an algorithmic level, both parallelism and locality do benefit from a specific data dependence transformation of NPDP. Next, we propose a parallel pipelining algorithm by decomposing computation operators and percolating data through a memory hierarchy to create just-in-time locality. In order to predict the execution time, we formulate an analytical performance model of the parallel algorithm. The parallel pipelining algorithm achieves not only high scalability on the 160-core IBM Cyclops64, but portable performance as well, across the 8-core Sun Niagara and quad-cores Intel Clovertown.  相似文献   

15.
16.
The NERSC and Lawrence Berkeley National Laboratory visualization group has developed the Visapult tool to attack grand challenge problems. Visapult is a distributed, parallel, volume rendering application that leverages parallel computation and high-performance networking resources that are on the same scale as the supercomputers generating the data. We've improved Visapult's effectiveness using aggressive network tuning and network protocol modifications. In particular, we used a new connectionless user datagram protocol (UDP) to improve network efficiency from a 25 to 88 percent line rate increase for multigigabit networks. This connectionless protocol also dramatically reduces the latency of network event delivery, improving the responsiveness of wide area distributed interactive graphics applications as compared to transmission control protocol (TCP) streams. We believe that this UDP protocol, as well as transport encodings and algorithms that can tolerate loss gracefully, will become a fundamental component of future grid visualization architectures.  相似文献   

17.
This paper presents a parallel volume rendering algorithm that can render a 256×256×225 voxel medical data set at over 15 Hz and a 512×512×334 voxel data set at over 7 Hz on a 32-processor Silicon Graphics Challenge. The algorithm achieves these results by minimizing each of the three components of execution time: computation time, synchronization time, and data communication time. Computation time is low because the parallel algorithm is based on the recently-reported shear-warp serial volume rendering algorithm which is over five times faster than previous serial algorithms. The algorithm uses run-length encoding to exploit coherence and an efficient volume traversal to reduce overhead. Synchronization time is minimized by using dynamic load balancing and a task partition that minimizes synchronization events. Data communication costs are low because the algorithm is implemented for shared-memory multiprocessors, a class of machines with hardware support for low-latency fine-grain communication and hardware caching to hide latency. We draw two conclusions from our implementation. First, we find that on shared-memory architectures data redistribution and communication costs do not dominate rendering time. Second, we find that cache locality requirements impose a limit on parallelism in volume rendering algorithms. Specifically, our results indicate that shared-memory machines with hundreds of processors would be useful only for rendering very large data sets  相似文献   

18.
The grid and the mesh of trees (or MOT) are among the best-known parallel architectures in the literature. Both of them enjoy efficient VLSI layouts, simplicity of topology, and a large number of parallel algorithms that can efficiently execute on them. One drawback of these architectures is that algorithms that perform best on one of them do not perform very well on the other. Thus there is a gap between the algorithmic capabilities of these two architectures. We propose a new class of parallel architectures, called the mesh-connected trees (or MCT) that can execute grid algorithms as efficiently as the grid, and MOT algorithms as efficiently as the MOT, up to a constant amount of slowdown. In particular, the MCT topology contains the MOT as a subgraph and emulates the grid via embedding with dilation 3 and congestion two. This significant amount of computational versatility offered by the MCT comes at no additional VLSI area cost over these earlier networks. Many topological, routing, and embedding properties analyzed here suggest that the MCT architecture is also a serious competitor for the hypercube. In fact, while the MCT is much simpler and cheaper than the hypercube, for all the algorithms we developed, the running time complexity on the MCT matches those of well known hypercube algorithms. We also present an interesting variant of the MCT architecture that admits both the MOT and the torus as its subgraphs. While most of the discussion in this paper is focused on the MCT architecture itself, these analyses can be easily extended to the variant of the MCT presented here  相似文献   

19.
Data-parallel,volumerendering algorithms   总被引:1,自引:0,他引:1  
In this presentation, we consider the image-composition scheme for parallel volume rendering in which each processor is assigned a portion of the volume. A processor renders its data by using any existing volume-rendering algorithm. We describe one such parallel algorithm that also takes advantage of vector-processing capabilities. The resulting images from all processors are then combined (composited) in visibility order to form the final image. The major advantage of this approach is that, as viewing and shading parameters change, only 2D partial images, and not 3D volume data, are communicated among processors. Through experimental results and performance analysis, we show that our parallel algorithm is amenable to extremely efficient implementations on distributed memory, multiple instruction-multiple data (MIMD), vector-processor architectures. This algorithm is also very suitable for hardware implementation based on image composition architectures. It supports various volume-rendering algorithms, and it can be extended to provide load-balanced execution.  相似文献   

20.
Image restoration is a significant process commonly applied in many research fields. In particular, image deconvolution algorithms play a very important role in the research methodology in astrophysics, where recorded images are frequently submitted to deconvolution processes. In this paper, we introduce a novel image deconvolution algorithm that is competitive in terms of the restored image quality when compared to classical approaches. We present parallelizations of this algorithm to make it competitive in terms of processing speeds as well. We also present the image deconvolution web portal (IDEWEP) that, using web services technologies, primarily aims to make this general parallel deconvolution method accessible through a web interface. Both the quality of the restored images and running times of the sequential and parallel version have been successfully tested in several sequential and parallel architectures. The IDEWEP portal greatly eases the access and use of the parallel algorithm in high performance architectures. As a contribution to the scientific community, open source sequential and parallel codes are provided and can be freely downloaded from our web portal.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号