首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In the sort-last-sparse parallel volume rendering system on distributed memory multicomputers, one can achieve a very good performance improvement in the rendering phase by increasing the number of processors. This is because each processor can render images locally without communicating with other processors. However, in the compositing phase, a processor has to exchange local images with other processors. When the number of processors exceeds a threshold, the image compositing time becomes a bottleneck. In this paper, we propose three compositing methods to efficiently reduce the compositing time in parallel volume rendering. They are the binary-swap with bounding rectangle (BSBR) method, the binary-swap with run-length encoding and static load-balancing (BSLC) method, and the binary-swap with bounding rectangle and run-length encoding (BSBRC) method. The proposed methods were implemented on an SP2 parallel machine along with the binary-swap compositing method. The experimental results show that the BSBRC method has the best performance among these four methods.  相似文献   

2.
Communication costs for parallel volume-rendering algorithms   总被引:2,自引:0,他引:2  
The computational expense of volume rendering motivates the development of parallel implementations on multicomputers. Parallelism achieves higher frame rates, which provide more natural viewing control and enhanced comprehension of 3D structure. Although many parallel implementations exist, we have no framework to compare their relative merits independent of host hardware. The article attempts to establish that framework by enumerating and classifying parallel volume-rendering algorithms suitable for multicomputers with distributed memory and a communication network. It determined the communication costs for classes of parallel algorithms by considering their inherent communication requirements  相似文献   

3.
Multicast communication services, in which the same message is delivered from a source node to an arbitrary number of destination nodes, are being provided in new-generation multicomputers. Broadcast is a special case of multicast in which a message is delivered to all nodes in the network. The nCUBE-2, a wormhole-routed hypercube multicomputer, provides hardware support for broadcast and a restricted form of multicast in which the destinations form a subcube. However, the broadcast routing algorithm adopted in the nCUBE-2 is not deadlock-free. In this paper, four multicast wormhole routing strategies for 2-D mesh multicomputers are proposed and studied. All of the algorithms are shown to be deadlock-free. These are the first deadlock-free multicast wormhole routing algorithms ever proposed. A simulation study has been conducted that compares the performance of these multicast algorithms under dynamic network traffic conditions in a 2-D mesh. The results indicate that a dual-path routing algorithm offers performance advantages over tree-based, multipath, and fixed-path algorithms  相似文献   

4.
Data-parallel,volumerendering algorithms   总被引:1,自引:0,他引:1  
In this presentation, we consider the image-composition scheme for parallel volume rendering in which each processor is assigned a portion of the volume. A processor renders its data by using any existing volume-rendering algorithm. We describe one such parallel algorithm that also takes advantage of vector-processing capabilities. The resulting images from all processors are then combined (composited) in visibility order to form the final image. The major advantage of this approach is that, as viewing and shading parameters change, only 2D partial images, and not 3D volume data, are communicated among processors. Through experimental results and performance analysis, we show that our parallel algorithm is amenable to extremely efficient implementations on distributed memory, multiple instruction-multiple data (MIMD), vector-processor architectures. This algorithm is also very suitable for hardware implementation based on image composition architectures. It supports various volume-rendering algorithms, and it can be extended to provide load-balanced execution.  相似文献   

5.
Parallelizing the Data Cube   总被引:1,自引:0,他引:1  
This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p.  相似文献   

6.
洪雄  戴光明 《微机发展》2004,14(8):44-46
科学计算可视化的核心是三维数据场的可视化.当前三维可视化的研究热点是体绘制技术。文中介绍了三维非规则数据场体绘制技术的研究现状。在此基础上,通过对已有非规则数据场体绘制技术和算法的分析比较.预测非规则数据场体绘制技术今后的发展趋势以及将来应该重视的研究方向。除了改进已有算法、将各种算法结合起来外,还应该在硬件及系统加速技术方面做研究,同时结合漫游技术研究和开发高效的三维空间非规则数据场的可视化技术和并行算法。  相似文献   

7.
A new algorithm for interactive graphics on multicomputers   总被引:1,自引:0,他引:1  
As nonshared-memory multiple instruction, multiple data (MIMD) systems become more common, it becomes important to develop parallel rendering algorithms for them. These systems, known as multicomputers, can produce data sets so large that it is difficult to visualize the data on conventional graphics systems, especially if the visualization proceeds in tandem with the calculation. Parallel systems must run interactive graphics to allow convenient visualizations of their computations. While few parallel systems currently have a frame buffer that will support interactive rendering, such systems should be more common in the future. This article describes an algorithm suited for interactive polygon rendering, where the model's image on screen generally has frame-to-frame coherence. The algorithm uses this coherence to perform load-balancing calculations in parallel with the other calculations. The algorithm also uses an optimized version of personalized all-to-all communication, where all processors communicate with all other processors  相似文献   

8.
Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient algorithms for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BLOCK-CYCLIC(r) to BLOCK-CYCLIC(kr) redistribution. The most significant improvement of our methods is that a processor does not need to construct the send/receive data sets for a redistribution. Based on the packing/unpacking information that derived from the BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) redistribution and vice versa, a processor can pack/unpack array elements into (from) messages directly. To evaluate the performance of our methods, we have implemented our methods along with the Thakur's methods and the PITFALLS method on an IBM SP2 parallel machine. The experimental results show that our algorithms outperform the Thakur's methods and the PITFALLS method for all test samples. This result encourages us to use the proposed algorithms for array redistribution.  相似文献   

9.
In distributed memory multicomputers, local memory accesses are much faster than those involving interprocessor communication. For the sake of reducing or even eliminating the interprocessor communication, the array elements in programs must be carefully distributed to local memory of processors for parallel execution. We devote our efforts to the techniques of allocating array elements of nested loops onto multicomputers in a communication-free fashion for parallelizing compilers. We first analyze the pattern of references among all arrays referenced by a nested loop, and then partition the iteration space into blocks without interblock communication. The arrays can be partitioned under the communication-free criteria with nonduplicate or duplicate data. Finally, a heuristic method for mapping the partitioned array elements and iterations onto the fixed-size multicomputers under the consideration of load balancing is proposed. Based on these methods, the nested loops can execute without any communication overhead on the distributed memory multicomputers. Moreover, the performance of the strategies with nonduplicate and duplicate data for matrix multiplication is studied  相似文献   

10.
随着多处理器的出现,并行技术受到了广泛的关注,成为了加速处理问题速度的重要技术.但是使用并行技术在加速计算的同时也带来了对处理器数量需求的急剧提升,并行成本的显著增加.针对这一问题,通过研究基于PRAM (Parallel Random Access Machine)下的3种最大值查找并行算法中的不足,提出了一种比平衡树算法,快速查找法,双对数深度树方法并行成本(cost)更优的基于数据划分方法的最大值查找并行算法.基于数据划分方法的最大值查找算法有效的解决了现有并行方法中处理器工作量分配不均,对处理器需求过大,实现条件苛刻等问题.为此后类似并行算法降低并行成本提供一个方向.  相似文献   

11.
This paper presents a parallel volume rendering algorithm that can render a 256×256×225 voxel medical data set at over 15 Hz and a 512×512×334 voxel data set at over 7 Hz on a 32-processor Silicon Graphics Challenge. The algorithm achieves these results by minimizing each of the three components of execution time: computation time, synchronization time, and data communication time. Computation time is low because the parallel algorithm is based on the recently-reported shear-warp serial volume rendering algorithm which is over five times faster than previous serial algorithms. The algorithm uses run-length encoding to exploit coherence and an efficient volume traversal to reduce overhead. Synchronization time is minimized by using dynamic load balancing and a task partition that minimizes synchronization events. Data communication costs are low because the algorithm is implemented for shared-memory multiprocessors, a class of machines with hardware support for low-latency fine-grain communication and hardware caching to hide latency. We draw two conclusions from our implementation. First, we find that on shared-memory architectures data redistribution and communication costs do not dominate rendering time. Second, we find that cache locality requirements impose a limit on parallelism in volume rendering algorithms. Specifically, our results indicate that shared-memory machines with hundreds of processors would be useful only for rendering very large data sets  相似文献   

12.
High demand 3-D scenes on embedded systems draw the developers’ attention to use the whole resources of current low-power processors and add dedicated hardware as a graphic accelerator unit to deal with real-time realistic scene rendering. Photon mapping, as one of the most powerful techniques to render highly realistic 3-D images by high amounts of floating-point operations, is very time-consuming. To use the advantages of multiprocessor systems to make 3-D scenes, parallel photon-mapping rendering on a homogeneous multiprocessor SoC (MPSoC) platform along with a mesh NoC by an adaptive wormhole routing method to communicate packets among cores is proposed in this paper. To make efficient use of the MPSoC platform to carry out photon-mapping rendering, many methods concerning the increase of load balancing, the efficient use of memory, and the decrease of communication cost to achieve a scalable application are explored in this paper. The resulting MPSoC platform is verified and evaluated by cycle-accurate simulations for different sizes of the mesh NoC. As expected, the proposed methods can obtain excellent load balancing and achieve a maximum of 44.3 times faster on an 8-by-8 MPSoC platform than on a single-core MPSoC platform.  相似文献   

13.
The authors compare the performance of two join algorithms on both cube and ring interconnections for message-based multicomputers, and investigate the effects that the number of processors and the type of interconnection scheme have on the performance. First, the parallel hybrid-hash join algorithm and the parallel join-index join algorithm for both the cube and ring connected multicomputers are presented. The performance of these algorithms is then compared through analytical cost modeling. The result shows that the join-index join algorithm gives good performance only when the join selectivity is very small, and the hybrid-hash join algorithm performs consistently well under most situations. It is shown that the cube topology yields better execution time than the same algorithm on the ring topology. Furthermore, increasing the number of processors has a more significant improvement on the execution time of the cube than for the ring configuration. The applicability of join indexes on the parallel database algorithms is also discussed  相似文献   

14.
A balanced parallel algorithm to sort a sequence of items on a linear array of processors is presented. The length of the sequence may be small to arbitrarily large. For a short sequence, the output of the sorted sequence begins at the step following the last input of the whole sequence. For an arbitrarily long sequence, the time complexity is optimal under realistic hardware conditions. A variation of the algorithm is also introduced. Both algorithms require far less local memory than that required by a different approach of balanced computation. Any number of balanced processors can be connected to deliver more computing power without increasing the memory size of each processor  相似文献   

15.
Presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor  相似文献   

16.
In many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access on distributed memory multicomputers. Since the redistribution is performed at run-time, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient methods for multi-dimensional array redistribution. Based on the previous work, the basic-cycle calculation technique, we present a basic-block calculation (BBC) and a complete-dimension calculation (CDC) techniques. We also developed a theoretical model to analyze the computation costs of these two techniques. The theoretical model shows that the BBC method has smaller indexing costs and performs well for the redistribution with small array size. The CDC method has smaller packing/unpacking costs and performs well when array size is large. When implemented these two techniques on an IBM SP2 parallel machine along with the PITFALLS method and the Prylli's method, the experimental results show that the BBC method has the smallest execution time of these four algorithms when the array size is small. The CDC method has the smallest execution time of these four algorithms when the array size is large.  相似文献   

17.
Partitioning and mapping of nested loops for linear array multicomputers   总被引:1,自引:1,他引:0  
In distributed-memory multicomputers, minimizing interprocessor communication is the key to the efficient execution of parallel programs. In order to reduce the amount of communication overhead, parallel programs on multicomputers must be carefully scheduled by parallelizing compilers. This paper proposes some compilation techniques for partitioning and mapping nested loops with constant data dependences onto linear array multicomputers. First, a systematic partition strategy is proposed to project ann-dimensional computational structure, representing ann-nested loop, onto a line to form a one-dimensional projected structure with low communication overhead. Then, a mapping algorithm is proposed for mapping the partitioned loops onto linear arrays in a way that balances the workload and minimizes the communication cost among processors. Finally, parallel execution codes can be automatically generated for such linear array multicomputers.  相似文献   

18.
We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.  相似文献   

19.
In this paper, we propose a prefix code matching parallel load-balancing method (PCMPLB) to efficiently deal with the load imbalance of solution-adaptive finite element application programs on distributed memory multicomputers. The main idea of the PCMPLB method is first to construct a prefix code tree for processors. Based on the prefix code tree, a schedule for performing load transfer among processors can be determined by concurrently and recursively dividing the tree into two subtrees and finding a maximum matching for processors in the two subtrees until the leaves of the prefix code tree are reached. We have implemented the PCMPLB method on an SP2 parallel machine and compared its performance with two load-balancing methods, the directed diffusion method and the multilevel diffusion method, and five mapping methods, the AE/ORB method, the AE/MC method, the MLkP method, the PARTY library method, and the JOSTLE-MS method. An unstructured finite element graph Truss was used as a test sample. During the execution, Truss was refined five times. Three criteria, the execution time of mapping/load-balancing methods, the execution time of an application program under different mapping/load-balancing methods, and the speedups achieved by mapping/load-balancing methods for an application program, are used for the performance evaluation. The experimental results show that (1) if a mapping method is used for the initial partitioning and this mapping method or a load-balancing method is used in each refinement, the execution time of an application program under a load-balancing method is less than that of the mapping method. (2) The execution time of an application program under the PCMPLB method is less than that of the directed diffusion method and the multilevel diffusion method.  相似文献   

20.
In this work, image-space-parallel direct volume rendering (DVR) of unstructured grids is investigated for distributed-memory architectures. A hypergraph-partitioning-based model is proposed for the adaptive screen partitioning problem in this context. The proposed model aims to balance the rendering loads of processors while trying to minimize the amount of data replication. In the parallel DVR framework we adopted, each data primitive is statically owned by its home processor, which is responsible from replicating its primitives on other processors. Two appropriate remapping models are proposed by enhancing the above model for use within this framework. These two remapping models aim to minimize the total volume of communication in data replication while balancing the rendering loads of processors. Based on the proposed models, a parallel DVR algorithm is developed. The experiments conducted on a PC cluster show that the proposed remapping models achieve better speedup values compared to the remapping models previously suggested for image-space-parallel DVR  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号