期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient Compositing Methods for the Sort-Last-Sparse Parallel Volume Rendering System on Distributed Memory Multicomputers

Yang Don-Lin Yu Jen-Chih Chung Yeh-Ching 《The Journal of supercomputing》2001,18(2):201-220

In the sort-last-sparse parallel volume rendering system on distributed memory multicomputers, one can achieve a very good performance improvement in the rendering phase by increasing the number of processors. This is because each processor can render images locally without communicating with other processors. However, in the compositing phase, a processor has to exchange local images with other processors. When the number of processors exceeds a threshold, the image compositing time becomes a bottleneck. In this paper, we propose three compositing methods to efficiently reduce the compositing time in parallel volume rendering. They are the binary-swap with bounding rectangle (BSBR) method, the binary-swap with run-length encoding and static load-balancing (BSLC) method, and the binary-swap with bounding rectangle and run-length encoding (BSBRC) method. The proposed methods were implemented on an SP2 parallel machine along with the binary-swap compositing method. The experimental results show that the BSBRC method has the best performance among these four methods. 相似文献

2.

Parallel rendering of volumetric data set on distributed-memory architectures

C. Montani R. Perego R. Scopigno 《Concurrency and Computation》1993,5(2):153-167

A solution is proposed to the problem of interactive visualization and rendering of volume data. Designed for parallel distributed memory MIMD architectures, the volume rendering system is based on the ray tracing (RT) visualization technique, the Sticks representation scheme (a data structure exploiting data coherence for the compression of classified data sets), the use of a slice-partitioning technique for the distribution of the data between the processing nodes and the consequent ray-data-flow parallelizing strategy. The system has been implemented on two different architectures: an inmos Transputer network and a hypercube nCUBE 6400 architecture. The high number of processors of this latter machine has allowed us to exploit a second level of parallelism (parallelism on image space, or parallelism on pixels) in order to arrive at a higher degree of scalability. In both proposals, the similarities between the chosen data-partitioning strategy, the communications pattern of the visualization processes and the topology of the physical system architecture represent the key points and provide improved software design and efficiency. Moreover, the partitioning strategy used and the network interconnection topology reduce the communications overhead and allow for an efficient implementation of a static load-balancing technique based on the prerendering of a low resolution image. Details of the practical issues involved in the parallelization process of volumetric RT, commonly encountered problems (i.e. termination and deadlock prevention) and the sw migration process between different architectures are discussed. 相似文献

3.

Parallel processing of an object space for image synthesis using ray tracing 总被引：1，自引：1，他引：0

Hiroaki Kobayashi Tadao Nakamura Yoshiharu Shigei 《The Visual computer》1987,3(1):13-22

This paper presents a novel parallel processing system for image synthesis using ray tracing. An object space is divided into parts (subspaces), each of which is allocated to a processor. The processor detects, simultaneously the intersections of the surfaces of each object and a fixed number of rays over the whole space, and calculates the local intensity on an object in each subspace. The global intensities of pixels on a screen are calculated by the other kind of processors simultaneously. We also present the optimal data structure, based on an adaptive division algorithm, for parallel processing of the object space. 相似文献

4.

Parallel implementation of a ray tracing algorithm for distributed memory parallel computers

Tong-Yee Lee C. S. Raghavendra John B. Nicholas 《Concurrency and Computation》1997,9(10):947-965

Ray tracing is a well known technique to generate life-like images. Unfortunately, ray tracing complex scenes can require large amounts of CPU time and memory storage. Distributed memory parallel computers with large memory capacities and high processing speeds are ideal candidates to perform ray tracing. However, the computational cost of rendering pixels and patterns of data access cannot be predicted until runtime. To parallelize such an application efficiently on distributed memory parallel computers, the issues of database distribution, dynamic data management and dynamic load balancing must be addressed. In this paper, we present a parallel implementation of a ray tracing algorithm on the Intel Delta parallel computer. In our database distribution, a small fraction of database is duplicated on each processor, while the remaining part is evenly distributed among groups of processors. In the system, there are multiple copies of the entire database in the memory of groups of processors. Dynamic data management is acheived by an ALRU cache scheme which can exploit image coherence to reduce data movements in ray tracing consecutive pixels. We balance load among processors by distributing subimages to processors in a global fashion based on previous workload requests. The success of our implementation depends crucially on a number of parameters which are experimentally evaluated. © 1997 John Wiley & Sons, Ltd. 相似文献

5.

Orthogonal multiprocessor sharing memory with an enhanced mesh for integrated image understanding

《CVGIP: Image Understanding》1991,53(1):31-45

This paper proposes a new parallel architecture, which has the potential to support low-level image processing as well as intermediate and high-level vision analysis tasks efficiently. The integrated architecture consists of an SIMD mesh of processors enhanced with multiple broadcast buses, and MIMD multiprocessor with orthogonal access buses, and a two-dimensional shared memory array. Low-level image processing is performed on the mesh processor, while intermediate and high-level vision analysis is performed on the orthogonal multiprocessor. The interaction between the two levels is supported by a common shared memory. Concurrent computations and I/O are made possible by partitioning the memory into disjoint spaces so that each processor system can access a different memory space. To illustrate the power of such a two-level system, we present efficient parallel algorithms for a variety of problems from low-level image processing to high-level vision. Representative problems include matrix based computations, histogramming and key counting operations, image component labeling, pyramid computations, Hough transform, pattern clustering, and scene labeling. Through computational complexity analysis, we show that the integrated architecture meets the processing requirements of most image understanding tasks. 相似文献

6.

Direct visualization of volume data 总被引：5，自引：0，他引：5

Yoo T.S. Neumann U. Fuchs H. Pizer S.M. Cullip T. Rhoades J. Whitaker R. 《Computer Graphics and Applications, IEEE》1992,12(4):63-71

A combination of segmentation tools and fast volume renderers that provides an interactive exploration environment for volume visualization is discussed. The tools and renderers include mechanisms that distribute volume data across multiple processors, as well as image compositing techniques and solutions to representation problems in the selection and display of subregions within bounding volumes. A volume visualization technique using the interactive control of images rendered directly from volume data coupled with a user-controlled semantic classification tool is described. The variations of parallel volume rendering being explored on the Pixel-Planes 5 system and the region-of-interest selection methods and the interactive tools used by the system are presented. The flexibility and power of combining volume rendering with region-of-interest selection techniques are demonstrated using examples of medical imaging applications 相似文献

7.

A Heterogeneous Mixed-Mode Execution Model for Massively Parallel Systems

《Journal of Parallel and Distributed Computing》1999,56(1):2-16

In this paper, we consider a massively parallel system that is composed of heterogeneous processors, that is, processors with different processing power, and that combines the advantages of the SIMD and MIMD architectures. The heterogeneous mixed-mode (HeMM) execution model is composed of two main components, which operate in the well-known SIMD and MIMD paradigms. The main computing power comes from a component that is composed of a massive number of processors and operates in a data parallel manner. The other component is composed of a few (or even one) fast processors which operate in the MIMD paradigm. The operation of a small number of processors in an MIMD paradigm has been well demonstrated through actual systems. The processors in this component add flexibility to the execution of the parallel programs such that it adjusts to the changing parallelism of the program to enhance the performance. Based on this execution model we analyze the gains in performance that is obtainable by this new system. We show that substantial performance gains can be obtained by using the HeMM system. 相似文献

8.

超长指令字DSP上的数字图像处理算法优化方法

张帆葛颖增窦勇《微计算机应用》2008,29(10)

数字图像处理(Digital Image Processing)广泛应用于航空航天、生物医学工程、通信工程、工业和工程、军事公安、文化艺术等方面.由于一些应用的实时性和环境要求,通常采用数字信号处理器(Digital Signal Processing,简称DSP)处理图像.采用超长指令字(Very Long Instruction Word,简称VLIW)体系结构的DSP由于功耗低、硬件结构简单和并行性好等优点,在实时图像处理应用中使用广泛.根据图像处理算法特点和VLIW DSP体系结构特点提出在YLIW DSP上优化图像处理算法的一般方法,包括存储优化方法和指令级并行优化方法.最后采用提出的方法对多个常用的图像处理算法优化,试验结果表明有较好优化效果. 相似文献

9.

On parallel processing systems: Amdahl's law generalized and someresults on optimal design

Kleinrock L. Huang J.-H. 《IEEE transactions on pattern analysis and machine intelligence》1992,18(5):434-447

The authors model a job in a parallel processing system as a sequence of stages, each of which requires a certain integral number of processors for a certain interval of time. They derive the speedup of the system for two cases: systems with no arrivals, and systems with arrivals. In the case with no arrivals, their speedup result is a generalization of Amdahl's law (G.M. Amdahl, 1967). They extend the notion of power as previously applied to general queuing and computer-communication systems to their case of parallel processing systems. They find the optimal job input and the optimal number of processors to use so that power is maximized. Many of the results for the case of arrivals are the same as for the case of no arrivals. It is found that the average number of jobs in the system with arrivals equals unity when power is maximized. They also model a job in such a way that the number of processors required continuously varies over time. The same performance indices and parameters studied in the discrete model are evaluated for this continuous model 相似文献

10.

景象匹配算法在多DSP系统中的并行实现

王皎张天序颜露新杨卫东《微计算机信息》2007,23(17):151-153

灰度互相关匹配是一种有效的景象匹配算法,其具有适应性强和匹配精度高的特点。但互相关匹配的计算量很大,在实时性要求很高的精确末制导中,匹配速度往往很难满足要求。基于空间并行方法实现了互相关匹配算法的并行处理,通过不同数量处理器的并行处理实验,结果表明:空间并行方法具有很高的加速比和并行效率,是一种适合互相关匹配算法的有效的并行处理方法。当图像尺寸增大计算量增加时,可以采取增加处理器提高系统处理能力的办法使算法依然满足实时性要求。相似文献

11.

A new parallel ray-tracing system based on object decomposition

Hyun-Joon Kim Chong-Min Kyung 《The Visual computer》1996,12(5):244-253

We propose a new parallel ray-tracing hardware architecture in which processors are connected as a ring. Most parallel ray-tracing algorithms subdivide the whole object space into subregions; A processor handles only rays entering the subregion assigned to it. Here we assign each processor objects that are spread over the whole object space. The processors trace rays on their own objects. The respective partial results are combined to form the final image. This scheme is especially suitable for synthesizing animated sequences because objects need not be reallocated for every frame. Preliminary results show a speed-up factor almost linearly proportional to the number of processors. 相似文献

12.

RTS: A system to simulate the real time cost behaviour of parallel computations

Bin Qin Howard A. Sholl Reda A. Ammar 《Software》1988,18(10):967-985

In this paper, we present a software tool, RTS (real time simulator), that analyses the time cost behaviour of parallel computations through simulation. It is assumed in RTS that the computer system which supports the executions of parallel computations has a limited number of processors all processors have the same speed and they communicate with each other through a shared memory. In RTS, the time cost of a parallel computation is defined as a function of the input, the algorithm, the data structure, the processor speed, the number of processors, the processor power allocation, the communication and the execution environment. How RTS models the time cost is first discussed in the paper. In the model, a locking technique is used to manipulate the access to the shared memory, processing power is equally allocated among all the operations that are currently being performed in parallel in the computer system, and the number of operations in the execution environment of a parallel computation changes from time to time. How RTS works and how the simulation is used to do time cost analysis are also discussed. 相似文献

13.

Image composition schemes for sort-last polygon rendering on 2Dmesh multicomputers

Tong-Yee Lee Raghavendra C.S. Nicholas J.B. 《IEEE transactions on visualization and computer graphics》1996,2(3):202-217

In a sort-last polygon rendering system, the efficiency of image composition is very important for achieving fast rendering. In this paper, the implementation of a sort-last rendering system on a general purpose multicomputer system is described. A two-phase sort-last-full image composition scheme is described first, and then many variants of it are presented for 2D mesh message-passing multicomputers, such as the Intel Delta and Paragon. All the proposed schemes are analyzed and experimentally evaluated on Caltech's Intel Delta machine for our sort-last parallel polygon renderer. Experimental results show that sort-last-sparse strategies are better suited than sort-last-full schemes for software implementation on a general purpose multicomputer system. Further, interleaved composition regions perform better than coherent regions. In a large multicomputer system. Performance can be improved by carefully scheduling the tasks of rendering and communication. Using 512 processors to render our test scenes, the peak rendering rate achieved on a 282,144 triangle dataset is dose to 4.6 million triangles per second which is comparable to the speed of current state-of-the-art graphics workstations 相似文献

14.

一种体绘制专用体系结构存储器的设计

下载免费PDF全文

乌晓峰孙济洲魏继增《计算机工程》2007,33(6):275-277

提出了一种适合Ray Casting算法的体绘制专用体系结构的存储模型。根据处理器数目的不同，体数据被划分为不同的子体。子体依据其空间坐标位置的不同被分配到不同处理单元，子体内的体素被分配到相应处理单元的存储器的对应位置。说明了子体在处理器间的分配方式以及体素在存储器内的编址和寻址方式。相似文献

15.

Analysis of multidimensional images on the Connection Machine system

Giampiero Marcenaro Massimo Tistarelli 《Concurrency and Computation》1991,3(6):699-713

The Connection Machine (CM) has been demonstrated to be an efficient and fast computational engine for the solution of many problems related to image processing. The high-level parallelism of the CM naturally fits to many large-scale data intensive applications. In this paper the implementation of parallel algorithms for the analysis of multidimensional images on the CM is presented. Different aspects in the analysis of multidimensional images are considered. In the field of artificial vision, the implementation of algorithms for the filtering of image sequences (both in space and time) and the estimation of the optical flow is described and some results in terms of accuracy and computation time are presented. The processing of three-dimensional images is investigated in the field of biomedical engineering. In this case the goal is the development of algorithms for the 3-D reconstruction of human body segments and their visualization. The parallel implementations exploit the fine grain parallelism allowed by the CM, processing each point of the data on a different processor. This mechanism is allowed by the possibility of dynamically reconfiguring the connectivity of the CM nodes and of defining a huge number of virtual processors. Moreover, as the CM processors operate on one-bit data, it is possible to tune the number of bits for each data point to match the accuracy required by the application. 相似文献

16.

Algorithms for rendering realistic terrain image sequences and their parellel implementation

Gennady Agranov Craig Gotsman 《The Visual computer》1995,11(9):455-464

We present algorithms for rendering realistic images of large terrains and their implementation on a parallel computer for rapid production of terrain-animation sequences. “Large” means datasets too large for RAM. A hybrid ray-casting and projection technique incorporates quadtree subdivision techniques and filtering using precomputed bit masks. Hilbert space-filling curves determine the imagepixel rendering order. A parallel version of the algorithm is based on a Meiko parallel computer architecture, designed to relieve dataflow bottlenecks and exploit temporal image coherence. Our parallel system, incorporating 26 processors, can generate a full color-terrain image at video resolution (without noticable aliasing artifacts) every 2 s, including I/O and communication overheads. 相似文献

17.

Dynamic load balancing for parallel polygon rendering 总被引：2，自引：0，他引：2

Whitman S. 《Computer Graphics and Applications, IEEE》1994,14(4):41-48

Using parallel processing for visualization speeds up computer graphics rendering of complex data sets. A parallel algorithm designed for polygon scan conversion and rendering is presented which supports fast rendering of highly complex data sets using advanced lighting models. Dedicated graphics rendering engines do not necessarily suit such data sets, although they can support real-time update of moderately complex scenes using simple lighting. Advantages to using a software-based approach include the feasibility of adding special rendering features to the program and the capability of integrating a parallel scientific application with a parallel graphics renderer. A new work decomposition strategy presented, called task adaptive, is based on dynamically partitioning the amount of computational work left at a given time. The algorithm uses a heuristic for dynamic task decomposition in which image space tasks are partitioned without requiring interruption of the partitioned processor. A sophisticated memory referencing strategy lets local memory access graphics data during rendering. This permits implementation of the algorithm on a distributed memory multiprocessor. An in-depth analysis of the overhead costs accompanying parallel processing shows where performance is adequate or could be improved 相似文献

18.

Parallel Shear-Warp Factorization Volume Rendering Using Efficient 1-D and 2-D Partitioning Schemes for Distributed Memory Multicomputers

Ching-Feng Lin Don-Lin Yang Yeh-Ching Chung 《The Journal of supercomputing》2002,22(3):277-302

3-D data visualization is very useful for medical imaging and computational fluid dynamics. Volume rendering can be used to exhibit the shape and volumetric properties of 3-D objects. However, volume rendering requires a considerable amount of time to process the large volume of data. To deliver the necessary rendering rates, parallel hardware architectures such as distributed memory multicomputers offer viable solutions. The challenge is to design efficient parallel algorithms that utilize the hardware parallelism effectively. In this paper, we present two efficient parallel volume rendering algorithms, the 1D-partition and 2D-partition methods, based on the shear-warp factorization for distributed memory multicomputers. The 1D-partition method has a performance bound on the size of the volume data. If the number of processors is less than a threshold, the 1D-partition method can deliver a good rendering rate. If the number of processors is over a threshold, the 2D-partition method can be used. To evaluate the performance of these two algorithms, we implemented the proposed methods along with the slice data partitioning, volume data partitioning, and sheared volume data partitioning methods on an IBM SP2 parallel machine. Six volume data sets were used as the test samples. The experimental results show that the proposed methods outperform other compatible algorithms for all test samples. When the number of processors is over a threshold, the experimental results also demonstrate that the 2D-partition method is better than the 1D-partition method. 相似文献

19.

A new algorithm for interactive graphics on multicomputers 总被引：1，自引：0，他引：1

Ellsworth D.A. 《Computer Graphics and Applications, IEEE》1994,14(4):33-40

As nonshared-memory multiple instruction, multiple data (MIMD) systems become more common, it becomes important to develop parallel rendering algorithms for them. These systems, known as multicomputers, can produce data sets so large that it is difficult to visualize the data on conventional graphics systems, especially if the visualization proceeds in tandem with the calculation. Parallel systems must run interactive graphics to allow convenient visualizations of their computations. While few parallel systems currently have a frame buffer that will support interactive rendering, such systems should be more common in the future. This article describes an algorithm suited for interactive polygon rendering, where the model's image on screen generally has frame-to-frame coherence. The algorithm uses this coherence to perform load-balancing calculations in parallel with the other calculations. The algorithm also uses an optimized version of personalized all-to-all communication, where all processors communicate with all other processors 相似文献

20.

Optimal parallel scheduling of M-way join queries

Farshad Fotouhi Jason Leigh Satyendra P. Rana 《Information Systems》1991,16(6):627-639

The problem of computing multirelation (M-way) join queries on uniprocessor architectures has been considered by many researchers in the past. This paper lays the necessary foundation for work involving optimization of M-way joins in parallel architectures. We explain the inadequacies of previous uniprocessor strategies and describe a more suitable formulation based on the concept of matching in graph theory to approach the problem in a parallel environment. It has been shown that the problem of optimizing M-way joins is an NP-hard problem and hence we would expect that in a parallel processing environment the search space of possible solutions (join schedules) would be enormous, especially when a variable number of processors are considered. Our strategy seeks to reduce the region to search by partitioning the search space according to the number of available processors. Based on this a significant portion of the search space, which will produce non-optimal join schedules, may be ignored. 相似文献