首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《Parallel Computing》2002,28(7-8):967-993
This paper describes a software architecture that allows image processing researchers to develop parallel applications in a transparent manner. The architecture's main component is an extensive library of data parallel low level image operations capable of running on homogeneous distributed memory MIMD-style multicomputers. Since the library has an application programming interface identical to that of an existing sequential library, all parallelism is completely hidden from the user.The first part of the paper discusses implementation aspects of the parallel library, and shows how sequential as well as parallel operations are implemented on the basis of so-called parallelizable patterns. A library built in this manner is easily maintainable, as extensive code redundancy is avoided. The second part of the paper describes the application of performance models to ensure efficiency of execution on all target platforms. Experiments show that for a realistic application performance predictions are highly accurate. These results indicate that the core of the architecture forms a powerful basis for automatic parallelization and optimization of a wide range of imaging software.  相似文献   

2.
Cellular computing architectures represent an important class of computation that are characterized by simple processing elements, local interconnect and massive parallelism. These architectures are a good match for many image and video processing applications and can be substantially accelerated with Reconfigurable Computers. We present a flexible software/hardware framework for design, implementation and automatic synthesis of cellular image processing algorithms. The system provides an extremely flexible set of parallel, pipelined and time-multiplexed components which can be tailored through reconfigurable hardware for particular applications. The most novel aspects of our framework include a highly pipelined architecture for multi-scale cellular image processing as well as support for several different pattern recognition applications. In this paper, we will describe the system in detail and present our performance assessments. The system achieved speed-up of at least 100× for computationally expensive sub-problems and 10× for end-to-end applications compared to software implementations.  相似文献   

3.
Highly regular multi-processor architectures are suitable for inherently highly parallelizable applications such as most of the image processing domain. Systems embedded in a single programmable chip platform (SoPC) allow hardware designers to tailor every aspect of the architecture in order to match the specific application needs. These platforms are now large enough to embed an increasing number of cores, allowing implementation of a multi-processor architecture with an embedded communication network. In this paper we present the parallelization and the embedding of a real time image stabilization algorithm on a SoPC platform. Our overall hardware implementation method is based upon meeting algorithm processing power requirements and communication needs with refinement of a generic parallel architecture model. Actual implementation is done by the choice and parameterization of readily available reconfigurable hardware modules and customizable commercially available IPs (Intellectual Property). We present both software and hardware implementation with performance results on a Xilinx SoPC target.  相似文献   

4.
A popular approach to providing nonexperts in parallel computing with an easy-to-use programming model is to design a software library consisting of a set of preparallelized routines, and hide the intricacies of parallelization behind the library's API. However, for regular domain problems (such as simple matrix manipulations or low-level image processing applications-in which all elements in a regular subset of a dense data field are accessed in turn) speedup obtained with many such library-based parallelization tools is often suboptimal. This is because interoperation optimization (or: time-optimization of communication steps across library calls) is generally not incorporated in the library implementations. We present a simple, efficient, finite state machine-based approach for communication minimization of library-based data parallel regular domain problems. In the approach, referred to as lazy parallelization, a sequential program is parallelized automatically at runtime by inserting communication primitives and memory management operations whenever necessary. Apart from being simple and cheap, lazy parallelization guarantees to generate legal, correct, and efficient parallel programs at all times. The effectiveness of the approach is demonstrated by analyzing the performance characteristics of two typical regular domain problems obtained from the field of low-level image processing. Experimental results show significant performance improvements over nonoptimized parallel applications. Moreover, obtained communication behavior is found to be optimal with respect to the abstraction level of message passing programs.  相似文献   

5.
对星载合成孔径雷达(SAR)并行处理算法在分布式共享存储器(DSM)HPC平台下的实现作了深入研究,对比了用消息传递和OpenMP两种并行编程模型实现的并行方案,在此基础上提出了基于进程的共享变量并行模型。这种模型克服了前两种模型的缺点,经过实验测试和实际SAR成像应用,证明是一种高效、稳定的并行方案。  相似文献   

6.
机群OpenMP系统的设计与实现   总被引:5,自引:0,他引:5  
OpenMP以其易用性和支持增量并行的特点成为共享存储体系结构的编程标准.目前机群系统已成为高性能计算的主流平台,研究机群OpenMP系统对推进并行应用的开发和普及非常有意义.该文作者以软件DSM系统JIAJIA作为OpenMP的运行时系统,结合一个前端编译器OMP2JIA,在机群系统上实现了OpenMP/JIAJIA计算环境,同时在提高性能方面根据机群系统特点扩展了OpenMP制导,优化了后端运行时库。通过11个OpenMP应用,作者比较了该计算环境和一个支持OpenMP的硬件cc-NUMA系统(SGI 2100)的性能.结果表明,作者的机群OpenMP系统的7机平均加速比为4.62;SGI 2100系统为4.55,二者性能相当.  相似文献   

7.
Computational fluid dynamics (CFD) is one of the most emerging fields of fluid mechanics used to analyze fluid flow situation. This analysis is based on simulations carried out on computing machines. For complex configurations, the grid points are so large that the computational time required to obtain the results are very high. Parallel computing is adopted to reduce the computational time of CFD by utilizing the available resource of computing. Parallel computing tools like OpenMP, MPI, CUDA, combination of these and few others are used to achieve parallelization of CFD software. This article provides a comprehensive state of the art review of important CFD areas and parallelization strategies for the related software. Issues related to the computational time complexities and parallelization of CFD software are highlighted. Benefits and issues of using various parallel computing tools for parallelization of CFD software are briefed. Open areas of CFD where parallelization is not much attempted are identified and parallel computing tools which can be useful for parallelization of CFD software are spotlighted. Few suggestions for future work in parallel computing of CFD software are also provided.  相似文献   

8.
Recent advances in semiconductor technologies make it possible to integrate many processor cores in a small device package. The parallel execution capability of such multi-core processors can be exploited to enhance the performance of many traditional sequential applications. There have been numerous research activities to develop parallelization techniques using the OpenMp programming model, in order to speed up sequential applications such as the H.264/AVC codec, but mostly in the PC environment. Therefore, it is difficult to understand which parallelization technique fits well with the H.264/AVC encoder on an embedded multi-core architecture. In this paper, we present parallelization techniques applicable to the H.264/AVC encoder on ARM MPCore using the OpenMP programming model. Further, we propose an analytical model for the performance estimation of the H.264/AVC encoder, and we then verify the model accuracy by performing simulations using hardware/software co-verification tool. Our experimental results show that the parallelization techniques proposed in this paper for the embedded multi-core platform improve the encoder performance by up to 2.36 times, and that the parallelization technique exploiting data-level parallelism outperforms the one using task-level parallelism by 41%. It is also observed that balancing loads among processor cores is a critical parameter in achieving better scalability in the encoder.  相似文献   

9.
The time‐dependent Maxwell equations are one of the most important approaches to describing dynamic or wide‐band frequency electromagnetic phenomena. A sequential finite‐volume, characteristic‐based procedure for solving the time‐dependent, three‐dimensional Maxwell equations has been successfully implemented in Fortran before. Due to its need for a large memory space and high demand on CPU time, it is impossible to test the code for a large array. Hence, it is essential to implement the code on a parallel computing system. In this paper, we discuss an efficient and scalable parallelization of the sequential Fortran time‐dependent Maxwell equations solver using High Performance Fortran (HPF). The background to the project, the theory behind the efficiency being achieved, the parallelization methodologies employed and the experimental results obtained on the Cray T3E massively parallel computing system will be described in detail. Experimental runs show that the execution time is reduced drastically through parallel computing. The code is scalable up to 98 processors on the Cray T3E and has a performance similar to that of an MPI implementation. Based on the experimentation carried out in this research, we believe that a high‐level parallel programming language such as HPF is a fast, viable and economical approach to parallelizing many existing sequential codes which exhibit a lot of parallelism. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

10.
对DR ( Digtal Radiography数字辐射成像)图像进行分割是工业DR图像处理中一项重要内容。C-V算法对DR图像分割效果较好, 但该算法计算量大, 在工业应用中达不到实时处理要求。本文结合高性价比CUDA(计算机统一设备架构)技术实现C-V算法对DR图像分割并行化,并采用共享内存技术、独立计算与合并计算结合的方法,较大地提高了C-V方法的计算效率。对实际工业DR图像分割实验结果显示,本文方法加速比可达到32到44倍,表明使用CUDA并行化C-V方法分割DR图像高效可行。  相似文献   

11.
Dynamic programming is an important paradigm that has been widely used to solve problems in various areas such as control theory, operation research, biology and computer science. We generalize the finite automaton formal model for dynamic programming deriving pipeline parallel algorithms. The optimality of these algorithms is established for the new class of non‐decreasing finite automata. As an intermediate step for the construction of a skeleton for the automatic parallelization of dynamic programming, we have developed a tool for the implementation of pipeline algorithms. The tool maps the processes in the pipeline in the target architecture following a mix of block and cyclic policies adapted to the grain of the machine. Based on the former tool, the automatic parallelization of dynamic programming is straightforward. The use of the model and its associated tools is illustrated with the Single Resource Allocation Problem. The performance and portability of these tools is compared with specific ‘hand made’ code written by experienced programmers. The experimental results on distributed memory and shared distributed memory architectures prove the scalability of the proposed paradigm and its associated tools. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

12.
In this paper, we present faster than real-time implementation of a class of dense stereo vision algorithms on a low-power massively parallel SIMD architecture, the CSX700. With two cores, each with 96 Processing Elements, this SIMD architecture provides a peak computation power of 96 GFLOPS while consuming only 9 Watts, making it an excellent candidate for embedded computing applications. Exploiting full features of this architecture, we have developed schemes for an efficient parallel implementation with minimum of overhead. For the sum of squared differences (SSD) algorithm and for VGA (640 × 480) images with disparity ranges of 16 and 32, we achieve a performance of 179 and 94 frames per second (fps), respectively. For the HDTV (1,280 × 720) images with disparity ranges of 16 and 32, we achieve a performance of 67 and 35 fps, respectively. We have also implemented more accurate, and hence more computationally expensive variants of the SSD, and for most cases, particularly for VGA images, we have achieved faster than real-time performance. Our results clearly demonstrate that, by developing careful parallelization schemes, the CSX architecture can provide excellent performance and flexibility for various embedded vision applications.  相似文献   

13.
随着CPU多核架构的普及,应用的复杂和数据集的膨胀,基于Matlab的遗留系统中的串行程序代码无法充分发挥系统潜在的性能优势,无力应对当前大型数据集的处理应用需求。Matlab的并行计算模型为数据密集型的处理任务提供了并行支持。本文首先从系统架构扩展和业务代码并行化入手,分析遗留系统并行化重构过程要点和方法,应用案例的并行化重构实验数据表明了系统重构处理大型数据集的性能提升。  相似文献   

14.
Recent research in reduced instruction set computer architectures has emphasized the importance of the empirical approach to designing computer architectures: architectural features are analyzed for utility and cost with respect to the system software that uses them. This approach has resulted in architectural simulators that allow computer designers to vary the features of the architecture being simulated and to analyze how the addition or removal of these features affects the cost and performance of the architecture. In this paper we apply this technique to a new area: reconfigurable architectures. Our approach is to use an empirical methodology that emphasizes the interaction between the target software and the reconfigurability features of parallel architectures. We have developed a set of tools, the reconfigurable architecture workbench, that assists in this methodology by allowing parallel programs to be simulated on a target architecture in order to study the performance implications of various reconfigurability features. The workbench is based on a framework, the PCI model, which describes the range of parallel programs, parallel architectures, and reconfiguration features. We present details of the design and implementation of a prototype workbench, GT-RAW. GT-RAW is being used to study the utility of one dimension of reconfiguration for image processing and image understanding applications. We present an example of the experiments that are being conducted with GT-RAW as a demonstration of our empirical methodology.  相似文献   

15.
Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg–Tarjan preflow‐push algorithm on the Cray XMT‐2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well‐suited to the parallelization of graph theoretic algorithms, such as preflow‐push. We describe the implementation of the preflow‐push code on the XMT‐2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 320002 pixels in size, which are well beyond the largest previously reported in the literature.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

16.
Diminishing returns from increased clock frequencies and instruction‐level parallelism have forced computer architects to adopt architectures that exploit wider parallelism through multiple processor cores. While emerging many‐core architectures have progressed at a remarkable rate, concerns arise regarding the performance and productivity of numerous parallel‐programming tools for application development. Development of parallel applications on many‐core processors often requires developers to familiarize themselves with unique characteristics of a target platform while attempting to maximize performance and maintain correctness of their applications. The family of partitioned global address space (PGAS) programming models comprises the current state of the art in balancing performance and programmability. One such PGAS approach is SHMEM, a lightweight, shared‐memory programming library that has demonstrated high performance and productivity potential for parallel‐computing systems with distributed‐memory architectures. In the paper, we present research, design, and analysis of a new SHMEM infrastructure specifically crafted for low‐level PGAS on modern and emerging many‐core processors featuring dozens of cores and more. Our approach (with a new library known as TSHMEM) is investigated and evaluated atop two generations of Tilera architectures, which are among the most sophisticated and scalable many‐core processors to date, and is intended to enable similar libraries atop other architectures now emerging. In developing TSHMEM, we explore design decisions and their impact on parallel performance for the Tilera TILE‐Gx and TILEPro many‐core architectures, and then evaluate the designs and algorithms within TSHMEM through microbenchmarking and applications studies with other communication libraries. Our results with barrier primitives provided by the Tilera libraries show dissimilar performance between the TILE‐Gx and TILEPro; therefore, TSHMEM's barrier design takes an alternative approach and leverages the on‐chip mesh network to provide consistent low‐latency performance. In addition, our experiments with TSHMEM show that naive collective algorithms consistently outperformed linear distributed collective algorithms when executed in an SMP‐centric environment. In leveraging these insights for the design of TSHMEM, our approach outperforms the OpenSHMEM reference implementation, achieves similar to positive performance over OpenMP and OSHMPI atop MPICH, and supports similar libraries in delivering high‐performance parallel computing to emerging many‐core systems. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

17.
The growing gap between sustained and peak performance for scientific applications is a well‐known problem in high‐performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX‐6 vector processor, and compares it against the cache‐based IBM Power3 and Power4 superscalar architectures, across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines many low‐level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Overall results demonstrate that the SX‐6 achieves high performance on a large fraction of our application suite and often significantly outperforms the cache‐based architectures. However, certain classes of applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX‐6 effectively. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

18.
Despite using multiple concurrent processors, a typical high‐performance parallel application is long‐running, taking hours, even days to arrive at a solution. To modify a running high‐performance parallel application, the programmer has to stop the computation, change the code, redeploy, and enqueue the updated version to be scheduled to run, thus wasting not only the programmer's time, but also expensive computing resources. To address these inefficiencies, this article describes how dynamic software updates (DSU) can be used to modify a parallel application on the fly, thus saving the programmer's time and using expensive computing resources more productively. The net effect of updating parallel applications dynamically can reduce the total time that elapses between posing a problem and arriving at a solution, otherwise known as time‐to‐discovery. To explore the benefits of dynamic updates for high performance applications, this article takes a two‐pronged approach. First, we describe our experiences of building and evaluating a system for dynamically updating applications running on a parallel cluster. We then review a large body of literature describing the existing state of the art in DSU and point out how this research can be applied to high‐performance applications. Our experimental results indicate that DSU have the potential to become a powerful tool in reducing time‐to‐discovery for high‐performance parallel applications. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

19.
Event services based on publish–subscribe architectures are well‐established components of distributed computing applications. Recently, an event service has been proposed as part of the common component architecture (CCA) for high‐performance computing (HPC) applications. In this paper we describe our implementation, experimental evaluation, and initial experience with a high‐performance CCA event service that exploits efficient communications mechanisms commonly used on HPC platforms. We describe the CCA event service model and briefly discuss the possible implementation strategies of the model. We then present the design and implementation of the event service using the aggregate remote memory copy interface as an underlying communication layer for this mechanism. Two alternative implementations are presented and evaluated on a Cray XD‐1 platform. The performance results demonstrate that event delivery latencies are low and that the event service is able to achieve high‐throughput levels. Finally, we describe the use of the event service in an application for high‐speed processing of data from a mass spectrometer and conclude by discussing some possible extensions to the event service for other HPC applications. Published in 2009 by John Wiley & Sons, Ltd.  相似文献   

20.
The amount of big data collected during human–computer interactions requires natural language processing (NLP) applications to be executed efficiently, especially in parallel computing environments. Scalability and performance are critical in many NLP applications such as search engines or web indexers. However, there is a lack of mathematical models helping users to design and apply scheduling theory for NLP approaches. Moreover, many researchers and software architects reported various difficulties related to common NLP benchmarks. Therefore, this paper aims to introduce and demonstrate how to apply a scheduling model for a class of keyword extraction approaches. Additionally, we propose methods for the overall performance evaluation of different algorithms, which are based on processing time and correctness (quality) of answers. Finally, we present a set of experiments performed in different computing environments together with obtained results that can be used as reference benchmarks for further research in the field.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号