期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Highly optimized implementation of OpenCV for the Cell Broadband Engine

Hiroki Sugano Ryusuke Miyamoto 《Computer Vision and Image Understanding》2010,114(11):1273-1281

Recently, real-time processing of image recognition is required for embedded applications such as automotive applications, robotics, entertainment, and so on. To realize real-time processing of image recognition on such systems we need optimized libraries for embedded processors. OpenCV is one of the most widely used libraries for computer vision applications and has many functions optimized for Intel processors, but no function is optimized for embedded processors. We present a parallel implementation of OpenCV library on the Cell Broadband Engine (Cell), which is one of the most widely used high performance embedded processors. Experimental result shows that most of the functions optimized for the Cell processor are faster than functions optimized for Intel Core 2 Duo E6850 3.00 GHz. 相似文献

2.

The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor 总被引：1，自引：0，他引：1

Michael Gschwind 《International journal of parallel programming》2007,35(3):233-262

As CMOS feature sizes continue to shrink and traditional microarchitectural methods for delivering high performance (e.g., deep pipelining) become too expensive and power-hungry, chip multiprocessors (CMPs) become an exciting new direction by which system designers can deliver increased performance. Exploiting parallelism in such designs is the key to high performance, and we find that parallelism must be exploited at multiple levels of the system: the thread-level parallelism that has become popular in many designs fails to exploit all the levels of available parallelism in many workloads for CMP systems. We describe the Cell Broadband Engine and the multiple levels at which its architecture exploits parallelism: data-level, instruction-level, thread-level, memory-level, and compute-transfer parallelism. By taking advantage of opportunities at all levels of the system, this CMP revolutionizes parallel architectures to deliver previously unattained levels of single chip performance. We describe how the heterogeneous cores allow to achieve this performance by parallelizing and offloading computation intensive application code onto the Synergistic Processor Element (SPE) cores using a heterogeneous thread model with SPEs. We also give an example of scheduling code to be memory latency tolerant using software pipelining techniques in the SPE. This paper is based in part on “Chip multiprocessing and the Cell Broadband Engine”, ACM Computing Frontiers 2006. 相似文献

3.

Implementation of mixed precision in solving systems of linear equations on the Cell processor

Jakub Kurzak Jack Dongarra 《Concurrency and Computation》2007,19(10):1371-1385

This paper describes the design concepts behind implementations of mixed‐precision linear algebra routines targeted for the Cell processor. It describes in detail the implementation of code to solve linear system of equations using Gaussian elimination in single precision with iterative refinement of the solution to the full double‐precision accuracy. By utilizing this approach the algorithm achieves close to an order of magnitude higher performance on the Cell processor than the performance offered by the standard double‐precision algorithm. The code is effectively an implementation of the high‐performance LINPACK benchmark, as it meets all of the requirements concerning the problem being solved and the numerical properties of the solution. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

4.

The Bottom-Up Implementation of One MILC Lattice QCD Application on the Cell Blade

Guochun Shi Volodymyr Kindratenko Steven Gottlieb 《International journal of parallel programming》2009,37(5):488-507

We report the results of the bottom-up implementation of one MILC lattice quantum chromodynamics (QCD) application on the Cell Broadband Engine™ processor. In our implementation, we preserve MILC’s framework for scaling the application to run on a large number of compute nodes and accelerate computationally intensive kernels on the Cell’s synergistic processor elements. Speedups of 3.4 × for the 8 × 8 × 16 × 16 lattice and 5.7 × for the 16 × 16 × 16 × 16 lattice are obtained when comparing our implementation of the MILC application executed on a 3.2 GHz Cell processor to the standard MILC code executed on a quad-core 2.33 GHz Intel Xeon processor. We provide an empirical model to predict application performance for a given lattice size. We also show that performance of the compute-intensive part of the application on the Cell processor is limited by the bandwidth between main memory and the Cell’s synergistic processor elements, whereas performance of the application’s parallel execution framework is limited by the bandwidth between main memory and the Cell’s power processor element. 相似文献

5.

Microwave tomography for breast cancer detection on Cell broadband engine processors

Meilian Xu Parimala Thulasiraman Sima Noghanian 《Journal of Parallel and Distributed Computing》2012

Microwave tomography (MT) is a safe screening modality that can be used for breast cancer detection. The technique uses the dielectric property contrasts between different breast tissues at microwave frequencies to determine the existence of abnormalities. Our proposed MT approach is an iterative process that involves two algorithms: Finite-Difference Time-Domain (FDTD) and Genetic Algorithm (GA). It is a compute intensive problem: (i) the number of iterations can be quite large to detect small tumors; (ii) many fine-grained computations and discretizations of the object under screening are required for accuracy. In our earlier work, we developed a parallel algorithm for microwave tomography on CPU-based homogeneous, multi-core, distributed memory machines. The performance improvement was limited due to communication and synchronization latencies inherent in the algorithm. In this paper, we exploit the parallelism of microwave tomography on the Cell BE processor. Since FDTD is a numerical technique with regular memory accesses, intensive floating point operations and SIMD type operations, the algorithm can be efficiently mapped on the Cell processor achieving significant performance. The initial implementation of FDTD on Cell BE with 8 SPEs is 2.9 times faster than an eight node shared memory machine and 1.45 times faster than an eight node distributed memory machine. In this work, we modify the FDTD algorithm by overlapping computations with communications during asynchronous DMA transfers. The modified algorithm also orchestrates the computations to fully use data between DMA transfers to increase the computation-to-communication ratio. We see 54% improvement on 8 SPEs (27.9% on 1 SPE) for the modified FDTD in comparison to our original FDTD algorithm on Cell BE. We further reduce the synchronization latency between GA and FDTD by using mechanisms such as double buffering. We also propose a performance prediction model based on DMA transfers, number of instructions and operations, the processor frequency and DMA bandwidth. We show that the execution time from our prediction model is comparable (within 0.5 s difference) with the execution time of the experimental results on one SPE. 相似文献

6.

Parallelization and performance comparison of the conjugate gradient equation solver on multicore Cell and Xeon computers

Fadi N. Sibai Mohammad Saad Hashir K. Kidwai 《Concurrency and Computation》2011,23(18):2463-2467

Multicore accelerators are used today to supplement traditional superscalar processors in massively parallel computer nodes with extra floating‐point computation power. This paper presents our parallelization and performance enhancement and evaluation of the conjugate gradient (CG) linear equation solver with enhanced matrix multiplication on the Cell Broadband Engine accelerator. The paper also compares the CG performance results on the Cell and two CG implementations on a computer with two quadcore Xeon processors, one with OpenMP and the other with OpenMPI. We also report the enhancements made on the CG code and performance analysis of CG on single and dual Cell Broadband Engine packages with 8 and 16 synergistic processing elements and on Xeon for heptadiagonal matrices, in particular to matrix multiplication and synchronization. We also report the communication and computation time breakdowns and the floating point operations per second ratio. Our parallel CG solver is shown to scale well with data size, grid dimensionality, and number of cores. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

7.

基于GMM和贝叶斯推理的多模态过程运行状态评价

邹筱瑜常玉清王福利周阳《控制理论与应用》2016,33(2):164-171

为使综合经济效益最大化,生产过程应保持在最优运行状态等级.针对多模态过程运行状态等级优劣判断问题,提出一种运行状态等级评价方法.该方法对同一运行状态等级的多模态数据建立一个高斯混合模型(Gaussian mixture model,GMM),确保特征提取的准确性,避免模态划分问题.至于在线评价策略,本文采用贝叶斯推理,确定当前运行状态属于各等级的后验概率.并引入滑动窗口,判定当前运行状态等级,有效解决多模态过程运行状态在线评价问题.针对"非优"运行状态,本文提出一种基于变量偏导数的贡献计算方法,对导致过程运行状态等级"非优"的原因变量进行追溯.最后,通过田纳西–伊斯曼(Tennessee–Eastman,TE)过程验证所提方法的有效性. 相似文献

8.

Parallel Performance Analysis of the Improved Quasi-Minimal Residual Method on Bulk Synchronous Parallel Architectures

Yang Tianruo Lin Hai-Xiang 《The Journal of supercomputing》1999,13(2):191-210

For the solutions of unsymmetric linear systems of equations, we have proposed an improved version of the quasi-minimal residual (IQMR) method [21] by using the Lanczos process as a major component combining elements of numerical stability and parallel algorithm design. For Lanczos process, stability is obtained by a couple two-term procedure that generates Lanczos vectors scaled to unit length. The algorithm is derived such that all inner products and matrix-vector multiplications of a single iteration step are independent and communication time required for inner product can be overlapped efficiently with computation time. In this paper, we use the Bulk Synchronous Parallel (BSP) model to design a fully efficient, scalable and portable parallel IQMR algorithm and to provide accurate performance prediction of the algorithm for a wide range of architectures including the Cray T3D, the Parsytec GC/PowerPlus, and a cluster of workstations connected by an Ethernet. This performance model provides us useful insight in the time complexity of the IQMR method using only a few system dependent parameters based on a simple and accurate cost modeling. The theoretical performance prediction are compared with measured timing results of a numerical application from ocean flow simulation. 相似文献

9.

支持HLA仿真和并行绘制的统一对象模型研究 总被引：1，自引：0，他引：1

王总辉熊华姜晓红石教英《计算机研究与发展》2008,45(2):329-336

随着计算机硬件、软件、网络以及社会需求的发展,分布式交互仿真应用对大规模复杂场景和大量仿真实体对象的管理能力和绘制能力提出了较高的要求.针对现有HLA仿真应用中大规模复杂场景的视景仿真实时性较差、仿真实体对象管理与绘制对象管理是分离的等问题,提出了统一对象模型,包含异质实体对象树、操作记录列表和统一访问接口,实现了仿真实体对象和绘制对象的高效组织和统一管理.该统一对象模型在HLA仿真平台和并行绘制平台之间建立了高效的数据交换桥梁,减轻了两者集成的开发工作量,并有效地支持了大规模复杂场景和大量仿真实体对象的实时绘制,有助于提高仿真的实时性.该统一对象模型对于HLA仿真平台和并行绘制平台具有通用性.最后给出了实验结果和分析. 相似文献

10.

基于词共现模型与DOM的石油主题采集策略

李村合李晗《微计算机应用》2008,29(2):28-31

提出了一种基于DOM树的词共现模型,首先利用文档的结构信息生成DOM树,并依据DOM树的结构特点来统计文档中主题词的共现信息,最后采用向量空间模型实现对石油主题网页的采集和分类.它改进了原有的词共现模型,突出了利用位置信息来优化词共现模型的特点.实验证明该策略使采集和分类的性能都有了一定的提高. 相似文献

11.

Numerical study on the feasibility of dynamic evolving neural-fuzzy inference system for approximation of compressive strength of dry-cast concrete

《Applied Soft Computing》2014

This paper assesses effectiveness of dynamic evolving neural-fuzzy inference system (DENFIS) models in predicting the compressive strength of dry-cast concretes, and compares their prediction performances with those of regression, neural network (NN) and ANFIS models. The results of this study emphasized capabilities of online first-order and offline high-order Takagi–Sugeno (TSK) type DENFIS models for prediction purposes, whereas offline first-order TSK-type DENFIS models did not produce reliable results. Comparison between the produced results of an elite high-order DENFIS model with those predicted by the selected NN, regression and ANFIS models showed effectiveness of DENFIS model than the regression model, while its performance was similar to or slightly better than the other artificial prediction tools. 相似文献