期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performing BMMC Permutations Efficiently on Distributed-Memory Multiprocessors with MPI

T. H. Cormen J. C. Clippinger 《Algorithmica》1999,24(3-4):349-370

This paper presents an architecture-independent method for performing BMMC permutations on multiprocessors with distributed memory. All interprocessor communication uses the MPI function MPI_Sendrecv_replace(). The number of elements and number of processors must be powers of 2, with at least one element per processor, and there is no inherent upper bound on the ratio of elements per processor. Our method transmits only data without transmitting any source or target indices, which conserves network bandwidth. When data is transmitted, the source and target processors implicitly agree on each other's identity and the indices of the elements being transmitted. A C-callable implementation of our method is available from Netlib. The implementation allows preprocessing (which incurs a modest cost) to be factored out for multiple runs of the same permutation, even if on different data. Data may be laid out in any one of several ways: processor-major, processor-minor, or anything in between. Experimental results indicate that our method works well compared with several other candidate methods on three different platforms. In particular, the slower the interconnection network, the greater the relative advantage of our method. Received June 1, 1997; revised March 10, 1998. 相似文献

2.

基于最小二乘法的跑道摩擦系数测试方法

史艳红王立文《微计算机信息》2009,25(23)

介绍了车载跑道摩擦系数测试设备的测量原理和嵌入式PC控制系统组成,分析了测量过程中跑道摩擦系数测量载体-测量轮所受垂直正压力和水平摩擦力产生波动的影响因素.为消除水平力、垂直力波动对跑道摩擦系数测量结果带来的误差,采用了渐消记忆递推最小二乘数据处理方法.结果表明该方法可消除测量过程中大部分中高频干扰信号,使单次测量数据精度提高,保证摩擦系数测量结果的准确性. 相似文献

3.

A Weighted Least-Squares Approach to Parameter Estimation Problems Based on Binary Measurements

Colinet E. Juillard J. 《Automatic Control, IEEE Transactions on》2010,55(1):148-152

We present a new approach to parameter estimation problems based on binary measurements, motivated by the need to add integrated low-cost self-test features to microfabricated devices. This approach is based on the use of original weighted least-squares criteria: as opposed to other existing methods, it requires no dithering signal and it does not rely on an approximation of the quantizer. In this technical note, we focus on a simple choice for the weights and establish some asymptotical properties of the corresponding criterion. To achieve this, the assumption that the quantizer's input is Gaussian and centered is made. In this context, we prove that the proposed criterion is locally convex and that it is possible to use a simple gradient descent to find a consistent estimate of the unknown system parameters, regardless of the presence of measurement noise at the quantizer's input. 相似文献

4.

分布存储系统中优化通信的冗余计算分割 总被引：1，自引：0，他引：1

陈莉张兆庆冯晓兵《计算机学报》2003,26(2):180-187

针对并行循环套序列,提出一种冗余计算分割的通信优化方法,根据数据流分析,文中给出用以确定每个循环套的冗余计算量的一般方法,并在此基础上提出冗余计算分割的实现和判定,针对规则依赖的程序,该文还提出了一个高效的冗余计算分割的实现方法,该技术已经在一个并行编译器中实现,试验结果表明,它比传统的通信优化技术有明显的优越性。相似文献

5.

多处理器操作系统的性能分析

何莉张越峰《电脑与信息技术》2005,13(6):10-13

个人计算机近期最主要的发展之一是使用多处理器以提高计算能力。为了分析个人计算机额外处理器所获得的性能，研究操作系统性能与计算机体系结构的相关性，我们用功能模块设计了一个双处理器计算机系统，采用修改过的快速傅立叶算法验证加强计算的能力。Windows NT4．0版本和Linux 2.0版本采用双处理器系统运行快速傅立叶算法，无需线程技术，使第二个处理器产生的性能得到了确认。文章探讨了这方面的细节问题。相似文献

6.

Adaptive Scheduling of Computations and Communications on Distributed-Memory Systems

《Journal of Parallel and Distributed Computing》2000,60(6):716-740

Compile-time scheduling is one approach to extract parallelism which has proved effective when the execution behavior is predictable. Unfortunately, the performance of most priority-based scheduling algorithms is computation dependent. Scheduling based on the concept of earliest-startable-task produces reasonably short schedules only when available parallelism is large enough to cover the communications. A priority-based decision is more effective when parallelism is low. We propose a scheduling in which the decision function combines two concepts: (1) task-level as global priority and (2) earliest-task-first as local priority The degree of dominance of one of the above concepts is controlled by a computation profile factor that is the ratio of task parallelism to communication. It is shown that the above factor is an upper bound on the deviation of schedule length from optimum. To tune the solution finish time the above scheduler is iteratively applied on the computation graph. In each iteration, the newly generated schedule is used to sharpen the task-levels which contribute in finding shorter schedules in the next iteration. Evaluation is carried out for a wide category of computation graphs with communications for which optimum schedules are known. It is found that pure local scheduling and static priority-based scheduling significantly deviate from the optimum under specific problem instances. Our approach to adapting the scheduling decision to the computation profile is able to produce near-optimum solutions via a much reduced number of iterations than other approaches. 相似文献

7.

Evolution-Based Scheduling of Computations and Communications on Distributed-Memory Multicomputers

Al-Mouhamed Mayez 《Computer Journal》1999,42(5):373-390

相似文献

8.

Scheduling of Precedence-Constrained Tasks on Multiprocessors

Price C. C.; Salama M. A. 《Computer Journal》1990,33(3):219-229

相似文献

9.

超立方体连接的分布式存贮MIMD上稠密线性代数方程组求解 总被引：2，自引：0，他引：2

莫则尧李晓梅《计算机工程与设计》1994,(5):26-33

在大规模科学计算中，求解线性代数方程组是一个非常重要的课题。而在分布式存贮的ＭＩＭＤ上如何求解稠密线性代数方程组，数据平衡与机间通讯是两个最大的影响因素。本文针对超立方体连接的分布式ＭＩＭＤ系统上高斯消去法的具体实现展开了讨论。首先，我们介绍两种非选主元的高期去法的通讯策略，然后将其推广到选主元的高斯消去法，最后提出一种新的算法。使处理机效率大大提高，基本达到全并行工作。部分已有实验数据也在文中给相似文献

10.

Parallel Implementation of a Least-Squares Spectral Element Solver for Incompressible Flow Problems

Margreet Nool Michael M. J. Proot 《The Journal of supercomputing》2004,28(2):135-148

Least-squares spectral element methods (LSQSEM) are based on two important and successful numerical methods: spectral/hp element methods and least-squares finite element methods. Least-squares methods lead to symmetric and positive definite algebraic systems which circumvent the Ladyzhenskaya–Babuka–Brezzi (LBB) stability condition and consequently allow the use of equal order interpolation polynomials for all variables. In this paper, we present results obtained with a parallel implementation of the least-squares spectral element solver on a distributed memory machine (Cray T3E) and on a virtual shared memory machine (SGI Origin 3800). 相似文献

11.

Multiprocessors and run-time compilation

Joel Saltz Harry Berryman Janet Wu 《Concurrency and Computation》1991,3(6):573-592

Run-time preprocessing plays a major role in many efficient algorithms in computer science, as well as playing an important role in exploiting multiprocessor architectures. We give examples that elucidate the importance of run-time preprocessing and show how these optimizations can be integrated into compilers. To support our arguments, we describe transformations implemented in prototype multiprocessor compilers and present benchmarks from the iPSC2/860, the CM-2 and the Encore Multimax/320. 相似文献

12.

The PMESC Programming Library for Distributed-Memory MIMD Computers

S. Crivelli E. R. Jessup 《Journal of Parallel and Distributed Computing》1999,57(3):1575

Efficient programming of task-parallel problems, where the number and execution times of the computational tasks can vary unpredictably, demands an asynchronous and adaptive approach. In this sort of approach, however, such fundamental programming issues as load sharing, data sharing, and termination detection can present difficult programming problems. This paper presents the PMESC library for managing task-parallel problems on distributed-memory MIMD computers within the context of the SPMD (single program, multiple data) programming model. PMESC offers support for all of the application-independent programming issues involved in SPMD task-parallel computation in a portable and efficient way while still allowing users to customize their codes. Because different problems may require different strategies to achieve good performance, PMESC is based on a straightforward model in which different building blocks can be easily put together and changed to accommodate the particular needs of the different applications. The library provides an interface that allows users to program a virtual machine and thereby ignore the details associated with message passing and machine architecture. These features make PMESC accessible to a wide variety of users. 相似文献

13.

Algorithmic Performance of Dataflow Multiprocessors

Carlson W.W. Kai Hwang 《Computer》1985,18(12):30-40

相似文献

14.

近似三对角Toeplitz方程组的快速分布式并行算法

单润红高峰宋君强李晓梅《计算机研究与发展》2004,41(7):1281-1285

利用近似三对角Toeplitz矩阵的特殊结构,提出了一种新的求解近似三对角Toeplitz方程组的快速算法．在三对角Toeplitz矩阵的近似LU分解的基础上,利用“分而治之”的思想,并结合秦九韶技术和特殊的数学技巧减少大量的冗余计算,提出了求解近似Toeplitz三对角方程组的快速分布式并行算法,并在理论上证明了算法具有近似于线性的加速比．最后通过数值实验证明,新的并行算法具有较高的并行效率,并且当矩阵阶数n足够大时,算法的加速比趋近于线性加速比．相似文献

15.

一类Toeplitz三对角方程组的一种分布式并行算法 总被引：3，自引：0，他引：3

骆志刚李晓梅《计算机学报》2001,24(2):173-178

文中提出一类Toeplitz三对角方程组的一种分布式并行算法。该算法以系数矩阵的分解为基础,充分利用了系数矩阵结构的特殊性,算法因并行化而引入的冗余计算量非常少,算法的通信机制简单,通信量仅与处理机台数p有关,与方程组规模n无关,算法具有很高的并行效率,理论分析和数值试验表明,其加速比Sp(n)→p(n→ ∞）,此为线性加速比的理想情况。文中给出了算法在分布存储多计算机系统上的数值试验结果。相似文献

16.

Migration in Single Chip Multiprocessors

《Computer Architecture Letters》2002,1(1):12-12

Global communication costs in future single-chipmultiprocessors will increase linearly with distance. In this paper,we revisit the issues of locality and load balance in order totake advantage of these new costs. We present a technique whichsimultaneously migrates data and threads based on vectors specifyinglocality and resource usage. This technique improves performanceon applications with distinguishable locality and imbalancedresource usage. 64% of the ideal reduction in execution timewas achieved on an application with these traits while no improvementwas obtained on a balanced application with little locality. 相似文献

17.

Multiprocessors from a software perspective

Amarasinghe S.P. Anderson J.M. Wilson C.S. Shih-Wei Liao Murphy B.R. French R.S. Lam M.S. Hall M.W. 《Micro, IEEE》1996,16(3):52-61

Like many architectural techniques that originated with mainframes. the use of multiple processors in a single computer is becoming popular in workstations and even personal computers. Multiprocessors constitute a significant percentage of recent workstation sales, and highly affordable multiprocessor personal computers are available in local computer stores. Once again, we find ourselves in a familiar situation: hardware is ahead of software. Because of the complexity of parallel programming, multiprocessors today are rarely used to speed up individual applications. Instead, they usually function as cycle-servers that achieve increased system throughput by running multiple tasks simultaneously. Automatic parallelization by a compiler is a particularly attractive approach to software development for multiprocessors, as it enables ordinary sequential programs to take advantage of the multiprocessor hardware without user involvement. This article looks to the future by examining some of the latest research results in automatic parallelization technology 相似文献

18.

Parallel ROLAP Data Cube Construction on Shared-Nothing Multiprocessors 总被引：6，自引：2，他引：6

Ying Chen Frank Dehne Todd Eavis Andrew Rau-Chaplin 《Distributed and Parallel Databases》2004,15(3):219-236

The pre-computation of data cubes is critical to improving the response time of On-Line Analytical Processing (OLAP) systems and can be instrumental in accelerating data mining tasks in large data warehouses. In order to meet the need for improved performance created by growing data sizes, parallel solutions for generating the data cube are becoming increasingly important. This paper presents a parallel method for generating data cubes on a shared-nothing multiprocessor. Since no (expensive) shared disk is required, our method can be used on low cost Beowulf style clusters consisting of standard PCs with local disks connected via a data switch. Our approach uses a ROLAP representation of the data cube where views are stored as relational tables. This allows for tight integration with current relational database technology.We have implemented our parallel shared-nothing data cube generation method and evaluated it on a PC cluster, exploring relative speedup, local vs. global schedule trees, data skew, cardinality of dimensions, data dimensionality, and balance tradeoffs. For an input data set of 2,000,000 rows (72 Megabytes), our parallel data cube generation method achieves close to optimal speedup; generating a full data cube of 227 million rows (5.6 Gigabytes) on a 16 processors cluster in under 6 minutes. For an input data set of 10,000,000 rows (360 Megabytes), our parallel method, running on a 16 processor PC cluster, created a data cube consisting of 846 million rows (21.7 Gigabytes) in under 47 minutes. 相似文献

19.

Priority-Driven Scheduling of Periodic Task Systems on Multiprocessors 总被引：5，自引：3，他引：5

Joël Goossens Shelby Funk Sanjoy Baruah 《Real-Time Systems》2003,25(2-3):187-205

The scheduling of systems of periodic tasks upon multiprocessor platforms is considered. Utilization-based conditions are derived for determining whether a periodic task system meets all deadlines when scheduled using the earliest deadline first scheduling algorithm (EDF) upon a given multiprocessor platform. A new priority-driven algorithm is proposed for scheduling periodic task systems upon multiprocessor platforms: this algorithm is shown to successfully schedule some task systems for which EDF may fail to meet all deadlines. 相似文献

20.

Thread Scheduling for Multiprogrammed Multiprocessors

N. S. Arora R. D. Blumofe C. G. Plaxton 《Theory of Computing Systems》2001,34(2):115-144

We present a user-level thread scheduler for shared-memory multiprocessors, and we analyze its performance under multiprogramming. We model multiprogramming with two scheduling levels: our scheduler runs at user-level and schedules threads onto a fixed collection of processes, while below this level, the operating system kernel schedules processes onto a fixed collection of processors. We consider the kernel to be an adversary, and our goal is to schedule threads onto processes such that we make efficient use of whatever processor resources are provided by the kernel. Our thread scheduler is a non-blocking implementation of the work-stealing algorithm. For any multithreaded computation with work T ₁ and critical-path length T _∈ fty , and for any number P of processes, our scheduler executes the computation in expected time O(T ₁ /P _A + T _∈ fty P/P _A ) , where P _A is the average number of processors allocated to the computation by the kernel. This time bound is optimal to within a constant factor, and achieves linear speedup whenever P is small relative to the parallelism T ₁ /T _∈ fty . Online publication February 26, 2001. 相似文献