共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents an architecture-independent method for performing BMMC permutations on multiprocessors with distributed memory. All interprocessor communication uses the MPI function MPI_Sendrecv_replace(). The number of elements and number of processors must be powers of 2, with at least one element per processor, and there is no inherent upper bound on the ratio of elements per processor. Our method transmits only data without transmitting any source or target indices, which conserves network bandwidth. When data is transmitted, the source and target processors implicitly agree on each other's identity and the indices of the elements being transmitted. A C-callable implementation of our method is available from Netlib. The implementation allows preprocessing (which incurs a modest cost) to be factored out for multiple runs of the same permutation, even if on different data. Data may be laid out in any one of several ways: processor-major, processor-minor, or anything in between. Experimental results indicate that our method works well compared with several other candidate methods on three different platforms. In particular, the slower the interconnection network, the greater the relative advantage of our method. Received June 1, 1997; revised March 10, 1998. 相似文献
2.
介绍了车载跑道摩擦系数测试设备的测量原理和嵌入式PC控制系统组成,分析了测量过程中跑道摩擦系数测量载体-测量轮所受垂直正压力和水平摩擦力产生波动的影响因素.为消除水平力、垂直力波动对跑道摩擦系数测量结果带来的误差,采用了渐消记忆递推最小二乘数据处理方法.结果表明该方法可消除测量过程中大部分中高频干扰信号,使单次测量数据精度提高,保证摩擦系数测量结果的准确性. 相似文献
3.
4.
5.
个人计算机近期最主要的发展之一是使用多处理器以提高计算能力。为了分析个人计算机额外处理器所获得的性能,研究操作系统性能与计算机体系结构的相关性,我们用功能模块设计了一个双处理器计算机系统,采用修改过的快速傅立叶算法验证加强计算的能力。Windows NT4.0版本和Linux 2.0版本采用双处理器系统运行快速傅立叶算法,无需线程技术,使第二个处理器产生的性能得到了确认。文章探讨了这方面的细节问题。 相似文献
6.
《Journal of Parallel and Distributed Computing》2000,60(6):716-740
Compile-time scheduling is one approach to extract parallelism which has proved effective when the execution behavior is predictable. Unfortunately, the performance of most priority-based scheduling algorithms is computation dependent. Scheduling based on the concept of earliest-startable-task produces reasonably short schedules only when available parallelism is large enough to cover the communications. A priority-based decision is more effective when parallelism is low. We propose a scheduling in which the decision function combines two concepts: (1) task-level as global priority and (2) earliest-task-first as local priority The degree of dominance of one of the above concepts is controlled by a computation profile factor that is the ratio of task parallelism to communication. It is shown that the above factor is an upper bound on the deviation of schedule length from optimum. To tune the solution finish time the above scheduler is iteratively applied on the computation graph. In each iteration, the newly generated schedule is used to sharpen the task-levels which contribute in finding shorter schedules in the next iteration. Evaluation is carried out for a wide category of computation graphs with communications for which optimum schedules are known. It is found that pure local scheduling and static priority-based scheduling significantly deviate from the optimum under specific problem instances. Our approach to adapting the scheduling decision to the computation profile is able to produce near-optimum solutions via a much reduced number of iterations than other approaches. 相似文献
7.
8.
9.
超立方体连接的分布式存贮MIMD上稠密线性代数方程组求解 总被引:2,自引:0,他引:2
在大规模科学计算中,求解线性代数方程组是一个非常重要的课题。而在分布式存贮的MIMD上如何求解稠密线性代数方程组,数据平衡与机间通讯是两个最大的影响因素。本文针对超立方体连接的分布式MIMD系统上高斯消去法的具体实现展开了讨论。首先,我们介绍两种非选主元的高期去法的通讯策略,然后将其推广到选主元的高斯消去法,最后提出一种新的算法。使处理机效率大大提高,基本达到全并行工作。部分已有实验数据也在文中给 相似文献
10.
Least-squares spectral element methods (LSQSEM) are based on two important and successful numerical methods: spectral/hp element methods and least-squares finite element methods. Least-squares methods lead to symmetric and positive definite algebraic systems which circumvent the Ladyzhenskaya–Babuka–Brezzi (LBB) stability condition and consequently allow the use of equal order interpolation polynomials for all variables. In this paper, we present results obtained with a parallel implementation of the least-squares spectral element solver on a distributed memory machine (Cray T3E) and on a virtual shared memory machine (SGI Origin 3800). 相似文献
11.
Run-time preprocessing plays a major role in many efficient algorithms in computer science, as well as playing an important role in exploiting multiprocessor architectures. We give examples that elucidate the importance of run-time preprocessing and show how these optimizations can be integrated into compilers. To support our arguments, we describe transformations implemented in prototype multiprocessor compilers and present benchmarks from the iPSC2/860, the CM-2 and the Encore Multimax/320. 相似文献
12.
Efficient programming of task-parallel problems, where the number and execution times of the computational tasks can vary unpredictably, demands an asynchronous and adaptive approach. In this sort of approach, however, such fundamental programming issues as load sharing, data sharing, and termination detection can present difficult programming problems. This paper presents the PMESC library for managing task-parallel problems on distributed-memory MIMD computers within the context of the SPMD (single program, multiple data) programming model. PMESC offers support for all of the application-independent programming issues involved in SPMD task-parallel computation in a portable and efficient way while still allowing users to customize their codes. Because different problems may require different strategies to achieve good performance, PMESC is based on a straightforward model in which different building blocks can be easily put together and changed to accommodate the particular needs of the different applications. The library provides an interface that allows users to program a virtual machine and thereby ignore the details associated with message passing and machine architecture. These features make PMESC accessible to a wide variety of users. 相似文献
13.
14.
利用近似三对角Toeplitz矩阵的特殊结构,提出了一种新的求解近似三对角Toeplitz方程组的快速算法.在三对角Toeplitz矩阵的近似LU分解的基础上,利用“分而治之”的思想,并结合秦九韶技术和特殊的数学技巧减少大量的冗余计算,提出了求解近似Toeplitz三对角方程组的快速分布式并行算法,并在理论上证明了算法具有近似于线性的加速比.最后通过数值实验证明,新的并行算法具有较高的并行效率,并且当矩阵阶数n足够大时,算法的加速比趋近于线性加速比. 相似文献
15.
一类Toeplitz三对角方程组的一种分布式并行算法 总被引:3,自引:0,他引:3
文中提出一类Toeplitz三对角方程组的一种分布式并行算法。该算法以系数矩阵的分解为基础,充分利用了系数矩阵结构的特殊性,算法因并行化而引入的冗余计算量非常少,算法的通信机制简单,通信量仅与处理 机台数p有关,与方程组规模n无关,算法具有很高的并行效率,理论分析和数值试验表明,其加速比Sp(n)→p(n→ ∞),此为线性加速比的理想情况。文中给出了算法在分布存储多计算机系统上的数值试验结果。 相似文献
16.
《Computer Architecture Letters》2002,1(1):12-12
Global communication costs in future single-chipmultiprocessors will increase linearly with distance. In this paper,we revisit the issues of locality and load balance in order totake advantage of these new costs. We present a technique whichsimultaneously migrates data and threads based on vectors specifyinglocality and resource usage. This technique improves performanceon applications with distinguishable locality and imbalancedresource usage. 64% of the ideal reduction in execution timewas achieved on an application with these traits while no improvementwas obtained on a balanced application with little locality. 相似文献
17.
Amarasinghe S.P. Anderson J.M. Wilson C.S. Shih-Wei Liao Murphy B.R. French R.S. Lam M.S. Hall M.W. 《Micro, IEEE》1996,16(3):52-61
Like many architectural techniques that originated with mainframes. the use of multiple processors in a single computer is becoming popular in workstations and even personal computers. Multiprocessors constitute a significant percentage of recent workstation sales, and highly affordable multiprocessor personal computers are available in local computer stores. Once again, we find ourselves in a familiar situation: hardware is ahead of software. Because of the complexity of parallel programming, multiprocessors today are rarely used to speed up individual applications. Instead, they usually function as cycle-servers that achieve increased system throughput by running multiple tasks simultaneously. Automatic parallelization by a compiler is a particularly attractive approach to software development for multiprocessors, as it enables ordinary sequential programs to take advantage of the multiprocessor hardware without user involvement. This article looks to the future by examining some of the latest research results in automatic parallelization technology 相似文献
18.
Ying Chen Frank Dehne Todd Eavis Andrew Rau-Chaplin 《Distributed and Parallel Databases》2004,15(3):219-236
The pre-computation of data cubes is critical to improving the response time of On-Line Analytical Processing (OLAP) systems and can be instrumental in accelerating data mining tasks in large data warehouses. In order to meet the need for improved performance created by growing data sizes, parallel solutions for generating the data cube are becoming increasingly important. This paper presents a parallel method for generating data cubes on a shared-nothing multiprocessor. Since no (expensive) shared disk is required, our method can be used on low cost Beowulf style clusters consisting of standard PCs with local disks connected via a data switch. Our approach uses a ROLAP representation of the data cube where views are stored as relational tables. This allows for tight integration with current relational database technology.We have implemented our parallel shared-nothing data cube generation method and evaluated it on a PC cluster, exploring relative speedup, local vs. global schedule trees, data skew, cardinality of dimensions, data dimensionality, and balance tradeoffs. For an input data set of 2,000,000 rows (72 Megabytes), our parallel data cube generation method achieves close to optimal speedup; generating a full data cube of 227 million rows (5.6 Gigabytes) on a 16 processors cluster in under 6 minutes. For an input data set of 10,000,000 rows (360 Megabytes), our parallel method, running on a 16 processor PC cluster, created a data cube consisting of 846 million rows (21.7 Gigabytes) in under 47 minutes. 相似文献
19.
The scheduling of systems of periodic tasks upon multiprocessor platforms is considered. Utilization-based conditions are derived for determining whether a periodic task system meets all deadlines when scheduled using the earliest deadline first scheduling algorithm (EDF) upon a given multiprocessor platform. A new priority-driven algorithm is proposed for scheduling periodic task systems upon multiprocessor platforms: this algorithm is shown to successfully schedule some task systems for which EDF may fail to meet all deadlines. 相似文献
20.
We present a user-level thread scheduler for shared-memory multiprocessors, and we analyze its performance under multiprogramming.
We model multiprogramming with two scheduling levels: our scheduler runs at user-level and schedules threads onto a fixed
collection of processes, while below this level, the operating system kernel schedules processes onto a fixed collection of
processors. We consider the kernel to be an adversary, and our goal is to schedule threads onto processes such that we make
efficient use of whatever processor resources are provided by the kernel.
Our thread scheduler is a non-blocking implementation of the work-stealing algorithm. For any multithreaded computation with
work T
1
and critical-path length T
∈
fty , and for any number P of processes, our scheduler executes the computation in expected time O(T
1
/P
A
+ T
∈
fty P/P
A
) , where P
A
is the average number of processors allocated to the computation by the kernel. This time bound is optimal to within a constant
factor, and achieves linear speedup whenever P is small relative to the parallelism T
1
/T
∈
fty .
Online publication February 26, 2001. 相似文献