共查询到20条相似文献,搜索用时 15 毫秒
1.
Multiple addition is the problem of adding Nb-bit integers. Prefix sums and multiple addition play fundamental roles in many algorithms, particularly on the reconfigurable mesh (R-Mesh). Scaling algorithms on the R-Mesh to run with the same or increased efficiency on fewer processors is a challenging and important proposition. In this paper, we present algorithms that scale with increasing efficiency for multiple addition, prefix sums, and matrix-vector multiplication. Along the way, we obtain an improved multiple addition algorithm. 相似文献
2.
In this paper we present efficient algorithms for packet routing on the reconfigurable linear array and the reconfigurable
two-dimensional mesh. We introduce algorithms that are efficient in the worst case and algorithms that are better on average.
The time bounds presented are better than those achievable on the conventional mesh and previously known algorithms. We present
two variants of the reconfigurable mesh. In the first model, M
r
, the processors are attached to a reconfigurable bus, the individual edge connections being bidirectional. In the second
model, M
mr
, the processors are attached to two unidirectional buses. In this paper we present lower bounds and nearly matching upper
bounds for packet routing on these two models. As a consequence, we solve two of the open problems mentioned in [9].
Received August 17, 1998; revised November 3, 1999. 相似文献
3.
Jayram Moorkanikara Nageswaran Andrew Felch Ashok Chandrasekhar Nikil Dutt Richard Granger Alex Nicolau Alex Veidenbaum 《International journal of parallel programming》2009,37(4):345-369
Even though computing systems have increased the number of transistors, the switching speed, and the number of processors,
most programs exhibit limited speedup due to the serial dependencies of existing algorithms. Analysis of intrinsically parallel
systems such as brain circuitry have led to the identification of novel architecture designs, and also new algorithms than
can exploit the features of modern multiprocessor systems. In this article we describe the details of a brain derived vision
(BDV) algorithm that is derived from the anatomical structure, and physiological operating principles of thalamo-cortical
brain circuits. We show that many characteristics of the BDV algorithm lend themselves to implementation on IBM CELL architecture,
and yield impressive speedups that equal or exceed the performance of specialized solutions such as FPGAs. Mapping this algorithm
to the IBM CELL is non-trivial, and we suggest various approaches to deal with parallelism, task granularity, communication,
and memory locality. We also show that a cluster of three PS3s (or more) containing IBM CELL processors provides a promising
platform for brain derived algorithms, exhibiting speedup of more than 140 × over a desktop PC implementation, and thus enabling
real-time object recognition for robotic systems. 相似文献
4.
Optical interconnections attract many engineers and scientists’ attention due to their potential for gigahertz transfer rates and concurrent access to the bus in a pipelined fashion. These unique characteristics of optical interconnections give us the opportunity to reconsider traditional algorithms designed for ideal parallel computing models, such as PRAMs. Since the PRAM model is far from practice, not all algorithms designed on this model can be implemented on a realistic parallel computing system. From this point of view, we study Cole’s pipelined merge sort [Cole R. Parallel merge sort. SIAM J Comput 1988;14:770–85] on the CREW PRAM and extend it in an innovative way to an optical interconnection model, the LARPBS (Linear Array with Reconfigurable Pipelined Bus System) model [Pan Y, Li K. Linear array with a reconfigurable pipelined bus system—concepts and applications. J Inform Sci 1998;106;237–58]. Although Cole’s algorithm is optimal, communication details have not been provided due to the fact that it is designed for a PRAM. We close this gap in our sorting algorithm on the LARPBS model and obtain an O(log N)-time optimal sorting algorithm using O(N) processors. This is a substantial improvement over the previous best sorting algorithm on the LARPBS model that runs in O(log N log log N) worst-case time using N processors [Datta A, Soundaralakshmi S, Owens R. Fast sorting algorithms on a linear array with a reconfigurable pipelined bus system. IEEE Trans Parallel Distribut Syst 2002;13(3):212–22]. Our solution allows efficiently assign and reuse processors. We also discover two new properties of Cole’s sorting algorithm that are presented as lemmas in this paper. 相似文献
5.
This paper generalizes the widely used Nelder and Mead (Comput J 7:308–313, 1965) simplex algorithm to parallel processors.
Unlike most previous parallelization methods, which are based on parallelizing the tasks required to compute a specific objective
function given a vector of parameters, our parallel simplex algorithm uses parallelization at the parameter level. Our parallel
simplex algorithm assigns to each processor a separate vector of parameters corresponding to a point on a simplex. The processors
then conduct the simplex search steps for an improved point, communicate the results, and a new simplex is formed. The advantage
of this method is that our algorithm is generic and can be applied, without re-writing computer code, to any optimization
problem which the non-parallel Nelder–Mead is applicable. The method is also easily scalable to any degree of parallelization
up to the number of parameters. In a series of Monte Carlo experiments, we show that this parallel simplex method yields computational
savings in some experiments up to three times the number of processors. 相似文献
6.
Jerry L. Trahan Mingxian Jin Wittaya Chantamas Johnnie W. Baker 《Journal of Parallel and Distributed Computing》2010
The MASC (Multiple ASsociative Computing) model is a multi-SIMD model that uses control parallelism to coordinate the interaction of data parallel threads and supports associative SIMD computing on each of its threads. There have been a wide range of algorithms developed for this model. Research on using this model in real-time system applications and building a scalable MASC architecture is currently quite active. In this paper, we present simulations between MASC and reconfigurable bus-based models, e.g., various versions of the Reconfigurable Multiple Bus Machine (RMBM). Constant time simulations of the basic RMBM by MASC and vice versa are obtained. Simulations of the segmenting RMBM, fusing RMBM, and extended RMBM by MASC in non-constant time are also discussed. By taking advantage of previously established relationships between RMBM and two other popular parallel computational models, namely, the Reconfigurable Mesh (RM) and the Parallel Random Access Machine (PRAM), we extend our simulation results to further categorize the power of the MASC model in relation to RM and PRAM. 相似文献
7.
遗传算法(Genetic Algorithm)是一类借鉴生物界的进化规律演化而来的随机化搜索方法,已经成功运用在很多大规模的组合优化问题中。利用如今流行的并行计算机系统,对遗传算法进行并行化,可解决标准遗传算法的速度瓶颈问题。本文在MPI并行环境下,用C++语言实现了粗粒度模型的并行遗传算法。结合并行遗传算法的特点,提出了解决物流配送路线优化的策略以及给出相应的算法过程,并进行了有效验证。通过研究结果表明,与传统遗传算法相比,并行遗传算法提高了运算速度,降低了平均开销时间并且最小总路径值更理想。 相似文献
8.
The HIRLAM (high resolution limited area modelling) limited-area atmospheric model was originally developed and optimized for shared memory vector-based computers, and has been used for operational weather forecasting on such machines for several years. This paper describes the algorithms applied to obtain a highly parallel implementation of the model, suitable for distributed memory machines. The performance results presented indicate that the parallelization effort has been successful, and the Norwegian Meteorological Institute will run the parallel version in production on a Cray T3E. 相似文献
9.
This paper introduces the parallelization on a distributed memory multicomputer of two iterative methods for finding all the roots of a given polynomial. The parallel algorithms share the computation of the roots among the processors and perform a total exchange of the data at each step. Since the amount of communications is the main drawback of this approach, we study the effect of the network topology on the performance of the algorithms. Particularly, we show that among the different classical processors networks topologies (ring, 2d-torus or n-cube), the hypercube topology minimizes the communications. For each topology is computed the optimal number of processors. Experiments on the hypercube FPS T40 illustrate the results. 相似文献
10.
We present efficient parallel algorithms for using a pyramid computer to determine convexity properties of digitized black/white pictures and labeled figures. Algorithms are presented for deciding convexity, identifying extreme points of convex hulls, and using extreme points in a variety of fashions. For a pyramid computer with a base ofn simple processing elements arranged in ann
1/2 ×n
1/2 square, the running times of the algorithms range from (logn) to find the extreme points of a convex figure in a digitized picture, to (n
1/6) to find the diameter of a labeled figure, (n
1/4 logn) to find the extreme points of every figure in a digitized picture, to (n
1/2) to find the extreme points of every labeled set of processing elements. Our results show that the pyramid computer can be used to obtain efficient solutions to nontrivial problems in image analysis. We also show the sensitivity of efficient pyramid-computer algorithms to the rate at which essential data can be compressed. Finally, we show that a wide variety of techniques are needed to make full and efficient use of the pyramid architecture.This research was partially supported by National Science Foundation Grants MCS-83-01019, DCR-8507851, DCR-8608640, IRI-8800514 and an Incentives for Excellence Award from the Digital Equipment Corporation. 相似文献
11.
Whole-body vibration (WBV) occupational exposure levels were determined with 29 different types of machinery. Two distinct WBV exposure profiles were found and reproduced in a controlled exposure situation that involved the transformation of a vehicle. A sample of 45 volunteer subjects was recruited to assess the effects of WBV on their cognitive/motor performance. Two different tests were selected and applied: an Action Judgment Test, which was primarily designed to examine the relation between the distribution of attention and the resultant reaction to ever-changing conditions, and an Omega Test, designed to examine the precision and attention in handling mechanisms. The results show that the vibration exposure level affects the degree of impairment. The subjects presented a lower performance level when exposed to higher vibration levels, as the time required to correct their errors more than doubled. No significant differences were found when comparing the performance for gender or age variables. The obtained results can be used to improve the characteristics of work vehicles, in order to reduce the corresponding adverse effects and, consequently, improve the corresponding working conditions. 相似文献
12.
Bitonic sort is one of the fastest oblivious parallel sorting algorithms known so far. Due to its high modularity, bitonic sort can be mapped to different interconnection networks. In this paper, the bitonic sort algorithm is mapped to the chained-cubic tree (CCT) interconnection network. It is shown that the computation time of the bitonic sort on a CCT (BSCCT) algorithm is O((n/p)×log(np)) and that the communication cost is O(plog2p), assuming that n keys are evenly distributed among p processors that comprise a given CCT network. Simulation is implemented and used to assess the performance of the BSCCT algorithm in terms of computation time, communication cost, message delay, and key comparisons. Simulation results showed that the BSCCT algorithm achieves a speedup that is almost 12-fold relative to a bitonic sort on a single processor, when 1024 processors were used to sort 32M keys. 相似文献
13.
Hiroaki Kobayashi Satoshi Nishimura Hideyuki Kubota Tadao Nakamura Yoshiharu Shigei 《The Visual computer》1988,4(4):197-209
Static and dynamic load balancing strategies for a multiprocessor system for a ray tracing algorithm based on constant subdivision are presented. An object space is divided into regular cubes (subspaces), whose boundary planes are perpendicular to the coordinate axes, and these are allocated to the processors in the system. Here, load balancing among the processors is the most important problem. Firstly, in a category of static load balancing, strategies for mapping the subspaces into the processors are evaluated by simulation. Moreover, we propose a hierarchical multiprocessor system in order to realize dynamic load balancing with the static one. Its architecture can overcome the limitation of the static load balancing in a large scale multiprocessor system. 相似文献
14.
Particle-In-Cell (PIC) methods have been widely used for plasma physics simulations in the past three decades. To ensure an acceptable level of statistical accuracy relatively large numbers of particles are needed. State-of-the-art Graphics Processing Units (GPUs), with their high memory bandwidth, hundreds of SPMD processors, and half-a-teraflop performance potential, offer a viable alternative to distributed memory parallel computers for running medium-scale PIC plasma simulations on inexpensive commodity hardware. In this paper, we present an overview of a typical plasma PIC code and discuss its GPU implementation. In particular we focus on fast algorithms for the performance bottleneck operation of Particle-To-Grid interpolation. 相似文献
15.
16.
Abdel-Elah 《Performance Evaluation》2005,60(1-4):223-236
In this paper we investigate the performance of distributed heuristic search methods based on a well-known heuristic search algorithm, the iterative deepening A* (IDA*). The contribution of this paper includes proposing and assessing a distributed algorithm for IDA*. The assessment is based on space, time and solution quality that are quantified in terms of several performance parameters such as generated search space and real execution time among others. The experiments are conducted on a cluster computer system consisting of 16 hosts built around a general-purpose network. The objective of this research is to investigate the feasibility of cluster computing as an alternative for hosting applications requiring intensive graph search. The results reveal that cluster computing improves on the performance of IDA* at a reasonable cost. 相似文献
17.
The main objective of this paper is to describe a realistic framework to understand parallel performance of high-dimensional
image processing algorithms in the context of heterogeneous networks of workstations (NOWs). As a case study, this paper explores
techniques for mapping hyperspectral image analysis techniques onto fully heterogeneous NOWs. Hyperspectral imaging is a new
technique in remote sensing that has gained tremendous popularity in many research areas, including satellite imaging and
aerial reconnaissance. The automation of techniques able to transform massive amounts of hyperspectral data into scientific
understanding in valid response times is critical for space-based Earth science and planetary exploration. Using an evaluation
strategy which is based on comparing the efficiency achieved by an heterogeneous algorithm on a fully heterogeneous NOW with
that evidenced by its homogeneous version on a homogeneous NOW with the same aggregate performance as the heterogeneous one,
we develop a detailed analysis of parallel algorithms that integrate the spatial and spectral information in the image data
through mathematical morphology concepts. For comparative purposes, performance data for the tested algorithms on Thunderhead
(a large-scale Beowulf cluster at NASA’s Goddard Space Flight Center) are also provided. Our detailed investigation of the
parallel properties of the proposed morphological algorithms provides several intriguing findings that may help image analysts
in selection of parallel techniques and strategies for specific applications.
相似文献
Antonio PlazaEmail: |
18.
Juan A. Villar Francisco J. Andújar José L. Sánchez Francisco J. Alfaro José A. Gámez José Duato 《Journal of Parallel and Distributed Computing》2013
High-radix switches reduce network cost and improve network performance, especially in large switch-based interconnection networks. However, there are some problems related to the integration scale to implement such switches in a single chip. An interesting alternative for building high-radix switches consists of combining several current smaller single-chip switches to obtain switches with a greater number of ports. A key design issue of this kind of high-radix switches is the internal switch configuration, specifically, the correspondence between the ports of these high-radix switches and the ports of their smaller internal single-chip switches. In this paper we use artificial intelligence and data mining techniques in order to obtain the optimal internal configuration of all the switches in the network of large supercomputers running parallel applications. Simulation results show that using the resultant switch configurations, it is possible to achieve similar performance as with single-chip switches with the same radix, which would be unfeasible with the current integration scale. 相似文献
19.
As the size of the Internet grows by orders of magnitude both in terms of users, number of IP addresses, and number of routers, and as the links we use (be they wired, optical or wireless) continuously evolve and provide varying reliability and quality of service, the IP based network architecture that we know so well will have to evolve and change. Both scalability and QoS have become key issues. We are currently conducting a research project that revisits the IP routing architecture issues and proposes new designs for routers. As part of this effort, this paper discusses a packet network architecture called a cognitive packet network (CPN), in which intelligent capabilities for routing and flow control are moved towards the packets, rather than being concentrated in the nodes. In this paper we outline the design of the CPN architecture, and discuss the quality-of-service based routing algorithm that we have designed and implemented. We then present our test-bed and report on extensive measurement experiments that we have conducted. 相似文献
20.
《Ergonomics》2012,55(8):780-799
Heat stress can be a significant problem for pilots wearing protective clothing during flights, because they provide extra insulation which prevents evaporative heat loss. Heat stress can influence human cognitive activity, which might be critical in the flying situation, requiring efficient and error-free performance. This study investigated the effect of wearing protective clothing under various ambient conditions on physiological and cognitive performance. On several occasions, eight subjects were exposed for 3 h to three different environmental conditions; 0°C at 80% RH, 23°C at 63% RH and 40°C at 19% RH. The subjects were equipped with thermistors, dressed as they normally do for flights (including helmet, two layers of underwear and an uninsulated survival suit). During three separate exposures the subjects carried out two cognitive performance tests (Vigilance test and DG test). Performance was scored as correct, incorrect, missed reaction and reaction time. Skin temperature, deep body temperature, heart rate, oxygen consumption, temperature and humidity inside the clothing, sweat loss, subjective sensation of temperature and thermal comfort were measured. Rises in rectal temperature, skin temperature, heart rate and body water loss indicated a high level of heat stress in the 40°C ambient temperature condition in comparison with 0°C and 23°C. Performance of the DG test was unaffected by ambient temperature. However, the number of incorrect reactions in the Vigilance test was significantly higher at 40°C than at 23°C (p = 0.006) or 0°C (p = 0.03). The effect on Vigilance performance correlated with changes in deep-body temperature, and this is in accordance with earlier studies that have demonstrated that cognitive performance is virtually unaffected unless environmental conditions are sufficient to change deep body temperature. 相似文献