共查询到20条相似文献,搜索用时 15 毫秒
1.
Jae-Hyun Park 《The Journal of supercomputing》2011,55(3):432-447
High-radix multistage interconnection networks are popular interconnection technologies for parallel supercomputers and cluster
computers. In this paper, we presented a new dynamically fault-tolerant high-radix multistage interconnection network using
a fully-adaptive self-routing. To devise the fully-adaptive self-routing for recovering the misrouting around link faults
in such network, we introduce an abstract algebraic analysis of the topological structure of the high-radix Delta network.
The presented interconnection network provides multiple paths by using all the links of all the stages of the network. We
also present a mathematical analysis of the reliability of the interconnection network for quantitative comparison against
other networks. The MTTF of 64×64 network proposed is 2.2 times greater than that of the cyclic Banyan network. The hardware cost of the proposed
network is half that of the cyclic Banyan network and the 2D ring-Banyan network. 相似文献
2.
This paper proposes a novel scheme, named ER-TCP, which transparently masks the failures happened on the server nodes of a cluster
from clients at TCP connection granularity. In this scheme, TCP connections at the server side are actively and fully replicated
to remain consistency so as to be transplanted over healthy parts during failure. A log mechanism is designed to cooperate
with the replication to achieve small sacrifice on the performance of communication and makes the scheme scales beyond a few
nodes, even when they have different processing capacities. We built a prototype system at a four-node cluster with ER-TCP,
and conducted a series of experiments on that. The experimental result told us that ER-TCP has relatively small penalty on
the communication performance, especially when it is used to synchronize multiple replicas. The results of real applications
show that ER-TCP will incur small sacrifice on performance of web server at light load, and it can be used to distribute files
very efficiently and reliably.
相似文献
Hai JinEmail: |
3.
The hybrid dynamic parallel scheduling algorithm for load balancing on Chained-Cubic Tree interconnection networks 总被引:1,自引:0,他引:1
The Chained-Cubic Tree (CCT) interconnection network topology was recently proposed as a continuation for the extended efforts
in the area of interconnection networks’ performance improvement. This topology, which promises to exhibit the best properties
of the hypercube and tree topologies, needs to be deeply investigated in order to evaluate its performance among other interconnection
networks’ topologies. This work comes as a complementary effort, in which the load balancing technique is investigated as
one of the most important aspects of performance improvement. This paper proposes a new load balancing algorithm on CCT interconnection
networks. The proposed algorithm, which is called Hybrid Dynamic Parallel Scheduling Algorithm (HD-PSA), is a combination
of two common load balancing strategies; dynamic load balancing and parallel scheduling. The performance of the proposed algorithm
is evaluated both, analytically and experimentally, in terms of various performance metrics; including, execution time, load
balancing accuracy, communication cost, number of tasks hops, and tasks locality. 相似文献
4.
Keqin Li 《The Journal of supercomputing》2012,60(2):223-247
In this paper, scheduling parallel tasks on multiprocessor computers with dynamically variable voltage and speed are addressed
as combinatorial optimization problems. Two problems are defined, namely, minimizing schedule length with energy consumption
constraint and minimizing energy consumption with schedule length constraint. The first problem has applications in general
multiprocessor and multicore processor computing systems where energy consumption is an important concern and in mobile computers
where energy conservation is a main concern. The second problem has applications in real-time multiprocessing systems and
environments where timing constraint is a major requirement. Our scheduling problems are defined such that the energy-delay
product is optimized by fixing one factor and minimizing the other. It is noticed that power-aware scheduling of parallel
tasks has rarely been discussed before. Our investigation in this paper makes some initial attempt to energy-efficient scheduling
of parallel tasks on multiprocessor computers with dynamic voltage and speed. Our scheduling problems contain three nontrivial
subproblems, namely, system partitioning, task scheduling, and power supplying. Each subproblem should be solved efficiently,
so that heuristic algorithms with overall good performance can be developed. The above decomposition of our optimization problems
into three subproblems makes design and analysis of heuristic algorithms tractable. A unique feature of our work is to compare
the performance of our algorithms with optimal solutions analytically and validate our results experimentally, not to compare
the performance of heuristic algorithms among themselves only experimentally. The harmonic system partitioning and processor
allocation scheme is used, which divides a multiprocessor computer into clusters of equal sizes and schedules tasks of similar
sizes together to increase processor utilization. A three-level energy/time/power allocation scheme is adopted for a given
schedule, such that the schedule length is minimized by consuming given amount of energy or the energy consumed is minimized
without missing a given deadline. The performance of our heuristic algorithms is analyzed, and accurate performance bounds
are derived. Simulation data which validate our analytical results are also presented. It is found that our analytical results
provide very accurate estimation of the expected normalized schedule length and the expected normalized energy consumption
and that our heuristic algorithms are able to produce solutions very close to optimum. 相似文献
5.
Rosa Filgueira Jesús Carretero David E. Singh Alejandro Calderón Alberto Núñez 《The Journal of supercomputing》2012,59(1):361-391
This work presents an optimization of MPI communications, called Dynamic-CoMPI, which uses two techniques in order to reduce the impact of communications and non-contiguous I/O requests in parallel applications.
These techniques are independent of the application and complementaries to each other. The first technique is an optimization
of the Two-Phase collective I/O technique from ROMIO, called Locality aware strategy for Two-Phase I/O (LA-Two-Phase I/O). In order to increase the locality of the file accesses, LA-Two-Phase I/O employs the Linear Assignment Problem (LAP) for finding an optimal I/O data communication schedule. The main purpose of this
technique is the reduction of the number of communications involved in the I/O collective operation. The second technique,
called Adaptive-CoMPI, is based on run-time compression of MPI messages exchanged by applications. Both techniques can be applied on every application,
because both of them are transparent for the users. Dynamic-CoMPI has been validated by using several MPI benchmarks and real HPC applications. The results show that, for many of the considered
scenarios, important reductions in the execution time are achieved by reducing the size and the number of the messages. Additional
benefits of our approach are the reduction of the total communication time and the network contention, thus enhancing, not
only performance, but also scalability. 相似文献
6.
Performance evaluation of the parallel processing producer–distributor–consumer network architecture
《Computer Standards & Interfaces》2013,36(6):596-604
The CSMA/CD access method is no longer invoked in switched, full-duplex Ethernet, but the industrial protocols still take the presence of the method into account. The parallel processing producer–distributor–consumer network architecture (ppPDC) was designed specifically to actively utilize the frame queuing. The network nodes process frames in parallel, which shortens the time needed to perform a cycle of communication, especially in cases when frame processing times within the nodes are not uniform. The experiments show that the achievable cycle times of the ppPDC architecture are an order of magnitude shorter than in the well-known sequential PDC protocol. 相似文献
7.
Modern service robots will soon become an essential part of modern society. As they have to move and act in human environments,
it is essential for them to be provided with a fast and reliable tracking system that localizes people in the neighborhood.
It is therefore important to select the most appropriate filter to estimate the position of these persons. This paper presents
three efficient implementations of multisensor-human tracking based on different Bayesian estimators: Extended Kalman Filter
(EKF), Unscented Kalman Filter (UKF) and Sampling Importance Resampling (SIR) particle filter. The system implemented on a
mobile robot is explained, introducing the methods used to detect and estimate the position of multiple people. Then, the
solutions based on the three filters are discussed in detail. Several real experiments are conducted to evaluate their performance,
which is compared in terms of accuracy, robustness and execution time of the estimation. The results show that a solution
based on the UKF can perform as good as particle filters and can be often a better choice when computational efficiency is
a key issue. 相似文献
8.
YUZHEN GE LAYNE T. WATSON EMMANUEL G. COLLINS JR 《International journal of systems science》2013,44(11):1069-1076
A distributed version of a homotopy algorithm for solving the H /H∞ mixed-norm controller synthesis problem is presented. The main purpose of the study is to explore the possibility of achieving high performance with low cost. Existing UNIX workstations running PVM (Parallel Virtual Machine) are utilized. Only the jacobian matrix computation is distributed and therefore the modification to the original sequential code is minimal. The same algorithm has also been implemented on an Intel Paragon parallel machine. Our implementation shows that acceptable speed-up is achieved and the larger the problem sizes, the higher the speed-up. Compared with the results from the Intel Paragon, the study concludes that utilizing the existing UNIX workstations can be a very cost-effective approach to shorten computation time. Furthermore, this economical way to achieve high performance computation can easily be realized and incorporated in a practical industrial design environment. 相似文献
9.
The recent advance of multicore architectures and the deployment of multiprocessors as the mainstream computing platforms have given rise to a new concurrent programming impetus. Software transactional memories (STM) are one of the most promising approaches to take up this challenge. The aim of a STM system is to discharge the application programmer from the management of synchronization when he/she has to write multiprocess programs. His/her task is to decompose his/her program into a set of sequential tasks that access shared objects, and to decompose each task in atomic units of computation. The management of the required synchronization is ensured by the associated STM system. This paper presents two existing STM systems, and a new one based on time-window mechanism. The paper, which focuses mainly on STM principles, has an introductory and survey flavor. 相似文献
10.
Advances in computer technology, encompassed with fast emerging of multicore processor technology, have made the many-core
personal computers available and more affordable. The availability of network of workstations and cluster of many-core SMPs
have made them an attractive solution for high performance computing by providing computational power equal or superior to
supercomputers or mainframes at an affordable cost using commodity components. In order to search alternative ways to extract
unused and idle computing power from these computing resources targeting to improve overall performance, as well as to fully
utilize the underlying new hardware platforms, these are major topics in this field of research. In this research paper, the
design rationale and implementation of an effective toolkit for performance measurement and analysis of parallel applications
in cluster environments is introduced; not only generating parallel applications’ timing graph representation, but also to
provide application execution’s performance data charts. The goal in developing this toolkit is to permit application developers
have a better understanding of the application’s behavior among selected computing nodes purposed for that particular execution.
Additionally, multiple execution results of a given application under development can be combined and overlapped, permitting
application developers to perform “what-if” analysis, i.e., to deeper understand the utilization of allocated computational
resources. Experimentations using this toolkit have shown its effectiveness on the development and performance tuning of parallel
applications, extending the use in teaching of message passing, and shared memory model parallel programming courses.
相似文献
Tien-Hsiung WengEmail: |
11.
Muzhou Xiong Michael Lees Wentong Cai Suiping Zhou Malcolm Yoke Hean Low 《The Visual computer》2010,26(5):367-383
This paper proposes a rule-based motion planning system for agent-based crowd simulation, consisting of sets of rules for
both collision avoidance and collision response. In order to avoid an oncoming collision, a set of rules for velocity sampling
and evaluation is proposed, which aims to choose a velocity with an expected time to collision larger than a predefined threshold.
In order to improve the efficiency over existing methods, the sampling procedure terminates upon finding an appropriate velocity.
Moreover, the proposed motion planning system does not guarantee a collision-free movement. In case of collision, another
set of rules is also defined to direct the agent to make a corresponding response. The experiment results show that the proposed
approach can be applied in different scenarios, while making the simulation execution efficient. 相似文献
12.
PC grid is a cost-effective grid-computing platform that attracts users by allocating to their massively parallel applications
as many desktop computers as requested. However, a challenge is how to distribute necessary files to remote computing nodes
that may be unconnected to the same network file system, equipped with insufficient disk space to keep entire files, and even
powered off asynchronously.
Targeting PC grid, the AgentTeamwork grid-computing middleware deploys a hierarchy of mobile agents to remote desktops so
as to launch, monitor, check-point, and resume a parallel and distributed computing job. To achieve high-speed file distribution,
AgentTeamwork takes advantage of its agent hierarchy. The system partitions files into stripes at the tree root if they are
random-access files, duplicates them at each tree level if they are shared among all remote nodes, fragments them into smaller
messages if they are too large to relay to a lower tree level, aggregates such messages in a larger fragment if they are in
transit to the same subtree, and returns output files to the user along multi-paths established within the tree. To achieve
fault-tolerant file delivery, each agent periodically takes a snapshot of in-transit and on-memory file messages with its
user job, and thus resumes them from the latest snapshot when they crash accidentally.
This paper presents an implementation and its competitive performance of AgentTeamwork’s file-distribution algorithm including
file partitioning, transfer, check-pointing, and consistency maintenance.
相似文献
Jumpei MiyauchiEmail: |
13.
Pseudorandom number generators are required for many computational tasks, such as stochastic modelling and simulation. This paper investigates the serial and parallel implementation of a Linear Congruential Generator for Graphics Processing Units (GPU) based on the binary representation of the normal number $\alpha _{2,3}$ . We adapted two methods of modular reduction which allowed us to perform most operations in 64-bit integer arithmetic, improving on the original implementation based on 106-bit double-double operations, which resulted in four-fold increase in efficiency. We found that our implementation is faster than existing methods in literature, and our generation rate is close to the limiting rate imposed by the efficiency of writing to a GPU’s global memory. 相似文献
14.
Deployment strategies for distributed complex event processing 总被引:1,自引:0,他引:1
Several complex event processing (CEP) middleware solutions have been proposed in the past. They act by processing primitive events generated by sources, extracting new knowledge in the form of composite events, and delivering them to interested sinks. Event-based applications often involve a large number of sources and sinks, possibly dispersed over a wide geographical area. To better support these scenarios, the CEP middleware can be internally built around several, distributed processors, which cooperate to provide the processing and routing service. This paper introduces and compares different deployment strategies for a CEP middleware, which define (i) how the processing load is distributed over different processors and (ii) how these processors interact to produce the required results and to deliver them to sinks. Our evaluation compares the presented solutions and shows their benefits with respect to a centralized deployment, both in terms of network traffic and in terms of forwarding delay. 相似文献
15.
Chunye Gong Weimin Bao Guojian Tang Bo Yang Jie Liu 《The Journal of supercomputing》2014,68(3):1521-1537
The computational complexity of Caputo fractional reaction–diffusion equation is \(O(MN^2)\) compared with \(O(MN)\) of traditional reaction–diffusion equation, where \(M\) , \(N\) are the number of time steps and grid points. A efficient parallel solution for Caputo fractional reaction–diffusion equation with explicit difference method is proposed. The parallel solution, which is implemented with MPI parallel programming model, consists of three procedures: preprocessing, parallel solver and postprocessing. The parallel solver involves the parallel tridiagonal matrix vector multiplication, vector vector addition and constant vector multiplication. The sum of constant vector multiplication is optimized. As to the authors’ knowledge, this is the first parallel solution for Caputo fractional reaction–diffusion equation. The experimental results show that the parallel solution compares well with the analytic solution. The parallel solution on single Intel Xeon X5540 CPU runs more than three times faster than the serial solution on single X5540 CPU core, and scales quite well on a distributed memory cluster system. 相似文献
16.
Chowdhury Farhan Ahmed Syed Khairuzzaman Tanbeer Byeong-Soo Jeong Young-Koo Lee 《Applied Intelligence》2011,34(2):181-198
Traditional frequent pattern mining methods consider an equal profit/weight for all items and only binary occurrences (0/1)
of the items in transactions. High utility pattern mining becomes a very important research issue in data mining by considering
the non-binary frequency values of items in transactions and different profit values for each item. However, most of the existing
high utility pattern mining algorithms suffer in the level-wise candidate generation-and-test problem and generate too many
candidate patterns. Moreover, they need several database scans which are directly dependent on the maximum candidate length.
In this paper, we present a novel tree-based candidate pruning technique, called HUC-Prune (High Utility Candidates Prune),
to solve these problems. Our technique uses a novel tree structure, called HUC-tree (High Utility Candidates tree), to capture
important utility information of the candidate patterns. HUC-Prune avoids the level-wise candidate generation process by adopting
a pattern growth approach. In contrast to the existing algorithms, its number of database scans is completely independent
of the maximum candidate length. Extensive experimental results show that our algorithm is very efficient for high utility
pattern mining and it outperforms the existing algorithms. 相似文献
17.
《Computers & chemistry》1996,20(4):439-448
This paper describes a parallel cross-validation (PCV) procedure, for testing the predictive ability of multi-layer feed-forward (MLF) neural networks models, trained by the generalized delta learning rule. The PCV program has been parallelized to operate in a local area computer network. Development and execution of the parallel application was aided by the HYDRA programming environment, which is extensively described in Part I of this paper. A brief theoretical introduction on MLF networks is given and the problems, associated with the validation of predictive abilities, will be discussed. Furthermore, this paper comprises a general outline of the PCV program. Finally, the parallel PCV application is used to validate the predictive ability of an MLF network modeling a chemical non-linear function approximation problem which is described extensively in the literature. 相似文献
18.
Pedro Furtado 《Distributed and Parallel Databases》2009,25(1-2):71-96
Consider data warehouses as large data repositories queried for analysis and data mining in a variety of application contexts. A query over such data may take a large amount of time to be processed in a regular PC. Consider partitioning the data into a set of PCs (nodes), with either a parallel database server or any database server at each node and an engine-independent middleware. Nodes and network may even not be fully dedicated to the data warehouse. In such a scenario, care must be taken for handling processing heterogeneity and availability, so we study and propose efficient solutions for this. We concentrate on three main contributions: a performance-wise index, measuring relative performance; a replication-degree; a flexible chunk-wise organization with on-demand processing. These contributions extend the previous work on de-clustering and replication and are generic in the sense that they can be applied in very different contexts and with different data partitioning approaches. We evaluate their merits with a prototype implementation of the system. 相似文献
19.
We examine the class of multi-linear representations (MLR) for expressing probability distributions over discrete variables. Recently, MLR have been considered as intermediate representations that facilitate inference in distributions represented as graphical models. We show that MLR is an expressive representation of discrete distributions and can be used to concisely represent classes of distributions which have exponential size in other commonly used representations, while supporting probabilistic inference in time linear in the size of the representation. Our key contribution is presenting techniques for learning bounded-size distributions represented using MLR, which support efficient probabilistic inference. We demonstrate experimentally that the MLR representations we learn support accurate and very efficient inference. 相似文献
20.
In the framework of heavy mid-level processing for high speed imaging, a nonlinear bi-dimensional network is proposed, allowing the implementation of active curve algorithms. Usually this efficient type of algorithm is prohibitive for real-time image processing due to its calculus charge and the inadequate structure for the use of serial or parallel architectures. Another kind of implementation philosophy is proposed here, by considering the active curve generated by a propagation phenomenon inspired from biological modeling. A programmable nonlinear reaction–diffusion system is proposed under front control and technological constraints. Geometric multiscale processing is presented and this opens a discussion about electronic implementation. 相似文献