期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The deflection self-routing Delta network: a dynamically fault-tolerant high-radix multistage interconnection network

Jae-Hyun Park 《The Journal of supercomputing》2011,55(3):432-447

High-radix multistage interconnection networks are popular interconnection technologies for parallel supercomputers and cluster computers. In this paper, we presented a new dynamically fault-tolerant high-radix multistage interconnection network using a fully-adaptive self-routing. To devise the fully-adaptive self-routing for recovering the misrouting around link faults in such network, we introduce an abstract algebraic analysis of the topological structure of the high-radix Delta network. The presented interconnection network provides multiple paths by using all the links of all the stages of the network. We also present a mathematical analysis of the reliability of the interconnection network for quantitative comparison against other networks. The MTTF of 64×64 network proposed is 2.2 times greater than that of the cyclic Banyan network. The hardware cost of the proposed network is half that of the cyclic Banyan network and the 2D ring-Banyan network. 相似文献

2.

ER-TCP: an efficient TCP fault-tolerance scheme for cluster computing

Zhiyuan Shao Hai Jin Bin Cheng Wenbin Jiang 《The Journal of supercomputing》2008,43(2):127-145

This paper proposes a novel scheme, named ER-TCP, which transparently masks the failures happened on the server nodes of a cluster from clients at TCP connection granularity. In this scheme, TCP connections at the server side are actively and fully replicated to remain consistency so as to be transplanted over healthy parts during failure. A log mechanism is designed to cooperate with the replication to achieve small sacrifice on the performance of communication and makes the scheme scales beyond a few nodes, even when they have different processing capacities. We built a prototype system at a four-node cluster with ER-TCP, and conducted a series of experiments on that. The experimental result told us that ER-TCP has relatively small penalty on the communication performance, especially when it is used to synchronize multiple replicas. The results of real applications show that ER-TCP will incur small sacrifice on performance of web server at light load, and it can be used to distribute files very efficiently and reliably.

Hai JinEmail:

相似文献

3.

The hybrid dynamic parallel scheduling algorithm for load balancing on Chained-Cubic Tree interconnection networks 总被引：1，自引：0，他引：1

Basel A. Mahafzah Bashira A. Jaradat 《The Journal of supercomputing》2010,52(3):224-252

The Chained-Cubic Tree (CCT) interconnection network topology was recently proposed as a continuation for the extended efforts in the area of interconnection networks’ performance improvement. This topology, which promises to exhibit the best properties of the hypercube and tree topologies, needs to be deeply investigated in order to evaluate its performance among other interconnection networks’ topologies. This work comes as a complementary effort, in which the load balancing technique is investigated as one of the most important aspects of performance improvement. This paper proposes a new load balancing algorithm on CCT interconnection networks. The proposed algorithm, which is called Hybrid Dynamic Parallel Scheduling Algorithm (HD-PSA), is a combination of two common load balancing strategies; dynamic load balancing and parallel scheduling. The performance of the proposed algorithm is evaluated both, analytically and experimentally, in terms of various performance metrics; including, execution time, load balancing accuracy, communication cost, number of tasks hops, and tasks locality. 相似文献

4.

High efficient inverse dynamic calculation approach for a haptic device with pantograph parallel platform

Wei You Min-Xiu Kong Zhi-Jiang Du Li-Ning Sun 《Multibody System Dynamics》2009,21(3):233-247

This article presents a novel inverse dynamic calculation approach for a haptic device with pantograph parallel platform. This approach uses vector differential equations in kinematics analysis, hence deriving the explicit expressions of all links’ linear velocities, linear accelerations, angular velocities, and angular accelerations. In contrast to the regular influence coefficient method, the kinematics expressions presented herein avoid the large calculation load of matrix inverse operation, which is crucial to real-time computation. Kane’s equation is employed to establish the inverse dynamic calculation expression for the special architecture of hybrid series-parallel branch. The elements of velocity wrench of the top plate are chosen as the generalized velocities. After deriving the matrixes of partial linear velocity and partial angular velocity, the inverse dynamic equation in explicit form is obtained. Compared with the results calculated by ADAMS, the precision of this calculation approach is validated. Given that it is highly efficient and accurate, this approach is more suitable for real-time compute-torque control, especially for mechanisms with hybrid series-parallel branches. 相似文献

5.

Energy efficient scheduling of parallel tasks on multiprocessor computers 总被引：1，自引：1，他引：1

Keqin Li 《The Journal of supercomputing》2012,60(2):223-247

In this paper, scheduling parallel tasks on multiprocessor computers with dynamically variable voltage and speed are addressed as combinatorial optimization problems. Two problems are defined, namely, minimizing schedule length with energy consumption constraint and minimizing energy consumption with schedule length constraint. The first problem has applications in general multiprocessor and multicore processor computing systems where energy consumption is an important concern and in mobile computers where energy conservation is a main concern. The second problem has applications in real-time multiprocessing systems and environments where timing constraint is a major requirement. Our scheduling problems are defined such that the energy-delay product is optimized by fixing one factor and minimizing the other. It is noticed that power-aware scheduling of parallel tasks has rarely been discussed before. Our investigation in this paper makes some initial attempt to energy-efficient scheduling of parallel tasks on multiprocessor computers with dynamic voltage and speed. Our scheduling problems contain three nontrivial subproblems, namely, system partitioning, task scheduling, and power supplying. Each subproblem should be solved efficiently, so that heuristic algorithms with overall good performance can be developed. The above decomposition of our optimization problems into three subproblems makes design and analysis of heuristic algorithms tractable. A unique feature of our work is to compare the performance of our algorithms with optimal solutions analytically and validate our results experimentally, not to compare the performance of heuristic algorithms among themselves only experimentally. The harmonic system partitioning and processor allocation scheme is used, which divides a multiprocessor computer into clusters of equal sizes and schedules tasks of similar sizes together to increase processor utilization. A three-level energy/time/power allocation scheme is adopted for a given schedule, such that the schedule length is minimized by consuming given amount of energy or the energy consumed is minimized without missing a given deadline. The performance of our heuristic algorithms is analyzed, and accurate performance bounds are derived. Simulation data which validate our analytical results are also presented. It is found that our analytical results provide very accurate estimation of the expected normalized schedule length and the expected normalized energy consumption and that our heuristic algorithms are able to produce solutions very close to optimum. 相似文献

6.

On structured output training: hard cases and an efficient alternative

Thomas Gärtner Shankar Vembu 《Machine Learning》2009,76(2-3):227-242

We consider a class of structured prediction problems for which the assumptions made by state-of-the-art algorithms fail. To deal with exponentially sized output sets, these algorithms assume, for instance, that the best output for a given input can be found efficiently. While this holds for many important real world problems, there are also many relevant and seemingly simple problems where these assumptions do not hold. In this paper, we consider route prediction, which is the problem of finding a cyclic permutation of some points of interest, as an example and show that state-of-the-art approaches cannot guarantee polynomial runtime for this output set. We then present a novel formulation of the learning problem that can be trained efficiently whenever a particular ‘super-structure counting’ problem can be solved efficiently for the output set. We also list several output sets for which this assumption holds and report experimental results. 相似文献

7.

Dynamic-CoMPI: dynamic optimization techniques for MPI parallel applications

Rosa Filgueira Jesús Carretero David E. Singh Alejandro Calderón Alberto Núñez 《The Journal of supercomputing》2012,59(1):361-391

This work presents an optimization of MPI communications, called Dynamic-CoMPI, which uses two techniques in order to reduce the impact of communications and non-contiguous I/O requests in parallel applications. These techniques are independent of the application and complementaries to each other. The first technique is an optimization of the Two-Phase collective I/O technique from ROMIO, called Locality aware strategy for Two-Phase I/O (LA-Two-Phase I/O). In order to increase the locality of the file accesses, LA-Two-Phase I/O employs the Linear Assignment Problem (LAP) for finding an optimal I/O data communication schedule. The main purpose of this technique is the reduction of the number of communications involved in the I/O collective operation. The second technique, called Adaptive-CoMPI, is based on run-time compression of MPI messages exchanged by applications. Both techniques can be applied on every application, because both of them are transparent for the users. Dynamic-CoMPI has been validated by using several MPI benchmarks and real HPC applications. The results show that, for many of the considered scenarios, important reductions in the execution time are achieved by reducing the size and the number of the messages. Additional benefits of our approach are the reduction of the total communication time and the network contention, thus enhancing, not only performance, but also scalability. 相似文献

8.

Performance evaluation of the parallel processing producer–distributor–consumer network architecture

《Computer Standards & Interfaces》2013,36(6):596-604

The CSMA/CD access method is no longer invoked in switched, full-duplex Ethernet, but the industrial protocols still take the presence of the method into account. The parallel processing producer–distributor–consumer network architecture (ppPDC) was designed specifically to actively utilize the frame queuing. The network nodes process frames in parallel, which shortens the time needed to perform a cycle of communication, especially in cases when frame processing times within the nodes are not uniform. The experiments show that the achievable cycle times of the ppPDC architecture are an order of magnitude shorter than in the well-known sequential PDC protocol. 相似文献

9.

Computationally efficient solutions for tracking people with a mobile robot: an experimental evaluation of Bayesian filters

Nicola Bellotto Huosheng Hu 《Autonomous Robots》2010,28(4):425-438

Modern service robots will soon become an essential part of modern society. As they have to move and act in human environments, it is essential for them to be provided with a fast and reliable tracking system that localizes people in the neighborhood. It is therefore important to select the most appropriate filter to estimate the position of these persons. This paper presents three efficient implementations of multisensor-human tracking based on different Bayesian estimators: Extended Kalman Filter (EKF), Unscented Kalman Filter (UKF) and Sampling Importance Resampling (SIR) particle filter. The system implemented on a mobile robot is explained, introducing the methods used to detect and estimate the position of multiple people. Then, the solutions based on the three filters are discussed in detail. Several real experiments are conducted to evaluate their performance, which is compared in terms of accuracy, robustness and execution time of the estimation. The results show that a solution based on the UKF can perform as good as particle filters and can be often a better choice when computational efficiency is a key issue. 相似文献

10.

Cost-effective parallel processing for H /H∞ controller synthesis

YUZHEN GE LAYNE T. WATSON EMMANUEL G. COLLINS JR 《International journal of systems science》2013,44(11):1069-1076

A distributed version of a homotopy algorithm for solving the H /H^∞ mixed-norm controller synthesis problem is presented. The main purpose of the study is to explore the possibility of achieving high performance with low cost. Existing UNIX workstations running PVM (Parallel Virtual Machine) are utilized. Only the jacobian matrix computation is distributed and therefore the modification to the original sequential code is minimal. The same algorithm has also been implemented on an Intel Paragon parallel machine. Our implementation shows that acceptable speed-up is achieved and the larger the problem sizes, the higher the speed-up. Compared with the results from the Intel Paragon, the study concludes that utilizing the existing UNIX workstations can be a very cost-effective approach to shorten computation time. Furthermore, this economical way to achieve high performance computation can easily be realized and incorporated in a practical industrial design environment. 相似文献

11.

Performance-based parallel application toolkit for high-performance clusters

Kuan-Ching Li Tien-Hsiung Weng 《The Journal of supercomputing》2009,48(1):43-65

Advances in computer technology, encompassed with fast emerging of multicore processor technology, have made the many-core personal computers available and more affordable. The availability of network of workstations and cluster of many-core SMPs have made them an attractive solution for high performance computing by providing computational power equal or superior to supercomputers or mainframes at an affordable cost using commodity components. In order to search alternative ways to extract unused and idle computing power from these computing resources targeting to improve overall performance, as well as to fully utilize the underlying new hardware platforms, these are major topics in this field of research. In this research paper, the design rationale and implementation of an effective toolkit for performance measurement and analysis of parallel applications in cluster environments is introduced; not only generating parallel applications’ timing graph representation, but also to provide application execution’s performance data charts. The goal in developing this toolkit is to permit application developers have a better understanding of the application’s behavior among selected computing nodes purposed for that particular execution. Additionally, multiple execution results of a given application under development can be combined and overlapped, permitting application developers to perform “what-if” analysis, i.e., to deeper understand the utilization of allocated computational resources. Experimentations using this toolkit have shown its effectiveness on the development and performance tuning of parallel applications, extending the use in teaching of message passing, and shared memory model parallel programming courses.

Tien-Hsiung WengEmail:

相似文献

12.

An implementation of parallel file distribution in an agent hierarchy

Munehiro Fukuda Jumpei Miyauchi 《The Journal of supercomputing》2009,47(3):255-285

PC grid is a cost-effective grid-computing platform that attracts users by allocating to their massively parallel applications as many desktop computers as requested. However, a challenge is how to distribute necessary files to remote computing nodes that may be unconnected to the same network file system, equipped with insufficient disk space to keep entire files, and even powered off asynchronously. Targeting PC grid, the AgentTeamwork grid-computing middleware deploys a hierarchy of mobile agents to remote desktops so as to launch, monitor, check-point, and resume a parallel and distributed computing job. To achieve high-speed file distribution, AgentTeamwork takes advantage of its agent hierarchy. The system partitions files into stripes at the tree root if they are random-access files, duplicates them at each tree level if they are shared among all remote nodes, fragments them into smaller messages if they are too large to relay to a lower tree level, aggregates such messages in a larger fragment if they are in transit to the same subtree, and returns output files to the user along multi-paths established within the tree. To achieve fault-tolerant file delivery, each agent periodically takes a snapshot of in-transit and on-memory file messages with its user job, and thus resumes them from the latest snapshot when they crash accidentally. This paper presents an implementation and its competitive performance of AgentTeamwork’s file-distribution algorithm including file partitioning, transfer, check-pointing, and consistency maintenance.

Jumpei MiyauchiEmail:

相似文献

13.

Software transactional memories: an approach for multicore programming

Damien Imbs Michel Raynal 《The Journal of supercomputing》2011,57(2):203-215

The recent advance of multicore architectures and the deployment of multiprocessors as the mainstream computing platforms have given rise to a new concurrent programming impetus. Software transactional memories (STM) are one of the most promising approaches to take up this challenge. The aim of a STM system is to discharge the application programmer from the management of synchronization when he/she has to write multiprocess programs. His/her task is to decompose his/her program into a set of sequential tasks that access shared objects, and to decompose each task in atomic units of computation. The management of the required synchronization is ensured by the associated STM system. This paper presents two existing STM systems, and a new one based on time-window mechanism. The paper, which focuses mainly on STM principles, has an introductory and survey flavor. 相似文献

14.

An efficient implementation of Bailey and Borwein’s algorithm for parallel random number generation on graphics processing units

Gleb Beliakov Michael Johnstone Doug Creighton Tim Wilkin 《Computing》2013,95(4):309-326

Pseudorandom number generators are required for many computational tasks, such as stochastic modelling and simulation. This paper investigates the serial and parallel implementation of a Linear Congruential Generator for Graphics Processing Units (GPU) based on the binary representation of the normal number $\alpha _{2,3}$ . We adapted two methods of modular reduction which allowed us to perform most operations in 64-bit integer arithmetic, improving on the original implementation based on 106-bit double-double operations, which resulted in four-fold increase in efficiency. We found that our implementation is faster than existing methods in literature, and our generation rate is close to the limiting rate imposed by the efficiency of writing to a GPU’s global memory. 相似文献

15.

Analysis of an efficient rule-based motion planning system for simulating human crowds

Muzhou Xiong Michael Lees Wentong Cai Suiping Zhou Malcolm Yoke Hean Low 《The Visual computer》2010,26(5):367-383

This paper proposes a rule-based motion planning system for agent-based crowd simulation, consisting of sets of rules for both collision avoidance and collision response. In order to avoid an oncoming collision, a set of rules for velocity sampling and evaluation is proposed, which aims to choose a velocity with an expected time to collision larger than a predefined threshold. In order to improve the efficiency over existing methods, the sampling procedure terminates upon finding an appropriate velocity. Moreover, the proposed motion planning system does not guarantee a collision-free movement. In case of collision, another set of rules is also defined to direct the agent to make a corresponding response. The experiment results show that the proposed approach can be applied in different scenarios, while making the simulation execution efficient. 相似文献

16.

An efficient parallel solution for Caputo fractional reaction–diffusion equation

Chunye Gong Weimin Bao Guojian Tang Bo Yang Jie Liu 《The Journal of supercomputing》2014,68(3):1521-1537

The computational complexity of Caputo fractional reaction–diffusion equation is $O(MN^2)$ compared with $O(MN)$ of traditional reaction–diffusion equation, where $M$ , $N$ are the number of time steps and grid points. A efficient parallel solution for Caputo fractional reaction–diffusion equation with explicit difference method is proposed. The parallel solution, which is implemented with MPI parallel programming model, consists of three procedures: preprocessing, parallel solver and postprocessing. The parallel solver involves the parallel tridiagonal matrix vector multiplication, vector vector addition and constant vector multiplication. The sum of constant vector multiplication is optimized. As to the authors’ knowledge, this is the first parallel solution for Caputo fractional reaction–diffusion equation. The experimental results show that the parallel solution compares well with the analytic solution. The parallel solution on single Intel Xeon X5540 CPU runs more than three times faster than the serial solution on single X5540 CPU core, and scales quite well on a distributed memory cluster system. 相似文献

17.

Deployment strategies for distributed complex event processing 总被引：1，自引：0，他引：1

Gianpaolo Cugola Alessandro Margara 《Computing》2013,95(2):129-156

Several complex event processing (CEP) middleware solutions have been proposed in the past. They act by processing primitive events generated by sources, extracting new knowledge in the form of composite events, and delivering them to interested sinks. Event-based applications often involve a large number of sources and sinks, possibly dispersed over a wide geographical area. To better support these scenarios, the CEP middleware can be internally built around several, distributed processors, which cooperate to provide the processing and routing service. This paper introduces and compares different deployment strategies for a CEP middleware, which define (i) how the processing load is distributed over different processors and (ii) how these processors interact to produce the required results and to deliver them to sinks. Our evaluation compares the presented solutions and shows their benefits with respect to a centralized deployment, both in terms of network traffic and in terms of forwarding delay. 相似文献

18.

HUC-Prune: an efficient candidate pruning technique to mine high utility patterns

Chowdhury Farhan Ahmed Syed Khairuzzaman Tanbeer Byeong-Soo Jeong Young-Koo Lee 《Applied Intelligence》2011,34(2):181-198

Traditional frequent pattern mining methods consider an equal profit/weight for all items and only binary occurrences (0/1) of the items in transactions. High utility pattern mining becomes a very important research issue in data mining by considering the non-binary frequency values of items in transactions and different profit values for each item. However, most of the existing high utility pattern mining algorithms suffer in the level-wise candidate generation-and-test problem and generate too many candidate patterns. Moreover, they need several database scans which are directly dependent on the maximum candidate length. In this paper, we present a novel tree-based candidate pruning technique, called HUC-Prune (High Utility Candidates Prune), to solve these problems. Our technique uses a novel tree structure, called HUC-tree (High Utility Candidates tree), to capture important utility information of the candidate patterns. HUC-Prune avoids the level-wise candidate generation process by adopting a pattern growth approach. In contrast to the existing algorithms, its number of database scans is completely independent of the maximum candidate length. Extensive experimental results show that our algorithm is very efficient for high utility pattern mining and it outperforms the existing algorithms. 相似文献

19.

Parallel processing of chemical information in a local area network—II. A parallel cross-validation procedure for artificial neural networks

《Computers & chemistry》1996,20(4):439-448

This paper describes a parallel cross-validation (PCV) procedure, for testing the predictive ability of multi-layer feed-forward (MLF) neural networks models, trained by the generalized delta learning rule. The PCV program has been parallelized to operate in a local area computer network. Development and execution of the parallel application was aided by the HYDRA programming environment, which is extensively described in Part I of this paper. A brief theoretical introduction on MLF networks is given and the problems, associated with the validation of predictive abilities, will be discussed. Furthermore, this paper comprises a general outline of the PCV program. Finally, the parallel PCV application is used to validate the predictive ability of an MLF network modeling a chemical non-linear function approximation problem which is described extensively in the literature. 相似文献

20.

Scheduling jobs with equal processing times and time windows on identical parallel machines 总被引：1，自引：0，他引：1

Peter Brucker Svetlana A. Kravchenko 《Journal of Scheduling》2008,11(4):229-237

We present a linear programming approach to the problem of scheduling equal processing time jobs with release dates and deadlines on identical parallel machines. The known algorithm with complexity O(n ³log log n) of B. Simons schedules all the jobs while minimizing both the maximum completion time and the mean flow time. Our approach permits also to minimize the weighted sum of completion times and total tardiness in polynomial time for the problems without deadlines. The complexity status of these problems was open. Contract/grant sponsor: Alexander von Humboldt Foundation. 相似文献