期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The design and implementation of Visuel performance monitoring and analysis toolkit for cluster and grid environments

Kuan-Ching Li Hsun-Chang Chang 《The Journal of supercomputing》2007,40(3):299-317

The computing power provided by high performance and low-cost PC-based clusters with Grid platforms are attractive and they are equal or superior to supercomputers and mainframes. In this paper, we present implementation and design rationale of Visuel toolkit for MPI parallel program performance measurement and analysis in cluster and grid environments. Most of performance visualization tools available today for high-performance platforms show solely system performance data (e.g., CPU load, memory usage, network bandwidth, server average load), and thus, being suitable for computing system activity visualization. The Visuel (Visuel (in French language) = to visualize) toolkit is web-based interface designed to show performance activities of all computing nodes of a distributed environment involved in the execution of MPI parallel program, such as CPU load level and memory usage of each computing node. In addition, this toolkit is able to display comparative performance data charts of MPI parallel applications and multiple executions under investigation. The usage of this toolkit shows that it outperforms in easing the process of investigation of parallel applications.

Hsun-Chang ChangEmail:

相似文献

2.

Combating I-O bottleneck using prefetching: model,algorithms, and ramifications

Akshat Verma Sandeep Sen 《The Journal of supercomputing》2008,45(2):205-235

Multiple memory models have been proposed to capture the effects of memory hierarchy culminating in the I-O model of Aggarwal and Vitter (Commun. ACM 31(9):1116–1127, [1988]). More than a decade of architectural advancements have led to new features that are not captured in the I-O model—most notably the prefetching capability. We propose a relatively simple Prefetch model that incorporates data prefetching in the traditional I-O models and show how to design optimal algorithms that can attain close to peak memory bandwidth. Unlike (the inverse of) memory latency, the memory bandwidth is much closer to the processing speed, thereby, intelligent use of prefetching can considerably mitigate the I-O bottleneck. For some fundamental problems, our algorithms attain running times approaching that of the idealized random access machines under reasonable assumptions. Our work also explains more precisely the significantly superior performance of the I-O efficient algorithms in systems that support prefetching compared to ones that do not.

Sandeep SenEmail:

相似文献

3.

Performance-based parallel application toolkit for high-performance clusters 总被引：1，自引：1，他引：0

Kuan-Ching Li Tien-Hsiung Weng 《The Journal of supercomputing》2009,48(1):43-65

Advances in computer technology, encompassed with fast emerging of multicore processor technology, have made the many-core personal computers available and more affordable. The availability of network of workstations and cluster of many-core SMPs have made them an attractive solution for high performance computing by providing computational power equal or superior to supercomputers or mainframes at an affordable cost using commodity components. In order to search alternative ways to extract unused and idle computing power from these computing resources targeting to improve overall performance, as well as to fully utilize the underlying new hardware platforms, these are major topics in this field of research. In this research paper, the design rationale and implementation of an effective toolkit for performance measurement and analysis of parallel applications in cluster environments is introduced; not only generating parallel applications’ timing graph representation, but also to provide application execution’s performance data charts. The goal in developing this toolkit is to permit application developers have a better understanding of the application’s behavior among selected computing nodes purposed for that particular execution. Additionally, multiple execution results of a given application under development can be combined and overlapped, permitting application developers to perform “what-if” analysis, i.e., to deeper understand the utilization of allocated computational resources. Experimentations using this toolkit have shown its effectiveness on the development and performance tuning of parallel applications, extending the use in teaching of message passing, and shared memory model parallel programming courses.

Tien-Hsiung WengEmail:

相似文献

4.

Dynamic load balancing with adaptive factoring methods in scientific applications 总被引：1，自引：1，他引：0

Ricolindo L. Cariño Ioana Banicescu 《The Journal of supercomputing》2008,44(1):41-63

To improve the performance of scientific applications with parallel loops, dynamic loop scheduling methods have been proposed. Such methods address performance degradations due to load imbalance caused by predictable phenomena like nonuniform data distribution or algorithmic variance, and unpredictable phenomena such as data access latency or operating system interference. In particular, methods such as factoring, weighted factoring, adaptive weighted factoring, and adaptive factoring have been developed based on a probabilistic analysis of parallel loop iterates with variable running times. These methods have been successfully implemented in a number of applications such as: N-Body and Monte Carlo simulations, computational fluid dynamics, and radar signal processing. The focus of this paper is on adaptive weighted factoring (AWF), a method that was designed for scheduling parallel loops in time-stepping scientific applications. The main contribution of the paper is to relax the time-stepping requirement, a modification that allows the AWF to be used in any application with a parallel loop. The modification further allows the AWF to adapt to load imbalance that may occur during loop execution. Results of experiments to compare the performance of the modified AWF with the performance of the other loop scheduling methods in the context of three nontrivial applications reveal that the performance of the modified method is comparable to, and in some cases, superior to the performance of the most recently introduced adaptive factoring method.

Ioana BanicescuEmail:

相似文献

5.

LAPT: A locality-aware page table for thread and data mapping

《Parallel Computing》2016

The performance and energy efficiency of current systems is influenced by accesses to the memory hierarchy. One important aspect of memory hierarchies is the introduction of different memory access times, depending on the core that requested the transaction, and which cache or main memory bank responded to it. In this context, the locality of the memory accesses plays a key role for the performance and energy efficiency of parallel applications. Accesses to remote caches and NUMA nodes are more expensive than accesses to local ones. With information about the memory access pattern, pages can be migrated to the NUMA nodes that access them (data mapping), and threads that communicate can be migrated to the same node (thread mapping).In this paper, we present LAPT, a hardware-based mechanism to store the memory access pattern of parallel applications in the page table. The operating system uses the detected memory access pattern to perform an optimized thread and data mapping during the execution of the parallel application. Experiments with a wide range of parallel applications (from the NAS and PARSEC Benchmark Suites) on a NUMA machine showed significant performance and energy efficiency improvements of up to 19.2% and 15.7%, respectively, (6.7% and 5.3% on average). 相似文献

6.

Mapping functions and data redistribution for parallel files

Florin Isaila Walter F. Tichy 《The Journal of supercomputing》2008,46(3):213-236

Data distribution in memory or on disks is an important factor influencing the performance of parallel applications. On the other hand, programs or systems, like a parallel file system, frequently redistribute data between memory and disks. This paper presents a generalization of previous approaches of the redistribution problem. We introduce algorithms for mapping between two arbitrary distributions of a data set. The algorithms are optimized for multidimensional array partitions. We motivate our approach and present potential utilizations. The paper also presents a case study, the employment of mapping functions, and redistribution algorithms in a parallel file system.

Walter F. TichyEmail:

相似文献

7.

Design and implementation of a workflow-based resource broker with information system on computational grids 总被引：1，自引：0，他引：1

Chao-Tung Yang Kuan-Chou Lai Po-Chi Shih 《The Journal of supercomputing》2009,47(1):76-109

The grid is a promising infrastructure that can allow scientists and engineers to access resources among geographically distributed environments. Grid computing is a new technology which focuses on aggregating resources (e.g., processor cycles, disk storage, and contents) from a large-scale computing platform. Making grid computing a reality requires a resource broker to manage and monitor available resources. This paper presents a workflow-based resource broker whose main functions are matching available resources with user requests and considering network information statuses during matchmaking in computational grids. The resource broker provides a graphic user interface for accessing available and the appropriate resources via user credentials. This broker uses the Ganglia and NWS tools to monitor resource status and network-related information, respectively. Then we propose a history-based execution time estimation model to predict the execution time of parallel applications, according to previous execution results. The experimental results show that our model can accurately predict the execution time of embarrassingly parallel applications. We also report on using the Globus Toolkit to construct a grid platform called the TIGER project that integrates resources distributed across five universities in Taichung city, Taiwan, where the resource broker was developed.

Po-Chi ShihEmail:

相似文献

8.

Real-time image segmentation based on a parallel and pipelined watershed algorithm

Dang Ba Khac Trieu Tsutomu Maruyama 《Journal of Real-Time Image Processing》2007,2(4):319-329

The watershed transformation is a popular image segmentation technique for gray scale images. This paper describes a real-time image segmentation based on a parallel and pipelined watershed algorithm which is designed for hardware implementation. In our algorithm: (1) pixels in a given image are repeatedly scanned from top-left to bottom-right, and then from bottom-right to top-left, in order to achieve high performance on a pipelined circuit by simplifying memory access sequences, (2) all steps in the algorithm are executed at the same time in the pipelined circuit, (3) the amount of data that are scanned is gradually reduced as the calculation progresses by memorizing which data are modified in the previous scan, and (4) N pixels can be processed in parallel. In our current implementation on an off-the-shelf field-programmable gate array board, up to four pixels can be processed in parallel. The performance for 512 × 512 pixel images is fast enough to be the first step in real-time applications.

Tsutomu Maruyama (Corresponding author)Email:

相似文献

9.

An implementation of parallel file distribution in an agent hierarchy

Munehiro Fukuda Jumpei Miyauchi 《The Journal of supercomputing》2009,47(3):255-285

PC grid is a cost-effective grid-computing platform that attracts users by allocating to their massively parallel applications as many desktop computers as requested. However, a challenge is how to distribute necessary files to remote computing nodes that may be unconnected to the same network file system, equipped with insufficient disk space to keep entire files, and even powered off asynchronously. Targeting PC grid, the AgentTeamwork grid-computing middleware deploys a hierarchy of mobile agents to remote desktops so as to launch, monitor, check-point, and resume a parallel and distributed computing job. To achieve high-speed file distribution, AgentTeamwork takes advantage of its agent hierarchy. The system partitions files into stripes at the tree root if they are random-access files, duplicates them at each tree level if they are shared among all remote nodes, fragments them into smaller messages if they are too large to relay to a lower tree level, aggregates such messages in a larger fragment if they are in transit to the same subtree, and returns output files to the user along multi-paths established within the tree. To achieve fault-tolerant file delivery, each agent periodically takes a snapshot of in-transit and on-memory file messages with its user job, and thus resumes them from the latest snapshot when they crash accidentally. This paper presents an implementation and its competitive performance of AgentTeamwork’s file-distribution algorithm including file partitioning, transfer, check-pointing, and consistency maintenance.

Jumpei MiyauchiEmail:

相似文献

10.

Log-Based Rollback Recovery without Checkpoints of Shared Memory in Software DSM

Soyeon Park Seung Ryoul Maeng 《The Journal of supercomputing》2006,35(2):141-154

A common approach to fault-tolerant software DSM is to take checkpoints with message logging. Our remote logging has low overhead because each node saves the coherence-related data into the memory of a remote node through a high-speed system area network. For more lightweight fault-tolerant DSM, in this paper, we mainly focused on eliminating shared memory checkpointing during failure-free execution. Each node independently takes the checkpoints of execution states and non-shared data only. When a node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we also introduced a XOR-diffing technique. The diff logs, which have been created by XOR operations during failure-free execution, can be applicable to any version of remote copies either backward or forward for recovery. Our scheme reduces the checkpointing overhead and also alleviates the imbalance in execution times among nodes due to independent checkpointing. This research is supported by KISTEP under the National Research Laboratory program. 相似文献

11.

Custom parallel caching schemes for hardware-accelerated image compression

Su-Shin Ang George A. Constantinides Wayne Luk Peter Y. K. Cheung 《Journal of Real-Time Image Processing》2008,3(4):289-302

In an effort to achieve lower bandwidth requirements, video compression algorithms have become increasingly complex. Consequently, the deployment of these algorithms on field programmable gate arrays (FPGAs) is becoming increasingly desirable, because of the computational parallelism on these platforms as well as the measure of flexibility afforded to designers. Typically, video data are stored in large and slow external memory arrays, but the impact of the memory access bottleneck may be reduced by buffering frequently used data in fast on-chip memories. The order of the memory accesses, resulting from many compression algorithms are dependent on the input data (Jain in Proceedings of the IEEE, pp. 349–389, 1981). These data-dependent memory accesses complicate the exploitation of data re-use, and subsequently reduce the extent to which an application may be accelerated. In this paper, we present a hybrid memory sub-system which is able to capture data re-use effectively in spite of data-dependent memory accesses. This memory sub-system is made up of a custom parallel cache and a scratchpad memory. Further, the framework is capable of exploiting 2D spatial locality, which is frequently exhibited in the access patterns of image processing applications. In a case study involving the quad-tree structured pulse code modulation (QSDPCM) application, the impact of data dependence on memory accesses is shown to be significant. In comparison with an implementation which only employs an SPM, performance improvements of up to 1.7× and 1.4× are observed through actual implementation on two modern FPGA platforms. These performance improvements are more pronounced for image sequences exhibiting greater inter-frame movements. In addition, reductions of on-chip memory resources by up to 3.2× are achievable using this framework. These results indicate that, on custom hardware platforms, there is substantial scope for improvement in the capture of data re-use when memory accesses are data dependent.

Su-Shin AngEmail: Email:

相似文献

12.

TreadMarks: shared memory computing on networks of workstations 总被引：2，自引：0，他引：2

Amza C. Cox A.L. Dwarkadas S. Keleher P. Honghui Lu Rajamony R. Weimin Yu Zwaenepoel W. 《Computer》1996,29(2):18-28

Shared memory facilitates the transition from sequential to parallel processing. Since most data structures can be retained, simply adding synchronization achieves correct, efficient programs for many applications. We discuss our experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system. DSM allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. We illustrate a DSM system consisting of N networked workstations, each with its own memory. The DSM software provides the abstraction of a globally shared memory, in which each processor can access any data item without the programmer having to worry about where the data is or how to obtain its value 相似文献

13.

Toward the parallelization of GSL

Jose Ignacio Aliaga Francisco Almeida Jose Manuel Badía Sergio Barrachina Vicente Blanco Maria Castillo Rafael Mayo Enrique S. Quintana Gregorio Quintana Alfredo Remón Casiano Rodríguez Francisco de Sande Adrian Santos 《The Journal of supercomputing》2009,48(1):88-114

In this paper, we present our joint efforts to design and develop parallel implementations of the GNU Scientific Library for a wide variety of parallel platforms. The multilevel software architecture proposed provides several interfaces: a sequential interface that hides the parallel nature of the library to sequential users, a parallel interface for parallel programmers, and a web services based interface to provide remote access to the routines of the library. The physical level of the architecture includes platforms ranging from distributed and shared-memory multiprocessors to hybrid systems and heterogeneous clusters. Several well-known operations arising in discrete mathematics and sparse linear algebra are used to illustrate the challenges, benefits, and performance of different parallelization approaches.

Adrian SantosEmail:

相似文献

14.

A runtime resolution scheme for priority boost conflict in implicit coscheduling

Jung-Lok Yu Jin-Soo Kim Seung-Ryoul Maeng 《The Journal of supercomputing》2007,40(1):1-28

High-performance parallel and scientific applications are composed of multiple processes running on distinct CPUs that communicate frequently. Due to the synchronization needs of such applications, performance is greatly hampered if their processes are not scheduled simultaneously on the CPUs. Implicit coscheduling (ICS) is a well-known technique to address this problem in multi-programmed clusters, however, traditional ICS schemes do not incorporate steps to adequately deal with priority boost conflicts, leading to significantly degraded performance. In this paper, we propose the use of runtime difference in contention across nodes to provide more sophisticated coscheduling decisions in response to the conflicts. We also present a novel coscheduling scheme termed PROC (Process ReOrdering-based Coscheduling) that adaptively regulates the scheduling sequence of conflicting processes based on the rescheduling latency of their correspondents in remote nodes. We perform extensive simulation-based experiments using both synthetic and realistic workloads to analyze the performance of PROC compared to alternatives such as local scheduling, a widely used batch scheduling, gang scheduling, and existing ICS schemes. The results show that all ICS schemes commonly experience priority boost conflicts, and that the proposed PROC significantly outperforms other ICS alternatives (or batch scheduling) by up to 50.4% (or 72.5%) in the average job response time. This improvement is achieved by reducing wasted idle time and spinning time without sacrificing fairness.

Seung-Ryoul MaengEmail:

相似文献

15.

Efficient integration of fine-grained access control and resource brokering in grid

P. Mazzoleni B. Crispo S. Sivasubramanian E. Bertino 《The Journal of supercomputing》2009,49(1):108-126

In this paper, we present a novel resource brokering service for grid systems which considers authorization policies of the grid nodes in the process of selecting the resources to be assigned to a request. We argue such an integration is needed to avoid scheduling requests onto resources the policies of which do not authorize their execution. Our service, implemented in Globus as a part of Monitoring and Discovery Service (MDS), is based on the concept of fine-grained access control (FGAC) which enables participating grid nodes to specify fine-grained policies concerning the conditions under which grid clients can access their resources. Since the process of evaluating authorization policies, in addition to checking the resource requirements, can be a potential bottleneck for a large scale grid, we also analyze the problem of the efficient evaluation of FGAC policies. In this context, we present GroupByRule, a novel method for policy organization and compare its performance with other strategies.

E. BertinoEmail:

相似文献

16.

Impact of platform heterogeneity on the design of parallel algorithms for morphological processing of high-dimensional image data

Antonio Plaza Javier Plaza David Valencia 《The Journal of supercomputing》2007,40(1):81-107

The main objective of this paper is to describe a realistic framework to understand parallel performance of high-dimensional image processing algorithms in the context of heterogeneous networks of workstations (NOWs). As a case study, this paper explores techniques for mapping hyperspectral image analysis techniques onto fully heterogeneous NOWs. Hyperspectral imaging is a new technique in remote sensing that has gained tremendous popularity in many research areas, including satellite imaging and aerial reconnaissance. The automation of techniques able to transform massive amounts of hyperspectral data into scientific understanding in valid response times is critical for space-based Earth science and planetary exploration. Using an evaluation strategy which is based on comparing the efficiency achieved by an heterogeneous algorithm on a fully heterogeneous NOW with that evidenced by its homogeneous version on a homogeneous NOW with the same aggregate performance as the heterogeneous one, we develop a detailed analysis of parallel algorithms that integrate the spatial and spectral information in the image data through mathematical morphology concepts. For comparative purposes, performance data for the tested algorithms on Thunderhead (a large-scale Beowulf cluster at NASA’s Goddard Space Flight Center) are also provided. Our detailed investigation of the parallel properties of the proposed morphological algorithms provides several intriguing findings that may help image analysts in selection of parallel techniques and strategies for specific applications.

Antonio PlazaEmail:

相似文献

17.

Trustworthy remote compiling services for grid-based scientific applications

Yaohang Li Daniel Chen Xiaohong Yuan 《The Journal of supercomputing》2007,41(2):119-131

Grid computing, which is characterized by large-scale sharing and collaboration of dynamic resources, is becoming an emerging computing platform on a global scale for data-intensive and computation-intensive scientific application. However, the complications of large-scale scientific computations and simulations harnessing massive computing resources are compounded by extensive heterogeneity in environments arising from “the Grid.” Scientists and engineers lack an intuitive grid-based compilation tool, which has contributed to the difficulty of exploiting these diverse resources and developing their applications on the grid. While manual configuration of various toolkits simplifying the end-to-end completion of a job is adequate for a computational grid with a limited number of nodes, the compilation procedure becomes inefficient for a computational grid with an increasing number of heterogeneous computational service providers. On the other hand, a global-scale computational grid is a potentially untrustworthy computing environment. How to take advantage of the potentially untrustworthy grid resources to provide trustworthy computational services for large-scale scientific applications is another critical issue. In this article, a remote compiling service for a heterogeneous computational grid is developed. In addition to running compilation tasks, the remote compiling service provides security enforcement and validation facilities, including intermediate value checking, secure source program submission, restricted compilation, and binary inspection, to support trustworthy compilation and execution of grid-based scientific applications. Overall, it is expected that our remote compiling services on the grid can tackle the heterogeneity problem of the grid and provide a secure, trustworthy, reliable, and state-of-the-art mechanism to develop grid-aware scientific applications.

Xiaohong YuanEmail:

相似文献

18.

多核处理器中混合分布式共享存储空间的实时划分技术

陈小文陈书明鲁中海 Axel Jantsch 《计算机工程与科学》2012,34(7):54-59

在多核处理器芯片中,分布式共享存储DSM虽然提供了统一的全局寻址的存储空间,但却引入了虚地址向实地址转换的开销,这对性能产生了负面的影响。我们注意到,在并行程序的执行过程中,被处理的数据属性(私有或共享)并不是一成不变的。并行程序中不同的数据具有不同的属性,即使同一数据在程序的不同执行阶段也可能具有不同的属性。本文首先详细地阐述了一种混合式的分布式共享存储空间,支持对共享数据采用全局寻址的虚地址访问而对私有数据采用快速寻址的实地址访问;进而提出了一种针对混合式的分布式共享存储空间的实时划分技术。该技术根据并行程序中数据的属性,在程序运行时,实时地调整和划分分布式共享存储空间。当数据为私有时,通过实地址访问加快数据的访问速度,当数据为共享时则维持虚地址访问,从而减少整个并行程序运行过程中的地址转换开销,提高系统的性能。实际应用程序的实验结果表明,与传统的分布式共享存储空间相比,实时划分的混合式的分布式共享存储空间具有性能优势,性能的提升比例与具体的网络规模、计算规模、并行程序映射方式等有关。在我们的实验中,性能的提升比例最高为13.14%,最低为6.98%。相似文献

19.

An energy-efficient protocol for data gathering and aggregation in wireless sensor networks

Ming Liu Jiannong Cao Yuan Zheng Haigang Gong Xiaomin Wang 《The Journal of supercomputing》2008,43(2):107-125

Data gathering is a major function of many applications in wireless sensor networks (WSNs). The most important issue in designing a data gathering algorithm is how to save energy of sensor nodes while meeting the requirement of applications/users such as sensing area coverage. In this paper, we propose a novel hierarchical clustering protocol (DEEG) for long-lived sensor network. DEEG achieves a good performance in terms of lifetime by minimizing energy consumption for in-network communications and balancing the energy load among all the nodes, the proposed protocol achieves a good performance in terms of network lifetime. DEEG can also handle the energy hetergenous capacities and guarantee that out-network communications always occur in the subregion with high energy reserved. Furthermore, it introduces a simple but efficient approach to cope with the area coverage problem. We evaluate the performance of the proposed protocol using a simple temperature sensing application. Simulation results show that our protocol significantly outperforms LEACH and PEGASIS in terms of network lifetime and the amount of data gathered.

Xiaomin WangEmail:

相似文献

20.

Persistent clustered main memory index for accelerating <Emphasis Type="Italic">k</Emphasis>-NN queries on high dimensional datasets

Alexander Thomasian Lijuan Zhang 《Multimedia Tools and Applications》2008,38(2):253-270

Similarity search implemented via k-nearest neighbor— k-NN queries on multidimensional indices is an extremely useful paradigm for content-based image retrieval. As the dimensionality of feature vectors increases the curse of dimensionality sets in, i.e., the performance of k-NN search of disk-resident indices in the R-tree family degrades rapidly due to the overlap in index pages in high dimensions. This problem is dealt with in this study by utilizing the double filtering effect of clustering and indexing. The clustering algorithm ensures that the largest cluster fits into main memory and that only clusters closest to a query point need to be searched and hence loaded into main memory. We organize the data in each cluster according to the ordered-partition—OP-tree main memory resident index, which is not prone to the curse of dimensionality and highly efficient for processing k-NN queries. We serialize an OP-tree by writing its dynamically allocated nodes into contiguous memory locations, optimize its parameters, and make it persistent by writing it to disk. The time to read and write clusters constituting an OP-tree with a single sequential access to disk benefits from higher data transfer rates of modern disk drives. The performance of the index is further improved by applying the Karhunen–Loève transformation—KLT to the dataset, since this results in a more efficient computation of distances for k-NN queries. We compare OP-trees and sequential scans with and without a KL-transformation and with and without using a shortcut method in calculating Euclidean distances. A comparison against the OMNI-sequential scan is also reported. We finally compare a clustered and persistent version of the OP-tree against a clustered version of the SR-tree and the VA-file method. CPU time is measured and elapsed time is estimated in this study. It is observed that the OP-tree index outperforms the other two methods and that the improvement increases with the number of dimensions.

Lijuan ZhangEmail:

相似文献