期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performance‐steered design of software architectures for embedded multicore systems

Alessio Bechini Cosimo Antonio Prete 《Software》2002,32(12):1155-1173

相似文献

2.

Dynamic software updates for parallel high‐performance applications

Dong Kwan Kim Eli Tilevich Calvin J. Ribbens 《Concurrency and Computation》2011,23(4):415-434

Despite using multiple concurrent processors, a typical high‐performance parallel application is long‐running, taking hours, even days to arrive at a solution. To modify a running high‐performance parallel application, the programmer has to stop the computation, change the code, redeploy, and enqueue the updated version to be scheduled to run, thus wasting not only the programmer's time, but also expensive computing resources. To address these inefficiencies, this article describes how dynamic software updates (DSU) can be used to modify a parallel application on the fly, thus saving the programmer's time and using expensive computing resources more productively. The net effect of updating parallel applications dynamically can reduce the total time that elapses between posing a problem and arriving at a solution, otherwise known as time‐to‐discovery. To explore the benefits of dynamic updates for high performance applications, this article takes a two‐pronged approach. First, we describe our experiences of building and evaluating a system for dynamically updating applications running on a parallel cluster. We then review a large body of literature describing the existing state of the art in DSU and point out how this research can be applied to high‐performance applications. Our experimental results indicate that DSU have the potential to become a powerful tool in reducing time‐to‐discovery for high‐performance parallel applications. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

3.

Cluster computing: the commodity supercomputer

Mark Baker Rajkumar Buyya 《Software》1999,29(6):551-576

The availability of high‐speed networks and increasingly powerful commodity microprocessors is making the usage of clusters, or networks, of computers an appealing vehicle for cost effective parallel computing. Clusters, built using Commodity‐Off‐The‐Shelf (COTS) hardware components as well as free, or commonly used, software, are playing a major role in redefining the concept of supercomputing. In this paper we discuss the reasons why COTS‐based clusters are becoming popular environments for running supercomputing applications. We describe the current enabling technologies and present four state‐of‐the‐art cluster‐based projects. Finally, we summarise our findings and draw a number of conclusions relating to the usefulness and likely future of cluster computing. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

4.

Agent based scientific simulation and modeling

L. Blni D. C. Marinescu J. R. Rice P. Tsompanopoulou E. A. Vavalis 《Concurrency and Computation》2000,12(9):845-861

The simulation and modeling of complex physical systems often involves many components because (i) the physical system itself has components of differing natures, (ii) parallel computing strategies require many (somewhat independent) components, and (iii) existing simulation software applies only to simpler geometrical shapes and physical situations. We discuss how agent based networks are applied to such multi‐component applications. The network agents are used to (a) control the execution of existing solvers on sub‐components, (b) mediate between sub‐components, and (c) coordinate the execution of the ensemble. This paper focuses on partial differential equation (PDE) models as an instance of the approach and describes the implementation of networks using the PELLPACK problem solving environment for PDEs and the Bond system for agent based computing. Copyright © 2000 John Wiley & Sons, Ltd. 相似文献

5.

Real-time computing platform for spiking neurons (RT-spike) 总被引：1，自引：0，他引：1

Ros E. Ortigosa E.M. Agis R. Carrillo R. Arnold M. 《Neural Networks, IEEE Transactions on》2006,17(4):1050-1063

A computing platform is described for simulating arbitrary networks of spiking neurons in real time. A hybrid computing scheme is adopted that uses both software and hardware components to manage the tradeoff between flexibility and computational power; the neuron model is implemented in hardware and the network model and the learning are implemented in software. The incremental transition of the software components into hardware is supported. We focus on a spike response model (SRM) for a neuron where the synapses are modeled as input-driven conductances. The temporal dynamics of the synaptic integration process are modeled with a synaptic time constant that results in a gradual injection of charge. This type of model is computationally expensive and is not easily amenable to existing software-based event-driven approaches. As an alternative we have designed an efficient time-based computing architecture in hardware, where the different stages of the neuron model are processed in parallel. Further improvements occur by computing multiple neurons in parallel using multiple processing units. This design is tested using reconfigurable hardware and its scalability and performance evaluated. Our overall goal is to investigate biologically realistic models for the real-time control of robots operating within closed action-perception loops, and so we evaluate the performance of the system on simulating a model of the cerebellum where the emulation of the temporal dynamics of the synaptic integration process is important. 相似文献

6.

Grids and Grid technologies for wide‐area distributed computing

Mark Baker Rajkumar Buyya Domenico Laforenza 《Software》2002,32(15):1437-1466

The last decade has seen a substantial increase in commodity computer and network performance, mainly as a result of faster hardware and more sophisticated software. Nevertheless, there are still problems, in the fields of science, engineering, and business, which cannot be effectively dealt with using the current generation of supercomputers. In fact, due to their size and complexity, these problems are often very numerically and/or data intensive and consequently require a variety ofheterogeneous resources that are not available on a single machine. A number of teams have conducted experimental studies on the cooperative use of geographically distributed resources unified to act as a single powerful computer. This new approach is known by several names, such as metacomputing, scalable computing, global computing, Internet computing, and more recently peer‐to‐peer or Grid computing. The early efforts in Grid computing started as a project to link supercomputing sites, but have now grown far beyond their original intent. In fact, many applications can benefit from the Grid infrastructure, including collaborative engineering, data exploration, high‐throughput computing, and of course distributed supercomputing. Moreover, due to the rapid growth of the Internet and Web, there has been a rising interest in Web‐based distributed computing, and many projects have been started and aim to exploit the Web as an infrastructure for running coarse‐grained distributed and parallel applications. In this context, the Web has the capability to be a platform for parallel and collaborative work as well as a key technology to create a pervasive and ubiquitous Grid‐based infrastructure. This paper aims to present the state‐of‐the‐art of Grid computing and attempts to survey the major international efforts in developing this emerging technology. Copyright © 2002 John Wiley & Sons, Ltd. 相似文献

7.

Performance-based parallel application toolkit for high-performance clusters 总被引：1，自引：1，他引：0

Kuan-Ching Li Tien-Hsiung Weng 《The Journal of supercomputing》2009,48(1):43-65

Advances in computer technology, encompassed with fast emerging of multicore processor technology, have made the many-core personal computers available and more affordable. The availability of network of workstations and cluster of many-core SMPs have made them an attractive solution for high performance computing by providing computational power equal or superior to supercomputers or mainframes at an affordable cost using commodity components. In order to search alternative ways to extract unused and idle computing power from these computing resources targeting to improve overall performance, as well as to fully utilize the underlying new hardware platforms, these are major topics in this field of research. In this research paper, the design rationale and implementation of an effective toolkit for performance measurement and analysis of parallel applications in cluster environments is introduced; not only generating parallel applications’ timing graph representation, but also to provide application execution’s performance data charts. The goal in developing this toolkit is to permit application developers have a better understanding of the application’s behavior among selected computing nodes purposed for that particular execution. Additionally, multiple execution results of a given application under development can be combined and overlapped, permitting application developers to perform “what-if” analysis, i.e., to deeper understand the utilization of allocated computational resources. Experimentations using this toolkit have shown its effectiveness on the development and performance tuning of parallel applications, extending the use in teaching of message passing, and shared memory model parallel programming courses.

Tien-Hsiung WengEmail:

相似文献

8.

A novel strategy for building interoperable MPI environment in heterogeneous high performance systems

Francisco Isidro Massetto Liria Matsumoto Sato Kuan-Ching Li 《The Journal of supercomputing》2012,60(1):87-116

Breakthrough advances in microprocessor technology and efficient power management have altered the course of development of processors with the emergence of multi-core processor technology, in order to bring higher level of processing. The utilization of many-core technology has boosted computing power provided by cluster of workstations or SMPs, providing large computational power at an affordable cost using solely commodity components. Different implementations of message-passing libraries and system softwares (including Operating Systems) are installed in such cluster and multi-cluster computing systems. In order to guarantee correct execution of message-passing parallel applications in a computing environment other than that originally the parallel application was developed, review of the application code is needed. In this paper, a hybrid communication interfacing strategy is proposed, to execute a parallel application in a group of computing nodes belonging to different clusters or multi-clusters (computing systems may be running different operating systems and MPI implementations), interconnected with public or private IP addresses, and responding interchangeably to user execution requests. Experimental results demonstrate the feasibility of this proposed strategy and its effectiveness, through the execution of benchmarking parallel applications. 相似文献

9.

Hybrid Access Cache Indexing Framework Adapted to GPU

下载免费PDF全文

Hongjun Zhang Yanjun Wu Heng Zhang Libo Zhang 《International Journal of Software and Informatics》2021,11(2):195-216

Hash tables, as a type of data indexing structure that provides efficient data access based on key values, are widely used in various computer applications, especially in system software, databases, and high-performance computing field that requires extremely high performance. In network, cloud computing and IoT services, hash tables have become the core system components of cache systems. However, with the large-scale increase in the amount of large-scale data, performance bottlenecks have gradually emerged in systems designed with a multi-core CPU as the core of the hash table structure. There is an urgent need to further improve the high performance and scalability of the hash tables. With the increasing popularity of general-purpose Graphic Processing Units (GPUs) and the substantial improvement of hardware computing capabilities and concurrency performance, various types of system software tasks with parallel computing as the core have been optimized on the GPU and have achieved considerable performance promotion. Due to the sparseness and randomness, using the existing parallel structure of the hash tables directly on the GPUs will inevitably bring high-frequency memory access and frequent bus data transmission, which affects the performance of the hash tables on the GPUs. This study focuses on the analysis of memory access, hit ratio, and index overhead of hash table indexes in the cache system. A hybrid access cache indexing framework CCHT (Cache Cuckoo Hash Table) adapted to GPU is proposed and provided. The cache strategy suitable to different requirements of hit ratios and index overheads allows concurrent execution of write and query operations, maximizing the use of the computing performance and concurrency characteristics of GPU hardware, reducing memory access and bus transferring overhead. Through GPU hardware implementation and experimental verification, CCHT has better performance than other cache indexing hash tables while ensuring cache hit ratios. 相似文献

10.

The implementation and evaluation of the use of CORBA in an engineering design application

Susan D. Urban Ling Fu Jami J. Shah 《Software》1999,29(14):1313-1338

Many computer applications today require some form of distributed computing to allow different software components to communicate. Several different commercial products now exist based on the Common Object Request Broker Architecture (CORBA) of the Object Management Group. The use of such tools, however, often requires the modification of existing systems, rather than the development of new applications. The objective of this research has been to integrate the use of a CORBA tool into an existing engineering design application for the purpose of (1) evaluating the amount of re‐engineering that is involved to effectively integrate distributed object computing into an existing application, and (2) evaluating the use and performance of distributed object computing in an engineering domain, which often requires the transfer of large amounts of information. The results of this work demonstrate that CORBA technology can be easily integrated into existing applications. The ease of the integration as well as the efficiency of the resulting system, however, depends upon the degree of modification that developers are willing to consider in the re‐engineering process. The most transparent approach to the use of CORBA requires less modification and generally produces less efficient performance. The less transparent approach to the use of CORBA can potentially require significant system modification but produce greater performance gains. This work outlines issues that must be considered for the partitioning of functionality between the client and the server, development of an IDL interface, development of client and server‐side wrappers, and support for concurrent, multi‐user access. In addition, this work also provides performance and implementation comparisons of different techniques for the use of wrappers and for the transfer of large data files between the client and the server. Performance comparisons for the incorporation of concurrent access are also presented. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

11.

Experiences with networked parallel computing

Peter Hoogerbrugge Ravi Mirchandaney 《Concurrency and Computation》1995,7(1):1-16

The performance and proliferation of workstations continues to increase at a rapid rate. However, the practical utilization of workstation networks for parallel computing is still in its infancy. This is due to the relative immaturity of programming tools, low bandwidth networks such as Ethernet, and high message latencies. However, programming tools are becoming more mature and network bandwidths are increasing rapidly. Hence, networks of commodity workstations may prove to be practical for certain classes of parallel applications. This paper describes our experiences with two applications parallelized on a network of Sun workstations. The first application is from Shell's petroleum engineering department. This program quantitatively derives rock and porefill composition from well-log data, using a compute-intensive iterative optimization procedure. The second application is time filtering, which is a fundamental operation performed on seismic traces. Through our experiments we identify the limits of networked parallel computing based on the current state of network technology. We also provide a discussion on the possible impact of future high speed networks on networked parallel computing. 相似文献

12.

一种适应GPU的混合访问缓存索引框架

张鸿骏武延军张珩张立波《软件学报》2020,31(10):3038-3055

散列表（hash table）作为一类根据关键码值（key value）提供高效数据访问的数据索引结构,其广泛应用于各类计算机应用中,尤其是在对性能要求极高的系统软件、数据库以及高性能计算领域.在网络、云计算和物联网服务方面,以散列表为核心结构已经成为缓存系统的重要系统组件.然而,随着大规模数据量的大幅度增加,以多核CPU为核心设计散列表结构的系统已经逐渐出现性能瓶颈,亟需进一步改进散列表的高性能和可扩展性.随着通用图形处理器（graphic processing unit,简称GPU）的日益普及以及硬件计算能力和并发性能的大幅度提升,各类以并行计算为核心的系统软件任务在GPU上进行了优化设计并得到可观的性能提升.由于存在稀疏性和随机性,采用现有散列表的并行结构直接在GPU上应用势必会带来高频次的内存访问和频繁的总线数据传输,影响了散列表在GPU上的性能发挥.重点分析了缓存系统中散列表索引的内存访问、命中率与索引开销,提出并设计了一种适应GPU的混合访问缓存索引框架CCHT（cache cuckoo hash table）,提供了两种适应不同命中率和索引开销要求的缓存策略,允许写入与查询操作并发执行,最大程度地利用了GPU硬件的计算性能与并发特性,减少了内存访问与总线传输.通过在GPU硬件上的实现与实验验证,CCHT在保证缓存命中率的同时,性能优于其他用于缓存索引的散列表. 相似文献

13.

GridBLAST: a Globus‐based high‐throughput implementation of BLAST in a Grid computing framework

Arun Krishnan 《Concurrency and Computation》2005,17(13):1607-1623

Improvements in the performance of processors and networks have made it feasible to treat collections of workstations, servers, clusters and supercomputers as integrated computing resources or Grids. However, the very heterogeneity that is the strength of computational and data Grids can also make application development for such an environment extremely difficult. Application development in a Grid computing environment faces significant challenges in the form of problem granularity, latency and bandwidth issues as well as job scheduling. Currently existing Grid technologies limit the development of Grid applications to certain classes, namely, embarrassingly parallel, hierarchical parallelism, work flow and database applications. Of all these classes, embarrassingly parallel applications are the easiest to develop in a Grid computing framework. The work presented here deals with creating a Grid‐enabled, high‐throughput, standalone version of a bioinformatics application, BLAST, using Globus as the Grid middleware. BLAST is a sequence alignment and search technique that is embarrassingly parallel in nature and thus amenable to adaptation to a Grid environment. A detailed methodology for creating the Grid‐enabled application is presented, which can be used as a template for the development of similar applications. The application has been tested on a ‘mini‐Grid’ testbed and the results presented here show that for large problem sizes, a distributed, Grid‐enabled version can help in significantly reducing execution times. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献

14.

Scalable rendering on PC clusters

Wylie B. Pavlakos C. Lewis V. Moreland K. 《Computer Graphics and Applications, IEEE》2001,21(4):62-69

Sandia National Laboratories use PC clusters and commodity graphics cards to achieve higher rendering performance on extreme data sets. The main obstacle in using cluster-based graphics systems is the difficulty in realizing the full aggregate performance of all the individual graphics accelerators, particularly for very large data sets that exceed the capacity and performance characteristics of any one single node. Based on our efforts to achieve higher performance, we present results from a parallel sort-last implementation that the scalable rendering project at Sandia National Laboratories generated. Our sort-last library (libpglc) can be linked to an existing parallel application to achieve high rendering rates. We ran performance tests on a 64-node PC cluster populated with commodity graphics cards. Applications using libpglc have demonstrated rendering performance of 300 million polygons per second $approximately two orders of magnitude greater than the performance on an SGI Infinite Reality system for similar applications 相似文献

15.

Performance evaluation of the SX‐6 vector architecture for scientific computations

Leonid Oliker Andrew Canning Jonathan Carter John Shalf David Skinner Stphane Ethier Rupak Biswas Jahed Djomehri Rob Van der Wijngaart 《Concurrency and Computation》2005,17(1):69-93

The growing gap between sustained and peak performance for scientific applications is a well‐known problem in high‐performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX‐6 vector processor, and compares it against the cache‐based IBM Power3 and Power4 superscalar architectures, across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines many low‐level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Overall results demonstrate that the SX‐6 achieves high performance on a large fraction of our application suite and often significantly outperforms the cache‐based architectures. However, certain classes of applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX‐6 effectively. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献

16.

User transparency: a fully sequential programming model for efficient data parallel image processing

F. J. Seinstra D. Koelma 《Concurrency and Computation》2004,16(6):611-644

Although many image processing applications are ideally suited for parallel implementation, most researchers in imaging do not benefit from high‐performance computing on a daily basis. Essentially, this is due to the fact that no parallelization tools exist that truly match the image processing researcher's frame of reference. As it is unrealistic to expect imaging researchers to become experts in parallel computing, tools must be provided to allow them to develop high‐performance applications in a highly familiar manner. In an attempt to provide such a tool, we have designed a software architecture that allows transparent (i.e. sequential) implementation of data parallel imaging applications for execution on homogeneous distributed memory MIMD‐style multicomputers. This paper presents an extensive overview of the design rationale behind the software architecture, and gives an assessment of the architecture's effectiveness in providing significant performance gains. In particular, we describe the implementation and automatic parallelization of three well‐known example applications that contain many fundamental imaging operations: (1) template matching; (2) multi‐baseline stereo vision; and (3) line detection. Based on experimental results we conclude that our software architecture constitutes a powerful and user‐friendly tool for obtaining high performance in many important image processing research areas. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

17.

OpenMP‐oriented applications for distributed shared memory architectures

Ami Marowka Zhenying Liu Barbara Chapman 《Concurrency and Computation》2004,16(4):371-384

The rapid rise of OpenMP as the preferred parallel programming paradigm for small‐to‐medium scale parallelism could slow unless OpenMP can show capabilities for becoming the model‐of‐choice for large scale high‐performance parallel computing in the coming decade. The main stumbling block for the adaptation of OpenMP to distributed shared memory (DSM) machines, which are based on architectures like cc‐NUMA, stems from the lack of capabilities for data placement among processors and threads for achieving data locality. The absence of such a mechanism causes remote memory accesses and inefficient cache memory use, both of which lead to poor performance. This paper presents a simple software programming approach called copy‐inside–copy‐back (CC) that exploits the data privatization mechanism of OpenMP for data placement and replacement. This technique enables one to distribute data manually without taking away control and flexibility from the programmer and is thus an alternative to the automat and implicit approaches. Moreover, the CC approach improves on the OpenMP‐SPMD style of programming that makes the development process of an OpenMP application more structured and simpler. The CC technique was tested and analyzed using the NAS Parallel Benchmarks on SGI Origin 2000 multiprocessor machines. This study shows that OpenMP improves performance of coarse‐grained parallelism, although a fast copy mechanism is essential. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

18.

A Grid-based Virtual Reactor: Parallel performance and adaptive load balancing

Vladimir V. Korkhov Valeria V. Krzhizhanovskaya P.M.A. Sloot 《Journal of Parallel and Distributed Computing》2008

We address the problem of porting parallel distributed applications from static homogeneous cluster environments to dynamic heterogeneous Grid resources. We introduce a generic technique for adaptive load balancing of parallel applications on heterogeneous resources and evaluate it using a case study application: a Virtual Reactor for simulation of plasma chemical vapour deposition. This application has a modular architecture with a number of loosely coupled components suitable for distribution over the Grid. It requires large parameter space exploration that allows using Grid resources for high-throughput computing. The Virtual Reactor contains a number of parallel solvers originally designed for homogeneous computer clusters that needed adaptation to the heterogeneity of the Grid. In this paper we study the performance of one of the parallel solvers, apply the technique developed for adaptive load balancing, evaluate the efficiency of this approach and outline an automated procedure for optimal utilization of heterogeneous Grid resources for high-performance parallel computing. 相似文献

19.

Principles to Support Modular Software Construction

下载免费PDF全文

Jack B. Dennis 《计算机科学技术学报》2017,32(1):3-10

The construction of large software systems is always achieved through assembly of independently written components — program modules. For these software components to work together, they must share a common set of data types and principles for representing structured data such as arrays of values and files. This common set of tools for creating and operating on data objects is provided by the infrastructure of the computer system: the hardware, operating system and runtime code. Because the nature and properties of these tools are crucial for correct operation of software components and their inter-operation, it is essential to have a precise specification that may be used for verifying correctness of application software on one hand, and to verify correctness of system behavior on the other. We call such a specification a program execution model (PXM). It is evident that the properties of the PXM implemented by a computer system can have serious impact on the ability of application programmers to practice modular software construction. This paper discusses the concept of program execution models and presents a set of principles that a PXM must satisfy to provide a sound basis for modular software construction. Because parallel program execution on computer systems with many processing units is an essential part of contemporary computing environments, the expression of parallelism and modular software construction using components involving parallel operations is included in this treatment. The conclusion is that it is possible to build computer systems that implement a PXM within which any parallel program may be used, unmodified, as a component for building more substantial parallel programs. 相似文献

20.

基于集群的海洋遥感图像融合并行计算策略

李先涛曾志张丰刘仁义《计算机应用与软件》2012,(1):84-87

集群体系下的大规模并行计算,是高性能计算的基础。遥感图像处理效率的提高,有赖于并行计算技术的应用。在分析已有网格计算环境下分布式任务分配方法的基础上,针对海上遥感图像目标物数量相对较少的特点,首先利用四叉树结构理念对目标区域进行划分,同时采用动态负载均衡的任务分配策略与并行计算思想,提出对目标区域图像进行融合处理的集群体系任务分配算法处理模型。通过对比验证,表明该集群体系下算法模型能有效地提高图像融合的速度。相似文献