共查询到20条相似文献,搜索用时 0 毫秒
1.
There has been an increasing research interest in extending the use of Java towards high‐performance demanding applications such as scalable Web servers, distributed multimedia applications, and large‐scale scientific applications. However, extending Java to a multicomputer environment and improving the low performance of current Java implementations pose great challenges to both the systems developer and application designer. In this survey, we describe and classify 14 relevant proposals and environments that tackle Java's performance bottlenecks in order to make the language an effective option for high‐performance network‐based computing. We further survey significant performance issues while exposing the potential benefits and limitations of current solutions in such a way that a framework for future research efforts can be established. Most of the proposed solutions can be classified according to some combination of three basic parameters: the model adopted for inter‐process communication, language extensions, and the implementation strategy. In addition, where appropriate to each individual proposal, we examine other relevant issues, such as interoperability, portability, and garbage collection. Copyright © 2002 John Wiley & Sons, Ltd. 相似文献
2.
Guillermo L. Taboada Juan Tourio Ramn Doallo Aamir Shafi Mark Baker Bryan Carpenter 《Concurrency and Computation》2011,23(18):2382-2403
Since its release, the Java programming language has attracted considerable attention from the high‐performance computing (HPC) community because of its portability, high programming productivity, and built‐in multithreading and networking support. As a consequence, several initiatives have been taken to develop a high‐performance Java message‐passing library to program distributed memory architectures, such as clusters. The performance of Java message‐passing applications relies heavily on the communications performance. Thus, the design and implementation of low‐level communication devices that support message‐passing libraries is an important research issue in Java for HPC. MPJ Express is our Java message‐passing implementation for developing high‐performance parallel Java applications. Its public release currently contains three communication devices: the first one is built using the Java New Input/Output (NIO) package for the TCP/IP; the second one is specifically designed for the Myrinet Express library on Myrinet; and the third one supports thread‐based shared memory communications. Although these devices have been successfully deployed in many production environments, previous performance evaluations of MPJ Express suggest that the buffering layer, tightly coupled with these devices, incurs a certain degree of copying overhead, which represents one of the main performance penalties. This paper presents a more efficient Java message‐passing communications device, based on Java Input/Output sockets, that avoids this buffering overhead. Moreover, this device implements several strategies, both in the communication protocol and in the HPC hardware support, which optimizes Java message‐passing communications. In order to evaluate its benefits, this paper analyzes the performance of this device comparatively with other Java and native message‐passing libraries on various high‐speed networks, such as Gigabit Ethernet, Scalable Coherent Interface, Myrinet, and InfiniBand, as well as on a shared memory multicore scenario. The reported communication overhead reduction encourages the upcoming incorporation of this device in MPJ Express ( http://mpj‐express.org ). Copyright © 2011 John Wiley & Sons, Ltd. 相似文献
3.
P. E. Hadjidoukas V. V. Dimakopoulos M. Delakis C. Garcia 《Concurrency and Computation》2009,21(15):1819-1837
We present the development of a novel high‐performance face detection system using a neural network‐based classification algorithm and an efficient parallelization with OpenMP. We discuss the design of the system in detail along with experimental assessment. Our parallelization strategy starts with one level of threads and moves to the exploitation of nested parallel regions in order to further improve, by up to 19%, the image‐processing capability. The presented system is able to process images in real time (38 images/sec) by sustaining almost linear speedups on a system with a quad‐core processor and a particular OpenMP runtime library. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献
4.
提出一种高性能计算机代数环境HHPCAS,综合现有的多种计算机代数软件,通过添加内核扩展函数、外部调用等方法,结合集群管理软件和并行环境,可以提供高性能的计算机代数计算环境。HHPCAS根据Slot/Ticket模型可以有效管理可用计算资源和作业优先等级,充分发挥多种计算机代数软件的特长,并且提供并行的消息传递机制,将大量复杂的计算平均分配到每个计算节点,解决单台机器内存受限和计算能力有限等问题。通过并行差分代换方法测试表明HHPCAS可以为符号计算和计算机自动推理提供有效的计算平台。 相似文献
5.
This paper considers Java as an implementation language for a starting part of a computer algebra library. It describes a design of basic arithmetic and multivariate polynomial interfaces and classes which are then employed in advanced parallel and distributed Groebner base algorithms and applications. The library is type-safe due to its design with Java’s generic type parameters and thread-safe using Java’s concurrent programming facilities. We report on the performance of the polynomial arithmetic and on applications built upon the core library. 相似文献
6.
With the advancement of new processor and memory architectures, supercomputers of multicore and multinode architectures have become general tools for large‐scale engineering and scientific simulations. However, the nonuniform latencies between intranode and internode communications on these machines introduce new challenges that need to be addressed in order to achieve optimal performance. In this paper, a novel hybrid solver that is especially designed for supercomputers of multicore and multinode architectures is proposed. The new hybrid solver is characterized by its two‐level parallel computing approach on the basis of the strategies of two‐level partitioning and two‐level condensation. It distinguishes intranode and internode communications to minimize the communication overheads. Moreover, it further reduces the size of interface equation system to improve its convergence rate. Three numerical experiments of structural linear static analysis were conducted on DAWNING‐5000A supercomputer to demonstrate the validity and efficiency of the proposed method. Test results show that the proposed approach was superior in performance compared with the conventional Schur complement method. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献
7.
This paper surveys the research on power management techniques for high‐performance systems. These include both commercial high‐performance clusters and scientific high‐performance computing (HPC) systems. Power consumption has rapidly risen to an intolerable scale. This results in both high operating costs and high failure rates so it is now a major cause for concern. It has imposed new challenges to the development of high‐performance systems. In this paper, we first review the basic mechanisms that underlie power management techniques. Then we survey two fundamental techniques for power management: metrics and profiling. After that, we review the research for the two major types of high‐performance systems: commercial clusters and supercomputers. Based on this, we discuss the new opportunities and problems presented by the recent adoption of virtualization techniques, and again we present the most recent research on this. Finally, we summarize and discuss the future research directions. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献
8.
Anubhav Jain Shyue Ping Ong Wei Chen Bharat Medasani Xiaohui Qu Michael Kocher Miriam Brafman Guido Petretto Gian‐Marco Rignanese Geoffroy Hautier Daniel Gunter Kristin A. Persson 《Concurrency and Computation》2015,27(17):5037-5059
This paper introduces FireWorks, a workflow software for running high‐throughput calculation workflows at supercomputing centers. FireWorks has been used to complete over 50 million CPU‐hours worth of computational chemistry and materials science calculations at the National Energy Research Supercomputing Center. It has been designed to serve the demanding high‐throughput computing needs of these applications, with extensive support for (i) concurrent execution through job packing, (ii) failure detection and correction, (iii) provenance and reporting for long‐running projects, (iv) automated duplicate detection, and (v) dynamic workflows (i.e., modifying the workflow graph during runtime). We have found that these features are highly relevant to enabling modern data‐driven and high‐throughput science applications, and we discuss our implementation strategy that rests on Python and NoSQL databases (MongoDB). Finally, we present performance data and limitations of our approach along with planned future work. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献
9.
Kasper Peeters 《Computer Physics Communications》2007,176(8):550-558
Field theory is an area in physics with a deceptively compact notation. Although general purpose computer algebra systems, built around generic list-based data structures, can be used to represent and manipulate field-theory expressions, this often leads to cumbersome input formats, unexpected side-effects, or the need for a lot of special-purpose code. This makes a direct translation of problems from paper to computer and back needlessly time-consuming and error-prone. A prototype computer algebra system is presented which features -like input, graph data structures, lists with Young-tableaux symmetries and a multiple-inheritance property system. The usefulness of this approach is illustrated with a number of explicit field-theory problems. 相似文献
10.
In this article, an improved prescribed performance adaptive control strategy is developed to handle the output tracking control problem for a class of nonlinear high‐order systems with actuator faults. The actuator faults considered include the bias fault and gain fault models. A technique of adding a power integrator is utilized to deal with the controller design problem of high‐order system. With the help of backstepping technology and the classic adaptive control, an output tracking control scheme is proposed, which can guarantee that all signals of the closed‐loop system are bounded and the tracking error converges to a finite‐time predetermined region. Finally, the feasibility of the presented control method is tested through the simulation results. 相似文献
11.
Today, cluster-based computing is the mainstream architecture for high end computer systems. Balanced system design is critical for large scale cluster systems to achieve high efficiency. This paper addresses the practice on DeepComp high end computer systems toward a balanced system design. Methodologies of designing balanced large scale cluster systems are given. A method for balancing central processing unit (CPU) and memory hierarchy is addressed. For balancing computing nodes and I/O systems, two approaches are given: maximum bandwidth criterion and maximum number of computing nodes which can concurrently access I/O systems. Experiences of Lenovo high end cluster systems show that above methods are effective. Lenovo strategies toward a balanced system design for both peta and 10 peta scale high productivity computing systems (HPCSs). 相似文献
12.
Despite using multiple concurrent processors, a typical high‐performance parallel application is long‐running, taking hours, even days to arrive at a solution. To modify a running high‐performance parallel application, the programmer has to stop the computation, change the code, redeploy, and enqueue the updated version to be scheduled to run, thus wasting not only the programmer's time, but also expensive computing resources. To address these inefficiencies, this article describes how dynamic software updates (DSU) can be used to modify a parallel application on the fly, thus saving the programmer's time and using expensive computing resources more productively. The net effect of updating parallel applications dynamically can reduce the total time that elapses between posing a problem and arriving at a solution, otherwise known as time‐to‐discovery. To explore the benefits of dynamic updates for high performance applications, this article takes a two‐pronged approach. First, we describe our experiences of building and evaluating a system for dynamically updating applications running on a parallel cluster. We then review a large body of literature describing the existing state of the art in DSU and point out how this research can be applied to high‐performance applications. Our experimental results indicate that DSU have the potential to become a powerful tool in reducing time‐to‐discovery for high‐performance parallel applications. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献
13.
Salman Pervez Ganesh Gopalakrishnan Robert M. Kirby Rajeev Thakur William Gropp 《Software》2010,40(1):23-43
There is a growing need to address the complexity of verifying the numerous concurrent protocols employed in the high‐performance computing software. Today's approaches for verification consist of testing detailed implementations of these protocols. Unfortunately, this approach can seldom show the absence of bugs, and often results in serious bugs escaping into the deployed software. An approach called Model Checking has been demonstrated to be eminently helpful in debugging these protocols early in the software life cycle by offering the ability to represent and exhaustively analyze simplified formal protocol models. The effectiveness of model checking has yet to be adequately demonstrated in high‐performance computing. This paper presents a case study of a concurrent protocol that was thought to be sufficiently well tested, but proved to contain two very non‐obvious deadlocks in them. These bugs were automatically detected through model checking. The protocol models in which these bugs were detected were also easy to create. Recent work in our group demonstrates that even this tedium of model creation can be eliminated by employing dynamic source‐code‐level analysis methods. Our case study comes from the important domain of Message Passing Interface (MPI)‐based programming, which is universally employed for simulating and predicting anything from the structural integrity of combustion chambers to the path of hurricanes. We argue that model checking must be taught as well as used widely within HPC, given this and similar success stories. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献
14.
This paper describes the ARGUS prototype, a high‐density, low‐power supercomputer built from an IXIA network analyzer chassis and load modules. The prototype is configured as a diskless distributed system that is scalable to 128 processors in a single 9U chassis. The entire system has a footprint of 0.25 m2 (2.5 ft2), a volume of 0.09 m3 (3.3 ft3) and maximum power consumption of less than 2200 W. We compare and contrast the characteristics of ARGUS against various machines including our on‐site 32‐node Beowulf and LANL's Green Destiny. Our results show that the computing density (Gflops ft−3) of ARGUS is about 30 times higher than that of the Beowulf and about three times higher than that of Green Destiny with a comparable performance. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献
15.
Enric Tejedor Montse Farreras David Grove Rosa M. Badia Gheorghe Almasi Jesus Labarta 《Concurrency and Computation》2012,24(18):2421-2448
Programming for large‐scale, multicore‐based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function‐level parallelism that targets productivity. StarSs deploys a data‐flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. This paper introduces Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs (Symmetric Multiprocessors). ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one‐sided communication. We present the design of ClusterSs on top of APGAS, as well as the programming model and execution runtime for Java applications. Finally, we evaluate the productivity of ClusterSs, both in terms of programmability and performance and compare it to that of the IBM X10 language. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献
16.
Achille Peternier Cesare Pautasso Walter Binder Daniele Bonetta 《Concurrency and Computation》2014,26(1):71-97
Although modern computer hardware offers an increasing number of processing elements organized in nonuniform memory access (NUMA) architectures, prevailing middleware engines for executing business processes, workflows, and Web service compositions have not been optimized for properly exploiting the abundant processing resources of such machines. Amongst others, factors limiting performance are inefficient thread scheduling by the operating system, which can result in suboptimal use of system memory and CPU caches, and sequential code sections that cannot take advantage of multiple available cores. In this article, we study the performance of the JOpera process execution engine on recent multicore machines. We first evaluate its performance without any dedicated optimization for multicore hardware, showing that additional cores do not significantly improve performance, although the engine has a multithreaded design. Therefore, we apply optimizations on the basis of replication together with an improved, hardware‐aware usage of the underlying resources such as NUMA nodes and CPU caches. Thanks to our optimizations, we achieve speedups from a factor of 2 up to a factor of 20 (depending on the target machine) when compared with a baseline execution ‘as is’. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献
17.
Video servers are essential in video‐on‐demand and other multimedia applications. In this paper, we present our high‐performance clustered CBR video server, Odyssey. Odyssey is a server connecting PCs with switched Ethernet. It provides efficient support for normal play and interactive browsing functions such as fast‐forward and fast‐backward. We designed a set of algorithms for scheduling, synchronization and admission control, which results in a high utilization of resources. Odyssey is able to deliver a large number of video streams. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献
18.
Hyunjin Lee Lei Jin Kiyeon Lee Socrates Demetriades Michael Moeng Sangyeun Cho 《Software》2010,40(3):239-258
Simulation is indispensable in computer architecture research. Researchers increasingly resort to detailed architecture simulators to identify performance bottlenecks, analyze interactions among different hardware and software components, and measure the impact of new design ideas on the system performance. However, the slow speed of conventional execution‐driven architecture simulators is a serious impediment to obtaining desirable research productivity. This paper describes a novel fast multicore processor architecture simulation framework called Two‐Phase Trace‐driven Simulation (TPTS), which splits detailed timing simulation into a trace generation phase and a trace simulation phase. Much of the simulation overhead caused by uninteresting architectural events is only incurred once during the cycle‐accurate simulation‐based trace generation phase and can be omitted in the repeated trace‐driven simulations. We report our experiences with tsim, an event‐driven multicore processor architecture simulator that models detailed memory hierarchy, interconnect, and coherence protocol based on the TPTS framework. By applying aggressive event filtering, tsim achieves an impressive simulation speed of 146 millions of simulated instructions per second, when running 16‐thread parallel applications. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献
19.
Gang‐Youl Jeong 《Journal of the Society for Information Display》2009,17(9):723-734
Abstract— A high‐performance high‐efficiency LED‐backlight driving system for liquid‐crystal‐display panels is presented. The proposed LED‐backlight driving system is composed of a high‐efficiency DC‐DC converter capable of operating over a universal AC input voltage (75–265 V) and a high‐performance LED‐backlight sector‐dimming controller. The high efficiency of the system is achieved by using an asymmetrical half‐bridge DC‐DC converter that utilizes a new voltage‐driven synchronous rectifier and an LED‐backlight sector‐dimming controller. This controller regulates current using lossless power semiconductor switches (MOSFETs). The power semiconductor switches of the proposed DC‐DC converter, including the synchronous rectifier switch, operate with zero voltage, achieving high efficiency and low switch voltage stress using the asymmetrical‐PWM and synchronous rectifier techniques. To achieve high performance, the proposed driving system performs the sector dimming and the current regulation using low‐cost microcontrollers and MOSFET switching, resulting in high contrast and brightness. A100‐W laboratory prototype was built and tested. The experimental results verify the feasibility of the proposed system. 相似文献
20.
Tetsuo Minami Tomokazu Shiga Shigeo Mikoshiba Gerrit Oversluizen 《Journal of the Society for Information Display》2004,12(2):191-197
Abstract— It has been well known that the luminous efficiency of PDPs can be improved by increasing the Xe content in the panel. For instance, the efficiency is improved by a factor 1.7 when the Xe content is increased from 3.5% to 30%. The sustain pulse voltage, however, increases from 180 to 230 V by a factor 1.3. It was found that the increase in the sustain pulse voltage can be suppressed by increasing the sustain pulse frequency. The high‐frequency operation further increases the luminous efficiency. If the Xe content is increased from 3.5% to 30% and the drive pulse frequency is increased from 147 to 313 kHz, the luminous efficiency becomes 2.7 times higher and the luminance 4.5 times higher. Furthermore, the increase in the sustain pulse voltage is suppressed 1.1 times, from 180 to 200 V. A mechanism of attaining high efficiency and low‐voltage performance can be considered as follows. A train of pulses is applied during a sustain period. As the sustain pulse frequency is increased, the pulse repetition rate becomes faster and a percentage of the space charge created by the previous pulse remains until the following pulse is applied. Due to the priming effect of these space charge, the discharge current build‐up becomes faster, the width of the discharge current becomes narrower, ion‐heating loss is reduced, and the effective electron temperature is optimized so that Xe atoms are excited more efficiently. The intensity of Xe 147‐nm radiation, dominant in low‐pressure Xe dis‐charges, saturates with respect to electron density due to plasma saturation. This determines the high end of the sustain pulse frequency. 相似文献