首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Since its release, the Java programming language has attracted considerable attention from the high‐performance computing (HPC) community because of its portability, high programming productivity, and built‐in multithreading and networking support. As a consequence, several initiatives have been taken to develop a high‐performance Java message‐passing library to program distributed memory architectures, such as clusters. The performance of Java message‐passing applications relies heavily on the communications performance. Thus, the design and implementation of low‐level communication devices that support message‐passing libraries is an important research issue in Java for HPC. MPJ Express is our Java message‐passing implementation for developing high‐performance parallel Java applications. Its public release currently contains three communication devices: the first one is built using the Java New Input/Output (NIO) package for the TCP/IP; the second one is specifically designed for the Myrinet Express library on Myrinet; and the third one supports thread‐based shared memory communications. Although these devices have been successfully deployed in many production environments, previous performance evaluations of MPJ Express suggest that the buffering layer, tightly coupled with these devices, incurs a certain degree of copying overhead, which represents one of the main performance penalties. This paper presents a more efficient Java message‐passing communications device, based on Java Input/Output sockets, that avoids this buffering overhead. Moreover, this device implements several strategies, both in the communication protocol and in the HPC hardware support, which optimizes Java message‐passing communications. In order to evaluate its benefits, this paper analyzes the performance of this device comparatively with other Java and native message‐passing libraries on various high‐speed networks, such as Gigabit Ethernet, Scalable Coherent Interface, Myrinet, and InfiniBand, as well as on a shared memory multicore scenario. The reported communication overhead reduction encourages the upcoming incorporation of this device in MPJ Express ( http://mpj‐express.org ). Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

2.
Providing high‐performance inter‐node communication is a key capability for running high performance computing applications efficiently on parallel architectures. In fact, current systems deployments are aggregating a significant number of cores interconnected via advanced networking hardware with Remote Direct Memory Access (RDMA) mechanisms, that enable zero‐copy and kernel‐bypass features. The use of Java for parallel programming is becoming more promising thanks to some useful characteristics of this language, particularly its built‐in multithreading support, portability, easy‐to‐learn properties, and high productivity, along with the continuous increase in the performance of the Java virtual machine. However, current parallel Java applications generally suffer from inefficient communication middleware, mainly based on protocols with high communication overhead that do not take full advantage of RDMA‐enabled networks. This paper presents efficient low‐level Java communication devices that overcome these constraints by fully exploiting the underlying RDMA hardware, providing low‐latency and high‐bandwidth communications for parallel Java applications. The performance evaluation conducted on representative RDMA networks and parallel systems has shown significant point‐to‐point performance increases compared with previous Java communication middleware, allowing to obtain up to 40% improvement in application‐level performance on 4096 cores of a Cray XE6 supercomputer. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

3.
OSGi was designed with embedded systems in mind, its current support is insufficient for coping with one main characteristic of many embedded systems: real‐time performance. This article analyzes different key issues in providing OSGi with real‐time Java performance covering motivational issues, and different integration ways and challenges stemming from the integration. It also contributes a general framework for introducing real‐time performance in OSGi, which is called the real‐time for OSGi framework. The framework uses real‐time Java virtual machines and the real‐time specification for Java. The adoption of this framework allows cyber‐physical systems to experience real‐time Java performance in their applications. The framework introduces several integration levels for OSGi and real‐time specification for Java, and specific real‐time OSGi services. An empirical implementation was carried out using standard software, which was extended with the new defined services. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

4.
The Java language is popular because of its platform independence, making it useful in a lot of technologies ranging from embedded devices to high‐performance systems. The platform‐independent property of Java, which is visible at the Java bytecode level, is only made possible thanks to the availability of a Virtual Machine (VM), which needs to be designed specifically for each underlying hardware platform. More specifically, the same Java bytecode should run properly on a 32‐bit or a 64‐bit VM. In this paper, we compare the behavioral characteristics of 32‐bit and 64‐bit VMs using a large set of Java benchmarks. This is done using the Jikes Research VM as well as the IBM JDK 1.4.0 production VM on a PowerPC‐based IBM machine. By running the PowerPC machine in both 32‐bit and 64‐bit mode we are able to compare 32‐bit and 64‐bit VMs. We conclude that the space an object takes in the heap in 64‐bit mode is 39.3% larger on average than in 32‐bit mode. We identify three reasons for this: (i) the larger pointer size, (ii) the increased header and (iii) the increased alignment. The minimally required heap size is 51.1% larger on average in 64‐bit than in 32‐bit mode. From our experimental setup using hardware performance monitors, we observe that 64‐bit computing typically results in a significantly larger number of data cache misses at all levels of the memory hierarchy. In addition, we observe that when a sufficiently large heap is available, the IBM JDK 1.4.0 VM is 1.7% slower on average in 64‐bit mode than in 32‐bit mode. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

5.
As an objected‐oriented programming language and a platform‐independent environment, Java has been attracting much attention. However, the trade‐off between portability and performance has not spared Java. The initial performance of Java programs has been poor, due to the interpretive nature of the environment. In this paper we present the communication performance results of three different types of message‐passing programs: native, Java and native communications, and pure Java. Despite concerns about performance and numerical issues, we believe the obtained results confirm that high‐performance parallel computing in Java is possible, as the technology matures and the approach is pragmatic.Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

6.
Data‐driven programming models such as many‐task computing (MTC) have been prevalent for running data‐intensive scientific applications. MTC applies over‐decomposition to enable distributed scheduling. To achieve extreme scalability, MTC proposes a fully distributed task scheduling architecture that employs as many schedulers as the compute nodes to make scheduling decisions. Achieving distributed load balancing and best exploiting data locality are two important goals for the best performance of distributed scheduling of data‐intensive applications. Our previous research proposed a data‐aware work‐stealing technique to optimize both load balancing and data locality by using both dedicated and shared task ready queues in each scheduler. Tasks were organized in queues based on the input data size and location. Distributed key‐value store was applied to manage task metadata. We implemented the technique in MATRIX, a distributed MTC task execution framework. In this work, we devise an analytical suboptimal upper bound of the proposed technique, compare MATRIX with other scheduling systems, and explore the scalability of the technique at extreme scales. Results show that the technique is not only scalable but can achieve performance within 15% of the suboptimal solution. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

7.
Programming frameworks are an accepted fixture in the object‐oriented world, motivated by the need for code reuse, developer guidance and restriction. A new trend is emerging where frameworks require domain experts to provide declarations using a domain‐specific language, influencing the structure and behaviour of the resulting application. These mechanisms address concerns such as user privacy. Although many popular open platforms such as Android are based on declaration‐driven frameworks, current implementations provide ad hoc and narrow solutions to concerns raised by their openness to non‐certified developers. Most widely used frameworks fail to address serious privacy leaks and provide the user with little insight into application behaviour. To address these shortcomings, we show that declaration‐driven frameworks can limit privacy leaks, as well as guide developers, independently from the underlying programming paradigm. To do so, we identify concepts that underlie declaration‐driven frameworks and apply them systematically to an object‐oriented language, Java and a dynamic functional language, Racket. The resulting programming framework generators are used to develop a prototype mobile application, illustrating how we mitigate a common class of privacy leaks. Finally, we explore the possible design choices and propose development principles for developing domain‐specific language compilers to produce frameworks, applicable across a spectrum of programming paradigms. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

8.
Diminishing returns from increased clock frequencies and instruction‐level parallelism have forced computer architects to adopt architectures that exploit wider parallelism through multiple processor cores. While emerging many‐core architectures have progressed at a remarkable rate, concerns arise regarding the performance and productivity of numerous parallel‐programming tools for application development. Development of parallel applications on many‐core processors often requires developers to familiarize themselves with unique characteristics of a target platform while attempting to maximize performance and maintain correctness of their applications. The family of partitioned global address space (PGAS) programming models comprises the current state of the art in balancing performance and programmability. One such PGAS approach is SHMEM, a lightweight, shared‐memory programming library that has demonstrated high performance and productivity potential for parallel‐computing systems with distributed‐memory architectures. In the paper, we present research, design, and analysis of a new SHMEM infrastructure specifically crafted for low‐level PGAS on modern and emerging many‐core processors featuring dozens of cores and more. Our approach (with a new library known as TSHMEM) is investigated and evaluated atop two generations of Tilera architectures, which are among the most sophisticated and scalable many‐core processors to date, and is intended to enable similar libraries atop other architectures now emerging. In developing TSHMEM, we explore design decisions and their impact on parallel performance for the Tilera TILE‐Gx and TILEPro many‐core architectures, and then evaluate the designs and algorithms within TSHMEM through microbenchmarking and applications studies with other communication libraries. Our results with barrier primitives provided by the Tilera libraries show dissimilar performance between the TILE‐Gx and TILEPro; therefore, TSHMEM's barrier design takes an alternative approach and leverages the on‐chip mesh network to provide consistent low‐latency performance. In addition, our experiments with TSHMEM show that naive collective algorithms consistently outperformed linear distributed collective algorithms when executed in an SMP‐centric environment. In leveraging these insights for the design of TSHMEM, our approach outperforms the OpenSHMEM reference implementation, achieves similar to positive performance over OpenMP and OSHMPI atop MPICH, and supports similar libraries in delivering high‐performance parallel computing to emerging many‐core systems. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

9.
Complex graphical user interfaces (GUIs) that support a large amount of user interaction require a fast response time, a rich set of building blocks for an esthetic look‐and‐feel, and a development environment that supports ongoing change. On the World Wide Web, client‐side technologies offer more of these features than do server‐side solutions. Java and JavaScript are the two most popular languages used for client‐side GUI implementations. Java implementations require a user to download a plug‐in that contains a virtual machine to execute the Java byte‐code. The installation and maintenance of this plug‐in is sometimes an unsurmountable barrier to using Java. JavaScript lacks some of the desirable features of Java, such as easy to use object‐oriented features and having a GUI class library, but does not require a plug‐in. We have enhanced JavaScript by implementing a new language Object‐JavaScript (OJS) and by providing an OJS library of GUI components, thus making it a viable alternative to Java. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

10.
Nowadays, there is a strong trend towards rendering to higher‐resolution displays and at high frame rates. This development aims at delivering more detail and better accuracy, but it also comes at a significant cost. Although graphics cards continue to evolve with an ever‐increasing amount of computational power, the speed gain is easily counteracted by increasingly complex and sophisticated shading computations. For real‐time applications, the direct consequence is that image resolution and temporal resolution are often the first candidates to bow to the performance constraints (e.g. although full HD is possible, PS3 and XBox often render at lower resolutions). In order to achieve high‐quality rendering at a lower cost, one can exploit temporal coherence (TC). The underlying observation is that a higher resolution and frame rate do not necessarily imply a much higher workload, but a larger amount of redundancy and a higher potential for amortizing rendering over several frames. In this survey, we investigate methods that make use of this principle and provide practical and theoretical advice on how to exploit TC for performance optimization. These methods not only allow incorporating more computationally intensive shading effects into many existing applications, but also offer exciting opportunities for extending high‐end graphics applications to lower‐spec consumer‐level hardware. To this end, we first introduce the notion and main concepts of TC, including an overview of historical methods. We then describe a general approach, image‐space reprojection, with several implementation algorithms that facilitate reusing shading information across adjacent frames. We also discuss data‐reuse quality and performance related to reprojection techniques. Finally, in the second half of this survey, we demonstrate various applications that exploit TC in real‐time rendering.  相似文献   

11.
In this paper, we present Jcluster, an efficient Java parallel environment that provides some critical services, in particular automatic load balancing and high‐performance communication, for developing parallel applications in Java on a large‐scale heterogeneous cluster. In the Jcluster environment, we implement a task scheduler based on a transitive random stealing (TRS) algorithm. Performance evaluations show that the scheduler based on TRS can make any idle node obtain a task from another node with much fewer stealing times than random stealing (RS), which is a well‐known dynamic load‐balancing algorithm, on a large‐scale cluster. In the performance aspects of communication, with the method of asynchronously multithreaded transmission, we implement a high‐performance PVM‐like and MPI‐like message‐passing interface in pure Java. The evaluation of the communication performance is conducted among the Jcluster environment, LAM‐MPI and mpiJava on LAM‐MPI based on the Java Grande Forum's pingpong benchmark. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

12.
This paper introduces FireWorks, a workflow software for running high‐throughput calculation workflows at supercomputing centers. FireWorks has been used to complete over 50 million CPU‐hours worth of computational chemistry and materials science calculations at the National Energy Research Supercomputing Center. It has been designed to serve the demanding high‐throughput computing needs of these applications, with extensive support for (i) concurrent execution through job packing, (ii) failure detection and correction, (iii) provenance and reporting for long‐running projects, (iv) automated duplicate detection, and (v) dynamic workflows (i.e., modifying the workflow graph during runtime). We have found that these features are highly relevant to enabling modern data‐driven and high‐throughput science applications, and we discuss our implementation strategy that rests on Python and NoSQL databases (MongoDB). Finally, we present performance data and limitations of our approach along with planned future work. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

13.
Chee Shin Yeo  Rajkumar Buyya 《Software》2006,36(13):1381-1419
In utility‐driven cluster computing, cluster Resource Management Systems (RMSs) need to know the specific needs of different users in order to allocate resources according to their needs. This in turn is vital to achieve service‐oriented Grid computing that harnesses resources distributed worldwide based on users' objectives. Recently, numerous market‐based RMSs have been proposed to make use of real‐world market concepts and behavior to assign resources to users for various computing platforms. The aim of this paper is to develop a taxonomy that characterizes and classifies how market‐based RMSs can support utility‐driven cluster computing in practice. The taxonomy is then mapped to existing market‐based RMSs designed for both cluster and other computing platforms to survey current research developments and identify outstanding issues. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

14.
Distributed Java virtual machine (dJVM) systems enable concurrent Java applications to transparently run on clusters of commodity computers. This is achieved by supporting Java's shared‐memory model over multiple JVMs distributed across the cluster's computer nodes. In this work, we describe and evaluate selective dynamic diffing and lazy home allocation, two new runtime techniques that enable dJVMs to efficiently support memory sharing across the cluster. Specifically, the two proposed techniques can contribute to reduce the overheads due to message traffic, extra memory space, and high latency of remote memory accesses that such dJVM systems require for implementing their memory‐coherence protocol either in isolation or in combination. In order to evaluate the performance‐related benefits of dynamic diffing and lazy home allocation, we implemented both techniques in Cooperative JVM (CoJVM), a basic dJVM system we developed in previous work. In subsequent work, we carried out performance comparisons between the basic CoJVM and modified CoJVM versions for five representative concurrent Java applications (matrix multiply, LU, Radix, fast Fourier transform, and SOR) using our proposed techniques. Our experimental results showed that dynamic diffing and lazy home allocation significantly reduced memory sharing overheads. The reduction resulted in considerable gains in CoJVM system's performance, ranging from 9% up to 20%, in four out of the five applications, with resulting speedups varying from 6.5 up to 8.1 for an 8‐node cluster of computers. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

15.
It is a fact that the attention of research community in computer science, business executives, and decision makers is drastically drawn by big data. As the volume of data becomes bigger, it needs performance‐oriented data‐intensive processing frameworks such as MapReduce, which can scale computation on large commodity clusters. Hadoop MapReduce processes data in Hadoop Distributed File System as jobs scheduled according to YARN fair scheduler and capacity scheduler. However, with advancement and dynamic changes in hardware and operating environments, the performance of clusters is greatly affected. Various efforts in literature have been made to address the issues of heterogeneity (i.e., clusters consisting of virtual machines and machines with different hardware), network communication, data locality, better resource utilization, and run‐time scheduling. In this paper, we present a survey to discuss various research efforts made so far to improve Hadoop MapReduce scheduling. We classify scheduling algorithms and techniques proposed in the literature so far based on their addressing areas and present a taxonomy. Furthermore, we also discuss various aspects of open issues and challenges in the scheduling of MapReduce to improve its performance. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

16.
Grid computing technologies are now being largely deployed with the widespread adoption of the Globus Toolkit as the industrial standard Grid middleware. However, its inherent steep learning curve discourages the use of these technologies for non‐experts. Therefore, to increase the use of Grid computing, it is important to have high‐level tools that simplify the process of remote task execution. In this paper we introduce a middleware, developed on top of the Java Commodity Grid, which offers an object‐oriented, user‐friendly application programming interface, from the Java language, which eases remote task execution for computationally intensive applications. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

17.
Programming for large‐scale, multicore‐based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function‐level parallelism that targets productivity. StarSs deploys a data‐flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. This paper introduces Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs (Symmetric Multiprocessors). ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one‐sided communication. We present the design of ClusterSs on top of APGAS, as well as the programming model and execution runtime for Java applications. Finally, we evaluate the productivity of ClusterSs, both in terms of programmability and performance and compare it to that of the IBM X10 language. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

18.
Java just‐in‐time compilers often compile only hot methods because the compilation overhead is a part of the running time. This requires precise and efficient hot spot detection, which includes distinguishing hot methods from cold ones, detecting them as early as possible, and paying a small detection overhead. Hot spot detection is especially important in embedded applications because they show more of a start‐up phase behavior of a regular application where methods are not executed heavily, so the hot methods are not definite. Because a long‐running method is likely to be a hot method, we can detect a hot method by measuring its running time during interpretation. However, precise measurement of the running time during execution is too expensive, especially in embedded systems, so many counter‐based heuristics have been proposed to estimate it such as Oracle's HotSpot heuristic. One problem is that although the overhead of these heuristics is low, they do not estimate the running time precisely, which may lead to imprecise hot spot detection.This paper proposes a new hot spot detection heuristic called flow‐sensitive runtime estimation, which can estimate the running time more precisely than others with a relatively low overhead. It only counts important bytecode instructions dynamically, but it can obtain the precise count of all interpreted bytecode instructions with a simple arithmetic calculation. We also propose a static analysis technique to predict those hot methods which spends a huge execution time once invoked, so as to compile them at their first invocation. Our experimental results show that these techniques can improve the performance by as much as an average of 7.4% compared with the HotSpot heuristic for the benchmarks when they run once, which is often regarded as showing the start‐up phase behavior. Even for real embedded Java applications such as the digital TV Java Xlet applications, our techniques can improve the user response time by an average of 7.1%. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

19.
A significant number of real‐time control applications include computational activities where the results have to be delivered at precise instants, rather than within a deadline. The performance of such systems significantly degrades if outputs are generated before or after the desired target time. This work presents a general methodology that can be used to design and analyze target‐sensitive applications in which the timing parameters of the computational activities are tightly coupled with the physical characteristics of the system to be controlled. For the sake of clarity, the proposed methodology is illustrated through a sample case study used to show how to derive and verify real‐time constraints from the mission requirements. Software implementation issues necessary to map the computational activities into tasks running on a real‐time kernel are also discussed to identify the kernel mechanisms necessary to enforce timing constraints and analyze the feasibility of the application. A set of experiments are finally presented with the purpose of validating the proposed methodology. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

20.
Over the past few years, research and development in bioinformatics (e.g. genomic sequence alignment) has grown with each passing day fueling continuing demands for vast computing power to support better performance. This trend usually requires solutions involving parallel computing techniques because cluster computing technology reduces execution times and increases genomic sequence alignment efficiency. One example, mpiBLAST is a parallel version of NCBI BLAST that combines NCBI BLAST with message passing interface (MPI) standards. However, as most laboratories cannot build up powerful cluster computing environments, Grid computing framework concepts have been designed to meet the need. Grid computing environments coordinate the resources of distributed virtual organizations and satisfy the various computational demands of bioinformatics applications. In this paper, we report on designing and implementing a BioGrid framework, called G‐BLAST, that performs genomic sequence alignments using Grid computing environments and accessible mpiBLAST applications. G‐BLAST is also suitable for cluster computing environments with a server node and several client nodes. G‐BLAST is able to select the most appropriate work nodes, dynamically fragment genomic databases, and self‐adjust according to performance data. To enhance G‐BLAST capability and usability, we also employ a WSRF Grid Service Portal and a Grid Service GUI desk application for general users to submit jobs and host administrators to maintain work nodes. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号