期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Evaluating ARM HPC clusters for scientific workloads

Jahanzeb Maqbool Sangyoon Oh Geoffrey C. Fox 《Concurrency and Computation》2015,27(17):5390-5410

The power consumption of modern high‐performance computing (HPC) systems that are built using power hungry commodity servers is one of the major hurdles for achieving Exascale computation. Several efforts have been made by the HPC community to encourage the use of low‐powered system‐on‐chip (SoC) embedded processors in large‐scale HPC systems. These initiatives have successfully demonstrated the use of ARM SoCs in HPC systems, but there is still a need to analyze the viability of these systems for HPC platforms before a case can be made for Exascale computation. The major shortcomings of current ARM‐HPC evaluations include a lack of detailed insights about performance levels on distributed multicore systems and performance levels for benchmarking in large‐scale applications running on HPC. In this paper, we present a comprehensive evaluation of results that covers major aspects of server and HPC benchmarking for ARM‐based SoCs. For the experiments, we built an unconventional cluster of ARM Cortex‐A9s that is referred to as Weiser and ran single‐node benchmarks (STREAM, Sysbench, and PARSEC) and multi‐node scientific benchmarks (High‐performance Linpack (HPL), NASA Advanced Supercomputing (NAS) Parallel Benchmark, and Gadget‐2) in order to provide a baseline for performance limitations of the system. Based on the experimental results, we claim that the performance of ARM SoCs depends heavily on the memory bandwidth, network latency, application class, workload type, and support for compiler optimizations. During server‐based benchmarking, we observed that when performing memory intensive benchmarks for database transactions, x86 performed 12% better for multithreaded query processing. However, ARM performed four times better for performance to power ratios for a single core and 2.6 times better on four cores. We noticed that emulated double precision floating point in Java resulted in three to four times slower performance as compared with the performance in C for CPU‐bound benchmarks. Even though Intel x86 performed slightly better in computation‐oriented applications, ARM showed better scalability in I/O bound applications for shared memory benchmarks. We incorporated the support for ARM in the MPJ‐Express runtime and performed comparative analysis of two widely used message passing libraries. We obtained similar results for network bandwidth, large‐scale application scaling, floating‐point performance, and energy‐efficiency for clusters in message passing evaluations (NBP and Gadget 2 with MPJ‐Express and MPICH). Our findings can be used to evaluate the energy efficiency of ARM‐based clusters for server workloads and scientific workloads and to provide a guideline for building energy‐efficient HPC clusters. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

2.

Animal‐borne wireless network: Remote imaging of community ecology

Shinkyu Park Konrad H. Aschenbach Manjur Ahmed William L. Scott Naomi E. Leonard Kyler Abernathy Greg Marshall Mike Shepard Nuno C. Martins 《野外机器人技术杂志》2019,36(6):1141-1165

This article describes the design, construction, and field‐testing of a standalone networked animal‐borne monitoring system conceived to study community ecology remotely. The system consists of an assemblage of identical battery‐powered sensing devices with wireless communication capabilities that are each collar‐mounted on a study animal and together form a mobile ad hoc network. The sensing modalities of each device include high‐definition video, inertial accelerometry, and location resolved via a global positioning system module. Our system is conceived to use information exchange across the network to enable the devices to jointly decide without supervision when and how to use each sensing modality. The ultimate goal is to extend battery life while making sure that important events are appropriately documented. This requires judicious use of highly informative but power‐hungry sensing modalities, such as video, because battery capacity is constrained by stringent weight and dimension restrictions. We have proposed algorithms to regulate sensing rates, data transmission among devices, and triggering for video recording based on location and animal group movements and configuration. We have also developed the hardware and firmware of our devices to reliably execute these algorithms in the exacting conditions of real‐life deployments. We describe validation of the performance and reliability of our system using deployment results for a mission in Gorongosa National Park (Mozambique) to monitor two species in their natural habitat: the waterbuck and the African buffalo. We present movement data and snapshots of animal point‐of‐view videos collected by 14 fully operational devices collared on 10 waterbucks and 4 buffaloes. 相似文献

3.

Evaluating vector data type usage in OpenCL kernels

Jianbin Fang Ana Lucia Varbanescu Xiangke Liao Henk Sips 《Concurrency and Computation》2015,27(17):4586-4602

Open Computing Language (OpenCL) is an open, functionally portable programming model for a large range of highly parallel processors. To provide users with access to the underlying platforms, OpenCL has explicit support for features such as local memory and vector data types (VDTs). However, these are often low‐level, hardware‐specific features, which can be detrimental to performance on different platforms. In this paper, we focus on VDTs and investigate their usage in a systematic way. First, we propose two different approaches (inter‐vdt and intra‐vdt) to use VDTs in OpenCL kernels, and show how to translate scalar OpenCL kernels to vectorized ones. After obtaining vectorized code, we evaluate the performance effects of using VDTs with two types of benchmarks: micro‐benchmarks and macro‐benchmarks. With micro‐benchmarks, we study the execution model of VDTs and the role of the compiler‐aided vectorizer on five devices. With macro‐benchmarks, we explore the changes of memory access patterns before and after using VDTs, and the resulting performance impact. Not only our evaluation provides insights into how OpenCL's VDTs are mapped on different processors, but it also indicates that using such data types introduces changes in both computation and memory accesses. Based on the lessons learned, we discuss how to deal with performance portability in the presence of VDTs. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

4.

XML messaging for mobile devices: From requirements to implementation

《Computer Networks》2007,51(16):4634-4654

In recent years, both the number and capabilities of mobile devices have increased rapidly to the point where the mobile world is becoming a significant part of the Internet. Another recent trend is the increase in XML use for communication between applications. However, the mobile world has been reluctant to adopt XML due to its verbosity and processing needs.We consider here the problem of providing an XML-based messaging system for mobile devices. We analyze the requirements that the environment places on such a system and elaborate on these requirements by concentrating on three components that seem most amenable to improvements, namely XML processing interfaces, XML serialization, and message transfer protocols. In tandem with the analysis we also present the design and implementation of our messaging system that addresses these requirements.Our experimentation of this system is extensive and performed completely on real devices and real wireless networks. Based on our implementation and experimentation we conclude that there is potential for improvement in XML messaging. The largest gains are achieved by using an asynchronous programming style and by using a compact serialization format. The improvements are also individually integratable into existing systems. 相似文献

5.

Energy-efficient multisite offloading policy using Markov decision process for mobile cloud computing

《Pervasive and Mobile Computing》2016

Mobile systems, such as smartphones, are becoming the primary platform of choice for a user’s computational needs. However, mobile devices still suffer from limited resources such as battery life and processor performance. To address these limitations, a popular approach used in mobile cloud computing is computation offloading, where resource-intensive mobile components are offloaded to more resourceful cloud servers. Prior studies in this area have focused on a form of offloading where only a single server is considered as the offloading site. Because there is now an environment where mobile devices can access multiple cloud providers, it is possible for mobiles to save more energy by offloading energy-intensive components to multiple cloud servers. The method proposed in this paper differentiates the data- and computation-intensive components of an application and performs a multisite offloading in a data and process-centric manner. In this paper, we present a novel model to describe the energy consumption of a multisite application execution and use a discrete time Markov chain (DTMC) to model fading wireless mobile channels. We adopt a Markov decision process (MDP) framework to formulate the multisite partitioning problem as a delay-constrained, least-cost shortest path problem on a state transition graph. Our proposed Energy-efficient Multisite Offloading Policy (EMOP) algorithm, built on a value iteration algorithm (VIA), finds the efficient solution to the multisite partitioning problem. Numerical simulations show that our algorithm considers the different capabilities of sites to distribute appropriate components such that there is a lower energy cost for data transfer from the mobile to the cloud. A multisite offloading execution using our proposed EMOP algorithm achieved a greater reduction on the energy consumption of mobiles when compared to a single site offloading execution. 相似文献

6.

ThingsMigrate: Platform‐independent migration of stateful JavaScript Internet of Things applications

Kumseok Jung Julien Gascon‐Samson Shivanshu Goyal Armin Rezaiean‐Asel Karthik Pattabiraman 《Software》2021,51(1):117-155

The Internet of Things (IoT) has gained wide popularity both in academic and industrial contexts. Unlike traditional embedded devices with specialized firmwares, modern IoT devices accommodate general‐purpose operating systems, allowing developers to run more sophisticated applications written in high‐level languages like JavaScript. Because IoT devices are subject to resource constraints like available battery power, we need to dynamically migrate a running process between different devices to prevent losing state. However, it is challenging to apply migration techniques using memory snapshots across the heterogeneous pool of IoT devices. We present ThingsMigrate, a middleware providing platform‐independent migration of JavaScript processes across IoT devices. Prior to execution, ThingsMigrate instruments the source code of a given program to expose its internal state. During run‐time, the transformed program produces on demand a JSON snapshot of its current state, from which new code is generated to resume execution. Thus, ThingsMigrate enables process migration entirely in the application space without any modifications to the underlying virtual machine (VM), providing VM‐independence. We present three versions of ThingsMigrate, each building on the previous to optimize for run‐time latency and memory consumption. We report on the experience of building each successive version and discuss the insights gained and the learning outcomes. We evaluated ThingsMigrate against standard benchmarks, over two IoT platforms and a cloud‐like environment. We show that it can migrate even highly CPU‐intensive applications, with average run‐time latency overhead of 33% and memory overhead of 78%. ThingsMigrate supports multiple subsequent migrations without introducing additional overhead over each subsequent migration. 相似文献

7.

TinyVM: an energy‐efficient execution infrastructure for sensor networks

Kirak Hong Jiin Park Sungho Kim Taekhoon Kim Hwangho Kim Bernd Burgstaller Bernhard Scholz 《Software》2012,42(10):1193-1209

Energy‐efficient implementation techniques for virtual machines (VMs) have received little attention yet: conventional wisdom claims that VMs have a diametrical effect on energy consumption, and VM‐based applications are therefore short‐lived. In this paper, we argue that bytecode interpretation is affordable if we synthesize VMs specifically for energy efficiency. We present TinyVM, an execution infrastructure that seamlessly integrates with C and nesC/TinyOS‐based programming environments. TinyVM achieves high code density through the use of compressed bytecode as the primary program representation. Compressed bytecode allows rapid application deployment with low communication overhead. TinyVM executes compressed bytecode in place, which eliminates the need for a decompression stage and thereby reduces memory consumption on sensor nodes. Our infrastructure automates the creation of energy‐efficient application‐specific VMs. Applications are partitioned in machine code, bytecode, and VM instruction set extensions. Partitioning is manually controlled and/or fully guided by a discrete optimization problem that produces a partitioning with lowest energy consumption for a given program size limit. We provide experimental results for sensor network benchmarks and for selected applications on various CPU architectures including Atmega128‐based motes and the ARM‐based Intel iMote2. TinyVM has been released under the GNU General Public License. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

8.

EPE‐Mobile—A framework for early performance estimation of mobile applications

下载免费PDF全文

Thiago Soares Fernandes Álvaro Freitas Moreira Érika Cota 《Software》2018,48(1):85-104

Considering the constrained resources of mobile devices, a thorough performance evaluation of a mobile application is crucial. However, performance evaluation in the mobile domain is still a manual and time‐consuming task. The diversity of mobile devices only increases the complexity of this task. We propose EPE‐Mobile, a framework to automate early performance estimation in mobile applications. It is composed of a configurable library of basic operations and an engine that automatically creates a synthetic program based on the specification of a new app. The synthetic program that EPE‐Mobile generates provides feedback for mobile developers at the first design stages and before the actual implementation of a new application. The fast evaluation can also guide developers in optimizing their applications or in choosing devices with the best trade‐off between cost and performance to run a given application. Finally, developers can reuse the data collection infrastructure of the framework to collect performance data during all development stages. We validate the proposed framework using 4 applications from the Android Play Store. Based on their specifications, 4 synthetic programs were generated and executed on different devices. We compared the results to those obtained from the execution of the actual applications in the same devices. Experimental results show that it is possible to create synthetic applications with similar behavior to that of real applications and, thus, classify devices based on the actual application needs. The framework uses aspect‐oriented programming to collect the metrics of interest. This approach provides increased modularity and separation of concerns, thus facilitating the improvement of the framework itself, by adding other metrics or basic operations. 相似文献

9.

Accurate power modeling of modern mobile application processors

《Journal of Systems Architecture》2017

The power modeling of mobile application processors (APs) is a challenging task due to their complexity. The existing power models and their associated devices have mostly been made obsolete by recent hardware developments. In this paper, we propose an enhanced power model used in modern mobile devices. The model accurately estimates the power consumption of AP component and utilizes the runtime usage information of each hardware component. We evaluated the model accuracy using various benchmarks, as well as popular smartphone applications with multiple devices that employ different APs. The evaluation shows that our model achieves the mean absolute percentage error (MAPE) of 5.1%. 相似文献

10.

Controlling energy without compromising system performance in mobile grid environments

Li Chunlin Li Layuan 《Computers & Electrical Engineering》2010,36(3):503-517

The challenges confronting in mobile grid systems are: limited CPU power, limited memory, small screen, short battery life, and intermittent disconnection. Considering all these limitations, this paper is targeted to control energy consumption without compromising system’s performance in mobile grid. In this paper, we focus on using the mobile devices on the mobile grid environment. Mobile devices can serve two important functions in mobile grid environment either as service consumer or as valuable service providers. The proposed approach is not only to reduce energy consumption, but also to improve system performance in mobile grid environment. Utility functions are used to express grid users’ requirements, resource providers’ benefit function and system’s objectives. Dynamic programming is used to optimize the total utility function of mobile grid. A distributed controlling energy algorithm in mobile grid environment is proposed which decomposes mobile grid system optimization problem into sub-problems. In order to verify the efficiency of the proposed algorithm, in the experiment, the performance evaluation of controlling energy algorithm is conducted. 相似文献

11.

Energy constrained resource allocation optimization for mobile grids

Chunlin Li Layuan Li 《Journal of Parallel and Distributed Computing》2010

A mobile grid incorporates mobile devices into Grid systems. But mobile devices at present have severe limitations in terms of processing, memory capabilities and energy. Minimizing the energy usage in mobile devices poses significant challenges in mobile grids. This paper presents energy constrained resource allocation optimization for mobile grids. The goal of the paper is not only to reduce energy consumption, but also to improve the application utility in a mobile grid environment with a limited energy charge, ensuring battery lifetime and the deadlines of the grid applications. The application utility not only depends on its allocated resources including computation and communication resources, but also on the consumed energy, this leads to a coupled utility model, where the utilities are functions of allocated resources and consumed energy. Energy constrained resources allocation optimization is formulated as a utility optimization problem, which can be decomposed into two subproblems, the interaction between the two sub-problems is controlled through the use of a pricing variable. The paper proposes a price-based distributed energy constrained resources allocation optimization algorithm. In the simulation, the performance evaluation of our energy constrained resources allocation optimization algorithm is conducted. 相似文献

12.

On‐the‐Fly Power‐Aware Rendering

下载免费PDF全文

Victor Arellano Rui Wang Diego Gutierrez Hujun Bao 《Computer Graphics Forum》2018,37(4):155-166

Power saving is a prevailing concern in desktop computers and, especially, in battery‐powered devices such as mobile phones. This is generating a growing demand for power‐aware graphics applications that can extend battery life, while preserving good quality. In this paper, we address this issue by presenting a real‐time power‐efficient rendering framework, able to dynamically select the rendering configuration with the best quality within a given power budget. Different from the current state of the art, our method does not require precomputation of the whole camera‐view space, nor Pareto curves to explore the vast power‐error space; as such, it can also handle dynamic scenes. Our algorithm is based on two key components: our novel power prediction model, and our runtime quality error estimation mechanism. These components allow us to search for the optimal rendering configuration at runtime, being transparent to the user. We demonstrate the performance of our framework on two different platforms: a desktop computer, and a mobile device. In both cases, we produce results close to the maximum quality, while achieving significant power savings. 相似文献

13.

NAS Parallel Benchmarks with CUDA and beyond

Gabriell Araujo Dalvan Griebler Dinei A. Rockenbach Marco Danelutto Luiz G. Fernandes 《Software》2023,53(1):53-80

NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior. 相似文献

14.

Optimizing OpenMP Programs on Software Distributed Shared Memory Systems

Min Seung-Jai Basumallik Ayon Eigenmann Rudolf 《International journal of parallel programming》2003,31(3):225-249

This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is not currently being used on distributed system. The long-term goal of our project is to quantify the degree to which such a use is possible and develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have studied a number of hand optimizations that improve the baseline performance. Our approach complements related efforts that have proposed language extensions for efficient execution of OpenMP programs on distributed systems. Our results show that, while kernel benchmarks can show high efficiency of OpenMP programs on distributed systems, full applications need careful consideration of shared data access patterns. A naive translation (similar to OpenMP compilers for SMPs) leads to acceptable performance in very few applications only. However, additional optimizations, including access privatization, selective touch, and dynamic scheduling, resulting in 31% average improvement on our benchmarks. 相似文献

15.

The impact of distributed programming abstractions on application energy consumption

《Information and Software Technology》2013,55(9):1602-1613

With battery capacities remaining a key physical constraint for mobile devices, energy efficiency has become an important software design consideration. Distributed programming abstractions (e.g., sockets, RPC, messages, etc.) are an essential component of modern software, but their energy consumption characteristics are poorly understood. The programmer has few practical guidelines to choose the right abstraction for energy-constrained scenarios. In this article, we report on the findings of a systematic study we conducted to compare and contrast major distributed programming abstractions in terms of their energy consumption patterns. By varying the abstractions with the rest of the functionality fixed, we measure and analyze the impact of distributed programming abstractions on application energy consumption. Based on our findings, we present a set of practical guidelines for the programmer to select an abstraction that satisfies the energy consumption constraints in place. Our other guidelines can steer future efforts in creating energy efficient distributed programming abstractions. 相似文献

16.

Bringing Scheme programming to the iPhone—Experience

Engineer Bainomugisha Jorge Vallejos Elisa Gonzalez Boix Pascal Costanza Theo D'Hondt Wolfgang De Meuter 《Software》2012,42(3):331-356

The iPhone SDK provides a powerful platform for the development of applications that make use of iPhone capabilities, such as sensors, GPS, Wi‐Fi, or Bluetooth connectivity. We observe that so far the development of iPhone applications has mostly been restricted to using Objective‐C. However, developing applications in plain Objective‐C on the iPhone OS suffers from limitations, such as the need for explicit memory management and lack of syntactic extension mechanism. Moreover, when developing distributed applications in Objective‐C, programmers have to manually deal with distribution concerns, such as service discovery, remote communication, and failure handling. In this paper, we discuss our experience in porting the Scheme programming language to the iPhone OS and how it can be used together with Objective‐C to develop iPhone applications. To support the interaction between Scheme programs and the underlying iPhone APIs, we have implemented a language symbiosis layer that enables programmers to access the iPhone SDK libraries from Scheme. In addition, we have designed high‐level distribution constructs to ease the development of distributed iPhone applications in an event‐driven style. We validate and discuss these constructs with a series of examples, including an iPod controller, a maps application, and a distributed multiplayer Scrabble‐like game. We discuss the lessons learned from this experience for other programming language ports to mobile platforms. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

17.

CAD data visualization on mobile devices using sequential constrained Delaunay triangulation

Sang Wook Yang Hyun Chan Lee 《Computer aided design》2009,41(5):375-384

3D graphic rendering in mobile application programs is becoming increasingly popular with rapid advances in mobile device technology. Current 3D graphic rendering engines for mobile devices do not provide triangulation capabilities for surfaces; therefore, mobile 3D graphic applications have been dealing only with pre-tessellated geometric data. Since triangulation is comparatively expensive in terms of computation, real-time tessellation cannot be easily implemented on mobile devices with limited resources. No research has yet been reported on real-time triangulation on mobile devices.In this paper, we propose a real-time triangulation algorithm for visualization on mobile devices based on sequential constrained Delaunay triangulation. We apply a compact data structure and a sequential triangulation process for visualization of CAD data on mobile devices. In order to achieve a high performance and compact implementation of the triangulation, the nature of the CAD data is fully considered in the computational process. This paper also presents a prototype implementation for a mobile 3D CAD viewer running on a handheld Personal Digital Assistant (PDA). 相似文献

18.

Vector data flow analysis for SIMD optimizations on OpenCL programs

Yu‐Te Lin Jenq‐Kuen Lee 《Concurrency and Computation》2016,28(5):1629-1654

Multi‐core systems equipped with micro processing units and accelerators such as digital signal processors (DSPs) and graphics processing units (GPUs) have become a major trend in processor design in recent years in attempts to meet ever‐increasing application performance requirements. Open Computing Language (OpenCL) is one of the programming languages that include new extensions proposed to exploit the computing power of these kinds of processors. Among the newly extended language features, the single‐instruction multiple‐data (SIMD) linguistics and vector types are added to OpenCL to exploit hardware features of the accelerators. The addition makes it necessary to consider how traditional compiler data flow analysis can be adopted to meet the optimization requirements of vector linguistics. In this paper, we propose a calculus framework to support the data flow analysis of vector constructs for OpenCL programs that compilers can use to perform SIMD optimizations. We model OpenCL vector operations as data access functions in the style of mathematical functions. We then show that the data flow analysis for OpenCL vector linguistics can be performed based on the data access functions. Based on the information gathered from data flow analysis, we illustrate a set of SIMD optimizations on OpenCL programs. The experimental results incorporating our calculus and our proposed compiler optimizations show that the proposed SIMD optimizations can provide average performance improvements of 22% on x86 CPUs and 4% on advanced micro devices GPUs. For the selected 15 benchmarks, 11 of them are improved on x86 CPUs, and six of them are improved on advanced micro devices GPUs. The proposed framework has the potential to be used to construct other SIMD optimizations on OpenCL programs. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

19.

A dynamic binary translation system in a client/server environment

《Journal of Systems Architecture》2015,61(7):307-319

With rapid advances in mobile computing, multi-core processors and expanded memory resources are being made available in new mobile devices. This trend will allow a wider range of existing applications to be migrated to mobile devices, for example, running desktop applications in IA-32 (x86) binaries on ARM-based mobile devices transparently using dynamic binary translation (DBT). However, the overall performance could significantly affect the energy consumption of the mobile devices because it is directly linked to the number of instructions executed and the overall execution time of the translated code. Hence, even though the capability of today’s mobile devices will continue to grow, the concern over translation efficiency and energy consumption will put more constraints on a DBT for mobile devices, in particular, for thin mobile clients than that for severs. With increasing network accessibility and bandwidth in various environments, it makes many network servers highly accessible to thin mobile clients. Those network servers are usually equipped with a substantial amount of resources. This provides an opportunity for DBT on thin clients to leverage such powerful servers. However, designing such a DBT for a client/server environment requires many critical considerations.In this work, we looked at those design issues and developed a distributed DBT system based on a client/server model. It consists of two dynamic binary translators. An aggressive dynamic binary translator/optimizer on the server to service the translation/optimization requests from thin clients, and a thin DBT on each thin client to perform lightweight binary translation and basic emulation functions for its own. With such a two-translator client/server approach, we successfully off-load the DBT overhead of the thin client to the server and achieve a significant performance improvement over the non-client/server model. Experimental results show that the DBT of the client/server model could achieve 37% and 17% improvement over that of non-client/server model for x86/32-to-ARM emulation using MiBench and SPEC CINT2006 benchmarks with test inputs, respectively, and 84% improvement using SPLASH-2 benchmarks running two emulation threads. 相似文献

20.

PSkel: A stencil programming framework for CPU‐GPU systems

Alyson D. Pereira Luiz Ramos Luís F. W. Ges 《Concurrency and Computation》2015,27(17):4938-4953

The use of Graphics Processing Units (GPUs) for high‐performance computing has gained growing momentum in recent years. Unfortunately, GPU‐programming platforms like Compute Unified Device Architecture (CUDA) are complex, user unfriendly, and increase the complexity of developing high‐performance parallel applications. In addition, runtime systems that execute those applications often fail to fully utilize the parallelism of modern CPU‐GPU systems. Typically, parallel kernels run entirely on the most powerful device available, leaving other devices idle. These observations sparked research in two directions: (1) high‐level approaches to software development for GPUs, which strike a balance between performance and ease of programming; and (2) task partitioning to fully utilize the available devices. In this paper, we propose a framework, called PSkel, that provides a single high‐level abstraction for stencil programming on heterogeneous CPU‐GPU systems, while allowing the programmer to partition and assign data and computation to both CPU and GPU. Our current implementation uses parallel skeletons to transparently leverage Intel Threading Building Blocks (Intel Corporation, Santa Clara, CA, USA) and NVIDIA CUDA (Nvidia Corporation, Santa Clara, CA, USA). In our experiments, we observed that parallel applications with task partitioning can improve average performance by up to 76% and 28% compared with CPU‐only and GPU‐only parallel applications, respectively. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献