首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Open Computing Language (OpenCL) is an open, functionally portable programming model for a large range of highly parallel processors. To provide users with access to the underlying platforms, OpenCL has explicit support for features such as local memory and vector data types (VDTs). However, these are often low‐level, hardware‐specific features, which can be detrimental to performance on different platforms. In this paper, we focus on VDTs and investigate their usage in a systematic way. First, we propose two different approaches (inter‐vdt and intra‐vdt) to use VDTs in OpenCL kernels, and show how to translate scalar OpenCL kernels to vectorized ones. After obtaining vectorized code, we evaluate the performance effects of using VDTs with two types of benchmarks: micro‐benchmarks and macro‐benchmarks. With micro‐benchmarks, we study the execution model of VDTs and the role of the compiler‐aided vectorizer on five devices. With macro‐benchmarks, we explore the changes of memory access patterns before and after using VDTs, and the resulting performance impact. Not only our evaluation provides insights into how OpenCL's VDTs are mapped on different processors, but it also indicates that using such data types introduces changes in both computation and memory accesses. Based on the lessons learned, we discuss how to deal with performance portability in the presence of VDTs. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

2.
The Internet of Things (IoT) is an emerging technology paradigm where millions of sensors and actuators help monitor and manage physical, environmental, and human systems in real time. The inherent closed‐loop responsiveness and decision making of IoT applications make them ideal candidates for using low latency and scalable stream processing platforms. Distributed stream processing systems (DSPS) hosted in cloud data centers are becoming the vital engine for real‐time data processing and analytics in any IoT software architecture. But the efficacy and performance of contemporary DSPS have not been rigorously studied for IoT applications and data streams. Here, we propose RIoTBench , a real‐time IoT benchmark suite, along with performance metrics, to evaluate DSPS for streaming IoT applications. The benchmark includes 27 common IoT tasks classified across various functional categories and implemented as modular microbenchmarks. Further, we define four IoT application benchmarks composed from these tasks based on common patterns of data preprocessing, statistical summarization, and predictive analytics that are intrinsic to the closed‐loop IoT decision‐making life cycle. These are coupled with four stream workloads sourced from real IoT observations on smart cities and smart health, with peak streams rates that range from 500 to 10 000 messages/second from up to 3 million sensors. We validate the RIoTBench suite for the popular Apache Storm DSPS on the Microsoft Azure public cloud and present empirical observations. This suite can be used by DSPS researchers for performance analysis and resource scheduling, by IoT practitioners to evaluate DSPS platforms, and even reused within IoT solutions.  相似文献   

3.
By using the principle of fixed-time benchmarking, it is possible to compare a wide range of computers, from a small personal computer to the most powerful parallel supercomputer, on a single scale. Fixed-time benchmarks promise greater longevity than those based on a particular problem size and are more appropriate for “grand challenge” capability comparison. We present the design of a benchmark, SLALOM, that adjusts automatically to the computing power available and corrects several deficiencies in various existing benchmarks: it is highly scalable, solves a real problem, includes input and output times, and can be run on parallel computers of all kinds, using any convenient language. The benchmark provides an estimate of the size of problem solvable on scientific computers. It also can be used to demonstrate a new source of superlinear speedup in parallel computers. Results that span six orders of magnitude for contemporary computers of various architectures are presented.  相似文献   

4.
5.
SPEC CPU2000: measuring CPU performance in the New Millennium   总被引:1,自引:0,他引:1  
Henning  J.L. 《Computer》2000,33(7):28-35
As computers and software have become more powerful, it seems almost human nature to want the biggest and fastest toy you can afford. But how do you know if your toy is tops? Even if your application never does any I/O, it's not just the speed of the CPU that dictates performance. Cache, main memory, and compilers also play a role. Software applications also have differing performance requirements. So whom do you trust to provide this information? The Standard Performance Evaluation Corporation (SPEC) is a nonprofit consortium whose members include hardware vendors, software vendors, universities, customers, and consultants. SPEC's mission is to develop technically credible and objective component- and system-level benchmarks for multiple operating systems and environments, including high-performance numeric computing, Web servers, and graphical subsystems. On 30 June 2000, SPEC retired the CPU95 benchmark suite. Its replacement is CPU2000, a new CPU benchmark suite with 19 applications that have never before been in a SPEC CPU suite. The article discusses how SPEC developed this benchmark suite and what the benchmarks do  相似文献   

6.
7.
The increasing number of mobile devices with ever‐growing capabilities makes them useful for running scientific applications. However, these applications have high computational demands, whereas mobile devices have limited capabilities when compared with non‐mobile devices. More importantly, mobile devices rely on batteries for their power supply. We initially measure the battery consumption of different versions of known micro‐benchmarks representing common programming primitives found in scientific applications. Then, we analyze the performance of such micro‐benchmarks in CPU‐intensive mobile applications. We apply good programming practices and code refactorings to reduce battery consumption of scientific mobile applications. Our results show the reduction in energy usage from applying these refactorings to three scientific applications, and we consequently propose guidelines for high‐performance computing applications. Our focus is on Android, the dominant mobile operating system. As a long‐term contribution, our results represent one more step in the progress towards hybrid distributed infrastructures comprising fixed and mobile nodes, that is, the so‐called mobile grids. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

8.
Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for any particular compute device. To develop efficient OpenCL applications for the particular platform, we still need a more profound understanding of architecture features on the OpenCL model and computing devices. For this purpose, we design and implement an OpenCL micro-benchmark suite for GPUs and CPUs. In this paper, we introduce the implementations of our OpenCL micro benchmarks, and present the measuring results of hardware and software features like performance of mathematical operations, bus bandwidths, memory architectures, branch synchronizations and scalability, etc., on two multi-core CPUs, i.e. AMD Athlon II X2 250 and Intel Pentium Dual-Core E5400, and two different GPUs, i.e. NVIDIA GeForce GTX 460se and AMD Radeon HD 6850. We also compared the measuring results with existing benchmarks to demonstrate the reasonableness and correctness of our benchmark suite.  相似文献   

9.
Heterogeneous performance prediction models are valuable tools to accurately predict application runtime, allowing for efficient design space exploration and application mapping. The existing performance models require intricate system architecture knowledge, making the modeling task difficult. In this research, we propose a regression‐based performance prediction framework for general purpose graphical processing unit (GPGPU) clusters that statistically abstracts the system architecture characteristics, enabling performance prediction without detailed system architecture knowledge. The regression‐based framework targets deterministic synchronous iterative algorithms using our synchronous iterative GPGPU execution model and is broken into two components: the computation component that models the GPGPU device and host computations and the communication component that models the network‐level communications. The computation component regression models use algorithm characteristics such as the number of floating‐point operations and total bytes as predictor variables and are trained using several small, instrumented executions of synchronous iterative algorithms that include a range of floating‐point operations‐to‐byte requirements. The regression models for network‐level communications are developed using micro‐benchmarks and employ data transfer size and processor count as predictor variables. Our performance prediction framework achieves prediction accuracy over 90% compared with the actual implementations for several tested GPGPU cluster configurations. The end goal of this research is to offer the scientific computing community, an accurate and easy‐to‐use performance prediction framework that empowers users to optimally utilize the heterogeneous resources. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

10.
Data access costs contribute significantly to the execution time of applications with complex data structures. A the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it is useful to trade increased computation costs for reduced memory costs. The contributions of this paper are three-fold: we provide a detailed analysis of the memory performance of seven memory-intensive benchmarks; we describe Computation Regrouping, a source-level approach to improving the performance of memory-bound applications by increasing temporal locality to eliminate cache and TLB misses; and, we demonstrate significant performance improvement by applying Computation Regrouping to our suite of seven benchmarks. Using Computation Regrouping, we observe a geometric mean speedup of 1.90, with individual speedups ranging from 1.26 to 3.03. Most of this improvement comes from eliminating memory tall time.  相似文献   

11.
基准测试程序是评估计算机系统的关键测试工具。然而,大数据时代的到来使得开发大数据系统基准测试程序面临着更加严峻的挑战,当前学术界和产业界还不存在得到广泛认可的大数据基准测试程序包。文章利用实际的交通大数据系统构建了一个基于Hadoop平台的交通大数据基准测试程序包SIAT-Bench。通过选取多个层次属性量化了程序行为特征,采用聚类算法分析了不同程序-输入数据集对的相似性。根据聚类结果,为SIATBench选取了有代表性的程序和输入数据集。实验结果表明,SIAT-Bench在满足程序行为多样性的同时消除了基准测试集中的冗余。  相似文献   

12.
We introduce a bounding volume hierarchy based on the Slab Cut Ball. This novel type of enclosing shape provides an attractive balance between tightness of fit, cost of overlap testing, and memory requirement. The hierarchy construction algorithm includes a new method for the construction of tight bounding volumes in worst case O(n) time, which means our tree data structure is constructed in O(n log n) time using traditional top‐down building methods. A fast overlap test method between two slab cut balls is also proposed, requiring as few as 28–99 arithmetic operations, including the transformation cost. Practical collision detection experiments confirm that our tree data structure is amenable for high performance collision queries. In all the tested benchmarks, our bounding volume hierarchy consistently gives performance improvements over the sphere tree, and it is also faster than the OBB tree in five out of six scenes. In particular, our method is asymptotically faster than the sphere tree, and it also outperforms the OBB tree, in close proximity situations.  相似文献   

13.
Fei Liu  Bixin Li  Rupesh Nasre 《Software》2016,46(5):601-623
Pointer analysis is a key static analysis during compilation. Several client analyses and transformations rely on precise pointer information to optimize programs. Therefore, it is paramount to improve the efficiency of pointer analysis. A critical piece of an inclusion‐based pointer analysis is online cycle detection. The efficiency of pointer analysis is significantly influenced by the efficacy of detecting cycles. Existing approaches perform poorly when they guess cycle formation in the constraint graph. Thus, the number of false cycle‐detection triggers of the state‐of‐the‐art methods is considerably high (over 99% on Standard Performance Evaluation Corporation (SPEC) benchmarks). In this paper, we propose bootstrapping as a way to improve cycle detection predictability of pointer analysis. The main idea is to run a sequence of increasingly precise analyses to feed into the next more precise analysis to improve the efficiency of the latter analysis. In this process, we develop a new notion of pointer equivalence called constraint equivalence. Using Steensgaard's fast unification algorithm as the bootstrap, we devise a new cycle detection method for Andersen's inclusion‐based analysis. We measure the effectiveness of our approach using a suite of programs including SPEC 2000 benchmarks and two open‐source programs, and find that our method can reduce the number of false cycle detections by almost 22× compared with a state‐of‐the‐art method. This leads to an overall analysis time improvement of 18% on an average. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

14.
Collecting accurate program metrics is often complicated by environmental artifacts such as operating system workload, cache operation, and processor configuration. This paper demonstrates the ability of the IN‐Tune system to make accurate and repeatable measurements of program metrics by analyzing the computational workload of programs in the SPEC95 benchmark suite. It shows that metrics which are characteristic of program performance can be collected in both lightly loaded and heavily loaded environments without corruption. The IN‐Tune system accomplishes this by creating unique ‘virtual performance registers’ for each process or kernel thread monitored on an Intel processor. Further, this paper investigates the effect optimization has on the performance of the benchmarks. The results clearly show improvements in the quality of code generated by the compiler when optimizations are performed and that, whereas measurements of time can be misleading, the IN‐Tune measurements are not. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

15.
16.
The increasing attention on deep learning has tremendously spurred the design of intelligence processing hardware. The variety of emerging intelligence processors requires standard benchmarks for fair comparison and system optimization (in both software and hardware). However, existing benchmarks are unsuitable for benchmarking intelligence processors due to their non-diversity and nonrepresentativeness. Also, the lack of a standard benchmarking methodology further exacerbates this problem. In this paper, we propose BenchIP, a benchmark suite and benchmarking methodology for intelligence processors. The benchmark suite in BenchIP consists of two sets of benchmarks: microbenchmarks and macrobenchmarks. The microbenchmarks consist of single-layer networks. They are mainly designed for bottleneck analysis and system optimization. The macrobenchmarks contain state-of-the-art industrial networks, so as to offer a realistic comparison of different platforms. We also propose a standard benchmarking methodology built upon an industrial software stack and evaluation metrics that comprehensively reflect various characteristics of the evaluated intelligence processors. BenchIP is utilized for evaluating various hardware platforms, including CPUs, GPUs, and accelerators. BenchIP will be open-sourced soon.  相似文献   

17.
We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of meshes, but had not previously been captured in benchmarks. The new suite, named NPB (NAS parallel benchmarks) multi-zone, is derived from the NPB suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the message passing interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on four different parallel computers. We also use an empirical formula to investigate the performance characteristics of the hybrid parallel codes.  相似文献   

18.
Traditional methods for creating dynamic objects and characters from static drawings involve careful tweaking of animation curves and/or simulation parameters. Sprite sheets offer a more drawing‐centric solution, but they do not encode timing information or the logic that determines how objects should transition between poses and cannot generalize outside the given drawings. We present an approach for creating dynamic sprites that leverages sprite sheets while addressing these limitations. In our system, artists create a drawing, deform it to specify a small number of example poses, and indicate which poses can be interpolated. To make the object move, we design a procedural simulation to navigate the pose manifold in response to external or user‐controlled forces. Powerful artistic control is achieved by allowing the artist to specify both the pose manifold and how it is navigated, while physics is leveraged to provide timing and generality. We used our method to create sprites with a range of different dynamic properties. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

19.
工业界、学术界,以及最终用户都急切需要一个大数据的评测基准, 用以评估现有的大数据系统,改进现有技术以及开发新的技术。回顾了近几年来大数据评测基准研发方面的主要工作。 对它们的特点和缺点进行了比较分析。在此基础上, 对研发新的大数据评测基准提出了一系列考虑因素:1)为了对整个大数据平台的不同子工具进行评测, 以及把大数据平台作为一个整体进行评测, 需要研发面向组件的评测基准和面向大数据平台整体的评测基准, 后者是前者的有机组合;2)工作负载除了SQL查询之外, 必须包含大数据分析任务所需要的各种复杂分析功能, 涵盖各类应用需求;3)在评测指标方面,除了性能指标(响应时间和吞吐量)之外, 还需要考虑其他指标的评测, 包括系统的可扩展性、容错性、节能性和安全性等。  相似文献   

20.
This paper describes the quattor tool suite, a new system for the installation, configuration, and management of operating systems and application software for computing fabrics. At present Unix derivatives such as Linux and Solaris are supported. Quattor is a powerful, portable and modular open source solution that has been shown to scale to thousands of computing nodes and offers a significant reduction in management costs for large computing fabrics. The quattor tool suite includes innovations compared to existing solutions which make it very useful for computing fabrics integrated into grid environments. Evaluations of the tool suite in current large scale computing environments are presented.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号