首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
    
Thread‐level speculation can be used to take advantage of multicore architectures for JavaScript in web applications. We extend previous studies with these main contributions; we implement thread‐level speculation in the state‐of‐the art just‐in‐time‐enabled JavaScript engine V8 and make the measurements in the Chromium web browser both from Google instead of using an interpreted JavaScript engine. We evaluate the thread‐level speculation and just‐in‐time compilation combination on 15 very popular web applications, 20 HTML5 demos from the JS1K competition, and 4 Google Maps use cases. The performance is evaluated on two, four, and eight cores. The results clearly show that it is possible to successfully combine thread‐level speculation and just‐in‐time compilation. This makes it possible to take advantage of multicore architectures for web applications while hiding the details of parallel programming from the programmer. Further, our results show an average speedup for the thread‐level speculation and just‐in‐time compilation combination by a factor of almost 3 on four cores and over 4 on eight cores, without changing any of the JavaScript source code. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

2.
Technological directions for innovative HPC software environments are discussed in this paper. We focus on industrial user requirements of heterogeneous multidisciplinary applications, performance portability, rapid prototyping and software reuse, integration and interoperability of standard tools. The various issues are demonstrated with reference to the PQE2000project and its programming environment Skeleton-based Integrated Environment (). includes a coordination language, , allowing the designers to express, in a primitive and structured way, efficient combinations of data parallelism and task parallelism. The goal is achieving fast development and good efficiency for applications in different areas. Modules developed with standard languages and tools are encapsulated into structures to form the global application. Performance models associated to the coordination language allow powerful optimizations to be introduced both at run time and at compile time without the direct intervention of the programmer. The paper also discusses the features of the environment related to debugging, performance analysis tools, visualization and graphical user interface. A discussion of the results achieved in some applications developed using the environment concludes the paper.  相似文献   

3.
    
Two image processing applications, edge detection and image resizing, are studied in this paper on two HPC platforms namely the Cell BE and the Blue Gene/L machines. In this paper we focus on the performance scalability of the studied applications. Our results show that the scale of the problem to be solved highly affects the fitness of the platform. If the data set size is to fit into the Cell core, the fast on‐chip inter‐core communication of a multi‐core system pays back for its high technology design. On the other hand, the overhead of the distant communication in the massively parallel Blue Gene/L machine will only show its benefits for huge data set size that otherwise mandates multiple round‐trip data communications between the local memory of a core and main memory. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

4.
共享存储并行多目标跟踪   总被引:1,自引:0,他引:1  
高度的运算复杂性制约了粒子滤波在实际的多目标视频跟踪系统中的应用。为克服性能瓶颈,探索了一种基于OpenMP共享存储并行编程模型的粗粒度并行多目标跟踪系统的实现方法。在共享变量中维护被跟踪目标的列表,每一个目标用一个独立的粒子滤波器进行跟踪。根据处理单元的数目确定线程数量和每个线程跟踪的目标数量。与对应的串行版本相比,该并行系统将可实时跟踪的目标数目由2个增加到了8个,具有更大的实用价值。  相似文献   

5.
    
  相似文献   

6.
    
NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior.  相似文献   

7.
    
Heterogeneous platforms composed of multiple different types of computing devices (such as CPUs, GPUs, and Intel MICs) have been widely used recently. However, most of parallel applications developed in such a heterogeneous platform usually only utilize a certain kind of computing device due to the lack of easy-to-use heterogeneous cooperative parallel programming models. To reduce the difficulty of heterogeneous cooperative parallel programming, a directive-based heterogeneous cooperative parallel programming framework called HeteroPP is proposed. HeteroPP provides an easier way for programmers to fully exploit multiple different types of computing devices to concurrently and cooperatively perform data-parallel applications on heterogeneous platforms. An extension to OpenMP directives and clauses is proposed to make it possible for programmers to easily offload a data-parallel compute kernel to multiple different types of computing devices. A source-to-source compiler is designed to help programmers to automatically generate multiple device-specific compute kernels that can be concurrently and cooperatively performed on heterogeneous platforms. Many experiments are conducted with 12 typical data-parallel applications implemented with HeteroPP on a heterogeneous CPU-GPU-MIC platform. The results show that HeteroPP not only greatly simplifies the heterogeneous cooperative parallel programming, but also can fully utilize the CPUs, GPU, and MIC to efficiently perform these applications.  相似文献   

8.
We compare the performance of three major programming models on a modern, 64-processor hardware cache-coherent machine, one of the two major types of platforms upon which high-performance computing is converging. We focus on applications that are either regular, predictable or at least do not require fine-grained dynamic replication of irregularly accessed data. Within this class, we use programs with a range of important communication patterns. We examine whether the basic parallel algorithm and communication structuring approaches needed for best performance are similar or different among the models, whether some models have substantial performance advantages over others as problem size and number of processors change, what the sources of these performance differences are, where the programs spend their time, and whether substantial improvements can be obtained by modifying either the application programming interfaces or the implementations of the programming models on this type of tightly-coupled multiprocessor platform.  相似文献   

9.
    
Nowadays, parallel architectures are changing so fast that there is a need for scalable and efficient tools to analyze and predict the performance of parallel applications. Analytical models are proved to be a useful approximation for characterizing parallel algorithms, but developing accurate analytical models is a hard issue, and, in general, they provide coarse performance predictions due to their intrinsic lack of accuracy. In this paper, we describe in detail the Tools for Instrumentation and Analysis (TIA) framework, an easy‐to‐use tool that automatically obtains accurate performance models by means of analytical expressions. This framework automatizes most of its internal tasks, reducing opportunities for human error, and it only requires the user to focus on the metrics and execution parameters that might influence the performance, those that should be considered in the modeling process. Its main advantage over other tools is that TIA uses model selection techniques that allow the automation of the modeling process. As a case of study, the use of TIA to obtain analytical models of different implementations of the broadcast collective communication in a cluster of multicores is shown. The results obtained by TIA are evaluated and compared with theoretical approaches based on the LogGP model. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

10.
并行程序设计模型和语言   总被引:17,自引:0,他引:17       下载免费PDF全文
安虹  陈国良 《软件学报》2002,13(1):118-124
并行计算技术的发展已有20多年的历史了.时至今日,高性能并行计算仍然缺乏有效的并行程序设计方法和工具,使得编写并行程序、理解并行程序的行为、调试和优化并行程序的性能都很困难.从分析并行程序设计困难的原因入手,指出了当前各种高性能并行机系统支持的并行程序设计方法存在的诸多问题,综述了并行程序设计模型和语言的研究现状,给出了并行程序设计模型的评价标准,并提出了这一研究领域所面临的挑战性问题,指出了一些未来可能的发展方向.  相似文献   

11.
    
Programming for large‐scale, multicore‐based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function‐level parallelism that targets productivity. StarSs deploys a data‐flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. This paper introduces Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs (Symmetric Multiprocessors). ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one‐sided communication. We present the design of ClusterSs on top of APGAS, as well as the programming model and execution runtime for Java applications. Finally, we evaluate the productivity of ClusterSs, both in terms of programmability and performance and compare it to that of the IBM X10 language. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

12.
张杨  张冬雯  王一拙 《计算机应用》2014,34(11):3096-3099
针对使用并行库JOMP的程序在性能方面存在的不足,提出一个可以分离并行逻辑和功能逻辑的并行框架。该框架对程序中需要并行处理的部分进行标记,采用面向方面和运行时反射技术实现被标记部分的处理,其中面向方面技术用于实现并行逻辑的分离和编织,运行时反射技术用于获取运行时被标记部分的相关信息,以并行库(waxberry)的方式实现了该并行框架。使用基准测试程序JGF套件中的三个测试程序对并行库进行了测试,实验结果表明,应用该并行库的程序可以获得较好的性能。  相似文献   

13.
    
Load balancing increases the efficient use of existing resources for parallel and distributed applications. At a coarse level of granularity, advances in runtime systems for parallel programs have been proposed in order to control available resources as efficiently as possible by utilizing idle resources and using task migration. Simultaneously, at a finer granularity level, advances in algorithmic strategies for dynamically balancing computational loads by data redistribution have been proposed in order to respond to variations in processor performance during the execution of a given parallel application. Combining strategies from each level of granularity can result in a system which delivers advantages of both. The resulting integration is systemic in nature and transfers the responsibility of efficient resource utilization from the application programmer to the runtime system. This paper presents the design and implementation of a system that combines an algorithmic fine-grained data parallel load balancing strategy with a systemic coarse-grained task-parallel load balancing strategy, and reports on recent experimental results of running a computationally intensive scientific application under this integrated system. The experimental results indicate that a distributed runtime environment which combines both task and data migration can provide performance advantages with little overhead. It also presents proposals for performance enhancements of the implementation, as well as future explorations for effective resource management. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

14.
    
The research domain of Multimedia Content Analysis (MMCA) considers all aspects of the automated extraction of knowledge from multimedia data. High-performance computing techniques are necessary to satisfy the ever increasing computational demands of MMCA applications. The introduction of Graphics Processing Units (GPUs) in modern cluster systems presents application developers with a challenge. While GPUs are well known to be capable of providing significant performance improvements, the programming complexity vastly increases. To this end, we have extended a user transparent parallel programming model for MMCA, named Parallel-Horus, to allow the execution of compute intensive operations on the GPUs present in the cluster. The most important class of operations in the MMCA domain are convolutions, which are typically responsible for a large fraction of the execution time. Existing optimization approaches for CUDA kernels in general as well as those specific to convolution operations are too limited in both performance and flexibility. In this paper, we present a new optimization approach, called adaptive tiling, to implement a highly efficient, yet flexible, library-based convolution operation for modern GPUs. To the best of our knowledge, our implementation is the most optimized and best performing implementation of 2D convolution in the spatial domain available to date.  相似文献   

15.
WAPM:适合广域分布式计算的并行编程模型   总被引:1,自引:0,他引:1  
早期的MPI与OpenMP等编程模型由于扩展性限制或并行粒度的差异而不适合于大规模的广域动态Internet环境.提出了一个用于广域网络范围内的并行编程模型(WAPM),为应用的分布式计算的编程提供了一个新的可行解决方案.WAPM由通信库、通信协议和应用编程接口组成,并且具有通用编程、自适应并行、容错性等特点,通过选择合适的编程语言,就可形成一个广域范围内的并行程序设计环境.以分布式计算平台P2HP为工作平台,描述了WAPM分布式计算的实施过程.实验结果表明,WAPM是一个通用的、可行的、性能较好的编程模型.  相似文献   

16.
    
Multicore processors form the basis of most traditional high performance parallel processing architectures. Early experiences with these computers showed significant performance problems, both with regard to computation and inter‐process communication. The transition from Purple, an IBM POWER5‐based machine, to Cielo, a Cray XE6, as the main capability computing platform for the United States Department of Energy's Advanced Simulation and Computing campaign provides an opportunity to reexamine these issues after experiences with a few generations of multicore‐based machines. Experiences with Purple identified some important characteristics that led to strong performance of complex scientific application programs at very large scales. Herein, we compare the performance of some Advanced Simulation and Computing mission critical applications at capability scale across this transition to multicore processors. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

17.
A major step towards practical use of parallel computers is the integration of cost into transformational or derivational software development methods. The difficulties with doing this come from the wide variety of parallel architectures possible and the effects of memory access and congestion phenomena. This paper presents a model of costs for uniform architectures that is compatible with refinement-based methods of development, that is much simpler than those previously suggested, but which accurately assesses the costs of an implemented computation. Decisions about architecture type and machine size can be made at any stage in the development, even at the end.  相似文献   

18.
描述了一个基于Jini技术的互联网计算资源共享框架JiniFrame,JiniFrame框架把互联网环境下的机器按照功能角色分为客户机、代理机和主机3种实体,共同协作完成并行应用的求解.JiniFrarne框架的可扩展性和跨平台性得益于基于Java的Jini技术,易用性与编程灵活性得益于框架中网络节点的组织方式与灵活的编程模型,同时支持主从模式与分治模式的编程.通过对实际分布式计算任务的案例分析与实验,表明了JiniFrame框架的正确性与高效性.  相似文献   

19.
提出了因特网上基于节点角色的计算资源共享平台——RB-CRSP。设计时充分考虑节点的角色性和功能性,把因特网上的网络资源按照角色划分为服务器端节点、协调节点、工作机节点与客户机节点四类实体,通过配合RB-CRSP的应用编程模式,完成并行分布式计算。分析了RB-CRSP中的自适应资源调度策略,该策略考虑了节点的硬件信息与可信誉机制,实现了平台的负载均衡性;在动态的因特网环境下,利用面向工作机的容错方式保证了平台的可靠性。案例程序选择了典型的并行BenchMark程序:N皇后问题,测试结果表明,RB-CRSP可以方便聚集异构环境下的空闲计算资源,平台的性能与机器硬件条件和可靠性密切相关。  相似文献   

20.
基于模式的并行编程环境中任务队列模式的研究与实现   总被引:1,自引:0,他引:1  
并行程序的设计是并行计算的难点之一。本文在基于模式的并行编程方法的基础上,对一种典型的并行计算与通信模式-任务队列模式进行了深入的研究,并在基于模式的并行编程环境中对该模式进行了实现。本文将通过两个典型的应用实例说明在基于模式的并行编程环境中使用任务队列模式进行问题的并行求解与并行程序开发的过程,并从实现效率和可编程性方面对使用任务队列模式的并行程序和传统的MPI/PVM实现的并行程序进行了分析与比较。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号