期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

High-level buffering for hiding periodic output cost in scientific simulations

Ma X. Lee J. Winslett M. 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(3):193-204

Scientific applications often need to write out large arrays and associated metadata periodically for visualization or restart purposes. In this paper, we present active buffering, a high-level transparent buffering scheme for collective I/O, in which processors actively organize their idle memory into a hierarchy of buffers for periodic output data. It utilizes idle memory on the processors, yet makes no assumption regarding runtime memory availability. Active buffering can perform background I/O while the computation is going on, is extensible to remote I/O for more efficient data migration, and can be implemented in a portable style in today's parallel I/O libraries. It can also mask performance problems of scientific data formats used by many scientists. Performance experiments with both synthetic benchmarks and real simulation codes on multiple platforms show that active buffering can greatly reduce the visible I/O cost from the application's point of view. 相似文献

2.

An integrated runtime and compile-time approach for parallelizingstructured and block structured applications

Agrawal G. Sussman A. Saltz J. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(7):747-754

In compiling applications for distributed memory machines, runtime analysis is required when data to be communicated cannot be determined at compile-time. One such class of applications requiring runtime analysis is block structured codes. These codes employ multiple structured meshes, which may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present runtime and compile-time analysis for compiling such applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which supports the runtime analysis required. The library is currently implemented on several different systems. We have also developed compiler analysis for determining data access patterns at compile time and inserting calls to the appropriate runtime routines. Our methods can be used by compilers for HPF-like parallel programming languages in compiling codes in which data distribution, loop bounds and/or strides are unknown at compile-time. To demonstrate the efficacy of our approach, we have implemented our compiler analysis in the Fortran 90D/HPF compiler developed at Syracuse University. We have experimented with a multi-bloc Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads and the compiler parallelized codes perform within 20% of the codes parallelized by manually inserting calls to the runtime library 相似文献

3.

一种嵌入式系统实时性能分析的融合机制

王济勇林涛王金东韩光洁赵海《计算机科学》2004,31(1):157-161

许多嵌入式系统要求硬实时或者软实时地执行，为了确保这些要求得到满足，不仅要对单个任务或代码段的执行时间进行测量，同时还必须确定整个系统的实时性能。为此，针对导致传统的纯软件方法测量精确度低的两个主要原因，借助于多源数据的信息融合思想，依据改进后的纯软件的实时代码执行时间的测量方法，提出了一种用于嵌入式系统实时性能分析的融合机制，基于这种机制，设计和开发人员能够查明时限错误，找出需要优化的代码以解决期限错过问题，并确定特定环境下实时任务集的可调度性。相似文献

4.

Quasidynamic layout optimizations for improving data locality

Kadayif I. Kandemir M. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(11):996-1011

Compiler-directed locality optimization techniques are effective in reducing the number of cycles spent in off-chip memory accesses. Recently, methods have been developed that transform memory layouts of data structures at compile-time to improve spatial locality of nested loops beyond current control-centric (loop nest-based) optimizations. Most of these data-centric transformations use a single static (program-wide) memory layout for each array. A disadvantage of these static layout-based locality enhancement strategies is that they might fail to optimize codes that manipulate arrays, which demand different layouts in different parts of the code. We introduce a new approach, which extends current static layout optimization techniques by associating different memory layouts with the same array in different parts of the code. We call this strategy "quasidynamic layout optimization." In this strategy, the compiler determines memory layouts (for different parts of the code) at compile time, but layout conversions occur at runtime. We show that the possibility of dynamically changing memory layouts during the course of execution adds a new dimension to the data locality optimization problem. Our strategy employs a static layout optimizer module as a building block and, by repeatedly invoking it for different parts of the code, it checks whether runtime layout modifications bring additional benefits beyond static optimization. Our experiments indicate significant improvements in execution time over static layout-based locality enhancing techniques. 相似文献

5.

Achieving efficiency and portability in systems software: a case study on POSIX-compliant multithreaded programs

Shinjo Y. Pu C. 《IEEE transactions on pattern analysis and machine intelligence》2005,31(9):785-800

Portable (standards-compliant) systems software is usually associated with unavoidable overhead from the standards-prescribed interface. For example, consider the POSIX Threads standard facility for using thread-specific data (TSD) to implement multithreaded code. The first TSD reference must be preceded by pthread/spl I.bar/getspecific( ), typically implemented as a function or macro with 40-50 instructions. This paper proposes a method that uses the runtime specialization'facility of the Tempo program specializer to convert such unavoidable source code into simple memory references of one or two instructions for execution. Consequently, the source code remains standard compliant and the executed code's performance is similar to direct global variable access. Measurements show significant performance gains over a range of code sizes. A random number generator (10 lines of C) shows a speedup of 4.8 times on a SPARC and 2.2 times on a Pentium. A time converter (2,800 lines) was sped up by 14 and 22 percent, respectively, and a parallel genetic algorithm system (14,000 lines) was sped up by 13 and 5 percent. 相似文献

6.

非定常Monte Carlo输运问题的并行算法 总被引：1，自引：0，他引：1

刘杰邓力胡庆丰袁国兴李晓梅《计算机学报》2004,27(1):99-106

文中给出了非定常MonteCarlo(下文简写为MC)输运问题的并行算法 ,对并行程序的加载运行模式进行了讨论和优化设计 .针对MC并行计算设计了一种理想情况下无通信的并行随机数发生器算法 .动态MC输运问题有大量的I/O操作 ,特别是读取剩余粒子数据文件需要大量的I/O时间 ,文中针对I/O问题 ,提出了三种并行I/O算法 .最后给出了并行算法的性能测试结果 ,对比串行计算时间 ,使用 6 4台处理机时的并行计算时间缩短了 30倍相似文献

7.

A pipelined-loop-compatible architecture and algorithm to reduce variable-length sets of floating-point data on a reconfigurable computer

Gerald R. Morris Viktor K. Prasanna 《Journal of Parallel and Distributed Computing》2008

相似文献

8.

An embeddable mobile agent platform supporting runtime code mobility,interaction and coordination of mobile agents and host systems

Yu-Cheng Chou David Ko Harry H. Cheng 《Information and Software Technology》2010,52(2):185-196

Agent technology is emerging as an important concept for the development of distributed complex systems. A number of mobile agent systems have been developed in the last decade. However, most of them were developed to support only Java mobile agents. In order to provide distributed applications with code mobility, this article presents a library, the Mobile-C library, that allows a mobile agent platform, Mobile-C, to be embeddable in an application to support mobile C/C++ codes carried by mobile agents. Mobile-C uses a C/C++ interpreter as its Agent Execution Engine (AEE). Through the Mobile-C library, Mobile-C can be embedded into an application to support mobile C/C++ codes carried by mobile agents. Using mobile C/C++ codes, it is easy to interface a variety of low-level hardware devices and legacy systems. Through the Mobile-C library, Mobile-C can run on heterogeneous platforms with various operating systems. The Mobile-C library has a small footprint to meet the stringent memory capacity for applications in mechatronic and embedded systems. The Mobile-C library contains different categories of Application Programming Interfaces (APIs) in both binary and agent spaces to facilitate the design of mobile agent based applications. In addition, a rich set of existing APIs for the C/C++ interpreter employed as the AEE allows an application to have complete information and control over the mobile C/C++ codes residing in Mobile-C. With the synchronization mechanism provided by the Mobile-C library for both binary and agent spaces, simultaneous processes across both spaces can be coordinated to get correct runtime order and avoid unexpected race condition. The study of performance comparisons indicates that Mobile-C is about two times faster than JADE in agent migration. The application of the Mobile-C library is illustrated by dynamic runtime control of a mobile robot’s behavior using mobile agents. 相似文献

9.

面向SGX2代新型可信执行环境的内存优化系统

李明煜夏虞斌陈海波《软件学报》2022,33(6):2012-2029

可信执行环境(trusted execution environment, TEE)是一种应用于隐私计算保护场景的体系结构方案,能为涉及隐私相关的数据和代码提供机密性和完整性的保护,近年来成为机器学习隐私保护、加密数据库、区块链安全等场景的研究热点.主要讨论在新型可信硬件保护下的系统的性能问题:首先对新型可信硬件(IntelSGX2代)进行性能剖析,发现在配置大安全内存的前提下, Intel SGX1代旧有的换页开销不再成为主要矛盾.配置大容量安全内存引起了两个新的问题:首先,普通内存的可用范围被压缩,导致普通应用,尤其是大数据应用的换页开销加剧;其次,安全内存通常处于未被用满阶段,导致整体物理内存的利用率不高.针对以上问题,提出一种全新的轻量级代码迁移方案,将普通应用的代码动态迁入安全内存中,而数据保留在原地不动.迁移后的代码可使用安全内存,避免因磁盘换页导致的剧烈性能下降.实验结果表明:该方法可将普通应用因为磁盘换页导致的性能开销降低73.2%-98.7%,同时不影响安全应用的安全隔离和正常使用. 相似文献

10.

面向FT1000微处理器的STREAM并行计算与优化

迟利华胡庆丰刘杰甘新标蒋杰晏益慧《计算机工程与科学》2014,36(12):2267-2271

STREAM是微处理器上内存性能的基准测试程序,在多核多线程FT1000微处理器上发挥高性能是具有挑战性的研究工作。基于多级Cache结构,优化STREAM四个程序的指令流水线,根据寄存器数,设计了多级循环展开方法,根据指令延迟和Cache行的大小确定数据预取的数目,使用汇编语言编写了优化子程序。基于OpenMP并行环境,设计了STREAM并行程序,优化了局部化数据分配方式。数据测试结果表明,优化后的STREAM的性能比原始串行程序性能提高了19.2%~64.2%。优化后,并行程序的最高访存性能达到8.5 GB/s,对比优化前的最高访存性能最大提高了22.7%。相似文献

11.

Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

Mostafaeipour Ali Jahangard Rafsanjani Amir Ahmadi Mohammad Arockia Dhanraj Joshuva 《The Journal of supercomputing》2021,77(2):1273-1300

One of the most challenging issues in the big data research area is the inability to process a large volume of information in a reasonable time. Hadoop and Spark are two frameworks for distributed data processing. Hadoop is a very popular and general platform for big data processing. Because of the in-memory programming model, Spark as an open-source framework is suitable for processing iterative algorithms. In this paper, Hadoop and Spark frameworks, the big data processing platforms, are evaluated and compared in terms of runtime, memory and network usage, and central processor efficiency. Hence, the K-nearest neighbor (KNN) algorithm is implemented on datasets with different sizes within both Hadoop and Spark frameworks. The results show that the runtime of the KNN algorithm implemented on Spark is 4 to 4.5 times faster than Hadoop. Evaluations show that Hadoop uses more sources, including central processor and network. It is concluded that the CPU in Spark is more effective than Hadoop. On the other hand, the memory usage in Hadoop is less than Spark.

相似文献

12.

基于学习的容器环境Spark性能监控与分析

皮艾迪喻剑周笑波《计算机应用》2017,37(12):3586-3591

Spark计算框架被越来越多的企业用作大数据分析的框架,由于通常部署在分布式和云环境中因此增加了该系统的复杂性,对Spark框架的性能进行监控并查找导致性能下降的作业向来是非常困难的问题。针对此问题,提出并编写了一种针对分布式容器环境中Spark性能的实时监控与分析方法。首先,通过在Spark中植入代码和监控Docker容器中的API文件获取并整合了作业运行时资源消耗信息;然后,基于Spark作业历史信息,训练了高斯混合模型（GMM）;最后,使用训练后的模型对Spark作业的运行时资源消耗信息进行分类并找出导致性能下降的作业。实验结果表明,所提方法能检测出90.2%的异常作业,且其对Spark作业性能的影响仅有4.7%。该方法能减轻查错的工作量,帮助用户更快地发现Spark的异常作业。相似文献

13.

一种支持多种访存技术的CBEA片上多核MPI并行编程模型 总被引：1，自引：0，他引：1

冯国富董小社胡冰王旭昊王恩东《计算机学报》2008,31(11)

现有的CBEA(Cell Broadband Engine Architecture)编程模型多侧重于支持类似于流处理的"批量访存"(Bulk Data Transfer)应用,传统非规则访存应用性能较低.文中基于Cell架构提出了一种同时支持"批量访存"与非规则访存应用的MPI并行编程模型,将通信分解在PPE(PowerPC Processing Element)上,拓宽模型的适用范围;在统一访存接口下,通过运行时访存剖分信息指导选择和优化访存以提高计算效率.实验结果表明,文中提出的编程模型支持多种访存模式并具有很好的并行加速比,可获得较同类相关技术30%~50%左右的性能提升. 相似文献

14.

Transparent adaptation of sharing granularity in MultiView‐based DSM systems

Nitzan Niv Assaf Schuster 《Software》2001,31(15):1439-1459

In this paper we propose a mechanism that provides distributed shared memory (DSM) systems with a flexible sharing granularity. The size of the shared memory units is dynamically determined by the system during runtime. This size can range from that of a single variable up to the size of the entire shared memory space. During runtime, the DSM transparently adapts the granularity to the memory access pattern of the application in each phase of its execution. This adaptation, called ComposedView, provides efficient data sharing in software DSM while preserving sequential consistency. Neither complex code analysis nor annotation by the programmer or the compiler are required. Our experiments indicate a substantial performance boost (up to 80% speed‐up improvement) when running a large set of applications using our method, compared to running these benchmark applications with the best fixed granularity. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献

15.

编译器中激进蝴蝶优化方法的研究与实现

朱广林吕方赖庆宽陈华英何先波《计算机工程与科学》2021,43(6):962-968

编译优化技术的目的是挖掘程序中的优化空间,提高程序编译或运行效率,无效代码删除优化是被广泛使用的编译优化技术之一,它旨在删除程序中不可达的代码,以提升程序的执行效率.许多应用程序的执行路径往往与运行时的输入参数值相关,并且在一些分支路径上与运行时参数值相结合,可能存在无效代码,通过现有的无效代码删除优化,很难做出优化处... 相似文献

16.

Memory Optimization System for SGXv2 Trusted Execution Environment

下载免费PDF全文

Mingyu Li Yubin Xi Haibo Chen 《International Journal of Software and Informatics》2022,12(3):285-307

Trusted Execution Environment (TEE) is an architectural solution for secure computing that requires confidentiality and integrity for private data and code. In recent years, TEE has become the research hotspot for machine learning privacy protection, encrypted database, blockchain security, etc. This paper addresses the performance problem of the system under this new trusted hardware. We analyze the performance of the new trusted hardware, i.e., Intel SGXv2. We find that the paging overhead in SGXv1 is no longer the main issue in SGXv2 under the premise of configuring large secure memory. However, the setup of large secure memory leads to two new problems. First, the available range of normal memory is narrowed down, which increases the memory pressure of normal applications, especially big data applications. Second, secure memory is usually underutilized, resulting in low overall physical memory utilization. To solve the above problems, this paper proposes a new lightweight code migration approach, which dynamically migrates the code of normal applications into secure memory, while leaving the data in place. The migrated code can use secure memory and avoid the drastic performance degradation caused by disk swapping. Experimental results show that the proposed approach can reduce the runtime overhead of normal applications by 73.2\% to 98.7\% without affecting the isolation and the use of secure applications. 相似文献

17.

Tuning memory performance of sequential and parallel programs

Martonosi M. Gupta A. Anderson T.E. 《Computer》1995,28(4):32-40

To improve program memory performance, programmers and compiler writers can transform the application so that its memory-referencing behavior better exploits the memory hierarchy. The challenge in achieving these program transformations is overcoming the difficulty of statically analyzing or reasoning about an application's referencing behavior and interactions. In addition, many performance-monitoring tools collect high-level information that is inadequately detailed to analyze specific memory performance bugs. We describe MemSpy, a performance-monitoring tool we designed to help programmers discern where and why memory bottlenecks occur. MemSpy guides programmers toward program transformations that improve memory performance through detailed statistics on cache-miss causes and frequency. Because of the natural link between data-reference patterns and memory performance, MemSpy helps programmers comprehend data structure and code segment interactions by displaying statistics in terms of both the program's data and code structures, rather than for code structures alone 相似文献

18.

Design and implementation of an efficient hybrid dynamic and static typing language

下载免费PDF全文

Miguel Garcia Francisco Ortin Jose Quiroga 《Software》2016,46(2):199-226

Dynamic languages are suitable for developing specific applications where runtime adaptability is an important issue. On the contrary, statically typed languages commonly provide better compile‐time type error detection and more opportunities for compiler optimizations. Because both approaches offer different benefits, there exist programming languages that support hybrid dynamic and static typing. However, the existing hybrid typing languages commonly do not gather type information of dynamic references at compile time, missing opportunities for improving compile‐time error detection and runtime performance. Therefore, we propose some design principles to implement hybrid typing languages that continue gathering type information of dynamically typed references. This type information is used to perform compile‐time type checking of the dynamically typed code and improve its runtime performance. As an example, we have implemented a hybrid typing language following the proposed design principles. We have evaluated the runtime performance and memory consumption of the generated code. The average performance of the dynamic and hybrid typing code is at least 2.53× and 4.51× better than the related approaches for the same platform, consuming less memory resources. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

19.

Improving performance through deep value profiling and specialization with code transformation

Minhaj Ahmad Khan 《Computer Languages, Systems and Structures》2011,37(4):193-203

Specialization of code is used to improve the performance of the applications. However, specialization based on ineffective profiles deteriorates the performance. Existing value profiling algorithms are not yet able to address the issue of code size explosion incurred due to specialization of code. This problem can be mitigated by capturing data through profiling that would be useful for specialization of code with minimum code size.In this article, we present an approach to optimize code through value profiling and specialization with code transformation. The values of the parameters selected through an analysis of code are captured in the intervals which are automatically adapted to dynamic behavior of the application. The code is then specialized based on value profiles. The specialized code contains optimizations and may be converted back to the generalized code through a transformation. This approach facilitates the code to obtain optimizations through specialization with minimum size, and no runtime overhead.Using this approach, the experiments performed on Itanium-II (IA-64) architecture with icc compiler v 9.0 show a significant improvement in the performance of the SPEC 2000 benchmarks. 相似文献

20.

dOCAL: high-level distributed programming with OpenCL and CUDA

Rasch Ari Bigge Julian Wrodarczyk Martin Schulze Richard Gorlatch Sergei 《The Journal of supercomputing》2020,76(7):5117-5138

In the state-of-the-art parallel programming approaches OpenCL and CUDA, so-called host code is required for program’s execution. Efficiently implementing host code is often a cumbersome task, especially when executing OpenCL and CUDA programs on systems with multiple nodes, each comprising different devices, e.g., multi-core CPU and graphics processing units; the programmer is responsible for explicitly managing node’s and device’s memory, synchronizing computations with data transfers between devices of potentially different nodes and for optimizing data transfers between devices’ memories and nodes’ main memories, e.g., by using pinned main memory for accelerating data transfers and overlapping the transfers with computations. We develop distributed OpenCL/CUDA abstraction layer (dOCAL)—a novel high-level C++ library that simplifies the development of host code. dOCAL combines major advantages over the state-of-the-art high-level approaches: (1) it simplifies implementing both OpenCL and CUDA host code by providing a simple-to-use, high-level abstraction API; (2) it supports executing arbitrary OpenCL and CUDA programs; (3) it allows conveniently targeting the devices of different nodes by automatically managing node-to-node communications; (4) it simplifies implementing data transfer optimizations by providing different, specially allocated memory regions, e.g., pinned main memory for overlapping data transfers with computations; (5) it optimizes memory management by automatically avoiding unnecessary data transfers; (6) it enables interoperability between OpenCL and CUDA host code for systems with devices from different vendors. Our experiments show that dOCAL significantly simplifies the development of host code for heterogeneous and distributed systems, with a low runtime overhead.

相似文献